ElasticSearch(3) -- IK分词器

tech2022-09-23 163

ES支持的分词器有很多,这里我使用的是常用的IK分词器

1. 分词模式一:ik_max_word

会将文本最细力度的拆分先在kibana测试一波输入下面的请求：

POST _analyze { "analyzer": "ik_max_word", "text": "南京市长江大桥" }

结果:

{ "tokens" : [ { "token" : "南京市", "start_offset" : 0, "end_offset" : 3, "type" : "CN_WORD", "position" : 0 }, { "token" : "南京", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 1 }, { "token" : "市长", "start_offset" : 2, "end_offset" : 4, "type" : "CN_WORD", "position" : 2 }, { "token" : "市", "start_offset" : 2, "end_offset" : 3, "type" : "CN_CHAR", "position" : 3 }, { "token" : "长江大桥", "start_offset" : 3, "end_offset" : 7, "type" : "CN_WORD", "position" : 4 }, { "token" : "长江", "start_offset" : 3, "end_offset" : 5, "type" : "CN_WORD", "position" : 5 }, { "token" : "大桥", "start_offset" : 5, "end_offset" : 7, "type" : "CN_WORD", "position" : 6 } ] }

2. 分词模式二: ik_smart

POST _analyze { "analyzer": "ik_smart", "text": "南京市长江大桥" }

结果:

{ "tokens" : [ { "token" : "南京市", "start_offset" : 0, "end_offset" : 3, "type" : "CN_WORD", "position" : 0 }, { "token" : "长江大桥", "start_offset" : 3, "end_offset" : 7, "type" : "CN_WORD", "position" : 1 } ] }

思考: 如果现在江大桥是一个人名, 是南京市长, 那么上面的分词显然是不合理的, 改怎么办?

3. 添加扩展词典和停用词典

停用词: 有些词在文本中出现的频率非常高. 但对本文的语义产生不了多大的影响. 例如英文的a、an、the、of等. 或中文的”的、了、呢”等. 这样的词称为停用词. 停用词经常被过滤掉, 不会进行索引. 在检索过程中, 如果用户的查询词中含有停用词, 系统会自动过滤掉. 停用词可以加快索引的速度, 减少索引库文件的大小. 扩展词: 就是不想让哪些词分开, 让他们分成一个词. 比如上面的江大桥

3.1 在es安装目录/plugins/ik/config目录新增自定义词典

注意: 必须使用 utf-8编码, 否则扩展分词不管用

3.2 配置建立关系

3.3 重启ES,测试

使用 ik_smart 看不到

最新回复(0)