一、拼音分词的应用

拼音分词在日常生活中其实很常见，也许你每天都在用。打开淘宝看一看吧,输入拼音”zhonghua”,下面会有包含”zhonghua”对应的中文”中华”的商品的提示：

拼音分词是根据输入的拼音提示对应的中文，通过拼音分词提升搜索体验、加快搜索速度。下面介绍如何在Elasticsearch 5.1.1中配置和实现pinyin+iK分词。

二、IK分词器下载与安装

关于IK分词器的介绍不再多少，一言以蔽之，IK分词是目前使用非常广泛分词效果比较好的中文分词器。做ES开发的，中文分词十有八九使用的都是IK分词器。

下载地址:https://github.com/medcl/elasticsearch-analysis-ik
配置之前关闭elasticsearch，配置完成以后再重启。
IK的版本要和当前ES的版本一致，README中有说明。我使用的是ES是5.1.1，IK的版本为5.1.1(你也许会奇怪为什么IK上一个版本是1.X,下一个版本一下升到5.X?是因为Elastic官方为了统一版本号，之前es的版本是2.x,logstash的版本是2.x,同时Kibana的版本是4.x，ik的版本是1.x，这样版本很混乱。5.0之后，统一版本号，这样你使用5.1.1的es，其它软件的版本也使用5.1.1就好了)。

下载之后进入到elasticsearch-analysis-pinyin-master目录，mvn打包(没有安装maven的自行安装)，运行命令：

    mvn package

打包成功以后，会生成一个target文件夹，在elasticsearch-analysis-ik-master/target/releases目录下，找到elasticsearch-analysis-ik-5.1.1.zip，这就是我们需要的安装文件。解压elasticsearch-analysis-ik-5.1.1.zip，得到下面内容：

commons-codec-1.9.jar
commons-logging-1.2.jar
config
elasticsearch-analysis-ik-5.1.1.jar
httpclient-4.5.2.jar
httpcore-4.4.4.jar
plugin-descriptor.properties

然后在elasticsearch-5.1.1/plugins目录下新建一个文件夹ik，把elasticsearch-analysis-ik-5.1.1.zip解压后的文件拷贝到elasticsearch-5.1.1/plugins/ik目录下.截图方便理解。

三、pinyin分词器下载与安装

pinyin分词器的下载地址:
https://github.com/medcl/elasticsearch-analysis-pinyin

安装过程和IK一样，下载、打包、加入ES。这里不在重复上述步骤，给出最后配置截图

四、分词测试

IK和pinyin分词配置完成以后，重启ES。如果重启过程中ES报错，说明安装有错误，没有报错说明配置成功。

4.1 IK分词测试

创建一个索引:

curl -XPUT "http://localhost:9200/index"

测试分词效果:

curl -XPOST "http://localhost:9200/index/_analyze?analyzer=ik_max_word&text=中华人民共和国"

分词结果:

   {"tokens": [{"token": "中华人民共和国","start_offset": 0,"end_offset": 7,"type": "CN_WORD","position": 0}, {"token": "中华人民","start_offset": 0,"end_offset": 4,"type": "CN_WORD","position": 1}, {"token": "中华","start_offset": 0,"end_offset": 2,"type": "CN_WORD","position": 2}, {"token": "华人","start_offset": 1,"end_offset": 3,"type": "CN_WORD","position": 3}, {"token": "人民共和国","start_offset": 2,"end_offset": 7,"type": "CN_WORD","position": 4}, {"token": "人民","start_offset": 2,"end_offset": 4,"type": "CN_WORD","position": 5}, {"token": "共和国","start_offset": 4,"end_offset": 7,"type": "CN_WORD","position": 6}, {"token": "共和","start_offset": 4,"end_offset": 6,"type": "CN_WORD","position": 7}, {"token": "国","start_offset": 6,"end_offset": 7,"type": "CN_CHAR","position": 8}, {"token": "国歌","start_offset": 7,"end_offset": 9,"type": "CN_WORD","position": 9}]
}

使用ik_smart分词:

curl -XPOST "http://localhost:9200/index/_analyze?analyzer=ik_smart&text=中华人民共和国"

分词结果:

{"tokens": [{"token": "中华人民共和国","start_offset": 0,"end_offset": 7,"type": "CN_WORD","position": 0}, {"token": "国歌","start_offset": 7,"end_offset": 9,"type": "CN_WORD","position": 1}]
}

截图方便理解:

4.2拼音分词测试

测试拼音分词:

curl -XPOST "http://localhost:9200/index/_analyze?analyzer=pinyin&text=张学友"

分词结果:

{"tokens": [{"token": "zhang","start_offset": 0,"end_offset": 1,"type": "word","position": 0}, {"token": "xue","start_offset": 1,"end_offset": 2,"type": "word","position": 1}, {"token": "you","start_offset": 2,"end_offset": 3,"type": "word","position": 2}, {"token": "zxy","start_offset": 0,"end_offset": 3,"type": "word","position": 3}]
}

五、IK+pinyin分词配置

5.1创建索引与分析器设置

创建一个索引，并设置index分析器相关属性:

curl -XPUT "http://localhost:9200/medcl/" -d'
{"index": {"analysis": {"analyzer": {"ik_pinyin_analyzer": { "type": "custom", "tokenizer": "ik_smart", "filter": ["my_pinyin", "word_delimiter"] }},"filter": {"my_pinyin": { "type": "pinyin", "first_letter": "prefix", "padding_char": " " }}}}
}'

创建一个type并设置mapping:

curl -XPOST http://localhost:9200/medcl/folks/_mapping -d'
{"folks": {"properties": {"name": {"type": "keyword","fields": {"pinyin": {"type": "text","store": "no","term_vector": "with_positions_offsets","analyzer": "ik_pinyin_analyzer","boost": 10}}}}}
}'

5.2索引测试文档

索引2份测试文档。
文档1:

curl -XPOST http://localhost:9200/medcl/folks/andy -d'{"name":"刘德华"}'

文档2:

curl -XPOST http://localhost:9200/medcl/folks/tina -d'{"name":"中华人民共和国国歌"}'

5.3测试(1)拼音分词

下面四条命命令都可以匹配”刘德华”

curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:liu"curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:de"curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:hua"curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:ldh"

5.4测试(2)IK分词测试

curl -XPOST "http://localhost:9200/medcl/_search?pretty" -d'
{"query": {"match": {"name.pinyin": "国歌"}},"highlight": {"fields": {"name.pinyin": {}}}
}'

返回结果:

{"took" : 2,"timed_out" : false,"_shards" : {"total" : 5,"successful" : 5,"failed" : 0},"hits" : {"total" : 1,"max_score" : 16.698704,"hits" : [{"_index" : "medcl","_type" : "folks","_id" : "tina","_score" : 16.698704,"_source" : {"name" : "中华人民共和国国歌"},"highlight" : {"name.pinyin" : ["<em>中华人民共和国</em><em>国歌</em>"]}}]}
}

说明IK分词器起到了效果。

5.3测试(4)pinyin+ik分词测试：

curl -XPOST "http://localhost:9200/medcl/_search?pretty" -d'
{"query": {"match": {"name.pinyin": "zhonghua"}},"highlight": {"fields": {"name.pinyin": {}}}
}'

返回结果:

{"took" : 3,"timed_out" : false,"_shards" : {"total" : 5,"successful" : 5,"failed" : 0},"hits" : {"total" : 2,"max_score" : 5.9814634,"hits" : [{"_index" : "medcl","_type" : "folks","_id" : "tina","_score" : 5.9814634,"_source" : {"name" : "中华人民共和国国歌"},"highlight" : {"name.pinyin" : ["<em>中华人民共和国</em>国歌"]}},{"_index" : "medcl","_type" : "folks","_id" : "andy","_score" : 2.2534127,"_source" : {"name" : "刘德华"},"highlight" : {"name.pinyin" : ["<em>刘德华</em>"]}}]}
}

截图如下:

使用pinyin分词以后，原始的字段搜索要加上.pinyin后缀，搜索原始字段没有返回结果：

六、参考资料

https://github.com/medcl/elasticsearch-analysis-ik
https://github.com/medcl/elasticsearch-analysis-pinyin
https://my.oschina.net/xiaohui249/blog/214505

Elasticsearch 5 Ik+pinyin分词配置详解相关推荐

elasticsearch中文IK+Pinyin分词器
2019独角兽企业重金招聘Python工程师标准>>> 一.IK分词器安装 1.分词器的作用分词顾名思义,就是把一句话分成一个一个的词.这个概念在搜索中很重要,比如 This is ...
Elasticsearch 2.2.0 索引配置详解
2019独角兽企业重金招聘Python工程师标准>>> 内存控制器在Elasticsearch中有很多控制器可以防止内存溢出,每个控制器可以指定内存使用的最大值,除此之外,还有一个 ...
Springboot集成elasticsearch 使用IK+拼音分词
Springboot集成elasticsearch 使用IK+拼音分词 docker安装ES 下载 docker pull docker.elastic.co/elasticsearch/elasti ...
elasticsearch-.yml（中文配置详解）
此elasticsearch-.yml配置文件,是在$ES_HOME/config/下 elasticsearch-.yml(中文配置详解) # ======================== El ...
filebeat配置详解
filebeat5.x配置详解 https://blog.yuzunzhi.com/filebeat%E9%85%8D%E7%BD%AE%E8%AF%A6%E8%A7%A3/ https://www. ...
SpringBoot的配置详解application
SpringBoot的配置文件application有两种文件格式,两种配置的内容是一致的,只是格式不一致. 1.application.properties 2.application.yml或者a ...
Elasticsearch 7.X data stream 深入详解
直接从一个新概念的认知过程说下 elasticsearch data stream. 记得第一次听到 data stream 的时候,还是去年下半年在公交大巴车上早 8 点听魏彬老师的直播,后来就一直 ...
Gavin老师Transformer直播课感悟 - Rasa项目实战之电商零售智能业务对话机器人配置详解与Debugging演示(八十七)
本文继续围绕工业级业务对话平台和框架Rasa,对Rasa项目实战之电商零售智能业务对话机器人系统所使用的各项配置进行详细剖析,并通过debug模式来理解在下面展示的Rasa graph archite ...
ElasticSearch——Spring Boot 集成 ES 操作详解
文章目录 ElasticSearch--Spring Boot 集成 ES 操作详解 1.SpringBoot 集成 ES 2.索引的API操作详解 3.文档的API操作详解 ElasticSearc ...

Elasticsearch 5 Ik+pinyin分词配置详解

一、拼音分词的应用

二、IK分词器下载与安装

三、pinyin分词器下载与安装

四、分词测试

4.1 IK分词测试

4.2拼音分词测试

五、IK+pinyin分词配置

5.1创建索引与分析器设置

5.2索引测试文档

5.3测试(1)拼音分词

5.4测试(2)IK分词测试

5.3测试(4)pinyin+ik分词测试：

六、参考资料

Elasticsearch 5 Ik+pinyin分词配置详解相关推荐

最新文章

热门文章