1. 分词器插件

ElasticSearch提供了对文本内容进行分词的插件系统，对于不同的语言的文字分词器，规则一般是不一样的，而ElasticSearch提供的插件机制可以很好的集成各语种的分词器。

Elasticsearch 本身并不支持中文分词，但好在它支持编写和安装额外的分词管理插件，而开源的中文分词器 ik 就非常强大，具有20万以上的常用词库，可以满足一般的常用分词功能。

1.1 分词器插件作用

分词器的主要作用是把文本拆分成一个个最小粒度的单词，然后给ElasticSearch作为索引系统的词条使用。不同语种拆分单词规则也是不一样的，最常见的就是中文分词和英文分词。

对于同一个文本，使用不同分词器，拆分的效果也是不同的。如："中国人民共和国"使用ik_max_word分词器会被拆分成：中国人民共和国、中华人民、中华、华人、人民共和国、人民、共和国、共和、国，而使用standard分词器则会拆分成：中、国、人、民、共、和、国。被拆分后的词就可以作为ElasticSearch的索引词条来构建索引系统，这样就可以使用部分内容进行搜索了

2. 常用分词器

2.1 分词器介绍

standard

标准分词器。处理英文能力强，会将词汇单元转换成小写形式，并去除停用词和标点符号，对于非英文按单字切分
whitespace

空格分词器。针对英文，仅去除空格，没有其他任何处理，不支持非英文
simple

针对英文，通过非字母字符分割文本信息，然后将词汇单元统一为小写形式，数字类型的字符会被去除
stop

stop 的功能超越了simple，stop在simple的基础上增加了去除英文中的常用单词（如 the，a 等），也可以更加自己的需要设置常用单词，不支持中文
keyword

Keyword把整个输入作为一个单独词汇单元，不会对文本进行任何拆分，通常是用在邮政编码、电话号码等需要全匹配的字段上
pattern

查询文本会被自动当做正则表达式处理，生成一组terms关键字，然后在对Elasticsearch进行查询
snowball

雪球分析器，在 standard 的基础上添加了 snowball filter，Lucene官方不推荐使用
language

一个用于解析特殊语言文本的analyzer集合，但不包含中文
ik

IK分词器是一个开源的基于java语言开发的轻量级的中文分词工具包。采用了特有的“正向迭代最细粒度切分算法”，支持细粒度和最大词长两种切分模式。支持：英文字母、数字、中文词汇等分词处理，兼容韩文、日文字符。同时支持用户自定义词库。它带有两个分词器：
- ik_max_word ：将文本做最细粒度的拆分，尽可能多的拆分出词语
- ik_smart：做最粗粒度的拆分，已被分出的词语将不会再次被其它词语占有
pinyin

通过用户输入的拼音匹配 Elasticsearch 中的中文

2.2 分词器示例

对于同一个输入，使用不同分词器的结果。

输入：栖霞站长江线14w6号断路器

2.2.1 standard

{"tokens": [{"token": "栖","start_offset": 0,"end_offset": 1,"type": "<IDEOGRAPHIC>","position": 0},{"token": "霞","start_offset": 1,"end_offset": 2,"type": "<IDEOGRAPHIC>","position": 1},{"token": "站","start_offset": 2,"end_offset": 3,"type": "<IDEOGRAPHIC>","position": 2},{"token": "长","start_offset": 3,"end_offset": 4,"type": "<IDEOGRAPHIC>","position": 3},{"token": "江","start_offset": 4,"end_offset": 5,"type": "<IDEOGRAPHIC>","position": 4},{"token": "线","start_offset": 5,"end_offset": 6,"type": "<IDEOGRAPHIC>","position": 5},{"token": "14w6","start_offset": 6,"end_offset": 10,"type": "<ALPHANUM>","position": 6},{"token": "号","start_offset": 10,"end_offset": 11,"type": "<IDEOGRAPHIC>","position": 7},{"token": "断","start_offset": 11,"end_offset": 12,"type": "<IDEOGRAPHIC>","position": 8},{"token": "路","start_offset": 12,"end_offset": 13,"type": "<IDEOGRAPHIC>","position": 9},{"token": "器","start_offset": 13,"end_offset": 14,"type": "<IDEOGRAPHIC>","position": 10}]
}

2.2.2 ik

ik_smart

{"tokens": [{"token": "栖霞","start_offset": 0,"end_offset": 2,"type": "CN_WORD","position": 0},{"token": "站","start_offset": 2,"end_offset": 3,"type": "CN_CHAR","position": 1},{"token": "长江","start_offset": 3,"end_offset": 5,"type": "CN_WORD","position": 2},{"token": "线","start_offset": 5,"end_offset": 6,"type": "CN_CHAR","position": 3},{"token": "14w6","start_offset": 6,"end_offset": 10,"type": "LETTER","position": 4},{"token": "号","start_offset": 10,"end_offset": 11,"type": "COUNT","position": 5},{"token": "断路器","start_offset": 11,"end_offset": 14,"type": "CN_WORD","position": 6}]
}

ik_max_word

{"tokens": [{"token": "栖霞","start_offset": 0,"end_offset": 2,"type": "CN_WORD","position": 0},{"token": "站长","start_offset": 2,"end_offset": 4,"type": "CN_WORD","position": 1},{"token": "长江","start_offset": 3,"end_offset": 5,"type": "CN_WORD","position": 2},{"token": "线","start_offset": 5,"end_offset": 6,"type": "CN_CHAR","position": 3},{"token": "14w6","start_offset": 6,"end_offset": 10,"type": "LETTER","position": 4},{"token": "14","start_offset": 6,"end_offset": 8,"type": "ARABIC","position": 5},{"token": "w","start_offset": 8,"end_offset": 9,"type": "ENGLISH","position": 6},{"token": "6","start_offset": 9,"end_offset": 10,"type": "ARABIC","position": 7},{"token": "号","start_offset": 10,"end_offset": 11,"type": "COUNT","position": 8},{"token": "断路器","start_offset": 11,"end_offset": 14,"type": "CN_WORD","position": 9},{"token": "断路","start_offset": 11,"end_offset": 13,"type": "CN_WORD","position": 10},{"token": "器","start_offset": 13,"end_offset": 14,"type": "CN_CHAR","position": 11}]
}

2.2.3 pinyin

{"tokens": [{"token": "q","start_offset": 0,"end_offset": 0,"type": "word","position": 0},{"token": "qi","start_offset": 0,"end_offset": 0,"type": "word","position": 0},{"token": "栖霞站长江线14w6号断路器","start_offset": 0,"end_offset": 0,"type": "word","position": 0},{"token": "qixiazhanzhangjiangxian14w6haoduanluqi","start_offset": 0,"end_offset": 0,"type": "word","position": 0},{"token": "qxzzjx14w6hdlq","start_offset": 0,"end_offset": 0,"type": "word","position": 0},{"token": "x","start_offset": 0,"end_offset": 0,"type": "word","position": 1},{"token": "xia","start_offset": 0,"end_offset": 0,"type": "word","position": 1},{"token": "z","start_offset": 0,"end_offset": 0,"type": "word","position": 2},{"token": "zhan","start_offset": 0,"end_offset": 0,"type": "word","position": 2},{"token": "zhang","start_offset": 0,"end_offset": 0,"type": "word","position": 3},{"token": "j","start_offset": 0,"end_offset": 0,"type": "word","position": 4},{"token": "jiang","start_offset": 0,"end_offset": 0,"type": "word","position": 4},{"token": "xian","start_offset": 0,"end_offset": 0,"type": "word","position": 5},{"token": "14","start_offset": 0,"end_offset": 0,"type": "word","position": 6},{"token": "w","start_offset": 0,"end_offset": 0,"type": "word","position": 7},{"token": "6","start_offset": 0,"end_offset": 0,"type": "word","position": 8},{"token": "h","start_offset": 0,"end_offset": 0,"type": "word","position": 9},{"token": "hao","start_offset": 0,"end_offset": 0,"type": "word","position": 9},{"token": "d","start_offset": 0,"end_offset": 0,"type": "word","position": 10},{"token": "duan","start_offset": 0,"end_offset": 0,"type": "word","position": 10},{"token": "l","start_offset": 0,"end_offset": 0,"type": "word","position": 11},{"token": "lu","start_offset": 0,"end_offset": 0,"type": "word","position": 11}]
}

3. 自定义分词器

对于上面三种分词器的效果，在某些场景下可能都符合要求，下面来看看为什么需要自定义分词器。

众所周知，在推荐系统中，对于拼音搜索是很有必要的，比如输入："ls"，希望返回与"ls"相关的索引词条，“零食（ls）”、“雷蛇（ls）”、“林书豪（lsh）”……，上面都是对的情况，但如果此处仅仅使用拼音分词器，可会"l"相关的索引词条也会被命中，比如"李宁（l）"、“兰蔻（l）”……，这种情况下的推荐就是不合理的。

如果使用拼音分词器，对于上面的输入"栖霞站长江线14w6号断路器"，会产生出很多单个字母的索引词条，比如："h"，"d"，"l"等。如果用户输入的查询条件"ql"，根本不想看到这条”栖霞站长江线14w6号断路器“数据，但由于"l"词条被命中，所有该条数据也会被返回。

那如果避免这些单字母词条的索引生成呢？下面就自己手写一个ElasticSearch的分词器，定制化！！！

3.1 分词器原理

3.1.1 分词器插件工作流程

ElasticSearch在启动过程中会读取plugins/分词器/plugin-descriptor.properties文件
读取该配置文件获取分词器插件启动类信息并进行初始化，属性classname指向的启动类
分词器插件启动类必须继承AnalysisPlugin，确保ElasticSearch可以调用我们自定义的类来获取分词器对象
在ElasticSearch调用分词进行分词时，会实例化AnalyzerProvider对象，该对象中有get()方法可以获取到我们自定义的Analyzer对象，同时内部tokenStream()方法会调用用createComponents()方法实例化我们自定义的Tokenizer对象
Tokenizer是自定义分词器的核心组件，核心方法有4个，如下：
- incrementToken()：用来判断分词集合列表中是否还存在没读取的词条信息，以及设置term的基础属性如：长度，起始偏移量，结束偏移量，词条等
- reset()：重置默认数据和加载自定义模型来处理用户输入的字符串数据并进行分词处理，加入到分词集合列表
- end()：设置当分词结束的偏移量信息
- close()：销毁输入流对象和自定义的数据
Tokenizer对象每次完成一次用户输入文本的分词过程都会进行上述4步方法调用

3.2 分词器验证

安装分词器

将打包完的分词器zip文件，拷贝放到ElasticSearch安装目录的plugins目录下，可用elasticsearch-plugin list命令查看
启动ElasticSearch
验证分词器

{"tokens": [{"token": "栖霞","start_offset": 0,"end_offset": 2,"type": "word","position": 0},{"token": "qixia","start_offset": 0,"end_offset": 2,"type": "word","position": 1},{"token": "qx","start_offset": 0,"end_offset": 2,"type": "word","position": 2},{"token": "栖霞站长江线14w6号断路器","start_offset": 1,"end_offset": 14,"type": "word","position": 3},{"token": "qixiazhanzhangjiangxian14w6haoduanluqi","start_offset": 1,"end_offset": 14,"type": "word","position": 4},{"token": "qxzzjx14w6hdlq","start_offset": 1,"end_offset": 14,"type": "word","position": 5},{"token": "站","start_offset": 2,"end_offset": 3,"type": "word","position": 6},{"token": "长江","start_offset": 3,"end_offset": 5,"type": "word","position": 7},{"token": "zhangjiang","start_offset": 3,"end_offset": 5,"type": "word","position": 8},{"token": "zj","start_offset": 3,"end_offset": 5,"type": "word","position": 9},{"token": "线","start_offset": 5,"end_offset": 6,"type": "word","position": 10},{"token": "14w6","start_offset": 6,"end_offset": 10,"type": "word","position": 11},{"token": "号","start_offset": 10,"end_offset": 11,"type": "word","position": 12},{"token": "断路","start_offset": 11,"end_offset": 13,"type": "word","position": 13},{"token": "duanlu","start_offset": 11,"end_offset": 13,"type": "word","position": 14},{"token": "dl","start_offset": 11,"end_offset": 13,"type": "word","position": 15},{"token": "断路器","start_offset": 11,"end_offset": 14,"type": "word","position": 16},{"token": "duanluqi","start_offset": 11,"end_offset": 14,"type": "word","position": 17},{"token": "dlq","start_offset": 11,"end_offset": 14,"type": "word","position": 18}]
}

3.3 源码编译

JDK17、idea支持JDK17

luncene版本与ElasticSearch版本要求一致

ElasticSearch打包后的分词器与ElasticSearch使用版本一致

源码地址：https://gitee.com/frank_zxd/elasticsearch-search-analyzer

ElasticSearch——手写一个ElasticSearch分词器（附源码）相关推荐

三分钟手写一个迷你jQuery，附源码
诚然,不管前端技术怎么发展,重心都不会变,就是操作DOM + 获取数据. 下面的代码演示了如何快速手写一个简单的jQuery: <!DOCTYPE html> <html lang ...
【Keras+计算机视觉+Tensorflow】DCGAN对抗生成网络在MNIST手写数据集上实战（附源码和数据集超详细）
需要源码和数据集请点赞关注收藏后评论区留言私信~~~ 一.生成对抗网络的概念生成对抗网络(GANs,Generative Adversarial Nets),由Ian Goodfellow在2014 ...
Android App实战项目之实现手写签名APP功能（附源码，简单易懂可直接实用）
运行有问题或需要源码请点赞关注收藏后评论区留言~~~ 一.跟踪滑动轨迹实现手写签名手写签名的原理是把手机屏幕当作画板,把用户手指当作画笔,手指在屏幕上划来划去,屏幕就会显示手指的移动轨迹,就像画笔在 ...
从零开始用js手写飞机大战全过程（附源码）
plane game 前言此文章帮助初学者学习制作一个简单小游戏目标利用css,html,js制作出飞机大战的简略版前置准备 1. 绝对定位与相对定位这里除了背景都设置为绝对定位,使飞机和敌 ...
OpenCV2.3的cvCalcHist函数有问题?255级值总为0,索性自己写一个直方图计算函数,附源码
图像处理开发需求.图像处理接私活挣零花钱,请加微信/QQ 2487872782 图像处理开发资料.图像处理技术交流请加QQ群,群号 271891601 我在写直方图规定化的代码过程中,发现OpenCV ...
用Python制作一个相册播放器(附源码)
对于相册播放器,大家应该都不陌生(用于浏览多张图片的一个应用). 当然还有视频.音乐播放器,同样是用来播放多个视频.音乐文件的. 在Win10系统下,用[照片]这个应用打开一张图片,就可以浏览该图片所 ...
500 行代码写一个俄罗斯方块游戏（附源码）
点击上方[全栈开发者社区]→右上角[...]→[设为星标⭐] 导读:本文我们要制作一个俄罗斯方块游戏. 作者 | 派森学python 来源 | https://segmentfault.com/a/1 ...
用 Go 手写一个 JSON 序列化器
用 Go 手写一个 JSON 序列化器方案实现字符串转义忽略类型序列化器主体数字和逻辑类型字符串类型数组类型字典类型自定义结构类型指针类型 API 使用安装调用测试开源和 ...
基于vue手写一个分屏器，通过鼠标控制屏幕宽度。
基于vue手写一个分屏器,通过鼠标控制屏幕宽度. 先来看看实现效果: QQ录屏20220403095856 下面是实现代码: <template><section class=&qu ...

ElasticSearch——手写一个ElasticSearch分词器（附源码）