倒排索引、分词、同义词

倒排索引

正排索引：文档ID =>文档内容和单词
倒排索引：词条 =>文档ID
倒排索引组成：
- 词条字典（Term Dictionary），记录所有的词条与倒排列表的映射关系。这个字典很大，通过B+树或哈希拉链法实现，以满足高性能的插入与查询。
- 倒排列表（Posting List），由倒排索引项组成，包含如下信息：
  - 文档ID
  - 词频（TF）：该单词在文档中出现的次数，用于相关性评分
  - 位置（Position）：单词在文档中分词的位置（也就是第几个分词），用于语句搜索（phrase query）
  - 偏移（Offset）：记录单词的开始结束位置，实现高亮显示（从第几个字符开始，到第几个字符结束）
ES的倒排索引：
- ES文档的每个字段，默认都有自己的倒排索引
- 可以通过mappings，指定某些字段不做索引，以节省存储空间，但会导致该字段无法被搜索。

分词

文本分析（Analysis），就是把全文转换成一系列词条（term / token）的过程（也叫分词）。文本分析是通过分词器（Analyzer）实现的。

https://www.elastic.co/guide/en/elasticsearch/reference/7.2/analysis.html

分词器有两个作用：

数据写入时转换词条
分析查询语句

ES内置了多种分词器：https://www.elastic.co/guide/en/elasticsearch/reference/7.2/analysis-analyzers.html

在mapping中，每个text字段都可以指定专门的分词器：

PUT {index}
{"mappings": {"properties": {"{field}": {"type": "text","analyzer": "{standard}"  // 指定分词器}}}
}

说明：已经存在的索引执行上述操作会报错。可以在创建索引时指定。standard分词器是默认的分词器。

通常情况下，创建（index）和查询（search）时，应该使用同样的分词器。执行全文检索时，比如match查询，会在mapping中为每个字段查找分词器。在查询具体字段时使用何种分词器，遵循如下查找顺序：

查询条件有指定analyzer
mapping中的search_anylyzer参数
mapping中的analyzer参数
索引设置中的default_search
索引设置中的default
standard分词器

分词器由以下三部分组合而成：

Character Filters：处理原始文本，比如去除html标签。一个分词器可以没有或拥有多个char filter。
Tokenizer：接收上一步处理后的文本流，按照规则切分出词条并输出，比如：
- whitespace 按空格切分词条
- keyword 不做分词，直接作为关键词
- path_hierarchy, 按路径切分，保证搜索任意一级目录都可以搜索到结果
更多类别参考官网，每个分词器必须有一个tokenizer。
Token Filters：接收词条流，进行再加工，可能会添加、删除、或者变更词条。比如：
- lowercase 将所有的词条转为小写
- stop将停顿词（比如，the, in, on）从词条流中移除
- synonym往词条流中引入同义词
- multiplexer 一个单词触发多个词条（应用多个过滤器，每个过滤器触发一个），重复的词条会被去除。建议：将同义词过滤器追加到每个multiplexer过滤器列表后面，不要放在multiplexer之后，以避免异常。建议具体见官网底部的note
更过token filter类别可以参考官网，一个分词器可以没有或拥有多个token filter

通过组合以上三部分，我们也可以自定义分词器。

###analyze接口

analyze接口可以用来执行文本分析处理，并查看分词结果，比如：

GET _analyze
{"analyzer": "standard","text": "hello world"
}

使用standard分词器对文本进行分词，返回结果如下：

{"tokens" : [{"token" : "hello","start_offset" : 0,"end_offset" : 5,"type" : "<ALPHANUM>","position" : 0},{"token" : "world","start_offset" : 6,"end_offset" : 11,"type" : "<ALPHANUM>","position" : 1}]
}

自定义分词

也可以自己定义分词器，其实就是组合tokenizer, token filter 和 char filer，比如：

示例1：自定义token filter

POST _analyze
{"tokenizer": "standard","filter":  [ "lowercase", "asciifolding" ],"text":      "Is this déja vu?"
}

示例2：自定义char_filter

POST _analyze
{"tokenizer": "standard","char_filter": [{"type": "mapping","mappings": ["- => _"]  // 将中划线替换为下划线, 甚至可以将emoji表情替换为单词}],"text": "123-456, hello-world"
}// 结果如下：
{"tokens" : [{"token" : "123_456","start_offset" : 0,"end_offset" : 7,"type" : "<NUM>","position" : 0},{"token" : "hello_world","start_offset" : 9,"end_offset" : 20,"type" : "<ALPHANUM>","position" : 1}]
}

也可以为某个索引（集）自定义分词器：

put forum
{"settings": {"analysis": {"analyzer": {"my_analyzer": {  // 自定义分词器名字"type": "custom","tokenizer": "standard","filter": ["lowercase","asciifolding"]}}}},"mappings": {"properties": {"title": {"type": "text","analyzer": "my_analyzer"  // 使用自定义分词器}}}
}

测试分词效果：

GET forum/_analyze
{"analyzer": "my_analyzer","text": "Mr déjà"
}

分词结果如下：

{"tokens" : [{"token" : "mr","start_offset" : 12,"end_offset" : 14,"type" : "<ALPHANUM>","position" : 3},{"token" : "deja","start_offset" : 15,"end_offset" : 19,"type" : "<ALPHANUM>","position" : 4}]
}

mapping自定义分词器完全版：

{'settings': {'analysis': {'analyzer': {  # 自定义analyzer: 可以使用下面自定义的tokenizer, char_filter，token_filter 等内容'my_analyzer': {'type': 'custom','char_filter': 'my_char_flt','tokenizer': 'my_tokenizer','filter': ['my_flt'],  # token_filter}},'char_filter': {  # 自定义 char_filter'my_char_flt': {}},'tokenizer': {  # 自定义 tokenizer'my_tokenizer': {}},'filter': {  # 自定义 token_filter'my_flt': {'type': 'synonym','synonyms_path': 'xxxxx'}},# "search_analyzer": {}  # 也可以定义搜索分词器, 同analyzer},},'mappings': {'properties': {'content': {'type': 'text','analyzer': 'my_analyzer',  # 使用自定义的分词器'search_analyzer': 'ik_smart',}}}
}

中文分词

ES内置的中文分词不好用，可以通过插件安装，比如非常流行的ik分词:

 .\elasticsearch-plugin.bat install https://github.com/medcl/elasticsearch-ana
lysis-ik/releases/download/v7.2.0/elasticsearch-analysis-ik-7.2.0.zip

装好后重启ES，查看插件列表：

GET _cat/plugins

返回刚刚安装的ik插件

DESKTOP-L1E59GR analysis-ik 7.2.0

ik支持一些自定义配置，比如扩展词典和热词更新。配置文件默认位置在es安装目录的configik目录下。具体配置可以参考：https://github.com/medcl/elasticsearch-analysis-ik

安装完成后，我们测试下效果：

GET _analyze
{"analyzer": "ik_smart","text": "创新空间的不朽传奇"
}

分词结果：

{"tokens" : [{"token" : "创新","start_offset" : 0,"end_offset" : 2,"type" : "CN_WORD","position" : 0},{"token" : "空间","start_offset" : 2,"end_offset" : 4,"type" : "CN_WORD","position" : 1},{"token" : "的","start_offset" : 4,"end_offset" : 5,"type" : "CN_CHAR","position" : 2},{"token" : "不朽","start_offset" : 5,"end_offset" : 7,"type" : "CN_WORD","position" : 3},{"token" : "传奇","start_offset" : 7,"end_offset" : 9,"type" : "CN_WORD","position" : 4}]
}

试试官网的例子，定义news索引的mapping：

PUT news
{"mappings": {"properties": {"content": {"type": "text","analyzer": "ik_max_word", "search_analyzer": "ik_smart"}}}
}

按照文档说明，索引一些doc进去：

PUT news/_doc/4
{"content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
}

进行高亮查询：

POST news/_search
{"query": {"match": {"content": "中国"}},"highlight": {"pre_tags": "<bold>","post_tags": "</bold>","fields": {"content": {}}}
}

返回结果如下：

{"took" : 2,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 2,"relation" : "eq"},"max_score" : 0.642793,"hits" : [{"_index" : "news","_type" : "_doc","_id" : "3","_score" : 0.642793,"_source" : {"content" : "中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"},"highlight" : {"content" : ["中韩渔警冲突调查：韩警平均每天扣1艘<bold>中国</bold>渔船"]}},{"_index" : "news","_type" : "_doc","_id" : "4","_score" : 0.642793,"_source" : {"content" : "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"},"highlight" : {"content" : ["<bold>中国</bold>驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"]}}]}
}

测试下来后，在索引阶段，建议使用ik_max_word分词器，能更灵活的切割词条。

除了IK分词，还可以使用THULAC分词器，号称准确率目前最高。不过目前只更新到6.4，不知道7.2能否使用。

搜索时指定分词器

GET post001/_search
{"query": {"match": {"content": {"query": "考研","analyzer": "standard"}}},"highlight": {"fields": {"content": {}}}}

同义词图过滤器 Synonym Graph Token Filter

synonym_graph词条过滤器（token filter）用于处理同义词。

注意：该过滤器仅适用于作为搜索分词器的一部分。如果想在索引期间应用同义词，需要使用标准的同义词过滤器

同义词可以使用配置文件指定，如下：

PUT /{index}
{"settings": {"index" : {"analysis" : {"analyzer" : {"search_synonyms" : { // 自定义分词器名字"tokenizer" : "whitespace","filter" : ["graph_synonyms"]  // 使用自定义filter}},"filter" : {"graph_synonyms" : {"type" : "synonym_graph",  // 指定类型为同义词图"synonyms_path" : "analysis/synonym.txt"  // 指定同义词词库位置}}}}}
}

说明：同义词词库的位置是相对于config目录的相对地址。另外以下参数也是可用的：

expand 默认为true
leninet 默认为false，是否忽略同义词配置的解析异常，记住，如果设置为true，只要那些无法被解析到同义词规则会被忽略。如下示例：

PUT /{test_index}
{"settings": {"index" : {"analysis" : {"analyzer" : {"synonym" : {"tokenizer" : "standard","filter" : ["my_stop", "synonym_graph"]}},"filter" : {"my_stop": {"type" : "stop","stopwords": ["bar"]},"synonym_graph" : {"type" : "synonym_graph","lenient": true,"synonyms" : ["foo, bar => baz"]  // 手动定义同义词映射}}}}}
}

说明：bar在自定义的my_stop的filter中被剔除，但是在synonym_graph的filter中，foo => baz仍然被添加成功。但如果添加的是"foo, baz => bar", 那么什么也不会被添加到同义词列表。这时因为映射到目标单词"bar"本身作为停用词已被剔除。相似的，如果映射是"bar, foo, baz"并且expand设置为false，那么不会添加任何同义词映射，因为当expand为false时，目标映射是第一个单词，也就是"bar"。但是如果expand=true，那么映射将会添加为"foo, baz => foo, baz"，即所有词相互映射，除了停用词。

同义词配置文件如下：

# 空行和#号开头的都是注释# 近义词以 => 映射，这种映射会忽略模式中的 expand 参数
# 匹配到"=>"左侧的词条后，被替换为"=>"右侧的词条。
i-pod, i pod => ipod,
sea biscuit, sea biscit => seabiscuit# 近义词以逗号分隔，映射行为由模式中的expand参数决定。如此同样的近义词文件，可以用于不同的策略
ipod, i-pod, i pod
foozball , foosball
universe , cosmos
lol, laughing out loud# 当expand==true, "ipod, i-pod, i pod" 等价于:
ipod, i-pod, i pod => ipod, i-pod, i pod
# 当expand==false, "ipod, i-pod, i pod" 仅映射第一个单词， 等价于:
ipod, i-pod, i pod => ipod# 多个同义词映射将会合并
foo => foo bar
foo => baz
# 等价于：
foo => foo bar, baz

虽然可以直接在filter中使用synonyms定义同义词，比如：

PUT /test_index
{"settings": {"index" : {"analysis" : {"filter" : {"synonym" : {"type" : "synonym_graph","synonyms" : [ // 直接定义"lol, laughing out loud","universe, cosmos"]}}}}}
}

但是对于大型的同义词集合，还是推荐使用synonyms_path参数在文件中定义。

测试

定义mapping

PUT article
{"settings": {"index": {"analysis": {"analyzer": {"my_analyzer": {"tokenizer": "ik_max_word","filter": ["my_synonyms"]}},"filter": {"my_synonyms": {"type": "synonym_graph","synonyms_path": "analysis/synonyms.txt"}},"search_analyzer": "ik_smart"}}}
}

在ES的config/analysis目录下新建synonyms.txt文件，内容如下：

北京大学, 北大

索引一篇文档：

PUT article/_doc/1
{"content": "北京大学今年考研一飞冲天"
}

试着搜索：

GET article/_search
{"query": {"match": {"content": {"query": "北大"}}}
}

搜索结果：

{"took" : 0,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 1,"relation" : "eq"},"max_score" : 0.5753642,"hits" : [{"_index" : "article","_type" : "_doc","_id" : "1","_score" : 0.5753642,"_source" : {"content" : "北京大学今年考研一飞冲天"}}]}
}

同义词过滤器 Synonym Token Filter

同义词过滤器配置示例：

PUT /test_index
{"settings": {"index" : {"analysis" : {"analyzer" : {"my-synonym" : {  // 自定义分词器名字"tokenizer" : "whitespace","filter" : ["my_synonym_fltr"]  // 指定过滤器}},"filter" : {"my_synonym_fltr" : {  // 自定义同义词过滤器名字"type" : "synonym",  // type 为synonym，"synonyms_path" : "analysis/synonym.txt"}}}}}
}

其实同义词过滤器和以上面的同义词图过滤器配置用法都是一样的。区别如下：

前者type=synonym，后者type=synonym_graph
前者可以在文档索引期间使用，后者只能作为搜索分词的一部分。

另外在同义词图中的测试，将type更改后，测试仍然适用。