文章目录

  • 概述
  • TF/IDF
  • 链接
  • 示例
    • DSL
    • 普通查询
    • dis_max 查询
  • best fields策略-dis_max

概述

继续跟中华石杉老师学习ES,第十篇

课程地址: https://www.roncoo.com/view/55


TF/IDF

Apache Lucene默认评分机制

  • TF (Term Frequency): 基于词项(term vector), 用来表示一个词项在某个文档中出现了多少次。
    词频越高,文档得分越高

  • IDF (Inveres Dcoument Frequency): 基于词项(term vector),用来告诉评分公式该词有多美的汉奸。
    逆文档频率越高,词项就越罕见。 评分公式利用该因子为包含罕见词项的文档加权。

term vector : 词项向量是一种针对每个文档的微型倒排索引。词项向量的每个维由词项和出现频率结对组成,还可以包含词项的位置信息。 Lucene 和 ES都默认禁用词项向量索引,如果实现某些功能比如高亮显示等需要开启该选项 。


链接

官方指导: https://www.elastic.co/guide/en/elasticsearch/guide/current/_tuning_best_fields_queries.html

https://www.elastic.co/guide/en/elasticsearch/reference/7.2/query-dsl-dis-max-query.html


数据量少的时候,dis_max不生效的问题: https://stackoverflow.com/questions/38065692/dis-max-query-isnt-looking-for-the-best-matching-clause


其他博主写的相关文章:
https://blog.csdn.net/dm_vincent/article/details/41820537


示例

ES版本 6.4.1

为了演示效果,我们把之前的forum索引删除了重建一下,

DSL如下

DSL


DELETE /forumPUT /forum
{ "settings" : { "number_of_shards" : 1 }}POST /forum/article/_bulk
{ "index": { "_id": 1 }}
{ "articleID" : "XHDK-A-1293-#fJ3", "userID" : 1, "hidden": false, "postDate": "2017-01-01" }
{ "index": { "_id": 2 }}
{ "articleID" : "KDKE-B-9947-#kL5", "userID" : 1, "hidden": false, "postDate": "2017-01-02" }
{ "index": { "_id": 3 }}
{ "articleID" : "JODL-X-1937-#pV7", "userID" : 2, "hidden": false, "postDate": "2017-01-01" }
{ "index": { "_id": 4 }}
{ "articleID" : "QQPX-R-3956-#aD8", "userID" : 2, "hidden": true, "postDate": "2017-01-02" }POST /forum/article/_bulk
{"update":{"_id":"1"}}
{"doc":{"tag":["java","hadoop"]}}
{"update":{"_id":"2"}}
{"doc":{"tag":["java"]}}
{"update":{"_id":"3"}}
{"doc":{"tag":["hadoop"]}}
{"update":{"_id":"4"}}
{"doc":{"tag":["java","elasticsearch"]}}POST /forum/article/_bulk
{"update":{"_id":"1"}}
{"doc":{"tag_cnt":2}}
{"update":{"_id":"2"}}
{"doc":{"tag_cnt":1}}
{"update":{"_id":"3"}}
{"doc":{"tag_cnt":1}}
{"update":{"_id":"4"}}
{"doc":{"tag_cnt":2}}POST /forum/article/_bulk
{"update":{"_id":"1"}}
{"doc":{"view_cnt":30}}
{"update":{"_id":"2"}}
{"doc":{"view_cnt":50}}
{"update":{"_id":"3"}}
{"doc":{"view_cnt":100}}
{"update":{"_id":"4"}}
{"doc":{"view_cnt":80}}POST /forum/article/_bulk
{"index":{"_id":5}}
{"articleID":"DHJK-B-1395-#Ky5","userID":3,"hidden":false,"postDate":"2019-06-01","tag":["elasticsearch"],"tag_cnt":1,"view_cnt":10}POST /forum/article/_bulk
{"update":{"_id":"5"}}
{"doc":{"postDate":"2019-05-01"}}POST /forum/article/_bulk
{"update":{"_id":"1"}}
{"doc":{"title":"this is java and elasticsearch blog"}}
{"update":{"_id":"2"}}
{"doc":{"title":"this is java blog"}}
{"update":{"_id":"3"}}
{"doc":{"title":"this is elasticsearch blog"}}
{"update":{"_id":"4"}}
{"doc":{"title":"this is java, elasticsearch, hadoop blog"}}
{"update":{"_id":"5"}}
{"doc":{"title":"this is spark blog"}}POST /forum/article/_bulk
{"update":{"_id":"1"}}
{"doc":{"content":"i like to write best elasticsearch article"}}
{"update":{"_id":"2"}}
{"doc":{"content":"i think java is the best programming language"}}
{"update":{"_id":"3"}}
{"doc":{"content":"i am only an elasticsearch beginner"}}
{"update":{"_id":"4"}}
{"doc":{"content":"elasticsearch and hadoop are all very good solution, i am a beginner"}}
{"update":{"_id":"5"}}
{"doc":{"content":"spark is best big data solution based on scala ,an programming language similar to java"}}

至此,数据构造完成 ,下面来看下dis_max是如何作用的吧

GET /forum/article/_search 数据如下: {"took": 0,"timed_out": false,"_shards": {"total": 1,"successful": 1,"skipped": 0,"failed": 0},"hits": {"total": 5,"max_score": 1,"hits": [{"_index": "forum","_type": "article","_id": "1","_score": 1,"_source": {"articleID": "XHDK-A-1293-#fJ3","userID": 1,"hidden": false,"postDate": "2017-01-01","tag": ["java","hadoop"],"tag_cnt": 2,"view_cnt": 30,"title": "this is java and elasticsearch blog","content": "i like to write best elasticsearch article"}},{"_index": "forum","_type": "article","_id": "2","_score": 1,"_source": {"articleID": "KDKE-B-9947-#kL5","userID": 1,"hidden": false,"postDate": "2017-01-02","tag": ["java"],"tag_cnt": 1,"view_cnt": 50,"title": "this is java blog","content": "i think java is the best programming language"}},{"_index": "forum","_type": "article","_id": "3","_score": 1,"_source": {"articleID": "JODL-X-1937-#pV7","userID": 2,"hidden": false,"postDate": "2017-01-01","tag": ["hadoop"],"tag_cnt": 1,"view_cnt": 100,"title": "this is elasticsearch blog","content": "i am only an elasticsearch beginner"}},{"_index": "forum","_type": "article","_id": "4","_score": 1,"_source": {"articleID": "QQPX-R-3956-#aD8","userID": 2,"hidden": true,"postDate": "2017-01-02","tag": ["java","elasticsearch"],"tag_cnt": 2,"view_cnt": 80,"title": "this is java, elasticsearch, hadoop blog","content": "elasticsearch and hadoop are all very good solution, i am a beginner"}},{"_index": "forum","_type": "article","_id": "5","_score": 1,"_source": {"articleID": "DHJK-B-1395-#Ky5","userID": 3,"hidden": false,"postDate": "2019-05-01","tag": ["elasticsearch"],"tag_cnt": 1,"view_cnt": 10,"title": "this is spark blog","content": "spark is best big data solution based on scala ,an programming language similar to java"}}]}
}

普通查询

先看下普通的DSL

GET /forum/article/_search
{"query": {"bool": {"should": [{"match": {"title": "java solution"}},{"match": {"content": "java solution"}}],"minimum_should_match": 1}}
}

返回:

{"took": 1,"timed_out": false,"_shards": {"total": 1,"successful": 1,"skipped": 0,"failed": 0},"hits": {"total": 4,"max_score": 1.5179626,"hits": [{"_index": "forum","_type": "article","_id": "2","_score": 1.5179626,"_source": {"articleID": "KDKE-B-9947-#kL5","userID": 1,"hidden": false,"postDate": "2017-01-02","tag": ["java"],"tag_cnt": 1,"view_cnt": 50,"title": "this is java blog","content": "i think java is the best programming language"}},{"_index": "forum","_type": "article","_id": "5","_score": 1.4233948,"_source": {"articleID": "DHJK-B-1395-#Ky5","userID": 3,"hidden": false,"postDate": "2019-05-01","tag": ["elasticsearch"],"tag_cnt": 1,"view_cnt": 10,"title": "this is spark blog","content": "spark is best big data solution based on scala ,an programming language similar to java"}},{"_index": "forum","_type": "article","_id": "4","_score": 1.2832261,"_source": {"articleID": "QQPX-R-3956-#aD8","userID": 2,"hidden": true,"postDate": "2017-01-02","tag": ["java","elasticsearch"],"tag_cnt": 2,"view_cnt": 80,"title": "this is java, elasticsearch, hadoop blog","content": "elasticsearch and hadoop are all very good solution, i am a beginner"}},{"_index": "forum","_type": "article","_id": "1","_score": 0.4889865,"_source": {"articleID": "XHDK-A-1293-#fJ3","userID": 1,"hidden": false,"postDate": "2017-01-01","tag": ["java","hadoop"],"tag_cnt": 2,"view_cnt": 30,"title": "this is java and elasticsearch blog","content": "i like to write best elasticsearch article"}}]}
}

来分析一下结果

计算每个document的relevance score:每个query的分数,乘以matched query数量,除以总query数量

算一下doc2的分数

{ "match": { "title": "java solution" }},针对doc2,是有一个分数的
{ "match": { "content": "java solution" }},针对doc2,也是有一个分数的

假设分数如下 , 所以是两个分数加起来,比如说,1.1 + 1.2 = 2.3
matched query数量 = 2
总query数量 = 2

2.3 * 2 / 2 = 2.3


算一下doc5的分数

{ "match": { "title": "java solution" }},针对doc5,是没有分数的
{ "match": { "content": "java solution" }},针对doc5,是有一个分数的

所以说,只有一个query是有分数的,比如2.3
matched query数量 = 1
总query数量 = 2

2.3 * 1 / 2 = 1.15

doc5的分数 = 1.15 < doc2的分数 = 2.3


id=2的数据排在了前面,其实我们希望id=5的排在前面,毕竟id=5的数据 content字段既有java又有solution. 那看下dis_max吧


dis_max 查询


GET /forum/article/_search
{"query": {"dis_max": {"queries": [{"match": {"title": "java solution"}},{"match": {"content": "java solution"}}]}}
}

返回

{"took": 0,"timed_out": false,"_shards": {"total": 1,"successful": 1,"skipped": 0,"failed": 0},"hits": {"total": 4,"max_score": 1.4233948,"hits": [{"_index": "forum","_type": "article","_id": "5","_score": 1.4233948,"_source": {"articleID": "DHJK-B-1395-#Ky5","userID": 3,"hidden": false,"postDate": "2019-05-01","tag": ["elasticsearch"],"tag_cnt": 1,"view_cnt": 10,"title": "this is spark blog","content": "spark is best big data solution based on scala ,an programming language similar to java"}},{"_index": "forum","_type": "article","_id": "2","_score": 0.93952733,"_source": {"articleID": "KDKE-B-9947-#kL5","userID": 1,"hidden": false,"postDate": "2017-01-02","tag": ["java"],"tag_cnt": 1,"view_cnt": 50,"title": "this is java blog","content": "i think java is the best programming language"}},{"_index": "forum","_type": "article","_id": "4","_score": 0.79423964,"_source": {"articleID": "QQPX-R-3956-#aD8","userID": 2,"hidden": true,"postDate": "2017-01-02","tag": ["java","elasticsearch"],"tag_cnt": 2,"view_cnt": 80,"title": "this is java, elasticsearch, hadoop blog","content": "elasticsearch and hadoop are all very good solution, i am a beginner"}},{"_index": "forum","_type": "article","_id": "1","_score": 0.4889865,"_source": {"articleID": "XHDK-A-1293-#fJ3","userID": 1,"hidden": false,"postDate": "2017-01-01","tag": ["java","hadoop"],"tag_cnt": 2,"view_cnt": 30,"title": "this is java and elasticsearch blog","content": "i like to write best elasticsearch article"}}]}
}

best fields策略-dis_max

best fields策略 : 搜索到的结果,应该是某一个field中匹配到了尽可能多的关键词,被排在前面;而不是尽可能多的field匹配到了少数的关键词,排在了前面.

dis_max语法,直接取多个query中,分数最高的那一个query的分数即可

举个例子

{ "match": { "title": "java solution" }},针对doc2,是有一个分数的,1.1
{ "match": { "content": "java solution" }},针对doc2,也是有一个分数的,1.2

取最大分数,1.2


{ "match": { "title": "java solution" }},针对doc5,是没有分数的
{ "match": { "content": "java solution" }},针对doc5,是有一个分数的,2.3

取最大分数,2.3

然后doc2的分数 = 1.2 < doc5的分数 = 2.3,所以doc5就可以排在更前面的地方.

白话Elasticsearch10-深度探秘搜索技术之基于dis_max实现best fields策略进行多字段搜索相关推荐

  1. 白话Elasticsearch18-深度探秘搜索技术之基于slop参数实现近似匹配以及原理剖析

    文章目录 概述 官网 slop 含义 例子 示例一 示例二 示例三 概述 继续跟中华石杉老师学习ES,第18篇 课程地址: https://www.roncoo.com/view/55 接上篇博客 白 ...

  2. 白话Elasticsearch13-深度探秘搜索技术之基于multi_match+most fields策略进行multi-field搜索

    文章目录 概述 官网 示例 构造模拟数据 普通查询 使用 multi_match + most fileds查询 best fields VS most fields 概述 继续跟中华石杉老师学习ES ...

  3. 白话Elasticsearch14-深度探秘搜索技术之基于multi_match 使用most_fields策略进行cross-fields search弊端

    文章目录 概述 官网 示例 概述 继续跟中华石杉老师学习ES,第十四篇 课程地址: https://www.roncoo.com/view/55 官网 https://www.elastic.co/g ...

  4. 白话Elasticsearch12-深度探秘搜索技术之基于multi_match + best fields语法实现dis_max+tie_breaker

    文章目录 概述 官网 示例 概述 继续跟中华石杉老师学习ES,第十二篇 课程地址: https://www.roncoo.com/view/55 官网 https://www.elastic.co/g ...

  5. 白话Elasticsearch08-深度探秘搜索技术之基于boost的细粒度搜索条件权重控制

    文章目录 概述 boost 示例 概述 继续跟中华石杉老师学习ES,第八篇 课程地址: https://www.roncoo.com/view/55 boost https://www.elastic ...

  6. 白话Elasticsearch07- 深度探秘搜索技术之基于term+bool实现的multiword搜索底层剖析

    文章目录 概述 普通match转换为term+should and match转换为term+must minimum_should_match如何转换 概述 继续跟中华石杉老师学习ES,第七篇 课程 ...

  7. 白话Elasticsearch11-深度探秘搜索技术之基于tie_breaker参数优化dis_max搜索效果

    文章目录 概述 官方文档 例子 tie_breaker 概述 继续跟中华石杉老师学习ES,第十一篇 课程地址: https://www.roncoo.com/view/55 官方文档 https:// ...

  8. 23_深度探秘搜索技术_best fields策略的dis_max、tie_breaker参数以及multi_match语法

    目录 一.引入dis_max 实现best fields 的必要性 1.使用bulk批量添加测试数据 2.搜索title或content中包含java或solution的帖子 3.结果分析 二.bes ...

  9. ElasticSearch系列六:ElasticSearch搜索技术深入讲解(一)

    1.match 手工控制搜索结果精准度 GET /product_db/_search {"query": {"match": {"subTitle& ...

最新文章

  1. 深度理解目标检测(MMdetection)-HOOK机制
  2. JZOJ 5244. 【NOIP2017模拟8.8A组】Daydreamin ' (daydream)
  3. Java多线程相关的常用接口
  4. python logging之multi-module
  5. Spring - shortcuts
  6. 为SQL Server创建基于“智能”触发器的审核跟踪
  7. ORACLE检查点测试,oracle深度解析检查点
  8. 什么场景下声明式事务会失效?如何解决?
  9. 抢走Salesforce大客户,国产CRM靠的不是运气
  10. 移除List数组中的某一个元素
  11. 别再乱提交代码了,看下大厂 Git 提交规范是怎么做的!
  12. ps去水印通用方法和教程案例
  13. Codeforces Round #727 (Div. 2)_B. Love Song(前缀和)
  14. XML中的standalone什么意思?
  15. Linux下磁盘挂载
  16. 动真格了!苹果下架超5万款游戏App, 辛好我有企业签
  17. MALTAB之stem函数
  18. YOLOv3庖丁解牛(一):网络结构
  19. 护卫神mysql域名连接_护卫神·主机大师WEB管理端绑定自己的域名_护卫神
  20. 04_Large_OAD

热门文章

  1. mybatis mysql自动连接数据库_如何用mybatis链接数据库
  2. python将图像转换为8位单通道_【图像处理】OpenCV系列三十五--- equalizeHist函数详解...
  3. C++继承时的名字遮蔽(二)
  4. 链表c的经典实现(一)
  5. java 金_java
  6. 灾难恢复级别_防患于未然:灾难恢复全攻略,助你有效恢复业务数据
  7. 向前欧拉公式 matlab_你可能不知道的MATLAB操作#第三话
  8. 报错解决方法1:‘A GDAL API version must be specified.’
  9. 【数学建模】线性代数知识汇总,参加建模大赛的小伙伴看过来,它会是你的最优选
  10. Java高阶部分知识点汇总(一)- 成员变量与局部变量详讲