Elasticsearch短语或近似匹配及召回率案例深入剖析-搜索系统线上实战
专注于大数据及容器云核心技术解密,可提供全栈的大数据+云原生平台咨询方案,请持续关注本套博客。如有任何学术交流,可随时联系。更多内容请关注《数据云技术社区》公众号。
1 制作案例
POST /forum/article/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"author_first_name" : "Peter", "author_last_name" : "Smith"} }
{ "update": { "_id": "2"} }
{ "doc" : {"author_first_name" : "Smith", "author_last_name" : "Williams"} }
{ "update": { "_id": "3"} }
{ "doc" : {"author_first_name" : "Jack", "author_last_name" : "Ma"} }
{ "update": { "_id": "4"} }
{ "doc" : {"author_first_name" : "Robbin", "author_last_name" : "Li"} }
{ "update": { "_id": "5"} }
{ "doc" : {"author_first_name" : "Tonny", "author_last_name" : "Peter Smith"} }//实现cross-fields搜索
PUT /forum/_mapping/article
{"properties": {"new_author_first_name": {"type": "string","copy_to": "new_author_full_name" },"new_author_last_name": {"type": "string","copy_to": "new_author_full_name" },"new_author_full_name": {"type": "string"}}
}//其实效果不佳
POST /forum/article/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"new_author_first_name" : "Peter", "new_author_last_name" : "Smith"} } --> Peter Smith
{ "update": { "_id": "2"} }
{ "doc" : {"new_author_first_name" : "Smith", "new_author_last_name" : "Williams"} } --> Smith Williams
{ "update": { "_id": "3"} }
{ "doc" : {"new_author_first_name" : "Jack", "new_author_last_name" : "Ma"} } --> Jack Ma
{ "update": { "_id": "4"} }
{ "doc" : {"new_author_first_name" : "Robbin", "new_author_last_name" : "Li"} } --> Robbin Li
{ "update": { "_id": "5"} }
{ "doc" : {"new_author_first_name" : "Tonny", "new_author_last_name" : "Peter Smith"} } --> Tonny Peter SmithGET /forum/article/_search
{"query": {"match": {"new_author_full_name": "Peter Smith"}}
}//测试短语匹配
POST /forum/article/5/_update
{"doc": {"content": "spark is best big data solution based on scala ,an programming language similar to java spark"}
}//单单包含java的doc也返回了,不是我们想要的结果
GET /forum/article/_search
{"query": {"match": {"content": "java spark"}}
}
复制代码
2 短语匹配(match_phrase)
- 要求:只有包含java spark这个短语的doc才返回了,只包含java的doc不会返回
GET /forum/article/_search
{"query": {"match_phrase": {"content": "java spark"}}
}
复制代码
- term position的意思
hello world, java spark doc1
hi, spark java doc2hello doc1(0)
wolrd doc1(1)
java doc1(2) doc2(2)
spark doc1(3) doc2(1)了解什么是分词后的positionGET _analyze
{"text": "hello world, java spark","analyzer": "standard"
}
复制代码
3 近似匹配(slop)
- query string,搜索文本中的几个term,要经过几次移动才能与一个document匹配,这个移动的次数,就是slop
- slop的含义,不仅仅是说一个query string terms移动几次,跟一个doc匹配上。一个query string terms,最多可以移动几次去尝试跟一个doc匹配上
- slop搜索下,关键词离的越近,relevance score就会越高,
GET /forum/article/_search
{"query": {"match_phrase": {"content": {"query": "spark data","slop": 3}}}
}spark is best big data solution based on scala ,an programming language similar to java sparkspark data--> data--> data
spark --> dataGET /forum/article/_search
{"query": {"match_phrase": {"content": {"query": "java best","slop": 15}}}
}{"took": 3,"timed_out": false,"_shards": {"total": 5,"successful": 5,"failed": 0},"hits": {"total": 2,"max_score": 0.65380025,"hits": [{"_index": "forum","_type": "article","_id": "2","_score": 0.65380025,"_source": {"articleID": "KDKE-B-9947-#kL5","userID": 1,"hidden": false,"postDate": "2017-01-02","tag": ["java"],"tag_cnt": 1,"view_cnt": 50,"title": "this is java blog","content": "i think java is the best programming language","sub_title": "learned a lot of course","author_first_name": "Smith","author_last_name": "Williams","new_author_last_name": "Williams","new_author_first_name": "Smith"}},{"_index": "forum","_type": "article","_id": "5","_score": 0.07111243,"_source": {"articleID": "DHJK-B-1395-#Ky5","userID": 3,"hidden": false,"postDate": "2017-03-01","tag": ["elasticsearch"],"tag_cnt": 1,"view_cnt": 10,"title": "this is spark blog","content": "spark is best big data solution based on scala ,an programming language similar to java spark","sub_title": "haha, hello world","author_first_name": "Tonny","author_last_name": "Peter Smith","new_author_last_name": "Peter Smith","new_author_first_name": "Tonny"}}]}
}
复制代码
4 优先满足召回率
- 优先满足召回率,意思是:java spark,包含java的也返回,包含spark的也返回,包含java和spark的也返回;同时兼顾精准度,就是包含java和spark,同时java和spark离的越近的doc排在最前面
GET /forum/article/_search
{"query": {"bool": {"must": [{"match": {"content": "java spark"}}],"should": [{"match_phrase": {"content": {"query": "java spark","slop": 50}}}]}}
}{"took": 5,"timed_out": false,"_shards": {"total": 5,"successful": 5,"failed": 0},"hits": {"total": 2,"max_score": 1.258609,"hits": [{"_index": "forum","_type": "article","_id": "5","_score": 1.258609,"_source": {"articleID": "DHJK-B-1395-#Ky5","userID": 3,"hidden": false,"postDate": "2017-03-01","tag": ["elasticsearch"],"tag_cnt": 1,"view_cnt": 10,"title": "this is spark blog","content": "spark is best big data solution based on scala ,an programming language similar to java spark","sub_title": "haha, hello world","author_first_name": "Tonny","author_last_name": "Peter Smith","new_author_last_name": "Peter Smith","new_author_first_name": "Tonny","followers": ["Jack","Robbin Li"]}},{"_index": "forum","_type": "article","_id": "2","_score": 0.68640786,"_source": {"articleID": "KDKE-B-9947-#kL5","userID": 1,"hidden": false,"postDate": "2017-01-02","tag": ["java"],"tag_cnt": 1,"view_cnt": 50,"title": "this is java blog","content": "i think java is the best programming language","sub_title": "learned a lot of course","author_first_name": "Smith","author_last_name": "Williams","new_author_last_name": "Williams","new_author_first_name": "Smith","followers": ["Tom","Jack"]}}]}
}
复制代码
5 总结
执笔小记,温故知新
专注于大数据及容器云核心技术解密,可提供全栈的大数据+云原生平台咨询方案,请持续关注本套博客。如有任何学术交流,可随时联系。更多内容请关注《数据云技术社区》公众号。
转载于:https://juejin.im/post/5d62ab6f5188253961299c74
Elasticsearch短语或近似匹配及召回率案例深入剖析-搜索系统线上实战相关推荐
- 19_ElasticSearch 使用match和近似匹配实现召回率与精准度的平衡
19_ElasticSearch 使用match和近似匹配实现召回率与精准度的平衡 更多干货 分布式实战(干货) spring cloud 实战(干货) mybatis 实战(干货) spring b ...
- 白话Elasticsearch19-深度探秘搜索技术之混合使用match和近似匹配实现召回率(recall)与精准度(precision)的平衡
文章目录 概述 召回率recall 精准度 precision 分析利弊 方案 概述 继续跟中华石杉老师学习ES,第19篇 课程地址: https://www.roncoo.com/view/55 召 ...
- 【Elasticsearch】Elasticsearch 搜索体验可量化的指标 查准率(精确率)、查全率(召回率)
文章目录 1.概述 1.1 召回率 1.2 精确率 1.3 表格 1.概述 用户体验是感官反应,但感觉的搜索结果需要量化下. 如何量化?实际本质指标就是:查准率(精确率).查全率(召回率). 1.1 ...
- 准确率、精度和召回率
原文链接 精度(查准率)和召回率(查全率)是衡量机器学习模型性能的重要指标,特别是数据集分布不平衡的案例中. 什么是分布不平衡的数据集? 倘若某人声称创建一个能够识别登上飞机的恐怖分子的模型,并且准确 ...
- FP、FN、TP、TN、精确率(Precision)、召回率(Recall)、准确率(Accuracy)评价指标详述
来自微信公众号:小白CV关注可了解更多CV,ML,DL领域基础/最新知识;如果你觉得小白CV对您有帮助,欢迎点赞/收藏/转发 在机器学习领域中,用于评价一个模型的性能有多种指标,其中几项就是FP.FN ...
- 精确率、召回率、F1 值、ROC、AUC
首先我们来思考一个问题,如何评估一个机器学习模型效果的好坏呢? 1.性能度量 机器学习首先要建模,对于模型性能的好坏(即模型的泛化能力),我们必须有个评判的标准.为了了解模型的泛化能力,我们需要用某个 ...
- FP、FN、TP、TN、精确率(Precision)、召回率(Recall)、准确率(Accuracy)是什么意思
在机器学习领域中,用于评价一个模型的性能有多种指标,其中几项就是FP.FN.TP.TN.精确率(Precision).召回率(Recall).准确率(Accuracy).这里我们就对这块内容做一个集中 ...
- 机器学习算法衡量指标——准确率、精确率(查准率)、召回率(查全率)
机器学习算法衡量指标 在分类问题中,将机器学习模型的预测与实际情况进行比对后,结果可以分为四种:TP.TN.FN.FP.每个的第一个字母:T/F,代表预测结果是否符合事实,模型猜得对不对,True o ...
- 召回率 matlab代码,召回率和精度(示例代码)
召回率(Recall) 查全率 精度(Precise) 查准率 是广泛用于信息检索和统计学分类领域的两个度量值,用来评价结果的质量. 在信息检索中的解释: 系统检索到的相关文档数 ...
- 偏差、方差、精确率、召回率
1. 偏差.方差.精确率.召回率 四个概念 偏差 从直观上来讲,"偏"是偏离,放在分类任务上,也就是偏离了真实值.真实标签. 含义:偏差度量了学习算法的期望预测与真实结果的偏离程度 ...
最新文章
- Android 插件化原理解析——Service的插件化
- 论文学习2-Incorporating Graph Attention Mechanism into Knowledge Graph Reasoning Based on Deep Reinforce
- oracle 12c sp2 0667,SP2-0667/SP2-0750错误
- WORD中如何添加复选框控件?
- 产品特点概述-驰骋工作流
- hadoop 网页监控
- springboot获取apk包名、app名称、版本名称、版本号
- three.js 05-08 之 TorusKnotGeometry 几何体
- 华为OD机试 - We Are A Team
- 网吧无盘服务器连接交换机,网吧为什么要使用万兆交换机
- Steaming SQL for Apache Kafka 学习
- 路由器网口1一直闪烁正常吗_网口1一直闪烁上不了网(图文)
- Could not transfer artifact XXX:XXX:pom:XX from/to镜像地址
- 北航计算机专业录取线,北航各专业录取分数线
- Mac无法安装第三方软件
- php 正则 /is,PHP 正则表达式后面接的/isU, /is, /s含义
- win2008sever CA证书颁发服务器部署
- 如何用计算机扫描图片变成文字,怎么扫描图片上的文字-华为手机黑科技"文字扫描仪",3秒就能将纸质文档转成电子档,牛...
- 长虹g2958进入总控php是什么,长虹G2958型彩电总线故障检修一例
- php实现epoll,PHP socket初探 --- 颤颤抖抖开篇libevent(一)