全文搜索

全文搜索两个最重要的方面是:

  • 相关性(Relevance) 它是评价查询与其结果间的相关程度,并根据这种相关程度对结果排名的一种能力,这种计算方式可以是 TF/IDF 方法、地理位置邻近、模糊相似,或其他的某些算法。
  • 分词(Analysis) 它是将文本块转换为有区别的、规范化的 token 的一个过程,目的是为了创建倒排索引以及查询倒排索引。

构造数据

PUT /test4
{"settings": {"index": {"number_of_shards": "1","number_of_replicas": "1"}},"mappings": {"properties": {"name": {"type": "text"},"age": {"type": "long"},"mail": {"type": "keyword"},"hobby": {"type": "text","analyzer":"ik_max_word"}}}
}

查看mapping:

GET /test4/_mapping

结果:

插入数据:

POST /test4/_bulk
{"index":{"_index":"test4","_type":"_doc"}}
{"name":"张三","age": 20,"mail": "111@qq.com","hobby":"羽毛球、乒乓球、足球"}
{"index":{"_index":"test4","_type":"_doc"}}
{"name":"李四","age": 21,"mail": "222@qq.com","hobby":"羽毛球、乒乓球、足球、篮球"}
{"index":{"_index":"test4","_type":"_doc"}}
{"name":"王五","age": 22,"mail": "333@qq.com","hobby":"羽毛球、篮球、游泳、听音乐"}
{"index":{"_index":"test4","_type":"_doc"}}
{"name":"赵六","age": 23,"mail": "444@qq.com","hobby":"跑步、游泳、篮球"}
{"index":{"_index":"test4","_type":"_doc"}}
{"name":"孙七","age": 24,"mail": "555@qq.com","hobby":"听音乐、看电影、羽毛球"}

单词搜索:

POST /test4/_search
{
"query":{
"match":{
"hobby":"音乐"
}
},
"highlight": {"fields": {"hobby": {}}}
}

结果:

{"took" : 691,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 2,"relation" : "eq"},"max_score" : 0.816522,"hits" : [{"_index" : "test4","_type" : "_doc","_id" : "gj_2yXcBhFgDDNfpe9bx","_score" : 0.816522,"_source" : {"name" : "王五","age" : 22,"mail" : "333@qq.com","hobby" : "羽毛球、篮球、游泳、听音乐"},"highlight" : {"hobby" : ["羽毛球、篮球、游泳、听<em>音乐</em>"]}},{"_index" : "test4","_type" : "_doc","_id" : "hD_2yXcBhFgDDNfpe9bx","_score" : 0.816522,"_source" : {"name" : "孙七","age" : 24,"mail" : "555@qq.com","hobby" : "听音乐、看电影、羽毛球"},"highlight" : {"hobby" : ["听<em>音乐</em>、看电影、羽毛球"]}}]}
}

过程说明:

  1. 检查字段类型
    爱好 hobby 字段是一个 text 类型( 指定了IK分词器),这意味着查询字符串本身也应该被分词。

  2. 分析查询字符串 。
    将查询的字符串 “音乐” 传入IK分词器中,输出的结果是单个项 音乐。因为只有一个单词项,所以 match 查询执行的是单个底层 term 查询。

  3. 查找匹配文档 。
    用 term 查询在倒排索引中查找 “音乐” 然后获取一组包含该项的文档,本例的结果是文档:3 、5 。

  4. 为每个文档评分 。
    用 term 查询计算每个文档相关度评分 _score ,这是种将 词频(term frequency,即词 “音乐” 在相关文档的hobby 字段中出现的频率)和 反向文档频率(inverse document frequency,即词 “音乐” 在所有文档的hobby 字段中出现的频率),以及字段的长度(即字段越短相关度越高)相结合的计算方式。

多词搜索

POST /test4/_search
{
"query":{
"match":{
"hobby":"音乐 篮球"
}
},
"highlight": {"fields": {"hobby": {}}}
}

结果:

{"took" : 5,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 4,"relation" : "eq"},"max_score" : 1.319227,"hits" : [{"_index" : "test4","_type" : "_doc","_id" : "gj_2yXcBhFgDDNfpe9bx","_score" : 1.319227,"_source" : {"name" : "王五","age" : 22,"mail" : "333@qq.com","hobby" : "羽毛球、篮球、游泳、听音乐"},"highlight" : {"hobby" : ["羽毛球、<em>篮球</em>、游泳、听<em>音乐</em>"]}},{"_index" : "test4","_type" : "_doc","_id" : "hD_2yXcBhFgDDNfpe9bx","_score" : 0.816522,"_source" : {"name" : "孙七","age" : 24,"mail" : "555@qq.com","hobby" : "听音乐、看电影、羽毛球"},"highlight" : {"hobby" : ["听<em>音乐</em>、看电影、羽毛球"]}},{"_index" : "test4","_type" : "_doc","_id" : "gz_2yXcBhFgDDNfpe9bx","_score" : 0.6987338,"_source" : {"name" : "赵六","age" : 23,"mail" : "444@qq.com","hobby" : "跑步、游泳、篮球"},"highlight" : {"hobby" : ["跑步、游泳、<em>篮球</em>"]}},{"_index" : "test4","_type" : "_doc","_id" : "gT_2yXcBhFgDDNfpe9bx","_score" : 0.502705,"_source" : {"name" : "李四","age" : 21,"mail" : "222@qq.com","hobby" : "羽毛球、乒乓球、足球、篮球"},"highlight" : {"hobby" : ["羽毛球、乒乓球、足球、<em>篮球</em>"]}}]}
}

可以看到,包含了“音乐”、“篮球”的数据都已经被搜索到了。
可是,搜索的结果并不符合我们的预期,因为我们想搜索的是既包含“音乐”又包含“篮球”的用户,显然结果返回的“或”的关系。
在Elasticsearch中,可以指定词之间的逻辑关系,如下:

POST /test4/_search
{
"query":{
"match":{
"hobby":{"query": "音乐 篮球","operator": "and"
}
}
},
"highlight": {"fields": {"hobby": {}}}
}

结果:可以看到结果符合预期。

{"took" : 3,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 1,"relation" : "eq"},"max_score" : 1.319227,"hits" : [{"_index" : "test4","_type" : "_doc","_id" : "gj_2yXcBhFgDDNfpe9bx","_score" : 1.319227,"_source" : {"name" : "王五","age" : 22,"mail" : "333@qq.com","hobby" : "羽毛球、篮球、游泳、听音乐"},"highlight" : {"hobby" : ["羽毛球、<em>篮球</em>、游泳、听<em>音乐</em>"]}}]}
}

前面我们测试了“OR” 和 “AND”搜索,这是两个极端,其实在实际场景中,并不会选取这2个极端,更有可能是选取这种,或者说,只需要符合一定的相似度就可以查询到数据,在Elasticsearch中也支持这样的查询,通过minimum_should_match来指定匹配度,如:70%;

POST /test4/_search
{
"query":{
"match":{
"hobby":{"query": "游泳 羽毛球","minimum_should_match": "80%"
}
}
},
"highlight": {"fields": {"hobby": {}}}
}

结果:相似度为80%的情况下,查询到4条数据

{"took" : 4,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 4,"relation" : "eq"},"max_score" : 1.6214579,"hits" : [{"_index" : "test4","_type" : "_doc","_id" : "gj_2yXcBhFgDDNfpe9bx","_score" : 1.6214579,"_source" : {"name" : "王五","age" : 22,"mail" : "333@qq.com","hobby" : "羽毛球、篮球、游泳、听音乐"},"highlight" : {"hobby" : ["<em>羽毛球</em>、篮球、<em>游泳</em>、听音乐"]}},{"_index" : "test4","_type" : "_doc","_id" : "gD_2yXcBhFgDDNfpe9bx","_score" : 0.9608413,"_source" : {"name" : "张三","age" : 20,"mail" : "111@qq.com","hobby" : "羽毛球、乒乓球、足球"},"highlight" : {"hobby" : ["<em>羽毛球</em>、乒乓<em>球</em>、足球"]}},{"_index" : "test4","_type" : "_doc","_id" : "gT_2yXcBhFgDDNfpe9bx","_score" : 0.9134824,"_source" : {"name" : "李四","age" : 21,"mail" : "222@qq.com","hobby" : "羽毛球、乒乓球、足球、篮球"},"highlight" : {"hobby" : ["<em>羽毛球</em>、乒乓<em>球</em>、足球、篮球"]}},{"_index" : "test4","_type" : "_doc","_id" : "hD_2yXcBhFgDDNfpe9bx","_score" : 0.80493593,"_source" : {"name" : "孙七","age" : 24,"mail" : "555@qq.com","hobby" : "听音乐、看电影、羽毛球"},"highlight" : {"hobby" : ["听音乐、看电影、<em>羽毛球</em>"]}}]}
}

设置40%进行测试:

POST /test4/_search
{
"query":{
"match":{
"hobby":{"query": "游泳 羽毛球","minimum_should_match": "40%"
}
}
},
"highlight": {"fields": {"hobby": {}}}
}

结果:相似度为40%的情况下,查询到5条数据

{"took" : 6,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 5,"relation" : "eq"},"max_score" : 1.6214579,"hits" : [{"_index" : "test4","_type" : "_doc","_id" : "gj_2yXcBhFgDDNfpe9bx","_score" : 1.6214579,"_source" : {"name" : "王五","age" : 22,"mail" : "333@qq.com","hobby" : "羽毛球、篮球、游泳、听音乐"},"highlight" : {"hobby" : ["<em>羽毛球</em>、篮球、<em>游泳</em>、听音乐"]}},{"_index" : "test4","_type" : "_doc","_id" : "gz_2yXcBhFgDDNfpe9bx","_score" : 1.1349231,"_source" : {"name" : "赵六","age" : 23,"mail" : "444@qq.com","hobby" : "跑步、游泳、篮球"},"highlight" : {"hobby" : ["跑步、<em>游泳</em>、篮球"]}},{"_index" : "test4","_type" : "_doc","_id" : "gD_2yXcBhFgDDNfpe9bx","_score" : 0.9608413,"_source" : {"name" : "张三","age" : 20,"mail" : "111@qq.com","hobby" : "羽毛球、乒乓球、足球"},"highlight" : {"hobby" : ["<em>羽毛球</em>、乒乓<em>球</em>、足球"]}},{"_index" : "test4","_type" : "_doc","_id" : "gT_2yXcBhFgDDNfpe9bx","_score" : 0.9134824,"_source" : {"name" : "李四","age" : 21,"mail" : "222@qq.com","hobby" : "羽毛球、乒乓球、足球、篮球"},"highlight" : {"hobby" : ["<em>羽毛球</em>、乒乓<em>球</em>、足球、篮球"]}},{"_index" : "test4","_type" : "_doc","_id" : "hD_2yXcBhFgDDNfpe9bx","_score" : 0.80493593,"_source" : {"name" : "孙七","age" : 24,"mail" : "555@qq.com","hobby" : "听音乐、看电影、羽毛球"},"highlight" : {"hobby" : ["听音乐、看电影、<em>羽毛球</em>"]}}]}
}

结论:相似度应该多少合适,需要在实际的需求中进行反复测试,才可得到合理的值。

组合搜索

在搜索时,也可以使用过滤器中讲过的bool组合查询,示例:

POST /test4/_search
{
"query":{
"bool":{
"must":{
"match":{
"hobby":"篮球"
}
},
"must_not":{
"match":{
"hobby":"音乐"
}
},
"should":[
{
"match": {
"hobby":"游泳"
}
}
]
}
},
"highlight": {"fields": {"hobby": {}}}
}

结果:

{"took" : 4,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 2,"relation" : "eq"},"max_score" : 1.8336569,"hits" : [{"_index" : "test4","_type" : "_doc","_id" : "gz_2yXcBhFgDDNfpe9bx","_score" : 1.8336569,"_source" : {"name" : "赵六","age" : 23,"mail" : "444@qq.com","hobby" : "跑步、游泳、篮球"},"highlight" : {"hobby" : ["跑步、<em>游泳</em>、<em>篮球</em>"]}},{"_index" : "test4","_type" : "_doc","_id" : "gT_2yXcBhFgDDNfpe9bx","_score" : 0.502705,"_source" : {"name" : "李四","age" : 21,"mail" : "222@qq.com","hobby" : "羽毛球、乒乓球、足球、篮球"},"highlight" : {"hobby" : ["羽毛球、乒乓球、足球、<em>篮球</em>"]}}]}
}

上面搜索的意思是:
搜索结果中必须包含篮球,不能包含音乐,如果包含了游泳,那么它的相似度更高。

评分的计算规则:
bool 查询会为每个文档计算相关度评分 _score , 再将所有匹配的 must 和 should 语句的分数 _score 求和,最后除以 must 和 should 语句的总数。
must_not 语句不会影响评分; 它的作用只是将不相关的文档排除。

默认情况下,should中的内容不是必须匹配的,如果查询语句中没有must,那么就会至少匹配其中一个。当然了,也可以通过minimum_should_match参数进行控制,该值可以是数字也可以的百分比。

示例:

POST /test4/_search
{
"query":{
"bool":{
"should":[
{
"match": {
"hobby":"游泳"
}
},
{
"match": {
"hobby":"篮球"
}
},
{
"match": {
"hobby":"音乐"
}
}
],
"minimum_should_match":2
}
},
"highlight": {"fields": {"hobby": {}}}
}

结果:

{"took" : 3,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 2,"relation" : "eq"},"max_score" : 2.1357489,"hits" : [{"_index" : "test4","_type" : "_doc","_id" : "gj_2yXcBhFgDDNfpe9bx","_score" : 2.1357489,"_source" : {"name" : "王五","age" : 22,"mail" : "333@qq.com","hobby" : "羽毛球、篮球、游泳、听音乐"},"highlight" : {"hobby" : ["羽毛球、<em>篮球</em>、<em>游泳</em>、听<em>音乐</em>"]}},{"_index" : "test4","_type" : "_doc","_id" : "gz_2yXcBhFgDDNfpe9bx","_score" : 1.8336569,"_source" : {"name" : "赵六","age" : 23,"mail" : "444@qq.com","hobby" : "跑步、游泳、篮球"},"highlight" : {"hobby" : ["跑步、<em>游泳</em>、<em>篮球</em>"]}}]}
}

minimum_should_match为2,意思是should中的三个词,至少要满足2个。

权重

有些时候,我们可能需要对某些词增加权重来影响该条数据的得分。如下:
搜索关键字为“游泳篮球”,如果结果中包含了“音乐”权重为10,包含了“跑步”权重为2。

POST /test4/_search
{"query": {"bool": {"must": {"match": {"hobby": {"query": "游泳篮球","operator": "and"}}},"should": [{"match": {"hobby": {"query": "音乐","boost": 10}}},{"match": {"hobby": {"query": "跑步","boost": 2}}}]}},"highlight": {"fields": {"hobby": {}}}
}

结果:

{"took" : 3,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 2,"relation" : "eq"},"max_score" : 2.1357489,"hits" : [{"_index" : "test4","_type" : "_doc","_id" : "gj_2yXcBhFgDDNfpe9bx","_score" : 2.1357489,"_source" : {"name" : "王五","age" : 22,"mail" : "333@qq.com","hobby" : "羽毛球、篮球、游泳、听音乐"},"highlight" : {"hobby" : ["羽毛球、<em>篮球</em>、<em>游泳</em>、听<em>音乐</em>"]}},{"_index" : "test4","_type" : "_doc","_id" : "gz_2yXcBhFgDDNfpe9bx","_score" : 1.8336569,"_source" : {"name" : "赵六","age" : 23,"mail" : "444@qq.com","hobby" : "跑步、游泳、篮球"},"highlight" : {"hobby" : ["跑步、<em>游泳</em>、<em>篮球</em>"]}}]}
}

如果不设置权重的查询结果是这样:

{"took" : 2,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 2,"relation" : "eq"},"max_score" : 3.630794,"hits" : [{"_index" : "test4","_type" : "_doc","_id" : "gz_2yXcBhFgDDNfpe9bx","_score" : 3.630794,"_source" : {"name" : "赵六","age" : 23,"mail" : "444@qq.com","hobby" : "跑步、游泳、篮球"},"highlight" : {"hobby" : ["<em>跑步</em>、<em>游泳</em>、<em>篮球</em>"]}},{"_index" : "test4","_type" : "_doc","_id" : "gj_2yXcBhFgDDNfpe9bx","_score" : 2.1357489,"_source" : {"name" : "王五","age" : 22,"mail" : "333@qq.com","hobby" : "羽毛球、篮球、游泳、听音乐"},"highlight" : {"hobby" : ["羽毛球、<em>篮球</em>、<em>游泳</em>、听<em>音乐</em>"]}}]}
}

elasticsearch全文搜索相关推荐

  1. Spring和Elasticsearch全文搜索整合详解

    Spring和Elasticsearch全文搜索整合详解 一.概述 ElasticSearch是一个基于Lucene的搜索服务器.它提供了一个分布式多用户能力的全文搜索引擎,基于RESTful web ...

  2. ElasticSearch 全文搜索

    ElasticSearch 全文搜索 对文档执行全文检索,包括单个或多个单词或词组查询,返回匹配条件的搜索结果. ElasticSearch 是基于Apache Lucene的搜索引擎,一个开源.免费 ...

  3. [Elasticsearch] 全文搜索 (一) - 基础概念和match查询

    全文搜索(Full Text Search) 现在我们已经讨论了搜索结构化数据的一些简单用例,是时候开始探索全文搜索了 - 如何在全文字段中搜索来找到最相关的文档. 对于全文搜索而言,最重要的两个方面 ...

  4. SpringBoot ElasticSearch 全文搜索

    2019独角兽企业重金招聘Python工程师标准>>> 一.pom.xml配置 SpringBoot版本1.5.6https://blog.csdn.net/kingice1014/ ...

  5. SpringBoot 集成 ElasticSearch 全文搜索(步骤非常的详细)

    目录 一.pom.xml配置 二.项目代码集成示例 Yml配置 存储映射实体 @Document注解 @Field注解 创建Repository 三.安装ES 下载安装ES 测试默认分词 四.Ik分词 ...

  6. 帝国CMS7.5基于es(Elasticsearch)7.x的全文搜索插件

    帝国CMS7.5基于es(Elasticsearch)7.x的全文搜索插件 - GXECMS博客 一.插件演示地址 后台演示地址:https://ecms.gxecms.cf/e/admin/inde ...

  7. php中文搜索工具,Laravel 下 TNTSearch+jieba-PHP 实现中文全文搜索

    TNTSearch+jieba-php这套组合可以在不依赖第三方的情况下实现中文全文搜索: 特别的适合博客这种小项目: 开启php扩展 pdo_sqlite sqlite3 mbstring 开始: ...

  8. python elasticsearch 入门教程(二) ---全文搜索

    python elasticsearch 入门教程(二) ---全文搜索 截止目前的搜索相对都很简单:单个姓名,通过年龄过滤.现在尝试下稍微高级点儿的全文搜索--一项 传统数据库确实很难搞定的任务. ...

  9. 全文搜索!收藏这篇Solr ElasticSearch 长文就可以搞定

    转载自  全文搜索!收藏这篇Solr ElasticSearch 长文就可以搞定 摘自:JaJian`博кē Java后端技术编者说:文章从浅到深,描述了什么是全文搜索,为什么要使用全文搜索,Solr ...

最新文章

  1. SAP ECM的相关设定(ECN)
  2. 生日QQ配对【找到你生日QQ了吗?】
  3. linux gstack pstack 进程运行堆栈查看工具
  4. php一些高级函数方法
  5. Spring中实现监听的方法
  6. python后台截屏_Python实现屏幕截图
  7. 苹果从来不飙配置,也从不关注配置,即使一般的配置也能卖好价钱,为啥没人喷?
  8. 20160417_无为_常州
  9. python 爬虫 使用selenium 控制浏览器 进行搜索操作
  10. 揭开OpenStack 统计资源和资源调度的面纱
  11. redis php高级使用_项目中应用Redis+Php的场景
  12. echarts在(React,Vue)中的使用总结
  13. 小米球外网映射本地tomcat
  14. 『已解决』IIS启动 服务无法在此时接受控制信息
  15. python no such file or directory_python No such file or Directory
  16. 西部世界第二季百度云免费在线观看_迅雷下载
  17. 4.7 电源管理 第五部分 ---- Windows CE设备驱动开发之电源管理
  18. ue编辑器c语言语法高亮文件,自己动手做 UEStudio/UltraEdit 的语法高亮文件 (*.uew)...
  19. XSS之xss-labs-level17
  20. 传奇3服务器配置文件,服务器技术交流_GowLom2战神引擎GameServer配置文件说明_-921根据地_只做有质量的游戏 - Powered by Discuz!...

热门文章

  1. SDC时序约束(1)- create_clock
  2. 浙大计算机系帅哥,浙大男篮的小伙子们在校园里有了男神般的待遇
  3. ubuntu ufw 开放端口
  4. 日常工作中常用的几个git指令
  5. android 下拉放大背景图,Swift - 实现下拉时背景图片放大效果(仿QQ个人资料页面)...
  6. css3 clip-path属性
  7. 有意思的it企业命名!
  8. 火在肺里,咳嗽;火在肝里,失眠;火在胃里,口臭!为了健康,看看/h1
  9. 掌握 tar 命令让你秒变大牛
  10. 手把手教你弄一个毕业答辩项目-01