1. 问题回顾

​ 前面第一章,我们介绍了地图兴趣点检索的基本流程,以及如何用elasticsearch+ik搭建一个简单的demo。在运行demo时我们用“通州区万达广场“去搜索,结果排第一位的结果竟然是位于朝阳区的”建国路万达广场“。第二章,我们对ES的相关性打分原理进行了探索,了解了整体的打分策略。本文我们将利用ES提供的接口来调整打分规则,让搜索的结果符合我们的预期。

首先通过ES的explain参数来输出一下结果,具体分析一下为何第2名明显更符合常理的地址得分比较低。

get http://localhost:9200/idx_default/_search?explain=true
{"query": {​    "match": {​      "address": {​        "query": "通州区万达广场"​      }​    }}}

结果如下(只摘出前两名)

{"_shard": "[idx_default][0]","_node": "Crj7_cZOQT6w9sG0ryBbzQ","_index": "idx_default","_type": "_doc","_id": "138069","_score": 17.299044,"_source": {"address": "建国路万达广场","name": "恒大山水城","location": "39.90867476611688,116.46468505121267"},"_explanation": {"value": 17.299044,"description": "sum of:","details": [{"value": 10.175069,"description": "weight(address:万达 in 138410) [PerFieldSimilarity], result of:","details": [{"value": 10.175069,"description": "score(freq=1.0), computed as boost * idf * tf from:","details": [{"value": 2.2,"description": "boost","details": []},{"value": 7.7361317,"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:","details": [{"value": 89,"description": "n, number of documents containing term","details": []},{"value": 204918,"description": "N, total number of documents with field","details": []}]},{"value": 0.59784806,"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:","details": [{"value": 1.0,"description": "freq, occurrences of term within document","details": []},{"value": 1.2,"description": "k1, term saturation parameter","details": []},{"value": 0.75,"description": "b, length normalization parameter","details": []},{"value": 3.0,"description": "dl, length of field","details": []},{"value": 7.245098,"description": "avgdl, average length of field","details": []}]}]}]},{"value": 7.1239743,"description": "weight(address:广场 in 138410) [PerFieldSimilarity], result of:","details": [{"value": 7.1239743,"description": "score(freq=1.0), computed as boost * idf * tf from:","details": [{"value": 2.2,"description": "boost","details": []},{"value": 5.416376,"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:","details": [{"value": 910,"description": "n, number of documents containing term","details": []},{"value": 204918,"description": "N, total number of documents with field","details": []}]},{"value": 0.59784806,"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:","details": [{"value": 1.0,"description": "freq, occurrences of term within document","details": []},{"value": 1.2,"description": "k1, term saturation parameter","details": []},{"value": 0.75,"description": "b, length normalization parameter","details": []},{"value": 3.0,"description": "dl, length of field","details": []},{"value": 7.245098,"description": "avgdl, average length of field","details": []}]}]}]}]}},
{"_shard": "[idx_default][0]","_node": "Crj7_cZOQT6w9sG0ryBbzQ","_index": "idx_default","_type": "_doc","_id": "28730","_score": 16.216942,"_source": {"address": "北京市通州区新华西街58号万达广场F2","name": "手寓工坊(万达广场店)","location": "39.904175142894765,116.63712318703388"},"_explanation": {"value": 16.216942,"description": "sum of:","details": [{"value": 2.879858,"description": "weight(address:通州区 in 28165) [PerFieldSimilarity], result of:","details": [{"value": 2.879858,"description": "score(freq=1.0), computed as boost * idf * tf from:","details": [{"value": 2.2,"description": "boost","details": []},{"value": 2.8400025,"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:","details": [{"value": 11972,"description": "n, number of documents containing term","details": []},{"value": 204918,"description": "N, total number of documents with field","details": []}]},{"value": 0.46092433,"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:","details": [{"value": 1.0,"description": "freq, occurrences of term within document","details": []},{"value": 1.2,"description": "k1, term saturation parameter","details": []},{"value": 0.75,"description": "b, length normalization parameter","details": []},{"value": 7.0,"description": "dl, length of field","details": []},{"value": 7.245098,"description": "avgdl, average length of field","details": []}]}]}]},{"value": 7.844697,"description": "weight(address:万达 in 28165) [PerFieldSimilarity], result of:","details": [{"value": 7.844697,"description": "score(freq=1.0), computed as boost * idf * tf from:","details": [{"value": 2.2,"description": "boost","details": []},{"value": 7.7361317,"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:","details": [{"value": 89,"description": "n, number of documents containing term","details": []},{"value": 204918,"description": "N, total number of documents with field","details": []}]},{"value": 0.46092433,"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:","details": [{"value": 1.0,"description": "freq, occurrences of term within document","details": []},{"value": 1.2,"description": "k1, term saturation parameter","details": []},{"value": 0.75,"description": "b, length normalization parameter","details": []},{"value": 7.0,"description": "dl, length of field","details": []},{"value": 7.245098,"description": "avgdl, average length of field","details": []}]}]}]},{"value": 5.4923873,"description": "weight(address:广场 in 28165) [PerFieldSimilarity], result of:","details": [{"value": 5.4923873,"description": "score(freq=1.0), computed as boost * idf * tf from:","details": [{"value": 2.2,"description": "boost","details": []},{"value": 5.416376,"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:","details": [{"value": 910,"description": "n, number of documents containing term","details": []},{"value": 204918,"description": "N, total number of documents with field","details": []}]},{"value": 0.46092433,"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:","details": [{"value": 1.0,"description": "freq, occurrences of term within document","details": []},{"value": 1.2,"description": "k1, term saturation parameter","details": []},{"value": 0.75,"description": "b, length normalization parameter","details": []},{"value": 7.0,"description": "dl, length of field","details": []},{"value": 7.245098,"description": "avgdl, average length of field","details": []}]}]}]}]}}

​ 从结果可见**”建国路万达广场“后面简称“建国路地址”)得分为17.299044,而”北京市通州区新华西街58号万达广场F2“**(后面简称“通州区地址”)只有16.216942。上一章提到,建国路地址位于朝阳区,显然与我们的查询条件相差比较远,而通州区地址更符合预期。为什么会得到现在的结果,可以在_explanation内部找到答案,下面利用上一章学习的score模型我们来分析一下原因。

2. 原因分析

​ 先来看看_explanation的结构,它是一个JSON对象,下面有三个属性"value"、“description”、“details”,分别表示“得分”,”计算公式“和公式中的所有”变量值“,其中details为一个数组,数组内的元素也是类似结构的JSON对象。这样的JSON对象有4层,第1层是总体得分对象;第2层是分词得分对象;第3层是子项得分对象,比如某个词条的idf得分;第4层是子项变量对象,比如某个词条idf公式内的变量N的值。下面是总得分的计算公式:
最终总得分=∑in每个词条得分最终总得分=\sum_{i}^{n}每个词条得分 最终总得分=i∑n​每个词条得分
​ 再具体分析单个词条,以“建国路万达广场”中的“万达”词条为例。我们找到“万达”JSON对象,再看它的details为“score(freq=1.0), computed as boost * idf * tf from:”,里面需要三个值:

**boost **是一个查询的权重项,我们可以在创建索引时,通过mapping对指定的field设定boost值,当我们进行多字段混合查询时可以区分不同field的权重。

**idf **即逆文档词频,描述为:“idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:”。具体含义参见上一章

**tf **即词频,描述为:“tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:”。具体含义参见上一章

​ 了解了ES中得分的计算方式及结果的含义,我们来分析下,为什么建国路地址比通州区地址的得分要高。把JSON结果变换为如下词条得分表,每一行代表左侧的词条在两个地址中的得分。其中词条“通州区”只存在于通州区地址。但即使多了一个词条,通州区地址的得分仍然更低。

​ 通过对“万达”、“广场”两个词在地址中的得分进一步分析,可以发现具体原因。下面表格表示计算“万达”和“广场”两个词时,每个计算项的得分。可以明显发现,通州区地址在tf词频得分上都更低,其他项则相同。

进一步看tf的计算情况,发现区别只在于dl,他的description为“dl, length of field”,即地址的长度。

回顾一下上一篇介绍的tf的计算公式(这个公式和ES默认的计算公式略有不同,ES版本分子上的k+1被省略了,但整体效果相同):
TF=(k+1)⋅fik⋅(1−b+b⋅dlavg(dl))+fiTF=\frac{(k+1)\cdot f_i}{k\cdot(1-b+b\cdot\frac{dl}{avg(dl)})+f_i} TF=k⋅(1−b+b⋅avg(dl)dl​)+fi​(k+1)⋅fi​​
其中dl为当前文档的长度,avg(dl)为文档库中文档的平均长度。显然这里avg(dl)大家是相同的,而dl越大tf的得分就越低。所以分析后得到的原因是通州区地址,即“北京市通州区新华西街58号万达广场F2”太长了。虽然它覆盖的词条更多(多了一个通州区),但是dl会影响每个词条的得分。下面我们看看有什么参数可以调节从而减少dl的影响。

3. 调整参数

​ 上一篇文章最后我们介绍了tf公式内有一个参数b,提到了它是BM25让我们调节文档长度影响程度的因子,当b=0时,分母变为k+fi,完全消除了文档长度影响。当b值更高时,长度因素则会对TF得分有更大的影响。显然本文我们希望降低,甚至消除长度的影响,因为地址库里面所有地址长度差异不大,我希望它们公平竞争,谁匹配的词多谁得分高。

​ ES提供了非常方便的接口,只需要在创建索引时,在settings内部定义一下b的值。具体命令如下:

put http://localhost:9200/idx_default
{"settings": {"index": {"similarity": {"BM25_b_0": {"type": "BM25","b": "0.0"}}}},"mappings": {"poipo": {"properties": {"location": {"type": "geo_point"},"address": {"type": "text","similarity": "BM25_b_0"}}}}
}

​ BM25_b_0是我们定义的相似性计算模型,type指定了它是一个BM25模型,b则指明我们要覆盖此变量让其值变为0。然后在下面mappings中指定address字段的similarity为新模型。至此我们完成了新索引的构建,重新导入数据后再次查询。结果如下:

{"_index": "idx_default","_type": "poipo","_id": "56963","_score": 17.46982,"_source": {"address": "北京市通州区新华街道建国路93号院万达广场11号楼","location": {"lon": 116.6574382584145,"lat": 39.92313729883979}}},{"_index": "idx_default","_type": "poipo","_id": "87454","_score": 16.99757,"_source": {"address": "北京市通州区北苑街道手寓工坊(万达广场店)","location": {"lon": 116.64295933891906,"lat": 39.905244856754514}}}
...

这里只列举前两个结果,显然都是通州区的万达广场,说明我们的参数调整已经发挥作用。

​ 本文我们利用一个例子说明了如何查看ES查询结果及详情,并通过分析得分的计算细节,找出了错误排名的原因。最后,利用ES提供的参数调整接口实现了模型的修改。这个调参的案例比较粗暴的将长度因子进行了剔除,后面章节我们会尝试从词条的优先级入手探讨更细粒度的调参策略。

地图兴趣点搜索三(ES相关性得分参数调整)相关推荐

  1. 地图兴趣点搜索一(基本流程)

    1 地图兴趣点搜索 1. 地图搜索无处不在 ​ 随着本世纪初Google Map的诞生,地图以一个全新形式进入人们的视野,大家发现原来地图不只是躺在课本里的彩页,还可以与我们互动.今天地图在生活中已经 ...

  2. 超参数调整的方法介绍

    文章目录 超参数调整的方法介绍 常用的超参数调整方法 网格搜索(Grid Search) 如何进行网格搜索 小结 随机搜索(Random Search) 贝叶斯优化(Bayesian Optimiza ...

  3. 腾讯,百度,高德地图兴趣点(POI)的获取以及查询,逆解析解析

    1.POI数据介绍 POI数据介绍 POI是"Point of Interest"的缩写,中文可以翻译为"兴趣点".POI数据会包含各种信息,如前面提到的名称. ...

  4. 百度地图——poi搜索

    定义 POI(Point of Interest),中文可以翻译为"兴趣点".在地理信息系统中,一个POI可以是一栋房子.一个商铺.一个邮筒.一个公交站等. 百度地图SDK提供三种 ...

  5. Python之爬取百度地图兴趣点(POI)数据

    关于爬虫系列,前三篇文章分别讲了三个简单案例,分别爬取了<你好,李焕英>电影豆瓣热门短评.58同城在售楼盘房源信息以及安居客网二手房小区详情页数据.通过前三个案例,相信大家都对爬虫有了简单 ...

  6. Android实现高德地图POI搜索

    效果图如下: 导入高德地图的搜索服务包到工程的libs目录中,并配置好权限与用户KEY. 权限如下: <uses-permission android:name="android.pe ...

  7. 神经网络中的网络优化和正则化(三)之超参数优化

    转载请注明出处:https://thinkgamer.blog.csdn.net/article/details/101033047 博主微博:http://weibo.com/234654758 G ...

  8. java安卓百度地图查找便利店_Android 百度地图POI搜索功能实例代码

    在没介绍正文之前先给大家说下poi是什么意思. 由于工作的关系,经常在文件中会看到POI这三个字母的缩写,但是一直对POI的概念和含义没有很详细的去研究其背后代表的意思.今天下班之前,又看到了POI这 ...

  9. ES mapping 映射参数第一期之~ Analyzer

    前言 本文是ES mapping 映射参数第一期~ Analyzer. 文中使用 ES versions 7.1,其他版本可能有偏差. mapping 映射参数预计是每周3章,有喜欢的欢迎关注,一起交 ...

最新文章

  1. 使用express搭建第一个Web应用【Node.js初学】
  2. 查找Excel的Sheetname的方法
  3. wcf简单教程(10) ajax调用,wcf简单教程(10) ajax调用
  4. 二进制信号在信噪比为127:1的4kHz信道上传输,最大数据传输速率可以达到( )
  5. ASP.NET Core2基于RabbitMQ对Web前端实现推送功能
  6. 解决虚拟机能ping通宿主机,而宿主机不能ping通虚拟机
  7. 2019.7.16考试总结
  8. 【软件开发底层知识修炼】二十七 C/C++中的指针与数组是不同的
  9. python简单笔记
  10. Antechinus C# Editor!
  11. PHP 调用shell命令
  12. 【BZOJ4710】[JSOI2011]分特产(容斥)
  13. mongodb自定义字段_MongoDB哈希分片
  14. 非关系数据库-NoSQL探讨
  15. Vi 编辑器常用命令
  16. Python+Flask(2)--通过flask paginate解决列表分页问题
  17. python 文件夹_使用python进行文件夹对比
  18. multisim仿真高通滤波器——光谱分析仪的使用
  19. 三季度国内光伏市场需求仍将强劲
  20. (63)计数器设计(递增计数器)

热门文章

  1. 鳄鱼!自然界最完美的伏击捕食
  2. 【资料分享】智能车比赛 - 硬件调教
  3. linux定时任务整点执行,Linux 设置定时任务crontab命令
  4. 一文简述服务器架构的演变过程:集群—分布式—微服务
  5. Towards 3D Human Pose Estimation in the Wild: a Weakly-supervised Approach论文翻译
  6. TeraTerm与TTL(Tera Term Language)
  7. c++实现并集(A∪B = C)
  8. Top 150 Questions - 1.4
  9. 绝版CocoStudio下载——致我们终将逝去的青春
  10. UIView 绘制渲染机制