作者:Daulet Nurmanbetov

编译:ronghuaiyang

来自:AI公园公众号

导读

语义搜索是NLP中很值得去解决的,但又很困难的问题。

我们通常会花很多时间在大量的文档中寻找特定的信息。我们通常会使用CTRL + F。还有众所周知的Google-fu,在21世纪的职场中,有效使用Google搜索信息的是一项宝贵的技能。人类的所有知识对我们来说都是可用的,问题在于提出正确的问题,以及知道如何浏览结果找到相关的答案。

我们的大脑会执行语义搜索,我们会查看结果并找到与我们的搜索查询相似的句子。在金融和法律行业尤其如此,因为文件越来越长,我们不得不搜索很多关键字来找到正确的句子或段落。时至今日,人类在探索上所付出的累积努力是惊人的。

自NLP出现以来,机器学习一直试图解决语义搜索的这个问题。一个完整的研究领域 —— 语义搜索已经出现。最近,由于深度学习技术的进步,计算机能够以最小的人力投入精确地向我们提供相关信息。

句子嵌入的方法

自然语言处理(NLP)领域对此有一个术语,当一个词被提及时,我们称之为“surface form”,举个例子,“president”这个词本身意味着国家元首。但根据上下文和时间,这可能意味着特朗普或奥巴马。

NLP的进步使我们能够有效地映射这些surface form,并将这些单词中的上下文捕获到称为“embeddings”的东西中。具有相似含义的两个单词将具有相似的向量,从而允许我们计算向量的相似性。

扩展这个想法,在向量空间中,我们应该能够计算任意两个句子之间的相似性。这就是句子嵌入模型所能达到的效果。这些模型将任何给定的句子转换成一个向量,从而能够快速计算任意一对句子的相似度或不同度。

最先进的语义搜索 —— 找到最相似的句子

这个想法并不新鲜,最早的一篇论文——word2vec早在2013年就提出了用向量表示单个单词。然而,从那时起,BERT和其他基于Transformer的模型让我们走了很长的路,它们允许我们更有效地捕捉这些词的上下文。

在这里,我们如何将最近的嵌入模型与word2vec或过去的GloVe进行比较。

STS是NLP的句子意义相似度竞赛。得分越高更好

这些经过修改和微调的BERT NLP模型在识别相似的句子方面非常好,比以前的模型好得多。让我们看看这在实际意义上意味着什么。

我有几篇2020年4月的文章标题,我希望找到与一组搜索词最相似的句子。

这里是我的搜索词 ——

1. The economy is more resilient and improving.
2. The economy is in a lot of trouble.
3. Trump is hurting his own reelection chances.

我的文章标题如下 ——

Coronavirus:
White House organizing program to slash development time for coronavirus vaccine by as much as eight months (Bloomberg)
Trump says he is pushing FDA to approve emergency-use authorization for Gilead's remdesivir (WSJ)
AstraZeneca to make an experimental coronavirus vaccine developed by Oxford University (Bloomberg)
Trump contradicts US intel, says Covid-19 started in Wuhan lab. (The Hill)
Reopening:
Inconsistent patchwork of state, local and business decision-making on reopening raising concerns about a second wave of the coronavirus (Politico)
White House risks backlash with coronavirus optimism if cases flare up again (The Hill)
Florida plans to start reopening on Monday with restaurants and retail in most areas allowed to resume business in most areas (Bloomberg)
California Governor Newsom plans to order closure of all state beaches and parks starting Friday due to concerns about overcrowding (CNN)
Japan preparing to extend coronavirus state of emergency, which is scheduled to end 6-May, by about another month (Reuters)
Policy/Stimulus:
Economists from a broad range of ideological backgrounds encouraging Congress to keep spending to combat the coronavirus fallout and don't believe now is time to worry about deficit (Politico)
Global economy:
China's official PMIs mixed with beat from services and miss from manufacturing (Bloomberg)
China's Beige Book shows employment situation in Chinese factories worsened in April from end of March, suggesting economy on less solid ground than government data (Bloomberg)
Japan's March factory output fell at the fastest pace in five months, while retail sales also dropped (Reuters)
Eurozone economy contracts by 3.8% in Q1, the fastest decline on record (FT)
US-China:
Trump says China wants to him to lose his bid for re-election and notes he is looking at different options in terms of consequences for Beijing over the virus (Reuters)
Senior White House official confident China will meet obligations under trad deal despite fallout from coronavirus pandemic (WSJ)
Oil:
Trump administration may announce plans as soon as today to offer loans to oil companies, possibly in exchange for a financial stake (Bloomberg)
Munchin says Trump administration could allow oil companies to store another several hundred million barrels (NY Times)
Norway, Europe's biggest oil producer, joins international efforts to cut supply for first time in almost two decades (Bloomberg)
IEA says coronavirus could drive 6% decline in global energy demand in 2020 (FT)
Corporate:
Microsoft reports strong results as shift to more activities online drives growth in areas from cloud-computing to video gams (WSJ)
Facebook revenue beats expectations and while ad revenue fell sharply in March there have been recent signs of stability (Bloomberg)
Tesla posts third straight quarterly profit while Musk rants on call about need for lockdowns to be lifted (Bloomberg)
eBay helped by online shopping surge though classifieds business hurt by closure of car dealerships and lower traffic (WSJ)
Royal Dutch Shell cuts dividend for first time since World War II and also suspends next tranche of buyback program (Reuters)
Chesapeake Energy preparing bankruptcy filing and has held discussions with lenders about a ~$1B loan (Reuters)
Amazon accused by Trump administration of tolerating counterfeit sales, but company says hit politically motivated (WSJ)

在计算了每个查询和每个嵌入的相似性后,这里是我的每个搜索词的前5个相似的句子:

======================
Query: The economy is more resilient and improving.Top 5 most similar sentences in corpus:
Microsoft reports strong results as shift to more activities online drives growth in areas from cloud-computing to video gams (WSJ) (Score: 0.5362)
Facebook revenue beats expectations and while ad revenue fell sharply in March there have been recent signs of stability (Bloomberg) (Score: 0.4632)
Senior White House official confident China will meet obligations under trad deal despite fallout from coronavirus pandemic (WSJ) (Score: 0.3558)
Economists from a broad range of ideological backgrounds encouraging Congress to keep spending to combat the coronavirus fallout and don't believe now is time to worry about deficit (Politico) (Score: 0.3052)
White House risks backlash with coronavirus optimism if cases flare up again (The Hill) (Score: 0.2885)
======================
Query: The economy is in a lot of trouble.Top 5 most similar sentences in corpus:
Inconsistent patchwork of state, local and business decision-making on reopening raising concerns about a second wave of the coronavirus (Politico) (Score: 0.4667)
eBay helped by online shopping surge though classifieds business hurt by closure of car dealerships and lower traffic (WSJ) (Score: 0.4338)
China's Beige Book shows employment situation in Chinese factories worsened in April from end of March, suggesting economy on less solid ground than government data (Bloomberg) (Score: 0.4283)
Eurozone economy contracts by 3.8% in Q1, the fastest decline on record (FT) (Score: 0.4252)
China's official PMIs mixed with beat from services and miss from manufacturing (Bloomberg) (Score: 0.4052)
======================
Query: Trump is hurting his own reelection chances.Top 5 most similar sentences in corpus:
Trump contradicts US intel, says Covid-19 started in Wuhan lab. (The Hill) (Score: 0.7472)
Amazon accused by Trump administration of tolerating counterfeit sales, but company says hit politically motivated (WSJ) (Score: 0.7408)
Trump says China wants to him to lose his bid for re-election and notes he is looking at different options in terms of consequences for Beijing over the virus (Reuters) (Score: 0.7111)
Inconsistent patchwork of state, local and business decision-making on reopening raising concerns about a second wave of the coronavirus (Politico) (Score: 0.6213)
White House risks backlash with coronavirus optimism if cases flare up again (The Hill) (Score: 0.6181)

你可以看到,这个模型挑选出最相似的句子是多么地准确。

我使用的代码可以在下面找到 ——

安装transformer包:

!git clone git@github.com:huggingface/transformers.git
!cd transformers
!pip install .
import scipy
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')

语料如下:

# Get a sample corpus to search over
_c="""
Coronavirus:
White House organizing program to slash development time for coronavirus vaccine by as much as eight months (Bloomberg)
Trump says he is pushing FDA to approve emergency-use authorization for Gilead's remdesivir (WSJ)
AstraZeneca to make an experimental coronavirus vaccine developed by Oxford University (Bloomberg)
Reopening:
Inconsistent patchwork of state, local and business decision-making on reopening raising concerns about a second wave of the coronavirus (Politico)
White House risks backlash with coronavirus optimism if cases flare up again (The Hill)
Florida plans to start reopening on Monday with restaurants and retail in most areas allowed to resume business in most areas (Bloomberg)
California Governor Newsom plans to order closure of all state beaches and parks starting Friday due to concerns about overcrowding (CNN)
Japan preparing to extend coronavirus state of emergency, which is scheduled to end 6-May, by about another month (Reuters)
Policy/Stimulus:
Economists from a broad range of ideological backgrounds encouraging Congress to keep spending to combat the coronavirus fallout and don't believe now is time to worry about deficit (Politico)
Global economy:
China's official PMIs mixed with beat from services and miss from manufacturing (Bloomberg)
China's Beige Book shows employment situation in Chinese factories worsened in April from end of March, suggesting economy on less solid ground than government data (Bloomberg)
Japan's March factory output fell at the fastest pace in five months, while retail sales also dropped (Reuters)
Eurozone economy contracts by 3.8% in Q1, the fastest decline on record (FT)
US-China:
Trump says China wants to him to lose his bid for re-election and notes he is looking at different options in terms of consequences for Beijing over the virus (Reuters)
Senior White House official confident China will meet obligations under trad deal despite fallout from coronavirus pandemic (WSJ)
Oil:
Trump administration may announce plans as soon as today to offer loans to oil companies, possibly in exchange for a financial stake (Bloomberg)
Munchin says Trump administration could allow oil companies to store another several hundred million barrels (NY Times)
Norway, Europe's biggest oil producer, joins international efforts to cut supply for first time in almost two decades (Bloomberg)
IEA says coronavirus could drive 6% decline in global energy demand in 2020 (FT)
Corporate:
Microsoft reports strong results as shift to more activities online drives growth in areas from cloud-computing to video gams (WSJ)
Facebook revenue beats expectations and while ad revenue fell sharply in March there have been recent signs of stability (Bloomberg)
Tesla posts third straight quarterly profit while Musk rants on call about need for lockdowns to be lifted (Bloomberg)
eBay helped by online shopping surge though classifieds business hurt by closure of car dealerships and lower traffic (WSJ)
Royal Dutch Shell cuts dividend for first time since World War II and also suspends next tranche of buyback program (Reuters)
Chesapeake Energy preparing bankruptcy filing and has held discussions with lenders about a ~$1B loan (Reuters)
Amazon accused by Trump administration of tolerating counterfeit sales, but company says hit politically motivated (WSJ)
Trump contradicts US intel, says Covid-19 started in Wuhan lab.
# Convert the corpus into a list of headlines
corpus=[i for i in _c.split('\n')if i != ''and len(i.split(' '))>=4]# Get a vector for each headline (sentence) in the corpus
corpus_embeddings = model.encode(corpus)# Define search queries and embed them to vectors as well
queries = ['The economy is more resilient and improving.', 'The economy is in a lot of trouble.', 'Trump is hurting his own reelection chances.']
query_embeddings = model.encode(queries)# For each search term return 5 closest sentences
closest_n = 5
for query, query_embedding in zip(queries, query_embeddings):distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]results = zip(range(len(distances)), distances)results = sorted(results, key=lambda x: x[1])print("\n\n======================\n\n")print("Query:", query)print("\nTop 5 most similar sentences in corpus:")for idx, distance in results[0:closest_n]:print(corpus[idx].strip(), "(Score: %.4f)" % (1-distance))

结果如下:

======================Query: The economy is more resilient and improving.Top 5 most similar sentences in corpus:
Microsoft reports strong results as shift to more activities online drives growth in areas from cloud-computing to video gams (WSJ) (Score: 0.5362)
Facebook revenue beats expectations and while ad revenue fell sharply in March there have been recent signs of stability (Bloomberg) (Score: 0.4632)
Senior White House official confident China will meet obligations under trad deal despite fallout from coronavirus pandemic (WSJ) (Score: 0.3558)
Economists from a broad range of ideological backgrounds encouraging Congress to keep spending to combat the coronavirus fallout and don't believe now is time to worry about deficit (Politico) (Score: 0.3052)
White House risks backlash with coronavirus optimism if cases flare up again (The Hill) (Score: 0.2885)======================Query: The economy is in a lot of trouble.Top 5 most similar sentences in corpus:
Inconsistent patchwork of state, local and business decision-making on reopening raising concerns about a second wave of the coronavirus (Politico) (Score: 0.4667)
eBay helped by online shopping surge though classifieds business hurt by closure of car dealerships and lower traffic (WSJ) (Score: 0.4338)
China's Beige Book shows employment situation in Chinese factories worsened in April from end of March, suggesting economy on less solid ground than government data (Bloomberg) (Score: 0.4283)
Eurozone economy contracts by 3.8% in Q1, the fastest decline on record (FT) (Score: 0.4252)
China's official PMIs mixed with beat from services and miss from manufacturing (Bloomberg) (Score: 0.4052)======================Query: Trump is hurting his own reelection chances.Top 5 most similar sentences in corpus:
Trump contradicts US intel, says Covid-19 started in Wuhan lab. (Score: 0.7472)
Amazon accused by Trump administration of tolerating counterfeit sales, but company says hit politically motivated (WSJ) (Score: 0.7408)
Trump says China wants to him to lose his bid for re-election and notes he is looking at different options in terms of consequences for Beijing over the virus (Reuters) (Score: 0.7111)
Inconsistent patchwork of state, local and business decision-making on reopening raising concerns about a second wave of the coronavirus (Politico) (Score: 0.6213)
White House risks backlash with coronavirus optimism if cases flare up again (The Hill) (Score: 0.6181)

上面的例子很简单,但是说明了语义搜索的一个重要方面。人类需要几分钟才能找到最相似的句子。它使我们能够在不需要人工参与的情况下在文本中查找特定信息,这意味着我们可以以计算机速度在成千上万个文档中搜索我们关心的短语。

这项技术已经被用来在两个文档中找到相似的句子。或者季度收益报告中的关键信息。例如,通过这种语义搜索,我们可以很容易地找到Twitter、Facebook、Snapchat等所有社交公司的日常活跃用户。尽管他们定义和叫法的是不同的——日活跃用户(DAU)或月活跃用户(MAU)或可盈利活跃用户(mMAU)。由BERT支持的语义搜索可以发现所有这些表面形式在语义上意味着相同的东西 —— 一种性能的衡量,它能够从报告中提取我们感兴趣的句子。

对冲基金利用语义搜索来解析和展示季度报告(10-Q/10-K)中的指标,并在它们发布后立即将其作为量化交易信号,这不是一个遥远的想法。

上面的实验显示了语义搜索在过去的一年里取得了怎样的效果。

找到相似的句子 —— 聚类

使用这些句子向量嵌入的另一种主要方式是用于聚类。我们可以快速地将单个文档中的句子或多个文档中的句子聚成相似的组。

通过使用上面的代码,我们可以利用sklearn中的一个简单的k-means方法。

from sklearn.cluster import KMeans
import numpy as npnum_clusters = 10clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_for i in range(10):print()print(f'Cluster {i + 1} contains:')clust_sent = np.where(cluster_assignment == i)for k in clust_sent[0]:print(f'- {corpus[k]}')

同样,对于一台机器来说,结果是准确的。这里有几个聚类 ——

Cluster 2 contains:
- AstraZeneca to make an experimental coronavirus vaccine developed by Oxford University (Bloomberg)
- Trump says he is pushing FDA to approve emergency-use authorization for Gilead's remdesivir (WSJ)Cluster 3 contains:
- Chesapeake Energy preparing bankruptcy filing and has held discussions with lenders about a ~$1B loan (Reuters)
- Trump administration may announce plans as soon as today to offer loans to oil companies, possibly in exchange for a financial stake (Bloomberg)
- Munchin says Trump administration could allow oil companies to store another several hundred million barrels (NY Times)Cluster 4 contains:
- Trump says China wants to him to lose his bid for re-election and notes he is looking at different options in terms of consequences for Beijing over the virus (Reuters)
- Amazon accused by Trump administration of tolerating counterfeit sales, but company says hit politically motivated (WSJ)
- Trump contradicts US intel, says Covid-19 started in Wuhan lab. (The Hill)

总结

有趣的是,ElasticSeach现在有了dense向量的用法:https://www.elastic.co/blog/text- similar-search with-vectors-in-elasticsearch,可以和其他的工业界的快速比较两个向量的工具相比,如Facebook的faiss。这个技术是很尖端的,但具有很强的操作性,会在几周内推出。先进的人工智能触手可及,任何人都知道该寻找什么。

—END—

英文原文:https://towardsdatascience.com/cutting-edge-semantic-search-and-sentence-similarity-53380328c655


添加个人微信,备注:昵称-学校(公司)-方向,即可获得

1. 快速学习深度学习五件套资料

2. 进入高手如云DL&NLP交流群

记得备注呦

最先进的语义搜索句子相似度计算相关推荐

  1. 【简单总结】句子相似度计算的几种方法

    [简单总结]句子相似度计算的几种方法 1.句子相似度介绍: 句子相似度–指的是两个句子之间相似的程度.在NLP中有很大的用处,譬如对话系统,文本分类.信息检索.语义分析等,它可以为我们提供检索信息更快 ...

  2. nlp自然语言处理中句子相似度计算

    在做自然语言处理的过程中,现在智能对话比较火,例如智能客服,智能家电,智能音箱等,我们需要获取用户说话的意图,方便做出正确的回答,这里面就涉及到句子相似度计算的问题,那么本节就来了解一下怎么样来用 P ...

  3. 中文句子相似度计算思路

    这里主要面向初学者介绍句子相似度目前主流的研究方向. 从词到句子,这是目前中文相似度计算的主要思想.而由这个-思想引申出来的算法却非常多,这里面向初学者介绍比较容易实现的方法. 这里要介绍的是二分法计 ...

  4. TextRank算法原理和提取关键词的主要过程详解 计算句子相似度 计算句子重要性公式

    1.TextRank计算句子相似度和句子重要性的公式 2.TextRank算法提取关键词的过程 3.TextRank建立关键词无向图

  5. 简单QA:TF-IDF句子相似度计算

    简单介绍一下基于TF-IDF计算句子相似度,并得到问题对应的答案过程: 准备好问题文件,答案文件,问题与答案一一对应,例如: 对问题文件进行分词.去停用词预处理操作 建立TF-IDF模型,计算所提问题 ...

  6. 自然语言处理中句子相似度计算的几种方法

    https://cloud.tencent.com/developer/article/1145941

  7. sentence_transformers 语义搜索,语义相似度计算,图片内容理解,图片与文字匹配。

    目录 介绍sentence_transformers 的实战代码: 语义相似度计算: 语义搜索 句子聚类,相似句子聚类

  8. word2vec相似度计算_AAAI-2016 | 使用孪生递归网络的句子语义相似度计算方法

    本文<Siamese Recurrent Architectures for Learning Sentence Similarity>提出了一种使用孪生递归网络来计算句子语义相似度的方法 ...

  9. Elastic Learned Sparse Encoder 简介:Elastic 用于语义搜索的 AI 模型

    作者:Aris Papadopoulos, Gilad Gal 寻找意义,而不仅仅是文字 我们很高兴地与大家分享,在 8.8 中,Elastic ® 提供开箱即用的语义搜索.语义搜索旨在根据文本的意图 ...

  10. NLP之句子相似度之入门篇

    文章目录 1.基于统计的方法 1.1.编辑距离计算 1.2.杰卡德系数计算 1.3.TF 计算 1.4.TFIDF 计算 1.5.BM25 2.基于深度学习的方法 2.1.Word2Vec 计算 6. ...

最新文章

  1. 使用代码将github仓库里某个issue同步到CSDN博客上
  2. 使用PYTHON操作Excel的工具
  3. 以太坊go-ethereum项目源码本地环境搭建
  4. 前沿实践:垃圾回收器是如何演进的?
  5. 4、mysql数据库的权限管理
  6. django-后台管理-编辑页的选项
  7. 基于FPGA实现PCIE IP功能仿真
  8. 应用软件暗藏猫腻,信息安全咋保障
  9. mysql版本升级对数据的影响_MySQL升级
  10. SQL SELECT语句的基本用法
  11. Android同步时出错,Android Studio中的Gradle给出错误项目同步失败
  12. FFmpeg学习之 一 (音视频理论知识)
  13. Shiro session过期跳转到登录页面问题
  14. oracle是什么软件可以卸载吗,卸载Oracle软件
  15. Java网络编程 获取本地主机名称和地址
  16. window系统区别
  17. ERP、MES(作用、功能、部署、相关模块)
  18. 关于报错FAILURE: Build failed with an exception.
  19. 设计一个对银行账户余额操作的简单程序(Java)
  20. C语言运算符与表达式

热门文章

  1. 【转】重装系统后找不到硬盘
  2. Python基础之字典
  3. 吸血鬼数字—THINKING IN JAVA中一道习题
  4. 20190901 On Java8 第十五章 异常
  5. Spring Cloud(7.2):配置Producer Server
  6. Echo团队Alpha冲刺随笔 - 第八天
  7. WPF管理系统自定义分页控件 - WPF特工队内部资料
  8. Filter和interceptor比较
  9. [luogu1373]小a和uim之大逃离_动态规划
  10. weblogic系列漏洞整理 -- 3. weblogic 后台提权