文章目录

  • 一、Smooth Inverse Frequency(SIF)
  • 二、BM25
    • 1.bm25源码实现
  • 三、基于BM25、tfidf和SIF的检索系统代码实现

  • 基于BM25、tfidf和SIF的检索系统实现数据集与代码链接

一、Smooth Inverse Frequency(SIF)

  Smooth Inverse Frequency是一种基于向量的检索。在介绍SIF前,需要先理解平均词向量与TFIDF加权平均词向量。

  • 平均词向量就是将句子中所有词的word embedding相加取平均,得到的向量就当做最终的sentence embedding。这种方法的缺点是认为句子中的所有词对于表达句子含义同样重要。
  • TFIDF加权平均词向量就是对每个词按照tfidf进行打分,然后进行加权平均,得到最终的句子表示。

SIF加权平均词向量主要是对上述两种方法进行了改进。SIF算法包括两步,第一步是对句子中所有的词向量进行加权平均,得到平均向量vsv_svs;第二步是移出(减去)vsv_svs在所有句子向量组成的矩阵的第一个主成分(principal component / singular vector)上的投影。

  • 第一步主要是对TFIDF加权平均词向量表示句子的方法进行改进。论文提出了一种平滑倒词频 (smooth inverse frequency, SIF)方法用于计算每个词的加权系数,具体地,词www的权重为a/(a+p(w))a/(a+p(w))a/(a+p(w)),其中aaa为平滑参数,p(w)p(w)p(w)为(估计的)词频。直观理解SIF,就是说频率越低的词在当前句子出现了,说明它在句子中的重要性更大,也就是加权系数更大。事实上,如果把一个句子认为是一篇文档并且假设该句中不出现重复的词(TF=1),那么TFIDF将演变成SIF,即未平滑的倒词频。但是相较于TFIDF这种经验式公式,论文通过理论证明为SIF提供理论依据。
  • 第二步,个人的直观理解是移出所有句子的共有信息,因此保留下来的句子向量更能够表示本身并与其它句子向量产生差距。

算法伪代码如下:

算法源码如下:https://github.com/PrincetonML/SIF
代码实现如下

二、BM25

  BM25是一种用来评价搜索词和文档之间相关性的算法,它是一种基于概率检索模型提出的算法:有一个query和一批文档Ds,现在计算query和每篇文档D之间的相关性,先对query进行切分(分词),得到单词qiq_iqi,然后单词由3部分组成:

  • 每个单词的权重
    N表示所有文档的数目,n(qi)n(q_i)n(qi)包含了单词qiq_iqi的文档数目。
    IDF(qi)=log(N−n(qi)+0.5n(qi)+0.5)IDF(q_i)=log(\frac{N-n(q_i)+0.5}{n(q_i)+0.5})IDF(qi)=log(n(qi)+0.5Nn(qi)+0.5)
  • 相关性分数R:单词和Doc之间的相关性,最后对于每个query中单词的分数求和,得到query和文档之间的分数
    R(qi,d)=fi(k1+1)fi+Kqfi(k2+1)qfi+k2R(q_i,d)=\frac{f_i(k_1+1)}{f_i+K}\frac{qf_i(k_2+1)}{qf_i+k_2}R(qi,d)=fi+Kfi(k1+1)qfi+k2qfi(k2+1)
    K=k1(1−b+bdlavgdl)K=k_1(1-b+b\frac{dl}{avg\ dl})K=k1(1b+bavgdldl)
    k1,k2,bk_1,k_2,bk1,k2,b均为超参数
    fif_ifi表示qiq_iqi出现在多少个文档中
    qfiqf_iqfi表示qiq_iqi在这个query中的次数
    dldldl表示的是当前doc的长度
    avgdlavg\ dlavgdl表示的是平均长度
    Score(Q,d)=∑nIDF(qi)R(qi,d)Score(Q,d)=\sum^{n}IDF(q_i)R(q_i,d)Score(Q,d)=nIDF(qi)R(qi,d)
    返回topKtop KtopK的Doc。

1.bm25源码实现

from gensim.summarization import bm25class BM25RetrievalModel:"""BM25 definition: https://en.wikipedia.org/wiki/Okapi_BM25"""def __init__(self, corpus):self.model = bm25.BM25(corpus)def get_top_similarities(self, query, topk=10):"""query: [word1, word2, ..., wordn]"""# 得到query与每个文档的分数scores = self.model.get_scores(query)# 基于score进行排序rtn = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)[:topk]# 得到文档索引与分数return rtn[0][0], rtn[1][0]

其底层源码为:

import logging
import math
from six import iteritems
from six.moves import range
from functools import partial
from multiprocessing import Pool
from ..utils import effective_n_jobsPARAM_K1 = 1.5
PARAM_B = 0.75
EPSILON = 0.25logger = logging.getLogger(__name__)class BM25(object):"""Implementation of Best Matching 25 ranking function.Attributes----------corpus_size : intSize of corpus (number of documents).avgdl : floatAverage length of document in `corpus`.doc_freqs : list of dicts of intDictionary with terms frequencies for each document in `corpus`. Words used as keys and frequencies as values.idf : dictDictionary with inversed documents frequencies for whole `corpus`. Words used as keys and frequencies as values.doc_len : list of intList of document lengths."""def __init__(self, corpus, k1=PARAM_K1, b=PARAM_B, epsilon=EPSILON):"""Parameters----------corpus : list of list of strGiven corpus.k1 : floatConstant used for influencing the term frequency saturation. After saturation is reached, additionalpresence for the term adds a significantly less additional score. According to [1]_, experiments suggestthat 1.2 < k1 < 2 yields reasonably good results, although the optimal value depends on factors such asthe type of documents or queries.b : floatConstant used for influencing the effects of different document lengths relative to average document length.When b is bigger, lengthier documents (compared to average) have more impact on its effect. According to[1]_, experiments suggest that 0.5 < b < 0.8 yields reasonably good results, although the optimal valuedepends on factors such as the type of documents or queries.epsilon : floatConstant used as floor value for idf of a document in the corpus. When epsilon is positive, it restrictsnegative idf values. Negative idf implies that adding a very common term to a document penalize the overallscore (with 'very common' meaning that it is present in more than half of the documents). That can beundesirable as it means that an identical document would score less than an almost identical one (byremoving the referred term). Increasing epsilon above 0 raises the sense of how rare a word has to be (amongdifferent documents) to receive an extra score."""self.k1 = k1self.b = bself.epsilon = epsilonself.corpus_size = 0self.avgdl = 0self.doc_freqs = []self.idf = {}self.doc_len = []self._initialize(corpus)def _initialize(self, corpus):"""Calculates frequencies of terms in documents and in corpus. Also computes inverse document frequencies."""nd = {}  # word -> number of documents with wordnum_doc = 0for document in corpus:self.corpus_size += 1self.doc_len.append(len(document))num_doc += len(document)frequencies = {}for word in document:if word not in frequencies:frequencies[word] = 0frequencies[word] += 1self.doc_freqs.append(frequencies)for word, freq in iteritems(frequencies):if word not in nd:nd[word] = 0nd[word] += 1self.avgdl = float(num_doc) / self.corpus_size# collect idf sum to calculate an average idf for epsilon valueidf_sum = 0# collect words with negative idf to set them a special epsilon value.# idf can be negative if word is contained in more than half of documentsnegative_idfs = []for word, freq in iteritems(nd):idf = math.log(self.corpus_size - freq + 0.5) - math.log(freq + 0.5)self.idf[word] = idfidf_sum += idfif idf < 0:negative_idfs.append(word)self.average_idf = float(idf_sum) / len(self.idf)if self.average_idf < 0:logger.warning('Average inverse document frequency is less than zero. Your corpus of {} documents'' is either too small or it does not originate from natural text. BM25 may produce'' unintuitive results.'.format(self.corpus_size))eps = self.epsilon * self.average_idffor word in negative_idfs:self.idf[word] = epsdef get_score(self, document, index):"""Computes BM25 score of given `document` in relation to item of corpus selected by `index`.Parameters----------document : list of strDocument to be scored.index : intIndex of document in corpus selected to score with `document`.Returns-------floatBM25 score."""score = 0.0doc_freqs = self.doc_freqs[index]numerator_constant = self.k1 + 1denominator_constant = self.k1 * (1 - self.b + self.b * self.doc_len[index] / self.avgdl)for word in document:if word in doc_freqs:df = self.doc_freqs[index][word]idf = self.idf[word]score += (idf * df * numerator_constant) / (df + denominator_constant)return scoredef get_scores(self, document):"""Computes and returns BM25 scores of given `document` in relation toevery item in corpus.Parameters----------document : list of strDocument to be scored.Returns-------list of floatBM25 scores."""scores = [self.get_score(document, index) for index in range(self.corpus_size)]return scoresdef get_scores_bow(self, document):"""Computes and returns BM25 scores of given `document` in relation toevery item in corpus.Parameters----------document : list of strDocument to be scored.Returns-------list of floatBM25 scores."""scores = []for index in range(self.corpus_size):score = self.get_score(document, index)if score > 0:scores.append((index, score))return scores

三、基于BM25、tfidf和SIF的检索系统代码实现

主函数main.py为:


import argparse
from NLP贪心.information_retrieval_demo.utils import get_corpus, word_tokenize, build_word_embedding
from model_zoo.bm25_model import BM25RetrievalModel
from model_zoo.tfidf_model import TFIDFRetrievalModel
from model_zoo.sif_model import SIFRetrievalModel
from model_zoo.bert_model import BertRetrievalModel# 配置参数
parser = argparse.ArgumentParser(description='Information retrieval model hyper-parameter setting.')
parser.add_argument('--input_file_path', type=str, default='./ChangCheng.xls', help='Training data location.')
# default='bm25'or'sif'or'tfidf'
parser.add_argument('--model_type', type=str, default='sif', help='Different information retrieval models.')# gensim模型路径
parser.add_argument('--gensim_model_path', type=str, default='./cached/gensim_model.pkl')
parser.add_argument('--pretrained_gensim_embddings_file', type=str, default='./cached/gensim_word_embddings.pkl')
parser.add_argument('--cached_gensim_embedding_file', type=str, default='./cached/embeddings_gensim.pkl')# 编码维度
parser.add_argument('--embedding_dim', type=int, default=100)# bert模型
parser.add_argument('--bert_model_ckpt', type=str, default='./model_zoo/bert/chinese_L-12_H-768_A-12/bert_model.ckpt')
parser.add_argument('--bert_config_file', type=str, default='./model_zoo/bert/chinese_L-12_H-768_A-12/bert_config.json')
parser.add_argument('--bert_vocab_file', type=str, default='./model_zoo/bert/chinese_L-12_H-768_A-12/vocab.txt')# 最大序列长度
parser.add_argument('--max_seq_len', type=int, default=30)
# pooling策略
parser.add_argument('--pooling_strategy', type=int, default=0)
# pool层数
parser.add_argument('--pooling_layer', type=str, default='-2')# 属性给与args实例:把parser中设置的所有"add_argument"给返回到args子类实例当中,
# 那么parser中增加的属性内容都会在args实例中,使用即可。
args = parser.parse_args()# 读取 问题-答案
questions_src, answers = get_corpus(args.input_file_path)# 分词,返回多个列表组成的列表
questions = [word_tokenize(line) for line in questions_src]
answers_corpus = [word_tokenize(line) for line in answers]# 第一次运行,需要训练词向量
print('\nBuild gensim model and word vectors...')
build_word_embedding(questions+answers_corpus, args.gensim_model_path, args.pretrained_gensim_embddings_file)def predict(model, query):"""预测:param model: 模型:param query: 输入文本:return: topK"""# 对输入文本分词query = word_tokenize(query)# 返回最相似两个问题的索引top_1, top_2 = model.get_top_similarities(query, topk=2)return questions_src[top_1], answers[top_1], questions_src[top_2], answers[top_2]if __name__ == '__main__':# 输入文本query = '上班穿什么'# 模型选择# BM25if args.model_type == 'bm25':bm25_model = BM25RetrievalModel(questions)res = predict(bm25_model, query)# TFIDFelif args.model_type == 'tfidf':tfidf_model = TFIDFRetrievalModel(questions)res = predict(tfidf_model, query)# SIFelif args.model_type == 'sif':# sif模型sif_model = SIFRetrievalModel(questions, args.pretrained_gensim_embddings_file,args.cached_gensim_embedding_file, args.embedding_dim)# 预测res = predict(sif_model, query)# BERTelif args.model_type == 'bert':bert_model = BertRetrievalModel(questions, args.bert_model_ckpt, args.bert_config_file,args.bert_vocab_file, args.max_seq_len,args.pooling_strategy, args.pooling_layer)res = predict(bert_model, query)else:raise ValueError('Invalid model type!')# 打印print('Query: ', query)print('\nQuestion 1: ', res[0])print('Answer 1: ', res[1])print('\nQuestion 2: ', res[2])print('Answer 2: ', res[3])

工具util.py部分为:

import jieba as jie
import pandas as pd
import numpy as np
import pickle
from gensim.models import Word2Vec
import re
from tqdm import tqdmdef get_corpus(file_path, header_idx=0):"""读取问题、答案:param file_path: 文件路径:param header_idx: 表头索引:return:"""# 读取文件src_df = pd.read_excel(file_path, header=header_idx)print('Corpus shape before: ', src_df.shape)# 去掉没有'Response'的情况src_df = src_df.dropna(subset=['Response'])print('Corpus shape after: ', src_df.shape)# 返回问题与答案return src_df['Question'].tolist(), src_df['Response'].tolist()def clean_text(text):"""对文本进行正则化处理:param text::return:"""# 正则化处理text = re.sub(u"([hH]ttp[s]{0,1})://[a-zA-Z0-9\.\-]+\.([a-zA-Z]{2,4})(:\d+)?(/[a-zA-Z0-9\-~!@#$%^&*+?:_/=<>.',;]*)?", '',text)  # remove http:xxxtext = re.sub(u'#[^#]+#', '', text)  # remove #xxx#text = re.sub(u'回复@[\u4e00-\u9fa5a-zA-Z0-9_-]{1,30}:', '', text)  # remove "回复@xxx:"text = re.sub(u'@[\u4e00-\u9fa5a-zA-Z0-9_-]{1,30}', '', text)  # remove "@xxx"text = re.sub(r'[0-9]+', 'DIG', text.strip()).lower()   # 将数字替换成DIGtext = ''.join(text.split())  # split remove spaces# 返回处理后的文本return textdef word_tokenize(line):"""对输入文本进行分词:param line: 输入文本:return:"""# 对输入文本进行正则化处理,删除部分格式文本content = clean_text(line)#content_words = [m for m in jie.lcut(content) if m not in self.stop_words]# 使用结巴分词对输入文本进行分词,直接返回listreturn jie.lcut(content)def load_embedding(cached_embedding_file):"""load embeddings"""with open(cached_embedding_file, mode='rb') as f:return pickle.load(f)def save_embedding(word_embeddings, cached_embedding_file):"""save word embeddings"""with open(cached_embedding_file, mode='wb') as f:pickle.dump(word_embeddings, f)def get_word_embedding_matrix(word2idx, pretrained_embeddings_file, embedding_dim=200):"""Load pre-trained embeddings"""# initialize an empty arraypre_trained_embeddings = np.zeros((len(word2idx), embedding_dim))initialized = 0exception = 0num = 0with open(pretrained_embeddings_file, mode='r') as f:try:for line in f:word_vec = line.split()idx = word2idx.get(word_vec[0], -1)# if current word exists in word2idxif idx != -1:pre_trained_embeddings[idx] = np.array(word_vec[1:], dtype=np.float)initialized += 1num += 1if num % 10000 == 0:print(num)except:exception += 1print('Pre-trained embedding initialization proportion: ', (initialized + 0.0) / len(word2idx))print('exception num: ', exception)return pre_trained_embeddingsdef build_word_embedding(corpus, gensim_model_path, gensim_word_embdding_path):"""基于语料训练词向量:param corpus: 语料:param gensim_model_path: 模型路径:param gensim_word_embdding_path: 编码路径:return:"""# build the model# initialize an empty modelmodel = Word2Vec(min_count=1, size=100, window=5, sg=1, negative=5, sample=0.001,iter=30)# 从一个句子序列构建词汇表model.build_vocab(sentences=corpus)# 根据句子序列更新模型的神经权重model.train(sentences=corpus, total_examples=model.corpus_count, epochs=model.iter)# 保存预训练单词向量model.wv.save_word2vec_format(gensim_word_embdding_path, binary=False)# 保存预训练模型model.save(gensim_model_path)print('\nGensim model build successfully!')print('\nTest the performance of word2vec model')# 测试for test_word in ['门禁卡', '食堂', '试用期']:# 返回最相似的单词aa = model.wv.most_similar(test_word)[0:10]print('\nMost similar word of %s is:' % test_word)for word, score in aa:print('{} {}'.format(word, score))'''# save word countssorted_word_counts = OrderedDict(sorted(model.wv.vocab.items(), key=lambda x: x[1].count, reverse=True))word_counts_file = codecs.open('./word_counts.txt', mode='w', encoding='utf-8')for k, v in sorted_word_counts.items():word_counts_file.write(k + ' ' + str(v.count) + '\n')'''

BM25模型部分为:

from gensim.summarization import bm25class BM25RetrievalModel:"""BM25 definition: https://en.wikipedia.org/wiki/Okapi_BM25"""def __init__(self, corpus):self.model = bm25.BM25(corpus)def get_top_similarities(self, query, topk=10):"""query: [word1, word2, ..., wordn]"""# 得到query与每个文档的分数scores = self.model.get_scores(query)# 基于score进行排序rtn = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)[:topk]# 得到文档索引与分数return rtn[0][0], rtn[1][0]

TF-IDF模型部分为:


from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.similarities import MatrixSimilarityfrom sklearn.feature_extraction.text import TfidfVectorizer
from annoy import AnnoyIndex
import numpy as npclass TFIDFRetrievalModel:def __init__(self, corpus):'''min_freq = 3self.dictionary = Dictionary(corpus)# Filter low frequency words from dictionary.low_freq_ids = [id_ for id_, freq inself.dictionary.dfs.items() if freq <= min_freq]self.dictionary.filter_tokens(low_freq_ids)self.dictionary.compactify()self.corpus = [self.dictionary.doc2bow(line) for line in corpus]self.model = TfidfModel(self.corpus)self.corpus_mm = self.model[self.corpus]self.index = MatrixSimilarity(self.corpus_mm)'''corpus_str = []for line in corpus:corpus_str.append(' '.join(line))self.tfidf = TfidfVectorizer(token_pattern=r"(?u)\b\w+\b")sentence_tfidfs = np.asarray(self.tfidf.fit_transform(corpus_str).todense().astype('float'))# build search modelself.t = AnnoyIndex(sentence_tfidfs.shape[1])for i in range(sentence_tfidfs.shape[0]):self.t.add_item(i, sentence_tfidfs[i, :])self.t.build(10)def get_top_similarities(self, query, topk=10):"""query: [word1, word2, ..., wordn]"""'''query_vec = self.model[self.dictionary.doc2bow(query)]scores = self.index[query_vec]rtn = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)[:topk]return rtn''''''query2tfidf = []for word in query:if word in self.word2tfidf:query2tfidf.append(self.word2tfidf[word])query2tfidf = np.array(query2tfidf)'''query2tfidf = np.asarray(self.tfidf.transform([' '.join(query)]).todense().astype('float'))[0]top_ids, top_distances = self.t.get_nns_by_vector(query2tfidf, n=topk, include_distances=True)return top_ids

SIF模型部分为:

from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from NLP贪心.information_retrieval_demo.utils import get_word_embedding_matrix, save_embedding, load_embedding
import os
from sklearn.decomposition import PCA
from annoy import AnnoyIndexclass SIFRetrievalModel:"""A simple but tough-to-beat baseline for sentence embedding.from https://openreview.net/pdf?id=SyK00v5xxPrinciple : Represent the sentence by a weighted average of the word vectors, and then modify them using Principal Component Analysis.Issue 1: how to deal with big input size ?randomized SVD version will not be affected by scale of input, see https://github.com/PrincetonML/SIF/issues/4Issue 2: how to preprocess input data ?Even if you dont remove stop words SIF will take care, but its generally better to clean the data,see https://github.com/PrincetonML/SIF/issues/23Issue 3: how to obtain the embedding of new sentence ?Weighted average is enough, see https://www.quora.com/What-are-some-interesting-techniques-for-learning-sentence-embeddings"""def __init__(self, corpus, pretrained_embedding_file, cached_embedding_file, embedding_dim):self.embedding_dim = embedding_dimself.max_seq_len = 0corpus_str = []# 遍历语料for line in corpus:corpus_str.append(' '.join(line))self.max_seq_len = max(self.max_seq_len, len(line))# 计算词频counter = CountVectorizer(token_pattern=r"(?u)\b\w+\b")# bag of words format, i.e., [[1.0, 2.0, ...], []]bow = counter.fit_transform(corpus_str).todense().astype('float')# word countword_count = np.sum(bow, axis=0)# word frequency, i.e., p(w)word_freq = word_count / np.sum(word_count)# the parameter in the SIF weighting scheme, usually in the range [1e-5, 1e-3]SIF_weight = 1e-3# 计算词权重self.word2weight = np.asarray(SIF_weight / (SIF_weight + word_freq))# number of principal components to remove in SIF weighting schemeself.SIF_npc = 1self.word2id = counter.vocabulary_# 语料 word idseq_matrix_id = np.zeros(shape=(len(corpus_str), self.max_seq_len), dtype=np.int64)# 语料 word 权重seq_matrix_weight = np.zeros((len(corpus_str), self.max_seq_len), dtype=np.float64)# 依次遍历每个样本for idx, seq in enumerate(corpus):seq_id = []for word in seq:if word in self.word2id:seq_id.append(self.word2id[word])seq_len = len(seq_id)seq_matrix_id[idx, :seq_len] = seq_idseq_weight = [self.word2weight[0][id] for id in seq_id]seq_matrix_weight[idx, :seq_len] = seq_weightif os.path.exists(cached_embedding_file):self.word_embeddings = load_embedding(cached_embedding_file)else:self.word_embeddings = get_word_embedding_matrix(counter.vocabulary_, pretrained_embedding_file, embedding_dim=self.embedding_dim)save_embedding(self.word_embeddings, cached_embedding_file)# 计算句向量self.sentence_embeddings = self.SIF_embedding(seq_matrix_id, seq_matrix_weight)# build search model# 在向量空间中做很多超平面的划分,帮助我们做向量的检索self.t = AnnoyIndex(self.embedding_dim)# 将sentence_embeddings放入annoy中,建立一棵树,就可以通过annoy来进行向量的检索for i in range(self.sentence_embeddings.shape[0]):self.t.add_item(i, self.sentence_embeddings[i, :])self.t.build(10)def SIF_embedding(self, x, w):"""句向量计算"""# weighted averagesn_samples = x.shape[0]emb = np.zeros((n_samples, self.word_embeddings.shape[1]))for i in range(n_samples):emb[i, :] = w[i, :].dot(self.word_embeddings[x[i, :], :]) / np.count_nonzero(w[i, :])# removing the projection on the first principal component# randomized SVD version will not be affected by scale of input, see https://github.com/PrincetonML/SIF/issues/4#svd = TruncatedSVD(n_components=self.SIF_npc, n_iter=7, random_state=0)svd = PCA(n_components=self.SIF_npc, svd_solver='randomized')svd.fit(emb)self.pc = svd.components_#print('pc shape:', pc.shape)if self.SIF_npc == 1:# pc.transpose().shape : embedding_size * 1# emb.dot(pc.transpose()).shape: num_sample * 1# (emb.dot(pc.transpose()) * pc).shape: num_sample * embedding_sizecommon_component_removal = emb - emb.dot(self.pc.transpose()) * self.pcelse:# pc.shape: self.SIF_npc * embedding_size# emb.dot(pc.transpose()).shape: num_sample * self.SIF_npc# emb.dot(pc.transpose()).dot(pc).shape: num_sample * embedding_sizecommon_component_removal = emb - emb.dot(self.pc.transpose()).dot(self.pc)# 返回SIF_embedding向量return common_component_removaldef get_top_similarities(self, query, topk=10):"""query: [word1, word2, ..., wordn]"""query2id = []for word in query:if word in self.word2id:query2id.append(self.word2id[word])query2id = np.array(query2id)id2weight = np.array([self.word2weight[0][id] for id in query2id])# 得到query_embeddingquery_embedding = id2weight.dot(self.word_embeddings[query2id, :]) / query2id.shape[0]# 返回top_ids与top_distancestop_ids, top_distances = self.t.get_nns_by_vector(query_embedding, n=topk, include_distances=True)return top_ids

运行结果为:

Query:  上班穿什么
Question 1:  公司有着装要求吗?
Answer 1:  公司要求周一至周四着正装,佩戴工牌和司徽(司徽可在六楼综合处贾洋洋处领取),男士需打领带。并且保证桌面整洁,座位处不放杂物。
Question 2:  公司有着装要求么?
Answer 2:  公司要求周一至周四着正装,佩戴工牌和司徽(司徽可在六楼综合处贾洋洋处领取),男士需打领带。并且保证桌面整洁,座位处不放杂物。

如果对您有帮助,麻烦点赞关注,这真的对我很重要!!!如果需要互关,请评论或者私信!


NLP学习—17.基于BM25、tfidf和SIF的检索系统实现相关推荐

  1. 【专利学习】基于区块链的审计方法和系统

    [背景] 1.审计需要资金对比.交叉.回溯等数据分析,使得审计疑点确认和取证变得麻烦,效率低 2.需要可信第三方,存在单点风险 [内容] 1.构建联盟链,存储业务数据,获取审计信息,获取业务数据,生成 ...

  2. 基于C语言实现的关键字检索系统

    1 项目简介 建立一个文本文件,文件名由用户用键盘输入,输入一个不含空格的关键字,统计输出关键字在文本中的出现次数. 2 项目功能要求 本项目的设计要求可以分成两个部分实现:首先建立一个文本文件,文件 ...

  3. 万方数据基于PaddleNLP的文献检索系统实践

    又是一年开学季,看着大批莘莘学子步入高校,同时又有大批学生即将面临毕业,这一年要饱受论文的洗礼.在学术论文领域,几乎每一位大学生都避不开论文检索.查重环节.想写出一篇高质量论文,前期大量的信息储备必不 ...

  4. matlab显示2dpsk误码率,基于MATLAB的2DPSK调制与解调系统的分析.doc

    您所在位置:网站首页 > 海量文档 &nbsp>&nbsp计算机&nbsp>&nbspmatlab 基于MATLAB的2DPSK调制与解调系统的分析. ...

  5. Milvus 实战|基于 Milvus 的图文检索系统

    本文主要介绍基于 Milvus 搭建的多模态图文检索系统.检索流程为: 1. 通过 TIRG(Text Image Residual Gating)模型将图片特征和文本特征转化为多模态特征向量. 2. ...

  6. 基于图灵机器人接口的简单NLP学习

    说明 图灵机器人提供在线接口,用户可自行注册学习,注册后官方提供想用的接口和连接方式,仅仅由于兴趣,做一个相当简单的聊天机器人,并开放核心源码,希望各位有闲情逸致的同胞可以继续丰富修改或扩展. 本品仅 ...

  7. [转]搜索引擎的文档相关性计算和检索模型(BM25/TF-IDF)

    搜索引擎的检索模型-查询与文档的相关度计算 1. 检索模型概述 搜索结果排序时搜索引擎最核心的部分,很大程度度上决定了搜索引擎的质量好坏及用户满意度.实际搜索结果排序的因子有很多,但最主要的两个因素是 ...

  8. NLP学习—16.对话系统及信息检索技术

    文章目录 一.对话系统概述 二.对话系统的分类及其对应得解决方案 三.检索及倒排索引 四.召回 五.倒排索引的空间优化方法-Variable Byte Compression 六.倒排索引搜索算法-W ...

  9. 自然语言处理NLP——ERNIE-M:基于回译机制的“预训练-微调”多语言模型

    目录 系列文章目录 一.背景介绍 1.多语言任务 1.1 多语言任务定义 1.2 多语言任务难题 2.多语言模型 2.1 多语言模型定义与原理 2.2 多语言模型困难 3.论文简介 3.1 背景与开发 ...

  10. 利用计算机技术实现对文本篇章,自然语言处理NLP学习笔记一:概念与模型初探...

    前言 先来看一些demo,来一些直观的了解. 自然语言处理: 可以做中文分词,词性分析,文本摘要等,为后面的知识图谱做准备. 知识图谱: 还有2个实际应用的例子,加深对NLP的理解 九歌机器人: 微软 ...

最新文章

  1. Caffe源码中syncedmem文件分析
  2. WPF快速入门系列(2)——深入解析依赖属性
  3. matlab绘制三维图形
  4. c语言 指针_初识C语言指针
  5. python 基础简单猜数游戏
  6. Tableau过期处理方法
  7. android日记功能的实现6,我的android studio学习日记
  8. 电子签名服务和云平台整合管理合同
  9. pix4d操作流程_利用精灵配合PIX4D软件制作正摄图片的简单制作流程
  10. 网易云自动签到云函数【详细版】-2022.5.4
  11. power designer绘制数据流图操作步骤
  12. 常见蛋白质种类_蛋白粉有哪些种类?都有什么作用?常见的6种蛋白粉
  13. Mezzanine汉化
  14. 简要介绍一下Dos/Windows格式文件和Unix/Linux格式文件(剪不断理还乱的\r\n和\n)
  15. 二层板的射频RF信号如何控阻抗 四层板的射频RF信号如何控阻抗  射频信号是否可以不控阻抗,射频差分需要控阻抗吗?为什么射频信号需要挖空隔层参考?射频信号为什么要加粗?
  16. Scratch基础(一):安装和了解软件
  17. 全志T7 Display驱动分析
  18. H264 SPS 中 VUI 自己碰到的一些比较关键的 字段介绍。
  19. 【web3.0设计】区块链如何解决真实交易的信用问题?
  20. 用活中台,久久丫鸭脖营销达到了新境界 | 数字化案例

热门文章

  1. 纯css实现照片墙3D效果
  2. 【转】eclipse 查看原始类出现The jar file rt.jar has no source attachment解决方法
  3. 20190825 On Java8 第十二章 集合
  4. 前端优化,包括css,jss,img,cookie
  5. 【BZOJ1855】[Scoi2010] 股票交易
  6. 如何用Github删除repository
  7. redhat 7.2更新yum源时踩的坑
  8. 通过jquery 获取下拉列表中选中的值对应的value
  9. [Ubuntu] change mouse scrolling between standard and natural
  10. ZOJ 2301 离散化