高维空间向量搜索—腾讯词向量相似检索实践

最近工作上遇到一些语料OOV问题，想到可以使用同义词进行替换来弥补部分OOV带来的问题，于是就有了本篇博客的相关实验。
最简单的可以使用gemsim在语料库上训练一个word2vec，然后使用gemsim自带的相似度API直接进行计算，这种方法是最简单的(不考虑内存消耗，计算时间的情况下)。但是自己的语料本身就有OOV问题，训练后，估计效果也不行，于是想到使用腾讯的语料，网上有一篇使用腾讯语料计算相似词的文章，但是只能使用公众号来请求，没有给出代码。
本文记录一下使用腾讯的全部词向量，使用高维空间向量搜索工具:hnsw进行的相关实验。
主要包括

1. 文件读取

腾讯词向量下载自行百度，下载后解压，大约16G的样子，是200维的高维向量
读取很简单

import numpy as np
def load_tencent_emb_data(path):"""加载腾讯词向量:param path: :return: """datas = []word_id_map = {}with open(path, 'r', encoding='utf-8') as fd:for idx, line in enumerate(fd):if idx == 0:continueline = line.strip().split(' ')word_id_map[line[0]] = idxdatas.append([float(x) for x in line[1:]])return np.asarray(datas), word_id_map

因为hnsw的输入需要时numpy格式的，所以讲向量转成了nparray

2. 构建索引

一般的高维向量搜索，都需要去构建索引，例如faiss,nsg,ssg等(这些demo后面有时间补上)
构建索引的方法在hnsw的readme中有很好的说明，我只是一个搬运工
首先，安装hnsw包

pip install hnswlib

下面是构建腾讯词向量索引的代码

import hnswlib
def build_hnsw_search_index(data):num_elements, dim = data.shape# Generating sample datadata_labels = np.arange(num_elements)# Declaring indexp = hnswlib.Index(space='cosine', dim=dim)  # possible options are l2, cosine or ip# Initing index - the maximum number of elements should be known beforehandp.init_index(max_elements=num_elements, ef_construction=200, M=16)# Element insertion (can be called several times):p.add_items(data, data_labels)return  p

3. 向量搜索

def search_word_similarity(word, k=10):"""查找与word最相近的k个词:param word: :param k: :return: """w_id = word_id_map.get(word, None)if not w_id:print('do not found {} embeding'.format(word))return []return p.knn_query(data[w_id], k=k)

例如，搜索北京，可以得到如下词

中方：

因为是相似搜索，时间和精确度的权衡可以参考这里https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md

4. 索引序列化和加载

4.1 序列化

p.save_index(path)

4.2 加载

 p.load_index(path)

5. 总结

hnsw是一个相比faiss在精度上更有优势的库，15G的腾讯词向量转化为索引后，需要占用8G的内存。faiss库也是一个很好地库，但是在使用过程中，感觉精度并没有hnsw库好。当然最近浙大和阿里也有相关工作，后续有时间，也会贴出相关实验代码(浙大的库全是C++，门槛有点高，编译有点复杂)。