nlp文本相似度_用几行代码在Python中搜索相似文本：一个NLP项目

nlp文本相似度

自然语言处理 (Natural Language Processing)

什么是自然语言处理？ (What is Natural Language Processing?)

Natural Language Processing (NLP) refers to developing an application that understands human languages. There are so many use cases for NLPs nowadays. Because people are generating thousands of gigabytes of text data every day through blogs, social media comments, product reviews, news archives, official reports, and many more. Search Engines are the biggest example of NLPs. I don’t think you will find very many people around you who never used search engines.

自然语言处理(NLP)是指开发能够理解人类语言的应用程序。如今，NLP的用例很多。因为人们每天通过博客，社交媒体评论，产品评论，新闻档案，官方报告等生成大量千兆字节的文本数据。搜索引擎是NLP的最大示例。我认为您周围不会有很多从未使用过搜索引擎的人。

项目概况 (Project Overview)

In my experience, the best way to learn is by doing a project. In this article, I will explain NLP with a real project. The dataset I will use is called ‘people_wiki.csv’. I found this dataset in Kaggle. Feel free to download the dataset from here:

以我的经验，最好的学习方法是做一个项目。在本文中，我将用一个真实的项目解释NLP。我将使用的数据集称为“ people_wiki.csv”。我在Kaggle中找到了这个数据集。 可以从此处免费下载数据集：

The dataset contains the name of some famous people, their Wikipedia URL, and the text of their Wikipedia page. So, the dataset is very big. The goal of this project is, to find people of related backgrounds. In the end, if you provide the algorithm a name of a famous person, it will return the name of a predefined number of people who have a similar background according to the Wikipedia information. Does this sound a bit like a search engine?

数据集包含一些著名人物的姓名，他们的Wikipedia URL和他们的Wikipedia页面的文本。因此，数据集非常大。该项目的目标是寻找具有相关背景的人员。最后，如果您提供算法一个名人的名字，它将根据Wikipedia信息返回预定义数量的具有相似背景的人的名字。这听起来有点像搜索引擎吗？

分步实施 (Step By Step Implementation)

Import the necessary packages and the dataset.导入必要的包和数据集。

import numpy as npimport pandas as pdfrom sklearn.neighbors import NearestNeighborsfrom sklearn.feature_extraction.text import CountVectorizerdf = pd.read_csv('people_wiki.csv')df.head()

2. Vectorize the ‘text’ column

2.向量化“文本”列

如何向量化？ (How to Vectorize?)

In Python’s scikit-learn library, there is a function named ‘count vectorizer’. This function provides an index to each word and generates a vector that contains the number of appearances of each word in a piece of text. Here, I will demonstrate it with a small text for your understanding. Suppose, this is our text:

在Python的scikit-learn库中，有一个名为“ count vectorizer ”的函数。此函数为每个单词提供一个索引，并生成一个矢量，其中包含一段文本中每个单词的出现次数。在这里，我将用一小段文字进行演示，以供您理解。假设这是我们的文本：

text = ["Jen is a good student. Jen plays guiter as well"]

Let’s import the function from the scikit_learn library and fit the text in the function.

让我们从scikit_learn库中导入函数，并在函数中放入文本。

vectorizer = CountVectorizer()vectorizer.fit(text)

Here, I am printing the vocabulary:

在这里，我正在打印词汇表：

print(vectorizer.vocabulary_)#Output:{'jen': 4, 'is': 3, 'good': 1, 'student': 6, 'plays': 5, 'guiter': 2, 'as': 0, 'well': 7}

Look, each word of the text received a number. Those numbers are the index of that word. It has eight significant words. So, the index is from 0 to 7. Next, we need to transform the text. I will print the transformed vector as an array.

看，文本的每个单词都收到一个数字。这些数字是该单词的索引。它有八个重要词。因此，索引是从0到7。接下来，我们需要转换文本。我将转换后的向量打印为数组。

vector = vectorizer.transform(text)print(vector.toarray())

Here is the output: [[1 1 1 1 2 1 1 1]]. ‘Jen’ has index 4 and it appeared twice. So in this output vector, the 4th indexed element is 2. All the other words appeared only once. So the elements of the vector are ones.

输出为：[[1 1 1 1 2 1 1 1]]。 “ Jen”的索引为4，它出现了两次。因此，在此输出向量中，第四个索引元素为2。所有其他单词仅出现一次。因此向量的元素是1。

Now, vectorize the ‘text’ column of the dataset, using the same technique.

现在，使用相同的技术对数据集的“文本”列进行矢量化处理。

vect = CountVectorizer()word_weight = vect.fit_transform(df['text'])

In the demonstration, I used ‘fit’ first and then ‘transform’ later’. But conveniently, you can use fit and transform both at once. This word_weight is the vectors of numbers as I explained before. There will be one such vector for each row of text in the ‘text’ column.

在演示中，我先使用“适合”，然后再使用“变换”。但是很方便，您可以同时使用fit和transform。正如我之前解释的，这个word_weight是数字的向量。 “文本”列中的每一行文本都会有一个这样的向量。

3. Fit this ‘word_weight’ from the previous step in the Nearest Neighbors function.

3.在“ 最近的邻居”功能中，按照上一步中的步骤安装此“ word_weight”。

The idea of the nearest neighbor’s function is to calculate the distance of a predefined number of training points from the required point. If it’s not clear, do not worry. Look at the implementation, it will be easier for you.

最近邻函数的想法是计算预定数量的训练点与所需点的距离。如果不清楚，请不要担心。看一下实现，对您来说会更容易。

nn = NearestNeighbors(metric = 'euclidean')nn.fit(word_weight)

4. Find 10 people with similar backgrounds as President Barak Obama.

4.找到10位与巴拉克·奥巴马总统背景相似的人。

First, find the index of ‘Barak Obama’ from the dataset.

首先，从数据集中找到“巴拉克·奥巴马”的索引。

obama_index = df[df['name'] == 'Barack Obama'].index[0]

Calculate the distance and the indices of 10 people who have the closest background as President Obama. In the word weight vector, the index of the text that contains the information about ‘Barak Obama’ should be in the same index as the dataset. we need to pass that index and the number of the person we want. That should return the calculated distance of those persons from ‘Barak Obama’ and the indices of those persons.

计算与奥巴马总统背景最接近的10个人的距离和指数。在单词权重向量中，包含有关“巴拉克·奥巴马”的信息的文本索引应与数据集位于同一索引中。我们需要传递该索引和所需人员的数量。那应该返回这些人与“巴拉克·奥巴马”的距离以及这些人的指数。

distances, indices = nn.kneighbors(word_weight[obama_index], n_neighbors = 10)

Organize the result in a DataFrame.

在DataFrame中组织结果。

neighbors = pd.DataFrame({'distance': distances.flatten(), 'id': indices.flatten()})print(neighbors)

Let’s find the name of the persons from the indexes. There are several ways to find names from the index. I used the merge function. I just merged the ‘neighbors’ DataFrame above with the original DataFrame ‘df’ using the id column as the common column. Sorted values on distance. President Obama should have no distance from himself. So, he came on top.

让我们从索引中找到人员的名字。有几种方法可以从索引中查找名称。我使用了合并功能。我只是使用id列作为公共列，将上面的'neighbors'数据框与原始DataFrame'df'合并。距离上的排序值。奥巴马总统与自己应该没有距离。因此，他名列前茅。

nearest_info = (df.merge(neighbors, right_on = 'id', left_index = True).sort_values('distance')[['id', 'name', 'distance']])print(nearest_info)

These are the 10 people closest to President Obama according to the information provided in Wikipedia. Results make sense, right?

根据维基百科提供的信息，这是最接近奥巴马总统的10个人。结果有意义，对不对？

A similar texts search could be useful in many areas such as searching for similar articles, similar resume, similar profiles as in this project, similar news items, similar songs. I hope you find this small project useful.

相似文本搜索在许多领域都可能有用，例如搜索相似的文章，相似的履历，与本项目相似的个人资料，相似的新闻条目，相似的歌曲。我希望这个小项目对您有用。