词嵌入 网络嵌入

When applying one-hot encoding to words, we end up with sparse (containing many zeros) vectors of high dimensionality. On large data sets, this could cause performance issues.

当对单词应用一键编码时,我们得到的是稀疏(包含许多零)高维向量。 在大型数据集上,这可能会导致性能问题。

Additionally, one-hot encoding does not take into account the semantics of the words. So words like airplane and aircraft are considered to be two different features while we know that they have a very similar meaning. Word embeddings address these two issues.

另外,一键式编码不考虑单词的语义。 因此,像飞机飞机这样的词被认为是两个不同的特征,而我们知道它们的含义非常相似。 词嵌入解决了这两个问题。

Word embeddings are dense vectors with much lower dimensionality. Secondly, the semantic relationships between words are reflected in the distance and direction of the vectors.

词嵌入是维数很低的密集向量。 其次,单词之间的语义关系反映在向量的距离和方向上。

We will work with the TwitterAirlineSentiment data set on Kaggle. This data set contains roughly 15K tweets with 3 possible classes for the sentiment (positive, negative and neutral). In my previous post, we tried to classify the tweets by tokenizing the words and applying two classifiers. Let’s see if word embeddings can outperform that.

我们将使用Kaggle上的TwitterAirlineSentiment数据集 。 该数据集包含大约15,000条推文,以及3种可能的情绪类别(正面,负面和中性)。 在我之前的文章中,我们尝试通过对单词进行标记并应用两个分类器来对推文进行分类。 让我们看看词嵌入是否可以胜过它。

After reading this tutorial you will know how to compute task-specific word embeddings with the Embedding layer of Keras. Secondly, we will investigate whether word embeddings trained on a larger corpus can improve the accuracy of our model.

阅读完本教程后,您将知道如何使用Keras的Embedding层来计算特定于任务的单词嵌入。 其次,我们将研究在较大语料库上训练的词嵌入是否可以提高模型的准确性。

The structure of this tutorial is:

本教程的结构为:

  • Intuition behind word embeddings
    词嵌入背后的直觉
  • Project set-up
    项目设置
  • Data preparation
    资料准备
  • Keras and its Embedding layer
    Keras及其嵌入层
  • Pre-trained word embeddings — GloVe
    预训练词嵌入— GloVe
  • Training word embeddings with more dimensions
    训练单词嵌入的更多维度

词嵌入背后的直觉 (Intuition behind word embeddings)

Before we can use words in a classifier, we need to convert them into numbers. One way to do that is to simply map words to integers. Another way is to one-hot encode words. Each tweet could then be represented as a vector with a dimension equal to (a limited set of) the words in the corpus. The words occurring in the tweet have a value of 1 in the vector. All other vector values equal zero.

在分类器中使用单词之前,我们需要将它们转换为数字。 一种方法是简单地将单词映射为整数。 另一种方法是对单词进行一次热编码。 然后,每个推文都可以表示为一个向量,其维数等于语料库中单词的有限集合。 鸣叫中出现的单词在向量中的值为1。 所有其他向量值等于零。

Word embeddings are computed differently. Each word is positioned into a multi-dimensional space. The number of dimensions in this space is chosen by the data scientist. You can experiment with different dimensions and see what provides the best result.

词嵌入的计算方式有所不同。 每个词都放在多维空间中 。 该空间中的维数由数据科学家选择。 您可以尝试不同的尺寸,看看有什么效果最好。

The vector values for a word represent its position in this embedding space. Synonyms are found close to each other while words with opposite meanings have a large distance between them. You can also apply mathematical operations on the vectors which should produce semantically correct results. A typical example is that the sum of the word embeddings of king and female produces the word embedding of queen.

单词向量值表示其在此嵌入空间中的位置 。 同义词彼此接近,而含义相反的单词之间的距离则很大。 您还可以对向量应用数学运算,这将产生语义上正确的结果。 一个典型的例子是词嵌入的总和产生皇后词嵌入。

项目设置 (Project set-up)

Let’s start by importing all packages for this project.

首先,导入该项目的所有软件包。

import pandas as pd
import numpy as np
import re
import collections
import matplotlib.pyplot as plt
from pathlib import Path
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical
from sklearn.preprocessing import LabelEncoder
from keras import models
from keras import layers

We define some parameters and paths used throughout the project. Most of them are self-explanatory. But others will be explained further in the code.

我们定义了整个项目中使用的一些参数和路径。 其中大多数是不言自明的。 但是其他的将在代码中进一步解释。

NB_WORDS = 10000  # Parameter indicating the number of words we'll put in the dictionary
VAL_SIZE = 1000  # Size of the validation set
NB_START_EPOCHS = 10  # Number of epochs we usually start to train with
BATCH_SIZE = 512  # Size of the batches used in the mini-batch gradient descent
MAX_LEN = 24  # Maximum number of words in a sequence
GLOVE_DIM = 100  # Number of dimensions of the GloVe word embeddings
root = Path('../')
input_path = root / 'input/'
ouput_path = root / 'output/'
source_path = root / 'source/'

Throughout this code, we will also use some helper functions for data preparation, modeling and visualization. These function definitions are not shown here to keep the blog post clutter free. You can always refer to the notebook in Github to look at the code.

在整个代码中,我们还将使用一些辅助函数来进行数据准备,建模和可视化。 这些功能定义未在此处显示,以使博客文章更加整洁。 您始终可以在Github中参考笔记本查看代码。

资料准备 (Data preparation)

读取数据并清洁 (Reading the data and cleaning)

We read in the CSV file with the tweets and apply a random shuffle on its indexes. After that, we remove stop words and @ mentions. A test set of 10% is split off to evaluate the model on new data.

我们使用推文读取CSV文件,并在其索引上应用随机随机播放。 之后,我们删除停用词和@提及。 分离出10%的测试集以根据新数据评估模型。

df = pd.read_csv(input_path / 'Tweets.csv')
df = df.reindex(np.random.permutation(df.index))
df = df[['text', 'airline_sentiment']]
df.text = df.text.apply(remove_stopwords).apply(remove_mentions)
X_train, X_test, y_train, y_test = train_test_split(df.text, df.airline_sentiment, test_size=0.1, random_state=37)

将单词转换为整数 (Convert words into integers)

With the Tokenizer from Keras, we convert the tweets into sequences of integers. We limit the number of words to the NB_WORDS most frequent words. Additionally, the tweets are cleaned with some filters, set to lowercase and split on spaces.

使用Keras的Tokenizer ,我们将推文转换为整数序列。 我们将单词数限制为NB_WORDS个最常用的单词。 此外,这些推文还使用一些过滤器进行了清理,设置为小写并在空格处分割。

tk = Tokenizer(num_words=NB_WORDS,
filters='!"#$%&()*+,-./:;<=>?@[\]^_`{"}~\t\n',lower=True, split=" ")
tk.fit_on_texts(X_train)
X_train_seq = tk.texts_to_sequences(X_train)
X_test_seq = tk.texts_to_sequences(X_test)

等长的序列 (Equal length of sequences)

Each batch needs to provide sequences of equal length. We achieve this with the pad_sequences method. By specifying maxlen, the sequences or padded with zeros or truncated.

每一批都需要提供相等长度的序列。 我们使用pad_sequences方法来实现。 通过指定maxlen ,序列可以用零填充或截断。

X_train_seq_trunc = pad_sequences(X_train_seq, maxlen=MAX_LEN)
X_test_seq_trunc = pad_sequences(X_test_seq, maxlen=MAX_LEN)

编码目标变量 (Encoding the target variable)

The target classes are strings which need to be converted into numeric vectors. This is done with the LabelEncoder from Sklearn and the to_categorical method from Keras.

目标类是需要转换为数字向量的字符串。 这是通过Sklearn的LabelEncoder和Keras的to_categorical方法完成的。

le = LabelEncoder()
y_train_le = le.fit_transform(y_train)
y_test_le = le.transform(y_test)
y_train_oh = to_categorical(y_train_le)
y_test_oh = to_categorical(y_test_le)

拆分验证集 (Splitting off the validation set)

From the training data, we split off a validation set of 10% to use during training.

从训练数据中,我们划分出10%的验证集以在训练期间使用。

X_train_emb, X_valid_emb, y_train_emb, y_valid_emb = train_test_split(X_train_seq_trunc, y_train_oh, test_size=0.1, random_state=37)

造型 (Modeling)

Keras和嵌入层 (Keras and the Embedding layer)

Keras provides a convenient way to convert each word into a multi-dimensional vector. This can be done with the Embedding layer. It will compute the word embeddings (or use pre-trained embeddings) and look up each word in a dictionary to find its vector representation. Here we will train word embeddings with 8 dimensions.

Keras提供了一种将每个单词转换为多维向量的便捷方法。 这可以通过嵌入层来完成。 它将计算单词嵌入(或使用预训练的嵌入),并在字典中查找每个单词以找到其向量表示形式。 在这里,我们将训练8个维度的词嵌入。

emb_model = models.Sequential()
emb_model.add(layers.Embedding(NB_WORDS, 8, input_length=MAX_LEN))
emb_model.add(layers.Flatten())
emb_model.add(layers.Dense(3, activation='softmax'))
emb_history = deep_model(emb_model, X_train_emb, y_train_emb, X_valid_emb, y_valid_emb)

We have a validation accuracy of about 74%. The number of words in the tweets is rather low, so this result is quite good. By comparing the training and validation loss, we see that the model starts overfitting from epoch 6.

我们的验证准确性约为74%。 推文中的单词数量很少,因此此结果相当不错。 通过比较训练损失和验证损失,我们看到模型从时期6开始过度拟合

In a previous article, I discussed how we can avoid overfitting. You might want to read that if you want to deep dive on that topic.

在上一篇文章中,我讨论了如何避免过度拟合 。 如果您想深入了解该主题,则可能需要阅读。

When we train the model on all data (including the validation data, but excluding the test data) and set the number of epochs to 6, we get a test accuracy of 78%. This test result is OK, but let’s see if we can improve with pre-trained word embeddings.

当我们在所有数据(包括验证数据,但不包括测试数据)上训练模型并将纪元数设置为6时,测试精度为78%。 这个测试结果还可以,但是让我们看看是否可以通过预训练的词嵌入来改善。

emb_results = test_model(emb_model, X_train_seq_trunc, y_train_oh, X_test_seq_trunc, y_test_oh, 6)
print('/n')
print('Test accuracy of word embeddings model: {0:.2f}%'.format(emb_results[1]*100))

预训练单词嵌入—手套 (Pre-trained word embeddings — Glove)

Because the training data is not so large, the model might not be able to learn good embeddings for the sentiment analysis. Alternatively, we can load pre-trained word embeddings built on a much larger training data.

由于训练数据不是很大,因此该模型可能无法为情感分析学习良好的嵌入。 或者,我们可以加载基于更大的训练数据构建的预训练词嵌入。

The GloVe database contains multiple pre-trained word embeddings, and more specific embeddings trained on tweets. So this might be useful for the task at hand.

GloVe数据库包含多个预训练的词嵌入,以及在推特上训练的更具体的嵌入 。 因此,这对于手头的任务可能很有用。

First, we put the word embeddings in a dictionary where the keys are the words and the values the word embeddings.

首先,我们将词嵌入嵌入字典中,其中键是词,词是嵌入词的值。

glove_file = 'glove.twitter.27B.' + str(GLOVE_DIM) + 'd.txt'
emb_dict = {}
glove = open(input_path / glove_file)
for line in glove:values = line.split()word = values[0]vector = np.asarray(values[1:], dtype='float32')emb_dict[word] = vector
glove.close()

With the GloVe embeddings loaded in a dictionary, we can look up the embedding for each word in the corpus of the airline tweets. These will be stored in a matrix with a shape of NB_WORDS and GLOVE_DIM. If a word is not found in the GloVe dictionary, the word embedding values for the word are zero.

通过在字典中加载GloVe嵌入,我们可以在航空公司推文的语料库中查找每个单词的嵌入。 这些将存储在形状为NB_WORDSGLOVE_DIM的矩阵中 。 如果在GloVe词典中找不到单词,则该单词的单词嵌入值为零。

emb_matrix = np.zeros((NB_WORDS, GLOVE_DIM))
for w, i in tk.word_index.items():if i < NB_WORDS:vect = emb_dict.get(w)if vect is not None:emb_matrix[i] = vectelse:break

Then we specify the model just like we did with the model above.

然后,像上面的模型一样指定模型。

glove_model = models.Sequential()
glove_model.add(layers.Embedding(NB_WORDS, GLOVE_DIM, input_length=MAX_LEN))
glove_model.add(layers.Flatten())
glove_model.add(layers.Dense(3, activation='softmax'))

In the Embedding layer (which is layer 0 here) we set the weights for the words to those found in the GloVe word embeddings. By setting trainable to False we make sure that the GloVe word embeddings cannot be changed. After that, we run the model.

在“嵌入”层(此处为第0层)中,我们单词的权重设置为在GloVe单词嵌入中找到的权重 。 通过将trainable设置为False,我们可以确保不能更改GloVe单词嵌入。 之后,我们运行模型。

glove_model.layers[0].set_weights([emb_matrix])
glove_model.layers[0].trainable = False
glove_history = deep_model(glove_model, X_train_emb, y_train_emb, X_valid_emb, y_valid_emb)

The model overfits fast after 3 epochs. Furthermore, the validation accuracy is lower compared to the embeddings trained on the training data.

该模型在3个时期后快速过拟合。 此外,与在训练数据上训练的嵌入相比,验证准确性较低。

glove_results = test_model(glove_model, X_train_seq_trunc, y_train_oh, X_test_seq_trunc, y_test_oh, 3)
print('/n')
print('Test accuracy of word glove model: {0:.2f}%'.format(glove_results[1]*100))

As a final exercise, let’s see what results we get when we train the embeddings with the same number of dimensions as the GloVe data.

作为最后的练习,让我们看看以与GloVe数据相同数量的维数训练嵌入时得到的结果。

训练单词嵌入的更多维度 (Training word embeddings with more dimensions)

We will train the word embeddings with the same number of dimensions as the GloVe embeddings (i.e. GLOVE_DIM).

我们将训练单词嵌入的维数与GloVe嵌入的维数相同(即GLOVE_DIM)。

emb_model2 = models.Sequential()
emb_model2.add(layers.Embedding(NB_WORDS, GLOVE_DIM, input_length=MAX_LEN))
emb_model2.add(layers.Flatten())
emb_model2.add(layers.Dense(3, activation='softmax'))
emb_history2 = deep_model(emb_model2, X_train_emb, y_train_emb, X_valid_emb, y_valid_emb)
emb_results2 = test_model(emb_model2, X_train_seq_trunc, y_train_oh, X_test_seq_trunc, y_test_oh, 3)
print('/n')
print('Test accuracy of word embedding model 2: {0:.2f}%'.format(emb_results2[1]*100))

On the test data we get good results, but we do not outperform the LogisticRegression with the CountVectorizer. So there is still room for improvement.

在测试数据上,我们获得了良好的结果,但是在CountVectorizer上,我们的性能并没有超过LogisticRegression。 因此,仍有改进的空间。

结论 (Conclusion)

The best result is achieved with 100-dimensional word embeddings that are trained on the available data. This even outperforms the use of word embeddings that were trained on a much larger Twitter corpus.

最好的结果是通过在可用数据上训练的100维单词嵌入实现的。 这甚至超过了在更大的Twitter语料库上训练的单词嵌入的使用。

Until now we have just put a Dense layer on the flattened embeddings. By doing this, we do not take into account the relationships between the words in the tweet. This can be achieved with a recurrent neural network or a 1D convolutional network. But that’s something for a future post :)

到目前为止,我们只是在扁平化嵌入中放置了一个Dense层。 这样, 我们就不会考虑推文中单词之间的关系 。 这可以通过递归神经网络或一维卷积网络来实现。 但这是将来的帖子:)

翻译自: https://www.freecodecamp.org/news/word-embeddings-for-sentiment-analysis/

词嵌入 网络嵌入

词嵌入 网络嵌入_深入研究词嵌入以进行情感分析相关推荐

  1. 关键词词云怎么做_《excle词云怎么制作》 除了tableau ,还有什么数据工具可以制作词云啊?...

    词云可以用哪些编程语言制作? Python有专门的库,十分方便,简单,wordcloud.教程链接Python词库入门教程 别的语言我就不知道了 怎么把词频排名前50 的绘制词云 试一下 优词www. ...

  2. 情感分析朴素贝叶斯_朴素贝叶斯推文的情感分析

    情感分析朴素贝叶斯 Millions of tweets are posted every second. It helps us know how the public is responding ...

  3. python 获取csv的列数_《极限挑战》弹幕及评论情感分析(Python)

    一.数据说明 本次实验用到的数据是前三季<极限挑战>第一期视频的评论数据和弹幕数据. 二.数据来源 本次实验所有数据均从bilibili爬取和处理得到. (1)视频来源 评论和弹幕数据来源 ...

  4. python实现文本情感分析_用python实现简单的文本情感分析

    很久没在公众号发布新内容,在这段时间内没想到有这么多python爱好者关注了我,港真的,心里很兴奋激动. 今天给大家带来我刚刚实现了的简单多文本情感分析代码,代码环境python3.5 原理 比如这么 ...

  5. 网络推广_百度万词霸屏推广方式有哪三种?

    最近,逛各种论坛,各种自媒体平台的时候,总是看到有人问,网络推广的方法有哪些,XX产品,XXXX有什么好的推广方式.明明这样的问题有很多人问过,也有很多人答过.可是依然还是不断的有人不厌其烦的去问,然 ...

  6. python生成中文词云的代码_[python] 基于词云的关键词提取:wordcloud的使用、源码分析、中文词云生成和代码重写...

    1. 词云简介 词云,又称文字云.标签云,是对文本数据中出现频率较高的"关键词"在视觉上的突出呈现,形成关键词的渲染形成类似云一样的彩色图片,从而一眼就可以领略文本数据的主要表达意 ...

  7. python嵌入到程序_在应用中嵌入Python:转

    前面的章节讨论如何扩展Python,如何生成适合的C库等.不过还有另一种情况:通过将Python嵌入C/C++应用以扩展程序的功能.Python嵌入实现了一些使用Python更合适的功能.这可以有很多 ...

  8. 小红书用户画像分析_用户研究:如何做用户画像分析

    用户画像就是根据用户特征.业务场景和用户行为等信息,构建一个标签化的用户模型.简而言之,用户画像就是将典型用户信息标签化. 在金融领域,构建用户画像变得很重要.比如金融公司会借助用户画像,采取垂直或精 ...

  9. (二十):网络表情包的单模态与双模态情感分析

    文献阅读(二十):IITK at SemEval-2020 Task 8: Unimodal and Bimodal Sentiment Analysis of Internet Memes 问题Ta ...

最新文章

  1. UI培训技术分享:设计大神都在用的10种技法!
  2. RocketMQ实战--大数据平台技术栈06
  3. 如何使用SMTPDiag 工具
  4. java代码是怎么运行的_Java代码是如何运行起来的?
  5. boost::char_separator相关的测试程序
  6. 【三种解法】Not so Mobile UVA - 839_19行代码AC
  7. 详细描述三个适于瀑布模型的项目_信息系统项目管理师-第二三章:信息系统项目管理基础与立项管理2...
  8. android adb音频采集,android adb
  9. ATM系统之问题描述与词汇表
  10. android 自定义 滑动删除,Android_Android ListView实现仿iPhone实现左滑删除按钮的简单实例,需要自定义ListView。这里就交Fl - phpStudy...
  11. 转分享[Mac] QQ音乐Mac特别版 可以下载无损
  12. 山东大学高频电子线路实验四 振幅调制与解调实验详解
  13. js实现分页并请求ajax,js实现ajax分页完整实例
  14. PythonTip挑战题(16-25)
  15. apt-get 提示 无法解析域名“cn.archive.ubuntu.com” 的解决
  16. 个人免签约支付系统,收款就是这么简单
  17. 机器人理论(3)DH表达法:解析关节轴之间的关系
  18. JavaWeb登陆成功后跳转到上一个页面
  19. Oracle数据库错误码1502解决,SQL的1502错误处理
  20. 啪啪打脸,国际互联网协会数据泄露

热门文章

  1. 【python】随机采样的两种方法
  2. Java—一篇读懂java集合(Collection/Map)及Lambda表达式
  3. 【C++学习】C++中的强制转换
  4. 装箱与拆箱 c# 1613534570
  5. 继承的编写小结汇总。
  6. fastdfs-02-上传与下载流程
  7. postgres主从配置
  8. 代码风格统一: 使用husky, prettier, eslint在代码提交时自动格式化,并检查代码。...
  9. delphi刷新界面所选行丢失问题
  10. Python3与OpenCV3.3 图像处理(五)--图像运算