本文是基于吴恩达老师《深度学习》第五课第二周练习题所做。

0.背景介绍

在发送短信时，通常我们会使用表情符来表达自己此刻的心情，比如 ❤️会代表“love”，但在表情包中选择表情往往需要花费一些时间，本程序将实现自动识别语义随后匹配合适的表情。

文本所需的第三方库、数据集及辅助程序，可点击此处下载。

import numpy as np
from emo_utils import *
import emoji
import matplotlib.pyplot as plt

1.Emojifier-V1

1.1 数据集和表情包

我们先建立一个简单的基线的分类器。下面导入一个较小的数据集，X包含127个句子，Y包含与X对应的[0-4]的整数标识每个句子的表情符，如下图所示：

X_train, Y_train = read_csv('data/train_emoji.csv')
X_test, Y_test = read_csv('data/test_emoji.csv')

maxLen = len(max(X_train, key=len).split())

看一下数据集中具体例子

index = 1
print(X_train[index], label_to_emoji(Y_train[index]))

I am proud of your achievements ὠ4

注：由于emoji库中表情符采用UCS2的编码方式，而IDLE采用utf-8的编码方式，因此在运行中会出现乱码。虽然采用了UCS2的方式来表示各表情符，但仍未解决该问题，希望了解的朋友能够指正。表情符的USC2编码参考文章（1）.在下面的代码中，“\u”当符号为4位时可以很好的显示，但是当为5位时就显示乱码。

emoji_dictionary = {"0": "\u2764",    "1": "\u26BE","2": "\u1F604","3": "\u1F61E","4": "\u1F374"}

1.2 Emojifier-V1概览

从上图中可知，模型的输入时一个句子的对应单词，输出是shape为（1,5）的概率向量。因此需要将Y值表示为（m，5）的one-hot表达式。

Y_oh_train = convert_to_one_hot(Y_train, C=5)
Y_oh_test = convert_to_one_hot(Y_test, C=5)

index = 50
print(Y_train[index], "is converted into one hot", Y_oh_train[index])0 is converted into one hot [1. 0. 0. 0. 0.]

1.3 应用Emojifier-V1

将输入的句子转化为词向量表达式后，我们使用预先训练好的一个50维的GloVe词嵌入模型。

word_to_index, index_to_word, word_to_vec_map = read_glove_vecs('data/glove.6B.50d.txt')

word = "cucumber"
index = 289846
print("the index of", word, "in the vocabulary is", word_to_index[word])
print("the", str(index) + "th word in the vocabulary is", index_to_word[index])the index of cucumber in the vocabulary is 113317
the 289846th word in the vocabulary is potatos

接下来我们将输入的词向量转化一个平均值词向量，即概率图中的avg向量。

def sentence_to_avg(sentence, word_to_vec_map):words = sentence.lower().split()avg = np.zeros((50,))for w in words:avg += word_to_vec_map[w]avg = avg / len(words)return avg

avg = sentence_to_avg("Morrocan couscous is my favorite dish", word_to_vec_map)
print("avg = ", avg)avg =  [-0.008005    0.56370833 -0.50427333  0.258865    0.55131103  0.03104983-0.21013718  0.16893933 -0.09590267  0.141784   -0.15708967  0.185258670.6495785   0.38371117  0.21102167  0.11301667  0.02613967  0.260377670.05820667 -0.01578167 -0.12078833 -0.02471267  0.4128455   0.51520610.38756167 -0.898661   -0.535145    0.33501167  0.68806933 -0.21562651.797155    0.10476933 -0.36775333  0.750785    0.10282583  0.348925-0.27262833  0.66768    -0.10706167 -0.283635    0.59580117  0.28747333-0.3366635   0.23393817  0.34349183  0.178405    0.1166155  -0.0764330.1445417   0.09808667]avg =  [-0.008005    0.56370833 -0.50427333  0.258865    0.55131103  0.03104983-0.21013718  0.16893933 -0.09590267  0.141784   -0.15708967  0.185258670.6495785   0.38371117  0.21102167  0.11301667  0.02613967  0.260377670.05820667 -0.01578167 -0.12078833 -0.02471267  0.4128455   0.51520610.38756167 -0.898661   -0.535145    0.33501167  0.68806933 -0.21562651.797155    0.10476933 -0.36775333  0.750785    0.10282583  0.348925-0.27262833  0.66768    -0.10706167 -0.283635    0.59580117  0.28747333-0.3366635   0.23393817  0.34349183  0.178405    0.1166155  -0.0764330.1445417   0.09808667]

接着，我们将平均值向量通过前向传播来计算cost，并通过反向传播来更新softmax的参数，建立model函数时所用到的公式如下：

def model(X, Y, word_to_vec_map, learning_rate=0.01, num_iterations = 400):np.random.seed(1)m = Y.shape[0]n_y = 5n_h = 50W = np.random.randn(n_y, n_h) / np.sqrt(n_h)b = np.zeros((n_y,))Y_oh = convert_to_one_hot(Y, C = n_y)for t in range(num_iterations):for i in range(m):avg = sentence_to_avg(X[i], word_to_vec_map)z = np.dot(W, avg) + ba = softmax(z)cost = -np.sum(Y_oh[i] * np.log(a))dz = a - Y_oh[i]dW = np.dot(dz.reshape(n_y,1), avg.reshape(1, n_h))db = dzW = W - learning_rate * dWb = b - learning_rate * dbif t % 100 == 0:print("Epoch: " + str(t) + " --- cost = " + str(cost))pred = predict(X, Y, W, b, word_to_vec_map)return pred, W, b

print(X_train.shape)
print(Y_train.shape)
print(np.eye(5)[Y_train.reshape(-1)].shape)
print(X_train[0])
print(type(X_train))
Y = np.asarray([5,0,0,5, 4, 4, 4, 6, 6, 4, 1, 1, 5, 6, 6, 3, 6, 3, 4, 4])
print(Y.shape)X = np.asarray(['I am going to the bar tonight', 'I love you', 'miss you my dear','Lets go party and drinks','Congrats on the new job','Congratulations','I am so happy for you', 'Why are you feeling bad', 'What is wrong with you','You totally deserve this prize', 'Let us go play football','Are you down for football this afternoon', 'Work hard play harder','It is suprising how people can be dumb sometimes','I am very disappointed','It is the best day in my life','I think I will end up alone','My life is so boring','Good job','Great so awesome'])(132,)
(132,)
(132, 5)
never talk to me again
<class 'numpy.ndarray'>
(20,)

再然后，训练模型并学习softmax的参数（W,b）

pred, W, b = model(X_train, Y_train, word_to_vec_map)
#print(pred)

Epoch: 0 --- cost = 1.7664588711088183
Accuracy: 0.11363636363636363
Epoch: 100 --- cost = 0.2872073477835263
Accuracy: 0.9242424242424242
Epoch: 200 --- cost = 0.2177810402059889
Accuracy: 0.9621212121212122
...
...
...[[3.][2.][3.][0.][4.]
.
.
.[1.][4.][3.][0.][2.]]

1.4 测试集上的表现

print("Training set:")
pred_train = predict(X_train, Y_train, W, b, word_to_vec_map)
print("Test set:")
pred_test = predict(X_test, Y_test, W, b, word_to_vec_map)

Training set:
Accuracy: 0.9772727272727273
Test set:
Accuracy: 0.625

可见在测试集上的学习精度不是很理想，但是比随机猜测准确很多。

在训练集中，算法会将“I love you”标记为 ❤️，我们来看看并未在训练集中出现的“adore”会有怎么样的结果

X_my_sentences = np.array(["i adore you", "i love you", "funny lol", "lets play with a ball", "food is ready", "you are not happy"])
Y_my_labels = np.array([[0], [0], [2], [1], [4],[3]])pred = predict(X_my_sentences, Y_my_labels , W, b, word_to_vec_map)
print_predictions(X_my_sentences, pred)

i adore you ❤️
i love you ❤️
funny lol ὠ4️
lets play with a ball ⚾️
food is ready ἷ4️
you are not happy ❤️

我们看到包含“adore”的句子也被自动标记了❤️，这是因为heart, dear, beloved ，adore等单词与love相似的词嵌入向量。但由于算法没有引入语序，因此不能很好的理解“not happy”这样的句子。最后我们打印一个混淆矩阵，来分析下被错误处理的情况。

print(Y_test.shape)
print('           '+ label_to_emoji(0)+ '    ' + label_to_emoji(1) + '    ' +  label_to_emoji(2)+ '    ' + label_to_emoji(3)+'   ' + label_to_emoji(4))
print(pd.crosstab(Y_test, pred_test.reshape(56,), rownames=['Actual'], colnames=['Predicted'], margins=True))
plot_confusion_matrix(Y_test, pred_test)
plt.show()

(56,)❤    ⚾    ὠ4    ὡE   ἷ4
Predicted  0.0  1.0  2.0  3.0  4.0  All
Actual
0            6    0    4    2    0   12
1            0    5    0    0    0    5
2            2    2   11    3    0   18
3            1    1    3    9    1   15
4            0    0    1    1    4    6
All          9    8   19   15    5   56

2. Emojifier-V2(在keras中使用LSTM)

为解决Emojifier-V1存在的问题，我们引入LSTM来进行算法优化。本模型所用到的第三方库及辅助程序如下：

import numpy as np
np.random.seed(0)
from keras.models import Model
from keras.layers import Dense, Input, Dropout, LSTM, Activation
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
np.random.seed(1)from keras.initializers import glorot_uniform
from emo_utils import *

2.1 模型概况

2.2 keras和mini-batching

由于深度学习框架要求序列使用相同长度的mini-batch，因此需要将模型的输入向量转换成统一长度的句子，例如，可以设置句子长度为20，那么输入的句子不足20时使用0进行填充；超过20时进行拆减只保留前20个词。

2.3 Embedding 层

在keras中，embedding矩阵可以被表示为一层，embedding层会自动将输入的句子矩阵转化成一个词向量列表。

如图所示，我们第一步需要将句子转化成词标签列表

def sentences_to_indices(X, word_to_index, max_len):m = X.shape[0]X_indices = np.zeros((m, max_len))for i in range(m):sentence_words = X[i].lower().split()j = 0for w in sentence_words:X_indices[i, j] = word_to_index[w]j = j + 1return X_indices

X1 = np.array(["funny lol", "lets play baseball", "food is ready for you"])
X1_indices = sentences_to_indices(X1, word_to_index, max_len = 5)
print("X1 = ", X1)
print("X1_indices = ",X1_indices)X1 =  ['funny lol' 'lets play baseball' 'food is ready for you']
X1_indices =  [[155345. 225122.      0.      0.      0.][220930. 286375.  69714.      0.      0.][151204. 192973. 302254. 151349. 394475.]]

接下来，我们创建一个预先训练的embedding层。

def pretrained_embedding_layer(word_to_vec_map, word_to_index):vocab_len = len(word_to_index) + 1emb_dim = word_to_vec_map["cucumber"].shape[0]emb_matrix = np.zeros((vocab_len, emb_dim))for word, index in word_to_index.items():emb_matrix[index, :] = word_to_vec_map[word]embedding_layer = Embedding(vocab_len, emb_dim, trainable=False)embedding_layer.build((None,))embedding_layer.set_weights([emb_matrix])return embedding_layer

embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)
print("weights[0][1][3] =", embedding_layer.get_weights()[0][1][3])

weights[0][1][3] = -0.3403

2.4 构建Emojifier-V2模型

def Emojify_V2(input_shape, word_to_vec_map, word_to_index):sentence_indices = Input(shape = input_shape, dtype = 'int32')embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)embeddings = embedding_layer(sentence_indices)X = LSTM(128, return_sequences = True)(embeddings)X = Dropout(0.5)(X)X = LSTM(128, return_sequences = False)(X)X = Dropout(0.5)(X)X = Dense(5, activation='softmax')(X)X = Activation('softmax')(X)model = Model(inputs = sentence_indices, outputs=X)return model

maxLen = 10
word_to_index, index_to_word, word_to_vec_map = read_glove_vecs('data/glove.6B.50d.txt')
model = Emojify_V2((maxLen,), word_to_vec_map, word_to_index)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_1 (InputLayer)         (None, 10)                0
_________________________________________________________________
embedding_1 (Embedding)      (None, 10, 50)            20000050
_________________________________________________________________
lstm_1 (LSTM)                (None, 10, 128)           91648
_________________________________________________________________
dropout_1 (Dropout)          (None, 10, 128)           0
_________________________________________________________________
lstm_2 (LSTM)                (None, 128)               131584
_________________________________________________________________
dropout_2 (Dropout)          (None, 128)               0
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 645
_________________________________________________________________
activation_1 (Activation)    (None, 5)                 0
=================================================================
Total params: 20,223,927
Trainable params: 223,877
Non-trainable params: 20,000,050
_________________________________________________________________

通常，使用keras构建模型，需要设定loss函数，优化器和metrics：

model.compile(loss = 'categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

训练模型，其精度接近100%

X_train, Y_train = read_csv('data/train_emoji.csv')
X_test, Y_test = read_csv('data/test_emoji.csv')
X_train_indices = sentences_to_indices(X_train, word_to_index, maxLen)
Y_train_oh = convert_to_one_hot(Y_train, C=5)model.fit(X_train_indices, Y_train_oh, epochs=50, batch_size=32,shuffle=True)

评估测试集

X_test_indices = sentences_to_indices(X_test, word_to_index, max_len = maxLen)
Y_test_oh = convert_to_one_hot(Y_test, C = 5)loss, acc = model.evaluate(X_test_indices, Y_test_oh)
print("Test accuracy = ", acc)

32/56 [================>.............] - ETA: 0s
Test accuracy =  0.5892857142857143

运行下列代码查看错误标识的样本

C = 5
y_test_oh = np.eye(C)[Y_test.reshape(-1)]
X_test_indices = sentences_to_indices(X_test, word_to_index, max_len = maxLen)
pred = model.predict(X_test_indices)
for i in range(len(X_test)):x = X_test_indicesnum = np.argmax(pred[i])if (num!=Y_test[i]):print("Expected emoji:"+label_to_emoji(Y_test[i])+\'prediction:'+X_test[i]+label_to_emoji(num).strip())

Expected emoji:❤prediction:she got me a present  ὠ4
Expected emoji:❤prediction:he is a good friend  ὠ4
Expected emoji:❤prediction:I am upset   ὡE
Expected emoji:❤prediction:We had such a lovely dinner tonight  ὠ4
Expected emoji:ὡEprediction:This girl is messing with me    ❤
Expected emoji:ὠ4prediction:are you serious ha ha   ὡE
Expected emoji:ἷ4prediction:any suggestions for dinner  ὠ4
Expected emoji:❤prediction:I love taking breaks ὠ4
Expected emoji:ὠ4prediction:you brighten my day ❤
Expected emoji:ὡEprediction:she is a bully  ὠ4
Expected emoji:ὡEprediction:I worked during my birthday ὠ4
Expected emoji:❤prediction:valentine day is near    ὠ4
Expected emoji:❤prediction:I miss you so much   ὠ4
Expected emoji:ὡEprediction:My life is so boring    ὠ4
Expected emoji:❤prediction:will you be my valentine ὠ4
Expected emoji:ἷ4prediction:I am starving   ὡE
Expected emoji:ὠ4prediction:I like your jacket  ❤
Expected emoji:⚾prediction:what is your favorite baseball game  ὠ4
Expected emoji:❤prediction:I love to the stars and back ὠ4
Expected emoji:ὠ4prediction:I want to joke  ὡE
Expected emoji:ὡEprediction:go away ⚾
Expected emoji:ὡEprediction:yesterday we lost again ⚾
Expected emoji:❤prediction:family is all I have ὡE
Expected emoji:ὡEprediction:I did not have breakfast ὠ4

下面我们来测试下Emojifier-V1中学习效果不是很理想的句子“you are not happy”

x_test = np.array(['you are not happy'])
X_test_indices = sentences_to_indices(x_test, word_to_index, maxLen)
print(x_test[0]+''+label_to_emoji(np.argmax(model.predict(X_test_indices))))

can you stop angry now ὡE

ὡE对应的是disappoint的表情符，可见比Emojifier-V1学习正确。

参考：

（1）表情符的USC2编码：https://blog.csdn.net/xiaoai_911/article/details/21175049

NLP之语义自动匹配emoji相关推荐

nc65语义模型设计_文本匹配方法系列––多维度语义交互匹配模型
摘要本文基于接着多语义匹配模型[1]和BERT匹配模型[2]介绍一些多维度语义交互匹配模型,包括2017 BiMPM模型[3]和腾讯出品的2018 MIX[4].这些方法的核心特征都是在多语义网络的 ...
论文阅读：RoadMap: A Light-Weight Semantic Map for Visual Localizationtowards Autonomous Driving轻量语义自动驾驶
题目:A Light-Weight Semantic Map for Visual Localization towards Autonomous Driving 中文:用于自动驾驶的视觉定位的轻量级 ...
NLP任务语义相似数据准备及实战
NLP任务语义相似数据准备及实战目录 NLP任务语义相似数据准备及实战流程梳理示例代码
自然语言处理NLP之语义相似度、语言模型、doc2vec
自然语言处理NLP之语义相似度.语言模型.doc2vec 目录自然语言处理NLP之语义相似度.语言模型.doc2vec 语义相似度
EXCEL中数据的自动匹配主要包含的内容
EXCEL中数据的自动匹配主要包含的内容: EXCEL中无法直接使用SELECT语句进行数据查询.定位.匹配,必须依赖其自身提供的函数.本文将介绍三种EXCEL中的数据自动匹配方法,使单元格内容能够自 ...
vc 文本框只显示下划线_【Axure9百例】36.文本框搜索自动匹配
" 根据输入的内容自动显示匹配内容列表." 这是<Axure9百例>系列第36篇在文本框搜索时,自动匹配搜索的内容,并以列表的形式显示在文本输入框的下方,选中一项后当 ...
MyBatis基础入门《九》ResultMap自动匹配
MyBatis基础入门<九>ResultMap自动匹配描述: Mybatis执行select查询后,使用ResultMap接收查询的数据结果. 实体类:TblClient.java 接口 ...
linux嵌套字幕工具,Linux(NAS通用)下自动匹配射手字幕脚本
2014/02/17更新:加入因超时致使获取字幕不成功的情况. 2014/02/15更新:优化了脚本,充分考虑获取字幕失败后的情况,保证在下次运行时还能再次获取上次失败的字幕. 从头学起,用了一天的时 ...
图片裁切，上传，自动匹配颜色。
图片裁切,上传,自动匹配颜色. photoclip插件学习. https://github.com/baijunjie/PhotoClip.js是官方文档. 使用方法很简单.不过我在使用的过程中遇到几 ...

NLP之语义自动匹配emoji