

在发送短信时,通常我们会使用表情符来表达自己此刻的心情,比如 ❤️会代表“love”,但在表情包中选择表情往往需要花费一些时间,本程序将实现自动识别语义随后匹配合适的表情。


import numpy as np
from emo_utils import *
import emoji
import matplotlib.pyplot as plt


1.1 数据集和表情包


X_train, Y_train = read_csv('data/train_emoji.csv')
X_test, Y_test = read_csv('data/test_emoji.csv')
maxLen = len(max(X_train, key=len).split())


index = 1
print(X_train[index], label_to_emoji(Y_train[index]))
I am proud of your achievements ὠ4


emoji_dictionary = {"0": "\u2764",    "1": "\u26BE","2": "\u1F604","3": "\u1F61E","4": "\u1F374"}

1.2 Emojifier-V1概览


Y_oh_train = convert_to_one_hot(Y_train, C=5)
Y_oh_test = convert_to_one_hot(Y_test, C=5)
index = 50
print(Y_train[index], "is converted into one hot", Y_oh_train[index])0 is converted into one hot [1. 0. 0. 0. 0.]

1.3 应用Emojifier-V1


word_to_index, index_to_word, word_to_vec_map = read_glove_vecs('data/glove.6B.50d.txt')
word = "cucumber"
index = 289846
print("the index of", word, "in the vocabulary is", word_to_index[word])
print("the", str(index) + "th word in the vocabulary is", index_to_word[index])the index of cucumber in the vocabulary is 113317
the 289846th word in the vocabulary is potatos


def sentence_to_avg(sentence, word_to_vec_map):words = sentence.lower().split()avg = np.zeros((50,))for w in words:avg += word_to_vec_map[w]avg = avg / len(words)return avg
avg = sentence_to_avg("Morrocan couscous is my favorite dish", word_to_vec_map)
print("avg = ", avg)avg =  [-0.008005    0.56370833 -0.50427333  0.258865    0.55131103  0.03104983-0.21013718  0.16893933 -0.09590267  0.141784   -0.15708967  0.185258670.6495785   0.38371117  0.21102167  0.11301667  0.02613967  0.260377670.05820667 -0.01578167 -0.12078833 -0.02471267  0.4128455   0.51520610.38756167 -0.898661   -0.535145    0.33501167  0.68806933 -0.21562651.797155    0.10476933 -0.36775333  0.750785    0.10282583  0.348925-0.27262833  0.66768    -0.10706167 -0.283635    0.59580117  0.28747333-0.3366635   0.23393817  0.34349183  0.178405    0.1166155  -0.0764330.1445417   0.09808667]avg =  [-0.008005    0.56370833 -0.50427333  0.258865    0.55131103  0.03104983-0.21013718  0.16893933 -0.09590267  0.141784   -0.15708967  0.185258670.6495785   0.38371117  0.21102167  0.11301667  0.02613967  0.260377670.05820667 -0.01578167 -0.12078833 -0.02471267  0.4128455   0.51520610.38756167 -0.898661   -0.535145    0.33501167  0.68806933 -0.21562651.797155    0.10476933 -0.36775333  0.750785    0.10282583  0.348925-0.27262833  0.66768    -0.10706167 -0.283635    0.59580117  0.28747333-0.3366635   0.23393817  0.34349183  0.178405    0.1166155  -0.0764330.1445417   0.09808667]


def model(X, Y, word_to_vec_map, learning_rate=0.01, num_iterations = 400):np.random.seed(1)m = Y.shape[0]n_y = 5n_h = 50W = np.random.randn(n_y, n_h) / np.sqrt(n_h)b = np.zeros((n_y,))Y_oh = convert_to_one_hot(Y, C = n_y)for t in range(num_iterations):for i in range(m):avg = sentence_to_avg(X[i], word_to_vec_map)z = np.dot(W, avg) + ba = softmax(z)cost = -np.sum(Y_oh[i] * np.log(a))dz = a - Y_oh[i]dW = np.dot(dz.reshape(n_y,1), avg.reshape(1, n_h))db = dzW = W - learning_rate * dWb = b - learning_rate * dbif t % 100 == 0:print("Epoch: " + str(t) + " --- cost = " + str(cost))pred = predict(X, Y, W, b, word_to_vec_map)return pred, W, b
Y = np.asarray([5,0,0,5, 4, 4, 4, 6, 6, 4, 1, 1, 5, 6, 6, 3, 6, 3, 4, 4])
print(Y.shape)X = np.asarray(['I am going to the bar tonight', 'I love you', 'miss you my dear','Lets go party and drinks','Congrats on the new job','Congratulations','I am so happy for you', 'Why are you feeling bad', 'What is wrong with you','You totally deserve this prize', 'Let us go play football','Are you down for football this afternoon', 'Work hard play harder','It is suprising how people can be dumb sometimes','I am very disappointed','It is the best day in my life','I think I will end up alone','My life is so boring','Good job','Great so awesome'])(132,)
(132, 5)
never talk to me again
<class 'numpy.ndarray'>


pred, W, b = model(X_train, Y_train, word_to_vec_map)
Epoch: 0 --- cost = 1.7664588711088183
Accuracy: 0.11363636363636363
Epoch: 100 --- cost = 0.2872073477835263
Accuracy: 0.9242424242424242
Epoch: 200 --- cost = 0.2177810402059889
Accuracy: 0.9621212121212122

1.4 测试集上的表现

print("Training set:")
pred_train = predict(X_train, Y_train, W, b, word_to_vec_map)
print("Test set:")
pred_test = predict(X_test, Y_test, W, b, word_to_vec_map)
Training set:
Accuracy: 0.9772727272727273
Test set:
Accuracy: 0.625


在训练集中,算法会将“I love you”标记为 ❤️,我们来看看并未在训练集中出现的“adore”会有怎么样的结果

X_my_sentences = np.array(["i adore you", "i love you", "funny lol", "lets play with a ball", "food is ready", "you are not happy"])
Y_my_labels = np.array([[0], [0], [2], [1], [4],[3]])pred = predict(X_my_sentences, Y_my_labels , W, b, word_to_vec_map)
print_predictions(X_my_sentences, pred)
i adore you ❤️
i love you ❤️
funny lol ὠ4️
lets play with a ball ⚾️
food is ready ἷ4️
you are not happy ❤️

我们看到包含“adore”的句子也被自动标记了❤️,这是因为heartdearbeloved ,adore等单词与love相似的词嵌入向量。但由于算法没有引入语序,因此不能很好的理解“not happy”这样的句子。最后我们打印一个混淆矩阵,来分析下被错误处理的情况。

print('           '+ label_to_emoji(0)+ '    ' + label_to_emoji(1) + '    ' +  label_to_emoji(2)+ '    ' + label_to_emoji(3)+'   ' + label_to_emoji(4))
print(pd.crosstab(Y_test, pred_test.reshape(56,), rownames=['Actual'], colnames=['Predicted'], margins=True))
plot_confusion_matrix(Y_test, pred_test)
(56,)❤    ⚾    ὠ4    ὡE   ἷ4
Predicted  0.0  1.0  2.0  3.0  4.0  All
0            6    0    4    2    0   12
1            0    5    0    0    0    5
2            2    2   11    3    0   18
3            1    1    3    9    1   15
4            0    0    1    1    4    6
All          9    8   19   15    5   56

2. Emojifier-V2(在keras中使用LSTM)


import numpy as np
from keras.models import Model
from keras.layers import Dense, Input, Dropout, LSTM, Activation
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
np.random.seed(1)from keras.initializers import glorot_uniform
from emo_utils import *

2.1 模型概况

2.2 keras和mini-batching


2.3 Embedding 层



def sentences_to_indices(X, word_to_index, max_len):m = X.shape[0]X_indices = np.zeros((m, max_len))for i in range(m):sentence_words = X[i].lower().split()j = 0for w in sentence_words:X_indices[i, j] = word_to_index[w]j = j + 1return X_indices
X1 = np.array(["funny lol", "lets play baseball", "food is ready for you"])
X1_indices = sentences_to_indices(X1, word_to_index, max_len = 5)
print("X1 = ", X1)
print("X1_indices = ",X1_indices)X1 =  ['funny lol' 'lets play baseball' 'food is ready for you']
X1_indices =  [[155345. 225122.      0.      0.      0.][220930. 286375.  69714.      0.      0.][151204. 192973. 302254. 151349. 394475.]]


def pretrained_embedding_layer(word_to_vec_map, word_to_index):vocab_len = len(word_to_index) + 1emb_dim = word_to_vec_map["cucumber"].shape[0]emb_matrix = np.zeros((vocab_len, emb_dim))for word, index in word_to_index.items():emb_matrix[index, :] = word_to_vec_map[word]embedding_layer = Embedding(vocab_len, emb_dim, trainable=False)embedding_layer.build((None,))embedding_layer.set_weights([emb_matrix])return embedding_layer
embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)
print("weights[0][1][3] =", embedding_layer.get_weights()[0][1][3])
weights[0][1][3] = -0.3403

2.4 构建Emojifier-V2模型

def Emojify_V2(input_shape, word_to_vec_map, word_to_index):sentence_indices = Input(shape = input_shape, dtype = 'int32')embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)embeddings = embedding_layer(sentence_indices)X = LSTM(128, return_sequences = True)(embeddings)X = Dropout(0.5)(X)X = LSTM(128, return_sequences = False)(X)X = Dropout(0.5)(X)X = Dense(5, activation='softmax')(X)X = Activation('softmax')(X)model = Model(inputs = sentence_indices, outputs=X)return model
maxLen = 10
word_to_index, index_to_word, word_to_vec_map = read_glove_vecs('data/glove.6B.50d.txt')
model = Emojify_V2((maxLen,), word_to_vec_map, word_to_index)
Layer (type)                 Output Shape              Param #
input_1 (InputLayer)         (None, 10)                0
embedding_1 (Embedding)      (None, 10, 50)            20000050
lstm_1 (LSTM)                (None, 10, 128)           91648
dropout_1 (Dropout)          (None, 10, 128)           0
lstm_2 (LSTM)                (None, 128)               131584
dropout_2 (Dropout)          (None, 128)               0
dense_1 (Dense)              (None, 5)                 645
activation_1 (Activation)    (None, 5)                 0
Total params: 20,223,927
Trainable params: 223,877
Non-trainable params: 20,000,050


model.compile(loss = 'categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


X_train, Y_train = read_csv('data/train_emoji.csv')
X_test, Y_test = read_csv('data/test_emoji.csv')
X_train_indices = sentences_to_indices(X_train, word_to_index, maxLen)
Y_train_oh = convert_to_one_hot(Y_train, C=5)model.fit(X_train_indices, Y_train_oh, epochs=50, batch_size=32,shuffle=True)


X_test_indices = sentences_to_indices(X_test, word_to_index, max_len = maxLen)
Y_test_oh = convert_to_one_hot(Y_test, C = 5)loss, acc = model.evaluate(X_test_indices, Y_test_oh)
print("Test accuracy = ", acc)
32/56 [================>.............] - ETA: 0s
Test accuracy =  0.5892857142857143


C = 5
y_test_oh = np.eye(C)[Y_test.reshape(-1)]
X_test_indices = sentences_to_indices(X_test, word_to_index, max_len = maxLen)
pred = model.predict(X_test_indices)
for i in range(len(X_test)):x = X_test_indicesnum = np.argmax(pred[i])if (num!=Y_test[i]):print("Expected emoji:"+label_to_emoji(Y_test[i])+\'prediction:'+X_test[i]+label_to_emoji(num).strip())
Expected emoji:❤prediction:she got me a present  ὠ4
Expected emoji:❤prediction:he is a good friend  ὠ4
Expected emoji:❤prediction:I am upset   ὡE
Expected emoji:❤prediction:We had such a lovely dinner tonight  ὠ4
Expected emoji:ὡEprediction:This girl is messing with me    ❤
Expected emoji:ὠ4prediction:are you serious ha ha   ὡE
Expected emoji:ἷ4prediction:any suggestions for dinner  ὠ4
Expected emoji:❤prediction:I love taking breaks ὠ4
Expected emoji:ὠ4prediction:you brighten my day ❤
Expected emoji:ὡEprediction:she is a bully  ὠ4
Expected emoji:ὡEprediction:I worked during my birthday ὠ4
Expected emoji:❤prediction:valentine day is near    ὠ4
Expected emoji:❤prediction:I miss you so much   ὠ4
Expected emoji:ὡEprediction:My life is so boring    ὠ4
Expected emoji:❤prediction:will you be my valentine ὠ4
Expected emoji:ἷ4prediction:I am starving   ὡE
Expected emoji:ὠ4prediction:I like your jacket  ❤
Expected emoji:⚾prediction:what is your favorite baseball game  ὠ4
Expected emoji:❤prediction:I love to the stars and back ὠ4
Expected emoji:ὠ4prediction:I want to joke  ὡE
Expected emoji:ὡEprediction:go away ⚾
Expected emoji:ὡEprediction:yesterday we lost again ⚾
Expected emoji:❤prediction:family is all I have ὡE
Expected emoji:ὡEprediction:I did not have breakfast ὠ4

下面我们来测试下Emojifier-V1中学习效果不是很理想的句子“you are not happy”

x_test = np.array(['you are not happy'])
X_test_indices = sentences_to_indices(x_test, word_to_index, maxLen)
can you stop angry now ὡE





