

  • 加载词嵌入层,并用余弦公式表达词相似度
  • 使用词嵌入层可解决词类analogy问题,例如会使模型基于man2woman,学习到king2?
  • 有些词嵌入层需要修改,避免政治正确


# 1 导入
import numpy as np
from w2v_utils import *
words, word_to_vec_map = read_glove_vecs('data/glove.6B.50d.txt')# 2 已知嵌入矩阵,计算词之间的相似度
def cosine_similarity(u, v):dot = np.dot(u, v)norm_u = np.linalg.norm(u)norm_v = np.linalg.norm(v)cosine_similarity = dot / norm_u / norm_vreturn cosine_similarity
# 2.1 找两个词的相似度
father = word_to_vec_map["father"]
mother = word_to_vec_map["mother"]
print("cosine_similarity(father, mother) = ", cosine_similarity(father, mother))  # 0.89
# 2.2 已知合适词组,找一个词最般配的另一个词
def complete_analogy(word_a, word_b, word_c, word_to_vec_map):word_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower()e_a, e_b, e_c = word_to_vec_map[word_a], word_to_vec_map[word_b], word_to_vec_map[word_c]words = word_to_vec_map.keys()max_cosine_sim = -100              best_word = None                   for w in words:        if w in [word_a, word_b, word_c] :continuecosine_sim = cosine_similarity((e_b - e_a), (word_to_vec_map[w] - e_c))if cosine_sim > max_cosine_sim:max_cosine_sim = cosine_simbest_word = wreturn best_word
triads_to_try = [('italy', 'italian', 'spain'), ('india', 'delhi', 'japan'), ('man', 'woman', 'boy'), ('small', 'smaller', 'large')]
for triad in triads_to_try:print ('{} -> {} :: {} -> {}'.format( *triad, complete_analogy(*triad,word_to_vec_map)))
'''italy -> italian :: spain -> spanish
india -> delhi :: japan -> tokyo
man -> woman :: boy -> girl
small -> smaller :: large -> larger'''




# 1 导入
import numpy as np
from emo_utils import *
import emoji
import matplotlib.pyplot as plt
X_train, Y_train = read_csv('data/train_emoji.csv') # m=127
X_test, Y_test = read_csv('data/tesss.csv') # m=56
maxLen = len(max(X_train, key=len).split())
# 1.1 预览一下
index = 1
print(X_train[index], label_to_emoji(Y_train[index]))
"""I am proud of your achievements ?"""
# 1.2 预处理:Y变成(m,5)独热码
Y_oh_train = convert_to_one_hot(Y_train, C = 5)
Y_oh_test = convert_to_one_hot(Y_test, C = 5)
# 1.3 预览数据
word_to_index, index_to_word, word_to_vec_map = read_glove_vecs('data/glove.6B.50d.txt') # 400,001words
word = "cucumber"
index = 289846
print("the index of", word, "in the vocabulary is", word_to_index[word])
print("the", str(index) + "th word in the vocabulary is", index_to_word[index])
"""the index of cucumber in the vocabulary is 113317
the 289846th word in the vocabulary is potatos"""# 2 实现模型
# 2.1 处理输入词向量
def sentence_to_avg(sentence, word_to_vec_map):"""提取句子中每个词的GloVe representation然后累加/句子长度作为句子的特征向量"""words = sentence.lower().split()avg = np.zeros(50,)for w in words:avg += word_to_vec_map[w]avg = avg / len(words)return avg
# 2.2 构建basicRNN模型
def model(X, Y, word_to_vec_map, learning_rate = 0.01, num_iterations = 400):"""Arguments:X -- shape (m, 1)Y -- shape (m, 1)Returns:pred -- vector of predictions, numpy-array of shape (m, 1)W -- weight matrix of the softmax layer, of shape (n_y, n_h)b -- bias of the softmax layer, of shape (n_y,)"""np.random.seed(1)m = Y.shape[0]                          # number of training examplesn_y = 5                                 # number of classesn_h = 50                                # dimensions of the GloVe vectorsW = np.random.randn(n_y, n_h) / np.sqrt(n_h)b = np.zeros((n_y,))Y_oh = convert_to_one_hot(Y, C = n_y) # Optimization loopfor t in range(num_iterations):                       for i in range(m):                                avg = sentence_to_avg(X[i],word_to_vec_map)z = np.dot(W,avg) + ba = softmax(z)cost = -1 * np.multiply(Y[i],np.log(a))dz = a - Y_oh[i]dW = np.dot(dz.reshape(n_y,1), avg.reshape(1, n_h))db = dzW = W - learning_rate * dWb = b - learning_rate * db       if t % 100 == 0:print("Epoch: " + str(t) + " --- cost = " + str(cost))pred = predict(X, Y, W, b, word_to_vec_map)return pred, W, b
# 2.3 开始训练
pred, W, b = model(X_train, Y_train, word_to_vec_map)
'''Epoch: 0 --- cost = [ 2.82117539  2.22537435  3.90409976  3.65077617  4.17192113]
Accuracy: 0.348484848485
Epoch: 100 --- cost = [  7.39085514   6.39666398   0.15943637   9.61056197  11.77782592]
Accuracy: 0.931818181818
Epoch: 200 --- cost = [  7.86956435   7.883712     0.08912738  11.25652113  13.75952996]
Accuracy: 0.954545454545
Epoch: 300 --- cost = [  8.06494045   8.67838712   0.06864535  12.0741376   14.92485916]
Accuracy: 0.969696969697'''
# 2.4 检验模型成果
print("Training set:")
pred_train = predict(X_train, Y_train, W, b, word_to_vec_map)
print('Test set:')
pred_test = predict(X_test, Y_test, W, b, word_to_vec_map)
'''Training set:
Accuracy: 0.977272727273
Test set:
Accuracy: 0.857142857143'''
X_my_sentences = np.array(["i adore you", "i love you", "funny lol", "lets play with a ball", "food is ready", "not feeling happy"])
Y_my_labels = np.array([[0], [0], [2], [1], [4],[3]])
pred = predict(X_my_sentences, Y_my_labels , W, b, word_to_vec_map)
print_predictions(X_my_sentences, pred)
Accuracy: 0.833333333333i adore you ❤️
i love you ❤️
funny lol ?
lets play with a ball ⚾
food is ready ?
not feeling happy ?'''

Amazing! Because adore has a similar embedding as love, the algorithm has generalized correctly even to a word it has never seen before. Words such as heart, dear, beloved or adore have embedding vectors similar to love, and so might work too.

What you should remember from this part:

  • Even with a 127 training examples, you can get a reasonably good model for Emojifying. This is due to the generalization power word vectors gives you.
  • Emojify-V1 will perform poorly on sentences such as “This movie is not good and not enjoyable” because it doesn’t understand combinations of words–it just averages all the words’ embedding vectors together, without paying attention to the ordering of words. You will build a better algorithm in the next part.


# 1 导入
import numpy as np
from keras.models import Model
from keras.layers import Dense, Input, Dropout, LSTM, Activation
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.initializers import glorot_uniform
np.random.seed(1)# 2 输入预处理:填充
ef sentences_to_indices(X, word_to_index, max_len):"""将X这些句子转化成特征向量并填充Arguments:X -- array of sentences (strings), of shape (m, 1)word_to_index -- a dictionary containing the each word mapped to its indexmax_len -- maximum number of words in a sentence. You can assume every sentence in X is no longer than this. Returns:X_indices -- array of indices corresponding to words in the sentences from X, of shape (m, max_len)"""    m = X.shape[0]                                   X_indices = np.zeros((m,max_len))for i in range(m):                               sentence_words = X[i].lower().split()j = 0for w in sentence_words:X_indices[i, j] = word_to_index[w]j = j+1return X_indices# 3 词嵌入层
def pretrained_embedding_layer(word_to_vec_map, word_to_index):"""Creates a Keras Embedding() layer and loads in pre-trained GloVe 50-dimensional vectors.Arguments:word_to_vec_map -- dictionary mapping words to their GloVe vector representation.word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)Returns:embedding_layer -- pretrained layer Keras instance"""vocab_len = len(word_to_index) + 1                  # adding 1 to fit Keras embedding (requirement)emb_dim = word_to_vec_map["cucumber"].shape[0]      # define dimensionality of your GloVe word vectors (= 50)emb_matrix = np.zeros((vocab_len,emb_dim))# Set each row "index" of the embedding matrix to be the word vector representation of the "index"th word of the vocabularyfor word, index in word_to_index.items():emb_matrix[index, :] = word_to_vec_map(word_to_index(index)) embedding_layer = Embedding(input_dim = vocab_len,output_dim = emb_dim,trainable=False)# Build the embedding layer, it is required before setting the weights of the embedding layer. Do not modify the "None".embedding_layer.build((None,))# Set the weights of the embedding layer to the embedding matrix. Your layer is now pretrained.embedding_layer.set_weights([emb_matrix])    return embedding_layer# 4 构建模型
def Emojify_V2(input_shape, word_to_vec_map, word_to_index):"""Function creating the Emojify-v2 model's graph.Arguments:input_shape -- shape of the input, usually (max_len,)word_to_vec_map -- dictionary mapping every word in a vocabulary into its 50-dimensional vector representationword_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)Returns:model -- a model instance in Keras"""# Define sentence_indices as the input of the graph, it should be of shape input_shape and dtype 'int32' (as it contains indices).sentence_indices = Input(input_shape, dtype = 'int32')# Create the embedding layer pretrained with GloVe Vectors (≈1 line)embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)# Propagate sentence_indices through your embedding layer, you get back the embeddingsembeddings = embedding_layer(sentence_indices)# Propagate the embeddings through an LSTM layer with 128-dimensional hidden state# Be careful, the returned output should be a batch of sequences.X = LSTM(128, return_sequences = True)(embeddings)# Add dropout with a probability of 0.5X = Dropout(0.5)(X)# Propagate X trough another LSTM layer with 128-dimensional hidden state# Be careful, the returned output should be a single hidden state, not a batch of sequences.X = LSTM(128, return_sequences = False)(embeddings)# Add dropout with a probability of 0.5X = Dropout(0.5)(X)# Propagate X through a Dense layer with softmax activation to get back a batch of 5-dimensional vectors.X = Dense(5)(X)# Add a softmax activationX = Activation("softmax")(X)# Create Model instance which converts sentence_indices into X.model = Model(inputs = sentence_indices, outputs=X)return model
model = Emojify_V2((maxLen,), word_to_vec_map, word_to_index)
Layer (type)                 Output Shape              Param #
input_2 (InputLayer)         (None, 10)                0
embedding_3 (Embedding)      (None, 10, 50)            20000050
lstm_3 (LSTM)                (None, 10, 128)           91648
dropout_3 (Dropout)          (None, 10, 128)           0
lstm_4 (LSTM)                (None, 128)               131584
dropout_4 (Dropout)          (None, 128)               0
dense_2 (Dense)              (None, 5)                 645
activation_2 (Activation)    (None, 5)                 0
Total params: 20,223,927
Trainable params: 223,877
Non-trainable params: 20,000,050
_________________________________________________________________"""# 5 训练
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
X_train_indices = sentences_to_indices(X_train, word_to_index, maxLen)
Y_train_oh = convert_to_one_hot(Y_train, C = 5)
model.fit(X_train_indices, Y_train_oh, epochs = 50, batch_size = 32, shuffle=True)
"""Epoch 50/50
132/132 [==============================] - 0s - loss: 0.0797 - acc: 0.9848     - ETA: 0s - loss: 0.0812 - acc: 0.984"""# 6 评估
X_test_indices = sentences_to_indices(X_test, word_to_index, max_len = maxLen)
Y_test_oh = convert_to_one_hot(Y_test, C = 5)
loss, acc = model.evaluate(X_test_indices, Y_test_oh)
print("Test accuracy = ", acc)
"""Test accuracy =  0.925000008515"""
# This code allows you to see the mislabelled examples
C = 5
y_test_oh = np.eye(C)[Y_test.reshape(-1)]
X_test_indices = sentences_to_indices(X_test, word_to_index, maxLen)
pred = model.predict(X_test_indices)
for i in range(len(X_test)):x = X_test_indicesnum = np.argmax(pred[i])if(num != Y_test[i]):print('Expected emoji:'+ label_to_emoji(Y_test[i]) + ' prediction: '+ X_test[i] + label_to_emoji(num).strip())
Expected emoji:❤️ prediction: I love taking breaks  ?
Expected emoji:? prediction: she is a bully ?
Expected emoji:? prediction: she said yes   ?
Expected emoji:❤️ prediction: I love you to the stars and back  ?
# Change the sentence below to see your prediction. Make sure all the words are in the Glove embeddings.
x_test = np.array(['not feeling happy'])
X_test_indices = sentences_to_indices(x_test, word_to_index, maxLen)
print(x_test[0] +' '+  label_to_emoji(np.argmax(model.predict(X_test_indices))))
"""not feeling happy ?"""


  • keras框架输入每一个mini-batch必须保证X的长度一致才可向量化,但句子的长度往往不一致。因此我们:padding
  • 学会如何创建embedding keras层keras.layers.Embedding(vocab_len, sequence_length)
    • step1:将整个X根据mini-batch切成列表indices
    • step2:填充到max length
    • step3:喂给embedding层即可 E维度为(400001,max_length)
    • step4:生成对应的矩阵
  • If you have an NLP task where the training set is small, using word embeddings can help your algorithm significantly. Word embeddings allow your model to work on words in the test set that may not even have appeared in your training set.
  • Training sequence models in Keras (and in most other deep learning frameworks) requires a few important details:
    • To use mini-batches, the sequences need to be padded so that all the examples in a mini-batch have the same length.
    • An Embedding() layer can be initialized with pretrained values. These values can be either fixed or trained further on your dataset. If however your labeled dataset is small, it’s usually not worth trying to train a large pre-trained set of embeddings.
    • LSTM() has a flag called return_sequences to decide if you would like to return every hidden states or only the last one.
    • You can use Dropout() right after LSTM() to regularize your network.


