tensorflow2.0实现IMDB文本数据集学习词嵌入

1. IMDB数据集示例如下所示

[{"rating": 5, "title": "The dark is rising!", "movie": "tt0484562", "review": "It is adapted from the book. I did not read the book and maybe that is why I still enjoyed the movie. There are recent famous books adapted into movies like Eragon which is an unsuccessful movie compared to the rest but I like it better than The Seeker adaptation, another one is The Chronicles of Narnia: The lion, The witch and The wardrobe which is successful and has a sequel under it. The Seeker is this year adaptation. It did a fair job. It is not bad and it is not good. It depends on the viewer. If fans hate the unfaithful adaptation because it does not really follow the line of the story, then be it. Those who have not read the book like me would want to go and watch this movie for entertainment. It did make me a little interested but not enough.It does have its good and bad points. The director failed to bring the spark of the movie. The cast are okay, not too bad. The special effects are considered good for a fantasy movie. What I don't like it is that it is quite short, it just bring straight to the point and that is it. By the time, you will realise it is going to end like that with some short fantasy action. The story is like any fantasy movies. Fast and straight-forward plot. The talking seems long and boring followed by some short action. That is about it. Nothing else. Nothing so interesting to catch your eyes.Overall, it makes a harmless movie to watch in free time or the boring weekends. It is considered dark for children but they still can handle it. It seems long but it is short. Overall, I still think Eragon is better than this. Either you don't like it or like it, it does not matter. It is your view. In this case, I can't say anything. It is just okay.", "link": "http://www.imdb.com/title/tt0484562/reviews-73", "user": "ur12930537"}, {"rating": 5, "title": "Bad attempt by the people that borough us Eragon.", "movie": "tt0484562", "review": "Ever since Lord of the Rings became a hit and was internationally acclaimed all other studios are trying to do the same thing and I can tell you now we are not getting many successes out of these half hearted attempts. The decent ones are Chronicles of Narnia which Disney snapped up and Harry Potter from Warner Brothers. Even the Golden Compass was pretty good by the same people who did Lord of the Rings but then we get to the bad ones. Fox studios gave us Eragon which I still believe is the worst movie I have ever seen. Now Fox studios tries again with the Seeker: The Dark is Rising and I can tell you it is a lot better than Eragon. However, it still is not very good. The director filmed the movie and then realised that his movie was too short so he had a great idea of just making characters appear for no reason and just look scary. I have not read the books but from what I have heard it isn't even faithful their. Overall, it was a decent try but still not worth seeing.", "link": "http://www.imdb.com/title/tt0484562/reviews-108", "user": "ur15303216"}, {"rating": 3, "title": "fantasy movie lacks magic", "movie": "tt0484562", "review": "I've not read the novel this movie was based on, but do enjoy fantasy movies, and thought it looked interesting. But after seeing it...... oh dear.An American boy, Will living with his family in a small village somewhere in England, discovers on his 14th birthday that he's The Seeker for a group of old ones, who fight for the Light. He's got days to find them, before the Rider who fights for the Dark comes to full strength....As I said, I've not read the novel, but seeing the movie several things spring to mind. There are echoes of Harry Potter, the Russian movies Night Watch and Day Watch amongst other fantasy movies tossed into the mix. The script is all over the place, though perhaps this is due to some brutal editing as the movie seems disjointed in parts and the director can't resist having his camera moving all the time and with some quick editing it's almost as if he's trying to be Micheal Bay!! You also get the feeling that despite the production team's efforts, the movie didn't have the budget it really needed. There are a couple of so-called twists in the mix, but they are too obvious to work effectively.The acting isn't too bad, with special mention going to Ian McShane, as one of the elder ones but try as they might, they can't save the movie.As the first of a trio of fantasy movies coming out, the others being Stardust and The Golden Compass, I hope this is not a sign of things to come.", "link": "http://www.imdb.com/title/tt0484562/reviews-60", "user": "ur0680065"}
]

2. 加载数据集

tensorflow2.0的datasets数据集包含此内容，colab上无需下载，如果本地没有可以使用如下命令下载

!pip install -q tensorflow-datasets

加载该数据集

import tensorflow_datasets as tfds
imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)

3. 设置训练集和测试集

使用.numpy()方法将原本tensorflow保存的tensor格式数据集转变为训练所需要的numpy格式
python3中需要用str(s.numpy())

import numpy as nptrain_data, test_data = imdb['train'], imdb['test']training_sentences = []
training_labels = []testing_sentences = []
testing_labels = []# str(s.tonumpy()) is needed in Python3 instead of just s.numpy()
for s,l in train_data:training_sentences.append(str(s.numpy()))training_labels.append(l.numpy())for s,l in test_data:testing_sentences.append(str(s.numpy()))testing_labels.append(l.numpy())training_labels_final = np.array(training_labels)
testing_labels_final = np.array(testing_labels)

4.文本预处理

设置词汇为10000，词嵌入维度16，文本向量最长为120.
padding: ‘pre’ 或 ‘post’: 在序列前填充或在序列后填充。truncating: ‘pre’ 或 ‘post’: 如果序列长度大于maxlen的值，从序列前端截取或者从序列后端截取。

vocab_size = 10000
embedding_dim = 16
max_length = 120
trunc_type='post'
oov_tok = "<>"from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequencestokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(training_sentences)
padded = pad_sequences(sequences,maxlen=max_length, truncating=trunc_type)testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences,maxlen=max_length)

5. 显示pad_sequences处理后的数据

为了显示数据，需要用reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])来将标签对反转，然后编写辅助函数decode_review对paddedd的数据解码展示

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])def decode_review(text):return ' '.join([reverse_word_index.get(i, '?') for i in text])print(decode_review(padded[1]))
print(training_sentences[1])

b’i have been known to fall asleep during films but this is usually due to a combination of things including really tired being warm and comfortable on the <> and having just eaten a lot however on this occasion i fell asleep because the film was rubbish the plot development was constant constantly slow and boring things seemed to happen but with no explanation of what was causing them or why i admit i may have missed part of the film but i watched the majority of it and everything just seemed to happen of its own <> without any real concern for anything else i cant recommend this film at all ’
b’I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. The plot development was constant. Constantly slow and boring. Things seemed to happen, but with no explanation of what was causing them or why. I admit, I may have missed part of the film, but i watched the majority of it and everything just seemed to happen of its own accord without any real concern for anything else. I cant recommend this film at all.’

可以看到经过pad_sequences后忽略了单词的大小写，并且省去了标点，将陌生词汇标记为<>

6.搭建网络

model = tf.keras.Sequential([tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),tf.keras.layers.Flatten(),tf.keras.layers.Dense(6, activation='relu'),tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

这里也可以用GlobalAveragePooling2D代替flatten

Flatten将采用任何形状的张量并将其转换为一维张量（加上样本尺寸），但所有值都保持在张量中。例如，张量（samples，10，20，1）将被展平为（samples，10 * 20 * 1）。
GlobalAveragePooling2D做一些不同的事情。它对空间维度应用平均池化，直到每个空间维度为一，其他维度保持不变。在这种情况下，值将不保持平均值。例如，假设第2维和第3维为空间（最后一个通道），则将张量（samples，10、20、1）输出为（samples，1、1、1）。

6. 模型训练

num_epochs = 10
model.fit(padded, training_labels_final, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))

7. 查看嵌入矩阵维度

e = model.layers[0]
weights = e.get_weights()[0]
print(weights.shape) # shape: (vocab_size, embedding_dim)

(10000, 16)

8. 可视化展示

在http://projector.tensorflow.org/上进行展示需要两个文件，编写如下代码下载这两个文件

import ioout_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')
for word_num in range(1, vocab_size):word = reverse_word_index[word_num]embeddings = weights[word_num]out_m.write(word + "\n")out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")
out_v.close()
out_m.close()try:from google.colab import files
except ImportError:pass
else:files.download('vecs.tsv')files.download('meta.tsv')

打开网址，左侧下滑找到load，2标记处上传刚才下载的vesc.tsv，3标记处下载meta.tsv.
右侧输入单词interesting，可以看到在嵌入矩阵的位置及和他邻近的单词