1. IMDB数据集示例如下所示

[{"rating": 5, "title": "The dark is rising!", "movie": "tt0484562", "review": "It is adapted from the book. I did not read the book and maybe that is why I still enjoyed the movie. There are recent famous books adapted into movies like Eragon which is an unsuccessful movie compared to the rest but I like it better than The Seeker adaptation, another one is The Chronicles of Narnia: The lion, The witch and The wardrobe which is successful and has a sequel under it. The Seeker is this year adaptation. It did a fair job. It is not bad and it is not good. It depends on the viewer. If fans hate the unfaithful adaptation because it does not really follow the line of the story, then be it. Those who have not read the book like me would want to go and watch this movie for entertainment. It did make me a little interested but not enough.It does have its good and bad points. The director failed to bring the spark of the movie. The cast are okay, not too bad. The special effects are considered good for a fantasy movie. What I don't like it is that it is quite short, it just bring straight to the point and that is it. By the time, you will realise it is going to end like that with some short fantasy action. The story is like any fantasy movies. Fast and straight-forward plot. The talking seems long and boring followed by some short action. That is about it. Nothing else. Nothing so interesting to catch your eyes.Overall, it makes a harmless movie to watch in free time or the boring weekends. It is considered dark for children but they still can handle it. It seems long but it is short. Overall, I still think Eragon is better than this. Either you don't like it or like it, it does not matter. It is your view. In this case, I can't say anything. It is just okay.", "link": "http://www.imdb.com/title/tt0484562/reviews-73", "user": "ur12930537"}, {"rating": 5, "title": "Bad attempt by the people that borough us Eragon.", "movie": "tt0484562", "review": "Ever since Lord of the Rings became a hit and was internationally acclaimed all other studios are trying to do the same thing and I can tell you now we are not getting many successes out of these half hearted attempts. The decent ones are Chronicles of Narnia which Disney snapped up and Harry Potter from Warner Brothers. Even the Golden Compass was pretty good by the same people who did Lord of the Rings but then we get to the bad ones. Fox studios gave us Eragon which I still believe is the worst movie I have ever seen. Now Fox studios tries again with the Seeker: The Dark is Rising and I can tell you it is a lot better than Eragon. However, it still is not very good. The director filmed the movie and then realised that his movie was too short so he had a great idea of just making characters appear for no reason and just look scary. I have not read the books but from what I have heard it isn't even faithful their. Overall, it was a decent try but still not worth seeing.", "link": "http://www.imdb.com/title/tt0484562/reviews-108", "user": "ur15303216"}, {"rating": 3, "title": "fantasy movie lacks magic", "movie": "tt0484562", "review": "I've not read the novel this movie was based on, but do enjoy fantasy movies, and thought it looked interesting. But after seeing it...... oh dear.An American boy, Will living with his family in a small village somewhere in England, discovers on his 14th birthday that he's The Seeker for a group of old ones, who fight for the Light. He's got days to find them, before the Rider who fights for the Dark comes to full strength....As I said, I've not read the novel, but seeing the movie several things spring to mind. There are echoes of Harry Potter, the Russian movies Night Watch and Day Watch amongst other fantasy movies tossed into the mix. The script is all over the place, though perhaps this is due to some brutal editing as the movie seems disjointed in parts and the director can't resist having his camera moving all the time and with some quick editing it's almost as if he's trying to be Micheal Bay!! You also get the feeling that despite the production team's efforts, the movie didn't have the budget it really needed. There are a couple of so-called twists in the mix, but they are too obvious to work effectively.The acting isn't too bad, with special mention going to Ian McShane, as one of the elder ones but try as they might, they can't save the movie.As the first of a trio of fantasy movies coming out, the others being Stardust and The Golden Compass, I hope this is not a sign of things to come.", "link": "http://www.imdb.com/title/tt0484562/reviews-60", "user": "ur0680065"}
]

2. 加载数据集

tensorflow2.0的datasets数据集包含此内容,colab上无需下载,如果本地没有可以使用如下命令下载

!pip install -q tensorflow-datasets

加载该数据集

import tensorflow_datasets as tfds
imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)

3. 设置训练集和测试集

使用.numpy()方法将原本tensorflow保存的tensor格式数据集转变为训练所需要的numpy格式
python3中需要用str(s.numpy())

import numpy as nptrain_data, test_data = imdb['train'], imdb['test']training_sentences = []
training_labels = []testing_sentences = []
testing_labels = []# str(s.tonumpy()) is needed in Python3 instead of just s.numpy()
for s,l in train_data:training_sentences.append(str(s.numpy()))training_labels.append(l.numpy())for s,l in test_data:testing_sentences.append(str(s.numpy()))testing_labels.append(l.numpy())training_labels_final = np.array(training_labels)
testing_labels_final = np.array(testing_labels)

4.文本预处理

设置词汇为10000,词嵌入维度16,文本向量最长为120.
padding: ‘pre’ 或 ‘post’: 在序列前填充或在序列后填充。truncating: ‘pre’ 或 ‘post’: 如果序列长度大于maxlen的值,从序列前端截取或者从序列后端截取。

vocab_size = 10000
embedding_dim = 16
max_length = 120
trunc_type='post'
oov_tok = "<>"from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequencestokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(training_sentences)
padded = pad_sequences(sequences,maxlen=max_length, truncating=trunc_type)testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences,maxlen=max_length)

5. 显示pad_sequences处理后的数据

为了显示数据,需要用reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])来将标签对反转,然后编写辅助函数decode_review对paddedd的数据解码展示

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])def decode_review(text):return ' '.join([reverse_word_index.get(i, '?') for i in text])print(decode_review(padded[1]))
print(training_sentences[1])

b’i have been known to fall asleep during films but this is usually due to a combination of things including really tired being warm and comfortable on the <> and having just eaten a lot however on this occasion i fell asleep because the film was rubbish the plot development was constant constantly slow and boring things seemed to happen but with no explanation of what was causing them or why i admit i may have missed part of the film but i watched the majority of it and everything just seemed to happen of its own <> without any real concern for anything else i cant recommend this film at all ’
b’I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. The plot development was constant. Constantly slow and boring. Things seemed to happen, but with no explanation of what was causing them or why. I admit, I may have missed part of the film, but i watched the majority of it and everything just seemed to happen of its own accord without any real concern for anything else. I cant recommend this film at all.’

可以看到经过pad_sequences后忽略了单词的大小写,并且省去了标点,将陌生词汇标记为<>

6.搭建网络

model = tf.keras.Sequential([tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),tf.keras.layers.Flatten(),tf.keras.layers.Dense(6, activation='relu'),tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()


这里也可以用GlobalAveragePooling2D代替flatten

Flatten将采用任何形状的张量并将其转换为一维张量(加上样本尺寸),但所有值都保持在张量中。例如,张量(samples,10,20,1)将被展平为(samples,10 * 20 * 1)。
GlobalAveragePooling2D做一些不同的事情。它对空间维度应用平均池化,直到每个空间维度为一,其他维度保持不变。在这种情况下,值将不保持平均值。例如,假设第2维和第3维为空间(最后一个通道),则将张量(samples,10、20、1)输出为(samples,1、1、1)。

6. 模型训练

num_epochs = 10
model.fit(padded, training_labels_final, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))

7. 查看嵌入矩阵维度

e = model.layers[0]
weights = e.get_weights()[0]
print(weights.shape) # shape: (vocab_size, embedding_dim)

(10000, 16)

8. 可视化展示

在http://projector.tensorflow.org/上进行展示需要两个文件,编写如下代码下载这两个文件

import ioout_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')
for word_num in range(1, vocab_size):word = reverse_word_index[word_num]embeddings = weights[word_num]out_m.write(word + "\n")out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")
out_v.close()
out_m.close()try:from google.colab import files
except ImportError:pass
else:files.download('vecs.tsv')files.download('meta.tsv')

打开网址,左侧下滑找到load,2标记处上传刚才下载的vesc.tsv,3标记处下载meta.tsv.
右侧输入单词interesting,可以看到在嵌入矩阵的位置及和他邻近的单词

tensorflow2.0实现IMDB文本数据集学习词嵌入相关推荐

  1. [Embeding-2]文本表示学习-词嵌入入门理解

    转载自Scofield Phil: http://www.scofield7419.xyz/2017/09/25/文本表示学习-词嵌入入门理解/ 之前一段时间,在结合深度学习做NLP的时候一直有思考一 ...

  2. NLP之词向量:利用word2vec对20类新闻文本数据集进行词向量训练、测试(某个单词的相关词汇)

    NLP之词向量:利用word2vec对20类新闻文本数据集进行词向量训练.测试(某个单词的相关词汇) 目录 输出结果 设计思路 核心代码 输出结果 寻找训练文本中与morning最相关的10个词汇: ...

  3. tensorflow2.0莺尾花iris数据集分类|超详细

    tensorflow2.0莺尾花iris数据集分类 超详细 直接上代码 #导入模块 import tensorflow as tf #导入tensorflow模块from sklearn import ...

  4. RNN模型与NLP应用笔记(2):文本处理与词嵌入详解及完整代码实现(Word Embedding)

    一.写在前面 紧接着上一节,现在来讲文本处理的常见方式. 本文大部分内容参考了王树森老师的视频内容,再次感谢王树森老师和李沐老师的讲解视频. 目录 一.写在前面 二.引入 三.文本处理基本步骤详解 四 ...

  5. [DeeplearningAI笔记]序列模型2.3-2.5余弦相似度/嵌入矩阵/学习词嵌入

    5.2自然语言处理 觉得有用的话,欢迎一起讨论相互学习~Follow Me 2.3词嵌入的特性 properties of word embedding Mikolov T, Yih W T, Zwe ...

  6. 2.5 学习词嵌入-深度学习第五课《序列模型》-Stanford吴恩达教授

    学习词嵌入 (Learning Word Embeddings) 在本节视频中,你将要学习一些具体的算法来学习词嵌入.在深度学习应用于学习词嵌入的历史上,人们一开始使用的算法比较复杂,但随着时间推移, ...

  7. 自然语言处理 —— 2.5 学习词嵌入

    这里将学习一些具体算法来学习词嵌入,在深度学习应用于学习词嵌入的历史上,人们一开始使用的算法比较复杂.但随着时间推移,研究者们不断发现它们能用更加简单的算法来达到一样好的效果,特别是在数据集很大的情况 ...

  8. 笔记3:Tensorflow2.0实战之MNSIT数据集

    最近Tensorflow相继推出了alpha和beta两个版本,这两个都属于tensorflow2.0版本:早听说新版做了很大的革新,今天就来用一下看看 这里还是使用MNSIT数据集进行测试 导入必要 ...

  9. 深度学习-词嵌入(word2vec)

    词嵌入(word2vec) 自然语言是一套用来表达含义的复杂系统.在这套系统中,词是表义的基本单元.顾名思义,词向量是用来表示词的向量,也可被认为是词的特征向量或表征.把词映射为实数域向量的技术也叫词 ...

最新文章

  1. Service Mesh:调度千军万马微服务,2.0妥妥的
  2. 通过零知识证明,成为重要的区块链革新者
  3. 超级猩猩:网红健身房的故事很好,但别为他人做了嫁衣
  4. net use 命令集合详解
  5. 一个领导力培训的游戏练习
  6. 大数据预测实战-随机森林预测实战(四)-模型微调
  7. oracle standby同步,PRIMARY Standby不能同步问题
  8. LeetCode 5356. 矩阵中的幸运数
  9. 洛谷 3203 HNOI2010 BOUNCE 弹飞绵羊
  10. 计算机化分析原理波涛,(波涛)证券期货投资计算机化技术分析原理OCR.pdf
  11. 百度php获取当前经纬度,百度地图获取经纬度的示例
  12. 唐朝义成公主的悲惨命运是怎样的?
  13. 用计算机遥感技术图片,卫星遥感影像数据是什么样格式的?
  14. git + 移动端 web 开发
  15. 猿创征文 | Git的良心教程
  16. 梆梆加固函数抽取执行流程
  17. 别傻了!不能只会给别人开热点,要尝试华为手机的WiFi分享功能
  18. Oasis montaj无法计算均衡重力异常
  19. latex中文简易模板,课程论文使用
  20. 用python定时自动发微博_Python脚本实现自动发带图的微博

热门文章

  1. 设计一个 加减乘除 计算器程序
  2. LPC1768的usb使用--硬件篇
  3. 摇骰子计算骰子点数之和(骰子个数可改变)
  4. 增强现实技术(AR)及扩展应用
  5. python提取peer地震波并进行归一化
  6. 小米上市市值逼近京东,雷军成为全球最有钱的100人之一!
  7. [9i]英语中常见的后缀-常看前后缀有助于记单词
  8. vue里面引用echarts制作攻击防御地图可视化视图
  9. Book——电力系统可靠性分析
  10. 李小龙的重要年表摘要