Course 3 .Natural Language Processing in TensorFlow

对文本信息进行分词，编码，建立语料库，Embdaing，情感分析，文本生成

week 1 编码和padding

原理文本中的每个单词都进行编码，这样一句话就可以用一串数字表示，但是每句话长短并不一样，可以用Padding进行扩充。为送入神经网络做准备。

代码1

# 导入工具
import tensorflow as tf
import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences# -----------一、准备数据---------------
sentences = ['I love my dog','I love my cat','You love my dog!','Do you think my dog is amazing?']# ----------二、对单词进行标记-------------
# 实例化标记器，标记单词为100个，即：只标记最常出现的前100个词，后面则不再进行标记，不能识别的词用‘<OOV>’进行标示。
tokenizer = Tokenizer(num_words = 100, oov_token='<OOV>'）
# 将文本进行转换
tokenizer.fit_on_texts(sentences)
# 提取标记后的文本索引
word_index = tokenizer.word_index#-----------三、将文本数据转换为序列数-----------
sequences = tokenizer.texts_to_sequences(sentences)# -----------四、padding----------------------
# 因为生成的数据长短不一，所以需要进行统一长度，不足maxlen的用0补齐，超过的截断。
padded= pad_sequences(sequences,maxlen=5)print("\n Word Index",word_index)
print("\n sequences = "，sequences)
print("\n Padded Sequences:")
print(padded)#---------------五、获取验证数据，并对验证数据进行编码-------
test_data = ['I really love my dog','my dog loves my manatee']
test_seq = tokenizer.text_to_sequences(test_data)
print("\n Test sequences"，test_seq)
# 参数padding： “pre” 表示向前补0，“post”表示向后补0
padded = pad_sequences(test_seq,maxlen=10,padding="pre")
print("\n Padded Test Sequence:")
print(padded)

week 1.2 解析json

.json数据是在文本分析中常见的一种数据格式，尤其是在网络爬虫中常见。本例子中是通过对sarcasm.json数据进行分析，这是一个讽刺的文本数据。

代码

import json
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences# 通过上下文管理器，打开json文件
with open("tmp/sarcasm.json","r") as f:datastore = json.load(f)sentences = []
labels = []
urls = []
for item in datastore:sentences.append(item['headline'])labels.append(item['is_sarcastic'])urls.append(item['article_link'])tokenizer = Tokenizer(oov_token = "<OOV>")
tokenizer.fit_on_texts(sentences)word_index = tokenizer.word_index
sequences = tokenizer.text_to_sequences(sentences)
padded = pad_sequences(sequences,padding='post')
print(len(word_index))
print(sentences[2])
print(padded[2]).
print(padded.shape)

week 2.1

在第二周所用的数据集是tensorflow_datasets 中的imdb电影评论数据，即通过电影评论将单词分为正极和负极。但是由于tensorflow_datasets 中的URL都是国外网站，存在链接超时错误。这里我是通过aclimdb下载到本地然后进行分析。数据集可以自己百度。另外一种方法可以直接通过keras.datasets 中导入imdb数据，但是keras中的数据是经过预处理的，我们的目的就是联系预处理，所以不使用keras.datasets。

代码
代码的主要思路，就是先通过正则表达式提取本地文件，然后对其进行标记，padding，喂入模型进行训练。


# -------------一，处理数据---------------------
import re
def rm_tags(text):re_tags = re.compile(r'<[^>]+>')return re_tags.sub(' ',text) # 用空格取代上面的这些形式import os
def read_files(filetype):path = 'tmp/aclImdb'file_list = []positive_path = path + filetype + "/pos/"for f in os.listdir(positive_path):file_list += [positive_path + f]negative_path = path + filetype + "/neg/"for f in os.listdir(negative_path):file_list += [negative_path + f]print("read",filetype,"files:",len(file_list))all_labels = ([1]*12500 + [0]*12500)all_texts = []for fi in file_list:with open(fi,encoding='utf8') as file_input:all_texts += [rm_tags(" ".join(file_input.readlines()))]return all_labels,all_texts# --------------二、准备数据----------------------
y_train,train_text = read_files("train")
y_test,test_text = read_files("test")
# 随机打乱
import numpy as np
state = np.random.get_state()
np.random.shuffle(train_text)
np.random.shuffle(test_text)
np.random.set_state(state)
np.random.shuffle(y_train)
np.random.shuffle(y_test)#---------------三，转换数据格式-----------------------
train_label = np.array(y_train)
test_label = np.array(y_test)#---------------四、文本编码---------------------------
vocab_size = 10000 # 语料库大小
embedding_dim = 16 # 模型参数
max_length = 120 # 一句话中的最大单次数
trunc_type = "post" # 向后填充
oov_tok = "<OOV>" # 未识别单词替换
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
# 创建语料库
tokenizer = Tokenizer(num_words=vocab_size,oov_token=oov_tok)
tokenizer.fit_on_texts(train_text)
# 通过语料库对训练集、测试集进行编码
sequences = tokenizer.texts_to_sequences(training_sentences)
x_train = pad_sequences(sequences,maxlen=max_length,truncating=trunc_type)test_sequences = tokenizer.texts_to_sequences(test_text)
x_test = pad_sequences(test_sequences,maxlen=max_length)# 解码函数
reverse_word_index = dict([(value,key) for (value,key) in word_index.items()])
def decode_review(text):return ' '.join([reverse_word_index.get(i,'?') for i in text]) # .join 表示用空格进行分隔各元素   .get(i,'?') 如果匹配不到则用“？”替换#----------------五、创建模型------------------
import tensorflow as tf
model = tf.keras.Sequential([tf.keras.layers.Embedding(vocab_size,embedding_dim,input_length=max_length),tf.keras.layers.Flatten(),tf.keras.layers.Dense(6,activation='relu'),tf.keras.layers.Dense(1,activation='sigmoid')])
# 训练模型
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
num_epochs = 10
model.fit(x_train,train_label,epochs = num_epochs,validation_data=(x_test,test_label))#----------------六、对词向量实现可视化---------------
import io
out_v = io.open('vecs.tsv','w',encoding='utf-8')
out_m = io.open('meta.tsv','w',encoding='utf-8')
for word_num in range(1,vocab_size):word = reverse_word_index[word_num]embeddings = weights[word_num]out_m.write(word + "\n")out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")
out_v.close()
out_m.close()

输出：

并且生成了两个文件，vecs.tsv 和 meta.tsv 将两个文件加载到词向量的可视化网页https://projector.tensorflow.org/可得到，如下结果：

week 3.1 多层LSTM

LSTM 长短期记忆网络通过长短期记忆网络对文本进行处理。

#-------------一、数据处理----------------------
from keras.dataset import imdb
(x_train,y_train),(x_test,y_test) = imdb.load_data(num_words=100)
from keras.preprocessing.sequence import pad_sequences
trainpad = pad_sequences(x_train,maxlen=120,truncating="post")
testpad = pad_sequences(x_test,maxlen=120,truncating="post")
y_train.reshape(-1,1)
y_test.reshape(-1,1)#---------------二、建立模型---------------------
import keras
model = keras.Sequential([keras.layers.Embedding(10000,64),keras.layers.Bidrectional(keras.layers.LSTM(64,teturn_sequences=True)),keras.layers.Bidrectional(keras.layers.LSTM(32)),keras.layers.Dense(64,activation='relu')keras.layers.Dense(1,activation='sigmoid')
])#--------------三、训练模型-----------------------
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['acc'])
NUM_EPOCHS=10
history = model.fit(trainpad,y_train,epochs=NUM_EPOCHS,validation_data=(testpad,y_test))#------------四、准确率可视化---------------------------
import matplotlib.pyplot as plt
%matplotlib inline
def plot_graphs(history,string):plt.plot(history.history[string])plt.plot(history.history['val_'+string])plt.xlabel("Epochs")plt.ylabel(string)plt.legend([string,'val_'+string])plt.show()
plot_graphs(history,'acc')
plot_graphs(history,'loss')

输出

week 3.2 Multiple Layer GRU

#---------------------一、处理数据---------------------
from  keras.datasets import imdb
(x_train,y_train),(x_test,y_test) = imdb.load_data(num_words=100)
Maxlen = 120
from keras.preprocessing.sequence import pad_sequences
import numpy as np
trainpad = pad_sequences(x_train,maxlen=Maxlen,truncating="post")
testpad = pad_sequences(x_test,maxlen=Maxlen,truncating="post")
train_label = np.array(y_train)
test_label = np.array(y_test)
#---------------------二、建立模型------------------
model = keras.Sequential([keras.layers.Embedding(10000,64),keras.layers.Conv1D(128,5,activation='relu'),keras.layers.GlobalAveragePooling1D(),keras.layers.Dense(64,activation='relu'),keras.layers.Dense(1,activation='sigmoid')
])
model.summary()
#-----------------三、训练模型------------------
model.compile(loss=keras.losses.binary_crossentropy,optimizer='adam',metrics=['acc'])
NUM_EPOCHS = 20
history = model.fit(trainpad,train_label,epochs=NUM_EPOCHS,validation_data=(testpad,test_label))
#----------------四、可视化----------------------
import matplotlib.pyplot as plt
def plot_graphs(history,string):plt.plot(history.history[string])plt.plot(history.history['val_'+string])plt.xlabel("Epochs")plt.ylabel(string)plt.legend([string,'val_'+string])plt.show()
plot_graphs(history,'acc')
plot_graphs(history,'loss')

week 4.1

通过对文本的学习，网络自己去书写相同风格的文本。实质上依然是预测问题。

import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding,LSTM,Dense,Bidrectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.optimizers import Adam
import numpy as np#----------------一、标记数据-----------------------
tokenizer = Tokenizer()data="In the town of Athy one Jeremy Lanigan \n Battered away til he hadnt a pound. \nHis father died and made him a man again \n Left him a farm and ten acres of ground. \nHe gave a grand party for friends and relations \nWho didnt forget him when come to the wall, \nAnd if youll but listen Ill make your eyes glisten \nOf the rows and the ructions of Lanigans Ball. \nMyself to be sure got free invitation, \nFor all the nice girls and boys I might ask, \nAnd just in a minute both friends and relations \nWere dancing round merry as bees round a cask. \nJudy ODaly, that nice little milliner, \nShe tipped me a wink for to give her a call, \nAnd I soon arrived with Peggy McGilligan \nJust in time for Lanigans Ball. \nThere were lashings of punch and wine for the ladies, \nPotatoes and cakes; there was bacon and tea, \nThere were the Nolans, Dolans, OGradys \nCourting the girls and dancing away. \nSongs they went round as plenty as water, \nThe harp that once sounded in Taras old hall,\nSweet Nelly Gray and The Rat Catchers Daughter,\nAll singing together at Lanigans Ball. \nThey were doing all kinds of nonsensical polkas \nAll round the room in a whirligig. \nJulia and I, we banished their nonsense \nAnd tipped them the twist of a reel and a jig. \nAch mavrone, how the girls got all mad at me \nDanced til youd think the ceiling would fall. \nFor I spent three weeks at Brooks Academy \nLearning new steps for Lanigans Ball. \nThree long weeks I spent up in Dublin, \nThree long weeks to learn nothing at all,\n Three long weeks I spent up in Dublin, \nLearning new steps for Lanigans Ball. \nShe stepped out and I stepped in again, \nI stepped out and she stepped in again, \nShe stepped out and I stepped in again, \nLearning new steps for Lanigans Ball. \nBoys were all merry and the girls they were hearty \nAnd danced all around in couples and groups, \nTil an accident happened, young Terrance McCarthy \nPut his right leg through miss Finnertys hoops. \nPoor creature fainted and cried Meelia murther, \nCalled for her brothers and gathered them all. \nCarmody swore that hed go no further \nTil he had satisfaction at Lanigans Ball. \nIn the midst of the row miss Kerrigan fainted, \nHer cheeks at the same time as red as a rose. \nSome of the lads declared she was painted, \nShe took a small drop too much, I suppose. \nHer sweetheart, Ned Morgan, so powerful and able, \nWhen he saw his fair colleen stretched out by the wall, \nTore the left leg from under the table \nAnd smashed all the Chaneys at Lanigans Ball. \nBoys, oh boys, twas then there were runctions. \nMyself got a lick from big Phelim McHugh. \nI soon replied to his introduction \nAnd kicked up a terrible hullabaloo. \nOld Casey, the piper, was near being strangled. \nThey squeezed up his pipes, bellows, chanters and all. \nThe girls, in their ribbons, they got all entangled \nAnd that put an end to Lanigans Ball."
corpus = data.lower.split("\n") #返回list
tokenizer.fit_on_texts(corpurs)# 转换未语料库
total_words = len(tokenizer.word_index) + 1
# 对文章进行编码：
input_sequences = []
for line in corpus:token_list = tokenizer.texts_to_sequences([line])[0]for i in range(1,len(token_list)):n_gram_sequence = token_list[:i+1]input_sequences.append(n_gram_sequence)
# padding
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences,maxlen=max_sequence_len,padding='pre'))# 分割目标和数据
xs,labels = input_sequences[:,:-1],input_sequences[:,-1]
ys = tf.keras.utils.to_categorical(labels,num_classes=total_words) # labels 转换为 ont_hot向量#建立模型
import keras
model=keras.Sequential()
model.add(Embedding(total_words,64,input_length=max_sequence_len-1))
model.add(Bidrectional(LSTM(20)))
model.add(Dense(total_words,activation='softmax'))
model.compile(loss = 'categorical_crossentropy',optimizer="adam",metrics=['acc'])
history = model.fit(xs,ys,epochs = 500,verbose = 1)# -----------可视化------------
import matplotlib.pyplot as plt
%matplotlib inline
def plot_graphs(history,string):plt.plot(history.history[string])plt.xlabel("Epochs")plt.ylabel(string)plt.show()
plot_graphs(history,'acc')# -------------预测----------------
seed_text = "Laurence went to dubin"
next_words=100
for _ in range(next_words):token_list = tokenizer.texts_to_sequences([seed_text])[0]token_list = pad_sequences([token_list],maxlen=max_sequence_len-1,padding='pre')predicted = model.predict_classes(token_list,verbose=0)output=" "for word,index in tokenizer.word_index.items():if index == predicted:output_word = wordbreakseed_text += " "+output_word
print(seed_text)

week 4.2

对文本进行学习，然后根据种子文本自动生成文章

#------------------一，导入模块-----------------
import tensorflow as tf
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding,LSTM,Dense,Bidirectional
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.optimizers import Adam
import numpy as np#-----------------二、创建语料库--------------------
tokenizer = Tokenizer()
data = open('tmp/irish-lyrics-eof.txt').read()
corpus = data.lower().split("\n")
tokenizer.fit_on_texts(corpus)
total_words=len(tokenizer.word_index) +1#-------------三、转换训练数据------------------------
input_sequences = []
for line in corpus:token_list = tokenizer.texts_to_sequences([line])[0]for i in range(1,len(token_list)):n_gram_sequence = token_list[:i+1]input_sequences.append(n_gram_sequence)# pad sequences
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences,maxlen=max_sequence_len,padding='pre'))# create predictors and label
xs,labels = input_sequences[:,:-1],input_sequences[:,-1]
ys = tf.keras.utils.to_categorical(labels,num_classes=total_words)#-----------------四、创建模型、训练模型----------------
model = Sequential()
model.add(Embedding(total_words, 100, input_length=max_sequence_len-1))
model.add(Bidirectional(LSTM(150)))
model.add(Dense(total_words, activation='softmax'))
adam = Adam(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])
#earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=5, verbose=0, mode='auto')
history = model.fit(xs, ys, epochs=100, verbose=1)#--------------五，可视化-----------------
import matplotlib.pyplot as plt
%matplotlib inline
def plot_graphs(history, string):plt.plot(history.history[string])plt.xlabel("Epochs")plt.ylabel(string)plt.show()
plot_graphs(history, 'accuracy')#------------六、生成文本-------------------
seed_text = "I've got a bad feeling about this"
next_words = 100
for _ in range(next_words):token_list = tokenizer.texts_to_sequences([seed_text])[0]token_list = pad_sequences([token_list],maxlen=max_sequence_len-1,padding = 'pre')predicted = model.predict_classes(token_list,verbose=0)output_word = " "for word, index in tokenizer.word_index.items():if index == predicted:output_word = wordbreakseed_text += " " + output_word
print(seed_text)

输出

Tensorflow in partice (lesson three) Course 3 Natural Language Processing in TensorFlow相关推荐

Natural language Processing in tensorflow quizs on Coursera
声明:本博客涉及的内容仅供个人学习使用,方便后续复习总结,请勿用做商业用途第一周知识点: (1) 分词器(tokenizer)的使用方法:在自然语言处理当中,分词器可以将句子中的单词进行编码,使计算 ...
自然语言处理NLP 2022年最新综述：An introduction to Deep Learning in Natural Language Processing
论文题目:An introduction to Deep Learning in Natural Language Processing: Models, techniques, and tools ...
论文阅读笔记（一）【Journal of Machine Learning Research】Natural Language Processing (Almost) from Scratch（未完）
学习内容题目: 自然语言从零开始 Natural Language Processing (Almost) from Scratch 2021年7月28日 1-5页这将是一个长期的过程,因为本文长 ...
论文阅读：Natural Language Processing Advancements By Deep Learning: A Survey
文章目录一.介绍二.背景 1.人工智能和深度学习 (1)多层感知机 (2)卷积神经网络 (3)循环神经网络 (4)自编码器 (5)生成对抗网络 2.NLP中深度学习的动机三.NLP领域的核心概念 ...
Efficient Methods for Natural Language Processing: A Survey自然语言处理有效方法综述
Efficient Methods for Natural Language Processing: A Survey 关于自然语言处理有效方法的一个综述,近来关于自然语言处理已经取得了非常显著的结果 ...
读论文《Natural Language Processing (Almost) from Scratch》
读论文<Natural Language Processing (Almost) from Scratch> 原文地址:http://blog.csdn.net/qq_31456593/a ...
【Gaze】A Survey on Using Gaze Behaviour for Natural Language Processing
A Survey on Using Gaze Behaviour for Natural Language Processing 1. Abstract 摘要中主要介绍本文的工作,整篇主要讨论了在NL ...
第二周自然语言处理与词嵌入（Natural Language Processing and Word Embeddings）
第二周自然语言处理与词嵌入(Natural Language Processing and Word Embeddings) 文章目录第二周自然语言处理与词嵌入(Natural Language ...
论文阅读：Pre-trained Models for Natural Language Processing: A Survey 综述：自然语言处理的预训练模型
Pre-trained Models for Natural Language Processing: A Survey 综述:自然语言处理的预训练模型目录 Pre-trained Models f ...

Tensorflow in partice (lesson three) Course 3 Natural Language Processing in TensorFlow