文章摘要生成

数据集:Amazon 500000评论
本节内容:
•数据预处理
•构建Seq2Seq模型
•训练网络
•测试效果
seq2seq教程: https://github.com/j-min/tf_tutorial_plus/tree/master/RNN_seq2seq/contrib_seq2seq

import pandas as pd
import numpy as np
import tensorflow as tf
import re
from nltk.corpus import stopwords
import time
from tensorflow.python.layers.core import Dense
from tensorflow.python.ops.rnn_cell_impl import _zero_state_tensors
print('TensorFlow Version: {}'.format(tf.__version__))

TensorFlow Version: 1.1.0
Insepcting the Data

reviews = pd.read_csv("Reviews.csv")
reviews.shape
(568454, 10)
reviews.head()

# Check for any nulls values
reviews.isnull().sum()

# Remove null values and unneeded features
reviews = reviews.dropna()
reviews = reviews.drop(['Id','ProductId','UserId','ProfileName','HelpfulnessNumerator','HelpfulnessDenominator','Score','Time'], 1)
reviews = reviews.reset_index(drop=True)
reviews.head()

# Inspecting some of the reviews
for i in range(5):print("Review #",i+1)print(reviews.Summary[i])print(reviews.Text[i])print()
Review # 1
Good Quality Dog Food
I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.Review # 2
Not as Advertised
Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".Review # 3
"Delight" says it all
This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven.  Not too chewy, and very flavorful.  I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.Review # 4
Cough Medicine
If you are looking for the secret ingredient in Robitussin I believe I have found it.  I got this in addition to the Root Beer Extract I ordered (which was good) and made some cherry soda.  The flavor is very medicinal.Review # 5
Great taffy
Great taffy at a great price.  There was a wide assortment of yummy taffy.  Delivery was very quick.  If your a taffy lover, this is a deal.

数据预处理
•全部转换成小写
•连词转换
•去停用词(只在描述中去掉)

contractions = {
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he's": "he is",
"how'd": "how did",
"how'll": "how will",
"how's": "how is",
"i'd": "i would",
"i'll": "i will",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'll": "it will",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"must've": "must have",
"mustn't": "must not",
"needn't": "need not",
"oughtn't": "ought not",
"shan't": "shall not",
"sha'n't": "shall not",
"she'd": "she would",
"she'll": "she will",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"that'd": "that would",
"that's": "that is",
"there'd": "there had",
"there's": "there is",
"they'd": "they would",
"they'll": "they will",
"they're": "they are",
"they've": "they have",
"wasn't": "was not",
"we'd": "we would",
"we'll": "we will",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"where'd": "where did",
"where's": "where is",
"who'll": "who will",
"who's": "who is",
"won't": "will not",
"wouldn't": "would not",
"you'd": "you would",
"you'll": "you will",
"you're": "you are"
}
def clean_text(text, remove_stopwords = True):'''Remove unwanted characters, stopwords, and format the text to create fewer nulls word embeddings'''# Convert words to lower casetext = text.lower()# Replace contractions with their longer formsif True:text = text.split()new_text = []for word in text:if word in contractions:new_text.append(contractions[word])else:new_text.append(word)text = " ".join(new_text)# Format words and remove unwanted characterstext = re.sub(r'https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)text = re.sub(r'\<a href', ' ', text)text = re.sub(r'&amp;', '', text) text = re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', text)text = re.sub(r'<br />', ' ', text)text = re.sub(r'\'', ' ', text)# Optionally, remove stop wordsif remove_stopwords:text = text.split()stops = set(stopwords.words("english"))text = [w for w in text if not w in stops]text = " ".join(text)return text

We will remove the stopwords from the texts because they do not provide much use for training our model. However, we will keep them for our summaries so that they sound more like natural phrases.

# Clean the summaries and texts
clean_summaries = []
for summary in reviews.Summary:clean_summaries.append(clean_text(summary, remove_stopwords=False))
print("Summaries are complete.")clean_texts = []
for text in reviews.Text:clean_texts.append(clean_text(text))
print("Texts are complete.")

Summaries are complete.
Texts are complete.

# Inspect the cleaned summaries and texts to ensure they have been cleaned well
for i in range(5):print("Clean Review #",i+1)print(clean_summaries[i])print(clean_texts[i])print()
Clean Review # 1
good quality dog food
bought several vitality canned dog food products found good quality product looks like stew processed meat smells better labrador finicky appreciates product betterClean Review # 2
not as advertised
product arrived labeled jumbo salted peanuts peanuts actually small sized unsalted sure error vendor intended represent product jumboClean Review # 3delight  says it all
confection around centuries light pillowy citrus gelatin nuts case filberts cut tiny squares liberally coated powdered sugar tiny mouthful heaven chewy flavorful highly recommend yummy treat familiar story c lewis lion witch wardrobe treat seduces edmund selling brother sisters witchClean Review # 4
cough medicine
looking secret ingredient robitussin believe found got addition root beer extract ordered good made cherry soda flavor medicinalClean Review # 5
great taffy
great taffy great price wide assortment yummy taffy delivery quick taffy lover deal
def count_words(count_dict, text):'''Count the number of occurrences of each word in a set of text'''for sentence in text:for word in sentence.split():if word not in count_dict:count_dict[word] = 1else:count_dict[word] += 1
# Find the number of times each word was used and the size of the vocabulary
word_counts = {}count_words(word_counts, clean_summaries)
count_words(word_counts, clean_texts)print("Size of Vocabulary:", len(word_counts))

Size of Vocabulary: 132884
使用构建好的词向量

# Load Conceptnet Numberbatch's (CN) embeddings, similar to GloVe, but probably better
# (https://github.com/commonsense/conceptnet-numberbatch)
embeddings_index = {}
with open('numberbatch-en-17.04b.txt', encoding='utf-8') as f:for line in f:values = line.split(' ')word = values[0]embedding = np.asarray(values[1:], dtype='float32')embeddings_index[word] = embeddingprint('Word embeddings:', len(embeddings_index))

Word embeddings: 484557

# Find the number of words that are missing from CN, and are used more than our threshold.
missing_words = 0
threshold = 20for word, count in word_counts.items():if count > threshold:if word not in embeddings_index:missing_words += 1missing_ratio = round(missing_words/len(word_counts),4)*100print("Number of words missing from CN:", missing_words)
print("Percent of words that are missing from vocabulary: {}%".format(missing_ratio))

Number of words missing from CN: 3044
Percent of words that are missing from vocabulary: 2.29%
阈值设置为20,不在词向量中的且出现超过20次,那咱们就得自己做它的映射向量了

# Limit the vocab that we will use to words that appear ≥ threshold or are in GloVe#dictionary to convert words to integers
vocab_to_int = {} value = 0
for word, count in word_counts.items():if count >= threshold or word in embeddings_index:vocab_to_int[word] = valuevalue += 1# Special tokens that will be added to our vocab
codes = ["<UNK>","<PAD>","<EOS>","<GO>"]   # Add codes to vocab
for code in codes:vocab_to_int[code] = len(vocab_to_int)# Dictionary to convert integers to words
int_to_vocab = {}
for word, value in vocab_to_int.items():int_to_vocab[value] = wordusage_ratio = round(len(vocab_to_int) / len(word_counts),4)*100print("Total number of unique words:", len(word_counts))
print("Number of words we will use:", len(vocab_to_int))
print("Percent of words we will use: {}%".format(usage_ratio))

Total number of unique words: 132884
Number of words we will use: 65469
Percent of words we will use: 49.27%

# Need to use 300 for embedding dimensions to match CN's vectors.
embedding_dim = 300
nb_words = len(vocab_to_int)# Create matrix with default values of zero
word_embedding_matrix = np.zeros((nb_words, embedding_dim), dtype=np.float32)
for word, i in vocab_to_int.items():if word in embeddings_index:word_embedding_matrix[i] = embeddings_index[word]else:# If word not in CN, create a random embedding for itnew_embedding = np.array(np.random.uniform(-1.0, 1.0, embedding_dim))embeddings_index[word] = new_embeddingword_embedding_matrix[i] = new_embedding# Check if value matches len(vocab_to_int)
print(len(word_embedding_matrix))

65469

def convert_to_ints(text, word_count, unk_count, eos=False):'''Convert words in text to an integer.If word is not in vocab_to_int, use UNK's integer.Total the number of words and UNKs.Add EOS token to the end of texts'''ints = []for sentence in text:sentence_ints = []for word in sentence.split():word_count += 1if word in vocab_to_int:sentence_ints.append(vocab_to_int[word])else:sentence_ints.append(vocab_to_int["<UNK>"])unk_count += 1if eos:sentence_ints.append(vocab_to_int["<EOS>"])ints.append(sentence_ints)return ints, word_count, unk_count
# Apply convert_to_ints to clean_summaries and clean_texts
word_count = 0
unk_count = 0int_summaries, word_count, unk_count = convert_to_ints(clean_summaries, word_count, unk_count)
int_texts, word_count, unk_count = convert_to_ints(clean_texts, word_count, unk_count, eos=True)unk_percent = round(unk_count/word_count,4)*100print("Total number of words in headlines:", word_count)
print("Total number of UNKs in headlines:", unk_count)
print("Percent of words that are UNK: {}%".format(unk_percent))

Total number of words in headlines: 25679946
Total number of UNKs in headlines: 170450
Percent of words that are UNK: 0.66%

def create_lengths(text):'''Create a data frame of the sentence lengths from a text'''lengths = []for sentence in text:lengths.append(len(sentence))return pd.DataFrame(lengths, columns=['counts'])
lengths_summaries = create_lengths(int_summaries)
lengths_texts = create_lengths(int_texts)print("Summaries:")
print(lengths_summaries.describe())
print()
print("Texts:")
print(lengths_texts.describe())

Summaries:
counts
count 568412.000000
mean 4.181620
std 2.657872
min 0.000000
25% 2.000000
50% 4.000000
75% 5.000000
max 48.000000

Texts:
counts
count 568412.000000
mean 41.996782
std 42.520854
min 1.000000
25% 18.000000
50% 29.000000
75% 50.000000
max 2085.000000

# Inspect the length of texts
print(np.percentile(lengths_texts.counts, 90))
print(np.percentile(lengths_texts.counts, 95))
print(np.percentile(lengths_texts.counts, 99))

84.0
115.0
207.0

# Inspect the length of summaries
print(np.percentile(lengths_summaries.counts, 90))
print(np.percentile(lengths_summaries.counts, 95))
print(np.percentile(lengths_summaries.counts, 99))

8.0
9.0
13.0

def unk_counter(sentence):'''Counts the number of time UNK appears in a sentence.'''unk_count = 0for word in sentence:if word == vocab_to_int["<UNK>"]:unk_count += 1return unk_count
# Sort the summaries and texts by the length of the texts, shortest to longest
# Limit the length of summaries and texts based on the min and max ranges.
# Remove reviews that include too many UNKssorted_summaries = []
sorted_texts = []
max_text_length = 84
max_summary_length = 13
min_length = 2
unk_text_limit = 1
unk_summary_limit = 0for length in range(min(lengths_texts.counts), max_text_length): for count, words in enumerate(int_summaries):if (len(int_summaries[count]) >= min_length andlen(int_summaries[count]) <= max_summary_length andlen(int_texts[count]) >= min_length andunk_counter(int_summaries[count]) <= unk_summary_limit andunk_counter(int_texts[count]) <= unk_text_limit andlength == len(int_texts[count])):sorted_summaries.append(int_summaries[count])sorted_texts.append(int_texts[count])# Compare lengths to ensure they match
print(len(sorted_summaries))
print(len(sorted_texts))

429210
429210
Building the Model

def model_inputs():'''Create palceholders for inputs to the model'''input_data = tf.placeholder(tf.int32, [None, None], name='input')targets = tf.placeholder(tf.int32, [None, None], name='targets')lr = tf.placeholder(tf.float32, name='learning_rate')keep_prob = tf.placeholder(tf.float32, name='keep_prob')summary_length = tf.placeholder(tf.int32, (None,), name='summary_length')max_summary_length = tf.reduce_max(summary_length, name='max_dec_len')text_length = tf.placeholder(tf.int32, (None,), name='text_length')return input_data, targets, lr, keep_prob, summary_length, max_summary_length, text_length
def process_encoding_input(target_data, vocab_to_int, batch_size):'''Remove the last word id from each batch and concat the <GO> to the begining of each batch'''ending = tf.strided_slice(target_data, [0, 0], [batch_size, -1], [1, 1])dec_input = tf.concat([tf.fill([batch_size, 1], vocab_to_int['<GO>']), ending], 1)return dec_input
def encoding_layer(rnn_size, sequence_length, num_layers, rnn_inputs, keep_prob):'''Create the encoding layer'''for layer in range(num_layers):with tf.variable_scope('encoder_{}'.format(layer)):cell_fw = tf.contrib.rnn.LSTMCell(rnn_size,initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))cell_fw = tf.contrib.rnn.DropoutWrapper(cell_fw, input_keep_prob = keep_prob)cell_bw = tf.contrib.rnn.LSTMCell(rnn_size,initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))cell_bw = tf.contrib.rnn.DropoutWrapper(cell_bw, input_keep_prob = keep_prob)enc_output, enc_state = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw, rnn_inputs,sequence_length,dtype=tf.float32)# Join outputs since we are using a bidirectional RNNenc_output = tf.concat(enc_output,2)return enc_output, enc_state
def training_decoding_layer(dec_embed_input, summary_length, dec_cell, initial_state, output_layer, vocab_size, max_summary_length):'''Create the training logits'''training_helper = tf.contrib.seq2seq.TrainingHelper(inputs=dec_embed_input,sequence_length=summary_length,time_major=False)training_decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell,training_helper,initial_state,output_layer) training_logits, _ = tf.contrib.seq2seq.dynamic_decode(training_decoder,output_time_major=False,impute_finished=True,maximum_iterations=max_summary_length)return training_logits
def inference_decoding_layer(embeddings, start_token, end_token, dec_cell, initial_state, output_layer,max_summary_length, batch_size):'''Create the inference logits'''start_tokens = tf.tile(tf.constant([start_token], dtype=tf.int32), [batch_size], name='start_tokens')inference_helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(embeddings,start_tokens,end_token)inference_decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell,inference_helper,initial_state,output_layer)inference_logits, _ = tf.contrib.seq2seq.dynamic_decode(inference_decoder,output_time_major=False,impute_finished=True,maximum_iterations=max_summary_length)return inference_logits
def decoding_layer(dec_embed_input, embeddings, enc_output, enc_state, vocab_size, text_length, summary_length, max_summary_length, rnn_size, vocab_to_int, keep_prob, batch_size, num_layers):'''Create the decoding cell and attention for the training and inference decoding layers'''for layer in range(num_layers):with tf.variable_scope('decoder_{}'.format(layer)):lstm = tf.contrib.rnn.LSTMCell(rnn_size,initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))dec_cell = tf.contrib.rnn.DropoutWrapper(lstm, input_keep_prob = keep_prob)output_layer = Dense(vocab_size,kernel_initializer = tf.truncated_normal_initializer(mean = 0.0, stddev=0.1))attn_mech = tf.contrib.seq2seq.BahdanauAttention(rnn_size,enc_output,text_length,normalize=False,name='BahdanauAttention')dec_cell = tf.contrib.seq2seq.DynamicAttentionWrapper(dec_cell,attn_mech,rnn_size)initial_state = tf.contrib.seq2seq.DynamicAttentionWrapperState(enc_state[0],_zero_state_tensors(rnn_size, batch_size, tf.float32)) with tf.variable_scope("decode"):training_logits = training_decoding_layer(dec_embed_input, summary_length, dec_cell, initial_state,output_layer,vocab_size, max_summary_length)with tf.variable_scope("decode", reuse=True):inference_logits = inference_decoding_layer(embeddings,  vocab_to_int['<GO>'], vocab_to_int['<EOS>'],dec_cell, initial_state, output_layer,max_summary_length,batch_size)return training_logits, inference_logits
def seq2seq_model(input_data, target_data, keep_prob, text_length, summary_length, max_summary_length, vocab_size, rnn_size, num_layers, vocab_to_int, batch_size):'''Use the previous functions to create the training and inference logits'''# Use Numberbatch's embeddings and the newly created ones as our embeddingsembeddings = word_embedding_matrixenc_embed_input = tf.nn.embedding_lookup(embeddings, input_data)enc_output, enc_state = encoding_layer(rnn_size, text_length, num_layers, enc_embed_input, keep_prob)dec_input = process_encoding_input(target_data, vocab_to_int, batch_size)dec_embed_input = tf.nn.embedding_lookup(embeddings, dec_input)training_logits, inference_logits  = decoding_layer(dec_embed_input, embeddings,enc_output,enc_state, vocab_size, text_length, summary_length, max_summary_length,rnn_size, vocab_to_int, keep_prob, batch_size,num_layers)return training_logits, inference_logits
def pad_sentence_batch(sentence_batch):"""Pad sentences with <PAD> so that each sentence of a batch has the same length"""max_sentence = max([len(sentence) for sentence in sentence_batch])return [sentence + [vocab_to_int['<PAD>']] * (max_sentence - len(sentence)) for sentence in sentence_batch]
def get_batches(summaries, texts, batch_size):"""Batch summaries, texts, and the lengths of their sentences together"""for batch_i in range(0, len(texts)//batch_size):start_i = batch_i * batch_sizesummaries_batch = summaries[start_i:start_i + batch_size]texts_batch = texts[start_i:start_i + batch_size]pad_summaries_batch = np.array(pad_sentence_batch(summaries_batch))pad_texts_batch = np.array(pad_sentence_batch(texts_batch))# Need the lengths for the _lengths parameterspad_summaries_lengths = []for summary in pad_summaries_batch:pad_summaries_lengths.append(len(summary))pad_texts_lengths = []for text in pad_texts_batch:pad_texts_lengths.append(len(text))yield pad_summaries_batch, pad_texts_batch, pad_summaries_lengths, pad_texts_lengths
# Set the Hyperparameters
epochs = 100
batch_size = 64
rnn_size = 256
num_layers = 2
learning_rate = 0.005
keep_probability = 0.75
# Build the graph
train_graph = tf.Graph()
# Set the graph to default to ensure that it is ready for training
with train_graph.as_default():# Load the model inputsinput_data, targets, lr, keep_prob, summary_length, max_summary_length, text_length = model_inputs()# Create the training and inference logitstraining_logits, inference_logits = seq2seq_model(tf.reverse(input_data, [-1]),targets, keep_prob,   text_length,summary_length,max_summary_length,len(vocab_to_int)+1,rnn_size, num_layers, vocab_to_int,batch_size)# Create tensors for the training logits and inference logitstraining_logits = tf.identity(training_logits.rnn_output, 'logits')inference_logits = tf.identity(inference_logits.sample_id, name='predictions')# Create the weights for sequence_lossmasks = tf.sequence_mask(summary_length, max_summary_length, dtype=tf.float32, name='masks')with tf.name_scope("optimization"):# Loss functioncost = tf.contrib.seq2seq.sequence_loss(training_logits,targets,masks)# Optimizeroptimizer = tf.train.AdamOptimizer(learning_rate)# Gradient Clippinggradients = optimizer.compute_gradients(cost)capped_gradients = [(tf.clip_by_value(grad, -5., 5.), var) for grad, var in gradients if grad is not None]train_op = optimizer.apply_gradients(capped_gradients)
print("Graph is built.")

训练网络

# Subset the data for training
start = 200000
end = start + 50000
sorted_summaries_short = sorted_summaries[start:end]
sorted_texts_short = sorted_texts[start:end]
print("The shortest text length:", len(sorted_texts_short[0]))
print("The longest text length:",len(sorted_texts_short[-1]))
The shortest text length: 25
The longest text length: 31
# Train the Model
learning_rate_decay = 0.95
min_learning_rate = 0.0005
display_step = 20 # Check training loss after every 20 batches
stop_early = 0
stop = 3 # If the update loss does not decrease in 3 consecutive update checks, stop training
per_epoch = 3 # Make 3 update checks per epoch
update_check = (len(sorted_texts_short)//batch_size//per_epoch)-1update_loss = 0
batch_loss = 0
summary_update_loss = [] # Record the update losses for saving improvements in the modelcheckpoint = "best_model.ckpt"
with tf.Session(graph=train_graph) as sess:sess.run(tf.global_variables_initializer())# If we want to continue training a previous session#loader = tf.train.import_meta_graph("./" + checkpoint + '.meta')#loader.restore(sess, checkpoint)for epoch_i in range(1, epochs+1):update_loss = 0batch_loss = 0for batch_i, (summaries_batch, texts_batch, summaries_lengths, texts_lengths) in enumerate(get_batches(sorted_summaries_short, sorted_texts_short, batch_size)):start_time = time.time()_, loss = sess.run([train_op, cost],{input_data: texts_batch,targets: summaries_batch,lr: learning_rate,summary_length: summaries_lengths,text_length: texts_lengths,keep_prob: keep_probability})batch_loss += lossupdate_loss += lossend_time = time.time()batch_time = end_time - start_timeif batch_i % display_step == 0 and batch_i > 0:print('Epoch {:>3}/{} Batch {:>4}/{} - Loss: {:>6.3f}, Seconds: {:>4.2f}'.format(epoch_i,epochs, batch_i, len(sorted_texts_short) // batch_size, batch_loss / display_step, batch_time*display_step))batch_loss = 0if batch_i % update_check == 0 and batch_i > 0:print("Average loss for this update:", round(update_loss/update_check,3))summary_update_loss.append(update_loss)# If the update loss is at a new minimum, save the modelif update_loss <= min(summary_update_loss):print('New Record!') stop_early = 0saver = tf.train.Saver() saver.save(sess, checkpoint)else:print("No Improvement.")stop_early += 1if stop_early == stop:breakupdate_loss = 0# Reduce learning rate, but not below its minimum valuelearning_rate *= learning_rate_decayif learning_rate < min_learning_rate:learning_rate = min_learning_rateif stop_early == stop:print("Stopping Training.")break
Epoch   1/100 Batch   20/781 - Loss:  4.470, Seconds: 156.00
Epoch   1/100 Batch   40/781 - Loss:  2.863, Seconds: 105.20
Epoch   1/100 Batch   60/781 - Loss:  2.652, Seconds: 151.58
Epoch   1/100 Batch   80/781 - Loss:  2.736, Seconds: 117.19
Epoch   1/100 Batch  100/781 - Loss:  2.686, Seconds: 118.42
Epoch   1/100 Batch  120/781 - Loss:  2.423, Seconds: 140.21
Epoch   1/100 Batch  140/781 - Loss:  2.696, Seconds: 152.89
Epoch   1/100 Batch  160/781 - Loss:  2.606, Seconds: 128.19
Epoch   1/100 Batch  180/781 - Loss:  2.525, Seconds: 151.52
Epoch   1/100 Batch  200/781 - Loss:  2.597, Seconds: 140.84
Epoch   1/100 Batch  220/781 - Loss:  2.515, Seconds: 130.87
Epoch   1/100 Batch  240/781 - Loss:  2.402, Seconds: 131.02
Average loss for this update: 2.734
New Record!
Epoch   1/100 Batch  260/781 - Loss:  2.382, Seconds: 106.18
Epoch   1/100 Batch  280/781 - Loss:  2.354, Seconds: 124.90
Epoch   1/100 Batch  300/781 - Loss:  2.306, Seconds: 148.73
Epoch   1/100 Batch  320/781 - Loss:  2.637, Seconds: 142.09
Epoch   1/100 Batch  340/781 - Loss:  2.680, Seconds: 140.91
Epoch   1/100 Batch  360/781 - Loss:  2.559, Seconds: 96.81
Epoch   1/100 Batch  380/781 - Loss:  2.448, Seconds: 130.00
Epoch   1/100 Batch  400/781 - Loss:  2.615, Seconds: 108.58
Epoch   1/100 Batch  420/781 - Loss:  2.193, Seconds: 124.55
Epoch   1/100 Batch  440/781 - Loss:  2.315, Seconds: 131.78
Epoch   1/100 Batch  460/781 - Loss:  2.276, Seconds: 131.26
Epoch   1/100 Batch  480/781 - Loss:  2.391, Seconds: 88.43
Epoch   1/100 Batch  500/781 - Loss:  2.455, Seconds: 120.87
Average loss for this update: 2.436
、、、、、、

测试效果

def text_to_seq(text):'''Prepare the text for the model'''text = clean_text(text)return [vocab_to_int.get(word, vocab_to_int['<UNK>']) for word in text.split()]
# Create your own review or use one from the dataset
#input_sentence = "I have never eaten an apple before, but this red one was nice. \#I think that I will try a green apple next time."
#text = text_to_seq(input_sentence)
random = np.random.randint(0,len(clean_texts))
input_sentence = clean_texts[random]
text = text_to_seq(clean_texts[random])checkpoint = "./best_model.ckpt"loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:# Load saved modelloader = tf.train.import_meta_graph(checkpoint + '.meta')loader.restore(sess, checkpoint)input_data = loaded_graph.get_tensor_by_name('input:0')logits = loaded_graph.get_tensor_by_name('predictions:0')text_length = loaded_graph.get_tensor_by_name('text_length:0')summary_length = loaded_graph.get_tensor_by_name('summary_length:0')keep_prob = loaded_graph.get_tensor_by_name('keep_prob:0')#Multiply by batch_size to match the model's input parametersanswer_logits = sess.run(logits, {input_data: [text]*batch_size, summary_length: [np.random.randint(5,8)], text_length: [len(text)]*batch_size,keep_prob: 1.0})[0] # Remove the padding from the tweet
pad = vocab_to_int["<PAD>"] print('Original Text:', input_sentence)print('\nText')
print('  Word Ids:    {}'.format([i for i in text]))
print('  Input Words: {}'.format(" ".join([int_to_vocab[i] for i in text])))print('\nSummary')
print('  Word Ids:       {}'.format([i for i in answer_logits if i != pad]))
print('  Response Words: {}'.format(" ".join([int_to_vocab[i] for i in answer_logits if i != pad])))
INFO:tensorflow:Restoring parameters from ./best_model.ckpt
Original Text: love individual oatmeal cups found years ago sam quit selling sound big lots quit selling found target expensive buy individually trilled get entire case time go anywhere need water microwave spoon know quaker flavor packetsTextWord Ids:    [70595, 18808, 668, 45565, 51927, 51759, 32488, 13510, 32036, 59599, 11693, 444, 23335, 32036, 59599, 51927, 67316, 726, 24842, 50494, 48492, 1062, 44749, 38443, 42344, 67973, 14168, 7759, 5347, 29528, 58763, 18927, 17701, 20232, 47328]Input Words: love individual oatmeal cups found years ago sam quit selling sound big lots quit selling found target expensive buy individually trilled get entire case time go anywhere need water microwave spoon know quaker flavor packetsSummaryWord Ids:       [70595, 28738]Response Words: love it
Examples of reviews and summaries:
•Review(1): The coffee tasted great and was at such a good price! I highly recommend this to everyone!
•Summary(1): great coffee
•Review(2): This is the worst cheese that I have ever bought! I will never buy it again and I hope you won't either!
•Summary(2): omg gross gross
•Review(3): love individual oatmeal cups found years ago sam quit selling sound big lots quit selling found target expensive buy individually trilled get entire case time go anywhere need water microwave spoon know quaker flavor packets
•Summary(3): love it

文章摘要生成(Summarizing Text with Amazon Reviews)相关推荐

  1. 无需搭建和训练模型,87行代码搞定文章摘要生成

    晓查 编译整理 量子位 出品 | 公众号 QbitAI 在不想去读长篇大论的时候,让电脑帮助我们提炼文章的摘要,这简直是懒癌患者福音,还能大大节约时间. 现在有人手把手教你如何用Python做到这一点 ...

  2. Django 快速搭建博客 第十一节(文章阅读量统计,自动生成文章摘要)

    这一节主要做一些修补工作,一个是:文章阅读量的统计,另一个是自动生成文章摘要内容 1 . 文章阅读量的统计: 1 文章阅读量的统计,我们需要在model下的Post类中新加入一个views 字段用来统 ...

  3. php文章摘要,PHP:如何生成文章摘要

    2017年第一篇日志,今天来说说一个我觉得很有意思的东西,文章摘要的生成.主要利用到了正则匹配来完成,下面来详细说说原理. 问题 生成文章摘要,就像下面的图片显示的一样, 困难有如下两点: 汉字的截取 ...

  4. 【复盘比赛】SDP 2021@NAACL LongSumm 科学论⽂⻓摘要生成任务 第一名

    SDP 2021@NAACL LongSumm 科学论⽂⻓摘要生成任务 第一名 前言 任务介绍 问题描述 数据展示 模型尝试 抽取模型尝试 DGCNN抽取模型 BertSumm 生成模型尝试 End2 ...

  5. 知识图谱如何助力文本摘要生成

    来源:丁香园大数据 本文约3800字,建议阅读8分钟 本文基于摘要生成,重点考虑如何帮助模型生成特定领域的知识点,并简要介绍一些用于应对无关重复这类退化现象的方案. 引言 文本生成类任务应用场景广泛, ...

  6. TensorFlow 自动文本摘要生成模型,2016

    TensorFlow 自动文本摘要生成模型 textsum: Text summarization with TensorFlow | Google Research Blog (文/ 谷歌大脑软件工 ...

  7. COLING 2020 | 面向医疗对话的摘要生成

    ©PaperWeekly 原创 · 作者|李东明 学校|香港中文大学(深圳)本科生 研究方向|文本生成 摘要 医疗对话是一类特殊的对话形态,属于任务驱动型的对话场景,通常包含极为关键的病人求诊信息以及 ...

  8. 优于人类参考摘要,适用CNN新闻,OpenAI用人类反馈提升了摘要生成质量

    选自arXiv 作者:Nisan Stiennon 等 机器之心编译 编辑:杜伟.小舟.陈萍 近日,来自 OpenAI 的研究者利用人类反馈优化了文本摘要生成模型,该模型生成的摘要质量显著提升,并且可 ...

  9. TensorFlow文本摘要生成 - 基于注意力的序列到序列模型

    1 相关背景 维基百科对自动摘要生成的定义是, "使用计算机程序对一段文本进行处理, 生成一段长度被压缩的摘要, 并且这个摘要能保留原始文本的大部分重要信息". 摘要生成算法主要分 ...

最新文章

  1. python异常值处理实例_Python异常值处理与检测
  2. Apache Hadoop YARN – ResourceManager--转载
  3. python服务器稳定性,一种基于Python服务器稳定性测试的方法技术
  4. socket网络编程——多进程、多线程处理并发
  5. 聋校计算机教学工作总结,聋校二年级数学教学工作总结
  6. 桂电计算机实训报告总结,桂林电子科技大学信息科技学院
  7. 我的成长笔记20210324(进度把控)
  8. power系列服务器问题PA模板,与 Power BI 报表服务器集成
  9. 汇编语言王爽 实验七
  10. c语言ad转换实验报告,有关单片机AD转换的实验报告
  11. LaTeX大括号用法
  12. 微信朋友圈马赛克图片 —— 抓包破解
  13. 带色彩恢复的多尺度视网膜增强算法(MSRCR)的原理、实现及应用。
  14. PC使用js调用qq聊天
  15. 安装brat的jquery错误
  16. 计算机基础知识教程 pdf,《计算机基础知识教程》.pdf
  17. 赛格威机器人路萌中国首秀 开发者计划今年将在国内落地
  18. web开发 jsp页面3 JSTL if choose/when/otherwise forEach
  19. 马宁伟-20年工作经验谈-3-十年磨一剑
  20. 一文读懂卫星导航测量天线

热门文章

  1. Ideas Of MySelf 20005-07-26
  2. mysql事务排队情况_MySQL事务问题
  3. 活体检测论文笔记2——Deep Spatial Gradient and Temporal Depth Learning for Face Anti-spoofing
  4. JAVA代理模式与动态代理模式
  5. BPF CO-RE reference guide
  6. 进入网页页面的开发者模式——三种方式
  7. 2019年环175五一作业
  8. Highcharts去掉右下角URL水印
  9. java图书馆登陆代码_图书馆系统(登录设计)
  10. 王者荣耀盒子 英雄图片爬取