
Text generation is a bridge between computational linguistics and AI that automatically generates natural language text. In deep learning, RNNs have proven to work extremely well with sequential data such as text. In this case example, I will demonstrate applying LSTMs with word embeddings to generate Hamilton lyrics. Many of the ideas came from Karpathy¹ and Bansal². All of the code can be found on my GitHub.

文本生成是计算语言学和自动生成自然语言文本的AI之间的桥梁。 在深度学习中,RNN已被证明可以很好地处理文本等顺序数据。 在本例中,我将演示如何将LSTM与单词嵌入一起应用以生成汉密尔顿歌词。 许多想法来自Karpathy¹和Bansal²。 所有代码都可以在我的GitHub上找到 。

Let’s import the required libraries from Tensorflow and Keras:


from keras.preprocessing.sequence import pad_sequences from keras.models import Sequentialfrom keras.layers import Embedding, LSTM, Bidirectional, Dense, Dropoutfrom keras.preprocessing.text import Tokenizer from keras.callbacks import EarlyStoppingimport keras.utils as kuimport numpy as np

Now we provide a path to the word embeddings:


glove_path = 'glove.twitter.27B/glove.twitter.27B.200d.txt'

The lyrics were scraped from the internet and placed in a plain text file:


text = open('ham_lyrics.txt', encoding='latin1').read()

The corpus was lowercased and tokenized. The input sequences were created using the list of tokens and padded to match the max sequence length:

语料库被小写并标记化。 使用标记列表创建输入序列,并对其进行填充以匹配最大序列长度:

tokenizer = Tokenizer()corpus = text.lower().split("\n") tokenizer.fit_on_texts(corpus) total_words = len(tokenizer.word_index) + 1 input_seq = [] for line in corpus:    token_list = tokenizer.texts_to_sequences([line])[0]    for i in range(1, len(token_list)):       n_gram_seq = token_list[:i+1]       input_seq.append(n_gram_seq)

We then separate our input sequences into predictors and labels for our learning algorithm. This is treated as a categorical task with the number of classes reflecting the total words that the tokenizer recognized:

然后,我们将输入序列分为学习算法的预测变量和标签。 这被视为分类任务,其类别数反映了令牌化程序识别的总单词数:

max_seq_len = max([len(x) for x in input_seq]) input_seq = np.array(pad_sequences(input_seq, maxlen=max_seq_len, padding='pre')) predictors, label = input_seq[:,:-1],input_seq[:,-1] label = ku.to_categorical(label, num_classes=total_words)

We need to open our word embedding file so that can be properly accessed in our embedding layer. The embedding index is a precursory step for the embedding matrix. GLoVe embeddings are applied here:

我们需要打开单词嵌入文件,以便可以在我们的嵌入层中正确访问该文件。 嵌入索引是嵌入矩阵的先验步骤。 GLoVe嵌入在此处应用:

embeddings_index = dict()with open(glove_path, encoding="utf8") as glove:    for line in glove:        values = line.split()        word = values[0]        coefs = np.asarray(values[1:], dtype='float32')     embeddings_index[word] = coefs  glove.close()

The embedding matrix is what we will actually feed into our network:


embedding_matrix = np.zeros((total_words, 200))for word, index in tokenizer.word_index.items():      if index > total_words - 1:            break      else:            embedding_vector = embeddings_index.get(word)            if embedding_vector is not None:         embedding_matrix[index] = embedding_vector

Now that the data and word embeddings are prepared, we can start setting up the layers of our RNN. We start by adding our embedding layer followed by the bidirectional LSTM with 256 units and an LSTM with 128 units:

现在已经准备好数据和单词嵌入,我们可以开始设置RNN的层了。 我们首先添加嵌入层,然后添加具有256个单位的双向LSTM和具有128个单位的LSTM:

model = Sequential() model.add(Embedding(total_words, 200, weights = [embedding_matrix],                     input_length=max_seq_len-1)) model.add(Bidirectional(LSTM(256, dropout=0.2,recurrent_dropout=0.2, return_sequences = True))) model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))

We follow with a dropout layer to remove disposable neurons and prevent overfitting without diminishing the performance of our task; recurrent dropout “drops” the connections between recurrent units whereas regular dropout “drops” the connections to the general inputs/outputs. The final dense layer with softmax activation closes out the model. We call an early stopping if the loss function begins to inflate. As the runtime can be quite long, epochs are set moderately low:

接下来是一个辍学层,以去除一次性神经元并防止过度拟合,而不会降低我们的工作效率; 经常性删除会“删除”循环单元之间的连接,而常规的删除会“删除”与通用输入/输出的连接。 最终具有softmax激活的密集层将模型封闭。 如果损失函数开始膨胀,我们称之为提前停止。 由于运行时间可能很长,因此将历时设置得较低:

model.add(Dropout(0.2)) model.add(Dense(total_words, activation=’softmax’)) model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’]) earlystop = EarlyStopping(monitor=’val_loss’, min_delta=0, patience=5, verbose=0, mode=’auto’) model.fit(predictors, label, epochs=25, verbose=1, callbacks=[earlystop])model.save('hamilton_model.h5')

Finally, a helper function was added to display the generated text:


def generate_text(seed_text, next_words, max_seq_len):  for _ in range(next_words):      token_list = tokenizer.texts_to_sequences([seed_text])[0]     token_list = pad_sequences([token_list], maxlen=max_seq_len-1, padding='pre')      predicted = model.predict_classes(token_list, verbose=0)      output_word = ""      for word, index in tokenizer.word_index.items():         if index == predicted:            output_word = word            break      seed_text += " " + output_word

The function takes the seed text, the number of following words, and the max sequence length as arguments. The seed text is the text we use as the basis for our learning algorithm to project its predictions and we choose the number of words we want to follow the text.

该函数将种子文本,后续单词的数量和最大序列长度作为参数。 种子文本是我们用作学习算法以预测其预测的基础的文本,我们选择了要跟随该文本的单词数。

We run our pipeline and print our results:


print(generate_text("These United States", 3, max_seq_len))

With several lines of text generated, we can expect results such as this:


Image for post

With more text preprocessing, feature engineering, and robust modeling, we can expect to mitigate the grammar and syntax errors above. The LSTMs can be switched out with GRUs for faster runtimes at the expense of lower precision in longer text sequences. Text generation with character embeddings or VAEs could be worth exploring as well. As Aaron Burr would note, the world is wide enough for different modeling approaches.

通过更多的文本预处理,功能工程和强大的建模,我们可以期望减轻上述语法和语法错误。 可以使用GRU切换LSTM,以实现更快的运行时间,但代价是较长的文本序列中的精度较低。 带有字符嵌入或VAE的文本生成也值得探讨。 正如亚伦·伯尔(Aaron Burr)所指出的那样,对于各种建模方法而言,世界已经足够广阔。

[1]: Andrej Karpathy. (May 21, 2015). The Unreasonable Effectiveness of Recurrent Neural Networks http://karpathy.github.io/2015/05/21/rnn-effectiveness/

[1]:安德烈(Andrej Karpathy)。 (2015年5月21日)。 循环神经网络的不合理效果 http://karpathy.github.io/2015/05/21/rnn-efficiency/

[2]: Shivam Bansal. (March 26, 2018). Language Modelling Text Generation using LSTMs — Deep Learning for NLP https://mc.ai/language-modelling-text-generation-using-lstms-deep-learning-for-nlp/?fbclid=IwAR2mR7QkpnwzCzszwN1mOXUWHBhIGOtfvxGA4AapS52RJZW6wSpKhckI1HY

[2]:Shivam Bansal。 (2018年3月26日)。 使用语言模型文本生成LSTMs -深度学习的NLP https://mc.ai/language-modelling-text-generation-using-lstms-deep-learning-for-nlp/?fbclid=IwAR2mR7QkpnwzCzszwN1mOXUWHBhIGOtfvxGA4AapS52RJZW6wSpKhckI1HY

翻译自: https://towardsdatascience.com/spamilton-text-generation-with-lstms-and-hamilton-lyrics-ec7938ae830c




