lstm 预测诗歌

Within written media, poetry is often regarded as enigmatic, frivolous, or too niche. As a result, poems (even by established poets) are often overlooked by larger publishers and literature-focused websites alike. (The anti-capitalist nature of poetry may play a role here as well). There are services for rating and recommending entire books (including poetry collections, to be fair) like GoodReads, Amazon, or Bookish, but to my knowledge, there aren’t any sites or services that recommend poems on an individual level.

在书面媒体中，诗歌通常被视为神秘，轻浮或过于小众。结果，大型出版商和以文学为中心的网站都常常忽略诗歌(即使是知名诗人)。 (诗歌的反资本主义性质在这里也可能发挥作用)。有一些服务可以对诸如GoodReads，Amazon或Bookish之类的整本书(包括公平的诗歌收藏)进行评级和推荐，但是据我所知，没有任何网站或服务可以单独推荐诗歌。

With this in mind, I wondered how poem recommendation may even work. One often finds a genre or two that they like and searches that out, but there must be elements of poetry that transcend genre. If there are, machine learning seems like a perfect tool to use to find them. In this article, I’ll explore some features of poetry that make it unique as a style of writing and investigate differences between four umbrella genres I’ll be referring to as “movements”. After building a model, I can create a recommendation system that recommends poetry based on a word, multiple words, or another poem.

考虑到这一点，我想知道诗歌推荐甚至可能如何起作用。人们经常会找到自己喜欢的一两种类型并进行搜索，但是必须有超越该类型的诗歌元素。如果有的话，机器学习似乎是找到它们的理想工具。在本文中，我将探讨诗歌的一些特征，使其成为一种独特的写作风格，并研究四种被我称为“动作”的伞型之间的差异。建立模型后，我可以创建一个基于单词，多个单词或另一首诗来推荐诗歌的推荐系统。

数据 (The data)

With a history that dates back to 1912, the Poetry Foundation is one of the largest purveyors of poetry in the world and a crucial resource for poets and readers alike. I scraped 4,307 poems from their website, each of which was labeled with a genre. There were a total of 13 genres, which I broke down into four movements:

诗歌基金会的历史可以追溯到1912年，是世界上最大的诗歌传播者之一，也是诗人和读者的重要资源。我从他们的网站上抓取了4307首诗，每首诗都标有一个流派。总共有13个流派，我分为四个乐章：

Pre-1900 (Victorian and Romantic)

1900年以前(维多利亚时代和浪漫时期)
Modern (a standalone category)

现代(独立类别)
Metropolitan (New York School [1st and 2nd Generation], Confessional, Beat, Harlem Renaissance, Black Arts Movement)

大都会(纽约学校[第一和第二代]，悔室，击败，哈林复兴，黑人艺术运动)
Avant-Garde (Imagist, Black Mountain, Language Poetry, Objectivist)

前卫(想象主义者，黑山，语言诗歌，客观主义者)

By using four roughly balanced categories instead of the original thirteen, I was able to more easily analyze and classify each class of poem. Modern poetry (both the genre and the movement) made up about 29% of the data. Avant-Garde, the movement with the least poems, made up about 22% of the data.

通过使用四个大致平衡的类别而不是原始的十三个类别，我能够更轻松地分析和分类每首诗。现代诗歌(体裁和乐章)约占数据的29％。诗歌最少的前卫运动占数据的大约22％。

关于刮削过程的注意事项 (A note on the scraping process)

The scraping process presented challenges in that poems came in two forms: HTML-text and scanned images. I was able to use BeautifulSoup to easily capture the text-based ones, but had to rely on PyTesseract for poems from scanned images. While I’m confident that a large majority have been scraped properly, there are undoubtedly some poems that are truncated, contain typos, or have extra lines, merely as a result of the inaccuracies of the image-to-text library. Still, in the name of having more data, using the scanned image poetry was a necessity.

拼写过程带来了挑战，因为诗歌以两种形式出现： HTML文本和扫描图像。我能够使用BeautifulSoup轻松捕获基于文本的文本，但是不得不依靠PyTesseract来获取扫描图像中的诗歌。尽管我相信大多数单词都已被正确地刮掉，但是毫无疑问，有些诗被截断，包含错别字或多余的行，仅是由于图像到文本库的不准确性所致。但是，以拥有更多数据为名，还是有必要使用扫描图像诗歌。

大纲 (Outline)

After a lengthy scraping (and re-scraping) process, I cleaned the data by removing section headers (roman numerals and things like Part 1, Part 2, etc.), empty lines, and any extra lines contain the poet’s name and year of publication. This allowed me to more accurately engineer several features, including the number of lines in the poem, average number of words per line, average number of syllables per word, and lexical richness. I also looked at the polarity and subjectivity of poems.

经过漫长的刮擦(和重新刮擦)过程，我通过删除节标题(罗马数字和类似Part 1，Part 2等的东西)，空行以及所有包含诗人姓名和年份的多余行来清理数据。出版物。这使我能够更准确地设计一些功能，包括诗歌中的行数，每行平均单词数，每个单词平均音节数量和词汇丰富度。我还研究了诗歌的极性和主观性。

After feature engineering, I explored the data alongside these new features and processed the text to investigate the most frequently used words. I created a variety of visualizations to support my findings. Finally, I ran several prediction models to provide further insights into what I looked at during my data exploration.

在进行特征工程之后，我探索了这些新特征旁边的数据，并处理了文本以调查最常用的单词。我创建了各种可视化来支持我的发现。最后，我运行了几个预测模型，以进一步了解我在数据探索期间的观察。

功能工程和EDA (Feature engineering and EDA)

I was very excited by the range of features that can be engineered within poetic text, most of which proved very useful in both analysis and classification. Poetry is a unique medium of writing in which structure and form are integral to the style (and sometimes even the meaning) of a poem.

诗歌文本可以设计的功能范围让我感到非常兴奋，其中的大多数功能在分析和分类中都非常有用。诗歌是一种独特的写作媒介，其结构和形式是诗歌风格(有时甚至是意义)所不可或缺的。

As I will show, Avant-Garde poetry, often seen as a more experimental style and an abject rejection of the past, is almost always at the opposite end of the spectrum as Pre-1900 poetry, which is unsurprising from a literary criticism standpoint. In short, the formal and structural elements that I quantified in this project provide statistical confirmation of well-established literary theories and analysis.

正如我将要展示的那样，前卫诗歌通常被视为更具实验性和对过去的拒绝，它几乎总是与1900年前诗处于相反的境界，从文学批评的角度来看这并不奇怪。简而言之，我在这个项目中量化的形式和结构要素为行之有效的文学理论和分析提供了统计确认。

行数 (Number of lines)

This is the standard measurement of the length of a poem, as opposed to word count. That is why the data cleaning I described earlier was so crucial, in removing any lines that aren’t part of the actual poem itself.

这是一首诗长度(而不是字数)的标准度量。这就是为什么我前面描述的数据清理如此重要的原因，在于删除了不属于实际诗歌本身的任何行。

I was surprised to find that the median values were fairly similar across all movements, with the exception of Modern poetry, which had the smallest median value.

我很惊讶地发现，所有运动中的中值都非常相似，但现代诗歌除外，后者的中值最小。

Despite these similarities, Pre-1900 poems do tend to be much longer on average. The average length is 55 lines, whereas the next highest, Metropolitan, is only 38. The distribution of the upper quartiles in the chart above further depicts a movement that’s no stranger to a long poem. The lower whisker for Pre-1900 also shows that those poems tend to be at least a few lines long (the minimum was 4), whereas the other movements have no problems with a one-line poem.

尽管有这些相似之处，但1900年前的诗歌确实平均要长得多。平均长度为55行，而第二高的城市(Metropolitan)仅为38行。上表中高四分位数的分布进一步说明了长诗并不陌生的乐章。 1900年前的较低胡须也表明，这些诗歌往往至少长几行(最少为4行)，而其他乐章的一行诗则没有问题。

Modern poetry tends to be the shortest with both the lowest average (33 lines) and a median that is 4 lines fewer than the next lowest. Avant-Garde and Metropolitan poetries are statistically similar to each other, as are Avant-Garde and Modern poetries.

现代诗歌往往是最短的，平均水平最低(33行)，中位数比第二低的最低4行。从统计学上看，前卫诗歌和大都会诗歌彼此相似，前卫诗歌和现代诗歌也是如此。

平均行长(每行字数) (Average line length (words per line))

Another key metric that greatly affects how a poem appears on the page, as well as how it is read, is the average number of words per line. A poem with a word per line average of two is going to look and feel very different than, say, a sonnet with a word per line average of eight.

每行单词的平均数量是另一个影响该诗如何出现在页面上以及如何阅读的关键指标。每行平均单词数为2的诗与每行平均单词数为8的十四行诗相比，在外观和感觉上都有很大的不同。

One important discovery was how the advent of the prose poem skewed my data. A prose poem is a poem that looks much more like a piece of fiction, using paragraphs or large chunks of text as opposed to the line breaks one usually associates with poems. So some of those one-line poems discussed in the previous section may have simply been a one-paragraph prose poem.

一个重要的发现是散文诗的出现如何使我的数据歪曲。散文诗是一种看起来更像是一部小说的诗歌，它使用段落或大块文本，而不是通常与诗歌联系在一起的换行符。因此，上一节中讨论的一些单行诗可能只是一部单节散文诗。

These types of poems became much more prevalent in the 20th Century and are not present in my data’s Pre-1900 category. As a result, the maximum values for average line length in Pre-1900 poetry is 23, whereas that for the other three movements is in the upper hundreds and even well above one thousand.

这些诗歌在20世纪变得更加流行，在我的数据中1900年前的类别中却不存在。结果，1900年前诗歌的平均线长的最大值为23，而其他三个乐章的最大值为上百个，甚至远远超过一千。

While this obviously skews the averages of Avant-Garde, Metropolitan, and Modern poetry, their median values tell a different story.

尽管这显然使前卫，大都会和现代诗歌的平均水平出现偏差，但它们的中位数却是一个不同的故事。

Avant-Garde tends to have the fewest words per line by far, with a median value of about 5.1 words, compared to the next lowest, Metropolitan, at about 6.6 words. Avant-Garde simultaneously happens to have the highest average at 9.3 words per line, which suggests a prevalence of prose poetry within the movement.

到目前为止，Avant-Garde的单行字数最少，中位数约为5.1字，而第二大字为Metropolitan的中位数约为6.6字。同时，先锋派的平均水平最高，每行9.3字，这表明该运动中散文诗盛行。

Pre-1900 poetry tends to have the longest lines, with a median value of 7.0 words, and also tends to be the most regular, with the smallest range of values. This makes sense given the adherence to established structures such as sonnets and villanelles. It is also worth noting that Pre-1900 poetry has the smallest average value (7.2 compared to the next lowest of 8.3), which is again most likely due to there being no examples of prose poetry.

1900年前的诗歌倾向于最长的诗集，中位值为7.0个单词，也倾向于最规律的诗集，其值域最小。考虑到对十四行诗和反派之类的既定结构的遵守，这是有道理的。还值得注意的是，1900年前的诗歌的平均值最低(7.2，而第二低的平均值是8.3)，这又很可能是由于没有散文诗的例子。

极性 (Polarity)

Pre-1900 poetry is overwhelmingly positive, with a median value of .90! In the box-and-whisker plot below, notice the position of the red line compared to the other movements. The other three movements are all similar to each other, and their polarities have no statistically significant differences between them.

1900年前的诗歌绝大多数是正面的，中间值为0.90！在下面的方须图中，请注意与其他运动相比，红线的位置。其他三个运动彼此相似，并且它们之间的极性在统计学上没有显着差异。

Poetry is rarely neutral and tends to be positive; as depicted in the chart below, at least 61% of the poems in each movement have a positive polarity score. 71% of Pre-1900 poems have a positive polarity score.

诗歌很少是中立的，倾向于正面的。如下表所示，每个乐章中至少有61％的诗具有正极性分数。 1900年前的诗歌中有71％的极性得分为正。

Avant-Garde poetry contains the most neutral poems at just below 5%, but it’s still a relatively small share.

前卫诗歌包含的中性诗歌最多，仅为5％以下，但所占比例仍然相对较小。

结束韵 (End rhymes)

I was able to use Allison Parrish’s Pronouncing package to determine the number of end rhymes a poem contains. An end rhyme occurs when the word at the end of one line rhymes with another word at the end of a different line. I divided that number by the number of total lines to get a ratio that became one of my classification model’s most important features. (Note: I counted only unique rhymes.)

我能够使用艾莉森·帕里什(Allison Parrish)的“发音”包来确定一首诗所包含的尾韵数量。当一个行末尾的单词与另一行末尾的另一个单词押韵时，就会出现结束韵。我将该数字除以总行数得出的比率成为分类模型最重要的功能之一。 (注意：我只计算了独特的押韵。)

Unsurprisingly, there is a lot of separation between Pre-1900 poetry and the other movements.

不足为奇的是，1900年以前的诗歌与其他乐章之间有很多不同之处。

Avant-Garde poetry tends not to use end rhymes, and they are relatively infrequent in Metropolitan poetry. End rhymes are not uncommon in Modern poetry, but they are truly at home in Pre-1900 poetry (and almost seem to be a requirement!), as shown below.

前卫诗歌倾向于不使用尾韵，而且在大都市诗歌中相对较少。尾韵在现代诗歌中并不少见，但在1900年以前的诗歌中确实如此(几乎是必需的！)，如下所示。

Only 8% of Avant-Garde poems had an end rhyme ratio above 0.25, compared to 85% of Pre-1900 poems.

前卫诗歌中只有8％的末韵比率高于0.25，而1900年前末诗歌中则为85％。

语言的复杂性(每个单词的音节) (Complexity of language (syllables per word))

Again using the Pronouncing package, I calculated the average number of syllables per word in each poem. I used this as a measure of the complexity of the language used within a poem; words with more syllables tend to be more complex than words with only one syllable.

再次使用“发音”包，我计算了每首诗中每个单词的平均音节数。我用它来衡量一首诗所用语言的复杂性。具有更多音节的单词往往比只有一个音节的单词更复杂。

I had expected Pre-1900 poetry, with its flowery Victorian-era English, to have a much higher average of syllables per word. Instead, it has the simplest word usage (fewest syllables), whereas Metropolitan has the highest median value, narrowly edging out Avant-Garde.

我曾期望1900年前的诗歌以及其绚丽的维多利亚时代英语，每个单词的音节平均要高得多。取而代之的是，它具有最简单的单词用法(最低音节)，而大都会(Metropolitan)具有最高的中位值，仅次于前卫花园。

It’s worth noting that Avant-Garde has the largest range by far, as shown in the above chart. This indicates a varied movement of poems that employ simple and complex language.

值得注意的是，如上图所示，Avant-Garde的射程最大。这表明采用简单和复杂语言的诗歌的变化。

词汇丰富 (Lexical richness)

Another measure of complexity of language is lexical richness, which is calculated by dividing a poem’s vocabulary (the number of unique words) by the number of total words in a text. A repetitious poem would have a low value, whereas a poem with a high value (almost or entirely unique words) would be described as “lexically rich”. A poem in which each word appears only once would have a score of 1.0.

衡量语言复杂性的另一种方法是词汇丰富度，它是通过将一首诗的词汇量(唯一单词的数量)除以文本中总单词的数量而得出的。重复的诗将具有较低的价值，而诗的价值较高(几乎或完全独特的词)将被描述为“词汇丰富”。一首诗，每个单词只出现一次，得分为1.0。

Pre-1900 poetry appears to be the most repetitious movement, whereas Avant-Garde poetry is the most lexically rich. In the chart below, Avant-Garde is the only movement where a whisker reaches a value of 1.0, and all of it’s quartiles are well above the other movements.

1900年前的诗歌似乎是最重复的乐章，而先锋派的诗歌则在词汇上最丰富。在下面的图表中，前卫是唯一一个晶须达到1.0的运动，并且所有四分位数都远高于其他运动。

It’s important to combine a couple of these observations to realize that Pre-1900 is wordy and repetitive, whereas Avant-Garde tends to be concise and full of unique language.

重要的是，将这两个观察结果结合起来，才能认识到1900年前的词多且重复，而Avant-Garde则倾向于简洁明了且充满独特的语言。

文字处理 (Text processing)

I processed the poems in order to get a better look at what words were most frequently used within each movement. To process the text, I:

我整理了这首诗，以便更好地了解每个乐章中最常用的单词。要处理文本，我：

made the poems lowercase
使诗变小写
converted contractions to root words
将收缩转换为词根
removed punctuation
删除标点符号
lemmatized
形容词
removed stop words
删除停用词

My stop words included:

我的停用词包括：

NLTK stop words
NLTK停用词
older English equivalents to those stop words (i.e. thy, doth, ere, etc.)

那些停用词(例如thy ， doth ， ere等)的较旧英语等价形式
poet names (because some may have gotten through in the scraping process), minus any names that may also be used as words
诗人的名字(因为某些人可能在拼写过程中获得通过)，减去任何也可以用作单词的名字
HTML tags that may have gotten through the scraping process (this was an issue during my initial scrape, but I believe was corrected during the cleaning process; still, better safe than sorry)
可能已通过抓取过程获得HTML标记(这是我最初进行抓取时遇到的问题，但我相信在清洁过程中已得到纠正；仍然比后悔更安全)
words of questionable value discovered in the first round of EDA (such as would, upon, and may)

可疑值的话发现在第一轮EDA的(如将应，并可)

There were 119,285 unique words in the corpus and 1,165,726 total words. After processing, this went down to 36,443 unique words and 585,256 total words.

语料库中有119,285个独特词，共有1,165,726个词。经过处理，这减少到36,443个唯一单词和585,256个总单词。

The 25 most frequently used words are (again, after processing and lemmatization):

25个最常用的单词是(同样，经过处理和词形分解后)：

There are a lot of visual (see, eye, light, look, white, face), temporal (day, night, time, old, long, never), and conceptual (love, life, man, heart, thing, still, world) terms.

有很多视觉(见，眼，光，看，白，脸)，时间(白天，黑夜，时间，旧，漫长，从不)和概念性(爱，生命，人，心，物，静止，世界)条款。

I find it interesting that come just barely edged out love for the top spot. Again, this is after lemmatization, so this is a combination of come, comes, coming, came, etc. This perhaps simultaneously points to a call to action (a beckoning, a la “Come here!”), a passive observation (“He comes from a distant city…”, from Diane di Prima’s An Exercise in Love), as well as the sexual verb, which is undoubtedly more common in the post-19th Century movements.

我觉得有趣的是前来勉强挤掉爱的头把交椅。再一次，这是在定格之后，所以这是来来来来的组合。这也许同时指向号召性用语(招呼，叫“来这里！”)，被动观察(“他来自遥远的城市……”，来自黛安·迪·普里玛(Diane di Prima)的《恋爱中的演习》 ( An Exercise in Love )，以及性动词，这无疑在19世纪后运动中更为普遍。

Breaking down word frequency by movement paints a clearer picture of some of the differences in language used:

通过运动分解单词频率可以更清楚地了解所用语言的一些差异：

*Image by author*) | (click to enlarge)*图片由作者*)| (点击放大)

Metropolitan, Modern, and Avant-Garde poetries tend to focus more on the visual and temporal, with Avant-Garde also including some more specifically natural words like water, tree, sea, and leaf. It is also worth noting that love is only the eighth most popular term for Avant-Garde, whereas it’s in the top three of the other movements.

大都会，现代和先锋派的诗歌往往更注重视觉和时间上的诗意，先锋派的诗歌还包括一些更自然的词，如水，树，海和树叶。还值得注意的是，爱情只是前卫音乐的第八大流行术语，而在其他乐章中排名第三。

Pre-1900 poetry skews more conceptual and ethereal, with words like soul and god, which are unique to this movement’s top 25. I’m also surprised at the relative lack of natural terms (with the exception of sea), considering this movement includes the Romantic genre, which is known for glorifying nature.

1900年前的诗歌偏向于概念性和空灵性，使用诸如灵魂和神之类的词，这是该运动前25名中特有的。我也为自然运动相对缺乏(海除外)感到惊讶，因为该运动包括浪漫风格，以美化自然而闻名。

Black is unique to Metropolitan’s list, which can presumably be explained by the Harlem Renaissance and Black Arts Movement genres, as well as the darker, gritty aesthetic of city-based poetry by Beat and New York School poets.

布莱克(Black)是大都会人的榜单所独有的，这可以用哈林(Harlem)文艺复兴时期和黑人艺术运动的流派来解释，也可以用Beat和纽约派诗人对城市诗歌的深色，坚韧美学加以解释。

Finally, it is worth noting the scale of each of these graphs, which reflects the wordiness and repetition of Pre-1900 poetry and the opposite qualities in Avant-Garde poetry. As has generally been the case in my analyses, Metropolitan and Modern poetries lie somewhere in the middle.

最后，值得注意的是每个图表的比例，它们反映了1900年前诗歌的冗长性和重复性，以及前卫诗歌的相反特质。正如我所做的分析一样，大都市和现代诗歌都位于中间。

造型 (Modeling)

I ran Naive Bayes, KNN, Decision Tree, Random Forest, and SVM models using a TF-IDF vectorizer. My final implementation, however, was an SVM model using Doc2Vec document vectors instead, which provided me with a decent F1 score and the best fit by far. Although I kept them out of my final Jupyter notebook for the sake of brevity, I also ran XGBoost and LSTM models, which showed some promise but weren’t quite up to the level of my final model.

我使用TF-IDF矢量化器运行了朴素贝叶斯，KNN，决策树，随机森林和SVM模型。但是，我的最终实现是使用Doc2Vec文档向量的SVM模型，该模型为我提供了不错的F1评分，并且是目前为止最合适的。尽管为了简洁起见，我将它们排除在最终的Jupyter笔记本电脑之外，但我还运行了XGBoost和LSTM模型，这些模型显示了一些希望，但并没有达到最终模型的水平。

数值数据的重要性 (The importance of numerical data)

All of my models consistently performed much better when using both the word (or document) vectors plus my engineered features. Generally, a model would see around a 10% boost when including these features.

当同时使用单词(或文档)向量和我的设计特征时，我所有的模型始终表现出更好的性能。通常，包含这些功能后，模型将获得10％左右的提升。

The baseline model, for which I used Bernoulli Naive Bayes on both TF-IDF vectors and my engineered features, achieved an F1 score of 42.7%. This is considerably better than just predicting the dominant class, which accounted for 29% of the data. Still, as you can see in the confusion matrix below, it did indeed overpredict on the dominant class, Modern, even though there wasn’t much of an imbalance.

我在TF-IDF载体和我设计的特征上都使用了Bernoulli Naive Bayes的基线模型，其F1得分为42.7％。这比仅预测占主导地位的类别要好得多，后者占数据的29％。但是，正如您在下面的混淆矩阵中所看到的那样，即使没有太多失衡，它的确对占主导地位的阶级Modern的预测过高。

I had some success with K-Nearest Neighbors, which suggests a certain amount of clustering in the data, as well as a Random Forest. The latter was extremely overfit, however.

我在“ K最近邻”方面取得了一些成功，这表明数据中有一定数量的聚类以及“随机森林”。但是，后者过于适合。

Similarly overfit was my best model, an SVM, with the TF-IDF vectors and numerical data. This was relatively unsurprising given SVM’s general success with text classification and data for which there are more features than datapoints. This achieved the best F1 score, but would not generalize well on unseen data.

同样，过拟合也是我最好的模型，即带有TF-IDF向量和数值数据的SVM。考虑到SVM在文本分类和数据方面的普遍成功，其功能要比数据点更多，这相对不足为奇。这获得了最佳的F1分数，但不能很好地归因于看不见的数据。

Combining my numerical features with Doc2Vec embeddings proved to be the model that best generalizes on unseen data, without taking too much of a hit in F1 score.

事实证明，将我的数值特征与Doc2Vec嵌入相结合是最能概括未见数据的模型，而不会给F1分数带来太大影响。

Other than the baseline, my models were consistently better at picking out Pre-1900 poetry, without much confusion between that movement and the other three. Avant-Garde, Metropolitan, and Modern proved more difficult to differentiate and were generally confused for each other. The final model seems to suggest Modern being the closest movement to Pre-1900, with 15% of Modern poems being incorrectly classified as Pre-1900 poems. Avant-Garde and Metropolitan appear very similar to each other, which makes sense from a poetry standpoint.

除了基线以外，我的模型在挑选1900年前的诗歌方面一直都比较出色，并且在该乐章与其他三首诗之间没有太多混淆。前卫，都市和现代被证明更加难以区分，并且通常彼此混淆。最终的模型似乎表明现代是最接近1900年以前的乐章，其中15％的现代诗歌被错误地归类为1900年前的诗歌。前卫风格和大都会风格看起来非常相似，从诗歌的角度来看，这是有道理的。

Run time was not an issue for most of my models, which is partially a result of having a relatively small dataset. My Doc2Vec model runs nearly instantly, having only 100 dimensions and 7 engineered features.

对于我的大多数模型而言，运行时间不是问题，部分原因是数据集相对较小。我的Doc2Vec模型几乎可以立即运行，只有100个尺寸和7个工程功能。

最终模型 (Final model)

I trained a final model using all of the data, and the F1 score increased to 66.8%. Pre-1900 was indeed the easiest to identify, and the other three movements were fairly similar to each other, with Modern being the most difficult to correctly identify (an F1 score of 60%).

我使用所有数据训练了最终模型，F1分数提高到66.8％。 1900年前的确是最容易识别的运动，其他三个运动彼此相当相似，而现代运动则是最难识别的运动(F1得分为60％)。

The F1 score of each individual movement increased after training on the entire dataset. Modern saw the biggest jump from 46% to 60%. Avant-Garde saw a surprisingly large boost in accuracy after training on the entire dataset, with it’s accuracy score moving from 51% to 65%. Being the smallest class (at about 22%) may explain this; more data is almost always a good thing. It’s F1 score jumped from 53% to 62%.

在整个数据集上进行训练后，每个单个动作的F1分数都增加了。 Modern的增幅最大，从46％升至60％。在对整个数据集进行训练之后，Avant-Garde看到了惊人的准确性提升，其准确性得分从51％上升到65％。最小的一类(大约22％)可以解释这一点。增加数据几乎总是一件好事。 F1分数从53％跃升至62％。

主要功能 (Top features)

Except for my baseline and TF-IDF SVM models, many if not all of my engineered features were prominently within the top ten most important features.

除了我的基准模型和TF-IDF SVM模型之外，我的许多工程设计功能(如果不是全部的话)都在十大最重要的功能中处于突出位置。

In my final model, five of my seven features were in the top ten:

在我的最终模型中，我的七个功能中的五个位于前十名中：

The ratio of end rhymes to total lines made the top spot by a healthy margin, followed by the average number of words per line, the total number of lines, and lexical richness. The average number of syllables per word was the other engineered feature that made the top ten. Polarity and sentiment scores were the only two that didn’t measure much importance to the model.

尾韵与总行数之比以健康的优势排在首位，其次是每行平均单词数，总行数和词汇丰富度。每个单词的平均音节数是进入前十名的另一个设计特征。极性和情感分数是唯一对模型没有多大重要性的分数。

By using document vectors instead of TF-IDF vectors, I do end up losing some interpretability, given that the other features in the above chart are merely five out of 100 mysterious dimensions. Still, by using a set of features that totals 107 as opposed to 43,053, I produced a much simpler model with similar efficacy and a better ability to generalize.

通过使用文档向量而不是TF-IDF向量，我确实失去了一些可解释性，因为上面图表中的其他功能仅仅是100个神秘维度中的5个。不过，通过使用总计107项而不是43,053项的一组功能，我制作了一个更简单的模型，具有相似的功效和更好的归纳能力。

This will help me more easily produce a recommendation system as well!

这也将帮助我更轻松地创建推荐系统！

结论 (Conclusions)

The power of form and structure! Numerical data based upon the form and structure of a poem proved to be consistently effective predictors of a poem’s movement.

形式和结构的力量！基于一首诗的形式和结构的数字数据被证明是一首诗运动的有效预测指标。

Pre-1900 poems tend to be long, wordy, positive, full of rhymes, and use simpler, repetitious language.

1900年前的诗歌倾向于长篇，罗word，积极，充满韵律，并使用简单，重复的语言。

Avant-Garde poems tend to be short, sparsely worded, unrhyming, and use complex, lexically rich language.

前卫诗歌倾向于简短，措辞少，没有韵律，并使用复杂的，词汇丰富的语言。

Metropolitan and Modern poems lie somewhere in between. Metropolitan poetry is most similar to Avant-Garde poetry, whereas Modern poetry shares similarities with all of the genres and is the only other genre to be somewhat similar to pre-1900 poetry.

都市诗和现代诗介于两者之间。大都会诗歌与前卫诗歌最相似，而现代诗歌与所有类型都有相似之处，并且是唯一与1900年前诗歌有些相似的其他类型。

未来考虑 (Future considerations)

In the future, it would be interesting to engineer even more features, such as other types of rhyming (use of internal rhymes or slant rhymes), verb tenses (whether a poem predominantly uses present or past tense), and use of white space (i.e. whether a poem always starts on the left part of the line). Topic modeling may yield some interesting results as well.

将来，设计更多功能会很有趣，例如其他类型的押韵(使用内部押韵或偏向押韵)，动词时态(无论一首诗主要使用现在时还是过去时)以及空白(即一首诗是否总是从行的左边开始)。主题建模也可能会产生一些有趣的结果。

Furthermore, I plan on trying to build this out using the actual genres (of which there are 13), as opposed to the four umbrella-like movements discussed here. This will present some notable challenges, not least of which is the large class imbalance. Modern poetry, which is its own genre and movement, accounts for over a quarter of all the poems. Although this will assuredly result in much less accurate models, it will also shed some light on the intricacies within poetic movements.

此外，我计划尝试使用实际类型(其中有13种)来构建这种类型，而不是这里讨论的四个类似伞形的动作。这将带来一些显着的挑战，尤其是阶级之间的巨大失衡。现代诗歌是其自身的风格和运动，占所有诗歌的四分之一以上。尽管这肯定会导致模型的准确性降低，但也会为诗歌运动中的复杂性提供一些启示。

项目回购 (Project repo)

You can check out my project repo on GitHub:https://github.com/p-szymo/poetry_genre_classifier

您可以在GitHub上查看我的项目存储库： https : //github.com/p-szymo/poetry_genre_classifier

翻译自: https://towardsdatascience.com/predicting-poetic-movements-51006847cc6f