lstm 预测诗歌

Within written media, poetry is often regarded as enigmatic, frivolous, or too niche. As a result, poems (even by established poets) are often overlooked by larger publishers and literature-focused websites alike. (The anti-capitalist nature of poetry may play a role here as well). There are services for rating and recommending entire books (including poetry collections, to be fair) like GoodReads, Amazon, or Bookish, but to my knowledge, there aren’t any sites or services that recommend poems on an individual level.

在书面媒体中,诗歌通常被视为神秘,轻浮或过于小众。 结果,大型出版商和以文学为中心的网站都常常忽略诗歌(即使是知名诗人)。 (诗歌的反资本主义性质在这里也可能发挥作用)。 有一些服务可以对诸如GoodReads,Amazon或Bookish之类的整本书(包括公平的诗歌收藏)进行评级和推荐,但是据我所知,没有任何网站或服务可以单独推荐诗歌。

With this in mind, I wondered how poem recommendation may even work. One often finds a genre or two that they like and searches that out, but there must be elements of poetry that transcend genre. If there are, machine learning seems like a perfect tool to use to find them. In this article, I’ll explore some features of poetry that make it unique as a style of writing and investigate differences between four umbrella genres I’ll be referring to as “movements”. After building a model, I can create a recommendation system that recommends poetry based on a word, multiple words, or another poem.

考虑到这一点,我想知道诗歌推荐甚至可能如何起作用。 人们经常会找到自己喜欢的一两种类型并进行搜索,但是必须有超越该类型的诗歌元素。 如果有的话,机器学习似乎是找到它们的理想工具。 在本文中,我将探讨诗歌的一些特征,使其成为一种独特的写作风格,并研究四种被我称为“动作”的伞型之间的差异。 建立模型后,我可以创建一个基于单词,多个单词或另一首诗来推荐诗歌的推荐系统。

数据 (The data)

With a history that dates back to 1912, the Poetry Foundation is one of the largest purveyors of poetry in the world and a crucial resource for poets and readers alike. I scraped 4,307 poems from their website, each of which was labeled with a genre. There were a total of 13 genres, which I broke down into four movements:

诗歌基金会的历史可以追溯到1912年,是世界上最大的诗歌传播者之一,也是诗人和读者的重要资源。 我从他们的网站上抓取了4307首诗,每首诗都标有一个流派。 总共有13个流派,我分为四个乐章:

  • Pre-1900 (Victorian and Romantic)

    1900年以前(维多利亚时代和浪漫时期)

  • Modern (a standalone category)

    现代(独立类别)

  • Metropolitan (New York School [1st and 2nd Generation], Confessional, Beat, Harlem Renaissance, Black Arts Movement)

    大都会(纽约学校[第一和第二代],悔室,击败,哈林复兴,黑人艺术运动)

  • Avant-Garde (Imagist, Black Mountain, Language Poetry, Objectivist)

    前卫(想象主义者,黑山,语言诗歌,客观主义者)

By using four roughly balanced categories instead of the original thirteen, I was able to more easily analyze and classify each class of poem. Modern poetry (both the genre and the movement) made up about 29% of the data. Avant-Garde, the movement with the least poems, made up about 22% of the data.

通过使用四个大致平衡的类别而不是原始的十三个类别,我能够更轻松地分析和分类每首诗。 现代诗歌(体裁和乐章)约占数据的29%。 诗歌最少的前卫运动占数据的大约22%。

关于刮削过程的注意事项 (A note on the scraping process)

The scraping process presented challenges in that poems came in two forms: HTML-text and scanned images. I was able to use BeautifulSoup to easily capture the text-based ones, but had to rely on PyTesseract for poems from scanned images. While I’m confident that a large majority have been scraped properly, there are undoubtedly some poems that are truncated, contain typos, or have extra lines, merely as a result of the inaccuracies of the image-to-text library. Still, in the name of having more data, using the scanned image poetry was a necessity.

拼写过程带来了挑战,因为诗歌以两种形式出现: HTML文本和扫描图像。 我能够使用BeautifulSoup轻松捕获基于文本的文本,但是不得不依靠PyTesseract来获取扫描图像中的诗歌。 尽管我相信大多数单词都已被正确地刮掉,但是毫无疑问,有些诗被截断,包含错别字或多余的行,仅是由于图像到文本库的不准确性所致。 但是,以拥有更多数据为名,还是有必要使用扫描图像诗歌。

大纲 (Outline)

After a lengthy scraping (and re-scraping) process, I cleaned the data by removing section headers (roman numerals and things like Part 1, Part 2, etc.), empty lines, and any extra lines contain the poet’s name and year of publication. This allowed me to more accurately engineer several features, including the number of lines in the poem, average number of words per line, average number of syllables per word, and lexical richness. I also looked at the polarity and subjectivity of poems.

经过漫长的刮擦(和重新刮擦)过程,我通过删除节标题(罗马数字和类似Part 1,Part 2等的东西),空行以及所有包含诗人姓名和年份的多余行来清理数据。出版物。 这使我能够更准确地设计一些功能,包括诗歌中的行数,每行平均单词数,每个单词平均音节数量和词汇丰富度。 我还研究了诗歌的极性和主观性。

After feature engineering, I explored the data alongside these new features and processed the text to investigate the most frequently used words. I created a variety of visualizations to support my findings. Finally, I ran several prediction models to provide further insights into what I looked at during my data exploration.

在进行特征工程之后,我探索了这些新特征旁边的数据,并处理了文本以调查最常用的单词。 我创建了各种可视化来支持我的发现。 最后,我运行了几个预测模型,以进一步了解我在数据探索期间的观察。

功能工程和EDA (Feature engineering and EDA)

I was very excited by the range of features that can be engineered within poetic text, most of which proved very useful in both analysis and classification. Poetry is a unique medium of writing in which structure and form are integral to the style (and sometimes even the meaning) of a poem.

诗歌文本可以设计的功能范围让我感到非常兴奋,其中的大多数功能在分析和分类中都非常有用。 诗歌是一种独特的写作媒介,其结构和形式是诗歌风格(有时甚至是意义)所不可或缺的。

As I will show, Avant-Garde poetry, often seen as a more experimental style and an abject rejection of the past, is almost always at the opposite end of the spectrum as Pre-1900 poetry, which is unsurprising from a literary criticism standpoint. In short, the formal and structural elements that I quantified in this project provide statistical confirmation of well-established literary theories and analysis.

正如我将要展示的那样,前卫诗歌通常被视为更具实验性和对过去的拒绝,它几乎总是与1900年前诗处于相反的境界,从文学批评的角度来看这并不奇怪。 简而言之,我在这个项目中量化的形式和结构要素为行之有效的文学理论和分析提供了统计确认。

行数 (Number of lines)

This is the standard measurement of the length of a poem, as opposed to word count. That is why the data cleaning I described earlier was so crucial, in removing any lines that aren’t part of the actual poem itself.

这是一首诗长度(而不是字数)的标准度量。 这就是为什么我前面描述的数据清理如此重要的原因,在于删除了不属于实际诗歌本身的任何行。

I was surprised to find that the median values were fairly similar across all movements, with the exception of Modern poetry, which had the smallest median value.

我很惊讶地发现,所有运动中的中值都非常相似,但现代诗歌除外,后者的中值最小。

Image by author)图片由作者)

Despite these similarities, Pre-1900 poems do tend to be much longer on average. The average length is 55 lines, whereas the next highest, Metropolitan, is only 38. The distribution of the upper quartiles in the chart above further depicts a movement that’s no stranger to a long poem. The lower whisker for Pre-1900 also shows that those poems tend to be at least a few lines long (the minimum was 4), whereas the other movements have no problems with a one-line poem.

尽管有这些相似之处,但1900年前的诗歌确实平均要长得多。 平均长度为55行,而第二高的城市(Metropolitan)仅为38行。上表中高四分位数的分布进一步说明了长诗并不陌生的乐章。 1900年前的较低胡须也表明,这些诗歌往往至少长几行(最少为4行),而其他乐章的一行诗则没有问题。

Modern poetry tends to be the shortest with both the lowest average (33 lines) and a median that is 4 lines fewer than the next lowest. Avant-Garde and Metropolitan poetries are statistically similar to each other, as are Avant-Garde and Modern poetries.

现代诗歌往往是最短的,平均水平最低(33行),中位数比第二低的最低4行。 从统计学上看,前卫诗歌和大都会诗歌彼此相似,前卫诗歌和现代诗歌也是如此。

平均行长(每行字数) (Average line length (words per line))

Another key metric that greatly affects how a poem appears on the page, as well as how it is read, is the average number of words per line. A poem with a word per line average of two is going to look and feel very different than, say, a sonnet with a word per line average of eight.

每行单词的平均数量是另一个影响该诗如何出现在页面上以及如何阅读的关键指标。 每行平均单词数为2的诗与每行平均单词数为8的十四行诗相比,在外观和感觉上都有很大的不同。

One important discovery was how the advent of the prose poem skewed my data. A prose poem is a poem that looks much more like a piece of fiction, using paragraphs or large chunks of text as opposed to the line breaks one usually associates with poems. So some of those one-line poems discussed in the previous section may have simply been a one-paragraph prose poem.

一个重要的发现是散文诗的出现如何使我的数据歪曲。 散文诗是一种看起来更像是一部小说的诗歌,它使用段落或大块文本,而不是通常与诗歌联系在一起的换行符。 因此,上一节中讨论的一些单行诗可能只是一部单节散文诗。

These types of poems became much more prevalent in the 20th Century and are not present in my data’s Pre-1900 category. As a result, the maximum values for average line length in Pre-1900 poetry is 23, whereas that for the other three movements is in the upper hundreds and even well above one thousand.

这些诗歌在20世纪变得更加流行,在我的数据中1900年前的类别中却不存在。 结果,1900年前诗歌的平均线长的最大值为23,而其他三个乐章的最大值为上百个,甚至远远超过一千。

While this obviously skews the averages of Avant-Garde, Metropolitan, and Modern poetry, their median values tell a different story.

尽管这显然使前卫,大都会和现代诗歌的平均水平出现偏差,但它们的中位数却是一个不同的故事。

Image by author)作者提供的图片)

Avant-Garde tends to have the fewest words per line by far, with a median value of about 5.1 words, compared to the next lowest, Metropolitan, at about 6.6 words. Avant-Garde simultaneously happens to have the highest average at 9.3 words per line, which suggests a prevalence of prose poetry within the movement.

到目前为止,Avant-Garde的单行字数最少,中位数约为5.1字,而第二大字为Metropolitan的中位数约为6.6字。 同时,先锋派的平均水平最高,每行9.3字,这表明该运动中散文诗盛行。

Pre-1900 poetry tends to have the longest lines, with a median value of 7.0 words, and also tends to be the most regular, with the smallest range of values. This makes sense given the adherence to established structures such as sonnets and villanelles. It is also worth noting that Pre-1900 poetry has the smallest average value (7.2 compared to the next lowest of 8.3), which is again most likely due to there being no examples of prose poetry.

1900年前的诗歌倾向于最长的诗集,中位值为7.0个单词,也倾向于最规律的诗集,其值域最小。 考虑到对十四行诗和反派之类的既定结构的遵守,这是有道理的。 还值得注意的是,1900年前的诗歌的平均值最低(7.2,而第二低的平均值是8.3),这又很可能是由于没有散文诗的例子。

极性 (Polarity)

Pre-1900 poetry is overwhelmingly positive, with a median value of .90! In the box-and-whisker plot below, notice the position of the red line compared to the other movements. The other three movements are all similar to each other, and their polarities have no statistically significant differences between them.

1900年前的诗歌绝大多数是正面的,中间值为0.90! 在下面的方须图中,请注意与其他运动相比,红线的位置。 其他三个运动彼此相似,并且它们之间的极性在统计学上没有显着差异。

Image by author)图片由作者)

Poetry is rarely neutral and tends to be positive; as depicted in the chart below, at least 61% of the poems in each movement have a positive polarity score. 71% of Pre-1900 poems have a positive polarity score.

诗歌很少是中立的,倾向于正面的。 如下表所示,每个乐章中至少有61%的诗具有正极性分数。 1900年前的诗歌中有71%的极性得分为正。

Avant-Garde poetry contains the most neutral poems at just below 5%, but it’s still a relatively small share.

前卫诗歌包含的中性诗歌最多,仅为5%以下,但所占比例仍然相对较小。

Image by author)作者提供的图片)

结束韵 (End rhymes)

I was able to use Allison Parrish’s Pronouncing package to determine the number of end rhymes a poem contains. An end rhyme occurs when the word at the end of one line rhymes with another word at the end of a different line. I divided that number by the number of total lines to get a ratio that became one of my classification model’s most important features. (Note: I counted only unique rhymes.)

我能够使用艾莉森·帕里什(Allison Parrish)的“发音”包来确定一首诗所包含的尾韵数量。 当一个行末尾的单词与另一行末尾的另一个单词押韵时,就会出现结束韵。 我将该数字除以总行数得出的比率成为分类模型最重要的功能之一。 (注意:我只计算了独特的押韵。)

Unsurprisingly, there is a lot of separation between Pre-1900 poetry and the other movements.

不足为奇的是,1900年以前的诗歌与其他乐章之间有很多不同之处。

Image by author)图片由作者)

Avant-Garde poetry tends not to use end rhymes, and they are relatively infrequent in Metropolitan poetry. End rhymes are not uncommon in Modern poetry, but they are truly at home in Pre-1900 poetry (and almost seem to be a requirement!), as shown below.

前卫诗歌倾向于不使用尾韵,而且在大都市诗歌中相对较少。 尾韵在现代诗歌中并不少见,但在1900年以前的诗歌中确实如此(几乎是必需的!),如下所示。

Image by author)作者提供的图片)

Only 8% of Avant-Garde poems had an end rhyme ratio above 0.25, compared to 85% of Pre-1900 poems.

前卫诗歌中只有8%的末韵比率高于0.25,而1900年前末诗歌中则为85%。

语言的复杂性(每个单词的音节) (Complexity of language (syllables per word))

Again using the Pronouncing package, I calculated the average number of syllables per word in each poem. I used this as a measure of the complexity of the language used within a poem; words with more syllables tend to be more complex than words with only one syllable.

再次使用“发音”包,我计算了每首诗中每个单词的平均音节数。 我用它来衡量一首诗所用语言的复杂性。 具有更多音节的单词往往比只有一个音节的单词更复杂。

I had expected Pre-1900 poetry, with its flowery Victorian-era English, to have a much higher average of syllables per word. Instead, it has the simplest word usage (fewest syllables), whereas Metropolitan has the highest median value, narrowly edging out Avant-Garde.

我曾期望1900年前的诗歌以及其绚丽的维多利亚时代英语,每个单词的音节平均要高得多。 取而代之的是,它具有最简单的单词用法(最低音节),而大都会(Metropolitan)具有最高的中位值,仅次于前卫花园。

Image by author)作者提供的图片)

It’s worth noting that Avant-Garde has the largest range by far, as shown in the above chart. This indicates a varied movement of poems that employ simple and complex language.

值得注意的是,如上图所示,Avant-Garde的射程最大。 这表明采用简单和复杂语言的诗歌的变化。

词汇丰富 (Lexical richness)

Another measure of complexity of language is lexical richness, which is calculated by dividing a poem’s vocabulary (the number of unique words) by the number of total words in a text. A repetitious poem would have a low value, whereas a poem with a high value (almost or entirely unique words) would be described as “lexically rich”. A poem in which each word appears only once would have a score of 1.0.

衡量语言复杂性的另一种方法是词汇丰富度,它是通过将一首诗的词汇量(唯一单词的数量)除以文本中总单词的数量而得出的。 重复的诗将具有较低的价值,而诗的价值较高(几乎或完全独特的词)将被描述为“词汇丰富”。 一首诗,每个单词只出现一次,得分为1.0。

Pre-1900 poetry appears to be the most repetitious movement, whereas Avant-Garde poetry is the most lexically rich. In the chart below, Avant-Garde is the only movement where a whisker reaches a value of 1.0, and all of it’s quartiles are well above the other movements.

1900年前的诗歌似乎是最重复的乐章,而先锋派的诗歌则在词汇上最丰富。 在下面的图表中,前卫是唯一一个晶须达到1.0的运动,并且所有四分位数都远高于其他运动。

Image by author)图片由作者)

It’s important to combine a couple of these observations to realize that Pre-1900 is wordy and repetitive, whereas Avant-Garde tends to be concise and full of unique language.

重要的是,将这两个观察结果结合起来,才能认识到1900年前的词多重复,而Avant-Garde则倾向于简洁明了且充满独特的语言。

文字处理 (Text processing)

I processed the poems in order to get a better look at what words were most frequently used within each movement. To process the text, I:

我整理了这首诗,以便更好地了解每个乐章中最常用的单词。 要处理文本,我:

  • made the poems lowercase
    使诗变小写
  • converted contractions to root words
    将收缩转换为词根
  • removed punctuation
    删除标点符号
  • lemmatized
    形容词
  • removed stop words
    删除停用词

My stop words included:

我的停用词包括:

  • NLTK stop words
    NLTK停用词
  • older English equivalents to those stop words (i.e. thy, doth, ere, etc.)

    那些停用词(例如thydothere等)的较旧英语等价形式

  • poet names (because some may have gotten through in the scraping process), minus any names that may also be used as words
    诗人的名字(因为某些人可能在拼写过程中获得通过),减去任何也可以用作单词的名字
  • HTML tags that may have gotten through the scraping process (this was an issue during my initial scrape, but I believe was corrected during the cleaning process; still, better safe than sorry)
    可能已通过抓取过程获得HTML标记(这是我最初进行抓取时遇到的问题,但我相信在清洁过程中已得到纠正;仍然比后悔更安全)
  • words of questionable value discovered in the first round of EDA (such as would, upon, and may)

    可疑值的话发现在第一轮EDA的(如并可)

There were 119,285 unique words in the corpus and 1,165,726 total words. After processing, this went down to 36,443 unique words and 585,256 total words.

语料库中有119,285个独特词,共有1,165,726个词。 经过处理,这减少到36,443个唯一单词和585,256个总单词。

The 25 most frequently used words are (again, after processing and lemmatization):

25个最常用的单词是(同样,经过处理和词形分解后):

Image by author)图片由作者)

There are a lot of visual (see, eye, light, look, white, face), temporal (day, night, time, old, long, never), and conceptual (love, life, man, heart, thing, still, world) terms.

有很多视觉(),时间(白天黑夜时间漫长从不)和概念性(生命静止世界)条款。

I find it interesting that come just barely edged out love for the top spot. Again, this is after lemmatization, so this is a combination of come, comes, coming, came, etc. This perhaps simultaneously points to a call to action (a beckoning, a la “Come here!”), a passive observation (“He comes from a distant city…”, from Diane di Prima’s An Exercise in Love), as well as the sexual verb, which is undoubtedly more common in the post-19th Century movements.

我觉得有趣的是前来勉强挤掉的头把交椅。 再一次,这是在定格之后,所以这是来来来来的组合。这也许同时指向号召性用语(招呼,叫“来这里!”),被动观察(“他来自遥远的城市……”,来自黛安·迪·普里玛(Diane di Prima)的《恋爱中的演习》 ( An Exercise in Love ),以及性动词,这无疑在19世纪后运动中更为普遍。

Breaking down word frequency by movement paints a clearer picture of some of the differences in language used:

通过运动分解单词频率可以更清楚地了解所用语言的一些差异:

Image by author) | (click to enlarge)图片由作者)| (点击放大)

Metropolitan, Modern, and Avant-Garde poetries tend to focus more on the visual and temporal, with Avant-Garde also including some more specifically natural words like water, tree, sea, and leaf. It is also worth noting that love is only the eighth most popular term for Avant-Garde, whereas it’s in the top three of the other movements.

大都会,现代和先锋派的诗歌往往更注重视觉和时间上的诗意,先锋派的诗歌还包括一些更自然的词,如水树叶。 还值得注意的是,爱情只是前卫音乐的第八大流行术语,而在其他乐章中排名第三。

Pre-1900 poetry skews more conceptual and ethereal, with words like soul and god, which are unique to this movement’s top 25. I’m also surprised at the relative lack of natural terms (with the exception of sea), considering this movement includes the Romantic genre, which is known for glorifying nature.

1900年前的诗歌偏向于概念性和空灵性,使用诸如灵魂神之类的词,这是该运动前25名中特有的。我也为自然运动相对缺乏(除外)感到惊讶,因为该运动包括浪漫风格,以美化自然而闻名。

Black is unique to Metropolitan’s list, which can presumably be explained by the Harlem Renaissance and Black Arts Movement genres, as well as the darker, gritty aesthetic of city-based poetry by Beat and New York School poets.

布莱克(Black)是大都会人的榜单所独有的,这可以用哈林(Harlem)文艺复兴时期和黑人艺术运动的流派来解释,也可以用Beat和纽约派诗人对城市诗歌的深色,坚韧美学加以解释。

Finally, it is worth noting the scale of each of these graphs, which reflects the wordiness and repetition of Pre-1900 poetry and the opposite qualities in Avant-Garde poetry. As has generally been the case in my analyses, Metropolitan and Modern poetries lie somewhere in the middle.

最后,值得注意的是每个图表的比例,它们反映了1900年前诗歌的冗长性和重复性,以及前卫诗歌的相反特质。 正如我所做的分析一样,大都市和现代诗歌都位于中间。

造型 (Modeling)

I ran Naive Bayes, KNN, Decision Tree, Random Forest, and SVM models using a TF-IDF vectorizer. My final implementation, however, was an SVM model using Doc2Vec document vectors instead, which provided me with a decent F1 score and the best fit by far. Although I kept them out of my final Jupyter notebook for the sake of brevity, I also ran XGBoost and LSTM models, which showed some promise but weren’t quite up to the level of my final model.

我使用TF-IDF矢量化器运行了朴素贝叶斯,KNN,决策树,随机森林和SVM模型。 但是,我的最终实现是使用Doc2Vec文档向量的SVM模型,该模型为我提供了不错的F1评分,并且是目前为止最合适的。 尽管为了简洁起见,我将它们排除在最终的Jupyter笔记本电脑之外,但我还运行了XGBoost和LSTM模型,这些模型显示了一些希望,但并没有达到最终模型的水平。

数值数据的重要性 (The importance of numerical data)

All of my models consistently performed much better when using both the word (or document) vectors plus my engineered features. Generally, a model would see around a 10% boost when including these features.

当同时使用单词(或文档)向量和我的设计特征时,我所有的模型始终表现出更好的性能。 通常,包含这些功能后,模型将获得10%左右的提升。

The baseline model, for which I used Bernoulli Naive Bayes on both TF-IDF vectors and my engineered features, achieved an F1 score of 42.7%. This is considerably better than just predicting the dominant class, which accounted for 29% of the data. Still, as you can see in the confusion matrix below, it did indeed overpredict on the dominant class, Modern, even though there wasn’t much of an imbalance.

我在TF-IDF载体和我设计的特征上都使用了Bernoulli Naive Bayes的基线模型,其F1得分为42.7%。 这比仅预测占主导地位的类别要好得多,后者占数据的29%。 但是,正如您在下面的混淆矩阵中所看到的那样,即使没有太多失衡,它的确对占主导地位的阶级Modern的预测过高。

I had some success with K-Nearest Neighbors, which suggests a certain amount of clustering in the data, as well as a Random Forest. The latter was extremely overfit, however.

我在“ K最近邻”方面取得了一些成功,这表明数据中有一定数量的聚类以及“随机森林”。 但是,后者过于适合。

Similarly overfit was my best model, an SVM, with the TF-IDF vectors and numerical data. This was relatively unsurprising given SVM’s general success with text classification and data for which there are more features than datapoints. This achieved the best F1 score, but would not generalize well on unseen data.

同样,过拟合也是我最好的模型,即带有TF-IDF向量和数值数据的SVM。 考虑到SVM在文本分类和数据方面的普遍成功,其功能要比数据点更多,这相对不足为奇。 这获得了最佳的F1分数,但不能很好地归因于看不见的数据。

Image by author)图片由作者)

Combining my numerical features with Doc2Vec embeddings proved to be the model that best generalizes on unseen data, without taking too much of a hit in F1 score.

事实证明,将我的数值特征与Doc2Vec嵌入相结合是最能概括未见数据的模型,而不会给F1分数带来太大影响。

Other than the baseline, my models were consistently better at picking out Pre-1900 poetry, without much confusion between that movement and the other three. Avant-Garde, Metropolitan, and Modern proved more difficult to differentiate and were generally confused for each other. The final model seems to suggest Modern being the closest movement to Pre-1900, with 15% of Modern poems being incorrectly classified as Pre-1900 poems. Avant-Garde and Metropolitan appear very similar to each other, which makes sense from a poetry standpoint.

除了基线以外,我的模型在挑选1900年前的诗歌方面一直都比较出色,并且在该乐章与其他三首诗之间没有太多混淆。 前卫,都市和现代被证明更加难以区分,并且通常彼此混淆。 最终的模型似乎表明现代是最接近1900年以前的乐章,其中15%的现代诗歌被错误地归类为1900年前的诗歌。 前卫风格和大都会风格看起来非常相似,从诗歌的角度来看,这是有道理的。

Run time was not an issue for most of my models, which is partially a result of having a relatively small dataset. My Doc2Vec model runs nearly instantly, having only 100 dimensions and 7 engineered features.

对于我的大多数模型而言,运行时间不是问题,部分原因是数据集相对较小。 我的Doc2Vec模型几乎可以立即运行,只有100个尺寸和7个工程功能。

最终模型 (Final model)

I trained a final model using all of the data, and the F1 score increased to 66.8%. Pre-1900 was indeed the easiest to identify, and the other three movements were fairly similar to each other, with Modern being the most difficult to correctly identify (an F1 score of 60%).

我使用所有数据训练了最终模型,F1分数提高到66.8%。 1900年前的确是最容易识别的运动,其他三个运动彼此相当相似,而现代运动则是最难识别的运动(F1得分为60%)。

Image by author)图片由作者)

The F1 score of each individual movement increased after training on the entire dataset. Modern saw the biggest jump from 46% to 60%. Avant-Garde saw a surprisingly large boost in accuracy after training on the entire dataset, with it’s accuracy score moving from 51% to 65%. Being the smallest class (at about 22%) may explain this; more data is almost always a good thing. It’s F1 score jumped from 53% to 62%.

在整个数据集上进行训练后,每个单个动作的F1分数都增加了。 Modern的增幅最大,从46%升至60%。 在对整个数据集进行训练之后,Avant-Garde看到了惊人的准确性提升,其准确性得分从51%上升到65%。 最小的一类(大约22%)可以解释这一点。 增加数据几乎总是一件好事。 F1分数从53%跃升至62%。

主要功能 (Top features)

Except for my baseline and TF-IDF SVM models, many if not all of my engineered features were prominently within the top ten most important features.

除了我的基准模型和TF-IDF SVM模型之外,我的许多工程设计功能(如果不是全部的话)都在十大最重要的功能中处于突出位置。

In my final model, five of my seven features were in the top ten:

在我的最终模型中,我的七个功能中的五个位于前十名中:

Image by author)图片由作者)

The ratio of end rhymes to total lines made the top spot by a healthy margin, followed by the average number of words per line, the total number of lines, and lexical richness. The average number of syllables per word was the other engineered feature that made the top ten. Polarity and sentiment scores were the only two that didn’t measure much importance to the model.

尾韵与总行数之比以健康的优势排在首位,其次是每行平均单词数,总行数和词汇丰富度。 每个单词的平均音节数是进入前十名的另一个设计特征。 极性和情感分数是唯一对模型没有多大重要性的分数。

By using document vectors instead of TF-IDF vectors, I do end up losing some interpretability, given that the other features in the above chart are merely five out of 100 mysterious dimensions. Still, by using a set of features that totals 107 as opposed to 43,053, I produced a much simpler model with similar efficacy and a better ability to generalize.

通过使用文档向量而不是TF-IDF向量,我确实失去了一些可解释性,因为上面图表中的其他功能仅仅是100个神秘维度中的5个。 不过,通过使用总计107项而不是43,053项的一组功能,我制作了一个更简单的模型,具有相似的功效和更好的归纳能力。

This will help me more easily produce a recommendation system as well!

这也将帮助我更轻松地创建推荐系统!

推荐系统 (Recommendation system)

Tune in later this week for a breakdown on how I built PO-REC, an algorithm that can recommend poems based on one word, multiple words in any format, and another poem within my dataset.

请在本周晚些时候进行调整,以详细了解我如何构建PO-REC ,该算法可以根据一个词,任意格式的多个词以及我的数据集中的另一首诗推荐诗歌。

结论 (Conclusions)

The power of form and structure! Numerical data based upon the form and structure of a poem proved to be consistently effective predictors of a poem’s movement.

形式和结构的力量! 基于一首诗的形式和结构的数字数据被证明是一首诗运动的有效预测指标。

Pre-1900 poems tend to be long, wordy, positive, full of rhymes, and use simpler, repetitious language.

1900年前的诗歌倾向于长篇,罗word,积极,充满韵律,并使用简单,重复的语言。

Avant-Garde poems tend to be short, sparsely worded, unrhyming, and use complex, lexically rich language.

前卫诗歌倾向于简短,措辞少,没有韵律,并使用复杂的,词汇丰富的语言。

Metropolitan and Modern poems lie somewhere in between. Metropolitan poetry is most similar to Avant-Garde poetry, whereas Modern poetry shares similarities with all of the genres and is the only other genre to be somewhat similar to pre-1900 poetry.

都市诗和现代诗介于两者之间。 大都会诗歌与前卫诗歌最相似,而现代诗歌与所有类型都有相似之处,并且是唯一与1900年前诗歌有些相似的其他类型。

Wikimedia Commons)Wikimedia Commons提供)

未来考虑 (Future considerations)

In the future, it would be interesting to engineer even more features, such as other types of rhyming (use of internal rhymes or slant rhymes), verb tenses (whether a poem predominantly uses present or past tense), and use of white space (i.e. whether a poem always starts on the left part of the line). Topic modeling may yield some interesting results as well.

将来,设计更多功能会很有趣,例如其他类型的押韵(使用内部押韵或偏向押韵),动词时态(无论一首诗主要使用现在时还是过去时)以及空白(即一首诗是否总是从行的左边开始)。 主题建模也可能会产生一些有趣的结果。

Furthermore, I plan on trying to build this out using the actual genres (of which there are 13), as opposed to the four umbrella-like movements discussed here. This will present some notable challenges, not least of which is the large class imbalance. Modern poetry, which is its own genre and movement, accounts for over a quarter of all the poems. Although this will assuredly result in much less accurate models, it will also shed some light on the intricacies within poetic movements.

此外,我计划尝试使用实际类型(其中有13种)来构建这种类型,而不是这里讨论的四个类似伞形的动作。 这将带来一些显着的挑战,尤其是阶级之间的巨大失衡。 现代诗歌是其自身的风格运动,占所有诗歌的四分之一以上。 尽管这肯定会导致模型的准确性降低,但也会为诗歌运动中的复杂性提供一些启示。

项目回购 (Project repo)

You can check out my project repo on GitHub:https://github.com/p-szymo/poetry_genre_classifier

您可以在GitHub上查看我的项目存储库: https : //github.com/p-szymo/poetry_genre_classifier

Image by author)作者提供的图片)

翻译自: https://towardsdatascience.com/predicting-poetic-movements-51006847cc6f

lstm 预测诗歌

http://www.taodudu.cc/news/show-2472601.html

相关文章:

  • 诗歌二
  • 中文现代诗歌创作项目
  • 诗歌赏析 - 兰花草
  • python诗歌文件格式处理_python实现诗歌游戏(类继承)
  • 诗歌集锦
  • js小学生图片_小学生画报设计图片
  • HTML5制作诗歌锦集,短小优美的自创现代诗歌(精选5首)
  • python诗歌文件格式处理_Python诗歌的依赖版本语法
  • c语言程序设计诗句,诗歌大全
  • 英语诗歌选读 | 期末总结
  • IMWeb小白-诗歌作业
  • 华为手机鸿蒙系统卡吗,鸿蒙到底有多流畅?华为:3年不卡!
  • WIFI手机使用正常电脑使用卡顿解决方案
  • PHY卡 网卡区别联系
  • wifi一到晚上服务器无响应,一到晚上九点,网络就开始卡了?主要原因是这三点!...
  • 公司网络很慢很卡的原因分析与处理
  • linux 6重启网卡,centos 网卡重启方法
  • SIM卡注册过程
  • 2020最新注册卡密微信在线充值购卡功能(适用于各种网络验证开发)【易语言源码】
  • 网络适配器、网卡和网卡驱动
  • H3C无线网络延时大、数据丢包,不能上网,原来是有BUG
  • 【网启树莓派】无SD卡 从网络驱动树莓派| raspberry boot via pxe
  • 超出本地计算机网络,超出本地计算机网络适配器卡的名称限制怎么解决?
  • 计算机网卡大小怎么查看,查看电脑网卡是100M还是1000M的方法
  • 数据网络卡顿怎么处理_电信数据网络卡顿怎么办 电信iptv卡顿解决方法
  • Napatech网络加速卡
  • 什么是NIC(网络接口卡)?
  • 网络百科——网络接口卡
  • 企业级网络突然变得很卡解决办法
  • ubuntu14.04 安装以太网网络卡驱动

lstm 预测诗歌_预测诗歌运动相关推荐

  1. 时间序列预测最大预测误差_预测误差的措施可以通过实验了解它们

    时间序列预测最大预测误差 入门(Getting Started) Measurement is the first step that leads to control and eventually ...

  2. 数学模型天气预测方法_预测即将到来的天气的新方法

    数学模型天气预测方法 By: Teja Balasubramanian 创建人:Teja Balasubramanian A new wave arises. Computer programming ...

  3. python 神经网络预测未来30天数据_使用LSTM循环神经网络的时间序列预测实例:预测未来的货币汇率...

    Statsbot团队发表过一篇关于使用时间序列分析来进行异常检测的文章.文章地址:https://blog.statsbot.co/time-series-anomaly-detection-algo ...

  4. lstm中look_back的大小选择_基于时空关联度加权的LSTM短时交通速度预测

    作 者 信 息 刘易诗1,关雪峰1,2,吴华意1,2,曹 军1,张 娜1 (1. 武汉大学 测绘遥感信息工程国家重点实验室,湖北 武汉 430079:2. 地球空间信息技术协同创新中心,湖北 武汉 4 ...

  5. cad2016珊瑚_预测有马的硬珊瑚覆盖率

    cad2016珊瑚 What's the future of the world's coral reefs? 世界珊瑚礁的未来是什么? In February of 2020, scientists ...

  6. tushare实战LSTM实现黄金价格预测

    tushare实战LSTM实现黄金价格预测 文章目录 tushare实战LSTM实现黄金价格预测 拉取数据 数据预处理 训练模型 模型预测及查看效果 先看整体情况 选取特定的一小段查看 结果分析 拉取 ...

  7. 大数据毕业设计 LSTM时间序列预测算法 - 股票预测 天气预测 房价预测

    文章目录 0 简介 1 基于 Keras 用 LSTM 网络做时间序列预测 2 长短记忆网络 3 LSTM 网络结构和原理 3.1 LSTM核心思想 3.2 遗忘门 3.3 输入门 3.4 输出门 4 ...

  8. 毕业设计 LSTM的预测算法 - 股票预测 天气预测 房价预测

    文章目录 0 简介 1 基于 Keras 用 LSTM 网络做时间序列预测 2 长短记忆网络 3 LSTM 网络结构和原理 3.1 LSTM核心思想 3.2 遗忘门 3.3 输入门 3.4 输出门 4 ...

  9. 时间序列预测 预测时间段_什么是时间序列预测

    时间序列预测 预测时间段 Notwithstanding the time series analysis is widely implemented for the business and soc ...

  10. Pytorch LSTM实现中文单词预测(附完整训练代码)

    Pytorch LSTM实现中文单词预测(附完整训练代码) 目录 Pytorch LSTM实现中文单词预测(词语预测 附完整训练代码) 1.项目介绍 2.中文单词预测方法(N-Gram 模型) 3.训 ...

最新文章

  1. html字体颜色代码表
  2. Python教程:多态与多态性
  3. 连接SQL Server文件集锦
  4. LINQ to SharePoint 试用感受, 欢迎讨论~
  5. android面试详解
  6. python实现随机乱序/洗牌
  7. python矩阵乘法分治_分治法实现矩阵乘法
  8. [3]2020-IEEE Access-Batch Active Learning With Two-Stage Sampling 论文笔记
  9. 关于vb.net初学者,倒计时器的开发
  10. Filenet路线更新及新版官网上线公告 ​
  11. 【高速PCB电路设计】8.DDR模块设计实战
  12. python 模块相互import
  13. glassfish基本使用
  14. 计算机恢复语言文件格式,txt文件乱码怎么恢复正常
  15. 华为服务器控制口地址修改,服务器修改管理口地址
  16. 让你效率飞起的右键工具——超级右键
  17. Elasticsearch笔记(九):实践篇-查找附近的人
  18. 人机对话比拼,Chat GPT和文心一言谁更接近真实交流?”
  19. [JavaScript高级程序设计]JavaScript介绍
  20. tomcat配置 详解

热门文章

  1. 精选西门子PLC工程实例源码【共300套】
  2. 魔百盒M302H-ZN安徽版-刷机固件及教程
  3. 【音频处理】Melodyne 简介 ( Melodyne 音频处理注意事项 | 在音乐宿主软件中加载 Melodyne 插件 )
  4. 计算机专业毕业论文审查意见,计算机专业毕业论文评语
  5. VLAN隔离技术 — MUX VLAN
  6. c语言中fprintf的作用,c语言中fprintf的用法
  7. 首期工业科技生态创新论坛举办,上海控安与微软苏州人工智能产业创新中心签署战略合作协议
  8. 计算机与打印机脱机后怎么共享,打印机脱机工作怎么恢复(连接的共享打印机脱机)...
  9. 量子计算与PKS信创体系首次融合,实现算力跨越
  10. 游戏外挂:用Python做个小游戏的开挂