机器学习美股

by Sofia Godovykh

索非亚·戈多维克(Sofia Godovykh)

我如何使用机器学习来探索英美文学之间的差异 (How I used machine learning to explore the differences between British and American literature)

As I delved further into English literature to further my own language gains, my interest was piqued: how do American and British English differ?

当我进一步研究英语文学以提高自己的语言水平时，我的兴趣激起了：美国英语和英国英语有何不同？

With this question framed in my mind, the next steps were to apply natural language processing and machine learning techniques to find concrete examples. I was curious to know whether it would be possible to train a classifier, which would distinguish literary texts.

考虑到这个问题，下一步是应用自然语言处理和机器学习技术来找到具体的例子。我很想知道是否有可能训练分类器来区分文学文本。

It is quite easy to distinguish texts written in different languages since the cardinality of the intersection of words (features, in terms of machine learning) was relatively small. Text classification by category (such as science, atheism, computer graphics, etc.) is a well-known “hello world” when it comes to tasks related with working with text classification. I faced a more difficult task when I tried to compare two dialects of the same language, as texts have no common theme.

由于单词交集的基数(相对于机器学习而言的特征)相对较小，因此区分以不同语言编写的文本非常容易。当涉及与文本分类相关的任务时，按类别(例如科学，无神论，计算机图形学等)进行文本分类是众所周知的“ hello world”。当我尝试比较同一种语言的两种方言时，我面临着更加艰巨的任务，因为文本没有共同的主题。

The most time consuming stage of machine learning deals with the retrieval of data. For the training sample, I used texts from Project Gutenberg, which can be freely downloaded. As for the list of American and British authors, I used names of authors I found in the Wikipedia.

机器学习最耗时的阶段是数据的检索。对于培训样本，我使用了来自Gutenberg项目的文本，可以免费下载。至于美国和英国作者的名单，我使用了我在维基百科上找到的作者的名字。

One of the challenges I encountered was finding the name of the author of a text that matched the Wikipedia page. A good search by name was implemented on the site, but since the site doesn’t allow the parsing of data, I instead proposed to use files that contained metadata. This meant that I needed to solve a non-trivial task of matching names (Sir Arthur Ignatius Conan Doyle and Doyle, C. is the same person, but Doyle, M.E. is a different person) — and I had to do so with a very high level of accuracy.

我遇到的挑战之一是找到与Wikipedia页面匹配的文本的作者姓名。在站点上实现了按名称的良好搜索，但是由于站点不允许解析数据，因此我建议使用包含元数据的文件。这意味着我需要解决一个简单的姓名匹配任务(Sir Arthur Ignatius Conan Doyle和C. Doyle是同一个人，而ME。Doyle是不同的人)，而我必须非常高精度。

Instead, I chose to sacrifice the sample size for the sake of attaining high accuracy, as well as saving some time. I chose, as a unique identifier, an author’s Wikipedia link, which was included in some of the metadata files. With these files, I was able to acquire about 1,600 British and 2,500 American texts and use them to begin training my classifier.

取而代之的是，我选择牺牲样本大小以达到高精度，同时节省一些时间。我选择了作者的Wikipedia链接作为唯一标识符，该链接包含在某些元数据文件中。有了这些文件，我就可以获取约1,600份英国文本和2500份美国文本，并使用它们开始训练我的分类器。

For this project I used sklearn package. The first step after the data collection and analysis stage is pre-processing, in which I utilized a CountVectorizer. A CountVecrorizer takes a text data as input and returns a vector of features as output. Next, I needed to calculate the tf-idf (term frequency — inverted document frequency). A brief explanation why I needed to use it and how:

对于这个项目，我使用了sklearn包。数据收集和分析阶段之后的第一步是预处理，其中我使用了CountVectorizer。 CountVecrorizer将文本数据作为输入，并返回特征向量作为输出。接下来，我需要计算tf-idf (术语频率-倒排文档频率)。简要说明为什么需要使用它以及如何使用：

For example, take the word “the” and count the number of occurrences of the word in a given text, A. Let’s suppose that we have 100 occurrences, and the total number of words in a document is 1000.

例如，取单词“ the”并计算给定文本A中该单词的出现次数。假设我们有100个出现次数，而文档中的单词总数为1000。

Thus,

从而，

tf(“the”) = 100/1000 = 0.1

Next, take the word “sepal”, which has occurred 50 times:

接下来，使用“ sepal”一词，该词已经出现了50次：

tf(“sepal”) = 50/1000 = 0.05

To calculate the inverted document frequency for these words, we need to take the logarithm of the ratio of the number of texts from which there is at least one occurrence of the word, to the total number of texts. If there are all 10,000 texts, and in each, there is the word “the”:

要计算这些单词的倒排文档频率，我们需要取至少出现一次单词的文本数与文本总数之比的对数。如果总共有10,000个文本，并且每个文本中都有单词“ the”：

idf(“the”) = log(10000/10000) = 0 and

idf(“the”) = log(10000/10000) = 0且

tf-idf(“the”) = idf(“the”) * tf(“the”) = 0 * 0.1 = 0

The word “sepal” is way more rare, and was found only in the 5 texts. Therefore:

“ sepal”一词更为罕见，仅在5个文本中才发现。因此：

idf(“sepal”) = log(10000/5) and tf-idf(“sepal”) = 7.6 * 0.05 = 0.38

Thus, the most frequently occurring words carry less weight, and specific rarer ones carry more weight. If there are many occurrences of the word “sepal”, we can assume that this is a botanical text. We can not feed a classifier with words, we will use tf-idf measure instead.

因此，最常出现的单词的权重较小，而特定的罕见单词的权重较大。如果出现“ sepal”一词的次数很多，我们可以假定这是植物性文本。我们无法用单词来填充分类器，我们将改用tf-idf度量。

After I had presented the data as a set of features, I needed to train the classifier. I was working with text data, which is presented as sparse data, so the best option is to use a linear classifier, which works well with large amounts of features.

在将数据呈现为一组功能之后，我需要训练分类器。我正在处理以稀疏数据形式表示的文本数据，因此最好的选择是使用线性分类器，该分类器可以很好地与大量功能配合使用。

First, I ran the CountVectorizer, TF-IDFTransformer and SGDClassifier using the default parameters. By analyzing the plot of the accuracy of the sample size — where accuracy fluctuated from 0.6 to 0.85 — I discovered that the classifier was very much dependent on the particular sample used, and therefore not very effective.

首先，我使用默认参数运行CountVectorizer，TF-IDFTransformer和SGDClassifier。通过分析样本大小的精度图(精度从0.6到0.85波动)，我发现分类器在很大程度上取决于所使用的特定样本，因此效果不佳。

After receiving a list of the classifier weights, I noticed a part of the problem: the classifier had been fed with words like “of” and “he”, which we should have treated as a noise. I could easily solve this problem by removing these words from the features by setting the stop_words parameter to the CountVectorizer: stop_words = ‘english’ (or your own custom list of stop words).

在收到分类器权重列表之后，我注意到了问题的一部分：分类器被喂了“ of”和“ he”之类的词，我们应该将其视为杂音。我可以通过将stop_words参数设置为stop_words从功能中删除这些单词来轻松解决此问题： stop_words = 'english' (或您自己的自定义停用词列表)。

With the default stop words removed, I got an accuracy of 0.85. After that, I launched the automatic selection of parameters using GridSearchCV and achieved a final accuracy of 0.89. I may be able to improve this result with a larger training sample, but for now I stuck with this classifier.

删除默认停用词后，我的准确度为0.85。之后，我使用GridSearchCV启动了参数的自动选择，最终精度达到了0.89。我可能可以通过使用更大的训练样本来改善此结果，但是现在我坚持使用该分类器。

Now on to what interests me most: which words point to the origin of the text? Here’s a list of words, sorted in descending order of weight in the classifier:

现在，我最感兴趣的是：哪些词指向文本的起源？这是单词列表，在分类器中按权重降序排列：

American: dollars, new, york, girl, gray, american, carvel, color, city, ain, long, just, parlor, boston, honor, washington, home, labor, got, finally, maybe, hodder, forever, dorothy, dr

美国：美元，新，约克，女孩，灰色，美国，carvel，颜色，城市，艾因，长，只是，客厅，波士顿，荣誉，华盛顿，家庭，劳工，终于有了，也许是霍德，永远，多萝西，博士

British: round, sir, lady, london, quite, mr, shall, lord, grey, dear, honour, having, philip, poor, pounds, scrooge, soames, things, sea, man, end, come, colour, illustration, english, learnt

英国人：圆形，先生，女士，伦敦，相当，先生，须，领主，灰色，亲爱的，荣誉，有，菲利普，可怜，磅，史克鲁奇，苏打，东西，海，人，端，来，颜色，插图，英语，学习

While having fun with the classifier, I was able to single-out the most “American” British authors and the most “British” American authors (a tricky way to see how bad my classifier could work).

在与分类器一起玩耍的同时，我能够挑选出最“美国”的英国作者和最“英国”的美国作者(这是一种棘手的方法，可以看出我的分类器的工作效果如何)。

The most “British” Americans:

最“英国”的美国人：

Frances Hodgson Burnett (born in England, moved to the USA at age of 17, so I treat her as an American writer)弗朗西斯·霍奇森·伯内特(Frances Hodgson Burnett)(出生于英国，在17岁时移居美国，所以我将她视为美国作家)
Henry James (born in the USA, moved to England at age of 33)亨利·詹姆斯(Henry James)(出生于美国，现年33岁，移居英国)
Owen Wister (yes, the father of Western fiction)Owen Wister(是，西方小说之父)
Mary Roberts Rinehart (was called the American Agatha Christie for a reason)玛丽·罗伯茨·雷内哈特(Mary Roberts Rinehart)(由于某种原因被称为美国阿加莎·克里斯蒂)
William McFee (another writer moved to America at a young age)威廉·麦克菲(另一位作家年轻时移居美国)

The most “American” British:

最“美国”的英国人：

Rudyard Kipling (he lived in America several years, also, he wrote “American Notes”)鲁德亚德·吉卜林(他在美国生活了几年，也写了《美国笔记》)
Anthony Trollope (the author of “North America”)安东尼·特罗洛普(Anthony Trollope)(《北美》的作者)
Frederick Marryat (A veteran of Anglo-American War of 1812, thanks to his “Narrative of the Travels and Adventures of Monsieur Violet in California, Sonara, and Western Texas” which made him fall into the american category)弗雷德里克·马里亚特(Frederick Marryat)(1812年英美战争的退伍军人，这要归功于他的“加利福尼亚，索纳拉和西得克萨斯州的紫罗兰先生游记和历险记”，使他进入了美国类别)
Arnold Bennett (the author of “Your United States: Impressions of a first visit”) one more gentleman wrote travel notes阿诺德·贝内特(Arnold Bennett)(《您的美国：第一次访问的印象》的作者)又一位先生写了旅行记录
E. Phillips Oppenheim菲利普斯·奥本海姆

And also the most “British” British and “American” American authors (because the classifier still works well):

也是最“英国”的英国和“美国”美国作者(因为分类器仍然有效)：

Americans:

美国人：

Francis Hopkinson Smith弗朗西斯·霍普金森·史密斯
Hamlin Garland哈姆林·加兰
George Ade乔治·阿德
Charles Dudley Warner查尔斯·达德利·华纳
Mark Twain马克·吐温

British:

英国人：

George Meredith乔治·梅雷迪思
Samuel Richardson塞缪尔·理查森(Samuel Richardson)
John Galsworthy约翰·加尔斯沃西
Gilbert Keith Chesterton吉尔伯特·基思·切斯特顿
Anthony Trollope (oh, hi)安东尼·特罗洛普(哦，嗨)

I was inspired to do this work by @TragicAllyHere tweet:

@TragicAllyHere启发了我从事这项工作：

Well, wourds really matter, as I realised.

嗯，就像我意识到的那样，丝瓜真的很重要。

翻译自: https://www.freecodecamp.org/news/how-to-differentiate-between-british-and-american-literature-being-a-machine-learning-engineer-ac842662da1c/

机器学习美股

机器学习美股_我如何使用机器学习来探索英美文学之间的差异相关推荐

机器学习数据模型_使用PyCaret将机器学习模型运送到数据—第二部分
机器学习数据模型 My previous post Machine Learning in SQL using PyCaret 1.0 provided details about integrat ...
python机器学习预测_使用Python和机器学习预测未来的股市趋势
python机器学习预测 Note from Towards Data Science's editors: While we allow independent authors to publish ...
深度学习算法和机器学习算法_是否可以使机器学习算法无需编码
深度学习算法和机器学习算法 I am a firm believer that the previous step to making your own Machine Learning Algori ...
机器学习框架_编写生产级机器学习框架的经验教训
机器学习框架 My wonderful colleagues at Atomwise and I have written a production-level PyTorch framework f ...
机器学习算法_五分钟了解机器学习十大算法
本文为有志于成为数据科学家或对此感兴趣的读者们介绍最流行的机器学习算法. 机器学习是该行业的一个创新且重要的领域.我们为机器学习程序选择的算法类型,取决于我们想要实现的目标. 现在,机器学习有很多算法 ...
机器学习属性_属性关系文件格式| 机器学习
机器学习属性 Today, we will be looking at the use of attribute relation file format for machine learning ...
机器学习框架_一文了解机器学习框架-TensorFlow的原理和用途
TensorFlow是Google开发的第二代分布式机器学习系统.于2015年11月在Github上开源,并于2017年1月发布了1.0版本的预览,API接口趋于稳定.目前TensorFlow正处于快 ...
pandas内置绘图_使用Pandas内置功能探索数据集
pandas内置绘图 Each and every data scientist is using the very famous libraries for data manipulation th ...
机器学习测试_测试优先机器学习
机器学习测试 Testing software is one of the most complex tasks in software engineering. While in traditio ...

机器学习美股_我如何使用机器学习来探索英美文学之间的差异

我如何使用机器学习来探索英美文学之间的差异 (How I used machine learning to explore the differences between British and American literature)

机器学习美股_我如何使用机器学习来探索英美文学之间的差异相关推荐

最新文章

热门文章

机器学习 美股_我如何使用机器学习来探索英美文学之间的差异

我如何使用机器学习来探索英美文学之间的差异 (How I used machine learning to explore the differences between British and American literature)

机器学习 美股_我如何使用机器学习来探索英美文学之间的差异相关推荐

最新文章

热门文章

机器学习美股_我如何使用机器学习来探索英美文学之间的差异

机器学习美股_我如何使用机器学习来探索英美文学之间的差异相关推荐