编程神奇算法

由Bryan Berend | 2017年3月23日 (by Bryan Berend | March 23, 2017)

About Bryan: Bryan is the Lead Data Scientist at Nielsen.

关于布莱恩(Bryan) : 布莱恩(Bryan)是尼尔森(Nielsen)的首席数据科学家。

介绍 (Introduction)

When you first start learning about data science, one of the first things you learn about are classification algorithms. The concept behind these algorithms is pretty simple: take some information about a data point and place the data point in the correct group or class.

当您第一次开始学习数据科学时,首先要学习的是分类算法。 这些算法的概念非常简单:获取有关数据点的一些信息,并将数据点放置在正确的组或类中。

A good example is the email spam filter. The goal of a spam filter is to label incoming emails (i.e. data points) as “Spam” or “Not Spam” using information about the email (the sender, number of capitalized words in the message, etc.).

一个很好的例子是电子邮件垃圾邮件过滤器。 垃圾邮件过滤器的目标是使用有关电子邮件的信息(发件人,邮件中大写单词的数量等)将传入的电子邮件(即数据点)标记为“垃圾邮件”或“非垃圾邮件”。

The email spam filter is a good example, but it gets boring after a while. Spam classification is the default example for lectures or conference presentations, so you hear about it over and over again. What if we could talk about a different classification algorithm that was a bit more interesting? Something more nerdy? Something more…magical?

电子邮件垃圾邮件过滤器是一个很好的例子,但过了一会儿,它变得很无聊。 垃圾邮件分类是讲座或会议演示的默认示例,因此您一遍又一遍地了解到它。 如果我们可以谈论一种更有趣的不同分类算法怎么办? 还有书呆子吗? 还有更多……神奇吗?

That’s right folks! Today we’ll be talking about the Sorting Hat from the Harry Potter universe. We’ll pull some Harry Potter data from the web, analyze it, and then build a classifier to sort characters into the different houses. Should be fun!

没错,伙计们! 今天我们将讨论哈利·波特宇宙中的分拣帽。 我们将从网络上获取一些Harry Potter数据,对其进行分析,然后构建一个分类器以将字符分类到不同的房屋中。 应该很有趣!

Disclaimer:

免责声明:

The classifier built below is not incredibly sophisticated. Thus, it should be treated as a “first pass” of the problem in order to demonstrate some basic web-scraping and text-analysis techniques. Also, due to a relatively small sample size, we will not be employing classic training techniques like cross-validation. We are simply gathering some data, building a simple rule-based classifier, and seeing the results.

下面构建的分类器并不十分复杂。 因此,应将其视为问题的“第一步”,以证明一些基本的网页抓取和文本分析技术。 同样,由于样本量相对较小,我们将不会采用经典的训练技术,例如交叉验证 。 我们只是在收集一些数据,建立一个基于规则的简单分类器,然后查看结果。

Side note:

边注:

The idea for this blog post came from Brian Lange’s excellent presentation on classification algorithms at PyData Chicago 2016. You can find the video of the talk here and the slides here. Thanks Brian!

这篇博客文章的想法来自Brian Lange在2016年PyData芝加哥上有关分类算法的精彩演讲。 你可以找到谈话的视频在这里和幻灯片在这里 。 谢谢布莱恩!

第一步:从网络提取数据 (Step One: Pulling Data from the Web)

In case you’ve been living under a rock for the last 20 years, the Sorting Hat is a magical hat that places incoming Hogwarts students into the four Hogwarts houses: Gryffindor, Slytherin, Hufflepuff, and Ravenclaw. Each house has certain characteristics, and when the Sorting Hat is placed on a student’s head, it reads their minds and determines which house they would be the best fit for. By this definition, the Sorting Hat is a multiclass classifier (more than two groups) as opposed to a binary classifier (exactly two groups), like an spam filter.

如果您在岩石下生活了20年 ,Sorting Hat就是一顶神奇的帽子,它将来港的Hogwarts学生安置在四个Hogwarts房屋中:格兰芬多,斯莱特林,赫奇帕奇和拉文克劳。 每个房屋都有其特定的特征,当将分拣帽放在学生的头上时,它会读懂他们的想法,并确定他们最适合的房屋。 根据此定义,排序帽是一个多类分类器(超过两个组),而二元分类器(恰好是两个组)则类似于垃圾邮件过滤器。

If we are going sort students into different houses, we’ll need some information about the students. Thankfully, there is a lot of information on harrypotter.wikia.com. This website has articles on nearly every facet of the Harry Potter universe, including students and faculty. As an added bonus, Fandom, the company that runs the website, has an easy-to-use API with lots of great documentation. Hazzah!

如果我们要将学生分类到不同的房子里,我们将需要有关学生的一些信息。 幸运的是,在harrypotter.wikia.com上有很多信息。 该网站上的文章几乎涵盖了哈利波特宇宙的方方面面,包括学生和教职员工。 此外,运营网站的公司Fandom具有易于使用的API,其中包含许多出色的文档 。 哈扎!

We’ll start by importing pandas and requests. The former will be used for organizing the data, while the later will be used to actually make the data requests to the API.

我们将从导入pandasrequests 。 前者将用于组织数据,而后者将用于实际向API提出数据请求。

We’ll also need a smart way to loop through all the different students at Hogwarts and record the house they are sorted into by the Sorting Hat (this will be the “truth” that we will compare our results to). By poking around the website, it appears that articles are grouped by “Category”, such as “Hogwarts_students” and “Films_(real-world)”. The Fandom API allows us to list out all of the articles of a given category.

我们还需要一种聪明的方法来遍历霍格沃茨的所有不同学生,并记录他们被分拣帽分拣到的房子(这将是我们将结果与之比较的“真相”)。 通过浏览该网站,似乎可以按照“类别”对文章进行分组,例如“ Hogwarts_students”和“ Films_(真实世界)”。 通过Fandom API,我们可以列出给定类别的所有文章。

Let’s use Ravenclaw as an example. We’ll get all the data into a variable called info and then we’ll put it into a Pandas DataFrame.

让我们以Ravenclaw为例。 我们将所有数据放入名为info的变量中,然后将其放入Pandas DataFrame中。

# Import modules
import pandas as pd
import requests# Get Ravenclaw articles
category = 'Ravenclaws'
url = 'http://harrypotter.wikia.com/api/v1/Articles/List?expand=1&limit=1000&category=' + category
requested_url = requests.get(url)
json_results = requested_url.json()
info = json_results['items']
ravenclaw_df = pd.DataFrame(info)print('Number of articles: {}'.format(len(info)))
print('')
ravenclaw_df.head()

# Import modules
import pandas as pd
import requests# Get Ravenclaw articles
category = 'Ravenclaws'
url = 'http://harrypotter.wikia.com/api/v1/Articles/List?expand=1&limit=1000&category=' + category
requested_url = requests.get(url)
json_results = requested_url.json()
info = json_results['items']
ravenclaw_df = pd.DataFrame(info)print('Number of articles: {}'.format(len(info)))
print('')
ravenclaw_df.head()

Number of articles: 158

文章数:158

Rodeo!Rodeo中进行整个分析!

Yhat note:

注意

If you’re following along in our Python IDE, Rodeo, just copy and paste the code above into the Editor or Terminal tab. You can view results in either the History or Terminal tab. Bonus: Did you know you can drag and drop the tabs and panes to rearrange and resize?

如果您要遵循我们的Python IDE Rodeo进行操作 ,只需将上面的代码复制并粘贴到“编辑器”或“终端”选项卡中即可。 您可以在“历史记录”或“终端”选项卡中查看结果。 奖励:您知道您可以拖放选项卡和窗格来重新排列和调整大小吗?

abstract 抽象 comments 注释 id ID ns ns original_dimensions 原尺寸 revision 修订版 thumbnail 缩图 title 标题 type 类型 url 网址
0 0 {{Ravenclaw individual… {{Ravenclaw个人… 0 0 5080 5080 10 10 None 没有 {‘id’: 964956, ‘timestamp’: ‘1460047333’, ‘use… {'id':964956,'timestamp':'1460047333','use ... None 没有 Ravenclaw individual infobox 拉文克劳个人信息箱 NaN N /wiki/Template:Ravenclaw_individual_infobox / wiki / Template:Ravenclaw_individual_infobox
1 1个 Roland Abberley was a Ravenclaw student at Hog… Roland Abberley是Hog的Ravenclaw学生。 0 0 33946 33946 0 0 None 没有 {‘id’: 1024340, ‘timestamp’: ‘1479282062’, ‘us… {'id':1024340,'timestamp':'1479282062','us ... None 没有 Roland Abberley 罗兰·阿伯利 article 文章 /wiki/Roland_Abberley / wiki /罗兰·阿伯利
2 2 Stewart Ackerley (born c. 1982-1983) was a wiz… 斯图尔特·阿克利(Stewart Ackerley)(生于1982-1983年)是一位奇才 0 0 7011 7011 0 0 None 没有 {‘id’: 1024309, ‘timestamp’: ‘1479281746’, ‘us… {'id':1024309,'timestamp':'1479281746','us ... None 没有 Stewart Ackerley 斯图尔特·阿克利 article 文章 /wiki/Stewart_Ackerley / wiki / Stewart_Ackerley
3 3 Jatin Agarkar was a Ravenclaw student at Hogwa… Jatin Agarkar是Hogwa的Ravenclaw学生。 0 0 99467 99467 0 0 None 没有 {‘id’: 1039350, ‘timestamp’: ‘1482842767’, ‘us… {'id':1039350,'timestamp':'1482842767','us ... None 没有 Jatin Agarkar 贾廷·阿加卡(Jatin Agarkar) article 文章 /wiki/Jatin_Agarkar / wiki / Jatin_Agarkar
4 4 Alannis was a female Ravenclaw student at Hogw… 阿兰尼斯(Alannis)是霍格(Hogw)的女拉文克劳(Ravenclaw)女学生。 0 0 27126 27126 0 0 {‘width’: 322, ‘height’: 546} {'width':322,'height':546} {‘id’: 1024320, ‘timestamp’: ‘1479281862’, ‘us… {'id':1024320,'timestamp':'1479281862','us ... http://vignette3.wikia.nocookie.net/harrypotte… http://vignette3.wikia.nocookie.net/harrypotte… Alannis 阿兰尼斯 article 文章 /wiki/Alannis / wiki /阿兰尼斯

We can see a few things from this:

我们可以从中看到一些东西:

  • The first observation in this list is “Ravenclaw individual infobox”. Since this is not a student, we want to filter our results on the “type” column.
  • Unfortunately ravenclaw_df doesn’t have the articles’ contents…just article abstracts. In order to get the contents, we need to use a different API request and query data based on the articles’ ids.
  • Furthermore, we can write a loop to run over all of the houses and get one dataframe with all the data we need.
  • 该列表中的第一个观察结果是“ Ravenclaw个人信息框”。 由于这不是学生,因此我们希望在“类型”列上过滤结果。
  • 不幸的是ravenclaw_df没有文章的内容,只是文章摘要。 为了获取内容,我们需要使用不同的API请求并根据文章的ID查询数据。
  • 此外,我们可以编写一个循环以在所有房屋上运行,并获得一个包含所有所需数据的数据框。

Number of student articles: 748

学生人数:748

      id             title                     url       house
0  33349     Astrix Alixan     /wiki/Astrix_Alixan  Gryffindor
1  33353   Filemina Alchin   /wiki/Filemina_Alchin  Gryffindor
2   7018  Euan Abercrombie  /wiki/Euan_Abercrombie  Gryffindor
3  99282      Sakura Akagi      /wiki/Sakura_Akagi  Gryffindor
4  99036       Zakir Akram       /wiki/Zakir_Akram  Gryffindorid             title                     url      house
743  100562  Phylis Whitehead  /wiki/Phylis_Whitehead  Slytherin
744    3153            Wilkes            /wiki/Wilkes  Slytherin
745   35971      Ella Wilkins      /wiki/Ella_Wilkins  Slytherin
746   44393    Rufus Winickus    /wiki/Rufus_Winickus  Slytherin
747     719     Blaise Zabini     /wiki/Blaise_Zabini  Slytherin

      id             title                     url       house
0  33349     Astrix Alixan     /wiki/Astrix_Alixan  Gryffindor
1  33353   Filemina Alchin   /wiki/Filemina_Alchin  Gryffindor
2   7018  Euan Abercrombie  /wiki/Euan_Abercrombie  Gryffindor
3  99282      Sakura Akagi      /wiki/Sakura_Akagi  Gryffindor
4  99036       Zakir Akram       /wiki/Zakir_Akram  Gryffindorid             title                     url      house
743  100562  Phylis Whitehead  /wiki/Phylis_Whitehead  Slytherin
744    3153            Wilkes            /wiki/Wilkes  Slytherin
745   35971      Ella Wilkins      /wiki/Ella_Wilkins  Slytherin
746   44393    Rufus Winickus    /wiki/Rufus_Winickus  Slytherin
747     719     Blaise Zabini     /wiki/Blaise_Zabini  Slytherin

获取文章内容 (Getting article contents)

Now that we have the article ids, we can start pulling article contents. But some these articles are MASSIVE with incredible amounts of detail…just take a look at Harry Potter’s or Voldemort’s articles!

现在我们有了商品编号,我们可以开始提取商品内容了。 但是其中一些文章的内容令人难以置信,其细节令人难以置信……只要看看哈利·波特 ( Harry Potter)或伏地魔(Voldemort)的文章!

If we look at some of the most important characters, we’ll see that they all have a “Personality and traits” section in their article. This seems like a logical place to extract information that the Sorting Hat would use in its decision. Not all characters have a “Personality and traits” section (such as Zakir Akram), so this step will reduce the number of students in our data by a significant amount.

如果我们看一些最重要的角色,我们会发现他们在文章中都有一个“个性和特质”部分。 这似乎是提取排序帽子将在其决策中使用的信息的逻辑位置。 并非所有角色都有“个性和特质”部分(例如Zakir Akram ),因此此步骤将大大减少我们数据中的学生数量。

The following code pulls the “Personality and traits” section from each article and computes the length of that section (i.e. number of text characters). Then it merges that data with our initial dataframe mydf by “id” (this takes a little while to run).

以下代码从每篇文章中提取“个性和特质”部分,并计算该部分的长度(即文本字符的数量)。 然后,它通过“ id”将该数据与我们的初始数据框mydf合并(这需要一些时间才能运行)。

# Creates a new DataFrame with just the students who have a "Personality and traits" section
mydf_relevant = mydf_all[mydf_all['text_len'] > 0]print('Number of useable articles: {}'.format(len(mydf_relevant)))
print('')
mydf_relevant.head()

# Creates a new DataFrame with just the students who have a "Personality and traits" section
mydf_relevant = mydf_all[mydf_all['text_len'] > 0]print('Number of useable articles: {}'.format(len(mydf_relevant)))
print('')
mydf_relevant.head()

Number of useable articles: 94

可用物品数:94

id ID title 标题 url 网址 house text 文本 text_len text_len
689 689 343 343 Tom Riddle 汤姆·里德尔 /wiki/Tom_Riddle / wiki / Tom_Riddle Slytherin 斯莱特林 Voldemort was considered by many to be “the mo… 许多人认为伏地魔是“魔鬼”。 26924 26924
169 169 13 13 Harry Potter 哈利·波特 /wiki/Harry_Potter / wiki / Harry_Potter Gryffindor 格兰芬多 Harry was an extremely brave, loyal, and selfl… 哈里是一个非常勇敢,忠诚和自私的人。 12987 12987
726 726 49 49 Dolores Umbridge 多洛雷斯·乌姆里奇 /wiki/Dolores_Umbridge / wiki / Dolores_Umbridge Slytherin 斯莱特林 Dolores Umbridge was nothing short of a sociop… 多洛雷斯·乌姆里奇(Dolores Umbridge)颇有社交感。 9668 9668
703 703 259 259 Horace Slughorn 霍拉斯·斯拉格霍恩 /wiki/Horace_Slughorn / wiki / Horace_Slughorn Slytherin 斯莱特林 Horace Slughorn was described as having a bumb… 霍拉斯·斯拉格霍恩(Horace Slughorn)被描述为有泡沫。 7944 7944
54 54 4178 4178 Albus Dumbledore 阿不思·邓布利多 /wiki/Albus_Dumbledore / wiki / Albus_Dumbledore Gryffindor 格兰芬多 Considered to be the most powerful wizard of h… 被认为是人类最强大的向导。 7789 7789

第二步:使用NLTK获取霍格沃茨房屋特征 (Step Two: Getting Hogwarts House Characteristics using NLTK)

Now that we have data on a number of students, we want to classify students into different houses. In order to do that, we’ll need a list of the characteristics for each house. We will start with the characteristics on harrypotter.wikia.com.

现在我们有了许多学生的数据,我们希望将学生分类到不同的房屋中。 为了做到这一点,我们需要每个房屋的特性列表。 我们将从harrypotter.wikia.com上的特征开始。

Notice that all of these characteristics are nouns, which is a good thing; we want to be consistent with our traits. Some of the traits on the wiki were non-nouns, so I changed them as follows:

注意所有这些特征都是名词,这是一件好事。 我们希望与我们的特质保持一致。 Wiki上的某些特征是非名词,因此我将其更改如下:

  • “ambitious” (an adjective) – this can be easily changed to ‘ambition’
  • “hard work”, “fair play”, and “unafraid of toil” – these multi-word phrases can also be changed to single-word nouns:
  • “hard work” –> ‘diligence’
  • “fair play” –> ‘fairness’
  • “unafraid of toil” –> ‘persistence’
  • “雄心勃勃”(形容词)–可以很容易地更改为“雄心勃勃”
  • “辛勤工作”,“公平竞争”和“不怕辛苦” –这些多词短语也可以更改为单词名词:
  • “努力” –>“勤奋”
  • “公平竞争” –>“公平”
  • “不怕辛苦” –>“坚持不懈”

Now that we have a list of characteristics for each house, we can simply scan through the “text” column in our DataFrame and count the number of times a characteristic appears. Sounds simple, right?

现在我们有了每个房屋的特征列表,我们可以简单地扫描DataFrame中的“ text”列并计算特征出现的次数。 听起来很简单,对吧?

Unfortunately we aren’t done yet. Take the following sentences from Neville Longbottom’s “Personality and traits” section:

不幸的是我们还没有完成。 从Neville Longbottom的“个性和特质”部分中选取以下句子:

When he was younger, Neville was clumsy, forgetful, shy, and many considered him ill-suited for Gryffindor house because he seemed timid.

内维尔(Neville)年轻时,他笨拙,健忘,害羞,许多人认为他不适合格兰芬多(Gryffindor)的房子,因为他看上去很胆小。

With the support of his friends, to whom he was very loyal, the encouragement of Professor Remus Lupin to face his fears in his third year, and the motivation of knowing his parents’ torturers were on the loose, Neville became braver, more self-assured, and dedicated to the fight against Lord Voldemort and his Death Eaters.

在他非常忠实的朋友的支持下,雷木斯·卢平(Remus Lupine)教授鼓励他在三年级时面对恐惧,并且知道父母的拷打者的动机越来越松散,内维尔变得更加勇敢 ,更加自我放心,并致力于与伏地魔领主及其食死徒的斗争。

The bold words in this passage should be counted towards one of the houses, but they won’t be because they are adjectives. Similarly, words like “bravely” and “braveness” also would not count. In order to make our classification algorithm work properly, we need to identify synonyms, antonyms, and word forms.

此段中的大胆词语应计入其中一所房子,但不会因为它们是形容词而已。 同样, “勇敢”“勇敢”也不会算在内。 为了使我们的分类算法正常工作,我们需要识别同义词,反义词和单词形式。

同义字 (Synonyms)

We can explore synonyms of words using the synsets function in WordNet, a lexical database of English words that is included in the nltk module (“NLTK” stands for Natural Language Toolkit). A “synset”, short for “synonym set”, is a collection of synonymous words, or “lemmas”. The synsets function returns the “synsets” that are associated with a particular word.

我们可以使用WordNet中的synsets函数探索单词的同义词, WordNet是nltk模块中包含的英语单词的词汇数据库(“ NLTK”代表自然语言工具包)。 “同义词集”(synset)是“同义词集”的缩写,是同义词或“ lemmas”的集合。 同义词集函数返回与特定单词关联的“同义词集”。

Confused? So was I when first learned about this material. Let’s run some code and then analyze it.

困惑? 当我第一次了解这种材料时,我也是如此。 让我们运行一些代码,然后对其进行分析。

from nltk.corpus import wordnet as wn# Synsets of differents words
foo1 = wn.synsets('bravery')
print("Synonym sets associated with the word 'bravery': {}".format(foo1))foo2 = wn.synsets('fairness')
print('')
print("Synonym sets associated with the word 'fairness': {}".format(foo2))foo3 = wn.synsets('wit')
print('')
print("Synonym sets associated with the word 'wit': {}".format(foo3))foo4 = wn.synsets('cunning')
print('')
print("Synonym sets associated with the word 'cunning': {}".format(foo4))foo4 = wn.synsets('cunning', pos=wn.NOUN)
print('')
print("Synonym sets associated with the *noun* 'cunning': {}".format(foo4))
print('')# Prints out the synonyms ("lemmas") associated with each synset
foo_list = [foo1, foo2, foo3, foo4]
for foo in foo_list:for synset in foo:print((synset.name(), synset.lemma_names()))

from nltk.corpus import wordnet as wn# Synsets of differents words
foo1 = wn.synsets('bravery')
print("Synonym sets associated with the word 'bravery': {}".format(foo1))foo2 = wn.synsets('fairness')
print('')
print("Synonym sets associated with the word 'fairness': {}".format(foo2))foo3 = wn.synsets('wit')
print('')
print("Synonym sets associated with the word 'wit': {}".format(foo3))foo4 = wn.synsets('cunning')
print('')
print("Synonym sets associated with the word 'cunning': {}".format(foo4))foo4 = wn.synsets('cunning', pos=wn.NOUN)
print('')
print("Synonym sets associated with the *noun* 'cunning': {}".format(foo4))
print('')# Prints out the synonyms ("lemmas") associated with each synset
foo_list = [foo1, foo2, foo3, foo4]
for foo in foo_list:for synset in foo:print((synset.name(), synset.lemma_names()))

Synonym sets associated with the word ‘bravery’: [Synset(‘courage.n.01’), Synset(‘fearlessness.n.01’)]

与单词'bravery'相关的同义词集:[Synset('courage.n.01'),Synset('fearlessness.n.01')]

Synonym sets associated with the word ‘fairness’: [Synset(‘fairness.n.01’), Synset(‘fairness.n.02’), Synset(‘paleness.n.02’), Synset(‘comeliness.n.01’)]

与单词“公平”相关的同义词集:[Synset('fairness.n.01'),Synset('fairness.n.02'),Synset('paleness.n.02'),Synset('comeliness.n .01')]

Synonym sets associated with the word ‘wit’: [Synset(‘wit.n.01’), Synset(‘brain.n.02’), Synset(‘wag.n.01’)]

与单词'wit'相关的同义词集:[Synset('wit.n.01'),Synset('brain.n.02'),Synset('wag.n.01')]

Synonym sets associated with the word ‘cunning’: [Synset(‘craft.n.05’), Synset(‘cunning.n.02’), Synset(‘cunning.s.01’), Synset(‘crafty.s.01’), Synset(‘clever.s.03’)]

与单词'cunning'相关的同义词集:[Synset('craft.n.05'),Synset('cunning.n.02'),Synset('cunning.s.01'),Synset('crafty.s .01'),Synset('clever.s.03')]

Synonym sets associated with the noun ‘cunning’: [Synset(‘craft.n.05’), Synset(‘cunning.n.02’)]

与名词'cunning'相关的同义词集:[Synset('craft.n.05'),Synset('cunning.n.02')]

(‘courage.n.01’, [‘courage’, ‘courageousness’, ‘bravery’, ‘braveness’]) (‘fearlessness.n.01’, [‘fearlessness’, ‘bravery’]) (‘fairness.n.01’, [‘fairness’, ‘equity’]) (‘fairness.n.02’, [‘fairness’, ‘fair-mindedness’, ‘candor’, ‘candour’]) (‘paleness.n.02’, [‘paleness’, ‘blondness’, ‘fairness’]) (‘comeliness.n.01’, [‘comeliness’, ‘fairness’, ‘loveliness’, ‘beauteousness’]) (‘wit.n.01’, [‘wit’, ‘humor’, ‘humour’, ‘witticism’, ‘wittiness’]) (‘brain.n.02’, [‘brain’, ‘brainpower’, ‘learning_ability’, ‘mental_capacity’, ‘mentality’, ‘wit’]) (‘wag.n.01’, [‘wag’, ‘wit’, ‘card’]) (‘craft.n.05’, [‘craft’, ‘craftiness’, ‘cunning’, ‘foxiness’, ‘guile’, ‘slyness’, ‘wiliness’]) (‘cunning.n.02’, [‘cunning’])

('courage.n.01',['courage','勇气','勇敢','勇敢'])('fearlessness.n.01',['fearlessness','bravery'])('公平n.01”,[“公平”,“公平”])(“ fairness.n.02”,[“公平”,“公平思想”,“坦率”,“坦率”))(“ paleness.n。 02',['paleness','blondness','fairness'])('comeliness.n.01',['comeliness','fairness','loveliness,'beauteousness'])('wit.n. 01',['wit','humor','humour','witticism,'wittiness'])('brain.n.02',['brain','brainpower','learning_ability','mental_capacity' ,'mentality','wit'])('wag.n.01',['wag','wit','card'])('craft.n.05',['craft','craftiness' ,“狡猾”,“狡猾”,“狡猾”,“狡猾”,“野蛮”])('cunning.n.02',['cunning'])

Okay, that’s a lot of output, so let’s point out some notes & potential problems:

好的,这是很多输出,所以让我们指出一些注意事项和潜在问题:

  • Typing wn.synsets(‘bravery’) yields two synsets: one for ‘courage.n.01’ and one for ‘fearlessness.n.01’. Let’s dive deeper into what this actually means:
  • The first part (‘courage’ and ‘fearlessness’) is the word the synset is centered around…let’s call it the “center” word. This means that the synonyms (“lemmas”) in the synset all mean the same thing as the center word.
  • The second part (‘n’) stands for “noun”. You can see that the synsets associated with the word “cunning” include ‘crafty.s.01’ and ‘clever.s.03’ (adjectives). These are here because the word “cunning” is both a noun and an adjective. To limit our results to just nouns, we can specify wn.synsets(‘cunning’, pos=wn.NOUN).
  • The third part (’01’) refers to the specific meaning of the center word. For example, ‘fairness’ can mean “conformity with rules or standards” as well as “making judgments free from discrimination or dishonesty”.
  • 键入wn.synsets('bravery')会产生两个同义词集:一个代表'courage.n.01',另一个代表'fearlessness.n.01'。 让我们更深入地了解这实际上意味着什么:
  • 第一部分(“勇气”和“无所畏惧”)是同义词集以中心为中心的词……我们称其为“中心”词。 这意味着同义词集中的同义词(“ lemmas”)与中心词的含义相同。
  • 第二部分(“ n”)代表“名词”。 您会看到与“狡猾”一词相关的同义词集包括“ crafty.s.01”和“ clever.s.03”(形容词)。 这些是因为“狡猾”一词既是名词又是形容词。 为了将结果限制为仅名词,我们可以指定wn.synsets('cunning',pos = wn.NOUN)。
  • 第三部分(“ 01”)是指中心词的具体含义。 例如,“公平”可以表示“符合规则或标准”,也可以“做出无歧视或不诚实的判断”。

We also we see that the synset function gives us some synonym sets that we may not want. The synonym sets associated with the word ‘fairness’ includes ‘paleness.n.02 (“having a naturally light complexion”) and ‘comeliness.n.01’ (“being good looking and attractive”). These are not traits associated with Hufflepuff (although Neville Longbottom grew up to be very handsome), so we need to manually exclude these synsets from our analysis.

我们还看到synset函数为我们提供了一些我们可能不需要的同义词集。 与“公平”一词相关的同义词集包括“ paleness.n.02(具有自然的肤色)”和“ comeliness.n.01”(“具有漂亮的外观和吸引力”)。 这些不是与赫奇帕奇(Hufflepuff)相关的特征(尽管内维尔·朗博托( Neville Longbottom)长大后非常帅气),因此我们需要从分析中手动排除这些同义词。

Translation: getting synonyms is harder than it looks

翻译:获得同义词比看起来困难

反义词和单词形式 (Antonyms and Word Forms)

After we get all the synonyms (which we’ll actually do in a moment), we also need to worry about the antonyms (words opposite in meaning) and different word forms (“brave”, “bravely”, and “braver” for “bravery”). We can do a lot of the heavy work in nltk, but we will also have to manually create adverbs and comparative / superlative adjectives.

在获得所有同义词(稍后我们将实际完成)之后,我们还需要担心反义词(含义相反的单词)和不同的单词形式(“勇敢”,“勇敢”和“勇敢”) “勇敢”)。 我们可以在nltk中完成许多繁重的工作,但是我们还必须手动创建副词和比较/最高级形容词。

Synset: courage.n.01; Lemma: courage; Antonyms: [Lemma(‘cowardice.n.01.cowardice’)]; Word Forms: [Lemma(‘brave.a.01.courageous’)]

同义词:courage.n.01; 引理:勇气; 反义词:[Lemma('cowardice.n.01.cowardice')]; 字形:[Lemma('brave.a.01.courageous')]

Synset: courage.n.01; Lemma: courageousness; Antonyms: []; Word Forms: [Lemma(‘brave.a.01.courageous’)]

同义词:courage.n.01; 引理:勇气; 反义词:[]; 字形:[Lemma('brave.a.01.courageous')]

Synset: courage.n.01; Lemma: bravery; Antonyms: []; Word Forms: []

同义词:courage.n.01; 引理:勇敢; 反义词:[]; 字形:[]

Synset: courage.n.01; Lemma: braveness; Antonyms: []; Word Forms: [Lemma(‘brave.a.01.brave’), Lemma(‘audacious.s.01.brave’)]

同义词:courage.n.01; 引理:勇敢; 反义词:[]; 字形:[Lemma('brave.a.01.brave'),Lemma('audacious.s.01.brave')]

Synset: fearlessness.n.01; Lemma: fearlessness; Antonyms: [Lemma(‘fear.n.01.fear’)]; Word Forms: [Lemma(‘audacious.s.01.fearless’), Lemma(‘unafraid.a.01.fearless’)]

同义词:无畏n.01; 引理:无所畏惧; 反义词:[Lemma('fear.n.01.fear')]; 字形:[Lemma('audacious.s.01.fearless'),Lemma('unafraid.a.01.fearless')]

Synset: fearlessness.n.01; Lemma: bravery; Antonyms: []; Word Forms: []

同义词:无畏n.01; 引理:勇敢; 反义词:[]; 字形:[]

放在一起 (Putting it all together)

The following code creates a list of the synonyms, antonyms, and words forms for each of the house traits described earlier. To make sure we’re exhaustive, some of these might not actually be correctly-spelled English words.

以下代码为前面描述的每个房屋特征创建一个同义词,反义词和单词形式的列表。 为确保我们详尽无遗,其中一些可能实际上不是正确拼写的英语单词。

# Manually select the synsets that are relevant to us
relevant_synsets = {}
relevant_synsets['Ravenclaw'] = [wn.synset('intelligence.n.01'), wn.synset('wit.n.01'), wn.synset('brain.n.02'),wn.synset('wisdom.n.01'), wn.synset('wisdom.n.02'), wn.synset('wisdom.n.03'),wn.synset('wisdom.n.04'), wn.synset('creativity.n.01'), wn.synset('originality.n.01'),wn.synset('originality.n.02'), wn.synset('individuality.n.01'), wn.synset('credence.n.01'),wn.synset('acceptance.n.03')]
relevant_synsets['Hufflepuff'] = [wn.synset('dedication.n.01'), wn.synset('commitment.n.04'), wn.synset('commitment.n.02'),wn.synset('diligence.n.01'), wn.synset('diligence.n.02'), wn.synset('application.n.06'),wn.synset('fairness.n.01'), wn.synset('fairness.n.01'), wn.synset('patience.n.01'),wn.synset('kindness.n.01'), wn.synset('forgivingness.n.01'), wn.synset('kindness.n.03'),wn.synset('tolerance.n.03'), wn.synset('tolerance.n.04'), wn.synset('doggedness.n.01'),wn.synset('loyalty.n.01'), wn.synset('loyalty.n.02')]
relevant_synsets['Gryffindor'] = [wn.synset('courage.n.01'), wn.synset('fearlessness.n.01'), wn.synset('heart.n.03'),wn.synset('boldness.n.02'), wn.synset('chivalry.n.01'), wn.synset('boldness.n.01')]
relevant_synsets['Slytherin'] = [wn.synset('resourcefulness.n.01'), wn.synset('resource.n.03'), wn.synset('craft.n.05'),wn.synset('cunning.n.02'), wn.synset('ambition.n.01'), wn.synset('ambition.n.02'),wn.synset('determination.n.02'), wn.synset('determination.n.04'),wn.synset('self-preservation.n.01'), wn.synset('brotherhood.n.02'),wn.synset('inventiveness.n.01'), wn.synset('brightness.n.02'), wn.synset('ingenuity.n.02')]# Function that will get the different word forms from a lemma
def get_forms(lemma):drfs = lemma.derivationally_related_forms()output_list = []if drfs:for drf in drfs:drf_pos = str(drf).split(".")[1]if drf_pos in ['n', 's', 'a']:output_list.append(drf.name().lower())if drf_pos in ['s', 'a']:# Adverbs + "-ness" nouns + comparative & superlative adjectivesif len(drf.name()) == 3:last_letter = drf.name()[-1:]output_list.append(drf.name().lower() + last_letter + 'er')output_list.append(drf.name().lower() + last_letter + 'est')output_list.append(drf.name().lower()+'ness')output_list.append(drf.name().lower()+'ly')elif drf.name()[-4:] in ['able', 'ible']:output_list.append(drf.name().lower()+'r')output_list.append(drf.name().lower()+'st')output_list.append(drf.name().lower()+'ness')output_list.append(drf.name()[:-1].lower()+'y')elif drf.name()[-1:] == 'e':output_list.append(drf.name().lower()+'r')output_list.append(drf.name().lower()+'st')output_list.append(drf.name().lower()+'ness')output_list.append(drf.name().lower()+'ly')elif drf.name()[-2:] == 'ic':output_list.append(drf.name().lower()+'er')output_list.append(drf.name().lower()+'est')output_list.append(drf.name().lower()+'ness')output_list.append(drf.name().lower()+'ally')elif drf.name()[-1:] == 'y':output_list.append(drf.name()[:-1].lower()+'ier')output_list.append(drf.name()[:-1].lower()+'iest')output_list.append(drf.name()[:-1].lower()+'iness')output_list.append(drf.name()[:-1].lower()+'ily')else:output_list.append(drf.name().lower()+'er')output_list.append(drf.name().lower()+'est')output_list.append(drf.name().lower()+'ness')output_list.append(drf.name().lower()+'ly')return output_listelse:return output_list# Creates a copy of our trait dictionary
# If we don't do this, then we constantly update the dictariony we are looping through, causing an infinite loop
import copy
new_trait_dict = copy.deepcopy(trait_dict)
antonym_dict = {}# Add synonyms and word forms to the (new) trait dictionary; also add antonyms (and their word forms) to the antonym dictionary
for house, traits in trait_dict.items():antonym_dict[house] = []for trait in traits:synsets = wn.synsets(trait, pos=wn.NOUN)for synset in synsets:if synset in relevant_synsets[house]:for lemma in synset.lemmas():new_trait_dict[house].append(lemma.name().lower())if get_forms(lemma):new_trait_dict[house].extend(get_forms(lemma))if lemma.antonyms():for ant in lemma.antonyms():antonym_dict[house].append(ant.name().lower())if get_forms(ant):antonym_dict[house].extend(get_forms(ant))new_trait_dict[house] = sorted(list(set(new_trait_dict[house])))antonym_dict[house] = sorted(list(set(antonym_dict[house])))# Print some of our results
print("Gryffindor traits: {}".format(new_trait_dict['Gryffindor']))
print("")
print("Gryffindor anti-traits: {}".format(antonym_dict['Gryffindor']))
print("")

# Manually select the synsets that are relevant to us
relevant_synsets = {}
relevant_synsets['Ravenclaw'] = [wn.synset('intelligence.n.01'), wn.synset('wit.n.01'), wn.synset('brain.n.02'),wn.synset('wisdom.n.01'), wn.synset('wisdom.n.02'), wn.synset('wisdom.n.03'),wn.synset('wisdom.n.04'), wn.synset('creativity.n.01'), wn.synset('originality.n.01'),wn.synset('originality.n.02'), wn.synset('individuality.n.01'), wn.synset('credence.n.01'),wn.synset('acceptance.n.03')]
relevant_synsets['Hufflepuff'] = [wn.synset('dedication.n.01'), wn.synset('commitment.n.04'), wn.synset('commitment.n.02'),wn.synset('diligence.n.01'), wn.synset('diligence.n.02'), wn.synset('application.n.06'),wn.synset('fairness.n.01'), wn.synset('fairness.n.01'), wn.synset('patience.n.01'),wn.synset('kindness.n.01'), wn.synset('forgivingness.n.01'), wn.synset('kindness.n.03'),wn.synset('tolerance.n.03'), wn.synset('tolerance.n.04'), wn.synset('doggedness.n.01'),wn.synset('loyalty.n.01'), wn.synset('loyalty.n.02')]
relevant_synsets['Gryffindor'] = [wn.synset('courage.n.01'), wn.synset('fearlessness.n.01'), wn.synset('heart.n.03'),wn.synset('boldness.n.02'), wn.synset('chivalry.n.01'), wn.synset('boldness.n.01')]
relevant_synsets['Slytherin'] = [wn.synset('resourcefulness.n.01'), wn.synset('resource.n.03'), wn.synset('craft.n.05'),wn.synset('cunning.n.02'), wn.synset('ambition.n.01'), wn.synset('ambition.n.02'),wn.synset('determination.n.02'), wn.synset('determination.n.04'),wn.synset('self-preservation.n.01'), wn.synset('brotherhood.n.02'),wn.synset('inventiveness.n.01'), wn.synset('brightness.n.02'), wn.synset('ingenuity.n.02')]# Function that will get the different word forms from a lemma
def get_forms(lemma):drfs = lemma.derivationally_related_forms()output_list = []if drfs:for drf in drfs:drf_pos = str(drf).split(".")[1]if drf_pos in ['n', 's', 'a']:output_list.append(drf.name().lower())if drf_pos in ['s', 'a']:# Adverbs + "-ness" nouns + comparative & superlative adjectivesif len(drf.name()) == 3:last_letter = drf.name()[-1:]output_list.append(drf.name().lower() + last_letter + 'er')output_list.append(drf.name().lower() + last_letter + 'est')output_list.append(drf.name().lower()+'ness')output_list.append(drf.name().lower()+'ly')elif drf.name()[-4:] in ['able', 'ible']:output_list.append(drf.name().lower()+'r')output_list.append(drf.name().lower()+'st')output_list.append(drf.name().lower()+'ness')output_list.append(drf.name()[:-1].lower()+'y')elif drf.name()[-1:] == 'e':output_list.append(drf.name().lower()+'r')output_list.append(drf.name().lower()+'st')output_list.append(drf.name().lower()+'ness')output_list.append(drf.name().lower()+'ly')elif drf.name()[-2:] == 'ic':output_list.append(drf.name().lower()+'er')output_list.append(drf.name().lower()+'est')output_list.append(drf.name().lower()+'ness')output_list.append(drf.name().lower()+'ally')elif drf.name()[-1:] == 'y':output_list.append(drf.name()[:-1].lower()+'ier')output_list.append(drf.name()[:-1].lower()+'iest')output_list.append(drf.name()[:-1].lower()+'iness')output_list.append(drf.name()[:-1].lower()+'ily')else:output_list.append(drf.name().lower()+'er')output_list.append(drf.name().lower()+'est')output_list.append(drf.name().lower()+'ness')output_list.append(drf.name().lower()+'ly')return output_listelse:return output_list# Creates a copy of our trait dictionary
# If we don't do this, then we constantly update the dictariony we are looping through, causing an infinite loop
import copy
new_trait_dict = copy.deepcopy(trait_dict)
antonym_dict = {}# Add synonyms and word forms to the (new) trait dictionary; also add antonyms (and their word forms) to the antonym dictionary
for house, traits in trait_dict.items():antonym_dict[house] = []for trait in traits:synsets = wn.synsets(trait, pos=wn.NOUN)for synset in synsets:if synset in relevant_synsets[house]:for lemma in synset.lemmas():new_trait_dict[house].append(lemma.name().lower())if get_forms(lemma):new_trait_dict[house].extend(get_forms(lemma))if lemma.antonyms():for ant in lemma.antonyms():antonym_dict[house].append(ant.name().lower())if get_forms(ant):antonym_dict[house].extend(get_forms(ant))new_trait_dict[house] = sorted(list(set(new_trait_dict[house])))antonym_dict[house] = sorted(list(set(antonym_dict[house])))# Print some of our results
print("Gryffindor traits: {}".format(new_trait_dict['Gryffindor']))
print("")
print("Gryffindor anti-traits: {}".format(antonym_dict['Gryffindor']))
print("")

Gryffindor traits: [‘bold’, ‘bolder’, ‘boldest’, ‘boldly’, ‘boldness’, ‘brass’, ‘brassier’, ‘brassiest’, ‘brassily’, ‘brassiness’, ‘brassy’, ‘brave’, ‘bravely’, ‘braveness’, ‘braver’, ‘bravery’, ‘bravest’, ‘cheek’, ‘cheekier’, ‘cheekiest’, ‘cheekily’, ‘cheekiness’, ‘cheeky’, ‘chivalry’, ‘courage’, ‘courageous’, ‘courageouser’, ‘courageousest’, ‘courageously’, ‘courageousness’, ‘daring’, ‘face’, ‘fearless’, ‘fearlesser’, ‘fearlessest’, ‘fearlessly’, ‘fearlessness’, ‘gallantry’, ‘hardihood’, ‘hardiness’, ‘heart’, ‘mettle’, ‘nerve’, ‘nervier’, ‘nerviest’, ‘nervily’, ‘nerviness’, ‘nervy’, ‘politesse’, ‘spunk’, ‘spunkier’, ‘spunkiest’, ‘spunkily’, ‘spunkiness’, ‘spunky’]

格兰芬多(Gryffindor)特质:[“大胆”,“笨拙”,“最大胆”,“大胆”,“大胆”,“黄铜”,“胸肌”,“最胸肌”,“胸肌”,“胸肌”,“粗鲁”,“勇敢”) ','勇敢','勇敢','勇敢','勇敢','勇敢','脸颊','cheekier','cheekiest','cheekily','cheekiness','cheeky','骑士精神, “勇气”,“勇气”,“勇气”,“最勇敢”,“勇气”,“勇气”,“大胆”,“脸”,“无畏”,“无所畏惧”,“最恐惧”,“无所畏惧”,“无所畏惧” ”,“ gallantry”,“ hardihood”,“ hardness”,“ heart”,“ mttle”,“ nerve”,“ nervier”,“ nerviest”,“ nervily”,“ nerviness”,“ nervy”,“ politesse”, 'spunk','spunkier','spunkiest','spunkily','spunkiness','spunky']

Gryffindor anti-traits: [‘cowardice’, ‘fear’, ‘timid’, ‘timider’, ‘timidest’, ‘timidity’, ‘timidly’, ‘timidness’]

格兰芬多反特征:[“怯ward”,“恐惧”,“胆怯”,“怯tim”,“最胆怯”,“胆怯”,“胆怯”,“胆怯”]

Any words overlap in trait dictionary? False Any words overlap in antonym dictionary? False

特质字典中是否有任何单词重叠? False反义词字典中有单词重叠吗? 假

步骤3:将学生分类到房屋中 (Step 3: Sorting Students into Houses)

The time has finally come to sort students into their houses! Our classification algorithm will work like this:

现在终于到了将学生分类到他们家中的时候了! 我们的分类算法将像这样工作:

  • For each student, go through their “Personality and traits” section word by word
  • If a word appears in a house’s trait list, then we add 1 to that house’s score
  • Similarly, if a word appears in a house’s anti-trait list, then we subtract 1 from that house’s score
  • The house with the highest score is the one we assign the student to
  • If there is a tie, we will simply output “Tie!”
  • 对于每个学生,逐字逐一浏览其“个性和特质”部分
  • 如果某个单词出现在房屋的特征列表中,则我们在该房屋的分数中加1
  • 同样,如果某个字词出现在房屋的反特征列表中,那么我们将从该房屋的分数中减去1
  • 分数最高的房子是我们分配给学生的房子
  • 如果有平局,我们将只输出“平局!”

For example, if a character’s “Personality and traits” section was just the sentence “Alice was brave”, then Alice would have a score of 1 for Gryffindor and zero for all other houses; we would sort Alice into Gryffindor.

例如,如果角色的“个性和特质”部分只是句子“爱丽丝很勇敢”,那么爱丽丝对格兰芬多的得分为1,而对其他所有房子的得分为零; 我们将爱丽丝归类为格兰芬多。

# Imports "word_tokenize", which breaks up sentences into words and punctuation
from nltk import word_tokenize# Function that sorts the students
def sort_student(text):text_list = word_tokenize(text)text_list = [word.lower() for word in text_list]score_dict = {}houses = ['Gryffindor', 'Hufflepuff', 'Ravenclaw', 'Slytherin']for house in houses:score_dict[house] = (sum([True for word in text_list if word in new_trait_dict[house]]) -sum([True for word in text_list if word in antonym_dict[house]]))sorted_house = max(score_dict, key=score_dict.get)sorted_house_score = score_dict[sorted_house]if sum([True for i in score_dict.values() if i==sorted_house_score]) == 1:return sorted_houseelse:return "Tie!"# Test our function
print(sort_student('Alice was brave'))
print(sort_student('Alice was British'))

# Imports "word_tokenize", which breaks up sentences into words and punctuation
from nltk import word_tokenize# Function that sorts the students
def sort_student(text):text_list = word_tokenize(text)text_list = [word.lower() for word in text_list]score_dict = {}houses = ['Gryffindor', 'Hufflepuff', 'Ravenclaw', 'Slytherin']for house in houses:score_dict[house] = (sum([True for word in text_list if word in new_trait_dict[house]]) -sum([True for word in text_list if word in antonym_dict[house]]))sorted_house = max(score_dict, key=score_dict.get)sorted_house_score = score_dict[sorted_house]if sum([True for i in score_dict.values() if i==sorted_house_score]) == 1:return sorted_houseelse:return "Tie!"# Test our function
print(sort_student('Alice was brave'))
print(sort_student('Alice was British'))

Gryffindor Tie!

格兰芬多领带!

Our function seems to work, so let’s apply it to our data and see what we get!

我们的函数似乎起作用了,所以让我们将其应用于我们的数据,看看我们得到了什么!

id ID title 标题 url 网址 house text 文本 text_len text_len new_house 新房子
689 689 343 343 Tom Riddle 汤姆·里德尔 /wiki/Tom_Riddle / wiki / Tom_Riddle Slytherin 斯莱特林 Voldemort was considered by many to be “the mo… 许多人认为伏地魔是“魔鬼”。 26924 26924 Hufflepuff 赫奇帕奇
169 169 13 13 Harry Potter 哈利·波特 /wiki/Harry_Potter / wiki / Harry_Potter Gryffindor 格兰芬多 Harry was an extremely brave, loyal, and selfl… 哈里是一个非常勇敢,忠诚和自私的人。 12987 12987 Ravenclaw 拉文克劳
726 726 49 49 Dolores Umbridge 多洛雷斯·乌姆里奇 /wiki/Dolores_Umbridge / wiki / Dolores_Umbridge Slytherin 斯莱特林 Dolores Umbridge was nothing short of a sociop… 多洛雷斯·乌姆里奇(Dolores Umbridge)颇有社交感。 9668 9668 Ravenclaw 拉文克劳
703 703 259 259 Horace Slughorn 霍拉斯·斯拉格霍恩 /wiki/Horace_Slughorn / wiki / Horace_Slughorn Slytherin 斯莱特林 Horace Slughorn was described as having a bumb… 霍拉斯·斯拉格霍恩(Horace Slughorn)被描述为有泡沫。 7944 7944 Slytherin 斯莱特林
54 54 4178 4178 Albus Dumbledore 阿不思·邓布利多 /wiki/Albus_Dumbledore / wiki / Albus_Dumbledore Gryffindor 格兰芬多 Considered to be the most powerful wizard of h… 被认为是人类最强大的向导。 7789 7789 Hufflepuff 赫奇帕奇
709 709 33 33 Severus Snape 西弗勒斯·斯内普 /wiki/Severus_Snape / wiki / Severus_Snape Slytherin 斯莱特林 At times, Snape could appear cold, cynical, ma… 有时,斯内普可能会显得冷漠,愤世嫉俗,残酷…… 6894 6894 Ravenclaw 拉文克劳
164 164 331 331 Peter Pettigrew 彼得·佩蒂格鲁(Peter Pettigrew) /wiki/Peter_Pettigrew / wiki / Peter_Pettigrew Gryffindor 格兰芬多 Peter Pettigrew was characterised by weakness…. 彼得·佩蒂格鲁(Peter Pettigrew)的特点是虚弱…… 6600 6600 Gryffindor 格兰芬多
230 230 14 14 Ronald Weasley 罗纳德·韦斯莱 /wiki/Ronald_Weasley / wiki / Ronald_Weasley Gryffindor 格兰芬多 Ron was a very funny person, but often emotion… 罗恩是一个非常有趣的人,但常常情绪激动…… 6078 6078 Ravenclaw 拉文克劳
646 646 16 16 Draco Malfoy 德拉科·马尔福 /wiki/Draco_Malfoy / wiki /德拉科_马尔福 Slytherin 斯莱特林 Draco was, in general, an arrogant, spiteful b… 一般来说,Draco是一个傲慢,恶意的人。 5435 5435 Tie! 领带!
468 468 53 53 Gilderoy Lockhart 吉尔德罗伊·洛克哈特(Gilderoy Lockhart) /wiki/Gilderoy_Lockhart / wiki / Gilderoy_Lockhart Ravenclaw 拉文克劳 Gilderoy Lockhart’s defining characteristics w… Gilderoy Lockhart的标志性特征是… 5167 5167 Slytherin 斯莱特林
84 84 47 47 Rubeus Hagrid 鲁比乌斯·海格 /wiki/Rubeus_Hagrid / wiki / Rubeus_Hagrid Gryffindor 格兰芬多 Hagrid was an incredibly warm, kind-hearted ma… 海格是一位非常热情,善良的母亲。 4884 4884 Hufflepuff 赫奇帕奇
76 76 15 15 Hermione Granger 赫敏·格兰杰 /wiki/Hermione_Granger / wiki / Hermione_Granger Gryffindor 格兰芬多 Hermione was noted for being extremely intelli… 赫敏因非常聪明而闻名。 4648 4648 Tie! 领带!
114 114 52 52 Remus Lupin 雷木斯·卢平 /wiki/Remus_Lupin / wiki / Remus_Lupin Gryffindor 格兰芬多 Remus was compassionate, intelligent, tolerant… 雷木思(Remus)富有同情心,聪明,宽容…… 4321 4321 Hufflepuff 赫奇帕奇
223 223 26 26 Arthur Weasley 亚瑟·韦斯莱 /wiki/Arthur_Weasley / wiki /亚瑟·韦斯莱 Gryffindor 格兰芬多 While Arthur Weasley was often seen as “fun” i… 虽然亚瑟·韦斯莱(Arthur Weasley)经常被视为“有趣”,但我… 4316 4316 Slytherin 斯莱特林
679 679 5091 5091 Albus Potter 阿不思·波特 /wiki/Albus_Potter / wiki / Albus_Potter Slytherin 斯莱特林 Albus was a quiet, kind, and thoughtful young … 阿不思是个安静,善良,体贴的年轻人…… 3522 3522 Tie! 领带!
23 23 31 31 Sirius Black 小天狼星·布莱克 /wiki/Sirius_Black / wiki / Sirius_Black Gryffindor 格兰芬多 Sirius was true to the ideal of a Gryffindor s… Sirius忠于Gryffindor系列的理想。 3483 3483 Hufflepuff 赫奇帕奇
131 131 32 32 Minerva McGonagall 密涅瓦·麦格教授 /wiki/Minerva_McGonagall / wiki / Minerva_McGonagall Gryffindor 格兰芬多 Minerva almost constantly exuded magnanimity a… 密涅瓦几乎不断散发着宽宏大量的气息。 3188 3188 Hufflepuff 赫奇帕奇
227 227 25 25 Ginevra Weasley 吉内夫拉·韦斯莱(Ginevra Weasley) /wiki/Ginevra_Weasley / wiki / Ginevra_Weasley Gryffindor 格兰芬多 Ginny was a forceful, independent girl who oft… 金妮是一个有力,独立的女孩,经常... 3113 3113 Tie! 领带!
229 229 30 30 Percy Weasley 珀西·韦斯莱 /wiki/Percy_Weasley / wiki / Percy_Weasley Gryffindor 格兰芬多 Percy was extremely ambitious and dedicated to… Percy雄心勃勃,致力于…… 3099 3099 Ravenclaw 拉文克劳
647 647 313 313 Lucius Malfoy 卢修斯·马尔福(Lucius Malfoy) /wiki/Lucius_Malfoy / wiki / Lucius_Malfoy Slytherin 斯莱特林 Despite being the embodiment of wealth and inf… 尽管是财富和通胀的体现 3069 3069 Ravenclaw 拉文克劳
print("Match rate: {}".format(sum(mydf_relevant['house'] == mydf_relevant['new_house']) / len(mydf_relevant)))
print("Percentage of ties: {}".format(sum(mydf_relevant['new_house'] == 'Tie!') / len(mydf_relevant)))

print("Match rate: {}".format(sum(mydf_relevant['house'] == mydf_relevant['new_house']) / len(mydf_relevant)))
print("Percentage of ties: {}".format(sum(mydf_relevant['new_house'] == 'Tie!') / len(mydf_relevant)))

Match rate: 0.2553191489361702 Percentage of ties: 0.32978723404255317

匹配率:0.2553191489361702领带百分比:0.32978723404255317

Hmmm. Those are not the results we were expecting. Let’s try to investigate why Voldemort was sorted into Hufflepuff.

嗯 这些不是我们期望的结果。 让我们尝试调查为什么伏地魔被归类为赫奇帕奇。

{‘Slytherin’: [‘ambition’], ‘Ravenclaw’: [‘intelligent’, ‘intelligent’, ‘mental’, ‘individual’, ‘mental’, ‘intelligent’], ‘Hufflepuff’: [‘kind’, ‘loyalty’, ‘true’, ‘true’, ‘true’, ‘loyalty’], ‘Gryffindor’: [‘brave’, ‘face’, ‘bold’, ‘face’, ‘bravery’, ‘brave’, ‘courageous’, ‘bravery’]}

{'Slytherin':['ambition'],'Ravenclaw':['intelligent','intelligent','mental','individual','mental','intelligent'],'Hufflepuff':['kind', “忠诚度”,“真实”,“真实”,“真实”,“忠诚度”],“格兰芬多”:[“勇敢”,“脸”,“大胆”,“脸”,“勇敢”,“勇敢”, '勇敢','勇敢']}

{‘Slytherin’: [], ‘Ravenclaw’: [‘common’], ‘Hufflepuff’: [], ‘Gryffindor’: [‘fear’, ‘fear’, ‘fear’, ‘fear’, ‘fear’, ‘fear’, ‘cowardice’, ‘fear’, ‘fear’]}

{'Slytherin':[],'Ravenclaw':['common'],'Hufflepuff':[],'Gryffindor':['fear','fear','fear','fear','fear', “恐惧”,“怯ward”,“恐惧”,“恐惧”]}

As you can see, Slytherin had a score of (1-0) = 1, Ravenclaw had (6-1) = 5, Hufflepuff had (6-0) = 6, and Gryffindor had (8-9) = -1.

如您所见,斯莱特林的得分为(1-0)= 1,拉文克劳的得分为(6-1)= 5,赫奇帕奇的得分为(6-0)= 6,格兰芬多的得分为(8-9)= -1。

It’s also interesting to note that Voldemort’s “Personality and Traits section”, which is the longest of any student, matched with only 31 words in our synonym and antonym dictionaries, which means that other students probably had much lower matched word counts. This means that we are making our classification decision off very little data, which explains the misclaffication rate and the high number of ties.

还有趣的是,Voldemort的“个性和特质”部分是所有学生中最长的,在我们的同义词和反义词词典中仅匹配31个单词,这意味着其他学生的匹配单词数可能要低得多。 这意味着我们要根据很少的数据来做出分类决策,这可以解释错配率和高数量的联系。

结论 (Conclusions)

The classifier we built is not very successful (we do slightly better than simplying guessing), but we have to consider that our approach was pretty simplistic. Modern email spam filters are very sophistocated and don’t just classify based on the presence of certain words, so future improvements to our algorithm should similarly take into account more information. Here’s a short list of ideas for future enhancements:

我们构建的分类器不是很成功(我们比简单的猜测做得更好),但是我们必须考虑到我们的方法非常简单。 现代的电子邮件垃圾邮件过滤器非常老练,不仅根据某些单词的存在进行分类,因此将来对我们算法的改进应同样考虑到更多信息。 以下是有关未来增强功能的简短建议列表:

  • Consider which houses other family members were placed
  • Use other sections of the the Harry Potter wiki articles, like “Early Life” or the abstract at the beginning of the article
  • Instead of taking a small list of traits and their synonyms, create a list of the most frequent words in the “Personality and traits” section for each house and classify based on that.
  • Employ more sophisticated text-analysis techniques like sentiment analysis
  • 考虑其他家庭成员安置的房屋
  • 使用《哈利·波特》维基文章的其他部分,例如“早期生活”或本文开头的摘要。
  • 与其列出一小部分特征及其同义词,不如在“个性和特征”部分为每个房屋创建一个最常用的单词列表,然后根据该列表进行分类。
  • 使用更复杂的文本分析技术,例如情感分析

However, we did learn a lot about APIs and nltk in the process, so at the end of the day I’m calling it a win. Now that we have these tools in our pocket, we have a solid base for future endeavours and can go out and conquer Python just like Neville conquered Nagini.

但是,我们在此过程中确实学到了很多关于API和nltk的知识,所以最终,我称其为胜利。 现在我们已经有了这些工具,我们为将来的工作奠定了坚实的基础,并且可以像Neville征服的Nagini一样征服Python。

翻译自: https://www.pybloggers.com/2017/03/a-magical-introduction-to-classification-algorithms/

编程神奇算法

编程神奇算法_分类算法的神奇介绍相关推荐

  1. 基于python的贝叶斯分类算法_分类算法-朴素贝叶斯

    朴素贝叶斯分类器(Naive Bayes Classifier, NBC)发源于古典数学理论,有着坚实的数学基础,以及稳定的分类效率.同时,NBC 模型所需估计的参数很少,对缺失数据不太敏感,算法也比 ...

  2. 算法杂货铺——分类算法之决策树(Decision tree)

    算法杂货铺--分类算法之决策树(Decision tree) 2010-09-19 16:30 by T2噬菌体, 88978 阅读, 29 评论, 收藏, 编辑 3.1.摘要 在前面两篇文章中,分别 ...

  3. 常用十大算法_回溯算法

    回溯算法 回溯算法已经在前面详细的分析过了,详见猛击此处. 简单的讲: 回溯算法是一种局部暴力的枚举算法 循环中,若条件满足,进入递归,开启下一次流程,若条件不满足,就不进行递归,转而进行上一次流程. ...

  4. 算法杂货铺——分类算法之贝叶斯网络(Bayesian networks)

    算法杂货铺--分类算法之贝叶斯网络(Bayesian networks) 2010-09-18 22:50 by T2噬菌体, 66011 阅读, 25 评论, 收藏, 编辑 2.1.摘要 在上一篇文 ...

  5. k近邻算法(KNN)-分类算法

    k近邻算法(KNN)-分类算法 1 概念 定义:如果一个样本在特征空间中的k个最相似(即特征空间中最邻近)的样本中的大多数属于某一个类别,则该样本也属于这个类别. k-近邻算法采用测量不同特征值之间的 ...

  6. 数据挖掘算法——常用分类算法总结

    常用分类算法总结 分类算法 NBC算法 LR算法 SVM算法 ID3算法 C4.5 算法 C5.0算法 KNN 算法 ANN 算法 分类算法 分类是在一群已经知道类别标号的样本中,训练一种分类器,让其 ...

  7. cb32a_c++_STL_算法_查找算法_(5)adjacent_find

    cb32a_c++_STL_算法_查找算法_(5)adjacent_find adjacent_find(b,e),b,begin(),e,end() adjacent_find(b,e,p),p-p ...

  8. NLP专栏简介:数据增强、智能标注、意图识别算法|多分类算法、文本信息抽取、多模态信息抽取、可解释性分析、性能调优、模型压缩算法等

    NLP专栏简介:数据增强.智能标注.意图识别算法|多分类算法.文本信息抽取.多模态信息抽取.可解释性分析.性能调优.模型压缩算法等 专栏链接:NLP领域知识+项目+码源+方案设计 订阅本专栏你能获得什 ...

  9. 分类算法列一下有多少种?应用场景?分类算法介绍、常见分类算法优缺点、如何选择分类算法、分类算法评估

    分类算法 分类算法介绍 概念 分类算法 常见分类算法 NBS LR SVM算法 ID3算法 C4.5 算法 C5.0算法 KNN 算法 ANN 算法 选择分类算法 分类算法性能评估 分类算法介绍 概念 ...

最新文章

  1. UVa1587 Box(排序)
  2. BZOJ3262/Luogu3810 陌上花开 (三维偏序,CDQ)
  3. 认清楚服务器的真正身份--深入ARP工作原理
  4. Hyperledger Fabric VS Ethereum
  5. 深入理解gtest C/C++单元测试经验谈
  6. “兼职”运维的常用命令
  7. JMeter插件模拟发送UDP请求:UDP sampler
  8. mysql多表查询方式_MySQL多表查询方式问题
  9. Docker 常用命令,还有谁不会?
  10. JAVA轻量级ORM框架JOOQ体验
  11. 新概念模拟电路_第一册_晶体管_读书笔记
  12. 移动光猫上插usb储存设备在终端系统中该如何设置才能共享里面的文件。新手,小白,求大神指点
  13. MFC+HPSocket+log4cplus的TCP助手(三、HPSocket)
  14. Android MediaProjection 代码分析
  15. AD19 DRC 时弹出 Design contains shelved or modified (but not repoured) polygons
  16. 微信扫码登录自定义二维码显示信息
  17. C++ 实验3-2本月有几天?
  18. 图的深度(DFS)/广度优先搜索算法(BFS)/Dijkstra
  19. 使用 OSquery 和 YARA 进行审计
  20. EditPlus安装Json格式化工具功能

热门文章

  1. 【译】10 款国外实用、有趣的 GitHub 简介 README
  2. c语言招生信息查询系统,《C语言程序设计》课程设计报-招生信息查询系统.docx...
  3. security的密码加密
  4. Security安全框架
  5. 经济基础知识(初级)【2】
  6. 怎么在网页上嵌入新浪微博页面
  7. [Mesh Order]lumerical MODE软件EME Solver中结构重叠区域(clapping)求解过程中的优先级问题
  8. 网站出现大量垃圾反链的原因以及解决方法
  9. FFmpeg 视频处理工具用法
  10. 在笔记本电脑上运行塔克机器人的语音播报功能