Python3自然语言处理(5)——预处理

注:转载请联系博主,或关注微信公众号"引文空间",后台提出转载申请,等待回复。否则将举报抄袭!

1.分词
当一个文档或者一个长字符串需要处理的时候,你首先要做的是将它拆分成一个个单词和标点符号,我们称这个过程为分词。接下来我们将了解NLTK中可用分词器的类型以及它们的用法。
创建一个名为tokenizer.py的文件并添加如下代码:


from nltk.tokenize import LineTokenizer,SpaceTokenizer,TweetTokenizer
from nltk import word_tokenize

我们将从LineTokernizer开始介绍。添加以下三行代码:

str1='My name is Maximus Decimus, commander of the Armies of the North, General of the Felix Legions and loyal servant to the true emperor, Marcus Aurelius. \nFather to a murdered son, husband to a murdered wife. \nAnd I will have my vengeance, in this life or the next.'
lTokenizer=LineTokenizer()
print('Line tokenizer output:',lTokenizer.tokenize(str1))

顾名思义,该分词器应该将输入的字符串拆分成行(非句子)。让我们看看分词器的输出效果:

Line tokenizer output: ['My name is Maximus Decimus, commander of the Armies of the North, General of the Felix Legions and loyal servant to the true emperor, Marcus Aurelius. ', 'Father to a murdered son, husband to a murdered wife. ', 'And I will have my vengeance, in this life or the next.']

如上所示,它返回了一个包含三个字符串的列表。这意味着给定的输入已经根据换行符的位置被拆分成了三行。LineTokernizer的作用是将输入的字符串拆分成行。
现在我们来看SpaceTokenizer。顾名思义,它是根据空格符来分词的。加入以下几行:

rawText='By 11 o\'clock on sunday, the doctor shall open the dispensary.'
sTokenizer=SpaceTokenizer()
print('Space Tokenizer output:',sTokenizer.tokenize(rawText))

sTokenizer是SpaceTokenize类的一个对象,调用tokenize()方法我们将看到如下输出:

Space Tokenizer output: ['By', '11', "o'clock", 'on', 'sunday,', 'the', 'doctor', 'shall', 'open', 'the', 'dispensary.']

正如期望的那样,输入的rawText被空格符""拆分。
接下来,调用word_tokenize()方法,示例如下:

print('Word Tokenizer output:',word_tokenize(rawText))

结果如下:

Word Tokenizer output: ['By', '11', "o'clock", 'on', 'sunday', ',', 'the', 'doctor', 'shall', 'open', 'the', 'dispensary', '.']

如上所示,SpaceTokenizer和word_tokenize()的区别是显而易见的。
最后我们介绍一下TweetTokernizer,处理特殊字符串的时候可以使用该分词器:

tTokenizer=TweetTokenizer()
print('Tweet Tokenizer output:',tTokenizer.tokenize("This is a coool #dummysmiley: :-) :-P <3"))

Tweets包含我们想要保持完整的特殊单词、特殊字符、标签、笑脸符号等。上述代码的运行结果如下:

Tweet Tokenizer output: ['This', 'is', 'a', 'coool', '#dummysmiley', ':', ':-)', ':-P', '<3']

正如我们看到的,Tokenizer保持特殊字符的完整性而没有进行拆分,笑脸符号也保持原封不动。这是一种特殊且比较少见的类,当需要的时候可以使用它。

2.词干提取
词干是没有任何后缀的词的基本组成部分,词干提取器的作用是去除后缀并输出词的词干。
创建一个名为stemmers.py的文件,并添加以下导入行:

from nltk import PorterStemmer,LancasterStemmer,word_tokenize

在进行词干提取之前,我们首先需要对输入文本进行分词,使用以下代码来完成这一步:

raw='My name is Maximus Decimus, commander of the Armies of the North, General of the Felix Legions and loyal servant to the true emperor, Marcus Aurelius. Father to a murdered son, husband to a murdered wife. And I will have my vengeance, in this life or the next.'
tokens=word_tokenize(raw)

分词列表包含输入字符串raw产生的所有分词。
首先使用PorterStemmer,添加如下三行代码:

porter=PorterStemmer()
pStems=[porter.stem(t) for t in tokens]
print(pStems)

首先初始化词干提取器,然后对所有的输入文本应用该词干提取器,最后打印输出结果。通过观察输出结果,我们可以了解到更多信息:

['My', 'name', 'is', 'maximu', 'decimu', ',', 'command', 'of', 'the', 'armi', 'of', 'the', 'north', ',', 'gener', 'of', 'the', 'felix', 'legion', 'and', 'loyal', 'servant', 'to', 'the', 'true', 'emperor', ',', 'marcu', 'aureliu', '.', 'father', 'to', 'a', 'murder', 'son', ',', 'husband', 'to', 'a', 'murder', 'wife', '.', 'and', 'I', 'will', 'have', 'my', 'vengeanc', ',', 'in', 'thi', 'life', 'or', 'the', 'next', '.']

正如你在输出结果中看到的,所有的单词都去除了“s”“es”“e”“ed”“al”等后缀。
接下来使用LancasterStemmer,与porter相比较,它更容易出错,因为它包含更多要去除的尾缀:

lancaster=LancasterStemmer()
lStems=[lancaster.stem(t) for t in tokens]
print(lStems)

进行相似实验,用LancasterStemmer代替PorterStemmer。输出结果如下:

['my', 'nam', 'is', 'maxim', 'decim', ',', 'command', 'of', 'the', 'army', 'of', 'the', 'nor', ',', 'gen', 'of', 'the', 'felix', 'leg', 'and', 'loy', 'serv', 'to', 'the', 'tru', 'emp', ',', 'marc', 'aureli', '.', 'fath', 'to', 'a', 'murd', 'son', ',', 'husband', 'to', 'a', 'murd', 'wif', '.', 'and', 'i', 'wil', 'hav', 'my', 'veng', ',', 'in', 'thi', 'lif', 'or', 'the', 'next', '.']

我们将在输出部分讨论它们的差别,但是很容易就能看出该分词器对尾缀的处理优于Porter。尾缀如“us”“e”“th”“eral”“ered”等。
通过比较这两种词干提取器的输出,我们发现在去除尾缀方面lancaster做得更加彻底。它尽可能多地去除尾部字符,而porter相对来说尽可能少地去除尾部字符。

3.词形还原
一个词元是一个词的中心词,或者简单地说是一个词的基本组成。我们已经了解了什么是词干,但是与词干提取过程不同的是,词干是通过去除或替换尾缀获得的,而词元获取是一个字典匹配过程。由于词形还原是一个字典映射过程,因此词形还原相对于词干提取来说,是一个更复杂的过程。
创建一个名为lemmatizer.py的文件并添加如下代码:

from nltk import word_tokenize,WordNetLemmatizer

在进行任何词干提取之前,我们首先需要对输入文本进行分词,使用如下代码来完成:

raw='My name is Maximus Decimus, commander of the armies of the north, General of the Felix legions and loyal servant to the true emperor, Marcus Aurelius. Father to a murdered son, husband to a murdered wife. And I will have my vengeance, in this life or the next.'
tokens=word_tokenize(raw)

现在我们使用词形还原器lemmatizer,添加如下三行代码:

lemmatizer=WordNetLemmatizer()
lemmas=[lemmatizer.lemmatize(t) for t in tokens]
print(lemmas)

运行程序,上述三行代码的输出如下所示:

['My', 'name', 'is', 'Maximus', 'Decimus', ',', 'commander', 'of', 'the', 'army', 'of', 'the', 'north', ',', 'General', 'of', 'the', 'Felix', 'legion', 'and', 'loyal', 'servant', 'to', 'the', 'true', 'emperor', ',', 'Marcus', 'Aurelius', '.', 'Father', 'to', 'a', 'murdered', 'son', ',', 'husband', 'to', 'a', 'murdered', 'wife', '.', 'And', 'I', 'will', 'have', 'my', 'vengeance', ',', 'in', 'this', 'life', 'or', 'the', 'next', '.']

如上所示,词形还原器能够判断专有名词,其不需要去除尾缀“s”,但对于非专有名词(如legions和armies)来说需要去除并替换其尾缀。词形还原是一个字典匹配过程,通过比较词干提取器和词形还原器的输出,我们发现词干提取器造成许多错误,而词形还原器只造成很少的错误,但是它没有对单词murdered做任何处理,这是一个处理错误。从最后的结果来看,在提取词的基础形式方面,词形还原器比词干提取器表现更优。
值得一提的是,仅当WordnetLemmatizer能在字典中查找到目标词时,它才会去除词缀。这使得词形还原的处理速度比词干提取的处理速度慢。而且,它识别首字母大写的单词并将其作为特殊词。它对这些特殊词不做任何处理,只是将它们按原样返回。为了避免这个问题,你可能需要将你的输入字符串转换成小写字母,然后再执行词形还原。虽然如此,词形还原仍然不是完美的,它也存在错误。检查本实例的输入与输出结果,我们发现它不能将murdered转换成murder。类似地,它能够正确处理women这个词,但却不能正确处理men这个词。

4.停用词语料库
在本节中,我们将以古登堡语料库(Gutenburg Corpus)为例子。古登堡语料库是NLTK数据模块中的一部分。它从古登堡文件档案中大约25000本电子书中选取了18个文本。它是纯文本语料库(PlainTextCorpus),因为该语料库没有进行任何分类,所以非常适用于单纯的词处理,而不需考虑其与任何主题的相关性。本节的目标之一就是介绍文本分析过程中最重要的预处理过程——停用词处理。根据目标,我们将通过这个语料库阐述Python的NLTK模块中频率分布和停用词语料库的应用。简而言之,停用词是一种具有很少语义价值,却具有极高句法价值的单词。当你不进行句法分析,而使用词袋(bag-of-word)方法的时候(例如TF/IDF),通常需要去除停用词。
创建一个名为Gutenberg.py的文件并导入如下三行代码:

import nltk
from nltk.corpus import gutenberg
print(gutenberg.fileids())

前两行代码导入Gutenberg语料库及其他需要的语料库,第三行代码用于检查是否成功加载语料库。在Python集成环境中运行这个文件,输出的结果如下:

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

如上所示,18个古登堡文件的名称都被打印到了屏幕上。
添加如下两行代码,我们对语料库中的所有单词列表做一些简单的预处理:

gb_words=gutenberg.words('bible-kjv.txt')
words_filtered=[e for e in gb_words if len(e)>=3]

第一行代码拷贝了语料库样例bible-kjv.txt的所有单词列表,并存储在gb-words变量中。第二行代码遍历古登堡语料库的所有单词列表,去除所有长度小于3的单词。

现在,我们使用nltk.corpus.stopwords,并在之前过滤后的单词列表上做停用词处理。添加如下几行代码:

stopwords=nltk.corpus.stopwords.words('english')
words=[w for w in words_filtered if w.lower() not in stopwords]

第一行代码简单地从停用词语料库中加载了英文(english)停用词到stopwords变量中。第二行代码是我们对之前过滤后的单词词表做进一步处理,过滤掉所有停用词。

现在我们分别在预处理后的词表words和未做处理的词表gb_words上分别应用nltk.FreqDist,加入如下几行代码:

fdistPlain=nltk.FreqDist(gb_words)
fdist=nltk.FreqDist(words)

如果我们想看到执行上述操作后的频率分布特征,加入下面两行代码:

print('The most commom 10 words in the bag:\n',fdistPlain.most_common(10))
print('The most common 10 words in the bag minus the stopwords:\n',fdist.most_common(10))

most_common(10)函数将返回被词频分布处理后的词袋中的10个最常用的词。运行上述程序后,你将得到类似如下的输出:

The most commom 10 words in the bag:[(',', 70509), ('the', 62103), (':', 43766), ('and', 38847), ('of', 34480), ('.', 26160), ('to', 13396), ('And', 12846), ('that', 12576), ('in', 12331)]
The most common 10 words in the bag minus the stopwords:[('shall', 9760), ('unto', 8940), ('LORD', 6651), ('thou', 4890), ('thy', 4450), ('God', 4115), ('said', 3995), ('thee', 3827), ('upon', 2730), ('man', 2721)]

如果仔细观察结果,你将会发现:未处理文本中的10个最常见词并没有什么意义。而另一方面,预处理后的文本中的10个最常见词,例如god、lord以及man,向我们提示——我们正在处理与信仰或宗教有关的文本。停用词的处理是在进行任何复杂文本数据分析之前都需要掌握的预处理技术。NLTK的stopwords语料库包含了11种语言,在任何文本分析应用中,当你需要分析关键词时,正确处理停用词将让你事半功倍。词频分布将帮助你获取重要的词语。从统计学角度来看,如果你绘制一个包含词频和词语重要度的二维平面图,理想的分布曲线看起来会像一条钟形曲线。

5.提取两文本共有词汇
创建一个名为lemmatizer.py的文件并创建一组包含长字符串的短文或新闻文章:

story1='''There was an old man who lived in a little hut in the middle of a forest. His wife was dead, and he had only one son, whom he loved dearly. Near their hut was a group of birch trees, in which some black-game had made their nests, and the youth had often begged his father's permission to shoot the birds, but the old man always strictly forbade him to do anything of the kind.
One day, however, when the father had gone to a little distance to collect some sticks for the fire, the boy fetched his bow, and shot at a bird that was just flying towards its nest. But he had not taken proper aim, and the bird was only wounded, and fluttered along the ground. The boy ran to catch it, but though he ran very fast, and the bird seemed to flutter along very slowly, he never could quite come up with it; it was always just a little in advance. But so absorbed was he in the chase that he did not notice for some time that he was now deep in the forest, in a place where he had never been before. Then he felt it would be foolish to go any further, and he turned to find his way home.
He thought it would be easy enough to follow the path along which he had come, but somehow it was always branching off in unexpected directions. He looked about for a house where he might stop and ask his way, but there was not a sign of one anywhere, and he was afraid to stand still, for it was cold, and there were many stories of wolves being seen in that part of the forest. Night fell, and he was beginning to start at every sound, when suddenly a magician came running towards him, with a pack of wolves snapping at his heels. Then all the boy's courage returned to him. He took his bow, and aiming an arrow at the largest wolf, shot him through the heart, and a few more arrows soon put the rest to flight. The magician was full of gratitude to his deliverer, and promised him a reward for his help if the youth would go back with him to his house.'''story2='''The newly-coined word "online education" may by no means sound strange to most people. During the past several years, hundreds of online education colleges have sprung up around China.
Why could online education be so popular in such a short period of time? For one thing, If we want to catch up with the development and the great pace of modern society, we all should possess an urgent and strong desire to study, while most people nowadays are under so enormous pressures that they can hardly have time and energy to study full time at school. Furthermore, online education enables them to save a great deal of time on the way spent on the way between home and school. Last but not least, the quick development of internet,which makes possible all our dreams of attending class on the net,should also be another critical reason.
Personally, I appreciate this new form of education. It's indeed a helpful complement to the traditional educational means. It can provide different learners with more flexible and various ways to learn. Most of all, through online education, we can stick to our jobs and at the same time study and absorb the latest knowledge.'''

第一步,删除文中的一些特殊字符。我们去除了所有换行符“\n”、逗号、句号、感叹号、问号等。最后,用casefold()函数将所有的字符串转换为小写:

story1=story1.replace(',','').replace('\n','').replace('.','').replace('"','').replace('!','').replace('?','').casefold()
story2=story2.replace(',','').replace('\n','').replace('.','').replace('"','').replace('!','').replace('?','').casefold()

下一步,对文本进行分词:

story1_words=story1.split(' ')
print("Story1 words:",story1_words)
story2_words=story2.split(' ')
print("Story2 words:",story2_words)

对story1和story2调用split,根据""字符进行分词,得到它们的单词列表。现在来看该步骤的输出:

Story1 words: ['there', 'was', 'an', 'old', 'man', 'who', 'lived', 'in', 'a', 'little', 'hut', 'in', 'the', 'middle', 'of', 'a', 'forest', 'his', 'wife', 'was', 'dead', 'and', 'he', 'had', 'only', 'one', 'son', 'whom', 'he', 'loved', 'dearly', 'near', 'their', 'hut', 'was', 'a', 'group', 'of', 'birch', 'trees', 'in', 'which', 'some', 'black-game', 'had', 'made', 'their', 'nests', 'and', 'the', 'youth', 'had', 'often', 'begged', 'his', "father's", 'permission', 'to', 'shoot', 'the', 'birds', 'but', 'the', 'old', 'man', 'always', 'strictly', 'forbade', 'him', 'to', 'do', 'anything', 'of', 'the', 'kindone', 'day', 'however', 'when', 'the', 'father', 'had', 'gone', 'to', 'a', 'little', 'distance', 'to', 'collect', 'some', 'sticks', 'for', 'the', 'fire', 'the', 'boy', 'fetched', 'his', 'bow', 'and', 'shot', 'at', 'a', 'bird', 'that', 'was', 'just', 'flying', 'towards', 'its', 'nest', 'but', 'he', 'had', 'not', 'taken', 'proper', 'aim', 'and', 'the', 'bird', 'was', 'only', 'wounded', 'and', 'fluttered', 'along', 'the', 'ground', 'the', 'boy', 'ran', 'to', 'catch', 'it', 'but', 'though', 'he', 'ran', 'very', 'fast', 'and', 'the', 'bird', 'seemed', 'to', 'flutter', 'along', 'very', 'slowly', 'he', 'never', 'could', 'quite', 'come', 'up', 'with', 'it;', 'it', 'was', 'always', 'just', 'a', 'little', 'in', 'advance', 'but', 'so', 'absorbed', 'was', 'he', 'in', 'the', 'chase', 'that', 'he', 'did', 'not', 'notice', 'for', 'some', 'time', 'that', 'he', 'was', 'now', 'deep', 'in', 'the', 'forest', 'in', 'a', 'place', 'where', 'he', 'had', 'never', 'been', 'before', 'then', 'he', 'felt', 'it', 'would', 'be', 'foolish', 'to', 'go', 'any', 'further', 'and', 'he', 'turned', 'to', 'find', 'his', 'way', 'homehe', 'thought', 'it', 'would', 'be', 'easy', 'enough', 'to', 'follow', 'the', 'path', 'along', 'which', 'he', 'had', 'come', 'but', 'somehow', 'it', 'was', 'always', 'branching', 'off', 'in', 'unexpected', 'directions', 'he', 'looked', 'about', 'for', 'a', 'house', 'where', 'he', 'might', 'stop', 'and', 'ask', 'his', 'way', 'but', 'there', 'was', 'not', 'a', 'sign', 'of', 'one', 'anywhere', 'and', 'he', 'was', 'afraid', 'to', 'stand', 'still', 'for', 'it', 'was', 'cold', 'and', 'there', 'were', 'many', 'stories', 'of', 'wolves', 'being', 'seen', 'in', 'that', 'part', 'of', 'the', 'forest', 'night', 'fell', 'and', 'he', 'was', 'beginning', 'to', 'start', 'at', 'every', 'sound', 'when', 'suddenly', 'a', 'magician', 'came', 'running', 'towards', 'him', 'with', 'a', 'pack', 'of', 'wolves', 'snapping', 'at', 'his', 'heels', 'then', 'all', 'the', "boy's", 'courage', 'returned', 'to', 'him', 'he', 'took', 'his', 'bow', 'and', 'aiming', 'an', 'arrow', 'at', 'the', 'largest', 'wolf', 'shot', 'him', 'through', 'the', 'heart', 'and', 'a', 'few', 'more', 'arrows', 'soon', 'put', 'the', 'rest', 'to', 'flight', 'the', 'magician', 'was', 'full', 'of', 'gratitude', 'to', 'his', 'deliverer', 'and', 'promised', 'him', 'a', 'reward', 'for', 'his', 'help', 'if', 'the', 'youth', 'would', 'go', 'back', 'with', 'him', 'to', 'his', 'house']
Story2 words: ['the', 'newly-coined', 'word', 'online', 'education', 'may', 'by', 'no', 'means', 'sound', 'strange', 'to', 'most', 'people', 'during', 'the', 'past', 'several', 'years', 'hundreds', 'of', 'online', 'education', 'colleges', 'have', 'sprung', 'up', 'around', 'chinawhy', 'could', 'online', 'education', 'be', 'so', 'popular', 'in', 'such', 'a', 'short', 'period', 'of', 'time', 'for', 'one', 'thing', 'if', 'we', 'want', 'to', 'catch', 'up', 'with', 'the', 'development', 'and', 'the', 'great', 'pace', 'of', 'modern', 'society', 'we', 'all', 'should', 'possess', 'an', 'urgent', 'and', 'strong', 'desire', 'to', 'study', 'while', 'most', 'people', 'nowadays', 'are', 'under', 'so', 'enormous', 'pressures', 'that', 'they', 'can', 'hardly', 'have', 'time', 'and', 'energy', 'to', 'study', 'full', 'time', 'at', 'school', 'furthermore', 'online', 'education', 'enables', 'them', 'to', 'save', 'a', 'great', 'deal', 'of', 'time', 'on', 'the', 'way', 'spent', 'on', 'the', 'way', 'between', 'home', 'and', 'school', 'last', 'but', 'not', 'least', 'the', 'quick', 'development', 'of', 'internet,which', 'makes', 'possible', 'all', 'our', 'dreams', 'of', 'attending', 'class', 'on', 'the', 'net,should', 'also', 'be', 'another', 'critical', 'reasonpersonally', 'i', 'appreciate', 'this', 'new', 'form', 'of', 'education', "it's", 'indeed', 'a', 'helpful', 'complement', 'to', 'the', 'traditional', 'educational', 'means', 'it', 'can', 'provide', 'different', 'learners', 'with', 'more', 'flexible', 'and', 'various', 'ways', 'to', 'learn', 'most', 'of', 'all', 'through', 'online', 'education', 'we', 'can', 'stick', 'to', 'our', 'jobs', 'and', 'at', 'the', 'same', 'time', 'study', 'and', 'absorb', 'the', 'latest', 'knowledge']

正如你所看到的,所有的特殊字符都被去除,并创建了一个单词列表。

现在,我们基于这个单词列表创建一个词汇集。一个词汇集是由一组非重复的单词组成的,我们调用Python自带的set()函数将这个列表转换成一个集合:

story1_vocab=set(story1_words)
print('Story1 vocabulary:',story1_vocab)
story2_vocab=set(story2_words)
print('Story2 vocabulary:',story2_vocab)

结果如下:

Story1 vocabulary: {'still', 'being', "boy's", 'strictly', 'ground', 'largest', 'further', 'forbade', 'forest', 'always', 'of', 'put', 'find', 'slowly', 'were', 'now', 'gone', 'branching', 'sticks', 'magician', 'permission', 'afraid', 'only', 'proper', 'come', 'before', 'heels', 'help', 'more', 'ask', 'back', 'trees', 'some', 'which', 'there', 'about', 'seen', 'anywhere', 'off', 'wolf', 'path', 'birch', 'group', 'deep', 'with', 'birds', 'night', 'the', 'shot', 'snapping', 'time', 'go', 'chase', 'loved', 'when', 'catch', 'fire', 'at', 'begged', 'stop', 'old', 'fast', 'fell', 'been', 'arrow', 'distance', 'dead', 'came', 'then', 'one', 'day', 'where', 'for', 'aim', 'fetched', 'quite', 'easy', 'their', 'often', 'just', 'towards', 'but', 'seemed', 'had', 'sign', 'many', 'beginning', 'gratitude', 'along', 'pack', 'flying', 'promised', 'house', 'flight', 'dearly', 'very', 'fluttered', 'might', 'start', 'through', 'suddenly', 'his', 'bow', 'follow', 'do', 'notice', 'never', 'could', 'be', 'courage', 'son', 'and', 'stories', 'would', 'deliverer', 'that', 'soon', 'foolish', 'however', 'returned', 'took', 'advance', 'all', 'near', 'ran', 'absorbed', 'felt', 'he', 'father', 'way', 'every', 'rest', 'anything', 'homehe', 'heart', 'who', 'was', 'part', 'collect', 'so', 'him', 'whom', 'not', 'it', 'running', 'lived', 'unexpected', 'somehow', 'arrows', 'few', 'full', 'stand', 'any', 'aiming', "father's", 'cold', 'a', 'to', 'wolves', 'bird', 'little', 'sound', 'place', 'it;', 'wounded', 'hut', 'man', 'in', 'made', 'nests', 'though', 'looked', 'if', 'up', 'flutter', 'did', 'turned', 'wife', 'directions', 'thought', 'reward', 'black-game', 'taken', 'middle', 'enough', 'kindone', 'nest', 'an', 'shoot', 'its', 'youth', 'boy'}
Story2 vocabulary: {'newly-coined', 'of', 'no', 'great', 'latest', 'dreams', 'attending', 'during', 'helpful', 'nowadays', 'net,should', 'study', 'save', 'more', 'are', 'period', 'also', 'new', 'they', 'spent', 'least', "it's", 'deal', 'desire', 'various', 'may', 'most', 'last', 'thing', 'chinawhy', 'with', 'can', 'the', 'flexible', 'time', 'strange', 'catch', 'i', 'provide', 'at', 'reasonpersonally', 'while', 'home', 'appreciate', 'online', 'hundreds', 'colleges', 'critical', 'strong', 'one', 'urgent', 'possible', 'for', 'another', 'sprung', 'pace', 'our', 'same', 'popular', 'but', 'internet,which', 'stick', 'means', 'educational', 'pressures', 'through', 'modern', 'around', 'could', 'be', 'indeed', 'makes', 'and', 'energy', 'by', 'school', 'education', 'that', 'possess', 'have', 'should', 'all', 'different', 'furthermore', 'way', 'so', 'not', 'jobs', 'enables', 'knowledge', 'it', 'complement', 'this', 'short', 'years', 'full', 'people', 'quick', 'we', 'hardly', 'past', 'on', 'several', 'traditional', 'a', 'to', 'under', 'class', 'sound', 'ways', 'learners', 'between', 'want', 'in', 'if', 'word', 'them', 'up', 'absorb', 'learn', 'such', 'development', 'enormous', 'form', 'an', 'society'}

以上是包含非重复词汇的set集合,词汇来自于这两篇短文。

现在,最后一步就是找出两篇短文之间的共有词汇。Python允许使用集合操作符&,我们用它来找出两个词汇集中共有的词汇:

common_vocab=story1_vocab&story2_vocab
print('Common Vocabulary:',common_vocab)

最终输出结果如下:

Common Vocabulary: {'with', 'full', 'the', 'through', 'of', 'time', 'could', 'be', 'catch', 'a', 'at', 'to', 'and', 'sound', 'that', 'one', 'more', 'in', 'for', 'if', 'all', 'way', 'up', 'but', 'so', 'not', 'it', 'an'}

Python3自然语言处理(5)——预处理相关推荐

  1. Python3自然语言处理(3)——WordNet

    Python3自然语言处理(3)--WordNet 注:转载请联系博主,或关注微信公众号"引文空间",后台提出转载申请,等待回复.否则将举报抄袭! 1.WordNet介绍 Word ...

  2. 自然语言处理 文本预处理(上)(分词、词性标注、命名实体识别等)

    文章目录 一.认识文本预处理 1 文本预处理及其作用 2. 文本预处理中包含的主要环节 3. 概览 二.文本处理的基本方法 1. 分词 1.1 什么是分词 1.2 分词的作用 1.3 流行中文分词工具 ...

  3. 自然语言处理 文本预处理(下)(张量表示、文本数据分析、文本特征处理等)

    文章目录 一.文本张量表示方法 1. 什么是文本张量表示 2. 文本张量表示的作用: 3. 文本张量表示的方法: 4. one-hot词向量 4.1 什么是one-hot词向量表示 4.2 one-h ...

  4. 自然语言之文本预处理

    感谢阅读 文本处理的基本方法 分词 概念 作用 jieba 安装 结巴识别模式 精确模式: 全模式: 搜索引擎模式: 全模式和搜索引擎模式的区别: 向切分依据的字典中添加.删除词语 用户自定义词典(u ...

  5. python3 自然语言处理_Python3NLTK-自然语言处理

    NLTK 从NLTK中的book模块中,载入所有条目 book 模块包含所有数据 from nltk.book import * *** Introductory Examples for the N ...

  6. 自然语言处理:数据集预处理词向量嵌入

    1 原始数据提取问答数据集并保存 原始数据剪切即把如下格式的问答语句转换成正常的问答语料.从原始数据中提取完整的对话,并处理成问答格式,最终将问题和答案数据分开保存. 原始数据 E M 呵呵 M 是王 ...

  7. 【自然语言处理】【大模型】BLOOM:一个176B参数且可开放获取的多语言模型

    BLOOM:一个176B参数且可开放获取的多语言模型 <BLOOM: A 176B-Parameter Open-Access Multilingual Language Model> 论 ...

  8. NLTK与NLP原理及基础

    参考https://blog.csdn.net/zxm1306192988/article/details/78896319 以NLTK为基础配合讲解自然语言处理的原理  http://www.nlt ...

  9. delphi 停电文本数据丢失_NLP中的文本分析和特征工程

    语言检测,文本清理,长度测量,情绪分析,命名实体识别,n字频率,词向量,主题建模 前言 在本文中,我将使用NLP和Python解释如何分析文本数据并为机器学习模型提取特征. NLP(自然语言处理)是人 ...

  10. 开源开放 | 开源网络通信行业知识图谱(新华三)

    转载公众号 | 数字化领航 OpenKG地址:http://openkg.cn/dataset/network-communication 文章作者:新华三集团 出品平台:数字化领航 OpenKG是中 ...

最新文章

  1. io密集型和cpu密集型java,如何设计CPU密集型与I/O密集型程序
  2. 基本数据结构之BinarySearchTree
  3. 5g空分复用技术_5G十大关键技术之三的空分复用
  4. flink源码分析_源码解析 | 万字长文详解 Flink 中的 CopyOnWriteStateTable
  5. Android Studio 上传aar(Library)到JCenter
  6. [笔记][原创]74HC595芯片使用方法介绍
  7. Stackelberg博弈
  8. python微信抢红包脚本_这个Python脚本牛逼了,秒抢红包就算了,还能无视撤回消息...
  9. 对事件流的小故事理解
  10. PR2017添加字幕文本或文字水印
  11. [原创]VC成功实现重启路由器(完整源码)
  12. 有向图和无向图用邻接矩阵储存
  13. 为什么数学不好,和语文有关系?
  14. java俄罗斯方块七中图形类_shell中的俄罗斯方块小游戏
  15. android 默认输入法,踩坑之Android默认输入法配置
  16. JSPServlet中request.getParameter() 和request.getAttribute() 区别
  17. easyExcel设置单个单元格(颜色)样式
  18. OTG线与普通USB线的区别
  19. 论文阅读笔记——Vulnerability Dataset Construction Methods Applied To Vulnerability Detection A Survey
  20. python随机抽签列表中的同学值日_神奇的大抽签--Python中的列表,中国大学MOOC(慕课)答案公众号搜题...

热门文章

  1. 安恒信息明御WEB应用防火墙产品白皮书
  2. 使用SharePoint Designer 2010新建外部内容类型,并解决访问被拒绝问题
  3. bom成本分析模型_BOM成本核算实例
  4. 转载-一种基于陀螺仪传感器的准确计步器算法
  5. 我要多开梦幻手游PC端(梦幻手游PC端多开的简单分析及实现办法)(二)
  6. 思岚A1激光雷达的测试(windows)
  7. lpb.wifi index.php,lpb(法国lpb是什么品牌)
  8. arma模型_R语言ARMA-GARCH-COPULA模型和金融时间序列案例
  9. Perl中shift函数用法
  10. 【Linux】解决shell脚本中syntax error:unexpected end of file问题