2.3.NLTK工具包安装
2.3.1.分词
2.3.2.Text对象
2.3.3.停用词
2.3.4.过滤掉停用词
2.3.5.词性标注
2.3.6.分块
2.3.7.命名实体识别
2.3.8.数据清洗实例
2.3.9.参考文章

2.3.NLTK工具包安装

非常实用的文本处理工具,主要用于英文数据,历史悠久~

(base) C:\Users\toto>pip install nltk -i https://pypi.tuna.tsinghua.edu.cn/simple
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: nltk in d:\installed\anaconda3\lib\site-packages (3.5)
Requirement already satisfied: joblib in d:\installed\anaconda3\lib\site-packages (from nltk) (0.17.0)
Requirement already satisfied: tqdm in d:\installed\anaconda3\lib\site-packages (from nltk) (4.50.2)
Requirement already satisfied: regex in d:\installed\anaconda3\lib\site-packages (from nltk) (2020.10.15)
Requirement already satisfied: click in d:\installed\anaconda3\lib\site-packages (from nltk) (7.1.2)(base) C:\Users\toto>

NLTK最麻烦的是它的使用需要一些较大的数据包,如果对自己的网速有信心,可以直接在切到安装环境后,使用python命令进入到python环境中,输入:

import nltk
nltk.download()

然后在可视化界面中下载就好。

但是,这种方式不仅仅下载慢,还容易遇到大大小小的下载问题,因此,可以直接到nltk的github上下载数据包:https://github.com/nltk/nltk_data

下载之后,需要将文件放在nltk扫描的文件下,其中的路径可以通过下面的方式找到:

解决办法是将上面github上下载的packages包里面的内容放到D:\installed\Anaconda\nltk_data中,最终如下:

不过,要注意一点,在Github上下载的这个压缩数据包,里面的一些子文件夹下还有压缩内容,例如,如果调用nltk进行句子分割,会用到这个函数: word_tokenize():

import nltksen = 'hello, how are you?'
res = nltk.word_tokenize(sen)
print(res)

却会报错(我这里是这样),可以在报错信息中看到是punkt数据未找到:

  Resource [93mpunkt[0m not found.Please use the NLTK Downloader to obtain the resource:[31m>>> import nltk>>> nltk.download('punkt')[0mFor more information see: https://www.nltk.org/data.html

类似这样的错误,其实如果找到查找的路径,也就是上面我们放数据包的地方,是可以在tokenizers文件夹下找到这个punkt的,原因就在于没有解压,那么,把punkt.zip解压到文件夹中,再运行分割句子的代码就没问题了。话有其他的一些数据也是这样的,如果遇到显示没有找到某个数据包,不妨试一试。
如下:

最后再次运行,结果如下:

2.3.1.分词

import nltkfrom nltk.tokenize import word_tokenize
from nltk.text import Textinput_str = "Today's weather is good, very windy and sunny, we have no classes in the afternoon,We have to play basketball tomorrow."
tokens = word_tokenize(input_str)
tokens = [word.lower() for word in tokens]
print(tokens)
'''
输出结果:
['today', "'s", 'weather', 'is', 'good', ',', 'very', 'windy', 'and', 'sunny', ',', 'we', 'have', 'no', 'classes', 'in', 'the', 'afternoon', ',', 'we', 'have', 'to', 'play', 'basketball', 'tomorrow', '.']
'''print(tokens[:5])
'''
输出结果:
['today', "'s", 'weather', 'is', 'good']
'''

2.3.2.Text对象

import nltk# from nltk.tokenize import word_tokenize
from nltk.text import Texthelp(nltk.text)

输出结果:

D:\installed\Anaconda\python.exe E:/workspace/nlp/nltk/demo.py
Help on module nltk.text in nltk:NAMEnltk.textDESCRIPTIONThis module brings together a variety of NLTK functionality fortext analysis, and provides simple, interactive interfaces.Functionality includes: concordancing, collocation discovery,regular expression search over tokenized strings, anddistributional similarity.CLASSESbuiltins.objectConcordanceIndexContextIndexTextTextCollectionTokenSearcherclass ConcordanceIndex(builtins.object)|  ConcordanceIndex(tokens, key=<function ConcordanceIndex.<lambda> at 0x000002602C7FA280>)|  |  An index that can be used to look up the offset locations at which|  a given word occurs in a document.|  |  Methods defined here:|  |  __init__(self, tokens, key=<function ConcordanceIndex.<lambda> at 0x000002602C7FA280>)|      Construct a new concordance index.|      |      :param tokens: The document (list of tokens) that this|          concordance index was created from.  This list can be used|          to access the context of a given word occurrence.|      :param key: A function that maps each token to a normalized|          version that will be used as a key in the index.  E.g., if|          you use ``key=lambda s:s.lower()``, then the index will be|          case-insensitive.|  |  __repr__(self)|      Return repr(self).|  |  find_concordance(self, word, width=80)|      Find all concordance lines given the query word.|  |  offsets(self, word)|      :rtype: list(int)|      :return: A list of the offset positions at which the given|          word occurs.  If a key function was specified for the|          index, then given word's key will be looked up.|  |  print_concordance(self, word, width=80, lines=25)|      Print concordance lines given the query word.|      :param word: The target word|      :type word: str|      :param lines: The number of lines to display (default=25)|      :type lines: int|      :param width: The width of each line, in characters (default=80)|      :type width: int|      :param save: The option to save the concordance.|      :type save: bool|  |  tokens(self)|      :rtype: list(str)|      :return: The document that this concordance index was|          created from.|  |  ----------------------------------------------------------------------|  Data descriptors defined here:|  |  __dict__|      dictionary for instance variables (if defined)|  |  __weakref__|      list of weak references to the object (if defined)class ContextIndex(builtins.object)|  ContextIndex(tokens, context_func=None, filter=None, key=<function ContextIndex.<lambda> at 0x000002602C7F4EE0>)|  |  A bidirectional index between words and their 'contexts' in a text.|  The context of a word is usually defined to be the words that occur|  in a fixed window around the word; but other definitions may also|  be used by providing a custom context function.|  |  Methods defined here:|  |  __init__(self, tokens, context_func=None, filter=None, key=<function ContextIndex.<lambda> at 0x000002602C7F4EE0>)|      Initialize self.  See help(type(self)) for accurate signature.|  |  common_contexts(self, words, fail_on_unknown=False)|      Find contexts where the specified words can all appear; and|      return a frequency distribution mapping each context to the|      number of times that context was used.|      |      :param words: The words used to seed the similarity search|      :type words: str|      :param fail_on_unknown: If true, then raise a value error if|          any of the given words do not occur at all in the index.|  |  similar_words(self, word, n=20)|  |  tokens(self)|      :rtype: list(str)|      :return: The document that this context index was|          created from.|  |  word_similarity_dict(self, word)|      Return a dictionary mapping from words to 'similarity scores,'|      indicating how often these two words occur in the same|      context.|  |  ----------------------------------------------------------------------|  Data descriptors defined here:|  |  __dict__|      dictionary for instance variables (if defined)|  |  __weakref__|      list of weak references to the object (if defined)class Text(builtins.object)|  Text(tokens, name=None)|  |  A wrapper around a sequence of simple (string) tokens, which is|  intended to support initial exploration of texts (via the|  interactive console).  Its methods perform a variety of analyses|  on the text's contexts (e.g., counting, concordancing, collocation|  discovery), and display the results.  If you wish to write a|  program which makes use of these analyses, then you should bypass|  the ``Text`` class, and use the appropriate analysis function or|  class directly instead.|  |  A ``Text`` is typically initialized from a given document or|  corpus.  E.g.:|  |  >>> import nltk.corpus|  >>> from nltk.text import Text|  >>> moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))|  |  Methods defined here:|  |  __getitem__(self, i)|  |  __init__(self, tokens, name=None)|      Create a Text object.|      |      :param tokens: The source text.|      :type tokens: sequence of str|  |  __len__(self)|  |  __repr__(self)|      Return repr(self).|  |  __str__(self)|      Return str(self).|  |  collocation_list(self, num=20, window_size=2)|      Return collocations derived from the text, ignoring stopwords.|      |          >>> from nltk.book import text4|          >>> text4.collocation_list()[:2]|          [('United', 'States'), ('fellow', 'citizens')]|      |      :param num: The maximum number of collocations to return.|      :type num: int|      :param window_size: The number of tokens spanned by a collocation (default=2)|      :type window_size: int|      :rtype: list(tuple(str, str))|  |  collocations(self, num=20, window_size=2)|      Print collocations derived from the text, ignoring stopwords.|      |          >>> from nltk.book import text4|          >>> text4.collocations() # doctest: +ELLIPSIS|          United States; fellow citizens; four years; ...|      |      :param num: The maximum number of collocations to print.|      :type num: int|      :param window_size: The number of tokens spanned by a collocation (default=2)|      :type window_size: int|  |  common_contexts(self, words, num=20)|      Find contexts where the specified words appear; list|      most frequent common contexts first.|      |      :param words: The words used to seed the similarity search|      :type words: str|      :param num: The number of words to generate (default=20)|      :type num: int|      :seealso: ContextIndex.common_contexts()|  |  concordance(self, word, width=79, lines=25)|      Prints a concordance for ``word`` with the specified context window.|      Word matching is not case-sensitive.|      |      :param word: The target word|      :type word: str|      :param width: The width of each line, in characters (default=80)|      :type width: int|      :param lines: The number of lines to display (default=25)|      :type lines: int|      |      :seealso: ``ConcordanceIndex``|  |  concordance_list(self, word, width=79, lines=25)|      Generate a concordance for ``word`` with the specified context window.|      Word matching is not case-sensitive.|      |      :param word: The target word|      :type word: str|      :param width: The width of each line, in characters (default=80)|      :type width: int|      :param lines: The number of lines to display (default=25)|      :type lines: int|      |      :seealso: ``ConcordanceIndex``|  |  count(self, word)|      Count the number of times this word appears in the text.|  |  dispersion_plot(self, words)|      Produce a plot showing the distribution of the words through the text.|      Requires pylab to be installed.|      |      :param words: The words to be plotted|      :type words: list(str)|      :seealso: nltk.draw.dispersion_plot()|  |  findall(self, regexp)|      Find instances of the regular expression in the text.|      The text is a list of tokens, and a regexp pattern to match|      a single token must be surrounded by angle brackets.  E.g.|      |      >>> print('hack'); from nltk.book import text1, text5, text9|      hack...|      >>> text5.findall("<.*><.*><bro>")|      you rule bro; telling you bro; u twizted bro|      >>> text1.findall("<a>(<.*>)<man>")|      monied; nervous; dangerous; white; white; white; pious; queer; good;|      mature; white; Cape; great; wise; wise; butterless; white; fiendish;|      pale; furious; better; certain; complete; dismasted; younger; brave;|      brave; brave; brave|      >>> text9.findall("<th.*>{3,}")|      thread through those; the thought that; that the thing; the thing|      that; that that thing; through these than through; them that the;|      through the thick; them that they; thought that the|      |      :param regexp: A regular expression|      :type regexp: str|  |  generate(self, length=100, text_seed=None, random_seed=42)|      Print random text, generated using a trigram language model.|      See also `help(nltk.lm)`.|      |      :param length: The length of text to generate (default=100)|      :type length: int|      |      :param text_seed: Generation can be conditioned on preceding context.|      :type text_seed: list(str)|      |      :param random_seed: A random seed or an instance of `random.Random`. If provided,|      makes the random sampling part of generation reproducible. (default=42)|      :type random_seed: int|  |  index(self, word)|      Find the index of the first occurrence of the word in the text.|  |  plot(self, *args)|      See documentation for FreqDist.plot()|      :seealso: nltk.prob.FreqDist.plot()|  |  readability(self, method)|  |  similar(self, word, num=20)|      Distributional similarity: find other words which appear in the|      same contexts as the specified word; list most similar words first.|      |      :param word: The word used to seed the similarity search|      :type word: str|      :param num: The number of words to generate (default=20)|      :type num: int|      :seealso: ContextIndex.similar_words()|  |  vocab(self)|      :seealso: nltk.prob.FreqDist|  |  ----------------------------------------------------------------------|  Data descriptors defined here:|  |  __dict__|      dictionary for instance variables (if defined)|  |  __weakref__|      list of weak references to the object (if defined)class TextCollection(Text)|  TextCollection(source)|  |  A collection of texts, which can be loaded with list of texts, or|  with a corpus consisting of one or more texts, and which supports|  counting, concordancing, collocation discovery, etc.  Initialize a|  TextCollection as follows:|  |  >>> import nltk.corpus|  >>> from nltk.text import TextCollection|  >>> print('hack'); from nltk.book import text1, text2, text3|  hack...|  >>> gutenberg = TextCollection(nltk.corpus.gutenberg)|  >>> mytexts = TextCollection([text1, text2, text3])|  |  Iterating over a TextCollection produces all the tokens of all the|  texts in order.|  |  Method resolution order:|      TextCollection|      Text|      builtins.object|  |  Methods defined here:|  |  __init__(self, source)|      Create a Text object.|      |      :param tokens: The source text.|      :type tokens: sequence of str|  |  idf(self, term)|      The number of texts in the corpus divided by the|      number of texts that the term appears in.|      If a term does not appear in the corpus, 0.0 is returned.|  |  tf(self, term, text)|      The frequency of the term in text.|  |  tf_idf(self, term, text)|  |  ----------------------------------------------------------------------|  Methods inherited from Text:|  |  __getitem__(self, i)|  |  __len__(self)|  |  __repr__(self)|      Return repr(self).|  |  __str__(self)|      Return str(self).|  |  collocation_list(self, num=20, window_size=2)|      Return collocations derived from the text, ignoring stopwords.|      |          >>> from nltk.book import text4|          >>> text4.collocation_list()[:2]|          [('United', 'States'), ('fellow', 'citizens')]|      |      :param num: The maximum number of collocations to return.|      :type num: int|      :param window_size: The number of tokens spanned by a collocation (default=2)|      :type window_size: int|      :rtype: list(tuple(str, str))|  |  collocations(self, num=20, window_size=2)|      Print collocations derived from the text, ignoring stopwords.|      |          >>> from nltk.book import text4|          >>> text4.collocations() # doctest: +ELLIPSIS|          United States; fellow citizens; four years; ...|      |      :param num: The maximum number of collocations to print.|      :type num: int|      :param window_size: The number of tokens spanned by a collocation (default=2)|      :type window_size: int|  |  common_contexts(self, words, num=20)|      Find contexts where the specified words appear; list|      most frequent common contexts first.|      |      :param words: The words used to seed the similarity search|      :type words: str|      :param num: The number of words to generate (default=20)|      :type num: int|      :seealso: ContextIndex.common_contexts()|  |  concordance(self, word, width=79, lines=25)|      Prints a concordance for ``word`` with the specified context window.|      Word matching is not case-sensitive.|      |      :param word: The target word|      :type word: str|      :param width: The width of each line, in characters (default=80)|      :type width: int|      :param lines: The number of lines to display (default=25)|      :type lines: int|      |      :seealso: ``ConcordanceIndex``|  |  concordance_list(self, word, width=79, lines=25)|      Generate a concordance for ``word`` with the specified context window.|      Word matching is not case-sensitive.|      |      :param word: The target word|      :type word: str|      :param width: The width of each line, in characters (default=80)|      :type width: int|      :param lines: The number of lines to display (default=25)|      :type lines: int|      |      :seealso: ``ConcordanceIndex``|  |  count(self, word)|      Count the number of times this word appears in the text.|  |  dispersion_plot(self, words)|      Produce a plot showing the distribution of the words through the text.|      Requires pylab to be installed.|      |      :param words: The words to be plotted|      :type words: list(str)|      :seealso: nltk.draw.dispersion_plot()|  |  findall(self, regexp)|      Find instances of the regular expression in the text.|      The text is a list of tokens, and a regexp pattern to match|      a single token must be surrounded by angle brackets.  E.g.|      |      >>> print('hack'); from nltk.book import text1, text5, text9|      hack...|      >>> text5.findall("<.*><.*><bro>")|      you rule bro; telling you bro; u twizted bro|      >>> text1.findall("<a>(<.*>)<man>")|      monied; nervous; dangerous; white; white; white; pious; queer; good;|      mature; white; Cape; great; wise; wise; butterless; white; fiendish;|      pale; furious; better; certain; complete; dismasted; younger; brave;|      brave; brave; brave|      >>> text9.findall("<th.*>{3,}")|      thread through those; the thought that; that the thing; the thing|      that; that that thing; through these than through; them that the;|      through the thick; them that they; thought that the|      |      :param regexp: A regular expression|      :type regexp: str|  |  generate(self, length=100, text_seed=None, random_seed=42)|      Print random text, generated using a trigram language model.|      See also `help(nltk.lm)`.|      |      :param length: The length of text to generate (default=100)|      :type length: int|      |      :param text_seed: Generation can be conditioned on preceding context.|      :type text_seed: list(str)|      |      :param random_seed: A random seed or an instance of `random.Random`. If provided,|      makes the random sampling part of generation reproducible. (default=42)|      :type random_seed: int|  |  index(self, word)|      Find the index of the first occurrence of the word in the text.|  |  plot(self, *args)|      See documentation for FreqDist.plot()|      :seealso: nltk.prob.FreqDist.plot()|  |  readability(self, method)|  |  similar(self, word, num=20)|      Distributional similarity: find other words which appear in the|      same contexts as the specified word; list most similar words first.|      |      :param word: The word used to seed the similarity search|      :type word: str|      :param num: The number of words to generate (default=20)|      :type num: int|      :seealso: ContextIndex.similar_words()|  |  vocab(self)|      :seealso: nltk.prob.FreqDist|  |  ----------------------------------------------------------------------|  Data descriptors inherited from Text:|  |  __dict__|      dictionary for instance variables (if defined)|  |  __weakref__|      list of weak references to the object (if defined)class TokenSearcher(builtins.object)|  TokenSearcher(tokens)|  |  A class that makes it easier to use regular expressions to search|  over tokenized strings.  The tokenized string is converted to a|  string where tokens are marked with angle brackets -- e.g.,|  ``'<the><window><is><still><open>'``.  The regular expression|  passed to the ``findall()`` method is modified to treat angle|  brackets as non-capturing parentheses, in addition to matching the|  token boundaries; and to have ``'.'`` not match the angle brackets.|  |  Methods defined here:|  |  __init__(self, tokens)|      Initialize self.  See help(type(self)) for accurate signature.|  |  findall(self, regexp)|      Find instances of the regular expression in the text.|      The text is a list of tokens, and a regexp pattern to match|      a single token must be surrounded by angle brackets.  E.g.|      |      >>> from nltk.text import TokenSearcher|      >>> print('hack'); from nltk.book import text1, text5, text9|      hack...|      >>> text5.findall("<.*><.*><bro>")|      you rule bro; telling you bro; u twizted bro|      >>> text1.findall("<a>(<.*>)<man>")|      monied; nervous; dangerous; white; white; white; pious; queer; good;|      mature; white; Cape; great; wise; wise; butterless; white; fiendish;|      pale; furious; better; certain; complete; dismasted; younger; brave;|      brave; brave; brave|      >>> text9.findall("<th.*>{3,}")|      thread through those; the thought that; that the thing; the thing|      that; that that thing; through these than through; them that the;|      through the thick; them that they; thought that the|      |      :param regexp: A regular expression|      :type regexp: str|  |  ----------------------------------------------------------------------|  Data descriptors defined here:|  |  __dict__|      dictionary for instance variables (if defined)|  |  __weakref__|      list of weak references to the object (if defined)DATA__all__ = ['ContextIndex', 'ConcordanceIndex', 'TokenSearcher', 'Text'...FILEd:\installed\anaconda\lib\site-packages\nltk\text.py

创建一个Text对象,方便后续操作

import nltkfrom nltk.tokenize import word_tokenize
from nltk.text import Textinput_str = "Today's weather is good, very windy and sunny, we have no classes in the afternoon,We have to play basketball tomorrow."
tokens = word_tokenize(input_str)
tokens = [word.lower() for word in tokens]t = Text(tokens)
print(t.count('good'))
'''
输出结果:
1
'''print(t.index('good'))
'''
输出结果:
4
'''t.plot(8)

2.3.3.停用词

可以看一下说明中的介绍

import nltk
from nltk.corpus import stopwords
print(stopwords.readme().replace('\n', ' '))

输出结果:

Stopwords Corpus  This corpus contains lists of stop words for several languages.  These are high-frequency grammatical words which are usually ignored in text retrieval applications.  They were obtained from: http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/snowball/stopwords/  The stop words for the Romanian language were obtained from: http://arlc.ro/resources/  The English list has been augmented https://github.com/nltk/nltk_data/issues/22  The German list has been corrected https://github.com/nltk/nltk_data/pull/49  A Kazakh list has been added https://github.com/nltk/nltk_data/pull/52  A Nepali list has been added https://github.com/nltk/nltk_data/pull/83  An Azerbaijani list has been added https://github.com/nltk/nltk_data/pull/100  A Greek list has been added https://github.com/nltk/nltk_data/pull/103  An Indonesian list has been added https://github.com/nltk/nltk_data/pull/112
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
# print(stopwords.readme().replace('\n', ' '))print(stopwords.fileids())
'''
输出结果:
['arabic', 'azerbaijani', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']
'''print(stopwords.raw('english').replace('\n', ' '))
'''
输出结果:
i me my myself we our ours ourselves you you're you've you'll you'd your yours yourself yourselves he him his himself she she's her hers herself it it's its itself they them their theirs themselves what which who whom this that that'll these those am is are was were be been being have has had having do does did doing a an the and but if or because as until while of at by for with about against between into through during before after above below to from up down in out on off over under again further then once here there when where why how all any both each few more most other some such no nor not only own same so than too very s t can will just don don't should should've now d ll m o re ve y ain aren aren't couldn couldn't didn didn't doesn doesn't hadn hadn't hasn hasn't haven haven't isn isn't ma mightn mightn't mustn mustn't needn needn't shan shan't shouldn shouldn't wasn wasn't weren weren't won won't wouldn wouldn't
''''''
数据准备
'''
input_str = "Today's weather is good, very windy and sunny, we have no classes in the afternoon,We have to play basketball tomorrow."
tokens = word_tokenize(input_str)
tokens = [word.lower() for word in tokens]test_words = [word.lower() for word in tokens]
test_words_set = set(test_words)print(test_words_set)
'''
输出结果:
{'no', 'good', 'windy', 'in', 'afternoon', 'very', '.', 'have', 'to', 'basketball', 'classes', 'and', 'the', 'we', 'weather', 'tomorrow', 'is', ',', 'today', "'s", 'play', 'sunny'}
''''''
获得test_words_set中的停用词
1'''
print(test_words_set.intersection(set(stopwords.words('english'))))
'''
{'no', 'to', 'and', 'is', 'very', 'the', 'we', 'have', 'in'}
'''

2.3.4.过滤掉停用词

filtered = [w for w in test_words_set if(w not in stopwords.words('english'))]
print(filtered)
'''
输出结果:
['.', 'play', 'windy', 'tomorrow', 'today', 'weather', 'afternoon', 'classes', 'sunny', 'good', "'s", 'basketball', ',']
'''

2.3.5.词性标注

nltk.download()  # 第三个
'''
输出结果:
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
'''
from nltk import pos_tag
tags = pos_tag(tokens)
print(tags)
'''
输出结果:
[('today', 'NN'), ("'s", 'POS'), ('weather', 'NN'), ('is', 'VBZ'), ('good', 'JJ'), (',', ','), ('very', 'RB'), ('windy', 'JJ'), ('and', 'CC'), ('sunny', 'JJ'), (',', ','), ('we', 'PRP'), ('have', 'VBP'), ('no', 'DT'), ('classes', 'NNS'), ('in', 'IN'), ('the', 'DT'), ('afternoon', 'NN'), (',', ','), ('we', 'PRP'), ('have', 'VBP'), ('to', 'TO'), ('play', 'VB'), ('basketball', 'NN'), ('tomorrow', 'NN'), ('.', '.')]
'''

2.3.6.分块

from nltk.chunk import RegexpParser
sentence = [('the','DT'),('little','JJ'),('yellow','JJ'),('dog','NN'),('died','VBD')]
grammer = "MY_NP: {<DT>?<JJ>*<NN>}"
cp = nltk.RegexpParser(grammer)    # 生成规则
result = cp.parse(sentence)        # 进行分块
print(result)result.draw()                      # 调用matplotlib库画出来

2.3.7.命名实体识别

nltk.download()
#maxent_ne_chunke
#wordsshowing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
from nltk import ne_chunksentence = "Edison went to Tsinghua University today"
print(ne_chunk(pos_tag(word_tokenize(sentence))))
'''
输出结果:
(S(PERSON Edison/NNP)went/VBDto/TO(ORGANIZATION Tsinghua/NNP University/NNP)today/NN)
'''

2.3.8.数据清洗实例

import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize# 输入数据
s = '    RT @Amila #Test\nTom\'s newly listed Co  &amp; Mary\'s unlisted     Group to supply tech for nlTK.\nh $TSLA $AAPL https:// t.co/x34afsfQsh'# 指定停用词
cache_english_stopwords = stopwords.words('english')def text_clean(text):print('原始数据:', text, '\n')# 去掉HTML标签(e.g. &amp;)text_no_special_entities = re.sub(r'\&\w*;|#\w*|@\w*', '', text)print('去掉特殊标签后的:', text_no_special_entities, '\n')# 去掉一些价值符号text_no_tickers = re.sub(r'\$\w*', '', text_no_special_entities)print('去掉价值符号后的:', text_no_tickers, '\n')# 去掉超链接text_no_hyperlinks = re.sub(r'https?:\/\/.*\/\w*', '', text_no_tickers)print('去掉超链接后的:', text_no_hyperlinks, '\n')# 去掉一些专门名词缩写,简单来说就是字母比较少的词text_no_small_words = re.sub(r'\b\w{1,2}\b', '', text_no_hyperlinks)print('去掉专门名词缩写后:', text_no_small_words, '\n')# 去掉多余的空格text_no_whitespace = re.sub(r'\s\s+', ' ', text_no_small_words)text_no_whitespace = text_no_whitespace.lstrip(' ')print('去掉空格后的:', text_no_whitespace, '\n')# 分词tokens = word_tokenize(text_no_whitespace)print('分词结果:', tokens, '\n')# 去停用词list_no_stopwords = [i for i in tokens if i not in cache_english_stopwords]print('去停用词后结果:', list_no_stopwords, '\n')# 过滤后结果text_filtered = ' '.join(list_no_stopwords)  # ''.join() would join without spaces between words.print('过滤后:', text_filtered)text_clean(s)

输出结果:

D:\installed\Anaconda\python.exe E:/workspace/nlp/nltk/demo2.py
原始数据:     RT @Amila #Test
Tom's newly listed Co  &amp; Mary's unlisted     Group to supply tech for nlTK.
h $TSLA $AAPL https:// t.co/x34afsfQsh 去掉特殊标签后的:     RT
Tom's newly listed Co   Mary's unlisted     Group to supply tech for nlTK.
h $TSLA $AAPL https:// t.co/x34afsfQsh 去掉价值符号后的:     RT
Tom's newly listed Co   Mary's unlisted     Group to supply tech for nlTK.
h   https:// t.co/x34afsfQsh 去掉超链接后的:     RT
Tom's newly listed Co   Mary's unlisted     Group to supply tech for nlTK.
h    去掉专门名词缩写后:
Tom' newly listed    Mary' unlisted     Group  supply tech for nlTK.去掉空格后的: Tom' newly listed Mary' unlisted Group supply tech for nlTK.  分词结果: ['Tom', "'", 'newly', 'listed', 'Mary', "'", 'unlisted', 'Group', 'supply', 'tech', 'for', 'nlTK', '.'] 去停用词后结果: ['Tom', "'", 'newly', 'listed', 'Mary', "'", 'unlisted', 'Group', 'supply', 'tech', 'nlTK', '.'] 过滤后: Tom ' newly listed Mary ' unlisted Group supply tech nlTK .Process finished with exit code 0

2.3.9.参考文章

https://pypi.org/project/nltk/#files
https://blog.csdn.net/sinat_34328764/article/details/94830948

2.3.NLTK工具包安装、分词、Text对象、停用词、过滤掉停用词、词性标注、分块、命名实体识别、数据清洗实例、参考文章相关推荐

  1. 文本预处理的基本方法(分词、词性标注、命名实体识别)

    文本预处理及其作用 文本语料在输送给模型前一般需要一系列的预处理工作, 才能符合模型输入的要求, 如: 将文本转化成模型需要的张量, 规范张量的尺寸等, 而且科学的文本预处理环节还将有效指导模型超参数 ...

  2. 自然语言处理(NLP)之pyltp的介绍与使用(中文分词、词性标注、命名实体识别、依存句法分析、语义角色标注)

    pyltp的简介   语言技术平台(LTP)经过哈工大社会计算与信息检索研究中心 11 年的持续研发和推广, 是国内外最具影响力的中文处理基础平台.它提供的功能包括中文分词.词性标注.命名实体识别.依 ...

  3. Pytorch:jieba分词、hanlp分词、词性标注、命名实体识别、one-hot、Word2vec(CBOW、skipgram)、Word Embedding词嵌入、fasttext

    日萌社 人工智能AI:Keras PyTorch MXNet TensorFlow PaddlePaddle 深度学习实战(不定时更新) 文本预处理及其作用: 文本语料在输送给模型前一般需要一系列的预 ...

  4. 自然语言处理 文本预处理(上)(分词、词性标注、命名实体识别等)

    文章目录 一.认识文本预处理 1 文本预处理及其作用 2. 文本预处理中包含的主要环节 3. 概览 二.文本处理的基本方法 1. 分词 1.1 什么是分词 1.2 分词的作用 1.3 流行中文分词工具 ...

  5. 中文处理工具fastHan 2.0:支持中文分词、词性标注、命名实体识别、依存语法分析、中文AMR的强有力工具

    fastHan 简介 fastHan是基于fastNLP与pytorch实现的中文自然语言处理工具,像spacy一样调用方便. 其内核为基于BERT的联合模型,其在15个语料库中进行训练,可处理中文分 ...

  6. 百度NLP工具LAC初体验:分词,词性标注,命名实体识别

    LAC全称Lexical Analysis of Chinese,是百度自然语言处理部研发的一款联合的词法分析工具,实现中文分词.词性标注.专名识别等功能. 输入: from LAC import L ...

  7. 分词,词性标注,和命名实体识别,有什么区别?

    ※※先简述一下个人的理解: 分词就是把我们们的句子进行分词可以是中文也可以是英文,为了确定词与词之间的边界. 另外,词性标注和命名实体识别结果都是为了标注,除了标注方式不同,个人感觉还有针对的内容不一 ...

  8. 自然语言处理学习8:python使用standford CoreNLP进行中文分词、标注和命名实体识别

    jieba分词可以进行中文分词和标注,但是无法进行命名实体识别. 1. 环境配置 (1) 下载安装JDK 1.8及以上版本 (2)下载Stanford CoreNLP文件,解压. (3)处理中文还需要 ...

  9. 2.文本预处理(分词,命名实体识别和词性标注,one-hot,word2vec,word embedding,文本数据分析,文本特征处理,文本数据增强)

    文章目录 1.1 认识文本预处理 文本预处理及其作用 文本预处理中包含的主要环节 文本处理的基本方法 文本张量表示方法 文本语料的数据分析 文本特征处理 数据增强方法 重要说明 1.2 文本处理的基本 ...

最新文章

  1. 安装vue脚手架创建项目
  2. JavaScript的常用工具汇总
  3. 汇编语言学习笔记-按指定的字体输出文本
  4. js面向对象的五种写法
  5. 闭包、装饰器与递归_月隐学python第12课
  6. 华为微型计算机b515,华为MateStation B515台式机曝光:五种配置
  7. backlog配置_TCP/IP协议中backlog参数
  8. 小熊派开发板移植emwin_小熊派开发实践丨小熊派+合宙Cat.1接入云服务器
  9. 大气辐射示意简单图_地理笔记 | N21 自然地理——大气的组成与垂直分层
  10. 微软新一代系统镜像 Windows 11 系统 ISO 镜像下载 - BT 磁力 / 网盘地址
  11. 【PotPlayer】敲好用的本地视频播放器
  12. PC 上安装Windows10系统到硬盘上ESP分区丢失,新建ESP分区修复引导ESP分区创建失败解决办法
  13. 兜了一圈,发现想要的APK在这里有
  14. 笔记本系统触摸板只能移动鼠标不能点击使用的解决方案
  15. 深度学习的loss变小梯度是否变小
  16. Linux上创建和更改硬链接和符号链接(软连接)
  17. 【信号去噪】基于小波阈值实现心电信号去噪附matlab代码
  18. Caused by: java.lang.IllegalArgumentException报错
  19. 循环神经网络:用训练好的model写诗歌
  20. 3226. 【HBOI2013】ALO

热门文章

  1. 电气笔记:线路、主变、母线保护讲解
  2. pyqt5讲解6:菜单栏,工具栏,状态栏
  3. Java8 IdentityHashMap 源码分析
  4. wxWidgets:wxRibbonButtonBar类用法
  5. boost::type_erasure::is_placeholder相关的测试程序
  6. boost::replace_if相关的测试程序
  7. boost::thread相关的测试程序
  8. boost::gil模块临界点threshold的测试程序
  9. boost::filesystem模块和boost::timer混合的测试程序
  10. GDCM:gdcm::Global的测试程序