1. 使用NLTK对英文进行词性标注
- 1.1词性标注示例
- 1.2 语料库的已标注数据
2 标注器
- 2.1 默认标注器
- 2.2 正则表达式标注器
- 2.3 查询标注器
3 训练N-gram标注器
- 3.1 一般N-gram标注
- 3.2 组合标注器
4.更进一步
5.中文标注器的训练
6. brown语料库相关方法

参考链接2
参考链接3
参考链接1

自然语言是人类在沟通中形成的一套规则体系。规则有强有弱，比如非正式场合使用口语，正式场合下的书面语。要处理自然语言，也要遵循这些形成的规则，否则就会得出令人无法理解的结论。下面介绍一些术语的简单区别。
文法：等同于语法(grammar)，文章的书写规范，用来描述语言及其结构，它包含句法和词法规范。
句法：Syntax，句子的结构或成分的构成与关系的规范。
词法：Lexical，词的构词，变化等的规范。

词性标注，或POS(Part Of Speech)，是一种分析句子成分的方法，通过它来识别每个词的词性。

下面简要列举POS的tagset含意，详细可看nltk.help.brown_tagset()

标记	词性	示例
ADJ	形容词	new, good, high, special, big, local
ADV	动词	really, already, still, early, now
CONJ	连词	and, or, but, if, while, although
DET	限定词	the, a, some, most, every, no
EX	存在量词	there, there’s
MOD	情态动词	will, can, would, may, must, should
NN	名词	year,home,costs,time
NNP	专有名词	April，China，Washington
NUM	数词	fourth，2016, 09:30
PRON	代词	he,they,us
P	介词	on,over,with,of
TO	词to	to
UH	叹词	ah,ha,oops
VB		动词
VBD	动词过去式	made,said,went
VBG	现在分词	going,lying,playing
VBN	过去分词	taken,given,gone
WH	wh限定词	who,where,when,what

1. 使用NLTK对英文进行词性标注

1.1词性标注示例

import nltksent = "I am going to Beijing tomorrow.""""
nltk.sent_tokenize(text) #按句子分割 ,python3分不开句子
nltk.word_tokenize(sentence) #分词
nltk的分词是句子级别的，所以对于一篇文档首先要将文章按句子进行分割，然后句子进行分词：
"""

'\nnltk.sent_tokenize(text) #按句子分割 ,python3分不开句子\nnltk.word_tokenize(sentence) #分词 \nnltk的分词是句子级别的，所以对于一篇文档首先要将文章按句子进行分割，然后句子进行分词： \n'

# 分割句子
words = nltk.word_tokenize(sent)
print(words)

['I', 'am', 'going', 'to', 'Beijing', 'tomorrow', '.']

# 词性标注
taged_sent = nltk.pos_tag(words)
taged_sent

[('I', 'PRP'),('am', 'VBP'),('going', 'VBG'),('to', 'TO'),('Beijing', 'NNP'),('tomorrow', 'NN'),('.', '.')]

1.2 语料库的已标注数据

语料类提供了下列方法可以返回预标注数据。

方法	说明
tagged_words(fileids,categories)	返回标注数据，以词列表的形式
tagged_sents(fileids,categories)	返回标注数据，以句子列表形式
tagged_paras(fileids,categories)	返回标注数据，以文章列表形式

2 标注器

2.1 默认标注器

最简单的词性标注器是将所有词都标注为名词NN。这种标注器没有太大的价值。正确率很低。下面演示NLTK提供的默认标注器的用法。

import nltk
from nltk.corpus import brown

# 加载数据
brown_tagged_sents = brown.tagged_sents(categories='news') # [[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'),
brown_sents = brown.sents(categories='news')
# brown_tagged_sents

# 最简单的标注器是为每个标识符分配同样的标记。这似乎是一个相对普通的方法，
# 但为标注器的性能建立了一个重要的标准。为了得到最好的效果，我们用最有可能的标记标注每个词。
# 通过下例找出哪个标记是最有可能的。
tags = [tag for (word,tag) in brown.tagged_words(categories='news')]
tags

['AT','NP-TL','NN-TL','JJ-TL','NN-TL','VBD','NR','AT','NN','IN','NP$','JJ','NN','NN','VBD','``','AT','NN',"''",'CS','DTI','NNS','VBD','NN',
...,'IN','NN','.','NP','NPS','BER','VBG','JJ','NN','TO','VB','AT','NN','IN','AT','CD','NN$',...]

tag = nltk.FreqDist(tags).max()
tag

'NN'

# 我们现在可以创建一个将所有词都标注为NN的标注器。
default_tagger = nltk.DefaultTagger('NN')
sent = "I am going to Beijing tomorrow."
default_tagger.tag(nltk.word_tokenize(sent))

[('I', 'NN'),('am', 'NN'),('going', 'NN'),('to', 'NN'),('Beijing', 'NN'),('tomorrow', 'NN'),('.', 'NN')]

default_tagger.evaluate(brown_tagged_sents)

0.13089484257215028

2.2 正则表达式标注器

正则表达式标注器基于匹配模式分配标记给标识符。例如，一般情况下认为任一以ed结尾的词都是动词过去分词，任一以‘s结尾的词都是名词所有格。下例中可以用正则表达式的列表来表示这些。

patterns = [(r'.*ing$', 'VBG'),                                # gerunds(r'.*ed$', 'VBD'),                                 # simple past(r'.*es$', 'VBZ'),                                 # 3rd singular present(r'.*ould$', 'MD'),                                # modals(r'.*\'s$', 'NN$'),                                # possessive nouns(r'.*s$', 'NNS'),                                  # plural nouns(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),                 # cardinal numbers(r'.*', 'NN')                                      # nouns (deafult)
]

这些是按顺序处理的，第一个匹配上的会被使用。现在建立一个标注器，并用它来标注句子。

regexp_tagger = nltk.RegexpTagger(patterns)
regexp_tagger.tag(brown_sents[3])
regexp_tagger.evaluate(brown_tagged_sents)
# 0.20326391789486245 # 大约有五分之一是正确的

0.20326391789486245

2.3 查询标注器

很多高频词没有NN标记，我们找出100个最频繁的词，存储它们最有可能的标记，然后我们可以使用这个信息作为“查询标注器（NLTKUnigramTagger）”的模型，如下例：

# 先把词拿出来
fd = nltk.FreqDist(brown.words(categories='news')) # ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]# 收集了在不同条件下运行的单个实验的频率分布。条件频率分布用于记录每个样本在给定的实验条件下出现的次数。
# 例如，可以使用条件频率分布来记录文档中给定长度的每个单词(类型)的频率。
# 在形式上，条件频率分布可以定义为一个函数，将每个条件映射到实验条件下的FreqDist。
cfd =nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
# print(cfd.items()) # dict_items([('The', FreqDist({'AT': 775, 'AT-TL': 28, 'AT-HL': 3})), ('Fulton', FreqDist({'NP-TL': 10, 'NP': 4})), # 频繁词top100
most_freq_words = fd.keys()# python 3.6 以上，dict_keys 类型需要list转化
most_freq_words = list(most_freq_words)[:100] # ['The','Fulton','County','Grand','Jury','said','Friday','an',# 字典生成式 对于top100的单词，取该单词频率分布最高的词性，作为该词的词性
likely_tags = dict((word,cfd[word].max()) for word in most_freq_words)
# likely_tags # {'The': 'AT','Fulton': 'NP-TL','County': 'NN-TL','Grand': 'JJ-TL','Jury': 'NN-TL','said': 'VBD','Friday': 'NR',# UnigramTagger为训练语料库中的每个单词找到最有可能的标记，然后使用该信息为新标记分配标记。
baseline_tagger = nltk.UnigramTagger(model = likely_tags)baseline_tagger.evaluate(brown_tagged_sents) # 0.3329355371243312
# brown.tagged_words(categories='news') #[('The', 'AT'), ('Fulton', 'NP-TL'), ...]
baseline_tagger.evaluate([brown.tagged_words(categories='news')]) # brown.tagged_words()需要加括号转二维数组 0.3329355371243312
baseline_tagger.evaluate([brown.tagged_sents(categories='news')[3]]) # 个别语句会有极高的准确率 0.972972972972973

0.972972972972973

此处结果与书中不同，书中结果为0.45左右，即仅仅知道100个最频繁的词的标记就能正确标注很大一部分标识符。

来看看它在未标注的输入文本是运行得怎么样：

sent = brown.sents(categories='news')[10] #
baseline_tagger.tag(sent)

[('It', 'PPS'),('urged', None),('that', 'CS'),('the', 'AT'),('city', 'NN'),('``', '``'),('take', None),('steps', None),('to', 'TO'),('remedy', None),("''", "''"),('this', 'DT'),('problem', None),('.', '.')]

可以看到很多词都被分配了’None’标签，因为它们不在100个最频繁的词中。这种情况我们想分配默认标记NN。也就是说，我们应先使用查找表，如果不能指定就使用默认标注器，这个过程叫“回退”。

# 设置默认标注器，在找不到匹配时使用
baseline_tagger = nltk.UnigramTagger(model = likely_tags,backoff = nltk.DefaultTagger('NN'))

最后我们把查找标注器和默认标注器结合起来之后，看它的性能如何，使用大小不同的模型：

def performance(cfd,wordlist):lt = dict((word,cfd[word].max()) for word in wordlist)baseline_tagger = nltk.UnigramTagger(model=lt,backoff=nltk.DefaultTagger('NN'))return baseline_tagger.evaluate(brown.tagged_sents(categories='news'))def display():import pylabwords_by_freq = list(nltk.FreqDist(brown.words(categories='news')))cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))sizes = 2 ** pylab.arange(16)prefs = [performance(cfd,words_by_freq[:size]) for size in sizes]pylab.plot(sizes,prefs,'-bo')pylab.title('Lookup Tagger Performance with Varying Model Size')pylab.xlabel('Model Size')pylab.ylabel('Performace')pylab.show()display()

可以看到随着模型规模的增长，最初性能增加较快，最终达到稳定水平，这时哪怕模型规模再增加，性能提升幅度也很小

3 训练N-gram标注器

3.1 一般N-gram标注

在上一节中，已经使用了1-Gram，即Unigram标注器。考虑更多的上下文，便有了2/3-gram，这里统称为N-gram。注意，更长的上正文并不能带来准确度的提升。
除了向N-gram标注器提供词表模型，另外一种构建标注器的方法是训练。N-gram标注器的构建函数如下：init(train=None, model=None, backoff=None),可以将标注好的语料作为训练数据，用于构建一个标注器。

import nltk
from nltk.corpus import brownbrown_tagged_sents = brown.tagged_sents(categories = 'news')
train_num = int(len(brown_tagged_sents) * 0.9)
x_train = brown_tagged_sents[0:train_num]
x_test = brown_tagged_sents[train_num:]
tagger = nltk.UnigramTagger(train = x_train)
print(tagger.evaluate(x_test)) # 0.8121200039868434

0.8121200039868434

对于UniGram，使用90%的数据进行训练，在余下10%的数据上测试的准确率为81%。如果改为BiGram，则正确率会下降到10%左右。

3.2 组合标注器

可以利用backoff参数，将多个组合标注器组合起来，以提高识别精确率。

import nltk
from nltk.corpus import brown
pattern = [(r'.*ing$','VBG'),(r'.*ed$','VBD'),(r'.*es$','VBZ'),(r'.*\'s$','NN$'),(r'.*s$','NNS'),(r'.*', 'NN')  #未匹配的仍标注为NN
]
brown_tagged_sents = brown.tagged_sents(categories = 'news')
train_num = int(len(brown_tagged_sents) * 0.9)
x_train =  brown_tagged_sents[0:train_num]
x_test =   brown_tagged_sents[train_num:]t0 = nltk.RegexpTagger(pattern)
t1 = nltk.UnigramTagger(x_train,backoff = t0)
t2 = nltk.BigramTagger(x_train,backoff = t1)
print(t2.evaluate(x_test)) # 0.8627529153792485

0.8627529153792485

从上面可以看出，不需要任何的语言学知识，只需要借助统计数据便可以使得词性标注做的足够好。
对于中文，只要有标注语料，也可以按照上面的过程训练N-gram标注器。

4.更进一步

nltk.tag.BrillTagger实现了基于转换的标注，在基础标注器的结果上，对输出进行基于规则的修正，实现更高的准确度。

import nltk
import nltk.tag.brill
from nltk.corpus import brownpattern = [(r'.*ing$','VBG'),(r'.*ed$','VBD'),(r'.*es$','VBZ'),(r'.*\'s$','NN$'),(r'.*s$','NNS'),(r'.*', 'NN')  #未匹配的仍标注为NN
]
# 划分数据集
brown_tagged_sents = brown.tagged_sents(categories = ['news'])
train_num = int(len(brown_tagged_sents)*0.9)
x_train = brown_tagged_sents[:train_num]
x_test = brown_tagged_sents[train_num:]
#
baseline_tagger = nltk.UnigramTagger(x_train,backoff = nltk.RegexpTagger(pattern))
tt = nltk.tag.brill_trainer.BrillTaggerTrainer(baseline_tagger, nltk.tag.brill.brill24())
brill_tagger = tt.train(x_train,max_rules=20,min_acc=0.99)
# 评估
print(brill_tagger.evaluate(x_test))# 0.8683344961626632

0.8683344961626632

brown_sents = brown.sents(categories="news")
print(brown_tagged_sents[2007])
print(brill_tagger.tag(brown_sents[2007]))

[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'CS'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]
[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'QL'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]

5.中文标注器的训练

下面基于Unigram训练一个中文词性标注器，语料使用网上可以下载得到的人民日报98年1月的标注资料。

import nltk
import jsonlines = open('./词性标注人民日报199801.txt',encoding = 'utf-8').readlines()
all_tagged_sents = []for line in lines:sent = line.split()tagged_sent = []for item in sent:pair = nltk.str2tuple(item)tagged_sent.append(pair)if len(tagged_sent)>0:all_tagged_sents.append(tagged_sent)train_size = int(len(all_tagged_sents)*0.8)
x_train = all_tagged_sents[:train_size]
x_test = all_tagged_sents[train_size:]tagger = nltk.UnigramTagger(train=x_train,backoff=nltk.DefaultTagger('n'))
print(tagger.evaluate(x_test)) # 0.8714095491725319
"""
line:
19980101-01-001-001/m  迈向/v  充满/v  希望/n  的/u  新/a  世纪/n  ——/w  一九九八年/t  新年/t  讲话/n  （/w  附/v  图片/n  １/m  张/q  ）/w line.split():
'\nline:\n19980101-01-001-001/m  迈向/v  充满/v  希望/n  的/u  新/a  世纪/n  ——/w  一九九八年/t  新年/t  讲话/n  （/w  附/v  图片/n  １/m  张/q  ）/w  \n'tagged_sent:
[('19980101-01-001-001', 'M'), ('迈向', 'V'), ('充满', 'V'), ('希望', 'N'), ('的', 'U'), ('新', 'A'), ('世纪', 'N'), ('——', 'W'), ('一九九八年', 'T'), ('新年', 'T'), ('讲话', 'N'), ('（', 'W'), ('附', 'V'), ('图片', 'N'), ('１', 'M'), ('张', 'Q'), ('）', 'W')]
"""

0.8714095491725319"\nline:\n19980101-01-001-001/m  迈向/v  充满/v  希望/n  的/u  新/a  世纪/n  ——/w  一九九八年/t  新年/t  讲话/n  （/w  附/v  图片/n  １/m  张/q  ）/w \n\nline.split():\n'\nline:\n19980101-01-001-001/m  迈向/v  充满/v  希望/n  的/u  新/a  世纪/n  ——/w  一九九八年/t  新年/t  讲话/n  （/w  附/v  图片/n  １/m  张/q  ）/w  \n'\n\ntagged_sent:\n[('19980101-01-001-001', 'M'), ('迈向', 'V'), ('充满', 'V'), ('希望', 'N'), ('的', 'U'), ('新', 'A'), ('世纪', 'N'), ('——', 'W'), ('一九九八年', 'T'), ('新年', 'T'), ('讲话', 'N'), ('（', 'W'), ('附', 'V'), ('图片', 'N'), ('１', 'M'), ('张', 'Q'), ('）', 'W')]\n"

6. brown语料库相关方法

# 语料库文件名列表
brown.fileids()

['ca01','ca02','ca03','ca04',
...,'cp17','cp18','cp19','cp20','cp21','cp22','cp23','cp24','cp25','cp26','cp27','cp28','cp29','cr01','cr02','cr03','cr04','cr05','cr06','cr07','cr08','cr09']

# 返回指定类别('news')的文件名列表
brown.fileids('news')

['ca01','ca02','ca03','ca04','ca05','ca06','ca07','ca08',...'ca26','ca27','ca28','ca29','ca30','ca31','ca32','ca33','ca34','ca35','ca36','ca37','ca38','ca39','ca40','ca41','ca42','ca43','ca44']

# 返回指定分类的原始文本
brown.raw(categories=['news'])

"\n\n\tThe/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta’s/np$ recent/jj primary/nn election/nn produced/vbd / no/at evidence/nn ‘’/‘’ that/cs any/dti irregularities/nns took/vbd place/nn ./.\n\n\n\tThe/at jury/nn further/rbr said/vbd in/in term-end/nn presentments/nns that/cs the/at City/nn-tl Executive/jj-tl Committee/nn-tl ,/, which/wdt had/hvd over-all/jj charge/nn of/in the/at election/nn ,/, / deserves/vbz the/at praise/nn and/cc thanks/nns of/in the/at City/nn-tl of/in-tl Atlanta/np-tl ‘’/‘’ for/in the/at manner/nn in/in which/wdt the/at election/nn was/bedz conducted/vbn ./.\n\n\n\tThe/at September-October/np term/nn jury/nn had/hvd been/ben charged/vbn by/in Fulton/np-tl Superior/jj-tl
…
Steve/np Barber/np joined/vbd the/at club/nn one/cd week/nn ago/rb after/cs completing/vbg his/pp$ hitch/nn under/in the/at Army’s/nnKaTeX parse error: Undefined control sequence: \nThe at position 108: … ,/, Ky./np ./.\̲n̲T̲h̲e̲/at 22-year-old… bulky/jj spring-training/nn contingent/nn now/rb gradually/rb will/md be/be reduced/vbn as/cs Manager/nn-tl Paul/np Richards/np and/cc his/pp$ coaches/nns seek/vb to/to trim/vb it/ppo down/rp to/in a/at more/ql streamlined/vbn and/cc workable/jj unit/nn ./.\n\n\n\n\n/ Take/vb a/at ride/nn on/in this/dt one/cd ‘’/‘’ ,/, Brooks/np Robinson/np greeted/vbd Hansen/np as/cs the/at Bird/np third/od sacker/nn grabbed/vbd a/at bat/nn ,/, headed/vbd for/in the/at plate/nn and/cc bounced/vbd a/at third-inning/nn two-run/jj double/nn off/in the/at left-centerfield/nn wall/nn tonight/nr ./.\n\n\n\tIt/pps was/bedz the/at first/od of/in two/cd doubles/nns by/in Robinson/np ,/, who/wps was/bedz in/in a/at mood/nn to/to celebrate/vb ./.\n\n\n\tJust/rb before/in game/nn time/nn ,/, Robinson’s/np$ pretty/jj wife/nn ,/, Connie/np informed/vbd him/ppo that/cs an/at addition/nn to/in the/at family/nn can/md be/be expected/vbn late/jj next/ap summer/nn ./.\n\n\n\tUnfortunately/rb ,/, Brooks’s/np$ teammates/nns were/bed not/* in/in such/ql festive/jj mood/nn as/cs the/at Orioles/nps expired/vbd before/in the/at seven-hit

# 返回指定文件名的文本字符串
brown.raw(fileids=['ca01','ca02'])

"\n\n\tThe/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.\n\n\n\tThe/at jury/nn further/rbr said/vbd in/in term-end/nn presentments/nns that/cs the/at City/nn-tl Executive/jj-tl Committee/nn-tl ,/, which/wdt had/hvd over-all/jj charge/nn of/in the/at election/nn ,/, ``/`` deserves/vbz the/at praise/nn and/cc thanks/nns of/in the/at City/nn-tl of/in-tl ... for/in-hl extension/nn-hl \nOther/ap recommendations/nns made/vbn by/in the/at committee/nn are/ber :/: \n\n\tExtension/nn of/in the/at ADC/nn program/nn to/in all/abn children/nns in/in need/nn living/vbg with/in any/dti relatives/nns ,/, including/in both/abx parents/nns ,/, as/cs a/at means/nns of/in preserving/vbg family/nn unity/nn ./.\n\n\n\tResearch/nn projects/nns as/ql soon/rb as/cs possible/jj on/in the/at causes/nns and/cc prevention/nn of/in dependency/nn and/cc illegitimacy/nn ./.\n\n"

# 返回指定文件名的语句列表
brown.sents(fileids=['ca01','ca02'])

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

# 按分类返回语句列表
brown.sents(categories=['news'])

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

# 返回指定文件名的单词列表
brown.words('ca01')

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

# 返回指定分类的单词列表
brown.words(categories=['news'])

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

# 返回按句子标注好词性的二维数组
brown.tagged_sents(categories=['news'])

[[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')], [('The', 'AT'), ('jury', 'NN'), ('further', 'RBR'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'NN'), ('presentments', 'NNS'), ('that', 'CS'), ('the', 'AT'), ('City', 'NN-TL'), ('Executive', 'JJ-TL'), ('Committee', 'NN-TL'), (',', ','), ('which', 'WDT'), ('had', 'HVD'), ('over-all', 'JJ'), ('charge', 'NN'), ('of', 'IN'), ('the', 'AT'), ('election', 'NN'), (',', ','), ('``', '``'), ('deserves', 'VBZ'), ('the', 'AT'), ('praise', 'NN'), ('and', 'CC'), ('thanks', 'NNS'), ('of', 'IN'), ('the', 'AT'), ('City', 'NN-TL'), ('of', 'IN-TL'), ('Atlanta', 'NP-TL'), ("''", "''"), ('for', 'IN'), ('the', 'AT'), ('manner', 'NN'), ('in', 'IN'), ('which', 'WDT'), ('the', 'AT'), ('election', 'NN'), ('was', 'BEDZ'), ('conducted', 'VBN'), ('.', '.')], ...]

NLTK2：词性标注相关推荐

创新工场提出中文分词和词性标注模型，性能分别刷新五大数据集| ACL 2020
出品 | AI科技大本营(ID:rgznai100) 中文分词和词性标注是中文自然语言处理的两个基本任务.尽管以BERT为代表的预训练模型大行其道,但事实上,中文中基于全词覆盖 (whole word ...
文本预处理的基本方法（分词、词性标注、命名实体识别）
文本预处理及其作用文本语料在输送给模型前一般需要一系列的预处理工作, 才能符合模型输入的要求, 如: 将文本转化成模型需要的张量, 规范张量的尺寸等, 而且科学的文本预处理环节还将有效指导模型超参数 ...
自然语言处理(NLP)之pyltp的介绍与使用(中文分词、词性标注、命名实体识别、依存句法分析、语义角色标注)
pyltp的简介语言技术平台(LTP)经过哈工大社会计算与信息检索研究中心 11 年的持续研发和推广, 是国内外最具影响力的中文处理基础平台.它提供的功能包括中文分词.词性标注.命名实体识别.依 ...
NLP词性标注数据准备及模型训练实例
NLP词性标注数据准备及模型训练实例目录 NLP词性标注数据准备及模型训练实例第一套方案: 第二套方案
自然语言处理NLP之BERT、BERT是什么、智能问答、阅读理解、分词、词性标注、数据增强、文本分类、BERT的知识表示本质
自然语言处理NLP之BERT.BERT是什么.智能问答.阅读理解.分词.词性标注.数据增强.文本分类.BERT的知识表示本质目录
spacy spaCy主要功能包括分词、词性标注、词干化、命名实体识别、名词短语提取等等
spaCy主要功能包括分词.词性标注.词干化.命名实体识别.名词短语提取等等https://zhuanlan.zhihu.com/p/51425975
SnowNLP简易教程：分词、词性标注、情感分析、繁体转换、关键字抽取、相似度计算...
SnowNLP 一个可以方便的处理中文文本内容的python写的类库,受到了TextBlob的启发而写的安装:pip install snownlp ‍ from snownlp import Sn ...
自然语言处理基础技术之词性标注
声明:转载请注明出处,谢谢:https://blog.csdn.net/m0_37306360/article/details/84502176 另外,更多实时更新的个人学习笔记分享,请关注: 知乎: ...
python自然语言处理.词性标注
想要了解更多 NLP 相关的内容,请访问 NLP专题 ,免费提供59页的NLP文档下载. 访问 NLP 专题,下载 59 页免费 PDF 什么是词性标注? 维基百科上对词性的定义为:In tradit ...
中文分词最佳记录刷新了，两大模型分别解决中文分词及词性标注问题丨已开源...
伊瓢发自中关村量子位报道 | 公众号 QbitAI 中文分词的最佳效果又被刷新了. 在今年的ACL 2020上,来自创新工场大湾区人工智能研究院的两篇论文中的模型,刷新了这一领域的成绩. WM ...

NLTK2：词性标注

目录