声明:代码的运行环境为Python3。Python3与Python2在一些细节上会有所不同,希望广大读者注意。本博客以代码为主,代码中会有详细的注释。相关文章将会发布在我的个人博客专栏《Python自然语言处理》,欢迎大家关注。


一、信息提取与分块

1、信息提取

# 信息提取
def ie_preprocess(document):sentences = nltk.sent_tokenize(document)sentences = [nltk.word_tokenize(sent) for sent in sentences]sentences = [nltk.pos_tag(sent) for sent in sentences]  # 词性标注

2、分块

# 正则表达式分块
grammar = r"""
NP: {<DT|PP\$>?<JJ>*<NN>} # chunk determiner/possessive, adjectives and nouns
{<NNP>+} # chunk sequences of proper nouns
"""  # 定义分块的语法
cp = nltk.RegexpParser(grammar)  # 定义分块器
sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"),("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]
result = cp.parse(sentence)  # 将分块器应用于已有的句子中
print(result)  # 打印结果
result.draw()  # 树状结构表示出来

分块之后的结果及树状表示如下所示:

(S(NP Rapunzel/NNP)let/VBDdown/RP(NP her/PP$ long/JJ golden/JJ hair/NN))

3、例子

# 探索文本语料库
cp = nltk.RegexpParser('CHUNK: {<V.*> <TO> <V.*>}')
brown = nltk.corpus.brown
for sent in brown.tagged_sents():tree = cp.parse(sent)for subtree in tree.subtrees():if subtree.label() == 'CHUNK': print(subtree)# 加缝隙
grammar = r"""
NP:
{<.*>+} # Chunk everything
}<VBD|IN>+{ # Chink sequences of VBD and IN
"""
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
cp = nltk.RegexpParser(grammar)
print(cp.parse(sentence))

二、评估分块器

1、评估基准

from nltk.corpus import conll2000cp = nltk.RegexpParser("")
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
print(cp.evaluate(test_sents))

当RegexpParser为空是,可以得到评估的结果为:

ChunkParse score:IOB Accuracy:  43.4%%Precision:      0.0%%Recall:         0.0%%F-Measure:      0.0%%

2、简单评估

grammar = r"NP: {<[CDJNP].*>+}"
cp = nltk.RegexpParser(grammar)
print(cp.evaluate(test_sents))

此时的评估结果为:

ChunkParse score:IOB Accuracy:  87.7%%Precision:     70.6%%Recall:        67.8%%F-Measure:     69.2%%

3、使用unigram标注器对名词短语分块

# 使用unigram标注器对名词短语分块
class UnigramChunker(nltk.ChunkParserI):def __init__(self, train_sents):train_data = [[(t, c) for w, t, c in nltk.chunk.tree2conlltags(sent)]for sent in train_sents]self.tagger = nltk.UnigramTagger(train_data)def parse(self, sentence):pos_tags = [pos for (word, pos) in sentence]tagged_pos_tags = self.tagger.tag(pos_tags)chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]conlltags = [(word, pos, chunktag) for ((word, pos), chunktag)in zip(sentence, chunktags)]return nltk.chunk.conlltags2tree(conlltags)test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])
unigram_chunker = UnigramChunker(train_sents)
print(unigram_chunker.evaluate(test_sents))

得到的结果为:

ChunkParse score:IOB Accuracy:  92.9%%Precision:     79.9%%Recall:        86.8%%F-Measure:     83.2%%

4、使用Bigram标注器分块

class BigramChunker(nltk.ChunkParserI):def __init__(self, train_sents):train_data = [[(t, c) for w, t, c in nltk.chunk.tree2conlltags(sent)]for sent in train_sents]self.tagger = nltk.BigramTagger(train_data)def parse(self, sentence):pos_tags = [pos for (word, pos) in sentence]tagged_pos_tags = self.tagger.tag(pos_tags)chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]conlltags = [(word, pos, chunktag) for ((word, pos), chunktag)in zip(sentence, chunktags)]return nltk.chunk.conlltags2tree(conlltags)bigram_chunker = BigramChunker(train_sents)
print(bigram_chunker.evaluate(test_sents))

得到的结果为:

ChunkParse score:IOB Accuracy:  93.3%%Precision:     82.3%%Recall:        86.8%%F-Measure:     84.5%%

5、训练基于分类器的分块器

(1)

# 训练基于分类器的分块器
class ConsecutiveNPChunkTagger(nltk.TaggerI):def __init__(self, train_sents):train_set = []for tagged_sent in train_sents:untagged_sent = nltk.tag.untag(tagged_sent)history = []for i, (word, tag) in enumerate(tagged_sent):featureset = npchunk_features(untagged_sent, i, history)train_set.append((featureset, tag))history.append(tag)self.classifier = nltk.MaxentClassifier.train(train_set, trace=0)  # 使用最大熵分类器(此处使用此分类器比使用朴素贝叶斯分类器效果好)def tag(self, sentence):history = []for i, word in enumerate(sentence):featureset = npchunk_features(sentence, i, history)tag = self.classifier.classify(featureset)history.append(tag)return zip(sentence, history)class ConsecutiveNPChunker(nltk.ChunkParserI):def __init__(self, train_sents):tagged_sents = [[((w, t), c) for (w, t, c) in nltk.chunk.tree2conlltags(sent)]for sent in train_sents]self.tagger = ConsecutiveNPChunkTagger(tagged_sents)def parse(self, sentence):tagged_sents = self.tagger.tag(sentence)conlltags = [(w, t, c) for ((w, t), c) in tagged_sents]return nltk.chunk.conlltags2tree(conlltags)def npchunk_features(sentence, i, history):word, pos = sentence[i]return {"pos": pos}chunker = ConsecutiveNPChunker(train_sents)
print(chunker.evaluate(test_sents))

(2)

def npchunk_features(sentence, i, history):word, pos = sentence[i]if i == 0:prevword, prevpos = "<START>", "<START>"else:prevword, prevpos = sentence[i - 1]return {"pos": pos, "prevpos": prevpos}chunker = ConsecutiveNPChunker(train_sents)
print(chunker.evaluate(test_sents))

(3)

def npchunk_features(sentence, i, history):word, pos = sentence[i]if i == 0:prevword, prevpos = "<START>", "<START>"else:prevword, prevpos = sentence[i - 1]return {"pos": pos, "word": word, "prevpos": prevpos}chunker = ConsecutiveNPChunker(train_sents)
print(chunker.evaluate(test_sents))

(4)

def npchunk_features(sentence, i, history):word, pos = sentence[i]if i == 0:prevword, prevpos = "<START>", "<START>"else:prevword, prevpos = sentence[i - 1]if i == len(sentence) - 1:nextword, nextpos = "<END>", "<END>"else:nextword, nextpos = sentence[i + 1]return {"pos": pos,"word": word,"prevpos": prevpos,"nextpos": nextpos,"prevpos+pos": "%s+%s" % (prevpos, pos),"pos+nextpos": "%s+%s" % (pos, nextpos),"tags-since-dt": tags_since_dt(sentence, i)}def tags_since_dt(sentence, i):tags = set()for word, pos in sentence[:i]:if pos == 'DT':tags = set()else:tags.add(pos)return '+'.join(sorted(tags))chunker = ConsecutiveNPChunker(train_sents)
print(chunker.evaluate(test_sents))

三、语言结构中的递归

1、用级联分块器构建嵌套结构

# 用级联分块器构建嵌套结构
grammar = r"""
NP: {<DT|JJ|NN.*>+} # Chunk sequences of DT, JJ, NN
PP: {<IN><NP>} # Chunk prepositions followed by NP
VP: {<VB.*><NP|PP|CLAUSE>+$} # Chunk verbs and their arguments
CLAUSE: {<NP><VP>} # Chunk NP, VP
"""
cp = nltk.RegexpParser(grammar)
sentence = [("Mary", "NN"), ("saw", "VBD"), ("the", "DT"), ("cat", "NN"),("sit", "VB"), ("on", "IN"), ("the", "DT"), ("mat", "NN")]
print(cp.parse(sentence))

结果为:

(S(NP Mary/NN)saw/VBD(CLAUSE(NP the/DT cat/NN)(VP sit/VB (PP on/IN (NP the/DT mat/NN)))))

再次测试:

cp = nltk.RegexpParser(grammar, loop=2)
print(cp.parse(sentence))

结果为:

(S(CLAUSE(NP Mary/NN)(VPsaw/VBD(CLAUSE(NP the/DT cat/NN)(VP sit/VB (PP on/IN (NP the/DT mat/NN)))))))

2、树与遍历

(1)树

tree1 = nltk.Tree('NP', ['Alice'])
tree2 = nltk.Tree('NP', ['the', 'rabbit'])
tree3 = nltk.Tree('VP', ['chased', tree2])
tree4 = nltk.Tree('S', [tree1, tree3])
print(tree4)

结果为:

(S (NP Alice) (VP chased (NP the rabbit)))

(2)遍历

# 遍历树
def traverse(t):try:t.label()except AttributeError:print(t),else:# Now we know that t.node is definedprint('(', t.label(),)for child in t:traverse(child)print(')',)traverse(tree4)

结果:

( S
( NP
Alice
)
( VP
chased
( NP
the
rabbit
)
)
)

3、关系抽取

import reIN = re.compile(r'.*\bin\b(?!\b.+ing)')
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):for rel in nltk.sem.extract_rels('ORG', 'LOC', doc, corpus='ieer', pattern=IN):print(nltk.sem.relextract.rtuple(rel))from nltk.corpus import conll2002vnv = """
(
is/V| # 3rd sing present and
was/V| # past forms of the verb zijn ('be')
werd/V| # and also present
wordt/V # past of worden ('become')
)
.* # followed by anything
van/Prep # followed by van ('of')
"""
VAN = re.compile(vnv, re.VERBOSE)
for doc in conll2002.chunked_sents('ned.train'):for r in nltk.sem.extract_rels('PER', 'ORG', doc, corpus='conll2002', pattern=VAN):print(nltk.sem.relextract.rtuple(r))

分块器评估与语言结构中的递归相关推荐

  1. m_Orchestrate learning system---十三、thinkphp的验证器支持多语言么

    m_Orchestrate learning system---十三.thinkphp的验证器支持多语言么 一.总结 一句话总结:支持,不仅验证器支持,其它的插件应该都支持 不仅thinkphp支持多 ...

  2. c语言实现感知器算法,感知器算法(c语言版).doc

    感知器算法(c语言版).doc includestdio.hincludetime.hdefine C 1void mainint i,j,k,N1,N2,x202,s3,d20,array204,w ...

  3. 单片机六位抢答器c语言程序,八路电子抢答器(基于51单片机的8路抢答器设计C语言程序)...

    哥,你还有AT89C51单片机8路抢答器的资料吗 哥,你还有AT89C51单片机8路抢答器的资料吗 AT89C51单片机8路抢答器的资料 源程序如下 #include #define uchar un ...

  4. 【C语言】利用递归解决猴子吃桃问题

    [C语言]利用递归解决猴子吃桃问题 参考文章: (1)[C语言]利用递归解决猴子吃桃问题 (2)https://www.cnblogs.com/ieybl/p/6597937.html 备忘一下.

  5. C语言函数递归调用实验报告,C语言函数的递归和调用实例分析

    一.基本内容: C语言中的函数可以递归调用,即:可以直接(简单递归)或间接(间接递归)地自己调自己. 要点: 1.C语言函数可以递归调用. 2.可以通过直接或间接两种方式调用.目前只讨论直接递归调用. ...

  6. C语言之用递归进行十进制转二进制(图解)

    文章目录 前言 一.用二整除法 1.原理(图解) 如图: 2.使用图片 3.代码 二.对照表比较法 1.原理(图解) 如图: 2.使用图片 3.代码 总结 前言 在学习算法的过程中,我们会遇到如何把十 ...

  7. c语言模板函数调用自定义函数调用,C语言函数的递归和调用

    C语言函数的递归和调用Tag内容描述: 1.计算机科学系陈垚,1,张福祥主编辽宁大学出版社,C语言程序设计,计算机科学系陈垚,2,我们先看这样一个例子:,说有一只调皮的小猴子,摘了一堆水果,第一天吃了 ...

  8. c语言递推递归题目,C语言-递推递归.ppt

    C语言-递推递归 第二讲 基础算法 目录 递推 递归 排序与检索 递推 指一个序列u1,u2,u3,-,un-1,un,后面的每一项都能按公式由前面的一项或连续的几项推算出来,或者前面的每一项都能按公 ...

  9. C语言程序设计递推递归n,软考程序员考点C语言程序设计之递归法

    下面希赛小编为大家整理的软考程序员考点C语言程序设计之递归法,希望能帮助学友们.具体内容如下: 递归法 递归法是设计和描述算法的一种有力的工具,由于它在复杂算法的描述中被经常采用,为此在进一步介绍其他 ...

  10. 编写程序C语言 用递归法求n,用C语言编写一个递归程序用来计算:1*2+2*3+3*4+.+(n-1)*n...

    用C语言编写一个递归程序用来计算:1*2+2*3+3*4+.+(n-1)*n以下文字资料是由(历史新知网www.lishixinzhi.com)小编为大家搜集整理后发布的内容,让我们赶快一起来看一下吧 ...

最新文章

  1. 详解设计模式在Spring中的应用
  2. Aspose.Words导出图片 表格 Interop.Word
  3. 你有没有扔过一枚硬币选择正反面?
  4. android查看报错日志,android运行错误日志帮看下 不懂啊
  5. RNN神经网络的输入输出维度的关系
  6. 深度学习在语音识别中的声学模型以及语言模型的应用
  7. 【C++】【TinyXml】xml文件的读写功能使用——写xml文件
  8. c语言商品管理系统文件,c语言商品管理系统(文件应用).doc
  9. java字符串转字符串数组_Java字符串数组到字符串
  10. php include_once 路径,php使用include加密路径的方法介绍
  11. 【老生谈算法】matlab实现图像复原算法源码——图像复原
  12. 苏教版六年级上册计算机教案,苏教版六年级数学上册最新全册教案
  13. 数据结构——“双向循环链表“ 易懂刨析双向循环链表(图解+代码)
  14. JVM初识-JVM内存结构
  15. 云计算认证哪个好?考什么内容?
  16. 保留数据和程序win7升级win10,平滑升级,完美!
  17. 优派 ELITE XG320Q、XG320U / UG 评测
  18. Firefox火狐浏览器插件大全
  19. 化妆品店5大智能玩法,引领美容美妆新零售趋势
  20. 1星《微信软文营销实战技巧》:标题党,作者没有实战经验

热门文章

  1. 机器学习--归纳总结
  2. Macmini 2018安装 ArchLinux
  3. 突破人生的瓶颈(心灵之灯)
  4. android sdcard下创建文件,android创建以及使用SDcard镜像文件
  5. Window Live Writer Test
  6. bitbucket配置_用Bitbucket搭建博客初探
  7. 《指弹:HARD RAIN》
  8. 成都千锋培训python就业班
  9. 软件测试多长时间可以学习,软件测试学习多长时间啊?好学吗?
  10. 韦东山第二期课程内容概要