sklearn实战-乳腺癌细胞数据挖掘(博主亲自录制视频教程)

https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

Chunking with NLTK

对chunk分类数据结构可以图形化输出,用于分析英语句子主干结构

# -*- coding: utf-8 -*-"""Created on Sun Nov 13 09:14:13 2016

@author: daxiong"""import nltksentence="GW.Bush is a big pig."#切分单词words=nltk.word_tokenize(sentence)#词性标记tagged=nltk.pos_tag(words)#正则表达式,定义包含所有名词的reNPGram=r"""NP:{<NNP>|<NN>|<NNS>|<NNPS>}"""chunkParser=nltk.RegexpParser(NPGram)chunked=chunkParser.parse(tagged)#树状图展示chunked.draw()

# -*- coding: utf-8 -*-
"""
Created on Sun Nov 13 09:14:13 2016@author: daxiong
"""
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer#训练数据
train_text=state_union.raw("2005-GWBush.txt")
#测试数据
sample_text=state_union.raw("2006-GWBush.txt")
'''Punkt is designed to learn parameters (a list of abbreviations, etc.) unsupervised from a corpus similar to the target domain. The pre-packaged models may therefore be unsuitable: use PunktSentenceTokenizer(text) to learn parameters from the given text
'''
#我们现在训练punkttokenizer(分句器)
custom_sent_tokenizer=PunktSentenceTokenizer(train_text)
#训练后,我们可以使用punkttokenizer(分句器)
tokenized=custom_sent_tokenizer.tokenize(sample_text)'''
nltk.pos_tag(["fire"]) #pos_tag(列表)
Out[19]: [('fire', 'NN')]
'''words=nltk.word_tokenize(tokenized[0])
tagged=nltk.pos_tag(words)
chunkGram=r"""Chunk:{<RB.?>*<VB.?>*<NNP>+<NN>?}"""
chunkParser=nltk.RegexpParser(chunkGram)
chunked=chunkParser.parse(tagged)
#lambda t:t.label()=='Chunk' 包含Chunk标签的列
for subtree in chunked.subtrees(filter=lambda t:t.label()=='Chunk'):print(subtree)

数据类型:chunked 是树结构

#lambda t:t.label()=='Chunk' 包含Chunk标签的列

输出只包含Chunk标签的列

完整代码

# -*- coding: utf-8 -*-
"""
Created on Sun Nov 13 09:14:13 2016@author: daxiong
"""
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer#训练数据
train_text=state_union.raw("2005-GWBush.txt")
#测试数据
sample_text=state_union.raw("2006-GWBush.txt")
'''Punkt is designed to learn parameters (a list of abbreviations, etc.) unsupervised from a corpus similar to the target domain. The pre-packaged models may therefore be unsuitable: use PunktSentenceTokenizer(text) to learn parameters from the given text
'''
#我们现在训练punkttokenizer(分句器)
custom_sent_tokenizer=PunktSentenceTokenizer(train_text)
#训练后,我们可以使用punkttokenizer(分句器)
tokenized=custom_sent_tokenizer.tokenize(sample_text)'''
nltk.pos_tag(["fire"]) #pos_tag(列表)
Out[19]: [('fire', 'NN')]
'''
'''
#测试语句
words=nltk.word_tokenize(tokenized[0])
tagged=nltk.pos_tag(words)
chunkGram=r"""Chunk:{<RB.?>*<VB.?>*<NNP>+<NN>?}"""
chunkParser=nltk.RegexpParser(chunkGram)
chunked=chunkParser.parse(tagged)
#lambda t:t.label()=='Chunk' 包含Chunk标签的列
for subtree in chunked.subtrees(filter=lambda t:t.label()=='Chunk'):print(subtree)
'''#文本词性标记函数
def process_content():try:for i in tokenized[0:5]:words=nltk.word_tokenize(i)tagged=nltk.pos_tag(words)#RB副词,VB动词,NNP专有名词单数形式,NN单数名词chunkGram=r"""Chunk:{<RB.?>*<VB.?>*<NNP>+<NN>?}"""chunkParser=nltk.RegexpParser(chunkGram)chunked=chunkParser.parse(tagged)#print(chunked)for subtree in chunked.subtrees(filter=lambda t:t.label()=='Chunk'):print(subtree)#chunked.draw()except Exception as e:print(str(e))process_content()

得到所有名词分类

Now that we know the parts of speech, we can do what is called chunking, and group words into hopefully meaningful chunks. One of the main goals of chunking is to group into what are known as "noun phrases." These are phrases of one or more words that contain a noun, maybe some descriptive words, maybe a verb, and maybe something like an adverb. The idea is to group nouns with the words that are in relation to them.

In order to chunk, we combine the part of speech tags with regular expressions. Mainly from regular expressions, we are going to utilize the following:

+ = match 1 or more ? = match 0 or 1 repetitions. * = match 0 or MORE repetitions . = Any character except a new line 

See the tutorial linked above if you need help with regular expressions. The last things to note is that the part of speech tags are denoted with the "<" and ">" and we can also place regular expressions within the tags themselves, so account for things like "all nouns" (<N.*>)

import nltk
from nltk.corpus import state_union from nltk.tokenize import PunktSentenceTokenizer train_text = state_union.raw("2005-GWBush.txt") sample_text = state_union.raw("2006-GWBush.txt") custom_sent_tokenizer = PunktSentenceTokenizer(train_text) tokenized = custom_sent_tokenizer.tokenize(sample_text) def process_content(): try: for i in tokenized: words = nltk.word_tokenize(i) tagged = nltk.pos_tag(words) chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}""" chunkParser = nltk.RegexpParser(chunkGram) chunked = chunkParser.parse(tagged) chunked.draw() except Exception as e: print(str(e)) process_content()

The result of this is something like:

The main line here in question is:

chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""

This line, broken down:

<RB.?>* = "0 or more of any tense of adverb," followed by:

<VB.?>* = "0 or more of any tense of verb," followed by:

<NNP>+ = "One or more proper nouns," followed by

<NN>? = "zero or one singular noun."

Try playing around with combinations to group various instances until you feel comfortable with chunking.

Not covered in the video, but also a reasonable task is to actually access the chunks specifically. This is something rarely talked about, but can be an essential step depending on what you're doing. Say you print the chunks out, you are going to see output like:

(S(Chunk PRESIDENT/NNP GEORGE/NNP W./NNP BUSH/NNP)'S/POS(ChunkADDRESS/NNPBEFORE/NNPA/NNPJOINT/NNPSESSION/NNPOF/NNPTHE/NNPCONGRESS/NNPON/NNPTHE/NNPSTATE/NNPOF/NNPTHE/NNPUNION/NNPJanuary/NNP)31/CD,/,2006/CDTHE/DT(Chunk PRESIDENT/NNP):/:(Chunk Thank/NNP)you/PRPall/DT./.)

Cool, that helps us visually, but what if we want to access this data via our program? Well, what is happening here is our "chunked" variable is an NLTK tree. Each "chunk" and "non chunk" is a "subtree" of the tree. We can reference these by doing something like chunked.subtrees. We can then iterate through these subtrees like so:

            for subtree in chunked.subtrees(): print(subtree)

Next, we might be only interested in getting just the chunks, ignoring the rest. We can use the filter parameter in the chunked.subtrees() call.

            for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'): print(subtree)

Now, we're filtering to only show the subtrees with the label of "Chunk." Keep in mind, this isn't "Chunk" as in the NLTK chunk attribute... this is "Chunk" literally because that's the label we gave it here: chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""

Had we said instead something like chunkGram = r"""Pythons: {<RB.?>*<VB.?>*<NNP>+<NN>?}""", then we would filter by the label of "Pythons." The result here should be something like:

-
(Chunk PRESIDENT/NNP GEORGE/NNP W./NNP BUSH/NNP)
(ChunkADDRESS/NNPBEFORE/NNPA/NNPJOINT/NNPSESSION/NNPOF/NNPTHE/NNPCONGRESS/NNPON/NNPTHE/NNPSTATE/NNPOF/NNPTHE/NNPUNION/NNPJanuary/NNP)
(Chunk PRESIDENT/NNP)
(Chunk Thank/NNP)

Full code for this would be:

import nltk
from nltk.corpus import state_union from nltk.tokenize import PunktSentenceTokenizer train_text = state_union.raw("2005-GWBush.txt") sample_text = state_union.raw("2006-GWBush.txt") custom_sent_tokenizer = PunktSentenceTokenizer(train_text) tokenized = custom_sent_tokenizer.tokenize(sample_text) def process_content(): try: for i in tokenized: words = nltk.word_tokenize(i) tagged = nltk.pos_tag(words) chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}""" chunkParser = nltk.RegexpParser(chunkGram) chunked = chunkParser.parse(tagged) print(chunked) for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'): print(subtree) chunked.draw() except Exception as e: print(str(e)) process_content()

If you get particular enough, you may find that you may be better off if there was a way to chunk everything, except some stuff. This process is what is known as chinking, and that's what we're going to be covering next.

python风控评分卡建模和风控常识

https://study.163.com/course/introduction.htm?courseId=1005214003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

转载于:https://www.cnblogs.com/webRobot/p/6080135.html

自然语言16_Chunking with NLTK相关推荐

  1. python语言pos_Python自然语言处理(二)--NLTK调用Stanford_NLP_Tools完成NLP任务

    上一篇博文Python自然语言处理(一)介绍了如何利用NLTK自带的函数快速进行NLP任务,适用于对NLP处理要求不高的场景. 如果对NLP的效果有较高要求的话,那些NLTK自带的函数可能就无法满足要 ...

  2. Python自然语言处理-自然语言工具包(NLTK)

    一. 简介 如何理解每个单词的具体含义.自然语言工具包(Natural Language Toolkit,NKTK)就是这样一个python库,用于识别和标记英语文本单词中各个词的词性(parts o ...

  3. python自然语言处理工具NLTK各个包的意思和作用总结

    [转]http://www.myexception.cn/perl-python/464414.html [原]Python NLP实战之一:环境准备 最近正在学习Python,看了几本关于Pytho ...

  4. 自然语言处理库——NLTK

    NLTK(www.nltk.org)是在处理预料库.分类文本.分析语言结构等多项操作中最长遇到的包.其收集的大量公开数据集.模型上提供了全面.易用的接口,涵盖了分词.词性标注(Part-Of-Spee ...

  5. python自然语言处理工具nltk安装_安装自然语言处理工具Nltk以及初次使用

    步骤一:卸载已经安装的python 步骤二:安装python科学计算工具,里面自动安装了很多库,像numpy,matplotlib,nltk等等,并且会自动安装python,安装完成后,不需要配置环境 ...

  6. 自然语言22_Wordnet with NLTK

    sklearn实战-乳腺癌细胞数据挖掘(博主亲自录制视频教程) https://study.163.com/course/introduction.htm?courseId=1005269003&am ...

  7. 【自然语言处理】-nltk库学习笔记(一)

    句子切分(Sentence Tokenize) nltk的分词是句子级别的,所以对于一篇文档首先要将文章按句子进行分割,然后句子进行分词 from nltk.tokenize import sent_ ...

  8. 开源nlp自然语言处理 word2vec nltk textblob crf++ 机器人、翻译、简繁转换、分词、词性、词向量、关键词主题、命名体识别、语义分析、情感正负面、近义同义词、句子相似性、聚类

    github开源:https://github.com/lhyxcxy/nlp 说明 本例子主要集成各种nlp框架 主要功能如下 (1)自动问答机器人 (2)中文翻译,及繁体转简体 (3)关键词提取, ...

  9. Python自然语言处理 NLTK 库用法入门教程

                                                               NLP (Natural Language Processing):自然语言处理 ...

最新文章

  1. Caffe源码中common文件分析
  2. XML学习总结(2)——XML简单介绍
  3. c语言uint赋值给int,如何在C#中将uint转换为int?
  4. 二叉树两个结点的最低公共父结点 【微软面试100题 第七十五题】
  5. 基于YARN集群构建运行PySpark Application
  6. 从图片搜索到人脸识别,CV正在成为“互动营销”领域的【硬核技术】
  7. jQuery获取所选单选按钮的值
  8. beego 初体验 - 环境搭建
  9. ubuntu 安装 ftp server
  10. linux 组态软件,基于嵌入式Linux的组态软件实时数据库的设计
  11. 投标文件 医院弱电系统_智慧建筑办公楼弱电系统如何规划设计?需要设计哪些系统?...
  12. Linux使用ragel进行文本快速解析(下)
  13. 用户使用计算机首要考虑因素,工业设计心理学试题(新整理有答案参考)
  14. python棋盘放米的故事_棋盘摆米的故事你得到了什么启发
  15. U8-存货结存数量与序列号可用数量不一致
  16. 点陶极速版《隐私政策》
  17. A+B Problem——经典中的经典
  18. python的图片转PDF
  19. 下载stm32f4xx标准外设库
  20. Google Earth Engine(GEE)——全球影像数据正确下载方式和注意事项

热门文章

  1. 学爬虫的动力是啥?那肯定就是爬美女图片了。6千多图片看到爽。
  2. 录屏存储为gif图片
  3. virtualBox实现windows和Ubuntu之间的复制粘贴
  4. 分布式系统关注点——「负载均衡」到底该如何实施?
  5. 简谐振动的能量与合成(大学物理笔记)
  6. 自己编写的数据库如何和mapkeeper相连进行评测
  7. JMeter Linux下执行测试
  8. codeforces 849B Tell Your World(计算几何)
  9. 百度地图受邀参加第九届全国化工物流行业年会 助力危化品物流运输安全合规
  10. Unity3D实现3D立体游戏原理及过程