python nlp 句子提取_python-仅从Stanford Core NLP获取作为输出的标记化句子

尝试new “shiny” Stanford CoreNLP API in NLTK =)

第一：

pip install -U nltk[corenlp]

在命令行上：

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

然后在Python中,标准用法是：

>>> from nltk.parse.corenlp import CoreNLPParser

>>> stanford = CoreNLPParser('http://localhost:9000')

>>> text = 'Pusheen and Smitha walked along the beach. Pusheen wanted to surf, but fell off the surfboard.'

# Gets you the tokens.

>>> ' '.join(next(stanford.raw_parse(text)).leaves())

u'Pusheen and Smitha walked along the beach . Pusheen wanted to surf , but fell off the surfboard .'

# Gets you the Tree object.

>>> next(stanford.raw_parse(text))

Tree('ROOT', [Tree('S', [Tree('S', [Tree('NP', [Tree('NNP', ['Pusheen']), Tree('CC', ['and']), Tree('NNP', ['Smitha'])]), Tree('VP', [Tree('VBD', ['walked']), Tree('PP', [Tree('IN', ['along']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['beach'])])])]), Tree('.', ['.'])]), Tree('NP', [Tree('NNP', ['Pusheen'])]), Tree('VP', [Tree('VP', [Tree('VBD', ['wanted']), Tree('PP', [Tree('TO', ['to']), Tree('NP', [Tree('NN', ['surf'])])])]), Tree(',', [',']), Tree('CC', ['but']), Tree('VP', [Tree('VBD', ['fell']), Tree('PRT', [Tree('RP', ['off'])]), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['surfboard'])])])]), Tree('.', ['.'])])])

# Gets you the pretty png tree.

>>> next(stanford.raw_parse(text)).draw()

[出]：

要获得标记化的句子,您需要一些技巧：

>>> from nltk.parse.corenlp import CoreNLPParser

>>> stanford = CoreNLPParser('http://localhost:9000')

# Using the CoreNLPParser.api_call() function, ...

>>> stanford.api_call

# ... , you can get the JSON output from the CoreNLP tool.

>>> stanford.api_call(text, properties={'annotators': 'tokenize,ssplit'})

{u'sentences': [{u'tokens': [{u'index': 1, u'word': u'Pusheen', u'after': u' ', u'characterOffsetEnd': 7, u'characterOffsetBegin': 0, u'originalText': u'Pusheen', u'before': u''}, {u'index': 2, u'word': u'and', u'after': u' ', u'characterOffsetEnd': 11, u'characterOffsetBegin': 8, u'originalText': u'and', u'before': u' '}, {u'index': 3, u'word': u'Smitha', u'after': u' ', u'characterOffsetEnd': 18, u'characterOffsetBegin': 12, u'originalText': u'Smitha', u'before': u' '}, {u'index': 4, u'word': u'walked', u'after': u' ', u'characterOffsetEnd': 25, u'characterOffsetBegin': 19, u'originalText': u'walked', u'before': u' '}, {u'index': 5, u'word': u'along', u'after': u' ', u'characterOffsetEnd': 31, u'characterOffsetBegin': 26, u'originalText': u'along', u'before': u' '}, {u'index': 6, u'word': u'the', u'after': u' ', u'characterOffsetEnd': 35, u'characterOffsetBegin': 32, u'originalText': u'the', u'before': u' '}, {u'index': 7, u'word': u'beach', u'after': u'', u'characterOffsetEnd': 41, u'characterOffsetBegin': 36, u'originalText': u'beach', u'before': u' '}, {u'index': 8, u'word': u'.', u'after': u' ', u'characterOffsetEnd': 42, u'characterOffsetBegin': 41, u'originalText': u'.', u'before': u''}], u'index': 0}, {u'tokens': [{u'index': 1, u'word': u'Pusheen', u'after': u' ', u'characterOffsetEnd': 50, u'characterOffsetBegin': 43, u'originalText': u'Pusheen', u'before': u' '}, {u'index': 2, u'word': u'wanted', u'after': u' ', u'characterOffsetEnd': 57, u'characterOffsetBegin': 51, u'originalText': u'wanted', u'before': u' '}, {u'index': 3, u'word': u'to', u'after': u' ', u'characterOffsetEnd': 60, u'characterOffsetBegin': 58, u'originalText': u'to', u'before': u' '}, {u'index': 4, u'word': u'surf', u'after': u'', u'characterOffsetEnd': 65, u'characterOffsetBegin': 61, u'originalText': u'surf', u'before': u' '}, {u'index': 5, u'word': u',', u'after': u' ', u'characterOffsetEnd': 66, u'characterOffsetBegin': 65, u'originalText': u',', u'before': u''}, {u'index': 6, u'word': u'but', u'after': u' ', u'characterOffsetEnd': 70, u'characterOffsetBegin': 67, u'originalText': u'but', u'before': u' '}, {u'index': 7, u'word': u'fell', u'after': u' ', u'characterOffsetEnd': 75, u'characterOffsetBegin': 71, u'originalText': u'fell', u'before': u' '}, {u'index': 8, u'word': u'off', u'after': u' ', u'characterOffsetEnd': 79, u'characterOffsetBegin': 76, u'originalText': u'off', u'before': u' '}, {u'index': 9, u'word': u'the', u'after': u' ', u'characterOffsetEnd': 83, u'characterOffsetBegin': 80, u'originalText': u'the', u'before': u' '}, {u'index': 10, u'word': u'surfboard', u'after': u'', u'characterOffsetEnd': 93, u'characterOffsetBegin': 84, u'originalText': u'surfboard', u'before': u' '}, {u'index': 11, u'word': u'.', u'after': u'', u'characterOffsetEnd': 94, u'characterOffsetBegin': 93, u'originalText': u'.', u'before': u''}], u'index': 1}]}

>>> output_json = stanford.api_call(text, properties={'annotators': 'tokenize,ssplit'})

>>> len(output_json['sentences'])

>>> for sent in output_json['sentences']:

... start_offset = sent['tokens'][0]['characterOffsetBegin'] # Begin offset of first token.

... end_offset = sent['tokens'][-1]['characterOffsetEnd'] # End offset of last token.

... sent_str = text[start_offset:end_offset]

... print sent_str

...

Pusheen and Smitha walked along the beach.

Pusheen wanted to surf, but fell off the surfboard.

python nlp 句子提取_python-仅从Stanford Core NLP获取作为输出的标记化句子相关推荐

python nlp 句子提取_Python可以把一个文本里部分词语相同的句子提取出来吗？
谢邀,题主的这个问题与具体的编程语言无关.涉及的知识点或者说技术点为NLP(自然语言处理).不过由于Python处理文本内容的便捷性,人们常常会用Python来做NLP. 如果你不想了解原理,直接使用 ...
python nlp 句子提取_python nlp 句子提取_《用Python进行自然语言处理》第7章从文本提取信息...
1. 我们如何能构建一个系统,从非结构化文本中提取结构化数据? 2. 有哪些稳健的方法识别一个文本中描述的实体和关系? 3. 哪些语料库适合这项工作,我们如何使用它们来训练和评估我们的模型? 7.1 ...
python 英文关键词提取_python 利用jieba.analyse进行关键词提取
1.简单应用代码如下: #!/usr/bin/env python # -*- coding: utf-8 -*- # @File : jieba.analyse.py # @Author: 赵路仓 ...
python 英文关键词提取_python提取内容关键词的方法
python怎么提取关键词 import re f = open("D:/xiangmu/python/xiangmu/gjc.txt", "r", encod ...
python opencv轮廓提取_Python + Opencv2 实现轮廓提取，轮廓区域面积计算
对图像处理时,会遇到这样一个场景:找到图像主体轮廓,这是其一,可能为了凸显轮廓,需要用指定的颜色进行标记:轮廓标记完可能任务还没有结束,还需对轮廓所勾勒的像素面积区域统计计算. 本篇文章的主要内容就是 ...
python源码提取_Python|第一个python程序（获取音乐下载地址，附源码)
至少我们曾经在一起过. 来自:一言软件截图软件说明: 刚刚开始接触python,做出一个这样的软件,很low.附上源码, 希望大佬可以指点指点,正在琢磨怎么把音乐下载下来... 软件源码 impo ...
python源码提取_Python提取Linux内核源代码的目录结构实现方法
今天用Python提取了Linux内核源代码的目录树结构,没有怎么写过脚本程序,我居然折腾了2个小时,先是如何枚举出给定目录下的所有文件和文件夹,os.walk可以实现列举,但是os.walk是只给出 ...
python文本关键词提取_python实现关键词提取
1 importjieba2 importjieba.analyse3 4 #第一步:分词,这里使用结巴分词全模式 5 text = '''新闻,也叫消息,是指报纸.电台.电视台.互联网经常使用的记录 ...
python文本关键词提取_python提取文本关键词
python提取关键词textrank算法,将数据库中的数据提取出来,然后进行分析,代码如下 import pymysql import jieba from textrank4zh import T ...

python nlp 句子提取_python-仅从Stanford Core NLP获取作为输出的标记化句子

python nlp 句子提取_python-仅从Stanford Core NLP获取作为输出的标记化句子相关推荐

最新文章

热门文章