spaCy 是一个Python自然语言处理工具包,诞生于2014年年中,号称“Industrial-Strength Natural Language Processing in Python”,是具有工业级强度的Python NLP工具包。spaCy里大量使用了 Cython 来提高相关模块的性能,这个区别于学术性质更浓的Python NLTK,因此具有了业界应用的实际价值。

安装和编译 spaCy 比较方便,在ubuntu环境下,直接用pip安装即可:

sudo apt-get install build-essential python-dev gitsudo pip install -U spacy

不过安装完毕之后,需要下载相关的模型数据,以英文模型数据为例,可以用"all"参数下载所有的数据:

sudo python -m spacy.en.download all

或者可以分别下载相关的模型和用glove训练好的词向量数据:

# 这个过程下载英文tokenizer,词性标注,句法分析,命名实体识别相关的模型python -m spacy.en.download parser

# 这个过程下载glove训练好的词向量数据python -m spacy.en.download glove

下载好的数据放在spacy安装目录下的data里,以我的ubuntu为例:

textminer@textminer:/usr/local/lib/python2.7/dist-packages/spacy/data$ du -sh *776M en-1.1.0774M en_glove_cc_300_1m_vectors-1.0.0

进入到英文数据模型下:

textminer@textminer:/usr/local/lib/python2.7/dist-packages/spacy/data/en-1.1.0$ du -sh *424M deps8.0K meta.json35M ner12M pos84K tokenizer300M vocab6.3M wordnet

可以用如下命令检查模型数据是否安装成功:

textminer@textminer:~$ python -c "import spacy; spacy.load('en'); print('OK')"OK

也可以用pytest进行测试:

# 首先找到spacy的安装路径:python -c "import os; import spacy; print(os.path.dirname(spacy.__file__))"/usr/local/lib/python2.7/dist-packages/spacy

# 再安装pytest:sudo python -m pip install -U pytest

# 最后进行测试:python -m pytest /usr/local/lib/python2.7/dist-packages/spacy --vectors --model --slow============================= test session starts ==============================platform linux2 -- Python 2.7.12, pytest-3.0.4, py-1.4.31, pluggy-0.4.0rootdir: /usr/local/lib/python2.7/dist-packages/spacy, inifile:collected 318 items

../../usr/local/lib/python2.7/dist-packages/spacy/tests/test_matcher.py ........../../usr/local/lib/python2.7/dist-packages/spacy/tests/matcher/test_entity_id.py ....../../usr/local/lib/python2.7/dist-packages/spacy/tests/matcher/test_matcher_bugfixes.py ............./../usr/local/lib/python2.7/dist-packages/spacy/tests/vocab/test_vocab.py .......Xx../../usr/local/lib/python2.7/dist-packages/spacy/tests/website/test_api.py x................./../usr/local/lib/python2.7/dist-packages/spacy/tests/website/test_home.py ............

============== 310 passed, 5 xfailed, 3 xpassed in 53.95 seconds ===============

现在可以快速测试一下spaCy的相关功能,我们以英文数据为例,spaCy目前主要支持英文和德文,对其他语言的支持正在陆续加入:

textminer@textminer:~$ ipythonPython 2.7.12 (default, Jul 1 2016, 15:12:24)Type "copyright", "credits" or "license" for more information.

IPython 2.4.1 -- An enhanced Interactive Python.? -> Introduction and overview of IPython's features.%quickref -> Quick reference.help -> Python's own help system.object? -> Details about 'object', use 'object??' for extra details.

In [1]: import spacy

# 加载英文模型数据,稍许等待In [2]: nlp = spacy.load('en')

Word tokenize功能,spaCy 1.2版本加了中文tokenize接口,基于Jieba中文分词:

In [3]: test_doc = nlp(u"it's word tokenize test for spacy")

In [4]: print(test_doc)it's word tokenize test for spacy

In [5]: for token in test_doc:print(token)...:it'swordtokenizetestforspacy

英文断句:

In [6]: test_doc = nlp(u'Natural language processing (NLP) deals with the application of computational models to text or speech data. Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways. NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form. From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models.')

In [7]: for sent in test_doc.sents:print(sent)...:Natural language processing (NLP) deals with the application of computational models to text or speech data.Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways.NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form.From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models.

词干化(Lemmatize):

In [8]: test_doc = nlp(u"you are best. it is lemmatize test for spacy. I love these books")

In [9]: for token in test_doc:print(token, token.lemma_, token.lemma)...:(you, u'you', 472)(are, u'be', 488)(best, u'good', 556)(., u'.', 419)(it, u'it', 473)(is, u'be', 488)(lemmatize, u'lemmatize', 1510296)(test, u'test', 1351)(for, u'for', 480)(spacy, u'spacy', 173783)(., u'.', 419)(I, u'i', 570)(love, u'love', 644)(these, u'these', 642)(books, u'book', 1011)

词性标注(POS Tagging):

In [10]: for token in test_doc:print(token, token.pos_, token.pos)....:(you, u'PRON', 92)(are, u'VERB', 97)(best, u'ADJ', 82)(., u'PUNCT', 94)(it, u'PRON', 92)(is, u'VERB', 97)(lemmatize, u'ADJ', 82)(test, u'NOUN', 89)(for, u'ADP', 83)(spacy, u'NOUN', 89)(., u'PUNCT', 94)(I, u'PRON', 92)(love, u'VERB', 97)(these, u'DET', 87)(books, u'NOUN', 89)

命名实体识别(NER):

In [11]: test_doc = nlp(u"Rami Eid is studying at Stony Brook University in New York")

In [12]: for ent in test_doc.ents:print(ent, ent.label_, ent.label)....:(Rami Eid, u'PERSON', 346)(Stony Brook University, u'ORG', 349)(New York, u'GPE', 350)

名词短语提取:

In [13]: test_doc = nlp(u'Natural language processing (NLP) deals with the application of computational models to text or speech data. Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways. NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form. From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models.')

In [14]: for np in test_doc.noun_chunks:print(np)....:Natural language processingNatural language processing (NLP) dealsthe applicationcomputational modelstextspeechdataApplication areasNLPautomatic (machine) translationlanguagesdialogue systemsa humana machinenatural languageinformation extractionthe goalunstructured textstructured (database) representationsflexible waysNLP technologiesa dramatic impactthe waypeoplecomputersthe waypeoplethe uselanguagethe waypeoplethe vast amountlinguistic dataelectronic forma scientific viewpointNLPfundamental questionsformal modelsexamplenatural language phenomenaalgorithmsthese models

基于词向量计算两个单词的相似度:

In [15]: test_doc = nlp(u"Apples and oranges are similar. Boots and hippos aren't.")

In [16]: apples = test_doc[0]

In [17]: print(apples)Apples

In [18]: oranges = test_doc[2]

In [19]: print(oranges)oranges

In [20]: boots = test_doc[6]

In [21]: print(boots)Boots

In [22]: hippos = test_doc[8]

In [23]: print(hippos)hippos

In [24]: apples.similarity(oranges)Out[24]: 0.77809414836023805

In [25]: boots.similarity(hippos)Out[25]: 0.038474555379008429

当然,spaCy还包括句法分析的相关功能等。另外值得关注的是 spaCy 从1.0版本起,加入了对深度学习工具的支持,例如 Tensorflow 和 Keras 等,这方面具体可以参考官方文档给出的一个对情感分析(Sentiment Analysis)模型进行分析的例子:Hooking a deep learning model into spaCy.

参考:spaCy官方文档Getting Started with spaCy

点击“阅读原文”更精彩

Spacy分词php,自然语言处理工具包spaCy介绍相关推荐

  1. 如何用Python处理自然语言?(Spacy与Word Embedding)

    本文教你用简单易学的工业级Python自然语言处理软件包Spacy,对自然语言文本做词性分析.命名实体识别.依赖关系刻画,以及词嵌入向量的计算和可视化. (由于微信公众号外部链接的限制,文中的部分链接 ...

  2. python自然语言处理之spacy详解

    spaCy简介 spaCy号称工业级Python自然语言处理(NLP)软件包,可以对自然语言文本做词性分析.命名实体识别.依赖关系刻画,以及词嵌入向量的计算和可视化等. spaCy模块有4个非常重要的 ...

  3. Spacy分词php,Spacy简单入门

    安装Spacy pip install spacy 导入工具包和英文模型 #python -m spacy download en 文本处理 import spacy nlp=spacy.load(' ...

  4. Spacy分词php,spaCy 第二篇:语言模型

    spaCy处理文本的过程是模块化的,当调用nlp处理文本时,spaCy首先将文本标记化以生成Doc对象,然后,依次在几个不同的组件中处理Doc,这也称为处理管道.语言模型默认的处理管道依次是:tagg ...

  5. 今日 Paper | 多人姿势估计;对话框语义分析;无监督语义分析;自然语言处理工具包等

    导语:为了更好地服务广大 AI 青年,AI 研习社正式推出全新「论文」版块   目录 基于层次表示的面向任务对话框语义分析 固定的无监督语义分析 斯坦福CoreNLP自然语言处理工具包 DeepCut ...

  6. 基于python的语料库数据处理电子版_基于 Python 自然语言处理工具包在语料库研究中的运用...

    基于 Python 自然语言处理工具包在语料库研究中的运用 刘 旭 [摘 要] 摘要:国内当前以语料库为基础的研究,在研究工具方面,多以 AntConc . PowerGREP 为主,使用 Pytho ...

  7. mallet java_Mallet:自然语言处理工具包

    转自:http://www.131x.com/zhaosq/BBSShow.aspx?id=1727 Mallet:自然语言处理工具包ClickNum:157|ReplyNum:1 MALLET是基于 ...

  8. Python自然语言处理工具包推荐

    结巴分词 就是前面说的中文分词,这里需要介绍的是一个分词效果较好,使用起来像但方便的Python模块:结巴. 结巴中文分词采用的算法 基于Trie树结构实现高效的词图扫描,生成句子中汉字所有可能成词情 ...

  9. python 英语分词_自然语言处理 | NLTK英文分词尝试

    NLTK是一个高效的Python构建的平台,用来处理自然语言数据,它提供了易于使用的接口,通过这些接口可以访问超过50个语料库和词汇资源(如WordNet),还有一套用于分类.标记化.词干标记.解析和 ...

最新文章

  1. 2018-3-20李宏毅机器学习笔记十----------Logistic Regression
  2. 基于docker部署的微服务架构(九): 分布式服务追踪 Spring Cloud Sleuth
  3. Spring Remoting: Burlap--转
  4. 模块计算机型x86yu,ldd3学习之九:与硬件通信
  5. java udp 工具类_java基于UDP实现图片群发功能
  6. shell 随机生成10个数,找出最大值
  7. wpsppt流程图联系效果_风险隐患排查的手段—HAZOP 与检查表的区别及应用效果
  8. E:Tree Queries(假树链剖分写法)
  9. java文件和xml文件_用Java分割大型XML文件
  10. 什么是I帧,P帧,B帧
  11. 动态规划经典题之年终奖
  12. 【论文】Awesome Relation Classification Paper(关系分类)(PART II)
  13. 机器学习实现计算不规则图形面积_《图形编程技术学习》(五十八)用VS实现逐顶点的光照计算...
  14. 电气专业标准规范大全html,电气专业规范大全
  15. 大年初一,给大家发红包了!
  16. 360P 480P 720P 1080P 1080i 说明
  17. RPC failed; curl 56 GnuTLS recv error (-9): A TLS packet with unexpected length was received
  18. Unity3D相机限制移动范围
  19. php蓝牙连接不上,Mac蓝牙不可用怎么办?苹果电脑Mac蓝牙连不上i
  20. 04、江苏专转本(专业课笔记)第四章、计算机网络与因特网

热门文章

  1. 不要忽视Lazada,技术正在成为重构东南亚电商的关键因子
  2. SAP License:GR/IR和暂估入库设计思路的简单对比
  3. MyBatis 中的trim标签介绍
  4. 快速dns_9月的DNS快速更新
  5. JQuery之向标签动态插入html
  6. 信息安全等级保护大体框架
  7. 源码解析:JUC及使用场景
  8. upupw网站平台绿色搭建Edusoho
  9. weixin sdk java 开源_weixin4j(微信公众开发平台SDK)
  10. Android逆向之雷速体育(360加固)