spaCy 是一个Python自然语言处理工具包,诞生于2014年年中,号称“Industrial-Strength Natural Language Processing in Python”,是具有工业级强度的Python NLP工具包。spaCy里大量使用了 Cython 来提高相关模块的性能,这个区别于学术性质更浓的Python NLTK,因此具有了业界应用的实际价值。

安装和编译 spaCy 比较方便,在ubuntu环境下,直接用pip安装即可:

sudo apt-get install build-essential python-dev git

sudo pip install -U spacy

不过安装完毕之后,需要下载相关的模型数据,以英文模型数据为例,可以用"all"参数下载所有的数据:

sudo python -m spacy.en.download all

或者可以分别下载相关的模型和用glove训练好的词向量数据:

# 这个过程下载英文tokenizer,词性标注,句法分析,命名实体识别相关的模型

python -m spacy.en.download parser

# 这个过程下载glove训练好的词向量数据

python -m spacy.en.download glove

下载好的数据放在spacy安装目录下的data里,以我的ubuntu为例:

textminer@textminer:/usr/local/lib/python2.7/dist-packages/spacy/data$ du -sh *

776Men-1.1.0

774Men_glove_cc_300_1m_vectors-1.0.0

进入到英文数据模型下:

textminer@textminer:/usr/local/lib/python2.7/dist-packages/spacy/data/en-1.1.0$ du -sh *

424Mdeps

8.0Kmeta.json

35Mner

12Mpos

84Ktokenizer

300Mvocab

6.3Mwordnet

可以用如下命令检查模型数据是否安装成功:

textminer@textminer:~$ python -c "import spacy; spacy.load('en'); print('OK')"

OK

也可以用pytest进行测试:

# 首先找到spacy的安装路径:

python -c "import os; import spacy; print(os.path.dirname(spacy.__file__))"

/usr/local/lib/python2.7/dist-packages/spacy

# 再安装pytest:

sudo python -m pip install -U pytest

# 最后进行测试:

python -m pytest /usr/local/lib/python2.7/dist-packages/spacy --vectors --model --slow

============================= test session starts ==============================

platform linux2 -- Python 2.7.12, pytest-3.0.4, py-1.4.31, pluggy-0.4.0

rootdir: /usr/local/lib/python2.7/dist-packages/spacy, inifile:

collected 318 items

../../usr/local/lib/python2.7/dist-packages/spacy/tests/test_matcher.py ........

../../usr/local/lib/python2.7/dist-packages/spacy/tests/matcher/test_entity_id.py ....

../../usr/local/lib/python2.7/dist-packages/spacy/tests/matcher/test_matcher_bugfixes.py .....

......

../../usr/local/lib/python2.7/dist-packages/spacy/tests/vocab/test_vocab.py .......Xx

../../usr/local/lib/python2.7/dist-packages/spacy/tests/website/test_api.py x...............

../../usr/local/lib/python2.7/dist-packages/spacy/tests/website/test_home.py ............

============== 310 passed, 5 xfailed, 3 xpassed in 53.95 seconds ===============

现在可以快速测试一下spaCy的相关功能,我们以英文数据为例,spaCy目前主要支持英文和德文,对其他语言的支持正在陆续加入:

textminer@textminer:~$ ipython

Python 2.7.12 (default, Jul 1 2016, 15:12:24)

Type "copyright", "credits" or "license" for more information.

IPython 2.4.1 -- An enhanced Interactive Python.

? -> Introduction and overview of IPython's features.

%quickref -> Quick reference.

help -> Python's own help system.

object? -> Details about 'object', use 'object??' for extra details.

In [1]: import spacy

# 加载英文模型数据,稍许等待

In [2]: nlp = spacy.load('en')

Word tokenize功能,spaCy 1.2版本加了中文tokenize接口,基于Jieba中文分词:

In [3]: test_doc = nlp(u"it's word tokenize test for spacy")

In [4]: print(test_doc)

it's word tokenize test for spacy

In [5]: for token in test_doc:

print(token)

...:

it

's

word

tokenize

test

for

spacy

英文断句:

In [6]: test_doc = nlp(u'Natural language processing (NLP) deals with the application of computational models to text or speech data. Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways. NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form. From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models.')

In [7]: for sent in test_doc.sents:

print(sent)

...:

Natural language processing (NLP) deals with the application of computational models to text or speech data.

Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways.

NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form.

From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models.

词干化(Lemmatize):

In [8]: test_doc = nlp(u"you are best. it is lemmatize test for spacy. I love these books")

In [9]: for token in test_doc:

print(token, token.lemma_, token.lemma)

...:

(you, u'you', 472)

(are, u'be', 488)

(best, u'good', 556)

(., u'.', 419)

(it, u'it', 473)

(is, u'be', 488)

(lemmatize, u'lemmatize', 1510296)

(test, u'test', 1351)

(for, u'for', 480)

(spacy, u'spacy', 173783)

(., u'.', 419)

(I, u'i', 570)

(love, u'love', 644)

(these, u'these', 642)

(books, u'book', 1011)

词性标注(POS Tagging):

In [10]: for token in test_doc:

print(token, token.pos_, token.pos)

....:

(you, u'PRON', 92)

(are, u'VERB', 97)

(best, u'ADJ', 82)

(., u'PUNCT', 94)

(it, u'PRON', 92)

(is, u'VERB', 97)

(lemmatize, u'ADJ', 82)

(test, u'NOUN', 89)

(for, u'ADP', 83)

(spacy, u'NOUN', 89)

(., u'PUNCT', 94)

(I, u'PRON', 92)

(love, u'VERB', 97)

(these, u'DET', 87)

(books, u'NOUN', 89)

命名实体识别(NER):

In [11]: test_doc = nlp(u"Rami Eid is studying at Stony Brook University in New York")

In [12]: for ent in test_doc.ents:

print(ent, ent.label_, ent.label)

....:

(Rami Eid, u'PERSON', 346)

(Stony Brook University, u'ORG', 349)

(New York, u'GPE', 350)

名词短语提取:

In [13]: test_doc = nlp(u'Natural language processing (NLP) deals with the application of computational models to text or speech data. Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways. NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form. From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models.')

In [14]: for np in test_doc.noun_chunks:

print(np)

....:

Natural language processing

Natural language processing (NLP) deals

the application

computational models

text

speech

data

Application areas

NLP

automatic (machine) translation

languages

dialogue systems

a human

a machine

natural language

information extraction

the goal

unstructured text

structured (database) representations

flexible ways

NLP technologies

a dramatic impact

the way

people

computers

the way

people

the use

language

the way

people

the vast amount

linguistic data

electronic form

a scientific viewpoint

NLP

fundamental questions

formal models

example

natural language phenomena

algorithms

these models

基于词向量计算两个单词的相似度:

In [15]: test_doc = nlp(u"Apples and oranges are similar. Boots and hippos aren't.")

In [16]: apples = test_doc[0]

In [17]: print(apples)

Apples

In [18]: oranges = test_doc[2]

In [19]: print(oranges)

oranges

In [20]: boots = test_doc[6]

In [21]: print(boots)

Boots

In [22]: hippos = test_doc[8]

In [23]: print(hippos)

hippos

In [24]: apples.similarity(oranges)

Out[24]: 0.77809414836023805

In [25]: boots.similarity(hippos)

Out[25]: 0.038474555379008429

当然,spaCy还包括句法分析的相关功能等。另外值得关注的是 spaCy 从1.0版本起,加入了对深度学习工具的支持,例如 Tensorflow 和 Keras 等,这方面具体可以参考官方文档给出的一个对情感分析(Sentiment Analysis)模型进行分析的例子:Hooking a deep learning model into spaCy.

python如何下载安装spacy_Python spaCy相关推荐

  1. python官网怎么下载安装-Python怎么下载安装

    Python是一种跨平台的计算机程序设计语言,适配多个平台,很多用户还不知道Python怎么下载和安装,下面就跟小编一起去看看下载安装的方法吧! Python怎么下载安装 1.打开python官网 2 ...

  2. python软件如何下载-Python怎么下载安装

    Python是一种跨平台的计算机程序设计语言,适配多个平台,很多用户还不知道Python怎么下载和安装,下面就跟小编一起去看看下载安装的方法吧! Python怎么下载安装 1.打开python官网 2 ...

  3. Python库下载安装教程

    Python库下载安装教程 Python是一种高级编程语言,非常流行.由于其开源和免费的特点,有许多开发者使用Python开发各种应用程序.Python库是Python语言的一种核心组成部分,它们包含 ...

  4. python软件下载安装教程,如何下载和安装python

    python下载安装教程 首先打开浏览器,百度搜索[python].出现搜索结果后,再进入下图所示的官网中.进入官网后,鼠标移至[download]再选择自己的系统.进入下载页面后,再选择python ...

  5. python软件下载安装百度网盘-python网盘下载

    广告关闭 2017年12月,云+社区对外发布,从最开始的技术博客到现在拥有多个社区产品.未来,我们一起乘风破浪,创造无限可能. pip install tencentcloud-sdk-python ...

  6. 【Python】Python系列下载安装教程

    目录 Python系列下载安装教程 Python安装教程 PyCharm安装教程 不积跬步,无以至千里:不积小流,无以成江海.要沉下心来,诗和远方的路费真的很贵! Python系列下载安装教程 Pyt ...

  7. Python基础下载安装卸载

    Python基础下载安装 官网下载地址:https://www.python.org/downloads/ 官网页面如图: 点击[Download Python 3.8.5]进入下载, 下载位置自行选 ...

  8. python软件下载安装教程,python下载安装教程

    python下载安装教程 python下载安装教程:自定义安装目录,点击install进行安装,打开cmd窗口,输入python,显示python的版本,安装成功. Python语言更适合初学者,Py ...

  9. python charm下载安装教程-Python及Pycharm安装方法图文教程

    Python及Pycharm安装方法,供大家参考,具体内容如下 1.任务简介 为了学习Python我今天对它进行了安装,整个安装过程忘了截图,故我在虚拟机中重新安装了一遍,并将Python及Pycha ...

最新文章

  1. Nature:何胜洋和辛秀芳组发表植物叶际微生物组稳态机制
  2. java 如何解密_java加密与解密
  3. Portlet开发指南第二章
  4. Mac OS X安装之虚拟机环境下的总结
  5. 生信宝典被分享最多的15篇文章
  6. 概率论 —— 数学期望
  7. js,在字符串中,查找某个字符的位置
  8. php debug 打印变量_PHP_Debug显示所有变量
  9. Thymeleaf模板的使用
  10. 计算机课平时成绩重要吗,大学计算机基础课程平时成绩评定方法的研究.pdf
  11. windows连接mysql程序_windows 连接mysql
  12. iptables记录日志
  13. 人脸对齐(十四)--LPFA
  14. VMwar配置静态ip
  15. 打印机驱动安装及换色带
  16. 查找入职员工时间排名倒数第三的员工所有信息
  17. 排列组合思维导图_巧用思维导图做数学单元整理
  18. dw设置html背景,Dreamweaver默认浏览器怎么设置
  19. 《星际争霸2》分析报告
  20. 五种常用的Web安全认证方式

热门文章

  1. ESP8266连接中国移动ONENET物联网平台TCP透传实现WIFI远程控制
  2. 品味奢华 匠心独韵——飞利浦Fidelio T1设计与声音的哲学
  3. Human Pose Estimation姿态估计调研
  4. 解决 M1 MAC安装软件提示来自身份不明开发者
  5. 学习记录609@python实现数据样本的过采样与欠采样
  6. CAD三维图自动生成三视图
  7. 可用的PHP在线云加密系统源码
  8. 设计模式(一)设计模式的分类与区别
  9. HDU2063-过山车[Hopcroft-Carp]二分图匹配
  10. 二分图匹配Hopcroft-Carp算法介绍