spaCy 是一个Python自然语言处理工具包,诞生于2014年年中,号称“Industrial-Strength Natural Language Processing in Python”,是具有工业级强度的Python NLP工具包。spaCy里大量使用了 Cython 来提高相关模块的性能,这个区别于学术性质更浓的Python NLTK,因此具有了业界应用的实际价值。

安装和编译 spaCy 比较方便,在ubuntu环境下,直接用pip安装即可:

sudo apt-get install build-essential python-dev git

sudo pip install -U spacy

不过安装完毕之后,需要下载相关的模型数据,以英文模型数据为例,可以用"all"参数下载所有的数据:

sudo python -m spacy.en.download all

或者可以分别下载相关的模型和用glove训练好的词向量数据:

# 这个过程下载英文tokenizer,词性标注,句法分析,命名实体识别相关的模型

python -m spacy.en.download parser

# 这个过程下载glove训练好的词向量数据

python -m spacy.en.download glove

下载好的数据放在spacy安装目录下的data里,以我的ubuntu为例:

textminer@textminer:/usr/local/lib/python2.7/dist-packages/spacy/data$ du -sh *

776Men-1.1.0

774Men_glove_cc_300_1m_vectors-1.0.0

进入到英文数据模型下:

textminer@textminer:/usr/local/lib/python2.7/dist-packages/spacy/data/en-1.1.0$ du -sh *

424Mdeps

8.0Kmeta.json

35Mner

12Mpos

84Ktokenizer

300Mvocab

6.3Mwordnet

可以用如下命令检查模型数据是否安装成功:

textminer@textminer:~$ python -c "import spacy; spacy.load('en'); print('OK')"

OK

也可以用pytest进行测试:

# 首先找到spacy的安装路径:

python -c "import os; import spacy; print(os.path.dirname(spacy.__file__))"

/usr/local/lib/python2.7/dist-packages/spacy

# 再安装pytest:

sudo python -m pip install -U pytest

# 最后进行测试:

python -m pytest /usr/local/lib/python2.7/dist-packages/spacy --vectors --model --slow

============================= test session starts ==============================

platform linux2 -- Python 2.7.12, pytest-3.0.4, py-1.4.31, pluggy-0.4.0

rootdir: /usr/local/lib/python2.7/dist-packages/spacy, inifile:

collected 318 items

../../usr/local/lib/python2.7/dist-packages/spacy/tests/test_matcher.py ........

../../usr/local/lib/python2.7/dist-packages/spacy/tests/matcher/test_entity_id.py ....

../../usr/local/lib/python2.7/dist-packages/spacy/tests/matcher/test_matcher_bugfixes.py .....

......

../../usr/local/lib/python2.7/dist-packages/spacy/tests/vocab/test_vocab.py .......Xx

../../usr/local/lib/python2.7/dist-packages/spacy/tests/website/test_api.py x...............

../../usr/local/lib/python2.7/dist-packages/spacy/tests/website/test_home.py ............

============== 310 passed, 5 xfailed, 3 xpassed in 53.95 seconds ===============

现在可以快速测试一下spaCy的相关功能,我们以英文数据为例,spaCy目前主要支持英文和德文,对其他语言的支持正在陆续加入:

textminer@textminer:~$ ipython

Python 2.7.12 (default, Jul 1 2016, 15:12:24)

Type "copyright", "credits" or "license" for more information.

IPython 2.4.1 -- An enhanced Interactive Python.

? -> Introduction and overview of IPython's features.

%quickref -> Quick reference.

help -> Python's own help system.

object? -> Details about 'object', use 'object??' for extra details.

In [1]: import spacy

# 加载英文模型数据,稍许等待

In [2]: nlp = spacy.load('en')

Word tokenize功能,spaCy 1.2版本加了中文tokenize接口,基于Jieba中文分词:

In [3]: test_doc = nlp(u"it's word tokenize test for spacy")

In [4]: print(test_doc)

it's word tokenize test for spacy

In [5]: for token in test_doc:

print(token)

...:

it

's

word

tokenize

test

for

spacy

英文断句:

In [6]: test_doc = nlp(u'Natural language processing (NLP) deals with the application of computational models to text or speech data. Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways. NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form. From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models.')

In [7]: for sent in test_doc.sents:

print(sent)

...:

Natural language processing (NLP) deals with the application of computational models to text or speech data.

Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways.

NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form.

From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models.

词干化(Lemmatize):

In [8]: test_doc = nlp(u"you are best. it is lemmatize test for spacy. I love these books")

In [9]: for token in test_doc:

print(token, token.lemma_, token.lemma)

...:

(you, u'you', 472)

(are, u'be', 488)

(best, u'good', 556)

(., u'.', 419)

(it, u'it', 473)

(is, u'be', 488)

(lemmatize, u'lemmatize', 1510296)

(test, u'test', 1351)

(for, u'for', 480)

(spacy, u'spacy', 173783)

(., u'.', 419)

(I, u'i', 570)

(love, u'love', 644)

(these, u'these', 642)

(books, u'book', 1011)

词性标注(POS Tagging):

In [10]: for token in test_doc:

print(token, token.pos_, token.pos)

....:

(you, u'PRON', 92)

(are, u'VERB', 97)

(best, u'ADJ', 82)

(., u'PUNCT', 94)

(it, u'PRON', 92)

(is, u'VERB', 97)

(lemmatize, u'ADJ', 82)

(test, u'NOUN', 89)

(for, u'ADP', 83)

(spacy, u'NOUN', 89)

(., u'PUNCT', 94)

(I, u'PRON', 92)

(love, u'VERB', 97)

(these, u'DET', 87)

(books, u'NOUN', 89)

命名实体识别(NER):

In [11]: test_doc = nlp(u"Rami Eid is studying at Stony Brook University in New York")

In [12]: for ent in test_doc.ents:

print(ent, ent.label_, ent.label)

....:

(Rami Eid, u'PERSON', 346)

(Stony Brook University, u'ORG', 349)

(New York, u'GPE', 350)

名词短语提取:

In [13]: test_doc = nlp(u'Natural language processing (NLP) deals with the application of computational models to text or speech data. Application areas within NLP include automatic (machine) translation between languages; dialogue systems, which allow a human to interact with a machine using natural language; and information extraction, where the goal is to transform unstructured text into structured (database) representations that can be searched and browsed in flexible ways. NLP technologies are having a dramatic impact on the way people interact with computers, on the way people interact with each other through the use of language, and on the way people access the vast amount of linguistic data now in electronic form. From a scientific viewpoint, NLP involves fundamental questions of how to structure formal models (for example statistical models) of natural language phenomena, and of how to design algorithms that implement these models.')

In [14]: for np in test_doc.noun_chunks:

print(np)

....:

Natural language processing

Natural language processing (NLP) deals

the application

computational models

text

speech

data

Application areas

NLP

automatic (machine) translation

languages

dialogue systems

a human

a machine

natural language

information extraction

the goal

unstructured text

structured (database) representations

flexible ways

NLP technologies

a dramatic impact

the way

people

computers

the way

people

the use

language

the way

people

the vast amount

linguistic data

electronic form

a scientific viewpoint

NLP

fundamental questions

formal models

example

natural language phenomena

algorithms

these models

基于词向量计算两个单词的相似度:

In [15]: test_doc = nlp(u"Apples and oranges are similar. Boots and hippos aren't.")

In [16]: apples = test_doc[0]

In [17]: print(apples)

Apples

In [18]: oranges = test_doc[2]

In [19]: print(oranges)

oranges

In [20]: boots = test_doc[6]

In [21]: print(boots)

Boots

In [22]: hippos = test_doc[8]

In [23]: print(hippos)

hippos

In [24]: apples.similarity(oranges)

Out[24]: 0.77809414836023805

In [25]: boots.similarity(hippos)

Out[25]: 0.038474555379008429

当然,spaCy还包括句法分析的相关功能等。另外值得关注的是 spaCy 从1.0版本起,加入了对深度学习工具的支持,例如 Tensorflow 和 Keras 等,这方面具体可以参考官方文档给出的一个对情感分析(Sentiment Analysis)模型进行分析的例子:Hooking a deep learning model into spaCy.

spacy 英文模型下载_spaCy相关推荐

  1. spacy 英文模型下载_spaCy2.1中文模型包

    1.预训练模型概述 spaCy是最流行的开源NLP开发包之一,它有极快的处理速度,并且预置了 词性标注.句法依存分析.命名实体识别等多个自然语言处理的必备模型. 本包提供适用于spaCy 2.1的中文 ...

  2. spacy 英文模型下载_英语文本处理工具库2 — spaCy

    网易云课堂AI工程师(自然语言处理)学习笔记,接上一篇<英文文本处理工具库1 - NLTK>. 1. spaCy简介 spaCyspaCy是Python和Cython中的高级自然语言处理库 ...

  3. spacy 英文模型

    import spacy nlp = spacy.load('en') #加载英文模型 doc = nlp(u"it's word tokenize test for spacy" ...

  4. 草图大师里创建动态组件_教你搞定SketchUp草图大师动态组件模型下载

    昨天最角模型上发布了一个"室内概念草图"的视频,里面涉及到一个门的开关,很多SketchUp爱好者非常好奇,是怎么制作的?虽然不是利用动态组件制作的,但其实就是动态组件的意思.大家 ...

  5. OpenCV4机器学习算法原理与编程实战(附部分模型下载地址)

    一直想找本书,能在机器学习复杂的算法原理和高效的编程实战之间达到合适的平衡:让感兴趣的同学拿到就有能用的代码,还有基本原理的介绍,因为了解原理才知道什么时候用什么算法最合适,以及如何调整参数. 一直没 ...

  6. (三)硕博生常用的英文文献下载的网站

    写在这里的初衷,一是备忘,二是希望得到高人指点,三是希望能遇到志同道合的朋友. 常用的下载文献的网站 1.SCI-HUB 2.大木虫学术导航 3.龙猫学术导航 4.谷粉学术 5.GeenMedical ...

  7. 新鲜出炉!20款好看的英文字体下载

    字体是设计作品的重要组成部分,相比图形化元素,字体更能够清晰的表达含义.字体的运用是一门学问,使用恰当的字体作为设计元素能够达到事半功倍的效果.这里给大家分享20款新鲜出炉的好看的英文字体,可以免费下 ...

  8. 史上最全!国内外最新免费3D模型下载网站分享!

    关注柳杉前端公众号,获取更多资料 ❝ 最近在学习three.js,所以找了一些模型库练习,顺便也分享给大家!希望对你有用. ❞ 01 kenney 「地址:https://www.kenney.nl/ ...

  9. 3D模型:免费3D模型下载网站推荐

    互联网上打着"3D模型免费下载"口号的网站属实不少,但多数网站只提供少量普通模型免费下载,而优质的模型则需要会员等付费条件才能下载,真正免费的网站少之又少,并且鲜为人知. 今天就整 ...

最新文章

  1. 软件测试在哪个城市好找工作,职业测试:你适合在哪个城市工作?
  2. 2月07日云栖精选夜读:观点 | 阿里云 MVP 唐俊飞:安全性可以认为是一种能力...
  3. Matlab符号运算总结
  4. router vue 动态改变url_Vue教程(路由router-基本使用)
  5. ceph学习笔记之七 数据平衡
  6. linux如何ARP嗅探 Linux下嗅探工具Dsniff安装记录
  7. @ngrx/router-store 在 SAP 电商云 Spartacus UI 开发中的作用
  8. react回调函数_React中的回调中自动绑定ES6类函数
  9. Fritzing添加新的元件库的方法
  10. 用java实现combin函数_Java8的CompletableFuture
  11. SCVMM 2012 R2---安装前的准备工作
  12. 学习三层结构心得(一)
  13. 线性代数知识荟萃(3)——行列式
  14. 郑州java工程师待遇怎么样_Java工程师工资待遇怎么样?
  15. Nmap Script脚本使用指南
  16. 过来人:软件测试自学还是报班好?需要掌握哪些技能?
  17. Regularization(正则化)
  18. 微信公众号配置token
  19. 白日梦想家(The Secret Life of Walter Mitty)观后
  20. uniapp组件-uni-fav收藏按钮

热门文章

  1. 基于活动场景签到微会动平台上线发布两款酷炫微信现场扫码签到产品
  2. 调音台、声卡、视频采集卡技术参数
  3. 计算机网络.期末复习
  4. 11. 符号和符号解析
  5. STM32F103C8T6详细引脚表
  6. 2022年计算机二级MS Office高级应用复习题及答案
  7. MECHREVO X8ti 安装Ubuntu18.04,NVIDIA GTX 1060驱动、CUDA10
  8. 如何设置文档的默认打开方式
  9. 机器学习和深度学习资料汇总【02】
  10. 【清华夏令营2016模拟5.31】图森破