推荐 :快速掌握spacy在python中进行自然语言处理(附代码链接)
作者:Paco Nathan 翻译:笪洁琼 校对:和中华
本文简要介绍了如何使用spaCy和Python中的相关库进行自然语言处理(有时称为“文本分析”)。以及一些目前最新的相关应用。
说明页面
https://support.dominodatalab.com/hc/en-us/articles/115000392643-Environment-management
import spacy
nlp = spacy.load("en_core_web_sm")
text = "The rain in Spain falls mainly on the plain."
doc = nlp(text) for token in doc: print(token.text, token.lemma_, token.pos_, token.is_stop)
The the DET True
rain rain NOUN False
in in ADP True
Spain Spain PROPN False
falls fall VERB False
mainly mainly ADV False
on on ADP True
the the DET True
plain plain NOUN False
. . PUNCT False
import pandas as pd cols = ("text", "lemma", "POS", "explain", "stopword")
rows = [] for t in doc: row = [t.text, t.lemma_, t.pos_, spacy.explain(t.pos_), t.is_stop] rows.append(row) df = pd.DataFrame(rows, columns=cols) df
原始文本
词形(lemma)引理——这个词的词根形式
词性(part-of-speech)
是否是停用词的标志,比如一个可能会被过滤的常用词
from spacy import displacy displacy.render(doc, )
text = "We were all out at the zoo one day, I was doing some acting, walking on the railing of the gorilla exhibit. I fell in. Everyone screamed and Tommy jumped in after me, forgetting that he had blueberries in his front pocket. The gorillas just went wild." doc = nlp(text) for sent in doc.sents: print(">", sent)
We were all out at the zoo one day, I was doing some acting, walking on the railing of the gorilla exhibit.
I fell in.
Everyone screamed and Tommy jumped in after me, forgetting that he had blueberries in his front pocket.
The gorillas just went wild.
for sent in doc.sents: print(">", sent.start, sent.end)
doc[48:54]
The gorillas just went wild.
token = doc[51]
print(token.text, token.lemma_, token.pos_)
went go VERB
import sysimport warnings
warnings.filter
warnings("ignore")
from bs4 import BeautifulSoup
import requests
import traceback def get_text (url): buf = [] try: soup = BeautifulSoup(requests.get(url).text, "html.parser") for p in soup.find_all("p"): buf.append(p.get_text()) return "\n".join(buf) except: print(traceback.format_exc()) sys.exit(-1)
https://opensource.org/licenses/
lic = {}
lic["mit"] = nlp(get_text("https://opensource.org/licenses/MIT"))
lic["asl"] = nlp(get_text("https://opensource.org/licenses/Apache-2.0"))
lic["bsd"] = nlp(get_text("https://opensource.org/licenses/BSD-3-Clause"))
for sent in lic["bsd"].sents: print(">", sent)
> SPDX short identifier: BSD-3-Clause
> Note: This license has also been called the "New BSD License" or "Modified BSD License"
> See also the 2-clause BSD License.
…
pairs = [ ["mit", "asl"], ["asl", "bsd"], ["bsd", "mit"]
] for a, b in pairs:
print(a, b, lic[a].similarity(lic[b]))
mit asl 0.9482039305669306
asl bsd 0.9391555350757145
bsd mit 0.9895838089575453
现在让我们深入了解一下spaCy中的NLU特性。假设我们要解析有一个文档,从纯语法的角度来看,我们可以提取名词块(https://spacy.io/usage/linguistic-features#noun-chunks),即每个名词短语:
text = "Steve Jobs and Steve Wozniak incorporated Apple Computer on January 3, 1977, in Cupertino, California."
doc = nlp(text) for chunk in doc.noun_chunks: print(chunk.text)
Steve Jobs
Steve Wozniak
Apple Computer
January
Cupertino
California
for ent in doc.ents:
print(ent.text, ent.label_)
displacy.render(doc, )
import nltk
nltk.download("wordnet")
[nltk_data] Downloading package wordnet to /home/ceteri/nltk_data...
[nltk_data] Package wordnet is already up-to-date!True
from spacy_wordnet.wordnet_annotator import WordnetAnnotator
print("before", nlp.pipe_names)
if "WordnetAnnotator" not in nlp.pipe_names: nlp.add_pipe(WordnetAnnotator(nlp.lang), after="tagger")
print("after", nlp.pipe_names)before ['tagger', 'parser', 'ner']after ['tagger', 'WordnetAnnotator', 'parser', 'ner']
token = nlp("withdraw")[0]
token._.wordnet.synsets()
[Synset('withdraw.v.01'),
Synset('retire.v.02'),
Synset('disengage.v.01'),
Synset('recall.v.07'),
Synset('swallow.v.05'),
Synset('seclude.v.01'),
Synset('adjourn.v.02'),
Synset('bow_out.v.02'),
Synset('withdraw.v.09'),
Synset('retire.v.08'),
Synset('retreat.v.04'),
Synset('remove.v.01')]
token._.wordnet.lemmas()
[Lemma('withdraw.v.01.withdraw'),
Lemma('withdraw.v.01.retreat'),
Lemma('withdraw.v.01.pull_away'),
Lemma('withdraw.v.01.draw_back'),
Lemma('withdraw.v.01.recede'),
Lemma('withdraw.v.01.pull_back'),
Lemma('withdraw.v.01.retire'),
…
token._.wordnet.wordnet_domains()
['astronomy',
'school',
'telegraphy',
'industry',
'psychology',
'ethnology',
'ethnology',
'administration',
'school',
'finance',
'economy',
'exchange',
'banking',
'commerce',
'medicine',
'ethnology',
'university',
…
domains = ["finance", "banking"]
sentence = nlp("I want to withdraw 5,000 euros.") enriched_sent = [] for token in sentence: # get synsets within the desired domains synsets = token._.wordnet.wordnet_synsets_for_domain(domains) if synsets: lemmas_for_synset = [] for s in synsets: # get synset variants and add to the enriched sentence lemmas_for_synset.extend(s.lemma_names()) enriched_sent.append("({})".format("|".join(set(lemmas_for_synset)))) else: enriched_sent.append(token.text) print(" ".join(enriched_sent))
I (require|want|need) to (draw_off|withdraw|draw|take_out) 5,000 euros .
import scattertext as st
if "merge_entities" not in nlp.pipe_names:
nlp.add_pipe(nlp.create_pipe("merge_entities"))
if "merge_noun_chunks" not in nlp.pipe_names:
nlp.add_pipe(nlp.create_pipe("merge_noun_chunks"))
convention_df = st.SampleCorpora.ConventionData2012.get_data()
corpus = st.CorpusFromPandas(convention_df,
category_col="party",
text_col="text",
nlp=nlp).build()
html = st.produce_scattertext_explorer( corpus, category="democrat", category_name="Democratic", not_category_name="Republican", width_in_pixels=1000, metadata=convention_df["speaker"]
)
from IPython.display import IFrame file_name = "foo.html"
with open(file_name, "wb") as f: f.write(html.encode("utf-8")) IFrame(src=file_name, width = 1200, height=700)
总结
值得注意的是,随着谷歌开始赢得国际语言翻译比赛,用于自然语言的的机器学习自2000年中期得到了很大的发展。2017年至2018年期间,随着深度学习的诸多成功,这些方法开始超越以前的机器学习模型,出现了另一个重大变化。
例如,经Allen AI研究提出的看到ELMo 语言嵌入模型, 随后是谷歌的BERT,(https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html),以及最近由
END
转自: 数据派THU 公众号;
版权声明:本号内容部分来自互联网,转载请注明原文链接和作者,如有侵权或出处有误请和我们联系。
商务合作|约稿 请加qq:365242293
更多相关知识请回复:“ 月光宝盒 ”;
数据分析(ID : ecshujufenxi )互联网科技与数据圈自己的微信,也是WeMedia自媒体联盟成员之一,WeMedia联盟覆盖5000万人群。
推荐 :快速掌握spacy在python中进行自然语言处理(附代码链接)相关推荐
- 独家 | 快速掌握spacy在python中进行自然语言处理(附代码链接)
作者:Paco Nathan 翻译:笪洁琼 校对:和中华 本文约6600字,建议阅读15分钟. 本文简要介绍了如何使用spaCy和Python中的相关库进行自然语言处理(有时称为"文本分析& ...
- python有趣代码-wtfPython―Python中一组有趣微妙的代码【收藏】
wtfPython是github上的一个项目,作者收集了一些奇妙的Python代码片段,这些代码的输出结果会和我们想象中的不太一样: 通过探寻产生这种结果的内部原因,可以让我们对Python里的一些细 ...
- 如何在Python中注释掉一段代码[重复]
本文翻译自:How to comment out a block of code in Python [duplicate] This question already has an answer h ...
- python有趣的代码-介绍wtfPython—Python中一组有趣微妙的代码【收藏】
wtfPython-Python中一组有趣微妙的代码[收藏] wtfPython是github上的一个项目,作者收集了一些奇妙的Python代码片段,这些代码的输出结果会和我们想象中的不太一样: 通过 ...
- Python中sort和sorted函数代码解析
Python中sort和sorted函数代码解析 本文研究的主要是Python中sort和sorted函数的相关内容,具体如下. 一.sort函数 sort函数是序列的内部函数 函数原型: L.sor ...
- python代码示例图形-纯干货:手把手教你用Python做数据可视化(附代码)
原标题:纯干货:手把手教你用Python做数据可视化(附代码) 导读:制作提供信息的可视化(有时称为绘图)是数据分析中的最重要任务之一.可视化可能是探索过程的一部分,例如,帮助识别异常值或所需的数据转 ...
- python画图代码大全-纯干货:手把手教你用Python做数据可视化(附代码)
原标题:纯干货:手把手教你用Python做数据可视化(附代码) 导读:制作提供信息的可视化(有时称为绘图)是数据分析中的最重要任务之一.可视化可能是探索过程的一部分,例如,帮助识别异常值或所需的数据转 ...
- 独家 | 在浏览器中使用TensorFlow.js和Python构建机器学习模型(附代码)
作者:MOHD SANAD ZAKI RIZVI 翻译:吴金笛 校对:丁楠雅 本文约5500字,建议阅读15分钟. 本文首先介绍了TensorFlow.js的重要性及其组件,并介绍使用其在浏览器中构建 ...
- linux tensorflow demo_独家 | 在浏览器中使用TensorFlow.js和Python构建机器学习模型(附代码)...
作者:MOHD SANAD ZAKI RIZVI 翻译:吴金笛 校对:丁楠雅 本文约5500字,建议阅读15分钟. 本文首先介绍了TensorFlow.js的重要性及其组件,并介绍使用其在浏览器中构建 ...
最新文章
- public void DeleteT(ListT EntityList) where T : class, new() 这是什么意思
- Robert C. Martin关于UML、CASE的观点
- 工业级光纤收发器与光端机各自的作用及区别介绍
- CentOS7 安装Mysql5.6 后启动失败处理 The server quit without updating PID file
- java 内存跟踪_详解JVM中的本机内存跟踪
- Linux内核第六节 20135332武西垚
- 你觉得跳广场舞的都是一群什么样的人?
- VS2012下基于Glut OpenGL glDepthMask示例程序:
- 十九、Oracle学习笔记:行变量
- Lightly:新一代的C语言IDE
- 华为HPLC模组全拆解之电力载波收发原理分析
- linux可以用tab键,linux下tab键在命令行情况下的强大
- EasyOcr报错 --- [WinError 10054] An existing connection was forcibly closed by the remote host
- 一句话告诉你为什么有些jQuery插件会有特殊字符(加号、减号、感叹号等)
- 效果图色彩与色彩搭配原理
- Android的5种快捷开关的实现
- 骑士cms任意代码执行(CVE-2020-35339)
- java实现发送qq邮箱验证码
- mirosoft Edge出现闪退怎么办
- PDF编辑器哪个好,PDF文件怎么拆分成多个PDF
热门文章
- 模型和算法,有什么区别
- @Idempotent注解限制同一时刻的访问间隔
- 批处理 Win10锁屏背景图片的提取
- ITX迷你主机的优点及缺点
- Pg extention pg_buffercache
- C# extention extension
- 检测视频中的人脸,并画出矩形框
- 初学单片机:Proteus介绍、Proteus与Keil联调(Windows10环境下)
- 大气化学实验卫星 ACE-FTS(SCISAT) 相关介绍
- python自相关函数提取基音周期_Python语音基础操作--4.2基音周期检测