spacy 名词性短语

“ spaCy” is designed specifically for production use. It helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems or to pre-process text for deep learning. In this article, you will learn about Tokenization, Lemmatization, Stop Words and Phrase Matching operations using spaCy.

“ spaCy”是专门为生产使用而设计的。它可以帮助您构建处理和“理解”大量文本的应用程序。它可用于构建信息提取或自然语言理解系统，或预处理文本以进行深度学习 。在本文中，您将学习使用spaCy进行的标记化，词法化，停用词和词组匹配操作。

This is article 2 in the spaCy Series. In my last article, I have explained about spaCy Installation and basic operations. If you are new to this, I would suggest starting from article 1 for a better understanding.

这是spaCy系列文章中的第2条。在上一篇文章中，我解释了有关spaCy安装和基本操作的信息。如果您对此不熟悉，我建议从第1条开始，以更好地理解。

Article 1 — spaCy-installation-and-basic-operations-nlp-text-processing-library/

第1条： spaCy安装和基本操作-NLP文本处理库/

代币化 (Tokenization)

Tokenization is the first step in text processing task. Tokenization is not only breaking the text into components, pieces like words, punctuation etc known as tokens. However, it is more than that. spaCy do the intelligent Tokenizer which internally identifies whether a “.” is punctuation and separate it into token or it is part of an abbreviation like “U.S.” and do not separate it.

标记化是文本处理任务的第一步。标记化不仅将文本分解为多个组成部分，例如单词，标点符号等，也称为标记。但是，不仅如此。 spaCy做智能标记器，在内部识别是否为“。” 是标点符号，请将其分隔为令牌，或者是“ US”等缩写的一部分，请勿将其分隔。

spaCy applies rules specific to the Language type. Let’s understand with an example.

spaCy应用特定于语言类型的规则。让我们看一个例子。

import spacynlp = spacy.load("en_core_web_sm")doc = nlp("\"Next Week, We're coming from U.S.!\"") for token in doc: print(token.text)

spaCy start splitting first based on the white space available in the raw text.

spaCy首先根据原始文本中可用的空白开始拆分。
Then it processes the text from left to right and on each item (splitter based on white space) it performs the following two checks:然后，它从左到右处理文本，并在每个项目(基于空白的分隔符)上执行以下两项检查：
Exception Rule Check: Punctuation available in “U.S.” should not be treated as further tokens. It should remain one. However, we’re should be split into “we” and “ ‘re “

例外规则检查： “美国”中的标点符号不应被视为其他标记。它应该保持为一。但是，我们应该分为“我们”和“'re”
Prefix, Suffix and Infix check: Punctuation like commas, periods, hyphens or quotes to be treated as tokens and separated out.

前缀，后缀和后缀检查：标点符号(例如逗号，句点，连字符或引号)被视为标记并分开。

If there’s a match, the rule is applied and the Tokenizer continues its loop, starting with the newly split substrings. This way, spaCy can split complex, nested tokens like combinations of abbreviations and multiple punctuation marks.

如果存在匹配项，则应用规则，并且Tokenizer继续其循环，从新拆分的子字符串开始。这样，spaCy可以拆分复杂的嵌套令牌，例如缩写和多个标点符号的组合。

Prefix: Look for Character(s) at the beginning ▸ $ ( " ¿

前缀：在开头查找字符▸ $ ( " ¿
Suffix: Look for Character(s) at the end ▸ mm ) , . ! " mm is an example of a unit

后缀：在末尾查找字符▸， mm ) , . ! " mm ) , . ! " mm ) , . ! " mm是一个单位的示例
Infix: Look for Character(s) in between ▸ - -- / ...

中缀：在▸ - -- / ...之间寻找字符
Exception: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied ▸ St. N.Y.

例外：特殊情况下的规则，在应用标点符号规则时将字符串分割为多个标记或防止标记被分割▸St St. NY

Notice that tokens are pieces of the original text. Tokens are the basic building blocks of a Doc object — everything that helps us understand the meaning of the text is derived from tokens and their relationship to one another.

请注意，标记是原始文本的一部分。令牌是Doc对象的基本构建块-有助于我们理解文本含义的所有内容都源于令牌及其相互之间的关系。

前缀，后缀和前缀作为标记 (Prefixes, Suffixes and Infixes as Tokens)

spaCy will separate punctuation that does not form an integral part of a word.

spaCy将分隔不构成单词组成部分的标点符号。
Quotation marks, commas, and punctuation at the end of a sentence will be assigned their own token.句子结尾处的引号，逗号和标点符号将被分配自己的标记。
However, punctuation that exists as part of an email address, website or numerical value will be kept as part of the token.但是，作为电子邮件地址，网站或数字值一部分存在的标点符号将作为令牌的一部分保留。

doc2 = nlp(u"We're here to guide you! Send your query, email contact@enetwork.ai or visit us at http://www.enetwork.ai!")for t in doc2: print(t)

Note that the exclamation points, comma are assigned their own tokens. However point, colon present in email address and website URL are not isolated. Hence both the email address and website are preserved.

请注意 ，感叹号，逗号已分配了自己的令牌。但是，电子邮件地址和网站URL中的冒号不是孤立的。因此，电子邮件地址和网站都将保留。

doc3 = nlp(u'A 40km U.S. cab ride costs $100.60')for t in doc3: print(t)

Here the distance unit and dollar sign are assigned their own tokens, however, the dollar amount is preserved, point in amount is not isolated.

在这里，距离单位和美元符号分配有它们自己的令牌，但是，美元金额被保留，金额点未隔离。

代币生成中的例外 (Exceptions in Token generation)

Punctuation that exists as part of a known abbreviation will be kept as part of the token.

作为已知缩写的一部分存在的标点符号将作为令牌的一部分保留。

doc4 = nlp(u"Let's visit the St. Louis in the U.S. next year.")for t in doc4: print(t)

Here the abbreviations for “Saint” and “United States” are both preserved. Mean point next to St. is not separated as a token. Same in the U.S.

此处保留了“圣”和“美国”的缩写。 St.旁边的平均点不作为标记分开。在美国也一样

计数代币 (Counting Tokens)

Using len() function, you can count the number of tokens in a document.

使用len()函数，您可以计算文档中令牌的数量。

len(doc4)

计算词汇条目 (Counting Vocab Entries)

Vocab objects contain a full library of items!

Vocab对象包含项的完整库！

see all doc objects are created from the English language model, which we have loaded in the beginning using

查看所有文档对象都是根据英语模型创建的，我们首先使用

nlp = spacy.load("en_core_web_sm")

nlp = spacy.load("en_core_web_sm")

Hence vocab len will be the same.

因此，vocab len将是相同的。

令牌中的索引和切片 (Indexing and Slicing in Token)

Doc objects can be thought of as lists of token objects.

可以将Doc对象视为token对象的列表。
As such, individual tokens can be retrieved by index position.这样，可以通过索引位置检索单个令牌。
spans of tokens can be retrieved through slicing:令牌范围可以通过切片检索：

不允许分配令牌 (Assignment of the token is not allowed)

合法化 (Lemmatization)

In contrast to stemming, Lemmatization looks beyond word reduction and considers a language’s full vocabulary to apply a morphological analysis to words.

与词干法相反，词法词法化处理不仅仅限于减少词，还考虑了语言的全部词汇，以便对词进行形态分析 。
The lemma of ‘was’ is ‘be’, the lemma of “rats” is “rat” and the lemma of ‘mice’ is ‘mouse’. Further, the lemma of ‘meeting’ might be ‘meet’ or ‘meeting’ depending on its use in a sentence.“ was”的引理是“ be”，“ rats”的引理是“ rat”，而“ mice”的引理是“ mouse”。此外，“会议”的引理可能是“会议”或“会议”，这取决于其在句子中的用法。
Lemmatization looks at the surrounding text to determine a given word’s part of speech. It does not categorize phrases.词法化处理会查看周围的文本，以确定给定单词的词性。它不对短语进行分类。

Note spaCy do not have stemming. Due to the reason that Lemmatization is seen as more informative than stemming.

注意 spaCy没有词根。由于起死作用被认为比阻止更能提供更多信息。

doc1 = nlp(u"I am a runner running in a race because I love to run since I ran today")for token in doc1: print(token.text, '\t', token.pos_, '\t', token.lemma, '\t', token.lemma_)

Creating a Function to find and print Lemma in a more structured way.

创建一个函数以更结构化的方式查找和打印引理 。

def find_lemmas(text): for token in text: print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}}{token.lemma_}')

Here we’re using an f-string to format the printed text by setting minimum field widths and adding a left-align to the lemma hash value.

在这里，我们使用f字符串通过设置最小字段宽度并将左对齐添加到引理哈希值来设置打印文本的格式。

Now, let’s call that function

现在，让我们调用该函数

doc2 = nlp(u"I saw eighteen mice today!")find_lemmas(doc2)

Note that the lemma of saw is see, lemma of mice is mouse, mice is the plural form of mouse, and see eighteen is a number, not an expanded form of eight and this is detected while computing lemmas hence it has kept eighteen as untouched.

注意的引理saw被see ，的引理mice是mouse ， mice是复数形式mouse ，看看eighteen是一个数字，而不是的扩展形式eight并且在计算引理因此它一直保持这个被检测eighteen为不变。

doc3 = nlp(u"I am meeting him tomorrow at the meeting.")find_lemmas(doc3)

Here the lemma of meeting is determined by its Part of Speech tag.

在这里， meeting的引语由其词性标签确定。

for first meeting which is a verb it has calculated lemma as meet. and for second meeting which is a Noun, and it has calculated lemma as meeting itself.

对于作为动词的第一次meeting ，它已将引理计算为meet 。而第二meeting是名词，它已将引理计算为meeting本身。

That is where we can see that spaCy take care of the part of speech while calculating the Lemmas.

在这里，我们可以看到spaCy在计算Lemmas时会照顾到语音部分。

doc4 = nlp(u"That's an enormous automobile")find_lemmas(doc4)

Note that Lemmatization does not reduce words to their most basic synonym — that is, enormous doesn't become big and automobile doesn't become car.

需要注意的是词形还原不减少的话他们最基本的代名词-那就是enormous不会成为big和automobile不会成为car 。

停用词 (Stop Words)

Words like “a” and “the” appear so frequently that they don’t require tagging as thoroughly as nouns, verbs and modifiers.诸如“ a”和“ the”之类的词出现如此频繁，以至于它们不需要像名词，动词和修饰语那样彻底地进行标记。
We call them stop words, and they can be filtered from the text to be processed.

我们称它们为停用词 ，并且可以从待处理的文本中过滤掉它们。
spaCy holds a built-in list of some 305 English stop words.

spaCy包含305个英文停用词的内置列表 。

You can print the total number of stop words using the len() function.

您可以使用 len() 函数 打印停用词的总数 。

添加用户定义的停用词 (Adding a user-defined stop word)

There may be times when you wish to add a stop word to the default set. Perhaps you decide that 'btw' (common shorthand for "by the way") should be considered a stop word.

有时您可能希望在默认设置中添加停用词。也许您决定将'btw' ( 'btw'常用缩写)视为停用词。

#Add the word to the set of stop words. Use lowercase!nlp.Defaults.stop_words.add('btw') #alwasy use lowercase while adding the stop words#Set the stop_word tag on the lexemenlp.vocab['btw'].is_stop = True

删除停用词 (Removing a stop word)

Alternatively, you may decide that 'without' should not be considered a stop word.

或者，您可以决定不将'without'视为停用词。

#Remove the word from the set of stop wordsnlp.Defaults.stop_words.remove('without')#Remove the stop_word tag from the lexemenlp.vocab['without'].is_stop = Falselen(nlp.Defaults.stop_words)nlp.vocab['beyond'].is_stop

词汇和匹配 (Vocabulary and Matching)

In this section, we will identify and label specific phrases that match patterns we can define ourselves.

在本节中，我们将识别并标记与可以定义自己的模式匹配的特定短语。

基于规则的匹配 (Rule-based Matching)

spaCy offers a rule-matching tool called Matcher.

spaCy提供了一个称为Matcher的规则匹配工具。
It allows you to build a library of token patterns.它允许您构建令牌模式库。
It then matches those patterns against a Doc object to return a list of found matches.然后，将这些模式与Doc对象进行匹配，以返回找到的匹配项列表。

You can match on any part of the token including text and annotations, and you can add multiple patterns to the same matcher.

您可以在令牌的任何部分进行匹配，包括文本和注释，也可以向同一匹配器添加多个模式。

#Import the Matcher libraryfrom spacy.matcher import Matchermatcher = Matcher(nlp.vocab)

创建模式 (Creating patterns)

In literature, the phrase ‘united states’ might appear as one word or two, with or without a hyphen. In this section we’ll develop a matcher named ‘unitedstates’ that finds all three:

在文学中，“美国”一词可能显示为一个或两个单词，带有或不带有连字符。在本节中，我们将开发一个名为“ unitedstates”的匹配器，该匹配器可以找到所有三个：

pattern1 = [{'LOWER': 'unitedstates'}]pattern2 = [{'LOWER': 'united'}, {'LOWER': 'states'}]pattern3 = [{'LOWER': 'united'}, {'IS_PUNCT': True}, {'LOWER': 'states'}]matcher.add('UnitedStates', None, pattern1, pattern2, pattern3)

Breaking it further:

进一步细分：

pattern1 looks for a single token whose lowercase text reads 'unitedstates'

pattern1查找单个标记，其小写文本读取为“ unitedstates”
pattern2 looks for two adjacent tokens that read 'united' and 'states' in that order

pattern2寻找两个相邻的令牌，它们以该顺序读取“ united”和“ states”
pattern3 looks for three adjacent tokens, with a middle token that can be any punctuation.*

pattern3寻找三个相邻的标记，中间的标记可以是任何标点符号。*

* Remember that single spaces are not tokenized, so they don’t count as punctuation.Once we define our patterns, we pass them into matcher with the name 'unitedstates', and set callbacks to None

*请记住，单个空格未标记，因此它们不算作标点符号。一旦定义了模式，我们会将其传递给名称为'unitedstates'的matcher ，并将回调设置为None

将匹配器应用于Doc对象 (Applying the matcher to a Doc object)

To make you understand I have written the United States differently like “United States”, “UnitedStates”, “United-States” and “United-States”

为了使您理解，我对美国的写法与“美国”，“美国”，“美国”和“美国”不同

doc = nlp(u'The United States of America is a country consisting of 50 independent states. The first constitution of the UnitedStates was adopted in 1788. The current United-States flag was designed by a high school student - Robert G. Heft.')found_matches = matcher(doc)print(found_matches)for match_id, start, end in found_matches: string_id = nlp.vocab.strings[match_id] # get string representation span = doc[start:end] # get the matched span print(match_id, string_id, start, end, span.text)

设置模式选项和量词 (Setting pattern options and quantifiers)

You can make token rules optional by passing an 'OP':'*' argument. This lets us streamline our patterns list:

您可以通过传递'OP':'*'参数来使令牌规则为可选。这使我们可以简化模式列表：

#Redefine the patterns:pattern1 = [{'LOWER': 'unitedstates'}]pattern2 = [{'LOWER': 'united'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'states'}]#Remove the old patterns to avoid duplication:matcher.remove('UnitedStates')#Add the new set of patterns to the 'SolarPower' matcher:matcher.add('someNameToMatcher', None, pattern1, pattern2)doc = nlp(u'United--States has the world's largest coal reserves.')found_matches = matcher(doc) print(found_matches)

This found both two-word patterns, with and without the hyphen!

这样就找到了两个单词的模式，带有和不带有连字符！

The following quantifiers can be passed to the 'OP' key:

可以将以下量词传递给'OP'键：

留意引理！ (Careful with lemmas!)

Suppose we have another word as “Solar Power” in some sentence. Now, If we want to match on both ‘solar power’ and ‘solar powered’, it might be tempting to look for the lemma of ‘powered’ and expect it to be ‘power’. This is not always the case! The lemma of the adjective ‘powered’ is still ‘powered’:

假设我们在某句话中有另一个词“太阳能”。现在，如果我们想同时使用“太阳能”和“太阳能”，寻找“动力”的引理并期望它是“动力”可能是很诱人的。这并非总是如此！ 形容词 “有能力”的引理仍然是“有能力”：

pattern1 = [{'LOWER': 'solarpower'}]pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LEMMA': 'power'}] # CHANGE THIS PATTERN#Remove the old patterns to avoid duplication:matcher.remove('someNameToMatcher') #remove the previously added matcher name#Add the new set of patterns to the 'SolarPower' matcher:matcher.add('SolarPower', None, pattern1, pattern2)doc2 = nlp(u'Solar-powered energy runs solar-powered cars.')found_matches = matcher(doc2)print(found_matches)

The matcher found the first occurrence because the lemmatizer treated ‘Solar-powered’ as a verb, but not the second as it considered it an adjective.For this case it may be better to set explicit token patterns.

匹配器发现第一个匹配项是因为lemmatizer将'Solar-powered'视为动词，而不是第二个匹配词，因为它认为它是一个形容词。在这种情况下，最好设置显式标记模式。

pattern1 = [{'LOWER': 'solarpower'}]pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':''}, {'LOWER': 'power'}] pattern3 = [{'LOWER': 'solarpowered'}] pattern4 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':''}, {'LOWER': 'powered'}]#Remove the old patterns to avoid duplication:matcher.remove('SolarPower')#Add the new set of patterns to the 'SolarPower' matcher:matcher.add('SolarPower', None, pattern1, pattern2, pattern3, pattern4)found_matches = matcher(doc2)print(found_matches)

其他令牌属性 (Other Token Attributes)

Besides lemmas, there are a variety of token attributes we can use to determine matching rules:

除了引理外，我们还可以使用多种令牌属性来确定匹配规则：

令牌通配符 (Token wildcard)

You can pass an empty dictionary {} as a wildcard to represent any token. For example, you might want to retrieve hashtags without knowing what might follow the # character:

您可以传递空字典{}作为通配符来表示任何令牌 。例如，您可能想在不知道#字符后面是什么的情况下检索主题标签：

词组匹配器 (Phrase Matcher)

In the above section, we used token patterns to perform rule-based matching. An alternative — and often more efficient — method is to match on terminology lists. In this case, we use PhraseMatcher to create a Doc object from a list of phrases and pass that into matcher instead.

在上一节中，我们使用了令牌模式来执行基于规则的匹配。另一种方法(通常是更有效的方法)是在术语表上进行匹配。在这种情况下，我们使用PhraseMatcher从短语列表中创建一个Doc对象，并将其传递给matcher 。

#Perform standard imports, reset nlpimport spacynlp = spacy.load('en_core_web_sm')# Import the PhraseMatcher libraryfrom spacy.matcher import PhraseMatchermatcher = PhraseMatcher(nlp.vocab)

For this exercise, we’re going to import a Wikipedia article on ReaganomicsSource: https://en.wikipedia.org/wiki/Reaganomics

对于本练习，我们将导入有关Reaganomics的Wikipedia文章来源： https : //en.wikipedia.org/wiki/Reaganomics

with open('../TextFiles/reaganomics.txt') as f:doc3 = nlp(f.read())#First, create a list of match phrases:phrase_list = ['voodoo economics', 'supply-side economics', 'trickle-down economics', 'free-market economics']#Next, convert each phrase to a Doc object:phrase_patterns = [nlp(text) for text in phrase_list]#Pass each Doc object into matcher (note the use of the asterisk!):matcher.add('VoodooEconomics', None, *phrase_patterns)#Build a list of matches:matches = matcher(doc3)#(match_id, start, end)matches

The first four matches are where these terms are used in the definition of Reaganomics:

前四个匹配项是在Reaganomics定义中使用以下术语的地方：

doc3[:70]

查看比赛 (Viewing Matches)

There are a few ways to fetch the text surrounding a match. The simplest is to grab a slice of tokens from the doc that is wider than the match:

有几种方法可以获取比赛周围的文字。最简单的方法是从文档中获取比匹配范围宽的令牌片段：

This is all about text pre-processing operations which include Tokenization, Lemmatization, Stop Words and Phrase Matching. Hope you enjoyed the post.

这全部与文本预处理操作有关，包括标记化，词法化，停用词和词组匹配。希望您喜欢这个职位。

Spacy Installation and Basic Operations | NLP Text Processing Library | Part 1

Spacy安装和基本操作 NLP文本处理库| 第1部分
Parts of Speech Tagging and Dependency Parsing using spaCy | NLP | Part 3

使用spaCy进行部分语音标记和相关性解析| NLP | 第三部分
Named Entity Recognition NER using spaCy | NLP | Part 4

使用spaCy命名实体识别NER NLP | 第4部分
How to Perform Sentence Segmentation or Sentence Tokenization using spaCy | NLP Series | Part 5

如何使用spaCy执行句子分段或句子标记化 NLP系列| 第5部分
Numerical Feature Extraction from Text | NLP series | Part 6

从文本中提取数值特征| NLP系列| 第6部分
Word2Vec and Semantic Similarity using spacy | NLP spacy Series | Part 7

使用spacy的Word2Vec和语义相似性 NLP spacy系列| 第7部分

If you have any feedback to improve the content or any thought please write in the comment section below. Your comments are very valuable.

如果您有任何改进内容或意见的反馈，请在下面的评论部分中写。您的评论非常有价值。

Thank You!

谢谢！

Originally published at http://ashutoshtripathi.com on April 6, 2020.

最初于 2020年4月6日 发布在 http://ashutoshtripathi.com 上。

翻译自: https://towardsdatascience.com/a-quick-guide-to-tokenization-lemmatization-stop-words-and-phrase-matching-using-spacy-nlp-b29b407adbfc