spacy 名词性短语

spaCy” is designed specifically for production use. It helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems or to pre-process text for deep learning. In this article, you will learn about Tokenization, Lemmatization, Stop Words and Phrase Matching operations using spaCy.

spaCy”是专门为生产使用而设计的。 它可以帮助您构建处理和“理解”大量文本的应用程序。 它可用于构建信息提取自然语言理解系统,或预处理文本以进行深度学习 。 在本文中,您将学习使用spaCy进行的标记化,词法化,停用词和词组匹配操作。

This is article 2 in the spaCy Series. In my last article, I have explained about spaCy Installation and basic operations. If you are new to this, I would suggest starting from article 1 for a better understanding.

这是spaCy系列文章中的第2条。 在上一篇文章中,我解释了有关spaCy安装和基本操作的信息。 如果您对此不熟悉,我建议从第1条开始,以更好地理解。

Article 1 — spaCy-installation-and-basic-operations-nlp-text-processing-library/

第1条: spaCy安装和基本操作-NLP文本处理库/

代币化 (Tokenization)

Tokenization is the first step in text processing task. Tokenization is not only breaking the text into components, pieces like words, punctuation etc known as tokens. However, it is more than that. spaCy do the intelligent Tokenizer which internally identifies whether a “.” is punctuation and separate it into token or it is part of an abbreviation like “U.S.” and do not separate it.

标记化是文本处理任务的第一步。 标记化不仅将文本分解为多个组成部分,例如单词,标点符号等,也称为标记。 但是,不仅如此。 spaCy做智能标记器,在内部识别是否为“。” 是标点符号,请将其分隔为令牌,或者是“ US”等缩写的一部分,请勿将其分隔。

spaCy applies rules specific to the Language type. Let’s understand with an example.

spaCy应用特定于语言类型的规则。 让我们看一个例子。

import spacynlp = spacy.load("en_core_web_sm")doc = nlp("\"Next Week, We're coming from U.S.!\"") for token in doc: print(token.text)
  • spaCy start splitting first based on the white space available in the raw text.

    spaCy首先根据原始文本中可用的空白开始拆分。

  • Then it processes the text from left to right and on each item (splitter based on white space) it performs the following two checks:然后,它从左到右处理文本,并在每个项目(基于空白的分隔符)上执行以下两项检查:
  • Exception Rule Check: Punctuation available in “U.S.” should not be treated as further tokens. It should remain one. However, we’re should be split into “we” and “ ‘re “

    例外规则检查: “美国”中的标点符号不应被视为其他标记。 它应该保持为一。 但是,我们应该分为“我们”和“'re”

  • Prefix, Suffix and Infix check: Punctuation like commas, periods, hyphens or quotes to be treated as tokens and separated out.

    前缀,后缀和后缀检查:标点符号(例如逗号,句点,连字符或引号)被视为标记并分开。

If there’s a match, the rule is applied and the Tokenizer continues its loop, starting with the newly split substrings. This way, spaCy can split complex, nested tokens like combinations of abbreviations and multiple punctuation marks.

如果存在匹配项,则应用规则,并且Tokenizer继续其循环,从新拆分的子字符串开始。 这样,spaCy可以拆分复杂的嵌套令牌,例如缩写和多个标点符号的组合。

  • Prefix: Look for Character(s) at the beginning ▸ $ ( " ¿

    前缀 :在开头查找字符▸ $ ( " ¿

  • Suffix: Look for Character(s) at the end ▸ mm ) , . ! " mm is an example of a unit

    后缀 :在末尾查找字符▸, mm ) , . ! " mm ) , . ! " mm ) , . ! " mm是一个单位的示例

  • Infix: Look for Character(s) in between ▸ - -- / ...

    中缀 :在▸ - -- / ...之间寻找字符

  • Exception: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied ▸ St. N.Y.

    例外 :特殊情况下的规则,在应用标点符号规则时将字符串分割为多个标记或防止标记被分割▸St St. NY

Notice that tokens are pieces of the original text. Tokens are the basic building blocks of a Doc object — everything that helps us understand the meaning of the text is derived from tokens and their relationship to one another.

请注意,标记是原始文本的一部分。 令牌是Doc对象的基本构建块-有助于我们理解文本含义的所有内容都源于令牌及其相互之间的关系。

前缀,后缀和前缀作为标记 (Prefixes, Suffixes and Infixes as Tokens)

  • spaCy will separate punctuation that does not form an integral part of a word.

    spaCy将分隔构成单词组成部分的标点符号。

  • Quotation marks, commas, and punctuation at the end of a sentence will be assigned their own token.句子结尾处的引号,逗号和标点符号将被分配自己的标记。
  • However, punctuation that exists as part of an email address, website or numerical value will be kept as part of the token.但是,作为电子邮件地址,网站或数字值一部分存在的标点符号将作为令牌的一部分保留。
doc2 = nlp(u"We're here to guide you! Send your query, email contact@enetwork.ai or visit us at http://www.enetwork.ai!")for t in doc2: print(t)

Note that the exclamation points, comma are assigned their own tokens. However point, colon present in email address and website URL are not isolated. Hence both the email address and website are preserved.

请注意 ,感叹号,逗号已分配了自己的令牌。 但是,电子邮件地址和网站URL中的冒号不是孤立的。 因此,电子邮件地址和网站都将保留。

doc3 = nlp(u'A 40km U.S. cab ride costs $100.60')for t in doc3: print(t)

Here the distance unit and dollar sign are assigned their own tokens, however, the dollar amount is preserved, point in amount is not isolated.

在这里,距离单位和美元符号分配有它们自己的令牌,但是,美元金额被保留,金额点未隔离。

代币生成中的例外 (Exceptions in Token generation)

Punctuation that exists as part of a known abbreviation will be kept as part of the token.

作为已知缩写的一部分存在的标点符号将作为令牌的一部分保留。

doc4 = nlp(u"Let's visit the St. Louis in the U.S. next year.")for t in doc4: print(t)

Here the abbreviations for “Saint” and “United States” are both preserved. Mean point next to St. is not separated as a token. Same in the U.S.

此处保留了“圣”和“美国”的缩写。 St.旁边的平均点不作为标记分开。 在美国也一样

计数代币 (Counting Tokens)

Using len() function, you can count the number of tokens in a document.

使用len()函数,您可以计算文档中令牌的数量。

len(doc4)

计算词汇条目 (Counting Vocab Entries)

Vocab objects contain a full library of items!

Vocab对象包含项的完整库!

see all doc objects are created from the English language model, which we have loaded in the beginning using

查看所有文档对象都是根据英语模型创建的,我们首先使用

nlp = spacy.load("en_core_web_sm")

nlp = spacy.load("en_core_web_sm")

Hence vocab len will be the same.

因此,vocab len将是相同的。

令牌中的索引和切片 (Indexing and Slicing in Token)

  • Doc objects can be thought of as lists of token objects.

    可以将Doc对象视为token对象的列表。

  • As such, individual tokens can be retrieved by index position.这样,可以通过索引位置检索单个令牌。
  • spans of tokens can be retrieved through slicing:令牌范围可以通过切片检索:

不允许分配令牌 (Assignment of the token is not allowed)

合法化 (Lemmatization)

  • In contrast to stemming, Lemmatization looks beyond word reduction and considers a language’s full vocabulary to apply a morphological analysis to words.

    与词干法相反,词法词法化处理不仅仅限于减少词,还考虑了语言的全部词汇,以便对词进行形态分析

  • The lemma of ‘was’ is ‘be’, the lemma of “rats” is “rat” and the lemma of ‘mice’ is ‘mouse’. Further, the lemma of ‘meeting’ might be ‘meet’ or ‘meeting’ depending on its use in a sentence.“ was”的引理是“ be”,“ rats”的引理是“ rat”,而“ mice”的引理是“ mouse”。 此外,“会议”的引理可能是“会议”或“会议”,这取决于其在句子中的用法。
  • Lemmatization looks at the surrounding text to determine a given word’s part of speech. It does not categorize phrases.词法化处理会查看周围的文本,以确定给定单词的词性。 它不对短语进行分类。

Note spaCy do not have stemming. Due to the reason that Lemmatization is seen as more informative than stemming.

注意 spaCy没有词根。 由于起死作用被认为比阻止更能提供更多信息。

doc1 = nlp(u"I am a runner running in a race because I love to run since I ran today")for token in doc1: print(token.text, '\t', token.pos_, '\t', token.lemma, '\t', token.lemma_)

Creating a Function to find and print Lemma in a more structured way.

创建一个函数以更结构化的方式查找和打印引理

def find_lemmas(text): for token in text: print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}}{token.lemma_}')

Here we’re using an f-string to format the printed text by setting minimum field widths and adding a left-align to the lemma hash value.

在这里,我们使用f字符串通过设置最小字段宽度并将左对齐添加到引理哈希值来设置打印文本的格式。

Now, let’s call that function

现在,让我们调用该函数

doc2 = nlp(u"I saw eighteen mice today!")find_lemmas(doc2)

Note that the lemma of saw is see, lemma of mice is mouse, mice is the plural form of mouse, and see eighteen is a number, not an expanded form of eight and this is detected while computing lemmas hence it has kept eighteen as untouched.

注意的引理sawsee ,的引理micemousemice是复数形式mouse ,看看eighteen是一个数字,而不是的扩展形式eight并且在计算引理因此它一直保持这个被检测eighteen为不变。

doc3 = nlp(u"I am meeting him tomorrow at the meeting.")find_lemmas(doc3)

Here the lemma of meeting is determined by its Part of Speech tag.

在这里, meeting的引语由其词性标签确定。

for first meeting which is a verb it has calculated lemma as meet. and for second meeting which is a Noun, and it has calculated lemma as meeting itself.

对于作为动词的第一次meeting ,它已将引理计算为meet 。 而第二meeting是名词,它已将引理计算为meeting本身。

That is where we can see that spaCy take care of the part of speech while calculating the Lemmas.

在这里,我们可以看到spaCy在计算Lemmas时会照顾到语音部分。

doc4 = nlp(u"That's an enormous automobile")find_lemmas(doc4)

Note that Lemmatization does not reduce words to their most basic synonym — that is, enormous doesn't become big and automobile doesn't become car.

需要注意的是词形还原减少的话他们最基本的代名词-那就是enormous不会成为bigautomobile不会成为car

停用词 (Stop Words)

  • Words like “a” and “the” appear so frequently that they don’t require tagging as thoroughly as nouns, verbs and modifiers.诸如“ a”和“ the”之类的词出现如此频繁,以至于它们不需要像名词,动词和修饰语那样彻底地进行标记。
  • We call them stop words, and they can be filtered from the text to be processed.

    我们称它们为停用词 ,并且可以从待处理的文本中过滤掉它们。

  • spaCy holds a built-in list of some 305 English stop words.

    spaCy包含305个英文停用词的内置列表

You can print the total number of stop words using the len() function.

您可以使用 len() 函数 打印停用词的总数

添加用户定义的停用词 (Adding a user-defined stop word)

There may be times when you wish to add a stop word to the default set. Perhaps you decide that 'btw' (common shorthand for "by the way") should be considered a stop word.

有时您可能希望在默认设置中添加停用词。 也许您决定将'btw' ( 'btw'常用缩写)视为停用词。

#Add the word to the set of stop words. Use lowercase!nlp.Defaults.stop_words.add('btw') #alwasy use lowercase while adding the stop words#Set the stop_word tag on the lexemenlp.vocab['btw'].is_stop = True

删除停用词 (Removing a stop word)

Alternatively, you may decide that 'without' should not be considered a stop word.

或者,您可以决定不将'without'视为停用词。

#Remove the word from the set of stop wordsnlp.Defaults.stop_words.remove('without')#Remove the stop_word tag from the lexemenlp.vocab['without'].is_stop = Falselen(nlp.Defaults.stop_words)nlp.vocab['beyond'].is_stop

词汇和匹配 (Vocabulary and Matching)

In this section, we will identify and label specific phrases that match patterns we can define ourselves.

在本节中,我们将识别并标记与可以定义自己的模式匹配的特定短语。

基于规则的匹配 (Rule-based Matching)

  • spaCy offers a rule-matching tool called Matcher.

    spaCy提供了一个称为Matcher的规则匹配工具。

  • It allows you to build a library of token patterns.它允许您构建令牌模式库。
  • It then matches those patterns against a Doc object to return a list of found matches.然后,将这些模式与Doc对象进行匹配,以返回找到的匹配项列表。

You can match on any part of the token including text and annotations, and you can add multiple patterns to the same matcher.

您可以在令牌的任何部分进行匹配,包括文本和注释,也可以向同一匹配器添加多个模式。

#Import the Matcher libraryfrom spacy.matcher import Matchermatcher = Matcher(nlp.vocab)

创建模式 (Creating patterns)

In literature, the phrase ‘united states’ might appear as one word or two, with or without a hyphen. In this section we’ll develop a matcher named ‘unitedstates’ that finds all three:

在文学中,“美国”一词可能显示为一个或两个单词,带有或不带有连字符。 在本节中,我们将开发一个名为“ unitedstates”的匹配器,该匹配器可以找到所有三个:

pattern1 = [{'LOWER': 'unitedstates'}]pattern2 = [{'LOWER': 'united'}, {'LOWER': 'states'}]pattern3 = [{'LOWER': 'united'}, {'IS_PUNCT': True}, {'LOWER': 'states'}]matcher.add('UnitedStates', None, pattern1, pattern2, pattern3)

Breaking it further:

进一步细分:

  • pattern1 looks for a single token whose lowercase text reads 'unitedstates'

    pattern1查找单个标记,其小写文本读取为“ unitedstates”

  • pattern2 looks for two adjacent tokens that read 'united' and 'states' in that order

    pattern2寻找两个相邻的令牌,它们以该顺序读取“ united”和“ states”

  • pattern3 looks for three adjacent tokens, with a middle token that can be any punctuation.*

    pattern3寻找三个相邻的标记,中间的标记可以是任何标点符号。*

* Remember that single spaces are not tokenized, so they don’t count as punctuation.Once we define our patterns, we pass them into matcher with the name 'unitedstates', and set callbacks to None

*请记住,单个空格未标记,因此它们不算作标点符号。一旦定义了模式,我们会将其传递给名称为'unitedstates'的matcher ,并将回调设置为None

将匹配器应用于Doc对象 (Applying the matcher to a Doc object)

To make you understand I have written the United States differently like “United States”, “UnitedStates”, “United-States” and “United-States”

为了使您理解,我对美国的写法与“美国”,“美国”,“美国”和“美国”不同

doc = nlp(u'The United States of America is a country consisting of 50 independent states. The first constitution of the UnitedStates was adopted in 1788. The current United-States flag was designed by a high school student - Robert G. Heft.')found_matches = matcher(doc)print(found_matches)for match_id, start, end in found_matches: string_id = nlp.vocab.strings[match_id] # get string representation span = doc[start:end] # get the matched span print(match_id, string_id, start, end, span.text)

设置模式选项和量词 (Setting pattern options and quantifiers)

You can make token rules optional by passing an 'OP':'*' argument. This lets us streamline our patterns list:

您可以通过传递'OP':'*'参数来使令牌规则为可选。 这使我们可以简化模式列表:

#Redefine the patterns:pattern1 = [{'LOWER': 'unitedstates'}]pattern2 = [{'LOWER': 'united'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'states'}]#Remove the old patterns to avoid duplication:matcher.remove('UnitedStates')#Add the new set of patterns to the 'SolarPower' matcher:matcher.add('someNameToMatcher', None, pattern1, pattern2)doc = nlp(u'United--States has the world's largest coal reserves.')found_matches = matcher(doc) print(found_matches)

This found both two-word patterns, with and without the hyphen!

这样就找到了两个单词的模式,带有和不带有连字符!

The following quantifiers can be passed to the 'OP' key:

可以将以下量词传递给'OP'键:

留意引理! (Careful with lemmas!)

Suppose we have another word as “Solar Power” in some sentence. Now, If we want to match on both ‘solar power’ and ‘solar powered’, it might be tempting to look for the lemma of ‘powered’ and expect it to be ‘power’. This is not always the case! The lemma of the adjective ‘powered’ is still ‘powered’:

假设我们在某句话中有另一个词“太阳能”。 现在,如果我们想同时使用“太阳能”和“太阳能”,寻找“动力”的引理并期望它是“动力”可能是很诱人的。 这并非总是如此! 形容词 “有能力”的引理仍然是“有能力”:

pattern1 = [{'LOWER': 'solarpower'}]pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LEMMA': 'power'}] # CHANGE THIS PATTERN#Remove the old patterns to avoid duplication:matcher.remove('someNameToMatcher') #remove the previously added matcher name#Add the new set of patterns to the 'SolarPower' matcher:matcher.add('SolarPower', None, pattern1, pattern2)doc2 = nlp(u'Solar-powered energy runs solar-powered cars.')found_matches = matcher(doc2)print(found_matches)

The matcher found the first occurrence because the lemmatizer treated ‘Solar-powered’ as a verb, but not the second as it considered it an adjective.For this case it may be better to set explicit token patterns.

匹配器发现第一个匹配项是因为lemmatizer将'Solar-powered'视为动词,而不是第二个匹配词,因为它认为它是一个形容词。在这种情况下,最好设置显式标记模式。

pattern1 = [{'LOWER': 'solarpower'}]pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':''}, {'LOWER': 'power'}] pattern3 = [{'LOWER': 'solarpowered'}] pattern4 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':''}, {'LOWER': 'powered'}]#Remove the old patterns to avoid duplication:matcher.remove('SolarPower')#Add the new set of patterns to the 'SolarPower' matcher:matcher.add('SolarPower', None, pattern1, pattern2, pattern3, pattern4)found_matches = matcher(doc2)print(found_matches)

其他令牌属性 (Other Token Attributes)

Besides lemmas, there are a variety of token attributes we can use to determine matching rules:

除了引理外,我们还可以使用多种令牌属性来确定匹配规则:

令牌通配符 (Token wildcard)

You can pass an empty dictionary {} as a wildcard to represent any token. For example, you might want to retrieve hashtags without knowing what might follow the # character:

您可以传递空字典{}作为通配符来表示任何令牌 。 例如,您可能想在不知道#字符后面是什么的情况下检索主题标签:

词组匹配器 (Phrase Matcher)

In the above section, we used token patterns to perform rule-based matching. An alternative — and often more efficient — method is to match on terminology lists. In this case, we use PhraseMatcher to create a Doc object from a list of phrases and pass that into matcher instead.

在上一节中,我们使用了令牌模式来执行基于规则的匹配。 另一种方法(通常是更有效的方法)是在术语表上进行匹配。 在这种情况下,我们使用PhraseMatcher从短语列表中创建一个Doc对象,并将其传递给matcher

#Perform standard imports, reset nlpimport spacynlp = spacy.load('en_core_web_sm')# Import the PhraseMatcher libraryfrom spacy.matcher import PhraseMatchermatcher = PhraseMatcher(nlp.vocab)

For this exercise, we’re going to import a Wikipedia article on ReaganomicsSource: https://en.wikipedia.org/wiki/Reaganomics

对于本练习,我们将导入有关Reaganomics的Wikipedia文章来源: https : //en.wikipedia.org/wiki/Reaganomics

with open('../TextFiles/reaganomics.txt') as f:doc3 = nlp(f.read())#First, create a list of match phrases:phrase_list = ['voodoo economics', 'supply-side economics', 'trickle-down economics', 'free-market economics']#Next, convert each phrase to a Doc object:phrase_patterns = [nlp(text) for text in phrase_list]#Pass each Doc object into matcher (note the use of the asterisk!):matcher.add('VoodooEconomics', None, *phrase_patterns)#Build a list of matches:matches = matcher(doc3)#(match_id, start, end)matches

The first four matches are where these terms are used in the definition of Reaganomics:

前四个匹配项是在Reaganomics定义中使用以下术语的地方:

doc3[:70]

查看比赛 (Viewing Matches)

There are a few ways to fetch the text surrounding a match. The simplest is to grab a slice of tokens from the doc that is wider than the match:

有几种方法可以获取比赛周围的文字。 最简单的方法是从文档中获取比匹配范围宽的令牌片段:

This is all about text pre-processing operations which include Tokenization, Lemmatization, Stop Words and Phrase Matching. Hope you enjoyed the post.

这全部与文本预处理操作有关,包括标记化,词法化,停用词和词组匹配。 希望您喜欢这个职位。

Related Articles:

相关文章:

  • Spacy Installation and Basic Operations | NLP Text Processing Library | Part 1

    Spacy安装和基本操作 NLP文本处理库| 第1部分

  • Parts of Speech Tagging and Dependency Parsing using spaCy | NLP | Part 3

    使用spaCy进行部分语音标记和相关性解析| NLP | 第三部分

  • Named Entity Recognition NER using spaCy | NLP | Part 4

    使用spaCy命名实体识别NER NLP | 第4部分

  • How to Perform Sentence Segmentation or Sentence Tokenization using spaCy | NLP Series | Part 5

    如何使用spaCy执行句子分段或句子标记化 NLP系列| 第5部分

  • Numerical Feature Extraction from Text | NLP series | Part 6

    从文本中提取数值特征| NLP系列| 第6部分

  • Word2Vec and Semantic Similarity using spacy | NLP spacy Series | Part 7

    使用spacy的Word2Vec和语义相似性 NLP spacy系列| 第7部分

If you have any feedback to improve the content or any thought please write in the comment section below. Your comments are very valuable.

如果您有任何改进内容或意见的反馈,请在下面的评论部分中写。 您的评论非常有价值。

Thank You!

谢谢!

Originally published at http://ashutoshtripathi.com on April 6, 2020.

最初于 2020年4月6日 发布在 http://ashutoshtripathi.com 上。

翻译自: https://towardsdatascience.com/a-quick-guide-to-tokenization-lemmatization-stop-words-and-phrase-matching-using-spacy-nlp-b29b407adbfc

spacy 名词性短语


http://www.taodudu.cc/news/show-2337361.html

相关文章:

  • 斯坦福stanford coreNLP 宾州树库汉语短语类别表23个
  • 日语量词
  • 弗雷格的伟大历史贡献:量词的引入
  • 1.7 新概念 量词
  • mysql全称量词_数据库基础lt;三)标准语言SQL-一团网
  • mysql全称量词_数据库整理(三) SQL基础
  • 含有一个量词的命题的否命题_第三节:简单的逻辑联结词、全称量词与存在量词...
  • 计算机专业中的量词,什么是量词?说明量词的分类和语法特征.
  • 含有一个量词的命题的否命题_高一 | 数学必修一全称量词与存在量词知识点总结...
  • 这个好像、也许、或许、大概、应该、Maybe真的可以算是传说中的Spring.Net了吧...
  • silk lobe资源公众号_电视剧资源免费看公众号【推荐】
  • 【无标题】免费公众号对接电影自动回复功能教程详解
  • php公众号模板在哪,微信公众号模板哪里找?公众号如何套用模板?
  • php公众号多域名授权,多个网站用一个微信公众号授权域名的坑
  • 实现微信自动回复看电影!
  • php微信公众号回复换行,PHP 微信公众号开发,关键字回复使用switch出错
  • 公众号写作
  • 公众号搜题接口API
  • Node.js七天搞定微信公众号(又名:Koa2实现电影微信公众号前后端开发)
  • 公众号零基础,只需10分钟,你的公众号也能5天500+粉丝
  • 微信公众号怎么赚钱?
  • python 公众号引流_公众号粉丝从0到1000的引流思路
  • 微信公众号怎么制作
  • 如何公众号运营好一个公众号
  • 微信公众号学习总结
  • silklabo哪个公众号有资源_公众号和头条号,做哪个赚钱?
  • python 公众号付费视频地址_Python下载微信公众号文章内的视频
  • 【无标题】公众号引流进阶教程(公众号对接电影,影视资源)
  • 技术公众号怎样运营?
  • 免费微信公众号专用h5在线电影票API

spacy 名词性短语_使用spacy nlp进行词法化,词法化,停用词和短语匹配的快速指南...相关推荐

  1. NLP分词中的2750个停用词和9995个同义词

    停用词: 在信息检索中,为节省存储空间和提高搜索效率,在处理自然语言数据(或文本)之前或之后会自动过滤掉某些字或词,这些字或词即被称为Stop Words(停用词). 这些停用词都是人工输入.非自动化 ...

  2. 去停用词 java代码_如何在java中去除中文文本的停用词

    1.  整体思路 第一步:先将中文文本进行分词,这里使用的HanLP-汉语言处理包进行中文文本分词. 第二步:使用停用词表,去除分好的词中的停用词. 2.  中文文本分词环境配置 使用的HanLP-汉 ...

  3. 【NLP】文本预处理:删除单词停用词

    作者 | Chetna Khanna 编译 | VK 来源 | Towards Data Science 我们很清楚这样一个事实:计算机可以很容易地处理数字. 然而,我们掌握的大部分信息都是以文本的形 ...

  4. ansj 自定义 停用词_构造自定义停用词列表的快速提示

    ansj 自定义 停用词 by Kavita Ganesan 通过Kavita Ganesan 构造自定义停用词列表的快速提示 (Quick tips for constructing custom ...

  5. NLP算法-关键词提取补充知识-停用词表

    引入 书接上回,我们讲这个关键词提取的时候没有说停用词: 那啥是停用词呢?当一个词语出现频率很高但是这个词并不是你所需要的信息,这个时候就会用到停用词表这个概念 什么是停用词表? 停用词是指在信息检索 ...

  6. 中文停用词文档_使用Python中的NLTK和spaCy删除停用词与文本标准化

    译者 | VK 来源 | Analytics Vidhya [磐创AI 导读]:本文介绍了如何使用Python中的NLTK和spaCy删除停用词与文本标准化,欢迎大家转发.留言.想要更多电子杂志的机器 ...

  7. 【贪玩巴斯】带你一起攻克英语语法长难句—— 第三章——名词(短语)和名词性从句{主语、宾语、表语和同位语}全解 ——2022年2月6日-16日

    [贪玩巴斯]带你一起攻克英语语法长难句-- 第三章--名词(短语)和名词性从句{主语.宾语.表语和同位语}全解 --2022年2月6日-16日 1.名词(短语)能做什么成分? 2.解释同位语是什么? ...

  8. one 主格 复数 宾格_英语主格宾格形容词性物主代词及名词性物主代词练习题.doc...

    主格宾格与物主代词及其练习 人称代词 人称 单数 复数 主格 宾格 主格 宾格 第一人称 I me we us 第二人称 you you you you 第三人称 he him they them s ...

  9. easypoi 语法_高中语法精讲系列七丨高中英语八大语法之“名词性从句”要点归纳...

    在句子中起名词作用的从句叫名词性从句,包括主语从句.宾语从句.表语从句和同位语从句. 一. 主语从句 在句子中充当主语的从句叫主语从句,通常由从属连词(that, whether)和连接代词(what ...

  10. English语法_名词性从句 - what

    Contents 1> 复合关系代词 what 2> what 引导名词性从句,作主语 3> what 引导名词性从句,作表语 4> what 引导名词性从句,作宾语 4> ...

最新文章

  1. mysql汉字转拼音函数
  2. Java 异常处理的误区和经验总结--转载
  3. 24张IT工程师技能图谱,这些你都会吗?
  4. VTK:Filtering之VertexGlyphFilter
  5. C++确定数字的奇偶校验的算法实现(附完整源码)
  6. 南京信息工程大学c语言真题,南京信息工程大学C语言试题库.doc
  7. mongodb 输出数组字段_JMeter之Groovy对MongoDB操作
  8. block,inline和inlinke-block细节对比
  9. ie10不适用计算机,Windows6.1-KB2731771-x64.msu 提示此更新不适用于您的计算机 ie10 ie11 安装失败...
  10. 网工考试——网络体系结构、物理层和数据通信
  11. 【转】ajax发送请求时候为什么会报拒绝设置不安全的header
  12. 计算机行业未来的规划模板,计算机专业个人职业规划优秀模板
  13. 深入原理64式:40 概率论公式总结
  14. cmd的常用命令分类详解
  15. python为什么是蛇的天敌_蛇的天敌是什么?蛇獴对所有蛇毒免疫(成蛇类死对头)
  16. 数据库原理与应用第一章笔记整理
  17. 潇潇日暮时,掠水鸳鸯散。(01背包
  18. 荀子《劝学》原文及翻译
  19. SimpleDateFormat的使用:SimpleDateFormat对日期Date类的格式化和解析
  20. csr sha2生成 linux,Openssl生成csr的解决方案

热门文章

  1. AVC/HEVC/VVC/AV1 帧间预测:运动估计、运动补偿
  2. 批量处理word文档向下箭头
  3. 【毕业季】致毕业生的一句话:天高任鸟飞,海阔凭鱼跃
  4. android 跳转京东app,第三方应用跳转到京东app
  5. 2022全球量子通信产业发展报告
  6. R语言潜在变量模型、探索性因子分析EFA、验证性因素分析(CFA)、结构方程建模(SEM)之间的关系、潜在变量模型常用包:ltm包、sem包、OpenMx包、Lavaan包、lsa包、ca包等
  7. mac上html无法显示图片,Safari 无法显示/加载网页图片解决方案
  8. oppoa5降级教程_OPPO A5官方出厂rom系统刷机包下载_卡刷升级包降级回退包
  9. 企业需要关注的零信任 24 问
  10. python 正则匹配中re.match().group(num=0)