自然语言处理综述

Aren't we all initially got surprised when smart devices understood what we were telling them? And in fact, it answered in the most friendly manner too, isn't it? Like Apple’s Siri and Amazon’s Alexa comprehend when we ask the weather, for directions, or to play a certain genre of music. Ever since then I was wondering how do these computers get our language. This long due curiosity rekindled me and I thought to write a blog as a newbie on this.

当智能设备理解了我们告诉他们的内容后,我们所有人最初并不感到惊讶吗? 实际上,它也以最友好的方式回答,不是吗? 就像苹果公司的Siri和亚马逊公司的Alexa一样,当我们询问天气,方向或播放某种音乐时,他们就会明白。 从那时起,我一直在想这些计算机如何获得我们的语言。 这种长期的好奇心使我重新燃起了生命,我想以此为博客写一个新手。

In this article, I will be using a popular NLP library called NLTK. Natural Language Toolkit or NLTK is one of the most powerful and probably the most popular natural language processing libraries. Not only does it have the most comprehensive library for python-based programming, but it also supports the most number of different human languages.

在本文中,我将使用一个流行的名为NLTK的NLP库 。 自然语言工具包或NLTK是功能最强大且可能是最受欢迎的自然语言处理库之一。 它不仅具有用于基于python的编程的最全面的库,而且还支持大多数不同的人类语言。

What is Natural Language Processing?

什么是自然语言处理?

Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human languages, in particular how to train computers to process and analyze large amounts of natural language data.

自然语言处理(NLP)是语言学,计算机科学,信息工程和人工智能的一个子领域,与计算机和人类语言之间的相互作用有关,尤其是如何训练计算机以处理和分析大量自然语言数据。

Why sorting of Unstructured Datatype is so important?

为什么对非结构化数据类型进行排序如此重要?

For every tick of the clock, the world generates the overwhelming amount of data!!, yeah, this is mind-boggling!! and the majority of the data falls under unstructured datatype. The data formats such as text, audio, video, image are classic examples of unstructured data. The Unstructured Datatype will not be having fixed dimensions and structures like traditional row and column structure of relational databases. Therefore it’s more difficult to analyze and not easily searchable. Having said that, it is also important for business organizations to find ways of addressing challenges and embracing opportunities to derive insights and prosper in highly competitive environments to be successful. However, with the help of natural language processing and machine learning, this is changing fast.

每一刻时钟,世界都会产生大量数据!是的,这真是令人难以置信! 并且大多数数据属于非结构化数据类型。 文本,音频,视频,图像等数据格式是非结构化数据的经典示例。 非结构化数据类型将没有固定的维度和结构,如关系数据库的传统行和列结构。 因此,它更难以分析且不易搜索。 话虽如此,对于企业组织来说,找到应对挑战和把握机遇的方法也很重要,以便在高竞争环境中获得见识并取得成功。 但是,借助自然语言处理和机器学习,这种情况正在Swift改变。

Are Computers confused with our Natural Language?

计算机与我们的自然语言混淆了吗?

Human language is one of the powerful tools of communication. The words, the tone, the sentences, the gestures which we use draw information. There are countless different ways of assembling words in a phrase. Words can also have many shades of meaning and, to comprehend human language with the intended meaning is a challenge. A linguistic paradox is a phrase or sentence that contradicts itself, for example, “oh, this is my open secret”, “can you please act naturally”, though it sounds pointedly foolish, we humans can understand and use in everyday speech but for machines, natural language’s ambiguity and inaccurate characteristics are the hurdles to sail-off.

语言是交流的强大工具之一。 我们使用的单词,语气,句子,手势会吸引信息。 在短语中组合单词的方式有无数种。 单词也可以具有多种含义,要使人类语言具有预期的含义是一个挑战。 语言悖论是与自己矛盾的短语或句子,例如,“哦,这是我的公开秘密”,“您能自然地行动吗”,虽然听起来很愚蠢,但我们人类可以在日常语音中理解和使用,但对于机器,自然语言的歧义和不正确的特征是航行的障碍。

Most used NLP Libraries

最常用的NLP库

In the past, only pioneers could be part of NLP projects those who would have superior knowledge in mathematics, computer learning, and linguistics in natural language processing. Now developers can use ready-made libraries to simplify pre-processing of texts so that they can concentrate on creating machine learning models. These libraries have enabled text comprehension, interpretation, sentiment analysis through only a few lines of code. Most popular NLP libraries are:

过去,只有先驱者才能成为NLP项目的一部分,他们将对数学,计算机学习和自然语言处理方面的语言有丰富的知识。 现在,开发人员可以使用现成的库来简化文本的预处理,以便他们可以专注于创建机器学习模型。 这些库仅通过几行代码就可以进行文本理解,解释和情感分析。 最受欢迎的NLP库是:

Spark NLP, NLTK, PyTorch-Transformers, TextBlob, Spacy, Stanford CoreNLP, Apache OpenNLP, Allen NLP, GenSim, NLP Architecture, sci-kit learn.

Spark NLP,NLTK,PyTorch-Transformers,TextBlob,Spacy,Stanford CoreNLP,Apache OpenNLP,Allen NLP,GenSim,NLP Architecture,Sci-kit学习。

The question is from where should we start and how?

问题是我们应该从哪里开始,如何开始?

Have you ever observed how kids start to understand and learn a language? yeah, by picking each word and then sentence formations, right! Making computers understand our language is more or less similar to it.

您是否曾经观察过孩子如何开始理解和学习语言? 是的,先选择每个单词,然后再选择句子形式,对! 使计算机理解我们的语言或多或少类似于它。

Pre-processing Steps :

预处理步骤:

  1. Sentence Tokenization
    句子标记化
  2. Word Tokenization
    词标记化
  3. Text Lemmatization and Stemming
    文本缩编和词干
  4. Stop Words
    停用词
  5. POS Tagging
    POS标签
  6. Chunking
    块状
  7. Wordnet
    词网
  8. Bag-of-Words
    言语袋
  9. TF-IDF
    特遣部队
  1. Sentence Tokenization(Sentence Segmentation)To make computers understand the natural language, the first step is to break the paragraphs into the sentences. Punctuation marks are such an easy way out for splitting the sentences apart.

    句子标记化(Sentence Segmentation)为了使计算机理解自然语言,第一步是将段落分成句子。 标点符号是将句子分开的简便方法。

import nltknltk.download('punkt')text = "Home Farm is one of the biggest junior football clubs in Ireland and their senior team, from 1970 up to the late 1990s, played in the League of Ireland. However, the link between Home Farm and the senior team was severed in the late 1990s. The senior side was briefly known as Home Farm Fingal in an effort to identify it with the north Dublin area."sentences = nltk.sent_tokenize(text)print("The number of sentences in the paragrah:",len(sentences))for sentence in sentences:print(sentence)OUTPUT:The number of sentences in the paragraph: 3  Home Farm is one of the biggest junior football clubs in Ireland and their senior team, from 1970 up to the late 1990s, played in the League of Ireland. However, the link between Home Farm and the senior team was severed in the late 1990s. The senior side was briefly known as Home Farm Fingal in an effort to identify it with the north Dublin area.

2. Word Tokenization(Word Segmentation)By now we have separated sentences with us and the next step is to break the sentences into words which are often called Tokens.

2.单词标记化(单词分割)到目前为止,我们已经将句子分隔开了,下一步是将这些句子分解为通常称为标记的单词。

The way of creating a space in one’s own life helps for good, similarly, Space between the words helps in breaking the words apart in a phrase. We can consider punctuation marks as separate tokens as well, as punctuation has a purpose too.

在自己的生活中创造空间的方式有益于美好,同样,单词之间的空间有助于将单词在短语中分开。 我们也可以将标点符号也视为单独的标记,因为标点符号也是有目的的。

for sentence in sentences:words = nltk.word_tokenize(sentence)print("The number of words in a sentence:", len(words))print(words)OUTPUT:The number of words in a sentence: 32 ['Home', 'Farm', 'is', 'one', 'of', 'the', 'biggest', 'junior', 'football', 'clubs', 'in', 'Ireland', 'and', 'their', 'senior', 'team', ',', 'from', '1970', 'up', 'to', 'the', 'late', '1990s', ',', 'played', 'in', 'the', 'League', 'of', 'Ireland', '.']  The number of words in a sentence: 18 ['However', ',', 'the', 'link', 'between', 'Home', 'Farm', 'and', 'the', 'senior', 'team', 'was', 'severed', 'in', 'the', 'late', '1990s', '.']  The number of words in a sentence: 22 ['The', 'senior', 'side', 'was', 'briefly', 'known', 'as', 'Home', 'Farm', 'Fingal', 'in', 'an', 'effort', 'to', 'identify', 'it', 'with', 'the', 'north', 'Dublin', 'area', '.']

The prerequisite to use word_tokenize() or sent_tokenize() functions in the program, we should have punkt package downloaded.

在程序中使用word_tokenize()sent_tokenize()函数的前提条件是,我们应该已下载punkt软件包。

3. Stemming and Text Lemmatization

3.词干和词法化

In every text document, we usually come across different forms of words like write, writes, writing with an alike meaning, and the same base word. But how to make a computer to analyze such words?That’s when Text Lemmatization and Stemming comes in the picture.

在每个文本文档中,我们通常会遇到不同形式的单词,例如写,写,具有相似含义的写词和相同的基本单词。 但是如何使计算机分析此类单词呢? 那就是图片的词法化和词法提取的时候。

Stemming and Text Lemmatization are the normalization techniques that offer the same idea of chopping the ends of a word to the core word. While both of them want to solve the same problem, but they are going about it in entirely different ways. Stemming is often a crude heuristic process whereas Lemmatization is a vocabulary -based morphological base word. Let’s just take a closer look!

词干和文本词法归类化是归一化技术,它们提供将单词的结尾切成核心单词的相同思想。 虽然他们两个都想解决相同的问题,但是他们以完全不同的方式来解决这个问题。 词干提取通常是一个粗略的启发式过程,而词干提取则是基于词汇的词法基础词。 让我们仔细看看!

Stemming- Words are reduced to their stem word. A word stem need not be the same root as a dictionary-based morphological(smallest unit) root, it just is an equal to or smaller form of the word.

词干 -单词被简化为词干。 词干不必与基于字典的词法(最小单位)词根相同,而可以等于或小于该词的形式。

from nltk.stem import PorterStemmer#create an object of class PorterStemmerporter = PorterStemmer()#A list of words to be stemmedword_list = ['running', ',', 'driving', 'sung', 'between', 'lasted', 'was', 'paticipated', 'before', 'severed', '1990s', '.']print("{0:20}{1:20}".format("Word","Porter Stemmer"))for word in word_list:print("{0:20}{1:20}".format(word,porter.stem(word)))OUTPUT:Word                Porter Stemmer       running             run                  ,                   ,                    driving             drive                sung                sung                 between             between              lasted              last                 was                 wa                   paticipated         paticip              before              befor                severed             sever                1990s               1990                 .                   .

Stemming is not as easy as it looks :(we might get into two issues such as under-stemming and over-stemming of a word.

词干看起来并不容易:(我们可能会遇到两个问题,例如单词的词干 不足和词干 过度

Lemmatization-When we think that stemming is the best estimate method to snip a word based on how it appears and meanwhile, on the other hand, lemmatization is a method that seems to be even more planned way of pruning the word. Their dictionary process includes resolving words. Indeed a word’s lemma is its dictionary or canonical form.

词法化 -当我们认为词干是根据单词出现的方式来截断单词的最佳估计方法时,另一方面,词法化似乎是一种修剪单词的更有计划的方法。 他们的词典处理过程包括解析单词。 确实,单词的引理是其字典或规范形式。

nltk.download('wordnet')from nltk.stem import WordNetLemmatizerwordnet_lemmatizer = WordNetLemmatizer()#A list of words to lemmatizeword_list = ['running', ',', 'drives', 'sung', 'between', 'lasted', 'was', 'paticipated', 'before', 'severed', '1990s', '.']print("{0:20}{1:20}".format("Word","Lemma"))for word in word_list:      print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word)))OUTPUT:Word                Lemma                running             running              ,                   ,                    drives              drive                sung                sung                 between             between              lasted              lasted               was                 wa                   paticipated         paticipated          before              before               severed             severed              1990s               1990s                .                   .

If speed is needed, then resorting to stemming is better. But it’s better to use lemmatization when accuracy is needed.

如果需要速度,则最好采用阻止。 但是,当需要准确性时,最好使用定理。

4. Stop Words‘in’, ‘at’, ‘on’, ‘so’.. etc are considered as stop words. Stop words don't play an important role in NLP, but the removal of stop words necessarily plays an important role during sentiment analysis.

4.“在”,“在”,“在”,“如此”等上的停用词被视为停用词。 停用词在NLP中并不重要,但是在情感分析过程中停用词的去除必定起着重要作用。

NLTK comes with the stopwords for 16 different languages that contain stop word lists.

NLTK随附了包含停用词列表的16种不同语言的停用词。

from nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenizestop_words = set(stopwords.words('english'))print("The stop words in NLTK lib are:", stop_words)para="""Home Farm is one of the biggest junior football clubs in Ireland and their senior team, from 1970 up to the late 1990s, played in the League of Ireland. However, the link between Home Farm and the senior team was severed in the late 1990s. The senior side was briefly known as Home Farm Fingal in an effort to identify it with the north Dublin area."""tokenized_para=word_tokenize(para)modified_token_list=[word for word in tokenized_para if not word in stop_words]print("After removing the stop words in the sentence:")print(modified_token_list)OUTPUT:The stop words in NLTK lib are: {'about', 'ma', "shouldn't", 's', 'does', 't', 'our', 'mightn', 'doing', 'while', 'ourselves', 'themselves', 'will', 'some', 'you', "aren't", 'by', "needn't", 'in', 'can', 'he', 'into', 'as', 'being', 'between', 'very', 'after', 'couldn', 'himself', 'herself', 'had', 'its', 've', 'him', 'll', "isn't", 'through', 'should', 'was', 'now', 'them', "you'll", 'again', 'who', 'don', 'been', 'they', 'weren', "you're", 'both', 'd', 'me', 'didn', "won't", "you'd", 'only', 'itself', 'hadn', "should've", 'than', 'how', 'few', 're', 'down', 'these', 'y', "haven't", "mightn't", 'won', "hadn't", 'other', 'above', 'all', "doesn't", 'isn', "that'll", 'not', 'yourselves', 'at', 'mustn', "it's", 'on', 'the', 'for', "didn't", 'what', "mustn't", 'his', 'haven', 'doesn', "you've", 'are', 'out', 'hers', 'with', 'has', 'she', 'most', 'ain', 'those', 'when', 'myself', 'before', 'their', 'during', 'there', 'or', 'until', 'that', 'more', "hasn't", 'o', 'we', 'and', "shan't", 'which', 'because', "don't", 'why', 'shan', 'an', 'my', 'if', 'did', 'having', "couldn't", 'your', 'theirs', 'aren', 'just', 'further', 'here', 'of', "wouldn't", 'be', 'too', 'her', 'no', 'same', 'it', 'is', 'were', 'yourself', 'have', 'off', 'this', 'needn', 'once', "wasn't", 'against', 'wouldn', 'up', 'a', 'i', 'below', "weren't", 'over', 'own', 'then', 'so', 'do', 'from', 'shouldn', 'am', 'under', 'any', 'yours', 'ours', 'hasn', 'such', 'nor', 'wasn', 'to', 'where', 'm', "she's", 'each', 'whom', 'but'} After removing the stopwords in the sentence: ['Home', 'Farm', 'one', 'biggest', 'junior', 'football', 'clubs', 'Ireland', 'senior', 'team', ',', '1970', 'late', '1990s', ',', 'played', 'League', 'Ireland', '.', 'However', ',', 'link', 'Home', 'Farm', 'senior', 'team', 'severed', 'late', '1990s', '.', 'The', 'senior', 'side', 'briefly', 'known', 'Home', 'Farm', 'Fingal', 'effort', 'identify', 'north', 'Dublin', 'area', '.']

5. POS TaggingDown the memories lane of our early English grammar classes, can we all remember how our teachers used to give relevant instructions around basic parts of speech to have effective communication? Yeah, good old days!!Let's teach parts of speech to our computers too. :)

5. POS标记我们早期英语语法课的记忆里,我们都还记得我们的老师曾经如何围绕基本的言语给予相关指导以进行有效的交流吗? 是的,过去美好!让我们也将词性教学到我们的计算机上。 :)

The eight parts of speech are nouns, verbs, pronouns, adjectives, adverbs, prepositions, conjunctions, and interjections.

语音的八个部分是名词,动词,代词,形容词,副词,介词,连词感叹词。

POS Tagging is an ability to identify and assign parts of speech to the words in a sentence. There are different methods to tag, but we will be using the universal style of tagging.

POS标记是一种识别语音部分并将其分配给句子中单词的功能。 标记的方法不同,但是我们将使用通用的标记样式。

nltk.download('averaged_perceptron_tagger')nltk.download('universal_tagset')pos_tag= [nltk.pos_tag(i,tagset="universal") for i in words]print(pos_tag)[[('Home', 'NOUN'), ('Farm', 'NOUN'), ('is', 'VERB'), ('one', 'NUM'), ('of', 'ADP'), ('the', 'DET'), ('biggest', 'ADJ'), ('junior', 'NOUN'), ('football', 'NOUN'), ('clubs', 'NOUN'), ('in', 'ADP'), ('Ireland', 'NOUN'), ('and', 'CONJ'), ('their', 'PRON'), ('senior', 'ADJ'), ('team', 'NOUN'), (',', '.'), ('from', 'ADP'), ('1970', 'NUM'), ('up', 'ADP'), ('to', 'PRT'), ('the', 'DET'), ('late', 'ADJ'), ('1990s', 'NUM'), (',', '.'), ('played', 'VERB'), ('in', 'ADP'), ('the', 'DET'), ('League', 'NOUN'), ('of', 'ADP'), ('Ireland', 'NOUN'), ('.', '.')]

One of the applications of POS tagging to analyze the qualities of a product in feedback, by sorting the adjectives in the customers’ review we can evaluate the sentiment of the feedback. Say example, how was your shopping with us?

POS标记用于分析产品在反馈中的质量的一种应用,通过对客户评论中的形容词进行分类,我们可以评估反馈的情绪。 举例来说, 如何与我们一起购物?

6. ChunkingChunking is used to add more structure to the sentence by tagging the following parts of speech (POS). Also named as shallow parsing. The resulting word group is named “chunks.” There are no such predefined rules to perform chunking.

6.分块用于通过标记以下词性(POS)为句子添加更多结构。 也称为浅层解析。 所得的单词组称为“块”。 没有此类预定义规则可以执行分块。

Phrase structure conventions:

短语结构约定:

  • S(Sentence) → NP VP.
    S(句子)→NP VP。
  • NP → {Determiner, Noun, Pronoun, Proper name}.
    NP→{确定词,名词,代词,专有名称}。
  • VP → V (NP)(PP)(Adverb).
    VP→V(NP)(PP)(副词)。
  • PP → Pronoun (NP).
    PP→代词(NP)。
  • AP → Adjective (PP).
    AP→形容词(PP)。

I never had a good time with complex regular expressions, I used to remain as far as I could but off late realized, how important it is to have a grip on regular expressions in data science. Let’s start by understanding the simple instance.

我从来没有过过使用复杂的正则表达式的美好时光,我曾经尽我所能,但后来才意识到,掌握数据科学中的正则表达式是多么重要。 让我们从了解简单实例开始。

If we need to tag Noun, verb (past tense), adjective, and coordinating junction from the sentence. You can use the rule as below

如果我们需要标记句子中的名词,动词(过去式),形容词和协调连接。 您可以使用以下规则

chunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}

块:{<NN。?> * <VBD。?> * <JJ。?> * <CC>?}

import nltkfrom nltk.tokenize import word_tokenizecontent = "Home Farm is one of the biggest junior football clubs in Ireland and their senior team, from 1970 up to the late 1990s, played in the League of Ireland. However, the link between Home Farm and the senior team was severed in the late 1990s. The senior side was briefly known as Home Farm Fingal in an effort to identify it with the north Dublin area."tokenized_text = nltk.word_tokenize(content)print("After Split:",tokenized_text)tokens_tag = pos_tag(tokenized_text)print("After Token:",tokens_tag)patterns= """mychunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}"""chunker = RegexpParser(patterns)print("After Regex:",chunker)output = chunker.parse(tokens_tag)print("After Chunking",output)OUTPUT:After Regex: chunk.RegexpParser with 1 stages: RegexpChunkParser with 1 rules: <ChunkRule: '<NN.?>*<VBD.?>*<JJ.?>*<CC>?'> After Chunking (S   (mychunk Home/NN Farm/NN)   is/VBZ   one/CD  of/IN   the/DT   (mychunk biggest/JJS)   (mychunk junior/NN football/NN clubs/NNS)   in/IN  (mychunk Ireland/NNP and/CC)   their/PRP$   (mychunk senior/JJ)   (mychunk team/NN)   ,/,   from/IN   1970/CD   up/IN   to/TO   the/DT   (mychunk late/JJ)   1990s/CD   ,/,   played/VBN   in/IN   the/DT   (mychunk League/NNP)   of/IN   (mychunk Ireland/NNP)   ./.)

7. Wordnet

7.词网

Wordnet is an NLTK corpus reader, a lexical database for English. It can be used to generate a synonym or antonym.

Wordnet是NLTK语料库阅读器,英语的词汇数据库。 它可用于生成同义词或反义词。

from nltk.corpus import wordnetsynonyms = []antonyms = []for syn in wordnet.synsets("active"):        for lemmas in syn.lemmas():            synonyms.append(lemmas.name())for syn in wordnet.synsets("active"):        for lemmas in syn.lemmas():            if lemmas.antonyms():                antonyms.append(lemmas.antonyms()[0].name())print("Synonyms are:",synonyms)print("Antonyms are:",antonyms)OUTPUT:Synonyms are: ['active_agent', 'active', 'active_voice', 'active', 'active', 'active', 'active', 'combat-ready', 'fighting', 'active', 'active', 'participating', 'active', 'active', 'active', 'active', 'alive', 'active', 'active', 'active', 'dynamic', 'active', 'active', 'active'] Antonyms are: ['passive_voice', 'inactive', 'passive', 'inactive', 'inactive', 'inactive', 'quiet', 'passive', 'stative', 'extinct', 'dormant', 'inactive']

8. Bag of WordsA bag of words model turns the raw text into words, and the frequency for the words in the text is also counted.

8.单词单词袋模型将原始文本转换为单词,并计算单词在单词中的出现频率。

import nltkimport re # to match regular expressionsimport numpy as nptext="Home Farm is one of the biggest junior football clubs in Ireland and their senior team, from 1970 up to the late 1990s, played in the League of Ireland. However, the link between Home Farm and the senior team was severed in the late 1990s. The senior side was briefly known as Home Farm Fingal in an effort to identify it with the north Dublin area."sentences = nltk.sent_tokenize(text)for i in range(len(sentences)):  sentences[i] = sentences[i].lower()  sentences[i] = re.sub(r'\W', ' ', sentences[i])  sentences[i] = re.sub(r'\s+', ' ', sentences[i])bag_of_words = {}for sentence in sentences:    words = nltk.word_tokenize(sentence)    for word in words:       if word not in bag_of_words.keys():          bag_of_words[word] = 1       else:          bag_of_words[word] += 1print(bag_of_words)OUTPUT:{'home': 3, 'farm': 3, 'is': 1, 'one': 1, 'of': 2, 'the': 8, 'biggest': 1, 'junior': 1, 'football': 1, 'clubs': 1, 'in': 4, 'ireland': 2, 'and': 2, 'their': 1, 'senior': 3, 'team': 2, 'from': 1, '1970': 1, 'up': 1, 'to': 2, 'late': 2, '1990s': 2, 'played': 1, 'league': 1, 'however': 1, 'link': 1, 'between': 1, 'was': 2, 'severed': 1, 'side': 1, 'briefly': 1, 'known': 1, 'as': 1, 'fingal': 1, 'an': 1, 'effort': 1, 'identify': 1, 'it': 1, 'with': 1, 'north': 1, 'dublin': 1, 'area': 1}

9. TF-IDF

9.特遣部队

TF-IDF stands for Term Frequency — Inverse document frequency.

TF-IDF代表术语频率-反向文档频率

Text data needs to be converted to the numerical format where each word is represented in the matrix form. The encoding of a given word is the vector in which the corresponding element is set to one, and all other elements are zero. Thus TF-IDF technique is also referred to as Word Embedding.

文本数据需要转换为数字格式,其中每个单词都以矩阵形式表示。 给定单词的编码是将相应元素设置为1并将所有其他元素设置为零的向量。 因此,TF-IDF技术也称为词嵌入

TF-IDF works on two concepts:

TF-IDF处理两个概念:

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

TF(t)=(术语t在文档中出现的次数)/(文档中术语的总数)

IDF(t) = log_e(Total number of documents / Number of documents with term t in it)

IDF(t)= log_e(文件总数/其中带有术语t的文件数)

from sklearn.feature_extraction.text import TfidfTransformerfrom sklearn.feature_extraction.text import CountVectorizerimport pandas as pddocs=["Home Farm is one of the biggest junior football clubs in Ireland and their senior team, from 1970 up to the late 1990s, played in the League of Ireland","However, the link between Home Farm and the senior team was severed in the late 1990s"," The senior side was briefly known as Home Farm Fingal in an effort to identify it with the north Dublin area"]#instantiate CountVectorizer()cv=CountVectorizer()# this steps generates word counts for the words in your docsword_count_vector=cv.fit_transform(docs)tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)tfidf_transformer.fit(word_count_vector)# print idf valuesdf_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(),columns=["idf_weights"])# sort ascendingdf_idf.sort_values(by=['idf_weights'])# count matrixcount_vector=cv.transform(docs)# tf-idf scorestf_idf_vector=tfidf_transformer.transform(count_vector)feature_names = cv.get_feature_names()#get tfidf vector for the documentfirst_document_vector=tf_idf_vector[0]#print the scoresdf = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, columns=["tfidf"])df.sort_values(by=["tfidf"],ascending=False)tfidfof      0.374810ireland 0.374810the     0.332054in      0.2213691970    0.187405football 0.187405up      0.187405as      0.000000an      0.000000and so on..

What are these scores telling us? The more common the word across documents, the lower its score, and the more unique a word the higher the score will be.

这些分数告诉我们什么? 文档中的单词越常见,其得分就越低,单词越独特,得分就会越高。

So far, we learned the steps of cleaning and preprocessing the text. What can we do with the sorted data after all this? We could use this data for sentiment analysis, chatbot, market intelligence. Maybe build a recommender system based on user purchases or item reviews or customer segmentation with clustering.

到目前为止,我们学习了清理和预处理文本的步骤。 所有这些之后,我们该如何处理排序后的数据? 我们可以使用这些数据进行情感分析,聊天机器人,市场情报。 也许可以基于用户购买或商品评论或具有集群的客户细分来构建推荐系统。

Computers are still not accurate with human language as much as they are with numbers. With the massive proportion of text data generated every day, NLP is indeed becoming ever more significant to make sense of the data and is being used in many other applications. Hence there are endless ways to explore NLP.

计算机对人类语言的准确性仍然不如数字。 随着每天生成大量文本数据,NLP确实变得越来越重要以理清数据,并在许多其他应用程序中得到使用。 因此,有无数种探索NLP的方法。

翻译自: https://medium.com/analytics-vidhya/natural-language-processing-bedb2e1c8ceb

自然语言处理综述

http://www.taodudu.cc/news/show-863717.html

相关文章:

  • 来自天秤座的梦想_天秤座:单线全自动机器学习
  • 数据增强 数据集扩充_数据扩充的抽象总结
  • 贝叶斯优化神经网络参数_贝叶斯超参数优化:神经网络,TensorFlow,相预测示例
  • 如何学习 azure_Azure的监督学习
  • t-sne 流形_流形学习[t-SNE,LLE,Isomap等]变得轻松
  • 数据库课程设计结论_结论
  • 摘要算法_摘要
  • 数据库主从不同步_数据从不说什么
  • android 揭示动画_遗传编程揭示具有相互作用的多元线性回归
  • 检测和语义分割_分割和对象检测-第5部分
  • 如何在代码中将menu隐藏_如何在40行代码中将机器学习用于光学/光子学应用
  • pytorch实现文本分类_使用变形金刚进行文本分类(Pytorch实现)
  • python 机器学习管道_构建机器学习管道-第1部分
  • pandas数据可视化_5利用Pandas进行强大的可视化以进行数据预处理
  • 迁移学习 迁移参数_迁移学习简介
  • div文字自动扩充_文字资料扩充
  • ml是什么_ML,ML,谁是所有人的冠军?
  • 随机森林分类器_建立您的第一个随机森林分类器
  • Python中的线性回归:Sklearn与Excel
  • 机器学习中倒三角符号_机器学习的三角误差
  • 使用Java解决您的数据科学问题
  • 树莓派 神经网络植入_使用自动编码器和TensorFlow进行神经植入
  • opencv 运动追踪_足球运动员追踪-使用OpenCV根据运动员的球衣颜色识别运动员的球队
  • 犀牛建模软件的英文语言包_使用tidytext和textmineR软件包在R中进行主题建模(
  • 使用Keras和TensorFlow构建深度自动编码器
  • 出人意料的生日会400字_出人意料的有效遗传方法进行特征选择
  • fast.ai_使用fast.ai自组织地图—步骤4:使用Fast.ai DataBunch处理非监督数据
  • 无监督学习与监督学习_有监督与无监督学习
  • 分类决策树 回归决策树_决策树分类器背后的数学
  • 检测对抗样本_对抗T恤以逃避ML人检测器

自然语言处理综述_自然语言处理相关推荐

  1. python自然语言处理书籍_自然语言处理有哪些可以推荐的书?

    研究人类语言的过程称为NLP.深入研究语言的人称为语言学家,而"计算语言学家"这个专有名词适用于应用计算研究语言处理的人.从本质上讲,计算语言学家是深入了解语言的计算机科学家,计算 ...

  2. 2018 NLP圣经《自然语言处理综述》最新手稿已经发布!

    之前红色石头整理过一篇文章,谈一谈机器学习如何入门的路线图: [干货]我的机器学习入门路线图 那么对于深度学习的自然语言处理(NLP)方向有没有比较好的学习资源呢?我们熟知的是斯坦福大学的 CS224 ...

  3. python和nltk自然语言处理书评_python自然语言处理_自然语言处理入门

    说明:本文是<Python数据分析与数据化运营>中的"3.12.4 自然语言文本预处理".下面是正文内容-与数据库 本文从概念和实际操作量方面,从零开始,介绍在Pyth ...

  4. NLP入门之综述阅读-自然语言处理发展及应用综述

    NLP入门-综述阅读-[自然语言处理发展及应用综述] 1 前言 2 自然语言处理的发展 3 自然语言处理的研究方法和内容 3.1 自然语言处理的研究方法 3.2 自然语言处理基础研究 3.2.1 词法 ...

  5. 论文阅读:Pre-trained Models for Natural Language Processing: A Survey 综述:自然语言处理的预训练模型

    Pre-trained Models for Natural Language Processing: A Survey 综述:自然语言处理的预训练模型 目录 Pre-trained Models f ...

  6. NLP必读圣经《自然语言处理综述》2020最新版免费分享

    自然语言处理领域经典综述教材<Speech and Language Processing >,中文名<自然语言处理综述>第三版发布.该书由NLP领域的大牛,斯坦福大学 Dan ...

  7. 深度学习与自然语言处理(4)_斯坦福cs224d 大作业测验1与解答

    深度学习与自然语言处理(4)_斯坦福cs224d 大作业测验1与解答 作业内容翻译:@胡杨(superhy199148@hotmail.com) && @胥可(feitongxiaok ...

  8. 自然语言处理综述(一)

    1. 自然语言处理的基本内容 语言是思维的载体,是人类交流思想.表达情感最自然.最直接.最方便的工具.人类历史上以语言文字形式记载和流传的知识占知识总量的80%以上,中国互联网上有87.8%的网页内容 ...

  9. python自然语言处理书籍_精通Python自然语言处理pdf

    自然语言处理(NLP)是有关计算语言学与人工智能的研究领域之一.NLP主要关注人机交互,它提供了计算机和人类之间的无缝交互,使得计算机在机器学习的帮助下理解人类语言. 本书详细介绍如何使用Python ...

最新文章

  1. Intellij IDEA 竟然把 Java8 的数据流问题这么完美的解决掉了!
  2. 关于plsql连接oracle数据库session失效时间设置
  3. java区分不同的excel_Java处理excel两种不同的方式
  4. android 7.0 裁剪,Android 7.0中拍照和图片裁剪适配的问题详解
  5. sqlite的几个常用方法
  6. (6) Hibernate的集合映射
  7. 自定义smokeping告警(邮件+短信)
  8. linux dd使用记录
  9. 论文学习18-Relation extraction and the influence of automatic named-entity recognition(联合实体关系抽取模型,2007)
  10. python语言处理excel_Python语言操作excel
  11. linux怎么修改bash,Linux操作系统中如何对Bash变量内容修改?
  12. Spring4 MVC HelloWorld 注解和JavaConfig实例
  13. 学生用计算机记录表,计算机教室学生上机记录表第14周
  14. 一个初学者的辛酸路程-Python基础-3
  15. 图片标注工具LabelImg
  16. handler机制原理
  17. 最短路算法——Floyd-Warshall
  18. Android客户端面经总结
  19. 无锁环形缓存器RingBuffer的原理
  20. C语言-求一元二次方程的解-你是否会了呢?

热门文章

  1. 病毒行为分析初探(三)
  2. centos 安装JAVA 三种方法
  3. RxJava 基础扫盲
  4. Solr 搭建搜索服务器
  5. Visual Studio 2010中添加App_Code文件夹注意事项
  6. php单文件压缩的功能函数的实现
  7. 松下壁挂式新风系统推荐_壁挂式新风系统哪个好?
  8. android+note2+分辨率,5.5英寸720p屏全新RGB像素排列_三星 GALAXY Note II_手机Android频道-中关村在线...
  9. 自定义xy组 android,Android自定义view之仿支付宝芝麻信用仪表盘示例
  10. Codeforces Round #642 (Div. 3)(AB)