单元识别码是什么意思_NLPer入门指南

介绍

你对互联网上的大量文本数据着迷吗?你是否正在寻找处理这些文本数据的方法，但不确定从哪里开始?毕竟，机器识别的是数字，而不是我们语言中的字母。在机器学习中，这可能是一个棘手的问题。

那么，我们如何操作和处理这些文本数据来构建模型呢?答案就在自然语言处理(NLP)的奇妙世界中。

解决一个NLP问题是一个多阶段的过程。在进入建模阶段之前，我们需要首先处理非结构化文本数据。处理数据包括以下几个关键步骤:标识化

预测每个单词的词性

词形还原

识别和删除停止词，等等

在本文中，我们将讨论第一步—标识化。我们将首先了解什么是标识化，以及为什么在NLP中需要标识化。然后，我们将研究在Python中进行标识化的六种独特方法。

阅读本文不需要什么先决条件，任何对NLP或数据科学感兴趣的人都可以跟读。

在NLP中，什么是标识化？

标识化是处理文本数据时最常见的任务之一。但是标识化(tokenization)具体是什么意思呢?标识化(tokenization)本质上是将短语、句子、段落或整个文本文档分割成更小的单元，例如单个单词或术语。每个较小的单元都称为标识符(token)

看看下面这张图片，你就能理解这个定义了:

标识符可以是单词、数字或标点符号。在标识化中，通过定位单词边界创建更小的单元。等等，可能你又有疑问，什么是单词边界呢?

单词边界是一个单词的结束点和下一个单词的开始。而这些标识符被认为是词干提取(stemming)和词形还原(lemmatization )的第一步。

为什么在NLP中需要标识化?

在这里，我想让你们思考一下英语这门语言。想一句任何你能想到的一个英语句子，然后在你接下去读这部分的时候，把它记在心里。这将帮助你更容易地理解标识化的重要性。

在处理一种自然语言之前，我们需要识别组成字符串的单词，这就是为什么标识化是处理NLP(文本数据)的最基本步骤。这一点很重要，因为通过分析文本中的单词可以很容易地解释文本的含义。

让我们举个例子，以下面的字符串为例:“This is a cat.”

你认为我们对这个字符串进行标识化之后会发生什么?是的，我们将得到[' This '， ' is '， ' a '， cat ']。

这样做有很多用途，我们可以使用这个标识符形式:计数文本中出现的单词总数

计数单词出现的频率，也就是某个单词出现的次数

之外，还有其他用途。我们可以提取更多的信息，这些信息将在以后的文章中详细讨论。现在，是我们深入研究本文的主要内容的时候了——在NLP中进行标识化的不同方法。

在Python中执行标识化的方法

我们将介绍对英文文本数据进行标识化的六种独特方法。我已经为每个方法提供了Python代码，所以你可以在自己的机器上运行示例用来学习。

1.使用python的split()函数进行标识化

让我们从split()方法开始，因为它是最基本的方法。它通过指定的分隔符分割给定的字符串后返回字符串列表。默认情况下，split()是以一个或多个空格作为分隔符。我们可以把分隔符换成任何东西。让我们来看看。

单词标识化：

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet

species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed

liquid-fuel launch vehicle to orbit the Earth."""

# 以空格为分隔符进行分割

text.split()

Output : ['Founded', 'in', '2002,', 'SpaceX’s', 'mission', 'is', 'to', 'enable', 'humans',

'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi-planet',

'species', 'by', 'building', 'a', 'self-sustaining', 'city', 'on', 'Mars.', 'In',

'2008,', 'SpaceX’s', 'Falcon', '1', 'became', 'the', 'first', 'privately',

'developed', 'liquid-fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'Earth.']

句子标识化：

这类似于单词标识化。这里，我们在分析中研究句子的结构。一个句子通常以句号(.)结尾，所以我们可以用"."作为分隔符来分割字符串:

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet

species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed

liquid-fuel launch vehicle to orbit the Earth."""

# 以"."作为分割符进行分割

text.split('. ')

Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring

civilization and a multi-planet \nspecies by building a self-sustaining city on

Mars',

'In 2008, SpaceX’s Falcon 1 became the first privately developed \nliquid-fuel

launch vehicle to orbit the Earth.']

使用Python的split()方法的一个主要缺点是一次只能使用一个分隔符。另一件需要注意的事情是——在单词标识化中，split()没有将标点符号视为单独的标识符。

2.使用正则表达式(RegEx)进行标识化

让我们理解正则表达式是什么，它基本上是一个特殊的字符序列，使用该序列作为模式帮助你匹配或查找其他字符串或字符串集。

我们可以使用Python中的re库来处理正则表达式。这个库预安装在Python安装包中。

现在，让我们记住正则表达式并执行单词标识化和句子标识化。

单词标识化：

import re

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet

species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed

liquid-fuel launch vehicle to orbit the Earth."""

tokens = re.findall("[\w']+", text)

tokens

Output : ['Founded', 'in', '2002', 'SpaceX', 's', 'mission', 'is', 'to', 'enable',

'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a',

'multi', 'planet', 'species', 'by', 'building', 'a', 'self', 'sustaining',

'city', 'on', 'Mars', 'In', '2008', 'SpaceX', 's', 'Falcon', '1', 'became',

'the', 'first', 'privately', 'developed', 'liquid', 'fuel', 'launch', 'vehicle',

'to', 'orbit', 'the', 'Earth']

re.findall()函数的作用是查找与传递给它的模式匹配的所有单词，并将其存储在列表中。\w表示“任何字符”，通常表示字母数字和下划线(_)。+表示任意出现次数。因此[\w']+表示代码应该找到所有的字母数字字符，直到遇到任何其他字符为止。

句子标识化：

要执行句子标识化，可以使用re.split()函数，将通过传递一个模式给函数将文本分成句子。

import re

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet

species by building a self-sustaining city on, Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed

liquid-fuel launch vehicle to orbit the Earth."""

sentences = re.compile('[.!?] ').split(text)

sentences

Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring

civilization and a multi-planet \nspecies by building a self-sustaining city on

Mars.',

'In 2008, SpaceX’s Falcon 1 became the first privately developed \nliquid-fuel

launch vehicle to orbit the Earth.']

这里，我们相比split()方法上有一个优势，因为我们可以同时传递多个分隔符。在上面的代码中，我们使用了的re.compile()函数，并传递一个模式[.?!]。这意味着一旦遇到这些字符，句子就会被分割开来。

有兴趣关于正则表达式的信息吗?下面的参考资料将帮助你开始学习NLP中的正则表达式：

3.使用NLTK进行标识化

NLTK是Natural Language ToolKit的缩写，是用Python编写的用于符号和统计自然语言处理的库。

你可以使用以下命令安装NLTK:pip install --user -U nltk

NLTK包含一个名为tokenize()的模块，它可以进一步划分为两个子类别:Word tokenize:我们使用word_tokenize()方法将一个句子分割成标识符

Sentence tokenize:我们使用sent_tokenize()方法将文档或段落分割成句子

让我们一个一个来看是怎么操作的。

单词标识化：

from nltk.tokenize import word_tokenize

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet

species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed

liquid-fuel launch vehicle to orbit the Earth."""

word_tokenize(text)

Output: ['Founded', 'in', '2002', ',', 'SpaceX', '’', 's', 'mission', 'is', 'to', 'enable',

'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a',

'multi-planet', 'species', 'by', 'building', 'a', 'self-sustaining', 'city', 'on',

'Mars', '.', 'In', '2008', ',', 'SpaceX', '’', 's', 'Falcon', '1', 'became',

'the', 'first', 'privately', 'developed', 'liquid-fuel', 'launch', 'vehicle',

'to', 'orbit', 'the', 'Earth', '.']

注意到NLTK是如何考虑将标点符号作为标识符的吗?因此，对于之后的任务，我们需要从初始列表中删除这些标点符号。

句子标识化：

from nltk.tokenize import sent_tokenize

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet

species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed

liquid-fuel launch vehicle to orbit the Earth."""

sent_tokenize(text)

Output: ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring

civilization and a multi-planet \nspecies by building a self-sustaining city on

Mars.',

'In 2008, SpaceX’s Falcon 1 became the first privately developed \nliquid-fuel

launch vehicle to orbit the Earth.']

4.使用`spaCy`库进行标识化

我喜欢spaCy这个库，我甚至不记得上次我在做NLP项目时没有使用它是什么时候了。是的，它就是那么有用。

spaCy是一个用于高级自然语言处理(NLP)的开源库。它支持超过49种语言，并具有最快的的计算速度。

在Linux上安装Spacy的命令:pip install -U spacy

python -m spacy download en

要在其他操作系统上安装它，可以通过下面链接查看：

所以，让我们看看如何利用spaCy的神奇之处来进行标识化。我们将使用spacy.lang.en以支持英文。

单词标识化：

from spacy.lang.en import English

# 加载英文分词器，标记器、解析器、命名实体识别和词向量

nlp = English()

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet

species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed

liquid-fuel launch vehicle to orbit the Earth."""

#"nlp" 对象用于创建具有语言注解的文档

my_doc = nlp(text)

# 创建单词标识符列表

token_list = []

for token in my_doc:

token_list.append(token.text)

token_list

Output : ['Founded', 'in', '2002', ',', 'SpaceX', '’s', 'mission', 'is', 'to', 'enable',

'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a',

'multi', '-', 'planet', '\n', 'species', 'by', 'building', 'a', 'self', '-',

'sustaining', 'city', 'on', 'Mars', '.', 'In', '2008', ',', 'SpaceX', '’s',

'Falcon', '1', 'became', 'the', 'first', 'privately', 'developed', '\n',

'liquid', '-', 'fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'Earth', '.']

句子标识化：

from spacy.lang.en import English

# 加载英文分词器，标记器、解析器、命名实体识别和词向量

nlp = English()

# 创建管道 'sentencizer' 组件

sbd = nlp.create_pipe('sentencizer')

# 将组建添加到管道中

nlp.add_pipe(sbd)

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet

species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed

liquid-fuel launch vehicle to orbit the Earth."""

# "nlp" 对象用于创建具有语言注解的文档

doc = nlp(text)

# 创建句子标识符列表

sents_list = []

for sent in doc.sents:

sents_list.append(sent.text)

sents_list

Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring

civilization and a multi-planet \nspecies by building a self-sustaining city on

Mars.',

'In 2008, SpaceX’s Falcon 1 became the first privately developed \nliquid-fuel

launch vehicle to orbit the Earth.']

在执行NLP任务时，与其他库相比，spaCy的速度相当快(是的，甚至相较于NLTK)。我鼓励你收听下面的DataHack Radio播客，以了解spaCy是如何创建的，以及你可以在哪里使用它:

之外，下面是关于spaCy的一个更深入的教程:

5.使用Keras进行标识化

Keras！!目前业界最热门的深度学习框架之一。它是Python的一个开源神经网络库。Keras非常容易使用，也可以运行在TensorFlow之上。

在NLP上下文中，我们可以使用Keras处理我们通常收集到的非结构化文本数据。

在你的机子上，只需要一行代码就可以在机器上安装Keras:pip install Keras

让我们开始进行实验，要使用Keras执行单词标记化，我们使用keras.preprocessing.text类中的text_to_word_sequence方法.

让我们看看keras是怎么做的。

单词标识化：

from keras.preprocessing.text import text_to_word_sequence

# 文本数据

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet

species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed

liquid-fuel launch vehicle to orbit the Earth."""

# 标识化

result = text_to_word_sequence(text)

result

Output : ['founded', 'in', '2002', 'spacex’s', 'mission', 'is', 'to', 'enable', 'humans',

'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi',

'planet', 'species', 'by', 'building', 'a', 'self', 'sustaining', 'city', 'on',

'mars', 'in', '2008', 'spacex’s', 'falcon', '1', 'became', 'the', 'first',

'privately', 'developed', 'liquid', 'fuel', 'launch', 'vehicle', 'to', 'orbit',

'the', 'earth']

Keras在进行标记之前将所有字母转换成小写。你可以想象，这为我们节省了很多时间!

6.使用Gensim进行标识化

我们介绍的最后一个标识化方法是使用Gensim库。它是一个用于无监督主题建模和自然语言处理的开源库，旨在从给定文档中自动提取语义主题。

下面我们在机器上安装Gensim:pip install gensim

我们可以用gensim.utils类导入用于执行单词标识化的tokenize方法。

单词标识化：

from gensim.utils import tokenize

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet

species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed

liquid-fuel launch vehicle to orbit the Earth."""

list(tokenize(text))

Outpur : ['Founded', 'in', 'SpaceX', 's', 'mission', 'is', 'to', 'enable', 'humans', 'to',

'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi', 'planet',

'species', 'by', 'building', 'a', 'self', 'sustaining', 'city', 'on', 'Mars',

'In', 'SpaceX', 's', 'Falcon', 'became', 'the', 'first', 'privately',

'developed', 'liquid', 'fuel', 'launch', 'vehicle', 'to', 'orbit', 'the',

'Earth']

句子标识化：

from gensim.summarization.textcleaner import split_sentences

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet

species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed

liquid-fuel launch vehicle to orbit the Earth."""

result = split_sentences(text)

result

Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring

civilization and a multi-planet ',

'species by building a self-sustaining city on Mars.',

'In 2008, SpaceX’s Falcon 1 became the first privately developed ',

'liquid-fuel launch vehicle to orbit the Earth.']

你可能已经注意到，Gensim对标点符号非常严格。每当遇到标点符号时，它就会分割。在句子分割中，Gensim在遇到\n时会分割文本，而其他库则是忽略它。

总结

标识化是整个处理NLP任务中的一个关键步骤。如果不先处理文本，我们就不能简单地进入模型构建部分。

在本文中，对于给定的英文文本，我们使用了六种不同的标识化方法(单词和句子)。当然，还有其他的方法，但是这些方法已经足够让你开始进行标识化了。

[1]: 有部分中文将其翻译为分词,但中文文本和英文文本在分词上有所差别，且在本文中，不只演示将英文文本段落分割成单词，还演示将其分割成句子，所以在本文中将其翻译为标识化而不是分词。

单元识别码是什么意思_NLPer入门指南 | 完美第一步相关推荐

CUDA C/C++ 从入门到入土第一步——让你的CUDA跑起来
CUDA C/C++ 从入门到入土第一步--让你的CUDA跑起来目录: 一. windows10+visual studio2019配置cuda记录二. linux配置cuda代码记录三. T ...
Python3 入门100例，从入门到精通第一步
Python3 入门100例,从入门到精通第一步原题地址:http://www.runoob.com/python/python-100-examples.html 来源于github开源项目
【OpenCV入门指南】第十三篇人脸检测
原文出处:http://blog.csdn.net/MoreWindows/article/details/8426318#t2 本篇介绍图像处理与模式识别中最热门的一个领域--人脸检测(人脸识别). ...
『网络安全』蜜罐到蜜网入门指南（二）蜜罐的起源、作用及分类
原创不易,点个赞呗!如果喜欢,欢迎随意赞赏. 前言大家好,<『网络安全』蜜罐到蜜网入门指南>进入第二篇. 在第一篇,我们由网络安全入手,由浅入深,引出蜜罐概念. 从这一篇开始,我们将主要 ...
从mq服务器中获取消息命令,WebSphere MQ 入门指南
WebSphere MQ 入门指南这是一篇入门指南.我们从最基本的概念说起: 基础概念对于MQ,我们需要知道4个名词:队列管理器.队列.消息.通道:对于编程设计人员,通常更关心消息和队列,对于维护 ...
数据分析从头学_数据新闻学入门指南：让我们从头开始构建故事
数据分析从头学 by Mina Demian 由Mina Demian 数据新闻学入门指南:让我们从头开始构建故事 (A Beginner's Guide to Data Journalism: Le ...
BERT模型超酷炫，上手又太难？请查收这份BERT快速入门指南！
点击上方"AI遇见机器学习",选择"星标"公众号重磅干货,第一时间送达来自 | GitHub 作者 | Jay Alammar 转自 | 机器之心如 ...
BERT模型超酷炫，上手又太难？请查收这份BERT快速入门指南
2019-12-31 10:50:59 选自GitHub 作者:Jay Alammar 参与:王子嘉.Geek AI 如果你是一名自然语言处理从业者,那你一定听说过最近大火的 BERT 模型.本文是一 ...
计算机网络入门指南之计算机网络体系结构
一.计算机网络入门指南之计算机网络体系结构 1.1 计算机网络体系结构形成的原因: 计算机网络是个复杂的系统,现举一个最简单的计算机网络应用例子:两个连接在网络上的计算机要传送文件,首先要在两个计算机 ...

单元识别码是什么意思_NLPer入门指南 | 完美第一步

单元识别码是什么意思_NLPer入门指南 | 完美第一步相关推荐

最新文章

热门文章