什么是自然语言处理，它如何工作？

NicoElNino/Shutterstock.comNicoElNino / Shutterstock.com

Natural language processing enables computers to process what we’re saying into commands that it can execute. Find out how the basics of how it works, and how it’s being used to improve our lives.

自然语言处理使计算机能够将我们所说的内容处理成可以执行的命令。了解其运作方式的基础知识，以及如何将其用于改善我们的生活。

什么是自然语言处理？ (What Is Natural Language Processing?)

Whether it’s Alexa, Siri, Google Assistant, Bixby, or Cortana, everyone with a smartphone or smart speaker has a voice-activated assistant nowadays. Every year, these voice assistants seem to get better at recognizing and executing the things we tell them to do. But have you ever wondered how these assistants process the things we’re saying? They manage to do this thanks to Natural Language Processing, or NLP.

无论是Alexa，Siri，Google Assistant，Bixby还是Cortana，如今每个拥有智能手机或智能扬声器的人都可以使用声控助手。每年，这些语音助手在识别和执行我们告诉他们要做的事情上似乎都变得更好。但是您是否想知道这些助手如何处理我们所说的话？他们借助自然语言处理(NLP)设法做到了这一点。

Historically, most software has only been able to respond to a fixed set of specific commands. A file will open because you clicked Open, or a spreadsheet will compute a formula based on certain symbols and formula names. A program communicates using the programming language that it was coded in, and will thus produce an output when it is given input that it recognizes. In this context, words are like a set of different mechanical levers that always provide the desired output.

从历史上看，大多数软件只能响应一组固定的特定命令。一个文件将打开，因为你点击打开，或电子表格将计算公式基于一定的符号和公式的名称。程序使用其编码所用的编程语言进行通信，因此当获得可识别的输入时，它将产生输出。在这种情况下，词语就像总是提供所需输出的一组不同的机械杆。

This is in contrast to human languages, which are complex, unstructured, and have a multitude of meanings based on sentence structure, tone, accent, timing, punctuation, and context. Natural Language Processing is a branch of artificial intelligence that attempts to bridge that gap between what a machine recognizes as input and the human language. This is so that when we speak or type naturally, the machine produces an output in line with what we said.

这与人类语言相反，人类语言复杂，无结构，并且具有基于句子结构，语调，重音，时间，标点和上下文的多种含义。自然语言处理是人工智能的一个分支，它试图弥合机器识别为输入的语言与人类语言之间的鸿沟。这样一来，当我们自然说话或打字时，机器会产生与我们所说的一致的输出。

This is done by taking vast amounts of data points to derive meaning from the various elements of the human language, on top of the meanings of the actual words. This process is closely tied with the concept known as machine learning, which enables computers to learn more as they obtain more points of data. That is the reason why most of the natural language processing machines we interact with frequently seem to get better over time.

这是通过在实际单词的含义之上，通过获取大量数据点来从人类语言的各个元素中获取含义来实现的。该过程与称为机器学习的概念紧密相关，后者使计算机在获取更多数据点时可以学习更多。这就是为什么我们经常与之交互的大多数自然语言处理机器随着时间的推移而变得越来越好的原因。

To illuminate the concept better, let’s have a look at two of the most top-level techniques used in NLP to process language and information.

为了更好地阐明这一概念，让我们看一下NLP中用于处理语言和信息的两种最高级技术。

代币化 (Tokenization)

Tokenization means splitting up speech into words or sentences. Each piece of text is a token, and these tokens are what show up when your speech is processed. It sounds simple, but in practice, it’s a tricky process.

标记化是指将语音分为单词或句子。每一段文本都是一个标记，这些标记是在处理语音时显示的标记。听起来很简单，但是实际上，这是一个棘手的过程。

Let’s say that you are using text-to-speech software, such as the Google Keyboard, to send a message to a friend. You want to message, “Meet me at the park.” When your phone takes that recording and processes it through Google’s text-to-speech algorithm, Google must then split what you just said into tokens. These tokens would be “meet,” “me,” “at,” “the,” and “park”.

假设您正在使用文字转语音软件(例如Google键盘)向朋友发送消息。您想留言，“在公园认识我”。当您的手机录制该记录并通过Google的语音合成算法对其进行处理时，Google必须将您刚才所说的内容拆分为令牌。这些标记将是“满足”，“我”，“在”，“该”和“停放”。

People have different lengths of pauses between words, and other languages may not have very little in the way of an audible pause between words. The tokenization process varies drastically between languages and dialects.

人们在单词之间的停顿时间长短不同，而其他语言在单词之间的可听停顿方面可能不会少。语言和方言之间的分词过程大不相同。

词干和词法化 (Stemming and Lemmatization)

Stemming and lemmatization both involve the process of removing additions or variations to a root word that the machine can recognize. This is done to make interpretation of speech consistent across different words that all mean essentially the same thing, which makes NLP processing faster.

词干和词根去除均涉及删除机器可以识别的根词的附加内容或变体的过程。这样做的目的是使语音解释在不同的词之间保持一致，而这些词本质上都是同一件事，这使得NLP处理更快。

Stemming is a crude fast process that involves removing affixes from a root word, which are additions to a word attached before or after the root. This turns the word into the simplest base form by simply removing letters. For example:

词干处理是一个粗略的快速过程，涉及从词根词中删除词缀，词缀是词根之前或之后附加词的附加词。只需删除字母，即可将单词变成最简单的基本形式。例如：

“Walking” turns into “walk”“走路”变成“走路”
“Faster” turns into “fast”“更快”变成“快速”
“Severity” turns into “sever”“严重程度”变成“严重程度”

As you can see, stemming may have the adverse effect of changing the meaning of a word entirely. “Severity” and “sever” do not mean the same thing, but the suffix “ity” was removed in the process of stemming.

如您所见，词干可能会对完全改变单词的含义产生不利影响。 “严重性”和“严重性”并不相同，但是在词干处理过程中删除了后缀“ ity”。

On the other hand, lemmatization is a more sophisticated process that involves reducing a word to their base, known as the lemma. This takes into consideration the context of the word and how it’s used in a sentence. It also involves looking up a term in a database of words and their respective lemma. For example:

另一方面，词义化是一个更复杂的过程，涉及将单词减少为词根，即词义。这考虑了单词的上下文及其在句子中的使用方式。它还涉及在单词及其各自的引理的数据库中查找术语。例如：

“Are” turns into “be”“是”变成“是”
“Operation” turns into “operate”“经营”变成“经营”
“Severity” turns into “severe”“严重程度”变成“严重程度”

In this example, lemmatization managed to turn the term “severity” into “severe,” which is its lemma form and root word.

在此示例中，词形化成功将术语“严重性”转换为“严重”，这是其词缀形式和词根。

NLP用例和未来 (NLP Use Cases and the Future)

The previous examples only begin to scratch the surface of what Natural Language Processing is. It encompasses a wide range of practices and usage scenarios, many of which we use in our daily lives. These are a few examples of where NLP is currently in use:

前面的示例仅开始介绍自然语言处理的内容。它涵盖了广泛的实践和使用场景，我们在日常生活中使用了许多实践和使用场景。以下是一些当前使用NLP的示例：

Predictive Text: When you type a message on your smartphone, it automatically suggests you words that fit into the sentence or that you’ve used before.

预想文字：当您在智能手机上键入信息时，它会自动为您推荐适合该句子或您以前使用过的单词。
Machine Translation: Widely used consumer translating services, such as Google Translate, to incorporate a high-level form of NLP to process language and translate it.

机器翻译：广泛使用的消费者翻译服务，例如Google Translate，可以结合高级形式的NLP来处理语言并进行翻译。
Chatbots: NLP is the foundation for intelligent chatbots, especially in customer service, where they can assist customers and process their requests before they face a real person.

聊天机器人： NLP是智能聊天机器人的基础，尤其是在客户服务中，他们可以在面对真正的人之前帮助客户并处理他们的请求。

There’s more to come. NLP uses are currently being developed and deployed in fields such as news media, medical technology, workplace management, and finance. There’s a chance we may be able to have a full-fledged sophisticated conversation with a robot in the future.

还有更多。 NLP用途目前正在新闻媒体，医疗技术，工作场所管理和金融等领域开发和部署。将来，我们有可能与机器人进行全面的复杂对话。

If you’re interested in learning more about NLP, there are a lot of fantastic resources on the Towards Data Science blog or the Standford National Langauge Processing Group that you can check out.

如果您有兴趣了解有关NLP的更多信息，可以在Towards Data Science博客或Standford National Langauge Processing Group上找到很多精彩的资源，可以查阅。

翻译自: https://www.howtogeek.com/665702/what-is-natural-language-processing-and-how-does-it-work/