每天给你送来NLP技术干货!


来自:看个通俗理解吧

Chinese Natural Language Understanding, NLU, in Dialogue Systems 1&2 Task Introduction&Chinese vs English in NLP

当第一次看到自然语言理解的时候,我是感到困惑的。因为自然语言处理的目的就是要去理解人类产生的文本信息,从这个角度讨论,那应该所有的自然语言处理任务,都应该自然语言理解的范围之内。

When I first read about natural language understanding, I was very confused. What is the difference between natural language processing and understanding? The reason for my confusion is that the purpose of natural language processing is to understand human language, discussed from this perspective, all the natural language processing tasks should be within the scope of natural language understanding. To some extent, the concept of "natural language processing" is equivalent to "natural language understanding".

而当经过进一步调查之后,发现大家基本上是把特定的任务称呼为自然语言理解。这一系列中,会贴合大家已经形成的习惯用语,所以这里提到的自然语言理解便是在对话系统场景中的任务。

When the further investigation was carried out, it became clear that people were generally referring to specific tasks as natural language understanding. In this series, the natural language understanding includes the tasks for dialogue systems.

这仍然会是一系列的文章:

  1. 自然语言理解任务介绍(←这一篇)

  2. 自然语言处理中英文的区别(←这一篇)

  3. 学术界中的方法

  4. 工业界中的方法

  5. 中文对话系统相关的挑战赛

  6. 相关的有用的资源、资料

This series of articles will cover:

  1. Task Introduction (← this one)

  2. Chinese vs English in NLP (← this one)

  3. Academic Methods

  4. Industry Methods

  5. Chinese Dialogue System Challenge Track

  6. Resources

谁适合阅读这篇文章?无论是刚进入自然语言处理专业的小白,还是已经深耕这个领域多年的高手,或者你需要一份比较完整的科普讲解思路然后为他人讲解,都希望你能在这一系列文章中获得有用的信息。这一系列文章很适合用于快速了解不同方法的大概思路,或者用于复习一下前人的做法大概有哪些。

Who is suitable to read this article? Whether you are a beginner in the natural language processing field, you are an expert who has been deeply involved in the field for many years, or you are a teacher who needs ideas to explain these tasks to others, I hope you will get useful information in this series of articles. This series can be used for a quick overview of the general ideas of different approaches and previous studies.

1. 任务介绍(Task Introduction)

在这里会先简略的介绍一下它是一个什么样的任务。具体是什么任务在后续的文章中会有详细的介绍。用最简单的话说,我们希望模型能够理解我们说的话,并且做出相应的反应。

We start with a brief description of what task we are talking about. The exact tasks will be described in more detail in the following articles. In the simplest terms, we hope AI models are able to understand what we say and react accordingly.

在下面这个例子中,我们对智能音箱发出了2条指令:播放今天的新闻 以及 提高播放的音量。

In the example below, we said to the smart speaker: play today's news and raise the volume.

这个智能音箱很好的完成了这两个任务:它播放了新闻 并且 提高了音量。

The smart speaker does both tasks well: it plays the news and turns up the volume.

从上面这两个例子中,简单的解释了自然语言理解这个任务。我们是希望模型能够理解下面这些关键的信息:

  • 我们的意图是播放新闻

    • 北京(哪里的新闻)

    • 当天(什么时候的新闻)

  • 我们的意图是调整音量:增大

From these two examples above, the task of natural language understanding is briefly explained. We are expecting the model to understand the following key pieces of information.

  • Our intention is to play the news

    • Beijing (where news happened)

    • The day (when)

  • Our intention is to adjust the volume: increase

在现实中,这样的例子其实已经渐渐融入了我们的生活:

  • 语音助手Siri, 科塔娜

  • 自助下单、订票、预约

  • 智能家居(灯光、电视、窗帘等语音控制)

In reality, such examples are gradually integrated into our lives:

  • the voice assistant, Siri and Cortana

  • Self-service ordering, booking and reservation

  • Smart homes (voice control of lights, TV, curtains, etc.)

2.自然语言处理中英文的区别(Chinese vs English in NLP)

字符和分词 (Characters and Words)

在中英文中,都有字符这个概念。然而,在理解这两种语言的时候,却有一些不同。在中英文中,不同的字符都会组成不同的词语。但是在英文中,词语之间有明确的界限。而在中文中,是没有这样的界限的。

The concept of character is existing in both English and Chinese. However, there are some differences. In both English and Chinese, different characters make up different words. However, in English, there are clear boundaries between words. In Chinese, there is no such boundary.

所以,在某些场景和方法中,中文分词会成为了预处理中的一个环节。在英文中,则不需要这样的操作。(注意:左图并不代表英文也需要分词,而是一个例子,向不懂中文的读者展示中文分词的大概含义。)

Therefore, in some scenarios and methods, you may find that Chinese word segmentation was one step of the pre-processing procedure. In English, this step is not required. (Note: the diagram on the left does not mean that English also requires word segmentation. It is just a friendly and easy-understand example to show the non-Chinese speaking readers the core idea of word segmentation.)

但是,中文的分词并没有想象中那么容易。因为不同的分词结果会导致整句话的含义发生翻天覆地的变化,从而对这句话造成错误的理解。

However, Chinese word segmentation is not as easy as one might think. The reason is that different segmentation results can have very different meanings for the whole sentence. AI model may completely misunderstand the sentence.

比如下面这个例子:“看我头像牛不?”

For example:

在这个例子里面,展示了2种分词结果:

1)看我 头像 牛不 ?这句话题的意思是想问问大家自己的头像图片是不是很酷。

2)看我头 像 牛 不?而这句话的意思是说想问问大家自己的头是不是和牛很像。

Here shows two possible word segmentation results of the same sentence:

1)看我 头像 牛不 ?Look at my avatar, is it awesome?

2)看我头 像 牛 不?Look at my head, does it look like a cow?

中文自然语言处理的一般步骤 (General Steps for Chinese Natural Language Processing)

由于中文的特征,所以处理中文的方法一般可以大概分成下面这3类:

Methods for Chinese NLP can generally be divided into the following three categories:

  • 中文文本→预处理步骤(比如中文分词)→词向量→输入到模型中

  • Chinese text → pre-processing step (e.g. Chinese word segmentation) → word vector/embedding → input to the model

  • 中文文本→字符向量→输入到模型中

  • Chinese text → character vector/embedding → input to the model

  • 中文文本→预处理步骤(比如中文分词)→预处理结果(比如词向量)+字符向量结合→输入到模型中

  • Chinese text → pre-processing step (e.g. Chinese word segmentation) → pre-processing results (e.g., word embedding) + character embedding  → input to the model

需要注意的是,这几种方法并没有好坏之分,只有适合不适合之分。并不是说结合越来越多的特征,就一定会越来越满足你的需求。同一个方法,在不同的场合可能会有不同的效果。需要根据实际情况,来判断方法的好坏。

It is important to note that there are no good or bad methods of doing Chinese natural language processing, only suitable or unsuitable. It is not the case that combining more and more features will necessarily make an AI model more and more suitable for your needs. The same method may have different performances in different situations.

预处理中除了中文分词,还有什么别的可能的处理呢?

What other possible features could be used to improve model performance?

大家都知道,中文的汉字是很有底蕴的,可以利用的特征还有很多。这只是罗列出了大家比较常用的特征。这里尽量不会对中文汉字文化有太详细的解释。如果你对这部分不是很感兴趣,是可以轻松跳过的。在跳过之前,可以记住这句话,汉字有许多特征可以利用起来,从而帮助模型更好的理解整句话的含义。这些特征可以包括:提取汉字的组成部分、发音、形状等方法。

As we all know, Chinese characters are very culturally rich and there are many more features that can be utilized. We aim to discuss the features that are commonly used. In this section, we try not to go into too much detail about the culture of Chinese characters. If you are not very interested in this section, feel free to skip it. Before skipping, you could keep in mind that there are many features of Chinese characters that can be used to improve an AI model to understand the meaning of the whole sentence better. These features are Chinese radicals, pinyin (romanisation of Chinese), pronunciation, glyphs and so on.

部首偏旁 (Chinese Radical)

为了考虑到不怎么了解汉字的读者,这里对部首偏旁做一个简单的解释。一个汉字可以由不同的部分组成(也可以把每一个汉字想象成是一幅迷你画,这幅画由不同的部分组成的)。部首偏旁就是这个字中的一部分。一般来讲,我们找到一个字的部首偏旁之后,就可以大概猜出这个字的含义和什么相关。这也是为什么有些研究结果证明,融入部首偏旁特征之后,会提高模型自然语言理解的能力。

In order to take into account the readers who do not know much about Chinese, here is a brief explanation of radicals. A Chinese character can be made up of different parts (you can also think of each character as a miniature painting, which is made up of different parts). A radical is a part of this character. Generally speaking, once we find the radicals of a character, we can probably guess what the meaning of the character is related to. This is why some studies have shown that the incorporation of radical features improves the ability of models to understand the Chinese natural language.

以下面这2个图为例:

  • 在左图中,处在中间的是一个部首偏旁,它和眼睛相关。围绕着它的是一圈汉字,这些汉字都含有这个部首偏旁,并且他们的含义和眼睛都有着紧密的联系。

  • 在右图中,“木”字的含义和树有关系。跟在“木”后面的“林”、“森”,可以看的出来,“木”越来越多,于是便有了树丛、树林的含义。注:配图来源于网络

Take the following two figures as examples.

  • In the picture on the left, there is a radical in the middle. This radical is associated with eyes. Surrounding it is a circle of characters which all contain this radical and whose meanings are closely related to eyes.

  • In the picture on the right, the character "木" has a meaning related to a piece of wood. The character "木" is followed by the two characters "林" (woods) and "森" (forest). It can be seen that there are more and more "木"s in "林" and "森", thus that the two characters have the meanings of a forest.

汉字的拼音(Pinyin, Romanisation of Chinese)

大家都知道,在英文中有音标来标注一个单词如何发音。在中文中,和之非常相似的是拼音。拼音又是由两部分组成:字母和声调(比如是上升还是下降)。这篇文章不会对声调有太多的解释。只需要了解到,同样的字母可以被赋予不同的声调。而不同的声调会表达非常不同的意思。

As you know, in English, there are phonetic symbols to indicate how a word is pronounced. In Chinese, Pinyin, the romanisation of Chinese, is very similar to phonetic symbols. One character's pinyin is made up of two parts: the letters and the tone (i.e., flat, rising, falling-rising, falling or neutral tone). This article will not explain much about tones. It is sufficient to understand the rest of this section if you keep this in mind: the same letter can be given various tones in different situations. That indicates very different meanings.

而让更多的学习汉字的人抓狂的是,同样的一个汉字在不同的上下文下,却也可以拥有不同的声调或者发音。这就会表达非常不一样的意思。

We just discussed that a letter can have different tones for the characters in various words. What drives Chinese learners crazy is that one character's tone and pronunciation are also not fixed. How to pronounce a character also depends on the context in order to use the same character to express different meanings.

汉字的形状 (Glyph)

一些汉字的含义也可以从它的形状中猜出蛛丝马迹。例如下图中的这些例子:

  • 最左边:3种实际存在的事物(太阳、山、大象)

  • 中间:文字演变的过程

  • 最后:今天正在使用的汉字

The meaning of some Chinese characters can also be guessed from their glyph. Below we present three examples:

  • Leftmost: Sun, Mountain, Elephant

  • Middle: Character Evolution

  • Rightmost: the Chinese characters used today

小结 (Summary)

从上面我们可以看出,一个汉字背后的底蕴并不是仅仅一个汉字那么简单。如果能挖掘出来这些隐含的特征,那么可能会对中文自然语言的理解有一定的帮助。其实在英文中,也有类似的做法(比如提取词根、前后缀等预处理方式)。

As we can see from the above, the underlying meaning behind a Chinese character is not as simple as just a Chinese character looks. If these implied features can be unearthed or detected by an AI model, then it may improve the model's ability to understand Chinese. In fact, in English, similar approaches have been adopted (e.g. by extracting word roots and affixes).

下面这个表格中,简单的罗列了不同的中文自然语言处理任务工作都使用了哪些特征。这里并不是一个详尽的列表,只是为了大概展示一下可以在哪些任务中使用这些特征。可能对你现在的工作会有所启发。

The table below briefly summarises what features were used for different Chinese natural language processing tasks. Please note that this is not an exhaustive list. However, this table can generally provide information--what Chinese-specific features can be used in which tasks. We hope it can bring you some inspiration for your current work.

下一篇 (Next)

在后面的文章中,我们就要开始大概过一下,在学术界中是怎样完成中文自然语言理解的任务啦!

In the following articles, we will start to look at several previous academic methods for Chinese natural language understanding!


这篇文章的内容是可以自由使用的,但是只能用于非商业目的。在能引用本文的场合希望能够尽量引用。需要注意的是,如果打算以任何形式(包括但不限于翻译或截图)使用任何图片或文字(从其他作者的文章中提取的原始内容除外)用于任何商业目的(包括但不限于在你的PPT幻灯片中使用文章中的图片或文字用于任何行业的在线或离线培训课程资料)之前,请通过微信公众账号或其他方式联系我获得授权。

Please be free to use the material or any parts of the material for non-business purposes. Try to cite this article if you can. Please contact me via the public WeChat account or other ways before you use any pictures or texts (except the original content taken from publications or other authors' articles) in any forms (including but not limited to translating them or making screenshots) for any business/commercial purpose (including but not limited to using the pictures or my texts in your slides for any online or offline courses held by industries).



对话系统中的中文自然语言理解 (NLU) 任务介绍相关推荐

  1. 象形文字--中文自然语言理解的突破

    中文自然语言理解一直是自然语言理解领域的难点和有意思的课题.之所以难,很大原因是因为中文由象形文字演化而来.但是,目前的中文NLP理论中,似乎不多见关于如何利用象形这一重要元素的. 我(个人)相信,这 ...

  2. 机器学习不会解决自然语言理解(NLU)问题

    作为唯一由人类自身创造的符号,自然语言处理一直是机器学习界不断研究的方向. 自然语言处理技术主要是让机器理解人类的语言的一门领域.在自然语言处理技术中,大量使用了编译原理相关的技术,例如词法分析,语法 ...

  3. 实在智能参与中文自然语言理解评价标准体系(CLUE)阶段性进展回顾

    「实在智能」简介 「实在智能」(杭州实在智能科技有限公司)是一家人工智能科技公司,聚焦大规模复杂问题的智能决策领域,通过AI+RPA技术打造广泛应用于各行业的 智能软件机器人,即"数字员工& ...

  4. 自然语言处理NLP、自然语言理解NLU、自然语言生成NLG、任务家族

    自然语言处理NLP.自然语言理解NLU.自然语言生成NLG.任务家族 自然语言生成(NLG) 看图说话(image caption) 说话生图(text to image) 文本相似性(text si ...

  5. 对话系统中自然语言理解NLU——意图识别与槽位填充

    目录 1. 什么是意图识别和槽位填充 1.1 语义槽的设计 2. 意图识别的方法 2.1 规则模板 2.2 统计机器学习 2.3 深度学习 3. 意图识别的难点 4. 槽位填充的方法 5. 参考 问答 ...

  6. ChineseGLUE(CLUE):针对中文自然语言理解任务的基准平台

    导语 2018 年,来自纽约大学.华盛顿大学.DeepMind 机构的研究者创建了一个多任务自然语言理解基准和分析平台--GLUE(General Language Understanding Eva ...

  7. 自然语言理解(NLU)个人入门笔记记录1

    概念理解: NLP是我们在让机器基于文本数据完成特定任务时使用的思想.方法和技术的总称--其中一部分支持机器理解文本数据的内容,因此统称NLU:一部分支持机器生成人类可以理解的文本数据,因此统称NLG ...

  8. 自然语言一般使用计算机,自然语言理解

    自然语言处理(N LP , Natural Language Processing)是使用自然语言同计算机进行通讯的技术, 因为处理自然语言的关键是要让计算机"理解"自然语言,所以 ...

  9. 【论文分享】EMNLP 2020 自然语言理解

    点击上方,选择星标,每天给你送干货! 来自:复旦DISC 引言 自然语言理解(Natural Language Understanding,NLU)是希望机器像人一样,具备正常人的语言理解能力,是人机 ...

最新文章

  1. 几张表格怎么联动_猛男必备具皮肤:和平精英火箭少女联动火爆来袭,这摩托皮不香?...
  2. 一个关于组织学员学习技术的笔试题--求讨论
  3. VScode新建自定义模板快捷方式
  4. java list 字段去重_java list 根据对象一个字段去重
  5. 论一只爬虫的自我修养(第二天)
  6. 简直不要太硬了!一文带你彻底理解文件系统 | 原力计划
  7. 电脑排行榜笔记本_热门笔记本电脑排行榜推荐_windows7教程
  8. LINUX下载编译libvpx
  9. Linux:20个linux常用命令
  10. angularjs 获取复选框的值_如何利用Python批量获取天眼查企业信息?
  11. linux 解压 7z 分卷压缩文件,linux分卷压缩与解压缩
  12. c语言 code table,单片机C语言unsigned char code table是什么意思?
  13. linux 修改路由表 永久,CentOS 6.9永久设置静态路由表以及路由表常用设置
  14. DZY Loves Chinese/DZY Loves Chinese II 题解
  15. 农村信用社答题小程序
  16. windows系统在路由器组成的局域网中共享打印机
  17. 怎么看计算机配件型号,如何看硬件参数
  18. Hbase命令行语句
  19. HIHO#1245 : 王胖浩与三角形
  20. 机器学习:SOM聚类的实现

热门文章

  1. matlab三相电路基波图形,非正弦稳态对称三相电路如图a所示。A相电源电压为,其中基波角频率为ω1=1rad/s。负载参数为R=...
  2. CTA 认证android平台 彩信/ MMS 受控原理
  3. 2014,微信是糖,甜到忧伤
  4. linux内核IDR机制详解(一)
  5. 如何正确的向领导汇报工作?
  6. ssl证书过期怎么解决?
  7. c语言如何反复执行一段程序,C语言中重复执行程序的问题
  8. Android游戏开发教程汇总
  9. redis c++接口
  10. IGRP/EIGRP 内部网管路由选择协议