章节六：RASA NLU组件介绍--特征生成器

一、前言

RASA在处理对话时，整体流程是pipeline结构，自然语言理解（NLU）、对话状态追踪（DST）以及对话策略学习（DPL）一系列流程处理下来，再判断执行下一个动作。其中，NLU组件主要是将用户的输入处理成结构化输出。该组件主要用途为实体抽取、意图分类、响应选择、预处理等。NLU组件也是一个可细分pipeline结构，过程是Tokenize->Featurize->NER Extract->Intent Classify。

二、特征生成器

RASA文本特征生成器分为两个不同类别：稀疏特征生成器如One-hot和密集特征生成器如Bert。稀疏特征生成器会返回具大量缺失值（例如零）的特征向量。但是由于这些特征向量通常会占用大量内存，因此我们将它们存储为稀疏特征，稀疏特征仅存储非零值及其在向量中的位置，能够在更大的数据集上进行训练。

所有特征生成器都可以返回两种不同类型的特征：序列特征和句子特征。序列特征是维度为number-of-tokens x feature-dimension 的矩阵。该矩阵包含序列中每个标记的特征向量。我们用这个特征去训练序列模型，如实体识别。句子特征是纬度为(1 x feature-dimension) 的矩阵，包含完整话语的特征向量，可以用来做意图分类。句子特征可用于任何词袋模型。具体使用哪种特征方法由使用的分类器决定。注意：feature-dimensionfor 序列和句子的特征不必相同。

MitieFeaturizer

该特征生成器输出为稠密向量，可以用作实体提取、意图分类和响应分类的特征使用。需要在pipeline中引入MitieNLP语言模型。但是有意思的是，该特征器并没有被MitieIntentClassifier使用，因为MitieIntentClassifier里面实现了所有分词，特征提取功能。

MitieFeaturizer是对每个Token输出一个feature-dimension维度的向量，那么生成句子向量的做法是通过pooling技术，这里可以选择max pooling和mean pooling，这个参数可以在配置文件中指定。最终会生成一个1 x feature-dimension的句子向量。max pooling算法就是取每个token中相同维度，最大的值作为句子向量的这个维度的值，那么mean pooling就好理解了，句子向量是每个token的均值。

MitieFeaturizer主要是在pipline里面进行配置，配置方式如下：

pipeline:
- name: "MitieFeaturizer"# Specify what pooling operation should be used to calculate the vector of# the complete utterance. Available options: 'mean' and 'max'."pooling": "mean"

SpacyFeaturizer

该特征生成器输出为稠密向量，可以用作实体提取、意图分类和响应分类的特征使用。需要在pipeline中引入SpacyNLP语言模型。

SpacyFeaturizer是对每个Token输出一个feature-dimension维度的向量，那么生成句子向量的做法是通过pooling技术，这里可以选择max pooling和mean pooling，这个参数可以在配置文件中指定。最终会生成一个1 x feature-dimension的句子向量。max pooling算法就是取每个token中相同维度，最大的值作为句子向量的这个维度的值，那么mean pooling就好理解了，句子向量是每个token的均值。

SpacyFeaturizer主要是在pipline里面进行配置，配置方式如下：

pipeline:- name:"SpacyNLP"model:"en_core_web_md"- name: "SpacyTokenizer""intent_tokenization_flag": False"intent_split_symbol": "+""token_pattern": None- name: "SpacyFeaturizer"# Specify what pooling operation should be used to calculate the vector of# the complete utterance. Available options: 'mean' and 'max'."pooling": "mean"

ConveRTFeaturizer

ConveRT模型生成句子的特征表示，输出为稠密变量。因为ConveRT模型只有英文模型，因此中文对话机器人不能使用ConveRTFeaturizer，除非你想自己训练一个ConveRT模型。使用该Featurizer的时候，需要使用model_url参数配置模型文件的路径，否则训练的时候会报错。

ConveRTFeaturizer已经实现ConveRTTokenizer的功能，所以Pipeline可以配置任意的Tokenizer。从代码里面看ConveRTTokenizer，LanguageModelTokenizer都是继承WhitespaceTokenizer，并没有做特殊处理，而用WhitespaceTokenizer在中文的时候，会提示不支持中文错误，所以可选的分词器并不多，只有MitieTokenizer和SpcayTokenizer。
注意：要使用ConveRTFeaturizer，请安装 Rasa pip3 install rasa[convert]。

配置方式如下：

pipeline:
- name: "ConveRTFeaturizer"
# Remote URL/Local directory of model files(Required)
"model_url": None

LanguageModelFeaturizer

这就是通过预训练模型来作为特征生成器，输入为用户消息，响应消息等，输出为稠密变量。使用LanguageModelFeaturizer，首先要根据机器人的是中文的还是英文的，选择不同的预训练模型。下表是目前支持的预训练模型。因为LanguageModelFeaturizer里面已经实现了分词功能，因为pipeline里面可以配置任意Tokenizer。但注意要是稠密向量，所以不能配置WhitespaceTokenizer和JiebaTokenizer。而LanguageModelTokenizer又不支持中文，所以只有MitieTokenizer和SpcayTokenizer。这里配置什么分词器并不影响最终结果。

LanguageModelFeaturizer支持的预训练模型有：

其中预训练模型的名称使用model_weights参数配置，对于BERT模型，model_weights可选参数有：

bert-base-uncased
bert-large-uncased
bert-base-cased
bert-large-cased
bert-base-multilingual-uncased
bert-base-multilingual-cased
bert-base-chinese
bert-base-german-cased
bert-large-uncased-whole-word-masking
bert-large-cased-whole-word-masking
bert-large-uncased-whole-word-masking-finetuned-squad
bert-large-cased-whole-word-masking-finetuned-squad
bert-base-cased-finetuned-mrpc
bert-base-german-dbmdz-cased
bert-base-german-dbmdz-uncased
TurkuNLP/bert-base-finnish-cased-v1
TurkuNLP/bert-base-finnish-uncased-v1
wietsedv/bert-base-dutch-cased

roberta模型的model_weights参数可以选：

roberta-base
roberta-large
roberta-large-mnli
distilroberta-base
roberta-base-openai-detector
roberta-large-openai-detector

xlnet模型的model_weights参数可选：

xlnet-base-cased
xlnet-large-cased

配置方式如下：

pipeline:- name: LanguageModelFeaturizer# Name of the language model to usemodel_name: "bert"# Pre-Trained weights to be loadedmodel_weights: "rasa/LaBSE"# An optional path to a directory from which# to load pre-trained model weights.# If the requested model is not found in the# directory, it will be downloaded and# cached in this directory for future use.# The default value of `cache_dir` can be# set using the environment variable# `TRANSFORMERS_CACHE`, as per the# Transformers library.cache_dir: null

RegexFeaturizer

使用正则表达式创建一个消息表示，注意输入只能是用户消息，这个和其他几个Featurizer不同，输出的内容是一个稀疏向量。

在训练阶段，根据配置会生成一个符合输入句子的正则表达式列表，当运行阶段，针对用户消息，会对每一个正则表达式都匹配一次，如果匹配这个正则表达式就会创建一个特征，然后所有这些特征形成一个列表，输出到分类器和提取器里面。那么分类器和实体提取器会根据这个特征指示知道已经找到分类标签或者实体了，然后会直接使用分类标签和实体。

注意，目前只有CRFEntityExtractor和DIETClassifier这两个才支持RegexFeaturizer。

RegexFeaturizer在pipeline里面配置，具体正则表达式的配置在训练数据里面配置。通过添加选项使featurizer 不区分大小写case_sensitive: False，默认为 case_sensitive: True.

要正确处理中文等不使用空格分隔单词的语言，用户需要添加该use_word_boundaries: False选项，默认为use_word_boundaries: True.

pipeline:
- name: "RegexFeaturizer"# Text will be processed with case sensitive as default"case_sensitive": True# use match word boundaries for lookup table"use_word_boundaries": True

CountVectorsFeaturizer

把用户消息，意图，响应消息创建一个词袋模型表示。输出也是稀疏矩阵。

CountVectorizer生成算法是使用sklearn’s CountVectorizer。配置参数具体可以参考sklearn’s CountVectorizer的文档。下面将几个关键参数做个解释。

analyzer，min-ngram，max-ngram：analyzer可选值为word，char，char_wb。这个参数指定是用word的n-grams还是char的n-grams。例如，对于“中国北京“这个词，如果analyzer是word，那特征有“中国”，“北京”；如果analyzer配置为char，特征就有“中”，“国”，“北”，“京”四个token。如果min-ngram为1，max-ngram为2，依旧对于中国北京“这个词，如果analyzer是word，特征有“中国”，“北京”，“中国北京”；如果analyzer配置问char，特征就有“中”，“国”，“北”，“京”，“中国”，“国北”，“北京”。char_wb和char区别在于边界处理上，当analyzer为char_wb，单词边缘的n-gram用空格填充，例如，当min-ngram为1，max-ngram为2时，“中国北京”具有的特征有“中”，“国”，“北”，“京”，“中国”，“国北”，“北京”，“京 ”。

use_lemma：可选值为True和False,主要对于英文语言提取特征，因为英文有时态和语态问题，同一个单词有不同的表现方式，因此当use_lemma为True时候，use和used是一个特征。目前只有SpacyTokenizer支持，可以通过设置use_lemma为False来关闭。
需要注意的是，所有的纯数字都会编码为同一个特征向量。

OOV_token，OOV_words:由于训练数据是有限的词汇表，因此不能保证在预测过程中算法遇到所有词都在训练数据中心出现，为了教算法如何处理这些集外词，这里提供了两个配置参数OOV_token和OOV_words。
OOV_token是一个字符串，例如"oov"，在预测期间，所有未知单词都将被视为OOV_token对应的那个单词。我们可以在训练数据中编写一个intent，例如叫intent out_of_scope，包含一定数量的OOV_token和其他单词，这样在预测期间，算法将将含有未知单词的消息归类为这个意图。
OOV_words是一个单词列表，所有训练数据在此列表中的词，都被认为是OOV_token。这个东西有什么用途呢？假如，有一个词，不是很关键，我们想吧他设为集外词，那么我们要找训练数据，一条一条的找，把这个词改成_oov，那么，有了OOV_words这个工具，我们只需要吧这个词添加到OOV_words配置里面就可以了。如果配置OOV_words，首先需要配置OOV_token。

需要注意的是，CountVectorsFeaturizer是通过计算词频来创建一个词袋表示的，因此对句子中OOV_token的数量很敏感。

use_shared_vocab:如果要在用户消息和意图之间共享词汇表，则需要将选项use_shared_vocab设置为True。
additional_vocabulary_size:稠密特征是固定长度，但是稀疏特征的长度是变的，比如新编写的训练数据多了一个新词，那么向量就变长了，为了能固定稀疏特征的长度，引入additional_vocabulary_size配置参数。本质上就是为新词填充一些槽位，在没有新词的时候，这些槽位就填占位符，有新词就用新词替换占位符。例如

pipeline:
- name: CountVectorsFeaturizeradditional_vocabulary_size:text: 1000response: 1000action_text: 1000

在上面的示例中，为文本（用户消息）、响应（ResponseSelector使用的bot响应）和操作文本（ResponseSelector未使用的bot响应）中的每一类定义额外的词汇表大小。如果use_shared_vocab=True，只需定义text属性一个值。如果用户未配置任何属性，additional_vocabulary_size的默认值是当前词汇表的一半。为了避免在增量训练中过于频繁地用完额外的词汇表，这个数字至少保持在1000。一旦组件用完了额外的词汇表槽，新的词汇表标记就会被丢弃，这时候就需要从头开始训练模型了，不能增量训练了。

如果要在用户消息和意图之间共享词汇表，则需要将选项设置 use_shared_vocab为True。在这种情况下，意图和用户消息中的标记之间的通用词汇集就建立起来了。

pipeline:
- name: "CountVectorsFeaturizer"# Analyzer to use, either 'word', 'char', or 'char_wb'"analyzer": "word"# Set the lower and upper boundaries for the n-grams"min_ngram": 1"max_ngram": 1# Set the out-of-vocabulary token"OOV_token": "_oov_"# Whether to use a shared vocab"use_shared_vocab": False

前文没有介绍到参数，可以参考下文


+-------------------+----------------------+--------------------------------------------------------------+
| Parameter         | Default Value        | Description                                                  |
+===================+======================+==============================================================+
| use_shared_vocab  | False                | If set to 'True' a common vocabulary is used for labels      |
|                   |                      | and user message.                                            |
+-------------------+----------------------+--------------------------------------------------------------+
| analyzer          | word                 | Whether the features should be made of word n-gram or        |
|                   |                      | character n-grams. Option 'char_wb' creates character        |
|                   |                      | n-grams only from text inside word boundaries;               |
|                   |                      | n-grams at the edges of words are padded with space.         |
|                   |                      | Valid values: 'word', 'char', 'char_wb'.                     |
+-------------------+----------------------+--------------------------------------------------------------+
| strip_accents     | None                 | Remove accents during the pre-processing step.               |
|                   |                      | Valid values: 'ascii', 'unicode', 'None'.                    |
+-------------------+----------------------+--------------------------------------------------------------+
| stop_words        | None                 | A list of stop words to use.                                 |
|                   |                      | Valid values: 'english' (uses an internal list of            |
|                   |                      | English stop words), a list of custom stop words, or         |
|                   |                      | 'None'.                                                      |
+--------------+----------------------+--------------------------------------------------------------+
| min_df       | 1                    | When building the vocabulary ignore terms that have a        |
|              |                      | document frequency strictly lower than the given threshold.  |
+--------------+----------------------+--------------------------------------------------------------+
| max_df       | 1                    | When building the vocabulary ignore terms that have a        |
|              |                      | document frequency strictly higher than the given threshold  |
|              |                      | (corpus-specific stop words).                                |
+--------------+----------------------+--------------------------------------------------------------+
| min_ngram    | 1                    | The lower boundary of the range of n-values for different    |
|              |                      | word n-grams or char n-grams to be extracted.                |
+--------------+----------------------+--------------------------------------------------------------+
| max_ngram    | 1                    | The upper boundary of the range of n-values for different    |
|              |                      | word n-grams or char n-grams to be extracted.                |
+--------------+----------------------+--------------------------------------------------------------+
| max_features | None                 | If not 'None', build a vocabulary that only consider the top |
|              |                      | max_features ordered by term frequency across the corpus.    |
+--------------+----------------------+--------------------------------------------------------------+
| lowercase    | True                 | Convert all characters to lowercase before tokenizing.       |
+--------------+----------------------+--------------------------------------------------------------+
| OOV_token    | None                 | Keyword for unseen words.                                    |
+--------------+----------------------+--------------------------------------------------------------+
| OOV_words    | []                   | List of words to be treated as 'OOV_token' during training.  |
+--------------+----------------------+--------------------------------------------------------------+
| alias        | CountVectorFeaturizer| Alias name of featurizer.                                    |
+--------------+----------------------+--------------------------------------------------------------+
| use_lemma    | True                 | Use the lemma of words for featurization.                    |
+--------------+----------------------+--------------------------------------------------------------+

LexicalSyntacticFeaturizer

将输入的消息创建词法和语法特征，用于后续的实体识别。输入是用户输入消息，输出是稀疏向量。


==============  ==========================================================================================
Feature Name    Description
==============  ==========================================================================================
BOS             Checks if the token is at the beginning of the sentence.
EOS             Checks if the token is at the end of the sentence.
low             Checks if the token is lower case.
upper           Checks if the token is upper case.
title           Checks if the token starts with an uppercase character and all remaining characters arelowercased.
digit           Checks if the token contains just digits.
prefix5         Take the first five characters of the token.
prefix2         Take the first two characters of the token.
suffix5         Take the last five characters of the token.
suffix3         Take the last three characters of the token.
suffix2         Take the last two characters of the token.
suffix1         Take the last character of the token.
pos             Take the Part-of-Speech tag of the token (``SpacyTokenizer`` required).
pos2            Take the first two characters of the Part-of-Speech tag of the token(``SpacyTokenizer`` required).
==============  ==========================================================================================

特征提取器是后续计算的基础，信息越多越有利于后面的计算，所以常常都是几种办法组合使用。尤其RegexFeaturizer是最常用的。LexicalSyntacticFeaturizer也会提供一些有益的信息。值得注意的是，在后续说到的分类器会对特征有要求，是稀疏向量还是稠密向量。

当特征化器使用滑动窗口在用户消息中的标记上移动时，开发者可以为滑动窗口中的先前标记、当前标记和下一个标记定义特征。开发者将功能定义为[before, token, after]数组。如果想为之前的令牌、当前令牌和之后的令牌定义功能，具体配置将如下所示：

pipeline:
- name: LexicalSyntacticFeaturizer"features": [["low", "title", "upper"],["BOS", "EOS", "low", "upper", "title", "digit"],["low", "title", "upper"],]