目录

Tuning Your NLU Model

How to Choose a Pipeline

Component Lifecycle#

Doing Multi-Intent Classification#

Comparing Pipelines#

Choosing the Right Components#

Tokenization#

Featurization#

Intent Classification / Response Selectors#

Entity Extraction#

Handling Class Imbalance#

Configuring Tensorflow#

Optimizing CPU Performance#

Optimizing GPU Performance#


Tuning Your NLU Model

Rasa Open Source will provide you with a suggested NLU config on initialization of the project, but as your project grows, it's likely that you will need to adjust your config to suit your training data.

How to Choose a Pipeline

在Rasa开源中,传入的消息由一系列组件处理。这些组件在config.yml中定义的所谓处理管道中一个接一个地执行。选择NLU管道允许您自定义模型并在数据集上对其进行微调。

首先,您可以让建议的配置特性为您选择一个默认的管道。只需在配置中提供您的机器人的语言。yml文件,并将管道键保留为空。

language: fr  # your 2-letter language codepipeline:
# intentionally left empty

如果您要从头开始,从预先训练好的单词嵌入开始通常是很有帮助的。预先训练的单词嵌入是有帮助的,因为它们已经编码了某种语言知识。例如,如果你在训练数据中有一个类似于“我想买苹果”的句子,而Rasa被要求预测“得到梨”的意图,你的模型已经知道“苹果”和“梨”非常相似。如果你没有足够的训练数据,这是特别有用的。

如果你正在开始使用spaCy支持的一种语言,我们建议使用以下管道:

language: "fr"  # your two-letter language codepipeline:- name: SpacyNLP- name: SpacyTokenizer- name: SpacyFeaturizer- name: RegexFeaturizer- name: LexicalSyntacticFeaturizer- name: CountVectorsFeaturizer- name: CountVectorsFeaturizeranalyzer: "char_wb"min_ngram: 1max_ngram: 4- name: DIETClassifierepochs: 100- name: EntitySynonymMapper- name: ResponseSelectorepochs: 100

它使用了spacyfeataturizer,它提供了许多不同语言中预先训练好的从GloVe或fastText的单词嵌入(参见语言模型)。

如果您在管道中不使用任何预先训练过的word嵌入,那么您就不会被绑定到特定的语言,并且可以训练您的模型使其更加特定于领域。

如果你的语言没有单词嵌入,或者你有非常特定于领域的术语,我们建议使用以下管道:

language: "fr"  # your two-letter language codepipeline:- name: WhitespaceTokenizer- name: RegexFeaturizer- name: LexicalSyntacticFeaturizer- name: CountVectorsFeaturizer- name: CountVectorsFeaturizeranalyzer: "char_wb"min_ngram: 1max_ngram: 4- name: DIETClassifierepochs: 100- name: EntitySynonymMapper- name: ResponseSelectorepochs: 100

这个管道只使用countvectorsfeataturizer来训练您提供的训练数据。该管道可以处理任何单词由空格分隔的语言。如果您的语言不是这种情况,请检查WhitespaceTokenizer的替代方案。

Component Lifecycle

每个组件处理一个输入和/或创建一个输出。组件的顺序由config.yml中列出的顺序决定;组件的输出可以被管道中紧随其后的任何其他组件使用。有些组件只产生管道中其他组件使用的信息。其他组件产生在处理完成后返回的输出属性。

例如,对于“I am looking For Chinese food”这句话,输出是:

{"text": "I am looking for Chinese food","entities": [{"start": 8,"end": 15,"value": "chinese","entity": "cuisine","extractor": "DIETClassifier","confidence": 0.864}],"intent": {"confidence": 0.6485910906220309, "name": "restaurant_search"},"intent_ranking": [{"confidence": 0.6485910906220309, "name": "restaurant_search"},{"confidence": 0.1416153159565678, "name": "affirm"}]
}

This is created as a combination of the results of the different components in the following pipeline:

pipeline:- name: WhitespaceTokenizer- name: RegexFeaturizer- name: LexicalSyntacticFeaturizer- name: CountVectorsFeaturizer- name: CountVectorsFeaturizeranalyzer: "char_wb"min_ngram: 1max_ngram: 4- name: DIETClassifier- name: EntitySynonymMapper- name: ResponseSelector

Component Lifecycle#

每个组件处理一个输入和/或创建一个输出。组件的顺序由config.yml中列出的顺序决定;组件的输出可以被管道中紧随其后的任何其他组件使用。有些组件只产生管道中其他组件使用的信息。其他组件产生在处理完成后返回的输出属性。例如,对于“I am looking For Chinese food”这句话,输出是:

{"text": "I am looking for Chinese food","entities": [{"start": 8,"end": 15,"value": "chinese","entity": "cuisine","extractor": "DIETClassifier","confidence": 0.864}],"intent": {"confidence": 0.6485910906220309, "name": "restaurant_search"},"intent_ranking": [{"confidence": 0.6485910906220309, "name": "restaurant_search"},{"confidence": 0.1416153159565678, "name": "affirm"}]
}

This is created as a combination of the results of the different components in the following pipeline:

pipeline:- name: WhitespaceTokenizer- name: RegexFeaturizer- name: LexicalSyntacticFeaturizer- name: CountVectorsFeaturizer- name: CountVectorsFeaturizeranalyzer: "char_wb"min_ngram: 1max_ngram: 4- name: DIETClassifier- name: EntitySynonymMapper- name: ResponseSelector

For example, the entities attribute here is created by the DIETClassifiercomponent.

Every component can implement several methods from the Component base class; in a pipeline these different methods will be called in a specific order. Assuming we added the following pipeline to our config.yml:

pipeline:- name: "Component A"- name: "Component B"- name: "Last Component"

Before the first component is created using the create function, a so called context is created (which is nothing more than a python dict). This context is used to pass information between the components. For example, one component can calculate feature vectors for the training data, store that within the context and another component can retrieve these feature vectors from the context and do intent classification.

Initially the context is filled with all configuration values. The arrows in the image show the call order and visualize the path of the passed context. After all components are trained and persisted, the final context dictionary is used to persist the model's metadata.

Doing Multi-Intent Classification#

You can use Rasa Open Source components to split intents into multiple labels. For example, you can predict multiple intents (thank+goodbye) or model hierarchical intent structure (feedback+positive being more similar to feedback+negative than chitchat). To do this, you need to use the DIETClassifier in your pipeline. You'll also need to define these flags in whichever tokenizer you are using:

  • intent_tokenization_flag: Set it to True, so that intent labels are tokenized.

  • intent_split_symbol: Set it to the delimiter string that splits the intent labels. In this case +, default _.

Here's an example configuration:

language: "en"pipeline:
- name: "WhitespaceTokenizer"intent_tokenization_flag: Trueintent_split_symbol: "_"
- name: "CountVectorsFeaturizer"
- name: "DIETClassifier"

Comparing Pipelines#

Rasa gives you the tools to compare the performance of multiple pipelines on your data directly. See Comparing NLU Pipelines for more information.

Choosing the Right Components#

There are components for entity extraction, for intent classification, response selection, pre-processing, and others. If you want to add your own component, for example to run a spell-check or to do sentiment analysis, check out Custom NLU Components.

A pipeline usually consists of three main parts:

Tokenization#

For tokenization of English input, we recommend the ConveRTTokenizer. You can process other whitespace-tokenized (i.e. words are separated by spaces) languages with the WhitespaceTokenizer. If your language is not whitespace-tokenized, you should use a different tokenizer. We support a number of different tokenizers, or you can create your own custom tokenizer.

NOTE

Some components further down the pipeline may require a specific tokenizer. You can find those requirements on the individual components' requiresparameter. If a required component is missing inside the pipeline, an error will be thrown.

Featurization#

You need to decide whether to use components that provide pre-trained word embeddings or not. We recommend in cases of small amounts of training data to start with pre-trained word embeddings. Once you have a larger amount of data and ensure that most relevant words will be in your data and therefore will have a word embedding, supervised embeddings, which learn word meanings directly from your training data, can make your model more specific to your domain. If you can't find a pre-trained model for your language, you should use supervised embeddings.

Pre-trained Embeddings#

The advantage of using pre-trained word embeddings in your pipeline is that if you have a training example like: “I want to buy apples”, and Rasa is asked to predict the intent for “get pears”, your model already knows that the words “apples” and “pears” are very similar. This is especially useful if you don't have enough training data. We support a few components that provide pre-trained word embeddings:

  1. MitieFeaturizer

  2. SpacyFeaturizer

  3. ConveRTFeaturizer

  4. LanguageModelFeaturizer

If your training data is in English, we recommend using the ConveRTFeaturizer. The advantage of the ConveRTFeaturizer is that it doesn't treat each word of the user message independently, but creates a contextual vector representation for the complete sentence. For example, if you have a training example, like: “Can I book a car?”, and Rasa is asked to predict the intent for “I need a ride from my place”, since the contextual vector representation for both examples are already very similar, the intent classified for both is highly likely to be the same. This is also useful if you don't have enough training data.

An alternative to ConveRTFeaturizer is the LanguageModelFeaturizer which uses pre-trained language models such as BERT, GPT-2, etc. to extract similar contextual vector representations for the complete sentence. SeeHFTransformersNLP for a full list of supported language models.

If your training data is not in English you can also use a different variant of a language model which is pre-trained in the language specific to your training data. For example, there are chinese (bert-base-chinese) and japanese (bert-base-japanese) variants of the BERT model. A full list of different variants of these language models is available in the official documentation of the Transformers library.

spacynlp also provides word embeddings in many different languages, so you can use this as another alternative, depending on the language of your training data.

Supervised Embeddings#

If you don't use any pre-trained word embeddings inside your pipeline, you are not bound to a specific language and can train your model to be more domain specific. For example, in general English, the word “balance” is closely related to “symmetry”, but very different to the word “cash”. In a banking domain, “balance” and “cash” are closely related and you'd like your model to capture that. You should only use featurizers from the category sparse featurizers, such as CountVectorsFeaturizer, RegexFeaturizer or LexicalSyntacticFeaturizer, if you don't want to use pre-trained word embeddings.

Intent Classification / Response Selectors#

Depending on your data you may want to only perform intent classification, entity recognition or response selection. Or you might want to combine multiple of those tasks. We support several components for each of the tasks. We recommend using DIETClassifier for intent classification and entity recognition and ResponseSelector for response selection.

By default all of these components consume all available features produced in the pipeline. However, sometimes it makes sense to restrict the features that are used by a specific component. For example, ResponseSelector is likely to perform better if no features from the RegexFeaturizer or LexicalSyntacticFeaturizer are used. To achieve that, you can do the following: Set an alias for every featurizer in your pipeline via the option alias. By default the alias is set the the full featurizer class name, for example, RegexFeaturizer. You can then specify, for example, on the ResponseSelector via the option featurizers what features from which featurizers should be used. If you don't set the option featurizers all available features will be used.

Here is an example configuration file where the DIETClassifier is using all available features and the ResponseSelector is just using the features from the ConveRTFeaturizer and the CountVectorsFeaturizer.

language: "en"pipeline:- name: ConveRTTokenizer- name: ConveRTFeaturizeralias: "convert"- name: RegexFeaturizeralias: "regex"- name: LexicalSyntacticFeaturizeralias: "lexical-syntactic"- name: CountVectorsFeaturizeralias: "cvf-word"- name: CountVectorsFeaturizeralias: "cvf-char"analyzer: "char_wb"min_ngram: 1max_ngram: 4- name: DIETClassifierepochs: 100- name: EntitySynonymMapper- name: ResponseSelectorfeaturizers: ["convert", "cvf-word"]epochs: 100

Entity Extraction#

Entity extraction involves parsing user messages for required pieces of information. Rasa Open Source provides entity extractors for custom entities as well as pre-trained ones like dates and locations. Here is a summary of the available extractors and what they are best used for:

Component Requires Model Notes
DIETClassifier N/A conditional random field on top of a transformer good for training custom entities
CRFEntityExtractor sklearn-crfsuite conditional random field good for training custom entities
SpacyEntityExtractor spaCy averaged perceptron provides pre-trained entities
DucklingEntityExtractor running duckling context-free grammar provides pre-trained entities
MitieEntityExtractor MITIE structured SVM good for training custom entities
EntitySynonymMapper existing entities N/A maps known synonyms

Handling Class Imbalance#

Classification algorithms often do not perform well if there is a large class imbalance, for example if you have a lot of training data for some intents and very little training data for others. To mitigate this problem, you can use a balanced batching strategy. This algorithm ensures that all classes are represented in every batch, or at least in as many subsequent batches as possible, still mimicking the fact that some classes are more frequent than others. Balanced batching is used by default. In order to turn it off and use a classic batching strategy include batch_strategy: sequence in your config file.

language: "en"pipeline:
# - ... other components
- name: "DIETClassifier"batch_strategy: sequence

Configuring Tensorflow#

TensorFlow allows configuring options in the runtime environment via TF Config submodule. Rasa Open Source supports a smaller subset of these configuration options and makes appropriate calls to the tf.configsubmodule. This smaller subset comprises of configurations that developers frequently use with Rasa Open Source. All configuration options are specified using environment variables as shown in subsequent sections.

Optimizing CPU Performance#

NOTE

We recommend that you configure these options only if you are an advanced TensorFlow user and understand the implementation of the machine learning components in your pipeline. These options affect how operations are carried out under the hood in Tensorflow. Leaving them at their default values is fine.

Depending on the TensorFlow operations a NLU component or Core policy uses, you can leverage multi-core CPU parallelism by tuning these options.

Parallelizing One Operation#

Set TF_INTRA_OP_PARALLELISM_THREADS as an environment variable to specify the maximum number of threads that can be used to parallelize the execution of one operation. For example, operations like tf.matmul() and tf.reduce_sum can be executed on multiple threads running in parallel. The default value for this variable is 0 which means TensorFlow would allocate one thread per CPU core.

Parallelizing Multiple Operations#

Set TF_INTER_OP_PARALLELISM_THREADS as an environment variable to specify the maximum number of threads that can be used to parallelize the execution of multiple non-blocking operations. These would include operations that do not have a directed path between them in the TensorFlow graph. In other words, the computation of one operation does not affect the computation of the other operation. The default value for this variable is 0which means TensorFlow would allocate one thread per CPU core.

To understand more about how these two options differ from each other, refer to this stackoverflow thread.

Optimizing GPU Performance#

Limiting GPU Memory Growth#

TensorFlow by default blocks all the available GPU memory for the running process. This can be limiting if you are running multiple TensorFlow processes and want to distribute memory across them. To prevent Rasa Open Source from blocking all of the available GPU memory, set the environment variable TF_FORCE_GPU_ALLOW_GROWTH to True.

Restricting Absolute GPU Memory Available#

You may want to limit the absolute amount of GPU memory that can be used by a Rasa Open Source process.

For example, say you have two visible GPUs(GPU:0 and GPU:1) and you want to allocate 1024 MB from the first GPU and 2048 MB from the second GPU. You can do this by setting the environment variable TF_GPU_MEMORY_ALLOC to "0:1024, 1:2048".

Previous

Rasa原文--训练NLU数据相关推荐

  1. Rasa原文-编写对话数据

    目录 Writing Conversation Data When to Write Stories vs. Rules# Managing the Conversation Flow# When t ...

  2. Transformer课程 业务对话机器人Rasa 3.x NLU Training Data

    Transformer课程 业务对话机器人Rasa 3.x NLU Training Data NLU Training Data NLU训练数据存储有关用户消息的结构化信息. 自然语言理解(NLU) ...

  3. Rasa课程、Rasa培训、Rasa面试、Rasa实战系列之数据生成工具chatette

    Rasa课程.Rasa培训.Rasa面试.Rasa实战系列之数据生成工具chatette Chatette简介 Chatette是一个 Python 程序,它为给定模板文件的Rasa NLU生成训练数 ...

  4. add函数 pytorch_Pytorch的高级训练,数据增强和实用程序 - pytorch中文网

    v0.1.3 JUST RELEASED - 包含显着的改进,错误修复和额外的支持.从版本中获取,或者拉主分支. 这个包提供了几件事情: 具有回调,约束和正则化程序的类似Keras的训练的高级模块. ...

  5. 实用教程!使用YOLOv3训练自己数据的目标检测

    点击我爱计算机视觉标星,更快获取CVML新技术 YOLOv3是当前计算机视觉中最为流行的实时目标检测算法之一. 昨天LearnOpenCV网站博主又发福利,post了一个清晰明了的教程,一步一步示例, ...

  6. 模型训练:数据预处理和预载入

    相对于模型的训练而言,有时候数据的预处理和载入反而是一件更为耗时的工作. 为了优化模型的训练流程,有必要对训练的全流程做一个时间上的评测(Profiling),以弄清每一步所耗费的时间,并发现性能上的 ...

  7. mysql实验训练2 数据查询操作_实验训练2:数据查询操作.doc

    实验训练2:数据查询操作.doc 实验训练2数据查询操作请到电脑端查看实验目的基于实验1创建的汽车用品网上商城数据库Shopping,理解MySQL运算符.函数.谓词,练习Select语句的操作方法. ...

  8. 教程:使用tensorflow-slim训练自己数据的图像分类器

    教程:使用tensorflow-slim训练自己数据的图像分类器 1. 环境配置 2. 数据集处理 2.1 获取数据 2.2 生成list列表文件 2.3 生成labels标签文件 2.4 生成训练集 ...

  9. scipy笔记—scipy.misc.imresize用法(方便训练图像数据)

    scipy.misc.imresize 不同于普通的reshape, imresize不是单纯的改变图像矩阵的维度,而是能将图片重采样为指定像素,这样给深度学习中训练图像数据带来方便. import ...

  10. 基于已有模型,训练新数据的方法

    今天碰到一个问题,训练数据随着随着时间的累计越来越多.这样如果每一次训练都把所有的样本,训练一次,既浪费资源又耽误时间.所以,希望可以时间基于已有的模型,直接训练新的数据.比如,我用第一个月的数据训练 ...

最新文章

  1. python判断是否回文_对python判断是否回文数的实例详解
  2. CCF201612-4 压缩编码(100分)
  3. AndroidStudio更新时报错:Connection Error,Temp directory inside installation
  4. GDCM:gdcm::PDFCodec的测试程序
  5. 记录下Lambda常用的表现形式
  6. 榆落微时光社区小程序源码V1.0.35
  7. 读书笔记2013第16本:《删除:大数据取舍之道》
  8. menu什么意思中文意思_pipeline什么意思
  9. “弃用 Google AMP!”
  10. kafka安装使用说明
  11. php json 小红点,关于PHP的json_encode的一个小技巧
  12. (转)中国电信友华PT921、PT921G光猫设置路由,无线WIFI设置
  13. 高通平台开发系列讲解(USB篇)Vbus检测程序
  14. 神操作,用Python教你暴力破解WiFi密码,附赠技术实现视频
  15. python与java前景分析-Java和Python哪个未来前景好?
  16. Rust游戏数据查询、Rust服务器清档时间表
  17. 软件测试人员必备的60个测试工具,果断收藏了!
  18. Web前端 vs Web后端 区别是什么
  19. 翻译如重构,期待您的单元测试
  20. 利用高德地图周边搜索api获取不同类型的餐厅推荐

热门文章

  1. Java堆:Shallow Size和Retained Size
  2. hihoCoder1378:网络流二·最大流最小割定理
  3. CAD打断曲线(网页版)
  4. 东原地产的女性视角——对话罗韶颖
  5. 开发服务器 k8s 设置 自定义 dns解析
  6. WIFI6比WIFI5好在哪里呢?
  7. iOS基础:Xcode 添加PCH文件、常用预处理指令(移除程序中的identifier、发布模式关闭NSLog、导出和隐藏符号)
  8. The following error occured:
  9. python的spider程序下载安装_Python3WebSpider
  10. 计算机组成存储单元地址分配,主存中存储单元地址的分配