tf改善_如何衡量和改善自动常见问题解答

tf改善

When we start with a new recruitment chatbot project that includes the FAQ automation feature, we use a Starter Set of questions, which are then enriched with FAQS relevant for the client before go-live. After go-live, the system is put to the test with real usage, and improves over time with training.

当我们从一个新的招聘聊天机器人项目开始，该项目包括FAQ自动化功能时，我们将使用一个入门级问题集，然后在上线之前充实与客户相关的FAQS。上线后，该系统将按实际使用情况进行测试，并随着时间的推移不断改进。

When I say “Starter Set,” I mean something like sourdough starter: it may not look like the end goal, but it is essential for success. In a good automated FAQ system, robust algorithms are crucial; however, the dataset, especially the dataset that is often (in the world of chatbots) developed before go-live and with limited exposure to real data, can have a big influence. What the chatbot knows and doesn’t know on Day 1 already shapes how candidates interact with it, and influences its learning over time. Fortunately, we have experience with many multi-year live projects, which allows us to find consistent trends in FAQ topics relevant to candidates. These trends have guided the development of the Starter Set v1.0 in 2018, and its revision this summer.

当我说“初学者套装”时，我的意思是类似发酵母的初学者：它可能看起来并不像最终目标，但对成功至关重要。在一个好的自动化FAQ系统中，强大的算法至关重要。但是，数据集，尤其是经常(在聊天机器人的世界中)在上线之前开发且对真实数据的接触有限的数据集可能会产生很大的影响。聊天机器人在第1天知道和不知道的内容已经影响了候选人与之交互的方式，并随着时间的推移影响其学习。幸运的是，我们拥有许多多年的现场项目的经验，这使我们能够在与候选人相关的FAQ主题中找到一致的趋势。这些趋势指导了Starter Set v1.0在2018年的开发以及今年夏天的修订。

How can the chatbot know more from Day 1?

聊天机器人如何从第一天开始了解更多信息？

How were we able to substantially improve our Starter Set? We used a combination of automation and manual review, to grow our initial v1.0 set into from ~1K questions over 47 to ~3–4K questions (in German and English, respectively) over 68 categories. The new, v2.0 set reflects patterns in 130,000 questions asked by real candidates in multiple live projects over a year, and has been thoroughly anonymized. In this post, I’ll answer:

我们如何能够大幅改善我们的入门套装？我们结合了自动化和手动审核的功能，将最初的v1.0集从47个问题中的〜1K问题扩展到68个类别中的大约3-4K问题(分别为德语和英语)。新版v2.0反映了一年中多个现场项目中真实应聘者提出的130,000个问题中的模式，并且已完全匿名。在这篇文章中，我将回答：

What is the difference between accuracy and automation, and why does it matter?准确性和自动化之间有什么区别，为什么重要？
How do these measures help guide the iterative process of creating a good, clean starting dataset for an FAQ chatbot?这些措施如何帮助指导为FAQ聊天机器人创建良好，干净的起始数据集的迭代过程？
When we went from 47 categories to 68, we both added and removed categories; how did we decide what to change, and what was the role of automation and manual review in this process?当我们从47个类别增加到68个类别时，我们同时添加和删除了类别；我们如何决定要更改的内容，以及自动化和手动审核在此过程中的作用是什么？

Crucial to this human-machine collaborative data development process is how FAQ answer performance is measured. We use two measures: automation and accuracy, which are related, but distinct. When it comes to an FAQ chatbot in recruitment, not every incoming question is a frequently-asked question; to the key insight is to recognise that not everything should be automated, and to measure not only the capacity to automate as much as possible as well as possible, but also the capacity to correctly decline to answer something that is outside of the FAQ dataset. We developed a measure of accuracy, nex-cv (or: cross-validation with negative examples) that is especially useful, and is described in a previous publication.

对人机协作数据开发过程至关重要的是如何衡量FAQ回答性能。我们使用两种度量：自动化和准确性，它们是相关的，但又有所不同。当涉及到招聘中的FAQ聊天机器人时，并不是每个传入的问题都是一个经常问到的问题。关键见解是要认识到并不是所有的东西都应该是自动化的，不仅要测量尽可能多地实现自动化的能力，还要测量正确拒绝回答FAQ数据集之外的事物的能力。我们开发的准确性，NEX-CV的测量(或：C ross- v alidation与NE gative前amples)是特别有用的，并且是在以前的出版物中描述。

However, when it comes to the Starter Set, it is not the only important measure for the Starter Set. It is an internal estimate of accuracy, and v1.0 already had a relative high accuracy, compared to typical, live FAQ sets — in part because more categories makes high data quality difficult to maintain. Therefore, we also use automated response as the main measure of success: the goal was to improve the coverage of the Starter Set. In other words, how can the chatbot know more from Day 1?

但是，对于入门套件，这并不是入门套件的唯一重要措施。这是对准确性的内部估计，与典型的实时FAQ集相比，v1.0已经具有相对较高的准确性—部分原因是类别越多，难以维护的数据质量就越高。因此，我们还使用自动响应作为成功的主要指标：目标是提高入门套件的覆盖范围。换句话说，聊天机器人如何从第一天开始了解更多信息？

What is the difference between accuracy and automation, and why does it matter? In the example performance in Figure 2 below, automation would refer to the part of each pie that isn’t “No Response” / “Not FAQ.” Meanwhile, accuracy is something independent from the pie chart: it is a comparison to the ground truth, and includes the case where the ground truth label might indicate that no automated response is the best response. In the case of the Starter Set, a high accuracy is a vital, difficult pre-requisite; but increasing automation will help us expand the chatbot’s knowledge prior to go-live.

准确性和自动化之间有什么区别，为什么重要？ 在下面的图2的示例性能中，自动化将引用每个饼图中非“无响应” /“非常见问题”的部分。同时，准确性与饼图无关：它是对地面真实情况的比较，并且包括地面真实情况标签可能指示没有自动响应是最佳响应的情况。在入门套件的情况下，高精度是至关重要且困难的先决条件。但增加自动化将帮助我们在上线之前扩展聊天机器人的知识。

Ground truth: given a question, the recruiters’ judgment of what the right answer category is — as opposed to an automated guess.

基本事实：给一个问题，招聘人员对正确答案类别的判断是与自动猜测相反的。

With Starter Set v2.0, we see that the top 20 topics in each language cover nearly three quarters of the questions in the test data (Fig 2). At a high level, these topics are similar to the main topics in v1.0: details of the application process, and questions about qualifications and benefits. The differences are subtle, but important.

通过Starter Set v2.0，我们看到每种语言的前20个主题涵盖了测试数据中几乎四分之三的问题(图2)。从总体上讲，这些主题与v1.0中的主要主题相似：申请过程的详细信息以及有关资格和权益的问题。差异是细微的，但很重要。

During development, we aimed to improve coherence: split categories which contain questions that often end up in different categories in practice.

在开发过程中，我们旨在提高连贯性：拆分类别，其中包含的问题在实践中常常会归结为不同类别。

For example, between v1.0 and v2.0, one category that used to be about the company language (e.g, English or German) was split into two: (1) about the company language, and (2) about the language acceptable in the application. This split allows each individual category to perform better; and because both of the language categories are part of the top 20, it has a surprisingly big impact on the capacity of the dataset as a whole to handle the test set.

例如，在v1.0和v2.0之间，曾经与公司语言(例如英语或德语)有关的一个类别被分为两类：(1)关于公司语言，和(2)关于可接受的语言在应用程序中。这种划分使每个单独的类别都能表现更好；并且由于这两种语言类别都在前20名中，这对整个数据集处理测试集的能力产生了令人惊讶的巨大影响。

fell into a different feature, like small talk. The rest cover topics that, at a high level, stay very similar to 具有不同的功能，例如闲聊。其余的主题在较高层次上与the main topics in v1.0. Although the Top 20 topics cover a large portion of the data, the overall high performance would not be possible without the other 48 topics that cover the “long tail” — shown here in the “All Others” section.v1.0中的主要主题非常相似。尽管排名前20位的主题涵盖了大部分数据，但是如果没有涵盖“长尾巴”的其他48个主题(如此处“所有其他”部分所示)，则无法实现总体高性能。

So, how do these measures help guide the iterative process of creating a good, clean starting dataset for an FAQ chatbot?

因此，这些措施如何帮助指导为FAQ聊天机器人创建良好，干净的起始数据集的迭代过程？

We had a total of 130K questions, over 2 languages, and in the very beginning we split these into a test set and a development set. The test set for EN was 14K, and for DE 30K; this test set was not used at all until the end. In any data-driven project where you are experimenting with the structure of data or the algorithm, it is essential to leave out a test set. All the numbers reported (in Figures 1 and 2) reflect the results on v2.0 sets after they were completed. This allowed testing with unseen data.

我们总共有13万个问题，涉及2种语言，从一开始我们就将其分为测试集和开发集。 EN的测试集为14K，DE的测试集为30K；这个测试集 根本不使用 直到最后。在您尝试数据结构或算法的任何数据驱动项目中，必须省略测试集。报告的所有数字(在图1和2中)反映了v2.0集完成后的结果。这允许使用看不见的数据进行测试。

The development set, on the other hand, was used extensively and repeatedly for deciding which categories stay in the Starter Set, and which do not. For v1.0, this process was based on automated unsupervised clustering and manual review, but in 2.0 it was significantly more data-driven, and started with v1.0. Each language had 8–10 distinct iterations, at every round adding and removing categories.

另一方面，开发集被广泛且反复地用于确定哪些类别保留在入门套件中，哪些不保留。对于在v1.0版中，此过程基于自动无监督群集和手动审核，但是在2.0版中，它是由数据驱动的，并且从v1.0开始。每种语言在每个回合中都有8-10个不同的迭代，添加和删除类别。

How did we decide what to change, and what was the role of automation and manual review in this process? Each iteration went like this:

我们如何决定要更改的内容，以及自动化和手动审核在此过程中的作用是什么？ 每次迭代都是这样的：

Grow the dataset, but automatically accepting very confident guesses

扩大数据集，但会自动接受非常有把握的猜测
Automatically suggest changes in categories to (1) improve coherence: split categories which contain questions that often end up in different categories in practice and (2) reduce unrealistic overestimation: remove or reduce categories that appear more in the training data than in real, incoming questions.

自动建议类别更改，以(1)改善连贯性：拆分类别，使其包含在实践中常常会落在不同类别中的问题；(2)减少不切实际的高估：删除或减少在训练数据中出现的类别多于在实际输入中出现的类别问题。
Manually review newly-added questions, including anonymizing them, and the suggested changes. Especially in the first iterations, the suggestion list is very long, so these are prioritized by performance at the level of each category (F1 score, a common measure)

手动检查新添加的问题，包括将其匿名化以及建议的更改。尤其是在第一次迭代中，建议列表非常长，因此这些建议按每个类别级别的性能(F1分数，一种通用度量)进行优先排序
Check measures of success: accuracy (nex-cv) and auto-response — still using a test set from within the development set, and only use the held-out test sample at the end.

检查成功的措施：精度(NEX-CV)和自动响应-依然使用的是测试集从开发组内，只有在最后使用保留检验样本。

This process was repeated until it was no longer possible to grow: for example, at one point, the EN v2.0 dataset had nearly 10K questions, but through anonymization and review, this was reduced to about 3K, with the same high accuracy and automation. The role of the automation is to greatly speed up the manual review; however, manual review is crucial because:

重复此过程直到不再增长：例如，EN v2.0数据集曾有近1万个问题，但通过匿名化和复查，该问题减少到了约3K，并且具有相同的高精度和自动化。自动化的作用是大大加快人工审核的速度；但是，手动审核至关重要，因为：

All data must be fully anonymous, if the Starter Set is to be used for new projects. This means that individual questions cannot contain any non-anonymous data, but also that topics that are highly specific to particular live projects should be excluded.如果要将入门套件用于新项目，则所有数据都必须完全匿名。这意味着单个问题不能包含任何非匿名数据，而且应该排除特定于特定实时项目的主题。
The categories and topics must make sense, and match between the different languages: although this process takes place over each language individually at first, it is important that there is coherence between localized version of the dataset.类别和主题必须有意义，并且必须在不同的语言之间匹配：尽管此过程首先是针对每种语言进行的，但重要的是，数据集的本地化版本之间必须保持一致。

Over the last several years, we have considered many aspects of the human element of chatbot learning and data quality maintenance, and the development of the Starter Set v2.0 was no exception: although guided by optimising accuracy and automation, it was ultimately a collaboration between the development and implementation team. This collaboration enabled the data from real needs of job-seekers and candidates to be contextualized and understood throughout the process.

在过去的几年中，我们已经考虑了聊天机器人学习和数据质量维护的人为因素的许多方面，并且Starter Set v2.0的开发也不例外：尽管以优化准确性和自动化为指导，但最终还是合作在开发和实施团队之间。通过这种协作，可以将求职者和应聘者的实际需求中的数据在整个过程中进行情境化和理解。

翻译自: https://medium.com/jobpal-dev/how-to-measure-and-improve-automatic-faq-answers-90f97cbdcc3

tf改善

查看全文

http://www.taodudu.cc/news/show-2471062.html

pytorch 模型与tf模型转换
8052单片机英文缩写参考全称
使用 TF-IDF 算法将文本向量化
寄存器英文全称
【ROS】中级操作学习整理-TF坐标变换
51单片机英文全称
TF-IDF入门与实例
单片机c语言msb全称,51单片机英文缩写全称(整理最全)
汇编指令英文全称
html doc全称,html标签全称与功能介绍.doc
html doc全称,html标签全称和功能介绍.doc
ros之tf简介[tf-Package Summary]
ROS：TF，机器人坐标管理神器
TF-IDF算法总结
TF卡和SD卡的区别
TF卡（全称Trans Flash）
统计MMQ元素
NBA视频
log4j2-rce-cve-2021-44228 漏洞复现
vue 登陆成功后携带不了后台传来的 set-cookie 并携带请求错误处理
关于python使用系统命令反弹shell的一点记录
深入Elasticsearch：索引的创建
面试官再问分布式事务,求你看完这份至尊级分布式笔记，给年轻的面试官上一课
mysql 如何导入txt文件_mysql中导入txt文件
mq选型：rocketMq和kafka对比
egg.js+vue前后端分离项目，后端如何使用set-Cookie为前端设置cookie
Vue外卖十一：登录成功信息显示、浏览器cookie+后端session登录状态保持
反弹shell的各种姿势
Centos7+DockerCompose部署.NetCore3.1应用
Apache Log4j2历史漏洞复现

tf改善_如何衡量和改善自动常见问题解答相关推荐

阿尔法蛋机器人tf卡_科大讯飞阿尔法蛋tys1智能机器人常见问题解答
阿尔法小蛋机器人,是由科大讯飞有限公司倾力打造的一款教育陪伴智能机器人.采用互动式教育,"说""教"结合,一问一答学知识;云端教育资源库,多样化的学习模块,给孩 ...
switch两个账号合并_国行Nintendo Switch账号常见问题解答
国行Nintendo Switch如何添加新用户账户如果您在另外一台主机上已有一个用户账户,可以将其转移到当前这台主机上,无需另外创建新账户. 用户账户可以用来管理每个用户的保存数据以及个人设置. ...
96309245通讯异常工行_工银融e联常见问题解答
常见问题 1.我是非实名融e联用户,但是我现在注册电子银行后,可以转为实名用户并享受各类便捷金融服务吗? 答:可以进行转换,注册电子银行后,需要使用注册手机号和手机银行密码登录,登录时请根据提示完成同 ...
rxjava 背压_背压加载文件– RxJava常见问题解答
rxjava 背压事实证明,将文件作为流进行处理非常有效且方便. 许多人似乎忘记了,自Java 8(3年以上!)以来,我们可以很容易地将任何文件变成一行代码: String filePath = & ...
TF之NN：利用神经网络系统自动学习散点(二次函数+noise+优化修正)输出结果可视化(matplotlib动态演示)
TF之NN:利用神经网络系统自动学习散点(二次函数+noise+优化修正)输出结果可视化(matplotlib动态演示) 目录输出结果代码设计输出结果代码设计 import tensorflo ...
海康威视摄像头安装插件检测不到_海康威视摄像机常见问题解答
海康威视摄像机常见问题解答 1.忘记摄像机IP地址怎么办? ①可以通过设备网络搜索工具SADP在线搜索 ②可以使用客户端4200在线搜索功能 ③在设备和PC开启UPNP功能时,可以通过PC中网络发现查 ...
python3安装常见问题_有关在 Windows 上使用 Python 的常见问题解答
有关在 Windows 上使用 Python 的常见问题解答Frequently Asked Questions about using Python on Windows 07/19/2019 本文 ...
金蝶kis商贸采购单商品代码_金蝶KIS商贸高级版操作常见问题解答
金蝶KIS商贸高级版操作常见问题解答金蝶KIS商贸高级版操作常见问题有哪些你知道吗?你知道如何解决金蝶KIS商贸高级版操作常见问题吗?下面是yjbys小编为大家带来的金蝶KIS商贸高级版操作常见问题 ...
arcgis支持python3吗_常见问题解答：ArcGIS 中使用的 Python 是什么版本？
常见问题解答:ArcGIS 中使用的 Python 是什么版本? 问题常见问题解答:ArcGIS 中使用的 Python 是什么版本? 答案 Python 编程语言用于自 9.0 起的各版本 Arc ...
2019数据安装勾选_停机前未勾选完的发票，升完级后这样操作！后附常见问题解答...
增值税发票综合服务平台(企业版)已经上线了,那么如何操作呢,让我们一起来看看吧! 特别提醒:新平台上线后,不管是10月份确认勾选过的发票还是通过扫描认证的,11月份初还是需要登录平台签名确认. 1. ...

tf改善_如何衡量和改善自动常见问题解答

相关文章：

tf改善_如何衡量和改善自动常见问题解答相关推荐

最新文章

热门文章