Evaluating Coherence in Dialogue Systems using Entailment

coherence 英文中意味着连贯性、条理性。
这篇文章是面向对话应用的，更加关注于对话中上下位的连贯性。1. 直接转换为 NLI问题，premise-hypothesis问题。——2. 数据集是自己构造的。——数据集的质量保证上，好像是引入了5个人工检验。3. 连贯性是通过分级评价得到的，三个级别.，矛盾、中性、一致的。这种分类体系下真的能够较好的发现人类和机器之间的correlation吗？

抓住评测矩阵中的某一项指标做评估，比如一致性、完整性或者其他的性质。

这篇文章评估的是对话系统中的连贯性。

摘要部分

摘要部分，写的ok .
背景：Evaluating open-domain dialogue systems is difficult to the diversity of possible correct answers.
前人的研究：Automatic metrics such as BLEU correlately weak with human annotations,resulting in a significant bias in different models and datasets.
我们的研究：In this paper ,we present interpertable metrics for evaluating topic cohence by making use of distributed sentence representations.
结果：Results show that our metrics can be used as a surrogate for human judgement.

美 /ˈsʌrəɡət/,英 /'sʌrəɡət/
v. 代理, 【法】代替
n. 代理人, 代替, 〈英〉(宗教法庭上)主教代表, 【心】代用人物
adj. 替代的, 代用的

introduction部分

引入部分：
对话系统是什么？对话系统最关键的困难是什么？连贯性是评估对话系统最关键的指标
A challenging task of building dialogue systems lies in evaluating their systems.
什么情况下是好的dialogue? 能够sustain coherence的dialogue是好的对话。
最后1-2段介绍自己的工作：transform the consistency of dialogue system as the NLI question.
NLI 是natural language inference problem
NLI：premise 和hypothesis
NLI的重点是认识到一个假设是否是从一个前提中推断出来的

说明了为神魔要选择NLI？ The intuition 这种选择背后的直觉是，人类对话中的话语往往遵循一个一致的、连贯的流程，每个话语都可以从之前的互动中推断出来。

模型部分

核心思路：Given a conversation history H and a generated response r, the goal is to understand the premise-hypothesis pair((H, r)) is entailing.
在表征模型预测结果时，建模的问题类型是分类。
Learn a function to predict one of the three catagorys (含义一致/矛盾/中性) given premise-hypothsis pairs.

定义何为不连贯？何为连贯？
如果一个机器的回答与它以前的话语直接矛盾，或者在整个对话过程中遵循不合逻辑的推理，就可以认为是不连贯的。

数据部分

数据部分采用人工合成的方法。
premise-hypothesis pairs, namely InferConvAI.

模型部分

use the entailment model to predict a score for the generated utterances.

diagonal history is premise and generated response r as hypothesis.

These models were trained on the InferConvAI dataset. During evaluation, we use our test dialogue corpus from Reddit and OpenSubtitles, in which the majority vote of the 4-scale human rating constitutes the labels

评价指标效果

（1）三个baseline
在评价指标效果时，三个baseline，three
textual similarity metrics (Liu et al., 2016) based
on word embeddings: Average (A), Greedy (G),
and Extrema (E)

是将sentence视为Word的集合，忽视了句子词序

（2）semantic similarity，它衡量生成的反应和对话历史中的语料之间的距离。

Universal Sentence Encoder (USE) (Cer et al., 2018)
带下标数字的表示：第几轮对话
Abert表示使用bert得到sentence的embedding，然后取平均作为最终的embedding。

SS应该是在整个句子的前提计算NLI的分值。
A/G/E是在单个Word的基础上，计算NLI的分值。

相似度系列-6：单维度方法：Evaluating Coherence in Dialogue Systems using Entailment相关推荐

相似度系列-7：单维度：Evaluating the Factual Consistency of Abstractive Text Summarization
Evaluating the Factual Consistency of Abstractive Text Summarization 在研究方法上,还需要不是特别的精致,而且,和人类的correl ...
相似度论文系列-1：入门方法Towards a Unified Multi-Dimensional Evaluator for Text Generation
Towards a Unified Multi-Dimensional Evaluator for Text Generation 作者刘鹏飞,这篇文章是围绕相似度问题提出了一种统一的评测方法.区别与 ...
Day02-深度学习原理与使用方法
Day02-深度学习原理与使用方法文章目录 Day02-深度学习原理与使用方法作业说明示例代码完成作业作业说明今天的实战项目是基于深度神经网络的"手势识别". 作业要求 ...
【简单总结】句子相似度计算的几种方法
[简单总结]句子相似度计算的几种方法 1.句子相似度介绍: 句子相似度–指的是两个句子之间相似的程度.在NLP中有很大的用处,譬如对话系统,文本分类.信息检索.语义分析等,它可以为我们提供检索信息更快 ...
graphpad分组百分比柱状图_Graphpad prism7作图教程｜单维度分组散点图、二维度分组柱形图...
本文转载自:星星联盟会 Hello, 这里是行上行下,我是喵君姐姐~ 大家是否还在苦恼用什么软件能画出漂亮的行为结果图?本期就来继续推荐Graphspad prism这一款画图软件,它界面简洁,易于操 ...
华为如何走出数据沼泽丨中国数度系列报道之一
华为如何走出数据沼泽丨中国数度系列报道之一 2021-03-03 16:04 经济观察网记者陈白/文20年前,当人类历史上最大的探天工程斯隆数字天空勘测开始的时候,它在新墨西哥的天文望远镜最初几周采 ...
图像相似度测量和模板匹配方法
摘要本文主要总结了进行目标跟踪.检测中经常使用到的图像相似度测量和模板匹配方法,并给出了具体的基于OpenCV的代码实现. 引言模板匹配是一种在源图像中寻找与图像patch最相似的技术,常常用来进 ...
选项类 oracle ebs,Oracle EBS工具选项：关闭其他表单修改方法
Oracle EBS里工具-关闭其他表单如果被勾选上,用户在打开一个Form的时候,就会关闭其他的Form,保证只有一个Form存在. 如果想开启此选项,则可以通过下面两种方式 1)每个职责单独设 ...
python表单填写_Python3.4 splinter(模拟填写表单)使用方法
如下所示: from splinter.browser import Browser b = Browser('chrome') url = 'https://kyfw.12306.cn/otn/le ...

相似度系列-6：单维度方法：Evaluating Coherence in Dialogue Systems using Entailment

Evaluating Coherence in Dialogue Systems using Entailment

摘要部分

introduction部分

模型部分

数据部分

模型部分

评价指标效果

相似度系列-6：单维度方法：Evaluating Coherence in Dialogue Systems using Entailment相关推荐

最新文章

热门文章