DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications

Paper: https://arxiv.org/abs/1711.05073

Page: http://ai.baidu.com/broad/subordinate?dataset=dureader

Code: https://github.com/baidu/DuReader/

DuReader,一个新的大型开放中文机器阅读理解数据集。

DuReader 与以前的 MRC 数据集相比有三个优势:

  1. 数据来源:问题和文档均基于百度搜索和百度知道; 答案是手动生成的。

  2. 问题类型:它为更多的问题类型提供了丰富的注释,特别是是非类和观点类问题。

  3. 规模:包含 200K 个问题,420K 个答案和 1M 个文档; 是目前最大的中文 MRC 数据集。

This paper introduces DuReader, a new large-scale, open-domain Chinese machine reading comprehension (MRC) dataset, designed to address real-world MRC.

DuReader has three advantages over previous MRC datasets:

  1. data sources: questions and documents are based on Baidu Search and Baidu Zhidao; answers are manually generated.

  2. question types: it provides rich annotations for more question types, especially yes-no and opinion questions, that leaves more opportunity for the research community.

  3. scale: it contains 200K questions, 420K answers and 1M documents; it is the largest Chinese MRC dataset so far.

Introduction

Dataset Lang Que. Docs Source of Que. Source of Docs Answer Type
CNN/DM EN 1.4M 300K Synthetic cloze News Fill in entity
HLF-RC ZH 100K 28K Synthetic cloze Fairy/News Fill in word
CBT EN 688K 108 Synthetic cloze Children’s books Multi. choices
RACE EN 870K 50K English exam English exam Multi. choices
MCTest EN 2K 500 Crowd-sourced Fictional stories Multi. choices
NewsQA EN 100K 10K Crowd-sourced CNN Span of words
SQuAD EN 100K 536 Crowd-sourced Wiki. Span of words
SearchQA EN 140K 6.9M QA site Web doc. Span of words
TrivaQA EN 40K 660K Trivia websites Wiki./Web doc. Span/substring of words
NarrativeQA EN 46K 1.5K Crowd-sourced Book&movie Manual summary
MS-MARCO EN 100K 200K User logs Web doc. Manual summary
DuReader ZH 200k 1M User logs Web doc./CQA Manual summary

表 1: 机器阅读理解数据集对比

Pilot Study

Examples Fact Opinion
Entity iphone哪天发布 2017最好看的十部电影
- On which day will iphone be released Top 10 movies of 2017
Description 消防车为什么是红的 丰田卡罗拉怎么样
- Why are firetrucks red How is Toyota Carola
YesNo 39.5度算高烧吗 学围棋能开发智力吗
- Is 39.5 degree a high fever Does learning to play go improve intelligence

表 2: 中文六类问题的例子

Scaling up from the Pilot to DuReader

Data Collection and Annotation

Data Collection

DuReader 的样本可用四元组表示: [Math Processing Error]{q,t,D,A}\{q, t, D, A\},其中 [Math Processing Error]qq 是问题,t" role="presentation" style="position: relative;">[Math Processing Error]tt 是问题类型,[Math Processing Error]DD 是相关文档集合,A" role="presentation" style="position: relative;">[Math Processing Error]AA 是由人类标注产生的答案集合。

The DuReader is a sequence of 4-tuples: [Math Processing Error]{q,t,D,A}\{q, t, D, A\}, where [Math Processing Error]qq is a question, t" role="presentation" style="position: relative;">[Math Processing Error]tt is a question type, [Math Processing Error]DD is a set of relevant documents, and A" role="presentation" style="position: relative;">[Math Processing Error]AA is an answer set produced by human annotators.

Question Type Annotation

Answer Annotation

众包

Crowd-sourcing

Quality Control

Training, Development and Test Sets

数量 训练集 开发集 测试集
问题 181K 10K 10K
文档 855K 45K 46K
答案 376K 20K 21K

The training, development and test sets consist of 181K, 10K and 10K questions, 855K, 45K and 46K documents, 376K, 20K and 21K answers, respectively.

DuReader is (Relatively) Challenging

challenges:

  1. The number of answers.

    图 1. 答案数量分布

  2. The edit distance.

    人类生成的答案和源文档之间的差异很大。

    the difference between the human generated answers and the source documents is large.

  3. The document length.

    问题平均 4.8 词,答案平均 69.6 词,文档平均 396 词。

    In DuReader, questions tend to be short (4.8 words on average) compared to answers (69.6 words), and answers tend to be short compared to documents (396 words on average).

Experiments

Baseline Systems

  1. 从每个文件中选择一个最相关的段落

  2. 在选定的段落中应用最先进的 MRC 模型

our designed systems have two steps:

  1. select one most related paragraph from each document

  2. apply the state-of-the-art MRC models on the selected paragraphs

Paragraph Selection

在训练阶段,我们从文档中选择与人类生成答案重叠最大的段落作为最相关段落。

In training stage, we select one paragraph from a document as the most relevant one, if the paragraph has the largest overlap with human generated answer.

在测试阶段,由于我们没有人类生成答案,我们选择与问题重叠最大的段落作为最相关段落。

In testing stage, since we have no human generated answer, we select the most relevant paragraph that has the largest overlap with the corresponding question.

Answer Span Selection

  • Match-LSTM

    要在段落中找到答案,它会按顺序遍历段落,并动态地将注意力加权问题表示与段落的每个标记进行匹配。

    最后,使用答案指针层来查找段落中的答案范围。

    To find an answer in a paragraph, it goes through the paragraph sequentially and dynamically aggregates the matching of an attention-weighted question representation to each token of the paragraph.

    Finally, an answer pointer layer is used to find an answer span in the paragraph.

  • BiDAF

    它使用上下文对问题的关注和问题对上下文的关注,以突出问题和上下文中的重要部分。

    之后,使用注意流层来融合所有有用的信息,以获得每个位置的向量表示。

    It uses both context-to-question attention and question-to-context attention in order to highlight the important parts in both question and context.

    After that, the so-called attention flow layer is used to fuse all useful information in order to get a vector representation for each position.

Results and Analysis

评价方法:BLEU-4, Rouge-L

We evaluate the reading comprehension task via character-level BLEU-4 and Rouge-L.

Systems Baidu Search - Baidu Zhidao - All -
- BLEU-4% Rouge-L% BLEU-4% Rouge-L% BLEU-4% Rouge-L%
Selected Paragraph 15.8 22.6 16.5 38.3 16.4 30.2
Match-LSTM 23.1 31.2 42.5 48.0 31.9 39.2
BiDAF 23.1 31.1 42.2 47.5 31.8 39.0
Human 55.1 54.4 57.1 60.7 56.1 57.4

Table 6: Performance of typical MRC systems on the DuReader.

Question type Description - Entity - YesNo -
- BLEU-4% Rouge-L% BLEU-4% Rouge-L% BLEU-4% Rouge-L%
Match-LSTM 32.8 40.0 29.5 38.5 5.9 7.2
BiDAF 32.6 39.7 29.8 38.4 5.5 7.5
Human 58.1 58.0 44.6 52.0 56.2 57.4

Table 8: Performance on various question types.

Opinion-aware Evaluation

Question type Fact - Opinion -
- BLEU-4% Rouge-L% BLEU-4% Rouge-L%
Opinion-unaware 6.3 8.3 5.0 7.1
Opinion-aware 12.0 13.9 8.0 8.9

Table 9: Performance of opinion-aware model on YesNo questions.

Discussion

Conclusion

提出了 DuReader 数据集,提供了几个 baseline。

[paper] DuReader相关推荐

  1. Reading Comprehension必读paper汇总

    文章目录 Must-read papers on Machine Reading Comprehension. Model Architecture Utilizing Extenal Knolwed ...

  2. CVPR 2011 全部论文标题和摘要

    CVPR 2011 Tian, Yuandong; Narasimhan, Srinivasa G.; , ■Rectification and 3D reconstruction of curved ...

  3. 别光发Paper,搞点实际问题

    文 / LVS 话说几个月前,我参加了一场学术大会,台上的教授不是北大.清华就是浙大.上交大,几位教授不约而同的吐槽招通信.算法和编解码的学生太难了.为什么呢?原来,先不比金融,仅仅与同是IT领域的A ...

  4. 网友们票选的2018 Best Paper,你pick谁?

    整理 | 琥珀 出品 | AI科技大本营 不久前,Reddit 机器学习论坛上一位网友发布了一个帖子: "What is the best ML paper you read in 2018 ...

  5. 顶会paper越来越多,我该怎么看?

    视学算法转载 作者:王晋东 顶会论文越来越多,如何阅读?中国科学院大学计算机应用技术博士王晋东给出了一些建议. 近年来,作为学术前沿研究的风向标的顶会接收的论文越来越多.例如,最近放榜的 NeurIP ...

  6. 干货 | 你的Paper阅读能力合格了吗(硕士生版)

    作者&编辑:李中梁 前言 论文阅读一直是科研过程中至关重要的一环,如何收集论文,快速选取和课题匹配的论文,高效地把握论文核心思想是每个科研人员的必备素养,也是每个科研萌新(硕士研究生)苦恼的问 ...

  7. 被拒稿、被否定:读博五年间都没有发 paper 是一种怎样的体验?

    作者:少十七 https://www.zhihu.com/question/59323074/answer/1162785605 来源:知乎 著作权归作者所有.商业转载请联系作者获得授权,非商业转载请 ...

  8. 生活有时会有点苦涩——一位第六年还没发paper的PHD的自述

    公众号关注 "视学算法" 设为 "星标",DLCV消息即可送达! 来自 | 知乎 作者丨少十七 来源丨https://www.zhihu.com/questio ...

  9. CV领域最经典的Paper是什么来头?

    最近后台有很多准研究生粉丝给我留言,重复频次最高的2个主题是: ① 如何高效地筛选.阅读论文? ② 如何高效复现论文? AI领域学术会议动辄上千篇文章,无论是准研究生还是现研究生,都会碰到这个问题. ...

最新文章

  1. 被人恨,但感觉不错!
  2. 【2022新书】机器学习基础
  3. jQuery 效果 - 滑动
  4. LeetCode 14. Longest Common Prefix字典树 trie树 学习之 公共前缀字符串
  5. PHP 和 AJAX MySQL
  6. java io删除文件_java IO 文件操作方法总结
  7. 试写函数显示当前具体时间:
  8. 实验2 java_《Java程序设计》实验2
  9. python彩票36选7_彩票开奖查询-极速数据【最新版】_API_金融_生活服务-云市场-阿里云...
  10. 金蝶K/3WISE V14.0安装包下载地址,金蝶K3 WISE V14.0安装包资源包下载链接
  11. 【用C语言绘制谢尔宾斯基三角形】
  12. 「经济/商学/理财」简说
  13. “真智能”黑马杀出,智能家居的下一战要攻破场景化?
  14. win10设备管理没有android,win10电脑不能识别安卓设备怎么解决?
  15. uniapp配置全局样式
  16. 新浪校招php笔试题,新浪笔试题 PHP
  17. 汽车外饰360vr实景展示有哪些应用场景?
  18. 视频教程-实用通俗易懂的设计模式-软件设计
  19. 信用卡还款高峰到来小心多付冤枉钱
  20. BZOJ1654 奶牛舞会+tarjan算法(入门题目)+判断图中有几个环

热门文章

  1. 5G QoS控制原理专题详解-基础概念(2)
  2. 数据中心甲方项目管理杂谈
  3. c语言实验报告常见问题,C语言新手问题~~我是大一学生上C语言课期末让写实验报告我呕心沥 爱问知识人...
  4. 怎样在3Dslicer中创建LoadableModule
  5. 游戏服务器怎么修复,永劫无间无法连接游戏服务器怎么解决
  6. linux磁盘分区表 清理,linux 磁盘分区表
  7. 烧脑电影:彗星来的那一夜(Coherence)
  8. 3D目标检测综述:从数据集到2D和3D方法
  9. 2008年新作——《网管员面试宝典》上市了
  10. websocket测试工具,支持ws wss服务端和客户端