nlp自己制作一个语料库

内置AI NLP365(INSIDE AI NLP365)

Project #NLP365 (+1) is where I document my NLP learning journey every single day in 2020. Feel free to check out what I have been learning over the last 262 days here. At the end of this article, you can find previous papers summary grouped by NLP areas :)

在＃NLP365(+1)项目中，我记录了2020年每一天的NLP学习历程。请随时在这里查看我过去262天的学习内容。在本文的结尾，您可以找到按NLP领域分组的以前的论文摘要：)

Today’s NLP paper is An Argument-Annotated Corpus of Scientific Publications. Below are the key takeaways of the research paper.

今天的NLP论文是科学出版物的带有注解的语料库。以下是研究论文的主要内容。

目标与贡献 (Objective and Contribution)

Extended the Dr. Inventor corpus with argumentative components and relations annotations and conducted an annotation study. The goal here is to understand the different arguments within the scientific text and how they are link together. We performed analysis on the annotated argumentations and explore the relations between argumentation that exists within the scientific writing. The contributions are as follows:

扩展了具争议性成分和关系注释的Inventor博士语料库，并进行了注释研究。这里的目标是了解科学文本中的不同论点以及它们如何链接在一起。我们对带注释的论点进行了分析，并探讨了科学著作中存在的论点之间的关系。贡献如下：

Proposed a general argumentative annotation scheme for scientific text that covers different research domains
提出了涵盖不同研究领域的科学文本通用论证注释方案
Extended Dr. Inventor corpus with argumentative components and relations annotations
扩展了Dr. Inventor语料库，其中包含有论据性的成分和关系注释
Conducted analysis on the information-theoretic of the corpus
对语料库的信息理论进行了分析

注释方案(Annotation Scheme)

There are many theoretical frameworks for argumentation and we initially use the Toulmin model for its simplicity and relevant to AI and argument mining. The Toulmin model has 6 types of argumentative components: claim, data, warrant, backing, qualifier, and rebuttal. However, after initial annotations, we realised that not all components exists. Therefore we simplify our annotations scheme to the following three argumentative components:

有许多用于论证的理论框架，我们最初使用Toulmin模型是因为它简单易懂，并且与AI和论证挖掘相关。 Toulmin模型具有6种论证成分：索赔，数据，认股权证，支持，限定词和反证。但是，在进行初始注释之后，我们意识到并非所有组件都存在。因此，我们将注释方案简化为以下三个论证组件：

Own Claim. Argumentative statement that relates to the author’s work

自己的索赔。与作者作品有关的议论性陈述
Background Claim. Argumentative statement that relates to works that are related to the author’s work

背景索赔。与作者作品相关的论证性陈述
Data Component. Facts that support or against a claim. This includes references and facts with examples

数据组件。支持或反对主张的事实。这包括示例的参考和事实

With those argumentative components set, we introduced the following three relations types:

设置好这些论点后，我们介绍了以下三种关系类型：

Supports. This relation holds between two components if the factual accuracy of one component increases with the other

支持。如果一个组件的实际精度随着另一个组件的提高而增加，则在两个组件之间保持这种关系
Contradicts. This relation holds between two components if the factual accuracy of one component decreases with the other

矛盾的。如果一个组件的实际精度随着另一个组件的降低而在两个组件之间保持这种关系
Semantically Same. This relation captures claims or data component that are semantically the same. This is similar to argument coreference and / or event coreference

语义上相同。此关系捕获语义上相同的声明或数据组件。这类似于参数共指和/或事件共指

注释研究(Annotation Study)

We performed an annotation study of the Dr. Inventor corpus and extended the dataset. The Dr. Inventor corpus has four layers of rhetorical annotations with sub-labels as shown below:

我们对Inventor博士语料库进行了注释研究，并扩展了数据集。 Inventor博士语料库具有四层带有子标签的修辞注释，如下所示：

Discourse Role
话语角色
Citation Purpose
引用目的
Subjective Aspects
主观方面
Summarisation Relevance
总结相关性

The annotation process consists of one expert and three non-experts annotators. The annotators are trained in a calibration phase where all annotators annotate one publication together. We computed the inter-annotator agreement (IAA) for each iteration and discuss any disagreements. The figure below showcase the IAA score progression across 5 iterations. There are two versions: strict and weak. Strict version required entities to be exact match in span and type and relations to be exact match in both components, direction and relation type. Weak version requires match in type and only overlap in span. The agreement (IAA) increases with iterations as expected. In addition, the agreement on relations are lower as that’s usually a lot more subjective, not to mention the agreement on relations are influenced by the agreement on components.

注释过程由一名专家和三名非专家注释者组成。注释者在校准阶段接受培训，在此阶段，所有注释者一起注释一个出版物。我们为每次迭代计算了注释者间协议(IAA)，并讨论了任何分歧。下图显示了5次迭代中的IAA分数进度。有两种版本：严格和弱版本。严格版本要求实体在范围和类型上完全匹配，并且关系在组件，方向和关系类型上都必须完全匹配。弱版本要求类型匹配且跨度仅重叠。协议(IAA)随预期的迭代而增加。此外，关于关系的协议要低一些，因为这通常更加主观，更不用说关于关系的协议受组件协议的影响。

Image for post — The Inter-Annotator Agreement (IAA) [1]

语料库分析 (Corpus Analysis)

论据注释分析(Argumentation annotations analysis)

Table 2 showcase the summary statistics of each argumentative components and relations in the Dr. Inventor corpus. There are approx. 2x the number of own claims than background claims which it’s as expected as the corpus consists of original research papers. In addition, data components are only half as many as claims. This could due to the fact that not all claims are supported or claims can be supported by other claims. Naturally, there are a lot of supports relations as authors tend to strengthen their claims by supporting it with data components or other claims. Table 3 showcase the length of argumentative components. Both own and background claims are of similar length whereas data components are half the length. This could be attributed to the fact that in computer science, explanation tend to be shorted and also most often, authors would just refer to tables and figures for supports.

表2展示了Inventor博士语料库中每个议论性成分和关系的摘要统计量。有大约。拥有自己的权利要求的数量是背景权利要求的2倍，而预期的权利要求是语料库由原始研究论文构成的。此外，数据分量仅是声明的一半。这可能是由于并非所有权利要求都得到支持或其他权利要求可以支持这些事实。自然，存在很多支持关系，因为作者倾向于通过使用数据组件或其他声明来支持其主张，从而加强其主张。表3展示了论证组成部分的长度。自己的声明和背景声明的长度相似，而数据分量的长度只有一半。这可能归因于以下事实：在计算机科学中，解释往往会简短，而且大多数情况下，作者只会参考表格和图表作为支持。

The argument structure of a scientific paper follows the directed acyclic graph (DAG) where argumentative components are the nodes and the edges are the relations. Table 4 below showcase graph analysis of the DAG of the argument structure of scientific paper. There are 27 standalone claims and 39 unsupported claims. The max in-degree showcasing the maximum connections there are between nodes. An average of 6 tells us that there are lots of claims with strong supporting evidence provided. We also ran PageRank algorithm to identify the most important claims and listed some examples in Table 5. Results showcase that majority of the highest ranked claim comes from the background claim, telling us that in the computer graphics papers, they tend to put more emphasis on research gaps for their motivation of work rather than on empirical results.

科学论文的论证结构遵循有向无环图(DAG)，其中论证成分是节点，边是关系。下表4展示了科学论文论证结构DAG的图形分析。有27项独立声明和39项不受支持的声明。最大入度表示节点之间存在的最大连接数。平均6个告诉我们，有许多索赔要求提供了有力的支持证据。我们还运行PageRank算法来识别最重要的声明，并在表5中列出了一些示例。结果表明，排名最高的声明大部分来自背景声明，这告诉我们在计算机图形文件中，他们倾向于更加强调研究其工作动机而不是实证结果的差距。

Left: Graph-based analysis of the argumentative structures | Right: Examples of types of claims and sentences associated with those claims [1]

与其他修辞方面的联系 (Connections to other rhetorical aspects)

How well does our new argumentative components connect with existing annotations in the Dr. Inventor corpus? In table 6 below, we showcase the normalised mutual information (NMI), which measures the amount of shared information between the five annotation layers. We showcase the NMI scores for all the annotation pairs:

我们新的论证成分与Dr. Inventor语料库中的现有注释之间的联系程度如何？在下面的表6中，我们展示了标准化的互信息(NMI)，该信息可度量五个注释层之间的共享信息量。我们展示了所有注释对的NMI分数：

Argument Components (AC)
参数组件(AC)
Discourse Roles (DR)
话语角色(DR)
Subjective Aspects (SA)
主观方面(SA)
Summarisation Relevances (SR)
汇总相关性(SR)
Citation Contexts (CC)
引文上下文(CC)

There’s a strong NMI score between AC and DR, which makes sense as background claims are likely to be found in the discourse role background section. Another high NMI score is between AC and CC. This makes sense as citations are often referenced in background claims.

AC和DR之间的NMI得分很高，这很有意义，因为在话语角色背景部分中可能会找到背景说明。 NMI的另一个高得分是AC和CC之间。这是有道理的，因为在背景权利要求中经常提到引用。

结论与未来工作 (Conclusion and Future Work)

We created the first argument-annotated corpus of scientific papers and provided key summary statistics of the corpus and argumentative analysis. Potential future work could involve extending the corpus with papers from other domains and further develop the models to analyse scientific writing.

我们创建了第一个带有论点注释的科学论文语料库，并提供了该语料库的主要摘要统计数据和论证分析。未来的潜在工作可能涉及用其他领域的论文扩展语料库，并进一步开发模型以分析科学写作。

资源： (Source:)

[1] Lauscher, A., Glavaš, G. and Ponzetto, S.P., 2018, November. An argument-annotated corpus of scientific publications. In Proceedings of the 5th Workshop on Argument Mining (pp. 40–46).

[1] Lauscher，A.，Glavaš，G.和Ponzetto，SP，2018年11月。带注释的科学出版物集。在“第五次论证挖掘研讨会” (第40-46页)中。

Originally published at https://ryanong.co.uk on April 28, 2020.

最初于2020年4月28日在https://ryanong.co.uk上发布。

方面提取/基于方面的情感分析 (Aspect Extraction / Aspect-based Sentiment Analysis)

https://towardsdatascience.com/day-102-of-nlp365-nlp-papers-summary-implicit-and-explicit-aspect-extraction-in-financial-bdf00a66db41

https://towardsdatascience.com/day-102-of-nlp365-nlp-papers-summary-implicit-and-explicit-aspect-extraction-in-financial-bdf00a66db41
https://towardsdatascience.com/day-103-nlp-research-papers-utilizing-bert-for-aspect-based-sentiment-analysis-via-constructing-38ab3e1630a3

https://towardsdatascience.com/day-103-nlp-research-papers-utilizing-bert-for-aspect-based-sentiment-analysis-via-constructing-38ab3e1630a3
https://towardsdatascience.com/day-104-of-nlp365-nlp-papers-summary-sentihood-targeted-aspect-based-sentiment-analysis-f24a2ec1ca32

https://towardsdatascience.com/day-104-of-nlp365-nlp-papers-summary-sentihood-targeted-aspect-based-sentiment-analysis-f24a2ec1ca32
https://towardsdatascience.com/day-105-of-nlp365-nlp-papers-summary-aspect-level-sentiment-classification-with-3a3539be6ae8

https://towardsdatascience.com/day-105-of-nlp365-nlp-papers-summary-aspect-level-sentiment-classification-with-3a3539be6ae8
https://towardsdatascience.com/day-106-of-nlp365-nlp-papers-summary-an-unsupervised-neural-attention-model-for-aspect-b874d007b6d0

https://towardsdatascience.com/day-106-of-nlp365-nlp-papers-summary-an-unsupervised-neural-attention-model-for-aspect-b874d007b6d0
https://towardsdatascience.com/day-110-of-nlp365-nlp-papers-summary-double-embeddings-and-cnn-based-sequence-labelling-for-b8a958f3bddd

https://towardsdatascience.com/day-110-of-nlp365-nlp-papers-summary-double-embeddings-and-cnn-based-sequence-labelling-for-b8a958f3bddd
https://towardsdatascience.com/day-112-of-nlp365-nlp-papers-summary-a-challenge-dataset-and-effective-models-for-aspect-based-35b7a5e245b5

https://towardsdatascience.com/day-112-of-nlp365-nlp-papers-summary-a-challenge-dataset-and-effective-models-for-aspect-based-35b7a5e245b5

总结 (Summarisation)

https://towardsdatascience.com/day-107-of-nlp365-nlp-papers-summary-make-lead-bias-in-your-favor-a-simple-and-effective-4c52b1a569b8

https://towardsdatascience.com/day-107-of-nlp365-nlp-papers-summary-make-lead-bias-in-your-favor-a-simple-and-effective-4c52b1a569b8
https://towardsdatascience.com/day-109-of-nlp365-nlp-papers-summary-studying-summarization-evaluation-metrics-in-the-619f5acb1b27

https://towardsdatascience.com/day-109-of-nlp365-nlp-papers-summary-studying-summarization-evaluation-metrics-in-the-619f5acb1b27
https://towardsdatascience.com/day-113-of-nlp365-nlp-papers-summary-on-extractive-and-abstractive-neural-document-87168b7e90bc

https://towardsdatascience.com/day-113-of-nlp365-nlp-papers-summary-on-extractive-and-abstractive-neural-document-87168b7e90bc
https://towardsdatascience.com/day-116-of-nlp365-nlp-papers-summary-data-driven-summarization-of-scientific-articles-3fba016c733b

https://towardsdatascience.com/day-116-of-nlp365-nlp-papers-summary-data-driven-summarization-of-scientific-articles-3fba016c733b
https://towardsdatascience.com/day-117-of-nlp365-nlp-papers-summary-abstract-text-summarization-a-low-resource-challenge-61ae6cdf32f

https://towardsdatascience.com/day-117-of-nlp365-nlp-papers-summary-abstract-text-summarization-a-low-resource-challenge-61ae6cdf32f
https://towardsdatascience.com/day-118-of-nlp365-nlp-papers-summary-extractive-summarization-of-long-documents-by-combining-aea118a5eb3f

https://towardsdatascience.com/day-118-of-nlp365-nlp-papers-summary-extractive-summarization-of-long-documents-by-combining-aea118a5eb3f

其他 (Others)

https://towardsdatascience.com/day-108-of-nlp365-nlp-papers-summary-simple-bert-models-for-relation-extraction-and-semantic-98f7698184d7

https://towardsdatascience.com/day-108-of-nlp365-nlp-papers-summary-simple-bert-models-for-relation-extraction-and-semantic-98f7698184d7
https://towardsdatascience.com/day-111-of-nlp365-nlp-papers-summary-the-risk-of-racial-bias-in-hate-speech-detection-bff7f5f20ce5

https://towardsdatascience.com/day-111-of-nlp365-nlp-papers-summary-the-risk-of-racial-bias-in-hate-speech-detection-bff7f5f20ce5
https://towardsdatascience.com/day-115-of-nlp365-nlp-papers-summary-scibert-a-pretrained-language-model-for-scientific-text-185785598e33

https://towardsdatascience.com/day-115-of-nlp365-nlp-papers-summary-scibert-a-pretrained-language-model-for-scientific-text-185785598e33

翻译自: https://towardsdatascience.com/day-119-nlp-papers-summary-an-argument-annotated-corpus-of-scientific-publications-d7b9e2ea1097