Gendered Ambiguous Pronouns (GAP) Shared Task at the Gender Bias in NLP Workshop 2019

https://www.aclweb.org/anthology/W19-3801.pdf
性别歧义代词(GAP) 2019年 NLP 研讨会性别偏见问题共同任务

Abstract

The 1st ACL workshop on Gender Bias in Natural Language Processing included a shared task on gendered ambiguous pronoun (GAP) resolution. This task was based on the coreference challenge defined in Webster et al. (2018), designed to benchmark the ability of systems to resolve pronouns in real-world contexts in a gender-fair way. 263 teams competed via a Kaggle competition, with the winning system achieving logloss of 0.13667 and near gender parity. We review the approaches of eleven systems with accepted description papers, noting their effective use of BERT (Devlin et al., 2019), both via fine-tuning and for feature extraction, as well as ensembling.

第一届 ACL 自然语言处理中的性别偏见研讨会包括一项解决性别歧视代词(GAP)的共同任务。这项任务是基于韦伯斯特等人(2018年)定义的共同参照挑战，旨在基准的能力，以性别公平的方式解决现实世界语境中的代词。263支队伍通过 Kaggle 竞赛进行比赛，获胜系统的运行速度达到0.13667，接近性别平等。我们回顾了十一个系统的方法与公认的说明文件，注意到他们有效地使用了 BERT (德夫林等人，2019年) ，都通过微调和特征提取，以及集成。

1 Introduction

Gender bias is one of the typologies of social bias (e.g. race, politics) that is alarming the Natural Language Processing (NLP) community. An illustration of the problematic behaviour are the recurrently appearing occupational stereotypes that homemaker is to woman as programmer is to man (Bolukbasi et al., 2016). Recent studies have aimed to detect, analyse and mitigate gender bias in different NLP tools and applications including word embeddings (Bolukbasi et al., 2016; Gonen and Goldberg, 2019), coreference resolution (Rudinger et al., 2018; Zhao et al., 2018), sentiment analysis (Park et al., 2018; Bhaskaran and Bhallamudi, 2019) and machine translation (Vanmassenhove et al., 2018; Font and Costa-jussa`, 2019). One of the main sources of gender bias is believed to be societal artefacts in the data from which our algorithms learn. To address this, many have created gender-labelled and gender-balanced datasets (Rudinger et al., 2018; Zhao et al., 2018; Vanmassenhove et al., 2018)

性别偏见是社会偏见的类型之一(例如种族、政治) ，这种偏见正在引起自然语言处理(NLP)社区的警觉。家庭主妇对于女性的职业刻板印象就是问题行为的一个例证，就像程序员对于男性一样(borukbasi 等人，2016年)。最近的研究旨在检测、分析和减轻不同 NLP 工具和应用中的性别偏见，包括词嵌入(borukbasi 等人，2016年; Gonen 和 Goldberg，2019年) ，共同参照解析(Rudinger 等人，2018年; 赵等人，2018年) ，情感分析(Park 等人，2018年; Bhaskaran 和 Bhallamudi，2019年)和机器翻译(vanvanmassenhove 等人，2018年; 和 Costa-jussa‘ ，2019年)。性别偏见的一个主要来源被认为是我们的算法从中学习的数据中的社会伪影。为了解决这个问题，许多人创建了性别标签和性别平衡的数据集(Rudinger et al. ，2018; Zhao et al. ，2018; Vanmassenhove et al. ，2018)

We present the results of a shared task evaluation conducted at the 1st Workshop on Gender Bias in Natural Language Processing at the ACL 2019 conference. The shared task is based on the gender-balanced GAP coreference dataset (Webster et al., 2018) and allows us to test the hypothesis that fair datasets would be enough to solve the gender bias challenge in NLP.

我们在2019年 ACL 会议的第一次自然语言处理中的性别偏见研讨会上介绍了共同任务评估的结果。共同的任务是基于性别平衡的 GAP 共同参照数据集(Webster 等人，2018) ，并允许我们检验假设，即公平的数据集将足以解决 NLP 中的性别偏见挑战。

The strong results of submitted systems tend to support this hypothesis and gives the community a great starting point for mitigating bias in models. Indeed, the enthusiastic participation we saw for this shared task has yielded systems which achieve near-human accuracy while achieving near gender-parity at 0.99, measured by the ratio between F1 scores on feminine and masculine examples. We are excited for future work extending this success to more languages, domains, and tasks. However, we especially note future work in algorithms which achieve fair outcomes given biased data, given the wealth of information from existing unbalanced datasets.

提交的系统的强有力的结果倾向于支持这个假设，并且为社区减少模型中的偏见提供了一个很好的起点。事实上，我们在这项共同任务中看到的热情参与产生了一些系统，这些系统达到了接近人类的准确性，同时实现了0.99的接近性别均等，这是用阴性和阳性例子的 F1分数之比来衡量的。我们对将来的工作感到兴奋，这种成功将扩展到更多的语言、领域和任务。然而，我们特别注意到未来的工作，在算法实现公平的结果给出有偏见的数据，给予财富的信息，从现有的不平衡数据集。

2 Task

The goal of our shared task was to encourage research in gender-fair models for NLP by providing a well-defined task that is known to be sensitive to gender bias and an evaluation procedure addressing this issue. We chose the GAP resolution task (Webster et al., 2018), which measures the ability of systems to resolve gendered pronoun reference from real-world contexts in a gender-fair way. Specifically, GAP asks systems to resolve a target personal pronoun to one of two names, or neither name. For instance, a perfect resolver would resolve that she refers to Fujisawa and not to Mari Motohashi in the Wikipedia excerpt:

我们共同任务的目标是，通过提供一项明确界定的、对性别偏见敏感的任务和一个处理这一问题的评价程序，鼓励研究性别公平的自然语言处理模式。我们选择了 GAP 解决任务(Webster 等人，2018年) ，该任务衡量系统以性别公平的方式解决来自现实世界上下文的性别代词指称的能力。具体来说，GAP 要求系统将一个目标人称代词/名称解析为两个名称之一，或者两个都不解析。例如，一个完美的解决者会在维基百科摘录中提到藤泽，而不是元桥麻里:

In May, Fujisawa joined Mari Motohashi’s rink as the team’s skip, moving back from Karuizawa to Kitami where she had spent her junior days.

五月份，藤泽参加了元桥真理的滑冰比赛，并从轻井泽回到了北上，在那里度过了她的青少年时代。

The original GAP challenge encourages fairness by balancing its datasets by the gender of the pronoun, as well as using disaggregated evaluation with separate scores for masculine and feminine examples. To simplify evaluation, we did not disaggregate evaluation for this shared task, but instead encouraged fairness by not releasing the balance of masculine to feminine examples in the final evaluation data.1

最初的GAP挑战通过按代词的性别平衡数据集，以及对男性和女性例子使用分类评价和分数来鼓励公平。为了简化评估，我们没有对这项共同任务的评估进行分类，而是通过在最终评估数据中不打破男性对女性的平衡来鼓励公平

The competition was run on Kaggle2 , a wellknown platform for competitive data science and machine learning projects with an active community of participants and support.

这个竞赛是在 Kaggle2上进行的，Kaggle2是一个众所周知的竞争数据科学和机器学习项目平台，拥有一个活跃的参与者和支持者社区。

2.1 Setting

The original GAP challenge defines four evaluation settings, depending on whether the candidate systems have to identify potential antecedents or are given a fixed choice of antecedent candidates, and whether or not they have access to the entire Wikipedia page from which the example was extracted. Our task was run in gold-two-mention with page-context. This means that, for our task, systems had access to the two names being evaluated at inference time, so that the systems were not required to do mention detection and full coreference resolution. For each example, the systems had to consider whether the target pronoun was coreferent with the first, the second or neither of the two given antecedent candidates. A valid submission consisted of a probability estimate for each of these three cases. The systems were also given the source URL for the text snippet (a Wikipedia page), enabling unlimited access to context. This minimized the chance that systems could cheat, intentionally or inadvertently, by accessing information outside the task setting.

最初的 GAP 挑战定义了四种评估设置，这取决于候选系统是否必须识别潜在的前因，或者被给予前因候选人一个固定的选择，以及他们是否能够访问提取例子的整个维基百科页面。我们的任务是以页面上下文的形式进行的。这意味着，对于我们的任务，系统可以在推理时访问正在评估的两个名字，因此系统不需要进行提及检测和完全共指解析。对于每个例子，系统必须考虑目标代词是否与第一个、第二个或者两个已知的候选先行词相关。一个有效的提交包括对这三种情况的每一种情况的概率估计。这些系统还提供了文本片段的源 URL (Wikipedia 页面) ，从而可以无限制地访问上下文。这将系统有意或无意地通过访问任务设置之外的信息来欺骗的可能性降到了最低。

2.2 Data

To ensure blind evaluation, we sourced 760 new annotated examples for official evaluation3 using the same techniques from the original GAP work (Webster et al., 2018), with three changes. To ensure the highest quality of annotations for this task, we (i) only accepted examples on which the three raters provided unanimous judgement, (ii) added heuristics to remove cases with errors in entity span labeling, and (iii) did an additional, manual round to remove assorted errors. The final set of 760 clean examples was dispersed in a larger set of 11,599 unlabeled examples to produce a set of 12,359 examples that competing systems had to rate. This augmentation was to discourage submissions based on manual labeling.

为了确保盲目评估，我们从原来的 GAP 工作(Webster 等人，2018)中使用相同的技术为官方评估提供了760个新的注释示例，并做了三处修改。为了确保注释的最高质量，我们(i)只接受三个评分者提供一致判断的例子，(ii)添加启发式方法来删除实体跨标记中有错误的案例，(iii)进行额外的手动回合来删除各种错误。最后一组760个清洁示例被分散到一个更大的11,599个未标记示例的集合中，生成了一组12,359个竞争系统必须评级的示例。这种增加是为了阻止提交的基础上手工标签。

We note many competing systems used the original GAP evaluation data4 as training data for this task, given that the two have the same format, base domain (Wikipedia), and task definition

我们注意到许多竞争系统使用原来的 GAP 评估数据作为这项任务的培训数据，因为这两个系统具有相同的格式、基本域(Wikipedia)和任务定义.

2.3 Evaluation

The original GAP work defined two official evaluation metrics, F1 score and Bias, the ratio between the F1 scores on feminine and masculine examples. Bias takes a value of 1 at gender parity; a value below 1 indicates that masculine entities are resolved more accurately than feminine ones.

最初的 GAP 工作定义了两个官方评估指标，F1得分和偏差，即女性和男性的 F1得分之间的比率。性别偏见在两性均等时取值为1; 低于1表示男性实体比女性实体解决得更准确。

In contrast, the official evaluation metric of the competition was the logloss of the submitted probability estimates:

相比之下，官方对竞赛的评价标准是提交的概率估计值的对数损失:

where N is the number of samples in the test set, M = 3 is the number of classes to be predicted, yi j is 1 if observation i belongs to class j according to the gold-standard annotations and 0 otherwise, and pi j is the probability estimated by the system that observation i belongs to class j.

其中 N 是检验集中的样本数，M = 3是要预测的类数，如果观测值 i 根据金标准注释属于 j 类，yi j 是1，否则为0，pi j 是观测值 i 属于 j 类的系统估计的概率。

Table 1 tabulates results based on the original and shared task metrics. Logloss and GAP F1 both place the winners in the same order.

表1列出了基于原始和共享任务度量的结果。Logloss 和 GAP F1都将获胜者置于相同的顺序。/n/n

2.4 Prizes

A total prize pool of USD 25,000 was provided by Google. The pool was broken down into prizes of USD 12,000, 8,000, and 5,000 for the top three systems, respectively. This attracted submissions from 263 teams, covering a wide diversity of geographic locations and affiliations, see Section 3.1. Table 1 lists results for the three prize-winning systems: Attree (2019), Wang (2019), and Abzaliev (2019)

谷歌总共提供了25,000美元的奖金。该奖金池被分成三个奖项，前三名分别为12,000美元、8,000美元和5,000美元。这吸引了来自263个团队的投稿，涵盖了广泛的地理位置和附属机构，见第3.1节。表1列出了三个获奖系统的结果: Attree (2019年)、 Wang (2019年)和 Abzaliev (2019年)

3 Submissions

In this section, we describe the diverse set of teams who competed in the shared task, and the systems they designed for the GAP challenge. We note effective use of BERT (Devlin et al., 2019), both via fine-tuning and for feature extraction, and ensembling. Despite very little modeling targeted at debiasing for gender, the submitted systems narrowed the gender gap to near parity at 0.99, while achieving remarkably strong performance.

在本节中，我们将描述在共享任务中竞争的各种团队，以及他们为 GAP 挑战设计的系统。我们注意到 BERT (Devlin 等人，2019)的有效使用，包括通过微调和特征提取，以及集成。尽管很少有针对性别去偏的建模，提交的系统将性别差距缩小到接近于0.99，同时取得了显著的性能。

3.1 Teams

We accepted ten system description papers, from 11 of the 263 teams who competed (Ionita et al. (2019) is a combined submission from the teams placing 5 and 22). Table 2 characterises the teams by their number of members, whether their affiliation is to industry or an academic institution, and the geographic location of their affiliation. Details about participant gender were not collected.

我们接受了10份系统描述文件，来自263个参赛队伍中的11个(Ionita 等人(2019)是一份来自取得5分和22分的队伍的综合提交文件)。表2按照团队成员数量、隶属于行业或学术机构以及隶属于哪个地理位置确定团队的特征。没有收集参与者性别的详细资料。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-xBCItN2D-1617628704304)(C:\Users\55099\AppData\Roaming\Typora\typora-user-images\image-20210405205237048.png)]

Our first observation is that 7 of the top 10 teams submitted system descriptions, which allows us good insight into what approaches work well for the GAP task (see next, Section 3.2). Also, All these teams publicly release their code, promoting transparency and further development.

我们的第一个观察结果是，前10个团队中有7个团队提交了系统描述，这使我们能够很好地了解哪些方法适用于 GAP 任务(参见下一部分，3.2节)。此外，所有这些团队都公开发布他们的代码，以提高透明度和进一步开发。

We note the geographic diversity of teams: there is at least one team from each of Africa, Asia, Europe, and USA, and one team collaborating across regions (Europe and USA). Five teams had industry affiliations and four academic; the geographically diverse team was diverse here also, comprising both academic and industry researchers.

我们注意到团队的地理多样性: 至少有一个来自非洲、亚洲、欧洲和美国的团队，以及一个跨区域协作的团队(欧洲和美国)。五个团队有行业关系和四个学术团队; 地理位置不同的团队也是多样化的，包括学术和行业研究人员。

There is a correlation between team size and affiliation: industry submissions were all from individual contributors, while academic researchers worked in groups. This correlation is somewhat indicative of performance: individual contributors from industry won all three monetary prizes, and only one academic group featured in the top ten submissions. A possible factor in this was the concurrent timing of the competition with other conference deadlines.

团队规模和隶属关系之间存在相关性: 行业提交的报告都来自个人贡献者，而学术研究人员则是在团队中工作。这种相关性在一定程度上反映了表现: 来自工业界的个人贡献者获得了所有三个奖金，而且只有一个学术团体进入了前十名。其中一个可能的因素是竞争的时间与其他会议的截止日期同时进行。

3.2 Systems

All system descriptions were from teams who used BERT (Devlin et al., 2019), a method to create context-sensitive word embeddings by pretraining a deep self-attention neural network on a training objective optimizing for cloze word prediction and recognition of adjacent sentences. This is perhaps not surprising, given the recent success of BERT for modeling a wide range of NLP tasks (Tenney et al., 2019; Kwiatkowski et al., 2019) and the small amount of training data available for GAP resolution (which makes LM pretraining particularly attractive). The different models built from BERT are summarized in Table 3.

所有的系统描述都来自于使用 BERT (Devlin 等人，2019)的团队，这是一种通过预先训练一个深度自我注意神经网络来创建上下文敏感词嵌入的方法，用于完形词预测和相邻句子识别的训练目标优化。这也许并不令人惊讶，因为 BERT 最近成功地建立了一系列 NLP 任务的模型(Tenney 等人，2019; Kwiatkowski 等人，2019) ，并且为 GAP 解决方案提供了少量的训练数据(这使得 LM 预训练特别有吸引力)。从 BERT 建立的不同模型总结在表3中。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kFa0urL6-1617628704307)(C:\Users\55099\AppData\Roaming\Typora\typora-user-images\image-20210405205716578.png)]

Eight of the eleven system descriptions used BERT via fine-tuning, the technique recommended in Devlin et al. (2019). To do this, the original GAP data release was used as a tuning set to learn a classifier on top of BERT to predict whether the target pronoun referred to Name A, Name B, or Neither. Abzaliev (2019) also made use of the available datasets for coreference resolution: OntoNotes 5.0 (Pradhan and Xue, 2009), WinoBias (Zhao et al., 2018), WinoGender (Rudinger et al., 2018), and the Definite Pronoun Resolution Dataset (Rahman and Ng, 2012). Given the multiple BERT models available, it was possible to learn multiple such classifiers; teams marked ensemble fine-tuned multiple base BERT models and ensembled their predictions, while teams marked single produced just one, from a BERT-Large variant.

十一个系统描述中的8个通过微调使用了 BERT，这是 Devlin 等人(2019)推荐的技术。为了做到这一点，原来的 GAP 数据发布被用作一个调优集来学习 BERT 上的一个分类器，以预测目标代词是指名字 a、名字 b 还是都不是。Abzaliev (2019)也使用了可用的数据集进行共指解析: OntoNotes 5.0(Pradhan and Xue，2009) ，WinoBias (Zhao et al. ，2018) ，WinoGender (Rudinger et al. ，2018) ，和明确的代词解析数据集(Rahman and Ng，2012)。如果有多个 BERT 模型可用，就有可能学习多个这样的分类器; 团队标记为集成精调的多基 BERT 模型和集成他们的预测，而标记为单个的团队只产生一个，由 BERT-large 变种。

An alternative way to use BERT in NLP modeling is as a feature extractor. Teams using BERT in this capacity represented mention spans as input vectors to a neural structure (typically a linear structure, e.g. feed-forward network) that learned some sort of mention compatibility, via interaction or feature crossing. To derive mention-span representations from BERT subtoken encodings, Wang (2019) found that pooling using an attention mediated process was more effective than simple mean-pooling; most teams pooled using AllenAI’s SelfAttentionSpanExtractor5 . An interesting finding was that certain BERT layers were more suitable for feature extraction than others (see Abzaliev (2019); Yang et al. (2019) for an exploration).

在 NLP 建模中使用 BERT 作为特征提取器是另一种方法。使用 BERT 的团队将提及范围作为输入向量表示为神经结构(通常是线性结构，例如前馈网络) ，通过交互或特征交叉学习某种提及的兼容性。为了从 BERT 子令牌编码中推导出提及广度表示，Wang (2019)发现，使用注意力中介的过程比简单的平均合用更有效; 大多数团队使用 AllenAI 的 selfattentionextractor5进行合用。一个有趣的发现是，某些 BERT 层比其他层更适合于特征提取(参见 Abzaliev (2019) ; Yang 等人(2019)的探索)。

The winning solution (Attree, 2019) used a novel evidence pooling technique, which used the output of off-the-shelf coreference resolvers in a way that combines aspects of ensembling and feature crossing. This perhaps explains the system’s impressive performance despite its relative simplicity. Two other systems stood out as novel in their approach to the task: Chada (2019) reformulated GAP reference resolution as a question answering task, and Lois et al. (2019) used BERT in a third way, directly applying the masked language modeling task to predicting resolutions.

获胜的解决方案(Attree，2019)使用了一种新颖的证据汇集技术，它使用了现成的共引用解析器的输出，以一种结合集成和特征交叉的方式。这也许可以解释这个系统的令人印象深刻的性能，尽管它相对简单。另外两个系统在他们完成任务的方法上非常新颖: Chada (2019)将 GAP 引用解决方案重新制定为问题回答任务，Lois 等人(2019)以第三种方式使用 BERT，直接将掩码语言建模任务应用于预测解决方案。

Despite the scarcity of data for this challenge, there was little use of extra resources. Only two teams made use of the URL given in the example, with Attree (2019) using it only indirectly as part of a coreference heuristic fed into evidence pooling. Two teams augmented the GAP data by using name substitutions (Liu, 2019; Lois et al., 2019) and two automatically created extra examples of the minority label Neither (Attree, 2019; Bao and Qiao, 2019).

尽管这一挑战的数据很少，但是额外资源的使用却很少。只有两个团队使用了示例中给出的 URL，Attree (2019)只是间接地将其作为共引用启发式算法的一部分输入到证据汇集中。两个团队通过使用名称替换增强了 GAP 数据(Liu，2019; Lois et al. ，2019) ，两个团队自动创建了额外的少数族裔标签 Neither 的例子(Attree，2019; Bao 和 Qiao，2019)。

4 Discussion

Running the GAP shared task has taught us many valuable things about reference, gender, and BERT models. Based on these, we make recommendations for future work expanding from this shared task into different languages and domains.

执行 GAP 共享任务教会了我们许多关于参考、性别和 BERT 模型的有价值的东西。在此基础上，我们为将来的工作提出建议，以便从这个共享任务扩展到不同的语言和领域。

GAP Given the incredibly strong performance of the submitted systems, it is tempting to ask whether GAP resolution is solved. We suggest the answer is no. Firstly, the shared task only tested one of the four original GAP settings. A more challenging setting would be snippet-context, in which use of Wikipedia is not allowed, which we would extend to LM pre-training. Also, GAP only targets particular types of pronoun usage, and the time is ripe for exploring others. We are particularly excited for future work in languages with different pronoun systems (esp. prodrop languages including Portuguese, Chinese, Japanese), and gender neutral personal pronouns, e.g. English they, Spanish su or Turkish o.

GAP 鉴于提交的系统具有令人难以置信的强大性能，人们很容易想知道 GAP 解决方案是否得到解决。我们认为答案是否定的。首先，共享任务只测试了四个原始 GAP 设置中的一个。一个更具挑战性的设置是片段上下文，其中不允许使用维基百科，我们将扩展到 LM 预训练。此外，GAP 只针对特定类型的代词使用，探索其他类型的时机已经成熟。我们特别期待将来在不同代词系统的语言中的工作(尤其是。语言包括葡萄牙语，汉语，日语)和性别中立的人称代词，例如英语 they，西班牙语 su 或土耳其语 o。

Gender It is encouraging to see submitted systems improve the gender gap so close to parity at 0.99, particularly as no special modeling strategies were required. Indeed, Abzaliev (2019) reported that a handcrafted pronoun gender feature had no impact. Moreover, Bao and Qiao (2019) report that BERT encodings show no significant gender bias on either WEAT (Caliskan et al., 2017) or SEAT (May et al., 2019). We look forward to studies considering potential biases in BERT across more tasks and dimensions of diversity.

性别令人鼓舞的是，看到提交的系统改善了如此接近于0.99的性别差距，特别是因为不需要特别的建模战略。事实上，Abzaliev (2019年)报告说，手工制作的代词性别特征没有影响。此外，Bao 和 Qiao (2019年)报告说，BERT 编码在 WEAT (Caliskan 等人，2017年)或 SEAT (5月等人，2019年)上没有显著的性别偏见。我们期待着研究考虑更多的任务和多样性维度的 BERT 潜在偏差。

BERT The teams competing in the shared task made effective use of BERT in at least three distinct methods: fine-tuning, feature extraction, and masked language modeling. Many system papers noted the incredible power of the model (see, e.g. Attree (2019) for a good analysis), particularly when compared to hand-crafted features (Abzaliev, 2019). We also believe the widespread use of BERT is related to the low rate of external data usage, as it is easier for most teams to reuse an existing model than to clean and integrate new data. As well as the phenomenal modeling power of BERT, one possible reason for this observation is that the public releases of BERT are trained on the same domain as the GAP examples, Wikipedia. Future work could benchmark non-Wikipedia BERT models on the shared task examples, or collect more GAP examples from different domains.

BERT 参与共同任务竞争的团队在至少三个不同的方法中有效地使用了 BERT: 微调、特征提取和屏蔽语言建模。许多系统论文都指出了模型的惊人威力(例如 Attree (2019)) ，特别是与手工制作的特性相比(Abzaliev，2019)。我们还认为 BERT 的广泛使用与外部数据使用率低有关，因为对于大多数团队来说，重用现有模型比清理和集成新数据更容易。除了 BERT 非凡的建模能力之外，这种观察的一个可能原因是 BERT 的公开发布是在与 GAP 例子维基百科相同的领域进行训练的。未来的工作可以在共享的任务示例上对非 wikipedia 的 BERT 模型进行基准测试，或者从不同的领域收集更多 GAP 示例。

5 Conclusion

This paper describes the insights of shared task on GAP coreference resolution held as part of the 1st ACL workshop on Gender Bias in Natural Language Processing. The task drew a generous prize pool from Google and saw enthusiastic participation across a diverse set of researchers. Winning systems made effective use of BERT and ensembling, achieving near human accuracy and gender parity despite little efforts targeted at mitigating gender bias. We learned where the next research challenges in gender-fair pronoun resolution lie, as well as promising directions for testing the robustness of powerful language model pre-training methods, especially BERT.

本文介绍了作为第一届自然语言处理中的性别偏见研讨会的一部分而举行的关于 GAP 共指消解的共同任务的见解。这项任务吸引了来自谷歌的大量奖金，来自不同领域的研究人员都热情参与。获奖系统有效地利用了 BERT 和集成，达到了接近人类的准确性和性别均等，尽管在减少性别偏见方面没有做出多少努力。我们了解了性别公平代词解析的下一个研究挑战在哪里，以及测试强大的语言模型预训练方法，特别是 BERT 的稳健性的有希望的方向。

Acknowledgements

We would like to extend very many thanks to the Kaggle team (especially Julia Elliot and Will Cukierski) and the Google Data Compute team (especially Daphne Luong and Ashwin Kakarla) who made this shared task possible. This work is supported in part by the Spanish Ministerio de Econom´ıa y Competitividad and the European Regional Development Fund and the Agencia Estatal de Investigacion, through the ´ post-doctoral senior grant Ramon y Cajal and ´ by the Swedish Research Council through grant 2017-930.

性别歧义代词(GAP) 2019年 NLP 研讨会性别偏见问题共同任务--阅读笔记相关推荐

2019 sample-free（样本不平衡）目标检测论文阅读笔记
点击我爱计算机视觉标星,更快获取CVML新技术本文转载自知乎,已获作者同意转载,请勿二次转载 (原文地址:https://zhuanlan.zhihu.com/p/100052168) 背景 < ...
Intel® 2019网络技术研讨会圆满落幕
金秋九月,惠风送爽,Intel® 2019 网络技术研讨会于北京顺利召开.本次峰会满载收获,共有303位客户通过网络和微信注册,实到会场155位,并有102位客户参与了调查反馈,总共吸引了60多位新客 ...
python 四维数据怎么看性别_看四维报告单怎样鉴定胎儿性别
目前世界上最先进的彩色超声设备就是四维彩超.四维彩超是动态的,因此四维彩超才能够多方位.多角度地观察宫内胎儿的生长发育情况,既然这么全面,那是不是就能通过看四维报告单来鉴定胎儿性别呢? 看四维报告单怎 ...
python 四维数据怎么看性别_四维b超数据怎么看性别
每个准爸爸准妈妈都想知道孩子的健康状况,尤其是孩子的性别情况.很多父母对自己孩子的性别都有一个小期待,所以想通过各种渠道和方法了解自己孩子的性别.那么我们是不是可以通过四维b超数据来辨别孩子的性别呢? ...
【自然语言处理（NLP）】基于SQuAD的机器阅读理解
[自然语言处理(NLP)]基于SQuAD的机器阅读理解作者简介:在校大学生一枚,华为云享专家,阿里云专家博主,腾云先锋(TDP)成员,云曦智划项目总负责人,全国高等学校计算机教学与产业实践资源建设专 ...
NLP、CV经典论文：Batch Normalization 笔记
NLP.CV经典论文:Batch Normalization 笔记论文介绍优点缺点模型结构文章部分翻译 Abstract 1 Introduction 2 Towards Reducing ...
RocketQAv2阅读笔记（#问答系统 #NLP #检索）
#问答系统 #NLP #检索知乎博客:RocketQAv2 阅读笔记 - 知乎百度的检索技术厉害的原因:现实中海量的用户历史数据.强大的中文ERNIE预训练模型.各种创新的模型训练策略.其中Ro ...
【实践】NLP领域中的ERNIE模型在阅读理解中的应用
使用ERNIE在DuReader_robust上进行阅读理解 1. 实验内容机器阅读理解 (Machine Reading Comprehension) 是指让机器阅读文本,然后回答和阅读内容相关的 ...
斯坦福大学2019年NLP课程上线，下周二开课 | 附PPT+视频
晓查发自凹非寺量子位出品 | 公众号 QbitAI 斯坦福2019年的深度学习NLP课程开课啦!从1月8日开始,每周二和周四下午,都会有一节长达2小时20分钟的讲座.课程的PPT和视频资源将 ...

性别歧义代词(GAP) 2019年 NLP 研讨会性别偏见问题共同任务--阅读笔记