LaBSE : Language-agnostic BERT Sentence Embedding

摘要：
使用多语言的BERT生产109种语言无关的句子向量：
1，MLM，TLM结合的预训练BERT
2，使用translation ranking task fine-tune 双向的encoder

112个语种的双语挖掘准确率超过83.7%

介绍
1，MLM的输出句子向量效果不好
2，SentenceBERT （双encoder fine-tune 单语BERT）在STS（语义相似度）任务中取得了很好的成绩

多语言embedding策略：
双语：LASER —— 需要大量平行语料
双语 + input-response prediction ：mUSE —— 没有一个语对的模型好

多语言的交叉影响的优点：

MLM+TLM -> Pre-trained BERT
final layer [CLS] representations cosine

效果：
双语挖掘任务中sota，：UN，BUCC
对比LASER 大语种相似，小语种超越

没有语料的30+语种：

1：预训练+fine-tuning策略，在双语挖掘任务中sota
2，109种语言的单模型及zero-shoting
3，分析数据数量，数据质量，预训练，及负采样率策略

2，语料
单语：
CommonCrawl and Wikipedia
清洗策略：分类器（使用页面的主要内容作为正向样本，其他领域的作为负向样本）
17B 50% unfiltered version
双语：
双语挖掘（Uszkoreit et al. (2010)）
CDS 打分模型进行过滤
人工 subset GOOD BAD
为了平衡小语种，每个语对最多100M，总共6B

3 模型

3.1 Bidirectional Dual Encoder with Additive
Margin Softmax

xi yi 是真实翻译句子对减去m
Batch-Size N

trg->src

3.2 Cross-Accelerator Negative Sampling

训练加速

使用负采样率

正常128

交叉采样

3.3 Pre-training and parameter sharing

a transformer encoder
MLM+TLM

三阶段渐进叠加算法：
L/4 L/2 L

Evaluation

参数
词表：wordpiece model (Sennrich et al., 2016) 50W
encoder: BERT Base model 12 layers 12 heads 768hidden size
last layer [CLS] token l2 norm as output

pre-trained BERT model
512 cores TPU V3: batch-size:8192 max-len 512
min(20% , 80) tokens masked MLM TLM
400k,800k,1.8M

LaBSE models
32 cores TPU V3 : batch-size:2048 max-len 64
margin=0.3
50K (less than 1 epoch) ->200M双语
x10 scaling factor

BUCC

United Nations

Tatoeba

Analysis

Additive Margin

Pre-training

500K 1B (双语) vs 50K 200M

Comparison to Multilingual BERT

multi_cased_L-12_H-768_A-12
提高原因：
更大的词表 500K vs 30K
TLM improve transfer
common crawl (更多的数据,虽然噪音也多) vs wikipedia

Importance of the Data Selection

CDS model vs none
precision 99% 80%
CDS selection is not only based on the quality but also based on a domain match with the training data

5.1 Zero-shot Transfer to Languages without Training Data

vocab 影响

Negative Sampling
交叉加速负采样

5.2 Semantic Similarity

vs sBERT
能力倾向于区分语义是否等价，而在一句多义方面没有优势

vs m-USE
rediction of input-response 在判断语义相似度表现好

6 Mining Parallel Text from CommonCrawl

使用LaBSE 挖掘 commoncrawl
trg建立索引
ANN 算法挖掘
相似度>= 0.6

结论