使用MLM和TLM训练XLM

XLM官方源码

我是linux系统，显卡gtx 1080 ti，环境是cuda9.2，pytorch 1.4.0，apex 0.1，无法使用fp16。并且由于pytorch版本问题，一些函数和类型有所变动，所以会有数不清的警告（不影响训练）。例如torch.unit8应该改为torch.bool，如果pytorch版本在1.1.0及以前不会有这个警告。但是因为配置环境已经花费了大半天时间，1.4.0就凑合着用吧。

官方测试过的pytorch版本为0.4和1.0。

PyTorch (currently tested on version 0.4 and 1.0)

数据处理

MLM只需要单语数据处理，TLM需要双语数据处理。

强调，官方流程主要是用来参考要进行哪些处理，具体的处理方式可以自己灵活选择，主要是保证最后处理得到的数据格式保持一致。

本文用于博主记录流程，和官方文档并无区别。

第一步：数据准备

数据下载，分词，划分训练，验证，测试集

tools

该文件下主要包含分词脚本tokenize.sh，对日语采用Japanese KyTea tokenizer，中文采用Chinese Stanford segmenter，泰文采用Thai PythaiNLP tokenizer，其他语言采用Moses tokenizer。

此外还包含lowercase_and_remove_accent.py，猜测是处理大小写和词根词缀之类的？

下载维基百科单语数据，提取数据并依次调用tokenize.sh和lowercase_and_remove_accent.py，随机拆分为训练集、验证集、测试集

./get-data-wiki.sh lg

下载双语数据对，主要流程同单语

./get_data_para.sh src-tgt &

也就是说，咱们首先的工作就是要得到分好词并划分好训练、验证、测试集的单语数据或平行语料，具体做法（比如分词器）不一定要和官方一致。博主使用的是自己准备的数据。

第二步：BPE

官方选择的是fastBPE，我用的是subword-nmt。

从每种语言的训练集中抽取一百万句子，组成学习bpe的训练文件bpe.train。

# build the training set for BPE tokenization (50k codes)
OUTPATH=data/processed/XLM_en_zh/50k
mkdir -p $OUTPATH
shuf -r -n 10000000 data/wiki/train.en >> $OUTPATH/bpe.train
shuf -r -n 10000000 data/wiki/train.zh >> $OUTPATH/bpe.train

learn BPE。

OUTPATH=data/processed/XLM_en/30k  # path where processed files will be stored
FASTBPE=tools/fastBPE/fast  # path to the fastBPE tool# learn bpe codes on the training set (or only use a subset of it)
$FASTBPE learnbpe 30000 data/wiki/txt/en.train > $OUTPATH/codes

对每个文件apply BPE，并将数据文件二值化，以节省加载时的内存

单语

$FASTBPE applybpe $OUTPATH/train.en data/wiki/txt/en.train $OUTPATH/codes &
$FASTBPE applybpe $OUTPATH/valid.en data/wiki/txt/en.valid $OUTPATH/codes &
$FASTBPE applybpe $OUTPATH/test.en data/wiki/txt/en.test $OUTPATH/codes &cat $OUTPATH/train.en | $FASTBPE getvocab - > $OUTPATH/vocab &# This will create three files: $OUTPATH/{train,valid,test}.en.pth
# After that we're all set
python preprocess.py $OUTPATH/vocab $OUTPATH/train.en &
python preprocess.py $OUTPATH/vocab $OUTPATH/valid.en &
python preprocess.py $OUTPATH/vocab $OUTPATH/test.en &

双语，以下是官方示例，但经博主验证有些小问题，应该将$OUTPATH/$pair.$lg.$split 更改为 $OUTPATH/$split.$pair.$lg 才符合训练时要求的文件命名。

pair=en-zhfor lg in $(echo $pair | sed -e 's/\-/ /g'); dofor split in train valid test; do$FASTBPE applybpe $OUTPATH/$pair.$lg.$split data/wiki/para/$pair.$lg.$split $OUTPATH/codespython preprocess.py $OUTPATH/vocab $OUTPATH/$pair.$lg.$splitdone
done

训练模型

只训练MLM

python train.py## main parameters
--exp_name xlm_en_zh                       # experiment name
--dump_path ./dumped                       # where to store the experiment## data location / training objective
--data_path $OUTPATH                       # data location
--lgs 'en-zh'                              # considered languages
--clm_steps ''                             # CLM objective (for training GPT-2 models)
--mlm_steps 'en,zh'                        # MLM objective## transformer parameters
--emb_dim 1024                             # embeddings / model dimension (2048 is big, reduce if only 16Gb of GPU memory)
--n_layers 12                              # number of layers
--n_heads 16                               # number of heads
--dropout 0.1                              # dropout
--attention_dropout 0.1                    # attention dropout
--gelu_activation true                     # GELU instead of ReLU## optimization
--batch_size 32                            # sequences per batch
--bptt 256                                 # sequences length  (streams of 256 tokens)
--optimizer adam,lr=0.0001                 # optimizer (training is quite sensitive to this parameter)
--epoch_size 300000                        # number of sentences per epoch
--max_epoch 100000                         # max number of epochs (~infinite here)
--validation_metrics _valid_mlm_ppl        # validation metric (when to save the best model)
--stopping_criterion _valid_mlm_ppl,25     # stopping criterion (if criterion does not improve 25 times)
--fp16 true                                # use fp16 training## There are other parameters that are not specified here (see [here](https://github.com/facebookresearch/XLM/blob/master/train.py#L24-L198)).

如果要加上TLM一起训练，则在--mlm_steps参数后加上语言对（这里还需要弄清楚语言对是否和数据处理时保存位置有关）

--mlm_steps 'en,zh,en-zh'

博主最后的训练命令如下（有块卡被别的程序占用了）

export NGPU=7;CUDA_VISIBLE_DEVICES=0,1,2,4,5,6,7 nohup python -m torch.distributed.launch --nproc_per_node=7 train.py \
--exp_name xlm_mn_zh \
--dump_path ./dumped \
--data_path $OUTPATH \
--lgs 'mn-zh' \
--clm_steps '' \
--mlm_steps 'mn,zh,mn-zh' \
--emb_dim 1024 \
--n_layers 12  \
--n_heads 16  \
--dropout 0.1 \
--attention_dropout 0.1 \
--gelu_activation true \
--batch_size 8 \
--bptt 256 \
--optimizer adam,lr=0.0001 \
--epoch_size 300000 \
--max_epoch 100000 \
--validation_metrics _valid_mlm_ppl \
--stopping_criterion _valid_mlm_ppl,25 \
--fp16 false &

成功运行。（nohup xx &的用法是把进程挂后台，并且输出内容会重定向到nohup.out文件中）