深度学习模型CPT的环境配置经验

CPT code: https://github.com/fastnlp/CPT
CPT paper: https://arxiv.org/pdf/2109.05729.pdf

数据预处理

https://zhuanlan.zhihu.com/p/388830967 megatron-lm中的preprocess_data.py的详解, json格式中, 最重要的key, 即text有值即可

用下面的命令, 准备训练数据集

jsonfile="/Users/phoenixbai/workspace/github/CPT/tmp/eight.files3.json"
vocabfile="/Users/phoenixbai/workspace/github/CPT/finetune/generation/output/adgen/2/vocab.txt"
prefix="test"python ../pretrain/tools/preprocess_data.py \--input $jsonfile \--output-prefix $prefix \--vocab $vocabfile \--dataset-impl mmap \--tokenizer-type BertWordPieceCase

环境配置

需要一台带有gpu卡的机器, gpu驱动升级的方法, 在另一篇文章中再写.

如何从已trained好的cpt-base接着做pretrain, 需要稍改下代码 : https://github.com/fastnlp/CPT/issues/30,

# model_path = 'roberta_zh'
model_name = "fnlp/cpt-base"
# self.language_model = HFBartModel(config, encoder_config)
#self.language_model = HFBartModel(config)
#encoder_state = torch.load(model_path + '/pytorch_model.bin', map_location='cpu')
#self.language_model.model.encoder.load_state_dict(encoder_state)
self.language_model = HFBartModel.from_pretrained(model_path)

apex 指的是nvidia的apex https://github.com/NVIDIA/apex/blob/master/README.md, 安装时, 需要安装支持cuda的能力, 若报错如amp_C module not found 或者shared_mutex:: error: taking address of temporary [-fpermissive], 就升级到gcc-8.5, 应该就好了:
```
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
```
若import amp_C 时, 报ImportError: libc10.so: cannot open shared object file: No such file or directory, 则将anaconda3/下面找到libc10.so, 然后加到LD_LIBRARY_PATH下面即可, 如 lib/python3.7/site-packages/torch/lib/libc10.so
需要安装的python库, 及相关注意事项:
- 根据cuda版本, 如, 我的cuda用的是11.3版本, 所以下载对应的torch来安装: pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
- transformers==4.4.1
- 进到pretrain下面, 然后pip install -r requirements.txt
- pip install jieba_fast
- packaging会报minor version没有之类的, 那就卸载当前版本, 再装个>=20.0的即可, 默认就是最新版本
- deepspeed需要装
gcc-8.5的安装方式: (用gcc-6.1编译时cuda代码时, 会报 shared_mutex:: error: taking address of temporary [-fpermissive], 所以需要升级为8.5)
- 直接用gcc自带的./contrib/download_prerequisites来下载自动安装即可, 不行, 再看这个: How to install GCC piece by piece with GMP, MPFR, MPC, ELF (http://stackoverflow.com/questions/9450394/how-to-install-gcc-piece-by-piece-with-gmp-mpfr-mpc-elf-without-shared-libra)
- 安装方法:
```
https://ftp.gnu.org/gnu/gcc/ (下载地址)
tar xzf gcc-8.5.0.tar.gz
cd gcc-8.5.0
./contrib/download_prerequisites  #若失败, 则改下脚本中的ftp为http即可
cd ..
mkdir objdir
cd objdir
$PWD/../gcc-8.5.0/configure --prefix=$HOME/app/gcc-8.5.0 --enable-languages=c,c++,fortran,go --disable-multilib     //with 64bit support only
make -j20
make -j20 install
```

.bash_profile中添加变量

GCC_HOME=/home/phoenixbai/workspace/gcc-8.5
export LD_LIBRARY_PATH=$GCC_HOME/lib:$LD_LIBRARY_PATH
export PATH=$GCC_HOME/bin:$PATH

若需要下载cpt-base的已训练好的模型, 则到: https://huggingface.co/fnlp/cpt-base
- sha256sum pytorch_model.bin 验证文件是否一致, 直接下载的是个zip, 需要直接重命名为pytorch_model.bin即可.

训练时遇到的问题

若train时报如下错, 是因为之前训练出的checkpoint中的相关参数, 与这一次的不匹配, 程序可能去load之前的checkpoint了, 所以, 你就把之前的那个checkpoint删了就好了

File “/gruntdata/phoenixbai/CPT/pretrain/megatron/learning_rates.py”, line 133, in _check_and_set
f’AnnealingLR: class input value {cls_value} and checkpoint’
AssertionError: AnnealingLR: class input value 5120.0 and checkpointvalue 1280.0 for warmup iterations do not match

batch_size=512太大时, 也容易CUDA OOM, 所以要小点
iteration是指每个batch_size跑一次, 所以若有2000万条记录, 要2000万/256 个iterations才是一个epoch.
当程序跑起来后, 我用nohup sh pretrain_run.sh & 来跑任务, 但发现当session断开时, 会报如下错, 基于 https://github.com/pytorch/pytorch/issues/67538的讨论, 应该用tmux里来跑, 就能解决了. nohup may terminate pytorch program when close the terminal, but tmux really worked.

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 9645 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 9646 closing signal SIGHUP
Traceback (most recent call last):
…
packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 9564 got signal: 1

效果

将checkpoint转换成pytorch_model.bin:

python pretrain/tools/convert_ckpt.py pretrain/checkpoints/cpt-base/last/mp_rank_00_000/  model/cpt-base-zhijian/

用下面的脚本, 可以简单验证一下效果, 若需要在gpu上做预测, 参考: https://discuss.huggingface.co/t/is-transformers-using-gpu-by-default/8500

import sys
sys.path.extend(['/home/phoenixbai/workspace/CPT/finetune'])
from modeling_cpt import CPTForConditionalGeneration
from transformers import BertTokenizerMODEL_NAME = "/home/phoenixbai/workspace/CPT/model/cpt-base-test/"
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
model = CPTForConditionalGeneration.from_pretrained(MODEL_NAME)# 填充式的生成
input_ids = tokenizer.encode("你好, 菜鸟[MASK], 你有两个天猫[MASK]到了", return_tensors='pt')
pred_ids = model.generate(input_ids, num_beams=4, max_length=20)
print(tokenizer.convert_ids_to_tokens(pred_ids[0]))#输出结果: ['[SEP]', '[CLS]', '你', '好', '，', '菜', '鸟', '驿', '站', ',', '你', '有', '两', '个', '天', '猫', '快', '递', '到', '[SEP]']# 整句重新生成
input_ids = tokenizer.encode("你好, 菜鸟, 你有两个天猫到了", return_tensors='pt')
pred_ids = model.generate(input_ids, num_beams=4, max_length=20)
print(tokenizer.convert_ids_to_tokens(pred_ids[0]))#输出结果:  ['[SEP]', '[CLS]', '你', '好', '啊', '菜', '鸟', '直', '送', '你', '有', '两', '个', '天', '猫', '到', '了', '了', '你', '[SEP]']

Thanks

深度学习模型CPT的环境配置经验相关推荐

用Windows电脑训练深度学习模型？超详细配置教程来了
选自towardsdatascience 作者:Ahinand 机器之心编译编辑:Panda 虽然大多数深度学习模型都是在 Linux 系统上训练的,但 Windows 也是一个非常重要的系统,也可 ...
tensorflow linux多卡训练_用Windows电脑训练深度学习模型？超详细配置教程来了
公众号关注 "DL-CVer" 设为 "星标",DLCV消息即可送达! 转自机器之心虽然大多数深度学习模型都是在 Linux 系统上训练的,但 Window ...
windows python 访问mtp存储空间_用Windows电脑训练深度学习模型？超详细配置教程来了...
虽然大多数深度学习模型都是在 Linux 系统上训练的,但 Windows 也是一个非常重要的系统,也可能是很多机器学习初学者更为熟悉的系统.要在 Windows 上开发模型,首先当然是配置开发环境. ...
windows所有版本列表_用Windows电脑训练深度学习模型？超详细配置教程来了
选自towardsdatascience 作者:Ahinand 机器之心编译编辑:Panda 虽然大多数深度学习模型都是在 Linux 系统上训练的,但 Windows 也是一个非常重要的系统,也可 ...
tensorflow linux多卡训练_用 Windows 电脑训练深度学习模型？超详细配置教程来了...
点击上方蓝色小字 ,关注并星标 ,更多干货,第一时间送达转载自公众号:AI有道虽然大多数深度学习模型都是在 Linux 系统上训练的,但 Windows 也是一个非常重要的系统,也可能是很多 ...
深度学习论文代码复现环境配置操作
***深度学习论文代码复现前置工作安装Ubuntu18.04 安装Nvidia显卡驱动安装anaconda 安装CUDA与cuDNN 通过软链接的修改实现多版本CUDA间的切换将~/.bash ...
2.1 深度学习常用软件包和环境配置
常用软件包: Theano Pylearn2 scikit-neuralnetwork Caffe Deeplearning4j Torch http://deeplearning.net/softw ...
真传x深度学习第一课：环境配置搭建
真传x课程的配置笔记,基本参考高老师的01_实验docx, 机器ubuntu16.04, 默认python2.7 之前常用的python版本也是2.7 01 修改默认python为python3 参考 ...
深度学习模型建立的整体流程和框架
深度学习模型建立的整体流程和框架框架图如下,纵向是建立模型的主要流程,是一个简化且宏观的概念,横向是针对具体模块的延展. 数据处理数据处理一般涉及到一下五个环节: 读入数据划分数据集生成批次数 ...

深度学习模型CPT的环境配置经验

数据预处理

环境配置

训练时遇到的问题

效果

深度学习模型CPT的环境配置经验相关推荐

最新文章

热门文章