基于Bert预训练模型的文本分类fine tune

环境

  • python==3.7
  • torch==1.7.1
  • transformers==4.9.2
  • scikit-learn==0.21.3
  • tensorboard==2.5.0
  • pandas
  • numpy

构建数据集

将数据放到如下图格式的dataframe中,label对应的数字为每种类别的下标。


random seed设置

import torch
import numpy as nprandom_seed = 2018
np.random.seed(random_seed)
torch.manual_seed(random_seed)
torch.cuda.manual_seed_all(random_seed)
torch.backends.cudnn.deterministic = True

数据集划分

# 训练集 验证集 测试集划分
# stratify参数为分层抽样,根据label的种类分布,按比例抽样
train_text, temp_text, train_labels, temp_labels = train_test_split(df['text'], df['label'], random_state=random_seed, test_size=0.3, stratify=df['label'])
val_text, test_text, val_labels, test_labels = train_test_split(temp_text, temp_labels, random_state=random_seed, test_size=0.5, stratify=temp_labels)

预训练模型加载

import transformers
from transformers import AutoModel, BertTokenizerFast# 加载预训练模型
pretrained_model_path = "bert-base-uncased"
bert = AutoModel.from_pretrained(pretrained_model_path)
tokenizer = BertTokenizerFast.from_pretrained(pretrained_model_path)

句子长度分布

import pandas as pd# 查看训练集中的句子长度,决定pad的长度
seq_len = [len(i.split()) for i in train_text]
seq_df = pd.DataFrame(seq_len)
seq_df.describe()


如上图所示,75%的句子长度都在46,因此后面pad长度可设置为50左右


tokenizer

padding_len = 50
# 经过tokenizer之后会将文字变成数字的形式,其中也包含了attention mask
tokens_train = tokenizer.batch_encode_plus(train_text.tolist(), max_length = padding_len, pad_to_max_length=True, truncation=True)
tokens_test = tokenizer.batch_encode_plus(test_text.tolist(), max_length = padding_len, pad_to_max_length=True, truncation=True)
tokens_val = tokenizer.batch_encode_plus(val_text.tolist(), max_length = padding_len, pad_to_max_length=True, truncation=True)
# to tensors
train_seq = torch.tensor(tokens_train['input_ids'])
train_mask = torch.tensor(tokens_train['attention_mask'])
train_y = torch.tensor(train_labels.tolist())val_seq = torch.tensor(tokens_val['input_ids'])
val_mask = torch.tensor(tokens_val['attention_mask'])
val_y = torch.tensor(val_labels.tolist())test_seq = torch.tensor(tokens_test['input_ids'])
test_mask = torch.tensor(tokens_test['attention_mask'])
test_y = torch.tensor(test_labels.tolist())

Dataset Dataloder

# create dataloader
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSamplerbatch_size = 16
train_data = TensorDataset(train_seq, train_mask, train_y)
# sampler for sampling the data during training
train_sampler = RandomSampler(train_data)
# dataLoader for train set
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)val_data = TensorDataset(val_seq, val_mask, val_y)
val_sampler = SequentialSampler(val_data)
val_dataloader = DataLoader(val_data, sampler = val_sampler, batch_size=batch_size)

冻结预训练模型参数

# freeze all the parameters
for param in bert.parameters():param.requires_grad = False

模型定义

# 模型定义
class BertClassifier(nn.Module):def __init__(self, bert_pretrained_model):super(BertClassifier, self).__init__()self.bert = bert_pretrained_modelself.dropout = nn.Dropout(0.1)self.relu = nn.ReLU()self.fc1 = nn.Linear(768, 512)self.fc2 = nn.Linear(512, class_num)self.softmax = nn.LogSoftmax(dim=1)def forward(self, sent_id, mask):bert_res = self.bert(sent_id, attention_mask=mask)# _, cls_hs = bert_rescls_hs = bert_res.pooler_outputx = self.fc1(cls_hs)x = self.relu(x)x = self.dropout(x)x = self.fc2(x)x = self.softmax(x)return x

初始化

device = torch.device("cuda:1") if torch.cuda.is_available() else torch.device("cpu")
model = BertClassifier(bert)
model.to(device)

optimizer

from transformers import AdamW# bert论文推荐学习率2e-5, 3e-5, 5e-5
optimizer = AdamW(model.parameters(), lr=2e-5)

查看每种类别的weights

针对label分布不均衡,在损失函数里传weights

from sklearn.utils.class_weight import compute_class_weightgroup_size = df.groupby(["label"]).size()
print(group_size)
class_weights = compute_class_weight('balanced', np.unique(train_labels), train_labels)

损失函数

weights= torch.tensor(class_weights, dtype=torch.float)
weights = weights.to(device)cross_entropy = nn.NLLLoss(weight=weights)

训练

from datetime import datetime
from sklearn.metrics import f1_score
from torch.utils.tensorboard import SummaryWritersaved_path = '/data/yuhengshi/saved_model/9_aug_other_roberta_val185.plk'
writer = SummaryWriter("/data/yuhengshi/saved_model")def evaluate():print(f"""evaluate start""")model.eval()total_loss, total_accuracy = 0, 0total_preds = []total_labels = []for step, batch in enumerate(val_dataloader):batch = [t.to(device) for t in batch]sent_id, mask, labels = batchwith torch.no_grad():preds = model(sent_id, mask)loss = cross_entropy(preds, labels)total_loss = total_loss + loss.item()preds = preds.detach().cpu().numpy()total_preds.append(preds)labels = labels.detach().cpu().numpy()total_labels.append(labels)# # 每50次# if step % 50 == 0 and step != 0:#     print()avg_loss = total_loss/len(val_dataloader)total_preds = np.concatenate(total_preds, axis=0)total_labels = np.concatenate(total_labels, axis=0)val_f1 = f1_score(total_labels, np.argmax(total_preds, axis=1), average='micro')return avg_loss, val_f1, total_preds, total_labelsdef train(epochs=10):best_valid_loss = float('inf')start_time = datetime.now()for epoch in range(epochs):print(f"""Epoch {epoch}/{epochs} start""")model.train()total_loss, total_accuracy, total_batch = 0, 0, 0# 存预测结果total_preds = []total_labels = []for step, batch in enumerate(train_dataloader):batch = [x.to(device) for x in batch]sent_id, mask, labels = batchmodel.zero_grad()preds = model(sent_id, mask)loss = cross_entropy(preds, labels)total_loss = total_loss + loss.item()loss.backward()# 防梯度爆炸torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)optimizer.step()# model predictions are stored on GPU. So, push it to CPUpreds = preds.detach().cpu().numpy()total_preds.append(preds)total_labels.append(labels.detach().cpu().numpy())# 每50次if step != 0 and step % 50 == 0:current_time = datetime.now()print(f"""Epoch {epoch}/{epochs}, step:{step}, train_loss:{loss}, cost_time:{current_time-start_time}""")# 每个epoch结束 计算平均losstotal_preds = np.concatenate(total_preds, axis=0)total_labels = np.concatenate(total_labels, axis=0)avg_loss = total_loss/len(train_dataloader)preds = np.argmax(total_preds, axis=1)train_f1 = f1_score(total_labels, preds, average='micro')valid_loss, valid_f1, valid_preds, valid_labels = evaluate()valid_preds = np.argmax(valid_preds, axis=1)if valid_loss < best_valid_loss:best_valid_loss = valid_losstorch.save(model.state_dict(), saved_path)current_time = datetime.now()metrics_report = classification_report(valid_labels, valid_preds)# tensorboardwriter.add_scalar("loss/train", avg_loss, epoch)writer.add_scalar("loss/valid", valid_loss, epoch)writer.add_scalar("f1/train", train_f1, epoch)writer.add_scalar("f1/valid", valid_f1, epoch)print(f"Epoch {epoch}/{epochs}, train_loss: {avg_loss}, train_f1:{train_f1}\n"f"valid_loss: {valid_loss}, valid_f1: {valid_f1}\n"f"best_valid_loss: {best_valid_loss}, cost_time: {current_time-start_time}\n"f"{metrics_report}")print(f"=====================================")total_batch += 1writer.close()train()

预测

def predict(model_path, test_data, test_label):model.load_state_dict(torch.load(model_path))with torch.no_grad():preds = model(test_seq.to(device), test_mask.to(device))preds = preds.detach().cpu().numpy()preds = np.argmax(preds, axis=1)return preds

tensorboard使用

tensorboard --logdir=./log2 --port 8088

文本分类pytorch Bert fine tune相关推荐

  1. 【小白学习PyTorch教程】十五、BERT:通过PyTorch来创建一个文本分类的Bert模型

    @Author:Runsen 2018 年,谷歌发表了一篇题为<Pre-training of deep bidirectional Transformers for Language Unde ...

  2. 深度学习-nlp系列(2)文本分类(Bert)pytorch

    对于 Bert 来说,用于文本分类是最常见的,并且准确率也很高.本文将会对 bert 用于文本分类来做详细的介绍. 预训练模型 对于不同的数据,需要导入不同的预训练模型. 预训练模型下载地址:Mode ...

  3. 自然语言处理NLP之BERT、BERT是什么、智能问答、阅读理解、分词、词性标注、数据增强、文本分类、BERT的知识表示本质

    自然语言处理NLP之BERT.BERT是什么.智能问答.阅读理解.分词.词性标注.数据增强.文本分类.BERT的知识表示本质 目录

  4. nlp系列(3)文本分类(Bert+TextCNN)pytorch

    在前面两章讲解了 bert 和 TextCNN 模型,用这两个模型来进行文本分类.那我们就可以试一下将这两个模型进行融合来进行文本分类. 模型介绍 我们知道在进行模型融合时,要注意的时在第一个模型的输 ...

  5. 【文本分类】BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    ·阅读摘要:   Bert是继Transformer之后的又一杰出的模型.Bert是一种预训练语言模型,是在GPT.Elmo.Transformer的基础上提出的.基于Bert的多个NLP领域任务都取 ...

  6. GCN图卷积神经网络-中文文本分类 pytorch

    代码基于Text-GCN作者论文和基于Bible的实践,保留原作者信息,具体代码请移步原作者仓库 https://github.com/plkmo/Bible_Text_GCN https://git ...

  7. NLP-Beginner 任务二:基于深度学习的文本分类+pytorch(超详细!!)

    NLP-Beginner 任务二:基于深度学习的文本分类 传送门 一. 介绍 1.1 任务简介 1.2 数据集 1.3 流程介绍 二. 特征提取--Word embedding(词嵌入) 2.1 词嵌 ...

  8. 天池 入门赛-新闻文本分类-单个bert模型分数0.961

    文章目录 一些说明 三.最终代码及解析 3.1 构建分词器 3.2 预训练bert模型 3.3 分类任务微调: 零.分词tokenization 1.2 分词规则 1.3 character-base ...

  9. 组队学习-NLP实践-中文预训练模型泛化能力挑战赛(文本分类,bert)

    组队学习-NLP实践-中文预训练模型泛化能力挑战赛 Docker 安装与使用 阿里云镜像仓库 baseline 本机运行并提交 Docker 安装与使用 参考:https://mp.weixin.qq ...

  10. NLP-分类模型-2016-文本分类:FastText【使用CBOW的模型结构;作用:①文本分类、②训练词向量、③词向量模型迁移(直接拿FastText官方已训练好的词向量来使用)】【基于子词训练】

    <原始论文:Bag of Tricks for Efficient Text Classification> <原始论文:Enriching Word Vectors with Su ...

最新文章

  1. DaemonSet 典型应用场景 - 每天5分钟玩转 Docker 容器技术(129)
  2. Elasticsearch系列「」学习路线
  3. Spring CommonsMultipartResolver 上传文件
  4. PMAC运动程序例程(一)
  5. CISA 称SolarWinds黑客或通过密码猜测攻陷目标,CISA 前局长受聘
  6. centOS7忘记密码重置方法
  7. SQL注入漏洞与参数化查询
  8. Golang编程语言简介 go语言特点
  9. 反应釜cad图_反应罐系列图(5张图)CAD格式
  10. 基于BearPi套件开发的智能儿童手表系统
  11. Faster R-CNN最全讲解
  12. cocos creator实例--FlappyBird游戏的分析
  13. 深入分析Voldemort的PerformParallelRequests
  14. 数据中台,什么是数据中台?
  15. HTML5表单验证required提示的中文设置为英文
  16. [036] 微信公众帐号开发教程第12篇-符号表情的发送(下)
  17. 分区 与 bios 笔记本 tips
  18. MySQL数据库面试题五(2020最新版)
  19. Catia 开始界面
  20. Symbian操作系统及操作平台界面详解

热门文章

  1. CCIE学习(18)——HSRP、VRRP和GLBP
  2. 给年轻工程师的十大忠告[转载]
  3. 微积分知识总览(1)——微积分准备知识
  4. Django 框架之 URL
  5. Java中的代理模式--静态代理和动态代理
  6. [android底层]jni中获取自定义的类函数编写
  7. Oracle for Windows Embedded CE
  8. 过去的一年,2013!
  9. PHP函数set_magic_quotes_runtime()的作用
  10. JCI标准认证对我国医院建设有什么意义?