系列文章目录

深度学习NLP(一)之Attention Model;
深度学习NLP(二)之Self-attention, Muti-attention和Transformer;
深度学习NLP(三)之ELMO、BERT、GPT
深度学习NLP(四) 之IMDB影评情感分析之BERT实战


基于IMDB影评情感分析之BERT实战

  • 系列文章目录
  • 0. 前言
  • 1. 什么是WordPiece
  • 2. 第一种方法-利用Google-search在git-hub开源的代码训练Bert
    • 2.1 所需环境
    • 2.2 代码修改与介绍
  • 3. 第二种方法-利用 tensorflow_hub与bert-tensorflow训练Bert
    • 所需环境
    • 代码介绍
  • 4. 第三种方法-利用huggingFace的Transformer训练Bert
    • 3.1. 所需环境
    • 4.2. 代码解释
    • 4.3. 可能遇到的问题
      • ModuleNotFoundError: No module named 'tools.nnwrap'
  • 5. 其它补充
  • 6. 总结
  • 7. 参考资料

0. 前言

Imdb影评的数据集介绍与下载

下面我用三种方法来训练Bert用IMDB影评数据集

1. 什么是WordPiece

现在基本性能好一些的NLP模型,例如OpenAI GPT,google的BERT,在数据预处理的时候都会有WordPiece的过程。WordPiece字面理解是把word拆成piece一片一片,其实就是这个意思。

WordPiece的一种主要的实现方式叫做BPE(Byte-Pair Encoding)双字节编码

BPE的过程可以理解为把一个单词再拆分,使得我们的此表会变得精简,并且寓意更加清晰。

比如"loved",“loving”,"loves"这三个单词。其实本身的语义都是“爱”的意思,但是如果我们以单词为单位,那它们就算不一样的词,在英语中不同后缀的词非常的多,就会使得词表变的很大,训练速度变慢,训练的效果也不是太好。

BPE算法通过训练,能够把上面的3个单词拆分成"lov",“ed”,“ing”,"es"几部分,这样可以把词的本身的意思和时态分开,有效的减少了词表的数量。

在Bert 里做法其实是先查找这个单词是否存在在里,如果不存在,那么会尝试分成两个词。
例如"tokenizer" --> “token”,"##izer"

2. 第一种方法-利用Google-search在git-hub开源的代码训练Bert

2.1 所需环境

1.1.1 如果有GPU的话。

conda install python=3.6
conda install tensorflow-gpu=1.11.0

如果没有GPU, cpu版本的Tensorflow也可以。只是跑的慢而已

pip install tensorflow=1.11.0

1.1.2 在GitHub上下载google-search开源的bert代码

1.1.3 下载Bert的模型参数uncased_L-12_H-768_A-12, 解压
https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip

2.2 代码修改与介绍

打开run_classifier.py 文件, 加入下面这个ImdbProcessor。
数据集imdb_train.npz,imdb_test.npz和imdb_val.npz可以看我下面这个博客
Imdb影评的数据集介绍与下载


class ImdbProcessor(DataProcessor):"""Processor for the MRPC data set (GLUE version)."""def get_train_examples(self, data_dir):data = np.load('./data/imdb_train.npz')return self._create_examples(data, 'train')def get_dev_examples(self, data_dir):data = np.load('./data/imdb_val.npz')return self._create_examples(data, 'val')def get_test_examples(self, data_dir):data = np.load('./data/imdb_test.npz')return self._create_examples(data,'test')def get_labels(self):"""See base class."""return ["0", "1"]def _create_examples(self, train_data, set_type):"""Creates examples for the training and dev sets."""X = train_data['x']Y = train_data['y']examples = []i = 0for data, label in zip(X, Y):guid = "%s-%s" % (set_type, i)text_a = tokenization.convert_to_unicode(data)label1 = tokenization.convert_to_unicode(str(label))examples.append(InputExample(guid=guid, text_a=text_a, label=label1))i = i + 1return examples

搜main()方法在run_classifier.py文件里,然后加一下你刚才写的ImdbProcessor类

运行run_classifier.py用命令行。

export BERT_BASE_DIR=.\data\uncased_L-12_H-768_A-12
export DATASET=../data/python run_classifier.py \--data_dir=$DATASET \--task_name=imdb \--vocab_file=$BERT_BASE_DIR/vocab.txt \--bert_config_file=$BERT_BASE_DIR/bert_config.json \--output_dir=../output/ \--do_train=true \--do_eval=true \--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \--max_seq_length=200 \--train_batch_size=16 \--learning_rate=5e-5\--num_train_epochs=2.0

或者用pycharm运行, 通过run菜单进去的然后输入下面的参数。
注意,你要修改两个目录的参数,第一个时数据集(我的时./data/),第二个是模型的目录(我的是D:\train_data\uncased_L-12_H-768_A-12)

--data_dir=./data/ --task_name=imdb --vocab_file=D:\train_data\uncased_L-12_H-768_A-12\vocab.txt --bert_config_file=D:\train_data\uncased_L-12_H-768_A-12\bert_config.json --output_dir=../output/ --do_train=true --do_eval=true --init_checkpoint=D:\train_data\uncased_L-12_H-768_A-12\bert_model.ckpt --max_seq_length=200 --train_batch_size=16 --learning_rate=5e-5 --num_train_epochs=2.0


然后执行run_classifier.py
差不多2小时左右在GPU上就结果了。准确率92.24%

3. 第二种方法-利用 tensorflow_hub与bert-tensorflow训练Bert

所需环境

conda install python=3.6
conda install tensorflow-gpu=1.11.0
conda install tensorflow-hub
pip install bert-tensorflow

代码介绍

import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
from datetime import datetime
import bert
from bert import run_classifier
from bert import optimization
from bert import tokenization
import numpy as npOUTPUT_DIR = 'output1'
def download_and_load_datasets():train_data = np.load('./data/bert_train.npz')test_data = np.load('./data/bert_test.npz')train_df = pd.DataFrame({'sentence':train_data['x'], 'polarity':train_data['y']})test_df = pd.DataFrame({'sentence':test_data['x'], 'polarity':test_data['y']})return train_df, test_df# This is a path to an uncased (all lowercase) version of BERT
BERT_MODEL_HUB = "D:/train_data/tf-hub/bert_uncased_L-12_H-768_A-12_1"def create_tokenizer_from_hub_module():"""Get the vocab file and casing info from the Hub module."""with tf.Graph().as_default():bert_module = hub.Module(BERT_MODEL_HUB)tokenization_info = bert_module(signature="tokenization_info", as_dict=True)with tf.Session() as sess:vocab_file, do_lower_case = sess.run([tokenization_info["vocab_file"], tokenization_info["do_lower_case"]])return bert.tokenization.FullTokenizer(vocab_file=vocab_file, do_lower_case=do_lower_case)# We'll set sequences to be at most 128 tokens long.
MAX_SEQ_LENGTH = 128
# label_list is the list of labels, i.e. True, False or 0, 1 or 'dog', 'cat'
label_list = [0, 1]def getDataSet():train, test = download_and_load_datasets()#train = train.sample(5000)#test = test.sample(5000)DATA_COLUMN = 'sentence'LABEL_COLUMN = 'polarity'# Use the InputExample class from BERT's run_classifier code to create examples from the datatrain_InputExamples = train.apply(lambda x: bert.run_classifier.InputExample(guid=None,# Globally unique ID for bookkeeping, unused in this exampletext_a=x[DATA_COLUMN],text_b=None,label=x[LABEL_COLUMN]), axis=1)test_InputExamples = test.apply(lambda x: bert.run_classifier.InputExample(guid=None, text_a=x[DATA_COLUMN], text_b=None, label=x[LABEL_COLUMN]), axis=1)tokenizer = create_tokenizer_from_hub_module()tokenizer.tokenize("This here's an example of using the BERT tokenizer")# Convert our train and test features to InputFeatures that BERT understands.train_features = bert.run_classifier.convert_examples_to_features(train_InputExamples, label_list, MAX_SEQ_LENGTH,tokenizer)test_features = bert.run_classifier.convert_examples_to_features(test_InputExamples, label_list, MAX_SEQ_LENGTH, tokenizer)return train_features, test_featuresdef create_model(is_predicting, input_ids, input_mask, segment_ids, labels,num_labels):"""Creates a classification model."""bert_module = hub.Module(BERT_MODEL_HUB,trainable=True)bert_inputs = dict(input_ids=input_ids, input_mask=input_mask, segment_ids=segment_ids)bert_outputs = bert_module(inputs=bert_inputs, signature="tokens", as_dict=True)# Use "pooled_output" for classification tasks on an entire sentence.# Use "sequence_outputs" for token-level output.output_layer = bert_outputs["pooled_output"]hidden_size = output_layer.shape[-1].valueprint('hidden_size=',hidden_size)# Create our own layer to tune for politeness data.output_weights = tf.get_variable("output_weights", [num_labels, hidden_size], initializer=tf.truncated_normal_initializer(stddev=0.02))output_bias = tf.get_variable("output_bias", [num_labels], initializer=tf.zeros_initializer())with tf.variable_scope("loss"):# Dropout helps prevent overfittingoutput_layer = tf.nn.dropout(output_layer, keep_prob=0.9)logits = tf.matmul(output_layer, output_weights, transpose_b=True)logits = tf.nn.bias_add(logits, output_bias)log_probs = tf.nn.log_softmax(logits, axis=-1)# Convert labels into one-hot encodingone_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)predicted_labels = tf.squeeze(tf.argmax(log_probs, axis=-1, output_type=tf.int32))# If we're predicting, we want predicted labels and the probabiltiies.if is_predicting:return (predicted_labels, log_probs)# If we're train/eval, compute loss between predicted and actual labelper_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)loss = tf.reduce_mean(per_example_loss)return (loss, predicted_labels, log_probs)# model_fn_builder actually creates our model function
# using the passed parameters for num_labels, learning_rate, etc.
def model_fn_builder(num_labels, learning_rate, num_train_steps, num_warmup_steps):"""Returns `model_fn` closure for TPUEstimator."""def model_fn(features, labels, mode, params):  # pylint: disable=unused-argument"""The `model_fn` for TPUEstimator."""input_ids = features["input_ids"]input_mask = features["input_mask"]segment_ids = features["segment_ids"]label_ids = features["label_ids"]is_predicting = (mode == tf.estimator.ModeKeys.PREDICT)# TRAIN and EVALif not is_predicting:(loss, predicted_labels, log_probs) = create_model(is_predicting, input_ids, input_mask, segment_ids, label_ids, num_labels)train_op = bert.optimization.create_optimizer(loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu=False)# Calculate evaluation metrics.def metric_fn(label_ids, predicted_labels):accuracy = tf.metrics.accuracy(label_ids, predicted_labels)f1_score = tf.contrib.metrics.f1_score( label_ids, predicted_labels)auc = tf.metrics.auc(label_ids,predicted_labels)recall = tf.metrics.recall( label_ids, predicted_labels)precision = tf.metrics.precision(label_ids,predicted_labels)true_pos = tf.metrics.true_positives(label_ids,predicted_labels)true_neg = tf.metrics.true_negatives(label_ids, predicted_labels)false_pos = tf.metrics.false_positives( label_ids, predicted_labels)false_neg = tf.metrics.false_negatives( label_ids,predicted_labels)return {"eval_accuracy": accuracy, "f1_score": f1_score,"auc": auc,"precision": precision,"recall": recall,"true_positives": true_pos,"true_negatives": true_neg, "false_positives": false_pos,"false_negatives": false_neg}eval_metrics = metric_fn(label_ids, predicted_labels)if mode == tf.estimator.ModeKeys.TRAIN:return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)else:return tf.estimator.EstimatorSpec(mode=mode,loss=loss,eval_metric_ops=eval_metrics)else:(predicted_labels, log_probs) = create_model(is_predicting, input_ids, input_mask, segment_ids, label_ids, num_labels)predictions = {'probabilities': log_probs,'labels': predicted_labels}return tf.estimator.EstimatorSpec(mode, predictions=predictions)# Return the actual model function in the closurereturn model_fn# Compute train and warmup steps from batch size
# These hyperparameters are copied from this colab notebook (https://colab.sandbox.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb)
BATCH_SIZE = 16
LEARNING_RATE = 2e-5
NUM_TRAIN_EPOCHS = 3.0
# Warmup is a period of time where hte learning rate
# is small and gradually increases--usually helps training.
WARMUP_PROPORTION = 0.1
# Model configs
SAVE_CHECKPOINTS_STEPS = 500
SAVE_SUMMARY_STEPS = 100def get_estimator(train_features):# Compute # train and warmup steps from batch sizenum_train_steps = int(len(train_features) / BATCH_SIZE * NUM_TRAIN_EPOCHS)num_warmup_steps = int(num_train_steps * WARMUP_PROPORTION)# Specify output directory and number of checkpoint steps to saverun_config = tf.estimator.RunConfig(model_dir=OUTPUT_DIR,save_summary_steps=SAVE_SUMMARY_STEPS, save_checkpoints_steps=SAVE_CHECKPOINTS_STEPS)model_fn = model_fn_builder(num_labels=len(label_list),learning_rate=LEARNING_RATE, num_train_steps=num_train_steps,num_warmup_steps=num_warmup_steps)estimator = tf.estimator.Estimator(model_fn=model_fn, config=run_config, params={"batch_size": BATCH_SIZE})return estimatordef train_bert_model(train_features):num_train_steps = int(len(train_features) / BATCH_SIZE * NUM_TRAIN_EPOCHS)estimator =  get_estimator(train_features)# Create an input function for training. drop_remainder = True for using TPUs.train_input_fn = bert.run_classifier.input_fn_builder(features=train_features, seq_length=MAX_SEQ_LENGTH, is_training=True, drop_remainder=False)print(f'Beginning Training!')current_time = datetime.now()estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)print("Training took time ", datetime.now() - current_time)def test_bert_model(test_features):estimator = get_estimator(test_features)test_input_fn = run_classifier.input_fn_builder(features=test_features,seq_length=MAX_SEQ_LENGTH,is_training=False,  drop_remainder=False)test_result = estimator.evaluate(input_fn=test_input_fn, steps=None)print(test_result)def getPrediction(in_sentences, train_features):estimator = get_estimator(train_features)labels = ["Negative", "Positive"]tokenizer = create_tokenizer_from_hub_module()input_examples = [run_classifier.InputExample(guid="", text_a = x, text_b = None, label = 0) for x in in_sentences] # here, "" is just a dummy labelinput_features = run_classifier.convert_examples_to_features(input_examples, label_list, MAX_SEQ_LENGTH, tokenizer)predict_input_fn = run_classifier.input_fn_builder(features=input_features, seq_length=MAX_SEQ_LENGTH, is_training=False, drop_remainder=False)predictions = estimator.predict(predict_input_fn)return [(sentence, prediction['probabilities'], labels[prediction['labels']]) for sentence, prediction in zip(in_sentences, predictions)]def predict(train_features):pred_sentences = ["That movie was absolutely awful","The acting was a bit lacking","The film was creative and surprising","Absolutely fantastic!"]predictions = getPrediction(pred_sentences,train_features)print(predictions)
def main(_):train_features, test_features = getDataSet()print(type(train_features), len(train_features))print(type(test_features), len(test_features))train_bert_model(train_features)test_bert_model(test_features)#predict(train_features)if __name__ == '__main__':tf.app.run()

执行结果,准确率89.81%

{'auc': 0.89810693, 'eval_accuracy': 0.8981, 'f1_score': 0.897557, 'false_negatives': 501.0, 'false_positives': 518.0, 'loss': 0.52041256, 'precision': 0.8960257, 'recall': 0.8990936, 'true_negatives': 4517.0, 'true_positives': 4464.0, 'global_step': 6750}

4. 第三种方法-利用huggingFace的Transformer训练Bert

3.1. 所需环境

安装transformers

pip install transformers

下载Bert预训练数据集bert-base-uncased在HuggingFace的官网

注意bert-base-uncased-tf_model.h5或bert-base-uncased-pytorch_model.bin是二选择一
一个是pytorch的,一个是tensorflow的

https://cdn.huggingface.co/bert-base-uncased-tf_model.h5
https://cdn.huggingface.co/bert-base-uncased-pytorch_model.bin
https://cdn.huggingface.co/bert-base-uncased-vocab.txt
https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json

下载后放在同一个文件夹下比如bert-base-uncased
对每一个文件改名
bert-base-uncased-tf_model.h5 --> tf_model.h5
bert-base-uncased-pytorch_model.bin - > pytorch_model.bin
bert-base-uncased-vocab.txt -->vocab.txt
bert-base-uncased-config.json --> config.json

4.2. 代码解释

imdb_train.npz与imdb_test.npz数据文件参数可以看下面博客
Imdb影评的数据集介绍与下载

Transformer包是HuggingFace公司基于Google开源的bert做了一个封装。使得用起来更方便

下面代码的功能和tokenizer.encode_plus()功能是一样的。

from transformers import BertTokenizer
bert_weight_folder = r'D:\train_data\bert\bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(bert_weight_folder, do_lower_case=True)max_length_test = 20
test_sentence = 'Test tokenization sentence. Followed by another sentence'# add special tokens
test_sentence_with_special_tokens = '[CLS]' + test_sentence + '[SEP]'
tokenized = tokenizer.tokenize(test_sentence_with_special_tokens)
print('tokenized', tokenized)# convert tokens to ids in WordPiece
input_ids = tokenizer.convert_tokens_to_ids(tokenized)# precalculation of pad length, so that we can reuse it later on
padding_length = max_length_test - len(input_ids)# map tokens to WordPiece dictionary and add pad token for those text shorter than our max length
input_ids = input_ids + ([0] * padding_length)# attention should focus just on sequence with non padded tokens
attention_mask = [1] * len(input_ids)# do not focus attention on padded tokens
attention_mask = attention_mask + ([0] * padding_length)# token types, needed for example for question answering, for our purpose we will just set 0 as we have just one sequence
token_type_ids = [0] * max_length_test
bert_input = {"token_ids": input_ids,"token_type_ids": token_type_ids,"attention_mask": attention_mask
} print(bert_input)
tokenized ['[CLS]', 'test', 'token', '##ization', 'sentence', '.', 'followed', 'by', 'another', 'sentence', '[SEP]']
{'token_ids': [101, 3231, 19204, 3989, 6251, 1012, 2628, 2011, 2178, 6251, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0],'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
}

完整的代码时如下:

import tensorflow as tf
import os as os
import numpy as np
from transformers import BertTokenizer
from transformers import TFBertForSequenceClassification
import tensorflow as tf
from sklearn.model_selection import train_test_splitmax_length = 200
batch_size = 16
learning_rate = 2e-5
number_of_epochs = 10
bert_weight_folder = r'D:\train_data\bert\bert-base-uncased'tokenizer = BertTokenizer.from_pretrained(bert_weight_folder, do_lower_case=True)def convert_example_to_feature(review):return tokenizer.encode_plus(review,add_special_tokens=True,  # add [CLS], [SEP]max_length=max_length,  # max length of the text that can go to BERTpad_to_max_length=True,  # add [PAD] tokensreturn_attention_mask=True,  # add attention mask to not focus on pad tokenstruncation=True)# map to the expected input to TFBertForSequenceClassification, see here
def map_example_to_dict(input_ids, attention_masks, token_type_ids, label):return {"input_ids": input_ids, "token_type_ids": token_type_ids, "attention_mask": attention_masks,}, labeldef encode_examples(x, y , limit=-1):# prepare list, so that we can build up final TensorFlow dataset from slices.input_ids_list = []token_type_ids_list = []attention_mask_list = []label_list = []for review, label in zip(x, y ):bert_input = convert_example_to_feature(review)input_ids_list.append(bert_input['input_ids'])token_type_ids_list.append(bert_input['token_type_ids'])attention_mask_list.append(bert_input['attention_mask'])label_list.append([label])print('input_ids_list', input_ids_list)print('token_type_ids_list', token_type_ids_list)print('attention_mask_list', attention_mask_list)return tf.data.Dataset.from_tensor_slices( (input_ids_list, attention_mask_list, token_type_ids_list, label_list)).map(map_example_to_dict)def get_raw_dataset():train_data = np.load('./data/bert_train.npz')test_data = np.load('./data/bert_test.npz')val_data = np.load('./data/bert_val.npz')print("X_train:", train_data['x'].shape)print("y_train:", train_data['y'].shape)print("X_test:", test_data['x'].shape)print("y_test:", test_data['y'].shape)print("X_val:", val_data['x'].shape)print("y_val:", val_data['y'].shape)return train_data['x'], train_data['y'], test_data['x'], test_data['y'], val_data['x'], val_data['y']def get_dataset():x_train, y_train, x_test, y_test, x_val , y_val= get_raw_dataset()ds_train_encoded = encode_examples(x_train, y_train).shuffle(10000).batch(batch_size)# test datasetds_test_encoded = encode_examples(x_test, y_test).batch(batch_size)ds_val_encoded = encode_examples(x_val, y_val).batch(batch_size)return ds_train_encoded, ds_test_encoded, ds_val_encodeddef get_module():# model initializationmodel = TFBertForSequenceClassification.from_pretrained(bert_weight_folder)print(dir(model))# classifier Adam recommendedoptimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate, epsilon=1e-08)# we do not have one-hot vectors, we can use sparce categorical cross entropy and accuracyloss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')model.compile(optimizer=optimizer, loss=loss, metrics=[metric])print(model.summary())return modelroot_folder = r'.\bert'
weight_dir = root_folder + '\weight.h5'
def train_bert_module(model, ds_train_encoded, ds_val_encoded):if os.path.isfile(weight_dir):print('load weight')model.load_weights(weight_dir)current_max_loss = 9999def save_weight(epoch, logs):global current_max_lossif (logs['val_loss'] is not None and logs['val_loss'] < current_max_loss):current_max_loss = logs['val_loss']print('save_weight', epoch, current_max_loss)model.save_weights(weight_dir)# model.save(root_folder + '\module.h5', include_optimizer=False, save_format="tf")batch_print_callback = tf.keras.callbacks.LambdaCallback(on_epoch_end=save_weight)callbacks = [tf.keras.callbacks.EarlyStopping(patience=4, monitor='loss'),batch_print_callback,# tf.keras.callbacks.TensorBoard(log_dir=root_folder + '\logs')]print('start')bert_history = model.fit(ds_train_encoded, epochs=number_of_epochs, validation_data=ds_val_encoded, callbacks=callbacks)print('bert_history', bert_history)def test_bert_module(model, ds_test_encoded):if os.path.isfile(weight_dir):print('load weight')model.load_weights(weight_dir)scores = model.evaluate(ds_test_encoded)print(scores)if __name__ == '__main__':ds_train_encoded, ds_test_encoded, ds_val_encoded =  get_dataset()bert_module = get_module()train_bert_module(bert_module,ds_train_encoded, ds_val_encoded)test_bert_module(bert_module, ds_test_encoded)

结果 第一个epoch 测试集上准确率就已经是90.72%,
不够速度的很慢,大概1小时左右吧在GPU训练,如果是CPU那就得最少12小时了。

2186/2188 [============================>.] - ETA: 0s - loss: 0.2793 - accuracy: 0.8811
2187/2188 [============================>.] - ETA: 0s - loss: 0.2793 - accuracy: 0.8811
2188/2188 [==============================] - 1023s 467ms/step - loss: 0.2793 - accuracy: 0.8811 - val_loss: 0.2340 - val_accuracy: 0.9072

第二个epoch 测试集上准确率是90.78%, 有点过拟合了

2186/2188 [============================>.] - ETA: 0s - loss: 0.1526 - accuracy: 0.9438
2187/2188 [============================>.] - ETA: 0s - loss: 0.1526 - accuracy: 0.9438
2188/2188 [==============================] - 1010s 462ms/step - loss: 0.1526 - accuracy: 0.9438 - val_loss: 0.2995 - val_accuracy: 0.9078

第三个epoch 测试集上准确率是91.46%

2186/2188 [============================>.] - ETA: 0s - loss: 0.0758 - accuracy: 0.9747
2187/2188 [============================>.] - ETA: 0s - loss: 0.0758 - accuracy: 0.9747
2188/2188 [==============================] - 1007s 460ms/step - loss: 0.0757 - accuracy: 0.9747 - val_loss: 0.2978 - val_accuracy: 0.9146

第四个epoch 测试集上准确率是91.35%

2185/2188 [============================>.] - ETA: 1s - loss: 0.0425 - accuracy: 0.9862
2186/2188 [============================>.] - ETA: 0s - loss: 0.0425 - accuracy: 0.9861
2187/2188 [============================>.] - ETA: 0s - loss: 0.0425 - accuracy: 0.9861
2188/2188 [==============================] - 999s 457ms/step - loss: 0.0425 - accuracy: 0.9861 - val_loss: 0.3121 - val_accuracy: 0.9135

4.3. 可能遇到的问题

ModuleNotFoundError: No module named ‘tools.nnwrap’

安装pytorch时遇到下面问题

解决方法是手动安装pytorch的相关whl.
访问这个网站,https://download.pytorch.org/whl/torch_stable.html
下载torch-1.1.0-cp36-cp36m-win_amd64.whl和torchvision-0.3.0-cp36-cp36m-win_amd64.whl
然后安装
pip install yourfolder\torch-1.1.0-cp36-cp36m-win_amd64.whl
pip install yourfolder\torchvision-0.3.0-cp36-cp36m-win_amd64.whl

或者直接根据pytorch官网命令直接安装。但是我的没有成功,可能是因为我的网络的问题
https://pytorch.org/

5. 其它补充

tokens --> { input_ids, input_mask, segment_ids}
Bert 模型的输入参数 {guid,input_ids, input_mask, segment_ids}


一段tensorflow的日志

I0818 00:30:55.294839  4156 run_classifier.py:465] guid: None
INFO:tensorflow:tokens: [CLS] this here ' s an example of using the bert token ##izer [SEP] fuck you [SEP]
I0818 00:30:55.294839  4156 run_classifier.py:467] tokens: [CLS] this here ' s an example of using the bert token ##izer [SEP] fuck you [SEP]
INFO:tensorflow:input_ids: 101 2023 2182 1005 1055 2019 2742 1997 2478 1996 14324 19204 17629 102 6616 2017 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I0818 00:30:55.294839  4156 run_classifier.py:468] input_ids: 101 2023 2182 1005 1055 2019 2742 1997 2478 1996 14324 19204 17629 102 6616 2017 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I0818 00:30:55.294839  4156 run_classifier.py:469] input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I0818 00:30:55.294839  4156 run_classifier.py:470] segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:label: 0 (id = 0)
I0818 00:30:55.294839  4156 run_classifier.py:471] label: 0 (id = 0)

6. 总结

Bert 这个模型特别吃内存和CPU, 训练的时候特别的慢。最好用GPU吧。不然一个epoch一天都跑不完啊。
一个epoch测试集上准确率就已经是90.72%,很高了,但是第二个epoch准确没怎么提高,应该是过拟合。

7. 参考资料

[1] https://github.com/atherosai/python-graphql-nlp-transformers/tree/master/notebooks/BERT%20fine-tunning%20in%20Tensorflow%202%20with%20Keras%20API
[2] https://medium.com/atheros/text-classification-with-transformers-in-tensorflow-2-bert-2f4f16eff5ad
[3] HuggingFace的官网
[4] google-search的Bert开源代码在gitbub上的链接

[NLP]基于IMDB影评情感分析之BERT实战-测试集上92.24%相关推荐

  1. [深度学习-NLP]Imdb数据集情感分析之模型对比(贝叶斯, LSTM, GRU, TextCNN, Transformer, BERT)

    一,详细原理以及代码请看下面博客 1.Imdb数据集情感分析之离散贝叶斯 2.Imdb数据集情感分析之LSTM长短期记忆 3.Imdb数据集情感分析之卷积神经网络-TextCNN 4.Imdb数据集情 ...

  2. python毕业设计开题报告-基于python爬虫的影评情感分析研究开题报告

    论文(设计)题目 基于python爬虫的影评情感分析研究开题报告 选题的背景.意义及研究现状: 研究现状: 文本情感分析又称倾向性分析.情感挖掘,主观分析或评论挖掘,是对带有情感色彩的评论文本内容进行 ...

  3. 【论文泛读17】BERT后训练复习阅读理解和基于方面的情感分析

    贴一下汇总贴:论文阅读记录 论文链接:<BERT Post-Training for Review Reading Comprehension and Aspect-based Sentimen ...

  4. 情感分析与观点挖掘第五章笔记(上)/基于方面的情感分析/SentimentAnalysis-and-OpinionMining by Bing Liu

    Chapter 5 基于方面的情感分析_Aspect-based Sentiment Analysis 5.1 方面情感分类 5.2 观点和构成语义学的基本规则 随着各章的自然发展,本章应侧重于 短语 ...

  5. NLP自然语言处理之情感分析分析讲解、知识构建

    !!!!!!不要急着代码,搞清楚原理知识结构才下手,以后还指着它吃饭呢,又不是水一篇论文当混子!!!!!!! !!!!!!书越读越薄,本文源自:https://blog.csdn.net/linxid ...

  6. 【深度学习】-Imdb数据集情感分析之模型对比(2)- LSTM

    [深度学习]-Imdb数据集情感分析之模型对比(2)-LSTM 文章目录 前言 一.LSTM是什么? 算法介绍 二.训练LSTM模型 1.数据预处理 2.构建LSTM模型 设定模型参数 构建并训练模型 ...

  7. 基于Python实现情感分析实验

    资源下载地址:https://download.csdn.net/download/sheziqiong/86764011 资源下载地址:https://download.csdn.net/downl ...

  8. 论文浅尝 - ICLR2020 | 知道什么、如何以及为什么:基于方面的情感分析的近乎完整的解决方案...

    论文笔记整理:余海阳,浙江大学硕士,研究方向为知识图谱.自然语言处理. 链接:https://arxiv.org/abs/1911.01616 动机 基于目标的情感分析或基于方面的情感分析(ABSA) ...

  9. 论文浅尝 | 嵌入常识知识的注意力 LSTM 模型用于特定目标的基于侧面的情感分析...

    MaY, Peng H, Cambria E. Targeted aspect-based sentiment analysis via embedding commonsense knowledge ...

最新文章

  1. 太任性!00 后少年买不到回国机票,因“泄愤”找黑客攻击系统,被判刑 4 年
  2. 世界杯开幕硅谷也疯狂:员工边看踢球边工作
  3. ab st语言编程手册_木兰编程语言 0.0.14.7:功能覆盖初版用户手册;Gitee Go 流水线尝鲜...
  4. linux c之解决array subscript is not integer和AF_NET not undeclared
  5. c++ static关键字的作用
  6. 流言终结者- Flutter和RN谁才是更好的跨端开发方案?
  7. tf.concat用法总结
  8. 12月中旬计算机会议,2019年12月泰国曼谷--深度学习与计算机视觉国际会议(DLCV 2019)...
  9. 研发项目wbs分解简单案例_2013信息系统项目管理师案例分析之工作分解结构(WBS)案例...
  10. 一:Proficloud - EMMA能源管理+EMpro智能电表
  11. 利用Matlab进行图像处理
  12. 华为认证HCNE考试知识点
  13. 知道焊缝长度如何确定节点板尺寸_必看!手把手教你如何看懂图纸
  14. Struts 1与Struts 2区别
  15. 二维数组与字符数组——英文字母、数字字符及其他字符的个数
  16. LightOJ 1395 A Dangerous Maze (II) 期望DP
  17. k8s高可用多节点master搭建
  18. 苹果CMS海螺模板V16魔改版2.0修复bug分享给大家
  19. 泪目!视频剪辑教程自学百度云资源
  20. 5、Squid代理服务

热门文章

  1. Android中ScrollView嵌套WebView
  2. Mono for Android—初体验之“电话拨号器”
  3. [转]12个jquery插件
  4. Hello Android
  5. 面试官系统精讲Java源码及大厂真题 - 13 差异对比:集合在 Java 7 和 8 有何不同和改进
  6. 18医科大学计算机基础,18春中国医科大学《计算机基础与应用 》在线作业100分答案...
  7. ASCII编码,将英文存储到计算机
  8. 那些年我们一起隐藏的bug
  9. 常用网站URL规划分析
  10. 当我谈Flask的时候我在谈些什么