tBERT部分代码 (自学用)

tbert.py

from src.loaders.load_data import load_data
from src.models.base_model_bert import model,test_opt
import argparse# run tbert with different learning rates on a certain dataset
# example usage: python src/experiments/tbert.py -learning_rate 5e-05 -gpu 0 -topic_type ldamallet -topic word -dataset MSRP --debug#命令行
parser = argparse.ArgumentParser()
parser.register("type", "bool", lambda v: v.lower() == "true")
parser.add_argument('-dataset', action="store", dest="dataset", type=str, default='MSRP')
parser.add_argument('-learning_rate', action="store", dest="learning_rate", type=str, default='3e-5') # learning_rates = [5e-5, 2e-5, 4e-5, 3e-5]
parser.add_argument('-layers', action="store", dest="hidden_layers", type=str, default='0')
parser.add_argument('-topic', action="store", dest="topic", type=str, default='word')
parser.add_argument('-gpu', action="store", dest="gpu", type=int, default=-1)
parser.add_argument("--speedup_new_layers",type="bool",nargs="?",const=True,default=False,help="Use 100 times higher learning rate for new layers.")
parser.add_argument("--debug",type="bool",nargs="?",const=True,default=False,help="Try to use small number of examples for troubleshooting")
parser.add_argument("--train_longer",type="bool",nargs="?",const=True,default=False,help="Train for 9 epochs")
parser.add_argument("--early_stopping",type="bool",nargs="?",const=True,default=False)
parser.add_argument("--unk_topic_zero",type="bool",nargs="?",const=True,default=False)
parser.add_argument('-seed', action="store", dest="seed", type=str, default='fixed')
parser.add_argument('-topic_type', action="store", dest="topic_type", type=str, default='ldamallet')FLAGS, unparsed = parser.parse_known_args()# sanity check command line arguments
if len(unparsed)>0:parser.print_help()raise ValueError('Unidentified command line arguments passed: {}\n'.format(str(unparsed))) #在程序的指定位置手动抛出一个异常#raise 异常类名称(描述信息):在引发指定类型的异常的同时,附带异常的描述信息。# setting model options based on flags#设置并检查dataset
dataset = FLAGS.dataset
assert dataset in ['MSRP','Semeval_A','Semeval_B','Semeval_C','Quora'] #检查是否在这些数据集中,否则提前报错#设置并检查hidden_layer
hidden_layers = [int(h) for h in FLAGS.hidden_layers.split(',')]
for h in hidden_layers:assert h in [0,1,2]#设置并检查topic (文档级还是词语级 word or doc)
topics = FLAGS.topic.split(',')
for t in topics:assert t in ['word','doc']  #检查 topic#初始化需要的表及参数
priority = []
todo = []
last = []
stopping_criterion = None #'F1'
patience = None
batch_size = 32 # standard minibatch size#根据数据集设置任务、batch_size、num_topics、alpha
if 'Semeval' in dataset:dataset, task = dataset.split('_')subsets = ['train_large', 'test2016', 'test2017']if task in ['A']:batch_size = 16 # need smaller minibatch to fit on GPU due to long sentencesnum_topics = 70if FLAGS.topic_type=='gsdmm':alpha = 0.1else:alpha = 50elif task == 'B':num_topics = 80if FLAGS.topic_type=='gsdmm':alpha = 0.1else:alpha = 10elif task == 'C':batch_size = 16 # need smaller minibatch to fit on GPU due to long sentencesnum_topics = 70if FLAGS.topic_type=='gsdmm':alpha = 0.1else:alpha = 10
else:task = 'B'if dataset== 'Quora':subsets = ['train', 'dev', 'test'] num_topics = 90if FLAGS.topic_type=='gsdmm':alpha = 0.1else:alpha = 1task = 'B'else:subsets = ['train', 'dev', 'test'] # MSRPnum_topics = 80if FLAGS.topic_type=='gsdmm':alpha = 0.1else:alpha = 1task = 'B'if FLAGS.debug:max_m = 100
else:max_m = Noneif FLAGS.train_longer:epochs = 9predict_every_epoch = True
else:epochs = 3predict_every_epoch = Falseif FLAGS.early_stopping:patience = 2stopping_criterion = 'F1'try:seed = int(FLAGS.seed)
except:seed = Noneif FLAGS.unk_topic_zero:unk_topic = 'zero'
else:unk_topic = 'uniform'#根据是文档级还是词汇级输入
for topic_scope in topics:for hidden_layer in hidden_layers:opt = {'dataset': dataset,'datapath': 'data/','model': 'bert_simple_topic',        #'bert_update':True,'bert_cased':False,'tasks': [task],'subsets': subsets,'seed':seed,'minibatch_size': batch_size,'L2': 0,'max_m': max_m,'load_ids': True,'topic':topic_scope,     #'topic_update':False,        #'num_topics':num_topics,     #'topic_alpha':alpha,     #'unk_topic': unk_topic,'topic_type':FLAGS.topic_type,'unk_sub': False,'padding': False,'simple_padding': True,'learning_rate': float(FLAGS.learning_rate),'num_epochs': epochs,'hidden_layer':hidden_layer,'sparse_labels': True,'max_length': 'minimum','optimizer': 'Adam','dropout':0.1,'gpu': FLAGS.gpu,'speedup_new_layers':FLAGS.speedup_new_layers,'predict_every_epoch': predict_every_epoch,'stopping_criterion':stopping_criterion,'patience':patience}todo.append(opt)  #追加到上面定义的todo列表tasks = todo        #复制一份if __name__ == '__main__':for i,opt in enumerate(tasks):print('Starting experiment {} of {}'.format(i+1,len(tasks)))l_rate = str(opt['learning_rate']).replace('-0','-')   #把学习率中没用的0去掉if FLAGS.speedup_new_layers:log = 'tbert_{}_seed_speedup_new_layers.json'.format(str(seed))elif FLAGS.train_longer:log = 'tbert_{}_seed_train_longer.json'.format(str(seed))else:log = 'tbert_{}_seed.json'.format(str(seed))if FLAGS.early_stopping:log = log.replace('.json','_early_stopping.json')print(log)print(opt)test_opt(opt)       #检查是否输入可选命令data = load_data(opt, cache=True, write_vocab=False)        #加载主题信息if FLAGS.debug:# print(data[''])print(data['E1'][0].shape)print(data['E1'][1].shape)print(data['E1'][2].shape)print(data['E1_mask'][0])print(data['E1_seg'][0])opt = model(data, opt, logfile=log, print_dim=True)

load_data.py

import importlib
import os
import pickle
import numpy as npfrom src.loaders.Quora.build import build
from src.loaders.augment_data import create_large_train, double_task_training_data
from src.preprocessing.Preprocessor import Preprocessor, get_onehot_encoding, reduce_embd_id_len
from src.topic_model.topic_loader import load_document_topics, load_word_topics
from src.topic_model.topic_visualiser import read_topic_key_table#返回结合数据集、子集、任务的文件名称 filenames.append(prefix+subsets+'_'+task) eg:m_train_B  m_dev_B  m_test_B
def get_filenames(opt):filenames = [] # name of cache filesfor s in opt['subsets']:for t in opt['tasks']:prefix = ''if opt['dataset'] == 'Quora':if s.startswith('p_'):prefix = ''else:prefix = 'q_'if opt['dataset'] == 'PAWS':prefix = 'p_'if opt['dataset'] == 'MSRP':prefix = 'm_'filenames.append(prefix+s+'_'+t)  #m_train_B  m_dev_B  m_test_Breturn filenames#返回文件路径os.path.join(opt['datapath'], opt['dataset'], name + '.txt') eg:data/MSRP/m_train_B.text
def get_filepath(opt):filepaths = []for name in get_filenames(opt):if 'quora' in name:     #为什么单独?filepaths.append(os.path.join(opt['datapath'], 'Quora', name + '.txt'))print('quora in filename')else:filepaths.append(os.path.join(opt['datapath'], opt['dataset'], name + '.txt'))return filepaths        #data/MSRP/m_train_B.text#加载文件,返回列表return (ID1, ID2, D1, D2, L)  文档号,文档内容,标签
def load_file(filename,onehot=True):"""Reads file and returns tuple of (ID1, ID2, D1, D2, L) if ids=False"""# todo: return dictionaryID1 = []ID2 = []D1 = []D2 = []L = []with open(filename,'r',encoding='utf-8') as read:for i,line in enumerate(read):if not len(line.split('\t'))==5:print(line.split('\t'))id1, id2, d1, d2, label = line.rstrip().split('\t')ID1.append(id1)ID2.append(id2)D1.append(d1)D2.append(d2)if 's_' in filename:if float(label)>=4:label = 1elif float(label)<4:label = 0else:ValueError()L.append(int(label))L = np.array(L)     #创建数组# L = L.reshape(len(D1),1)if onehot:classes = L.shape[1] + 1L = get_onehot_encoding(L)print('Encoding labels as one hot vector.')return (ID1, ID2, D1, D2, L)#根据数据集设置两个句子s1,s2的最大长度
def get_dataset_max_length(opt):'''Determine maximum number of tokens in both sentences, as well as highes max length for current task:param opt: :return: [maximum length of sentence in tokens,should first sentence be shortened?]'''tasks = opt['tasks']if opt['dataset'] in ['Quora','PAWS','GlueQuora']:cutoff = opt.get('max_length', 24)if cutoff == 'minimum':cutoff = 24s1_len, s2_len = cutoff, cutoffelif opt['dataset']=='MSRP':cutoff = opt.get('max_length', 40)if cutoff == 'minimum':cutoff = 40s1_len, s2_len = cutoff, cutoffelif 'B' in tasks:cutoff = opt.get('max_length', 100)if cutoff == 'minimum':cutoff = 100s1_len, s2_len = cutoff, cutoffelif 'A' in tasks or 'C' in tasks:cutoff = opt.get('max_length', 200)if cutoff == 'minimum':s1_len = 100s2_len = 200else:s1_len, s2_len = cutoff,cutoffreturn s1_len,s2_len,max([s1_len,s2_len])#削减样例,保留前m个
def reduce_examples(matrices, m):'''Reduces the size of matrices:param matrices: :param m: maximum number of examples:return: '''return [matrix[:m] for matrix in matrices]  #matrix[:m]保留前m个def create_missing_datafiles(opt,datafile,datapath):if not os.path.exists(datapath) and 'large' in datafile:create_large_train()if not os.path.exists(datapath) and 'double' in datafile:double_task_training_data()if not os.path.exists(datapath) and 'quora' in datafile:quora_opt = optquora_opt['dataset'] = 'Quora'build(quora_opt)#生成缓存文件夹 eg:data/cache/
def get_cache_folder(opt):return opt['datapath'] + 'cache/'#返回id1,id2,r1,r2,t1,t2,l
def load_cache_or_process(opt, cache, onehot):ID1 = []ID2 = []R1 = []R2 = []T1 = []T2 = []L = []filenames = get_filenames(opt)      #m_train_B  m_dev_B  m_test_Bprint(filenames)filepaths = get_filepath(opt)       #data/MSRP/m_train_B.textprint(filepaths)for datafile,datapath in zip(filenames,filepaths):          #zip()函数用于将可迭代的对象作为参数,将对象中对应的元素打包成一个个元组,然后返回由这些元组组成的列表。create_missing_datafiles(opt,datafile,datapath) # if necessarycache_folder = get_cache_folder(opt)    #data/cache/if not os.path.exists(cache_folder):    #如果不存在,创建一个cache文件夹os.mkdir(cache_folder)cached_path = cache_folder + datafile + '.pickle'   #eg:data/cache/m_train_B.pickleprint(cached_path)#?????if cache and os.path.isfile(cached_path):       #判断某一对象(需提供绝对路径)是否为文件print("Loading cached input for " + datafile)try:with open(cached_path, 'rb') as f:id1, id2, r1, r2, t1, t2, l = pickle.load(f)except ValueError:Warning('No ids loaded from cache: {}.'.format(cached_path))with open(cached_path, 'rb') as f:r1, r2, l = pickle.load(f)id1 = Noneid2 = None# do preprocessing if cache not availableelse:print('Creating cache...')load_ids = opt.get('load_ids',True)     #trueif not load_ids:DeprecationWarning('Load_ids is deprecated setting. Now loaded automatically.')id1, id2, r1, r2, l = load_file(datapath,onehot)t1 = Preprocessor.basic_pipeline(r1)    # # process text  处理数据 变小写、切分句子为词,去除停用词t2 = Preprocessor.basic_pipeline(r2)    # # process text  处理数据 变小写、切分句子为词,去除停用词if cache: # don't overwrite existing data if cache=Falsepickle.dump((id1, id2, r1, r2, t1, t2, l), open(cached_path, "wb")) # Store the new data as the current cacheID1.append(id1)ID2.append(id2)R1.append(r1)R2.append(r2)L.append(l)T1.append(t1)T2.append(t2)return {'ID1': ID1, 'ID2': ID2, 'R1': R1, 'R2': R2,'T1': T1, 'T2': T2, 'L': L}#读取数据并基于选项文件进行预处理,然后返回一个数据字典。   ※※※
def load_data(opt,cache=True,numerical=True,onehot=False, write_vocab=False):"""Reads data and does preprocessing based on options file and returns a data dictionary.Tokens will always be loaded, other keys depend on settings and will contain None if not available.:param opt: option dictionary, containing task and dataset info:param numerical: map tokens to embedding ids or not:param onehot: load labels as one hot representation or not:param write_vocab: write vocabulary to file or not:param cache: try to use cached preprocessed data or not:return: { # essential:'ID1': ID1, 'ID2': ID2, # doc ids'R1': R1, 'R2': R2, # raw text'L': L, # labels# optional for word embds:'E1': E1, 'E2': E2, # embedding ids'embd': embd, # word embedding matrix'mapping_rates': mapping_rates,  # optional for topics:'D_T1':D_T1,'D_T2':D_T2, # document topics'word_topics':word_topics, # word topic matrix'topic_keys':topic_key_table} # key word explanation for topics"""E1 = None       #embedding idsE1_mask = NoneE1_seg = NoneE2 = None       #embedding idsD_T1 = None     #document topicsD_T2 = None     #document topicsW_T1 = NoneW_T2 = NoneW_T = Nonetopic_key_table = Nonemapping_rates = Noneword_topics = None      #word topic matrixword2id = Noneid2word = Noneembd = None     #word embedding matrixvocab = []# get options#执行相应py文件,加载所需数据集dataset = opt['dataset']        #获取数据集名称module_name = "src.loaders.{}.build".format(dataset)        #找到对应文件夹下的数据集my_module = importlib.import_module(module_name)        #importlib.import_module动态导入模块my_module.build(opt) # download and reformat if not existing        #使用数据集下的build函数加载数据集 加载数据集 三个(训练集,测试集,验证集)##确定主题模型,主题类型topic_scope = opt.get('topic','')       #字典get()函数返回指定键的值。dict.get(key, default=None)if not  topic_scope=='':    #如果找到主题值,则执行下一步,选择主题模型类型topic_type = opt['topic_type'] = opt.get('topic_type', 'ldamallet')     #获取主题类型,默认为ldatopic_update = opt.get('topic_update', False)       #获取topic_updateassert topic_update in [True,False] # no  backward compatibility    #没有,报错assert topic_scope in ['', 'word', 'doc', 'word+doc','word+avg']    #没有,报错recover_topic_peaks = opt['unflat_topics'] =opt.get('unflat_topics', False)     #获取recover_topic_peaks##获取一些必要参数w2v_limit = opt.get('w2v_limit', None)      #获取w2v_limitassert w2v_limit is None # discontinued     #没有,报错calculate_mapping_rate = opt.get('mapping_rate', False)     #获取calculate_mapping_ratetasks = opt.get('tasks', '')        #获取任务assert len(tasks)>0     #检查tasks长度是否大于0,若小于,则报错unk_topic = opt['unk_topic'] = opt.get('unk_topic', 'uniform')      #获取unk_topicassert unk_topic in ['uniform','zero','min','small']        #没有,报错s1_max_len,s2_max_len,max_len = get_dataset_max_length(opt)  #获取句子最大长度,和两个里更大的一个 MSRP数据集都为40max_m = opt.get('max_m',None) # maximum number of examples      #获取max_mbert_processing = 'bert' in opt.get('model', '') # special tokens for BERT   #如果model中有bert,则返回True## load or create cache 加载数据cache = load_cache_or_process(opt, cache, onehot)  # load max_m examplesID1 = cache['ID1']ID2 = cache['ID2']R1 = cache['R1']R2 = cache['R2']T1 = cache['T1']T2 = cache['T2']L = cache['L']# map words to embedding ids#????????if numerical:print('Mapping words to BERT ids...')bert_cased = opt['bert_cased'] = opt.get('bert_cased', False)   #Falsebert_large = opt['bert_large'] = opt.get('bert_large', False)   #False# use raw text rather than tokenized text as input due to different preprocessing steps for BERTprocessor_output = Preprocessor.map_files_to_bert_ids(R1, R2, s1_max_len + s2_max_len, calculate_mapping_rate,bert_cased=bert_cased, bert_large=bert_large)print('Finished word id mapping.')E1 = processor_output['E1']E1_mask = processor_output['E1_mask']E1_seg = processor_output['E1_seg']     #E2 = processor_output['E2']                 #noneword2id = processor_output['word2id']id2word = processor_output['id2word']mapping_rates = processor_output['mapping_rates']if not bert_processing and not s1_max_len==max_len:E1 = reduce_embd_id_len(E1, tasks, cutoff=s1_max_len)if not bert_processing and not s2_max_len==max_len:E2 = reduce_embd_id_len(E2, tasks, cutoff=s2_max_len)if 'doc' in topic_scope:doc_topics = load_document_topics(opt,recover_topic_peaks=recover_topic_peaks,max_m=None)   ##加载数据集中每个句子推断出的主题分布。D_T1 = doc_topics['D_T1']D_T2 = doc_topics['D_T2']# reduce number of examples after mapping words to ids to ensure static mapping regardless of max_mif not ID1 is None:ID1 = reduce_examples(ID1, max_m)   #减少样例个数为max_mID2 = reduce_examples(ID2, max_m)R1 = reduce_examples(R1, max_m)R2 = reduce_examples(R2, max_m)T1 = reduce_examples(T1, max_m)T2 = reduce_examples(T2, max_m)if not E1_mask is None:# reduce examples for bertE1 = reduce_examples(E1, max_m) #[train,dev,test]E1_mask = reduce_examples(E1_mask, max_m)E1_seg = reduce_examples(E1_seg, max_m)elif not E1 is None:E1 = reduce_examples(E1, max_m)E2 = reduce_examples(E2, max_m)if 'doc' in topic_scope:# reduce doc topics here after shufflingD_T1 = reduce_examples(D_T1, max_m)D_T2 = reduce_examples(D_T2, max_m)L = reduce_examples(L, max_m)# load topic related data 加载主题相关数据if 'word' in topic_scope:word_topics = load_word_topics(opt,recover_topic_peaks=recover_topic_peaks)##从文件中读取单词主题向量#load_word_topics 返回值 return {'complete_topic_dict':complete_word_topic_dict,'topic_dict':id_word_dict,'word_id_dict':word_id_dict,'topic_matrix':topic_matrix}word2id_dict = word_topics['word_id_dict']  #根据词取idprint('Mapping words to topic ids...')W_T1 = [Preprocessor.map_topics_to_id(r,word2id_dict,s1_max_len,opt) for r in T1]W_T2 = [Preprocessor.map_topics_to_id(r,word2id_dict,s2_max_len,opt) for r in T2]       #文本句子中的单词转换成对应idif ('word' in topic_scope) or ('doc' in topic_scope):topic_key_table = read_topic_key_table(opt) #读取主题:关键词print('Done.')data_dict= {'ID1': ID1, 'ID2': ID2, # doc ids'R1': R1, 'R2': R2, # raw text'T1': T1, 'T2': T2,  # tokenized text'E1': E1, 'E2': E2, # embedding ids'E1_mask': E1_mask, 'E1_seg': E1_seg,  # embedding ids'W_T1': W_T1, 'W_T2': W_T2, # separate word topic ids ()'W_T': W_T,  # joined word topic ids ()'D_T1':D_T1,'D_T2':D_T2, # document topics'L': L, # labels# misc'mapping_rates': mapping_rates,  # optional'id2word':id2word,'word2id':word2id,'word_topics':word_topics,          #单词的主题分布'topic_keys':topic_key_table        #主题下的单词}return data_dict# Example usage
if __name__ == '__main__':# Example usageopt = {'dataset': 'MSRP','datapath': 'data/','tasks': ['B'],'max_length':'minimum','subsets': ['train','dev','test'],'model': 'bert_simple_topic','load_ids':True,'cache':True,'w2v_limit': None,#'pretrained_embeddings': 'Deriu','topic':'word','num_topics':50,'topic_alpha':10,'topic_type':'ldamallet'#,'max_m':10000}data_dict = load_data(opt, cache=True, numerical=True, onehot=False)

bert_simple_topic.py

import tensorflow as tf
from src.models.tf_helpers import maybe_printdef forward_propagation(input_dict, classes, hidden_layer=0, reduction_factor=2, dropout=0, seed_list=[], print_dim=False):"""Defines forward pass for tBERT modelReturns: logits"""if print_dim:print('---')print('Model: tBERT')print('---')# word topicsif input_dict['D_T1'] is None:W_T1 = input_dict['W_T1'] # (batch, sent_len_1, num_topics)W_T2 = input_dict['W_T2'] # (batch, sent_len_2, num_topics)maybe_print([W_T1, W_T2], ['W_T1', 'W_T2'], print_dim)# compute mean of word topicsD_T1 = tf.reduce_mean(W_T1,axis=1) # (batch, num_topics)    #求平均D_T2 = tf.reduce_mean(W_T2,axis=1) # (batch, num_topics)# document topicselif input_dict['W_T1'] is None:D_T1 = input_dict['D_T1'] # (batch, num_topics)D_T2 = input_dict['D_T2'] # (batch, num_topics)else:ValueError('Word or document topics need to be provided for bert_simple_topic.')maybe_print([D_T1,D_T2], ['D_T1','D_T2'], print_dim)# bert representationbert = input_dict['E1']# bert has 2 keys: sequence_output which is output embedding for each token and pooled_output which is output embedding for the entire sequence.with tf.name_scope('bert_rep'):# pooled output (containing extra dense layer)bert_rep = bert['pooled_output'] # pooled output over entire sequencemaybe_print([bert_rep], ['pooled BERT'], print_dim)# C vector from last layer corresponding to CLS token# bert_rep = bert['sequence_output'][:, 0, :]  # shape (batch, BERT_hidden)# maybe_print([bert_rep], ['BERT C vector'], print_dim)bert_rep = tf.layers.dropout(inputs=bert_rep, rate=dropout, seed=seed_list.pop(0))#随机的拿掉网络中的部分神经元,从而减小对W权重的依赖,以达到减小过拟合的效果。# combine BERT with document topicscombined = tf.concat([bert_rep, D_T1, D_T2], -1)maybe_print([combined], ['combined'], print_dim)if hidden_layer>0:      #✔with tf.name_scope('hidden_1'):hidden_size = combined.shape[-1].value/reduction_factorcombined = tf.layers.dense(         #全连接层combined,       #层的输入hidden_size,    #层的输出维度activation=tf.tanh,     #激活函数kernel_initializer=tf.truncated_normal_initializer(stddev=0.02, seed=seed_list.pop(0)))#tf.truncated_normal_initializer 为从截断的正态分布中输出随机值,如果生成的值大于平均值2个标准偏差的值则丢弃重新选择maybe_print([combined], ['hidden 1'], print_dim)combined = tf.layers.dropout(inputs=combined, rate=dropout, seed=seed_list.pop(0))#全连接层后跟dropout层 防止过拟合if hidden_layer>1:with tf.name_scope('hidden_2'):hidden_size = combined.shape[-1].value/reduction_factorcombined = tf.layers.dense(     #dense :全连接层  相当于添加一个层combined,hidden_size,activation=tf.tanh,kernel_initializer=tf.truncated_normal_initializer(stddev=0.02, seed=seed_list.pop(0)))maybe_print([combined], ['hidden 2'], print_dim)combined = tf.layers.dropout(inputs=combined, rate=dropout, seed=seed_list.pop(0))if hidden_layer>2:raise ValueError('Only 2 hidden layers supported.')with tf.name_scope('output_layer'): #将输出传入全连接层 计算logitshidden_size = combined.shape[-1].valueoutput_weights = tf.get_variable("output_weights", [classes, hidden_size],initializer=tf.truncated_normal_initializer(stddev=0.02,seed=seed_list.pop(0)))output_bias = tf.get_variable("output_bias", [classes], initializer=tf.zeros_initializer()) #找变量名为**的变量,如果这个变量不存在,就创建一个变量logits = tf.matmul(combined, output_weights, transpose_b=True)    #全连接层   #矩阵乘法logits = tf.nn.bias_add(logits, output_bias)        #激活函数层   #将偏差加到value上maybe_print([logits], ['output layer'], print_dim)output = {'logits':logits}return output
#最后的输出 返回值用来计算损失

base_model_bert.py

import math
import traceback
import numpy as np
import sys
import importlib
import argparse
import tensorflow as tfimport os#rootPath = os.path.abspath(os.path.join(os.getcwd(), "../.."))
#sys.path.append(rootPath)
sys.path.append('E:\Project\Python-Project\A-tBERT')
sys.path.append('E:\Project\Python-Project\A-tBERT\src')
sys.path.append('E:\Project\Python-Project\A-tBERT\src\models')from src.models.tf_helpers import lookup_embedding,initialise_pretrained_embedding
from src.models.tf_helpers import compute_cost
from src.models.tf_helpers import create_placeholders,create_doc_topic_placeholders,create_word_topic_placeholders
from src.models.helpers.minibatching import create_minibatches
from src.models.tf_helpers import maybe_printfrom tensorflow.python import debug as tf_debug
from tensorflow.python.framework import ops
import tensorflow_hub as hubimport random
np.random.seed(1)from src.loaders.load_data import load_data
from src.logs.training_logs import write_log_entry, start_timer, end_timer, get_new_id
from src.models.save_load import save_model, load_model, create_saver, get_model_dir, create_model_folder, \delete_all_checkpoints_but_best
from src.evaluation.evaluate import output_predictions, get_confidence_scores, save_eval_metrics
from src.models.helpers.base import add_git_version,skip_MAP,extract_data
from src.models.helpers.bert import get_bert_version
from src.models.helpers.training_regimes import standard_training_regime,layer_specific_regimeimport os
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning, module="tensorflow")
# warnings.filterwarnings('error')"""
Implements Base Model with early stopping and auxiliary loss
"""predefined_opts = [# loader related keywords'dataset', 'datapath', 'tasks', 'subsets', 'w2v_limit', 'max_m', 'max_length', 'n_gram_embd', 'padding', 'unk_sub','simple_padding',# model related keywords'model', 'load_ids', 'learning_rate', 'num_epochs', 'minibatch_size','bert_update','bert_cased','bert_large','optimizer', 'epsilon', 'rho', 'L2','sparse_labels', 'dropout',  'hidden_reduce','hidden_layer', 'checkpoints', 'stopping_criterion','model',# topid related keywords'patience','unk_topic','topic_encoder','unflat_topics','topic_update','topic_alpha','num_topics','topic_type','topic','injection_location',# other keywords'speedup_new_layers','freeze_thaw_tune','predict_every_epoch','gpu','git','seed']def test_opt(opt):'''Test if opt contains unused or wrong:param opt::return:'''for k in opt.keys():assert k in predefined_opts, '{} not in accepted options'.format(k)def model(data_dict, opt, logfile=None, print_dim=False):"""Creates and executes Tensorflow graph for BERT-based modelsArguments:data_dict -- contains all necessary data for modelopt -- option log, contains learning_rate, num_epochs, minibatch_size, ...logfile -- path of file to save opt and resultsprint_dim -- print dimensions for debugging purposesReturns:opt -- updated option logparameters -- trained parameters of model"""###### Read options, set defaults and update log
#####try:# check input optionsprint(opt)test_opt(opt)if opt.get('git',None) is None:add_git_version(opt) # keep track of git SHA
##参数获取############################################################################################################################## assign variablesopt['model'] = opt.get('model','bert')      #bert_simple_topicassert 'bert' in opt['model']learning_rate = opt['learning_rate'] = opt.get('learning_rate', 5e-5) # small learning rate for pretrained BERT layersspeedup_new_layers = opt['speedup_new_layers'] = opt.get('speedup_new_layers', False)     #Falsefreeze_thaw_tune = opt['freeze_thaw_tune'] = opt.get('freeze_thaw_tune', False)     #Falselayer_specific_lr = speedup_new_layers or freeze_thaw_tune          #Falsenum_epochs = opt.get('num_epochs', None) # get num of planned epochs        #3opt['num_epochs'] = 0 # use this to keep track of finished epochsminibatch_size = opt['minibatch_size'] = opt.get('minibatch_size', 64)      #32bert_embd = Truebert_update = opt['bert_update'] = opt.get('bert_update', False)        #Truebert_large = opt['bert_large'] = opt.get('bert_large', False)           #Falsecased = opt['bert_cased'] = opt.get('bert_cased', False)        #Falsestarter_seed = opt['seed'] = opt.get('seed', None)      #fixedif not type(starter_seed) == int:assert starter_seed == None# layers = opt['layers'] = opt.get('layers', 1)hidden_layer = opt['hidden_layer'] = opt.get('hidden_layer', 0)   #0      # add hidden layer before softmax layer?assert hidden_layer in [0,1,2]topic_encoder = opt['topic_encoder'] = opt.get('topic_encoder', None)       #NoneL_R_unk = opt.get('unk_sub',False)          #Falseassert L_R_unk is False# assert encoder in ['word', 'ffn', 'cnn', 'lstm', 'bilstm', 'word+cnn', 'word+ffn', 'word+lstm', 'word+bilstm']assert topic_encoder in [None, 'ffn', 'cnn', 'lstm', 'bilstm']      #Noneoptimizer_choice = opt['optimizer'] = opt.get('optimizer', 'Adadelta')   #Adam   # which optimiser to use?assert optimizer_choice in ['Adam','Adadelta']epsilon = opt['epsilon'] = opt.get('epsilon', 1e-08)        #1e-08rho = opt['rho'] = opt.get('rho', 0.95)             #0.95L2 = opt['L2'] = opt.get('L2', 0)  # L2 regularisation      #0dropout = opt['dropout'] = opt.get('dropout', 0)        #0.1assert not (L2 > 0 and dropout > 0), 'Use dropout or L2 regularisation, not both. Current settings: L2={}, dropout={}.'.format(L2, dropout)sparse = opt['sparse_labels'] = opt.get('sparse_labels', True)     #True     # are labels encoded as sparse?save_checkpoints = opt.get('checkpoints', False)      #False      # save all checkpoints?stopping_criterion = opt['stopping_criterion'] = opt.get('stopping_criterion',None)      #None       # which metric should be used as early stopping criterion?assert stopping_criterion in [None, 'cost', 'MAP', 'F1', 'Accuracy']if stopping_criterion is None and num_epochs is None:raise ValueError('Invalid parameter combination. Stopping criterion and number of epochs cannot both be None.')early_stopping = stopping_criterion in ['F1', 'cost', 'MAP', 'Accuracy']    #Falsepredict_every_epoch = opt['predict_every_epoch'] = opt.get('predict_every_epoch', False)    #Falsereduction_factor = opt['hidden_reduce'] = opt.get('hidden_reduce', 2)   #2patience = opt['patience'] = opt.get('patience',20)     #20# topic modelstopic_scope = opt['topic'] = opt.get('topic', '')       #wordif opt['model']=='bert_simple_topic':assert topic_scope in ['word', 'doc']elif opt['model']=='bert':topic_scope = ''else:raise NotImplementedError()#引入另一个py模块  反向传播函数module_name = "src.models.forward.{}".format(opt['model'])model = importlib.import_module(module_name)if 'word' in topic_scope:topic_update = opt['topic_update'] = opt.get('topic_update', False) #False    # None for backward compatibilitynum_topics = opt['num_topics'] = opt.get('num_topics', None)    #80topic_type = opt['topic_type'] = opt.get('topic_type', None)    #ldamalletif not topic_scope == '':assert 'topic' in opt['model']assert num_topics > 1assert topic_type in ['LDA','ldamallet','gsdmm']opt['topic_alpha'] = opt.get('topic_alpha',50)      #1else:assert num_topics is Noneassert topic_type is Noneif opt['dataset']=='Quora' and opt['subsets']==['train','dev','test','p_test']:extra_test = Trueelse:extra_test = False          #Falseinjection_location = opt['injection_location'] = opt.get('injection_location',None)     #Noneif 'inject' in opt['model']:assert str(injection_location) in ['embd','0','1','2','3','4','5','6','7','8','9','10','11']else:assert injection_location is None# gpu settingsgpu = opt.get('gpu', -1)        #0# general settingssession_config = tf.ConfigProto()if not gpu == -1:print('Running on GPU: {}'.format(gpu))os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu)  # specifies which GPU to use (if multiple are available)ops.reset_default_graph()  # to be able to rerun the model without overwriting tf variables# #重新运行模型而不覆盖tf变量#tf.reset_default_graph函数用于清除默认图形堆栈并重置全局默认图形。if not starter_seed == None:   #if not xxx 为如果不满足xxx,则执行random.seed(starter_seed) # use starter seed to set seed for random libraryseed_list = [random.randint(1, 100000) for i in range(100)] # generate list of seeds to be used in the model  #100个1-100000的随机数np.random.seed(seed_list.pop(0))tf.set_random_seed(seed_list.pop(0))  # set tensorflow seed to keep results consistent########################################################################################################################################### unpack data and assign to model variables
#####assert data_dict.get('embd', None) is None  # (565852, 200)if 'word' in topic_scope:topic_embd = data_dict['word_topics'].get('topic_matrix', None) #词的主题分布#向量维数为(32635, 70)# assign word idsif extra_test:ID1_train, ID1_dev, ID1_test, ID1_test_extra = data_dict['ID1']ID2_train, ID2_dev, ID2_test, ID2_test_extra = data_dict['ID2']else:ID1_train, ID1_dev, ID1_test = data_dict['ID1']ID2_train, ID2_dev, ID2_test = data_dict['ID2']train_dict,dev_dict,test_dict,test_dict_extra = extract_data(data_dict, topic_scope, extra_test)    #extra_test=False#train_dict = {'E1': X1_train, 'E1_mask': X1_mask_train,'E1_seg': X1_seg_train, 'D_T1': D_T1_train, 'D_T2': D_T2_train, 'W_T1': W_T1_train,'W_T2': W_T2_train, 'W_T': W_T_train, 'Y': Y_train}#####
# check input dimensions
#####if sparse:classes = 2         #√else:classes = train_dict['Y'].shape[1](m, sentence_length_1) = train_dict['E1'].shape# (m, sentence_length_2) = train_dict['E2'].shape#####
# Define Tensorflow graph
###### Create Placeholders and initialise weights of the correct shapeX1,X1_mask,X1_seg,Y = create_placeholders([sentence_length_1, None], classes, bicnn=True, sparse=sparse, bert=bert_embd)# Create topic placeholdersprint('Topic scope: {}'.format(topic_scope))if 'doc' in topic_scope:D_T1, D_T2 = create_doc_topic_placeholders(num_topics)else:D_T1,D_T2 = None,Noneif 'word' in topic_scope:W_T_embedded = None(m, sentence_length_1) = train_dict['W_T1'].shape(m, sentence_length_2) = train_dict['W_T2'].shapeW_T1, W_T2 = create_word_topic_placeholders([sentence_length_1, sentence_length_2])else:W_T1_embedded, W_T2_embedded, W_T_embedded = None, None, None# tensors for feed_dictbert_inputs = dict(input_ids=X1,input_mask=X1_mask,segment_ids=X1_seg)maybe_print([X1],['input ids'],True)dropout_prob = tf.placeholder_with_default(0.0, name='dropout_rate', shape=())      #当输出未被送到时通过的 input 的占位符 op .# load and lookup BERTBERT_version = get_bert_version(cased,bert_large)BERT_URL = 'https://tfhub.dev/google/bert_{}/1'.format(BERT_version)print('Loading pretrained model from {}'.format(BERT_URL))bert_lookup = hub.Module(BERT_URL,name='bert_lookup',trainable=bert_update)   #加载一个封装好的 tensorflow hub 模型,参数可以是模型的路径也可以是一个保存有模型的 http 地址。X_embedded = bert_lookup(bert_inputs, signature="tokens", as_dict=True) # important to use tf. 1.11 as tf 1.7 will produce error for sess.run(X_embedded)#调用了刚刚创建的 hub 模型,bert_inputs作为模型的输入,X_embedded就是模型的输出了。#调用 session.run 就能得到具体的 X_embedded 输出值# X_embedded has 2 keys: 两个:一个是每个token的输出,一个是整个序列的输出# pooled_output is [batch_size, hidden_size] -->output embedding for each token# sequence_output is [batch_size, sequence_length, hidden_size] -->output embedding for the entire sequence# Create topic embedding matrixif 'word' in topic_scope:topic_vocabulary_size, topic_dim = topic_embd.shape     ##topic的embedding表示,向量维数为(32635, 70)# assert(topic_vocabulary_size==embd_vocabulary_size) # currently using the same id to index topic and embd matrixtopic_embedding_matrix = initialise_pretrained_embedding(topic_vocabulary_size, topic_dim, topic_embd,name='word_topics', trainable=topic_update)#topic_embedding_matrix  每个词的主题embedding矩阵# Lookup topic embedding#找到每个输入文本中的词的主题分布,结合起来 形成文档的词主题分布W_T1_embedded = lookup_embedding(W_T1, topic_embedding_matrix, expand=False,transpose=False,name='topic_lookup_L')W_T2_embedded = lookup_embedding(W_T2, topic_embedding_matrix, expand=False,transpose=False,name='topic_lookup_R')##mmx####mmx### Forward propagation: Build forward propagation as tensorflow graphinput_dict = {'E1':X_embedded,'E2':None,'D_T1':D_T1,'D_T2':D_T2,'W_T1':W_T1_embedded,'W_T2':W_T2_embedded,'W_T':W_T_embedded}forward_pass = model.forward_propagation(input_dict, classes, hidden_layer, reduction_factor, dropout_prob, seed_list, print_dim)logits = forward_pass['logits']with tf.name_scope('cost'):# Cost function: Add cost function to tensorflow graphmain_cost = compute_cost(logits, Y,loss_fn='bert')cross_entropy_scalar = tf.summary.scalar('cross_entropy', main_cost)    #用来显示标量信息cost = main_costcost_summary = tf.summary.merge([cross_entropy_scalar]) #画图# Backpropagation: choose training regime (creates tensorflow optimizer which minimizes the cost).#选择训练模式  把计算好的损失 用优化器优化训练if layer_specific_lr:learning_rate_old_layers = tf.placeholder_with_default(0.0, name='learning_rate_old', shape=())     #当输出未被送到时通过的 input 的占位符 oplearning_rate_new_layers = tf.placeholder_with_default(0.0, name='learning_rate_new', shape=())train_step = layer_specific_regime(optimizer_choice, cost, learning_rate_old_layers, learning_rate_new_layers, epsilon, rho)else:learning_rate_old_layers = tf.placeholder_with_default(0.0, name='learning_rate', shape=())learning_rate_new_layers = Nonetrain_step = standard_training_regime(optimizer_choice, cost, learning_rate_old_layers, epsilon, rho)#train_step = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)# Prediction and Evaluation tensorswith tf.name_scope('evaluation_metrics'):#1predicted_label = tf.argmax(logits, 1, name='predict')     #返回每行最大值的索引     # which column is the one with the highest activation value?if sparse:actual_label = Yelse:actual_label = tf.argmax(Y, 1)conf_scores = get_confidence_scores(logits, False)      #分数maybe_print([predicted_label, actual_label, conf_scores],['Predicted label', 'Actual label', 'Confidence Scores'], False)#2label_idx = tf.expand_dims(tf.where(tf.not_equal(Y, 0))[:, 0], 0,name='label_idx')  #x[:,n]表示在全部数组(维)中取第n个数据,直观来说,x[:,n]就是取所有集合的第n个数据   #x[n,:]表示在n个数组(维)中取全部数据,直观来说,x[n,:]就是取第n集合的所有数据,rank_scores = tf.expand_dims(get_confidence_scores(logits), 0,name='rank_scores')       #tf.expand_dims表示在0的位置增加一个为1的维度maybe_print([label_idx, rank_scores], ['Label index', 'Rank scores'], False)#3eps = 1e-11  # prevent division by zero#4# create streaming metrics: http://ronny.rest/blog/post_2017_09_11_tf_metrics/#tf.metrics.accuracy返回两个值,accuracy为到上一个batch为止的准确度,update_op为更新本批次后的准确度。streaming_accuracy, streaming_accuracy_update = tf.metrics.accuracy(labels=actual_label,predictions=predicted_label)    #tf.metrics.accuracy计算predictions匹配labels的情况streaming_map, streaming_map_update = tf.metrics.average_precision_at_k(label_idx, rank_scores, 10)streaming_recall, streaming_recall_update = tf.contrib.metrics.streaming_recall(predictions=predicted_label,labels=actual_label)streaming_precision, streaming_precision_update = tf.contrib.metrics.streaming_precision(predictions=predicted_label, labels=actual_label)streaming_f1 = 2 * (streaming_precision * streaming_recall) / (streaming_precision + streaming_recall + eps)#5# create and merge summariesaccuracy_scalar = tf.summary.scalar('Accuracy', streaming_accuracy) #生成准确率标量图recall_scalar = tf.summary.scalar('Recall', streaming_recall)precision_scalar = tf.summary.scalar('Precision', streaming_precision)f1_scalar = tf.summary.scalar('F1', streaming_f1)map_scalar = tf.summary.scalar('MAP', streaming_map)eval_summary = tf.summary.merge([accuracy_scalar, recall_scalar, precision_scalar, f1_scalar, map_scalar])  #其他要显示的信息def predict(sess,subset, writer, epoch, ignore_MAP,topic_scope,layer_specific_lr):'''Predict in minibatch loop to prevent out of memory error (for large datasets or complex models):param input_X1: document 1:param input_X2: document 2:param input_T1: topic distributions for document 1 or None:param input_T2: topic distributions for document 2 or None:param input_Y: labels:param writer::param epoch::param ignore_MAP::return: complete prediction results as list [confidence_scores, predictions, minibatch_cost, eval_metrics]'''# print(input_T1)predictions = []confidence_scores = []minibatch_size = 10minibatches = create_minibatches(subset, minibatch_size, sparse=sparse,random=False,topic_scope=topic_scope)sess.run(tf.local_variables_initializer())  # for streaming metricsfor minibatch in minibatches:feed_dict = {X1: minibatch['E1'],X1_mask: minibatch['E1_mask'],X1_seg: minibatch['E1_seg'],# X2: minibatch['E2'],Y: minibatch['Y'],learning_rate_old_layers: 0,dropout_prob: 0} # don't use dropout during predictionif layer_specific_lr:feed_dict[learning_rate_new_layers] = 0if 'doc' in topic_scope:feed_dict[D_T1] = minibatch['D_T1']feed_dict[D_T2] = minibatch['D_T2']if 'word' in topic_scope:feed_dict[W_T1] = minibatch['W_T1']feed_dict[W_T2] = minibatch['W_T2']# Run the session to execute the prediction and evaluation, the feedict should contain a minibatch for (X,Y).# evaluating merged_summary will mess up streaming metricspred, conf = sess.run([predicted_label, conf_scores],feed_dict=feed_dict)predictions.extend(pred)confidence_scores.extend(conf)if not ignore_MAP:eval_metrics = [None, None,None,None,None]else:eval_metrics = [None,None,None,None]return confidence_scores, predictions, None, eval_metricsdef predict_eval(sess,subset, writer, epoch, ignore_MAP,topic_scope,layer_specific_lr):'''在小批量循环中进行预测  以防内存不足Predict in minibatch loop to prevent out of memory error (for large datasets or complex models):param input_X1: document 1:param input_X2: document 2:param input_T1: topic distributions for document 1 or None:param input_T2: topic distributions for document 2 or None:param input_Y: labels:param writer::param epoch::param ignore_MAP::return: complete prediction results as list [confidence_scores, predictions, minibatch_cost, eval_metrics]'''# print(input_T1)predictions = []confidence_scores = []minibatch_size = 10#mini_batch = {'E1':mini_batch_X, 'E1_mask':mini_batch_X_mask,'E1_seg':mini_batch_X_seg,'W_T1':mini_batch_W_T1, 'W_T2':mini_batch_W_T2, 'Y':mini_batch_Y}#mini_batches.append(mini_batch)#return mini_batches#create_minibatches 返回的是小批次数据minibatches = create_minibatches(subset, minibatch_size, sparse=sparse,random=False,topic_scope=topic_scope)sess.run(tf.local_variables_initializer())  # for streaming metricsminibatch_cost = 0.num_minibatches = int(m / minibatch_size)  # number of minibatches of size minibatch_size in the train setfor minibatch in minibatches:feed_dict = {X1: minibatch['E1'],X1_mask: minibatch['E1_mask'],X1_seg: minibatch['E1_seg'],# X2: minibatch['E2'],Y: minibatch['Y'],learning_rate_old_layers: 0,dropout_prob: 0}  # don't use dropout during predictionif layer_specific_lr:feed_dict[learning_rate_new_layers] = 0if 'doc' in topic_scope:feed_dict[D_T1] = minibatch['D_T1']feed_dict[D_T2] = minibatch['D_T2']if 'word' in topic_scope:feed_dict[W_T1] = minibatch['W_T1']feed_dict[W_T2] = minibatch['W_T2']# Run the session to execute the prediction and evaluation, the feeddict should contain a minibatch for (X,Y).if not ignore_MAP:# print('with MAP')# merged_summary will mess up streaming metrics!pred, conf, batch_cost, c, _, _, _, _ = sess.run([predicted_label, conf_scores, cost, cost_summary, streaming_accuracy_update, streaming_recall_update, streaming_precision_update, streaming_map_update],feed_dict=feed_dict)else:# print('without MAP')pred, conf, batch_cost, c, _, _, _ = sess.run([predicted_label, conf_scores, cost, cost_summary, streaming_accuracy_update, streaming_recall_update, streaming_precision_update],feed_dict=feed_dict)confidence_scores.extend(conf)predictions.extend(pred)        # extend() 函数用于在列表末尾一次性追加另一个序列中的多个值writer.add_summary(c, epoch)    # 调用writer的add_summary方法将c以及epoch保存minibatch_cost += batch_cost / num_minibatchesif not ignore_MAP:eval, acc, rec, prec, f_1, ma_p = sess.run([eval_summary, streaming_accuracy, streaming_recall, streaming_precision, streaming_f1,streaming_map])eval_metrics = [acc, prec, rec, f_1, ma_p]else:eval, acc, rec, prec, f_1 = sess.run([eval_summary, streaming_accuracy, streaming_recall, streaming_precision, streaming_f1])eval_metrics = [acc, prec, rec, f_1]writer.add_summary(eval, epoch)return confidence_scores, predictions, minibatch_cost, eval_metricsdef training_loop(sess, train_dict, dev_dict, test_dict, train_writer, dev_writer, opt,dropout, seed_list, num_epochs, early_stopping, optimizer, lr_bert, lr_new, layer_specific_lr, stopping_criterion='MAP',patience=patience, topic_scope=None, predict_every_epoch=False):'''Trains the model:param X1_train: document 1 (train):param X2_train: document 2 (train):param D_T1_train: topic 1 (train):param D_T2_train: topic 2 (train):param Y_train: labels (train):param X1_dev: document 1 (dev):param X2_dev: document 2 (dev):param D_T1_dev: topic 1 (dev):param D_T2_dev: topic 2 (dev):param Y_dev: labels (dev):param train_writer::param dev_writer::param opt: option dict:param dropout::param seed_list::param num_epochs::param early_stopping::param stopping_criterion::param patience::return: [opt, epoch]'''if predict_every_epoch:epoch = opt['num_epochs']  # 0sess.run(tf.local_variables_initializer())_, _, train_cost, train_metrics = predict_eval(sess, train_dict, train_writer, epoch,skip_MAP(train_dict['E1']), topic_scope,layer_specific_lr)dev_scores, dev_pred, dev_cost, dev_metrics = predict_eval(sess, dev_dict, dev_writer, epoch,skip_MAP(dev_dict['E1']), topic_scope,layer_specific_lr)print('Predicting for epoch {}'.format(epoch))test_scores, test_pred, _, test_metrics = predict_eval(sess, test_dict, test_writer, epoch,skip_MAP(test_dict['E1']), topic_scope,layer_specific_lr)output_predictions(ID1_dev, ID2_dev, dev_scores, dev_pred, 'dev_{}'.format(epoch), opt)opt = save_eval_metrics(dev_metrics, opt, 'dev', 'score_{}'.format(epoch))  # log dev metricsoutput_predictions(ID1_test, ID2_test, test_scores, test_pred, 'test_{}'.format(epoch), opt)opt = save_eval_metrics(test_metrics, opt, 'test', 'score_{}'.format(epoch))  # log test metricswrite_log_entry(opt, 'data/logs/' + logfile)epoch = opt['num_epochs']+1  # continue counting after freeze epochs        #1best_dev_value = Nonebest_dev_round = 0ep = 'num_epochs'while True:print('Epoch {}'.format(epoch))minibatch_cost = 0.#minibatches 为小批次训练数据minibatches = create_minibatches(train_dict, minibatch_size, seed_list.pop(0), sparse=sparse, random=True, topic_scope=topic_scope)#用minibatch的数据 初始化feed_back的值  并sess.runfor minibatch in minibatches:feed_dict = {X1: minibatch['E1'],X1_mask: minibatch['E1_mask'],X1_seg: minibatch['E1_seg'],# X2: minibatch['E2'],Y: minibatch['Y'],learning_rate_old_layers: lr_bert,  #5e-5dropout_prob: dropout}if layer_specific_lr:feed_dict[learning_rate_new_layers] = lr_new    #5e-5# print(minibatch.keys())if 'doc' in topic_scope:feed_dict[D_T1] = minibatch['D_T1']feed_dict[D_T2] = minibatch['D_T2']if 'word' in topic_scope:feed_dict[W_T1] = minibatch['W_T1']feed_dict[W_T2] = minibatch['W_T2']_, temp_cost = sess.run([optimizer, cost], feed_dict=feed_dict)  # feed_dict还可以用来设置graph的输入值# 运行优化函数#返回一个[optimizer,cost]的list, 其中 _代表optimizer,temp_cost代表cost的值# write summaries and checkpoints every few epochsif not logfile is None:     #logfile不为空 执行# print("Train cost after epoch %i: %f" % (epoch, minibatch_cost))#初始化所有参数,运行整个图?sess.run(tf.local_variables_initializer())#sess.run(tf.global_variables_initializer()) 就是 run了 所有global Variable 的 assign op,这就是初始化参数的本来面目。_, _, train_cost, train_metrics = predict_eval(sess,train_dict, train_writer, epoch, skip_MAP(train_dict['E1']), topic_scope,layer_specific_lr)dev_scores, dev_pred, dev_cost, dev_metrics = predict_eval(sess,dev_dict, dev_writer, epoch, skip_MAP(dev_dict['E1']), topic_scope,layer_specific_lr)if predict_every_epoch and (not epoch==num_epochs):print('Predicting for epoch {}'.format(epoch))test_scores, test_pred, _, test_metrics =  predict_eval(sess,test_dict, test_writer, epoch, skip_MAP(test_dict['E1']), topic_scope,layer_specific_lr)# output_predictions 将验证结果写入csv文件  # save_eval_metrics 将评价指标的测试具体分数写入日志output_predictions(ID1_dev, ID2_dev, dev_scores, dev_pred, 'dev_{}'.format(epoch), opt)     #写入csv文件中opt = save_eval_metrics(dev_metrics, opt, 'dev','score_{}'.format(epoch))  # log dev metrics# output_predictions 将测试结果写入csv文件output_predictions(ID1_test, ID2_test, test_scores, test_pred, 'test_{}'.format(epoch), opt)       #写入csv文件中opt = save_eval_metrics(test_metrics, opt, 'test','score_{}'.format(epoch))  # log test metrics# 写到日志里write_log_entry(opt, 'data/logs/' + logfile)# dev_metrics = [acc, prec, rec, f_1, ma_p]# use cost or other metric as early stopping criterionif stopping_criterion == 'cost':stopping_metric = dev_costprint("Dev {} after epoch {}: {}".format(stopping_criterion, epoch, stopping_metric))elif stopping_criterion == 'MAP':assert len(dev_metrics) == 5  # X1_dev must have 10 * x examplescurrent_result = dev_metrics[-1]  # MAPprint("Dev {} after epoch {}: {}".format(stopping_criterion, epoch, current_result))stopping_metric = 1 - current_result  # dev errorelif stopping_criterion == 'F1':current_result = dev_metrics[3]  # F1print("Dev {} after epoch {}: {}".format(stopping_criterion, epoch, current_result))stopping_metric = 1 - current_result  # dev errorelif stopping_criterion == 'Accuracy':current_result = dev_metrics[0]  # Accuracyprint("Dev {} after epoch {}: {}".format(stopping_criterion, epoch, current_result))stopping_metric = 1 - current_result  # dev errorif early_stopping:# save checkpoint for first or better modelsif (best_dev_value is None) or (stopping_metric < best_dev_value):best_dev_value = stopping_metricbest_dev_round = epochsave_model(opt, saver, sess, epoch)  # save modelopt = save_eval_metrics(train_metrics, opt, 'train')  # update train metrics in log# check stopping criteria# stop training if predefined number of epochs reachedif (not early_stopping) and (epoch == num_epochs):print('Reached predefined number of training epochs.')save_model(opt, saver, sess, epoch)  # save modelbreakif early_stopping and (epoch == num_epochs):print('Maximum number of epochs reached during early stopping.')break# stop training if early stopping criterion reachedif early_stopping and epoch >= best_dev_round + patience:print('Early stopping criterion reached after training for {} epochs.'.format(epoch))break# stop training if gradient is vanishingif math.isnan(minibatch_cost):      #用于检查给定数字是否为 “NaN” (不是数字)print('Cost is Nan at epoch {}!'.format(epoch))breakepoch += 1print('Finished training.')# restore weights from saved model in best epochif early_stopping:print('Load best model from epoch {}'.format(best_dev_round))opt[ep] = best_dev_round        #ep=num_epoch   训练轮次变为最优的那一次轮次的取值epoch = best_dev_round  # log final predictions with correct epoch infoload_model(opt, saver, sess, best_dev_round) # ToDo: fix Too many open files# load_model 载入运行训练好的模型# clean up previous checkpoints to save spacedelete_all_checkpoints_but_best(opt, best_dev_round)else:opt[ep] = epochopt = save_eval_metrics(train_metrics, opt, 'train')  # log train metricsreturn opt, epoch## Initialize all the variables globallyinit = tf.global_variables_initializer()        #初始化变量 然后下面会runif (not logfile is None) or (early_stopping):saver = create_saver()start_time, opt = start_timer(opt, logfile)print('Model {}'.format(opt['id']))#####
# Start session to execute Tensorflow graph
#####with tf.Session(config=session_config) as sess: #config=tf.ConfigProto(log_device_placement=True)# add debugger (but not for batch experiments)if __name__ == '__main__' and FLAGS.debug:sess = tf_debug.TensorBoardDebugWrapperSession(sess, "NPMacBook.local:7000")# Run the initializationsess.run(init)# 创建文件夹用来存图if logfile is None:train_writer = Nonedev_writer = Nonetest_writer = Noneif extra_test:test_writer_extra = Noneelse:           #创建文件夹用来存图print('logfile: {}'.format(logfile))create_model_folder(opt)        #创建文件夹model_dir = get_model_dir(opt)      #data/models/model_3            #获取文件目录train_writer = tf.summary.FileWriter(model_dir + '/train', sess.graph)  # save graph first  指定一个文件用来保存图dev_writer = tf.summary.FileWriter(model_dir + '/dev')test_writer = tf.summary.FileWriter(model_dir + '/test')if extra_test:test_writer_extra = tf.summary.FileWriter(model_dir + '/test_extra')# additional input for predict every epoch#设置predict的input参数if predict_every_epoch:td = test_dictelse:td = None       #✔# set learning rates per layerif speedup_new_layers:lr_bert = learning_ratelr_new = learning_rate*100else:                               #✔lr_bert = learning_ratelr_new = learning_rate# Freeze BERT and only train new weightsif freeze_thaw_tune:        #Falseprint('Freeze BERT and train new layers...')opt, epoch = training_loop(sess, train_dict, dev_dict,td, train_writer, dev_writer,opt, dropout, seed_list, num_epochs, early_stopping, train_step, 0, lr_new, layer_specific_lr,stopping_criterion, patience, topic_scope, predict_every_epoch)num_epochs += epochlr_new = learning_rate# Normal Finetuningprint('Finetune...')opt, epoch = training_loop(sess,train_dict, dev_dict, td, train_writer, dev_writer,opt, dropout, seed_list, num_epochs, early_stopping, train_step, lr_bert, lr_new, layer_specific_lr,stopping_criterion, patience,topic_scope, predict_every_epoch)# Predict + evaluate on dev and test set# train_scores, train_pred, _, train_metrics = predict(X1_train, X2_train, Y_train, train_writer, epoch)dev_scores, dev_pred, _, dev_metrics = predict_eval(sess,dev_dict, dev_writer, epoch, skip_MAP(dev_dict['E1']),topic_scope,layer_specific_lr)opt = save_eval_metrics(dev_metrics, opt, 'dev')if opt['dataset']=='GlueQuora':test_scores, test_pred, _, test_metrics = predict(sess, test_dict, test_writer, epoch, skip_MAP(test_dict['E1']), topic_scope,layer_specific_lr)else:test_scores, test_pred, _, test_metrics = predict_eval(sess,test_dict, test_writer, epoch, skip_MAP(test_dict['E1']),topic_scope,layer_specific_lr)opt = save_eval_metrics(test_metrics, opt, 'test')#返回结束时间 从而得到训练时间opt = end_timer(opt, start_time, logfile)if extra_test:test_scores_extra, test_pred_extra, _, test_metrics_extra = predict_eval(sess,test_dict_extra, test_writer_extra, epoch, skip_MAP(test_dict['E1']),topic_scope,layer_specific_lr)opt = save_eval_metrics(test_metrics_extra, opt, 'PAWS')#打印输出评价指标下dev和test的得分if print_dim:           #Trueif stopping_criterion is None:stopping_criterion = 'Accuracy'print('Dev {}: {}'.format(stopping_criterion, opt['score'][stopping_criterion]['dev']))print('Test {}: {}'.format(stopping_criterion, opt['score'][stopping_criterion]['test']))#写日志、将结果写入csv文件、保存模型、关闭train_writer……if not logfile is None:# save logwrite_log_entry(opt, 'data/logs/' + logfile)        #写入日志# write predictions to file for scorer# output_predictions(ID1_train, ID2_train, train_scores, train_pred, 'train', opt)#将结果写入csv文件output_predictions(ID1_dev, ID2_dev, dev_scores, dev_pred, 'dev', opt)output_predictions(ID1_test, ID2_test, test_scores, test_pred, 'test', opt)if extra_test:output_predictions(ID1_test_extra, ID2_test_extra, test_scores_extra, test_pred_extra, 'PAWS_test', opt)print('Wrote predictions for model_{}.'.format(opt['id']))# save modelif save_checkpoints:save_model(opt, saver, sess, epoch)  # save disk space# close all writers to prevent too many open files errortrain_writer.close()dev_writer.close()test_writer.close()if extra_test:test_writer_extra.close()except Exception as e:print("Error: {0}".format(e.__doc__))traceback.print_exc(file=sys.stdout)  #异常跟踪并打印opt['status'] = 'Error'write_log_entry(opt, 'data/logs/' + logfile)# print('==============')return optif __name__ == '__main__':#命令行parser = argparse.ArgumentParser()parser.register("type", "bool", lambda v: v.lower() == "true")parser.add_argument("--debug",type="bool",nargs="?",const=True,default=False,help="Use debugger to track down bad values during training. ""Mutually exclusive with the --tensorboard_debug_address flag.")parser.add_argument('-gpu', action="store", dest="gpu", type=int, default=-1)FLAGS, unparsed = parser.parse_known_args()# adjust settings by changing the option dictionaryopt = {'dataset': 'MSRP','datapath': 'data/','model': 'bert_simple_topic','tasks': ['B'],'subsets': ['train','dev','test'],   # 'train_large','test2016','test2017' # for Semeval'minibatch_size': 10,'bert_update':True,'bert_large':False,'L2': 0,'load_ids': True,'max_m': 10, # only load 10 examples for debugging'unk_sub': False,'padding': False,'simple_padding': True,'hidden_layer': 1,'learning_rate': 0.3,'num_epochs': 1,'bert_cased':False,'sparse_labels': True,'max_length': 'minimum','stopping_criterion': 'F1','optimizer': 'Adadelta','dropout': 0.1,'gpu':FLAGS.gpu,'freeze_thaw_tune':False,'speedup_new_layers':False,'seed':1,'num_topics': 80,'topic_type': 'ldamallet','topic':'doc','topic_alpha':1,'topic_update':True,'unk_topic':'zero',  #'unflat_topics':True #'sparse_labels':True,'predict_every_epoch':False# 'patience': 2,}data_dict = load_data(opt, cache=True, write_vocab=False)  #数据加载opt = model(data_dict, opt, logfile='test.json', print_dim=True)# T1 = data_dict['T1']# T2 = data_dict['T2']# E1 = data_dict['E1']# E2 = data_dict['E2']

tf_helpers

import tensorflow as tf
import numpy as npdef maybe_print(elements, names, test_print):if test_print:for e, n in zip(elements, names):print(n + " shape: " + str(e.get_shape()))def compute_vocabulary_size(files):"""Counts number of distinct vocabulary indices:param files: only X, not Y:return: size of vocabulary"""vocabulary = set()for f in files:for row in f:for integer in row:if integer not in vocabulary:vocabulary.add(integer)return max(vocabulary)+1def create_placeholders(sentence_lengths, classes, bicnn=False, sparse=True, bert=False):"""Creates the placeholders for the tensorflow session.Arguments:sentence length -- scalar, width of sentence matrix classes -- scalar, number of classesReturns:X -- placeholder for the data input, of shape [None, sentence_length] and dtype "float"Y -- placeholder for the input labels, of shape [None, classes] and dtype "float""""sentence_length = sentence_lengths[0]if sparse:Y = tf.placeholder(tf.int64, [None, ], name='labels')else:Y = tf.placeholder(tf.int32, [None, classes], name='labels')if bert:# BERT Placeholders # no names!!X1 = tf.placeholder(dtype=tf.int32, shape=[None, None]) # input idsX1_mask = tf.placeholder(dtype=tf.int32, shape=[None, None]) # input masksX1_seg = tf.placeholder(dtype=tf.int32, shape=[None, None]) # segment idsreturn X1,X1_mask,X1_seg,Yelse:X = tf.placeholder(tf.int32, [None, sentence_length], name='XL')if bicnn:sentence_length2 = sentence_lengths[1]X2 = tf.placeholder(tf.int32, [None, sentence_length2], name='XR')return X, X2, Yelse:return X, Ydef create_text_placeholders(sentence_lengths):T1 = tf.placeholder(tf.string, [None, sentence_lengths[0]], name='TL')T2 = tf.placeholder(tf.string, [None, sentence_lengths[1]], name='TR')return T1,T2def create_word_topic_placeholders(sentence_lengths):"""Creates the placeholders for the tensorflow session.Arguments::param sentence lengths: scalar, width of sentence matrix :param num_topics: number of topics for topic model:param dim: dimensions of word topics, should be 2, 3 or NoneReturns:X -- placeholder for the data input, of shape [None, sentence_length] and dtype "float"Y -- placeholder for the input labels, of shape [None, classes] and dtype "float""""T1 = tf.placeholder(tf.int32, [None, sentence_lengths[0]], name='W_TL')T2 = tf.placeholder(tf.int32, [None, sentence_lengths[1]], name='W_TR')return T1,T2def create_word_topic_placeholder(sentence_length):"""Creates the placeholders for the tensorflow session.Arguments::param sentence lengths: scalar, width of sentence matrix:param num_topics: number of topics for topic model:param dim: dimensions of word topics, should be 2, 3 or NoneReturns:X -- placeholder for the data input, of shape [None, sentence_length] and dtype "float"Y -- placeholder for the input labels, of shape [None, classes] and dtype "float""""WT = tf.placeholder(tf.int32, [None, sentence_length], name='W_T')return WTdef create_doc_topic_placeholders(num_topics):"""Creates the placeholders for the tensorflow session.Arguments:sentence length -- scalar, width of sentence matrix classes -- scalar, number of classesReturns:X -- placeholder for the data input, of shape [None, sentence_length] and dtype "float"Y -- placeholder for the input labels, of shape [None, classes] and dtype "float""""T1 = tf.placeholder(tf.float32, [None, num_topics], name='D_TL')T2 = tf.placeholder(tf.float32, [None, num_topics], name='D_TR')return T1,T2def create_embedding(vocab_size, embedding_dim,name='embedding'):with tf.name_scope(name):embedding_matrix = tf.Variable(tf.random_uniform([vocab_size, embedding_dim], -1.0, 1.0),name="W", trainable=True)return embedding_matrixdef create_embedding_placeholder(doc_vocab_size, embedding_dim):# use this or load data directly as variable?embedding_placeholder = tf.placeholder(tf.float32, [doc_vocab_size, embedding_dim], name='embd_placeholder')return embedding_placeholderdef initialise_pretrained_embedding(doc_vocab_size, embedding_dim, embedding_placeholder, name='embedding',trainable=True):with tf.name_scope(name):if trainable:print('init pretrained embds')embedding_matrix = tf.Variable(embedding_placeholder, trainable=True, name="W",dtype=tf.float32)else:#设置矩阵 初始化全为0.0W = tf.Variable(tf.constant(0.0, shape=[doc_vocab_size, embedding_dim]), trainable=False, name="W") #tf.Variable(initializer,name),参数initializer是初始化参数,name是可自定义的变量名称,#分配数值 得到每个词的主题分布embedding_matrix = W.assign(embedding_placeholder)return embedding_matrixdef lookup_embedding(X, embedding_matrix,expand=True,transpose=True,name='embedding_lookup'):'''Looks up embeddings based on word ids:param X: word id matrix with shape (m, sentence_length):param embedding_matrix: embedding matrix with shape (vocab_size, embedding_dim):param expand: add dimension to embedded matrix or not:param transpose: switch dimensions of embedding matrix or not:param name: name used in TF graph:return: embedded_matrix'''##从embedding_matrix查找索引为X的embeedingembedded_matrix = tf.nn.embedding_lookup(embedding_matrix, X, name=name) # dim [m, sentence_length, embedding_dim]if transpose:#转置embedded_matrix = tf.transpose(embedded_matrix, perm=[0, 2, 1]) # dim [m, embedding_dim, sentence_length]if expand:#增加维度。  给定一个input,在axis轴处给input增加一个为1的维度embedded_matrix = tf.expand_dims(embedded_matrix, -1) # dim [m, embedding_dim, sentence_length, 1]return embedded_matrixdef compute_cost(logits, Y, loss_fn='cross_entropy', name='main_cost'):"""Computes the costArguments:logits -- output of forward propagation (output of the last LINEAR unit of shape (batch, classes)Y -- "true" labels vector of shape (batch,)Returns:cost - Tensor of the cost function"""# multi class classification (binary classification as special case)with tf.name_scope(name):if loss_fn=='cross_entropy':# maybe_print([logits,Y], ['logits','Y'], True)cost = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=Y),name='cost')      #tf.reduce_mean()对batch_size里每个样本的loss求平均,计算最后的cost值elif loss_fn=='bert':# from https://github.com/google-research/bert/blob/master/run_classifier_with_tfhub.py# probabilities = tf.nn.softmax(logits, axis=-1)log_probs = tf.nn.log_softmax(logits, axis=-1)      #softmax层one_hot_labels = tf.one_hot(Y, depth=2, dtype=tf.float32)       #tf.one_hot()函数的作用是将一个值化为一个onehot的概率分布的向量,一般用于分类问题。per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)      #reduce_sum应该理解为压缩求和,用于降维cost = tf.reduce_mean(per_example_loss)    #reduce就是“对矩阵降维”的含义,下划线后面的部分就是降维的方式,在reduce_sum()中就是按照求和的方式对矩阵降维。那么其他reduce前缀的函数也举一反三了,比如reduce_mean()就是按照某个维度求平均值,等等。else:raise NotImplemented()return cost

tBERT部分代码(自学用)相关推荐

  1. 小自考计算机专业代码,自学考试有关专业分类及其代码

    (摘自考委21号文件:高等教育自学考试专业及课程代码) 专科: 一级学科 二级学科 专 业 工学08 管理工程类0822 计算机信息管理082207, 管理工程082201, 工业外贸082203, ...

  2. VBA代码自学收集(150例)

    1.Application.CommandBars("Worksheet Menu Bar").Enabled = false 2.cells(activecell.row,&qu ...

  3. 【PyG】图神经网络GAT代码自学

    图方法分为谱方法和空间方法,空间方法是直接在图上进行操作,代表方法之一GAT:谱方法是将图映射到谱域上,例如拉普拉斯矩阵经过特征分解得到的空间,代表方法之一是GCN.本文介绍GAT的代码实现. 本文参 ...

  4. 【Java自学】搬砖中年人代码自学之路Lesson 5

    是的我还在.... 2022年真的是我经历过的最牛逼的一年了,从年初开始到现在3个多月的时间,先后经历了2次疫情隔离,合计一个半月.作为一个自控力非常弱鸡的人,隔离在家对我来说简直就是噩梦... 真的 ...

  5. 【Java自学】搬砖中年人代码自学之路Lesson 1

            都说「三十而立」,但是我的30岁却迎来了行业变革,想要在继续在这行工作,立是立不起来了,但是换个赛道又心有不甘,所以一直非常纠结         翻来复去好多天,抽了好几盒烟,喝了几保 ...

  6. c语言别踩白块小游戏代码,自学easeljs 根据别踩白块游戏规则自己写的代码

    主要基于       -------easeljs-0.7.1.min.js-----   去制作这个游戏 思路:主要思路是以行为单位 绑定可点击行 选中则讲 移动最外层容器继续绑定可点击行的下一行 ...

  7. 【Java自学】搬砖中年人代码自学之路Lesson 8

    这一堂课听的最多的一句话就是「这部分先了解就可以,后面到了xxx的时候会深入的讲」,就给人一种后面还有茫茫多的只是要学习一样.... 不过今天学的这些东西自己倒是能听懂,唯独就是后面的练习题,在各种点 ...

  8. 不止代码,职业发展黄金手册

    花了小半天时间,读完了阿里人出品的<不止代码,职业发展黄金手册>,记录下其中的诸多闪光点. 如何快速成长为技术大牛? 做的更多,做的比你主管安排给你的任务更多. 需求分析的时候更加准确,能 ...

  9. 不止代码 == 摘读

    误区 拜大牛为师 大牛很忙,不太可能单独给你开小灶,更不可能每天都给你开 1 个小时的小灶 因为第一个原因,所以一般要找大牛,都是带着问题去请教或者探讨 大牛不多,不太可能每个团队都有技术大牛 综合上 ...

  10. 工业界常用的三维重建技术有哪些?

    视觉三维重建 = 定位定姿 + 稠密重建 + surface reconstruction +纹理贴图.三维重建技术是计算机视觉的重要技术之一,基于视觉的三维重建技术通过深度数据获取.预处理.点云配准 ...

最新文章

  1. iOS 动画之CoreAnimation(CALayer)
  2. ASP.NET MVC使用Bootstrap系列(1)——开始使用Bootstrap
  3. 搜集的一些项目源码,改改就能用
  4. artDialog组件应用学习(五)
  5. 启明云端WT516P6Core离线语音模块发布后,开发者朋友提出的问题最多的是:是否可以自己编译指令
  6. BugKuCTF WEB 域名解析
  7. boost::mp11::mp_unique相关用法的测试程序
  8. 输入 ng build 或者 ng serve 之后没有任何输出的问题分析
  9. LeetCode 701. 二叉搜索树中的插入操作(二叉查找树/插入)
  10. 【华为云技术分享】前端快速建⽴Mock App
  11. ANSI C:+++
  12. tar、tar.gz、tar.Z、tgz、bz2、bin软件包的安装
  13. 我的在校项目:校园类app
  14. 超级科技数据防泄漏系统,管控违规上网行为,保障企业信息安全
  15. 我的世界服务器显示无法解析主机名什么意思,我开了我的世界服务器可为什么它出现无法解析主机名...
  16. leetcode 166分数到小数
  17. 库存管理习题:第三章
  18. 每个元音包含偶数次的最长子字符串——打死我也想不到的代码
  19. 客户案例|低代码上的西门子,乘风破浪的财务部
  20. Java项目:旅游信息网站设计与实现(java+springboot+vue+mysql)

热门文章

  1. tp5 生成二维码并与背景图合并
  2. 复变函数论里的欧拉公式
  3. 『晨读』纳什均衡又称为非合作博弈均衡,在一个博弈过程中,
  4. ea6700梅林固件
  5. python算法入门
  6. 下载Chrome浏览器历史版本方法
  7. python去除停用词_python jieba分词如何去除停用词
  8. 高通QFIL工具如何备份各分区镜像
  9. 大数据技术原理与应用——期末复习
  10. android 虚拟wifi定位,基于Android手机的WiFi定位系统设计