基于机器学习的古代汉语自动分词标注算法及语料库研究

摘要
近年来，深度学习的浪潮渗透在科研和生活领域的方方面面，本文主要研究深度学习在自然语言处理，尤其是古汉语自然语言处理方面的应用。本文旨在利用计算机帮助古文研究者对古汉语完成断代、断句、分词及词性标注等特殊而繁琐的任务，其中的断句、分词是不同于英文自然语言处理的，中文自然语言处理所特有的任务，尤其是断句任务更是古汉语自然语言处理所特有的任务。利用计算机处理古代汉语的各种任务有助于提高语言工作者的工作效率，避免人为主观因素误差，这将他们从繁重的古汉语基础任务中解脱出来，从使他们而将更多的精力投入到后续的授受、义理等内容方面上的研究。
本文使用长短期记忆神经网络作为主体，并针对不同的古汉语自然语言处理任务，设计不同的输入输出结构来搭建具体模型，训练集使用的是网络上公开下载的古汉语语料，并且我们对其中的部分上古汉语语料文本进行了手工标记。本文中设计的模型可对古汉语文本完成断代、断句、分词及词性标注的操作。本文涉及的的主要工作和创新点如下：
（1）使用长短期记忆神经网络作为主体构建古代文本断代模型。在断代模型当中，文本中的每一个字被转换成一串高维向量，然后将文本包含的所有向量送入模型分析它们之间的非线性关系。最终，模型会输出一个该段文本的年代类别标签。实验结果表明利用Bi-LSTM(Bi-directional Long Short-Term Memory, Bi-LSTM)神经网络构造的模型能够很好的完成断代任务，断代的正确率能达到80%以上。本文的断代模型提供了一种高效且准确的古文断代方法，这将节省古文研究工作者在文本断代过程中的时间。
（2）针对某些古代汉语书籍原著中缺少标点符号的问题，本文提出一个断句模型。本部分我们通过深度神经网络对大量已经断句的古汉语文本进行学习，使断句模型自动学习到某一时期、某种题材的断句规则，从而在后面的古代汉语文献信息化过程中，可以将断句工作交给计算机来完成，减少部分古汉语工作者的任务量。
（3）提出一个自动分词及词性标注一体化模型。由于目前尚没有公开的具有分词和词性标注的古汉语语料库，因此本文通过手工标记部分语料的方法得到了少量的数据集，将它们存入数据库作为训练集训练模型。实验表明本文提出的分词标注模型可以较好的完成古汉语分词标注任务。数据库也可通过模型加人工校准的方式进一步扩充。
论文以Bi-LSTM网络为主要结构，建立了一系列针对古代汉语文本不同任务的模型。实验证明，在现有有限的古汉语语料库中本文提出的模型已具备较好的效果，并可以应用到后续更大语料库的构建当中，作为辅助工具帮助古汉语工作者对文本的标记工作。新产生的语料库又可继续用来训练模型提高模型的精度，以此构成语料库和模型互相促进提高的局面，促进古汉语信息化及大型古汉语语料库的构建。
关键词：古汉语，自然语言处理，断代，断句，分词，词性标注
Machine Learning-based Segmentation, Tagging
and Corpus Building for Ancient Chinese
Abstract
In recent years, deep learning has penetrated into every aspect of research and life. This paper mainly studies the application of deep learning in natural language processing, especially in ancient Chinese natural language processing. This paper aims to use computer to help ancient Chinese researchers to complete special and cumbersome tasks such as dating, sentence breaking, word segmentation and part-of-speech tagging in ancient Chinese. The sentence breaking and the word segmentation are the unique tasks of Chinese natural language processing, especially the sentence-breaking tasks are the unique tasks of ancient Chinese natural language processing. The use of computers to deal with the various tasks of ancient Chinese helps to improve the efficiency of language workers and avoid the subjective factors of human error, which frees them from the heavy basic tasks of ancient Chinese, so that they can put more energy into other aspects of research.
In this paper, we use Long short-term memory neural networks as the main body, and design different input and output structures to build specific models for different ancient Chinese natural language processing tasks. The training set is an ancient Chinese corpus that we have publicly downloaded from the Internet, and we have manually marked some of the ancient Chinese corpus texts. The model designed in this paper can complete tasks such as breaking the ancient Chinese text, breaking sentences, word segmentation and part-of-speech tagging. The main work and innovations covered in this article are as follows:
(1) The Bi-LSTM was used as the main body to construct the ancient text dating model. In the age judging model, each word in the text is converted into a series of high-dimensional vectors, and then all the vectors contained in the text are sent to the model to analyze the nonlinear relationship between them. Finally, the model outputs a time category label for the text of the paragraph. Experiments show that the model constructed by Bi-LSTM can perform the task of age judging well, and the prediction accuracy can reach 80%. The model in this part provides an efficient and accurate method for ancient Chinese texts’ age judging, which will save the time consumption of ancient Chinese researchers in the process of textualization.
(2) In view of the lack of punctuation in the original works of some ancient Chinese books, this paper proposes a sentences breaking model. In this part, we use the deep neural network to learn a large number of ancient Chinese texts that have already been sentenced, so that the sentences breaking model automatically learns the rules of sentences breaking in a certain period and a certain subject. So in the process of informationization of ancient Chinese literature, we can hand over the sentences breaking work to the computer to reduce the task of ancient Chinese workers.
(3) An integrated model of automatic word segmentation and part-of-speech tagging is proposed. Since there is no public Chinese corpus with word segmentation and part-of-speech tagging, this paper obtains a small number of data sets by manually marking tag, and stores them in the database as a training set training model to verify the word segmentation proposed in this paper. Experiments show that the word segmentation and annotation model proposed in this paper can accomplish the task of marking ancient Chinese word segmentation well. The database can also be further expanded by model labeling and manual calibration.
Based on the Bi-LSTM network, the paper establishes a series of models for different tasks of ancient Chinese texts. The experiment proves that the model proposed in this paper has good effects in the existing limited ancient Chinese corpus. The model can be applied to the construction of the subsequent larger corpus as an auxiliary tool to help the ancient Chinese workers mark the text. The new corpus generated by the model can be used to train the model to improve the accuracy of the model, which constitutes a situation in which the corpus and the model promote each other, and promotes the informationization of ancient Chinese and the construction of a large ancient Chinese corpus.
Key Words： Ancient Chinese, Natural language processing, Judging the age, Punctuation, Word segmentation, Part of speech
目录
致谢 I
摘要 III
Abstract V
1 引言 1
1.1 课题研究背景及意义 1
1.2 研究内容 5
1.3 论文组织结构 6
2 研究综述 8
2.1 古代文本断代方法 8
2.2 古代文本断句方法 10
2.3 古代文本分词方法 12
2.4 词性标注综述 16
2.5 本章小结 17
3 古代文本断代模型 18
3.1 数据来源及预处理 18
3.2 模型结构 19
3.3 实验 24
3.4 本章小结 31
4 古代汉语断句模型 32
4.1 数据来源及预处理 32
4.2 模型构建 33
4.3 实验及效果展示 34
4.4 本章小结 38
5 古代汉语分词、标注系统及数据库建设 39
5.1 数据来源及预处理 39
5.2 分类模型的评估标准 41
5.3 模型架构 42
5.4 实验及性能分析 46
5.5 词性标注 49
5.6 本章小结 51
6 总结与展望 53
6.1 总结 53
6.2 展望 53
参考文献 55
研究内容
本课题的研究目的是利用现有成熟的基于深度学习的自然语言处理技术对中国古汉语建立一系列模型，旨在完成古代汉语的自动断代、断句及分词标注任务，减轻部分古汉语工作者的繁琐劳动，将这部分繁琐工作让机器去完成，从而加速古汉语信息化过程。研究内容从课题研究目的入手，可分为以下几个方面：
（1）为解决古代书籍断代的问题，本文提出使用双向长短期记忆神经网络作为主体构建古代文本断代模型。整理互联网上现有的已知年代的文本作为训练集对模型进行训练。利用word2vec模型将文本中的每一个字转换成一串高维向量，然后将文本包含的所有文字的字向量送入模型分析它们之间的非线性关系。最终，模型会输出一个该段文本的年代类别标签。实验结果表明利用Bi-LSTM神经网络构造的模型能够很好的完成断代任务，断代的正确率能达到80%以上。本文的断代模型提供了一种高效且准确的古文断代方法，这将节省古文研究工作者在文本断代过程中的时间。
（2）针对某些古代汉语书籍原著中缺少标点符号的问题，本文提出一个断句模型。本部分我们通过深度神经网络对大量经过断句的古汉语文本进行学习，使断句模型自动学习到某一时期、某种题材的断句规则，从而达到输入一段无断句的文字序列，机器自动为其添加断句的效果。
（3）针对古汉语分词及词性标注任务，我们需要解决训练集获取的问题，分词标注任务需要已经分好词、标注好词性的文本来做模型的训练集，但目前尚没有公开的具有分词和词性标注的古汉语语料库。因此我们通过手工标记部分语料的方法得到了少量的数据集对我们所设计的分词标注模型进行少量的实验，用以验证本文提出的分词标注模型可以较好的完成古汉语分词标注任务。
论文组织结构
论文的整体安排如下：
第一章作为绪论部分，首先对论文的研究意义及研究背景进行了简要的阐述。之后将研究中面临的主要问题和所做的工作内容进行简单的梳理，通过系统地归纳总结帮助读者了解论文中面临问题的本质和相对应的解决方法。最后对论文的大体结构进行简略介绍，方便读者了解整篇文章的体系架构。
第二章对课题相关的研究内容进行了详细介绍和总结。除了介绍中国古代汉语研究领域里古代文人关于著作年代的判断方法外，还介绍了自然语言方法在古汉语断代方面的应用；在古籍断句领域，本章介绍了一些传统常用的断句方法；此外还介绍了自然语言处理领域分词及词性标注任务的研究现状，并对分词及词性标注的常用方法、算法进行了总结和优劣分析。
第三章首先根据古汉语语料较少的特点，选择了双向长短期记忆神经网络结构作为模型的主体，并介绍了断代模型的总体结构框架。之后针对断代模型的多层结构，分层依次讲解了每一层的构成及作用。本章首次将双向长短期记忆神经网络应用到古代汉语的断代问题上去，通过两组实验分析了模型的性能，并简要分析出了同一时期内同一书籍中及不同书籍之间具有互相独立而统一的关系。
第四章对于古代汉语书籍没有标点符号的特点，利用字符标签的形式对输入的一句或多句古汉语文本进行标记，标记出应该含有标点符号的位置。本部分首先介绍了模型的数据来源和整体结构，然后介绍模型的代码实现，最后通过部分真实数据进行了一定的实验分析，分析证明模型的正确率较高，可以当做断句辅助工具供古汉语工作人员参考。
由于第五章模型任务的特殊性，第五章首先阐述了魔性训练集的数据来源及预处理，介绍了几种模型分类效果的评估标准。然后提出了本章的主要内容：基于双向长短期记忆神经网络的古代汉语分词及词性标注一体化系统。针对一体化系统，本部分创新性的提出了将两种标签进行一体化输出的编码方式，使得模型的输出可以同时带有分词及词性标注标签。关于模型分词及词性标注效果的评估，本文利用少量的手工标记的数据集对模型分别进行了分词实验和词性标注实验两部分实验，实验证明本部分提出的一体化系统在古代汉语分词及词性标注任务上有不错的效果，后期若有更加充足、准确的数据集后，该模型的准确率将可以达到更高。最后利用一体化模型，建立一个简单的上古语料库，并建设一个网站进行管理。
第六章对研究工作进行了总结分析，并对下一步的研究方向和计划进行阐述。
本文转载自：http://www.biyezuopin.vip/onews.asp?id=16559

# 导入数据
from sklearn.model_selection import train_test_split
import pickle
import numpy as np
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
from tqdm import tqdm
import time
import pickle
import os#   https://github.com/yongyehuang/Tensorflow-Tutorial/blob/master/Tutorial_6%20-%20Bi-directional%20LSTM%20for%20sequence%20labeling%20(Chinese%20segmentation).ipynb  #
with open('cleaned_train800-1160.txt', 'rb') as inp:texts = inp.read().decode('utf-16')
sentences = texts.split('\r\n')  # 根据换行切分# 将不规范的内容（如每行的开头）去掉
def clean(s):if u'“/s' not in s:  # 句子中间的引号不应去掉return s.replace(u'“ ', '')elif u'”/s' not in s:return s.replace(u'” ', '')elif u'‘/s' not in s:return s.replace(u'‘', '')elif u'’/s' not in s:return s.replace(u'’', '')else:return stexts = u''.join(map(clean, sentences))  # 把所有的词拼接起来# print('Length of texts is %d' % len(texts))
# print('Example of texts: \n', texts[:300])sentence = re.split(u'[，。！？、‘’“”：（）—《》]', texts)
# print('Sentences number:', len(sentence))
# print('Sentence Example:\n', sentence[2])#############################为每个字添加标签##############
sentences=[]
# f = open('E:\\pyCode\\Bi-directional_LSTM\\a.txt','w')
for sentenc in sentence:#给每个字添加标签a=sentenc.split()for index in range(len(a)):if (len(a[index]) == 1):a[index] += '/s  'elif (len(a[index]) == 2):a[index] = a[index][:1] + '/b  ' + a[index][1:] + '/e  'elif (len(a[index]) == 3):a[index] = a[index][:1] + '/b  ' + a[index][1:2] + '/m  '+a[index][2:]+'/e  'elif (len(a[index]) == 4):a[index] = a[index][:1] + '/b  ' + a[index][1:2] + '/m  '+a[index][2:3]+'/m  '+a[index][3:]+'/e  'elif (len(a[index]) == 5):a[index] = a[index][:1] + '/b  ' + a[index][1:2] + '/m  '+a[index][2:3]+'/m  ' + a[index][3:4] + '/m  ' + a[index][4:] + '/e  'elif (len(a[index]) == 6):a[index] = a[index][:1] + '/b  ' + a[index][1:2] + '/m  '+a[index][2:3]+'/m  '+a[index][3:4]+'/m  '+a[index][4:5]+'/m  '+a[index][5:]+'/e  'elif (len(a[index]) == 7):a[index] = a[index][:1] + '/b  ' + a[index][1:2] + '/m  '+a[index][2:3]+'/m  '+a[index][3:4]+'/m  '+a[index][4:5]+'/m  '+a[index][5:6]+'/m  '+a[index][6:]+'/e  'elif (len(a[index]) == 8):a[index] = a[index][:1] + '/b  ' + a[index][1:2] + '/m  '+a[index][2:3]+'/m  '+a[index][3:4]+'/m  '+a[index][4:5]+'/m  '+a[index][5:6]+'/m  '+a[index][6:7]+'/m  '+a[index][7:]+'/e  's=u''.join(a)# f.write(sentences+'\n')sentences.append(s)# print(sentences)########################
def get_Xy(sentence):"""将 sentence 处理成 [word1, w2, ..wn], [tag1, t2, ...tn]"""words_tags = re.findall('(.)/(.)', sentence)if words_tags:words_tags = np.asarray(words_tags)words = words_tags[:, 0]tags = words_tags[:, 1]return words, tags # 所有的字和tag分别存为 data / labelreturn None
datas = list()
labels = list()
# print('Start creating words and tags data ...')
for sentence in tqdm(iter(sentences)):result = get_Xy(sentence)if result:datas.append(result[0])labels.append(result[1])import pickle
from sklearn.model_selection import train_test_split
# import numpy as npwith open('data/data.pkl', 'rb') as inp:X = pickle.load(inp)y = pickle.load(inp)word2id = pickle.load(inp)id2word = pickle.load(inp)tag2id = pickle.load(inp)id2tag = pickle.load(inp)# 划分测试集/训练集/验证集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train,  test_size=0.2, random_state=42)
# print('X_train.shape={}, y_train.shape={}; \nX_valid.shape={}, y_valid.shape={};\nX_test.shape={}, y_test.shape={}'.format(
#     X_train.shape, y_train.shape, X_valid.shape, y_valid.shape, X_test.shape, y_test.shape))# ** 3.build the data generator
class BatchGenerator(object):""" Construct a Data generator. The input X, y should be ndarray or list like type.Example:Data_train = BatchGenerator(X=X_train_all, y=y_train_all, shuffle=False)Data_test = BatchGenerator(X=X_test_all, y=y_test_all, shuffle=False)X = Data_train.Xy = Data_train.yor:X_batch, y_batch = Data_train.next_batch(batch_size)"""def __init__(self, X, y, shuffle=False):if type(X) != np.ndarray:X = np.asarray(X)if type(y) != np.ndarray:y = np.asarray(y)self._X = Xself._y = yself._epochs_completed = 0self._index_in_epoch = 0self._number_examples = self._X.shape[0]self._shuffle = shuffleif self._shuffle:new_index = np.random.permutation(self._number_examples)self._X = self._X[new_index]self._y = self._y[new_index]@propertydef X(self):return self._X@propertydef y(self):return self._y@propertydef num_examples(self):return self._number_examples@propertydef epochs_completed(self):return self._epochs_completeddef next_batch(self, batch_size):""" Return the next 'batch_size' examples from this data set."""start = self._index_in_epochself._index_in_epoch += batch_sizeif self._index_in_epoch > self._number_examples:# finished epochself._epochs_completed += 1# Shuffle the dataif self._shuffle:new_index = np.random.permutation(self._number_examples)self._X = self._X[new_index]self._y = self._y[new_index]start = 0self._index_in_epoch = batch_sizeassert batch_size <= self._number_examplesend = self._index_in_epochreturn self._X[start:end], self._y[start:end]# print('Creating the data generator ...')
data_train = BatchGenerator(X_train, y_train, shuffle=True)
data_valid = BatchGenerator(X_valid, y_valid, shuffle=False)
data_test = BatchGenerator(X_test, y_test, shuffle=False)
# print('Finished creating the data generator.')import tensorflow as tfconfig = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)
from tensorflow.contrib import rnn
import numpy as np
import time
'''
For Chinese word segmentation.
'''
# ##################### config ######################
decay = 0.85
max_epoch = 5
max_max_epoch = 10
timestep_size = max_len = 32  # 句子长度
vocab_size = 7010  # 样本中不同字的个数+1(padding 0)，根据处理数据的时候得到
input_size = embedding_size = 64  # 字向量长度
class_num = 5
hidden_size = 128  # 隐含层节点数
layer_num = 2  # bi-lstm 层数
max_grad_norm = 5.0  # 最大梯度（超过此值的梯度将被裁剪）lr = tf.placeholder(tf.float32, [])
keep_prob = tf.placeholder(tf.float32, [])
batch_size = tf.placeholder(tf.int32, [])  # 注意类型必须为 tf.int32
model_save_path = 'ckpt/bi-lstm.ckpt'  # 模型保存位置with tf.variable_scope('embedding'):embedding = tf.get_variable("embedding", [vocab_size, embedding_size], dtype=tf.float32)def weight_variable(shape):"""Create a weight variable with appropriate initialization."""initial = tf.truncated_normal(shape, stddev=0.1)return tf.Variable(initial)def bias_variable(shape):"""Create a bias variable with appropriate initialization."""initial = tf.constant(0.1, shape=shape)return tf.Variable(initial)def lstm_cell():cell = rnn.LSTMCell(hidden_size, reuse=tf.get_variable_scope().reuse)return rnn.DropoutWrapper(cell, output_keep_prob=keep_prob)def bi_lstm(X_inputs):"""build the bi-LSTMs network. Return the y_pred"""# X_inputs.shape = [batchsize, timestep_size]  ->  inputs.shape = [batchsize, timestep_size, embedding_size]inputs = tf.nn.embedding_lookup(embedding, X_inputs)# ** 1.构建前向后向多层 LSTMcell_fw = rnn.MultiRNNCell([lstm_cell() for _ in range(layer_num)], state_is_tuple=True)cell_bw = rnn.MultiRNNCell([lstm_cell() for _ in range(layer_num)], state_is_tuple=True)# ** 2.初始状态initial_state_fw = cell_fw.zero_state(batch_size, tf.float32)initial_state_bw = cell_bw.zero_state(batch_size, tf.float32)# 下面两部分是等价的# **************************************************************# ** 把 inputs 处理成 rnn.static_bidirectional_rnn 的要求形式# ** 文档说明# inputs: A length T list of inputs, each a tensor of shape# [batch_size, input_size], or a nested tuple of such elements.# *************************************************************# Unstack to get a list of'n_steps' tensors of shape (batch_size, n_input)# inputs.shape = [batchsize, timestep_size, embedding_size]  ->  timestep_size tensor, each_tensor.shape = [batchsize, embedding_size]# inputs = tf.unstack(inputs, timestep_size, 1)# ** 3.bi-lstm 计算（tf封装）  一般采用下面 static_bidirectional_rnn 函数调用。#   但是为了理解计算的细节，所以把后面的这段代码进行展开自己实现了一遍。#     try:#         outputs, _, _ = rnn.static_bidirectional_rnn(cell_fw, cell_bw, inputs,#                         initial_state_fw = initial_state_fw, initial_state_bw = initial_state_bw, dtype=tf.float32)#     except Exception: # Old TensorFlow version only returns outputs not states#         outputs = rnn.static_bidirectional_rnn(cell_fw, cell_bw, inputs,#                         initial_state_fw = initial_state_fw, initial_state_bw = initial_state_bw, dtype=tf.float32)#     output = tf.reshape(tf.concat(outputs, 1), [-1, hidden_size * 2])# ***********************************************************# ***********************************************************# ** 3. bi-lstm 计算（展开）with tf.variable_scope('bidirectional_rnn'):# *** 下面，两个网络是分别计算 output 和 state# Forward directionoutputs_fw = list()state_fw = initial_state_fwwith tf.variable_scope('fw'):for timestep in range(timestep_size):if timestep > 0:tf.get_variable_scope().reuse_variables()(output_fw, state_fw) = cell_fw(inputs[:, timestep, :], state_fw)outputs_fw.append(output_fw)# backward directionoutputs_bw = list()state_bw = initial_state_bwwith tf.variable_scope('bw') as bw_scope:inputs = tf.reverse(inputs, [1])for timestep in range(timestep_size):if timestep > 0:tf.get_variable_scope().reuse_variables()(output_bw, state_bw) = cell_bw(inputs[:, timestep, :], state_bw)outputs_bw.append(output_bw)# *** 然后把 output_bw 在 timestep 维度进行翻转# outputs_bw.shape = [timestep_size, batch_size, hidden_size]outputs_bw = tf.reverse(outputs_bw, [0])# 把两个oupputs 拼成 [timestep_size, batch_size, hidden_size*2]output = tf.concat([outputs_fw, outputs_bw], 2)output = tf.transpose(output, perm=[1, 0, 2])output = tf.reshape(output, [-1, hidden_size * 2])# ***********************************************************return output  # [-1, hidden_size*2]with tf.variable_scope('Inputs'):X_inputs = tf.placeholder(tf.int32, [None, timestep_size], name='X_input')y_inputs = tf.placeholder(tf.int32, [None, timestep_size], name='y_input')wordNum = tf.placeholder(tf.int32, name='wordNum')
bilstm_output = bi_lstm(X_inputs)with tf.variable_scope('outputs'):softmax_w = weight_variable([hidden_size * 2, class_num])softmax_b = bias_variable([class_num])y_pred = tf.matmul(bilstm_output, softmax_w) + softmax_b# adding extra statistics to monitor
# y_inputs.shape = [batch_size, timestep_size]
correct_prediction = tf.equal(tf.cast(tf.argmax(y_pred, 1), tf.int32), tf.reshape(y_inputs, [-1]))
# accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
accuracy=(tf.reduce_sum(tf.cast(correct_prediction,tf.int32))+wordNum-16000)/wordNum
cost = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=tf.reshape(y_inputs, [-1]), logits=y_pred))# ***** 优化求解 *******
tvars = tf.trainable_variables()  # 获取模型的所有参数
grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars), max_grad_norm)  # 获取损失函数对于每个参数的梯度
optimizer = tf.train.AdamOptimizer(learning_rate=lr)  # 优化器# 梯度下降计算
train_op = optimizer.apply_gradients(zip(grads, tvars),global_step=tf.train.get_or_create_global_step())
# print('Finished creating the bi-lstm model.')
saver = tf.train.Saver()
best_model_path = 'ckpt/bi-lstm.ckpt-6'
saver.restore(sess, best_model_path)
X_tt, y_tt = data_train.next_batch(2)
# print('X_tt.shape=', X_tt.shape,'y_tt.shape=', y_tt.shape)
# print('X_tt = ', X_tt)
# print('y_tt = ', y_tt)# 利用 labels（即状态序列）来统计转移概率
# 因为状态数比较少，这里用 dict={'I_tI_{t+1}'：p} 来实现
# A统计状态转移的频数
A = {'sb': 0,'ss': 0,'be': 0,'bm': 0,'me': 0,'mm': 0,'eb': 0,'es': 0
}# zy 表示转移概率矩阵
zy = dict()
for label in labels:for t in range(len(label) - 1):key = label[t] + label[t + 1]A[key] += 1.0zy['sb'] = A['sb'] / (A['sb'] + A['ss'])
zy['ss'] = 1.0 - zy['sb']
zy['be'] = A['be'] / (A['be'] + A['bm'])
zy['bm'] = 1.0 - zy['be']
zy['me'] = A['me'] / (A['me'] + A['mm'])
zy['mm'] = 1.0 - zy['me']
zy['eb'] = A['eb'] / (A['eb'] + A['es'])
zy['es'] = 1.0 - zy['eb']
keys = sorted(zy.keys())
print('the transition probability: ')
for key in keys:print(    key, zy[key])
print(zy)
zy = {i: np.log(zy[i]) for i in zy.keys()}
print(zy)def viterbi(nodes):"""维特比译码：除了第一层以外，每一层有4个节点。计算当前层（第一层不需要计算）四个节点的最短路径：对于本层的每一个节点，计算出路径来自上一层的各个节点的新的路径长度（概率）。保留最大值（最短路径）。上一层每个节点的路径保存在 paths 中。计算本层的时候，先用paths_ 暂存，然后把本层的最大路径保存到 paths 中。paths 采用字典的形式保存（路径：路径长度）。一直计算到最后一层，得到四条路径，将长度最短（概率值最大的路径返回）"""paths = {'b': nodes[0]['b'], 's':nodes[0]['s']} # 第一层，只有两个节点for layer in range(1, len(nodes)):  # 后面的每一层print(layer)paths_ = paths.copy()  # 先保存上一层的路径print(paths_)# node_now 为本层节点， node_last 为上层节点paths = {}  # 清空 pathfor node_now in nodes[layer].keys():      # nodes[layer] {'s': 6.416104, 'b': 1.4611748, 'm': -2.1077693, 'e': 1.5862198}    key为s b m e.print(nodes[layer])print(node_now)# 对于本层的每个节点，找出最短路径sub_paths = {}# 上一层的每个节点到本层节点的连接for path_last in paths_.keys():if path_last[-1] + node_now in zy.keys(): # 若转移概率不为 0sub_paths[path_last + node_now] = paths_[path_last] + nodes[layer][node_now] + zy[path_last[-1] + node_now]# 最短路径,即概率最大的那个sr_subpaths = pd.Series(sub_paths)sr_subpaths = sr_subpaths.sort_values()  # 升序排序node_subpath = sr_subpaths.index[-1]  # 最短路径node_value = sr_subpaths[-1]   # 最短路径对应的值# 把 node_now 的最短路径添加到 paths 中paths[node_subpath] = node_value# 所有层求完后，找出最后一层中各个节点的路径最短的路径sr_paths = pd.Series(paths)sr_paths = sr_paths.sort_values()  # 按照升序排序return sr_paths.index[-1]  # 返回最短路径（概率值最大的路径）def text2ids(text):"""把字片段text转为 ids."""words = list(text)ids = list(word2id[words])if len(ids) >= max_len:  # 长则弃掉print(u'输出片段超过%d部分无法处理' % (max_len))return ids[:max_len]ids.extend([0]*(max_len-len(ids))) # 短则补全ids = np.asarray(ids).reshape([-1, max_len])return idsdef simple_cut(text):"""对一个片段text（标点符号把句子划分为多个片段）进行预测。"""if text:text_len = len(text)X_batch = text2ids(text)  # 这里每个 batch 是一个样本fetches = [y_pred]feed_dict = {X_inputs:X_batch, lr:1.0, batch_size:1, keep_prob:1.0}_y_pred = sess.run(fetches, feed_dict)[0][:text_len]  # padding填充的部分直接丢弃nodes = [dict(zip(['s','b','m','e'], each[1:])) for each in _y_pred]tags = viterbi(nodes)words = []for i in range(len(text)):if tags[i] in ['s', 'b']:words.append(text[i])else:words[-1] += text[i]return wordselse:return []def cut_word(sentence):"""首先将一个sentence根据标点和英文符号/字符串划分成多个片段text，然后对每一个片段分词。"""not_cuts = re.compile(u'([0-9\da-zA-Z ]+)|[。，、？！.\.\?,!]')result = []start = 0for seg_sign in not_cuts.finditer(sentence):result.extend(simple_cut(sentence[start:seg_sign.start()]))result.append(sentence[seg_sign.start():seg_sign.end()])start = seg_sign.end()result.extend(simple_cut(sentence[start:]))return resultsentence = "你看我盡節存忠立功勛，單注著楚霸王大軍盡。" #你  看  我  盡  節  存  忠  立  功勛  ，單  注  著  楚霸王  大軍  盡  。
result = cut_word(sentence)
rss = ''
for each in result:rss = rss + each + ' / '
print(rss)

基于机器学习的古代汉语自动分词标注算法及语料库研究相关推荐

基于机器学习的古代汉语切分标注算法及语料库研究（毕业设计包含完整代码+论文+资料ppt）
数据来源及预处理实验所用的数据集为从网络的开放数据库下载的不同年代的古籍.根据古籍所处具体时期的不同,我们从各个时期中选择了部分书籍进行实验.将其分为成了不连续的几个时间段:春秋战国时期.后汉时期. ...
汉语自动分词基本算法
文章目录汉语自动分词基本算法 1 最大匹配法(Maximum Matching , MM) 1.1 FMM算法描述 1.2 例子 1.3 优缺点 2 最少分词法(最短路径法) 2.1 算法描述 2. ...
【毕业设计_课程设计】基于机器学习的情感分类与分析算法设计与实现（源码+论文）
文章目录 0 项目说明 1 研究目的 2 研究方法 3 研究结论 4 项目流程 4.1 获取微博文本 4.2 SVM初步分类 4.3 使用朴素贝叶斯分类 4.4 AdaBoost 4.4.1 二分类A ...
基于机器学习和TFIDF的情感分类算法，详解自然语言处理
摘要:这篇文章将详细讲解自然语言处理过程,基于机器学习和TFIDF的情感分类算法,并进行了各种分类算法(SVM.RF.LR.Boosting)对比本文分享自华为云社区<[Python人工智能] ...
医学图像边缘检测matlab实验,基于Matlab的医学图像增强与边缘检测算法的实验研究...
分类号国际十进分类号(UDC) 第四军医大学学位论文基于 Matlab 的医学图像增强与边缘检测算法的实验研究 (题名和副题名) 袁丽婷 (作者姓名) 指导教师姓名邱力军副教授指导教师单 ...
阅读笔记——基于机器学习的文本情感多分类的学习与研究
文章目录 1 文章简介 2 文本情感分类概述 3 文本情感多分类项目设计与实现 3.1 数据处理 3.2 特征选取 3.3 线性逻辑回归模型 3.4 朴素贝叶斯模型 4 项目结果与分析 5 总结 1 ...
基于形态学的图像后期抗锯齿算法--MLAA优化研究
本篇博文来自博主Imageshop,打赏或想要查阅更多内容可以移步至Imageshop. 转载自:https://www.cnblogs.com/Imageshop/p/9903045.html 侵 ...
目标检测YOLO实战应用案例100讲-基于深度学习的无人机目标检测算法轻量化研究
目录基于深度学习的无人机图像目标检测算法研究目标检测相关技术理论 2.1 引言
[当人工智能遇上安全] 5.基于机器学习算法的主机恶意代码识别研究
您或许知道,作者后续分享网络安全的文章会越来越少.但如果您想学习人工智能和安全结合的应用,您就有福利了,作者将重新打造一个<当人工智能遇上安全>系列博客,详细介绍人工智能与安全相关的论文. ...

基于机器学习的古代汉语自动分词标注算法及语料库研究

基于机器学习的古代汉语自动分词标注算法及语料库研究相关推荐

最新文章

热门文章