cs224n学习笔记 03:Subword Models（fasttext附代码）

课程内容

语言学的一点小知识
词级字符级模型
n-gram思想
FastText模型

1 、人类语言声音：语音学和音系学

语音学是音流，这是属于物理层面的东西

词法学：一个n-grams的代替方案

在基于单词的模型中存在一些问题：
需要处理很大的词汇表，在英语中单词只要变个形态就是另一个单词了，比如说：gooooood bye

字符级别模型
单词嵌入可以由字符嵌入表示：

能为不知道的单词生成嵌入
相似的拼写有相似的嵌入
解决了oov问题

传统来说，一个因素或者一个字母不是一个因素单元，但是深度语言模型把它们组合在一起了。
单词之下的书写系统
大部分深度学习的任务都是从语言的书写形式来处理语言的，这是一个简单的过程，需要寻找数据。

2、词级字符级模型

词级语言建模是指把词作为文本信息的最小单位。在语义空间中，单词就好像是空间中的一个节点。在这种情况下，通过 TF 技术或者主题模型技术或者词嵌入模型来生成特征向量或字矢量，每个单词都用一个数字或者一个矢量来表示，之后就可以像循环神经网络这样的模型进行训练。

目前，比较流行的方法是使用词嵌入来得到特征向量。该方法是训练一个大型的语料库，从而来得到一个 Word2Vec 模型。Word2Vec 模型包含一个词典，其中每个单词都是一个矢量化信息。

但是在我们使用基于word level的模型时，需要处理很大的词汇表，而且在英语中单词只要变个形态就是另一个单词了，比如对于下图的一些非正式表达：因此会遇到很大的麻烦。

字符级语言建模就像用一个 0-1 向量来表示每一个字符，并且将这个向量送入模型进行训练，而文本中的语法和单词语义被简单地忽略掉，因为我们相信模型可以捕捉到这些语法和单词语义信息。字符级语言建模的思想来自于信号处理。

而用基于character level的模型，可以为未知单词生成word embedding ，相似的拼写单词会有相似的word embedding，并且可以解决OOV问题。

对于纯字符级的模型来说，开始时并没有得到令人满意的效果。但是在2015年之后，逐渐由研究者取得了一些成绩。

比如luong和Manining测试了一个纯字符级的seq2seq（LSTM）NMT系统作为baseline，它和基于word level的模型一样运行的很好，但是在训练时非常耗费时间。下图来是该系统的BLEU评分：

下图是使用该系统翻译的一个例子,可以看到字符级的模型有很好的效果：
2018年Google的一些研究人员对基于LSTM的seq2seq的character level模型进行分析发现，模型深度越深得到的效果越好，而且char level系统比word level系统表现好，但是char level系统所花费的时间却比word level系统长的多，如下图所示：

3、n-gram

N-Gram（有时也称为N元模型）是自然语言处理中一个非常重要的概念，通常在NLP中，人们基于一定的语料库，可以利用N-Gram来预计或者评估一个句子是否合理。另外一方面，N-Gram的另外一个作用是用来评估两个字符串之间的差异程度。这是模糊匹配中常用的一种手段。
N-Gram是基于一个假设：第n个词出现与前n-1个词相关，而与其他任何词不相关。（这也是隐马尔可夫当中的假设。）整个句子出现的概率就等于各个词出现的概率乘积。各个词的概率可以通过语料中统计计算得到。假设句子T是有词序列w1,w2,w3…wn组成，用公式表示N-Gram语言模型如下：

P(T)=P(w1)*p(w2)*p(w3)***p(wn)=p(w1)*p(w2|w1)*p(w3|w1w2)***p(wn|w1w2w3…)
一般常用的N-Gram模型是Bi-Gram和Tri-Gram。分别用公式表示如下：

一元Bi-Gram:　　P(T)=p(w1|begin)*p(w2|w1)*p(w3|w2)***p(wn|wn-1)

二元Tri-Gram:　　P(T)=p(w1|begin1,begin2)*p(w2|w1,begin1)*p(w3|w2w1)***p(wn|wn-1,wn-2)
除此之外，还有four-gram、five-gram等，不过n>5的应用很少见。

4、FastText模型

asttext是facebook开源的一个词向量与文本分类工具，在2016年开源，典型应用场景是“带监督的文本分类问题”。提供简单而高效的文本分类和表征学习的方法，性能比肩深度学习而且速度更快。

fastText结合了自然语言处理和机器学习中最成功的理念。这些包括了使用词袋以及n-gram袋表征语句，还有使用子字(subword)信息，并通过隐藏表征在类别间共享信息。我们另外采用了一个softmax层级(利用了类别不均衡分布的优势)来加速运算过程。

这些不同概念被用于两个不同任务：

有效文本分类：有监督学习
学习词向量表征：无监督学习

举例来说：fastText能够学会“男孩”、“女孩”、“男人”、“女人”指代的是特定的性别，并且能够将这些数值存在相关文档中。然后，当某个程序在提出一个用户请求（假设是“我女友现在在儿？”），它能够马上在fastText生成的文档中进行查找并且理解用户想要问的是有关女性的问题。
FastText方法包含三部分，模型架构，层次SoftMax和N-gram特征。

1、fastText的架构和word2vec中的CBOW的架构类似，因为它们的作者都是Facebook的科学家Tomas Mikolov，而且确实fastText也算是words2vec所衍生出来的。

2、对于有大量类别的数据集，fastText使用了一个分层分类器（而非扁平式架构）。不同的类别被整合进树形结构中（想象下二叉树而非 list）。在某些文本分类任务中类别很多，计算线性分类器的复杂度高。为了改善运行时间，fastText 模型使用了层次 Softmax 技巧。层次 Softmax 技巧建立在哈弗曼编码的基础上，对标签进行编码，能够极大地缩小模型预测目标的数量。

fastText 也利用了类别（class）不均衡这个事实（一些类别出现次数比其他的更多），通过使用 Huffman 算法建立用于表征类别的树形结构。因此，频繁出现类别的树形结构的深度要比不频繁出现类别的树形结构的深度要小，这也使得进一步的计算效率更高。

3、fastText 可以用于文本分类和句子分类。不管是文本分类还是句子分类，我们常用的特征是词袋模型。但词袋模型不能考虑词之间的顺序，因此 fastText 还加入了 N-gram 特征。“我爱她” 这句话中的词袋模型特征是 “我”，“爱”, “她”。这些特征和句子 “她爱我” 的特征是一样的。如果加入 2-Ngram，第一句话的特征还有 “我-爱” 和 “爱-她”，这两句话 “我爱她” 和 “她爱我” 就能区别开来了。当然啦，为了提高效率，我们需要过滤掉低频的 N-gram。

fastText有监督学习分类

# -*- coding:utf-8 -*-
import pandas as pd
import random
import fasttext
import jieba
from sklearn.model_selection import train_test_splitcate_dic = {'technology': 1, 'car': 2, 'entertainment': 3, 'military': 4, 'sports': 5}
"""
函数说明：加载数据
"""
def loadData():#利用pandas把数据读进来df_technology = pd.read_csv("./data/technology_news.csv",encoding ="utf-8")df_technology=df_technology.dropna()    #去空行处理df_car = pd.read_csv("./data/car_news.csv",encoding ="utf-8")df_car=df_car.dropna()df_entertainment = pd.read_csv("./data/entertainment_news.csv",encoding ="utf-8")df_entertainment=df_entertainment.dropna()df_military = pd.read_csv("./data/military_news.csv",encoding ="utf-8")df_military=df_military.dropna()df_sports = pd.read_csv("./data/sports_news.csv",encoding ="utf-8")df_sports=df_sports.dropna()technology=df_technology.content.values.tolist()[1000:21000]car=df_car.content.values.tolist()[1000:21000]entertainment=df_entertainment.content.values.tolist()[:20000]military=df_military.content.values.tolist()[:20000]sports=df_sports.content.values.tolist()[:20000]return technology,car,entertainment,military,sports"""
函数说明：停用词
参数说明：datapath：停用词路径
返回值：stopwords:停用词
"""
def getStopWords(datapath):stopwords=pd.read_csv(datapath,index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')stopwords=stopwords["stopword"].valuesreturn stopwords"""
函数说明：去停用词
参数：content_line：文本数据sentences：存储的数据category：文本类别
"""
def preprocess_text(content_line,sentences,category,stopwords):for line in content_line:try:segs=jieba.lcut(line)    #利用结巴分词进行中文分词segs=filter(lambda x:len(x)>1,segs)    #去掉长度小于1的词segs=filter(lambda x:x not in stopwords,segs)    #去掉停用词sentences.append("__lable__"+str(category)+" , "+" ".join(segs))    #把当前的文本和对应的类别拼接起来，组合成fasttext的文本格式except Exception as e:print (line)continue"""
函数说明：把处理好的写入到文件中，备用
参数说明："""
def writeData(sentences,fileName):print("writing data to fasttext format...")out=open(fileName,'w')for sentence in sentences:out.write(sentence.encode('utf8')+"\n")print("done!")"""
函数说明：数据处理
"""
def preprocessData(stopwords,saveDataFile):technology,car,entertainment,military,sports=loadData()    #去停用词，生成数据集sentences=[]preprocess_text(technology,sentences,cate_dic["technology"],stopwords)preprocess_text(car,sentences,cate_dic["car"],stopwords)preprocess_text(entertainment,sentences,cate_dic["entertainment"],stopwords)preprocess_text(military,sentences,cate_dic["military"],stopwords)preprocess_text(sports,sentences,cate_dic["sports"],stopwords)random.shuffle(sentences)    #做乱序处理，使得同类别的样本不至于扎堆writeData(sentences,saveDataFile)if __name__=="__main__":stopwordsFile=r"./data/stopwords.txt"stopwords=getStopWords(stopwordsFile)saveDataFile=r'train_data.txt'preprocessData(stopwords,saveDataFile)#fasttext.supervised():有监督的学习classifier=fasttext.supervised(saveDataFile,'classifier.model',lable_prefix='__lable__')result = classifier.test(saveDataFile)print("P@1:",result.precision)    #准确率print("R@2:",result.recall)    #召回率print("Number of examples:",result.nexamples)    #预测错的例子#实际预测lable_to_cate={1:'technology'.1:'car',3:'entertainment',4:'military',5:'sports'}texts=['中新网 日电 2018 预赛 亚洲区 强赛 中国队 韩国队 较量 比赛 上半场 分钟 主场 作战 中国队 率先 打破 场上 僵局 利用 角球 机会 大宝 前点 攻门 得手 中国队 领先']lables=classifier.predict(texts)print(lables)print(lable_to_cate[int(lables[0][0])])#还可以得到类别+概率lables=classifier.predict_proba(texts)print(lables)#还可以得到前k个类别lables=classifier.predict(texts，k=3)print(lables)#还可以得到前k个类别+概率lables=classifier.predict_proba(texts，k=3)print(lables)

fastText有监督学习分类

# -*- coding:utf-8 -*-
import pandas as pd
import random
import fasttext
import jieba
from sklearn.model_selection import train_test_splitcate_dic = {'technology': 1, 'car': 2, 'entertainment': 3, 'military': 4, 'sports': 5}
"""
函数说明：加载数据
"""
def loadData():#利用pandas把数据读进来df_technology = pd.read_csv("./data/technology_news.csv",encoding ="utf-8")df_technology=df_technology.dropna()    #去空行处理df_car = pd.read_csv("./data/car_news.csv",encoding ="utf-8")df_car=df_car.dropna()df_entertainment = pd.read_csv("./data/entertainment_news.csv",encoding ="utf-8")df_entertainment=df_entertainment.dropna()df_military = pd.read_csv("./data/military_news.csv",encoding ="utf-8")df_military=df_military.dropna()df_sports = pd.read_csv("./data/sports_news.csv",encoding ="utf-8")df_sports=df_sports.dropna()technology=df_technology.content.values.tolist()[1000:21000]car=df_car.content.values.tolist()[1000:21000]entertainment=df_entertainment.content.values.tolist()[:20000]military=df_military.content.values.tolist()[:20000]sports=df_sports.content.values.tolist()[:20000]return technology,car,entertainment,military,sports"""
函数说明：停用词
参数说明：datapath：停用词路径
返回值：stopwords:停用词
"""
def getStopWords(datapath):stopwords=pd.read_csv(datapath,index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')stopwords=stopwords["stopword"].valuesreturn stopwords"""
函数说明：去停用词
参数：content_line：文本数据sentences：存储的数据category：文本类别
"""
def preprocess_text(content_line,sentences,stopwords):for line in content_line:try:segs=jieba.lcut(line)    #利用结巴分词进行中文分词segs=filter(lambda x:len(x)>1,segs)    #去掉长度小于1的词segs=filter(lambda x:x not in stopwords,segs)    #去掉停用词sentences.append(" ".join(segs))except Exception as e:print (line)continue"""
函数说明：把处理好的写入到文件中，备用
参数说明："""
def writeData(sentences,fileName):print("writing data to fasttext format...")out=open(fileName,'w')for sentence in sentences:out.write(sentence.encode('utf8')+"\n")print("done!")"""
函数说明：数据处理
"""
def preprocessData(stopwords,saveDataFile):technology,car,entertainment,military,sports=loadData()    #去停用词，生成数据集sentences=[]preprocess_text(technology,sentences,stopwords)preprocess_text(car,sentences,stopwords)preprocess_text(entertainment,sentences,stopwords)preprocess_text(military,sentences,stopwords)preprocess_text(sports,sentences,stopwords)random.shuffle(sentences)    #做乱序处理，使得同类别的样本不至于扎堆writeData(sentences,saveDataFile)if __name__=="__main__":stopwordsFile=r"./data/stopwords.txt"stopwords=getStopWords(stopwordsFile)saveDataFile=r'unsupervised_train_data.txt'preprocessData(stopwords,saveDataFile)#fasttext.load_model:不管是有监督还是无监督的，都是载入一个模型#fasttext.skipgram(),fasttext.cbow()都是无监督的，用来训练词向量的model=fasttext.skipgram('unsupervised_train_data.txt','model')print(model.words)    #打印词向量#cbow modelmodel=fasttext.cbow('unsupervised_train_data.txt','model')print(model.words)    #打印词向量