一、gensim介绍

二、训练模型

一、gensim介绍

Gensim是一款开源的第三方Python工具包，用于从原始的非结构化的文本中，无监督地学习到文本隐层的主题向量表达。
它支持包括TF-IDF，LSA，LDA，和word2vec在内的多种主题模型算法，支持流式训练，并提供了诸如相似度计算，信息检索等一些常用任务的API接口。

基本的语料处理工具

LSI
LDA
HDP
DTM
DIM
TF-IDF
word2vec、paragraph2vec

基本概念

语料（Corpus）：一组原始文本的集合，用于无监督地训练文本主题的隐层结构。语料中不需要人工标注的附加信息。在Gensim中，Corpus通常是一个可迭代的对象（比如列表）。每一次迭代返回一个可用于表达文本对象的稀疏向量。
向量（Vector）：由一组文本特征构成的列表。是一段文本在Gensim中的内部表达。
稀疏向量（SparseVector）：通常，我们可以略去向量中多余的0元素。此时，向量中的每一个元素是一个(key, value)的元组
模型（Model）：是一个抽象的术语。定义了两个向量空间的变换（即从文本的一种向量表达变换为另一种向量表达）。

二、训练模型

1、训练

最简单的训练方式：

# 最简单的开始
import gensim
sentences = [['first', 'sentence'], ['second', 'sentence','is']]# 模型训练
model = gensim.models.Word2Vec(sentences, min_count=1)# min_count,频数阈值，大于等于1的保留# size，神经网络 NN 层单元数，它也对应了训练算法的自由程度# workers=4，default = 1 worker = no parallelization 只有在机器已安装 Cython 情况下才会起到作用。如没有 Cython，则只能单核运行。

第二种训练方式：

# 第二种训练方式
new_model = gensim.models.Word2Vec(min_count=1)  # 先启动一个空模型 an empty model
new_model.build_vocab(sentences)                 # can be a non-repeatable, 1-pass generator
new_model.train(sentences, total_examples=new_model.corpus_count, epochs=new_model.iter)
# can be a non-repeatable, 1-pass generator

案例：

#encoding=utf-8
from gensim.models import word2vec
sentences=word2vec.Text8Corpus(u'分词后的爽肤水评论.txt')
model=word2vec.Word2Vec(sentences, size=50)y2=model.similarity(u"好", u"还行")
print(y2)for i in model.most_similar(u"滋润"):print i[0],i[1]

txt文件是已经分好词的5W条评论，训练模型只需一句话：

model=word2vec.Word2Vec(sentences,min_count=5,size=50)

第一个参数是训练语料，第二个参数是小于该数的单词会被剔除，默认值为5,
第三个参数是神经网络的隐藏层单元数，默认为100

2、模型使用

# 根据词向量求相似
model.similarity('first','is')    # 两个词的相似性距离
model.most_similar(positive=['first', 'second'], negative=['sentence'], topn=1)  # 类比的防护四
model.doesnt_match("input is lunch he sentence cat".split())                   # 找出不匹配的词语

如何查看模型内部词向量内容：

# 词向量查询
model['first']

3、模型导出与导入

最简单的导入与导出

# 模型保存与载入
model.save('/tmp/mymodel')
new_model = gensim.models.Word2Vec.load('/tmp/mymodel')
odel = Word2Vec.load_word2vec_format('/tmp/vectors.txt', binary=False)  # 载入 .txt文件
# using gzipped/bz2 input works too, no need to unzip:
model = Word2Vec.load_word2vec_format('/tmp/vectors.bin.gz', binary=True)  # 载入 .bin文件word2vec = gensim.models.word2vec.Word2Vec(sentences(), size=256, window=10, min_count=64, sg=1, hs=1, iter=10, workers=25)
word2vec.save('word2vec_wx')

word2vec.save即可导出文件，这边没有导出为.bin

model = gensim.models.Word2Vec.load('xxx/word2vec_wx')
pd.Series(model.most_similar(u'微信',topn = 360000))

gensim.models.Word2Vec.load的办法导入

其中的Numpy,可以用numpy.load：

import numpy
word_2x = numpy.load('xxx/word2vec_wx.wv.syn0.npy')

还有其他的导入方式：

import gensim
word_vectors = gensim.models.KeyedVectors.load_word2vec_format('/tmp/vectors.txt', binary=False)  # C text format
word_vectors = gensim.models.KeyedVectors.load_word2vec_format('/tmp/vectors.bin', binary=True)  # C binary format

导入txt格式+bin格式。

其他导出方式：

from gensim.models import KeyedVectors
# save
model.save(fname) # 只有这样存才能继续训练!
model.wv.save_word2vec_format(outfile + '.model.bin', binary=True)  # C binary format 磁盘空间比上一方法减半
model.wv.save_word2vec_format(outfile + '.model.txt', binary=False) # C text format 磁盘空间大，与方法一样# load
model = gensim.models.Word2Vec.load(fname)
word_vectors = KeyedVectors.load_word2vec_format('/tmp/vectors.txt', binary=False)
word_vectors = KeyedVectors.load_word2vec_format('/tmp/vectors.bin', binary=True)# 最省内存的加载方法
model = gensim.models.Word2Vec.load('model path')
word_vectors = model.wv
del model
word_vectors.init_sims(replace=True)

來源：简书，其中：如果你不打算进一步训练模型，调用init_sims将使得模型的存储更加高效

4、增量训练

model = gensim.models.Word2Vec.load('/tmp/mymodel')
model.train(more_sentences)

不能对C生成的模型进行再训练

# 增量训练
model = gensim.models.Word2Vec.load(temp_path)
more_sentences = [['Advanced', 'users', 'can', 'load', 'a', 'model', 'and', 'continue', 'training', 'it', 'with', 'more', 'sentences']]
model.build_vocab(more_sentences, update=True)
model.train(more_sentences, total_examples=model.corpus_count, epochs=model.iter)

5、bow2vec + TFIDF模型

5.1 Bow2vec

主要内容为：
拆分句子为单词颗粒，记号化；
生成词典；
生成稀疏文档矩阵。

documents = ["Human machine interface for lab abc computer applications","A survey of user opinion of computer system response time","The EPS user interface management system","System and human system engineering testing of EPS",              "Relation of user perceived response time to error measurement","The generation of random binary unordered trees","The intersection graph of paths in trees","Graph minors IV Widths of trees and well quasi ordering","Graph minors A survey"]# 分词并根据词频剔除
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]for document in documents]# 生成词语列表：
[['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'],['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],['eps', 'user', 'interface', 'management', 'system'],['system', 'human', 'system', 'engineering', 'testing', 'eps'],['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],['generation', 'random', 'binary', 'unordered', 'trees'],['intersection', 'graph', 'paths', 'trees'],['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'],['graph', 'minors', 'survey']]# 词典生成
dictionary = corpora.Dictionary(texts)
dictionary.save(os.path.join(TEMP_FOLDER, 'deerwester.dict'))  # store the dictionary, for future reference
print(dictionary)
print(dictionary.token2id)  # 查看词典中所有词# 稀疏文档矩阵的生成：
# 单句bow 生成
new_doc = "Human computer interaction Human"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)  # the word "interaction" does not appear in the dictionary and is ignored# [(0, 1), (1, 1)] ，词典（dictionary）中第0个词，出现的频数为1（当前句子），# 第1个词，出现的频数为1# 多句bow 生成
[dictionary.doc2bow(text) for text in texts]  # 当前句子的词ID + 词频

5.2 tfidf

from gensim import corpora, models, similarities
corpus = [[(0, 1.0), (1, 1.0), (2, 1.0)],[(2, 1.0), (3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (8, 1.0)],[(1, 1.0), (3, 1.0), (4, 1.0), (7, 1.0)],[(0, 1.0), (4, 2.0), (7, 1.0)],[(3, 1.0), (5, 1.0), (6, 1.0)],[(9, 1.0)],[(9, 1.0), (10, 1.0)],[(9, 1.0), (10, 1.0), (11, 1.0)],[(8, 1.0), (10, 1.0), (11, 1.0)]]
tfidf = models.TfidfModel(corpus)# 词袋模型，实践
vec = [(0, 1), (4, 1),(9, 1)]
print(tfidf[vec])
>>>  [(0, 0.695546419520037), (4, 0.5080429008916749), (9, 0.5080429008916749)]

查找vec中，0,4,9号三个词的TFIDF值。同时进行转化，把之前的文档矩阵中的词频变成了TFIDF值。

利用tfidf求相似：

# 求相似
index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=12)
vec = [(0, 1), (4, 1),(9, 1)]
sims = index[tfidf[vec]]
print(list(enumerate(sims)))
>>>[(0, 0.40157393), (1, 0.16485332), (2, 0.21189235), (3, 0.70710677), (4, 0.0), (5, 0.5080429), (6, 0.35924056), (7, 0.25810757), (8, 0.0)]

对corpus的9个文档建立文档级别的索引，vec是一个新文档的词语的词袋内容，sim就是该vec向量对corpus中的九个文档的相似性。

索引的导出与载入：

index.save('/tmp/deerwester.index')
index = similarities.MatrixSimilarity.load('/tmp/deerwester.index')

5.3 继续转换

潜在语义索引（LSI）将Tf-Idf语料转化为一个潜在2-D空间

lsi = models.LsiModel(tfidf[corpus], id2word=dictionary, num_topics=2) # 初始化一个LSI转换
corpus_lsi = lsi[tfidf[corpus]] # 在原始语料库上加上双重包装: bow->tfidf->fold-in-lsi

设置了num_topics=2,
利用models.LsiModel.print_topics()来检查一下这个过程到底产生了什么变化吧：

lsi.print_topics(2)

根据LSI来看，“tree”、“graph”、“minors”都是相关的词语（而且在第一主题的方向上贡献最多），而第二主题实际上与所有的词语都有关系。如我们所料，前五个文档与第二个主题的关联更强，而其他四个文档与第一个主题关联最强：

>>> for doc in corpus_lsi: # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
...     print(doc)
[(0, -0.066), (1, 0.520)] # "Human machine interface for lab abc computer applications"
[(0, -0.197), (1, 0.761)] # "A survey of user opinion of computer system response time"
[(0, -0.090), (1, 0.724)] # "The EPS user interface management system"
[(0, -0.076), (1, 0.632)] # "System and human system engineering testing of EPS"
[(0, -0.102), (1, 0.574)] # "Relation of user perceived response time to error measurement"
[(0, -0.703), (1, -0.161)] # "The generation of random binary unordered trees"
[(0, -0.877), (1, -0.168)] # "The intersection graph of paths in trees"
[(0, -0.910), (1, -0.141)] # "Graph minors IV Widths of trees and well quasi ordering"
[(0, -0.617), (1, 0.054)] # "Graph minors A survey"

三、gensim训练好的word2vec使用

1、相似性

持数种单词相似度任务:
相似词+相似系数（model.most_similar）、model.doesnt_match、model.similarity（两两相似）

model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
[('queen', 0.50882536)]model.most_similar(positive=‘woman’, topn=topn, restrict_vocab=restrict_vocab)  # 直接给入词
model.most_similar(positive=[vector], topn=topn, restrict_vocab=restrict_vocab)  # 直接给入向量model.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'model.similarity('woman', 'man')
.73723527

2、词向量

通过以下方式来得到单词的向量：

model['computer']  # raw NumPy vector of a word
array([-0.00449447, -0.00310097,  0.02421786, ...], dtype=float32)

3、词向量表

model.wv.vocab.keys()

案例一：800万微信语料训练

来源于：【不可思议的Word2Vec】 2.训练好的模型

训练过程：

import gensim, logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)import pymongo
import hashlibdb = pymongo.MongoClient('172.16.0.101').weixin.text_articles_words
md5 = lambda s: hashlib.md5(s).hexdigest()
class sentences:def __iter__(self):texts_set = set()for a in db.find(no_cursor_timeout=True):if md5(a['text'].encode('utf-8')) in texts_set:continueelse:texts_set.add(md5(a['text'].encode('utf-8')))yield a['words']print u'最终计算了%s篇文章'%len(texts_set)word2vec = gensim.models.word2vec.Word2Vec(sentences(), size=256, window=10, min_count=64, sg=1, hs=1, iter=10, workers=25)
word2vec.save('word2vec_wx')

这里引入hashlib.md5是为了对文章进行去重（本来1000万篇文章，去重后得到800万），而这个步骤不是必要的。

案例二：字向量与词向量的训练

来源github：https://github.com/nlpjoe/daguan-classify-2018/blob/master/src/preprocess/EDA.ipynb

# 训练词向量
def train_w2v_model(type='article', min_freq=5, size=100):sentences = []if type == 'char':corpus = pd.concat((train_df['article'], test_df['article']))elif type == 'word':corpus = pd.concat((train_df['word_seg'], test_df['word_seg']))for e in tqdm(corpus):sentences.append([i for i in e.strip().split() if i])print('训练集语料:', len(corpus))print('总长度: ', len(sentences))model = Word2Vec(sentences, size=size, window=5, min_count=min_freq)model.itos = {}model.stoi = {}model.embedding = {}print('保存模型...')for k in tqdm(model.wv.vocab.keys()):model.itos[model.wv.vocab[k].index] = kmodel.stoi[k] = model.wv.vocab[k].indexmodel.embedding[model.wv.vocab[k].index] = model.wv[k]model.save('../../data/word2vec-models/word2vec.{}.{}d.mfreq{}.model'.format(type, size, min_freq))return model
model = train_w2v_model(type='char', size=100)
model = train_w2v_model(type='word', size=100)
# model.wv.save_word2vec_format('../../data/laozhu-word-300d', binary=False)
# train_df[:3]
print('OK')

案例三：Python+gensim-文本相似度分析

来源于：Python+gensim-文本相似度分析

2、代码实现

from gensim import corpora, models, similarities
import jieba
# 文本集和搜索词
texts = ['吃鸡这里所谓的吃鸡并不是真的吃鸡，也不是我们常用的谐音词刺激的意思','而是出自策略射击游戏《绝地求生：大逃杀》里的台词','我吃鸡翅，你吃鸡腿']
keyword = '玩过吃鸡？今晚一起吃鸡'
# 1、将【文本集】生成【分词列表】
texts = [jieba.lcut(text) for text in texts]
# 2、基于文本集建立【词典】，并提取词典特征数
dictionary = corpora.Dictionary(texts)
feature_cnt = len(dictionary.token2id)
# 3、基于词典，将【分词列表集】转换成【稀疏向量集】，称作【语料库】
corpus = [dictionary.doc2bow(text) for text in texts]
# 4、使用【TF-IDF模型】处理语料库
tfidf = models.TfidfModel(corpus)
# 5、同理，用【词典】把【搜索词】也转换为【稀疏向量】
kw_vector = dictionary.doc2bow(jieba.lcut(keyword))
# 6、对【稀疏向量集】建立【索引】
index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=feature_cnt)
# 7、相似度计算
sim = index[tfidf[kw_vector]]
for i in range(len(sim)):print('keyword 与 text%d 相似度为：%.2f' % (i + 1, sim[i]))

打印结果：

keyword 与 text1 相似度为：0.62
keyword 与 text2 相似度为：0.00
keyword 与 text3 相似度为：0.12

3、过程拆解

3.1、生成分词列表

对文本集中的文本进行中文分词，返回分词列表，格式如下：

[‘word1’, ‘word2’, ‘word3’, …]

import jieba
text = '七月七日长生殿，夜半无人私语时。'
words = jieba.lcut(text)

print(words)

[‘七月’, ‘七日’, ‘长生殿’, ‘，’, ‘夜半’, ‘无人’, ‘私语’, ‘时’, ‘。’]

3.2、基于文本集建立`词典`，获取特征数

corpora.Dictionary：建立词典
len(dictionary.token2id)：词典中词的个数

from gensim import corpora
import jieba
# 文本集
text1 = '坚果果实'
text2 = '坚果实在好吃'
texts = [text1, text2]
# 将文本集生成分词列表
texts = [jieba.lcut(text) for text in texts]
print('文本集：', texts)
# 基于文本集建立词典
dictionary = corpora.Dictionary(texts)
print('词典：', dictionary)
# 提取词典特征数
feature_cnt = len(dictionary.token2id)
print('词典特征数：%d' % feature_cnt)

打印结果：

文本集： [[‘坚果’, ‘果实’], [‘坚果’, ‘实在’, ‘好吃’]]
词典： Dictionary(4 unique tokens: [‘坚果’, ‘果实’, ‘好吃’, ‘实在’])
词典特征数：4

3.3、基于词典建立`语料库`

语料库即存放稀疏向量的列表

from gensim import corpora
import jieba
text1 = '来东京吃东京菜'
text2 = '东京啊东京啊东京'
texts = [text1, text2]
texts = [jieba.lcut(text) for text in texts]
dictionary = corpora.Dictionary(texts)
print('词典（字典）：', dictionary.token2id)
# 基于词典建立新的【语料库】
corpus = [dictionary.doc2bow(text) for text in texts]
print('语料库：', corpus)

打印结果

词典（字典）： {‘东京’: 0, ‘吃’: 1, ‘来’: 2, ‘菜’: 3, ‘啊’: 4}
语料库： [[(0, 2), (1, 1), (2, 1), (3, 1)], [(0, 3), (4, 2)]]

doc2bow函数生成稀疏向量

1、将所有单词取【集合】，并对每个单词分配一个ID号
以['东京', '啊', '东京', '啊', '东京']为例
对单词分配ID：东京→0；啊→4
变成：[0, 4, 0, 4, 0]
2、转换成稀疏向量
0有3个，即表示为(0, 3)
4有2个，即表示为(4, 2)
最终结果：[(0, 3), (4, 2)]

3.4、使用`TF-IDF`模型处理语料库，并建立`索引`

TF-IDF是一种统计方法，用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度

from gensim import corpora, models, similarities
import jieba
text1 = '南方医院无痛人流'
text2 = '北方人流浪到南方'
texts = [text1, text2]
texts = [jieba.lcut(text) for text in texts]
dictionary = corpora.Dictionary(texts)
feature_cnt = len(dictionary.token2id.keys())
corpus = [dictionary.doc2bow(text) for text in texts]
# 用TF-IDF处理语料库
tfidf = models.TfidfModel(corpus)
# 对语料库建立【索引】
index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=feature_cnt)

print(tfidf)

TfidfModel(num_docs=2, num_nnz=9)

3.5、用词典把搜索词转成稀疏向量

from gensim import corpora
import jieba
text1 = '南方医院无痛人流'
text2 = '北方人流落南方'
texts = [text1, text2]
texts = [jieba.lcut(text) for text in texts]
dictionary = corpora.Dictionary(texts)
# 用【词典】把【搜索词】也转换为【稀疏向量】
keyword = '无痛人流'
kw_vector = dictionary.doc2bow(jieba.lcut(keyword))

print(kw_vector)

[(0, 1), (3, 1)]

3.6、相似度计算

from gensim import corpora, models, similarities
import jieba
text1 = '无痛人流并非无痛'
text2 = '北方人流浪到南方'
texts = [text1, text2]
keyword = '无痛人流'
texts = [jieba.lcut(text) for text in texts]
dictionary = corpora.Dictionary(texts)
feature_cnt = len(dictionary.token2id)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)
new_vec = dictionary.doc2bow(jieba.lcut(keyword))
# 相似度计算
index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=feature_cnt)
print('\nTF-IDF模型的稀疏向量集：')
for i in tfidf[corpus]:print(i)
print('\nTF-IDF模型的keyword稀疏向量：')
print(tfidf[new_vec])
print('\n相似度计算：')
sim = index[tfidf[new_vec]]
for i in range(len(sim)):print('第', i+1, '句话的相似度为：', sim[i])

4、附录

阅读扩展
jieba中文分词
中文LDA模型
文本相似度分析【矩阵版】
注释

En	Cn
corpus	n. 文集；[计]语料库（复数：corpora）
sparse	adj. 稀疏的
vector	n. 矢量
Sparse Matrix Similarity	稀疏矩阵相似性
word2vec	word to vector
doc2bow	document to bag of words（词袋

案例四：Python+gensim【中文LDA】简洁模型

来源于：Python+gensim【中文LDA】简洁模型

0、原理

LDA文档主题生成模型，也称三层贝叶斯概率模型，包含词、主题和文档三层结构。
利用文档中单词的共现关系来对单词按主题聚类，得到“文档-主题”和“主题-单词”2个概率分布。

gensim流程

1、代码实现

from gensim import corpora, models
import jieba.posseg as jp, jieba
# 文本集
texts = ['美国教练坦言，没输给中国女排，是输给了郎平','美国无缘四强，听听主教练的评价','中国女排晋级世锦赛四强，全面解析主教练郎平的执教艺术','为什么越来越多的人买MPV，而放弃SUV？跑一趟长途就知道了','跑了长途才知道，SUV和轿车之间的差距','家用的轿车买什么好']
# 分词过滤条件
jieba.add_word('四强', 9, 'n')
flags = ('n', 'nr', 'ns', 'nt', 'eng', 'v', 'd')  # 词性
stopwords = ('没', '就', '知道', '是', '才', '听听', '坦言', '全面', '越来越', '评价', '放弃', '人')  # 停词
# 分词
words_ls = []
for text in texts:words = [word.word for word in jp.cut(text) if word.flag in flags and word.word not in stopwords]words_ls.append(words)
# 构造词典
dictionary = corpora.Dictionary(words_ls)
# 基于词典，使【词】→【稀疏向量】，并将向量放入列表，形成【稀疏向量集】
corpus = [dictionary.doc2bow(words) for words in words_ls]
# lda模型，num_topics设置主题的个数
lda = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=2)
# 打印所有主题，每个主题显示4个词
for topic in lda.print_topics(num_words=4):print(topic)
# 主题推断
print(lda.inference(corpus))

结果

主题0（体育）：‘0.081*“郎平” + 0.080*“中国女排” + 0.077*“输给” + 0.074*“主教练”’
主题1（汽车）：‘0.099*“长途” + 0.092*“SUV” + 0.084*“跑” + 0.074*“轿车”’

2、过程详解

2.1、打印中间件

print(words_ls)
[[‘美国’, ‘输给’, ‘中国女排’, ‘输给’, ‘郎平’],
[‘美国’, ‘无缘’, ‘四强’, ‘主教练’],
[‘中国女排’, ‘晋级’, ‘世锦赛’, ‘四强’, ‘主教练’, ‘郎平’, ‘执教’, ‘艺术’],
[‘买’, ‘MPV’, ‘SUV’, ‘跑’, ‘长途’],
[‘跑’, ‘长途’, ‘SUV’, ‘轿车’, ‘差距’],
[‘家用’, ‘轿车’, ‘买’]]

print(dictionary.token2id)
{‘中国女排’: 0, ‘美国’: 1, ‘输给’: 2, ‘郎平’: 3, ‘主教练’: 4, ‘四强’: 5, ‘无缘’: 6, ‘世锦赛’: 7, ‘执教’: 8, ‘晋级’: 9, ‘艺术’: 10, ‘MPV’: 11, ‘SUV’: 12, ‘买’: 13, ‘跑’: 14, ‘长途’: 15, ‘差距’: 16, ‘轿车’: 17, ‘家用’: 18}

print(corpus)
[[(0, 1), (1, 1), (2, 2), (3, 1)],
[(1, 1), (4, 1), (5, 1), (6, 1)],
[(0, 1), (3, 1), (4, 1), (5, 1), (7, 1), (8, 1), (9, 1), (10, 1)],
[(11, 1), (12, 1), (13, 1), (14, 1), (15, 1)],
[(12, 1), (14, 1), (15, 1), (16, 1), (17, 1)],
[(13, 1), (17, 1), (18, 1)]]

print(lda)
LdaModel(num_terms=19, num_topics=2, decay=0.5, chunksize=2000)

2.2、doc2bow函数

[‘美国’, ‘输给’, ‘中国女排’, ‘输给’, ‘郎平’]
↓↓↓【词→ID】
↓↓↓（美国→0、输给→2、中国女排→1、郎平→3）
[0, 2, 1, 2, 3]
↓↓↓【生成稀疏向量】
↓↓↓（2有两个，其它只有一个）
[(0, 1), (1, 1), (2, 2), (3, 1)]
…

2.3、主题推断

for e, values in enumerate(lda.inference(corpus)[0]):print(texts[e])for ee, value in enumerate(values):print('\t主题%d推断值%.2f' % (ee, value))

美国教练坦言，没输给中国女排，是输给了郎平
    主题0推断值5.29(体育)
    主题1推断值0.71
  美国无缘四强，听听主教练的评价
    主题0推断值4.44(体育)
    主题1推断值0.56
  中国女排晋级世锦赛四强，全面解析主教练郎平的执教艺术
    主题0推断值8.44(体育)
    主题1推断值0.56
  为什么越来越多的人买MPV，而放弃SUV？跑一趟长途就知道了
    主题0推断值0.54
    主题1推断值5.46(汽车)
  跑了长途才知道，SUV和轿车之间的差距
    主题0推断值0.56
    主题1推断值5.44(汽车)
  家用的轿车买什么好
    主题0推断值0.68
    主题1推断值3.32(汽车)

text5 = '中国女排将在郎平的率领下向世界女排三大赛的三连冠发起冲击'
bow = dictionary.doc2bow([word.word for word in jp.cut(text5) if word.flag in flags and word.word not in stopwords])
ndarray = lda.inference([bow])[0]
print(text5)
for e, value in enumerate(ndarray[0]):print('\t主题%d推断值%.2f' % (e, value))

中国女排将在郎平的率领下向世界女排三大赛的三连冠发起冲击
主题0推断值2.40(体育)
主题1推断值0.60

2.4、词和主题的关系

单个词与主题的关系

word_id = dictionary.doc2idx(['长途'])[0]
for i in lda.get_term_topics(word_id):print('【长途】与【主题%d】的关系值：%.2f%%' % (i[0], i[1]*100))

【长途】与【主题0】的关系值：1.61%
【长途】与【主题1】的关系值：7.41%(汽车)

全部词与主题的关系（minimum_probability设置概率阈值）

for word, word_id in dictionary.token2id.items():print(word, lda.get_term_topics(word_id, minimum_probability=1e-8))

3、附录

阅读扩展
jieba中文分词
文本相似度分析
注释

En	Cn
LDA	Latent Dirichlet Allocation
latent	潜在的
allocation	n. 分配；定位（allocation）
inference	n. 推理
term	术语；学期；
doc2bow	document to bag of words（词袋）
doc2idx	document to index

LINK：

python︱gensim训练word2vec及相关函数与功能理解

15分钟入门NLP神器—Gensim

Python+gensim-文本相似度分析

Python+gensim【中文LDA】简洁模型

基于python的gensim word2vec训练词向量

Gensim Word2vec 使用教程

官方教程：http://radimrehurek.com/gensim/models/word2vec.html

gensim相关功能函数及其案例相关推荐

powershell_功能扩展模块PSReadline(psReadlinekeyhandler)相关功能函数以及快捷键绑定情况(by official document)
文章目录快捷键 Fuction 使用补全功能后vscode背景配色(对于白色主题的优化) 快捷键 Get-PSReadLineKeyHandler (PSReadLine) - PowerShell ...
OSG/osgEarth相关功能函数汇总
1.字符串转double.float double osg::asciiToFloat(const char* str);//位于\src\osg\Math.h double osg::asciiTo ...
个人永久性免费-Excel催化剂功能第56波-获取Excel对象属性相关自定义函数
之前零散开发过一些自定义函数获取Excel对象属性,此次再细细地把有价值的属性都一一给开发完成,某些场景下,有这些小函数还是可以比较方便地实现一些通过Excel界面没法轻松获取到的信息. 修复与更新 ...
php常用的数组相关的函数及面向对象
内容本周学习知识点: 一．PHP的循环结构二．PHP函数的声明与使用三．数组的定义以及遍历四．常用的数组相关的函数五．面向对象六．三大特性以及修饰符.关键字七．抽象类.接口.多态性八 ...
Python如何用TKinter搭建图形界面窗口，并通过多进程的方式调用功能函数
用Python开发图形界面和程序时,经常会对图形界面的搭建感到失望,或许是由于对图形界面不熟悉的原因吧,总之一想到图形界面,就感觉会很费时.费力,编程的积极性大幅下降.最近,尝试用Tkinter创建了 ...
a标签触发手机电话相关功能
a标签的href关键字触发手机相关功能一.移动web页面自动探测电话号码 1.问题描述: 2.问题处理:在head标签中加入以下代码即可二.使用tel实现拨打电话 1.语法 2.案例 (1)代码 ...
Matlab：Matlab中常用的函数、案例详细攻略
Matlab:Matlab中常用的函数.案例详细攻略目录常用函数 1.与文件相关 2.MATLAB GUI不同控件函数间变量传递方法常用函数 Matlab中的bwmorph函数解释 bwmorp ...
linux 与信号集操作相关的函数
与信号集操作相关的函数 #include <signal.h> 清空信号集全都为0 int sigemptyset(sigset_t *set);填充信号集全都为1 int sigfi ...
智慧园区主要功能及典型案例分析
智慧园区主要功能及典型案例分析智慧园区是指融合新一代信息与通信技术,具备迅速信息采集.高速信息传输.高度集中技术.智慧实时处理和服务提供能力,实现产业园区内及时.互动.整合的信息感知.传递和处理,以 ...

gensim相关功能函数及其案例

一、gensim介绍

基本概念

二、训练模型

3、过程拆解

3.2、基于文本集建立词典，获取特征数

3.3、基于词典建立语料库

3.4、使用TF-IDF模型处理语料库，并建立索引

3.5、用词典把搜索词转成稀疏向量

3.6、相似度计算

4、附录

案例四：Python+gensim【中文LDA】简洁模型

0、原理

1、代码实现

2、过程详解

2.1、打印中间件

2.2、doc2bow函数

2.3、主题推断

2.4、词和主题的关系

3、附录

LINK：

gensim相关功能函数及其案例相关推荐

最新文章

热门文章

3.2、基于文本集建立`词典`，获取特征数

3.3、基于词典建立`语料库`

3.4、使用`TF-IDF`模型处理语料库，并建立`索引`