文章目录

模块一：训练LDA模型
模块二：困惑度计算
模块三：得到一段文本的主题
全部代码及案例（可直接运行）

首先使用gensim库:

pip install gensim

模块一：训练LDA模型

import gensim  # pip install gensim
from gensim import corporadef train_lda_model(all_contents, dictionary, num_topic=10):"""这是训练LDA的核心方法"""corpus = [dictionary.doc2bow(sentence) for sentence in all_contents]lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topic)  # 核心代码return ldaif __name__ == '__main__':data = list(iter(open('data.txt')))data = [content.split() for content in data]try:dictionary = corpora.Dictionary(data)num_topic = 3  # 主题类型lda_model = train_lda_model(data, dictionary, num_topic=num_topic)  # 训练LDA模型lda_model.save('lda_' + str(num_topic) + '.model')  # 保存LDA模型except Exception as e:print(e)

其中data.txt文件是：

in conjunction with the release of the the allen institute for ai partnered with
the recent outbreak of the deadly and highly infectious covid disease caused by
coronaviruses is related illness that vary from a common cold more severe
it is shown that the evaporation rate of a liquid sample containing the
covid illness an on going epidemic started in wuhan city china in december
in the beginning of december covid virus that slipped from animals humans in

模块二：困惑度计算

新建一个文件perplexity_cal.py，文件代码是：

import math
import gensimdef perplexity(ldamodel: gensim.models.LdaModel, data, dictionary: gensim.corpora.Dictionary):"""计算LDA模型困惑度:param ldamodel:  lda模型:param data: 计算困惑度需要训练数据:param dictionary: 文本处理后的Dictionary，使用corpora.Dictionary(my_data)处理训练gensim模型时的数据 my_data 后得到的:return: 返回困惑度"""size_dictionary = len(dictionary.keys())testset = []for i in data:testset.append(dictionary.doc2bow(i))num_topics = ldamodel.num_topicsprob_doc_sum = 0.0topic_word_list = []  # store the probablity of topic-word:[(u'business', 0.010020942661849608),(u'family', 0.0088027946271537413)...]for topic_id in range(num_topics):topic_word = ldamodel.show_topic(topic_id, size_dictionary)dic = {}for word, probability in topic_word:dic[word] = probabilitytopic_word_list.append(dic)doc_topics_ist = []  # store the doc-topic tuples:[(0, 0.0006211180124223594),(1, 0.0006211180124223594),...]for doc in testset:doc_topics_ist.append(ldamodel.get_document_topics(doc, minimum_probability=0))testset_word_num = 0for i in range(len(testset)):prob_doc = 0.0  # the probablity of the docdoc = testset[i]doc_word_num = 0  # the num of words in the docfor word_id, num in doc:prob_word = 0.0  # the probablity of the worddoc_word_num += numword = dictionary[word_id]for topic_id in range(num_topics):# cal p(w) : p(w) = sumz(p(z)*p(w|z))prob_topic = doc_topics_ist[i][topic_id][1]prob_topic_word = topic_word_list[topic_id][word]prob_word += prob_topic * prob_topic_wordprob_doc += math.log(prob_word)  # p(d) = sum(log(p(w)))prob_doc_sum += prob_doctestset_word_num += doc_word_numprep = math.exp(-prob_doc_sum / testset_word_num)  # perplexity = exp(-sum(p(d)/sum(Nd))# print("LDA模型困惑度 : %s" % prep)return prep

在主函数中使用：

perp = perplexity_cal.perplexity(lda_model, data, dictionary)

其中lda_model为训练出的LDA模型，data为计算困惑度的训练集，dictionary为训练LDA模型时的训练集使用corpora.Dictionary(my_data)得到的dictionary格式的数据

模块三：得到一段文本的主题

首先要确保这一段文本的词在LDA的训练集中出现过，然后写一个函数：

def get_topic_from_model(lda_model: gensim.models.ldamodel.LdaModel, text: str = "related illness that"):"""使用LDA模型得到文本主题"""text = [word for word in text.lower().split()]dictionary = corpora.Dictionary([text])bow = dictionary.doc2bow(text)return lda_model.get_document_topics(bow)

Main函数中使用：

topic = get_topic_from_model(lda_model, text="related illness that")
print(topic) # [(0, 0.08674477), (1, 0.084886044), (2, 0.8283692)] 返回值含义为 (主题：概率)

全部代码及案例（可直接运行）

data.txt数据文件：

in conjunction with the release of the the allen institute for ai partnered with
the recent outbreak of the deadly and highly infectious covid disease caused by
coronaviruses is related illness that vary from a common cold more severe
it is shown that the evaporation rate of a liquid sample containing the
covid illness an on going epidemic started in wuhan city china in december
in the beginning of december covid virus that slipped from animals humans in

Main.py文件

import gensim  # pip install gensim
from gensim import corpora
import perplexity_caldef train_lda_model(all_contents, dictionary, num_topic=10):"""这是训练LDA的核心方法"""corpus = [dictionary.doc2bow(sentence) for sentence in all_contents]lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topic)  # 核心代码return ldadef get_topic_from_model(lda_model: gensim.models.ldamodel.LdaModel, text: str = "related illness that"):"""使用LDA模型得到文本主题"""text = [word for word in text.lower().split()]dictionary = corpora.Dictionary([text])bow = dictionary.doc2bow(text)return lda_model.get_document_topics(bow)if __name__ == '__main__':data = list(iter(open('data.txt')))data = [content.split() for content in data]try:dictionary = corpora.Dictionary(data)num_topic = 3  # 主题类型lda_model = train_lda_model(data, dictionary, num_topic=num_topic)  # 训练LDA模型# lda_model.save('lda_' + str(num_topic) + '.model')  # 保存LDA模型# 计算困惑度perp = perplexity_cal.perplexity(lda_model, data, dictionary)print("LDA困惑度:  topic:", str(num_topic) + " value: " + str(perp))# 测试一个文章的主题topic = get_topic_from_model(lda_model, text="related illness that")print(topic)except Exception as e:print(e)

perplexity_cal.py文件：

import math
import gensimdef perplexity(ldamodel: gensim.models.LdaModel, data, dictionary: gensim.corpora.Dictionary):"""计算LDA模型困惑度:param ldamodel:  lda模型:param data: 计算困惑度需要训练数据:param dictionary: 文本处理后的Dictionary，使用corpora.Dictionary(my_data)处理训练gensim模型时的数据 my_data 后得到的:return: 返回困惑度"""size_dictionary = len(dictionary.keys())testset = []for i in data:testset.append(dictionary.doc2bow(i))num_topics = ldamodel.num_topicsprob_doc_sum = 0.0topic_word_list = []  # store the probablity of topic-word:[(u'business', 0.010020942661849608),(u'family', 0.0088027946271537413)...]for topic_id in range(num_topics):topic_word = ldamodel.show_topic(topic_id, size_dictionary)dic = {}for word, probability in topic_word:dic[word] = probabilitytopic_word_list.append(dic)doc_topics_ist = []  # store the doc-topic tuples:[(0, 0.0006211180124223594),(1, 0.0006211180124223594),...]for doc in testset:doc_topics_ist.append(ldamodel.get_document_topics(doc, minimum_probability=0))testset_word_num = 0for i in range(len(testset)):prob_doc = 0.0  # the probablity of the docdoc = testset[i]doc_word_num = 0  # the num of words in the docfor word_id, num in doc:prob_word = 0.0  # the probablity of the worddoc_word_num += numword = dictionary[word_id]for topic_id in range(num_topics):# cal p(w) : p(w) = sumz(p(z)*p(w|z))prob_topic = doc_topics_ist[i][topic_id][1]prob_topic_word = topic_word_list[topic_id][word]prob_word += prob_topic * prob_topic_wordprob_doc += math.log(prob_word)  # p(d) = sum(log(p(w)))prob_doc_sum += prob_doctestset_word_num += doc_word_numprep = math.exp(-prob_doc_sum / testset_word_num)  # perplexity = exp(-sum(p(d)/sum(Nd))# print("LDA模型困惑度 : %s" % prep)return prep

LDA模型训练与得到文本主题、困惑度计算（含可运行案例）相关推荐

NLP之TM之LDA：利用LDA算法瞬时掌握文档的主题内容—利用希拉里邮件数据集训练LDA模型并对新文本进行主题分类
NLP之TM之LDA:利用LDA算法瞬时掌握文档的主题内容-利用希拉里邮件数据集训练LDA模型并对新文本进行主题分类目录输出结果设计思路核心代码训练数据集 LDA模型应用输出结果设计思路 ...
BTM主题模型构建及困惑度计算
小白一枚,有什么不对的地方请多指教. BTM主题模型主要针对短文本而言,这里实现的方法主要参考论文<A Biterm Topic Model for Short Texts>,代码在作者的 ...
文本挖掘：LDA模型对公号文章主题分析
转载自:[大数据部落]文本挖掘:LDA模型对公号文章主题分析@tecdat拓端原文链接:http://tecdat.cn/?p=2175/ 1 语义透镜顾客满意度和关注点我们对于评价数据进行LD ...
Python_文本分析_困惑度计算和一致性检验
在做LDA的过程中比较比较难的问题就是主题数的确定,下面介绍困惑度.一致性这两种方法的实现. 其中的一些LDA的参数需要结合自己的实际进行设定直接计算出的log_perplexity是负值,是困惑度 ...
R语言使用lm构建线性回归模型、并将目标变量对数化实战：模型训练集和测试集的残差总结信息（residiual summary）、模型训练（测试）集自由度计算、模型训练（测试）集残差标准误计算
R语言使用lm构建线性回归模型.并将目标变量对数化实战:模型训练集和测试集的残差总结信息(residiual summary).模型训练(测试)集自由度计算.模型训练(测试)集残差标准误计算(Resi ...
『行远见大』 BQ Corpus 信贷文本匹配相似度计算
『行远见大』 BQ Corpus 信贷文本匹配相似度计算项目简介 BQ Corpus 信贷文本匹配相似度计算,根据两段银行信贷业务的文本在语义上是否相似进行二分类,相似判断为1,不相似判断为0.本项 ...
simhash算法和余弦相似度算法哪种更适合微博文本的相似度计算
对于微博文本的相似度计算,simhash算法可能更适合. 余弦相似度算法是一种常见的文本相似度计算方法,它可以计算两个文本向量之间的夹角余弦值,用于衡量它们的相似度.但是,当面对大量文本时,计算文本向 ...
LDA主题模型困惑度计算
对于LDA模型,最常用的两个评价方法困惑度(Perplexity).相似度(Corre). 其中困惑度可以理解为对于一篇文章d,所训练出来的模型对文档d属于哪个主题有多不确定,这个不确定成都就是困惑度 ...
基于神经网络模型的文本语义通顺度计算研究-全文复现(还没弄完)
该硕士学位论文分为两个部分: ①基于依存句法分析的语义通顺度计算方法 ②基于神经网络模型的语义通顺度计算方法本篇记录摘抄了该论文的核心内容以及实验复现的详细步骤. 在N-gram模型下进行智能批改场 ...
java 文本语义相似度计算,NLP 语义相似度计算整理总结
更新中更新时间: 2019-12-03 18:29:52 写在前面: 本人是喜欢这个方向的学生一枚,写文的目的意在记录自己所学,梳理自己的思路,同时share给在这个方向上一起努力的同学.写得不够专 ...

LDA模型训练与得到文本主题、困惑度计算（含可运行案例）