Dynamic Topic Models的Python实现
Dynamic Topic Models的Python实现
- Dynamic Topic Models(DTM)简介
- Dynamic Topic Models的实现
- 数据与预处理
- Python实现
Dynamic Topic Models(DTM)简介
Dynamic Topic Models来源于Blei于2006发表在第23届机器学习国际会议上的论文 Dynamic Topic Models,与先前的Latent Dirichlate Allocation(LDA)模型有所不同,DTM引入了时间因素,从而刻画语料库主题随时间的动态演化。
在LDA模型中,给定语料库中的所有文档,并无时间先后的差别,与词袋模型(bag-of-words)中的词无先后之分类似,在建模的过程中认为整个语料库中的K个主题是固定的。在DTM模型中,文档有了时间属性,具有先后之分。DTM认为,在不同时期,主题是随时间动态演化的。举个例子,比如语料库中存在音乐这个主题,从上世纪80年代的音乐所反映的内容与现在所反映的内容肯定是存在差别的。
DTM的概率图模型如下:
DTM的生成过程如下:
在DTM模型中,是语料库中的K个主题在不同的slice中是不断演化的,生成过程的第一步和第二步中,可以看出,t阶段是语料库中doc-topic分布以及topic-word分布是从t-1阶段演(evoled)化过来。由于在其他主题模型(如LDA)中被广泛使用的狄利克雷分布不再适合词出的序列建模,此处论文中采用的是高斯分布。
想要详细了解DTM的同学可以参读一下Blei的论文 《Dynamic Topic Models》。DTM提供了原始代码 dtm,但是该代码的实现难度较大,感兴趣的同学可以下载源代码琢磨一下。下面是本文的重点,通过Python调用NLP神器Gensim中的模块,实现DTM。
Dynamic Topic Models的实现
数据与预处理
数据集为GitHub上所提供的1324个文档,分为三个月。
首先,需要对文档进行合共,将三个月的行为文档放入一个txt文件中,以每一行表示一个文档。这里我首先是通过java对文档进行了合并操作,并去除标点,代码如下:
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.UnsupportedEncodingException;public class newsToDoc {public static void main(String[] args) {File file = new File("newsData\\sample");File newsOut = new File("newsData\\newOut.txt");File[] files = file.listFiles();String line; try {BufferedWriter bfw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(newsOut),"UTF-8")); BufferedReader bfr = null;for(int i = 0;i<files.length;i++) {bfr = new BufferedReader(new InputStreamReader(new FileInputStream(files[i]),"UTF-8"));while((line = bfr.readLine())!=null) {line = line.replaceAll("[`~!@#$%^&*()+=|{}':;',\\\\[\\\\].<>/?~!\"\"?@#¥%……&;*()——+|{}《》【】‘;:’。,、|-]",""); //去除标点符号bfw.append(line);}bfw.newLine();bfw.flush();}bfw.flush();bfw.close();bfr.close();} catch (IOException e) {e.printStackTrace();} }
}
读者也可以自己动手,使用其他方式尝试将文档进行合并以及去除标点的操作。经过处理后,原本分三个月的共1324篇新闻已经合并到myCorpus.txt文档中。
Python实现
对文档进行预处理后,通过Python调用Gensim实现DTM.
首先导入相关模块:
import logging
from gensim import corpora
from six import iteritems
from gensim.models import ldaseqmodel
from gensim.corpora import Dictionary, bleicorpus
接下面,我们需要将myCorpus.txt这个文档转化成DTM模型所需要的语料库,并构造词典
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) #输出日志信息,便于了解程序运行情况
stoplist = set('a able about above according i accordingly "i across actually after afterwards again against ain’t - all allow allows almost alone along already also although always am among amongst an and another any anybody anyhow anyone anything anyway anyways anywhere apart appear appreciate appropriate are aren’t around as a’s aside ask asking associated at available away awfully be became because become becomes becoming been before beforehand behind being believe below beside besides best better between beyond both brief but by came can cannot cant can’t cause causes certain certainly changes clearly c’mon co com come comes concerning consequently consider considering contain containing contains corresponding could couldn’t course c’s currently definitely described despite did didn’t different do does doesn’t doing done don’t down downwards during each edu eg eight either else elsewhere enough entirely especially et etc even ever every everybody everyone everything everywhere ex exactly example except far few fifth first five followed following follows for former formerly forth four from further furthermore get gets getting given gives go goes going gone got gotten greetings had hadn’t happens hardly has hasn’t have haven’t having he hello help hence her here hereafter hereby herein here’s hereupon hers herself he’s hi him himself his hither hopefully how howbeit however i’d ie if ignored i’ll i’m immediate in inasmuch inc indeed indicate indicated indicates inner insofar instead into inward is isn’t it it’d it’ll its it’s itself i’ve just keep keeps kept know known knows last lately later latter latterly least less lest let let’s like liked likely little look looking looks ltd mainly many may maybe me mean meanwhile merely might more moreover most mostly much must my myself name namely nd near nearly necessary need needs neither never nevertheless new next nine no nobody non none noone nor normally not nothing novel now nowhere obviously of off often oh ok okay old on once one ones only onto or other others otherwise ought our ours ourselves out outside over overall own particular particularly per perhaps placed please plus possible presumably probably provides que quite qv rather rd re really reasonably regarding regardless regards relatively respectively right said same saw say saying says second secondly see seeing seem seemed seeming seems seen self selves sensible sent serious seriously seven several shall she should shouldn’t since six so some somebody somehow someone something sometime sometimes somewhat somewhere soon sorry specified specify specifying still sub such sup sure take taken tell tends th than thank thanks thanx that thats that’s the their theirs them themselves then thence there thereafter thereby therefore therein theres there’s thereupon these they they’d they’ll they’re they’ve think third this thorough thoroughly those though three through throughout thru thus to together too took toward towards tried tries truly try trying t’s twice two un under unfortunately unless unlikely until unto up upon us use used useful uses using usually value various very via viz vs want wants was wasn’t way we we’d welcome well we’ll went were we’re weren’t we’ve what whatever what’s when whence whenever where whereafter whereas whereby wherein where’s whereupon wherever whether which while whither who whoever whole whom who’s whose why will willing wish with within without wonder won’t would wouldn’t yes yet you you’d you’ll your you’re yours yourself yourselves you’ve zero zt ZT zz ZZ'.split())
#构造词典,并去除停用词以及文档中只出现过一次的词
dictionary = corpora.Dictionary(line.lower().split() for line in open('datasets/myCorpus.txt'))
stop_ids = [dictionary.token2id[stopword]for stopword in stoplistif stopword in dictionary.token2id]
once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]
dictionary.filter_tokens(stop_ids + once_ids) # 去除只出现过一次的词
dictionary.compactify() # 删除去除单词后的空格
dictionary.save('datasets/news_dictionary') # 保存词典
#将文档加载成构造语料库
class MyCorpus(object):def __iter__(self):for line in open('datasets/myCorpus.txt'):yield dictionary.doc2bow(line.lower().split())
corpus_memory_friendly = MyCorpus()
corpus = [vector for vector in corpus_memory_friendly] # 将读取的文档转换成语料库
corpora.BleiCorpus.serialize('datasets/news_corpus', corpus) # 存储为Blei lda-c格式的语料库
通过上面的工作,我们已经将文档转换成了DTM模型所需要的词典以及语料库,下面把语料库、词典加载到模型中
try:dictionary = Dictionary.load('datasets/news_dictionary')
except FileNotFoundError as e:raise ValueError("SKIP: Please download the Corpus/news_dictionary dataset.")
corpus = bleicorpus.BleiCorpus('datasets/news_corpus')
time_slice = [438, 430, 456] #设置这个语料库的间隔,此处分为三个时期,第一个时期内有438条新闻,第二为430条,第三个为456条。
num_topics = 5 #设置主题数,此处为5个主题
ldaseq = ldaseqmodel.LdaSeqModel(corpus=corpus, id2word=dictionary, time_slice=time_slice, num_topics) #将语料库、词典、参数加载入模型中进行训练
corpusTopic = ldaseq.print_topics(time=0) # 输出指定时期主题分布,此处第一个时期主题分布
print(corpusTopic)
topicEvolution = ldaseq.print_topic_times(topic=0) # 查询指定主题在不同时期的演变,此处为第一个主题的
print(topicEvolution)
doc = ldaseq.doc_topics(0) # 查询指定文档的主题分布,此处为第一篇文档的主题分布
print (doc)
关于DTM的Python实现,我的介绍就到这里,Gensim中关于DTM模型还提供了其他方法可调用,想要进一步了解的同学可以参读Dynamic Topic Models Tutorial。
Dynamic Topic Models的Python实现相关推荐
- 连续时间动态主题模型(Continuous Time Dynamic Topic Models, cDTM)
用于分析和管理大量电子文档的工具变得越来越重要.近年来,离散数据的分层贝叶斯模型,已成为一种广泛使用的文本探索和预测分析方法. 主题模型,例如潜在Dirichlet分配(LDA)和更一般的离散分量分析 ...
- 主题模型简介(Topic Models)
主题模型简介(Topic Models) 要想更好地管理当今爆炸式的电子文档档案,需要使用新的技术或工具来处理自动组织.搜索.索引和浏览大型电子文档集合.在当今机器学习和统计学研究的基础上,利用层次概 ...
- 主题模型(topic models)总结
主题模型(topic models)总结 相关主题模型(CTM)是一种用于自然语言处理和机器学习的统计模型.相关主题模型(CTM)用于发现一组文档中显示的主题. CTM的关键是logistic正态分布 ...
- 主题模型(topic models)解释及评估
主题模型(topic models)解释及评估 目录 主题模型(topic models)解释及评估 主题解释及评估 展示主题 标记主题<
- 词袋模型(bag of words)构建并使用主题模型(topic models)特征进行文本聚类分析(clustering analysis)实战
词袋模型(bag of words)构建并使用主题模型(topic models)特征进行文本聚类分析(clustering analysis)实战 目录
- Incorporating Lexical Priors into Topic Models(即交互式主题模型的应用)论文阅读
本文作者:合肥工业大学 管理学院 钱洋 email:1563178220@qq.com 内容可能有不到之处,欢迎交流. 未经本人允许禁止转载. 文章目录 论文来源 应用场景及模型 第一个模型 第二个模 ...
- 主题模型结合词向量模型(Improving Topic Models with Latent Feature Word Representations)
本文作者:合肥工业大学 管理学院 钱洋 email:1563178220@qq.com 内容可能有不到之处,欢迎交流. 未经本人允许禁止转载. 论文来源 Nguyen D Q, Billingsley ...
- FasterMoE:Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models阅读笔记
FasterMoE:Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models FasterMoE阅读笔记 b ...
- 论文:lda2vev:Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec
lda2vev:Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec 论文概述 ABSTRACT Introduction ...
最新文章
- 【TarsosDSP】TarsosDSP 简介 ( TarsosDSP 功能 | 相关链接 | 源码和相关资源收集 | TarsosDSP 示例应用 | TarsosDSP 源码路径解析 )
- 集群,分布式,微服务的区别
- 【安利向】入坑半年的GPU云平台,三分钟训练起飞!xiu~
- 初识 Vue(11)---(Vue 中的事件绑定)
- leetcode - First Missing Positive
- vs code写ipynb怎么添加目录_用Django写招聘网站2——用户系统
- Linux下的网络桥接与链路聚合
- linux如何找大文件夹,Linux系统中如何查找大文件或目录文件夹的方法
- java中文件处理之图片_Java中的文件处理
- 前端学习之一——关于第一次使用VSCode打开前端代码并启动问题
- 基于STM32f103的TM1640驱动程序(地址自动加1 和 固定地址)
- debian docker_如何在Debian 10上安装和使用Docker
- android自动截图实现,Android实现截屏功能
- windows下gfortran编译error:Note: The following floating-point exceptions are signalling: IEEE_UNDERFLOW
- 理解Javascript的正则表达式
- OpenGL 游戏开发: 熟悉练习【《OpenGL超级宝典》第二章】
- 两台笔记本通过蓝牙传输文件
- 彻底理解安卓应用无响应机制
- Error:Execution failed for task ':xst:process开发环境DebugResources'.
- 详解信贷三级逾期口径,你可能并没有真正看懂逾期