SK-Learn使用NMF(非负矩阵分解)和LDA(隐含狄利克雷分布)进行话题抽取
英文链接:http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html
这是一个使用NMF和LDA对一个语料集进行话题抽取的例子。
输入分别是是tf-idf矩阵(NMF)和tf矩阵(LDA)。
输出是一系列的话题,每个话题由一系列的词组成。
默认的参数(n_samples/n_features/n_topics)会使这个例子运行数十秒。
你可以尝试修改问题的规模,但是要注意,NMF的时间复杂度是多项式级别的,LDA的时间复杂度与(n_samples*iterations)成正比。
几点注意事项:
(1)其中line 61的代码需要注释掉,才能看到输出结果。
(2)第一次运行代码,程序会从网上下载新闻数据,然后保存在一个缓存目录中,之后再运行代码,就不会重复下载了。
(3)关于NMF和LDA的参数设置,可以到sklearn的官网上查看【NMF官方文档】【LDA官方文档】。
(4)该代码对应的sk-learn版本为 scikit-learn 0.17.1
代码:
1 # Author: Olivier Grisel <olivier.grisel@ensta.org> 2 # Lars Buitinck <L.J.Buitinck@uva.nl> 3 # Chyi-Kwei Yau <chyikwei.yau@gmail.com> 4 # License: BSD 3 clause 5 6 from __future__ import print_function 7 from time import time 8 9 from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer 10 from sklearn.decomposition import NMF, LatentDirichletAllocation 11 from sklearn.datasets import fetch_20newsgroups 12 13 n_samples = 2000 14 n_features = 1000 15 n_topics = 10 16 n_top_words = 20 17 18 19 def print_top_words(model, feature_names, n_top_words): 20 for topic_idx, topic in enumerate(model.components_): 21 print("Topic #%d:" % topic_idx) 22 print(" ".join([feature_names[i] 23 for i in topic.argsort()[:-n_top_words - 1:-1]])) 24 print() 25 26 27 # Load the 20 newsgroups dataset and vectorize it. We use a few heuristics 28 # to filter out useless terms early on: the posts are stripped of headers, 29 # footers and quoted replies, and common English words, words occurring in 30 # only one document or in at least 95% of the documents are removed. 31 32 print("Loading dataset...") 33 t0 = time() 34 dataset = fetch_20newsgroups(shuffle=True, random_state=1, 35 remove=('headers', 'footers', 'quotes')) 36 data_samples = dataset.data 37 print("done in %0.3fs." % (time() - t0)) 38 39 # Use tf-idf features for NMF. 40 print("Extracting tf-idf features for NMF...") 41 tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, #max_features=n_features, 42 stop_words='english') 43 t0 = time() 44 tfidf = tfidf_vectorizer.fit_transform(data_samples) 45 print("done in %0.3fs." % (time() - t0)) 46 47 # Use tf (raw term count) features for LDA. 48 print("Extracting tf features for LDA...") 49 tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=n_features, 50 stop_words='english') 51 t0 = time() 52 tf = tf_vectorizer.fit_transform(data_samples) 53 print("done in %0.3fs." % (time() - t0)) 54 55 # Fit the NMF model 56 print("Fitting the NMF model with tf-idf features," 57 "n_samples=%d and n_features=%d..." 58 % (n_samples, n_features)) 59 t0 = time() 60 nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf) 61 exit() 62 print("done in %0.3fs." % (time() - t0)) 63 64 print("\nTopics in NMF model:") 65 tfidf_feature_names = tfidf_vectorizer.get_feature_names() 66 print_top_words(nmf, tfidf_feature_names, n_top_words) 67 68 print("Fitting LDA models with tf features, n_samples=%d and n_features=%d..." 69 % (n_samples, n_features)) 70 lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5, 71 learning_method='online', learning_offset=50., 72 random_state=0) 73 t0 = time() 74 lda.fit(tf) 75 print("done in %0.3fs." % (time() - t0)) 76 77 print("\nTopics in LDA model:") 78 tf_feature_names = tf_vectorizer.get_feature_names() 79 print_top_words(lda, tf_feature_names, n_top_words)
结果:
Loading dataset... done in 2.222s. Extracting tf-idf features for NMF... done in 2.730s. Extracting tf features for LDA... done in 2.702s. Fitting the NMF model with tf-idf features,n_samples=2000 and n_features=1000... done in 1.904s.Topics in NMF model: Topic #0: don just people think like know good time right ve say did make really way want going new year ll Topic #1: windows thanks file card does dos mail files know program use advance hi window help software looking ftp video pc Topic #2: drive scsi ide drives disk controller hard floppy bus hd cd boot mac cable card isa rom motherboard mb internal Topic #3: key chip encryption clipper keys escrow government algorithm security secure encrypted public nsa des enforcement law privacy bit use secret Topic #4: 00 sale 50 shipping 20 10 price 15 new 25 30 dos offer condition 40 cover asking 75 01 interested Topic #5: armenian armenians turkish genocide armenia turks turkey soviet people muslim azerbaijan russian greek argic government serdar kurds population ottoman million Topic #6: god jesus bible christ faith believe christians christian heaven sin life hell church truth lord does say belief people existence Topic #7: mouse driver keyboard serial com1 port bus com3 irq button com sys microsoft ball problem modem adb drivers card com2 Topic #8: space nasa shuttle launch station sci gov orbit moon earth lunar satellite program mission center cost research data solar mars Topic #9: msg food chinese flavor eat glutamate restaurant foods reaction taste restaurants salt effects carl brain people ingredients natural causes olneyFitting LDA models with tf features, n_samples=2000 and n_features=1000... done in 22.548s.Topics in LDA model: Topic #0: government people mr law gun state president states public use right rights national new control american security encryption health united Topic #1: drive card disk bit scsi use mac memory thanks pc does video hard speed apple problem used data monitor software Topic #2: said people armenian armenians turkish did saw went came women killed children turkey told dead didn left started greek war Topic #3: year good just time game car team years like think don got new play games ago did season better ll Topic #4: 10 00 15 25 12 11 20 14 17 16 db 13 18 24 30 19 27 50 21 40 Topic #5: windows window program version file dos use files available display server using application set edu motif package code ms software Topic #6: edu file space com information mail data send available program ftp email entry info list output nasa address anonymous internet Topic #7: ax max b8f g9v a86 pl 145 1d9 0t 34u 1t 3t giz bhj wm 2di 75u 2tm bxn 7ey Topic #8: god people jesus believe does say think israel christian true life jews did bible don just know world way church Topic #9: don know like just think ve want does use good people key time way make problem really work say need
转载于:https://www.cnblogs.com/CheeseZH/p/5254082.html
SK-Learn使用NMF(非负矩阵分解)和LDA(隐含狄利克雷分布)进行话题抽取相关推荐
- 独立成分分析ICA、因子分析、LDA降维、NMF非负矩阵分解
独立成分分析ICA.因子分析.LDA降维.NMF非负矩阵分解 目录 独立成分分析ICA.因子分析.LDA降维.NMF非负矩阵分解 独立成分分析ICA
- NMF 非负矩阵分解(Non-negative Matrix Factorization)实践
1. NMF-based 推荐算法 在例如Netflix或MovieLens这样的推荐系统中,有用户和电影两个集合.给出每个用户对部分电影的打分,希望预测该用户对其他没看过电影的打分值,这样可以根据打 ...
- NMF非负矩阵分解初探
NMF非负矩阵分解初探 NMF非负矩阵分解初探 简介 NMF信号分解 最优化问题NMF 简介 数据可以表示为一个矩阵 VVV,列 vn" role="presentation&qu ...
- 有人问你如何掌握隐含狄利克雷分布(LDA),把这篇文章甩给他
作者 | 玉龍 一.简介 隐含狄利克雷分布(Latent Dirichlet Allocation,简称LDA)是由 David M. Blei.Andrew Y. Ng.Michael I. Jor ...
- 狄利克雷分布公式_一文详解隐含狄利克雷分布(LDA)
一.简介 隐含狄利克雷分布(LatentDirichletAllocation,简称LDA)是由DavidM.Blei.AndrewY.Ng.MichaelI.Jordan在2003年提出的,是一种词 ...
- 主题模型TopicModel:隐含狄利克雷分布LDA
http://blog.csdn.net/pipisorry/article/details/42649657 主题模型LDA简介 隐含狄利克雷分布简称LDA(Latent Dirichlet all ...
- 干货 | 一文详解隐含狄利克雷分布(LDA)
作者 | 玉龍 一.简介 隐含狄利克雷分布(Latent Dirichlet Allocation,简称LDA)是由 David M. Blei.Andrew Y. Ng.Michael I. Jor ...
- 自然语言处理之LDA:文本主题模型之隐含狄利克雷分布(LDA)
自然语言处理之LDA:文本主题模型之隐含狄利克雷分布(LDA) 一.朴素贝叶斯分析 二.PLSA模型 三. 基本函数知识 3.1 词袋模型 3.2 二项分布 3.3 多项分布 3.4 Gamma函数 ...
- 【机器学习】NMF(非负矩阵分解)
写在篇前 本篇文章主要介绍NMF算法原理以及使用sklearn中的封装方法实现该算法,最重要的是理解要NMF矩阵分解的实际意义,将其运用到自己的数据分析中! 理论概述 NMF(Non-nega ...
最新文章
- 按AI顶会评实力:美国7倍领先中国,谷歌雄霸全球第一,腾讯和清华分获中国产学No.1...
- mysql从表中转移数据文件_MySQL 数据文件迁移找不到表
- 【科普】STP生成树协议
- 树莓派Linux内核源码配置、编译、挂载(boot/kernal/根文件)、开启新内核
- 驳斥5条普通流Tropes
- easyPR源码解析之plate_locate.h
- opencv裁剪图片_前端智能化实践——从图片识别UI样式
- Java中,native2ascii.exe 的使用(最简单说明)
- 面试中爱问的大数量的问题总结
- VUE配置本地代理服务器
- CheckBoxPreference组件 自动存储到sharedpreferences
- Windows XP修改CHM字体大小
- Windows7安装无法识别硬盘分区
- Retrofit 框架详解和使用
- 躲猫猫正式上线“Peek-a-Boo”就是“躲猫猫”
- 路由器老掉线的原因之一
- Android 小知识记录-----息屏后亮屏并显示Activity在锁屏页面之上
- 专利检索及分析模拟登陆(python)
- Itest(爱测试),最懂测试人的开源测试管理软件隆重发布
- 打印网页去掉不相关信息
热门文章
- 2021CCPC河北省省赛F题(河南省CCPC测试赛重现)
- python默认参数举例_Python之在函数中使用列表作为默认参数
- 要关闭python解释器可使用函数或者快捷键_【判断题】螺旋机构具有结构简单,传动平稳,噪声低等优点,被广泛应用。...
- c语言运算符类型转换,C语言中强制类型转换运算符的独特作用
- python学习-数据类型(列表→创建、取值、大小、长度)
- mongodb存list_查询 MongoDB 子文档的 List 字段
- php 函数 配置文件,php的几个配置文件函数
- c++如何输入数组_从一个数组中找出 N 个数,其和为 M 的所有可能最 nice 的解法...
- erp系统服务器怎么关机,服务器怎么设置自动关机
- azkaban mysql参数_学习azkaban的笔记以及心得