SK-Learn使用NMF（非负矩阵分解）和LDA（隐含狄利克雷分布）进行话题抽取

英文链接：http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html

这是一个使用NMF和LDA对一个语料集进行话题抽取的例子。

输入分别是是tf-idf矩阵（NMF）和tf矩阵（LDA）。

输出是一系列的话题，每个话题由一系列的词组成。

默认的参数（n_samples/n_features/n_topics）会使这个例子运行数十秒。

你可以尝试修改问题的规模，但是要注意，NMF的时间复杂度是多项式级别的，LDA的时间复杂度与（n_samples*iterations）成正比。

几点注意事项:

（1）其中line 61的代码需要注释掉，才能看到输出结果。

（2）第一次运行代码，程序会从网上下载新闻数据，然后保存在一个缓存目录中，之后再运行代码，就不会重复下载了。

（3）关于NMF和LDA的参数设置，可以到sklearn的官网上查看【NMF官方文档】【LDA官方文档】。

（4）该代码对应的sk-learn版本为 scikit-learn 0.17.1

代码：

 1 # Author: Olivier Grisel <olivier.grisel@ensta.org>
 2 #         Lars Buitinck <L.J.Buitinck@uva.nl>
 3 #         Chyi-Kwei Yau <chyikwei.yau@gmail.com>
 4 # License: BSD 3 clause
 5
 6 from __future__ import print_function
 7 from time import time
 8
 9 from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
10 from sklearn.decomposition import NMF, LatentDirichletAllocation
11 from sklearn.datasets import fetch_20newsgroups
12
13 n_samples = 2000
14 n_features = 1000
15 n_topics = 10
16 n_top_words = 20
17
18
19 def print_top_words(model, feature_names, n_top_words):
20     for topic_idx, topic in enumerate(model.components_):
21         print("Topic #%d:" % topic_idx)
22         print(" ".join([feature_names[i]
23                         for i in topic.argsort()[:-n_top_words - 1:-1]]))
24     print()
25
26
27 # Load the 20 newsgroups dataset and vectorize it. We use a few heuristics
28 # to filter out useless terms early on: the posts are stripped of headers,
29 # footers and quoted replies, and common English words, words occurring in
30 # only one document or in at least 95% of the documents are removed.
31
32 print("Loading dataset...")
33 t0 = time()
34 dataset = fetch_20newsgroups(shuffle=True, random_state=1,
35                              remove=('headers', 'footers', 'quotes'))
36 data_samples = dataset.data
37 print("done in %0.3fs." % (time() - t0))
38
39 # Use tf-idf features for NMF.
40 print("Extracting tf-idf features for NMF...")
41 tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, #max_features=n_features,
42                                    stop_words='english')
43 t0 = time()
44 tfidf = tfidf_vectorizer.fit_transform(data_samples)
45 print("done in %0.3fs." % (time() - t0))
46
47 # Use tf (raw term count) features for LDA.
48 print("Extracting tf features for LDA...")
49 tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=n_features,
50                                 stop_words='english')
51 t0 = time()
52 tf = tf_vectorizer.fit_transform(data_samples)
53 print("done in %0.3fs." % (time() - t0))
54
55 # Fit the NMF model
56 print("Fitting the NMF model with tf-idf features,"
57       "n_samples=%d and n_features=%d..."
58       % (n_samples, n_features))
59 t0 = time()
60 nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)
61 exit()
62 print("done in %0.3fs." % (time() - t0))
63
64 print("\nTopics in NMF model:")
65 tfidf_feature_names = tfidf_vectorizer.get_feature_names()
66 print_top_words(nmf, tfidf_feature_names, n_top_words)
67
68 print("Fitting LDA models with tf features, n_samples=%d and n_features=%d..."
69       % (n_samples, n_features))
70 lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
71                                 learning_method='online', learning_offset=50.,
72                                 random_state=0)
73 t0 = time()
74 lda.fit(tf)
75 print("done in %0.3fs." % (time() - t0))
76
77 print("\nTopics in LDA model:")
78 tf_feature_names = tf_vectorizer.get_feature_names()
79 print_top_words(lda, tf_feature_names, n_top_words)

结果：

Loading dataset...
done in 2.222s.
Extracting tf-idf features for NMF...
done in 2.730s.
Extracting tf features for LDA...
done in 2.702s.
Fitting the NMF model with tf-idf features,n_samples=2000 and n_features=1000...
done in 1.904s.Topics in NMF model:
Topic #0:
don just people think like know good time right ve say did make really way want going new year ll
Topic #1:
windows thanks file card does dos mail files know program use advance hi window help software looking ftp video pc
Topic #2:
drive scsi ide drives disk controller hard floppy bus hd cd boot mac cable card isa rom motherboard mb internal
Topic #3:
key chip encryption clipper keys escrow government algorithm security secure encrypted public nsa des enforcement law privacy bit use secret
Topic #4:
00 sale 50 shipping 20 10 price 15 new 25 30 dos offer condition 40 cover asking 75 01 interested
Topic #5:
armenian armenians turkish genocide armenia turks turkey soviet people muslim azerbaijan russian greek argic government serdar kurds population ottoman million
Topic #6:
god jesus bible christ faith believe christians christian heaven sin life hell church truth lord does say belief people existence
Topic #7:
mouse driver keyboard serial com1 port bus com3 irq button com sys microsoft ball problem modem adb drivers card com2
Topic #8:
space nasa shuttle launch station sci gov orbit moon earth lunar satellite program mission center cost research data solar mars
Topic #9:
msg food chinese flavor eat glutamate restaurant foods reaction taste restaurants salt effects carl brain people ingredients natural causes olneyFitting LDA models with tf features, n_samples=2000 and n_features=1000...
done in 22.548s.Topics in LDA model:
Topic #0:
government people mr law gun state president states public use right rights national new control american security encryption health united
Topic #1:
drive card disk bit scsi use mac memory thanks pc does video hard speed apple problem used data monitor software
Topic #2:
said people armenian armenians turkish did saw went came women killed children turkey told dead didn left started greek war
Topic #3:
year good just time game car team years like think don got new play games ago did season better ll
Topic #4:
10 00 15 25 12 11 20 14 17 16 db 13 18 24 30 19 27 50 21 40
Topic #5:
windows window program version file dos use files available display server using application set edu motif package code ms software
Topic #6:
edu file space com information mail data send available program ftp email entry info list output nasa address anonymous internet
Topic #7:
ax max b8f g9v a86 pl 145 1d9 0t 34u 1t 3t giz bhj wm 2di 75u 2tm bxn 7ey
Topic #8:
god people jesus believe does say think israel christian true life jews did bible don just know world way church
Topic #9:
don know like just think ve want does use good people key time way make problem really work say need

转载于:https://www.cnblogs.com/CheeseZH/p/5254082.html

SK-Learn使用NMF（非负矩阵分解）和LDA（隐含狄利克雷分布）进行话题抽取相关推荐

独立成分分析ICA、因子分析、LDA降维、NMF非负矩阵分解
独立成分分析ICA.因子分析.LDA降维.NMF非负矩阵分解目录独立成分分析ICA.因子分析.LDA降维.NMF非负矩阵分解独立成分分析ICA
NMF 非负矩阵分解(Non-negative Matrix Factorization)实践
1. NMF-based 推荐算法在例如Netflix或MovieLens这样的推荐系统中,有用户和电影两个集合.给出每个用户对部分电影的打分,希望预测该用户对其他没看过电影的打分值,这样可以根据打 ...
NMF非负矩阵分解初探
NMF非负矩阵分解初探 NMF非负矩阵分解初探简介 NMF信号分解最优化问题NMF 简介数据可以表示为一个矩阵 VVV,列 vn" role="presentation&qu ...
有人问你如何掌握隐含狄利克雷分布(LDA)，把这篇文章甩给他
作者 | 玉龍一.简介隐含狄利克雷分布(Latent Dirichlet Allocation,简称LDA)是由 David M. Blei.Andrew Y. Ng.Michael I. Jor ...
狄利克雷分布公式_一文详解隐含狄利克雷分布（LDA）
一.简介隐含狄利克雷分布(LatentDirichletAllocation,简称LDA)是由DavidM.Blei.AndrewY.Ng.MichaelI.Jordan在2003年提出的,是一种词 ...
主题模型TopicModel：隐含狄利克雷分布LDA
http://blog.csdn.net/pipisorry/article/details/42649657 主题模型LDA简介隐含狄利克雷分布简称LDA(Latent Dirichlet all ...
干货 | 一文详解隐含狄利克雷分布（LDA）
作者 | 玉龍一.简介隐含狄利克雷分布(Latent Dirichlet Allocation,简称LDA)是由 David M. Blei.Andrew Y. Ng.Michael I. Jor ...
自然语言处理之LDA：文本主题模型之隐含狄利克雷分布（LDA）
自然语言处理之LDA:文本主题模型之隐含狄利克雷分布(LDA) 一.朴素贝叶斯分析二.PLSA模型三. 基本函数知识 3.1 词袋模型 3.2 二项分布 3.3 多项分布 3.4 Gamma函数 ...
【机器学习】NMF(非负矩阵分解)
写在篇前本篇文章主要介绍NMF算法原理以及使用sklearn中的封装方法实现该算法,最重要的是理解要NMF矩阵分解的实际意义,将其运用到自己的数据分析中! 理论概述 NMF(Non-nega ...

SK-Learn使用NMF（非负矩阵分解）和LDA（隐含狄利克雷分布）进行话题抽取

SK-Learn使用NMF（非负矩阵分解）和LDA（隐含狄利克雷分布）进行话题抽取相关推荐

最新文章

热门文章