【python-sklearn】中文文本处理LDA主题模型分析

数据集和资料：
链接：LDA主题模型
提取码：rlns

数据概览

代码：

import os
import pandas as pd
import re
import jieba
import jieba.posseg as psg#######预处理output_path = 'D:/lda/result'
file_path = 'D:/lda/data'
os.chdir(file_path)
data=pd.read_excel("data.xlsx")#content type
os.chdir(output_path)
dic_file = "D:/lda/stop_dic/dict.txt"
stop_file = "D:/lda/stop_dic/stopwords.txt"def chinese_word_cut(mytext):jieba.load_userdict(dic_file)jieba.initialize()try:stopword_list = open(stop_file,encoding ='utf-8')except:stopword_list = []print("error in stop_file")stop_list = []flag_list = ['n','nz','vn']for line in stopword_list:line = re.sub(u'\n|\\r', '', line)stop_list.append(line)word_list = []#jieba分词seg_list = psg.cut(mytext)for seg_word in seg_list:#word = re.sub(u'[^\u4e00-\u9fa5]','',seg_word.word) word = seg_word.wordfind = 0for stop_word in stop_list:if stop_word == word or len(word)<2:     #this word is stopwordfind = 1breakif find == 0 and seg_word.flag in flag_list:word_list.append(word)      return (" ").join(word_list)data["content_cutted"] = data.content.apply(chinese_word_cut)#######LDA分析from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocationdef print_top_words(model, feature_names, n_top_words):tword = []for topic_idx, topic in enumerate(model.components_):print("Topic #%d:" % topic_idx)topic_w = " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])tword.append(topic_w)print(topic_w)return twordn_features = 1000 #提取1000个特征词语
tf_vectorizer = CountVectorizer(strip_accents = 'unicode',max_features=n_features,stop_words='english',max_df = 0.5,min_df = 10)
tf = tf_vectorizer.fit_transform(data.content_cutted)n_topics = 8
lda = LatentDirichletAllocation(n_components=n_topics, max_iter=50,learning_method='batch',learning_offset=50,
#                                 doc_topic_prior=0.1,
#                                 topic_word_prior=0.01,random_state=0)
lda.fit(tf)###########每个主题对应词语
n_top_words = 25
tf_feature_names = tf_vectorizer.get_feature_names()
topic_word = print_top_words(lda, tf_feature_names, n_top_words)###########输出每篇文章对应主题
import numpy as np
topics=lda.transform(tf)
topic = []
for t in topics:topic.append(list(t).index(np.max(t)))
data['topic']=topic
data.to_excel("data_topic.xlsx",index=False)
topics[0]#0 1 2 ###########可视化import pyLDAvis
import pyLDAvis.sklearnpyLDAvis.enable_notebook()
pic = pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)
pyLDAvis.display(pic)
pyLDAvis.save_html(pic, 'lda_pass'+str(n_topics)+'.html')
#去工作路径下找保存好的html文件
pyLDAvis.show(pic)###########困惑度
import matplotlib.pyplot as pltplexs = []
n_max_topics = 16
for i in range(1,n_max_topics):print(i)lda = LatentDirichletAllocation(n_components=i, max_iter=50,learning_method='batch',learning_offset=50,random_state=0)lda.fit(tf)plexs.append(lda.perplexity(tf))n_t=15#区间最右侧的值。注意：不能大于n_max_topics
x=list(range(1,n_t))
plt.plot(x,plexs[1:n_t])
plt.xlabel("number of topics")
plt.ylabel("perplexity")
plt.show()可视化结果：

如上图的困惑度曲线可知，当取值为8时效果较好

处理结果

参考：up主—>五连单排一班

【python-sklearn】中文文本处理LDA主题模型分析相关推荐

网易云评论进行LDA主题模型分析
网易云评论进行LDA主题模型分析前言这个项目是在学校参加竞赛下与另一个同学一起做的,我负责的是对评论进行LDA主题模型的分析.写这篇文章是想记录一下学习过程,有什么地方描述的不对还请大家多多指教, ...
lda主题模型python实现篇_基于LDA主题模型的短文本分类
VSM(向量空间模型)是信息检索领域最为经典的分析模型之一,采用VSM对短文本进行建模,即将每一篇短文本表示为向量的形式,用TF-TDF表示向量的值.给出一些符号定义:短文本集SD＝ {sd1,sd2 ...
R语言实现LDA主题模型分析知乎话题
这是一篇关于文本主题分析的应用实践,主要尝试聚焦几个问题,什么是LDA主题模型?如何使用LDA主题模型进行文本?我们将知乎上面的转基因话题精华帖下面的提问分成六大主题进行实践. 转基因" ...
R语言实现LDA主题模型分析网购数据
研究人员对各大电商平台海量用户的评价数据进行分析,得出智能门锁剁手攻略. 1 语义透镜顾客满意度和关注点最近我们被要求撰写关于LDA的研究报告,包括一些图形和统计输出.我们对于评价数据进行LDA建 ...
[Pyhon疫情大数据分析] 三.新闻信息抓取及词云可视化、文本聚类和LDA主题模型文本挖掘
思来想去,虽然很忙,但还是挤时间针对这次肺炎疫情写个Python大数据分析系列博客,包括网络爬虫.可视化分析.GIS地图显示.情感分析.舆情分析.主题挖掘.威胁情报溯源.知识图谱.预测预警及AI和NL ...
【项目实战】Python实现基于LDA主题模型进行电商产品评论数据情感分析
说明:这是一个机器学习.数据挖掘实战项目(附带数据+代码+文档+视频讲解),如需数据+代码+文档+视频讲解可以直接到文章最后获取. 视频: Python实现基于LDA模型进行电商产品评论数据情感分析 ...
LDA主题模型简介及Python实现
一.LDA主题模型简介 LDA主题模型主要用于推测文档的主题分布,可以将文档集中每篇文档的主题以概率分布的形式给出根据主题进行主题聚类或文本分类. LDA主题模型不关心文档中单词的顺序,通常使用词袋特 ...
《学术小白的学习之路 07》自然语言处理之 LDA主题模型 01
本文主要是学习参考杨秀璋老师的博客,笔记总结与记忆. 原文链接文章目录书山有路勤为径,学海无涯苦作舟(行行代码要手敲) 零.吃水不忘挖井人一.LDA主题模型 1.1简介 1.2安装二.LDA主 ...
LDA主题模型及python实现
LDA(Latent Dirichlet Allocation)中文翻译为:潜在狄利克雷分布.LDA主题模型是一种文档生成模型,是一种非监督机器学习技术.它认为一篇文档是有多个主题的,而每个主题又对应 ...

【python-sklearn】中文文本处理LDA主题模型分析

【python-sklearn】中文文本处理LDA主题模型分析相关推荐

最新文章

热门文章