主题建模评估：连贯性分数（Coherence Score）

1.主题连贯性分数

主题连贯性分数（Coherence Score）是一种客观的衡量标准，它基于语言学的分布假设：具有相似含义的词往往出现在相似的上下文中。如果所有或大部分单词都密切相关，则主题被认为是连贯的。

推荐阅读：Full-Text or Abstract ? Examining Topic Coherence Scores Using Latent Dirichlet Allocation

2.计算 LDA 模型的 Coherence Score

2.1 导入包

import pandas as pd
import numpy as np
from gensim.corpora import Dictionary
from gensim.models import LdaMulticore

2.2 数据预处理

# cast tweets to numpy array
docs = df.tweet_text.to_numpy()# create dictionary of all words in all documents
dictionary = Dictionary(docs)# filter extreme cases out of dictionary
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)# create BOW dictionary
bow_corpus = [dictionary.doc2bow(doc) for doc in docs]

2.3 构建模型

LdaMulticore：使用所有 CPU 内核并行化并加速模型训练。

workers：用于并行化的工作进程数。
passes：训练期间通过语料库的次数。

# create LDA model using preferred hyperparameters
lda_model = LdaMulticore(bow_corpus, num_topics=5, id2word=dictionary, passes=4, workers=2, random_state=21)# Save LDA model to disk
path_to_model = ""
lda_model.save(path_to_model)# for each topic, print words occuring in that topic
for idx, topic in lda_model.print_topics(-1):print('Topic: {} \nWords: {}'.format(idx, topic))

2.4 计算 Coherence Score

我们可以使用 Gensim 库中的 CoherenceModel 轻松计算主题连贯性分数。对于 LDA，它的实现比较简单：

# import library from gensim
from gensim.models import CoherenceModel# instantiate topic coherence model
cm = CoherenceModel(model=lda_model, corpus=bow_corpus, texts=docs, coherence='c_v')# get topic coherence score
coherence_lda = cm.get_coherence()
print(coherence_lda)

3.计算 GSDMM 模型的 Coherence Score

GSDMM（Gibbs Sampling Dirichlet Multinomial Mixture）是一种基于狄利克雷多项式混合模型的收缩型吉布斯采样算法（a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model）的简称，它是发表在 2014 2014 2014 年 KDD 上的论文《A Dirichlet Multinomial Mixture Model-based Approach for Short Text Clustering》的数学模型。

GSDMM 主要用于短文本聚类，短文本聚类是将大量的短文本（例如微博、评论等）根据计算某种相似度进行聚集，最终划分到几个类中的过程。GSDMM 主要具备以下优点：

可以自动推断聚类的个数，并且可以快速地收敛；
可以在完备性和一致性之间保持平衡；
可以很好的处理稀疏、高纬度的短文本，可以得到每一类的代表词汇；
较其它的聚类算法，在性能上表现更为突出。

3.1 安装并导入包

pip install git+https://github.com/rwalk/gsdmm.git

import pandas as pd
import numpy as np
from gensim.corpora import Dictionary
from gsdmm import MovieGroupProcess

3.2 数据预处理

# cast tweets to numpy array
docs = df.tweet_text.to_numpy()# create dictionary of all words in all documents
dictionary = Dictionary(docs)# filter extreme cases out of dictionary
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)# create variable containing length of dictionary/vocab
vocab_length = len(dictionary)# create BOW dictionary
bow_corpus = [dictionary.doc2bow(doc) for doc in docs]

3.3 构建模型

# initialize GSDMM
gsdmm = MovieGroupProcess(K=15, alpha=0.1, beta=0.3, n_iters=15)# fit GSDMM model
y = gsdmm.fit(docs, vocab_length)

# print number of documents per topic
doc_count = np.array(gsdmm.cluster_doc_count)
print('Number of documents per topic :', doc_count)# Topics sorted by the number of document they are allocated to
top_index = doc_count.argsort()[-15:][::-1]
print('Most important clusters (by number of docs inside):', top_index)# define function to get top words per topic
def top_words(cluster_word_distribution, top_cluster, values):for cluster in top_cluster:sort_dicts = sorted(cluster_word_distribution[cluster].items(), key=lambda k: k[1], reverse=True)[:values]print("\nCluster %s : %s"%(cluster, sort_dicts))# get top words in topics
top_words(gsdmm.cluster_word_distribution, top_index, 20)

3.4 结果可视化

# Import wordcloud library
from wordcloud import WordCloud# Get topic word distributions from gsdmm model
cluster_word_distribution = gsdmm.cluster_word_distribution# Select topic you want to output as dictionary (using topic_number)
topic_dict = sorted(cluster_word_distribution[topic_number].items(), key=lambda k: k[1], reverse=True)[:values]# Generate a word cloud image
wordcloud = WordCloud(background_color='#fcf2ed', width=1800,height=700,font_path=path_to_font,colormap='flag').generate_from_frequencies(topic_dict)# Print to screen
fig, ax = plt.subplots(figsize=[20,10])
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off");# Save to disk
wordcloud_24.to_file(path_to_file)

3.5 计算 Coherence Score

Gensim 官网对 LDA 之外的模型使用另一种方法计算 连贯性分数。

from gensim.test.utils import common_corpus, common_dictionary
from gensim.models.coherencemodel import CoherenceModeltopics = [['human', 'computer', 'system', 'interface'],['graph', 'minors', 'trees', 'eps']
]cm = CoherenceModel(topics=topics, corpus=common_corpus, dictionary=common_dictionary, coherence='u_mass')coherence = cm.get_coherence()  # get coherence value

GSDMM 的实现需要更多的工作，因为我们首先必须将 主题中的单词作为列表（变量主题）获取，然后将其输入 CoherenceModel。

# import library from gensim
from gensim.models import CoherenceModel# define function to get words in topics
def get_topics_lists(model, top_clusters, n_words):'''Gets lists of words in topics as a list of lists.model: gsdmm instancetop_clusters:  numpy array containing indices of top_clustersn_words: top n number of words to include'''# create empty list to contain topicstopics = []# iterate over top n clustersfor cluster in top_clusters:# create sorted dictionary of word distributionssorted_dict = sorted(model.cluster_word_distribution[cluster].items(), key=lambda k: k[1], reverse=True)[:n_words]# create empty list to contain wordstopic = []# iterate over top n words in topicfor k,v in sorted_dict:# append words to topic listtopic.append(k)# append topics to topics list    topics.append(topic)return topics# get topics to feed to coherence model
topics = get_topics_lists(gsdmm, top_index, 20) # evaluate model using Topic Coherence score
cm_gsdmm = CoherenceModel(topics=topics, dictionary=dictionary, corpus=bow_corpus, texts=docs, coherence='c_v')# get coherence value
coherence_gsdmm = cm_gsdmm.get_coherence()  print(coherence_gsdmm)

原文：Short-Text Topic Modelling: LDA vs GSDMM

【自然语言处理】主题建模评估：连贯性分数（Coherence Score）相关推荐

【自然语言处理】不同策略的主题建模方法比较
不同策略的主题建模方法比较本文将介绍利用 LSA.pLSA.LDA.NMF.BERTopic.Top2Vec 这六种策略进行主题建模之间的比较. 1.简介在自然语言处理(NLP)中,主题建模一词包 ...
【自然语言处理】主题建模：BERTopic（实战篇）
主题建模:BERTopic(实战篇) BERTopic 是基于深度学习的一种主题建模方法. 2018 2018 2018 年底, D e v l i n e t a l . Devlin\ et\ a ...
犀牛建模软件的英文语言包_使用tidytext和textmineR软件包在R中进行主题建模（
犀牛建模软件的英文语言包 In this article, we will learn to do Topic Model using tidytext and textmineR packages ...
独家 | 使用Python的LDA主题建模（附链接）
作者:Kamil Polak翻译:刘思婧校对:孙韬淳本文约2700字,建议阅读5分钟本文为大家介绍了主题建模的概念.LDA算法的原理,示例了如何使用Python建立一个基础的LDA主题模型,并使用p ...
独家 | 利用Python实现主题建模和LDA 算法（附链接）
作者:Susan Li翻译:陈之炎校对:陈汉青本文约1900字,建议阅读5分钟在这篇文章,我们将LDA应用于一组文档,并将文档按照主题分类. 标签:LDA 算法主题建模是一种用于找出文档集合中抽象& ...
一文读懂如何用LSA、PSLA、LDA和lda2vec进行主题建模
选自 Medium,作者:Joyce X,机器之心编译. 本文是一篇关于主题建模及其相关技术的综述.文中介绍了四种最流行的技术,用于探讨主题建模,它们分别是:LSA.pLSA.LDA,以及最新的.基于 ...
主题建模-corpora语料库-PCA进行降维
https://colab.research.google.com/drive/1F-1Ej7T2xnUKXSmDPjjOChNbBTvQlpnM?usp=sharing 考试 https://col ...
python建模大赛算法_利用Python实现主题建模和LDA 算法（附链接）
主题建模是一种用于找出文档集合中抽象"主题"的统计模型.LDA(Latent Dirichlet Allocation)是主题模型的一个示例,用于将文档中的文本分类为特定的主题.L ...
独家 | LDA主题建模和pyLDAvis可视化
作者:Xuan Qi 翻译:方星轩校对:欧阳锦本文约1700字,建议阅读6分钟本文的数据来自美国著名电视节目<老友记>.作者用python-Beautiful Soup抓取了224集 ...

【自然语言处理】主题建模评估：连贯性分数（Coherence Score）