LDA 用于将文档中的文本分类为特定主题

数据集：

LDA 用于将文档中的文本分类为特定主题。它构建每个文档模型的主题和每个主题模型的单词，建模为 Dirichlet 分布。

每个文档被建模为主题的多项分布，每个主题被建模为单词的多项分布。
LDA 假设我们输入的每一块文本都将包含某种相关的单词。因此，选择正确的数据语料库至关重要。
它还假设文档是从混合主题中产生的。然后这些主题根据它们的概率分布生成单词。

code：

from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train', shuffle = True)
newsgroups_test = fetch_20newsgroups(subset='test', shuffle = True)
print(list(newsgroups_train.target_names))
# Lets look at some sample news
newsgroups_train.data[:2]
print(newsgroups_train.filenames.shape, newsgroups_train.target.shape)
'''
Loading Gensim and nltk libraries
'''
# pip install gensim
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(400)
import nltk
nltk.download('wordnet')
print(WordNetLemmatizer().lemmatize('went', pos = 'v')) # past tense to present tense
import pandas as pd
stemmer = SnowballStemmer("english")
original_words = ['caresses', 'flies', 'dies', 'mules', 'denied','died', 'agreed', 'owned', 'humbled', 'sized','meeting', 'stating', 'siezing', 'itemization','sensational', 'traditional', 'reference', 'colonizer','plotted']
singles = [stemmer.stem(plural) for plural in original_words]pd.DataFrame(data={'original word':original_words, 'stemmed':singles })
'''
Write a function to perform the pre processing steps on the entire dataset
'''
def lemmatize_stemming(text):return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))# Tokenize and lemmatize
def preprocess(text):result=[]for token in gensim.utils.simple_preprocess(text) :if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:result.append(lemmatize_stemming(token))return result
'''
Preview a document after preprocessing
'''
document_num = 50
doc_sample = 'This disk has failed many times. I would like to get it replaced.'print("Original document: ")
words = []
for word in doc_sample.split(' '):words.append(word)
print(words)
print("\n\nTokenized and lemmatized document: ")
print(preprocess(doc_sample))
processed_docs = []for doc in newsgroups_train.data:processed_docs.append(preprocess(doc))
'''
Preview 'processed_docs'
'''
print(processed_docs[:2])
'''
Create a dictionary from 'processed_docs' containing the number of times a word appears
in the training set using gensim.corpora.Dictionary and call it 'dictionary'
'''
dictionary = gensim.corpora.Dictionary(processed_docs)
'''
Checking dictionary created
'''
count = 0
for k, v in dictionary.iteritems():print(k, v)count += 1if count > 10:break
'''
OPTIONAL STEP
Remove very rare and very common words:- words appearing less than 15 times
- words appearing in more than 10% of all documents
'''
dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n= 100000)
'''
Create the Bag-of-words model for each document i.e for each document we create a dictionary reporting how many
words and how many times those words appear. Save this to 'bow_corpus'
'''
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
'''
Preview BOW for our sample preprocessed document
'''
document_num = 20
bow_doc_x = bow_corpus[document_num]for i in range(len(bow_doc_x)):print("Word {} (\"{}\") appears {} time.".format(bow_doc_x[i][0], dictionary[bow_doc_x[i][0]], bow_doc_x[i][1]))
# LDA mono-core -- fallback code in case LdaMulticore throws an error on your machine
# lda_model = gensim.models.LdaModel(bow_corpus,
#                                    num_topics = 10,
#                                    id2word = dictionary,
#                                    passes = 50)# LDA multicore
'''
Train your lda model using gensim.models.LdaMulticore and save it to 'lda_model'
'''
# TODO
lda_model =  gensim.models.LdaMulticore(bow_corpus, num_topics = 8, id2word = dictionary,                                    passes = 10,workers = 2)
'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''
for idx, topic in lda_model.print_topics(-1):print("Topic: {} \nWords: {}".format(idx, topic ))print("\n")
num = 100
unseen_document = newsgroups_test.data[num]
print(unseen_document)
# Data preprocessing step for the unseen document
bow_vector = dictionary.doc2bow(preprocess(unseen_document))for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))
print(newsgroups_test.target[num])

LDA 用于将文档中的文本分类为特定主题相关推荐

Word处理控件Aspose.Words功能演示：用Java从Word文档中提取文本
Aspose.Words For .NET是一种高级Word文档处理API,用于执行各种文档管理和操作任务.API支持生成,修改,转换,呈现和打印文档,而无需在跨平台应用程序中直接使用Microsof ...
Java版Word开发工具Aspose.Words功能解析：查找和替换Word文档中的文本
MS Word提供了一种简单的方法来查找和替换文档中的文本.查找和替换文本的一种流行用例之一可能是在文档之间的敏感信息在各个实体之间共享之前,对其进行删除或替换.但是,手动过程可能需要您安装MS Wo ...
使用pymupdf获取pdf文档中的文本下划线信息（全网唯一解决方案）
1,问题描述最近,公司需要对一批pdf文档进行解析,获取其中文字,并再展示到前端页面上.如果单纯地提取文字,其实非常容易,但麻烦的在于保存原有文档中的文本格式,例如加粗.斜体.下划线,以及三者的各种 ...
Word处理控件Aspose.Words功能演示：使用 Python 查找和替换 Word 文档中的文本
很多时候,您需要替换 Word 文档中的特定文本或短语.MS Word 具有针对此类情况的内置功能,您可以一键替换所需的文本.在本文中,您将学习如何使用 Python 以编程方式查找和替换 Word ...
【Python】导出docx格式Word文档中的文本、图片和附件等
[Python]导出docx格式Word文档中的文本.图片和附件等零.需求为批量批改学生在机房提交的实验报告,我需要对所有的实验文档内容进行处理.需要批量提取Word文档中的图片和附件以便进一步检 ...
控制 Open XML WordprocessingML 文档中的文本
简介在 Open XML 字处理文档中处理文本的过程看起来非常简单:文档中包含正文,正文包含段落和表格,表格中包含行和单元格,完全类似于 HTML,不是吗?然后再看,又好像很难.您会看到修订跟 ...
用C#实现在PowerPoint文档中搜索文本
用编程的方式根据对象模型很容易实现在Word.Excel文档中搜索文本,在PowerPoint里面也同样如此,使用对象模型有助于我们了解office的文档结构. 搜索的思路和方法基本是一样的,用Pow ...
微软文本检索_如何在Microsoft Word中引用其他文档中的文本
微软文本检索 You probably have some text that you type often in your Word documents, such as addresses. In ...
【教程】Spire.PDF教程：C# 如何提取 PDF 文档中的文本和图片
Spire.PDF是一个专业的PDF组件,能够独立地创建.编写.编辑.操作和阅读PDF文件,支持 .NET.Java.WPF和Silverlight. [下载Spire.PDF最新试用版] 文本和图片 ...

LDA 用于将文档中的文本分类为特定主题

LDA 用于将文档中的文本分类为特定主题相关推荐

最新文章

热门文章