主题建模lda

App ratings and reviews in google play store provides a decent idea about the entire customer experience. And, it is very important to understand what are the customers liking/disliking about the product/service offered through the app. Thanks to machine learning, we can analyse millions of such reviews and get to know where the product/service/app is failing to provide a good customer experience.

Google Play商店中的应用评分和评论提供了有关整个客户体验的不错的主意。而且，了解客户喜欢/不喜欢通过该应用程序提供的产品/服务的内容非常重要。借助机器学习，我们可以分析数百万条此类评论，并了解产品/服务/应用无法提供良好客户体验的地方。

This read will help you to extract reviews of “any” app from google play store, understand top customer concerns associated to the app by extracting topic of the reviews (with bad ratings) and analysing these concerns in different ways.

这本读物将帮助您从Google Play商店中提取“任何”应用程序的评论，通过提取评论主题(评分不佳)并以不同方式分析这些关注点，从而了解与该应用程序相关的主要客户顾虑。

Text analytics, Python, NLP, LDA, Tableau, Data science

文本分析，Python，NLP，LDA，Tableau，数据科学

Let’s start!!!

开始吧！！！

1.图书馆安装 (1. Libraries installation)

We have google_play_scrapper to extract reviews of app from google_play_store and sklearn to take care of the NLP segment

我们有google_play_scrapper来提取google_play_store和sklearn中应用的评论，以处理NLP细分

!pip install google_play_scraper!pip install sklearnimport pandas as pdfrom google_play_scraper.features.reviews import Sort, reviews_all, reviewsfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.decomposition import LatentDirichletAllocation

2.评论数据提取 (2. Reviews data extraction)

Fortunately, we have the google_play_scraper library to extract all reviews and provide output in json format. We just need to provide the app id to the reviews_all function (refer the code below).

幸运的是，我们拥有google_play_scraper库来提取所有评论并以json格式提供输出。我们只需要将应用程序ID提供给reviews_all函数(请参见下面的代码)。

The app id can be found in the play store URL, once you select any particular app, then look from the string assigned to id, please refer to highlighted part of URL in the image below.

您可以在Play商店的网址中找到该应用程序ID，一旦选择任何特定的应用程序，然后从分配给ID的字符串中查找，请参考下图中URL的突出显示部分。

result = reviews_all(‘com.bt.bms’,sleep_milliseconds=0,lang=’en’, country=’us’)

3.创建评论的数据框 (3. Create dataframe of the reviews)

Here, I am converting the json output to pandas dataframe to get stats like total reviews, reviews by unknown users, reviews by unique users, etc. You might see the total reviews won’t match the total reviews present in google play store page. The reason for this difference is that, the total reviews considered in google play store consists of the blank reviews as well (most customers don’t write reviews, they just give the numerical rating).

在这里，我将json输出转换为pandas数据框，以获取统计信息，例如总评论，未知用户的评论，唯一用户的评论等。您可能会看到总评论与google play商店页面中的总评论不匹配。造成这种差异的原因是，在Google Play商店中考虑的总评论数也包括空白评论(大多数客户不写评论，他们只是给出数字评分)。

One important point about the review data — reviewId is a unique key and content column contains the textual reviews. In later part of this read, you will see joins with ‘reviewsId’ and ‘content’ as the joining keys.

关于审阅数据的一个重要点-reviewId是唯一键，内容列包含文本审阅。在本文的后续部分，您将看到以“ reviewsId”和“ content”作为连接键的连接。

df = pd.DataFrame(result)unique_users  = len(df[‘userName’].unique())unknown_users = len(df[df[‘userName’]==’A Google user’])total_reviews = len(df)mean = df['score'].mean()print(f’Total textual reviews: {len(result)} \n’)print(f’Total unique users : {unique_users}’)print(f’Total unknown users: {unknown_users}’)print(f’Total users who gave multiple reviews: {total_reviews — unique_users — unknown_users}\n’)print(f'Average rating for this app based on the textual reviews: {round(mean,2)} \n')

Output:

输出：

Total textual reviews: 233202 Total unique users   : 179630Total unknown users  : 28231Total users who gave multiple reviews: 25341Average rating for this app based on the textual reviews: 3.99

4.提取评分低于4的所有评论 (4. Extract all reviews with rating below 4)

We are considering reviews of bad ratings only, i.e. textual reviews of ratings below 4. We are assuming here, that customers will write a bad review if they rate the app below 4.

我们仅考虑对不良评分的评论，即评分低于4的文字评论。在此假设，如果客户将应用评分为4以下，则他们将撰写不良评论。

df_tm = df[df['score']<=3]df_tm = df_tm[df_tm.content.str.len()>=30]print(f'Remaining textual reviews: {len(df_tm)} \n')Remaining textual reviews: 37996

5.获取主题建模的相关列 (5. Get the relevant columns for topic modelling)

For topic modelling we just need the content column. reviewId is the unique key.

对于主题建模，我们只需要content列。 reviewId是唯一键。

df_tm = df_tm[['reviewId','content']].drop_duplicates()df_tm.dropna(inplace=True)df_tm = df_tm.reset_index().drop(columns='index')print(f'Remaining textual reviews: {len(df_tm)} \n')Remaining textual reviews: 37996

6.创建审阅的文档术语矩阵 (6. Create document term matrix of the reviews)

We are using CountVectorizer to create document term matrix, below are the two important paramaters.

我们使用CountVectorizer创建文档术语矩阵，以下是两个重要的参数。

i. max_df : discard words that occur more than 95% documentsii. min_df : include only those words that occur atleast in 2 documents

一世。 max_df：丢弃出现超过95％的文档的单词ii。 min_df：仅包括至少两个文档中出现的单词

cv = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')dtm = cv.fit_transform(df_tm['content'])dtm#shows 8839 terms and 37996 articles

7.使用LDA进行主题建模 (7. Using LDA for topic modelling)

In LDA, the main parameter is n_components, this parameter decides how many topics are to the created. By default n_component is 10. When we try to create more topics, then there is always a possibility of same type of words being present across multiple topics. I finalised n_components as 5 after iterating over this process and analysing the list of words across multiple topics.

在LDA中，主要参数是n_components，此参数决定要创建的主题数。默认情况下，n_component为10。当我们尝试创建更多主题时，总是有可能在多个主题中出现相同类型的单词。在遍历此过程并分析多个主题的单词列表之后，我将n_components最终确定为5。

LDA = LatentDirichletAllocation(n_components=5,random_state=1)LDA.fit(dtm)

Output:

输出：

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,evaluate_every=-1, learning_decay=0.7, learning_method='batch',learning_offset=10.0, max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001, n_components=5, n_jobs=None, perp_tol=0.1, random_state=1, topic_word_prior=None, total_samples=1000000.0, verbose=0)

提取主题和相应的top20(高频)词 (Extract the topics and corresponding top20 (high frequency) words)

for index,topic in enumerate(LDA.components_):    print(f'topic #{index} : ')    print([cv.get_feature_names()[i] for i in topic.argsort()[-20:]])

The output from the model are 5 topics each containing list of top 20 words. The function argsort() helps us to get index positions of high frequency words under a particular topic.

该模型的输出是5个主题，每个主题都包含前20个单词的列表。函数argsort()帮助我们获取特定主题下的高频单词的索引位置。

Naming to each topic is given after interpreting the words within each topic. LDA model doesn’t assign topic to those words.

在解释每个主题中的单词之后，将给每个主题命名。 LDA模型不会为这些单词分配主题。

topic #0 : Internet Charges related['rating', 'convenience', 'charging', 'people', 'book', 'movie', 'extra', 'fees', 'fee', 'tickets', 'offers', 'good', 'charge', 'ticket', 'booking', 'high', 'app', 'handling', 'charges', 'internet']topic #1 : Payment/Offers related['able', 'card', 'fix', 'problem', 'offers', 'unable', 'shows', 'tried', 'doesn', 'work', 'times', 'book', 'payment', 'try', 'offer', 'open', 'working', 'error', 'time', 'app']topic #2 : App issue related['download', 'don', 'like', 'updated', 'slow', 'previous', 'bad', 'phone', 'old', 'user', 'hai', 'need', 'latest', 'good', 'better', 'worst', 'new', 'version', 'update', 'app']topic #3 : Booking-Refund/Ticket['care', 'deducted', 'didn', 'account', 'transaction', 'time', 'movie', 'service', 'payment', 'got', 'refund', 'worst', 'customer', 'booking', 'book', 'app', 'booked', 'ticket', 'money', 'tickets']topic #4 : Booking-Location/language['add', 'cancellation', 'cinema', 'able', 'theatre', 'good', 'seat', 'ticket', 'location', 'shows', 'movies', 'showing', 'booking', 'available', 'seats', 'option', 'movie', 'book', 'tickets', 'app']

8.将主题建模结果与基础数据集结合 (8. Combine the topic modelling results with the base dataset)

topic_results = LDA.transform(dtm)df_topic_results = pd.DataFrame(topic_results, columns=['0_InternetCharges','1_Payment/Offers' ,'2_App'            ,'3_Booking-Refund/Ticket'  ,'4_Booking-Location/language' ])df_result = pd.merge(df_tm, df_topic_results,  how='inner', left_index=True, right_index=True )df_output = pd.merge(df, df_result,  how='left', on=[ 'reviewId','content' ])df_output.to_csv('app_reviews_bms.csv')

Output dataset:

输出数据集：

9.使用Tableau仪表板可视化输出 (9. Visualising the output using Tableau dashboard)

I have used tableau to analyse and visualise the output dataset.

我已经使用tableau分析和可视化输出数据集。

You can go this link- https://public.tableau.com/views/AppReviews_v2/GooglePlayStoreappreviewanalysis?:language=en&:display_count=y&:origin=viz_share_link

您可以转到此链接-https : //public.tableau.com/views/AppReviews_v2/GooglePlayStoreappreviewanalysis ?: language = zh-CN &: display_count = y &: origin = viz_share_link

and analyse the results, also download the tableau workbook once you have done creating your output dataset.

并分析结果，创建完输出数据集后，还请下载tableau工作簿。

For better experience of the tableau visualisation switch to desktop mode (if you are viewing on phone).

为了更好地体验Tableau可视化效果，请切换到桌面模式(如果在手机上查看)。

Next Steps, We can optimise the content by removing non-english words, emojis and common words. That might help us to improve the accuracy of the model.

下一步，我们可以通过删除非英语单词，表情符号和常用单词来优化内容。这可能有助于我们提高模型的准确性。

I would love to get valuable comments for any sort of improvements in the entire process.

对于整个过程中的任何改进，我都希望获得宝贵的意见。

翻译自: https://medium.com/analytics-vidhya/play-store-app-reviews-textual-data-topic-modelling-using-lda-f24bdbd2910d

主题建模lda

查看全文

http://www.taodudu.cc/news/show-863820.html

胶囊路由_评论：胶囊之间的动态路由
交叉验证python_交叉验证
open ai gpt_您实际上想尝试的GPT-3 AI发明鸡尾酒
python 线性回归_Python中的简化线性回归
机器学习模型的性能指标
利用云功能和API监视Google表格中的Cloud Dataprep作业状态
谷歌联合学习的论文_Google的未来联合学习
使用cnn预测房价_使用CNN的人和马预测
利用colab保存模型_在Google Colab上训练您的机器学习模型中的“后门”
java 回归遍历_回归基础：代码遍历
sql 12天内的数据_想要在12周内成为数据科学家吗？
SorterBot-第1部分
算法题指南书_分类算法指南
小米 pegasus_使用Google的Pegasus库生成摘要
数据集准备及数据预处理_1.准备数据集
ai模型_这就是AI的样子：用于回答问题的BiDAF模型
正则化技术
检测对抗样本_避免使用对抗性T恤进行检测
大数据数据量估算_如何估算数据科学项目的数据收集成本
为什么和平精英无响应_什么和为什么
1. face_generate.py
cnn卷积神经网络应用_卷积神经网络（CNN）：应用的核心概念
使用mnist数据集_使用MNIST数据集上的t分布随机邻居嵌入（t-SNE）进行降维
python模型部署方法_终极开箱即用的自动化Python模型选择方法
总体方差的充分统计量_R方是否衡量预测能力或统计充分性？
多尺度视网膜图像增强_视网膜图像怪异的预测
多元线性回归中多重共线性_多重共线性如何在线性回归中成为问题。
opencv 创建图像_非艺术家的图像创建（OpenCV项目演练）
使用TensorFlow进行深度学习-第2部分
基于bert的语义匹配_构建基于BERT的语义搜索系统…针对“星际迷航”

主题建模lda_使用LDA的Google Play商店应用评论的主题建模相关推荐

文本建模PLSA与LDA模型
文本建模PLSA与LDA模型 – 潘登同学的Machine Learning笔记文章目录文本建模PLSA与LDA模型 -- 潘登同学的Machine Learning笔记文本生成过程 Unigr ...
LDA模型，获取所有的文档-主题分布（即得到文档对于每个主题的概率分布）并保存
前言:写小论文用到lda主题模型,需要得到所有的文档-主题分布.现有的只是为文档输出前几个概率大的主题代码: import numpy as np from gensim.models import ...
如何在Google Chromebook上更改墙纸和主题
Personalizing your computer with a new wallpaper or fresh theme is one of the first things people do ...
如何优化你的Google Play商店应用详情页面
如何优化你的Google Play商店应用详情页面 Google Play ASO 应用商店优化是什么? ASO策略 1. 关键字研究与市场研究如何为你的应用选择关键字和关键字组合? 如何识别长尾关 ...
亚马逊fire充不上电_如何在Amazon Fire Tablet或Fire HD 8上安装Google Play商店
亚马逊fire充不上电 Amazon's Fire Tablet normally restricts you to the Amazon Appstore. But the Fire Tablet ...
如何解决 Google GMS 在被锁定失效后，无法再使用 Google Play Store的问题；亦适用于在不使用 Google GMS 的情况下，如何正常使用Google Play 商店
如何在不安装Google GMS的情况下,让谷歌商店正常使用谷歌 Play商店 ~~ 有朋友可能是采用安装 Google GMS的方式使用谷歌商店的,但很快,你会发现以下问题: ① 设 ...
2022年第十一届认证杯数学中国数学建模国际赛小美赛：C 题对人类活动进行分类建模方案及代码实现
2022年第十一届认证杯数学中国数学建模国际赛小美赛:C 题对人类活动进行分类建模方案及代码实现更新进展 (1)2022-12-3 10:00 发布合并后的数据,完整的python代码.LSTM ...
警惕！国内某广告SDK内置“后门”功能，Google Play商店已强制下架
本文讲的是警惕!国内某广告SDK内置"后门"功能,Google Play商店已强制下架, 新闻摘要 Lookout安全团队近日发现了个信的广告软件开发工具包(SDK),可以通过下载 ...
R语言编写自定义函数、创建使用ggplot2生成图标（icon）的主题（theme）函数、使用ggplot2以及自定义的图标主题函数创建箱图（boxplot）图标、ggsave保存图标(png、svg
R语言编写自定义函数.创建使用ggplot2生成图标(icon)的主题(theme)函数.使用ggplot2以及自定义的图标主题函数创建箱图(boxplot)图标.ggsave保存图标(png.svg ...

主题建模lda_使用LDA的Google Play商店应用评论的主题建模