
App ratings and reviews in google play store provides a decent idea about the entire customer experience. And, it is very important to understand what are the customers liking/disliking about the product/service offered through the app. Thanks to machine learning, we can analyse millions of such reviews and get to know where the product/service/app is failing to provide a good customer experience.

Google Play商店中的应用评分和评论提供了有关整个客户体验的不错的主意。 而且,了解客户喜欢/不喜欢通过该应用程序提供的产品/服务的内容非常重要。 借助机器学习,我们可以分析数百万条此类评论,并了解产品/服务/应用无法提供良好客户体验的地方。

This read will help you to extract reviews of “any” app from google play store, understand top customer concerns associated to the app by extracting topic of the reviews (with bad ratings) and analysing these concerns in different ways.

这本读物将帮助您从Google Play商店中提取“任何”应用程序的评论,通过提取评论主题(评分不佳)并以不同方式分析这些关注点,从而了解与该应用程序相关的主要客户顾虑。

Let’s start!!!


1.图书馆安装 (1. Libraries installation)

We have google_play_scrapper to extract reviews of app from google_play_store and sklearn to take care of the NLP segment


!pip install google_play_scraper!pip install sklearnimport pandas as pdfrom google_play_scraper.features.reviews import Sort, reviews_all, reviewsfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.decomposition import LatentDirichletAllocation

2.评论数据提取 (2. Reviews data extraction)

Fortunately, we have the google_play_scraper library to extract all reviews and provide output in json format. We just need to provide the app id to the reviews_all function (refer the code below).

幸运的是,我们拥有google_play_scraper库来提取所有评论并以json格式提供输出。 我们只需要将应用程序ID提供给reviews_all函数(请参见下面的代码)。

The app id can be found in the play store URL, once you select any particular app, then look from the string assigned to id, please refer to highlighted part of URL in the image below.


result = reviews_all(‘com.bt.bms’,sleep_milliseconds=0,lang=’en’, country=’us’)

3.创建评论的数据框 (3. Create dataframe of the reviews)

Here, I am converting the json output to pandas dataframe to get stats like total reviews, reviews by unknown users, reviews by unique users, etc. You might see the total reviews won’t match the total reviews present in google play store page. The reason for this difference is that, the total reviews considered in google play store consists of the blank reviews as well (most customers don’t write reviews, they just give the numerical rating).

在这里,我将json输出转换为pandas数据框,以获取统计信息,例如总评论,未知用户的评论,唯一用户的评论等。您可能会看到总评论与google play商店页面中的总评论不匹配。 造成这种差异的原因是,在Google Play商店中考虑的总评论数也包括空白评论(大多数客户不写评论,他们只是给出数字评分)。

One important point about the review data — reviewId is a unique key and content column contains the textual reviews. In later part of this read, you will see joins with ‘reviewsId’ and ‘content’ as the joining keys.

关于审阅数据的一个重要点-reviewId是唯一键,内容列包含文本审阅。 在本文的后续部分,您将看到以“ reviewsId”和“ content”作为连接键的连接。

df = pd.DataFrame(result)unique_users  = len(df[‘userName’].unique())unknown_users = len(df[df[‘userName’]==’A Google user’])total_reviews = len(df)mean = df['score'].mean()print(f’Total textual reviews: {len(result)} \n’)print(f’Total unique users : {unique_users}’)print(f’Total unknown users: {unknown_users}’)print(f’Total users who gave multiple reviews: {total_reviews — unique_users — unknown_users}\n’)print(f'Average rating for this app based on the textual reviews: {round(mean,2)} \n')



Total textual reviews: 233202 Total unique users   : 179630Total unknown users  : 28231Total users who gave multiple reviews: 25341Average rating for this app based on the textual reviews: 3.99

4.提取评分低于4的所有评论 (4. Extract all reviews with rating below 4)

We are considering reviews of bad ratings only, i.e. textual reviews of ratings below 4. We are assuming here, that customers will write a bad review if they rate the app below 4.


df_tm = df[df['score']<=3]df_tm = df_tm[df_tm.content.str.len()>=30]print(f'Remaining textual reviews: {len(df_tm)} \n')Remaining textual reviews: 37996

5.获取主题建模的相关列 (5. Get the relevant columns for topic modelling)

For topic modelling we just need the content column. reviewId is the unique key.

对于主题建模,我们只需要content列。 reviewId是唯一键。

df_tm = df_tm[['reviewId','content']].drop_duplicates()df_tm.dropna(inplace=True)df_tm = df_tm.reset_index().drop(columns='index')print(f'Remaining textual reviews: {len(df_tm)} \n')Remaining textual reviews: 37996

6.创建审阅的文档术语矩阵 (6. Create document term matrix of the reviews)

We are using CountVectorizer to create document term matrix, below are the two important paramaters.


i. max_df : discard words that occur more than 95% documentsii. min_df : include only those words that occur atleast in 2 documents

一世。 max_df:丢弃出现超过95%的文档的单词ii。 min_df:仅包括至少两个文档中出现的单词

cv = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')dtm = cv.fit_transform(df_tm['content'])dtm#shows 8839 terms and 37996 articles

7.使用LDA进行主题建模 (7. Using LDA for topic modelling)

In LDA, the main parameter is n_components, this parameter decides how many topics are to the created. By default n_component is 10. When we try to create more topics, then there is always a possibility of same type of words being present across multiple topics. I finalised n_components as 5 after iterating over this process and analysing the list of words across multiple topics.

在LDA中,主要参数是n_components,此参数决定要创建的主题数。 默认情况下,n_component为10。当我们尝试创建更多主题时,总是有可能在多个主题中出现相同类型的单词。 在遍历此过程并分析多个主题的单词列表之后,我将n_components最终确定为5。

LDA = LatentDirichletAllocation(n_components=5,random_state=1)LDA.fit(dtm)



LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,evaluate_every=-1, learning_decay=0.7, learning_method='batch',learning_offset=10.0, max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001, n_components=5, n_jobs=None, perp_tol=0.1, random_state=1, topic_word_prior=None, total_samples=1000000.0, verbose=0)

提取主题和相应的top20(高频)词 (Extract the topics and corresponding top20 (high frequency) words)

for index,topic in enumerate(LDA.components_):    print(f'topic #{index} : ')    print([cv.get_feature_names()[i] for i in topic.argsort()[-20:]])

The output from the model are 5 topics each containing list of top 20 words. The function argsort() helps us to get index positions of high frequency words under a particular topic.

该模型的输出是5个主题,每个主题都包含前20个单词的列表。 函数argsort()帮助我们获取特定主题下的高频单词的索引位置。

Naming to each topic is given after interpreting the words within each topic. LDA model doesn’t assign topic to those words.

在解释每个主题中的单词之后,将给每个主题命名。 LDA模型不会为这些单词分配主题。

topic #0 : Internet Charges related['rating', 'convenience', 'charging', 'people', 'book', 'movie', 'extra', 'fees', 'fee', 'tickets', 'offers', 'good', 'charge', 'ticket', 'booking', 'high', 'app', 'handling', 'charges', 'internet']topic #1 : Payment/Offers related['able', 'card', 'fix', 'problem', 'offers', 'unable', 'shows', 'tried', 'doesn', 'work', 'times', 'book', 'payment', 'try', 'offer', 'open', 'working', 'error', 'time', 'app']topic #2 : App issue related['download', 'don', 'like', 'updated', 'slow', 'previous', 'bad', 'phone', 'old', 'user', 'hai', 'need', 'latest', 'good', 'better', 'worst', 'new', 'version', 'update', 'app']topic #3 : Booking-Refund/Ticket['care', 'deducted', 'didn', 'account', 'transaction', 'time', 'movie', 'service', 'payment', 'got', 'refund', 'worst', 'customer', 'booking', 'book', 'app', 'booked', 'ticket', 'money', 'tickets']topic #4 : Booking-Location/language['add', 'cancellation', 'cinema', 'able', 'theatre', 'good', 'seat', 'ticket', 'location', 'shows', 'movies', 'showing', 'booking', 'available', 'seats', 'option', 'movie', 'book', 'tickets', 'app']

8.将主题建模结果与基础数据集结合 (8. Combine the topic modelling results with the base dataset)

topic_results = LDA.transform(dtm)df_topic_results = pd.DataFrame(topic_results, columns=['0_InternetCharges','1_Payment/Offers' ,'2_App'            ,'3_Booking-Refund/Ticket'  ,'4_Booking-Location/language' ])df_result = pd.merge(df_tm, df_topic_results,  how='inner', left_index=True, right_index=True )df_output = pd.merge(df, df_result,  how='left', on=[ 'reviewId','content' ])df_output.to_csv('app_reviews_bms.csv')

Output dataset:


9.使用Tableau仪表板可视化输出 (9. Visualising the output using Tableau dashboard)

I have used tableau to analyse and visualise the output dataset.


You can go this link- https://public.tableau.com/views/AppReviews_v2/GooglePlayStoreappreviewanalysis?:language=en&:display_count=y&:origin=viz_share_link

您可以转到此链接-https : //public.tableau.com/views/AppReviews_v2/GooglePlayStoreappreviewanalysis ?: language = zh-CN &: display_count = y &: origin = viz_share_link

and analyse the results, also download the tableau workbook once you have done creating your output dataset.


For better experience of the tableau visualisation switch to desktop mode (if you are viewing on phone).


Next Steps, We can optimise the content by removing non-english words, emojis and common words. That might help us to improve the accuracy of the model.

下一步,我们可以通过删除非英语单词,表情符号和常用单词来优化内容。 这可能有助于我们提高模型的准确性。

I would love to get valuable comments for any sort of improvements in the entire process.


翻译自: https://medium.com/analytics-vidhya/play-store-app-reviews-textual-data-topic-modelling-using-lda-f24bdbd2910d




