
Web scraping, Sentiment analysis, LDA topic modeling


项目概况(Project Overview)

In this project, we are going to scrape hotel reviews of “Hotel Beresford” located in San Francisco, CA from the website bookings.com. Then, we are going to do some data exploration, generate WordClouds, perform sentiment analysis and create an LDA topic model.

在本项目中,我们将从bookings.com网站上删除位于加利福尼亚州旧金山的“ Hotel Beresford”酒店点评。 然后,我们将进行一些数据探索,生成WordCloud,执行情感分析并创建LDA主题模型。

问题陈述 (Problem Statement)

The project goal is to use text analytics and Natural Language Processing (NLP) to extract actionable insights from the reviews and help the hotel improve their guest satisfactions.


方法论 (Methodologies)

(1) Web Scraping


The hotel reviews will be scraped from bookings.com by using requests with BeautifulSoup. The detailed steps are covered in the next section.

通过使用BeautifulSoup的请求,将从bookings.com删除酒店评论。 下一节将介绍详细步骤。

(2) Exploratory Data Analysis (EDA)


We will use pie chart, histogram, and seaborn violin plot to get a better understanding of the reviews and ratings data.


(3) WordClouds


In order to generate more meaningful WordClouds, we will customize some extra stop words and use lemmatization to remove closely redundant words.


(4) Sentiment Analysis


The sentiment analysis helps to classify the polarity and subjectivity of the overall reviews and determine whether the expressed opinion in the reviews is mostly positive, negative, or neutral.


(5) LDA Topic Model


In natural language processing, the latent Dirichlet allocation is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. We will use GridSearch to find the best topic model. The two tuning parameters are: (1) n_components: number of topics and (2) learning_decay (which controls the learning rate)

在自然语言处理中,潜在的狄利克雷分配是一种生成的统计模型,该模型允许未观察到的组解释观察集,这些观察组解释了为什么数据的某些部分相似。 我们将使用GridSearch查找最佳主题模型。 这两个调整参数是:(1) n_components :主题数;(2) learning_decay (控制学习率)

指标 (Metrics)

To diagnose the model performance, we will take a look at the perplexity and log-likelihood scores of the LDA model.


Perplexity captures how surprised a model is of new data it has not seen before, and is measured as the normalized log-likelihood of a held-out test set. Log-likelihood is a measure of how plausible model parameters are given the data.

困惑捕获了一个模型对从未见过的新数据感到惊讶的程度,并将其度量为保持测试集的标准化对数似然率。 对数似然性是对给定数据合理模型参数的一种度量。

A model with higher log-likelihood and lower perplexity is considered to be a good model. However, perplexity might not be the best measure to evaluate topic models because it doesn’t consider the context and semantic associations between words. (Read this article to learn more)

具有较高对数可能性和较低困惑度的模型被认为是一个很好的模型。 但是,困惑可能不是评估主题模型的最佳方法,因为它没有考虑单词之间的上下文和语义关联。 (阅读本文以了解更多信息)

如何取消评论? (How to Scrape the Reviews?)

In this project, we are going to scrape the reviews of “Hotel Beresford” located in San Francisco, CA . To scrape any websites, we need to first find the pattern of the URL and then inspect the web page. However, we see that this link is extremely long.

在此项目中,我们将刮除位于加利福尼亚州旧金山的“ Hotel Beresford”的评论。 要抓取任何网站,我们需要首先找到URL的模式,然后检查网页。 但是,我们看到此链接非常长。


After several tries, you will realize that using the link below can also generate the same page. If you want to scrape other hotels, simply replace “beresford” with any other hotel names booking.com uses. Entering a page number at the end would bring you to the review page you want to see.

经过几次尝试,您将意识到使用下面的链接也可以生成相同的页面。 如果要刮擦其他酒店,只需将“ beresford”替换为booking.com使用的任何其他酒店名称。 在末尾输入页码将带您进入要查看的评论页面。


Right click anywhere on the web page and select “Inspect” to view the HTML & CSS script of web elements. Here we find the html tags of the review section we want to scrape is “ul.review_list”.

右键单击网页上的任意位置,然后选择“检查”以查看Web元素HTML和CSS脚本。 在这里,我们发现要抓取的评论部分的html标签为“ ul.review_list”。

Review Section that includes reviewers’ information and both positive and negative reviews

Under this tag, we want to scrape the following information:


1. Basic information of the reviewer and reviews:


  • Rating Score


  • Reviewer Name


  • Reviewer’s Nationality


  • Overall Review (contains both positive & negative reviews)


  • Reviewer Reviewed Times


  • Review Date


  • Review Tags (Trip type, such as business trip, leisure trip, etc.)


2. Positive reviews


3. Negative Reviews


Reviewer’s info | Ratings | Review date
评论者信息| 评分| 审核日期
Review Tags | Positive Reviews | Negative Reviews
评论标签| 好评| 负面评论

Now that we have found all of the html tags, let’s start coding! First, we need to import all the python packages we need.

现在我们已经找到了所有的html标记,让我们开始编码! 首先,我们需要导入所需的所有python软件包。

## importing packages ##
import numpy as np
import pandas as pdimport seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
%matplotlib inline
import refrom bs4 import BeautifulSoup as bs
import requestsimport string
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords,wordnet
from wordcloud import WordCloud
# nltk.download('averaged_perceptron_tagger')
# nltk.download('vader_lexicon')
# nltk.download('wordnet')from textblob import TextBlob
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import LatentDirichletAllocation
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()import warnings

第一部分:来自Bookings.com的刮刮评论和干净的非结构化文本(Part I. Scrape Reviews from Bookings.com & Clean Unstructured Text)

The steps of scraping the reviews involves:


  1. Specify the URL of the reviews page指定评论页面的URL
  2. Send an HTTP request to the URL and save the response from server in a response object.将HTTP请求发送到URL,并将来自服务器的响应保存在响应对象中。
  3. Create a BeautifulSoup object by passing the raw HTML content from step 2 and specifying the HTML parser we want to use, likehtml.parser or html.5lib.


  4. Navigate and search the parse tree using BeautifulSoup’s tree-searching method.find_all()


  5. Try to scrape one page first. Print out each scraped text to identify patterns and clean the text by using .strip() and .replace() methods

    尝试先刮一页。 使用.strip().replace()方法打印出每个刮擦的文本以识别模式并清理文本

  6. Create for loops to store into the three lists创建for循环以存储到三个列表中
  7. Use a while loop to scrape all the pages使用while循环来抓取所有页面
  8. Convert the lists into dataframes将列表转换为数据框
  9. Put everything into a function called “scrape_reviews”将所有内容放入一个名为“ scrape_reviews”的函数中

The second function “show_data” will print out the length of a dataframe, total NAs, as well as the first five lines of a dataframe.

第二个函数“ show_data”将打印出数据帧的长度,总的NA以及数据帧的前五行。

def scrape_reviews(hotel_linkname,total_pages):'''Function to scrape hotel reviews from bookings.comINPUTS:hotel_linkname - hotel name in the bookings.com weblinktotal_pages - the total number of reviews pages to scrapeOUTPUTS:reviewer_info - a dataframe that includes reviewers' basic informationpos_reviews -  a dataframe that includes all the positive reviews neg_reviews - a dataframe that includes all the negative reviews'''#Create empty lists to put in reviewers' information as well as all of the positive & negative reviews info = []positive = []negative = []#bookings.com reviews linkurl = 'https://www.booking.com/reviews/us/hotel/'+ hotel_linkname +'.html?page=' page_number = 1#Use a while loop to scrape all the pages while page_number <= total_pages:page = requests.get(url + str(page_number)) #retrieve data from servesoup = bs(page.text, "html.parser") # initiate a beautifulsoup object using the html source and Python’s html.parserreview_box = soup.find('ul',{'class':'review_list'})#ratingsratings = [i.text.strip() for i in review_box.find_all('span',{'class':'review-score-badge'})]#reviewer_inforeviewer_info = [i.text.strip() for i in review_box.find_all('span',{'itemprop':'name'})]reviewer_name = reviewer_info[0::3]reviewer_country = reviewer_info[1::3]general_review = reviewer_info[2::3]# reviewer_review_timesreview_times = [i.text.strip() for i in review_box.find_all('div',{'class':'review_item_user_review_count'})]# review_datereview_date = [i.text.strip().strip('Reviewed: ') for i in review_box.find_all('p',{'class':'review_item_date'})]# reviewer_tagreviewer_tag = [i.text.strip().replace('\n\n\n','').replace('•',',').lstrip(', ') for i in review_box.find_all('ul',{'class':'review_item_info_tags'})]# positive_reviewpositive_review = [i.text.strip('눇').strip() for i in review_box.find_all('p',{'class':'review_pos'})]# negative_reviewnegative_review = [i.text.strip('눉').strip() for i in review_box.find_all('p',{'class':'review_neg'})]# append all reviewers' info into one listfor i in range(len(reviewer_name)):info.append([ratings[i],reviewer_name[i],reviewer_country[i],general_review[i], review_times[i],review_date[i],reviewer_tag[i]])# build positive review listfor i in range(len(positive_review)):positive.append(positive_review[i])# build negative review listfor i in range(len(negative_review)):negative.append(negative_review[i])# page changepage_number +=1#Reviewer_info dfreviewer_info = pd.DataFrame(info,columns = ['Rating','Name','Country','Overall_review','Review_times','Review_date','Review_tags'])reviewer_info['Rating'] = pd.to_numeric(reviewer_info['Rating'] )reviewer_info['Review_times'] = pd.to_numeric(reviewer_info['Review_times'].apply(lambda x:re.findall("\d+", x)[0]))reviewer_info['Review_date'] = pd.to_datetime(reviewer_info['Review_date'])#positive & negative reviews dfspos_reviews = pd.DataFrame(positive,columns = ['positive_reviews'])neg_reviews = pd.DataFrame(negative,columns = ['negative_reviews'])return reviewer_info, pos_reviews, neg_reviewsdef show_data(df):print("The length of the dataframe is: {}".format(len(df)))print("Total NAs: {}".format(reviewer_info.isnull().sum().sum()))return df.head()

There are 42 review pages for Hotel Beresford, adjust the hotel’s name and total number of reviews page when you scrape other hotels.

Hotel Beresford的共有42条评论页,刮其他酒店时,请调整酒店的名称和评论总数。

reviewer_info, pos_reviews, neg_reviews = scrape_reviews('beresford',total_pages = 42)

Now, let’s check our scraped data by applying the “show_data” function we defined before.

现在,让我们通过应用我们之前定义的“ show_data”函数来检查抓取的数据。

#reviewers' basic information
show_data(reviewer_info) #positive reviews
show_data(pos_reviews) # positive review
“reviewer_info” dataframe
“ reviewer_info”数据框
“pos_reviews” dataframe
“ pos_reviews”数据框
“neg_reviews” dataframe
“ neg_reviews”数据框

We have 1030 records for the dataframe that contains the basic information of the reviewers as well as rating scores, review dates, and review tags.


For positive reviews, we have scraped 651 records and 614 for the negatives.


第二部分探索性数据分析(EDA) (Part II. Exploratory Data Analysis (EDA))

Before doing further analyses, let’s perform the exploratory data analysis (EDA) first to get a “feel” of the data we have!


1.正面和负面评论的比例 (1. Ratio of positive and negative reviews)

fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
values = [len(pos_reviews), len(neg_reviews)]ax.pie(values, labels = ['Number of Positive Reviews', 'Number of Negative Reviews'],colors=['gold', 'lightcoral'],shadow=True,startangle=90, autopct='%1.2f%%')
plt.title('Positive Reviews Vs. Negative Reviews');
Positive reviews are slightly higher than the negatives

2. The Distribution of Ratings


# Histogram of Ratings
px.histogram(reviewer_info, x="Rating", nbins=30, title = 'Histogram of Ratings')


The histogram of ratings is left skewed.
print("For all of the reviewers, they came from {} different countries.".format(reviewer_info.Country.nunique()))
print("\n")top10_df = reviewer_info.groupby('Country').size().reset_index().sort_values(0, ascending = False).head(10)
top10_df.columns = ['Country', 'Counts']
print("Top 10 countries ranked by review counts")

People from different countries may have different standards when it comes to rating hotels and their services. Among 1030 reviewers, they came from 69 different countries. Here, we are only visualizing the distributions of the top 10 countries ranked by the number of reviews.

在评价酒店及其服务时,来自不同国家的人们可能有不同的标准。 在1030名审稿人中,他们来自69个不同的国家。 在这里,我们仅显示按评论数排名的前10个国家/地区的分布。

A Violin Plot is used to visualize the distribution of the data and its probability density. The plot shown below is displayed in the order of review counts of each country. It shows the relationship of ratings to the reviewers’ country of origins. From the box plot elements, we see that the median rating given by the U.S. and Ireland reviewers are a bit higher than the rest of the reviewers from other countries, while the median rating given by the reviewers from Italy is the lowest. Most of the shapes of the distributions (skinny on each end and wide in the middle) indicates the weights of ratings given by the reviewers are highly concentrated around the median, which is around 7 to 8. However, we probably need more data to get a better idea of the distributions.

小提琴图用于可视化数据的分布及其概率密度。 下面显示的图以每个国家的审查数量的顺序显示。 它显示了评分与审稿人的原籍国之间的关系。 从箱形图元素中,我们可以看到,美国和爱尔兰审稿人给出的中位评分略高于其他国家/地区的其他审稿人,而意大利审稿人给出的中位评分最低。 分布的大多数形状(两端呈细长形,中间呈宽形)表明审阅者给出的评分的权重高度集中在中位数(大约7到8)附近。但是,我们可能需要更多数据才能获得更好地了解分布。

3.查看每种行程类型的标签计数 (3. Review Tags Counts for each Trip Type)

#Define tag list
tag_list = ['Business','Leisure','Group','Couple','Family','friends','Solo']#Count for each review tag
tag_counts = []
for tag in tag_list:counts = reviewer_info['Review_tags'].str.count(tag).sum()tag_counts.append(counts)#Convert to a dataframe
trip_type = pd.DataFrame({'Trip Type':tag_list,'Counts':tag_counts}).sort_values('Counts',ascending = False)#Visualize the trip type counts from Review_tags
fig = px.bar(trip_type, x='Trip Type', y='Counts', title='Review Tags Counts for each Trip Type')


A lot of the times, one review has multiple tags. The bar chart shows that most people came to San Francisco for leisure trips, either as couples or by themselves. Fewer people came with their family or with a group, and even fewer people came with friends. Out of 1030 reviews, there are only 164 reviews that were tagged “Business”, which means only 16% of the reviewers came for business trips. However, we should take into account of the fact that people who came for leisure trips are usually more likely to have time or more willing to write reviews, while those who came for business trips maybe too busy or simply do not want to write any reviews.

很多时候,一个评论有多个标签。 条形图显示,大多数人成对或单独来到旧金山休闲旅行。 与家人或同伴一起来的人减少了,与朋友一起来的人更少了。 在1030条评论中,只有164条被标记为“商务”的评论,这意味着只有16%的评论者来此出差。 但是,我们应该考虑到这样一个事实,来休闲旅行的人们通常更有时间或更愿意写评论,而那些出差旅行的人可能太忙或根本不想写任何评论。

第三部分文字分析 (Part III. Text Analytics)

  1. Lemmatize Tokens


Lemmatization links words with similar meaning to one word. Wordnet and treebank have different tagging systems, so we want to first define a mapping between wordnet tags and POS tags. Then, we lemmatize words using NLTK. After generating WordClouds, I added extra customized stop words in the lemmatization process below.

词法化将具有相似含义的词链接到一个词。 Wordnet和treebank具有不同的标签系统,因此我们首先要定义wordnet标签和POS标签之间的映射。 然后,我们使用NLTK对词进行词素化。 生成WordClouds之后,我在下面的词形化过程中添加了额外的自定义停用词。

# wordnet and treebank have different tagging systems
# Create a function to define a mapping between wordnet tags and POS tags
def get_wordnet_pos(pos_tag):if pos_tag.startswith('J'):return wordnet.ADJ elif pos_tag.startswith('V'):return wordnet.VERB elif pos_tag.startswith('N'):return wordnet.NOUNelif pos_tag.startswith('R'):return wordnet.ADVelse:return wordnet.NOUN # default, return wordnet tag "NOUN"#Create a function to lemmatize tokens in the reviews
def lemmatized_tokens(text):text = text.lower()pattern = r'\b[a-zA-Z]{3,}\b'                 tokens = nltk.regexp_tokenize(text, pattern) # tokenize the texttagged_tokens = nltk.pos_tag(tokens)  # a list of tuples (word, pos_tag)stop_words = stopwords.words('english')new_stopwords = ["hotel","everything","anything","nothing","thing","need","good","great","excellent","perfect","much","even","really"]  #customize extra stop_wordsstop_words.extend(new_stopwords)stop_words = set(stop_words)wordnet_lemmatizer = WordNetLemmatizer()# get lemmatized tokens                             #call function "get_wordnet_pos"lemmatized_words=[wordnet_lemmatizer.lemmatize(word, get_wordnet_pos(tag)) # tagged_tokens is a list of tuples (word, tag)for (word, tag) in tagged_tokens \# remove stop wordsif word not in stop_words and \# remove punctuationsword not in string.punctuation]return lemmatized_words

2. Generate WordClouds


#Create a function to generate wordcloud
def wordcloud(review_df, review_colname, color, title):'''    INPUTS:reivew_df - dataframe, positive or negative reviewsreview_colname - column name, positive or negative reviewcolor - background color of worldcloudtitle - title of the wordcloudOUTPUT:Wordcloud visuazliation'''  text = review_df[review_colname].tolist()text_str = ' '.join(lemmatized_tokens(' '.join(text))) #call function "lemmatized_tokens"wordcloud = WordCloud(collocations = False,background_color = color,width=1600, height=800, margin=2,min_font_size=20).generate(text_str)plt.figure(figsize = (15, 10))plt.imshow(wordcloud, interpolation = 'bilinear')plt.axis("off")plt.figtext(.5,.8,title,fontsize = 20, ha='center')plt.show()     # Wordcoulds for Positive Reviews
wordcloud(pos_reviews,'positive_reviews', 'white','Positive Reviews: ')# # WordCoulds for Negative Reviews
wordcloud(neg_reviews,'negative_reviews', 'black', 'Negative Reviews:')

For positive reviews, most people are probably satisfied with the location, very convenient and close to Union Square or Chinatown and easy to find restaurants or pubs nearby, friendly and helpful staff, clean room, comfortable bed, and good price, etc.


The negative reviews also mentioned words like “breakfast”, “room” and “staff” quite often, but maybe people were complaining about the staffs who were being rude, small rooms, and coffee/ cereal/ muffin provided during breakfast. The air conditioning or the shower system may need improvements as we see words like “hot”, “cold”, “air”, “condition”, “bathroom” and “shower” in the WordCloud. The hotel may also need to solve issues related to soundproofing and parking.

负面评论还经常提到诸如“早餐”,“房间”和“员工”之类的词,但也许人们在抱怨那些粗鲁的员工,小房间以及早餐期间提供的咖啡/谷物/松饼。 当我们在WordCloud中看到诸如“热”,“冷”,“空气”,“状况”,“浴室”和“淋浴”之类的词时,空调或淋浴系统可能需要改进。 酒店可能还需要解决与隔音和停车有关的问题。

3. Sentiment Analysis


Here, we are using the Overall_review column in the reviewer_info dataframe to perform the sentiment analysis.


#Create a function to get the subjectivity
def subjectivity(text): return TextBlob(text).sentiment.subjectivity#Create a function to get the polarity
def polarity(text): return TextBlob(text).sentiment.polarity#Create two new columns
reviewer_info['Subjectivity'] = reviewer_info['Overall_review'].apply(subjectivity)
reviewer_info['Polarity'] = reviewer_info['Overall_review'].apply(polarity)
#Create a function to compute the negative, neutral and positive analysis
def getAnalysis(score):if score <0:return 'Negative'elif score == 0:return 'Neutral'else:return 'Positive'reviewer_info['Analysis'] = reviewer_info['Polarity'].apply(getAnalysis)
# plot the polarity and subjectivity
fig = px.scatter(reviewer_info, x='Polarity', y='Subjectivity', color = 'Analysis',size='Subjectivity')#add a vertical line at x=0 for Netural Reviews
fig.update_layout(title='Sentiment Analysis',shapes=[dict(type= 'line',yref= 'paper', y0= 0, y1= 1, xref= 'x', x0= 0, x1= 0)])


The x-axis shows polarity, and y-axis shows subjectivity. Polarity tells how positive or negative the text is. The subjectivity tells how subjective or opinionated the text is. The green dots that lies on the vertical line are the “neutral” reviews, the red dots on the left are the “negative” reviews, and the blue dots on the right are the “positive” reviews. Bigger dots indicate more subjectivity. We see that positive reviews are more than the negatives.

x轴表示极性,y轴表示主观性。 极性表明文本的正面或负面。 主观性告诉文本是多么主观或自以为是。 垂直线上的绿点是“中性”评论,左侧的红点是“负”评论,右侧的蓝点是“正”评论。 点越大表示主观性越强。 我们看到正面的评价大于负面的评价。

4. LDA Topic Modelling

4. LDA主题建模

Now, let’s apply the LDA model to find each document topic distribution and the high probability of word in each topic. Here, we want to specifically look at the negative reviews to find out what aspects should the hotel be focusing on improving.

现在,让我们应用LDA模型来查找每个文档主题分布以及每个主题中单词的高概率。 在这里,我们要专门研究负面评论,以找出酒店应着重改善的方面。

Below are the steps to find the optimal LDA model:


1. Convert the reviews to document-term matrix


TF computes the classic number of times the word appears in the text, and IDF computes the relative importance of this word which depends on how many texts the word can be found. TF-DF is the inverse document frequency. It adjusts for the fact that some words appear more frequently in general, like "we", "the", etc. We discard words that appeared in > 90% of the reviews and words appeared in < 10 reviews since high appearing words are too common to be meaningful in topics and low appearing words won’t have a strong enough signal and might even introduce noise to our model.

TF计算该词在文本中出现的经典次数,而IDF计算该词的相对重要性,这取决于可以找到该词的文本数。 TF-DF是反文档频率。 它会针对某些单词通常会更频繁地出现这一事实进行调整,例如“我们”,“该”等。我们会丢弃出现在90%以上评论中的单词和出现在<10条评论中的单词,因为出现次数也很高通常在主题上有意义,而露面的单词不会发出足够强烈的信号,甚至可能会给我们的模型带来噪音。

2. GridSearch and parameter tuning to find the optimal LDA model

2. GridSearch和参数调整以找到最佳的LDA模型

The process of grid search can consume a lot of time because it constructs multiple LDA models for all possible combinations of param values in the param_grid dict. So, here we are only tuning two parameters: (1)n_components (number of topics) and (2)learning_decay (which controls the learning rate).

网格搜索过程可能会花费大量时间,因为它会为param_grid dict中的所有可能的参数值组合构造多个LDA模型。 因此,这里我们仅调整两个参数:(1) n_components (主题数)和(2) learning_decay (控制学习率)。

3. Output the optimal lda model and its parameters


A good model should have higher log-likelihood and lower perplexity (exp(-1. * log-likelihood per word)). However, perplexity might not be the best measure to evaluate topic models because it doesn’t consider the context and semantic associations between words.

一个好的模型应该具有较高的对数似然率和较低的困惑度(exp(-1。*每个单词的对数似然率))。 但是,困惑可能不是评估主题模型的最佳方法,因为它没有考虑单词之间的上下文和语义关联。

4. Compare LDA Model Performance Scores


The line plot shows the LDA model performance scores with different params


#Create a function to build the optimal LDA model
def optimal_lda_model(df_review, review_colname):'''INPUTS:df_review - dataframe that contains the reviewsreview_colname: name of column that contains reviewsOUTPUTS:lda_tfidf - Latent Dirichlet Allocation (LDA) modeldtm_tfidf - document-term matrix in the tfidf formattfidf_vectorizer - word frequency in the reviewsA graph comparing LDA Model Performance Scores with different params'''docs_raw = df_review[review_colname].tolist()#************   Step 1: Convert to document-term matrix   ************##Transform text to vector form using the vectorizer object tf_vectorizer = CountVectorizer(strip_accents = 'unicode',stop_words = 'english',lowercase = True,token_pattern = r'\b[a-zA-Z]{3,}\b', # num chars > 3 to avoid some meaningless wordsmax_df = 0.9,                        # discard words that appear in > 90% of the reviewsmin_df = 10)                         # discard words that appear in < 10 reviews    #apply transformationtfidf_vectorizer = TfidfVectorizer(**tf_vectorizer.get_params())#convert to document-term matrixdtm_tfidf = tfidf_vectorizer.fit_transform(docs_raw)  print("The shape of the tfidf is {}, meaning that there are {} {} and {} tokens made through the filtering process.".\format(dtm_tfidf.shape,dtm_tfidf.shape[0], review_colname, dtm_tfidf.shape[1]))#*******   Step 2: GridSearch & parameter tuning to find the optimal LDA model   *******## Define Search Paramsearch_params = {'n_components': [5, 10, 15, 20, 25, 30], 'learning_decay': [.5, .7, .9]}# Init the Modellda = LatentDirichletAllocation()# Init Grid Search Classmodel = GridSearchCV(lda, param_grid=search_params)# Do the Grid Searchmodel.fit(dtm_tfidf)#*****  Step 3: Output the optimal lda model and its parameters  *****## Best Modelbest_lda_model = model.best_estimator_# Model Parametersprint("Best Model's Params: ", model.best_params_)# Log Likelihood Score: Higher the betterprint("Model Log Likelihood Score: ", model.best_score_)# Perplexity: Lower the better. Perplexity = exp(-1. * log-likelihood per word)print("Model Perplexity: ", best_lda_model.perplexity(dtm_tfidf))#***********   Step 4: Compare LDA Model Performance Scores   ***********##Get Log Likelyhoods from Grid Search Outputgscore=model.fit(dtm_tfidf).cv_results_n_topics = [5, 10, 15, 20, 25, 30]log_likelyhoods_5 = [gscore['mean_test_score'][gscore['params'].index(v)] for v in gscore['params'] if v['learning_decay']==0.5]log_likelyhoods_7 = [gscore['mean_test_score'][gscore['params'].index(v)] for v in gscore['params'] if v['learning_decay']==0.7]log_likelyhoods_9 = [gscore['mean_test_score'][gscore['params'].index(v)] for v in gscore['params'] if v['learning_decay']==0.9]# Show graphplt.figure(figsize=(12, 8))plt.plot(n_topics, log_likelyhoods_5, label='0.5')plt.plot(n_topics, log_likelyhoods_7, label='0.7')plt.plot(n_topics, log_likelyhoods_9, label='0.9')plt.title("Choosing Optimal LDA Model")plt.xlabel("Num Topics")plt.ylabel("Log Likelyhood Scores")plt.legend(title='Learning decay', loc='best')plt.show()return best_lda_model, dtm_tfidf, tfidf_vectorizerbest_lda_model, dtm_tfidf, tfidf_vectorizer = optimal_lda_model(neg_reviews, 'negative_reviews')

From the graph, we see that there is little impact to choose different learning decay, however, 5 topics would produce the best model.


Now, let’s output the words in the topics we just created.


#Create a function to inspect the topics we created
def display_topics(model, feature_names, n_top_words):'''INPUTS:model - the model we createdfeature_names - tells us what word each column in the matric representsn_top_words - number of top words to displayOUTPUTS:a dataframe that contains the topics we created and the weights of each token'''topic_dict = {}for topic_idx, topic in enumerate(model.components_):topic_dict["Topic %d words" % (topic_idx+1)]= ['{}'.format(feature_names[i])for i in topic.argsort()[:-n_top_words - 1:-1]]topic_dict["Topic %d weights" % (topic_idx+1)]= ['{:.1f}'.format(topic[i])for i in topic.argsort()[:-n_top_words - 1:-1]]return pd.DataFrame(topic_dict)display_topics(best_lda_model, tfidf_vectorizer.get_feature_names(), n_top_words = 20)
Top 20 words in each topic and their corresponding weights

Now, let’s visualize the topics with pyLDAVis Visualization!

现在,让我们使用pyLDAVis Visualization可视化主题!

pyLDAVis is a great tool to interpret individual topics and the relationships between the topics. A good topic model will have fairly big, non-overlapping bubbles scattered throughout the chart instead of being clustered in one quadrant. On the left-hand side of the visualization, each topic is represented by a bubble. The larger the bubble, the more prevalent is that topic. The indices inside the circle indicates the sorted order by the area with the number 1 being the most popular topic, and number 5 being the least popular topic. The distance between two bubbles represents the topic similarity. However, this is just an approximation to the original topic similarity matrix because we are only using a two-dimensional scatter plots to best represent the spatial distribution of all 5 topics.

pyLDAVis是解释单个主题以及主题之间关系的好工具。 一个好的主题模型将有相当大的,不重叠的气泡分散在整个图表中,而不是聚集在一个象限中。 在可视化的左侧,每个主题都用一个气泡表示。 气泡越大,该话题越普遍。 圆圈内的索引指示按区域排序的顺序,数字1是最受欢迎的主题,数字5是最受欢迎的主题。 两个气泡之间的距离代表主题相似度。 但是,这只是原始主题相似度矩阵的近似值,因为我们仅使用二维散点图来最好地表示所有5个主题的空间分布。

The right-hand side shows the top-30 most relevant terms for the topic you select on the left. The blue bar represents the overall term frequency, and the red bar indicates the estimated term frequency within the selected topic. So, if you see a bar with both red and blue, it means the term also appears at other topics. You can hover over the term to see in which topic(s) is the term also included.

右侧显示了与您在左侧选择的主题相关的前30个最相关术语。 蓝色条形表示总的词频,红色条形表示所选主题内的估计词频。 因此,如果您看到红色和蓝色的条形,则表示该术语也出现在其他主题上。 您可以将鼠标悬停在该术语上,以查看该术语还包括在哪些主题中。

You can adjust the relevance metric (λ) by sliding the bar on the top right corner. It helps to strike a balance between the terms that are exclusively popular for the topic you selected when λ=0 and the terms that also appear more in other topics if you slide the λ to the right.

您可以通过滑动右上角的条来调整相关性指标(λ)。 如果您将λ滑动到右侧,则有助于在λ= 0时专门为您选择的主题所用的术语与在其他主题中也会出现的术语之间取得平衡。

# Topic Modelling Visualization for the Negative Reviews
pyLDAvis.sklearn.prepare(best_lda_model, dtm_tfidf, tfidf_vectorizer)


From the Sentiment Analysis scatter plot, we see that positive reviews are slightly more than the negatives, Hotel Beresford definitely needs to improve hotel guest satisfaction.


The WordCloud reveals some problems for the hotel manager to look into, like their breakfast. However, it is probably necessary to read detail reviews about their breakfast to figure out what exactly needs to be improved, maybe coffee or pastries as appeared in the WordCloud. Also, the hotel manager should train staff well to provide friendlier and better services. The hotel may also need to work with issues related to soundproofing, air conditioning, shower system and parking.

WordCloud揭示了酒店经理需要研究的一些问题,例如早餐。 但是,可能有必要阅读有关他们早餐的详细评论,以找出需要改进的地方,例如WordCloud中出现的咖啡或糕点。 此外,酒店经理还应培训员工,以提供更友好,更好的服务。 酒店可能还需要处理有关隔音,空调,淋浴系统和停车的问题。

The EDA section could give the hotel manager a general idea of the reviews as well as the rating distribution. The pyLDAvis interactive visualization would help the hotel manager to further understand what most popular topics within the negative reviews are and make improvements accordingly.

EDA部分可以使酒店经理对评论以及评级分布有一个总体了解。 pyLDAvis交互式可视化将帮助酒店经理进一步了解负面评论中最受欢迎的主题,并做出相应的改进。

未来的工作 (Future Work)

A lot of the analyses are limited due to the size of the scraped data. Here, we are only scraping reviews written in “English.” According to San Francisco Travel Reports, there were 2.9 million international visitors visiting San Francisco in 2019. Visitors from non-English speaking countries are most likely going to leave reviews in their native language. Maybe trying to scrape reviews in other languages and translate the scraped reviews or scrape after translation would help to increase the data volume.

由于抓取数据的大小,许多分析受到限制。 在这里,我们只抓取以“英语”撰写的评论。 根据《旧金山旅行报告》 ,2019年有290万国际游客访问旧金山。来自非英语国家的游客很可能会以其母语发表评论。 也许尝试以其他语言刮取评论并翻译所刮取的评论,或者在翻译后进行刮擦将有助于增加数据量。

To provide more useful suggestions to Hotel Beresford, we may also conduct analysis of its competitors to gain insights of guest preferences as well as valuable information that Hotel Beresford may not get from its own reviews.


Hope that this analysis could also benefit people who are interested in text analytics. Please check out my GitHub link for the full code and analysis. Thank you!

希望这种分析也可以使对文本分析感兴趣的人受益。 请查看我的GitHub链接以获取完整的代码和分析。 谢谢!

翻译自: https://medium.com/swlh/sentiment-analysis-topic-modeling-for-hotel-reviews-6b83653f5b08




  • java和mysql实现点餐功能_java+mysql餐馆点餐系统的设计与开发
  • 饭店点餐系统
  • 手机点餐系统概述_基于Android无线点餐系统的设计与实现
  • 饭店点餐系统之系统工作流程
  • 点餐系统详解
  • 餐馆点菜系统
  • Java实现餐厅点餐系统
  • web在线编辑器
  • 【在线代码编辑器】4个好用且免费的在线代码编辑器
  • 网页版在线公式编辑器
  • 网页中插入文本编辑器
  • python网页编辑器-网页python编辑
  • 轻量化html编辑器,建站“一气呵成” 推荐7大免费网页编辑器
  • python在线编辑器可视化_海龟编辑器(Python编辑器)
  • jsp网页在线编辑器
  • c语言网页版在线编译器_简单好用的网页版在线公式编辑器
  • 在线富文本html编辑,html编辑器 - 经典富文本网页在线编辑器 - HtmlEditor
  • c++在线编辑器
  • 功能强大的在线网页编辑器
  • 在线Java/C++/C语言/Python/web网页编辑器(IDE)整理
  • openwrt dnsmasq rebind_protection域名劫持保护
  • Kali渗透-DNS域名劫持
  • linux 域名劫持,域名被劫持处理方案
  • 域名劫持的特征都有哪些?
  • DNS劫持又称域名劫持
  • 域名被劫持应该如何处理
  • DNS服务(二):域名劫持
  • 【应急响应】域名劫持事件发生后的应急响应策略
  • 谈谈 DNS 原理及“域名劫持”和“域名欺骗/域名污染”
  • Kali渗透-DNS域名劫持与钓鱼


  1. 带有酒店评论的情绪分析 - 处理数据

    在本节中,您将使用前面课程中的技术对大型数据集进行一些探索性数据分析.一旦您对各个列的用处有了很好的了解,您将了解到: 如何删除不必要的列 如何根据现有列计算一些新数据 如何保存生成的数据集以用于最终 ...

  2. 基于android点餐系统需求分析,基于android的点餐系统的需求分析

    基于android的点餐系统的需求分析 基于 Android 的点餐系统的需求分析1 功能需求传统的餐饮行业,就像我们平时在街边看到的小餐馆一类的,我们点菜的时候都是有服务员站在我们旁边,我们点一个他 ...

  3. (5)【整合】基于机器学习的酒店评论文本倾向性分析

    酒店评论情感分析系统(五)-- [整合]基于机器学习的酒店评论文本倾向性分析 本文主要是对之前的第三.四节所介绍的各分块进行整合. 流程: 对GUI界面输入的评论文本,先进行中文分词和去停用词,然后再 ...

  4. 点餐系统架构模型_点餐系统的设计与实现.doc

    摘要:如今,信息化时代的快速发展,导致人们的生活节奏也迅速加快.餐饮业的迅速发展,也对社会提出一种要求,如何顺应时代节奏,开发出高效率的处理餐厅各种纷繁复杂的管理工作的软件成为一种必要.而本文是专门针 ...

  5. 点餐系统架构模型_餐馆点餐系统课程设计

    ================ 精选公文范文, 管理类, 工作总结类, 工作计划类文档, 欢迎阅读下载 ============== -------------------- 精选公文范文,管理类, ...

  6. 点餐系统架构模型_微信点餐系统的系统结构是由哪些功能模块组成呢?

    微信点餐系统的系统结构导读:随着外卖行业的发展,外卖平台越来越多,喜欢叫外卖的人也变得越来越多,随之而来的问题也就越来越多,例如,引客到店难,订单量多管理不当,留不住顾客等等一些餐饮行业痛点,客来云扫 ...

  7. 微信点餐系统java教程_构建微服务微信点餐系统教程

    凡是认购学员提供全部的问题解答,有问题请大家私信提出问题. 微服务是目前行业的热门技术架构,随着移动互联网愈演愈烈,微信支付和外卖成为人们的**,为了让广大技术爱好者学习微服务架构和业务结合,从而研发 ...

  8. NLP_基于酒店评论的情感分析

    文章目录 1.自然语言处理概述 NLP意义与难点 2.NLP核心问题与主要应用 2.1 核心问题 2.2 主要应用 3.NLP中机器学习与深度学习方法 3.1 NLP的机器学习与深度学习方法 3.2 ...

  9. 基于BiLSTM的酒店评论文本情感分析

    介绍:(需要有部分机器学习和深度学习,还有NLP的理论知识) 使用深度学习框架TensorFlow中的keras接口实现BiLstm神经网络,用训练集训练出一个模型,可以判断酒店评论文本的正负向情绪. ...

  10. win10系统要求配置_酒店智能化集成系统设计要素、系统功能要求和配置标准

    一:设计要素 智能化集成系统将各子系统集成在统一的计算机网络平台和统一的人机界面上,"分散控制,集中管理",从而实现各子系统之间信息资源的共享与管理,提高工作效率和提供舒适的工作环 ...


  1. C++实现求数组中前K大的数
  2. 《人月神话》阅读笔记--02
  3. 后端:SpringBoot 的@Value注解,高级特性,非常实用!!
  4. glide工具类。加载显示原图片,显示圆角图片,gif图标显示
  5. Oracle Database 中 B*Tree 索引内部维护
  6. 91卫图助手-使用及下载
  7. 人体神经系统分类图解,人体神经系统分类图片
  8. 2019数学建模F:数字货币存在是否合理?提供一些思路供思考
  9. msgbox.html5.qq.com,怎样制作qq透明背景皮肤
  10. 汇报工作的技巧和要求
  11. 《Learning Enriched Features for Real Image Restoration and Enhancement》
  12. macOS、Linux CentOS 、Docker安装部署canal-server(canal-deployer)服务
  13. C++调用Python及间接调用Python三方库
  14. 金山中学 rugular SRM 04 ——纪念我的第一次Ak
  15. 通过GRUB引导U盘为引导活动区,从硬盘安装win8 64bit
  16. 移动手机用户目录下的证书至根目录下
  17. 江苏省计算机二级c语言复习知识点,计算机二级C语言知识点复习
  18. [CTF]Rabbit加密
  19. 利用eyoucms快速构建网站
  20. idea在mapper的xml文件中打不出‘.‘,idea在mapper的xml文件中打不出.,idea在mapper的xml文件中打不出小数点的问题


  1. AutoCAD 2006/2007等在高版本系统上无法安装
  2. UNITY3D MMO开发指南(Lua)
  3. thinkphp出现FILE: D:\www\zhao01\ThinkPHP\Library\Think\Dispatcher.class.php 解决方案
  4. 网络工程师考试知识点[必考知识点]--必看
  5. ES复杂分组同时多聚合统计实现(java)
  6. 用continue计算100以内奇数和_100万以内A类三胞胎素数有1393对
  7. Linux常用命令:系统信息、关机重启、文件和目录操作、文件搜索、文件权限、挂载、磁盘空间、用户及群组、解压缩、备份等操作。
  8. 在 100% 鲜活的时间,请让我遇见你
  9. java short转byte_java基础之short转换byte[]
  10. 求正整数2和n之间的完全数