目录

  • 1.项目源码
  • 2.数据部分
    • 2.1.数据说明
    • 2.2.数据预处理
    • 2.3.文本清理
  • 3.文本特征提取
  • 4.KNN分类器寻找相似产品
  • 5.基于聚类的词关联

1.项目源码

可在github下载:

https://github.com/chenshunpeng/Amazon-Product-Recommender-System

项目借鉴自github:传送门

2.数据部分

2.1.数据说明

数据集来源为亚马逊产品数据(网站数据集包含来自亚马逊的产品评论和元数据,包括 1996年5月至2014 年 7 月的 1.428 亿条评论)的一部分,具体为服装、鞋子和珠宝部分(一共278,677 条评论,45.1MB),网址为:亚马逊审查数据 (ucsd.edu),所用数据如下图:


数据集要下用于实验的"小"子集(本次实验为45.1MB),对于以G计算的较大数据,下载速度会很慢,还可能突然网速为0,另一方面,数据处理时间过长,不适合编译器的处理,所以我选取的样本是:“Small” subsets for experimentation

评论数据集的信息:

1产品编号asin - ID of the product, e.g. 0000013714
2评价认可度helpful - helpfulness rating of the review, e.g. 2/3
3产品评分overall - rating of the product
4产品评价reviewText - text of the review
5评论时间reviewTime - time of the review (raw)
6评论者编号reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
7评论者名字reviewerName - name of the reviewer
8评论的概要summary - summary of the review
9发表评论的时间unixReviewTime - time of the review (unix time)

数据集的一个评论样例:

{"reviewerID": "A2SUAM1J3GNN3B","asin": "0000013714","reviewerName": "J. McDonald","helpful": [2, 3],"reviewText": "I bought this for my husband who plays the piano.  He is having a wonderful time playing these old hymns.  The music  is at times hard to read because we think the book was published for singing from more than playing from.  Great purchase though!","overall": 5.0,"summary": "Heavenly Highway Hymns","unixReviewTime": 1252800000,"reviewTime": "09 13, 2009"
}

例如,我这里reviews_merged.json的前3条信息如下(共278677条信息):

{"reviewerID": "A1KLRMWW2FWPL4", "asin": "0000031887", "reviewerName": "Amazon Customer \"cameramom\"", "helpful": [0, 0], "reviewText": "This is a great tutu and at a really great price. It doesn't look cheap at all. I'm so glad I looked on Amazon and found such an affordable tutu that isn't made poorly. A++", "overall": 5.0, "summary": "Great tutu-  not cheaply made", "unixReviewTime": 1297468800, "reviewTime": "02 12, 2011"}
{"reviewerID": "A2G5TCU2WDFZ65", "asin": "0000031887", "reviewerName": "Amazon Customer", "helpful": [0, 0], "reviewText": "I bought this for my 4 yr old daughter for dance class, she wore it today for the first time and the teacher thought it was adorable. I bought this to go with a light blue long sleeve leotard and was happy the colors matched up great. Price was very good too since some of these go for over $15.00 dollars.", "overall": 5.0, "summary": "Very Cute!!", "unixReviewTime": 1358553600, "reviewTime": "01 19, 2013"}
{"reviewerID": "A1RLQXYNCMWRWN", "asin": "0000031887", "reviewerName": "Carola", "helpful": [0, 0], "reviewText": "What can I say... my daughters have it in orange, black, white and pink and I am thinking to buy for they the fuccia one. It is a very good way for exalt a dancer outfit: great colors, comfortable, looks great, easy to wear, durables and little girls love it. I think it is a great buy for costumer and play too.", "overall": 5.0, "summary": "I have buy more than one", "unixReviewTime": 1357257600, "reviewTime": "01 4, 2013"}

2.2.数据预处理

首先从json文件(reviews_merged.json)读取数据,数据集用相对路径方便一些,不然得每个.ipynb文件下面都复制一份

我在jupyter上跑的,首先导入一些数据分析的库:

%matplotlib inlineimport numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import DataFrame
import nltkfrom sklearn.neighbors import NearestNeighbors
from sklearn.linear_model import LogisticRegression
from sklearn import neighbors
from scipy.spatial.distance import cosine
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectKBest
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformerimport re
import string
from wordcloud import WordCloud, STOPWORDS
from sklearn.metrics import mean_squared_error# 忽略了警告错误的输出
import warnings
warnings.filterwarnings("ignore")

因为直接在jupyter下用matplotlib画图会出现后台运行的exe文件,而网页上无法显示图片,所以需要在头文件上加上一行代码:%matplotlib inline,具体可看:

完美解决Python下matplotlib绘图中文乱码问题

先介绍一下json文件:

  • 任何支持的类型都可以通过 JSON 来表示,例如字符串、数字、对象、数组等。但是对象和数组是比较特殊且常用的两种类型
  • 对象:对象在 JS 中是使用花括号包裹 {} 起来的内容,数据结构为 {key1:value1, key2:value2, ...} 的键值对结构。在面向对象的语言中,key 为对象的属性,value 为对应的值。键名可以使用整数和字符串来表示。值的类型可以是任意类型
  • 数组:数组在 JS 中是方括号 [] 包裹起来的内容,数据结构为 ["java", "javascript", "vb", ...] 的索引结构。在 JS 中,数组是一种比较特殊的数据类型,它也可以像对象那样使用键值对,但还是索引使用得多。同样,值的类型可以是任意类型

之后尝试用Pandas的read_json()读取文件,但是报错:ValueError: Trailing data,后来发现是 json 格式问题,需将文件里面的字典作为元素保存在列表中,把文件每一行看做一个完整的字典,然后在函数中修改参数pd.read_json('data.json',lines=True),lines 默认为 False ,设为 True 后可以按行读取 json 对象,借鉴博客:

Pandas read_json()时报错ValueError: Trailing data

# df = pd.read_csv('reviews.csv')
df = pd.read_json('reviews_merged.json',lines=True)
# 输出初始时数据
df

结果:

在这里说一下,Pandas DataFrame是带有标签轴(行和列)的二维大小可变的,可能是异构的表格数据结构,算术运算在行和列标签上对齐。可以将其视为Series对象的dict-like容器。它也是Pandas的主要数据结构,其用法可看:

Pandas DataFrame的基本属性详解

pandas 入门:DataFrame的创建,读写,插入和删除

# 获取列索引
print(df.columns)# 执行df.shape会返回一个元组,该元组的第一个元素代表行数,第二个元素代表列数
# 这就是这个数据的基本形状,也是数据的大小
print(df.shape)

结果:

Index(['reviewerID', 'asin', 'reviewerName', 'helpful', 'reviewText','overall', 'summary', 'unixReviewTime', 'reviewTime'],dtype='object')
(278677, 9)

统计每一个产品(按照asin区分产品)的评论个数并加到后面:

count = df.groupby("asin", as_index=False).count()# mean求均值,但在这里没有用到
mean = df.groupby("asin", as_index=False).mean()# 将count连接在df后面
dfMerged = pd.merge(df, count, how='right', on=['asin'])
dfMerged

一些函数用法:

Python-Groupby函数应用

Pandas中groupby的参数as_index的True与False

详解pandas库的pd.merge函数

结果:

reviewerID_x asin    reviewerName_x  helpful_x   reviewText_x    overall_x   summary_x   unixReviewTime_x    reviewTime_x    reviewerID_y    reviewerName_y  helpful_y   reviewText_y    overall_y   summary_y   unixReviewTime_y    reviewTime_y
0   A1KLRMWW2FWPL4  0000031887  Amazon Customer "cameramom"   [0, 0]  This is a great tutu and at a really great pri...   5   Great tutu- not cheaply made    1297468800  02 12, 2011 23  23  23  23  23  23  23  23
1   A2G5TCU2WDFZ65  0000031887  Amazon Customer [0, 0]  I bought this for my 4 yr old daughter for dan...   5   Very Cute!! 1358553600  01 19, 2013 23  23  23  23  23  23  23  23
...

当然如果我们输出中间结果count表格,可以看到其每一列信息都是一样的数字(即评论个数):

dfMerged最后增加3列信息:

dfMerged["totalReviewers"] = dfMerged["reviewerID_y"]
dfMerged["overallScore"] = dfMerged["overall_x"]
dfMerged["summaryReview"] = dfMerged["summary_x"]

结果:

reviewerID_x asin    reviewerName_x  helpful_x   reviewText_x    overall_x   summary_x   unixReviewTime_x    reviewTime_x    reviewerID_y    reviewerName_y  helpful_y   reviewText_y    overall_y   summary_y   unixReviewTime_y    reviewTime_y    totalReviewers  overallScore    summaryReview
0   A1KLRMWW2FWPL4  0000031887  Amazon Customer "cameramom"   [0, 0]  This is a great tutu and at a really great pri...   5   Great tutu- not cheaply made    1297468800  02 12, 2011 23  23  23  23  23  23  23  23  23  5   Great tutu- not cheaply made
1   A2G5TCU2WDFZ65  0000031887  Amazon Customer [0, 0]  I bought this for my 4 yr old daughter for dan...   5   Very Cute!! 1358553600  01 19, 2013 23  23  23  23  23  23  23  23  23  5   Very Cute!!
...

把dfMerged按照totalReviewers排序(递减)并选择评论数超过100条的产品存入dfCount:

dfMerged = dfMerged.sort_values(by='totalReviewers', ascending=False)
dfCount = dfMerged[dfMerged.totalReviewers >= 100]
dfCount

结果:

reviewerID_x asin    reviewerName_x  helpful_x   reviewText_x    overall_x   summary_x   unixReviewTime_x    reviewTime_x    reviewerID_y    reviewerName_y  helpful_y   reviewText_y    overall_y   summary_y   unixReviewTime_y    reviewTime_y    totalReviewers  overallScore    summaryReview
161700  A205ZO9KZY2ZD2  B005LERHD8  Winnie  [0, 0]  I was expecting it to be more of a gold tint w...   4   It's ok    1357776000  01 10, 2013 441 441 441 441 441 441 441 441 441 4   It's ok
161269  A1HFSY6W8LJNJM  B005LERHD8  Alicia7tommy "Alicia Andrews" [0, 0]  The owl necklace is really cute but made real ...   4   Really Cute 1343001600  07 23, 2012 441 441 441 441 441 441 441 441 441 4   Really Cute
...

对每个产品求其评分overall的均值:

首先看一下df:

df

结果:

reviewerID   asin    reviewerName    helpful reviewText  overall summary unixReviewTime  reviewTime
0   A1KLRMWW2FWPL4  0000031887  Amazon Customer "cameramom"   [0, 0]  This is a great tutu and at a really great pri...   5   Great tutu- not cheaply made    1297468800  02 12, 2011
1   A2G5TCU2WDFZ65  0000031887  Amazon Customer [0, 0]  I bought this for my 4 yr old daughter for dan...   5   Very Cute!! 1358553600  01 19, 2013
...

之后求均值(仅保留可以取均值的那些列)到 dfProductReview:

dfProductReview = df.groupby("asin", as_index=False).mean()
dfProductReview

结果:

asin overall unixReviewTime
0   0000031887  4.608696    1.370064e+09
1   0123456479  4.166667    1.382947e+09
...

把评论的概要summary按照asin分组提取出来到ProductReviewSummary,并保存到ProductReviewSummary.csv文件中:

ProductReviewSummary = dfCount.groupby("asin")["summaryReview"].apply(list)
ProductReviewSummary = pd.DataFrame(ProductReviewSummary)
ProductReviewSummary.to_csv("ProductReviewSummary.csv")
ProductReviewSummary

结果:

asin     summaryReview
B000072UMJ  [Love it, Weird sizing on the tag..., Great Sh...
B0000ANHST  [It's a carhartt what more can you say, Nice, ...
...

从ProductReviewSummary读出summary到df3,后面接上含有均值信息的dfProductReview,之后把无关列unixReviewTime去掉(也就是保留'asin','summaryReview','overall'这3列)到df3:

df3 = pd.read_csv("ProductReviewSummary.csv")
df3 = pd.merge(df3, dfProductReview, on="asin", how='inner')
df3 = df3[['asin','summaryReview','overall']]
df3

结果:

asin summaryReview   overall
0   B000072UMJ  ['Love it', 'Weird sizing on the tag...', 'Gre...  4.594595
1   B0000ANHST  ["It's a carhartt what more can you say", 'Nic...   4.487179
...

2.3.文本清理

定义文本清理函数cleanReviews:

#用于文本清理的函数
#匹配以a-z开头的字符串
regEx = re.compile('[^a-z]+')
def cleanReviews(reviewText):reviewText = reviewText.lower()#删除空格reviewText = regEx.sub(' ', reviewText).strip()return reviewText

re.sub (pattern, replacement, string)将所有出现的 pattern 替换为提供的字符串中的 replacement。 这个方法的行为类似于 Python 字符串方法 str.sub,但是使用正则表达式来匹配模式,具体可看:

【python】Regex相关函数的使用

Python学习,python的re模块,正则表达式用法详解,正则表达式中括号的用法

重置索引并删除重复行(可看:(Python)Pandas reset_index()用法总结):

df3["summaryClean"] = df3["summaryReview"].apply(cleanReviews)
#Pandas-去除重复项函数drop_duplicates()
df3 = df3.drop_duplicates(['overall'], keep='last')
#重置索引时,将旧索引添加为列,并使用新的顺序索引
df3 = df3.reset_index()
df3

结果:

 index   asin    summaryReview   overall summaryClean
0   0   B0000ANHST  ["It's a carhartt what more can you say", 'Nic...   4.487179    it s a carhartt what more can you say nice hea...
1   1   B0000C321X  ['NIce fit, nice wash', 'nice', 'nada mejor', ... 4.263415    nice fit nice wash nice nada mejor levi s orig...
...

3.文本特征提取

从df3中提取清洗后的评论summaryClean放入reviews,之后用sklearn的CountVectorizer进行文本特征提取,对于每一个训练文本,其只考虑每种词汇在该训练文本中出现的频率:

reviews = df3["summaryClean"]
# max_features:对所有关键词的term frequency进行降序排序,只取前max_features个作为关键词集
# 停用词设为'english',这类词是可以完全忽略掉,不做统计的
countVector = CountVectorizer(max_features = 300, stop_words='english') transformedReviews = countVector.fit_transform(reviews) dfReviews = DataFrame(transformedReviews.A, columns=countVector.get_feature_names())
dfReviews = dfReviews.astype(int)
dfReviews

同样我们注意到有些单词对情感分类是毫无意义的,这类词有个名字,叫“Stop_Word”(停用词),这类词是可以完全忽略掉不做统计的,显然忽略掉这些词,词频记录的存储空间能够得到优化,而且构建速度也更快

在csdn上关于stop_words的介绍很模糊,可在stackoverflow找到一些见解(可看:scikit learn classifies stopwords)

还有一些很有用的博客:

sklearn—CountVectorizer详解

用Python开始机器学习(5:文本特征抽取与向量化)

【SKLEARN】使用CountVector类来提取词频特征,并计算其TF-IDF特征(含可执行代码)

结果:

把dfReviews保存至dfReviews.csv:

# 保存
dfReviews.to_csv("dfReviews.csv")

4.KNN分类器寻找相似产品

创建数据集和测试集:

# 创建一个名为X的数据集
X = np.array(dfReviews)
# 创建数据集和测试集
tpercent = 0.9
tsize = int(np.floor(tpercent * len(dfReviews)))
dfReviews_train = X[:tsize]
dfReviews_test = X[tsize:]
#数据集和测试集的长度
lentrain = len(dfReviews_train)
lentest = len(dfReviews_test)
print(lentrain)
print(lentest)

结果:

80
9

之后用k最近邻算法(据说是最简单的机器学习算法QAQ)查找最相关的产品

k最近邻算法可看博客:knn scikit_Scikit学习-KNN学习

当然阅读国外介绍也可以,比国内详细很多,在这里说一下,国外的自成体系,如果要详细研究,一定要亲自读外文(sklearn.impute.KNNImputer):

记得老师上课讲了一下,有中英文2个机器学习网站(scikit-learn)是常用的,一定要多看:

1:scikit-learn Machine Learning in Python

2:scikit-learn (sklearn) 官方文档中文版

代码:

neighbor = NearestNeighbors(n_neighbors=3, algorithm='ball_tree').fit(dfReviews_train)# 为了在对象X中找到每个点的k邻域,需要在对象X上调用kneighbors()函数
distances, indices = neighbor.kneighbors(dfReviews_train)# 查找最相关的产品
for i in range(lentest):a = neighbor.kneighbors([dfReviews_test[i]])related_product_list = a[1]first_related_product = [item[0] for item in related_product_list]first_related_product = str(first_related_product).strip('[]')first_related_product = int(first_related_product)second_related_product = [item[1] for item in related_product_list]second_related_product = str(second_related_product).strip('[]')second_related_product = int(second_related_product)print ("Based on product reviews, for ", df3["asin"][lentrain + i] ," average rating is ",df3["overall"][lentrain + i])print ("The first similar product is ", df3["asin"][first_related_product] ," average rating is ",df3["overall"][first_related_product])print ("The second similar product is ", df3["asin"][second_related_product] ," average rating is ",df3["overall"][second_related_product])print ("-----------------------------------------------------------")

结果:

Based on product reviews, for  B008RUOCJU  average rating is  3.973684210526316
The first similar product is  B007WAEBPQ  average rating is  4.333333333333333
The second similar product is  B004R1II48  average rating is  4.055555555555555
-----------------------------------------------------------
Based on product reviews, for  B008WYDP1C  average rating is  4.257028112449799
The first similar product is  B007WA3K4Y  average rating is  4.209424083769633
The second similar product is  B0083S18LQ  average rating is  3.9565217391304346
-----------------------------------------------------------
Based on product reviews, for  B008X0EW44  average rating is  3.874125874125874
The first similar product is  B007WAEBPQ  average rating is  4.333333333333333
The second similar product is  B0083S18LQ  average rating is  3.9565217391304346
-----------------------------------------------------------
Based on product reviews, for  B009DNWFD0  average rating is  3.8446601941747574
The first similar product is  B0053XF2U2  average rating is  3.8684210526315788
The second similar product is  B004R1II48  average rating is  4.055555555555555
-----------------------------------------------------------
Based on product reviews, for  B009ZDEXQK  average rating is  4.7254901960784315
The first similar product is  B000EIJG0I  average rating is  4.594594594594595
The second similar product is  B001Q5QLP6  average rating is  4.673913043478261
-----------------------------------------------------------
Based on product reviews, for  B00BNB3A0W  average rating is  3.4414414414414414
The first similar product is  B004Z1CZDK  average rating is  3.1923076923076925
The second similar product is  B0053XF2U2  average rating is  3.8684210526315788
-----------------------------------------------------------
Based on product reviews, for  B00CIBCJ62  average rating is  4.2164179104477615
The first similar product is  B004R1II48  average rating is  4.055555555555555
The second similar product is  B007WAEBPQ  average rating is  4.333333333333333
-----------------------------------------------------------
Based on product reviews, for  B00CKGB85I  average rating is  4.066666666666666
The first similar product is  B004R1II48  average rating is  4.055555555555555
The second similar product is  B0074T7TY0  average rating is  4.255474452554744
-----------------------------------------------------------
Based on product reviews, for  B00CN47GXA  average rating is  3.4634146341463414
The first similar product is  B007WAU1VY  average rating is  3.551470588235294
The second similar product is  B007WAEBPQ  average rating is  4.333333333333333
-----------------------------------------------------------
Based on product reviews, for  B00D1MR8YU  average rating is  3.83739837398374
The first similar product is  B004R1II48  average rating is  4.055555555555555
The second similar product is  B0053XF2U2  average rating is  3.8684210526315788
-----------------------------------------------------------
Based on product reviews, for  B00DMWQK0W  average rating is  4.298076923076923
The first similar product is  B0078FXHNM  average rating is  4.26056338028169
The second similar product is  B007WAEBPQ  average rating is  4.333333333333333
-----------------------------------------------------------
Based on product reviews, for  B00DMWQOYY  average rating is  4.119718309859155
The first similar product is  B0067GUM2W  average rating is  4.174863387978142
The second similar product is  B0078FXHNM  average rating is  4.26056338028169
-----------------------------------------------------------
Based on product reviews, for  B00DNQIIE8  average rating is  4.228758169934641
The first similar product is  B0078FXHNM  average rating is  4.26056338028169
The second similar product is  B0067GUM2W  average rating is  4.174863387978142
-----------------------------------------------------------
Based on product reviews, for  B00DQYNS3I  average rating is  4.526315789473684
The first similar product is  B003YBHF82  average rating is  4.21
The second similar product is  B000FH4JJQ  average rating is  4.536363636363636
-----------------------------------------------------------

按照格式打印数据:

# 按照格式打印数据
# print ("Based on product reviews, for ", df3["asin"][260] ," average rating is ",df3["overall"][260])
print ("The first similar product is ", df3["asin"][first_related_product] ," average rating is ",df3["overall"][first_related_product])
print ("The second similar product is ", df3["asin"][second_related_product] ," average rating is ",df3["overall"][second_related_product])
print ("-----------------------------------------------------------")

结果:

The first similar product is  B003YBHF82  average rating is  4.21
The second similar product is  B000FH4JJQ  average rating is  4.536363636363636
-----------------------------------------------------------

预测评分:

df5_train_target = df3["overall"][:lentrain]
df5_test_target = df3["overall"][lentrain:lentrain+lentest]
df5_train_target = df5_train_target.astype(int)
df5_test_target = df5_test_target.astype(int)n_neighbors = 3
knnclf = neighbors.KNeighborsClassifier(n_neighbors, weights='distance')
knnclf.fit(dfReviews_train, df5_train_target)
knnpreds_test = knnclf.predict(dfReviews_test)print(classification_report(df5_test_target, knnpreds_test))

结果:

           precision    recall  f1-score   support3       1.00      1.00      1.00         34       1.00      1.00      1.00         6avg / total       1.00      1.00      1.00         9

模型的准确性:

print (accuracy_score(df5_test_target, knnpreds_test))
print(mean_squared_error(df5_test_target, knnpreds_test))

结果:

1.0
0.0

5.基于聚类的词关联

先看一下df:

df

结果:

之后按照评分归类评论:

cluster = df.groupby("overall")["summary"].apply(list)
cluster

结果:

overall
1    [Never GOT IT...., DO NOT BUY IF YOU EVER WANT...
2    [too short, I'm glad i bought back up straps, ...
3    [Came apart in 2weeks!, Arrived with a defect,...
4    [It's ok, Good, Practically Perfect in every w...
5    [Great tutu-  not cheaply made, Very Cute!!, I...
Name: summary, dtype: object

把聚类后的数据转换为Dataframe型:

可参考:dataframe数据标准化处理_pandas用法及数据预处理实例

cluster = pd.DataFrame(cluster)
cluster

结果:

保存到cluster.csv,把数据从cluster.csv导入cluster1并清洗数据:

cluster.to_csv("cluster.csv")
cluster1 = pd.read_csv("cluster.csv")
cluster1["summaryClean"] = cluster1["summary"].apply(cleanReviews)
cluster1

结果:

可视化每个分数组的单词云:

stopwords = set(STOPWORDS)def show_wordcloud(data, title = None):wordcloud = WordCloud(background_color='white',stopwords=stopwords,max_words=500,max_font_size=30, scale=3,random_state=1 # chosen at random by flipping a coin; it was heads).generate(str(data))fig = plt.figure(1, figsize=(8, 8))plt.axis('off')if title: fig.suptitle(title, fontsize=20)fig.subplots_adjust(top=2.3)plt.imshow(wordcloud)plt.show()

注意:为了展示不同分数评论的云图,需要使汉字在图表中显示,需要加上#coding:utf-8并用matplotlib.use('qt4agg') 来指定默认字体,用matplotlib.rcParams['axes.unicode_minus'] = False来解决负号’-'显示为方块的问题,否则会报错

#coding:utf-8
import matplotlib
matplotlib.use('qt4agg')
#指定默认字体
matplotlib.rcParams['font.sans-serif'] = ['SimHei']
matplotlib.rcParams['font.family']='sans-serif'
#解决负号'-'显示为方块的问题
matplotlib.rcParams['axes.unicode_minus'] = False

之后展示不同分数评论的云图:

show_wordcloud(cluster1["summaryClean"][0], title = "1分的评论")

show_wordcloud(cluster1["summaryClean"][1] , title = "2分的评论")

show_wordcloud(cluster1["summaryClean"][2], title = "3分的评论")

show_wordcloud(cluster1["summaryClean"][3], title = "4分的评论")

show_wordcloud(cluster1["summaryClean"][4], title = "5分的评论")

show_wordcloud(cluster1["summaryClean"][:], title = "评分1-5的总评论")

亚马逊商城评论数据分析与可视化(KNN预测评分,绘制云图)相关推荐

  1. 【数据分析案例】2020年亚马逊商品评论数据分析

    亚马逊商品评论数据分析 <2020年美国数学建模竞赛C题:亚马逊商品评论数据分析>的解题思路与代码. 题目介绍 亚马逊网上商城为客户提供了对交易进行评分和评价的机会.客户可以提供三种评价内 ...

  2. python评论数据分析_Python亚马逊智能产品评论数据分析

    一.概述 本文主要使用Python对亚马逊智能产品评论数据集进行数据清洗,通过对不同产品的评分进行变换后得到产品的好评数.中评数.差评数以及对应的比率等,通过上述这些指标,找到好评率最高的产品.并尝试 ...

  3. 基于Python获取亚马逊的评论

    上次亚马逊的商品信息都获取到了,自然要看一下评论的部分.用户的评论能直观的反映当前商品值不值得购买,亚马逊的评分信息也能获取到做一个评分的权重. 亚马逊的评论区由用户ID,评分及评论标题,地区时间,评 ...

  4. python根据频率画出词云_利用pandas+python制作100G亚马逊用户评论数据词云

    原标题:利用pandas+python制作100G亚马逊用户评论数据词云 数据挖掘入门与实战 公众号: datadw 我们手里面有一个差不多100G的亚马逊用户在购买商品后留下的评论数据(数据格式为j ...

  5. 用EXCEL分析《亚马逊智能产品评论》

    一.业务分析 数据分析的前提是,对业务流程.业务指标和业务分析方法的熟悉和精通.因此,为了探究<亚马逊智能产品评论>这一数据集能提供怎样有价值的信息,需要对电商领域的业务流程.业务指标和业 ...

  6. 利用python对亚马逊智能产品评论进行分析

    一.分析目标 Amazon是网络上最早开始经营电子商务的公司之一,是全球电商行业的佼佼者.而人们在电商平台进行购物时,会根据评论数量以及好评和坏评作为判断依据来购买商品.故本文采取亚马逊智能产品评论作 ...

  7. selenium爬取亚马逊商品评论

    亚马逊商品评论有反爬虫,所以就用selenium爬了.网速一定要好,不然爬的真的是天昏地暗.配合多线程就会快很多,这个不写了,爬的时候手动复制了N个代码去爬.还有一个点,中文和英文的设置,可以在评论里 ...

  8. 亚马逊推荐python_Python之亚马逊智能产品评论分析

    在万物互联时代,智能产品正在推动着智能生活迅速普及.聚焦广阔的智能产品市场,亚马逊一直致力于扩大智能产品阵容,作为典型的跨境电商B2C平台,与消费者互动的最直接.高效的方法是分析消费者评论. 一.明确 ...

  9. 我在亚马逊商城卧底的日子

    英文原文:My week as an Amazon insider 亚马逊这世界上最大的网路公司,近来却不断有质疑他们对待员工方式不合理的声音传出,也有员工因为压力过大传出精神出问题的消息.到底在这家 ...

最新文章

  1. 四川大学计算机学院夏欣,自适应窗口的图像平滑去噪.pdf
  2. think php5是什么,thinkphp5是什么
  3. Pixhawk代码分析-姿态解算篇D
  4. 使用matplotlib做动态排名图
  5. JVMTM Tool Interface
  6. elasticsearch集群搭建-windows
  7. mysql事务所_mysql事务
  8. 信息学奥赛一本通(1245:不重复地输出数)
  9. 本博文记录需要解决的问题
  10. 写给非网工的CCNA教程(4)聊聊ping命令后的原理(续)
  11. 编码表(ASCII码GB2312gbkunicodeUTF-8)
  12. 天空盒 Skybox
  13. Java项目的一些文件及其功能、使用方式。
  14. 什么是 Refresh Token
  15. 一本通1373:鱼塘钓鱼(fishing)
  16. nmap架构图_Nmap扫描教程之Nmap基础知识
  17. android音频驱动工程师,4.Android音频驱动(底层1)
  18. nginx全天候监测大流量可疑ip的封禁脚本
  19. 一根木棍随机折成三段,能组成三角形的概率多大?
  20. 《算法导论》第三版第31章 数论算法 练习思考题 个人答案

热门文章

  1. 工具及方法 - 字幕下载网站
  2. 什么是量化交易-量化交易demo
  3. 局域网屏幕共享_计算机:如何通过局域网方式接入Internet
  4. (附源码)springboot校园疫情管理系统 毕业设计021506
  5. 【淘宝商品】获取淘宝商品ID、获取淘宝商品详情
  6. 网狐大联盟脚本还原数据库
  7. 使用Bcrypt进行密码加密
  8. 普通平键的主要尺寸有_普通平键的尺寸规范
  9. java 视频转码 视频压缩
  10. SQL实现一对多、多对多建表与查询