流浪地球热门影评数据分析

既然已经获取到了数据，下面就可以开始进行数据分析了

（数据爬取的文章在这里https://blog.csdn.net/weixin_44508906/article/details/87904982）

首先要理清一下分析思路

无论做什么分析，最先做的肯定是数据处理，将数据处理成我们想要的格式并进行数据清洗
观察数据，进行统计性描述（这里只有一个score，且数据量过小，就略过了），确立分析指标
进行分析
得出结论，撰写报告

下面是这次分析的具体步骤

1、读取数据并简单处理数据

comments.csv 评论数据

cities.csv 评论用户居住城市

import pandas as pd
import matplotlib.pyplot as plt
import jieba
import re
import warnings
from pyecharts import Style, Geo, Map, Line, Pie
from chinese_province_city_area_mapper.transformer import CPCATransformer
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from snownlp import SnowNLP# 读取数据
df1 = pd.read_csv('comments.csv', names=['name', 'score', 'comment', 'date', 'href'])
df2 = pd.read_csv('cities.csv', names=['city'])
df = pd.merge(df1, df2, left_index=True, right_index=True, how='outer') # 根据索引合并数据df.drop('href', axis=1, inplace=True) # 去掉href列
df.drop_duplicates(subset=None, keep='first', inplace=True) # 去重（这里没有重复值）
df.dropna(axis=0) # 删除空值 (这里没有空值)# 去掉comment的span标签
def comment_process(comment):comment = comment.strip('<span class="short">').strip('</span>').replace('\n', '').replace('\r', '')p = re.compile('[^\u4e00-\u9fa5]')  # 中文编码范围\u4e00到\u9fa5comment = re.sub(p,'',comment)return commentdf['comment'] = df['comment'].apply(comment_process) # 使用apply比循环要快# 评分转换数字
df['score1'] = df['score']
df['score'] = df['score'].map({'力荐': 5,'推荐': 4,'还行': 3,'较差': 2,'很差': 1
})
df['date'] = pd.to_datetime(df['date']).dt.strftime('%Y-%m-%d')  # 将datetime字段由object转换成datetime类型，速度回快很多

处理后的结果

2、评论用户地理位置

# 处理城市数据，如'讷河, 齐齐哈尔'提取为齐齐哈尔，'江苏南京'提取为南京，同时去除国外城市
def city_process(line):city = re.compile('[^\u4e00-\u9fa5]') # 中文编码范围\u4e00到\u9fa5# 取出中文字符，返回列表zh = re.split(city, line)# 取列表中最后一个，例如'讷河, 齐齐哈尔'取齐齐哈尔zh = zh[-1]return zhdf['city'] = df['city'].apply(city_process)
# 提取出city中的市
cpca = CPCATransformer()
df['city'] = cpca.transform(df.city)['市']df1 = df[df['city'] != ''] # 去除城市为空数据counts = pd.value_counts(df1['city'])
attr = counts.index.to_list()
value = counts[counts.index].to_list()# 设置地图样式
style = Style(title_color="#fff",title_pos="center",width=1200,height=600,background_color='#404a59'
)chart = Geo('<流浪地球>评论用户地理位置', '数据来源：豆瓣', **style.init_style)
# 属性，值
chart.add('', attr, value, # visual_range=[0, 200],visual_text_color="#fff", is_legend_show=False,symbol_size=15, is_visualmap=True,tooltip_formatter='{b}',label_emphasis_textsize=15,label_emphasis_pos='right')
# chart.render('邪不压正粉丝人群地理位置.html') 保存文件
chart # 直接展示

可以看到，北京上海及沿海城市的评论用户较多，原因可能是因为这些地区的消费水平比较高，人们的生活也相对来说较为丰富

3、评分趋势

df2 = df[df['score'] != '--'][['date', 'score', 'score1']] # 去除未评分数据并取出date，score，score1三列
# 提取出5个评分的时间序列
df_5 = df2[df2['score'] == 5][['date', 'score1']]
df_4 = df2[df2['score'] == 4][['date', 'score1']]
df_3 = df2[df2['score'] == 3][['date', 'score1']]
df_2 = df2[df2['score'] == 2][['date', 'score1']]
df_1 = df2[df2['score'] == 1][['date', 'score1']]# 统计每日评分次数
df_5 = df_5.groupby(['date']).count()
df_4 = df_4.groupby(['date']).count()
df_3 = df_3.groupby(['date']).count()
df_2 = df_2.groupby(['date']).count()
df_1 = df_1.groupby(['date']).count()line = Line('评分趋势')
line.add('力荐', df_5.index.tolist(), df_5.score1.tolist())
line.add('推荐', df_4.index.tolist(), df_4.score1.tolist())
line.add('还行', df_3.index.tolist(), df_3.score1.tolist())
line.add('较差', df_2.index.tolist(), df_2.score1.tolist())
line.add('很差', df_1.index.tolist(), df_1.score1.tolist())line

因为爬取的是热门评论，所以是越早的评论点赞越高，如果要分析评分趋势还是需要所有的评论，进行随机采样，控制每天的评论数目进行对比，或进行升采样比较每天的平均评分

4、各评分占比

score_counts = df[df['score1'] != '--']['score1'].value_counts()
attr = score_counts.index.tolist()
value = score_counts[attr].tolist()pie = Pie('各评分占比')
pie.add('', attr, value, radius=[30, 75], rosetype='radius', is_legend_show=False, is_label_show=True)
pie

评论以3分4分为主，也有不少5分，总体来说评价还不错，只有很少一部分人打了一分

5、词云图

comment_cut = ''
comments = df['comment'].tolist()for comment in comments:comment = jieba.cut(comment)comment = ' '.join(comment)comment_cut += comment# 添加停用词
stopwords = STOPWORDS.copy()
stopwords.update(['流浪', '地球', '这种', '完全', '最后', '但是', '这个', '还是','有点', '电影', '希望', '没有', '就是', '什么', '觉得', '其实','不是', '真的', '感觉', '因为', '这么', '很多', '已经', '一个','这样', '一部', '非常', '那么', '作为', '个人', '基本', '只能','真是', '应该', '不能', '尤其', '可能', '确实', '只是', '一点'
]) # 还有很多词可以停用，没有全部列出#参数分别是指定字体、背景颜色、最大的词的大小、使用给定图作为背景形状
wc = WordCloud(width=1024,height=768,background_color='white',font_path='Users/wangyutian/Library/Fonts/simhei.ttf',stopwords=stopwords,max_font_size=400,random_state=50)
wc.generate_from_text(comment_cut)
plt.figure(figsize=(16, 8))
plt.imshow(wc)
plt.axis('off')#不显示坐标轴
plt.show()
#保存结果到本地
# wc.to_file('wordcloud')

词云图来看，评论用户对本部国产科幻电影的评价还是不错，总体来说还是支持的。我也看了这部电影，特效不错，但是剧情有些低估观众智商而且后面的演讲看的我相当尴尬(⊙﹏⊙)b，总体来说还是不错的，作为国产科幻电影还是值得鼓励的。

6、Tableau仪表盘

最后我用tableau画了一个仪表盘，放了上面的四张图，但是在作图之前还要进行一些数据处理，饼图，折线图和地图为一个数据源data.csv，词云图为一个数据源count.csv

数据处理代码

import pandas as pd
import jieba
import re
import warnings
from chinese_province_city_area_mapper.transformer import CPCATransformer# 读取数据
df1 = pd.read_csv('comments.csv', names=['name', 'score', 'comment', 'date', 'href'])
df2 = pd.read_csv('cities.csv', names=['city'])
df = pd.merge(df1, df2, left_index=True, right_index=True, how='outer') # 根据索引合并数据df.drop('href', axis=1, inplace=True) # 去掉href列
df.drop_duplicates(subset=None, keep='first', inplace=True) # 去重（这里没有重复值）
df.dropna(axis=0) # 删除空值 (这里没有空值)# 去掉comment的span标签
def comment_process(comment):comment = comment.strip('<span class="short">').strip('</span>').replace('\n', '').replace('\r', '')p = re.compile('[^\u4e00-\u9fa5]')  # 中文编码范围\u4e00到\u9fa5comment = re.sub(p,'',comment)return commentdf['comment'] = df['comment'].apply(comment_process) # 使用apply比循环要快# 评分转换数字
df['score1'] = df['score']
df['score'] = df['score'].map({'力荐': 5,'推荐': 4,'还行': 3,'较差': 2,'很差': 1
})
df['date'] = pd.to_datetime(df['date']).dt.strftime('%Y-%m-%d')  # 将datetime字段由object转换成datetime类型，速度回快很多# 处理城市数据，如'讷河, 齐齐哈尔'提取为齐齐哈尔，'江苏南京'提取为南京，同时去除国外城市
def city_process(line):city = re.compile('[^\u4e00-\u9fa5]') # 中文编码范围\u4e00到\u9fa5# 取出中文字符，返回列表zh = re.split(city, line)# 取列表中最后一个，例如'讷河, 齐齐哈尔'取齐齐哈尔zh = zh[-1]return zhdf['city'] = df['city'].apply(city_process)
# 提取出city中的市
cpca = CPCATransformer()
df['city'] = cpca.transform(df.city)['市']# df1 = df[df['city'] != ''] # 去除城市为空数据
df.replace('北京市', '北京', inplace=True)
df.replace('上海市', '上海', inplace=True)
df.to_csv('data.csv', index=0, encoding='utf-8-sig')# 统计每个词出现次数
comment_cut = ''
comments = df['comment'].tolist()for comment in comments:comment = jieba.cut(comment)comment = ' '.join(comment)comment_cut += commentdf_comment = pd.DataFrame([{'index' : '','comment' : ''
}])
comments = comment_cut.split(' ')
i = 1
for comment in comments:insertRow = pd.DataFrame([{'index' : str(i),'comment' : comment}])df_comment = pd.concat([df_comment, insertRow], ignore_index=True)i += 1
df_comment.drop([0], inplace=True)
count = df_comment['comment'].value_counts()value = count.index.tolist()
count = count[value].tolist()
df_count = pd.DataFrame({'value' : value,'count' : count
})def value_len(value):return len(value)df_count['len'] = df_count['value'].apply(value_len)
df_count = df_count[df_count['len'] > 1]
df_count = df_count.iloc[:30]df_count.to_csv('count.csv', encoding='utf-8-sig')

仪表盘（这里只是简单的学习并做了一下，有些丑，等有时间好好学习一下进行美化）

完整代码在这里：https://github.com/yourSprite/AnalysisExcercise/tree/master/流浪地球数据分析