python数据分析案例2：电影评分数据集的分析

这里是南京财经大学的Mooc课程的个人学习笔记，课程网址是：https://www.icourse163.org/course/NJUE-1458311167，课程是免费的，老师讲的很好很认真，欢迎学习。

数据集的获取：
1、grouplens官方网址

或者

2、

链接: https://files.grouplens.org/datasets/movielens/ml-100k.zip

使用上述压缩包中的以下三个文件

有关上述三个文件的介绍：

u.data     -- The full u data set, 100000 ratings by 943 users on 1682 items.Each user has rated at least 20 movies.  Users and items arenumbered consecutively from 1.  The data is randomlyordered. This is a tab separated list of user id | item id | rating | timestamp. The time stamps are unix seconds since 1/1/1970 UTC   u.item     -- Information about the items (movies); this is a tab separatedlist ofmovie id | movie title | release date | video release date |IMDb URL | unknown | Action | Adventure | Animation |Children's | Comedy | Crime | Documentary | Drama | Fantasy |Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |Thriller | War | Western |The last 19 fields are the genres, a 1 indicates the movieis of that genre, a 0 indicates it is not; movies can be inseveral genres at once.The movie ids are the ones used in the u.data data set.u.user     -- Demographic information about the users; this is a tabseparated list ofuser id | age | gender | occupation | zip codeThe user ids are the ones used in the u.data data set.

读入user数据
通过描述，可以看出，user文件的数据集格式是这样，也就是用来描述用户的数据集。

# 用户ID | 年龄 | 性别 | 职业 | 邮编
user id | age | gender | occupation | zip code

使用pandas提供的read_table()方法来读取：

# 表头设置
unames = ['user id', 'age', 'gender', 'occupation', 'zip code']
# header为None指的是读取的数据没有表头，names传入的是我们之前设置好的表头
user = pd.read_table('u.user', sep='|', header=None, names=unames)
# 数据预览
print(user.head(5))

读入data数据
data数据集包含了10万条评分数据，数据格式如下，分隔符采用tab：

# 用户ID tab 电影ID tab  评分 tab  时间戳
user id tab item id tab rating tab timestamp

同理，使用read_table()读取，注意设置sep='\t'

rnames = ['user id', 'item id', 'rating', 'timestamp']
data= pd.read_table('u.data', sep='\t', header=None, names=rnames)
print(data.head(5))

读入item数据
item数据集是描述电影的数据集，数据格式如下：

# 电影ID | 电影标题 | 上映日期 | 视频上映日期 | IMDb网址 | （后面16个0、1代表的是电影的分类）
movie id | movie title | release date | video release date |IMDb URL | unknown | Action | Adventure | Animation |Children's | Comedy | Crime | Documentary | Drama | Fantasy |Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |Thriller | War | Western |

这个数据集我们主要在意的就是电影ID、电影标题这两个列。
采用上述方法进行读取：
结果出现了Error：UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 3: invalid continuation byte
因为item数据集中出现了中文，读取时就很容易出现编码错误，因此，需要加入encoding参数：

mnames = ['movie id', 'movie title', 'release date', 'video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure','Animation', "Children's", 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror','Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
item = pd.read_table('u.item', sep='|', header=None, names=mnames, encoding="ISO-8859-1")
print(item.head(5))

注：添加以下代码，不省略显示

# 显示所有列
pd.set_option('display.max_columns', None)
# 显示所有行
pd.set_option('display.max_rows', None)
# 显示宽度为1000
pd.set_option('display.width', 1000)

数据连接
通过上述操作已经把三个数据读入了，如果想继续进行分析，把三个数据集连接起来是很好的方法。

# 首先合并user和data，因为都含有共同的user id列，所以直接合并就可以。
df = pd.merge(user, data)
# 把user和data合并后的数据与item数据合并，这里的item的movie id其实就是data中的item id，这种情况要手动传入两个参数
df = pd.merge(df, item, left_on='item id', right_on='movie id')

至此，数据的读取、连接完毕，接下来进行数据的分析

查看不同职业的平均打分

# 对df按照occupation分组，agg函数作用于df的rating，传入的函数是求平均值
df = df['rating'].groupby(by=df['occupation']).agg(['mean'])
# 对df进行排序
print(df.sort_values(by='mean', ascending=False))

查看不同职业、性别的平均打分

# 上述操作的合并操作
print(df['rating'].groupby(by=[df['occupation'], df['gender']]).agg(['mean']))

电影评分排名

print(df['rating'].groupby(by=df['movie title']).agg(['mean']).sort_values(ascending=False))

电影评分排名的进一步优化
可以看到，前几位的电影竟然出现了平均分5分，这种情况很有可能是评分人数过少，或者数据有错，因此，我们需要统计一下评分次数。
同时采用mean() count()函数时，可以采用一下方法：

# 使用agg来传入多个函数，排序函数中也新增by参数
print(df['rating'].groupby(by=df['movie title']).agg(['mean', 'count']).sort_values(by='mean', ascending=False))

我们发现，果然是有一些电影，评分数量过少，因此规定，评分次数大于50的才可以参与排名。

# 保存带有次数和平均值的DataFrame为df1
df1 = df['rating'].groupby(by=df['movie title']).agg(['mean', 'count'])
target = df1['count'] > 50
print(df1[target].sort_values(by='mean', ascending=False))

python数据分析案例2：电影评分数据集的分析相关推荐

电影评分数据集的分析
目录数据集的获得使用工具项目流程数据集的获得进入该网址:https://grouplens.org/datasets/movielens/ 找到如下part: 点击ml-100k.zip进行 ...
Python数据分析案例17——电影人气预测(特征工程构建)
案例背景本次案例是中国人民大学"人工智能与机器学习(2022年秋季)"课程的课堂竞赛. 比赛是根据有关电影的各种信息来预测电影的受欢迎程度,包括演员.工作人员.情节关键字.预算. ...
ML之KG：基于MovieLens电影评分数据集利用基于知识图谱的推荐算法(networkx+基于路径相似度的方法)实现对用户进行Top电影推荐案例
ML之KG:基于MovieLens电影评分数据集利用基于知识图谱的推荐算法(networkx+基于路径相似度的方法)实现对用户进行Top电影推荐案例目录基于MovieLens电影评分数据集利用基于 ...
ML之RL：基于MovieLens电影评分数据集利用强化学习算法(多臂老虎机+EpsilonGreedy策略)实现对用户进行Top电影推荐案例
ML之RL:基于MovieLens电影评分数据集利用强化学习算法(多臂老虎机+EpsilonGreedy策略)实现对用户进行Top电影推荐案例目录基于MovieLens电影评分数据集利用强化学习算 ...
python数据分析实战-Python数据分析案例实战(慕课版）
基本信息书名:Python数据分析案例实战(慕课版) :59.80元作者:王浩,袁琴,张明慧著出版社:人民邮电出版社出版日期:2020_06_01 ISBN:9787115520845 字数 ...
python数据分析实战案例-Python数据分析案例实战
原标题:Python数据分析案例实战至今我们网站已经开设了多个数据分析系列的课程,大部分都是基于算法思路来开展的,课程中着重点在于算法的讲授.软件的使用,案例只是辅助学习.然而很多学员反映,希望可以 ...
python电影数据分析报告_【python数据分析实战】电影票房数据分析(二)数据可视化...
在上一部分<[python数据分析实战]电影票房数据分析(一)数据采集> 已经获取到了2011年至今的票房数据,并保存在了mysql中. 本文将在实操中讲解如何将mysql中的数据抽取出来 ...
视频教程-Python数据分析案例实战视频课程-Python
Python数据分析案例实战视频课程计算机硕士,多年工作经验,技术和产品负责人. 多年推荐系统/NLP/大数据工作经验. 负责公司多个AI项目产品落地,包括文本分类.关键词抽取.命名实体识别.对话 ...
python 数据分析实际案例-Python数据分析案例实战
原标题:Python数据分析案例实战至今我们网站已经开设了多个数据分析系列的课程,大部分都是基于算法思路来开展的,课程中着重点在于算法的讲授.软件的使用,案例只是辅助学习.然而很多学员反映,希望可以 ...

python数据分析案例2：电影评分数据集的分析

python数据分析案例2：电影评分数据集的分析相关推荐

最新文章

热门文章