本次数据挖掘项目是电影推荐问题,目的是找出对象同时出现的情况,也就是寻找用户同时喜欢几部电影的情况。

使用最基础的Apriori算法。

import os
import pandas as pd
import numpy as np
import sys
from operator import itemgetter
from collections import defaultdict

一、加载数据并观察

# 文件的后缀就是.data,后面不要再加.csv了,否则会报错
all_ratings = pd.read_csv("u.data", delimiter='\t', header=None, names=["UserID","MovieID","Rating","Datetime"])
# 让我们看看冰山的一角吧
all_ratings.head()
UserID MovieID Rating Datetime
0 196 242 3 881250949
1 186 302 3 891717742
2 22 377 1 878887116
3 244 51 2 880606923
4 166 346 1 886397596
# 很好奇每一列都是什么数据类型,有没有缺失的记录呢?
all_ratings.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):#   Column    Non-Null Count   Dtype
---  ------    --------------   -----0   UserID    100000 non-null  int641   MovieID   100000 non-null  int642   Rating    100000 non-null  int643   Datetime  100000 non-null  int64
dtypes: int64(4)
memory usage: 3.1 MB

每一列都是int类型且没有缺失,很好,数据处理不用再做缺失值处理了。

# 再看看数据大小吧,其实从info()里也能看出来的,不过这样更直接,一眼就看出来啦
all_ratings.shape
(100000, 4)
# 去个重吧,万一有重复的数据
print("去重前数据大小:{0}".format(all_ratings.shape))
all_ratings.drop_duplicates(keep="first",inplace=True)
print("去重后数据大小:{0}".format(all_ratings.shape))
去重前数据大小:(100000, 4)
去重后数据大小:(100000, 4)

没有重复的呢。

# 看看都有哪些用户爱看电影还给电影打了分吧(反正我看电影是不喜欢打分的)
all_ratings["UserID"].value_counts()
405    737
655    685
13     636
450    540
276    518...
147     20
19      20
572     20
636     20
895     20
Name: UserID, Length: 943, dtype: int64

一共有943名用户,用户#405竟然给737部电影打了分,最少的也有20部。

# 让我们看看都有哪些电影被打了分吧
all_ratings["MovieID"].value_counts()
50      583
258     509
100     508
181     507
294     485...
1648      1
1571      1
1329      1
1457      1
1663      1
Name: MovieID, Length: 1682, dtype: int64

一共有1682部电影被观看并打分,电影#50被打分次数最多有583次,有的电影就有点惨了只被打过一次分比如电影#1663。

# 看下电影分数都有哪几档
all_ratings["Rating"].unique().tolist()
[3, 1, 2, 4, 5]
# 解析时间戳,日期看着比一串数字方便多了
all_ratings["Datetime"] = pd.to_datetime(all_ratings["Datetime"], unit='s')
all_ratings.head()
UserID MovieID Rating Datetime
0 196 242 3 1997-12-04 15:55:49
1 186 302 3 1998-04-04 19:22:22
2 22 377 1 1997-11-07 07:18:36
3 244 51 2 1997-11-27 05:02:03
4 166 346 1 1998-02-02 05:33:16

看样子都是很古老的数据啊

二、Apriori算法的实现

# 首先确定用户是不是喜欢某一部电影,评分高于3的认定为喜欢
all_ratings['Favorable'] = all_ratings['Rating'] > 3
# 从数据集选取一部分用作训练集,能减少搜索空间,提升算法的速度
# 我尝试用UserID前200的数据训练,电脑直接卡住,建议先用50以内的数值尝试,如果执行起来很流畅再调高
ratings = all_ratings[all_ratings['UserID'].isin(range(20))]
# 选取用户喜欢某部电影的数据行
favorable_ratings = ratings[ratings['Favorable']]
# 要知道用户喜欢哪些电影,把v.values存储为frozenset便于快速判断用户是否为某部电影打过分
favorable_reviews_by_users = dict((k, frozenset(v.values)) for k,v in favorable_ratings.groupby('UserID')['MovieID'])
# 每部电影有多少人喜欢
num_favorable_by_movie = ratings[['MovieID','Favorable']].groupby('MovieID').sum()
num_favorable_by_movie.sort_values(by='Favorable', axis=0, ascending=False)[:5]
Favorable
MovieID
50 14.0
100 12.0
174 11.0
127 10.0
56 10.0

1、实现

frequent_itemsets = {}
# 最小支持度
min_support = 10
# 每一部电影作为一个项集,判断它是否够频繁
frequent_itemsets[1] = dict((frozenset((MovieID,)), row['Favorable']) for MovieID, row in num_favorable_by_movie.iterrows() if row['Favorable'] > min_support)
frequent_itemsets[1]
{frozenset({50}): 14.0, frozenset({100}): 12.0, frozenset({174}): 11.0}
# 接收k-1项频繁项集,创建超集,检测频繁程度,生成k项频繁项集
def find_frequent_itemsets(favorable_reviews_by_users, k_1_itemsets, min_support):counts = defaultdict(int)for user, reviews in favorable_reviews_by_users.items():for itemset in k_1_itemsets:if itemset.issubset(reviews):# 遍历用户喜欢并打过分但是没出现在k-1频繁项集里的电影,用它生成超集,更新该项集的计数for other_reviewed_movie in reviews - itemset:current_superset = itemset | frozenset((other_reviewed_movie,))counts[current_superset] += 1return dict([(itemset, frequency) for itemset, frequency in counts.items() if frequency >= min_support])
find_frequent_itemsets(favorable_reviews_by_users, frequent_itemsets[1], min_support)
{frozenset({50, 100}): 18, frozenset({50, 174}): 20, frozenset({100, 174}): 18}
# 存储运行中发现的新频繁项集
for k in range(2, 5):cur_frequent_itemsets = find_frequent_itemsets(favorable_reviews_by_users, frequent_itemsets[k-1],min_support)frequent_itemsets[k] = cur_frequent_itemsetsif len(cur_frequent_itemsets) == 0:print("did not find any frequent itemsets of length {0}".format(k))# 确保代码还在执行时,把缓冲区内容输出到终端,不易过多使用,flush操作(输出也是)所带来的计算会拖慢程序的运行速度sys.stdout.flush()breakelse:print("i found {0} frequent itemsets of length {1}".format(len(cur_frequent_itemsets),k))sys.stdout.flush()
i found 3 frequent itemsets of length 2
i found 1 frequent itemsets of length 3
did not find any frequent itemsets of length 4
del frequent_itemsets[1]
frequent_itemsets
{2: {frozenset({50, 100}): 18,frozenset({50, 174}): 20,frozenset({100, 174}): 18},3: {frozenset({50, 100, 174}): 24},4: {}}

2、抽取关联规则

频繁项集是一组达到最小支持度的项目,从中抽取的关联规则形如:如果用户喜欢前提中的所有电影,那么他们也会喜欢结论中的电影。

candidate_rules = []
# 遍历不同长度的频繁项集,为每个项集生成规则
for itemset_length, itemset_counts in frequent_itemsets.items():# 遍历每个项集for itemset in itemset_counts.keys():for conclusion in itemset:premise = itemset - set((conclusion,))candidate_rules.append((premise, conclusion))
candidate_rules
[(frozenset({100}), 50),(frozenset({50}), 100),(frozenset({174}), 50),(frozenset({50}), 174),(frozenset({174}), 100),(frozenset({100}), 174),(frozenset({100, 174}), 50),(frozenset({50, 174}), 100),(frozenset({50, 100}), 174)]
correct_counts = defaultdict(int)
incorrect_counts = defaultdict(int)
# 遍历用户和打分数据,统计规则的次数,当规则的前提符合时,看一下用户是否喜欢结论中的电影
for user, reviews in favorable_reviews_by_users.items():for candidate_rule in candidate_rules:premise, conclusion = candidate_ruleif premise.issubset(reviews):if conclusion in reviews:correct_counts[candidate_rule] += 1else:incorrect_counts[candidate_rule] += 1
correct_counts
defaultdict(int,{(frozenset({100}), 50): 9,(frozenset({50}), 100): 9,(frozenset({174}), 50): 10,(frozenset({50}), 174): 10,(frozenset({174}), 100): 9,(frozenset({100}), 174): 9,(frozenset({100, 174}), 50): 8,(frozenset({50, 174}), 100): 8,(frozenset({50, 100}), 174): 8})
incorrect_counts
defaultdict(int,{(frozenset({50}), 174): 4,(frozenset({100}), 174): 3,(frozenset({50, 100}), 174): 1,(frozenset({50}), 100): 5,(frozenset({174}), 100): 2,(frozenset({50, 174}), 100): 2,(frozenset({100}), 50): 3,(frozenset({174}), 50): 1,(frozenset({100, 174}), 50): 1})
# 计算规则的置信度
rule_confidence = {candidate_rule: correct_counts[candidate_rule]/(correct_counts[candidate_rule] + incorrect_counts[candidate_rule])for candidate_rule in candidate_rules}
rule_confidence
{(frozenset({100}), 50): 0.75,(frozenset({50}), 100): 0.6428571428571429,(frozenset({174}), 50): 0.9090909090909091,(frozenset({50}), 174): 0.7142857142857143,(frozenset({174}), 100): 0.8181818181818182,(frozenset({100}), 174): 0.75,(frozenset({100, 174}), 50): 0.8888888888888888,(frozenset({50, 174}), 100): 0.8,(frozenset({50, 100}), 174): 0.8888888888888888}
# 置信度降序
sorted_confidence = sorted(rule_confidence.items(), key=itemgetter(1), reverse=True)
sorted_confidence
[((frozenset({174}), 50), 0.9090909090909091),((frozenset({100, 174}), 50), 0.8888888888888888),((frozenset({50, 100}), 174), 0.8888888888888888),((frozenset({174}), 100), 0.8181818181818182),((frozenset({50, 174}), 100), 0.8),((frozenset({100}), 50), 0.75),((frozenset({100}), 174), 0.75),((frozenset({50}), 174), 0.7142857142857143),((frozenset({50}), 100), 0.6428571428571429)]
# 输出置信度最高的前5条规则
for index in range(5):print('Rule #{0}'.format(index + 1))premise, conclusion = sorted_confidence[index][0]print('Rule: If a person recommends: {0}, they will also recommend: {1}'.format(premise, conclusion))print('- Confidence: {0:.3f}'.format(rule_confidence[(premise, conclusion)]))print("")
Rule #1
Rule: If a person recommends frozenset({174}), they will also recommend 50
- Confidence: 0.909Rule #2
Rule: If a person recommends frozenset({100, 174}), they will also recommend 50
- Confidence: 0.889Rule #3
Rule: If a person recommends frozenset({50, 100}), they will also recommend 174
- Confidence: 0.889Rule #4
Rule: If a person recommends frozenset({174}), they will also recommend 100
- Confidence: 0.818Rule #5
Rule: If a person recommends frozenset({50, 174}), they will also recommend 100
- Confidence: 0.800

输出结果只有电影编号没有电影名称,不太友好,接下来给电影编号匹配电影名称。

movie_name_data = pd.read_csv("u.item", delimiter='|', header=None, encoding='mac_roman')
movie_name_data = movie_name_data.iloc[:, :2]
movie_name_data.columns = ['MovieID','Title']
def get_movie_name(movie_id):title_object = movie_name_data[movie_name_data['MovieID'] == movie_id]['Title']title = title_object.values[0]return title
# 输出置信度最高的前5条规则
for index in range(5):print('Rule #{0}'.format(index + 1))premise, conclusion = sorted_confidence[index][0]premise_name = ', '.join(get_movie_name(idx) for idx in premise)conclusion_name = get_movie_name(conclusion)print('Rule: If a person recommends: {0}, \nthey will also recommend: {1}'.format(premise_name, conclusion_name))print('- Confidence: {0:.3f}'.format(rule_confidence[(premise, conclusion)]))print("")
Rule #1
Rule: If a person recommends: Raiders of the Lost Ark (1981),
they will also recommend: Star Wars (1977)
- Confidence: 0.909Rule #2
Rule: If a person recommends: Fargo (1996), Raiders of the Lost Ark (1981),
they will also recommend: Star Wars (1977)
- Confidence: 0.889Rule #3
Rule: If a person recommends: Star Wars (1977), Fargo (1996),
they will also recommend: Raiders of the Lost Ark (1981)
- Confidence: 0.889Rule #4
Rule: If a person recommends: Raiders of the Lost Ark (1981),
they will also recommend: Fargo (1996)
- Confidence: 0.818Rule #5
Rule: If a person recommends: Star Wars (1977), Raiders of the Lost Ark (1981),
they will also recommend: Fargo (1996)
- Confidence: 0.800

三、评估

# 用#100到#109的用户打分数据做测试
test_dataset = all_ratings[all_ratings['UserID'].isin(range(100,110,1))]
test_favorable = test_dataset[test_dataset['Favorable']]
test_favorable_by_users = dict((k, frozenset(v.values)) for k, v in test_favorable.groupby('UserID')['MovieID'])
correct_counts = defaultdict(int)
incorrect_counts = defaultdict(int)
# 计算规则在测试集上的应验数量
for user, reviews in test_favorable_by_users.items():for candidate_rule in candidate_rules:premise, conclusion = candidate_ruleif premise.issubset(reviews):if conclusion in reviews:correct_counts[candidate_rule] += 1else:incorrect_counts[candidate_rule] += 1
# 计算置信度
test_confidence = {candidate_rule: correct_counts[candidate_rule]/(correct_counts[candidate_rule] + incorrect_counts[candidate_rule])for candidate_rule in candidate_rules}
# 输出规则
for index in range(5):print('Rule #{0}'.format(index + 1))premise, conclusion = sorted_confidence[index][0]premise_name = ', '.join(get_movie_name(idx) for idx in premise)conclusion_name = get_movie_name(conclusion)print('Rule: If a person recommends: {0}, \nthey will also recommend: {1}'.format(premise_name, conclusion_name))print('- Train Confidence: {0:.3f}'.format(rule_confidence[(premise, conclusion)]))print('- Test Confidence: {0:.3f}'.format(test_confidence[(premise, conclusion)]))print("")
Rule #1
Rule: If a person recommends: Raiders of the Lost Ark (1981),
they will also recommend: Star Wars (1977)
- Train Confidence: 0.909
- Test Confidence: 1.000Rule #2
Rule: If a person recommends: Fargo (1996), Raiders of the Lost Ark (1981),
they will also recommend: Star Wars (1977)
- Train Confidence: 0.889
- Test Confidence: 1.000Rule #3
Rule: If a person recommends: Star Wars (1977), Fargo (1996),
they will also recommend: Raiders of the Lost Ark (1981)
- Train Confidence: 0.889
- Test Confidence: 0.333Rule #4
Rule: If a person recommends: Raiders of the Lost Ark (1981),
they will also recommend: Fargo (1996)
- Train Confidence: 0.818
- Test Confidence: 0.500Rule #5
Rule: If a person recommends: Star Wars (1977), Raiders of the Lost Ark (1981),
they will also recommend: Fargo (1996)
- Train Confidence: 0.800
- Test Confidence: 0.500

四、总结

从电影打分数据中找到可用于电影推荐的关联规则,整个过程可分为2个阶段:首先借助Apriori算法寻找数据中的频繁项集,然后根据找到的频繁项集生成关联规则。我们用一部分数据作为训练集来发现关联规则,在测试集上进行测试。

电影推荐——基于关联分析Apriori算法相关推荐

  1. python亲和性分析法推荐电影论文_数据挖掘-MovieLens数据集_电影推荐_亲和性分析_Aprioro算法...

    #!/usr/bin/env python2 # -*- coding: utf-8 -*- """ Created on Tue Feb  7 14:38:33 201 ...

  2. 关联分析(Apriori算法) 面包 牛奶 尿布 啤酒 ...

    关联分析时,需要处理两个关键问题 1 大量数据集中发现模式,计算代价高 2 某些模式可能是虚假的,因为他们是偶然发生的 关联分析例题:从这个商品记录得出顾客喜欢同时购买那几样东西 TID 面包 牛奶 ...

  3. 数据挖掘之关联分析Apriori算法

    文章目录 一.理论知识 1.1.定义 1.2.关联规则 1.3.频繁项集的产生 二.python实战 一.理论知识 许多商业企业在运营中积累了大量的数据.例如:普通超市的收银台每天都会收集到大量的用户 ...

  4. 关联分析——Apriori算法

    Apriori 算法详解 当我们在百度搜索里输入一个单词或单词一部分的时候,搜索引擎会自动补全查询词项,比如:输入"机器",百度下拉词项中就会出现"机器人编程" ...

  5. 【机器学习】关联分析Apriori算法详解以及代码实现

    Apriori算法以及统计学基础 什么是关联分析 简单的统计学基础 Apriori输出频繁集 从频繁项集中挖掘关联规则 什么是关联分析 从大规模数据集中寻找物品间的隐含关系被称作关联分析.而寻找物品的 ...

  6. 44 R关联分析——Apriori算法

    install.packages("gridBase") install.packages("arules") install.packages("a ...

  7. 挖掘频繁模式、关联和Apriori算法

    挖掘频繁模式.关联和Apriori算法 1. 引入 1.1 基本概念 频繁模式:频繁出现在数据集中的模式 频繁模式挖掘:获取到给定数据集中反复出现的联系 注:模式其实可以理解为,你在淘宝购物,你的购物 ...

  8. 基于关联分析与机器学习的配网台区重过载预测方法

    基于关联分析与机器学习的配网台区重过载预测方法 张国宾,王晓蓉,邓春宇 中国电力科学研究院,北京 100192 摘要:针对配电网运行中长期存在的台区重过载问题,提出基于关联规则挖掘的重过载影响因素分析 ...

  9. apriori算法_挖掘频繁模式、关联和Apriori算法

    挖掘频繁模式.关联和Apriori算法 1. 引入 1.1 基本概念 频繁模式:频繁出现在数据集中的模式 频繁模式挖掘:获取到给定数据集中反复出现的联系 注:模式其实可以理解为,你在淘宝购物,你的购物 ...

  10. 关联分析Apriori算法和FP-growth算法初探

    1. 关联分析是什么? Apriori和FP-growth算法是一种关联算法,属于无监督算法的一种,它们可以自动从数据中挖掘出潜在的关联关系.例如经典的啤酒与尿布的故事.下面我们用一个例子来切入本文对 ...

最新文章

  1. php ile_get_contents无法请求https连接的解决方法
  2. 【数据结构与算法】之深入解析“石子游戏IV”的求解思路与算法示例
  3. cmake编译opencv3.0
  4. JS高级——Proxy、Reflect
  5. 论文浅尝 | 用于视觉推理的显式知识集成
  6. Flink AggOperator 增量聚合函数
  7. spring boot 教程(五)使用JdbcTemplate访问数据库
  8. Qt5.12 安装教程windows
  9. 计算机修复开机按什么,电脑蓝屏修复按哪个健?
  10. mtk平台dump系统分区
  11. 华氏温度与摄氏温度用C语言的实现方法
  12. moment.js中文api
  13. 管理学-“三个和尚”
  14. 电脑开机显示器不显示BIOS界面,直接进入系统解决办法
  15. 游戏服务器 数据同步方案
  16. python read csv dtype_如何使用pandas将csv列作为dtype列表读取?
  17. 在Linux中运行Android软件
  18. SEO知识(总结土著游民)(1)
  19. minio存储类型 归档管理页面_软件定义存储,看这一篇就够了
  20. 独享服务器做系统,独享服务器的含义与好处

热门文章

  1. 参数问题:nested exception is java.lang.NumberFormatException: For input string: “null“,已解决。
  2. 8086CPU指令系统——算术运算类指令
  3. Yate for mac(标记和管理音频文件工具)
  4. springboot + vue 搭建使用maven+ant构建
  5. 手把手带你将电脑音乐同步到iPhone 音乐
  6. 【经典算法题-2】费式数列(Fibonacci数列)
  7. HTML旅游景点网页作业制作——旅游中国11个页面(HTML+CSS+JavaScript)
  8. 数字共享平台赋能船舶行业数字化转型——CSBC,搭建行业数字生态链
  9. 使用netron实现对onnx模型结构可视化
  10. Guass列主元、平方根法、追赶法求解方程组的C++实现