本次数据挖掘项目是电影推荐问题，目的是找出对象同时出现的情况，也就是寻找用户同时喜欢几部电影的情况。

使用最基础的Apriori算法。

import os
import pandas as pd
import numpy as np
import sys
from operator import itemgetter
from collections import defaultdict

一、加载数据并观察

# 文件的后缀就是.data，后面不要再加.csv了，否则会报错
all_ratings = pd.read_csv("u.data", delimiter='\t', header=None, names=["UserID","MovieID","Rating","Datetime"])

# 让我们看看冰山的一角吧
all_ratings.head()

	UserID	MovieID	Rating	Datetime
0	196	242	3	881250949
1	186	302	3	891717742
2	22	377	1	878887116
3	244	51	2	880606923
4	166	346	1	886397596

# 很好奇每一列都是什么数据类型，有没有缺失的记录呢？
all_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):#   Column    Non-Null Count   Dtype
---  ------    --------------   -----0   UserID    100000 non-null  int641   MovieID   100000 non-null  int642   Rating    100000 non-null  int643   Datetime  100000 non-null  int64
dtypes: int64(4)
memory usage: 3.1 MB

每一列都是int类型且没有缺失，很好，数据处理不用再做缺失值处理了。

# 再看看数据大小吧，其实从info()里也能看出来的，不过这样更直接，一眼就看出来啦
all_ratings.shape

(100000, 4)

# 去个重吧，万一有重复的数据
print("去重前数据大小:{0}".format(all_ratings.shape))
all_ratings.drop_duplicates(keep="first",inplace=True)
print("去重后数据大小:{0}".format(all_ratings.shape))

去重前数据大小:(100000, 4)
去重后数据大小:(100000, 4)

没有重复的呢。

# 看看都有哪些用户爱看电影还给电影打了分吧（反正我看电影是不喜欢打分的）
all_ratings["UserID"].value_counts()

405    737
655    685
13     636
450    540
276    518...
147     20
19      20
572     20
636     20
895     20
Name: UserID, Length: 943, dtype: int64

一共有943名用户，用户#405竟然给737部电影打了分，最少的也有20部。

# 让我们看看都有哪些电影被打了分吧
all_ratings["MovieID"].value_counts()

50      583
258     509
100     508
181     507
294     485...
1648      1
1571      1
1329      1
1457      1
1663      1
Name: MovieID, Length: 1682, dtype: int64

一共有1682部电影被观看并打分，电影#50被打分次数最多有583次，有的电影就有点惨了只被打过一次分比如电影#1663。

# 看下电影分数都有哪几档
all_ratings["Rating"].unique().tolist()

[3, 1, 2, 4, 5]

# 解析时间戳，日期看着比一串数字方便多了
all_ratings["Datetime"] = pd.to_datetime(all_ratings["Datetime"], unit='s')

all_ratings.head()

	UserID	MovieID	Rating	Datetime
0	196	242	3	1997-12-04 15:55:49
1	186	302	3	1998-04-04 19:22:22
2	22	377	1	1997-11-07 07:18:36
3	244	51	2	1997-11-27 05:02:03
4	166	346	1	1998-02-02 05:33:16

看样子都是很古老的数据啊

二、Apriori算法的实现

# 首先确定用户是不是喜欢某一部电影，评分高于3的认定为喜欢
all_ratings['Favorable'] = all_ratings['Rating'] > 3

# 从数据集选取一部分用作训练集，能减少搜索空间，提升算法的速度
# 我尝试用UserID前200的数据训练，电脑直接卡住，建议先用50以内的数值尝试，如果执行起来很流畅再调高
ratings = all_ratings[all_ratings['UserID'].isin(range(20))]

# 选取用户喜欢某部电影的数据行
favorable_ratings = ratings[ratings['Favorable']]

# 要知道用户喜欢哪些电影，把v.values存储为frozenset便于快速判断用户是否为某部电影打过分
favorable_reviews_by_users = dict((k, frozenset(v.values)) for k,v in favorable_ratings.groupby('UserID')['MovieID'])

# 每部电影有多少人喜欢
num_favorable_by_movie = ratings[['MovieID','Favorable']].groupby('MovieID').sum()

num_favorable_by_movie.sort_values(by='Favorable', axis=0, ascending=False)[:5]

	Favorable
MovieID
50	14.0
100	12.0
174	11.0
127	10.0
56	10.0

1、实现

frequent_itemsets = {}

# 最小支持度
min_support = 10

# 每一部电影作为一个项集，判断它是否够频繁
frequent_itemsets[1] = dict((frozenset((MovieID,)), row['Favorable']) for MovieID, row in num_favorable_by_movie.iterrows() if row['Favorable'] > min_support)

frequent_itemsets[1]

{frozenset({50}): 14.0, frozenset({100}): 12.0, frozenset({174}): 11.0}

# 接收k-1项频繁项集，创建超集，检测频繁程度，生成k项频繁项集
def find_frequent_itemsets(favorable_reviews_by_users, k_1_itemsets, min_support):counts = defaultdict(int)for user, reviews in favorable_reviews_by_users.items():for itemset in k_1_itemsets:if itemset.issubset(reviews):# 遍历用户喜欢并打过分但是没出现在k-1频繁项集里的电影，用它生成超集，更新该项集的计数for other_reviewed_movie in reviews - itemset:current_superset = itemset | frozenset((other_reviewed_movie,))counts[current_superset] += 1return dict([(itemset, frequency) for itemset, frequency in counts.items() if frequency >= min_support])

find_frequent_itemsets(favorable_reviews_by_users, frequent_itemsets[1], min_support)

{frozenset({50, 100}): 18, frozenset({50, 174}): 20, frozenset({100, 174}): 18}

# 存储运行中发现的新频繁项集
for k in range(2, 5):cur_frequent_itemsets = find_frequent_itemsets(favorable_reviews_by_users, frequent_itemsets[k-1],min_support)frequent_itemsets[k] = cur_frequent_itemsetsif len(cur_frequent_itemsets) == 0:print("did not find any frequent itemsets of length {0}".format(k))# 确保代码还在执行时，把缓冲区内容输出到终端，不易过多使用，flush操作（输出也是）所带来的计算会拖慢程序的运行速度sys.stdout.flush()breakelse:print("i found {0} frequent itemsets of length {1}".format(len(cur_frequent_itemsets),k))sys.stdout.flush()

i found 3 frequent itemsets of length 2
i found 1 frequent itemsets of length 3
did not find any frequent itemsets of length 4

del frequent_itemsets[1]

frequent_itemsets

{2: {frozenset({50, 100}): 18,frozenset({50, 174}): 20,frozenset({100, 174}): 18},3: {frozenset({50, 100, 174}): 24},4: {}}

2、抽取关联规则

频繁项集是一组达到最小支持度的项目，从中抽取的关联规则形如：如果用户喜欢前提中的所有电影，那么他们也会喜欢结论中的电影。

candidate_rules = []
# 遍历不同长度的频繁项集，为每个项集生成规则
for itemset_length, itemset_counts in frequent_itemsets.items():# 遍历每个项集for itemset in itemset_counts.keys():for conclusion in itemset:premise = itemset - set((conclusion,))candidate_rules.append((premise, conclusion))

candidate_rules

[(frozenset({100}), 50),(frozenset({50}), 100),(frozenset({174}), 50),(frozenset({50}), 174),(frozenset({174}), 100),(frozenset({100}), 174),(frozenset({100, 174}), 50),(frozenset({50, 174}), 100),(frozenset({50, 100}), 174)]

correct_counts = defaultdict(int)
incorrect_counts = defaultdict(int)

# 遍历用户和打分数据，统计规则的次数，当规则的前提符合时，看一下用户是否喜欢结论中的电影
for user, reviews in favorable_reviews_by_users.items():for candidate_rule in candidate_rules:premise, conclusion = candidate_ruleif premise.issubset(reviews):if conclusion in reviews:correct_counts[candidate_rule] += 1else:incorrect_counts[candidate_rule] += 1

correct_counts

defaultdict(int,{(frozenset({100}), 50): 9,(frozenset({50}), 100): 9,(frozenset({174}), 50): 10,(frozenset({50}), 174): 10,(frozenset({174}), 100): 9,(frozenset({100}), 174): 9,(frozenset({100, 174}), 50): 8,(frozenset({50, 174}), 100): 8,(frozenset({50, 100}), 174): 8})

incorrect_counts

defaultdict(int,{(frozenset({50}), 174): 4,(frozenset({100}), 174): 3,(frozenset({50, 100}), 174): 1,(frozenset({50}), 100): 5,(frozenset({174}), 100): 2,(frozenset({50, 174}), 100): 2,(frozenset({100}), 50): 3,(frozenset({174}), 50): 1,(frozenset({100, 174}), 50): 1})

# 计算规则的置信度
rule_confidence = {candidate_rule: correct_counts[candidate_rule]/(correct_counts[candidate_rule] + incorrect_counts[candidate_rule])for candidate_rule in candidate_rules}

rule_confidence

{(frozenset({100}), 50): 0.75,(frozenset({50}), 100): 0.6428571428571429,(frozenset({174}), 50): 0.9090909090909091,(frozenset({50}), 174): 0.7142857142857143,(frozenset({174}), 100): 0.8181818181818182,(frozenset({100}), 174): 0.75,(frozenset({100, 174}), 50): 0.8888888888888888,(frozenset({50, 174}), 100): 0.8,(frozenset({50, 100}), 174): 0.8888888888888888}

# 置信度降序
sorted_confidence = sorted(rule_confidence.items(), key=itemgetter(1), reverse=True)

sorted_confidence

[((frozenset({174}), 50), 0.9090909090909091),((frozenset({100, 174}), 50), 0.8888888888888888),((frozenset({50, 100}), 174), 0.8888888888888888),((frozenset({174}), 100), 0.8181818181818182),((frozenset({50, 174}), 100), 0.8),((frozenset({100}), 50), 0.75),((frozenset({100}), 174), 0.75),((frozenset({50}), 174), 0.7142857142857143),((frozenset({50}), 100), 0.6428571428571429)]

# 输出置信度最高的前5条规则
for index in range(5):print('Rule #{0}'.format(index + 1))premise, conclusion = sorted_confidence[index][0]print('Rule: If a person recommends: {0}, they will also recommend: {1}'.format(premise, conclusion))print('- Confidence: {0:.3f}'.format(rule_confidence[(premise, conclusion)]))print("")

Rule #1
Rule: If a person recommends frozenset({174}), they will also recommend 50
- Confidence: 0.909Rule #2
Rule: If a person recommends frozenset({100, 174}), they will also recommend 50
- Confidence: 0.889Rule #3
Rule: If a person recommends frozenset({50, 100}), they will also recommend 174
- Confidence: 0.889Rule #4
Rule: If a person recommends frozenset({174}), they will also recommend 100
- Confidence: 0.818Rule #5
Rule: If a person recommends frozenset({50, 174}), they will also recommend 100
- Confidence: 0.800

输出结果只有电影编号没有电影名称，不太友好，接下来给电影编号匹配电影名称。

movie_name_data = pd.read_csv("u.item", delimiter='|', header=None, encoding='mac_roman')

movie_name_data = movie_name_data.iloc[:, :2]

movie_name_data.columns = ['MovieID','Title']

def get_movie_name(movie_id):title_object = movie_name_data[movie_name_data['MovieID'] == movie_id]['Title']title = title_object.values[0]return title

# 输出置信度最高的前5条规则
for index in range(5):print('Rule #{0}'.format(index + 1))premise, conclusion = sorted_confidence[index][0]premise_name = ', '.join(get_movie_name(idx) for idx in premise)conclusion_name = get_movie_name(conclusion)print('Rule: If a person recommends: {0}, \nthey will also recommend: {1}'.format(premise_name, conclusion_name))print('- Confidence: {0:.3f}'.format(rule_confidence[(premise, conclusion)]))print("")

Rule #1
Rule: If a person recommends: Raiders of the Lost Ark (1981),
they will also recommend: Star Wars (1977)
- Confidence: 0.909Rule #2
Rule: If a person recommends: Fargo (1996), Raiders of the Lost Ark (1981),
they will also recommend: Star Wars (1977)
- Confidence: 0.889Rule #3
Rule: If a person recommends: Star Wars (1977), Fargo (1996),
they will also recommend: Raiders of the Lost Ark (1981)
- Confidence: 0.889Rule #4
Rule: If a person recommends: Raiders of the Lost Ark (1981),
they will also recommend: Fargo (1996)
- Confidence: 0.818Rule #5
Rule: If a person recommends: Star Wars (1977), Raiders of the Lost Ark (1981),
they will also recommend: Fargo (1996)
- Confidence: 0.800

三、评估

# 用#100到#109的用户打分数据做测试
test_dataset = all_ratings[all_ratings['UserID'].isin(range(100,110,1))]

test_favorable = test_dataset[test_dataset['Favorable']]

test_favorable_by_users = dict((k, frozenset(v.values)) for k, v in test_favorable.groupby('UserID')['MovieID'])

correct_counts = defaultdict(int)
incorrect_counts = defaultdict(int)

# 计算规则在测试集上的应验数量
for user, reviews in test_favorable_by_users.items():for candidate_rule in candidate_rules:premise, conclusion = candidate_ruleif premise.issubset(reviews):if conclusion in reviews:correct_counts[candidate_rule] += 1else:incorrect_counts[candidate_rule] += 1

# 计算置信度
test_confidence = {candidate_rule: correct_counts[candidate_rule]/(correct_counts[candidate_rule] + incorrect_counts[candidate_rule])for candidate_rule in candidate_rules}

# 输出规则
for index in range(5):print('Rule #{0}'.format(index + 1))premise, conclusion = sorted_confidence[index][0]premise_name = ', '.join(get_movie_name(idx) for idx in premise)conclusion_name = get_movie_name(conclusion)print('Rule: If a person recommends: {0}, \nthey will also recommend: {1}'.format(premise_name, conclusion_name))print('- Train Confidence: {0:.3f}'.format(rule_confidence[(premise, conclusion)]))print('- Test Confidence: {0:.3f}'.format(test_confidence[(premise, conclusion)]))print("")

Rule #1
Rule: If a person recommends: Raiders of the Lost Ark (1981),
they will also recommend: Star Wars (1977)
- Train Confidence: 0.909
- Test Confidence: 1.000Rule #2
Rule: If a person recommends: Fargo (1996), Raiders of the Lost Ark (1981),
they will also recommend: Star Wars (1977)
- Train Confidence: 0.889
- Test Confidence: 1.000Rule #3
Rule: If a person recommends: Star Wars (1977), Fargo (1996),
they will also recommend: Raiders of the Lost Ark (1981)
- Train Confidence: 0.889
- Test Confidence: 0.333Rule #4
Rule: If a person recommends: Raiders of the Lost Ark (1981),
they will also recommend: Fargo (1996)
- Train Confidence: 0.818
- Test Confidence: 0.500Rule #5
Rule: If a person recommends: Star Wars (1977), Raiders of the Lost Ark (1981),
they will also recommend: Fargo (1996)
- Train Confidence: 0.800
- Test Confidence: 0.500

四、总结

从电影打分数据中找到可用于电影推荐的关联规则，整个过程可分为2个阶段：首先借助Apriori算法寻找数据中的频繁项集，然后根据找到的频繁项集生成关联规则。我们用一部分数据作为训练集来发现关联规则，在测试集上进行测试。

电影推荐——基于关联分析Apriori算法相关推荐

python亲和性分析法推荐电影论文_数据挖掘-MovieLens数据集_电影推荐_亲和性分析_Aprioro算法...
#!/usr/bin/env python2 # -*- coding: utf-8 -*- """ Created on Tue Feb 7 14:38:33 201 ...
关联分析(Apriori算法) 面包牛奶尿布啤酒 ...
关联分析时,需要处理两个关键问题 1 大量数据集中发现模式,计算代价高 2 某些模式可能是虚假的,因为他们是偶然发生的关联分析例题:从这个商品记录得出顾客喜欢同时购买那几样东西 TID 面包牛奶 ...
数据挖掘之关联分析Apriori算法
文章目录一.理论知识 1.1.定义 1.2.关联规则 1.3.频繁项集的产生二.python实战一.理论知识许多商业企业在运营中积累了大量的数据.例如:普通超市的收银台每天都会收集到大量的用户 ...
关联分析——Apriori算法
Apriori 算法详解当我们在百度搜索里输入一个单词或单词一部分的时候,搜索引擎会自动补全查询词项,比如:输入"机器",百度下拉词项中就会出现"机器人编程" ...
【机器学习】关联分析Apriori算法详解以及代码实现
Apriori算法以及统计学基础什么是关联分析简单的统计学基础 Apriori输出频繁集从频繁项集中挖掘关联规则什么是关联分析从大规模数据集中寻找物品间的隐含关系被称作关联分析.而寻找物品的 ...
44 R关联分析——Apriori算法
install.packages("gridBase") install.packages("arules") install.packages("a ...
挖掘频繁模式、关联和Apriori算法
挖掘频繁模式.关联和Apriori算法 1. 引入 1.1 基本概念频繁模式:频繁出现在数据集中的模式频繁模式挖掘:获取到给定数据集中反复出现的联系注:模式其实可以理解为,你在淘宝购物,你的购物 ...
基于关联分析与机器学习的配网台区重过载预测方法
基于关联分析与机器学习的配网台区重过载预测方法张国宾,王晓蓉,邓春宇中国电力科学研究院,北京 100192 摘要:针对配电网运行中长期存在的台区重过载问题,提出基于关联规则挖掘的重过载影响因素分析 ...
apriori算法_挖掘频繁模式、关联和Apriori算法
挖掘频繁模式.关联和Apriori算法 1. 引入 1.1 基本概念频繁模式:频繁出现在数据集中的模式频繁模式挖掘:获取到给定数据集中反复出现的联系注:模式其实可以理解为,你在淘宝购物,你的购物 ...
关联分析Apriori算法和FP-growth算法初探
1. 关联分析是什么? Apriori和FP-growth算法是一种关联算法,属于无监督算法的一种,它们可以自动从数据中挖掘出潜在的关联关系.例如经典的啤酒与尿布的故事.下面我们用一个例子来切入本文对 ...

电影推荐——基于关联分析Apriori算法