

Apriori算法首先保证规则在数据集中有足够的支持度,最重要的一个参数就是最小支持度比如要生成商品A B的频繁项集(A,B)要求支持度至少为30,那么A,B都必须至少在数据集中出现30次,更大的频繁项集也要最受这个约定。


In [1]: import numpy as np                                                      In [2]: import pandas as pd                                                     In [3]: all_ratings = pd.read_csv('/Users/gn/scikit--learn/ml-100k/u.data',delim...: iter="\t", header=None, names = ["UserID", "MovieID", "Rating", "Datetim...: e"])                                                                    In [4]: all_ratings.head()
Out[4]: UserID  MovieID  Rating   Datetime
0     196      242       3  881250949
1     186      302       3  891717742
2      22      377       1  878887116
3     244       51       2  880606923
4     166      346       1  886397596In [5]: all_ratings["Datetime"] = pd.to_datetime(all_ratings['Datetime'],unit='s...: ')                                                                      In [6]: all_ratings["Favorable"] = all_ratings['Rating'] > 3                                                            In [7]: ratings  = all_ratings[all_ratings['UserID'].isin(range(200))]                                                  In [8]: ratings.head()
Out[8]: UserID  MovieID  Rating            Datetime  Favorable
0     196      242       3 1997-12-04 15:55:49      False
1     186      302       3 1998-04-04 19:22:22      False
2      22      377       1 1997-11-07 07:18:36      False
4     166      346       1 1998-02-02 05:33:16      False
6     115      265       2 1997-12-03 17:51:28      FalseIn [9]: favorable_ratings = ratings[ratings["Favorable"]]                                                               In [10]: favorable_ratings.groupby("UserID")["MovieID"]
Out[10]: <pandas.core.groupby.generic.SeriesGroupBy object at 0x11583d9b0>In [11]: favorable_reviews_by_users = dict((k, frozenset(v.values)) for k, v in favorable_ratings.groupby("UserID")["MovieID"])                         In [12]: num_favorable_by_movie = ratings[["MovieID", "Favorable"]].groupby("MovieID").sum()                                                            In [13]: num_favorable_by_movie.sort_values('Favorable',ascending=False).head()
Out[13]: Favorable
50           100.0
100           89.0
258           83.0
181           79.0
174           74.0In [14]: from collections import defaultdict                                                                                                            In [15]: frequent_itemsets = {}                                                                                                                         In [16]: min_support = 50                                                                                                                               In [17]: frequent_itemsets[1] = dict((frozenset((movie_id,)), row["Favorable"]) for movie_id,row in num_favorable_by_movie[num_favorable_by_movie['Favorable']>50].iterrows())          In [18]: import sys                                                                                                                                                                                      In [19]: print("有 {} 部电影 获得超过 {} 个好评".format(len(frequent_itemsets[1]), min_support))
有 16 部电影 获得超过 50 个好评


In [20]: def find_frequent_itemsets(favorable_reviews_by_users, k_1_itemsets, min_support): ...:     counts = defaultdict(int) ...:     for user,reviews in favorable_reviews_by_users.items(): ...:         for itemset in k_1_itemsets: ...:             if itemset.issubset(reviews): ...:                 for other_reviewed_movie in reviews - itemset: ...:                     current_superset = itemset | frozenset((other_reviewed_movie,)) ...:                     counts[current_superset] += 1 ...:     return dict([(itemset, frequency) for itemset, frequency in counts.items() if frequency >= min_support]) ...:                                                                                                                                                                                                 In [21]: for k in range(2, 20): ...:     cur_frequent_itemsets = find_frequent_itemsets(favorable_reviews_by_users, frequent_itemsets[k-1],min_support) ...:     if len(cur_frequent_itemsets) == 0: ...:         print("未发现频数为{}的频繁项集".format(k)) ...:         sys.stdout.flush() ...:         break ...:     else: ...:         print("发现 {} 个长度为 {}的频繁项集".format(len(cur_frequent_itemsets), k)) ...:         sys.stdout.flush() ...:         frequent_itemsets[k] = cur_frequent_itemsets ...:
发现 93 个长度为 2的频繁项集
发现 295 个长度为 3的频繁项集
发现 593 个长度为 4的频繁项集
发现 785 个长度为 5的频繁项集
发现 677 个长度为 6的频繁项集
发现 373 个长度为 7的频繁项集
发现 126 个长度为 8的频繁项集
发现 24 个长度为 9的频繁项集
发现 2 个长度为 10的频繁项集



In [22]: del frequent_itemsets[1]                                                                                                                                                                        In [23]: candidate_rules = []                                                                                                                                                                            In [24]: for itemset_length, itemset_counts in frequent_itemsets.items(): ...:     for itemset in itemset_counts.keys(): ...:         for conclusion in itemset: ...:             premise = itemset - set((conclusion,)) ...:             candidate_rules.append((premise, conclusion)) ...: print("生成 {} 条规则 ".format(len(candidate_rules)))
生成 15285 条规则 In [25]: correct_counts = defaultdict(int)                                                                                                                                                               In [26]: incorrect_counts = defaultdict(int)                                                                                                                                                             In [27]: for user, reviews in favorable_reviews_by_users.items(): ...:     for candidate_rule in candidate_rules: ...:         premise, conclusion = candidate_rule ...:         if premise.issubset(reviews): ...:             if conclusion in reviews: ...:                 correct_counts[candidate_rule] += 1 ...:             else: ...:                 incorrect_counts[candidate_rule] += 1 ...: rule_confidence = {candidate_rule: correct_counts[candidate_rule] / float(correct_counts[candidate_rule] + incorrect_counts[candidate_rule]) ...:               for candidate_rule in candidate_rules}                                                                                                                                            In [28]: min_confidence = 0.9                                                                                                                                                                            In [29]: rule_confidence = {rule: confidence for rule, confidence in rule_confidence.items() if confidence > min_confidence} ...: print(len(rule_confidence))
5152In [30]: from operator import itemgetter                                                                                                                                                                 In [31]: sorted_confidence = sorted(rule_confidence.items(), key=itemgetter(1), reverse=True)                                                                                                            In [32]: for index in range(5): ...:     print("规则 #{0}".format(index + 1)) ...:     (premise, conclusion) = sorted_confidence[index][0] ...:     print("规则: 如果有人喜欢电影 {0} 那么他们同样喜欢电影 {1}".format(premise, conclusion)) ...:     print(" - 置信度: {0:.3f}".format(rule_confidence[(premise, conclusion)])) ...:     print("") ...:
规则 #1
规则: 如果有人喜欢电影 frozenset({98, 181}) 那么他们同样喜欢电影 50- 置信度: 1.000规则 #2
规则: 如果有人喜欢电影 frozenset({172, 79}) 那么他们同样喜欢电影 174- 置信度: 1.000规则 #3
规则: 如果有人喜欢电影 frozenset({258, 172}) 那么他们同样喜欢电影 174- 置信度: 1.000规则 #4
规则: 如果有人喜欢电影 frozenset({1, 181, 7}) 那么他们同样喜欢电影 50- 置信度: 1.000规则 #5
规则: 如果有人喜欢电影 frozenset({1, 172, 7}) 那么他们同样喜欢电影 174- 置信度: 1.000



In [41]: movie_name_data = pd.read_csv('/Users/gn/scikit--learn/ml-100k/u.item',delimiter='|',header=None,encoding='mac-roman')                                                                          In [42]: movie_name_data.head()
Out[42]: 0                  1            2   3                                                  4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19  20  21  22  23
0   1   Toy Story (1995)  01-Jan-1995 NaN  http://us.imdb.com/M/title-exact?Toy%20Story%2...   0   0   0   1   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0
1   2   GoldenEye (1995)  01-Jan-1995 NaN  http://us.imdb.com/M/title-exact?GoldenEye%20(...   0   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0
2   3  Four Rooms (1995)  01-Jan-1995 NaN  http://us.imdb.com/M/title-exact?Four%20Rooms%...   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0
3   4  Get Shorty (1995)  01-Jan-1995 NaN  http://us.imdb.com/M/title-exact?Get%20Shorty%...   0   1   0   0   0   1   0   0   1   0   0   0   0   0   0   0   0   0   0
4   5     Copycat (1995)  01-Jan-1995 NaN  http://us.imdb.com/M/title-exact?Copycat%20(1995)   0   0   0   0   0   0   1   0   1   0   0   0   0   0   0   0   1   0   0In [43]: movie_name_data.columns = ["MovieID", "Title", "Release Date", "Video Release", "IMDB", "<UNK>", "Action", "Adventure", ...:                            "Animation", "Children's", "Comedy", "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir", ...:                            "Horror", "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"]                                                                                   In [44]: movie_name_data.head()
Out[44]: MovieID              Title Release Date  Video Release                                               IMDB  <UNK>  Action  ...  Musical  Mystery  Romance  Sci-Fi  Thriller  War  Western
0        1   Toy Story (1995)  01-Jan-1995            NaN  http://us.imdb.com/M/title-exact?Toy%20Story%2...      0       0  ...        0        0        0       0         0    0        0
1        2   GoldenEye (1995)  01-Jan-1995            NaN  http://us.imdb.com/M/title-exact?GoldenEye%20(...      0       1  ...        0        0        0       0         1    0        0
2        3  Four Rooms (1995)  01-Jan-1995            NaN  http://us.imdb.com/M/title-exact?Four%20Rooms%...      0       0  ...        0        0        0       0         1    0        0
3        4  Get Shorty (1995)  01-Jan-1995            NaN  http://us.imdb.com/M/title-exact?Get%20Shorty%...      0       1  ...        0        0        0       0         0    0        0
4        5     Copycat (1995)  01-Jan-1995            NaN  http://us.imdb.com/M/title-exact?Copycat%20(1995)      0       0  ...        0        0        0       0         1    0        0[5 rows x 24 columns]In [45]: def get_movie_name(movie_id): ...:     title_object = movie_name_data[movie_name_data["MovieID"] == movie_id]["Title"] ...:     title = title_object.values[0] ...:     return title ...:                                                                                                                                                                                                 In [46]: for index in range(5): ...:     print("规则 #{0}".format(index + 1)) ...:     (premise, conclusion) = sorted_confidence[index][0] ...:     premise_names = ", ".join(get_movie_name(idx) for idx in premise) ...:     conclusion_name = get_movie_name(conclusion) ...:     print("规则: 如果有人喜欢电影 {0}那么他们同样喜欢电影 {1}".format(premise_names, conclusion_name)) ...:     print(" - 置信度: {0:.3f}".format(rule_confidence[(premise, conclusion)])) ...:     print("") ...:
规则 #1
规则: 如果有人喜欢电影 Silence of the Lambs, The (1991), Return of the Jedi (1983)那么他们同样喜欢电影 Star Wars (1977)- 置信度: 1.000规则 #2
规则: 如果有人喜欢电影 Empire Strikes Back, The (1980), Fugitive, The (1993)那么他们同样喜欢电影 Raiders of the Lost Ark (1981)- 置信度: 1.000规则 #3
规则: 如果有人喜欢电影 Contact (1997), Empire Strikes Back, The (1980)那么他们同样喜欢电影 Raiders of the Lost Ark (1981)- 置信度: 1.000规则 #4
规则: 如果有人喜欢电影 Toy Story (1995), Return of the Jedi (1983), Twelve Monkeys (1995)那么他们同样喜欢电影 Star Wars (1977)- 置信度: 1.000规则 #5
规则: 如果有人喜欢电影 Toy Story (1995), Empire Strikes Back, The (1980), Twelve Monkeys (1995)那么他们同样喜欢电影 Raiders of the Lost Ark (1981)- 置信度: 1.000


In [47]: test_dataset = all_ratings[~all_ratings['UserID'].isin(range(200))]                                                                                                                             In [48]: test_favorable = test_dataset[test_dataset["Favorable"]]                                                                                                                                        In [49]: test_favorable_by_users = dict((k, frozenset(v.values)) for k, v in test_favorable.groupby("UserID")["MovieID"])                                                                                In [50]: test_dataset.head()
Out[50]: UserID  MovieID  Rating            Datetime  Favorable
3      244       51       2 1997-11-27 05:02:03      False
5      298      474       4 1998-01-07 14:20:06       True
7      253      465       5 1998-04-03 18:34:27       True
8      305      451       3 1998-02-01 09:20:17      False
11     286     1014       5 1997-11-17 15:38:45       TrueIn [51]: correct_counts = defaultdict(int) ...: incorrect_counts = defaultdict(int) ...: for user, reviews in test_favorable_by_users.items(): ...:     for candidate_rule in candidate_rules: ...:         premise, conclusion = candidate_rule ...:         if premise.issubset(reviews): ...:             if conclusion in reviews: ...:                 correct_counts[candidate_rule] += 1 ...:             else: ...:                 incorrect_counts[candidate_rule] += 1 ...:                                                                                                                                                                                                 In [52]: test_confidence = {candidate_rule: correct_counts[candidate_rule] / float(correct_counts[candidate_rule] + incorrect_counts[candidate_rule]) ...:                    for candidate_rule in rule_confidence}                                                                                                                                       In [53]: sorted_test_confidence = sorted(test_confidence.items(), key=itemgetter(1), reverse=True)                                                                                                       In [54]: for index in range(10): ...:     print("规则 #{0}".format(index + 1)) ...:     (premise, conclusion) = sorted_confidence[index][0] ...:     premise_names = ", ".join(get_movie_name(idx) for idx in premise) ...:     conclusion_name = get_movie_name(conclusion) ...:     print("规则:如果有人喜欢电影  {0} 那么他们同样喜欢电影 {1}".format(premise_names, conclusion_name)) ...:     print(" - 训练 置信度: {0:.3f}".format(rule_confidence.get((premise, conclusion), -1))) ...:     print(" - 测试 置信度: {0:.3f}".format(test_confidence.get((premise, conclusion), -1))) ...:     print("") ...:
规则 #1
规则:如果有人喜欢电影  Silence of the Lambs, The (1991), Return of the Jedi (1983) 那么他们同样喜欢电影 Star Wars (1977)- 训练 置信度: 1.000- 测试 置信度: 0.936规则 #2
规则:如果有人喜欢电影  Empire Strikes Back, The (1980), Fugitive, The (1993) 那么他们同样喜欢电影 Raiders of the Lost Ark (1981)- 训练 置信度: 1.000- 测试 置信度: 0.876规则 #3
规则:如果有人喜欢电影  Contact (1997), Empire Strikes Back, The (1980) 那么他们同样喜欢电影 Raiders of the Lost Ark (1981)- 训练 置信度: 1.000- 测试 置信度: 0.841规则 #4
规则:如果有人喜欢电影  Toy Story (1995), Return of the Jedi (1983), Twelve Monkeys (1995) 那么他们同样喜欢电影 Star Wars (1977)- 训练 置信度: 1.000- 测试 置信度: 0.932规则 #5
规则:如果有人喜欢电影  Toy Story (1995), Empire Strikes Back, The (1980), Twelve Monkeys (1995) 那么他们同样喜欢电影 Raiders of the Lost Ark (1981)- 训练 置信度: 1.000- 测试 置信度: 0.903规则 #6
规则:如果有人喜欢电影  Pulp Fiction (1994), Toy Story (1995), Star Wars (1977) 那么他们同样喜欢电影 Raiders of the Lost Ark (1981)- 训练 置信度: 1.000- 测试 置信度: 0.816规则 #7
规则:如果有人喜欢电影  Pulp Fiction (1994), Toy Story (1995), Return of the Jedi (1983) 那么他们同样喜欢电影 Star Wars (1977)- 训练 置信度: 1.000- 测试 置信度: 0.970规则 #8
规则:如果有人喜欢电影  Toy Story (1995), Silence of the Lambs, The (1991), Return of the Jedi (1983) 那么他们同样喜欢电影 Star Wars (1977)- 训练 置信度: 1.000- 测试 置信度: 0.933规则 #9
规则:如果有人喜欢电影  Toy Story (1995), Empire Strikes Back, The (1980), Return of the Jedi (1983) 那么他们同样喜欢电影 Star Wars (1977)- 训练 置信度: 1.000- 测试 置信度: 0.971规则 #10
规则:如果有人喜欢电影  Pulp Fiction (1994), Toy Story (1995), Shawshank Redemption, The (1994) 那么他们同样喜欢电影 Silence of the Lambs, The (1991)- 训练 置信度: 1.000- 测试 置信度: 0.794



