购物篮数据两种商品间的关联分析

先讲一个故事，来自于百度知道。

在一家超市里，有一个有趣的现象：尿布和啤酒赫然摆在一起出售。但是这个奇怪的举措却使尿布和啤酒的销量双双增加了。这不是一个笑话，而是发生在美国沃尔玛连锁店超市的真实案例，并一直为商家所津津乐道。

沃尔玛拥有世界上最大的数据仓库系统，为了能够准确了解顾客在其门店的购买习惯，沃尔玛对其顾客的购物行为进行购物篮分析，想知道顾客经常一起购买的商品有哪些。沃尔玛数据仓库里集中了其各门店的详细原始交易数据。在这些原始交易数据的基础上，沃尔玛利用数据挖掘方法对这些数据进行分析和挖掘。

一个意外的发现是："跟尿布一起购买最多的商品竟是啤酒！经过大量实际调查和分析，揭示了一个隐藏在"尿布与啤酒"背后的美国人的一种行为模式：在美国，一些年轻的父亲下班后经常要到超市去买婴儿尿布，而他们中有30%～40%的人同时也为自己买一些啤酒。产生这一现象的原因是：美国的太太们常叮嘱她们的丈夫下班后为小孩买尿布，而丈夫们在买尿布后又随手带回了他们喜欢的啤酒。

相信通过这个小故事你已经懂了什么是关联分析，下边上一个demo。

本文通过一个200条的购物数据，一共7种商品，数据量不大，方便了解算法，提高运算速度。

数据网址如下：购物篮数据，

百度网盘：链接：https://pan.baidu.com/s/1YIJ4TXV0iGw4AVXI5ikSvw
提取码：j9so

首先打开test_confidence.txt文件，并读取文件中一共有多少种商品。

fr = open(r'D:\Workspace\jupyterNotebook\datamining\Test1\test_support_confidence.txt','r',encoding='utf-8')
goods = []
for line in fr:temp = line.strip().split('\t')#print(type(temp),temp)for i in temp:if i not in goods:goods.append(i)
fr.close()
print(goods)

输出结果：

['啤酒', '苹果', '奶酪', '薯片', '面包', '牛奶', '香蕉']

我们看到一共有7种商品，下边我们把7种商品作为字段，将有数据的字段重写为1，没有数据的字段重写为0，把数据写入text_se.txt文件。

fr = open(r'D:\Workspace\jupyterNotebook\datamining\Test1\test_support_confidence.txt','r',encoding='utf-8')
fw = open(r'D:\Workspace\jupyterNotebook\datamining\Test1\test_se.txt','w',encoding='utf-8')
for line in fr:num = []for i in goods:num.append('0')temp = line.strip().split('\t')for t in temp:index = goods.index(t)num[index] = '1'for n in num:fw.write(n + '\t')fw.write('\n')
fr.close()
fw.close()

然后我们用numpy看一看文件输出的效果

import numpy as np
path = r'D:\Workspace\jupyterNotebook\datamining\Test1\test_se.txt'
x = np.loadtxt(path)
#'面包', '牛奶', '奶酪', '苹果', '香蕉','啤酒','薯片'
print (x[:5])

输出结果：

[[1. 1. 1. 0. 0. 0. 0.][0. 0. 1. 1. 0. 0. 0.][0. 1. 0. 1. 1. 1. 0.][0. 0. 0. 1. 1. 1. 1.][0. 0. 0. 0. 1. 0. 0.]]

这就表示第一条记录有面包，牛奶，奶酪，第二条记录有奶酪，苹果...已将数据二元化。再瞅瞅有多少条数据，多少个特征。

print(x.shape)
#样本，特征
samples,features = x.shape
print (samples,features)

输出结果：201条数据，7种商品
(201, 7)
201 7

下边看看有多少人买了苹果，多少人买了香蕉

count_apple = 0
count_banana = 0
for i in x:if i[3] == 1:count_apple += 1
print ('{} people buy apples'.format(count_apple))

输出结果：

104 people buy apples

下面看看买苹果的会不会买香蕉，即苹果——>香蕉

#会买香蕉
valid = 0
#不会买香蕉
invalid = 0
for item in x:if item[3] == 1:if item[4] == 1:valid += 1else:invalid += 1
print (valid,invalid)
confidence = valid / count_apple
print(confidence)
support = count_apple / samples
print(support)

输出结果：

61 43
置信度：0.5865384615384616
支持度：0.3034825870646766

在这我们解释一下支持度和置信度的概念

支持度：201个人里边有多少同时买了苹果和香蕉

置信度：买苹果的人里有多少买了香蕉

那么为什么使用支持度和置信度？

支持度是筛选占比的，就是说支持度很低的规则是没有意义的，如201个人只有1人买苹果和香蕉，出现的概率不够频繁，规则多半是无意义的；置信度是检验关联规则的可靠性，置信度越高，得出的规则越可靠。

之后我们将上述代码封装成函数，只看两种商品之间的关联，多种商品的关联，我将在后边讲解。

from collections import defaultdict
#合法
val = defaultdict(int)
#非法
inval = defaultdict(int)
#发生次数
occ = defaultdict(int)for sample in x:#features为5for result in range(features):if sample[result] == 0:continueocc[result] += 1for j in range(features):if result == j:continueif sample[j] == 1:val[result,j] += 1else:inval[result,j] += 1confidence = defaultdict(float)for i,j in val.keys():confidence[(i,j)] = val[i,j] / occ[i]feature_names = ['面包', '牛奶', '奶酪', '苹果', '香蕉','啤酒','薯片']
for p,c in confidence.keys():p_name = feature_names[p]c_name = feature_names[c]print('Rule: if one person buys {},he/she also buys {}'.format(p_name,c_name))print('-Confidence: {0:.3f}'.format(confidence[p,c]))print('-Support: {}'.format(val[p,c]))print('******************************')

输出结果：

Rule: if one person buys 面包,he/she also buys 牛奶
-Confidence: 0.532
-Support: 59
******************************
Rule: if one person buys 面包,he/she also buys 奶酪
-Confidence: 0.559
-Support: 62
******************************
Rule: if one person buys 牛奶,he/she also buys 面包
-Confidence: 0.590
-Support: 59
******************************
Rule: if one person buys 牛奶,he/she also buys 奶酪
-Confidence: 0.550
-Support: 55
******************************
Rule: if one person buys 奶酪,he/she also buys 面包
-Confidence: 0.626
-Support: 62
******************************
Rule: if one person buys 奶酪,he/she also buys 牛奶
-Confidence: 0.556
-Support: 55
******************************
Rule: if one person buys 奶酪,he/she also buys 苹果
-Confidence: 0.636
-Support: 63
******************************
Rule: if one person buys 苹果,he/she also buys 奶酪
-Confidence: 0.606
-Support: 63
******************************
Rule: if one person buys 牛奶,he/she also buys 苹果
-Confidence: 0.610
-Support: 61
******************************
Rule: if one person buys 牛奶,he/she also buys 香蕉
-Confidence: 0.580
-Support: 58
******************************
Rule: if one person buys 牛奶,he/she also buys 啤酒
-Confidence: 0.570
-Support: 57
******************************
Rule: if one person buys 苹果,he/she also buys 牛奶
-Confidence: 0.587
-Support: 61
******************************
Rule: if one person buys 苹果,he/she also buys 香蕉
-Confidence: 0.587
-Support: 61
******************************
Rule: if one person buys 苹果,he/she also buys 啤酒
-Confidence: 0.529
-Support: 55
******************************
Rule: if one person buys 香蕉,he/she also buys 牛奶
-Confidence: 0.574
-Support: 58
******************************
Rule: if one person buys 香蕉,he/she also buys 苹果
-Confidence: 0.604
-Support: 61
******************************
Rule: if one person buys 香蕉,he/she also buys 啤酒
-Confidence: 0.594
-Support: 60
******************************
Rule: if one person buys 啤酒,he/she also buys 牛奶
-Confidence: 0.553
-Support: 57
******************************
Rule: if one person buys 啤酒,he/she also buys 苹果
-Confidence: 0.534
-Support: 55
******************************
Rule: if one person buys 啤酒,he/she also buys 香蕉
-Confidence: 0.583
-Support: 60
******************************
Rule: if one person buys 苹果,he/she also buys 薯片
-Confidence: 0.548
-Support: 57
******************************
Rule: if one person buys 香蕉,he/she also buys 薯片
-Confidence: 0.604
-Support: 61
******************************
Rule: if one person buys 啤酒,he/she also buys 薯片
-Confidence: 0.592
-Support: 61
******************************
Rule: if one person buys 薯片,he/she also buys 苹果
-Confidence: 0.559
-Support: 57
******************************
Rule: if one person buys 薯片,he/she also buys 香蕉
-Confidence: 0.598
-Support: 61
******************************
Rule: if one person buys 薯片,he/she also buys 啤酒
-Confidence: 0.598
-Support: 61
******************************
Rule: if one person buys 奶酪,he/she also buys 啤酒
-Confidence: 0.566
-Support: 56
******************************
Rule: if one person buys 啤酒,he/she also buys 奶酪
-Confidence: 0.544
-Support: 56
******************************
Rule: if one person buys 面包,he/she also buys 香蕉
-Confidence: 0.495
-Support: 55
******************************
Rule: if one person buys 面包,he/she also buys 啤酒
-Confidence: 0.523
-Support: 58
******************************
Rule: if one person buys 面包,he/she also buys 薯片
-Confidence: 0.577
-Support: 64
******************************
Rule: if one person buys 奶酪,he/she also buys 香蕉
-Confidence: 0.596
-Support: 59
******************************
Rule: if one person buys 奶酪,he/she also buys 薯片
-Confidence: 0.586
-Support: 58
******************************
Rule: if one person buys 香蕉,he/she also buys 面包
-Confidence: 0.545
-Support: 55
******************************
Rule: if one person buys 香蕉,he/she also buys 奶酪
-Confidence: 0.584
-Support: 59
******************************
Rule: if one person buys 啤酒,he/she also buys 面包
-Confidence: 0.563
-Support: 58
******************************
Rule: if one person buys 薯片,he/she also buys 面包
-Confidence: 0.627
-Support: 64
******************************
Rule: if one person buys 薯片,he/she also buys 奶酪
-Confidence: 0.569
-Support: 58
******************************
Rule: if one person buys 牛奶,he/she also buys 薯片
-Confidence: 0.600
-Support: 60
******************************
Rule: if one person buys 薯片,he/she also buys 牛奶
-Confidence: 0.588
-Support: 60
******************************
Rule: if one person buys 面包,he/she also buys 苹果
-Confidence: 0.550
-Support: 61
******************************
Rule: if one person buys 苹果,he/she also buys 面包
-Confidence: 0.587
-Support: 61
******************************

可能有些读者不理解苹果——>香蕉，和香蕉——>苹果，这两种关联规则的区别，在这里编写一段代码，方便读者理解。

from operator import itemgetter
def print_rule(premise, conclusion, support, confidence, features):premise_name = features[premise]conclusion_name = features[conclusion]print("Rule: If a person buys {0} they will also buy {1}".format(premise_name, conclusion_name))print(" - Confidence: {0:.3f}".format(confidence[(premise, conclusion)]))print(" - Support: {0}".format(val[(premise, conclusion)]))print("")sorted_support = sorted(val.items(),key=itemgetter(1),reverse=True)
for index in range(6):print ('Rule #{0}'.format(index + 1))p,c = sorted_support[index][0]print_rule (p,c,val,confidence,feature_names)

输出结果：

Rule #1
Rule: If a person buys 面包 they will also buy 薯片- Confidence: 0.577- Support: 64Rule #2
Rule: If a person buys 薯片 they will also buy 面包- Confidence: 0.627- Support: 64Rule #3
Rule: If a person buys 奶酪 they will also buy 苹果- Confidence: 0.636- Support: 63Rule #4
Rule: If a person buys 苹果 they will also buy 奶酪- Confidence: 0.606- Support: 63Rule #5
Rule: If a person buys 面包 they will also buy 奶酪- Confidence: 0.559- Support: 62Rule #6
Rule: If a person buys 奶酪 they will also buy 面包- Confidence: 0.626- Support: 62

这仅仅是一个小Demo，用Jupyter Notebook编写，所以函数都没有封装，读者可以完善一下，也可根据需求选择满足特定支持度和置信度的规则，这里我们仅介绍了两种商品之间的关联，还可以有多种商品间的关联，如买牛奶和面包的人会不会买苹果，这个我们在后边讲到。

购物篮数据两种商品间的关联分析相关推荐

pandas中使用rolling.corr函数计算两个时间序列数据列之间的滚动相关性（Rolling correlations）、例如，计算两种商品销售额之间的3个月的滚动相关性
pandas中使用rolling.corr函数计算两个时间序列数据列之间的滚动相关性(Rolling correlations).例如,计算两种商品销售额之间的3个月的滚动相关性目录
excel中使用CORREL函数计算两个时间序列数据列之间的滚动相关性（Rolling correlations）、例如，计算两种商品销售额之间的3个月的滚动相关性
excel中使用CORREL函数计算两个时间序列数据列之间的滚动相关性(Rolling correlations).例如,计算两种商品销售额之间的3个月的滚动相关性目录
R语言使用zoo包中的rollapply函数计算两个时间序列数据列之间的滚动相关性（Rolling correlations）、例如，计算两种商品销售额之间的3个月的滚动相关性
R语言时间序列数据滚动相关性分析(Rolling correlations).R语言使用zoo包中的rollapply函数计算两个时间序列数据列之间的滚动相关性(Rolling correlation ...
ireport参数传递json_Json传递数据两种方式（json大全）
1.Json传递数据两种方式(json大全) ----------------------------字符串 var list1 = ["number","name&qu ...
diy 扫地机器人滚刷_无滚刷PK有滚刷：关于保洁机器人两种常见清扫结构的分析...
目前市面上销售的保洁机器人从底部清扫结构上来看,主要分为两种结构类型:一类是以iRobot Roomba为代表的有滚刷三段式清扫结构,另一类则是以V-BOT为代表无滚刷双重清扫结构,今天爸爸乐轻松网主 ...
mysql同时购买两种商品_SQL题1——查询所有购入商品为两种或两种以上的购物人记录...
题目1:假设顾客购物表 customer_shopping 结构如下: customer commodity amount A 甲 2 B 乙 4 C 丙 1 A 丁 2 B 丙 5 ...
SparkStreaming从Kafka读取数据两种方式
参考文章:http://www.jianshu.com/p/60344796f8a5 在结合 Spark Streaming 及 Kafka 的实时应用中,我们通常使用以下两个 API 来获取最初的 ...
JAVA与PLC通讯读取数据(两种方式)
第一种方式(s7connector) S7官网:S7Connector - Documentation,有简单的读写操作参考. 1.创建maven工程引入依赖 <dependency>&l ...
bs和cs架构的区别和优缺点_C/S和B/S两种架构区别与优缺点分析
C/S和B/S,是再普通不过的两种软件架构方式,都可以进行同样的业务处理,甚至也可以用相同的方式实现共同的逻辑.既然如此,为何还要区分彼此呢?那我们就来看看二者的区别和联系. 一.C/S 架构 1. ...

购物篮数据两种商品间的关联分析

购物篮数据两种商品间的关联分析相关推荐

最新文章

热门文章