black Friday

项目介绍：

黑色星期五是美国感恩节后一天，圣诞节前的一次大采购活动，当天一般美国商场会推出大量的打折优惠、促销活动，由于美国的商场一般以红笔记录赤字，以黑笔记录盈利，而感恩节后的这个星期五人们疯狂的抢购使得商场利润大增，因此被商家们称作黑色星期五。商家期望通过以这一天开始的圣诞大采购为这一年获得最多的盈利。

分析目的：

本次的分析数据来自于Kaggle提供的某电商黑色星期五的销售记录，参考网上的分析思路，将围绕产品和用户两大方面展开叙述，为电商平台制定策略提供分析及建议。

本文分析的主要框架

1.整体消费的情况
2.用户画像分析(探究最优价值的用户类型:性别、年龄、职业、婚姻)
3.城市业绩分析(城市分布、居住年限分布)
3.产品分析(探究最优价值的产品) 细化分析：产品销售额Top 5产品、产品销售额Top5 产品类别
4.最大贡献用户价值分析: 客单价、价值Top1000用户清单、价值Top1000用户情况
5.结论以及建议

import pandas as pd
import numpy as np
import matplotlib as plt
%matplotlib inline
plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
plt.rcParams['axes.unicode_minus']=False #用来正常显示负号
import seaborn as sns

df = pd.read_csv('D:\\BaiduNetdiskDownload\\practise\\Third Program\\BlackFriday.csv')

df.head()

	User_ID	Product_ID	Gender	Age	Occupation	City_Category	Stay_In_Current_City_Years	Product_Category_1	Product_Category_2	Product_Category_3	Purchase
0	1000001	P00069042	F	0-17	10	A	2	3	NaN	NaN	8370
1	1000001	P00248942	F	0-17	10	A	2	1	6.0	14.0	15200
2	1000001	P00087842	F	0-17	10	A	2	12	NaN	NaN	1422
3	1000001	P00085442	F	0-17	10	A	2	12	14.0	NaN	1057
4	1000002	P00285442	M	55+	16	C	4+	8	NaN	NaN	7969

原始数据中共有12个字段，解释如下：

User_ID：用户ID

Product_ID: 产品ID

Gender: 性别

Age: 年龄

Occupation: 职业

City_Category: 城市（A,B,C）

Stay_In_Current_City_Years：居住时长

Marital_Status：婚姻状况

Product_Category_1 产品类别1,是一级分类

Product_Category_2 产品类别2,是二级分类

Product_Category_3 产品类别3,是三级分类

Purchase：金额（美元）

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 537577 entries, 0 to 537576
Data columns (total 12 columns):
User_ID                       537577 non-null int64
Product_ID                    537577 non-null object
Gender                        537577 non-null object
Age                           537577 non-null object
Occupation                    537577 non-null int64
City_Category                 537577 non-null object
Stay_In_Current_City_Years    537577 non-null object
Marital_Status                537577 non-null int64
Product_Category_1            537577 non-null int64
Product_Category_2            370591 non-null float64
Product_Category_3            164278 non-null float64
Purchase                      537577 non-null int64
dtypes: float64(2), int64(5), object(5)
memory usage: 49.2+ MB

1、整体消费情况

df.describe()

	User_ID	Occupation	Marital_Status	Product_Category_1	Product_Category_2	Product_Category_3	Purchase
count	5.375770e+05	537577.00000	537577.000000	537577.000000	370591.000000	164278.000000	537577.000000
mean	1.002992e+06	8.08271	0.408797	5.295546	9.842144	12.669840	9333.859853
std	1.714393e+03	6.52412	0.491612	3.750701	5.087259	4.124341	4981.022133
min	1.000001e+06	0.00000	0.000000	1.000000	2.000000	3.000000	185.000000
25%	1.001495e+06	2.00000	0.000000	1.000000	5.000000	9.000000	5866.000000
50%	1.003031e+06	7.00000	0.000000	5.000000	9.000000	14.000000	8062.000000
75%	1.004417e+06	14.00000	1.000000	8.000000	15.000000	16.000000	12073.000000
max	1.006040e+06	20.00000	1.000000	18.000000	18.000000	18.000000	23961.000000

df['Purchase'].sum()/df['User_ID'].drop_duplicates(keep='first').count()#平均客单价是85万美元

851751.5494822611

从本次的消费记录来看,记录的主要是大客户的消费数据，人均消费已经达到了85万美元！这些人一共贡献了50亿美金的销售额。抓住忠实用户，并促进他们消费，是互联网电商发展的基本操作。

2、从用户的角度来分析问题

（1）性别方面

#df_gender_purchase=df.groupby("Gender").agg({"Purchase":"sum"}).reset_index().rename(columns={"Purchase":"Purchase_amount"})
df_gender_purchase=df.groupby('Gender').agg({'Purchase':'sum'}).reset_index().rename(columns={'Purchase':'Purchase_amount'})
df_gender_purchase['gender_purchase_pro']=df_gender_purchase.apply(lambda x: x[1]/df['Purchase'].sum(),axis=1)
def Gender_user_count(x):if x[0]=='F':return df.loc[df['Gender']=='F'].drop_duplicates('User_ID',keep='first')['User_ID'].count()if x[0]=='M':return df.loc[df['Gender']=='M'].drop_duplicates('User_ID',keep='first')['User_ID'].count()
df_gender_purchase['gender_user_count']=df_gender_purchase.apply(lambda x:Gender_user_count(x),axis=1)
df_gender_purchase['gender_customer_price']=df_gender_purchase.apply(lambda x:x[1]/x[3],axis=1)
df_gender_purchase['gender_count_prop']=df_gender_purchase.apply(lambda x:x[3]/df_gender_purchase['gender_user_count'].sum(),axis=1)
df_gender_purchase

	Gender	Purchase_amount	gender_purchase_pro	gender_user_count	gender_customer_price	gender_count_prop
0	F	1164624021	0.232105	1666	699054.034214	0.282804
1	M	3853044357	0.767895	4225	911963.161420	0.717196

在黑色星期五的活动中，男性是占据了71%的用户,将近是女性的2.5倍;但是贡献了将近76%的销售额,是女生的3.3倍;显然是有跟多的男性参与这个活动,并且客单价还是较高于女性, 所以应该针对男性用价格较高的产品来推销。

（2）年龄方面

df_age_purchase = df.groupby('Age').agg({'Purchase':'sum'}).reset_index().rename(columns={'Purchase':'Purchase_amount'})
df_age_purchase['Purchase_amount_pro']=df_age_purchase.apply(lambda x: x[1]/df_age_purchase['Purchase_amount'].sum(),axis=1)
def Age_user_count(x):for i in df['Age'].drop_duplicates():if x[0]==i:return df.loc[df['Age']==i].drop_duplicates('User_ID',keep='first')['User_ID'].count()
df_age_purchase['age_user_count']=df_age_purchase.apply(lambda x: Age_user_count(x),axis=1)
df_age_purchase['age_user_count_pro']=df_age_purchase.apply(lambda x: x[3]/df.drop_duplicates('User_ID',keep='first')['User_ID'].count(),axis=1)
df_age_purchase['age_customer_price']=df_age_purchase.apply(lambda x: x[1]/x[3],axis=1)
df_age_purchase

	Age	Purchase_amount	Purchase_amount_pro	age_user_count	age_user_count_pro	age_customer_price
0	0-17	132659006	0.026438	218	0.037006	608527.550459
1	18-25	901669280	0.179699	1069	0.181463	843469.859682
2	26-35	1999749106	0.398542	2053	0.348498	974061.912323
3	36-45	1010649565	0.201418	1167	0.198099	866023.620394
4	46-50	413418223	0.082392	531	0.090137	778565.391714
5	51-55	361908356	0.072127	481	0.081650	752408.224532
6	55+	197614842	0.039384	372	0.063147	531222.693548

消费人数和金额主要集中在18-45这个年龄阶段，几乎贡献了80%的销售额,其中26-35年龄段，无论是消费者人数和消费金额都是最多的,这是应该重点推销商品的用户。

（3）婚姻状态方面

df_Marital_purchase=df.groupby('Marital_Status').agg({'Purchase':'sum'}).reset_index().rename(columns={'Purchase':'Purchase_amount'})
df_Marital_purchase['Marital_purchase_prop']=df_Marital_purchase.apply(lambda x:x[1]/df['Purchase'].sum(),axis=1)def Marital_user_count(x):if x[0]==0:return (df.loc[df['Marital_Status']==0].drop_duplicates(subset=['User_ID'],keep='first')['User_ID'].count())if x[0]==1:return (df.loc[df['Marital_Status']==1].drop_duplicates(subset=['User_ID'],keep='first')['User_ID'].count())df_Marital_purchase['Marital_user_count']=df_Marital_purchase.apply(lambda x:Marital_user_count(x),axis=1)
df_Marital_purchase['Marital_customer_price']=df_Marital_purchase.apply(lambda x:x[1]/x[3],axis=1)
df_Marital_purchase['Marital_count_prop']=df_Marital_purchase.apply(lambda x:x[3]/df.drop_duplicates(subset=['User_ID'],keep='first')['User_ID'].count(),axis=1)
df_Marital_purchase

	Marital_Status	Purchase_amount	Marital_purchase_prop	Marital_user_count	Marital_customer_price	Marital_count_prop
0	0	2966289500	0.591169	3417	868097.600234	0.580037
1	1	2051378878	0.408831	2474	829174.970897	0.419963

不结婚的人在销售金额、参与活动数量方面是比已经结婚的高出40%

（4）合并性别和婚姻状态这两个字段分析不同年龄段的销售额情况

df["Gender_MaritalStatus"]=df[["Gender","Marital_Status"]].apply(lambda x:str(x[0])+"_"+str(x[1]),axis=1)

df_Gender_MaritalStatus_purchase=df.groupby(["Gender_MaritalStatus","Age"]).agg({"Purchase":"sum"}).reset_index().rename(columns={"Purchase":"Purchase_amount"})

def Gender_MaritalStatus_user_count(x):for i in df["Gender_MaritalStatus"].drop_duplicates():for j in df["Age"].drop_duplicates():if x[0]==i and x[1]==j:return (df.loc[(df["Gender_MaritalStatus"]==i) & (df["Age"]==j)].drop_duplicates(subset=["User_ID"],keep="first")["User_ID"].count())df_Gender_MaritalStatus_purchase["Gender_MaritalStatus_user_count"]=df_Gender_MaritalStatus_purchase.apply(lambda x:Gender_MaritalStatus_user_count(x),axis=1)
df_Gender_MaritalStatus_purchase["Gender_MaritalStatus_user_price"]=df_Gender_MaritalStatus_purchase.apply(lambda x:x[2]/x[3],axis=1)
df_Gender_MaritalStatus_purchase["Gender_MaritalStatus_count_prop"]=df_Gender_MaritalStatus_purchase.apply(lambda x:x[2]/df.drop_duplicates(subset=["User_ID"],keep="first")["User_ID"].count(),axis=1)
df_Gender_MaritalStatus_purchase.head(5)

	Gender_MaritalStatus	Age	Purchase_amount	Gender_MaritalStatus_user_count	Gender_MaritalStatus_user_price	Gender_MaritalStatus_count_prop
0	F_0	0-17	41826615	78	536238.653846	7100.087421
1	F_0	18-25	153305178	217	706475.474654	26023.625530
2	F_0	26-35	254464648	320	795202.025000	43195.492786
3	F_0	36-45	148392364	202	734615.663366	25189.673061
4	F_0	46-50	27113309	49	553332.836735	4602.496860

sns.barplot(x="Age",hue="Gender_MaritalStatus",y="Gender_MaritalStatus_user_count",data=df_Gender_MaritalStatus_purchase)

<matplotlib.axes._subplots.AxesSubplot at 0x1dbcba59748>

26到35这个时间区间中,未婚状态下的男性参与活动的人数的最多的，而到18-35这个地区重未婚男性的销量也拍排到第二位的

sns.barplot(x="Age",hue="Gender_MaritalStatus",y="Purchase_amount",data=df_Gender_MaritalStatus_purchase)

<matplotlib.axes._subplots.AxesSubplot at 0x1dbc5204978>

26到35这个时间区间中,未婚状态下的男性参与活动的人数的最多的，而到18-35这个地区重未婚男性的销量也拍排到第二位的

（5）考虑不同职位的下的人购买情况

df_Occupation_purchase=df.groupby("Occupation").agg({"Purchase":"sum"}).reset_index().rename(columns={"Purchase":"Purchase_amount"})
df_Occupation_purchase["Occupation_purchase_prop"]=df_Occupation_purchase.apply(lambda x:x[1]/df["Purchase"].sum(),axis=1)
def Occupation_user_count(x):for i in df["Occupation"].drop_duplicates():if x[0]==i:return (df.loc[df["Occupation"]==i].drop_duplicates(subset=["User_ID"],keep="first")["User_ID"].count())df_Occupation_purchase["Occupation_user_count"]=df_Occupation_purchase.apply(lambda x:Occupation_user_count(x),axis=1)
df_Occupation_purchase["Occupation_customer_price"]=df_Occupation_purchase.apply(lambda x:x[1]/x[3],axis=1)
df_Occupation_purchase["Occupation_count_prop"]=df_Occupation_purchase.apply(lambda x:x[3]/df.drop_duplicates(subset=["User_ID"],keep="first")["User_ID"].count(),axis=1)
df_Occupation_purchase.sort_values(by="Occupation_user_count",ascending=False)

	Occupation	Purchase_amount	Occupation_purchase_prop	Occupation_user_count	Occupation_customer_price	Occupation_count_prop
4	4	657530393	0.131043	740	8.885546e+05	0.125615
0	0	625814811	0.124722	688	9.096146e+05	0.116788
7	7	549282744	0.109470	669	8.210504e+05	0.113563
1	1	414552829	0.082619	517	8.018430e+05	0.087761
17	17	387240355	0.077175	491	7.886769e+05	0.083347
12	12	300672105	0.059923	376	7.996599e+05	0.063826
14	14	255594745	0.050939	294	8.693699e+05	0.049907
20	20	292276985	0.058250	273	1.070612e+06	0.046342
2	2	233275393	0.046491	256	9.112320e+05	0.043456
16	16	234442330	0.046723	235	9.976269e+05	0.039891
6	6	185065697	0.036883	228	8.116917e+05	0.038703
10	10	114273954	0.022774	192	5.951768e+05	0.032592
3	3	160428450	0.031973	170	9.436968e+05	0.028858
13	13	71135744	0.014177	140	5.081125e+05	0.023765
15	15	116540026	0.023226	140	8.324288e+05	0.023765
11	11	105437359	0.021013	128	8.237294e+05	0.021728
5	5	112525355	0.022426	111	1.013742e+06	0.018842
9	9	53619309	0.010686	88	6.093103e+05	0.014938
19	19	73115489	0.014572	71	1.029796e+06	0.012052
18	18	60249706	0.012008	67	8.992493e+05	0.011373
8	8	14594599	0.002909	17	8.585058e+05	0.002886

4、0、7、1的人数占到了用户总人数的40%,这些职位应该是我们关注的对象

3、从城市贡献角度考虑

df_City_Category_purchase=df.groupby("City_Category").agg({"Purchase":"sum"}).reset_index().rename(columns={"Purchase":"Purchase_amount"})

df_City_Category_purchase=df.groupby("City_Category").agg({"Purchase":"sum"}).reset_index().rename(columns={"Purchase":"Purchase_amount"})
df_City_Category_purchase["Marital_purchase_prop"]=df_City_Category_purchase.apply(lambda x:x[1]/df["Purchase"].sum(),axis=1)def City_Category_user_count(x):if x[0]=="A":return (df.loc[df["City_Category"]=="A"].drop_duplicates(subset=["User_ID"],keep="first")["User_ID"].count())if x[0]=="B":return (df.loc[df["City_Category"]=="B"].drop_duplicates(subset=["User_ID"],keep="first")["User_ID"].count())if x[0]=="C":return (df.loc[df["City_Category"]=="C"].drop_duplicates(subset=["User_ID"],keep="first")["User_ID"].count())df_City_Category_purchase["City_Category_user_count"]=df_City_Category_purchase.apply(lambda x:City_Category_user_count(x),axis=1)
df_City_Category_purchase["City_Category_customer_price"]=df_City_Category_purchase.apply(lambda x:x[1]/x[3],axis=1)
df_City_Category_purchase["City_Category_count_prop"]=df_City_Category_purchase.apply(lambda x:x[3]/df.drop_duplicates(subset=["User_ID"],keep="first")["User_ID"].count(),axis=1)
df_City_Category_purchase

	City_Category	Purchase_amount	Marital_purchase_prop	City_Category_user_count	City_Category_customer_price	City_Category_count_prop
0	A	1295668797	0.258221	1045	1.239874e+06	0.177389
1	B	2083431612	0.415219	1707	1.220522e+06	0.289764
2	C	1638567969	0.326560	3139	5.220032e+05	0.532847

C 城市的参与活动的用户量占总的53%,但是贡献销售额仅仅占了30%,相反B城市是占的总用户量的28%确贡献了40%的销售额,并且AB城市的客单价是分别是C城市的近似2倍。我们大致能够猜测到AB城市的消费水品较高，下次举办活动的时候,可以对AB城市的价格适当提高。C城市可以适当降低价格，通过提高销售量来提高销售额

4、从产品品相考虑

（1）销量Top10的产品

df_count10=df.groupby("Product_ID").agg({"User_ID":"count","Purchase":"sum"}).rename(columns={"Purchase":"Purchase_amount","User_ID":"User_count"}).reset_index().sort_values(by=["Purchase_amount"],ascending=False)[["Product_ID","Purchase_amount"]].head(10)
df_count10

	Product_ID	Purchase_amount
249	P00025442	27532426
1014	P00110742	26382569
2441	P00255842	24652442
1743	P00184942	24060871
581	P00059442	23948299
1028	P00112142	23882624
1016	P00110942	23232538
2261	P00237542	23096487
565	P00057642	22493690
104	P00010742	21865042

（2）销售Top10的产品

df_amount10=df.groupby("Product_ID").agg({"User_ID":"count","Purchase":"sum"}).rename(columns={"Purchase":"Purchase_amount","User_ID":"User_count"}).reset_index().sort_values(by=["User_count"],ascending=False)[["Product_ID","User_count"]].head(10)
df_amount10

	Product_ID	User_count
2534	P00265242	1858
1014	P00110742	1591
249	P00025442	1586
1028	P00112142	1539
565	P00057642	1430
1743	P00184942	1424
458	P00046742	1417
568	P00058042	1396
1353	P00145042	1384
581	P00059442	1384

（3）销量和销量金额都在Top10的产品

pd.merge(df_amount10,df_count10,left_on="Product_ID",right_on="Product_ID",how="inner")

	Product_ID	User_count	Purchase_amount
0	P00110742	1591	26382569
1	P00025442	1586	27532426
2	P00112142	1539	23882624
3	P00057642	1430	22493690
4	P00184942	1424	24060871
5	P00059442	1384	23948299

df_amount=df.groupby("Product_Category_1").agg({"User_ID":"count","Purchase":"sum"}).rename(columns={"Purchase":"Purchase_amount","User_ID":"User_count"}).reset_index().sort_values(by=["Purchase_amount"],ascending=False)[["Product_Category_1","Purchase_amount"]]
df_amount["Category_Prop"]=df_amount.apply(lambda x:x[1]/df["Purchase"].sum(),axis=1)
df_amount

	Product_Category_1	Purchase_amount	Category_Prop
0	1	1882666325	0.375207
4	5	926917497	0.184731
7	8	840693394	0.167547
5	6	319355286	0.063646
1	2	264497242	0.052713
2	3	200412211	0.039941
15	16	143168035	0.028533
10	11	112203088	0.022362
9	10	99029631	0.019736
14	15	91658147	0.018267
6	7	60059209	0.011970
3	4	26937957	0.005369
13	14	19718178	0.003930
17	18	9149071	0.001823
8	9	6277472	0.001251
16	17	5758702	0.001148
11	12	5235883	0.001043
12	13	3931050	0.000783

5、总结

1、用户的角度

结论汇总：年龄在26-35岁，职业编号为"4",“0”,“7”,"1"的未婚男性消费人群属于高消费人群，该平台的超级忠实用户；
后续改进：1）对高价值用户重点关注，进行更精细化的营销，后续为这些高价值用户提供更多的高价值消费品；
2）针对其他的用户，主要引导用户点击购买，多推荐一些热销的商品；

2、商品的角度

结论汇总： 1）黑色星期五期间，一级商品分类的5、1、8的销量、销售额都是排在前3的，而且最受用户欢迎的商品top10中也有这3类商品，这3类商品贡献了72%的销售额；
2）销量排名最低的三个商品种类是16、11、12，占比都不到0.3%；
3）即在在Top10销售额中的产品和在Top10销售量的产品，可利用爆款商品陈列位置为其他产品引流。
后续改进： 1）可以在最受用户欢迎的商品top10的商品和其他一些相关的商品做一些捆绑销售，带动其他商品的销量；在一级商品分类为5、1、8的商品页面推荐一些其他的商品，引导用户去点击购买；
2）具体再分析下销量排名最低的三个商品种类是什么原因造成的，如果商品种类16、11、12是一些已经淘汰过时的商品或者被一些该商品的替代品占领了市场，可以考虑是否要下架，减少相关渠道的广告等；

3、城市角度

结论汇总：畅销第一级别类目依次是5、8、1，仓库管理需按畅销商品名单、分类，安排库存，对于消费旺盛B城市提前备货，节省调度；同时监控库存，防止断货。