【数据分析与挖掘】天猫超市复购预测实战(含代码和数据集)
一.背景
商家有时会在特定日期,例如Boxing-day,黑色星期五或是双十一(11月11日)开展大型促销活动或者发放优惠券以吸引消费者,然而很多被吸引来的买家都是一次性消费者,这些促销活动可能对销售业绩的增长并没有长远帮助,因此为解决这个问题,商家需要识别出哪类消费者可以转化为重复购买者。通过对这些潜在的忠诚客户进行定位,商家可以大大降低促销成本,提高投资回报率(Return on Investment, ROI)。众所周知的是,在线投放广告时精准定位客户是件比较难的事情,尤其是针对新消费者的定位。不过,利用天猫长期积累的用户行为日志,我们或许可以解决这个问题。
我们提供了一些商家信息,以及在“双十一”期间购买了对应产品的新消费者信息。你的任务是预测给定的商家中,哪些新消费者在未来会成为忠实客户,即需要预测这些新消费者在6个月内再次购买的概率。
数据集:500MB+
二.数据描述
数据集包含了匿名用户在 "双十一 "前6个月和"双十一 "当天的购物记录,标签为是否是重复购买者。出于隐私保护,数据采样存在部分偏差,该数据集的统计结果会与天猫的实际情况有一定的偏差,但不影响解决方案的适用性。训练集和测试集数据见文件data_format1.zip,数据详情见下表。
三.数据探索
3.1工具导入
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import statsimport warnings
warnings.filterwarnings("ignore")%matplotlib inline
3.2数据读取
"""
读取数据集
"""
test_data = pd.read_csv('./data_format1/test_format1.csv')
train_data = pd.read_csv('./data_format1/train_format1.csv')user_info = pd.read_csv('./data_format1/user_info_format1.csv')
user_log = pd.read_csv('./data_format1/user_log_format1.csv')#user_info = pd.read_csv('./data_format1/user_info_format1.csv').drop_duplicates()
#user_log = pd.read_csv('./data_format1/user_log_format1.csv').rename(columns={"seller_id":'merchant_id'})
数据集样例查看
train_data.head(5)
test_data.head(5)
user_info.head(5)
user_log.head(5)
3.3单变量数据分析
3.3.1数据类型和数据大小(info)
用户信息数据
数据集中共有2个float64类型和1个int64类型的数据
数据大小9.7MB
数据集共有424170条数据
用户行为数据
数据集中共有6个int64类型和1个float64类型的数据
数据大小2.9GB
数据集共有54925330条数据
用户购买训练数据
数据均为int64类型
数据大小6MB
数据集共有260864条数据
3.3.2缺失值查看
3.3.2.1用户信息数据缺失
年龄缺失
#年龄缺失占比
(user_info.shape[0]-user_info['age_range'].count())/user_info.shape[0]
#年龄缺失或者为0的个数
user_info[user_info['age_range'].isna() | (user_info['age_range']==0)].count()#年龄分组
user_info.groupby(['age_range'])['user_id'].count()
1.年龄值为空的缺失率为0.5%
2.年龄值缺失或者年龄值为缺省值0
3.共计95131条数据
性别缺失
1.性别值为空的缺失率 1.5%
2.性别值缺失或者性别为缺省值2
3.共计16862条数据
`
(user_info.shape[0]-user_info['gender'].count())/user_info.shape[0]
user_info[user_info['gender'].isna() | (user_info['gender'] == 2)].count()
user_info.groupby(['gender'])[['user_id']].count()
3.3.2.2用户行为日志信息
print(user_log.isnull().sum())
3.4观察数据分布
3.4.1整体数据统计信息
user_info.describe()
user_log.describe()
3.4.2查看正负样本的分布
label_gp=train_data.groupby('label')['user_id'].count()
print('正负样本的数量:\n',label_gp)
fig=figure(figsize=(12,6))
ax1=plt.subplot(1,2,1)
train_data['label'].value_counts().plot(kind='pie',autopct='%1.1f%%',shadow=True,explode=[0,0.1],ax=ax1)
ax2=plt.subplot(1,2,2)
sns.countplot('label',data=train_data,ax=ax2)
从上图可以看出,样本的分布不均衡,需要采取一定的措施处理样本不均衡的问题:
- 类似欠采样,将一份正样本和多分负样本组合成多分训练集,训练多个模型后求平均
- 调整模型的权重
3.5探索店铺,用户,性别,年龄对复购的影响
3.5.1查看不同商家与复购的关系
print('选取top5店铺\n店铺\t购买次数')
print(train_data['merchat_id']).value_counts().head(5))
train_data_merchat=train_data.copy()
train_data_merchat['top5']=train_data_merchat['merchat_id'].map(lambda x: 1 if x in [4044,3828,4173,1102,4976] else 0)
train_data_merchant = train_data_merchant[train_data_merchant['TOP5']==1]
plt.figure(figsize=(8,6))
plt.title('Merchant VS Label')
ax = sns.countplot('merchant_id',hue='label',data=train_data_merchant)
for p in ax.patches:height = p.get_height()
从图可以看出不同店铺有不同复购率,可能与不同店铺售卖的商品有关,以及店铺的运营有关。
3.5.2查看店铺复购概率分布
merchant_repeat_buy=[rate for rate in train_data.groupby('merchant_id')['label'].mean() if rate<=1 and rate>0]
plt.figure(figsize=(8,4))
ax=plt.subplot(121)
sns.distplot(merchant_repeat_buy,fit=stats.norm)
ax=plt.subplot(1,2,2)
res = stats.probplot(merchant_repeat_buy, plot=plt)
可以看出不同店铺有不同复购率,大致在0-0.3之间
3.5.3查看用户大于一次复购概率分布
user_repeat_buy = [rate for rate in train_data.groupby([‘user_id’])[‘label’].mean() if rate <= 1 and rate > 0]
plt.figure(figsize=(8,6))
ax=plt.subplot(1,2,1)
sns.distplot(user_repeat_buy, fit=stats.norm)
ax=plt.subplot(1,2,2)
res = stats.probplot(user_repeat_buy, plot=plt)
3.5.4查看用户性别与复购的关系
train_data_user_info = train_data.merge(user_info,on=['user_id'],how='left')
plt.figure(figsize=(8,8))
plt.title('Gender VS Label')
ax = sns.countplot('gender',hue='label',data=train_data_user_info)
for p in ax.patches:height = p.get_height()
3.5.5查看用户性别复购的分布
repeat_buy=[rate for rate in train_data_user_info.groupby(['gender'])['label'].mean()]
ax=plt.subplot(1,2,1)
sns.distplot(repeat_buy,fit=stats.norm)
ax=plt.subplot(1,2,2)
res = stats.probplot(repeat_buy, plot=plt)
可以看出男女的复购率不一样
3.5.6查看用户年龄与复购关系
plt.figure(figsize=(8,8))
plt.title('Age VS Label')
ax = sns.countplot('age_range',hue='label',data=train_data_user_info)
3.5.7查看用户年龄复购的分布
repeat_buy = [rate for rate in train_data_user_info.groupby(['age_range'])['label'].mean()] plt.figure(figsize=(8,4))ax=plt.subplot(1,2,1)
sns.distplot(repeat_buy, fit=stats.norm)
ax=plt.subplot(1,2,2)
res = stats.probplot(repeat_buy, plot=plt)
四.特征工程
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import statsimport gc
from collections import Counter
import copyimport warnings
warnings.filterwarnings("ignore")%matplotlib inline
4.1合并用户信息
del test_data['prob']
all_data = train_data.append(test_data)
all_data = all_data.merge(user_info,on=['user_id'],how='left')
del train_data, test_data, user_info
gc.collect()
all_data.head()
4.2用户行为日志信息按时间进行排序
"""
按时间排序
"""
user_log = user_log.sort_values(['user_id','time_stamp'])
user_log.head()
4.3对每个用户的逐个合并所有的item_id, cat_id,seller_id,brand_id,time_stamp, action_type字段
"""
合并数据
"""
list_join_func = lambda x: " ".join([str(i) for i in x])agg_dict = {'item_id' : list_join_func, 'cat_id' : list_join_func,'seller_id' : list_join_func,'brand_id' : list_join_func,'time_stamp' : list_join_func,'action_type' : list_join_func}rename_dict = {'item_id' : 'item_path','cat_id' : 'cat_path','seller_id' : 'seller_path','brand_id' : 'brand_path','time_stamp' : 'time_stamp_path','action_type' : 'action_type_path'}
user_log_path = user_log.groupby('user_id').agg(agg_dict).reset_index().rename(columns=rename_dict)
user_log_path.head()
all_data_path = all_data.merge(user_log_path,on='user_id')
all_data_path.head()
4.4定义数据统计函数
4.4.1统计数据的总数
def cnt_(x):try:return len(x.split(' '))except:return -1
4.4.2统计唯一数据总数
def nunique_(x):try:return len(set(x.split(' ')))except:return -1
4.4.3统计数据最大值
def max_(x):try:return np.max([int(i) for i in x.split(' ')])except:return -1
4.4.4统计数据最小值
def min_(x):try:return np.min([int(i) for i in x.split(' ')])except:return -1
4.4.5统计数据的标准差
def std_(x):try:return np.std([float(i) for i in x.split(' ')])except:return -1
4.4.6统计数据中top N的数据
def most_n_cnt(x, n):try:return Counter(x.split(' ')).most_common(n)[n-1][1]except:return -1
###
def user_cnt(df_data, single_col, name):df_data[name] = df_data[single_col].apply(cnt_)return df_datadef user_nunique(df_data, single_col, name):df_data[name] = df_data[single_col].apply(nunique_)return df_datadef user_max(df_data, single_col, name):df_data[name] = df_data[single_col].apply(max_)return df_datadef user_min(df_data, single_col, name):df_data[name] = df_data[single_col].apply(min_)return df_datadef user_std(df_data, single_col, name):df_data[name] = df_data[single_col].apply(std_)return df_datadef user_most_n(df_data, single_col, name, n=1):func = lambda x: most_n(x, n)df_data[name] = df_data[single_col].apply(func)return df_datadef user_most_n_cnt(df_data, single_col, name, n=1):func = lambda x: most_n_cnt(x, n)df_data[name] = df_data[single_col].apply(func)return df_data
4.5提取商铺的基本统计特征
"""提取基本统计特征
"""
all_data_test = all_data_path.head(2000)
#all_data_test = all_data_path
# 统计用户 点击、浏览、加购、购买行为
# 总次数
all_data_test = user_cnt(all_data_test, 'seller_path', 'user_cnt')
# 不同店铺个数
all_data_test = user_nunique(all_data_test, 'seller_path', 'seller_nunique')
# 不同品类个数
all_data_test = user_nunique(all_data_test, 'cat_path', 'cat_nunique')
# 不同品牌个数
all_data_test = user_nunique(all_data_test, 'brand_path', 'brand_nunique')
# 不同商品个数
all_data_test = user_nunique(all_data_test, 'item_path', 'item_nunique')
# 活跃天数
all_data_test = user_nunique(all_data_test, 'time_stamp_path', 'time_stamp_nunique')
# 不用行为种数
all_data_test = user_nunique(all_data_test, 'action_type_path', 'action_type_nunique')
all_data_test.head()
# ....
# 最晚时间
all_data_test = user_max(all_data_test, 'action_type_path', 'time_stamp_max')
# 最早时间
all_data_test = user_min(all_data_test, 'action_type_path', 'time_stamp_min')
# 活跃天数方差
all_data_test = user_std(all_data_test, 'action_type_path', 'time_stamp_std')
# 最早和最晚相差天数
all_data_test['time_stamp_range'] = all_data_test['time_stamp_max'] - all_data_test['time_stamp_min']
# 用户最喜欢的店铺
all_data_test = user_most_n(all_data_test, 'seller_path', 'seller_most_1', n=1)
# 最喜欢的类目
all_data_test = user_most_n(all_data_test, 'cat_path', 'cat_most_1', n=1)
# 最喜欢的品牌
all_data_test = user_most_n(all_data_test, 'brand_path', 'brand_most_1', n=1)
# 最常见的行为动作
all_data_test = user_most_n(all_data_test, 'action_type_path', 'action_type_1', n=1)
# .....
# 用户最喜欢的店铺 行为次数
all_data_test = user_most_n_cnt(all_data_test, 'seller_path', 'seller_most_1_cnt', n=1)
# 最喜欢的类目 行为次数
all_data_test = user_most_n_cnt(all_data_test, 'cat_path', 'cat_most_1_cnt', n=1)
# 最喜欢的品牌 行为次数
all_data_test = user_most_n_cnt(all_data_test, 'brand_path', 'brand_most_1_cnt', n=1)
# 最常见的行为动作 行为次数
all_data_test = user_most_n_cnt(all_data_test, 'action_type_path', 'action_type_1_cnt', n=1)
# .....
4.6分开统计用户的点击,加购,购买,收藏特征
# 点击、加购、购买、收藏 分开统计
"""
统计基本特征函数
-- 知识点二
-- 根据不同行为的业务函数
-- 提取不同特征
"""
def col_cnt_(df_data, columns_list, action_type):try:data_dict = {}col_list = copy.deepcopy(columns_list)if action_type != None:col_list += ['action_type_path']for col in col_list:data_dict[col] = df_data[col].split(' ')path_len = len(data_dict[col])data_out = []for i_ in range(path_len):data_txt = ''for col_ in columns_list:if data_dict['action_type_path'][i_] == action_type:data_txt += '_' + data_dict[col_][i_]data_out.append(data_txt)return len(data_out) except:return -1def col_nuique_(df_data, columns_list, action_type):try:data_dict = {}col_list = copy.deepcopy(columns_list)if action_type != None:col_list += ['action_type_path']for col in col_list:data_dict[col] = df_data[col].split(' ')path_len = len(data_dict[col])data_out = []for i_ in range(path_len):data_txt = ''for col_ in columns_list:if data_dict['action_type_path'][i_] == action_type:data_txt += '_' + data_dict[col_][i_]data_out.append(data_txt)return len(set(data_out))except:return -1def user_col_cnt(df_data, columns_list, action_type, name):df_data[name] = df_data.apply(lambda x: col_cnt_(x, columns_list, action_type), axis=1)return df_datadef user_col_nunique(df_data, columns_list, action_type, name):df_data[name] = df_data.apply(lambda x: col_nuique_(x, columns_list, action_type), axis=1)return df_data
4.7统计店铺被用户点击次数,加购次数,购买次数,收藏次数
# 点击次数
all_data_test = user_col_cnt(all_data_test, ['seller_path'], '0', 'user_cnt_0')
# 加购次数
all_data_test = user_col_cnt(all_data_test, ['seller_path'], '1', 'user_cnt_1')
# 购买次数
all_data_test = user_col_cnt(all_data_test, ['seller_path'], '2', 'user_cnt_2')
# 收藏次数
all_data_test = user_col_cnt(all_data_test, ['seller_path'], '3', 'user_cnt_3')# 不同店铺个数
all_data_test = user_col_nunique(all_data_test, ['seller_path'], '0', 'seller_nunique_0')
# ....
4.8组合特征
# 点击次数
all_data_test = user_col_cnt(all_data_test, ['seller_path', 'item_path'], '0', 'user_cnt_0')# 不同店铺个数
all_data_test = user_col_nunique(all_data_test, ['seller_path', 'item_path'], '0', 'seller_nunique_0')all_data_test.columns
list(all_data_test.columns)
# ....
利用countvector,tfidf提取特征
"""
-- 知识点四
-- 利用countvector,tfidf提取特征
"""
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS
from scipy import sparse
# cntVec = CountVectorizer(stop_words=ENGLISH_STOP_WORDS, ngram_range=(1, 1), max_features=100)
tfidfVec = TfidfVectorizer(stop_words=ENGLISH_STOP_WORDS, ngram_range=(1, 1), max_features=100)# columns_list = ['seller_path', 'cat_path', 'brand_path', 'action_type_path', 'item_path', 'time_stamp_path']
columns_list = ['seller_path']
for i, col in enumerate(columns_list):all_data_test[col] = all_data_test[col].astype(str)tfidfVec.fit(all_data_test[col])data_ = tfidfVec.transform(all_data_test[col])if i == 0:data_cat = data_else:data_cat = sparse.hstack((data_cat, data_))
4.9特征重命名 特征合并
df_tfidf = pd.DataFrame(data_cat.toarray())
df_tfidf.columns = ['tfidf_' + str(i) for i in df_tfidf.columns]
all_data_test = pd.concat([all_data_test, df_tfidf],axis=1)
embeeding特征
import gensim# Train Word2Vec modelmodel = gensim.models.Word2Vec(all_data_test['seller_path'].apply(lambda x: x.split(' ')), size=100, window=5, min_count=5, workers=4)
# model.save("product2vec.model")
# model = gensim.models.Word2Vec.load("product2vec.model")def mean_w2v_(x, model, size=100):try:i = 0for word in x.split(' '):if word in model.wv.vocab:i += 1if i == 1:vec = np.zeros(size)vec += model.wv[word]return vec / i except:return np.zeros(size)def get_mean_w2v(df_data, columns, model, size):data_array = []for index, row in df_data.iterrows():w2v = mean_w2v_(row[columns], model, size)data_array.append(w2v)return pd.DataFrame(data_array)df_embeeding = get_mean_w2v(all_data_test, 'seller_path', model, 100)
df_embeeding.columns = ['embeeding_' + str(i) for i in df_embeeding.columns]all_data_test = pd.concat([all_data_test, df_embeeding],axis=1)
stacking特征
"""
-- 知识点六
-- stacking特征
"""
# from sklearn.cross_validation import KFold
from sklearn.model_selection import KFold
import pandas as pd
import numpy as np
from scipy import sparse
import xgboost
import lightgbm
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier,ExtraTreesClassifier
from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor,GradientBoostingRegressor,ExtraTreesRegressor
from sklearn.linear_model import LinearRegression,LogisticRegression
from sklearn.svm import LinearSVC,SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import log_loss,mean_absolute_error,mean_squared_error
from sklearn.naive_bayes import MultinomialNB,GaussianNB
"""
-- 回归
-- stacking 回归特征
"""
def stacking_reg(clf,train_x,train_y,test_x,clf_name,kf,label_split=None):train=np.zeros((train_x.shape[0],1))test=np.zeros((test_x.shape[0],1))test_pre=np.empty((folds,test_x.shape[0],1))cv_scores=[]for i,(train_index,test_index) in enumerate(kf.split(train_x,label_split)): tr_x=train_x[train_index]tr_y=train_y[train_index]te_x=train_x[test_index]te_y = train_y[test_index]if clf_name in ["rf","ada","gb","et","lr"]:clf.fit(tr_x,tr_y)pre=clf.predict(te_x).reshape(-1,1)train[test_index]=pretest_pre[i,:]=clf.predict(test_x).reshape(-1,1)cv_scores.append(mean_squared_error(te_y, pre))elif clf_name in ["xgb"]:train_matrix = clf.DMatrix(tr_x, label=tr_y, missing=-1)test_matrix = clf.DMatrix(te_x, label=te_y, missing=-1)z = clf.DMatrix(test_x, label=te_y, missing=-1)params = {'booster': 'gbtree','eval_metric': 'rmse','gamma': 1,'min_child_weight': 1.5,'max_depth': 5,'lambda': 10,'subsample': 0.7,'colsample_bytree': 0.7,'colsample_bylevel': 0.7,'eta': 0.03,'tree_method': 'exact','seed': 2017,'nthread': 12}num_round = 10000early_stopping_rounds = 100watchlist = [(train_matrix, 'train'),(test_matrix, 'eval')]if test_matrix:model = clf.train(params, train_matrix, num_boost_round=num_round,evals=watchlist,early_stopping_rounds=early_stopping_rounds)pre= model.predict(test_matrix,ntree_limit=model.best_ntree_limit).reshape(-1,1)train[test_index]=pretest_pre[i, :]= model.predict(z, ntree_limit=model.best_ntree_limit).reshape(-1,1)cv_scores.append(mean_squared_error(te_y, pre))elif clf_name in ["lgb"]:train_matrix = clf.Dataset(tr_x, label=tr_y)test_matrix = clf.Dataset(te_x, label=te_y)params = {'boosting_type': 'gbdt','objective': 'regression_l2','metric': 'mse','min_child_weight': 1.5,'num_leaves': 2**5,'lambda_l2': 10,'subsample': 0.7,'colsample_bytree': 0.7,'colsample_bylevel': 0.7,'learning_rate': 0.03,'tree_method': 'exact','seed': 2017,'nthread': 12,'silent': True,}num_round = 10000early_stopping_rounds = 100if test_matrix:model = clf.train(params, train_matrix,num_round,valid_sets=test_matrix,early_stopping_rounds=early_stopping_rounds)pre= model.predict(te_x,num_iteration=model.best_iteration).reshape(-1,1)train[test_index]=pretest_pre[i, :]= model.predict(test_x, num_iteration=model.best_iteration).reshape(-1,1)cv_scores.append(mean_squared_error(te_y, pre))else:raise IOError("Please add new clf.")print("%s now score is:"%clf_name,cv_scores)test[:]=test_pre.mean(axis=0)print("%s_score_list:"%clf_name,cv_scores)print("%s_score_mean:"%clf_name,np.mean(cv_scores))return train.reshape(-1,1),test.reshape(-1,1)def rf_reg(x_train, y_train, x_valid, kf, label_split=None):randomforest = RandomForestRegressor(n_estimators=600, max_depth=20, n_jobs=-1, random_state=2017, max_features="auto",verbose=1)rf_train, rf_test = stacking_reg(randomforest, x_train, y_train, x_valid, "rf", kf, label_split=label_split)return rf_train, rf_test,"rf_reg"def ada_reg(x_train, y_train, x_valid, kf, label_split=None):adaboost = AdaBoostRegressor(n_estimators=30, random_state=2017, learning_rate=0.01)ada_train, ada_test = stacking_reg(adaboost, x_train, y_train, x_valid, "ada", kf, label_split=label_split)return ada_train, ada_test,"ada_reg"def gb_reg(x_train, y_train, x_valid, kf, label_split=None):gbdt = GradientBoostingRegressor(learning_rate=0.04, n_estimators=100, subsample=0.8, random_state=2017,max_depth=5,verbose=1)gbdt_train, gbdt_test = stacking_reg(gbdt, x_train, y_train, x_valid, "gb", kf, label_split=label_split)return gbdt_train, gbdt_test,"gb_reg"def et_reg(x_train, y_train, x_valid, kf, label_split=None):extratree = ExtraTreesRegressor(n_estimators=600, max_depth=35, max_features="auto", n_jobs=-1, random_state=2017,verbose=1)et_train, et_test = stacking_reg(extratree, x_train, y_train, x_valid, "et", kf, label_split=label_split)return et_train, et_test,"et_reg"def lr_reg(x_train, y_train, x_valid, kf, label_split=None):lr_reg=LinearRegression(n_jobs=-1)lr_train, lr_test = stacking_reg(lr_reg, x_train, y_train, x_valid, "lr", kf, label_split=label_split)return lr_train, lr_test, "lr_reg"def xgb_reg(x_train, y_train, x_valid, kf, label_split=None):xgb_train, xgb_test = stacking_reg(xgboost, x_train, y_train, x_valid, "xgb", kf, label_split=label_split)return xgb_train, xgb_test,"xgb_reg"def lgb_reg(x_train, y_train, x_valid, kf, label_split=None):lgb_train, lgb_test = stacking_reg(lightgbm, x_train, y_train, x_valid, "lgb", kf, label_split=label_split)return lgb_train, lgb_test,"lgb_reg"
stacking 分类特征
"""
-- 分类
-- stacking 分类特征
"""
def stacking_clf(clf,train_x,train_y,test_x,clf_name,kf,label_split=None):train=np.zeros((train_x.shape[0],1))test=np.zeros((test_x.shape[0],1))test_pre=np.empty((folds,test_x.shape[0],1))cv_scores=[]for i,(train_index,test_index) in enumerate(kf.split(train_x,label_split)): tr_x=train_x[train_index]tr_y=train_y[train_index]te_x=train_x[test_index]te_y = train_y[test_index]if clf_name in ["rf","ada","gb","et","lr","knn","gnb"]:clf.fit(tr_x,tr_y)pre=clf.predict_proba(te_x)train[test_index]=pre[:,0].reshape(-1,1)test_pre[i,:]=clf.predict_proba(test_x)[:,0].reshape(-1,1)cv_scores.append(log_loss(te_y, pre[:,0].reshape(-1,1)))elif clf_name in ["xgb"]:train_matrix = clf.DMatrix(tr_x, label=tr_y, missing=-1)test_matrix = clf.DMatrix(te_x, label=te_y, missing=-1)z = clf.DMatrix(test_x)params = {'booster': 'gbtree','objective': 'multi:softprob','eval_metric': 'mlogloss','gamma': 1,'min_child_weight': 1.5,'max_depth': 5,'lambda': 10,'subsample': 0.7,'colsample_bytree': 0.7,'colsample_bylevel': 0.7,'eta': 0.03,'tree_method': 'exact','seed': 2017,"num_class": 2}num_round = 10000early_stopping_rounds = 100watchlist = [(train_matrix, 'train'),(test_matrix, 'eval')]if test_matrix:model = clf.train(params, train_matrix, num_boost_round=num_round,evals=watchlist,early_stopping_rounds=early_stopping_rounds)pre= model.predict(test_matrix,ntree_limit=model.best_ntree_limit)train[test_index]=pre[:,0].reshape(-1,1)test_pre[i, :]= model.predict(z, ntree_limit=model.best_ntree_limit)[:,0].reshape(-1,1)cv_scores.append(log_loss(te_y, pre[:,0].reshape(-1,1)))elif clf_name in ["lgb"]:train_matrix = clf.Dataset(tr_x, label=tr_y)test_matrix = clf.Dataset(te_x, label=te_y)params = {'boosting_type': 'gbdt',#'boosting_type': 'dart','objective': 'multiclass','metric': 'multi_logloss','min_child_weight': 1.5,'num_leaves': 2**5,'lambda_l2': 10,'subsample': 0.7,'colsample_bytree': 0.7,'colsample_bylevel': 0.7,'learning_rate': 0.03,'tree_method': 'exact','seed': 2017,"num_class": 2,'silent': True,}num_round = 10000early_stopping_rounds = 100if test_matrix:model = clf.train(params, train_matrix,num_round,valid_sets=test_matrix,early_stopping_rounds=early_stopping_rounds)pre= model.predict(te_x,num_iteration=model.best_iteration)train[test_index]=pre[:,0].reshape(-1,1)test_pre[i, :]= model.predict(test_x, num_iteration=model.best_iteration)[:,0].reshape(-1,1)cv_scores.append(log_loss(te_y, pre[:,0].reshape(-1,1)))else:raise IOError("Please add new clf.")print("%s now score is:"%clf_name,cv_scores)test[:]=test_pre.mean(axis=0)print("%s_score_list:"%clf_name,cv_scores)print("%s_score_mean:"%clf_name,np.mean(cv_scores))return train.reshape(-1,1),test.reshape(-1,1)def rf_clf(x_train, y_train, x_valid, kf, label_split=None):randomforest = RandomForestClassifier(n_estimators=1200, max_depth=20, n_jobs=-1, random_state=2017, max_features="auto",verbose=1)rf_train, rf_test = stacking_clf(randomforest, x_train, y_train, x_valid, "rf", kf, label_split=label_split)return rf_train, rf_test,"rf"def ada_clf(x_train, y_train, x_valid, kf, label_split=None):adaboost = AdaBoostClassifier(n_estimators=50, random_state=2017, learning_rate=0.01)ada_train, ada_test = stacking_clf(adaboost, x_train, y_train, x_valid, "ada", kf, label_split=label_split)return ada_train, ada_test,"ada"def gb_clf(x_train, y_train, x_valid, kf, label_split=None):gbdt = GradientBoostingClassifier(learning_rate=0.04, n_estimators=100, subsample=0.8, random_state=2017,max_depth=5,verbose=1)gbdt_train, gbdt_test = stacking_clf(gbdt, x_train, y_train, x_valid, "gb", kf, label_split=label_split)return gbdt_train, gbdt_test,"gb"def et_clf(x_train, y_train, x_valid, kf, label_split=None):extratree = ExtraTreesClassifier(n_estimators=1200, max_depth=35, max_features="auto", n_jobs=-1, random_state=2017,verbose=1)et_train, et_test = stacking_clf(extratree, x_train, y_train, x_valid, "et", kf, label_split=label_split)return et_train, et_test,"et"def xgb_clf(x_train, y_train, x_valid, kf, label_split=None):xgb_train, xgb_test = stacking_clf(xgboost, x_train, y_train, x_valid, "xgb", kf, label_split=label_split)return xgb_train, xgb_test,"xgb"def lgb_clf(x_train, y_train, x_valid, kf, label_split=None):xgb_train, xgb_test = stacking_clf(lightgbm, x_train, y_train, x_valid, "lgb", kf, label_split=label_split)return xgb_train, xgb_test,"lgb"def gnb_clf(x_train, y_train, x_valid, kf, label_split=None):gnb=GaussianNB()gnb_train, gnb_test = stacking_clf(gnb, x_train, y_train, x_valid, "gnb", kf, label_split=label_split)return gnb_train, gnb_test,"gnb"def lr_clf(x_train, y_train, x_valid, kf, label_split=None):logisticregression=LogisticRegression(n_jobs=-1,random_state=2017,C=0.1,max_iter=200)lr_train, lr_test = stacking_clf(logisticregression, x_train, y_train, x_valid, "lr", kf, label_split=label_split)return lr_train, lr_test, "lr"def knn_clf(x_train, y_train, x_valid, kf, label_split=None):kneighbors=KNeighborsClassifier(n_neighbors=200,n_jobs=-1)knn_train, knn_test = stacking_clf(kneighbors, x_train, y_train, x_valid, "lr", kf, label_split=label_split)return knn_train, knn_test, "knn"
获取训练和验证数据(为stacking特征做准备)
features_columns = [c for c in all_data_test.columns if c not in ['label', 'prob', 'seller_path', 'cat_path', 'brand_path', 'action_type_path', 'item_path', 'time_stamp_path']]
x_train = all_data_test[~all_data_test['label'].isna()][features_columns].values
y_train = all_data_test[~all_data_test['label'].isna()]['label'].values
x_valid = all_data_test[all_data_test['label'].isna()][features_columns].values
处理函数值inf以及nan情况
def get_matrix(data):where_are_nan = np.isnan(data)where_are_inf = np.isinf(data)data[where_are_nan] = 0data[where_are_inf] = 0return data
x_train = np.float_(get_matrix(np.float_(x_train)))
y_train = np.int_(y_train)
x_valid = x_train
导入划分数据函数 设stacking特征为5折
from sklearn.model_selection import StratifiedKFold, KFold
folds = 5
seed = 1
kf = KFold(n_splits=5, shuffle=True, random_state=0)
使用lgb和xgb分类模型构造stacking特征
# clf_list = [lgb_clf, xgb_clf, lgb_reg, xgb_reg]
# clf_list_col = ['lgb_clf', 'xgb_clf', 'lgb_reg', 'xgb_reg']clf_list = [lgb_clf, xgb_clf]
clf_list_col = ['lgb_clf', 'xgb_clf']
训练模型,获取stacking特征
clf_list = clf_list
column_list = []
train_data_list=[]
test_data_list=[]
for clf in clf_list:train_data,test_data,clf_name=clf(x_train, y_train, x_valid, kf, label_split=None)train_data_list.append(train_data)test_data_list.append(test_data)
train_stacking = np.concatenate(train_data_list, axis=1)
test_stacking = np.concatenate(test_data_list, axis=1)
五.模型训练、验证和评测
import pandas as pd
import numpy as npimport warnings
warnings.filterwarnings("ignore") train_data = pd.read_csv('train_all.csv',nrows=10000)
test_data = pd.read_csv('test_all.csv',nrows=100)train_data.head()
train_data.columns
获取训练和测试数据
features_columns = [col for col in train_data.columns if col not in ['user_id','label']]
train = train_data[features_columns].values
test = test_data[features_columns].values
target =train_data['label'].valuesfrom sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifierclf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0, n_jobs=-1)
X_train, X_test, y_train, y_test = train_test_split(train, target, test_size=0.4, random_state=0)print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)
交叉验证:评估估算器性能
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifierclf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0, n_jobs=-1)
scores = cross_val_score(clf, train, target, cv=5)
print(scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
模型调参
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier# Split the dataset in two equal parts
X_train, X_test, y_train, y_test = train_test_split(train, target, test_size=0.5, random_state=0)# model
clf = RandomForestClassifier(n_jobs=-1)# Set the parameters by cross-validationtuned_parameters = {'n_estimators': [50, 100, 200]
# ,'criterion': ['gini', 'entropy']
# ,'max_depth': [2, 5]
# ,'max_features': ['log2', 'sqrt', 'int']
# ,'bootstrap': [True, False]
# ,'warm_start': [True, False]}scores = ['precision']for score in scores:print("# Tuning hyper-parameters for %s" % score)print()clf = GridSearchCV(clf, tuned_parameters, cv=5,scoring='%s_macro' % score)clf.fit(X_train, y_train)print("Best parameters set found on development set:")print()print(clf.best_params_)print()print("Grid scores on development set:")print()means = clf.cv_results_['mean_test_score']stds = clf.cv_results_['std_test_score']for mean, std, params in zip(means, stds, clf.cv_results_['params']):print("%0.3f (+/-%0.03f) for %r"% (mean, std * 2, params))print()print("Detailed classification report:")print()print("The model is trained on the full development set.")print("The scores are computed on the full evaluation set.")print()y_true, y_pred = y_test, clf.predict(X_test)print(classification_report(y_true, y_pred))print()
模糊矩阵
import itertools
import numpy as np
import matplotlib.pyplot as pltfrom sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier# label name
class_names = ['no-repeat', 'repeat']# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)# Run classifier, using a model that is too regularized (C too low) to see
# the impact on the results
clf = RandomForestClassifier(n_jobs=-1)
y_pred = clf.fit(X_train, y_train).predict(X_test)def plot_confusion_matrix(cm, classes,normalize=False,title='Confusion matrix',cmap=plt.cm.Blues):"""This function prints and plots the confusion matrix.Normalization can be applied by setting `normalize=True`."""if normalize:cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]print("Normalized confusion matrix")else:print('Confusion matrix, without normalization')print(cm)plt.imshow(cm, interpolation='nearest', cmap=cmap)plt.title(title)plt.colorbar()tick_marks = np.arange(len(classes))plt.xticks(tick_marks, classes, rotation=45)plt.yticks(tick_marks, classes)fmt = '.2f' if normalize else 'd'thresh = cm.max() / 2.for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):plt.text(j, i, format(cm[i, j], fmt),horizontalalignment="center",color="white" if cm[i, j] > thresh else "black")plt.ylabel('True label')plt.xlabel('Predicted label')plt.tight_layout()# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,title='Confusion matrix, without normalization')# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,title='Normalized confusion matrix')plt.show()
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier# label name
class_names = ['no-repeat', 'repeat']# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)# Run classifier, using a model that is too regularized (C too low) to see
# the impact on the results
clf = RandomForestClassifier(n_jobs=-1)
y_pred = clf.fit(X_train, y_train).predict(X_test)print(classification_report(y_test, y_pred, target_names=class_names))
不同的分类模型
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScalerstdScaler = StandardScaler()
X = stdScaler.fit_transform(train)# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, target, random_state=0)clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(X_train, y_train)
clf.score(X_test, y_test)from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScalerstdScaler = StandardScaler()
X = stdScaler.fit_transform(train)# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, target, random_state=0)clf = KNeighborsClassifier(n_neighbors=3).fit(X_train, y_train)
clf.score(X_test, y_test)from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScalerstdScaler = StandardScaler()
X = stdScaler.fit_transform(train)# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, target, random_state=0)clf = GaussianNB().fit(X_train, y_train)
clf.score(X_test, y_test)from sklearn import tree# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)
clf = BaggingClassifier(KNeighborsClassifier(), max_samples=0.5, max_features=0.5)clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)from sklearn.ensemble import RandomForestClassifier# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)
clf = clf = RandomForestClassifier(n_estimators=10, max_depth=3, min_samples_split=12, random_state=0)clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)from sklearn.ensemble import ExtraTreesClassifier# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)
clf = ExtraTreesClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)from sklearn.ensemble import AdaBoostClassifier# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)
clf = AdaBoostClassifier(n_estimators=10)clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)from sklearn.ensemble import GradientBoostingClassifier# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)
clf = GradientBoostingClassifier(n_estimators=10, learning_rate=1.0, max_depth=1, random_state=0)clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)
VOTE模型投票
from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.preprocessing import StandardScalerstdScaler = StandardScaler()
X = stdScaler.fit_transform(train)
y = targetclf1 = LogisticRegression(solver='lbfgs', multi_class='multinomial', random_state=1)
clf2 = RandomForestClassifier(n_estimators=50, random_state=1)
clf3 = GaussianNB()eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='hard')for clf, label in zip([clf1, clf2, clf3, eclf], ['Logistic Regression', 'Random Forest', 'naive Bayes', 'Ensemble']):scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
六.特征优化和特征选择
import pandas as pd
import numpy as npimport warnings
warnings.filterwarnings("ignore")
train_data = pd.read_csv('train_all.csv',nrows=10000)
test_data = pd.read_csv('test_all.csv',nrows=100)#获取训练和测试数据
features_columns = [col for col in train_data.columns if col not in ['user_id','label']]
train = train_data[features_columns].values
test = test_data[features_columns].values
target =train_data['label'].values
缺失值补全
处理缺失值有很多方法,最常用为以下几种:
1.删除。当数据量较大时,或者缺失数据占比较小时,可以使用这种方法。
2.填充。通用的方法是采用平均数、中位数来填充,可以适用插值或者模型预测的方法进行缺失补全。
3.不处理。树类模型对缺失值不明感。
采用中值进行填充
# from sklearn.preprocessing import Imputer
# imputer = Imputer(strategy="median")from sklearn.impute import SimpleImputerimputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(train)
train_imputer = imputer.transform(train)
test_imputer = imputer.transform(test)
特征选择
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifierdef feature_selection(train, train_sel, target):clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0, n_jobs=-1)scores = cross_val_score(clf, train, target, cv=5)scores_sel = cross_val_score(clf, train_sel, target, cv=5)print("No Select Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2)) print("Features Select Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
删除方差较小的要素
VarianceThreshold是一种简单的基线特征选择方法。它会删除方差不符合某个阈值的所有要素。默认情况下,它会删除所有零方差要素,即在所有样本中具有相同值的要素。
from sklearn.feature_selection import VarianceThresholdsel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel = sel.fit(train)
train_sel = sel.transform(train)
test_sel = sel.transform(test)
print('训练数据未特征筛选维度', train.shape)
print('训练数据特征筛选维度后', train_sel.shape)
【数据分析与挖掘】天猫超市复购预测实战(含代码和数据集)相关推荐
- 【BI学习作业13-淘宝定向广告演化与天猫用户复购预测】
目录 写在前面的话 1.思考题 1.1电商定向广告和搜索广告有怎样的区别,算法模型是否有差别 1.1.1电商定向广告 1.1.2搜索广告 1.2定向广告都有哪些常见的使用模型,包括Attention机 ...
- 天池大赛——天猫用户复购预测
从0开始学大数据分析与机器学习,简简单单写下竞赛心得.得分是0.623537,排名629/5602 一.赛题背景 商家有时会在特定的日期(如节礼日甩卖."黑色星期五 "或 &quo ...
- 【BI学习心得13-淘宝定向广告演化与天猫用户复购预测】
目录 1.电商定向广告 VS 搜索广告 1.1电商定向广告 1.2搜索广告 2.淘宝定向广告演化 3.阿里深度兴趣网络DIN 3.1attention机制 3.2评价指标 3.2.1改进AUC 3.2 ...
- 天猫用户复购预测之特征工程构建1
导入包 import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns fro ...
- 9.1 电商B2C商铺新用户复购预测
电商B2C商铺新用户复购预测 1. 电商B2C模式介绍 1.1 电商主要业务模式 B2B C2C B2C 1.2 B2C主要业务功能 平台盈利模式 1.3 商家数据分析师日常 1.3.1 日报周报(数 ...
- 天猫复购预测训练赛技术报告
天猫复购预测赛技术报告 小组成员:李xx.姚xx.黄xx.刘xx github地址:https://github.com/2017403603/Data_mining 一.问题描述 1.1 问题背景 ...
- 10- 天猫用户复购预测 (机器学习集成算法) (项目十) *
项目难点 merchant: 商人 重命名列名: user_log.rename(columns={'seller_id':'merchant_id'}, inplace=True) 数据类型转换 ...
- 阿里云天池大赛赛题(机器学习)——天猫用户重复购买预测(完整代码)
目录 赛题背景 全代码 导入包 读取数据(训练数据前10000行,测试数据前100条) 读取全部数据 获取训练和测试数据 切分40%数据用于线下验证 交叉验证:评估估算器性能 F1验证 Shuffle ...
- 【Python】电商用户复购数据实战:图解Pandas的移动函数shift
公众号:尤而小屋 作者:Peter 编辑:Peter 本文主要介绍的是pandas中的一个移动函数:shift.最后结合一个具体的电商领域中用户的复购案例来说明如何使用shift函数. 这个案例综合性 ...
最新文章
- 【HTML】行内元素与块级元素
- oracle cols user_tab_columns,user_tab_cols和user_tab_columns的区别
- Spring.Net学习
- 最最简单的CentOs6在线源搭建
- Linux内存申请机制
- linux io映射,【原创】Linux 文件系统移植全解密以linux-2.6.35内核源码为例说明一下IO静态映射的过程...
- CRM 702和CRM 712的区别
- 成都鸿蒙脱模剂厂家,现场体验荣耀智慧屏与鸿蒙OS,荣耀Life成都店与您共享锐科技...
- HTML5求自动在闪,HTML5 重复而不停闪烁的团状物
- 游戏网页代码 html静态网页设计制作 dw静态网页成品模板素材网页 web前端网页设计与制作 div静态网页设计
- 操作失败,错误为 0x00000bcb
- 如何在计算机自动开机时选择用户,电脑如何设置自动开机,详细教您如何设置...
- 台式电脑连接电脑主机与显示器
- 腾讯浏览器支持html5视频播放器,JS第8款:html5media.js跨浏览器兼容的HTML5视频音频播放器...
- APP合规讲堂(七)-App有关收集使用规则的内容晦涩难懂、冗长繁琐,用户难以理解
- java 6面骰子_java计算掷6面骰子6000次每个点数出现的概率代码实例
- 公司电子邮箱可以定制邮箱地址吗?
- 动态规划练习三:换钱问题(动态规划概念理解与记忆搜索法概念理解对比)
- 计算机中硬盘和移动硬盘的区别,笔记本硬盘和移动硬盘有什么区别
- saltstack学习视频—老男孩—超详细—网盘下载
热门文章
- 提高企业计算机网络安全意识,对企业计算机网络安全建设问题综合分析探讨.doc...
- PDF怎么编辑修改?如何编辑PDF的内容?
- 一篇文章吃透:为什么加载数据库驱动要用Class.forName()
- ThinkPHP根据时间显示不同的问候语
- 新学期,新气象,新目标
- 2015.09.07 活着就是一种召唤——《活着》余华
- 科目二 总结(方向盘,离合器,刹车)
- Android开发高手课笔记 - 01 崩溃优化(上):关于“崩溃”那点事
- javascript基础:元素增删改操作
- Android FaceBook登录 分享获取HashKey(密钥散列)的简单方法