一.背景

商家有时会在特定日期,例如Boxing-day,黑色星期五或是双十一(11月11日)开展大型促销活动或者发放优惠券以吸引消费者,然而很多被吸引来的买家都是一次性消费者,这些促销活动可能对销售业绩的增长并没有长远帮助,因此为解决这个问题,商家需要识别出哪类消费者可以转化为重复购买者。通过对这些潜在的忠诚客户进行定位,商家可以大大降低促销成本,提高投资回报率(Return on Investment, ROI)。众所周知的是,在线投放广告时精准定位客户是件比较难的事情,尤其是针对新消费者的定位。不过,利用天猫长期积累的用户行为日志,我们或许可以解决这个问题。

我们提供了一些商家信息,以及在“双十一”期间购买了对应产品的新消费者信息。你的任务是预测给定的商家中,哪些新消费者在未来会成为忠实客户,即需要预测这些新消费者在6个月内再次购买的概率。

数据集:500MB+

二.数据描述

数据集包含了匿名用户在 "双十一 "前6个月和"双十一 "当天的购物记录,标签为是否是重复购买者。出于隐私保护,数据采样存在部分偏差,该数据集的统计结果会与天猫的实际情况有一定的偏差,但不影响解决方案的适用性。训练集和测试集数据见文件data_format1.zip,数据详情见下表。



三.数据探索

3.1工具导入

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import statsimport warnings
warnings.filterwarnings("ignore")%matplotlib inline

3.2数据读取

"""
读取数据集
"""
test_data = pd.read_csv('./data_format1/test_format1.csv')
train_data = pd.read_csv('./data_format1/train_format1.csv')user_info = pd.read_csv('./data_format1/user_info_format1.csv')
user_log = pd.read_csv('./data_format1/user_log_format1.csv')#user_info = pd.read_csv('./data_format1/user_info_format1.csv').drop_duplicates()
#user_log = pd.read_csv('./data_format1/user_log_format1.csv').rename(columns={"seller_id":'merchant_id'})

数据集样例查看

train_data.head(5)

test_data.head(5)

user_info.head(5)

user_log.head(5)

3.3单变量数据分析

3.3.1数据类型和数据大小(info)

用户信息数据

数据集中共有2个float64类型和1个int64类型的数据
数据大小9.7MB
数据集共有424170条数据

用户行为数据

数据集中共有6个int64类型和1个float64类型的数据
数据大小2.9GB
数据集共有54925330条数据

用户购买训练数据

数据均为int64类型
数据大小6MB
数据集共有260864条数据

3.3.2缺失值查看

3.3.2.1用户信息数据缺失

年龄缺失

#年龄缺失占比
(user_info.shape[0]-user_info['age_range'].count())/user_info.shape[0]
#年龄缺失或者为0的个数
user_info[user_info['age_range'].isna() | (user_info['age_range']==0)].count()#年龄分组
user_info.groupby(['age_range'])['user_id'].count()

1.年龄值为空的缺失率为0.5%

2.年龄值缺失或者年龄值为缺省值0

3.共计95131条数据

性别缺失

1.性别值为空的缺失率 1.5%

2.性别值缺失或者性别为缺省值2

3.共计16862条数据
`

(user_info.shape[0]-user_info['gender'].count())/user_info.shape[0]
user_info[user_info['gender'].isna() | (user_info['gender'] == 2)].count()
user_info.groupby(['gender'])[['user_id']].count()

3.3.2.2用户行为日志信息

print(user_log.isnull().sum())

3.4观察数据分布

3.4.1整体数据统计信息

user_info.describe()

user_log.describe()

3.4.2查看正负样本的分布

label_gp=train_data.groupby('label')['user_id'].count()
print('正负样本的数量:\n',label_gp)
fig=figure(figsize=(12,6))
ax1=plt.subplot(1,2,1)
train_data['label'].value_counts().plot(kind='pie',autopct='%1.1f%%',shadow=True,explode=[0,0.1],ax=ax1)
ax2=plt.subplot(1,2,2)
sns.countplot('label',data=train_data,ax=ax2)



从上图可以看出,样本的分布不均衡,需要采取一定的措施处理样本不均衡的问题:

  • 类似欠采样,将一份正样本和多分负样本组合成多分训练集,训练多个模型后求平均
  • 调整模型的权重

3.5探索店铺,用户,性别,年龄对复购的影响

3.5.1查看不同商家与复购的关系

print('选取top5店铺\n店铺\t购买次数')
print(train_data['merchat_id']).value_counts().head(5))
train_data_merchat=train_data.copy()
train_data_merchat['top5']=train_data_merchat['merchat_id'].map(lambda x: 1 if x in [4044,3828,4173,1102,4976] else 0)
train_data_merchant = train_data_merchant[train_data_merchant['TOP5']==1]
plt.figure(figsize=(8,6))
plt.title('Merchant VS Label')
ax = sns.countplot('merchant_id',hue='label',data=train_data_merchant)
for p in ax.patches:height = p.get_height()



从图可以看出不同店铺有不同复购率,可能与不同店铺售卖的商品有关,以及店铺的运营有关。

3.5.2查看店铺复购概率分布

merchant_repeat_buy=[rate for rate in train_data.groupby('merchant_id')['label'].mean() if rate<=1 and rate>0]
plt.figure(figsize=(8,4))
ax=plt.subplot(121)
sns.distplot(merchant_repeat_buy,fit=stats.norm)
ax=plt.subplot(1,2,2)
res = stats.probplot(merchant_repeat_buy, plot=plt)


可以看出不同店铺有不同复购率,大致在0-0.3之间

3.5.3查看用户大于一次复购概率分布

user_repeat_buy = [rate for rate in train_data.groupby([‘user_id’])[‘label’].mean() if rate <= 1 and rate > 0]

plt.figure(figsize=(8,6))

ax=plt.subplot(1,2,1)
sns.distplot(user_repeat_buy, fit=stats.norm)
ax=plt.subplot(1,2,2)
res = stats.probplot(user_repeat_buy, plot=plt)

3.5.4查看用户性别与复购的关系

train_data_user_info = train_data.merge(user_info,on=['user_id'],how='left')
plt.figure(figsize=(8,8))
plt.title('Gender VS Label')
ax = sns.countplot('gender',hue='label',data=train_data_user_info)
for p in ax.patches:height = p.get_height()

3.5.5查看用户性别复购的分布

repeat_buy=[rate for rate in train_data_user_info.groupby(['gender'])['label'].mean()]
ax=plt.subplot(1,2,1)
sns.distplot(repeat_buy,fit=stats.norm)
ax=plt.subplot(1,2,2)
res = stats.probplot(repeat_buy, plot=plt)

可以看出男女的复购率不一样

3.5.6查看用户年龄与复购关系

plt.figure(figsize=(8,8))
plt.title('Age VS Label')
ax = sns.countplot('age_range',hue='label',data=train_data_user_info)

3.5.7查看用户年龄复购的分布

repeat_buy = [rate for rate in train_data_user_info.groupby(['age_range'])['label'].mean()] plt.figure(figsize=(8,4))ax=plt.subplot(1,2,1)
sns.distplot(repeat_buy, fit=stats.norm)
ax=plt.subplot(1,2,2)
res = stats.probplot(repeat_buy, plot=plt)

四.特征工程

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import statsimport gc
from collections import Counter
import copyimport warnings
warnings.filterwarnings("ignore")%matplotlib inline

4.1合并用户信息

del test_data['prob']
all_data = train_data.append(test_data)
all_data = all_data.merge(user_info,on=['user_id'],how='left')
del train_data, test_data, user_info
gc.collect()
all_data.head()

4.2用户行为日志信息按时间进行排序

"""
按时间排序
"""
user_log = user_log.sort_values(['user_id','time_stamp'])
user_log.head()

4.3对每个用户的逐个合并所有的item_id, cat_id,seller_id,brand_id,time_stamp, action_type字段

"""
合并数据
"""
list_join_func = lambda x: " ".join([str(i) for i in x])agg_dict = {'item_id' : list_join_func,   'cat_id' : list_join_func,'seller_id' : list_join_func,'brand_id' : list_join_func,'time_stamp' : list_join_func,'action_type' : list_join_func}rename_dict = {'item_id' : 'item_path','cat_id' : 'cat_path','seller_id' : 'seller_path','brand_id' : 'brand_path','time_stamp' : 'time_stamp_path','action_type' : 'action_type_path'}
user_log_path = user_log.groupby('user_id').agg(agg_dict).reset_index().rename(columns=rename_dict)
user_log_path.head()

all_data_path = all_data.merge(user_log_path,on='user_id')
all_data_path.head()

4.4定义数据统计函数

4.4.1统计数据的总数

def cnt_(x):try:return len(x.split(' '))except:return -1

4.4.2统计唯一数据总数

def nunique_(x):try:return len(set(x.split(' ')))except:return -1

4.4.3统计数据最大值

def max_(x):try:return np.max([int(i) for i in x.split(' ')])except:return -1

4.4.4统计数据最小值

def min_(x):try:return np.min([int(i) for i in x.split(' ')])except:return -1

4.4.5统计数据的标准差

def std_(x):try:return np.std([float(i) for i in x.split(' ')])except:return -1

4.4.6统计数据中top N的数据

def most_n_cnt(x, n):try:return Counter(x.split(' ')).most_common(n)[n-1][1]except:return -1
###
def user_cnt(df_data, single_col, name):df_data[name] = df_data[single_col].apply(cnt_)return df_datadef user_nunique(df_data, single_col, name):df_data[name] = df_data[single_col].apply(nunique_)return df_datadef user_max(df_data, single_col, name):df_data[name] = df_data[single_col].apply(max_)return df_datadef user_min(df_data, single_col, name):df_data[name] = df_data[single_col].apply(min_)return df_datadef user_std(df_data, single_col, name):df_data[name] = df_data[single_col].apply(std_)return df_datadef user_most_n(df_data, single_col, name, n=1):func = lambda x: most_n(x, n)df_data[name] = df_data[single_col].apply(func)return df_datadef user_most_n_cnt(df_data, single_col, name, n=1):func = lambda x: most_n_cnt(x, n)df_data[name] = df_data[single_col].apply(func)return df_data

4.5提取商铺的基本统计特征

"""提取基本统计特征
"""
all_data_test = all_data_path.head(2000)
#all_data_test = all_data_path
# 统计用户 点击、浏览、加购、购买行为
# 总次数
all_data_test = user_cnt(all_data_test,  'seller_path', 'user_cnt')
# 不同店铺个数
all_data_test = user_nunique(all_data_test,  'seller_path', 'seller_nunique')
# 不同品类个数
all_data_test = user_nunique(all_data_test,  'cat_path', 'cat_nunique')
# 不同品牌个数
all_data_test = user_nunique(all_data_test,  'brand_path', 'brand_nunique')
# 不同商品个数
all_data_test = user_nunique(all_data_test,  'item_path', 'item_nunique')
# 活跃天数
all_data_test = user_nunique(all_data_test,  'time_stamp_path', 'time_stamp_nunique')
# 不用行为种数
all_data_test = user_nunique(all_data_test,  'action_type_path', 'action_type_nunique')
all_data_test.head()
# ....

# 最晚时间
all_data_test = user_max(all_data_test,  'action_type_path', 'time_stamp_max')
# 最早时间
all_data_test = user_min(all_data_test,  'action_type_path', 'time_stamp_min')
# 活跃天数方差
all_data_test = user_std(all_data_test,  'action_type_path', 'time_stamp_std')
# 最早和最晚相差天数
all_data_test['time_stamp_range'] = all_data_test['time_stamp_max'] - all_data_test['time_stamp_min']
# 用户最喜欢的店铺
all_data_test = user_most_n(all_data_test, 'seller_path', 'seller_most_1', n=1)
# 最喜欢的类目
all_data_test = user_most_n(all_data_test, 'cat_path', 'cat_most_1', n=1)
# 最喜欢的品牌
all_data_test = user_most_n(all_data_test, 'brand_path', 'brand_most_1', n=1)
# 最常见的行为动作
all_data_test = user_most_n(all_data_test, 'action_type_path', 'action_type_1', n=1)
# .....
# 用户最喜欢的店铺 行为次数
all_data_test = user_most_n_cnt(all_data_test, 'seller_path', 'seller_most_1_cnt', n=1)
# 最喜欢的类目 行为次数
all_data_test = user_most_n_cnt(all_data_test, 'cat_path', 'cat_most_1_cnt', n=1)
# 最喜欢的品牌 行为次数
all_data_test = user_most_n_cnt(all_data_test, 'brand_path', 'brand_most_1_cnt', n=1)
# 最常见的行为动作 行为次数
all_data_test = user_most_n_cnt(all_data_test, 'action_type_path', 'action_type_1_cnt', n=1)
# .....

4.6分开统计用户的点击,加购,购买,收藏特征

# 点击、加购、购买、收藏 分开统计
"""
统计基本特征函数
-- 知识点二
-- 根据不同行为的业务函数
-- 提取不同特征
"""
def col_cnt_(df_data, columns_list, action_type):try:data_dict = {}col_list = copy.deepcopy(columns_list)if action_type != None:col_list += ['action_type_path']for col in col_list:data_dict[col] = df_data[col].split(' ')path_len = len(data_dict[col])data_out = []for i_ in range(path_len):data_txt = ''for col_ in columns_list:if data_dict['action_type_path'][i_] == action_type:data_txt += '_' + data_dict[col_][i_]data_out.append(data_txt)return len(data_out)  except:return -1def col_nuique_(df_data, columns_list, action_type):try:data_dict = {}col_list = copy.deepcopy(columns_list)if action_type != None:col_list += ['action_type_path']for col in col_list:data_dict[col] = df_data[col].split(' ')path_len = len(data_dict[col])data_out = []for i_ in range(path_len):data_txt = ''for col_ in columns_list:if data_dict['action_type_path'][i_] == action_type:data_txt += '_' + data_dict[col_][i_]data_out.append(data_txt)return len(set(data_out))except:return -1def user_col_cnt(df_data, columns_list, action_type, name):df_data[name] = df_data.apply(lambda x: col_cnt_(x, columns_list, action_type), axis=1)return df_datadef user_col_nunique(df_data, columns_list, action_type, name):df_data[name] = df_data.apply(lambda x: col_nuique_(x, columns_list, action_type), axis=1)return df_data

4.7统计店铺被用户点击次数,加购次数,购买次数,收藏次数

# 点击次数
all_data_test = user_col_cnt(all_data_test,  ['seller_path'], '0', 'user_cnt_0')
# 加购次数
all_data_test = user_col_cnt(all_data_test,  ['seller_path'], '1', 'user_cnt_1')
# 购买次数
all_data_test = user_col_cnt(all_data_test,  ['seller_path'], '2', 'user_cnt_2')
# 收藏次数
all_data_test = user_col_cnt(all_data_test,  ['seller_path'], '3', 'user_cnt_3')# 不同店铺个数
all_data_test = user_col_nunique(all_data_test,  ['seller_path'], '0', 'seller_nunique_0')
# ....

4.8组合特征

# 点击次数
all_data_test = user_col_cnt(all_data_test,  ['seller_path', 'item_path'], '0', 'user_cnt_0')# 不同店铺个数
all_data_test = user_col_nunique(all_data_test,  ['seller_path', 'item_path'], '0', 'seller_nunique_0')all_data_test.columns
list(all_data_test.columns)
# ....

利用countvector,tfidf提取特征

"""
-- 知识点四
-- 利用countvector,tfidf提取特征
"""
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS
from scipy import sparse
# cntVec = CountVectorizer(stop_words=ENGLISH_STOP_WORDS, ngram_range=(1, 1), max_features=100)
tfidfVec = TfidfVectorizer(stop_words=ENGLISH_STOP_WORDS, ngram_range=(1, 1), max_features=100)# columns_list = ['seller_path', 'cat_path', 'brand_path', 'action_type_path', 'item_path', 'time_stamp_path']
columns_list = ['seller_path']
for i, col in enumerate(columns_list):all_data_test[col] = all_data_test[col].astype(str)tfidfVec.fit(all_data_test[col])data_ = tfidfVec.transform(all_data_test[col])if i == 0:data_cat = data_else:data_cat = sparse.hstack((data_cat, data_))

4.9特征重命名 特征合并

df_tfidf = pd.DataFrame(data_cat.toarray())
df_tfidf.columns = ['tfidf_' + str(i) for i in df_tfidf.columns]
all_data_test = pd.concat([all_data_test, df_tfidf],axis=1)

embeeding特征

import gensim# Train Word2Vec modelmodel = gensim.models.Word2Vec(all_data_test['seller_path'].apply(lambda x: x.split(' ')), size=100, window=5, min_count=5, workers=4)
# model.save("product2vec.model")
# model = gensim.models.Word2Vec.load("product2vec.model")def mean_w2v_(x, model, size=100):try:i = 0for word in x.split(' '):if word in model.wv.vocab:i += 1if i == 1:vec = np.zeros(size)vec += model.wv[word]return vec / i except:return  np.zeros(size)def get_mean_w2v(df_data, columns, model, size):data_array = []for index, row in df_data.iterrows():w2v = mean_w2v_(row[columns], model, size)data_array.append(w2v)return pd.DataFrame(data_array)df_embeeding = get_mean_w2v(all_data_test, 'seller_path', model, 100)
df_embeeding.columns = ['embeeding_' + str(i) for i in df_embeeding.columns]all_data_test = pd.concat([all_data_test, df_embeeding],axis=1)

stacking特征

"""
-- 知识点六
-- stacking特征
"""
# from sklearn.cross_validation import KFold
from sklearn.model_selection import KFold
import pandas as pd
import numpy as np
from scipy import sparse
import xgboost
import lightgbm
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier,ExtraTreesClassifier
from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor,GradientBoostingRegressor,ExtraTreesRegressor
from sklearn.linear_model import LinearRegression,LogisticRegression
from sklearn.svm import LinearSVC,SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import log_loss,mean_absolute_error,mean_squared_error
from sklearn.naive_bayes import MultinomialNB,GaussianNB
"""
-- 回归
-- stacking 回归特征
"""
def stacking_reg(clf,train_x,train_y,test_x,clf_name,kf,label_split=None):train=np.zeros((train_x.shape[0],1))test=np.zeros((test_x.shape[0],1))test_pre=np.empty((folds,test_x.shape[0],1))cv_scores=[]for i,(train_index,test_index) in enumerate(kf.split(train_x,label_split)):       tr_x=train_x[train_index]tr_y=train_y[train_index]te_x=train_x[test_index]te_y = train_y[test_index]if clf_name in ["rf","ada","gb","et","lr"]:clf.fit(tr_x,tr_y)pre=clf.predict(te_x).reshape(-1,1)train[test_index]=pretest_pre[i,:]=clf.predict(test_x).reshape(-1,1)cv_scores.append(mean_squared_error(te_y, pre))elif clf_name in ["xgb"]:train_matrix = clf.DMatrix(tr_x, label=tr_y, missing=-1)test_matrix = clf.DMatrix(te_x, label=te_y, missing=-1)z = clf.DMatrix(test_x, label=te_y, missing=-1)params = {'booster': 'gbtree','eval_metric': 'rmse','gamma': 1,'min_child_weight': 1.5,'max_depth': 5,'lambda': 10,'subsample': 0.7,'colsample_bytree': 0.7,'colsample_bylevel': 0.7,'eta': 0.03,'tree_method': 'exact','seed': 2017,'nthread': 12}num_round = 10000early_stopping_rounds = 100watchlist = [(train_matrix, 'train'),(test_matrix, 'eval')]if test_matrix:model = clf.train(params, train_matrix, num_boost_round=num_round,evals=watchlist,early_stopping_rounds=early_stopping_rounds)pre= model.predict(test_matrix,ntree_limit=model.best_ntree_limit).reshape(-1,1)train[test_index]=pretest_pre[i, :]= model.predict(z, ntree_limit=model.best_ntree_limit).reshape(-1,1)cv_scores.append(mean_squared_error(te_y, pre))elif clf_name in ["lgb"]:train_matrix = clf.Dataset(tr_x, label=tr_y)test_matrix = clf.Dataset(te_x, label=te_y)params = {'boosting_type': 'gbdt','objective': 'regression_l2','metric': 'mse','min_child_weight': 1.5,'num_leaves': 2**5,'lambda_l2': 10,'subsample': 0.7,'colsample_bytree': 0.7,'colsample_bylevel': 0.7,'learning_rate': 0.03,'tree_method': 'exact','seed': 2017,'nthread': 12,'silent': True,}num_round = 10000early_stopping_rounds = 100if test_matrix:model = clf.train(params, train_matrix,num_round,valid_sets=test_matrix,early_stopping_rounds=early_stopping_rounds)pre= model.predict(te_x,num_iteration=model.best_iteration).reshape(-1,1)train[test_index]=pretest_pre[i, :]= model.predict(test_x, num_iteration=model.best_iteration).reshape(-1,1)cv_scores.append(mean_squared_error(te_y, pre))else:raise IOError("Please add new clf.")print("%s now score is:"%clf_name,cv_scores)test[:]=test_pre.mean(axis=0)print("%s_score_list:"%clf_name,cv_scores)print("%s_score_mean:"%clf_name,np.mean(cv_scores))return train.reshape(-1,1),test.reshape(-1,1)def rf_reg(x_train, y_train, x_valid, kf, label_split=None):randomforest = RandomForestRegressor(n_estimators=600, max_depth=20, n_jobs=-1, random_state=2017, max_features="auto",verbose=1)rf_train, rf_test = stacking_reg(randomforest, x_train, y_train, x_valid, "rf", kf, label_split=label_split)return rf_train, rf_test,"rf_reg"def ada_reg(x_train, y_train, x_valid, kf, label_split=None):adaboost = AdaBoostRegressor(n_estimators=30, random_state=2017, learning_rate=0.01)ada_train, ada_test = stacking_reg(adaboost, x_train, y_train, x_valid, "ada", kf, label_split=label_split)return ada_train, ada_test,"ada_reg"def gb_reg(x_train, y_train, x_valid, kf, label_split=None):gbdt = GradientBoostingRegressor(learning_rate=0.04, n_estimators=100, subsample=0.8, random_state=2017,max_depth=5,verbose=1)gbdt_train, gbdt_test = stacking_reg(gbdt, x_train, y_train, x_valid, "gb", kf, label_split=label_split)return gbdt_train, gbdt_test,"gb_reg"def et_reg(x_train, y_train, x_valid, kf, label_split=None):extratree = ExtraTreesRegressor(n_estimators=600, max_depth=35, max_features="auto", n_jobs=-1, random_state=2017,verbose=1)et_train, et_test = stacking_reg(extratree, x_train, y_train, x_valid, "et", kf, label_split=label_split)return et_train, et_test,"et_reg"def lr_reg(x_train, y_train, x_valid, kf, label_split=None):lr_reg=LinearRegression(n_jobs=-1)lr_train, lr_test = stacking_reg(lr_reg, x_train, y_train, x_valid, "lr", kf, label_split=label_split)return lr_train, lr_test, "lr_reg"def xgb_reg(x_train, y_train, x_valid, kf, label_split=None):xgb_train, xgb_test = stacking_reg(xgboost, x_train, y_train, x_valid, "xgb", kf, label_split=label_split)return xgb_train, xgb_test,"xgb_reg"def lgb_reg(x_train, y_train, x_valid, kf, label_split=None):lgb_train, lgb_test = stacking_reg(lightgbm, x_train, y_train, x_valid, "lgb", kf, label_split=label_split)return lgb_train, lgb_test,"lgb_reg"

stacking 分类特征

"""
-- 分类
-- stacking 分类特征
"""
def stacking_clf(clf,train_x,train_y,test_x,clf_name,kf,label_split=None):train=np.zeros((train_x.shape[0],1))test=np.zeros((test_x.shape[0],1))test_pre=np.empty((folds,test_x.shape[0],1))cv_scores=[]for i,(train_index,test_index) in enumerate(kf.split(train_x,label_split)):       tr_x=train_x[train_index]tr_y=train_y[train_index]te_x=train_x[test_index]te_y = train_y[test_index]if clf_name in ["rf","ada","gb","et","lr","knn","gnb"]:clf.fit(tr_x,tr_y)pre=clf.predict_proba(te_x)train[test_index]=pre[:,0].reshape(-1,1)test_pre[i,:]=clf.predict_proba(test_x)[:,0].reshape(-1,1)cv_scores.append(log_loss(te_y, pre[:,0].reshape(-1,1)))elif clf_name in ["xgb"]:train_matrix = clf.DMatrix(tr_x, label=tr_y, missing=-1)test_matrix = clf.DMatrix(te_x, label=te_y, missing=-1)z = clf.DMatrix(test_x)params = {'booster': 'gbtree','objective': 'multi:softprob','eval_metric': 'mlogloss','gamma': 1,'min_child_weight': 1.5,'max_depth': 5,'lambda': 10,'subsample': 0.7,'colsample_bytree': 0.7,'colsample_bylevel': 0.7,'eta': 0.03,'tree_method': 'exact','seed': 2017,"num_class": 2}num_round = 10000early_stopping_rounds = 100watchlist = [(train_matrix, 'train'),(test_matrix, 'eval')]if test_matrix:model = clf.train(params, train_matrix, num_boost_round=num_round,evals=watchlist,early_stopping_rounds=early_stopping_rounds)pre= model.predict(test_matrix,ntree_limit=model.best_ntree_limit)train[test_index]=pre[:,0].reshape(-1,1)test_pre[i, :]= model.predict(z, ntree_limit=model.best_ntree_limit)[:,0].reshape(-1,1)cv_scores.append(log_loss(te_y, pre[:,0].reshape(-1,1)))elif clf_name in ["lgb"]:train_matrix = clf.Dataset(tr_x, label=tr_y)test_matrix = clf.Dataset(te_x, label=te_y)params = {'boosting_type': 'gbdt',#'boosting_type': 'dart','objective': 'multiclass','metric': 'multi_logloss','min_child_weight': 1.5,'num_leaves': 2**5,'lambda_l2': 10,'subsample': 0.7,'colsample_bytree': 0.7,'colsample_bylevel': 0.7,'learning_rate': 0.03,'tree_method': 'exact','seed': 2017,"num_class": 2,'silent': True,}num_round = 10000early_stopping_rounds = 100if test_matrix:model = clf.train(params, train_matrix,num_round,valid_sets=test_matrix,early_stopping_rounds=early_stopping_rounds)pre= model.predict(te_x,num_iteration=model.best_iteration)train[test_index]=pre[:,0].reshape(-1,1)test_pre[i, :]= model.predict(test_x, num_iteration=model.best_iteration)[:,0].reshape(-1,1)cv_scores.append(log_loss(te_y, pre[:,0].reshape(-1,1)))else:raise IOError("Please add new clf.")print("%s now score is:"%clf_name,cv_scores)test[:]=test_pre.mean(axis=0)print("%s_score_list:"%clf_name,cv_scores)print("%s_score_mean:"%clf_name,np.mean(cv_scores))return train.reshape(-1,1),test.reshape(-1,1)def rf_clf(x_train, y_train, x_valid, kf, label_split=None):randomforest = RandomForestClassifier(n_estimators=1200, max_depth=20, n_jobs=-1, random_state=2017, max_features="auto",verbose=1)rf_train, rf_test = stacking_clf(randomforest, x_train, y_train, x_valid, "rf", kf, label_split=label_split)return rf_train, rf_test,"rf"def ada_clf(x_train, y_train, x_valid, kf, label_split=None):adaboost = AdaBoostClassifier(n_estimators=50, random_state=2017, learning_rate=0.01)ada_train, ada_test = stacking_clf(adaboost, x_train, y_train, x_valid, "ada", kf, label_split=label_split)return ada_train, ada_test,"ada"def gb_clf(x_train, y_train, x_valid, kf, label_split=None):gbdt = GradientBoostingClassifier(learning_rate=0.04, n_estimators=100, subsample=0.8, random_state=2017,max_depth=5,verbose=1)gbdt_train, gbdt_test = stacking_clf(gbdt, x_train, y_train, x_valid, "gb", kf, label_split=label_split)return gbdt_train, gbdt_test,"gb"def et_clf(x_train, y_train, x_valid, kf, label_split=None):extratree = ExtraTreesClassifier(n_estimators=1200, max_depth=35, max_features="auto", n_jobs=-1, random_state=2017,verbose=1)et_train, et_test = stacking_clf(extratree, x_train, y_train, x_valid, "et", kf, label_split=label_split)return et_train, et_test,"et"def xgb_clf(x_train, y_train, x_valid, kf, label_split=None):xgb_train, xgb_test = stacking_clf(xgboost, x_train, y_train, x_valid, "xgb", kf, label_split=label_split)return xgb_train, xgb_test,"xgb"def lgb_clf(x_train, y_train, x_valid, kf, label_split=None):xgb_train, xgb_test = stacking_clf(lightgbm, x_train, y_train, x_valid, "lgb", kf, label_split=label_split)return xgb_train, xgb_test,"lgb"def gnb_clf(x_train, y_train, x_valid, kf, label_split=None):gnb=GaussianNB()gnb_train, gnb_test = stacking_clf(gnb, x_train, y_train, x_valid, "gnb", kf, label_split=label_split)return gnb_train, gnb_test,"gnb"def lr_clf(x_train, y_train, x_valid, kf, label_split=None):logisticregression=LogisticRegression(n_jobs=-1,random_state=2017,C=0.1,max_iter=200)lr_train, lr_test = stacking_clf(logisticregression, x_train, y_train, x_valid, "lr", kf, label_split=label_split)return lr_train, lr_test, "lr"def knn_clf(x_train, y_train, x_valid, kf, label_split=None):kneighbors=KNeighborsClassifier(n_neighbors=200,n_jobs=-1)knn_train, knn_test = stacking_clf(kneighbors, x_train, y_train, x_valid, "lr", kf, label_split=label_split)return knn_train, knn_test, "knn"

获取训练和验证数据(为stacking特征做准备)

features_columns = [c for c in all_data_test.columns if c not in ['label', 'prob', 'seller_path', 'cat_path', 'brand_path', 'action_type_path', 'item_path', 'time_stamp_path']]
x_train = all_data_test[~all_data_test['label'].isna()][features_columns].values
y_train = all_data_test[~all_data_test['label'].isna()]['label'].values
x_valid = all_data_test[all_data_test['label'].isna()][features_columns].values

处理函数值inf以及nan情况

def get_matrix(data):where_are_nan = np.isnan(data)where_are_inf = np.isinf(data)data[where_are_nan] = 0data[where_are_inf] = 0return data
x_train = np.float_(get_matrix(np.float_(x_train)))
y_train = np.int_(y_train)
x_valid = x_train

导入划分数据函数 设stacking特征为5折

from sklearn.model_selection import StratifiedKFold, KFold
folds = 5
seed = 1
kf = KFold(n_splits=5, shuffle=True, random_state=0)

使用lgb和xgb分类模型构造stacking特征

# clf_list = [lgb_clf, xgb_clf, lgb_reg, xgb_reg]
# clf_list_col = ['lgb_clf', 'xgb_clf', 'lgb_reg', 'xgb_reg']clf_list = [lgb_clf, xgb_clf]
clf_list_col = ['lgb_clf', 'xgb_clf']

训练模型,获取stacking特征

clf_list = clf_list
column_list = []
train_data_list=[]
test_data_list=[]
for clf in clf_list:train_data,test_data,clf_name=clf(x_train, y_train, x_valid, kf, label_split=None)train_data_list.append(train_data)test_data_list.append(test_data)
train_stacking = np.concatenate(train_data_list, axis=1)
test_stacking = np.concatenate(test_data_list, axis=1)

五.模型训练、验证和评测

import pandas as pd
import numpy as npimport warnings
warnings.filterwarnings("ignore") train_data = pd.read_csv('train_all.csv',nrows=10000)
test_data = pd.read_csv('test_all.csv',nrows=100)train_data.head()

train_data.columns

获取训练和测试数据

features_columns = [col for col in train_data.columns if col not in ['user_id','label']]
train = train_data[features_columns].values
test = test_data[features_columns].values
target =train_data['label'].valuesfrom sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifierclf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0, n_jobs=-1)
X_train, X_test, y_train, y_test = train_test_split(train, target, test_size=0.4, random_state=0)print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)

交叉验证:评估估算器性能

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifierclf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0, n_jobs=-1)
scores = cross_val_score(clf, train, target, cv=5)
print(scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

模型调参

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier# Split the dataset in two equal parts
X_train, X_test, y_train, y_test = train_test_split(train, target, test_size=0.5, random_state=0)# model
clf = RandomForestClassifier(n_jobs=-1)# Set the parameters by cross-validationtuned_parameters = {'n_estimators': [50, 100, 200]
#                     ,'criterion': ['gini', 'entropy']
#                     ,'max_depth': [2, 5]
#                     ,'max_features': ['log2', 'sqrt', 'int']
#                     ,'bootstrap': [True, False]
#                     ,'warm_start': [True, False]}scores = ['precision']for score in scores:print("# Tuning hyper-parameters for %s" % score)print()clf = GridSearchCV(clf, tuned_parameters, cv=5,scoring='%s_macro' % score)clf.fit(X_train, y_train)print("Best parameters set found on development set:")print()print(clf.best_params_)print()print("Grid scores on development set:")print()means = clf.cv_results_['mean_test_score']stds = clf.cv_results_['std_test_score']for mean, std, params in zip(means, stds, clf.cv_results_['params']):print("%0.3f (+/-%0.03f) for %r"% (mean, std * 2, params))print()print("Detailed classification report:")print()print("The model is trained on the full development set.")print("The scores are computed on the full evaluation set.")print()y_true, y_pred = y_test, clf.predict(X_test)print(classification_report(y_true, y_pred))print()

模糊矩阵

import itertools
import numpy as np
import matplotlib.pyplot as pltfrom sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier# label name
class_names = ['no-repeat', 'repeat']# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)# Run classifier, using a model that is too regularized (C too low) to see
# the impact on the results
clf = RandomForestClassifier(n_jobs=-1)
y_pred = clf.fit(X_train, y_train).predict(X_test)def plot_confusion_matrix(cm, classes,normalize=False,title='Confusion matrix',cmap=plt.cm.Blues):"""This function prints and plots the confusion matrix.Normalization can be applied by setting `normalize=True`."""if normalize:cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]print("Normalized confusion matrix")else:print('Confusion matrix, without normalization')print(cm)plt.imshow(cm, interpolation='nearest', cmap=cmap)plt.title(title)plt.colorbar()tick_marks = np.arange(len(classes))plt.xticks(tick_marks, classes, rotation=45)plt.yticks(tick_marks, classes)fmt = '.2f' if normalize else 'd'thresh = cm.max() / 2.for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):plt.text(j, i, format(cm[i, j], fmt),horizontalalignment="center",color="white" if cm[i, j] > thresh else "black")plt.ylabel('True label')plt.xlabel('Predicted label')plt.tight_layout()# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,title='Confusion matrix, without normalization')# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,title='Normalized confusion matrix')plt.show()

from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier# label name
class_names = ['no-repeat', 'repeat']# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)# Run classifier, using a model that is too regularized (C too low) to see
# the impact on the results
clf = RandomForestClassifier(n_jobs=-1)
y_pred = clf.fit(X_train, y_train).predict(X_test)print(classification_report(y_test, y_pred, target_names=class_names))


不同的分类模型

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScalerstdScaler = StandardScaler()
X = stdScaler.fit_transform(train)# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, target, random_state=0)clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(X_train, y_train)
clf.score(X_test, y_test)from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScalerstdScaler = StandardScaler()
X = stdScaler.fit_transform(train)# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, target, random_state=0)clf = KNeighborsClassifier(n_neighbors=3).fit(X_train, y_train)
clf.score(X_test, y_test)from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScalerstdScaler = StandardScaler()
X = stdScaler.fit_transform(train)# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, target, random_state=0)clf = GaussianNB().fit(X_train, y_train)
clf.score(X_test, y_test)from sklearn import tree# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)
clf = BaggingClassifier(KNeighborsClassifier(), max_samples=0.5, max_features=0.5)clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)from sklearn.ensemble import RandomForestClassifier# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)
clf = clf = RandomForestClassifier(n_estimators=10, max_depth=3, min_samples_split=12, random_state=0)clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)from sklearn.ensemble import ExtraTreesClassifier# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)
clf = ExtraTreesClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)from sklearn.ensemble import AdaBoostClassifier# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)
clf = AdaBoostClassifier(n_estimators=10)clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)from sklearn.ensemble import GradientBoostingClassifier# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)
clf = GradientBoostingClassifier(n_estimators=10, learning_rate=1.0, max_depth=1, random_state=0)clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)

VOTE模型投票

from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.preprocessing import StandardScalerstdScaler = StandardScaler()
X = stdScaler.fit_transform(train)
y = targetclf1 = LogisticRegression(solver='lbfgs', multi_class='multinomial', random_state=1)
clf2 = RandomForestClassifier(n_estimators=50, random_state=1)
clf3 = GaussianNB()eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='hard')for clf, label in zip([clf1, clf2, clf3, eclf], ['Logistic Regression', 'Random Forest', 'naive Bayes', 'Ensemble']):scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

六.特征优化和特征选择

import pandas as pd
import numpy as npimport warnings
warnings.filterwarnings("ignore")
train_data = pd.read_csv('train_all.csv',nrows=10000)
test_data = pd.read_csv('test_all.csv',nrows=100)#获取训练和测试数据
features_columns = [col for col in train_data.columns if col not in ['user_id','label']]
train = train_data[features_columns].values
test = test_data[features_columns].values
target =train_data['label'].values

缺失值补全
处理缺失值有很多方法,最常用为以下几种:

1.删除。当数据量较大时,或者缺失数据占比较小时,可以使用这种方法。

2.填充。通用的方法是采用平均数、中位数来填充,可以适用插值或者模型预测的方法进行缺失补全。

3.不处理。树类模型对缺失值不明感。

采用中值进行填充

# from sklearn.preprocessing import Imputer
# imputer = Imputer(strategy="median")from sklearn.impute import SimpleImputerimputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(train)
train_imputer = imputer.transform(train)
test_imputer = imputer.transform(test)

特征选择

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifierdef feature_selection(train, train_sel, target):clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0, n_jobs=-1)scores = cross_val_score(clf, train, target, cv=5)scores_sel = cross_val_score(clf, train_sel, target, cv=5)print("No Select Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))     print("Features Select Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

删除方差较小的要素
VarianceThreshold是一种简单的基线特征选择方法。它会删除方差不符合某个阈值的所有要素。默认情况下,它会删除所有零方差要素,即在所有样本中具有相同值的要素。

from sklearn.feature_selection import VarianceThresholdsel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel = sel.fit(train)
train_sel = sel.transform(train)
test_sel = sel.transform(test)
print('训练数据未特征筛选维度', train.shape)
print('训练数据特征筛选维度后', train_sel.shape)

【数据分析与挖掘】天猫超市复购预测实战(含代码和数据集)相关推荐

  1. 【BI学习作业13-淘宝定向广告演化与天猫用户复购预测】

    目录 写在前面的话 1.思考题 1.1电商定向广告和搜索广告有怎样的区别,算法模型是否有差别 1.1.1电商定向广告 1.1.2搜索广告 1.2定向广告都有哪些常见的使用模型,包括Attention机 ...

  2. 天池大赛——天猫用户复购预测

    从0开始学大数据分析与机器学习,简简单单写下竞赛心得.得分是0.623537,排名629/5602 一.赛题背景 商家有时会在特定的日期(如节礼日甩卖."黑色星期五 "或 &quo ...

  3. 【BI学习心得13-淘宝定向广告演化与天猫用户复购预测】

    目录 1.电商定向广告 VS 搜索广告 1.1电商定向广告 1.2搜索广告 2.淘宝定向广告演化 3.阿里深度兴趣网络DIN 3.1attention机制 3.2评价指标 3.2.1改进AUC 3.2 ...

  4. 天猫用户复购预测之特征工程构建1

    导入包 import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns fro ...

  5. 9.1 电商B2C商铺新用户复购预测

    电商B2C商铺新用户复购预测 1. 电商B2C模式介绍 1.1 电商主要业务模式 B2B C2C B2C 1.2 B2C主要业务功能 平台盈利模式 1.3 商家数据分析师日常 1.3.1 日报周报(数 ...

  6. 天猫复购预测训练赛技术报告

    天猫复购预测赛技术报告 小组成员:李xx.姚xx.黄xx.刘xx github地址:https://github.com/2017403603/Data_mining 一.问题描述 1.1 问题背景 ...

  7. 10- 天猫用户复购预测 (机器学习集成算法) (项目十) *

    项目难点 merchant:  商人 重命名列名:  user_log.rename(columns={'seller_id':'merchant_id'}, inplace=True) 数据类型转换 ...

  8. 阿里云天池大赛赛题(机器学习)——天猫用户重复购买预测(完整代码)

    目录 赛题背景 全代码 导入包 读取数据(训练数据前10000行,测试数据前100条) 读取全部数据 获取训练和测试数据 切分40%数据用于线下验证 交叉验证:评估估算器性能 F1验证 Shuffle ...

  9. 【Python】电商用户复购数据实战:图解Pandas的移动函数shift

    公众号:尤而小屋 作者:Peter 编辑:Peter 本文主要介绍的是pandas中的一个移动函数:shift.最后结合一个具体的电商领域中用户的复购案例来说明如何使用shift函数. 这个案例综合性 ...

最新文章

  1. 【HTML】行内元素与块级元素
  2. oracle cols user_tab_columns,user_tab_cols和user_tab_columns的区别
  3. Spring.Net学习
  4. 最最简单的CentOs6在线源搭建
  5. Linux内存申请机制
  6. linux io映射,【原创】Linux 文件系统移植全解密以linux-2.6.35内核源码为例说明一下IO静态映射的过程...
  7. CRM 702和CRM 712的区别
  8. 成都鸿蒙脱模剂厂家,现场体验荣耀智慧屏与鸿蒙OS,荣耀Life成都店与您共享锐科技...
  9. HTML5求自动在闪,HTML5 重复而不停闪烁的团状物
  10. 游戏网页代码 html静态网页设计制作 dw静态网页成品模板素材网页 web前端网页设计与制作 div静态网页设计
  11. 操作失败,错误为 0x00000bcb
  12. 如何在计算机自动开机时选择用户,电脑如何设置自动开机,详细教您如何设置...
  13. 台式电脑连接电脑主机与显示器
  14. 腾讯浏览器支持html5视频播放器,JS第8款:html5media.js跨浏览器兼容的HTML5视频音频播放器...
  15. APP合规讲堂(七)-App有关收集使用规则的内容晦涩难懂、冗长繁琐,用户难以理解
  16. java 6面骰子_java计算掷6面骰子6000次每个点数出现的概率代码实例
  17. 公司电子邮箱可以定制邮箱地址吗?
  18. 动态规划练习三:换钱问题(动态规划概念理解与记忆搜索法概念理解对比)
  19. 计算机中硬盘和移动硬盘的区别,笔记本硬盘和移动硬盘有什么区别
  20. saltstack学习视频—老男孩—超详细—网盘下载

热门文章

  1. 提高企业计算机网络安全意识,对企业计算机网络安全建设问题综合分析探讨.doc...
  2. PDF怎么编辑修改?如何编辑PDF的内容?
  3. 一篇文章吃透:为什么加载数据库驱动要用Class.forName()
  4. ThinkPHP根据时间显示不同的问候语
  5. 新学期,新气象,新目标
  6. 2015.09.07 活着就是一种召唤——《活着》余华
  7. 科目二 总结(方向盘,离合器,刹车)
  8. Android开发高手课笔记 - 01 崩溃优化(上):关于“崩溃”那点事
  9. javascript基础:元素增删改操作
  10. Android FaceBook登录 分享获取HashKey(密钥散列)的简单方法