日萌社

人工智能AI:Keras PyTorch MXNet TensorFlow PaddlePaddle 深度学习实战(不定时更新)


阿里云官网:天池新人实战赛o2o优惠券使用预测

数据集下载链接:https://pan.baidu.com/s/13OtaUv6j4x8dD7cgD4sL5g 
提取码:7tze 


Sklearn:天池新人实战赛o2o优惠券使用预测 part1

Sklearn:天池新人实战赛o2o优惠券使用预测 part2

Sklearn:天池新人实战赛o2o优惠券使用预测 part3


特征处理

In [1]:

import pandas as pd
import numpy as np
from datetime import date
import datetime as dt

获取数据

In [2]:

off_train = pd.read_csv('./data/ccf_offline_stage1_train.csv', keep_default_na=False, header=0)
off_test = pd.read_csv('./data/ccf_offline_stage1_test_revised.csv',keep_default_na=False, header=0)
on_train = pd.read_csv('./data/ccf_online_stage1_train.csv', keep_default_na=False, header=0)

In [3]:

# 更换特征名称
off_train.columns=['user_id','merchant_id','coupon_id','discount_rate','distance','date_received','date']
off_test.columns = ['user_id','merchant_id','coupon_id','discount_rate','distance','date_received']
on_train.columns = ['user_id','merchant_id','action','coupon_id','discount_rate','date_received','date']

数据分批处理

#将数据分为3个数据集 利用滑窗法

#将2016年1月1日到4月13日的数据提取特征,利用4月14日的到5月14日的作为测试集

#将2月1日到5月14日的作为数据集提取特征,利用5月15日6月15日的作为测试集

#将3月15日到6月30日作为数据集提取特征,再测试7月1日到7月31日的数据

In [4]:

# 将2016年1月1日到4月13日的数据提取特征
feature1 = off_train[(off_train.date >= '20160101') & (off_train.date <= '20160413') | (
    (off_train.date == 'null') & (off_train.date_received >= '20160101') & (off_train.date_received <= '20160413'))]
# 利用4月14日的到5月14日的作为测试集
dataset1 = off_train[(off_train.date_received >= '201604014')
                     & (off_train.date_received <= '20160514')]

In [5]:

# 在2月1日到5月14日之间使用了券,只要领取时间在2月1日到5月14日之间,并包括没有数据中没有领取券的
feature2 = off_train[(off_train.date >= '20160201') & (off_train.date <= '20160514') | (
    (off_train.date == 'null') & (off_train.date_received >= '20160201') & (off_train.date_received <= '20160514'))]
# 提取数据集2的测试集
dataset2 = off_train[(off_train.date_received >= '20160515')
                     & (off_train.date_received <= '20160615')]

In [6]:

# 数据集3的特征为 取数据中领券和用券日期大于3月15日和小于6月30日的
feature3 = off_train[((off_train.date >= '20160315') & (off_train.date <= '20160630')) | (
    (off_train.date == 'null') & (off_train.date_received >= '20160315') & (off_train.date_received <= '20160630'))]
# 使数据集3等于test集 没有label标签
dataset3 = off_test

特征工程

用户特征处理

In [7]:

def is_firstlastone(x):
    if x == 0:
        return 1
    elif x > 0:
        return 0
    else:
        return -1  # 表明这个优惠券只接受了一次
def get_day_gap_before(s):
    date_received, dates = s.split('-')
    dates = dates.split(':')
    gaps = []
    for d in dates:
        # 将时间差转化为天数
        this_gap = (dt.date(int(date_received[0:4]), int(date_received[4:6]), int(
            date_received[6:8]))-dt.date(int(d[0:4]), int(d[4:6]), int(d[6:8]))).days
        if this_gap > 0:
            gaps.append(this_gap)
    if len(gaps) == 0:
        return -1
    else:
        return min(gaps)
def get_day_gap_after(s):
    date_received, dates = s.split('-')
    dates = dates.split(':')
    gaps = []
    for d in dates:
        this_gap = (dt.datetime(int(d[0:4]), int(d[4:6]), int(d[6:8]))-dt.datetime(
            int(date_received[0:4]), int(date_received[4:6]), int(date_received[6:8]))).days
        if this_gap > 0:
            gaps.append(this_gap)
    if len(gaps) == 0:
        return -1
    else:
        return min(gaps)

In [8]:

def GetOtherFeature(dataset):
    # 对于测试集,提取用户的Id
    dataset3 = dataset
    t = dataset3[['user_id']]
    # 相当于给原有数据加上一列,这个月用户收取的所有优惠券数目,并初始化为1
    t['this_month_user_receive_all_coupon_count'] = 1
    # 将t按照用户id进行分组,然后统计所有用户收取的优惠券数目,并初始化一个索引值
    t = t.groupby('user_id').agg('sum').reset_index()
    # 提取数据集的优惠券Id和用户Id
    t1 = dataset3[['user_id', 'coupon_id']]
    # 提取这个月用户收到的相同的优惠券的数量
    t1['this_month_user_receive_same_coupn_count'] = 1
    t1 = t1.groupby(['user_id', 'coupon_id']).agg('sum').reset_index()
    # 提取数据集的用户id,优惠券id以及优惠券接收的时间
    t2 = dataset3[['user_id', 'coupon_id', 'date_received']]
    # 将数据转换为str类型
    t2.date_received = t2.date_received.astype('str')
    # 如果出现相同的用户接收相同的优惠券在接收时间上用‘:’连接上第n次接受优惠券的时间
    t2 = t2.groupby(['user_id', 'coupon_id'])['date_received'].agg(
        lambda x: ':'.join(x)).reset_index()
    # 将接收时间的一组按着':'分开,这样就可以计算接受了优惠券的数量,apply是合并
    t2['receive_number'] = t2.date_received.apply(lambda s: len(s.split(':')))
    t2 = t2[t2.receive_number > 1]
    # 最大接受的日期
    t2['max_date_received'] = t2.date_received.apply(
        lambda s: max([int(d) for d in s.split(':')]))
    # 最小的接收日期
    t2['min_date_received'] = t2.date_received.apply(
        lambda s: min([int(d) for d in s.split(':')]))
    t2 = t2[['user_id', 'coupon_id', 'max_date_received', 'min_date_received']]
    t3 = dataset3[['user_id', 'coupon_id', 'date_received']]
    # 将两表融合只保留左表数据,这样得到的表,相当于保留了最近接收时间和最远接受时间
    t3 = pd.merge(t3, t2, on=['user_id', 'coupon_id'], how='left')
    # 这个优惠券最近接受时间
    t3['this_month_user_receive_same_coupon_lastone'] = t3.max_date_received - \
        t3.date_received.astype(int)
    # 这个优惠券最远接受时间
    t3['this_month_user_receive_same_coupon_firstone'] = t3.date_received.astype(
        int)-t3.min_date_received
    t3.this_month_user_receive_same_coupon_lastone = t3.this_month_user_receive_same_coupon_lastone.apply(
        is_firstlastone)
    t3.this_month_user_receive_same_coupon_firstone = t3.this_month_user_receive_same_coupon_firstone.apply(
        is_firstlastone)
    t3 = t3[['user_id', 'coupon_id', 'date_received', 'this_month_user_receive_same_coupon_lastone',
             'this_month_user_receive_same_coupon_firstone']]
    # 将表格中接收优惠券日期中为最近和最远的日期时置为1其余为0,若只接受了一次优惠券为-1
    # 提取第四个特征,一个用户所接收到的所有优惠券的数量
    t4 = dataset3[['user_id', 'date_received']]
    t4['this_day_receive_all_coupon_count'] = 1
    t4 = t4.groupby(['user_id', 'date_received']).agg('sum').reset_index()
    # 提取第五个特征,一个用户不同时间所接收到不同优惠券的数量
    t5 = dataset3[['user_id', 'coupon_id', 'date_received']]
    t5['this_day_user_receive_same_coupon_count'] = 1
    t5 = t5.groupby(['user_id', 'coupon_id', 'date_received']
                    ).agg('sum').reset_index()
    # 一个用户不同优惠券 的接受时间
    t6 = dataset3[['user_id', 'coupon_id', 'date_received']]
    t6.date_received = t6.date_received.astype('str')
    t6 = t6.groupby(['user_id', 'coupon_id'])['date_received'].agg(
        lambda x: ':'.join(x)).reset_index()
    # 重命名inplace代表深拷贝
    t6.rename(columns={'date_received': 'dates'}, inplace=True)
    t7 = dataset3[['user_id', 'coupon_id', 'date_received']]
    # 将t6和t7融合
    t7 = pd.merge(t7, t6, on=['user_id', 'coupon_id'], how='left')
    # 注意这里所有的时间格式都已经是'str'格式
    t7['date_received_date'] = t7.date_received.astype('str')+'-'+t7.dates
    # print(t7)
    t7['day_gap_before'] = t7.date_received_date.apply(get_day_gap_before)
    t7['day_gap_after'] = t7.date_received_date.apply(get_day_gap_after)
    t7 = t7[['user_id', 'coupon_id', 'date_received',
             'day_gap_before', 'day_gap_after']]
    # 将所有特征融合在一张表中
    other_feature3 = pd.merge(t1, t, on='user_id')
    other_feature3 = pd.merge(other_feature3, t3, on=['user_id', 'coupon_id'])
    other_feature3 = pd.merge(other_feature3, t4, on=['user_id', 'date_received'])
    other_feature3 = pd.merge(other_feature3, t5, on=[
                              'user_id', 'coupon_id', 'date_received'])
    other_feature3 = pd.merge(other_feature3, t7, on=[
                              'user_id', 'coupon_id', 'date_received'])
    return other_feature3

提取优惠券的相关特征

In [9]:

def calc_discount_rate(s):
    s = str(s)
    s = s.split(':')
    if len(s) == 1:
        return float(s[0])
    else:
        return 1.0-float(s[1])/float(s[0])
def get_discount_man(s):
    s = str(s)
    s = s.split(':')
    if len(s)==1:
        return 'null'
    else:
        return int(s[0])
def get_discount_jian(s):
    s = str(s)
    s = s.split(':')
    if len(s) == 1:
        return 'null'
    else:
        return int(s[1])
def is_man_jian(s):
    s = str(s)
    s = s.split(':')
    if len(s)==1:
        return 0
    else:
        return 1 

In [10]:

def GetCouponFeature(dataset, feature):
    # 对于数据集
    # 将时间转化为第几周
    # 显示时间是第几周
    # tt是获取到的特征中消费的最大时间
    dataset3 = dataset
    tt = feature[feature.date != 'null'].date.unique().max()
    dataset3['day_of_week'] = dataset3.date_received.astype('str').apply(
        lambda x: date(int(x[0:4]), int(x[4:6]), int(x[6:8])).weekday()+1)
    # 显示时间是几月
    dataset3['day_of_month'] = dataset3.date_received.astype(
        'str').apply(lambda x: int(x[6:8]))
    # 显示时期和截止日之间的天数
    dataset3['days_distance'] = dataset3.date_received.astype('str').apply(
        lambda x: (date(int(x[0:4]), int(x[4:6]), int(x[6:8]))-date(int(tt[0:4]), int(tt[4:6]), int(tt[6:8]))).days)
    # 显示满了多少钱后开始减
    dataset3['discount_man'] = dataset3.discount_rate.apply(get_discount_man)
    # 显示满减的减少的钱
    dataset3['discount_jian'] = dataset3.discount_rate.apply(get_discount_jian)
    # 返回优惠券是否是满减券
    dataset3['is_man_jian'] = dataset3.discount_rate.apply(is_man_jian)
    # 显示打折力度
    dataset3['discount_rate'] = dataset3.discount_rate.apply(
        calc_discount_rate)
    d = dataset3[['coupon_id']]
    d['coupon_count'] = 1
    # 显示每一种优惠券的数量
    d = d.groupby('coupon_id').agg('sum').reset_index()
    dataset3 = pd.merge(dataset3, d, on='coupon_id', how='left')
    return dataset3

提取商品的特征

In [11]:

def GetMerchantFeature(feature):
    #提取商品的特征
    #对于数据集
    feature3 = feature
    merchant3 = feature3[['merchant_id','coupon_id','distance','date_received','date']]
    t = merchant3[['merchant_id']]
    #删除重复行数据
    t.drop_duplicates(inplace=True)
    #显示卖出的商品
    t1 = merchant3[merchant3.date!='null'][['merchant_id']]
    t1['total_sales'] = 1
    #显示每个商品的销售数量
    t1 = t1.groupby('merchant_id').agg('sum').reset_index()
    #显示使用了优惠券消费的商品,正样本
    t2 = merchant3[(merchant3.date!='null')&(merchant3.coupon_id!='null')][['merchant_id']]
    t2['sales_use_coupon'] = 1
    t2 = t2.groupby('merchant_id').agg('sum').reset_index()
    #显示了商品的优惠券的总数量
    t3 = merchant3[merchant3.coupon_id != 'null'][['merchant_id']]
    t3 ['total_coupon'] = 1
    t3 = t3.groupby('merchant_id').agg('sum').reset_index()
    #显示商品销量和距离的关系
    t4 = merchant3[(merchant3.date != 'null')&(merchant3.coupon_id != 'null')][['merchant_id','distance']]
    #把数据中的null值全部替换为-1
    t4.replace('null',-1,inplace=True)
    t4.distance = t4.distance.astype('int')
    #再把数据中的-1全部替换为NaN
    t4.replace(-1,np.nan,inplace=True)
    #返回用户离商品的距离最小值
    t5 = t4.groupby('merchant_id').agg('min').reset_index()
    t5.rename(columns={'distance':'merchant_min_distance'},inplace = True)
    #返回用户离商品的距离最大值
    t6 = t4.groupby('merchant_id').agg('max').reset_index()
    t6.rename(columns={'distance':'merchant_max_distance'},inplace = True)
    #print(t6)
    #返回距离的平均值
    t7 = t4.groupby('merchant_id').agg('mean').reset_index()
    t7.rename(columns = {'distance':'merchant_mean_distance'},inplace= True)
    #返回距离的中位值
    t8 = t4.groupby('merchant_id').agg('median').reset_index()
    t8.rename(columns={'distance':'merchant_median_distance'},inplace = True)
    merchant3_feature = pd.merge(t,t1,on='merchant_id',how='left')
    merchant3_feature = pd.merge(merchant3_feature,t2,on='merchant_id',how='left')
    merchant3_feature = pd.merge(merchant3_feature,t3,on='merchant_id',how='left')
    merchant3_feature = pd.merge(merchant3_feature,t5,on='merchant_id',how='left')
    merchant3_feature = pd.merge(merchant3_feature,t6,on='merchant_id',how='left')
    merchant3_feature = pd.merge(merchant3_feature,t7,on='merchant_id',how='left')
    merchant3_feature = pd.merge(merchant3_feature,t8,on='merchant_id',how='left')
    #将数据中的NaN用0来替换
    merchant3_feature.sales_use_coupon = merchant3_feature.sales_use_coupon.replace(np.nan,0)
    #即优惠券的使用率
    merchant3_feature['merchant_coupon_transfer_rate'] = merchant3_feature.sales_use_coupon.astype('float')/merchant3_feature.total_coupon
    #即卖出商品中使用优惠券的占比
    merchant3_feature['coupon_rate'] = merchant3_feature.sales_use_coupon.astype('float') / merchant3_feature.total_sales
    #将数据中的NaN用0来替换
    merchant3_feature.total_coupon = merchant3_feature.total_coupon.replace(np.nan,0)
    return merchant3_feature

用户的相关信息

In [12]:

def get_user_date_datereceived_gap(s):
    s = s.split(':')
    return (date(int(s[0][0:4]), int(s[0][4:6]), int(s[0][6:8])) - date(int(s[1][0:4]), int(s[1][4:6]), int(s[1][6:8]))).days

In [13]:

def GetUserRelateInfo(feature):
    feature3 = feature
    user3 = feature3[['user_id','merchant_id','coupon_id','discount_rate','distance','date_received','date']]
    t = user3[['user_id']]
    #去掉数据中重复的用户Id
    t.drop_duplicates(inplace=True)
    #用户购买商品的种类数
    t1 = user3[user3.date!='null'][['user_id','merchant_id']]
    #同样去掉重复用的用户id和商品id
    t1.drop_duplicates(inplace=True)
    t1.merchant_id = 1
    t1 = t1.groupby('user_id').agg('sum').reset_index()
    t1.rename(columns={'merchant_id':'count_merchant'},inplace=True)
    #使用了优惠券购买商品的用户id和距离
    t2 = user3[(user3.date!='null')&(user3.coupon_id!='null')][['user_id','distance']]
    #将null值替换为-1
    t2.replace('null',-1,inplace=True)
    t2.distance = t2.distance.astype('int')#转换数据类型为int
    t2.replace(-1,np.nan,inplace=True)
    #得到使用优惠券购买商品的用户离店铺的最短距离
    t3 = t2.groupby('user_id').agg('min').reset_index()
    t3.rename(columns={'distance':'user_min_distance'},inplace=True)
    #得到最大距离
    t4 = t2.groupby('user_id').agg('max').reset_index()
    t4.rename(columns={'distance':'user_max_distance'},inplace=True)
    #得到平均距离
    t5 = t2.groupby('user_id').agg('mean').reset_index()
    t5.rename(columns={'distance':'user_mean_distance'},inplace=True)
    #得到中间距离
    t6 = t2.groupby('user_id').agg('median').reset_index()
    t6.rename(columns={'distance':'user_median_distance'},inplace=True)
    #每个用户使用优惠券购买的物品数量
    t7 = user3[(user3.date != 'null')&(user3.coupon_id != 'null')][['user_id']]
    t7['buy_use_coupon'] = 1
    t7 = t7.groupby('user_id').agg('sum').reset_index()
    #购买物品的总数
    t8 = user3[user3.date != 'null'][['user_id']]
    t8['buy_total'] = 1
    t8 = t8.groupby('user_id').agg('sum').reset_index()
    #接受的优惠券的总数
    t9 = user3[user3.coupon_id != 'null'][['user_id']]
    t9['coupon_received'] = 1
    t9 = t9.groupby('user_id').agg('sum').reset_index()
    #接受到优惠券的日期和使用之间的间隔
    t10 = user3[(user3.date_received != 'null')&(user3.date != 'null')][['user_id','date_received','date']]
    t10['user_date_datereceived_gap'] = t10.date + ':'+ t10.date_received
    t10.user_date_datereceived_gap = t10.user_date_datereceived_gap.apply(get_user_date_datereceived_gap)
    t10 = t10[['user_id','user_date_datereceived_gap']]
    #将用户优惠券使用时间的间隔取平均数
    t11 = t10.groupby('user_id').agg('mean').reset_index()
    t11.rename(columns={'user_date_datereceived_gap':'avg_user_date_datereceived_gap'},inplace=True)
    #间隔天数的最小值
    t12 = t10.groupby('user_id').agg('min').reset_index()
    t12.rename(columns={'user_date_datereceived_gap':'min_user_date_datereceived_gap'},inplace=True)
    #间隔天数的最大值
    t13 = t10.groupby('user_id').agg('max').reset_index()
    t13.rename(columns={'user_date_datareceived_gap':'max_user_date_datereceived_gap'},inplace=True)
    #将提取的特征合并
    user3_feature = pd.merge(t,t1,on='user_id',how='left')
    user3_feature = pd.merge(user3_feature,t3,on='user_id',how='left')
    user3_feature = pd.merge(user3_feature,t4,on='user_id',how='left')
    user3_feature = pd.merge(user3_feature,t5,on='user_id',how='left')
    user3_feature = pd.merge(user3_feature,t6,on='user_id',how='left')
    user3_feature = pd.merge(user3_feature,t7,on='user_id',how='left')
    user3_feature = pd.merge(user3_feature,t8,on='user_id',how='left')
    user3_feature = pd.merge(user3_feature,t9,on='user_id',how='left')
    user3_feature = pd.merge(user3_feature,t11,on='user_id',how='left')
    user3_feature = pd.merge(user3_feature,t12,on='user_id',how='left')
    user3_feature = pd.merge(user3_feature,t13,on='user_id',how='left')
    user3_feature.count_merchant = user3_feature.count_merchant.replace(np.nan,0)
    user3_feature.buy_user_coupon = user3_feature.buy_use_coupon.replace(np.nan,0)
    user3_feature['buy_use_coupon_rate'] = user3_feature.buy_use_coupon.astype('float') / user3_feature.buy_total.astype('float')#使用优惠券购买的商品占总数的多少
    user3_feature['user_coupon_transfer_rate'] = user3_feature.buy_use_coupon.astype('float') / user3_feature.coupon_received.astype('float')
    user3_feature.buy_total = user3_feature.buy_total.replace(np.nan,0)#将数据中的NaN值转为0
    user3_feature.coupon_received = user3_feature.coupon_received.replace(np.nan,0)
    return user3_feature

用户和商家之间的特征关系

In [14]:

def GetUserMerchantRelateInfo(feature):
    #4.user_merchant:
    #times_user_buy_merchant_before. 
    feature3 = feature
    all_user_merchant = feature3[['user_id','merchant_id']]
    all_user_merchant.drop_duplicates(inplace=True)
    #只保留销售了商品的商户id
    t = feature3[['user_id','merchant_id','date']]
    t = t[t.date!='null'][['user_id','merchant_id']]
    #用户一共买了这家商户的多少商品
    t['user_merchant_buy_total'] = 1
    t = t.groupby(['user_id','merchant_id']).agg('sum').reset_index()
    t.drop_duplicates(inplace=True)
    t1 = feature3[['user_id','merchant_id','coupon_id']]
    t1 = t1[t1.coupon_id!='null'][['user_id','merchant_id']]
    #用户一共收到一个商户的多少优惠券
    t1['user_merchant_received'] = 1
    t1 = t1.groupby(['user_id','merchant_id']).agg('sum').reset_index()
    t1.drop_duplicates(inplace = True)
    t2 = feature3[['user_id','merchant_id','date','date_received']]
    t2 = t2[(t2.date!='null')&(t2.date_received!='null')][['user_id','merchant_id']]
    #用户在一家商户中使用优惠券购买的商品的数目
    t2['user_merchant_buy_use_coupon'] = 1
    t2 = t2.groupby(['user_id','merchant_id']).agg('sum').reset_index()
    t2.drop_duplicates(inplace = True)
    #用户在一家商家的所有记录总数
    t3 = feature3[['user_id','merchant_id']]
    t3['user_merchant_any'] = 1
    t3 = t3.groupby(['user_id','merchant_id']).agg('sum').reset_index()
    t3.drop_duplicates(inplace = True)
    t4 = feature3[['user_id','merchant_id','date','coupon_id']]
    t4 = t4[(t4.date!='null')&(t4.coupon_id=='null')][['user_id','merchant_id']]
    #用户没有使用优惠券购买的商品的数目
    t4['user_merchant_buy_common'] = 1
    t4 = t4.groupby(['user_id','merchant_id']).agg('sum').reset_index()
    t4.drop_duplicates(inplace = True)
    user_merchant3 = pd.merge(all_user_merchant,t,on=['user_id','merchant_id'],how='left')
    user_merchant3 = pd.merge(user_merchant3,t1,on=['user_id','merchant_id'],how='left')
    user_merchant3 = pd.merge(user_merchant3,t2,on=['user_id','merchant_id'],how='left')
    user_merchant3 = pd.merge(user_merchant3,t3,on=['user_id','merchant_id'],how='left')
    user_merchant3 = pd.merge(user_merchant3,t4,on=['user_id','merchant_id'],how='left')
    #都是针对一家商户和一个用户
    user_merchant3.user_merchant_buy_use_coupon = user_merchant3.user_merchant_buy_use_coupon.replace(np.nan,0)
    user_merchant3.user_merchant_buy_common = user_merchant3.user_merchant_buy_common.replace(np.nan,0)
    #y优惠券的转换率,用户使用了的优惠券/一共收到的优惠券
    user_merchant3['user_merchant_coupon_transfer_rate'] = user_merchant3.user_merchant_buy_use_coupon.astype('float') / user_merchant3.user_merchant_received.astype('float')
    #用户使用优惠券的概率,在一家商户使用优惠券购买的商品/在一家商户购买商品的总数
    user_merchant3['user_merchant_coupon_buy_rate'] = user_merchant3.user_merchant_buy_use_coupon.astype('float') / user_merchant3.user_merchant_buy_total.astype('float')
    #用户在商户消费的概率 用户在商户购买的总数/在一家商户浏览的总次数
    user_merchant3['user_merchant_rate'] = user_merchant3.user_merchant_buy_total.astype('float') / user_merchant3.user_merchant_any.astype('float')
    #用户在一家商户不适用优惠券购买的概率 普通购买的商品数/购买商品的总数
    user_merchant3['user_merchant_common_buy_rate'] = user_merchant3.user_merchant_buy_common.astype('float') / user_merchant3.user_merchant_buy_total.astype('float')
    return user_merchant3

构建训练集和测试集

In [15]:

def get_label(s):
    s = s.split(':')
    if s[0]=='null':
        return 0
    elif (date(int(s[0][0:4]),int(s[0][4:6]),int(s[0][6:8]))-date(int(s[1][0:4]),int(s[1][4:6]),int(s[1][6:8]))).days<=15:
        return 1
    else:
        return -1 

In [16]:

def GenerateData(dataset, feature, label=True):
    # 获取各个特征处理后的结果
    coupon_feature = GetCouponFeature(dataset, feature)
    merchant_feature = GetMerchantFeature(feature)
    user_feature = GetUserRelateInfo(feature)
    user_merchant = GetUserMerchantRelateInfo(feature)
    other_feature = GetOtherFeature(dataset)
    dataset = pd.merge(coupon_feature, merchant_feature,
                       on='merchant_id', how='left')
    dataset = pd.merge(dataset, user_feature, on='user_id', how='left')
    dataset = pd.merge(dataset, user_merchant, on=[
                       'user_id', 'merchant_id'], how='left')
    dataset = pd.merge(dataset, other_feature, on=[
                       'user_id', 'coupon_id', 'date_received'], how='left')
    dataset.drop_duplicates(inplace=True)
    dataset.user_merchant_buy_total = dataset.user_merchant_buy_total.replace(
        np.nan, 0)
    dataset.user_merchant_any = dataset.user_merchant_any.replace(np.nan, 0)
    dataset.user_merchant_received = dataset.user_merchant_received.replace(
        np.nan, 0)
    dataset['is_weekend'] = dataset.day_of_week.apply(
        lambda x: 1 if x in (6, 7) else 0)
    weekday_dummies = pd.get_dummies(dataset.day_of_week)
    weekday_dummies.columns = [
        'weekday'+str(i+1) for i in range(weekday_dummies.shape[1])]
    dataset = pd.concat([dataset, weekday_dummies], axis=1)
    # 如果是训练集要记得处理label标签值  但是在测试集中不用处理label标签 注意off_train和off_test字段
    if label:
        dataset['label'] = dataset.date.astype(
            'str') + ':' + dataset.date_received.astype('str')
        dataset.label = dataset.label.apply(get_label)
        dataset.drop(['merchant_id', 'day_of_week', 'date', 'date_received',
                     'coupon_count'], axis=1, inplace=True)
    else:
        dataset.drop(['merchant_id', 'day_of_week', 'coupon_count'],
                 axis=1, inplace=True)
    # 所有的表都要一起处理null
    dataset = dataset.replace('null', np.nan)
    return dataset

In [17]:

GenerateData1 = GenerateData(dataset1, feature1)
GenerateData2 = GenerateData(dataset2, feature2)
GenerateData3 = GenerateData(dataset3, feature3, False)

保存处理好的特征值, 以便后续使用

In [18]:

GenerateData1.to_csv('./GenerateData1.csv', index=None)
GenerateData2.to_csv('./GenerateData2.csv', index=None)
GenerateData3.to_csv('./GenerateData3.csv', index=None)

模型融合

In [1]:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import datetime as dt
from datetime import date
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss, roc_auc_score, auc, roc_curve
from sklearn.linear_model import SGDClassifier, LogisticRegression
from xgboost import XGBClassifier

In [2]

def get_processed_data():
    dataset1 = pd.read_csv('./GenerateData1.csv')
    dataset2 = pd.read_csv('./GenerateData2.csv')
    dataset3 = pd.read_csv('./GenerateData3.csv')
    dataset1.label.replace(-1, 0, inplace=True)
    dataset2.label.replace(-1, 0, inplace=True)
    dataset1.drop_duplicates(inplace=True)
    dataset2.drop_duplicates(inplace=True)
    dataset3.drop_duplicates(inplace=True) 
    # 按照行或列进行合并,axis=0为列索引,axis=1为行索引 因为特征处理都一样, 所以按照列索引
    dataset12 = pd.concat([dataset1, dataset2], axis=0)
    dataset12.fillna(-1, inplace=True)
#     dataset3.fillna(0, inplace=True)
    return dataset12, dataset3

In [3]:

dataset12, dataset3 = get_processed_data()

In [4]:

predict_dataset = dataset3[['user_id', 'coupon_id', 'date_received']].copy()
dataset12_label = dataset12.label
# 降低维度, 把没有必要的字段删除
dataset12_x = dataset12.drop(['user_id','label','coupon_id','day_gap_before','day_gap_after'],axis=1)
dataset3.fillna(-1, inplace=True)
dataset3_x = dataset3.drop(['user_id','coupon_id','date_received','day_gap_before','day_gap_after'],axis=1)

In [5]:

x_train, x_test, y_train, y_test = train_test_split(dataset12_x, dataset12_label, test_size=0.25, random_state=88)
X_train, X_train_lr, y_train, y_train_lr = train_test_split(x_train, y_train, test_size=0.5)

GBDT

In [6]:

gbdt_model =GradientBoostingClassifier(learning_rate=0.1,
                                 n_estimators=190,
                                 min_samples_split=5,
                                 min_samples_leaf=5,
                                 max_depth=15,
                                 random_state=0,
                                 max_features=24,)

XGB

In [7]:

XGB_model = XGBClassifier(max_depth=15, learning_rate=0.01,eta=1, gamma=0, n_jobs=-1)
# XGB_model.fit(x_train, y_train)

In [8]:

from sklearn.preprocessing import  OneHotEncoder

In [9]:

grd_enc = OneHotEncoder()

In [10]:

gbdt_model.fit(X_train, y_train)

Out[10]:

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,learning_rate=0.1, loss='deviance', max_depth=15,max_features=24, max_leaf_nodes=None,min_impurity_decrease=0.0, min_impurity_split=None,min_samples_leaf=5, min_samples_split=5,min_weight_fraction_leaf=0.0, n_estimators=190,n_iter_no_change=None, presort='deprecated',random_state=0, subsample=1.0, tol=0.0001,validation_fraction=0.1, verbose=0,warm_start=False)

In [11]:

# 训练one-hot编码器
grd_enc.fit(gbdt_model.apply(X_train)[:, :, 0])

Out[11]:

OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,handle_unknown='error', sparse=True)

In [27]:

# gbdt_model.apply(X_train)

In [24]:

gbdt_model.apply(X_train)[:, :, 0] # 将三维的以二维的形式表示

Out[24]:

array([[1407., 1448., 1478., ..., 1232., 1428., 2841.],[1495., 1500., 1807., ..., 1232., 1428., 2841.],[ 284.,  377.,  614., ..., 1232.,  756.,  647.],...,[ 275.,  362.,  599., ..., 1232.,  772., 1505.],[1409., 1456., 1690., ..., 1251., 1450., 2885.],[ 139.,  160.,  371., ..., 1126.,  253., 2742.]])

In [12]:

XGB_model.fit(grd_enc.transform(gbdt_model.apply(X_train_lr)[:, :, 0]), y_train_lr)

Out[12]:

XGBClassifier(base_score=0.5, booster=None, colsample_bylevel=1,colsample_bynode=1, colsample_bytree=1, eta=1, gamma=0, gpu_id=-1,importance_type='gain', interaction_constraints=None,learning_rate=0.01, max_delta_step=0, max_depth=15,min_child_weight=1, missing=nan, monotone_constraints=None,n_estimators=100, n_jobs=-1, num_parallel_tree=1,objective='binary:logistic', random_state=0, reg_alpha=0,reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method=None,validate_parameters=False, verbosity=None)

In [13]:

# 用训练好的XGB模型多X_test做预测
y_pred_grd_lm = XGB_model.predict_proba(grd_enc.transform(gbdt_model.apply(x_test)[:, :, 0]))[:, 1]

In [14]:

# 根据预测结果输出
fpr_grd_lm, tpr_grd_lm, _ = roc_curve(y_test, y_pred_grd_lm)

In [15]:

fpr_grd_lm

Out[15]:

array([0.00000000e+00, 1.97818067e-05, 3.95636133e-05, ...,7.50620654e-01, 7.51045963e-01, 1.00000000e+00])

In [16]:

tpr_grd_lm

Out[16]:

array([0.        , 0.0058958 , 0.01106967, ..., 0.98856937, 0.98856937,1.        ])

In [17]:

_ 

Out[17]:

array([1.7995733 , 0.7995733 , 0.7982012 , ..., 0.18624917, 0.1855642 ,0.18549208], dtype=float32)

In [18]:

print("AUC准确率:", roc_auc_score(y_test,y_pred_grd_lm))
AUC准确率: 0.867398106049461

模型融合 0.7441

In [19]:

dataset_preds = dataset3[['user_id','coupon_id','date_received']]
y_pred_grd_lm = XGB_model.predict_proba(grd_enc.transform(gbdt_model.apply(dataset3_x)[:, :, 0]))[:, 1]
dataset_preds['label'] = y_pred_grd_lm
dataset_preds.sort_values(by=['coupon_id','label'],inplace=True)
dataset_preds.to_csv("GBDT_XGB_preds1.csv",index=None,header=None)

线上线下特征合并

In [ ]:

import datetime
import os
from concurrent.futures import ProcessPoolExecutor
from math import ceil
import pandas as pd

读入源数据

In [ ]:

# In[] 读入源数据
def get_source_data():
    # 源数据路径
    DataPath = 'dataset'
    # 读入源数据
    off_train = pd.read_csv(os.path.join(DataPath, 'ccf_offline_stage1_train.csv'),
                            parse_dates=['Date_received', 'Date'])
    off_train.columns = ['User_id', 'Merchant_id', 'Coupon_id', 'Discount_rate', 'Distance', 'Date_received', 'Date']
    on_train = pd.read_csv(os.path.join(DataPath, 'ccf_online_stage1_train.csv'), parse_dates=['Date_received', 'Date'])
    on_train.columns = ['User_id', 'Merchant_id', 'Action', 'Coupon_id', 'Discount_rate', 'Date_received', 'Date']
    off_test = pd.read_csv(os.path.join(DataPath, 'ccf_offline_stage1_test_revised.csv'), parse_dates=['Date_received'])
    off_test.columns = ['User_id', 'Merchant_id', 'Coupon_id', 'Discount_rate', 'Distance', 'Date_received']
    print(off_train.info())
    print(off_train.head(5))
    return off_train, on_train, off_test

数据基本处理

null,na 特殊处理

In [ ]:

# In[] null,na 特殊处理
def null_process_offline(dataset, predict=False):
    dataset.Distance.fillna(11, inplace=True)
    dataset.Distance = dataset.Distance.astype(int)
    dataset.Coupon_id.fillna(0, inplace=True)
    dataset.Coupon_id = dataset.Coupon_id.astype(int)
    dataset.Date_received.fillna(date_null, inplace=True)
    dataset[['discount_rate_x', 'discount_rate_y']] = dataset[dataset.Discount_rate.str.contains(':') == True][
        'Discount_rate'].str.split(':', expand=True).astype(int)
    dataset['discount_rate'] = 1 - dataset.discount_rate_y / dataset.discount_rate_x
    dataset.discount_rate = dataset.discount_rate.fillna(dataset.Discount_rate).astype(float)
    if predict:
        return dataset
    else:
        dataset.Date.fillna(date_null, inplace=True)
        return dataset
def null_process_online(dataset):
    dataset.Coupon_id.fillna(0, inplace=True)
    # online.Coupon_id = online.Coupon_id.astype(int)
    dataset.Date_received.fillna(date_null, inplace=True)
    dataset.Date.fillna(date_null, inplace=True)
    return dataset

生成交叉训练集

In [ ]:

def data_process(off_train, on_train, off_test):
    # train feature split
    # 交叉训练集一:收到券的日期大于4月14日和小于5月14日
    time_range = ['2016-04-16', '2016-05-15']
    dataset1 = off_train[(off_train.Date_received >= time_range[0]) & (off_train.Date_received <= time_range[1])].copy()
    dataset1['label'] = 0
    dataset1.loc[
        (dataset1.Date != date_null) & (dataset1.Date - dataset1.Date_received <= datetime.timedelta(15)), 'label'] = 1
    # 交叉训练集一特征offline:线下数据中领券和用券日期大于1月1日和小于4月13日
    time_range_date_received = ['2016-01-01', '2016-03-31']
    time_range_date = ['2016-01-01', '2016-04-15']
    feature1_off = off_train[(off_train.Date >= time_range_date[0]) & (off_train.Date <= time_range_date[1]) | (
            (off_train.Coupon_id == 0) & (off_train.Date_received >= time_range_date_received[0]) & (
            off_train.Date_received <= time_range_date_received[1]))]
    # 交叉训练集一特征online:线上数据中领券和用券日期大于1月1日和小于4月13日[on_train.date == 'null' to on_train.coupon_id == 0]
    feature1_on = on_train[(on_train.Date >= time_range_date[0]) & (on_train.Date <= time_range_date[1]) | (
            (on_train.Coupon_id == 0) & (on_train.Date_received >= time_range_date_received[0]) & (
            on_train.Date_received <= time_range_date_received[1]))]
    # 交叉训练集二:收到券的日期大于5月15日和小于6月15日
    time_range = ['2016-05-16', '2016-06-15']
    dataset2 = off_train[(off_train.Date_received >= time_range[0]) & (off_train.Date_received <= time_range[1])]
    dataset2['label'] = 0
    dataset2.loc[
        (dataset2.Date != date_null) & (dataset2.Date - dataset2.Date_received <= datetime.timedelta(15)), 'label'] = 1
    # 交叉训练集二特征offline:线下数据中领券和用券日期大于2月1日和小于5月14日
    time_range_date_received = ['2016-02-01', '2016-04-30']
    time_range_date = ['2016-02-01', '2016-05-15']
    feature2_off = off_train[(off_train.Date >= time_range_date[0]) & (off_train.Date <= time_range_date[1]) | (
            (off_train.Coupon_id == 0) & (off_train.Date_received >= time_range_date_received[0]) & (
            off_train.Date_received <= time_range_date_received[1]))]
    # 交叉训练集二特征online:线上数据中领券和用券日期大于2月1日和小于5月14日
    feature2_on = on_train[(on_train.Date >= time_range_date[0]) & (on_train.Date <= time_range_date[1]) | (
            (on_train.Coupon_id == 0) & (on_train.Date_received >= time_range_date_received[0]) & (
            on_train.Date_received <= time_range_date_received[1]))]
    # 测试集
    dataset3 = off_test
    # 测试集特征offline :线下数据中领券和用券日期大于3月15日和小于6月30日的
    time_range = ['2016-03-16', '2016-06-30']
    feature3_off = off_train[((off_train.Date >= time_range[0]) & (off_train.Date <= time_range[1])) | (
            (off_train.Coupon_id == 0) & (off_train.Date_received >= time_range[0]) & (
            off_train.Date_received <= time_range[1]))]
    # 测试集特征online :线上数据中领券和用券日期大于3月15日和小于6月30日的
    feature3_on = on_train[((on_train.Date >= time_range[0]) & (on_train.Date <= time_range[1])) | (
            (on_train.Coupon_id == 0) & (on_train.Date_received >= time_range[0]) & (
            on_train.Date_received <= time_range[1]))]
    # get train feature
    ProcessDataSet1 = get_features(dataset1, feature1_off, feature1_on)
    ProcessDataSet2 = get_features(dataset2, feature2_off, feature2_on)
    ProcessDataSet3 = get_features(dataset3, feature3_off, feature3_on)
    return ProcessDataSet1, ProcessDataSet2, ProcessDataSet3
def get_features(dataset, feature_off, feature_on):
    dataset = get_offline_features(dataset, feature_off)
    return get_online_features(feature_on, dataset)

特征工程

对线下数据处理

In [ ]:

def get_offline_features(X, offline):
    # X = X[:1000]
    print(len(X), len(X.columns))
    temp = offline[offline.Coupon_id != 0]
    coupon_consume = temp[temp.Date != date_null]
    coupon_no_consume = temp[temp.Date == date_null]
    user_coupon_consume = coupon_consume.groupby('User_id')
    X['weekday'] = X.Date_received.dt.weekday
    X['day'] = X.Date_received.dt.day
    '''user features'''
    # 优惠券消费次数
    temp = user_coupon_consume.size().reset_index(name='u2')
    X = pd.merge(X, temp, how='left', on='User_id')
    # X.u2.fillna(0, inplace=True)
    # X.u2 = X.u2.astype(int)
    # 优惠券不消费次数
    temp = coupon_no_consume.groupby('User_id').size().reset_index(name='u3')
    X = pd.merge(X, temp, how='left', on='User_id')
    # 使用优惠券次数与没使用优惠券次数比值
    X['u19'] = X.u2 / X.u3
    # 领取优惠券次数
    X['u1'] = X.u2.fillna(0) + X.u3.fillna(0)
    # 优惠券核销率
    X['u4'] = X.u2 / X.u1
    # 普通消费次数
    temp = offline[(offline.Coupon_id == 0) & (offline.Date != date_null)]
    temp1 = temp.groupby('User_id').size().reset_index(name='u5')
    X = pd.merge(X, temp1, how='left', on='User_id')
    # 一共消费多少次
    X['u25'] = X.u2 + X.u5
    # 用户使用优惠券消费占比
    X['u20'] = X.u2 / X.u25
    # 正常消费平均间隔
    temp = pd.merge(temp, temp.groupby('User_id').Date.max().reset_index(name='max'))
    temp = pd.merge(temp, temp.groupby('User_id').Date.min().reset_index(name='min'))
    temp = pd.merge(temp, temp.groupby('User_id').size().reset_index(name='len'))
    temp['u6'] = ((temp['max'] - temp['min']).dt.days / (temp['len'] - 1))
    temp = temp.drop_duplicates('User_id')
    X = pd.merge(X, temp[['User_id', 'u6']], how='left', on='User_id')
    # 优惠券消费平均间隔
    temp = pd.merge(coupon_consume, user_coupon_consume.Date.max().reset_index(name='max'))
    temp = pd.merge(temp, temp.groupby('User_id').Date.min().reset_index(name='min'))
    temp = pd.merge(temp, temp.groupby('User_id').size().reset_index(name='len'))
    temp['u7'] = ((temp['max'] - temp['min']).dt.days / (temp['len'] - 1))
    temp = temp.drop_duplicates('User_id')
    X = pd.merge(X, temp[['User_id', 'u7']], how='left', on='User_id')
    # 15天内平均会普通消费几次
    X['u8'] = X.u6 / 15
    # 15天内平均会优惠券消费几次
    X['u9'] = X.u7 / 15
    # 领取优惠券到使用优惠券的平均间隔时间
    temp = coupon_consume.copy()
    temp['days'] = (temp.Date - temp.Date_received).dt.days
    temp = (temp.groupby('User_id').days.sum() / temp.groupby('User_id').size()).reset_index(name='u10')
    X = pd.merge(X, temp, how='left', on='User_id')
    # 在15天内使用掉优惠券的值大小
    X['u11'] = X.u10 / 15
    # 领取优惠券到使用优惠券间隔小于15天的次数
    temp = coupon_consume.copy()
    temp['days'] = (temp.Date - temp.Date_received).dt.days
    temp = temp[temp.days <= 15]
    temp = temp.groupby('User_id').size().reset_index(name='u21')
    X = pd.merge(X, temp, how='left', on='User_id')
    # 用户15天使用掉优惠券的次数除以使用优惠券的次数
    X['u22'] = X.u21 / X.u2
    # 用户15天使用掉优惠券的次数除以领取优惠券未消费的次数
    X['u23'] = X.u21 / X.u3
    # 用户15天使用掉优惠券的次数除以领取优惠券的总次数
    X['u24'] = X.u21 / X.u1
    # 消费优惠券的平均折率
    temp = user_coupon_consume.discount_rate.mean().reset_index(name='u45')
    X = pd.merge(X, temp, how='left', on='User_id')
    # 用户核销优惠券的最低消费折率
    temp = user_coupon_consume.discount_rate.min().reset_index(name='u27')
    X = pd.merge(X, temp, how='left', on='User_id')
    # 用户核销优惠券的最高消费折率
    temp = user_coupon_consume.discount_rate.max().reset_index(name='u28')
    X = pd.merge(X, temp, how='left', on='User_id')
    # 用户核销过的不同优惠券数量
    temp = coupon_consume.groupby(['User_id', 'Coupon_id']).size()
    temp = temp.groupby('User_id').size().reset_index(name='u32')
    X = pd.merge(X, temp, how='left', on='User_id')
    # 用户领取所有不同优惠券数量
    temp = offline[offline.Date_received != date_null]
    temp = temp.groupby(['User_id', 'Coupon_id']).size().reset_index(name='u47')
    X = pd.merge(X, temp, how='left', on=['User_id', 'Coupon_id'])
    # 用户核销过的不同优惠券数量占所有不同优惠券的比重
    X['u33'] = X.u32 / X.u47
    # 用户平均每种优惠券核销多少张
    X['u34'] = X.u2 / X.u47
    # 核销优惠券用户-商家平均距离
    temp = offline[(offline.Coupon_id != 0) & (offline.Date != date_null) & (offline.Distance != 11)]
    temp = temp.groupby('User_id').Distance
    temp = pd.merge(temp.count().reset_index(name='x'), temp.sum().reset_index(name='y'), on='User_id')
    temp['u35'] = temp.y / temp.x
    temp = temp[['User_id', 'u35']]
    X = pd.merge(X, temp, how='left', on='User_id')
    # 用户核销优惠券中的最小用户-商家距离
    temp = coupon_consume[coupon_consume.Distance != 11]
    temp = temp.groupby('User_id').Distance.min().reset_index(name='u36')
    X = pd.merge(X, temp, how='left', on='User_id')
    # 用户核销优惠券中的最大用户-商家距离
    temp = coupon_consume[coupon_consume.Distance != 11]
    temp = temp.groupby('User_id').Distance.max().reset_index(name='u37')
    X = pd.merge(X, temp, how='left', on='User_id')
    # 优惠券类型
    discount_types = [
        '0.2', '0.5', '0.6', '0.7', '0.75', '0.8', '0.85', '0.9', '0.95', '30:20', '50:30', '10:5',
        '20:10', '100:50', '200:100', '50:20', '30:10', '150:50', '100:30', '20:5', '200:50', '5:1',
        '50:10', '100:20', '150:30', '30:5', '300:50', '200:30', '150:20', '10:1', '50:5', '100:10',
        '200:20', '300:30', '150:10', '300:20', '500:30', '20:1', '100:5', '200:10', '30:1', '150:5',
        '300:10', '200:5', '50:1', '100:1',
    ]
    X['discount_type'] = -1
    for k, v in enumerate(discount_types):
        X.loc[X.Discount_rate == v, 'discount_type'] = k
    # 不同优惠券领取次数
    temp = offline.groupby(['User_id', 'Discount_rate']).size().reset_index(name='u41')
    X = pd.merge(X, temp, how='left', on=['User_id', 'Discount_rate'])
    # 不同优惠券使用次数
    temp = coupon_consume.groupby(['User_id', 'Discount_rate']).size().reset_index(name='u42')
    X = pd.merge(X, temp, how='left', on=['User_id', 'Discount_rate'])
    # 不同优惠券不使用次数
    temp = coupon_no_consume.groupby(['User_id', 'Discount_rate']).size().reset_index(name='u43')
    X = pd.merge(X, temp, how='left', on=['User_id', 'Discount_rate'])
    # 不同打折优惠券使用率
    X['u44'] = X.u42 / X.u41
    # 满减类型优惠券领取次数
    temp = offline[offline.Discount_rate.str.contains(':') == True]
    temp = temp.groupby('User_id').size().reset_index(name='u48')
    X = pd.merge(X, temp, how='left', on='User_id')
    # 打折类型优惠券领取次数
    temp = offline[offline.Discount_rate.str.contains('\.') == True]
    temp = temp.groupby('User_id').size().reset_index(name='u49')
    X = pd.merge(X, temp, how='left', on='User_id')
    '''offline merchant features'''
    # 商户消费次数
    temp = offline[offline.Date != date_null].groupby('Merchant_id').size().reset_index(name='m0')
    X = pd.merge(X, temp, how='left', on='Merchant_id')
    # 商家优惠券被领取后核销次数
    temp = coupon_consume.groupby('Merchant_id').size().reset_index(name='m1')
    X = pd.merge(X, temp, how='left', on='Merchant_id')
    # 商户正常消费笔数
    X['m2'] = X.m0.fillna(0) - X.m1.fillna(0)
    # 商家优惠券被领取次数
    temp = offline[offline.Date_received != date_null].groupby('Merchant_id').size().reset_index(name='m3')
    X = pd.merge(X, temp, how='left', on='Merchant_id')
    # 商家优惠券被领取后核销率
    X['m4'] = X.m1 / X.m3
    # 商家优惠券被领取后不核销次数
    temp = coupon_no_consume.groupby('Merchant_id').size().reset_index(name='m7')
    X = pd.merge(X, temp, how='left', on='Merchant_id')
    # 商户当天优惠券领取次数
    temp = X[X.Date_received != date_null]
    temp = temp.groupby(['Merchant_id', 'Date_received']).size().reset_index(name='m5')
    X = pd.merge(X, temp, how='left', on=['Merchant_id', 'Date_received'])
    # 商户当天优惠券领取人数
    temp = X[X.Date_received != date_null]
    temp = temp.groupby(['User_id', 'Merchant_id', 'Date_received']).size().reset_index()
    temp = temp.groupby(['Merchant_id', 'Date_received']).size().reset_index(name='m6')
    X = pd.merge(X, temp, how='left', on=['Merchant_id', 'Date_received'])
    # 商家优惠券核销的平均消费折率
    temp = coupon_consume.groupby('Merchant_id').discount_rate.mean().reset_index(name='m8')
    X = pd.merge(X, temp, how='left', on='Merchant_id')
    # 商家优惠券核销的最小消费折率
    temp = coupon_consume.groupby('Merchant_id').discount_rate.max().reset_index(name='m9')
    X = pd.merge(X, temp, how='left', on='Merchant_id')
    # 商家优惠券核销的最大消费折率
    temp = coupon_consume.groupby('Merchant_id').discount_rate.min().reset_index(name='m10')
    X = pd.merge(X, temp, how='left', on='Merchant_id')
    # 商家优惠券核销不同的用户数量
    temp = coupon_consume.groupby(['Merchant_id', 'User_id']).size()
    temp = temp.groupby('Merchant_id').size().reset_index(name='m11')
    X = pd.merge(X, temp, how='left', on='Merchant_id')
    # 商家优惠券领取不同的用户数量
    temp = offline[offline.Date_received != date_null].groupby(['Merchant_id', 'User_id']).size()
    temp = temp.groupby('Merchant_id').size().reset_index(name='m12')
    X = pd.merge(X, temp, how='left', on='Merchant_id')
    # 核销商家优惠券的不同用户数量其占领取不同的用户比重
    X['m13'] = X.m11 / X.m12
    # 商家优惠券平均每个用户核销多少张
    X['m14'] = X.m1 / X.m12
    # 商家被核销过的不同优惠券数量
    temp = coupon_consume.groupby(['Merchant_id', 'Coupon_id']).size()
    temp = temp.groupby('Merchant_id').size().reset_index(name='m15')
    X = pd.merge(X, temp, how='left', on='Merchant_id')
    # 商家领取过的不同优惠券数量的比重
    temp = offline[offline.Date_received != date_null].groupby(['Merchant_id', 'Coupon_id']).size()
    temp = temp.groupby('Merchant_id').count().reset_index(name='m18')
    X = pd.merge(X, temp, how='left', on='Merchant_id')
    # 商家被核销过的不同优惠券数量占所有领取过的不同优惠券数量的比重
    X['m19'] = X.m15 / X.m18
    # 商家被核销优惠券的平均时间
    temp = pd.merge(coupon_consume, coupon_consume.groupby('Merchant_id').Date.max().reset_index(name='max'))
    temp = pd.merge(temp, temp.groupby('Merchant_id').Date.min().reset_index(name='min'))
    temp = pd.merge(temp, temp.groupby('Merchant_id').size().reset_index(name='len'))
    temp['m20'] = ((temp['max'] - temp['min']).dt.days / (temp['len'] - 1))
    temp = temp.drop_duplicates('Merchant_id')
    X = pd.merge(X, temp[['Merchant_id', 'm20']], how='left', on='Merchant_id')
    # 商家被核销优惠券中的用户-商家平均距离
    temp = coupon_consume[coupon_consume.Distance != 11].groupby('Merchant_id').Distance
    temp = pd.merge(temp.count().reset_index(name='x'), temp.sum().reset_index(name='y'), on='Merchant_id')
    temp['m21'] = temp.y / temp.x
    temp = temp[['Merchant_id', 'm21']]
    X = pd.merge(X, temp, how='left', on='Merchant_id')
    # 商家被核销优惠券中的用户-商家最小距离
    temp = coupon_consume[coupon_consume.Distance != 11]
    temp = temp.groupby('Merchant_id').Distance.min().reset_index(name='m22')
    X = pd.merge(X, temp, how='left', on='Merchant_id')
    # 商家被核销优惠券中的用户-商家最大距离
    temp = coupon_consume[coupon_consume.Distance != 11]
    temp = temp.groupby('Merchant_id').Distance.max().reset_index(name='m23')
    X = pd.merge(X, temp, how='left', on='Merchant_id')
    """offline coupon features"""
    # 此优惠券一共发行多少张
    temp = offline[offline.Coupon_id != 0].groupby('Coupon_id').size().reset_index(name='c1')
    X = pd.merge(X, temp, how='left', on='Coupon_id')
    # 此优惠券一共被使用多少张
    temp = coupon_consume.groupby('Coupon_id').size().reset_index(name='c2')
    X = pd.merge(X, temp, how='left', on='Coupon_id')
    # 优惠券使用率
    X['c3'] = X.c2 / X.c1
    # 没有使用的数目
    X['c4'] = X.c1 - X.c2
    # 此优惠券在当天发行了多少张
    temp = X.groupby(['Coupon_id', 'Date_received']).size().reset_index(name='c5')
    X = pd.merge(X, temp, how='left', on=['Coupon_id', 'Date_received'])
    # 优惠券类型(直接优惠为0, 满减为1)
    X['c6'] = 0
    X.loc[X.Discount_rate.str.contains(':') == True, 'c6'] = 1
    # 不同打折优惠券领取次数
    temp = offline.groupby('Discount_rate').size().reset_index(name='c8')
    X = pd.merge(X, temp, how='left', on='Discount_rate')
    # 不同打折优惠券使用次数
    temp = coupon_consume.groupby('Discount_rate').size().reset_index(name='c9')
    X = pd.merge(X, temp, how='left', on='Discount_rate')
    # 不同打折优惠券不使用次数
    temp = coupon_no_consume.groupby('Discount_rate').size().reset_index(name='c10')
    X = pd.merge(X, temp, how='left', on='Discount_rate')
    # 不同打折优惠券使用率
    X['c11'] = X.c9 / X.c8
    # 优惠券核销平均时间
    temp = pd.merge(coupon_consume, coupon_consume.groupby('Coupon_id').Date.max().reset_index(name='max'))
    temp = pd.merge(temp, temp.groupby('Coupon_id').Date.min().reset_index(name='min'))
    temp = pd.merge(temp, temp.groupby('Coupon_id').size().reset_index(name='count'))
    temp['c12'] = ((temp['max'] - temp['min']).dt.days / (temp['count'] - 1))
    temp = temp.drop_duplicates('Coupon_id')
    X = pd.merge(X, temp[['Coupon_id', 'c12']], how='left', on='Coupon_id')
    '''user merchant feature'''
    # 用户领取商家的优惠券次数
    temp = offline[offline.Coupon_id != 0]
    temp = temp.groupby(['User_id', 'Merchant_id']).size().reset_index(name='um1')
    X = pd.merge(X, temp, how='left', on=['User_id', 'Merchant_id'])
    # 用户领取商家的优惠券后不核销次数
    temp = coupon_no_consume.groupby(['User_id', 'Merchant_id']).size().reset_index(name='um2')
    X = pd.merge(X, temp, how='left', on=['User_id', 'Merchant_id'])
    # 用户领取商家的优惠券后核销次数
    temp = coupon_consume.groupby(['User_id', 'Merchant_id']).size().reset_index(name='um3')
    X = pd.merge(X, temp, how='left', on=['User_id', 'Merchant_id'])
    # 用户领取商家的优惠券后核销率
    X['um4'] = X.um3 / X.um1
    # 用户对每个商家的不核销次数占用户总的不核销次数的比重
    temp = coupon_no_consume.groupby('User_id').size().reset_index(name='temp')
    X = pd.merge(X, temp, how='left', on='User_id')
    X['um5'] = X.um2 / X.temp
    X.drop(columns='temp', inplace=True)
    # 用户在商店总共消费过几次
    temp = offline[offline.Date != date_null].groupby(['User_id', 'Merchant_id']).size().reset_index(name='um6')
    X = pd.merge(X, temp, how='left', on=['User_id', 'Merchant_id'])
    # 用户在商店普通消费次数
    temp = offline[(offline.Coupon_id == 0) & (offline.Date != date_null)]
    temp = temp.groupby(['User_id', 'Merchant_id']).size().reset_index(name='um7')
    X = pd.merge(X, temp, how='left', on=['User_id', 'Merchant_id'])
    # 用户当天在此商店领取的优惠券数目
    temp = offline[offline.Date_received != date_null]
    temp = temp.groupby(['User_id', 'Merchant_id', 'Date_received']).size().reset_index(name='um8')
    X = pd.merge(X, temp, how='left', on=['User_id', 'Merchant_id', 'Date_received'])
    # 用户领取优惠券不同商家数量
    temp = offline[offline.Coupon_id == offline.Coupon_id]
    temp = temp.groupby(['User_id', 'Merchant_id']).size().reset_index()
    temp = temp.groupby('User_id').size().reset_index(name='um9')
    X = pd.merge(X, temp, how='left', on='User_id')
    # 用户核销优惠券不同商家数量
    temp = coupon_consume.groupby(['User_id', 'Merchant_id']).size()
    temp = temp.groupby('User_id').size().reset_index(name='um10')
    X = pd.merge(X, temp, how='left', on='User_id')
    # 用户核销过优惠券的不同商家数量占所有不同商家的比重
    X['um11'] = X.um10 / X.um9
    # 用户平均核销每个商家多少张优惠券
    X['um12'] = X.u2 / X.um9
    '''other feature'''
    # 用户领取的所有优惠券数目
    temp = X.groupby('User_id').size().reset_index(name='o1')
    X = pd.merge(X, temp, how='left', on='User_id')
    # 用户领取的特定优惠券数目
    temp = X.groupby(['User_id', 'Coupon_id']).size().reset_index(name='o2')
    X = pd.merge(X, temp, how='left', on=['User_id', 'Coupon_id'])
    # multiple threads
    # data split
    stop = len(X)
    step = int(ceil(stop / cpu_jobs))
    X_chunks = [X[i:i + step] for i in range(0, stop, step)]
    X_list = [X] * cpu_jobs
    counters = [i for i in range(cpu_jobs)]
    start = datetime.datetime.now()
    with ProcessPoolExecutor() as e:
        X = pd.concat(e.map(task, X_chunks, X_list, counters))
        print('time:', str(datetime.datetime.now() - start).split('.')[0])
    # multiple threads
    # 用户领取优惠券平均时间间隔
    temp = pd.merge(X, X.groupby('User_id').Date_received.max().reset_index(name='max'))
    temp = pd.merge(temp, temp.groupby('User_id').Date_received.min().reset_index(name='min'))
    temp = pd.merge(temp, temp.groupby('User_id').size().reset_index(name='len'))
    temp['o7'] = ((temp['max'] - temp['min']).dt.days / (temp['len'] - 1))
    temp = temp.drop_duplicates('User_id')
    X = pd.merge(X, temp[['User_id', 'o7']], how='left', on='User_id')
    # 用户领取特定商家的优惠券数目
    temp = X.groupby(['User_id', 'Merchant_id']).size().reset_index(name='o8')
    X = pd.merge(X, temp, how='left', on=['User_id', 'Merchant_id'])
    # 用户领取的不同商家数目
    temp = X.groupby(['User_id', 'Merchant_id']).size()
    temp = temp.groupby('User_id').size().reset_index(name='o9')
    X = pd.merge(X, temp, how='left', on='User_id')
    # 用户当天领取的优惠券数目
    temp = X.groupby(['User_id', 'Date_received']).size().reset_index(name='o10')
    X = pd.merge(X, temp, how='left', on=['User_id', 'Date_received'])
    # 用户当天领取的特定优惠券数目
    temp = X.groupby(['User_id', 'Coupon_id', 'Date_received']).size().reset_index(name='o11')
    X = pd.merge(X, temp, how='left', on=['User_id', 'Coupon_id', 'Date_received'])
    # 用户领取的所有优惠券种类数目
    temp = X.groupby(['User_id', 'Coupon_id']).size()
    temp = temp.groupby('User_id').size().reset_index(name='o12')
    X = pd.merge(X, temp, how='left', on='User_id')
    # 商家被领取的优惠券数目
    temp = X.groupby('Merchant_id').size().reset_index(name='o13')
    X = pd.merge(X, temp, how='left', on='Merchant_id')
    # 商家被领取的特定优惠券数目
    temp = X.groupby(['Merchant_id', 'Coupon_id']).size().reset_index(name='o14')
    X = pd.merge(X, temp, how='left', on=['Merchant_id', 'Coupon_id'])
    # 商家被多少不同用户领取的数目
    temp = X.groupby(['Merchant_id', 'User_id']).size()
    temp = temp.groupby('Merchant_id').size().reset_index(name='o15')
    X = pd.merge(X, temp, how='left', on='Merchant_id')
    # 商家发行的所有优惠券种类数目
    temp = X.groupby(['Merchant_id', 'Coupon_id']).size()
    temp = temp.groupby('Merchant_id').size().reset_index(name='o16')
    X = pd.merge(X, temp, how='left', on='Merchant_id')
    print(len(X), len(X.columns))
    return X

对线上数据处理

In [ ]:

def get_online_features(online, X):
    # temp = online[online.Coupon_id == online.Coupon_id]
    # coupon_consume = temp[temp.Date == temp.Date]
    # coupon_no_consume = temp[temp.Date != temp.Date]
    # 用户线上操作次数
    temp = online.groupby('User_id').size().reset_index(name='on_u1')
    X = pd.merge(X, temp, how='left', on='User_id')
    # 用户线上点击次数
    temp = online[online.Action == 0].groupby('User_id').size().reset_index(name='on_u2')
    X = pd.merge(X, temp, how='left', on='User_id')
    # 用户线上点击率
    X['on_u3'] = X.on_u2 / X.on_u1
    # 用户线上购买次数
    temp = online[online.Action == 1].groupby('User_id').size().reset_index(name='on_u4')
    X = pd.merge(X, temp, how='left', on='User_id')
    # 用户线上购买率
    X['on_u5'] = X.on_u4 / X.on_u1
    # 用户线上领取次数
    temp = online[online.Coupon_id != 0].groupby('User_id').size().reset_index(name='on_u6')
    X = pd.merge(X, temp, how='left', on='User_id')
    # 用户线上领取率
    X['on_u7'] = X.on_u6 / X.on_u1
    # 用户线上不消费次数
    temp = online[(online.Date == date_null) & (online.Coupon_id != 0)]
    temp = temp.groupby('User_id').size().reset_index(name='on_u8')
    X = pd.merge(X, temp, how='left', on='User_id')
    # 用户线上优惠券核销次数
    temp = online[(online.Date != date_null) & (online.Coupon_id != 0)]
    temp = temp.groupby('User_id').size().reset_index(name='on_u9')
    X = pd.merge(X, temp, how='left', on='User_id')
    # 用户线上优惠券核销率
    X['on_u10'] = X.on_u9 / X.on_u6
    # 用户线下不消费次数占线上线下总的不消费次数的比重
    X['on_u11'] = X.u3 / (X.on_u8 + X.u3)
    # 用户线下的优惠券核销次数占线上线下总的优惠券核销次数的比重
    X['on_u12'] = X.u2 / (X.on_u9 + X.u2)
    # 用户线下领取的记录数量占总的记录数量的比重
    X['on_u13'] = X.u1 / (X.on_u6 + X.u1)
    print(len(X), len(X.columns))
    print('----------')
    return X

In [ ]:

def task(X_chunk, X, counter):
    print(counter, end=',', flush=True)
    X_chunk = X_chunk.copy()
    X_chunk['o17'] = -1
    X_chunk['o18'] = -1
    for i, user in X_chunk.iterrows():
        temp = X[X.User_id == user.User_id]
        temp1 = temp[temp.Date_received < user.Date_received]
        temp2 = temp[temp.Date_received > user.Date_received]
        # 用户此次之后/前领取的所有优惠券数目
        X_chunk.loc[i, 'o3'] = len(temp1)
        X_chunk.loc[i, 'o4'] = len(temp2)
        # 用户此次之后/前领取的特定优惠券数目
        X_chunk.loc[i, 'o5'] = len(temp1[temp1.Coupon_id == user.Coupon_id])
        X_chunk.loc[i, 'o6'] = len(temp2[temp2.Coupon_id == user.Coupon_id])
        # 用户上/下一次领取的时间间隔
        temp1 = temp1.sort_values(by='Date_received', ascending=False)
        if len(temp1):
            X_chunk.loc[i, 'o17'] = (user.Date_received - temp1.iloc[0].Date_received).days
        temp2 = temp2.sort_values(by='Date_received')
        if len(temp2):
            X_chunk.loc[i, 'o18'] = (temp2.iloc[0].Date_received - user.Date_received).days
    return X_chunk

数据处理保存

In [ ]:

# 程序开始时间打印
    start = datetime.datetime.now()
    print(start.strftime('%Y-%m-%d %H:%M:%S'))
    cpu_jobs = os.cpu_count() - 1
    date_null = pd.to_datetime('1970-01-01', format='%Y-%m-%d')
    pd.set_option('expand_frame_repr', False)
    pd.set_option('display.max_rows', 200)
    pd.set_option('display.max_columns', 200)
    # 预处理后数据存放路径
    FeaturePath = 'data_preprocessed_2'
    # 读入源数据
    off_train, on_train, off_test = get_source_data()
    # 源数据null处理
    off_train = null_process_offline(off_train, predict=False)
    on_train = null_process_online(on_train)
    off_test = null_process_offline(off_test, predict=True)
    # 获取训练特征集,测试特征集
    ProcessDataSet1, ProcessDataSet2, ProcessDataSet3 = data_process(off_train, on_train, off_test)
    # 源数据处理后的数据保存为文件
    ProcessDataSet1.to_csv(os.path.join(FeaturePath, 'ProcessDataSet1.csv'), index=None)
    ProcessDataSet2.to_csv(os.path.join(FeaturePath, 'ProcessDataSet2.csv'), index=None)
    ProcessDataSet3.to_csv(os.path.join(FeaturePath, 'ProcessDataSet3.csv'), index=None)
    # 花费时间打印
    print((datetime.datetime.now() - start).seconds)

Sklearn:天池新人实战赛o2o优惠券使用预测 part2相关推荐

  1. Sklearn:天池新人实战赛o2o优惠券使用预测 part1

    日萌社 人工智能AI:Keras PyTorch MXNet TensorFlow PaddlePaddle 深度学习实战(不定时更新) 阿里云官网:天池新人实战赛o2o优惠券使用预测 数据集下载链接 ...

  2. 天池新人实战赛o2o优惠券使用预测四(别家代码分析1)

    弄了2天,发现还是对python不够熟悉,完全不能实现我的各种设想,所以,还是先参考别人的代码,先明白别人的思路和python的写法再说吧. 源代码地址请直接在CSDN上搜索,原文说未经许可不得转载, ...

  3. 数据科学家之路 实战1 --- O2O优惠券使用预测(part2)

    本篇文章内容接着part1的模型部分 4.上模型 a.回归树 由于GBDT中用到的树是回归树,而不是我们一般用到的分类树,所以这里举个例子复习一下: 我们利用x1.x2这两个特征可以很容易地把数据分为 ...

  4. 天池新人实战赛——阿里移动推荐算法大赛(离线赛与平台赛)

    竞赛题目 (离线赛与平台赛题目一样,只是数据量不一样,离线2000w+数据,平台11亿+数据) 在真实的业务场景下,我们往往需要对所有商品的一个子集构建个性化推荐模型.在完成这件任务的过程中,我们不仅 ...

  5. 天池 O2O 优惠券使用预测思路解析与代码实战

    个人网站:redstonewill.com 前阵子因为机器学习训练营的任务安排,需要打一场 AI 比赛.然后就了解到最近热度很高且非常适合新人入门的一场比赛:天池新人实战赛o2o优惠券使用预测.今天, ...

  6. 天池比赛:o2o优惠券使用预测

    一.比赛背景 O2O:全称Online To Offline,线上线下电子商务,是把线上的消费者带到现实的商店中去:在线支付线下商品.服务,再到线下去享受服务.通过打折(例如团购).提供信息.服务(例 ...

  7. 天池o2o优惠券使用预测(入门)

    一.前言 近期学习了一下天池中o2o优惠券使用预测的学习赛,主要任务是通过分析建模,精准预测用户是否会在规定时间内使用相应优惠券.这次的参与主要是学习为主,牛刀小试. 二.解决方案 数据分析:对于给定 ...

  8. 【机器学习】天池O2O优惠券使用预测_系统性总结与心得

    Preface:上半年参加天池的O2O优惠券预测赛排名第二,同时参加了学校数据仓库老师的课程改革建设团队,于是把参赛经验总结成文,准备分享给该课程的学弟学妹.现在我把总结的参赛教程文章也发到CSDN上 ...

  9. O2O优惠券使用预测项目总结

    O2O优惠券使用预测笔记 前言 项目介绍 数据 评价方式 赛题分析 基本思路 数据集划分 特征工程 模型选取 过程及代码 导入python库 导入与划分数据集 特征工程 模型训练与调参 预测测试集 总 ...

最新文章

  1. 控制-超前校正-C语言实现
  2. mysql备份脚本+关_mysql数据库自动定期备份的脚本
  3. Odoo中最小库存规则和按订单生成规则的区别
  4. 单片机断电后不保存程序_为什么单片机语音芯片既有flash又有EEPROM
  5. (十九)java多线程之ForkJoinPool
  6. laravel 中request
  7. HTML5 3D爱心动画及其制作过程
  8. 6002.mavlink通过xml生成自定义消息
  9. Oracle中shrink space命令
  10. ecshop始终显示全部分类
  11. C#制作、打包、签名、发布Activex全过程【转】
  12. Android APP
  13. 图书管理系统的设计与实现
  14. python二元函数拟合_Python拟合二元一次函数
  15. android icon颜色转换工具,Android代码修改图标颜色
  16. 理性讨论: 《崩坏3》是不是目前打击感最好的动作手游?
  17. 视频剪辑教程,视频批量分割,分割后的新视频怎么保存
  18. 最简单的验证码(利用JSP生成验证码)
  19. 【微信小程序】微信小程序--倒放音频的实现
  20. 抖音和tiktok是什么关系?TikTok和抖音差别大吗?

热门文章

  1. linux 查看当前用户密码
  2. 高解析 喷码机解决方案
  3. cmd执行服务器sql文件,sqlserver执行sql文件命令(sqlcmd)
  4. jetson nano配置过程中踩过的坑
  5. Centos安装Redis
  6. (51)TCL脚本命令【join】
  7. JavaScript之观察者模式
  8. RMAN异机恢复步骤
  9. 特征提取和特征选择方法
  10. 基于PHP的ACM集训队员管理系统