1. 背景

About this Dataset,In this challenge, you are given a list of users along with their demographics, web session records, and some summary statistics. You are asked to predict which country a new user’s first booking destination will be. All the users in this dataset are from the USA.

There are 12 possible outcomes of the destination country: ‘US’, ‘FR’, ‘CA’, ‘GB’, ‘ES’, ‘IT’, ‘PT’, ‘NL’,’DE’, ‘AU’, ‘NDF’ (no destination found), and ‘other’. Please note that ‘NDF’ is different from ‘other’ because ‘other’ means there was a booking, but is to a country not included in the list, while ‘NDF’ means there wasn’t a booking.

2. 数据描述

train_users_2.csv - the training set of users (训练数据)
* id: user id (用户id)
* date_account_created(帐号注册时间): the date of account creation
* timestamp_first_active(首次活跃时间): timestamp of the first activity, note that it can be earlier than date_account_created or date_first_booking because a user can search before signing up
* date_first_booking(首次订房时间): date of first booking
* gender(性别)
* age(年龄)
* signup_method(注册方式)
* signup_flow(注册页面): the page a user came to signup up from
* language(语言): international language preference
* affiliate_channel(付费市场渠道): what kind of paid marketing
* affiliate_provider(付费市场渠道名称): where the marketing is e.g. google, craigslist, other
* first_affiliate_tracked(注册前第一个接触的市场渠道): whats the first marketing the user interacted with before the signing up
* signup_app(注册app)
* first_device_type(设备类型)
* first_browser(浏览器类型)
* country_destination(订房国家-需要预测的量): this is the target variable you are to predict
test_users.csv - the test set of users (测试数据)
sessions.csv - web sessions log for users(网页浏览数据)
* user_id(用户id): to be joined with the column ‘id’ in users table
* action(用户行为)
* action_type(用户行为类型)
* action_detail(用户行为具体)
* device_type(设备类型)
* secs_elapsed(停留时长)

3. 探索性分析与特征工程

3.1 train_user_2和test_user文件


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import pickle
import datetime
import os


train = pd.read_csv("../input/train_users_2.csv")
test = pd.read_csv("../input/test_users.csv")
id date_account_created timestamp_first_active date_first_booking gender age signup_method signup_flow language affiliate_channel affiliate_provider first_affiliate_tracked signup_app first_device_type first_browser country_destination
0 gxn3p5htnn 2010-06-28 20090319043255 NaN -unknown- NaN facebook 0 en direct direct untracked Web Mac Desktop Chrome NDF
1 820tgsjxq7 2011-05-25 20090523174809 NaN MALE 38.0 facebook 0 en seo google untracked Web Mac Desktop Chrome NDF
2 4ft3gnwmtx 2010-09-28 20090609231247 2010-08-02 FEMALE 56.0 basic 3 en direct direct untracked Web Windows Desktop IE US
3 bjjt8pjhuk 2011-12-05 20091031060129 2012-09-08 FEMALE 42.0 facebook 0 en direct direct untracked Web Mac Desktop Firefox other
4 87mebub9p4 2010-09-14 20091208061105 2010-02-18 -unknown- 41.0 basic 0 en direct direct untracked Web Mac Desktop Chrome US


print("Column names for training dataset : ")
for column in train.columns:print("-", column)
    Column names for training dataset : - id
    - date_account_created
    - timestamp_first_active
    - date_first_booking
    - gender
    - age
    - signup_method
    - signup_flow
    - language
    - affiliate_channel
    - affiliate_provider
    - first_affiliate_tracked
    - signup_app
    - first_device_type
    - first_browser
    - country_destination


    <class 'pandas.core.frame.DataFrame'>RangeIndex: 213451 entries, 0 to 213450Data columns (total 16 columns):id                         213451 non-null objectdate_account_created       213451 non-null objecttimestamp_first_active     213451 non-null int64date_first_booking         88908 non-null objectgender                     213451 non-null objectage                        125461 non-null float64signup_method              213451 non-null objectsignup_flow                213451 non-null int64language                   213451 non-null objectaffiliate_channel          213451 non-null objectaffiliate_provider         213451 non-null objectfirst_affiliate_tracked    207386 non-null objectsignup_app                 213451 non-null objectfirst_device_type          213451 non-null objectfirst_browser              213451 non-null objectcountry_destination        213451 non-null objectdtypes: float64(1), int64(2), object(13)memory usage: 26.1+ MB

1. train文件包含213451行数据,16个特征
2. 各特征的数据类型和空值情况
3. age空值较多,特征提取时考虑将空值单独作为一个类别
4. date_first_booking空值较多,在特征提取时可以考虑删除



dac_train = pd.to_datetime(train.date_account_created).value_counts()
dac_test = pd.to_datetime(test.date_account_created).value_counts()#计算距离首次注册相隔的天数
dac_train_days = dac_train.index - dac_train.index.min()
dac_test_days = dac_test.index - dac_train.index.min()plt.figure(figsize=[10,6])
plt.scatter(dac_train_days.days, dac_train.values,color='b', label="train set")
plt.scatter(dac_test_days.days, dac_test.values,color='r', label="test set")
plt.title("Accounts Created vs. Days")
plt.ylabel('Accounts Created')



    0    200903190432551    200905231748092    200906092312473    200910310601294    20091208061105Name: timestamp_first_active, dtype: int64
# 转化为datetime类型
tfa_train = train.timestamp_first_active.astype(str).apply(lambda x: datetime.datetime(int(x[:4]), int(x[4:6]), int(x[6:8]),int(x[8:10]), int(x[10:12]), int(x[12:])))
tfa_test = test.timestamp_first_active.astype(str).apply(lambda x: datetime.datetime(int(x[:4]), int(x[4:6]), int(x[6:8]),int(x[8:10]), int(x[10:12]), int(x[12:])))
# 计算距离用户首次活跃相隔的天数
tfa_train_days = (tfa_train - tfa_train.min()).apply(lambda x: x.days).value_counts()
tfa_test_days = (tfa_test - tfa_train.min()).apply(lambda x: x.days).value_counts()plt.figure(figsize=[10,6])
plt.scatter(tfa_train_days.index, tfa_train_days.values,color='b', label="train set")
plt.scatter(tfa_test_days.index, tfa_test_days.values,color='r', label="test set")
plt.title("User First Active vs. Days")
plt.ylabel('User First Active')





30.0    6124
31.0    6016
29.0    5963
28.0    5939
32.0    5855
Name: age, dtype: int64


age_step = 10def ageProcess(age):'''对age字段进行前处理及离散化1. 字段中存在19xx岁的数据存在,猜测是用户误将年龄信息填写为出生年份2. 对age进行离散化处理,每隔一个age_step分为一个组'''if age >= 1900 and age < 2014:age = 2014 - ageif age < 0 or pd.isnull(age):return "NA"if age < (1 * age_step):return "< %d" % age_stepelif age < (2 * age_step):return "%d ~ %d" % (1*age_step, 2*age_step)elif age < (3 * age_step):return "%d ~ %d" % (2*age_step, 3*age_step)elif age < (4 * age_step):return "%d ~ %d" % (3*age_step, 4*age_step)elif age < (5 * age_step):return "%d ~ %d" % (4*age_step, 5*age_step)elif age < (6 * age_step):return "%d ~ %d" % (5*age_step, 6*age_step)elif age < (7 * age_step):return "%d ~ %d" % (6*age_step, 7*age_step)elif age < (8 * age_step):return "%d ~ %d" % (7*age_step, 8*age_step)elif age < (9 * age_step):return "%d ~ %d" % (8*age_step, 9*age_step)elif age < 110:return "%d ~ 110" % (9 * age_step)else:return "non-physical"age_train = train.age.apply(ageProcess)
age_test = test.age.apply(age,Process)plt.figure(figsize=(10,6))
fig, (ax1, ax2) = plt.subplots(1,2,)
plt.bar(age_train.value_counts().index, age_train.value_counts().values)


def feature_barplot(df_train, df_test, feature, figsize=(12,6),rot=90, saveimg=False ):feature_train_counts = df_train[feature].value_counts()feature_test_counts = df_test[feature].value_counts()fig, (ax1, ax2) = plt.subplots(1, 2, sharex=True, sharey=False, figsize=figsize)sns.barplot(feature_train_counts.index, feature_train_counts.values, ax=ax1)sns.barplot(feature_test_counts.index, feature_test_counts.values, ax=ax2)ax1.set_xticklabels(ax1.xaxis.get_majorticklabels(), rotation = rot)ax2.set_xticklabels(ax1.xaxis.get_majorticklabels(), rotation = rot)ax1.set_title(feature + ' of training set')ax2.set_title(feature + ' of test set')ax1.set_ylabel('Counts')plt.tight_layout()if saveimg == True:figname = feature + ".png"fig_feature.savefig(figname, dpi = 75)


feature_pool = ['gender', 'signup_method', 'signup_flow', 'language', 'affiliate_channel', 'affiliate_provider', 'first_affiliate_tracked','signup_app', 'first_device_type', 'first_browser']for feature in feature_pool:feature_barplot(train, test, feature)

### `特征提取`

# 划分训练集特征与标签
X_train = train.loc[:, train.columns != 'country_destination'].copy()
y_train = train.loc[:, train.columns == 'country_destination'].copy()X_test = test.copy()
# 将训练集与测试集特征合并,便于后续特征处理
X_train["train/test"] = "train"
X_test["train/test"] = "test"X_total = pd.concat([X_train, X_test], axis=0, ignore_index=True)


Y = 2000season = [(0, (datetime.date(Y,1,1), datetime.date(Y,3,20))),(1, (datetime.date(Y,3,21), datetime.date(Y,6,20))),(2, (datetime.date(Y,6,21), datetime.date(Y,9,20))),(3, (datetime.date(Y,9,21), datetime.date(Y,12,20))),(0, (datetime.date(Y,12,21), datetime.date(Y,12,31)))]def get_season(dt):dt = datetime.date(Y, dt.month, dt.day)for season_id, (start_date, end_date) in season:if start_date <= dt and dt <= end_date:return season_iddef create_dt_feature(df, feature):year = "%s_year" % featuremonth = "%s_month" % featureday_of_month = "%s_day_of_month" % featureweekday = "%s_weekday" % featureseason = "%s_season" % featuredf[year] = df[feature].apply(lambda x: x.year)df[month] = df[feature].apply(lambda x: x.month)df[day_of_month] = df[feature].apply(lambda x: x.day)df[weekday] = df[feature].apply(lambda x: x.isoweekday())df[season] = df[feature].apply(get_season)return df
X_total["dac"] = pd.to_datetime(X_total["date_account_created"])
X_total = create_dt_feature(X_total, "dac")
X_total.drop(["date_account_created"], axis=1, inplace=True)
id timestamp_first_active date_first_booking gender age signup_method signup_flow language affiliate_channel affiliate_provider signup_app first_device_type first_browser train/test dac dac_year dac_month dac_day_of_month dac_weekday dac_season
0 gxn3p5htnn 20090319043255 NaN -unknown- NaN facebook 0 en direct direct Web Mac Desktop Chrome train 2010-06-28 2010 6 28 1 2
1 820tgsjxq7 20090523174809 NaN MALE 38.0 facebook 0 en seo google Web Mac Desktop Chrome train 2011-05-25 2011 5 25 3 1
2 4ft3gnwmtx 20090609231247 2010-08-02 FEMALE 56.0 basic 3 en direct direct Web Windows Desktop IE train 2010-09-28 2010 9 28 2 3
3 bjjt8pjhuk 20091031060129 2012-09-08 FEMALE 42.0 facebook 0 en direct direct Web Mac Desktop Firefox train 2011-12-05 2011 12 5 1 3
4 87mebub9p4 20091208061105 2010-02-18 -unknown- 41.0 basic 0 en direct direct Web Mac Desktop Chrome train 2010-09-14 2010 9 14 2 2

5 rows × 21 columns


X_total["tfa"] = X_total["timestamp_first_active"].astype("str").apply(lambda x: datetime.datetime(int(x[:4]),int(x[4:6]),int(x[6:8]),int(x[8:10]),int(x[10:12]),int(x[12:])))
X_total = create_dt_feature(X_total, "tfa")
X_total.drop(["timestamp_first_active"], axis=1, inplace=True)
id date_first_booking gender age signup_method signup_flow language affiliate_channel affiliate_provider first_affiliate_tracked dac_month dac_day_of_month dac_weekday dac_season tfa tfa_year tfa_month tfa_day_of_month tfa_weekday tfa_season
0 gxn3p5htnn NaN -unknown- NaN facebook 0 en direct direct untracked 6 28 1 2 2009-03-19 04:32:55 2009 3 19 4 0
1 820tgsjxq7 NaN MALE 38.0 facebook 0 en seo google untracked 5 25 3 1 2009-05-23 17:48:09 2009 5 23 6 1
2 4ft3gnwmtx 2010-08-02 FEMALE 56.0 basic 3 en direct direct untracked 9 28 2 3 2009-06-09 23:12:47 2009 6 9 2 1
3 bjjt8pjhuk 2012-09-08 FEMALE 42.0 facebook 0 en direct direct untracked 12 5 1 3 2009-10-31 06:01:29 2009 10 31 6 3
4 87mebub9p4 2010-02-18 -unknown- 41.0 basic 0 en direct direct untracked 9 14 2 2 2009-12-08 06:11:05 2009 12 8 2 3

5 rows × 26 columns


X_total["dt_span"] = X_total['dac'].subtract(X_total["tfa"]).dt.daysX_total["dt_span"].value_counts().head()
-1    2753690         76         45         41         4
Name: dt_span, dtype: int64
count    275547.000000
mean         -0.820423
std          10.516688
min          -1.000000
25%          -1.000000
50%          -1.000000
75%          -1.000000
max        1455.000000
Name: dt_span, dtype: float64
def get_span_class(dt_span):if dt_span == -1:return "One day"elif dt_span < 7:return "One week"elif dt_span < 30:return "One month"elif dt_span < 365:return "One year"else:return "Over one year"
X_total["dt_span"] = X_total["dt_span"].apply(get_span_class)


X_total["age"] = X_total["age"].apply(ageProcess)
NA              116866
30 ~ 40          58598
20 ~ 30          48776
40 ~ 50          24623
50 ~ 60          12774
60 ~ 70           6191
10 ~ 20           2984
90 ~ 110          1879
70 ~ 80           1472
non-physical       960
80 ~ 90            325
< 10                99
Name: age, dtype: int64

类别型特征one-hot encoding

# 选出需要进行one-hot-encoding的特征,其中包括之前生成的dac、tfa、dt_span、age相关特征
feature_pool = ['gender', 'signup_method', 'signup_flow', 'language', 'affiliate_channel', 'affiliate_provider', 'first_affiliate_tracked',
'signup_app', 'first_device_type', 'first_browser']
feature_pool = feature_pool + ["dac_weekday", "dac_season", "tfa_weekday", "tfa_season", "dt_span","age"]
for feature in feature_pool:df_temp = pd.get_dummies(X_total[feature], prefix=feature)X_total = pd.concat([X_total, df_temp], axis=1)


X_total.drop(["dac", "tfa", "date_first_booking"], axis=1, inplace=True)
X_total.drop(feature_pool, axis=1, inplace=True)

3.2 Session文件

session = pd.read_csv("../input/sessions.csv")
user_id action action_type action_detail device_type secs_elapsed
0 d1mm9tcy42 lookup NaN NaN Windows Desktop 319.0
1 d1mm9tcy42 search_results click view_search_results Windows Desktop 67753.0
2 d1mm9tcy42 lookup NaN NaN Windows Desktop 301.0
3 d1mm9tcy42 search_results click view_search_results Windows Desktop 22141.0
4 d1mm9tcy42 lookup NaN NaN Windows Desktop 435.0


session["id"] = session["user_id"]
session.drop(["user_id"], inplace=True, axis=1)
action action_type action_detail device_type secs_elapsed id
0 lookup NaN NaN Windows Desktop 319.0 d1mm9tcy42
1 search_results click view_search_results Windows Desktop 67753.0 d1mm9tcy42
2 lookup NaN NaN Windows Desktop 301.0 d1mm9tcy42
3 search_results click view_search_results Windows Desktop 22141.0 d1mm9tcy42
4 lookup NaN NaN Windows Desktop 435.0 d1mm9tcy42
action             79626
action_type      1126204
action_detail    1126204
device_type            0
secs_elapsed      136031
id                 34496
dtype: int64


session.action.fillna("NAN", inplace=True)
session.action_type.fillna("NAN", inplace=True)
session.action_detail.fillna("NAN", inplace=True)
action                0
action_type           0
action_detail         0
device_type           0
secs_elapsed     136031
id                34496
dtype: int64
session1 = session.copy()

- 首先将用户的特征根据用户id进行分组
- 特征action:统计每个用户总的action出现的次数,各个action类型的数量,平均值以及标准差
- 特征action_detail:统计每个用户总的action_detail出现的次数,各个action_detail类型的数量,平均值以及标准差
- 特征action_type:统计每个用户总的action_type出现的次数,各个action_type类型的数量,平均值,标准差以及总的停留时长(进行log处理)
- 特征device_type:统计每个用户总的device_type出现的次数,各个device_type类型的数量,平均值以及标准差
- 特征secs_elapsed:对缺失值用0填充,统计每个用户secs_elapsed时间的总和,平均值,标准差以及中位数(进行log处理),(总和/平均数),secs_elapsed(log处理后)各个时间出现的次数

for feature in ["action", "action_detail", "action_type", "device_type"]:print(len(session1[feature].value_counts().index))
action_list = dict(session1.action.value_counts())
action_detail_list = dict(session1.action_detail.value_counts())
session1["action"] = session1["action"].apply(lambda x: x if action_list[x] > 1000 else "other")
session1["action_detail"] = session1["action_detail"].apply(lambda x: x if action_detail_list[x] > 1000 else "other")
f_act = session1.action.value_counts().argsort()
f_act_detail = session1.action_detail.value_counts().argsort()
f_act_type = session1.action_type.value_counts().argsort()
f_dev_type = session1.device_type.value_counts().argsort()
show              148
index             147
search_results    146
personalize       145
search            144
Name: action, dtype: int64
dgr_session = session1.groupby(["id"])
samples = []
ln = len(dgr_session)
for g in dgr_session:gr = g[1] # 每个id包含的所有session信息组成的dataframel = [] # 存放临时特征# idl.append(g[0])# number of total actionsl.append(len(gr)) #将id对应数据的长度放入列表# secs_elapsed特征中的缺失值用0填充再获取具体的停留时长sev = gr.secs_elapsed.fillna(0).values# action# 每个用户行为出现的次数,各个行为类型的数量,平均值以及标准差c_act = [0] * len(f_act)for i, v in enumerate(gr.action.values):c_act[f_act[v]] += 1_, c_act_uqc = np.unique(gr.action.values, return_counts=True)# 计算用户行为特征各个类型数量的长度,平均值以及标准差c_act += [len(c_act_uqc), np.mean(c_act_uqc), np.std(c_act_uqc)]l = l + c_act# action_typec_act_type = [0] * len(f_act)l_act_type = [0] * len(f_act)for i, v in enumerate(gr.action_type.values):c_act_type[f_act_type[v]] += 1l_act_type[f_act_type[v]] += sev[i]_, c_act_type_uqc = np.unique(gr.action_type.values, return_counts=True)l_act_type = np.log(1 + np.array(l_act_type)).tolist()c_act_type += [len(c_act_type_uqc), np.mean(c_act_type_uqc), np.std(c_act_type_uqc)]l = l + c_act_type + l_act_type# action_detailc_act_detail = [0] * len(f_act)for i, v in enumerate(gr.action_detail.values):c_act_detail[f_act_detail[v]] += 1_, c_act_detail_uqc = np.unique(gr.action_detail.values, return_counts=True)c_act_detail += [len(c_act_detail_uqc), np.mean(c_act_detail_uqc), np.std(c_act_detail_uqc)]l = l + c_act_detail# device typec_dev_type = [0] * len(f_act)for i, v in enumerate(gr.device_type.values):c_dev_type[f_dev_type[v]] += 1c_dev_type.append(len(np.unique(gr.device_type.values)))_, c_dev_type_uqc = np.unique(gr.device_type.values, return_counts=True)c_dev_type += [len(c_dev_type_uqc), np.mean(c_dev_type_uqc), np.std(c_dev_type_uqc)]l = l + c_dev_type#secs_elapsed features  特征-停留时长     l_secs = [0] * 5 l_log = [0] * 15if len(sev) > 0:#Simple statistics about the secs_elapsed values.l_secs[0] = np.log(1 + np.sum(sev))l_secs[1] = np.log(1 + np.mean(sev)) l_secs[2] = np.log(1 + np.std(sev))l_secs[3] = np.log(1 + np.median(sev))l_secs[4] = l_secs[0] / float(l[1]) #log_sev = np.log(1 + sev).astype(int)  l_log = np.bincount(log_sev, minlength=15).tolist()                    l = l + l_secs + l_logsamples.append(l)samples = np.array(samples)
samp_ar = samples[:, 1:].astype(np.float16)
samp_id = samples[:, 0]col_names = []
for i in range(len(samples[0])-1):col_names.append('c_' + str(i))
df_agg_sess = pd.DataFrame(samp_ar, columns=col_names)
df_agg_sess['id'] = samp_id
df_agg_sess.index = df_agg_sess.id

3.3 整合提取的所有特征

# 未完待续


