Adversarial validation

之前参加了kaggle的Toxic比赛，名次为Top1%（瞎打，忙于项目——提不上去的理由了，安慰自己）。回头看别人分享的kernel时，发现了Adversarial validation，本文也是直接copy fastml以及来自kaggle中一个kernel。

在比赛中，可能会遇到测试数据集与训练数据集分布明显不同。这时候，我们常用的k-折交叉验证可能达不到想要的效果，即训练集上的交叉验证结果可能与测试集上的结果相差甚远。

可能是由以下原因造成:

1.过细的学习训练数据集，即可能也把噪声数据学习到了。
2.训练数据集和测试数据集存在显著差异，即分布不同。

问题

对于第一个问题，我们可以使用一些正则化手段解决过拟合问题，那么对于第二个问题，我们该如何解决？如果你参加过像Kaggle这样的竞赛，你就会知道这些竞赛的模式。在比赛中，你可以下载到一个train数据集和test数据集。你需要在train数据上训练你的模型，在test数据上预测并且在上传到Kaggle平台上来得到你排名。

我们通常做法是将完整的train数据集划分为train和valid数据集。valid数据用于评估与模型的性能，理想状态下，我们希望valid数据集可以代表test数据集，即同分布，这样我们通过对valid数据集的评估，可以渐近地认识到模型在test数据上的结果。如果train数据集和test数据的分布不相同。那么，模型在valid数据集和test数据集就会得到不同的结果。

解决方法

解决方法是Adversarial validation。在fastml上看到了这种技术，具体的做法如下：

1.建立一个分类器来区分train数据集和test数据集

将train数据和test数据合并成一个数据中。并新增一个类别变量isTest，其中train数据中isTest为0，test数据中isTest为1。然后需要构建一个二元分类器（可以是任意分类器，比如RF，xgboost或lr等）来区分train数据和test数据。

2.将train数据的预测概率按递减顺序排序

一旦训练完模型之后，接下来使用训练好的模型对train数据进行预测。得到一个拟合概率，对概率按递减顺序排序，那么，top的数据可以理解为很大的概率预测为test数据。

3.取train数据的top n%数据作为valid数据集

比如取30%作为验证集，剩下的数据作为train数据来训练模型。

通过上述方法，得到valid数据集上的准确性度量应该与test数据上的相似。如果模型能够很好地预测valid数据，那么它应该也能够很好地预测test数据。

案例

当然理想的情况是，train数据和test数据来自一个相同的分布，这样模型在valid数据上的结果就可以很好的映射到test数据上。在这种情况下，如果我们尝试训练一个分类器来区分train和test，那么模型的性能不会比随机猜测好——相当于ROC_AUC为0.5。

该例子来自于fastml

设置train数据集合test数据集的类别:

train = pd.read_csv( 'data/train.csv' )
test = pd.read_csv( 'data/test.csv' )
train['TARGET'] = 1
test['TARGET'] = 0

将train数据和test数据合并成一个数据：

data = pd.concat(( train, test ))
data = data.iloc[ np.random.permutation(len( data )) ]
data.reset_index( drop = True, inplace = True )
x = data.drop( [ 'TARGET', 'ID' ], axis = 1 )
y = data.TARGET

重新划分一个新的train和test数据:

from sklearn.cross_validation import train_test_splitx_train, x_test, y_train, y_test = train_test_split( x, y, train_size = train_examples)

训练一个lr和rf分类器，训练结果如下:

# logistic regression / AUC: 49.82%
# random forest, 10 trees / AUC: 50.05%
# random forest, 100 trees / AUC: 49.95%

在valid数据集上的结果如下:

# logistic regression / AUC: 58.30%
# random forest / AUC: 75.32%

在test数据集上的结果如下:

# logistic regression / AUC: 61.47%
# random forest / AUC: 74.37%

看样子，valid数据跟test数据结果类似。

接下来，我们来看看valid数据与test数据不同的一个例子，使用来自金融时间序列的Numerai数据时看到这种效果。

首先尝试了logistic回归，得到以下验证分数:

LR
AUC: 52.67%, accuracy: 52.74%MinMaxScaler + LR
AUC: 53.52%, accuracy: 52.48%

排行榜得分(使用AUC):

# AUC 0.51706 / LR
# AUC 0.52781 / MinMaxScaler + LR
# AUC 0.51784 / PolynomialFeatures + MinMaxScaler + LR

从结果中，valid数据结果和test结果存在一定差距。

我们希望有一个验证集来代表Numerai测试集。为此，我们从train数据集提取出与test数据最相似的数据。

具体来说，我们将使用交叉验证训练一个分类器，从而获得所有train数据的预测概率。然后我们将看到哪些train数据被错误地预测为test数据，并将它们当做valid数据。这意味着它们看起来像test数据，但实际上是trian数据。

首先，让我们尝试训练一个分类器来区分train和test，就像我们对Santander数据所做的那样。但是我们得到的不是0.5，而是0.87 AUC，这意味着模型能够很好地区分了train和test。因此，随机地进行构建valid数据集明显是不可靠的，因此，我们通过交叉验证对所有trian数据进行预测，并对预测概率进行排序。

i = predictions.argsort()
train['p'] = predictions
train_sorted = train.iloc[i]

valid和test

按照概率排序结果，选取前后位置一定样本量进行测试。

val_size = 5000
train = data.iloc[:-val_size]
val = data.iloc[-val_size:]

结果如下：

LR
AUC: 52.54%, accuracy: 51.96%, log loss: 69.22%Pipeline(steps=[('poly', PolynomialFeatures(degree=2, include_bias=True, interaction_only=False)), ('scaler', MinMaxScaler(copy=True, feature_range=(0, 1)))])
AUC: 52.57%, accuracy: 51.76%, log loss: 69.58%

可以看到，模型在valid数据之间的差异非常小。接下来我们与test数据结果进行比较。

valid数据结果:

# 0.6922 / LR
# 0.6958 / PolynomialFeatures + MinMaxScaler + LR

poblic排行榜结果:

# 0.6910 / LR
# 0.6923 / PolynomialFeatures + MinMaxScaler + LR

private排行榜结果:

# 0.6916 / LR
# 0.6954 / PolynomialFeatures + MinMaxScaler + LR

可以看到新的valid数据结果与test数据结果相近。

Toxic

Adversarial validation是检查训练和测试数据集是否有显著差异的一种方法。其思想是使用数据集特性来尝试分离训练和测试样本。

因此，对trrain和test数据集创建一个二分类模型，其中train的target为1，test的target为0，并基于给定的特征训练一个分类器，以预测给定的样本是否在train或test数据集中!

在这里，我们将使用logistic回归模型和TF-IDF特征来检查文本特征分布是否不同，看看是否可以分离样本。

参考地址:kernel

读取数据集：

import numpy as np
import pandas as pd
trn = pd.read_csv("../input/train.csv", encoding="utf-8")
sub = pd.read_csv("../input/test.csv", encoding="utf-8")

构建tfidf特征：

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import regex
vectorizer = TfidfVectorizer(sublinear_tf=True,strip_accents='unicode',tokenizer=lambda x: regex.findall(r'[^\p{P}\W]+', x),analyzer='word',token_pattern=None,stop_words='english',ngram_range=(1, 1), max_features=20000
)
trn_idf = vectorizer.fit_transform(trn.comment_text)
trn_vocab = vectorizer.vocabulary_
sub_idf = vectorizer.fit_transform(sub.comment_text)
sub_vocab = vectorizer.vocabulary_
all_idf = vectorizer.fit_transform(pd.concat([trn.comment_text, sub.comment_text], axis=0))
all_vocab = vectorizer.vocabulary_

将词汇字典转化为词列表：

trn_words = [word for word in trn_vocab.keys()]
sub_words = [word for word in sub_vocab.keys()]
all_words = [word for word in all_vocab.keys()]

查看词的分布:

number of words in both train and test : 16182
number of words in all_words not in train : 1931
number of words in all_words not in test : 1896

这意味着在train数据和test数据的词汇存在很大的差异，让我们看看线性回归是否能在train和test之间发挥作用。对trian数据和test数据构建tfidf特征。

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
# Create target where all train samples are ones and all test samples are zeros
target = np.hstack((np.ones(trn.shape[0]), np.zeros(sub.shape[0])))
# Shuffle samples to mix zeros and ones
idx = np.arange(all_idf.shape[0])
np.random.seed(1)
np.random.shuffle(idx)
all_idf = all_idf[idx]
target = target[idx]
# Train a Logistic Regression
folds = StratifiedKFold(5, True, 1)
for trn_idx, val_idx in folds.split(all_idf, target):lr = LogisticRegression()lr.fit(all_idf[trn_idx], target[trn_idx])print(roc_auc_score(target[val_idx], lr.predict_proba(all_idf[val_idx])[:, 1]))

结果如下:

0.6847478442985545
0.6838642860252299
0.6862951075131871
0.6883996519299357
0.6882208658576139

从结果中，我们发现只在trian和test数据集上构建tfidf特征，train和test数据之间也存在显著的差异。或许我们可以构建一个与test数据即渐近分布的valid。

接下来，尝试以trian数据和test数据共有的词作为特征(可以降低不同词分布带来的影响)，设置最小频数为3：

vectorizer = TfidfVectorizer(sublinear_tf=True,strip_accents='unicode',tokenizer=lambda x: regex.findall(r'[^\p{P}\W]+', x),analyzer='word',token_pattern=None,stop_words='english',ngram_range=(1, 1), max_features=20000,min_df=3
)
trn_idf = vectorizer.fit_transform(trn.comment_text)
trn_vocab = vectorizer.vocabulary_
sub_idf = vectorizer.fit_transform(sub.comment_text)
sub_vocab = vectorizer.vocabulary_

构建tfidf特征

trn_words = [word for word in trn_vocab.keys()]
sub_words = [word for word in sub_vocab.keys()]
print("Number of words in common : ", len(set(trn_words).intersection(set(sub_words))))
vectorizer = TfidfVectorizer(sublinear_tf=True,strip_accents='unicode',tokenizer=lambda x: regex.findall(r'[^\p{P}\W]+', x),analyzer='word',token_pattern=None,stop_words='english',ngram_range=(1, 1), vocabulary=list(set(trn_words).intersection(set(sub_words)))
)
all_idf = vectorizer.fit_transform(pd.concat([trn.comment_text, sub.comment_text], axis=0))
# Create target where all train samples are ones and all test samples are zeros
target = np.hstack((np.ones(trn.shape[0]), np.zeros(sub.shape[0])))
# Shuffle samples to mix zeros and ones
idx = np.arange(all_idf.shape[0])
np.random.seed(1)
np.random.shuffle(idx)
all_idf = all_idf[idx]
target = target[idx]
# Train a Logistic Regression
folds = StratifiedKFold(5, True, 1)
for trn_idx, val_idx in folds.split(all_idf, target):lr = LogisticRegression()lr.fit(all_idf[trn_idx], target[trn_idx])print(roc_auc_score(target[val_idx], lr.predict_proba(all_idf[val_idx])[:, 1]))

结果如下所示

Number of words in common :  16416
0.6758812171438944
0.6755599693902824
0.6787566884700114
0.6796040649316202
0.6789255076072573

尝试更少的词汇：

vectorizer = TfidfVectorizer(sublinear_tf=True,strip_accents='unicode',tokenizer=lambda x: regex.findall(r'[^\p{P}\W]+', x),analyzer='word',token_pattern=None,stop_words='english',ngram_range=(1, 1), max_features=500,min_df=3
)
trn_idf = vectorizer.fit_transform(trn.comment_text)
trn_vocab = vectorizer.vocabulary_
sub_idf = vectorizer.fit_transform(sub.comment_text)
sub_vocab = vectorizer.vocabulary_trn_words = [word for word in trn_vocab.keys()]
sub_words = [word for word in sub_vocab.keys()]
print("Number of words in common : ", len(set(trn_words).intersection(set(sub_words))))
vectorizer = TfidfVectorizer(sublinear_tf=True,strip_accents='unicode',tokenizer=lambda x: regex.findall(r'[^\p{P}\W]+', x),analyzer='word',token_pattern=None,stop_words='english',ngram_range=(1, 1), vocabulary=list(set(trn_words).intersection(set(sub_words)))
)
all_idf = vectorizer.fit_transform(pd.concat([trn.comment_text, sub.comment_text], axis=0))
# Create target where all train samples are ones and all test samples are zeros
target = np.hstack((np.ones(trn.shape[0]), np.zeros(sub.shape[0])))
# Shuffle samples to mix zeros and ones
idx = np.arange(all_idf.shape[0])
np.random.seed(1)
np.random.shuffle(idx)
all_idf = all_idf[idx]
target = target[idx]
# Train a Logistic Regression
folds = StratifiedKFold(5, True, 1)
for trn_idx, val_idx in folds.split(all_idf, target):lr = LogisticRegression()lr.fit(all_idf[trn_idx], target[trn_idx])print(roc_auc_score(target[val_idx], lr.predict_proba(all_idf[val_idx])[:, 1]))

结果如下:

Number of words in common :  444
0.6295718202729551
0.6268219112785893
0.6270581079920985
0.6280726585488302
0.6244650004722636

即使词汇量大大减少，train数据和test数据之间仍然有显著的差异。我认为这是由于数据集涵盖的话题范围太广，总共只有30万个样本。

你需要意识到分词非常重要。为了演示我将使用CountVectorizer和两种不同的分词记方法:一个非常简单的分词和来自nltk包的TweetTokenizer。

simple Tokenization

from nltk.tokenize import TweetTokenizer
vectorizer = CountVectorizer(strip_accents='unicode',tokenizer=lambda x: regex.findall(r'[^\p{P}\W]+', x),analyzer='word',token_pattern=None,stop_words='english',ngram_range=(1, 2), max_features=500,min_df=3
)
trn_idf = vectorizer.fit_transform(trn.comment_text)
trn_vocab = vectorizer.vocabulary_
sub_idf = vectorizer.fit_transform(sub.comment_text)
sub_vocab = vectorizer.vocabulary_trn_words = [word for word in trn_vocab.keys()]
sub_words = [word for word in sub_vocab.keys()]
print("Number of words in common : ", len(set(trn_words).intersection(set(sub_words))))
vectorizer = CountVectorizer(strip_accents='unicode',tokenizer=lambda x: regex.findall(r'[^\p{P}\W]+', x),analyzer='word',token_pattern=None,stop_words='english',ngram_range=(1, 2), vocabulary=list(set(trn_words).intersection(set(sub_words)))
)
all_idf = vectorizer.fit_transform(pd.concat([trn.comment_text, sub.comment_text], axis=0))
# Create target where all train samples are ones and all test samples are zeros
target = np.hstack((np.ones(trn.shape[0]), np.zeros(sub.shape[0])))
# Shuffle samples to mix zeros and ones
idx = np.arange(all_idf.shape[0])
np.random.seed(1)
np.random.shuffle(idx)
all_idf = all_idf[idx]
target = target[idx]
# Train a Logistic Regression
folds = StratifiedKFold(5, True, 1)
for trn_idx, val_idx in folds.split(all_idf, target):lr = LogisticRegression()lr.fit(all_idf[trn_idx], target[trn_idx])print(roc_auc_score(target[val_idx], lr.predict_proba(all_idf[val_idx])[:, 1]))

结果如下:

Number of words in common :  440
0.6063327137520516
0.5999916796025004
0.6011318222132256
0.5996101413728843
0.5993641245063593

TweetTokenizer

from nltk.tokenize import TweetTokenizer
vectorizer = CountVectorizer(strip_accents='unicode',tokenizer=TweetTokenizer().tokenize,analyzer='word',token_pattern=None,stop_words='english',ngram_range=(1, 2), max_features=500,min_df=3
)
trn_idf = vectorizer.fit_transform(trn.comment_text)
trn_vocab = vectorizer.vocabulary_
sub_idf = vectorizer.fit_transform(sub.comment_text)
sub_vocab = vectorizer.vocabulary_trn_words = [word for word in trn_vocab.keys()]
sub_words = [word for word in sub_vocab.keys()]
print("Number of words in common : ", len(set(trn_words).intersection(set(sub_words))))
vectorizer = CountVectorizer(strip_accents='unicode',tokenizer=TweetTokenizer().tokenize,analyzer='word',token_pattern=None,stop_words='english',ngram_range=(1, 2), vocabulary=list(set(trn_words).intersection(set(sub_words)))
)
all_idf = vectorizer.fit_transform(pd.concat([trn.comment_text, sub.comment_text], axis=0))
# Create target where all train samples are ones and all test samples are zeros
target = np.hstack((np.ones(trn.shape[0]), np.zeros(sub.shape[0])))
# Shuffle samples to mix zeros and ones
idx = np.arange(all_idf.shape[0])
np.random.seed(1)
np.random.shuffle(idx)
all_idf = all_idf[idx]
target = target[idx]
# Train a Logistic Regression
folds = StratifiedKFold(5, True, 1)
for trn_idx, val_idx in folds.split(all_idf, target):lr = LogisticRegression()lr.fit(all_idf[trn_idx], target[trn_idx])print(roc_auc_score(target[val_idx], lr.predict_proba(all_idf[val_idx])[:, 1]))

结果如下:

Number of words in common :  425
0.808150062507659
0.8092440866192762
/opt/conda/lib/python3.6/site-packages/sklearn/linear_model/base.py:340: RuntimeWarning: overflow encountered in expnp.exp(prob, prob)
0.8100851554254078
0.8080836017812789
0.8085904163543269

可以尝试构造一份valid数据集

#seperate train rows which have been misclassified as test and use them as validation
train["predictions"] = predictions
predictions_argsort = predictions.argsort()
train_sorted = train.iloc[predictions_argsort]#select only trains set because we need to find train rows which have been misclassified as test set and use them for validation
train_sorted = train_sorted.loc[train_sorted.is_test == 0]#Why did I chose 0.7 as thereshold? just a hunch, but you should try different thresholds i.e 0.6, 0.8 and see the difference in validation score and please report back. :)
train_as_test = train_sorted.loc[train_sorted.predictions > 0.7]
#save the indices of the misclassified train rows to use as validation set
adversarial_set_ids = train_as_test.index.values
adversarial_set = pd.DataFrame(adversarial_set_ids, columns=['adversial_set_ids'])
#save adversarial set index
adversarial_set.to_csv('adversarial_set_ids.csv', index=False)