伪标记是一种简单的半监督学习方法

参考网址：https://datawhatnow.com/pseudo-labeling-semi-supervised-learning/

每一个机器学习项目的基础都是数据——这是你离不开的东西。在这篇文章中，我将展示一种称为伪标记的简单半监督学习方法，它可以利用未标记的数据提高您最喜欢的机器学习模型的性能。

伪标签

为了训练具有监督学习的机器学习模型，必须对数据进行标记。这是否意味着未标记的数据对于分类和回归等监督任务是无用的?当然不!除了将额外的数据用于分析目的之外，我们甚至可以用它来帮助用半监督学习来训练我们的模型——结合未标记和标记数据来进行模型训练。

其主要思想很简单。首先，在标记数据上训练模型，然后使用训练后的模型预测未标记数据上的标签，从而创建伪标签。此外，将标记数据和新伪标记数据合并到用于培训数据的新数据集中。

当 fast.ai MOOC (original paper)提到这个方法时，我受到了启发，开始尝试这个方法。虽然这个方法是在深度学习(在线算法)的背景下提到的，但是我在传统的机器学习模型上进行了尝试，并得到了一些改进。

数据预处理和探索

在比赛中，比如 Kaggle上的比赛，参赛者接收训练集(标记数据)和测试集(未标记数据)。这是测试伪标记的好地方。我们将使用的数据集来自the Mercedes-Benz Greener Manufacturing competition——目标是基于汽车的特性(回归)预测测试的持续时间。和往常一样，所有带有附加描述的代码都可以在这个notebook中找到。

import pandas as pd
# Load the data
train = pd.read_csv('input/train.csv')
test = pd.read_csv('input/test.csv')
print(train.shape, test.shape)
# (4209, 378) (4209, 377)

可以看出，训练数据集并不理想，数据点的数量较少(4209)，特征较多(376)。为了改进数据集，我们应该减少特征的数量，并尽可能增加数据点的数量。我在前一篇博客文章中讨论了特性的重要性(特性减少)，这个主题将被跳过，因为这篇博客文章的主要关注点是增加带有伪标记的数据点的数量。该数据集体积小，标记与未标记数据的比例为1:1，适合于伪标记。

下表显示了整个培训数据集的子集。特性X0-X8是分类变量，我们必须将它们转换成模型可用的形式——数值。

这是使用scikit-learn的LabelEncoder类完成的。

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
features = train.columns[2:]
for column_name in features:label_encoder = LabelEncoder() # Get the column valuestrain_column_values = list(train[column_name].values)test_column_values = list(test[column_name].values)# Fit the label encoderlabel_encoder.fit(train_column_values + test_column_values)# Transform the featuretrain[column_name] = label_encoder.transform(train_column_values)test[column_name] = label_encoder.transform(test_column_values)

结果：

现在，数据已经为我们的机器学习模型准备好了。

使用Python和scikit-learn实现伪标记

让我们创建一个函数来创建由伪标记和标记数据组成的“增强训练集”。函数的参数包括模型、训练和测试集信息(数据和特性)以及参数sample_rate。Sample_rate允许我们控制将与真正的标记数据混合的伪标记数据的百分比。将sample_rate设置为0.0意味着模型将只使用真实的标记数据，而sample_rate 0.5意味着模型将使用所有真实的标记数据和一半的伪标记数据。无论哪种情况，模型都将使用所有真正标记的数据。

def create_augmented_train(X, y, model, test, features, target, sample_rate):'''Create and return the augmented_train set that consistsof pseudo-labeled and labeled data.'''num_of_samples = int(len(test) * sample_rate)# Train the model and creat the pseudo-labelesmodel.fit(X, y)pseudo_labeles = model.predict(test[features])# Add the pseudo-labeles to the test setaugmented_test = test.copy(deep=True)augmented_test[target] = pseudo_labeles# Take a subset of the test set with pseudo-labeles and append in onto# the training setsampled_test = augmented_test.sample(n=num_of_samples)temp_train = pd.concat([X, y], axis=1)augemented_train = pd.concat([sampled_test, temp_train])# Shuffle the augmented dataset and return itreturn shuffle(augemented_train)

此外，我们还需要一种拟合方法——一种训练模型的方法——它将采用增强训练集并使用它训练模型。这是另一个函数，我们之前写的那个函数已经有很多参数了。这是一个很好的机会，可以创建一个类来增强内聚性，使代码更干净，并将方法放入该类中。我们将创建的类将被称为PseudoLabeler。它将使用一个scikit-learn模型，并使用增强训练集对其进行训练。scikit-learn允许我们创建自己的回归函数，但我们必须遵循它们的库标准。

from sklearn.utils import shuffle
from sklearn.base import BaseEstimator, RegressorMixin
class PseudoLabeler(BaseEstimator, RegressorMixin):def __init__(self, model, test, features, target, sample_rate=0.2, seed=42):self.sample_rate = sample_rateself.seed = seedself.model = modelself.model.seed = seedself.test = testself.features = featuresself.target = targetdef get_params(self, deep=True):return {"sample_rate": self.sample_rate,"seed": self.seed,"model": self.model,"test": self.test,"features": self.features,"target": self.target}def set_params(self, **parameters):for parameter, value in parameters.items():setattr(self, parameter, value)return selfdef fit(self, X, y):if self.sample_rate > 0.0:augemented_train = self.__create_augmented_train(X, y)self.model.fit(augemented_train[self.features],augemented_train[self.target])else:self.model.fit(X, y)return selfdef __create_augmented_train(self, X, y):num_of_samples = int(len(test) * self.sample_rate)# Train the model and creat the pseudo-labelsself.model.fit(X, y)pseudo_labels = self.model.predict(self.test[self.features])# Add the pseudo-labels to the test setaugmented_test = test.copy(deep=True)augmented_test[self.target] = pseudo_labels# Take a subset of the test set with pseudo-labels and append in onto# the training setsampled_test = augmented_test.sample(n=num_of_samples)temp_train = pd.concat([X, y], axis=1)augemented_train = pd.concat([sampled_test, temp_train])return shuffle(augemented_train)def predict(self, X):return self.model.predict(X)def get_model_name(self):return self.model.__class__.__name__

除了“fit”和“_create_augmented_train”方法之外，scikit-learn还需要几个更小的方法来使用这个类作为回归器(您可以在官方文档中阅读关于这个主题的更多信息)。现在我们已经创建了用于伪标记的scikit-learn类，让我们来看一个例子。

target = 'y'
# Preprocess the data
X_train, X_test = train[features], test[features]
y_train = train[target]
# Create the PseudoLabeler with XGBRegressor as the base regressor
model = PseudoLabeler(XGBRegressor(nthread=1),test,features,target
)
# Train the model and use it to predict
model.fit(X_train, y_train)
model.predict(X_train)

在本例中，pseudolabeler类使用XGBRegressor对伪标记进行回归。“sample_rate”的默认参数是0.2，这意味着伪标记程序将使用未标记数据集的20%。

结果

To test out the PseudoLabeler, I used XGBoost (when the competition was live I was getting the best results with XGBoost). To evaluate the model, we compare the raw XGBoost against the pseudo-labeled XGBoost. Using eight-fold cross-validation (on 4k data points, each fold got a small dataset – around 500 data points). The evaluation metric is R2-score, the official metric of the competition.

伪随机变量具有略高的均值得分和较低的偏差，这使得它(略)优于原始模型。我在笔记本上做了更详细的分析，你可以在这里看到。性能提升可能看起来很低，但请记住这是一个Kaggle竞赛，分数的每一次提高都可能让你在排行榜上的排名更高。这里介绍的复杂性不是太大(~70 LOC)，但是在这个例子中，问题和模型非常简单，在尝试将其用于更复杂的问题或领域时，请记住这一点。

结论

伪标记允许我们在训练机器学习模型时使用未标记的数据。这听起来像一个强大的技术，是的，它通常会提高我们模型的性能。但是，很难进行调优并使其正常工作，即使在工作时，也只能稍微提高性能。在Kaggle这样的比赛中，我相信这个技巧是有用的，因为，通常，即使是分数的轻微增加也能在排行榜上给你一个提升。尽管如此，在将其用于生产环境之前，我还是会三思而后行，因为它似乎会引入额外的复杂性，而不会显著提高性能，而这可能不是您想要的。