基于随机优化算法的特征选择

【翻译自： Feature Selection with Stochastic Optimization Algorithms】

【说明：Jason Brownlee PhD大神的文章个人很喜欢，所以闲暇时间里会做一点翻译和学习实践的工作，这里是相应工作的实践记录，希望能帮到有需要的人！】

通常，可以通过从训练数据集中删除输入特征（列）来开发更简单，性能更好的机器学习模型。这称为特征选择，可以使用许多不同类型的算法。可以将特征选择问题框架为优化问题。在输入要素很少的情况下，可以评估输入要素的所有可能组合，并确定地找到最佳子集。在输入特征数量众多的情况下，可以使用随机优化算法来探索搜索空间并找到特征的有效子集。

在本教程中，您将发现如何在机器学习中使用优化算法进行特征选择。完成本教程后，您将知道：

特征选择的问题可以广义地定义为优化问题。
如何枚举数据集输入要素的所有可能子集。
如何应用随机优化来选择输入要素的最佳子集。

教程概述

本教程分为三个部分：他们是：

优化特征选择
枚举所有功能子集
优化功能子集
优化特征选择

特征选择是在开发预测模型时减少输入变量数量的过程。

希望减少输入变量的数量，以减少建模的计算成本，并且在某些情况下，还需要改善模型的性能。尽管可以将特征选择算法大致分为两种主要类型，但它们有很多不同的类型：包装器和过滤器方法。包装器特征选择方法会创建许多具有不同输入特征子集的模型，并根据性能指标选择那些导致最佳性能模型的特征。这些方法与变量类型无关，尽管它们在计算上可能很昂贵。 RFE是包装功能选择方法的一个很好的例子。过滤器特征选择方法使用统计技术来评估每个输入变量和目标变量之间的关系，这些得分将用作选择（过滤）将在模型中使用的那些输入变量的基础。

包装特征选择：搜索性能良好的特征子集。
过滤特征选择：根据特征子集与目标的关系选择特征子集。

流行的包装方法是递归特征消除算法（RFE）。RFE的工作方式是：从训练数据集中的所有要素开始搜索要素的子集，然后成功删除要素，直到保留所需数量为止。这可以通过拟合模型核心中使用的给定机器学习算法，按重要性对特征进行排序，丢弃最不重要的特征以及重新拟合模型来实现。重复此过程，直到保留指定数量的功能。

下面是两篇相关的博文资料，感兴趣的可以一并阅读：

如何使用机器学习选择特征

递归特征消除（RFE）用于Python中的特征选择

包装器特征选择的问题可被视为优化问题。也就是说，找到可带来最佳模型性能的输入要素子集。RFE是一种系统解决此问题的方法，尽管它可能会受到众多功能的限制。当特征的数量很大时，另一种方法是使用随机优化算法，例如随机爬山算法。当特征的数量相对较小时，可能会枚举所有可能的特征子集。

少量输入变量：枚举要素的所有可能子集。
许多输入要素：随机优化算法，用于查找要素的良好子集。

既然我们熟悉了将特征选择作为一个优化问题来探讨的想法，那么让我们看一下如何枚举所有可能的特征子集。

枚举所有功能子集

当输入变量的数量相对较小且模型评估相对较快时，则可能会枚举输入变量的所有可能子集。这意味着在给定每个可能的唯一输入变量组的情况下，使用测试工具评估模型的性能。我们将通过一个可行的示例探索如何做到这一点。首先，让我们定义一个小的二进制分类数据集，其中包含很少的输入功能。我们可以使用make_classification（）函数定义一个具有五个输入变量的数据集，其中两个是信息变量，并且有1,000行。下面的示例定义了数据集并总结了其形状。

# define a small classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=3, random_state=1)
# summarize the shape of the dataset
print(X.shape, y.shape)

运行示例将创建数据集并确认其具有所需的形状。

(1000, 5) (1000,)

接下来，我们可以使用对整个数据集评估的模型来建立性能基准。我们将使用DecisionTreeClassifier作为模型，因为它的性能对输入变量的选择非常敏感。我们将使用良好的实践来评估模型，例如具有三个重复和10折的重复分层k折交叉验证。下面列出了完整的示例。

# evaluate a decision tree on the entire small dataset
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
# define dataset
X, y = make_classification(n_samples=1000, n_features=3, n_informative=2, n_redundant=1, random_state=1)
# define model
model = DecisionTreeClassifier()
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report result
print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

运行示例将评估整个数据集上的决策树，并报告均值和标准差分类准确性。

注意：由于算法或评估程序的随机性，或者数值精度不同，您的结果可能会有所不同。考虑运行该示例几次并比较平均结果。在这种情况下，我们可以看到该模型实现了约80.5％的精度。

Mean Accuracy: 0.805 (0.030)

接下来，我们可以尝试通过使用输入功能的子集来改善模型性能。首先，我们必须选择一种表示方式进行枚举。在这种情况下，我们将枚举一组布尔值，每个输入要素都有一个值：如果要使用该要素，则为True；如果不将该要素用作输入，则为False。例如，对于五个输入要素，序列[True，True，True，True，True]将使用所有输入要素，而[True，False，False，False，False，False]仅将第一个输入要素用作输入。我们可以使用product（）Python函数枚举length = 5的所有布尔值序列。我们必须指定有效值[True，False]和序列中的步数，该步数等于输入变量的数量。该函数返回一个可迭代的函数，我们可以直接为每个序列枚举。

# determine the number of columns
n_cols = X.shape[1]
best_subset, best_score = None, 0.0
# enumerate all combinations of input features
for subset in product([True, False], repeat=n_cols):

对于给定的布尔值序列，我们可以对其进行枚举并将其转换为该序列中每个True的列索引序列。

# convert into column indexes
ix = [i for i, x in enumerate(subset) if x]

如果序列没有列索引（对于所有False值），那么我们可以跳过该序列。

# check for now column (all False)
if len(ix) == 0:continue

然后，我们可以使用列索引来选择数据集中的列。

# select columns
X_new = X[:, ix]

然后可以像以前一样评估数据集的此子集。

# define model
model = DecisionTreeClassifier()
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X_new, y, scoring='accuracy', cv=cv, n_jobs=-1)
# summarize scores
result = mean(scores)

如果模型的准确性优于到目前为止找到的最佳序列，则可以存储它。

# check if it is better than the best so far
if best_score is None or result >= best_score:# better resultbest_subset, best_score = ix, result

就是这样。结合在一起，下面列出了通过枚举所有可能的特征子集进行特征选择的完整示例。

# feature selection by enumerating all possible subsets of features
from itertools import product
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
# define dataset
X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=3, random_state=1)
# determine the number of columns
n_cols = X.shape[1]
best_subset, best_score = None, 0.0
# enumerate all combinations of input features
for subset in product([True, False], repeat=n_cols):# convert into column indexesix = [i for i, x in enumerate(subset) if x]# check for now column (all False)if len(ix) == 0:continue# select columnsX_new = X[:, ix]# define modelmodel = DecisionTreeClassifier()# define evaluation procedurecv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)# evaluate modelscores = cross_val_score(model, X_new, y, scoring='accuracy', cv=cv, n_jobs=-1)# summarize scoresresult = mean(scores)# report progressprint('>f(%s) = %f ' % (ix, result))# check if it is better than the best so farif best_score is None or result >= best_score:# better resultbest_subset, best_score = ix, result
# report best
print('Done!')
print('f(%s) = %f' % (best_subset, best_score))

运行示例将报告所考虑特征的每个子集的模型的平均分类精度。然后在运行结束时报告最佳子集。

注意：由于算法或评估程序的随机性，或者数值精度不同，您的结果可能会有所不同。考虑运行该示例几次并比较平均结果。在这种情况下，我们可以看到要素的最佳子集涉及索引[2、3、4]处的要素，这些要素的平均分类精度约为83.0％，这比以前使用所有输入要素报告的结果要好。

>f([0, 1, 2, 3, 4]) = 0.813667
>f([0, 1, 2, 3]) = 0.827667
>f([0, 1, 2, 4]) = 0.815333
>f([0, 1, 2]) = 0.824000
>f([0, 1, 3, 4]) = 0.821333
>f([0, 1, 3]) = 0.825667
>f([0, 1, 4]) = 0.807333
>f([0, 1]) = 0.817667
>f([0, 2, 3, 4]) = 0.830333
>f([0, 2, 3]) = 0.819000
>f([0, 2, 4]) = 0.828000
>f([0, 2]) = 0.818333
>f([0, 3, 4]) = 0.830333
>f([0, 3]) = 0.821333
>f([0, 4]) = 0.816000
>f([0]) = 0.639333
>f([1, 2, 3, 4]) = 0.823667
>f([1, 2, 3]) = 0.821667
>f([1, 2, 4]) = 0.823333
>f([1, 2]) = 0.818667
>f([1, 3, 4]) = 0.818000
>f([1, 3]) = 0.820667
>f([1, 4]) = 0.809000
>f([1]) = 0.797000
>f([2, 3, 4]) = 0.827667
>f([2, 3]) = 0.755000
>f([2, 4]) = 0.827000
>f([2]) = 0.516667
>f([3, 4]) = 0.824000
>f([3]) = 0.514333
>f([4]) = 0.777667
Done!
f([0, 3, 4]) = 0.830333

现在，我们知道了如何枚举所有可能的特征子集，让我们看一下如何使用随机优化算法选择特征子集。

优化特征子集

我们可以将随机优化算法应用于输入特征子集的搜索空间。首先，让我们定义一个更大的问题，该问题具有更多功能，这会使模型评估速度太慢，并且搜索空间太大，无法枚举所有子集。我们将定义一个具有10,000行和500个输入要素的分类问题，其中10个是相关的，其余490个是多余的。

# define a large classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=10000, n_features=500, n_informative=10, n_redundant=490, random_state=1)
# summarize the shape of the dataset
print(X.shape, y.shape)

运行示例将创建数据集并确认其具有所需的形状。

(10000, 500) (10000,)

我们可以通过评估具有所有输入特征的数据集上的模型来建立性能基准。由于数据集很大且模型评估缓慢，因此我们将修改模型的评估，以使用3倍交叉验证，例如更少的褶皱，没有重复。下面列出了完整的示例。

# evaluate a decision tree on the entire larger dataset
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
# define dataset
X, y = make_classification(n_samples=10000, n_features=500, n_informative=10, n_redundant=490, random_state=1)
# define model
model = DecisionTreeClassifier()
# define evaluation procedure
cv = StratifiedKFold(n_splits=3)
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report result
print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

运行示例将评估整个数据集上的决策树，并报告均值和标准差分类准确性。

注意：由于算法或评估程序的随机性，或者数值精度不同，您的结果可能会有所不同。考虑运行该示例几次并比较平均结果。在这种情况下，我们可以看到该模型的准确度约为91.3％。这提供了使用功能选择我们预期会胜过的基准。

Mean Accuracy: 0.913 (0.001)

我们将使用简单的随机爬山算法作为优化算法。首先，我们必须定义目标函数。它将数据集和要素子集用作输入，并返回估计的模型准确度，范围从0（最差）到1（最佳）。这是一个最大化的优化问题。这个目标函数只是对上一节中序列和模型评估步骤的解码。下面的Objective（）函数实现了此目的，并返回了得分和用于帮助报告的列的已解码子集。

# objective function
def objective(X, y, subset):# convert into column indexesix = [i for i, x in enumerate(subset) if x]# check for now column (all False)if len(ix) == 0:return 0.0# select columnsX_new = X[:, ix]# define modelmodel = DecisionTreeClassifier()# evaluate modelscores = cross_val_score(model, X_new, y, scoring='accuracy', cv=3, n_jobs=-1)# summarize scoresresult = mean(scores)return result, ix

我们还需要一个可以在搜索空间中迈出一步的函数。给定一个现有的解决方案，它必须对其进行修改并返回一个新的解决方案。在这种情况下，我们将通过随机翻转子序列中列的包含/排除来实现此目的。序列中的每个位置都将被独立考虑，并且在翻转的概率为超参数的情况下，概率将被翻转。下面的mutate（）函数在给定的候选解决方案（布尔序列）和突变超参数的情况下实现了这一点，创建并返回了修改后的解决方案（搜索空间中的步骤）。p_mutate值越大（在0到1的范围内），搜索空间中的步长越大。

# mutation operator
def mutate(solution, p_mutate):# make a copychild = solution.copy()for i in range(len(child)):# check for a mutationif rand() < p_mutate:# flip the inclusionchild[i] = not child[i]return child

现在，我们可以实现爬山算法。初始解决方案是随机生成的序列，然后对其进行评估。

# generate an initial point
solution = choice([True, False], size=X.shape[1])
# evaluate the initial point
solution_eval, ix = objective(X, y, solution)

然后，我们循环进行固定次数的迭代，创建当前解决方案的变异版本，对其进行评估，并在分数更高时保存它们。

# run the hill climb
for i in range(n_iter):# take a stepcandidate = mutate(solution, p_mutate)# evaluate candidate pointcandidate_eval, ix = objective(X, y, candidate)# check if we should keep the new pointif candidate_eval >= solution_eval:# store the new pointsolution, solution_eval = candidate, candidate_eval# report progressprint('>%d f(%s) = %f' % (i+1, len(ix), solution_eval))

下面的hillclimbing（）函数以数据集，目标函数和超参数作为参数来实现此目的，并返回数据集列的最佳子集和模型的估计性能。

# hill climbing local search algorithm
def hillclimbing(X, y, objective, n_iter, p_mutate):# generate an initial pointsolution = choice([True, False], size=X.shape[1])# evaluate the initial pointsolution_eval, ix = objective(X, y, solution)# run the hill climbfor i in range(n_iter):# take a stepcandidate = mutate(solution, p_mutate)# evaluate candidate pointcandidate_eval, ix = objective(X, y, candidate)# check if we should keep the new pointif candidate_eval >= solution_eval:# store the new pointsolution, solution_eval = candidate, candidate_eval# report progressprint('>%d f(%s) = %f' % (i+1, len(ix), solution_eval))return solution, solution_eval

然后，我们可以调用此函数并传入我们的综合数据集以对特征选择进行优化。在这种情况下，我们将对算法进行100次迭代，并对给定突变的序列进行约五次翻转，这是非常保守的。

# define dataset
X, y = make_classification(n_samples=10000, n_features=500, n_informative=10, n_redundant=490, random_state=1)
# define the total iterations
n_iter = 100
# probability of including/excluding a column
p_mut = 10.0 / 500.0
# perform the hill climbing search
subset, score = hillclimbing(X, y, objective, n_iter, p_mut)

在运行结束时，我们将布尔序列转换为列索引（因此，如果需要，我们可以拟合最终模型），并报告最佳子序列的性能。

# convert into column indexes
ix = [i for i, x in enumerate(subset) if x]
print('Done!')
print('Best: f(%d) = %f' % (len(ix), score))

结合在一起，下面列出了完整的示例。

# stochastic optimization for feature selection
from numpy import mean
from numpy.random import rand
from numpy.random import choice
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier# objective function
def objective(X, y, subset):# convert into column indexesix = [i for i, x in enumerate(subset) if x]# check for now column (all False)if len(ix) == 0:return 0.0# select columnsX_new = X[:, ix]# define modelmodel = DecisionTreeClassifier()# evaluate modelscores = cross_val_score(model, X_new, y, scoring='accuracy', cv=3, n_jobs=-1)# summarize scoresresult = mean(scores)return result, ix# mutation operator
def mutate(solution, p_mutate):# make a copychild = solution.copy()for i in range(len(child)):# check for a mutationif rand() < p_mutate:# flip the inclusionchild[i] = not child[i]return child# hill climbing local search algorithm
def hillclimbing(X, y, objective, n_iter, p_mutate):# generate an initial pointsolution = choice([True, False], size=X.shape[1])# evaluate the initial pointsolution_eval, ix = objective(X, y, solution)# run the hill climbfor i in range(n_iter):# take a stepcandidate = mutate(solution, p_mutate)# evaluate candidate pointcandidate_eval, ix = objective(X, y, candidate)# check if we should keep the new pointif candidate_eval >= solution_eval:# store the new pointsolution, solution_eval = candidate, candidate_eval# report progressprint('>%d f(%s) = %f' % (i+1, len(ix), solution_eval))return solution, solution_eval# define dataset
X, y = make_classification(n_samples=10000, n_features=500, n_informative=10, n_redundant=490, random_state=1)
# define the total iterations
n_iter = 100
# probability of including/excluding a column
p_mut = 10.0 / 500.0
# perform the hill climbing search
subset, score = hillclimbing(X, y, objective, n_iter, p_mut)
# convert into column indexes
ix = [i for i, x in enumerate(subset) if x]
print('Done!')
print('Best: f(%d) = %f' % (len(ix), score))

运行示例将报告所考虑特征的每个子集的模型的平均分类精度。然后在运行结束时报告最佳子集。

注意：由于算法或评估程序的随机性，或者数值精度不同，您的结果可能会有所不同。考虑运行该示例几次并比较平均结果。在这种情况下，我们可以看到，通过239个特征的子集和大约91.8％的分类精度，可以实现最佳性能。这比在所有输入要素上评估的模型要好。尽管结果更好，但我们知道我们可以做得更好，可能是优化优化算法的超参数，或者是使用替代优化算法。

>80 f(240) = 0.918099
>81 f(236) = 0.918099
>82 f(238) = 0.918099
>83 f(236) = 0.918099
>84 f(239) = 0.918099
>85 f(240) = 0.918099
>86 f(239) = 0.918099
>87 f(245) = 0.918099
>88 f(241) = 0.918099
>89 f(239) = 0.918099
>90 f(239) = 0.918099
>91 f(241) = 0.918099
>92 f(243) = 0.918099
>93 f(245) = 0.918099
>94 f(239) = 0.918099
>95 f(245) = 0.918099
>96 f(244) = 0.918099
>97 f(242) = 0.918099
>98 f(238) = 0.918099
>99 f(248) = 0.918099
>100 f(238) = 0.918099
Done!
Best: f(239) = 0.918099

最后给出来一些相关的教程和可供参考学习的API链接。

接口

sklearn.datasets.make_classification API.
itertools.product API.