python 特征选择卡方_4. 机器学习之特征选择-Python代码

1. 特征选择------sklearn代码

1.1 特征选择------方差法

忽略warning错误

import warnings

warnings.filterwarnings("ignore")

# 方差法

from sklearn.feature_selection import VarianceThreshold

X = [[0, 2, 0, 3], [0, 1, 4, 3], [0, 1, 1, 3]]

selector = VarianceThreshold()

X_new = selector.fit_transform(X)

print('X_new:\n',selector.fit_transform(X))

print('get_params:\n',selector.get_params())

print('get_support:\n',selector.get_support())

print('inverse_transform:\n',selector.inverse_transform(X_new))

运行结果

X_new:

[[2 0]

[1 4]

[1 1]]

get_params:

{'threshold': 0.0}

get_support:

[False True True False]

inverse_transform:

[[0 2 0 0]

[0 1 4 0]

[0 1 1 0]]

1.2 特征选择------单变量特征选择 (卡方,F分布,互信息)

1.2.1 什么是卡方分布

Compute chi-squared stats between each non-negative feature and class.This score can be used to select the n_features features with the highest values for the test chi-squared statistic from X, which must contain only non-negative features.

检测每个自变量与因变量的相关性，使用此函数“除去”最可能独立于类的特征。要优于方差检验

1.2.2 时间复杂度

O(n_classes * n_features)

1.2.3 sklearn中卡方的两个函数

* SelectKBest(score_func，k=)。保留评分最高得分的 K 个特征;

* SelectPercentile(score_func，percentile=)。保留最高得分的百分比特征;

1.2.4 score_func的选择

1.2.3 的两个函数默认：f_classif 函数。这里想使用卡方分布法，选择函数chi2.

对于回归:

f_regression：相关系数，计算每个变量与目标变量的相关系数，然后计算出F值和P值；

mutual_info_regression：互信息;

对于分类:

chi2：卡方检验；

f_classif：方差分析，计算方差分析(ANOVA)的F值 (组间均方 / 组内均方)；

mutual_info_classif：互信息；

注：chi2 , mutual_info_classif , mutual_info_regression 可以保持数据的稀疏性。

注：关于f_classif和f_regression的了解。点此链接

1.2.5 两个属性

scores_ : array-like, shape=(n_features,)，Scores of features.

pvalues_ : array-like, shape=(n_features,)，p-values of feature scores, None if score_func returned only scores.

from sklearn.datasets import load_iris

from sklearn.feature_selection import SelectKBest,SelectPercentile

from sklearn.feature_selection import chi2,f_classif,mutual_info_classif

iris = load_iris()

X, y = iris.data, iris.target

print('old_data_shape:\n',X.shape)

# SelectKBest 保留评分最高的 K 个特征

X_SelectKBest = SelectKBest(chi2, k=2)

X_new_SelectKBest = X_SelectKBest.fit_transform(X, y)

print('new_SelectKBest_data_shape:\n',X_new_SelectKBest.shape)

# SelectPercentile 保留最高得分百分比的特征

# 默认：f_classif 函数

# 对于回归: f_regression , mutual_info_regression

# 对于分类: chi2 , f_classif , mutual_info_classif

X_SelectPercentile = SelectPercentile(chi2,percentile=50)

X_new_SelectPercentile = X_SelectPercentile.fit_transform(X, y)

print('new_SelectPercentile_data_shape:\n',X_new_SelectPercentile.shape)

# Attributes

# scores_ : array-like, shape=(n_features,)

# Scores of features.

# pvalues_ : array-like, shape=(n_features,)

# p-values of feature scores, None if `score_func` returned only scores.

print('KBest scores： \n',X_SelectKBest.scores_)

print('Percentile scores：\n',X_SelectPercentile.scores_)

print('KBest pvalues_： \n',X_SelectKBest.pvalues_)

print('Percentile pvalues_：\n',X_SelectPercentile.pvalues_)

# 查看选择的那两列

# 对比发现 SelectKBest 和 SelectPercentile 挑选的都是最后两列

print('old_data_10:\n',X[0:10,:])

print('X_new_SelectKBest:\n',X_new_SelectKBest[0:10,:])

print('X_new_SelectPercentile:\n',X_new_SelectPercentile[0:10,:])

运行结果

old_data_shape:

(150, 4)

new_SelectKBest_data_shape:

(150, 2)

new_SelectPercentile_data_shape:

(150, 2)

KBest scores：

[ 10.81782088 3.59449902 116.16984746 67.24482759]

Percentile scores：

[ 10.81782088 3.59449902 116.16984746 67.24482759]

KBest pvalues_：

[4.47651499e-03 1.65754167e-01 5.94344354e-26 2.50017968e-15]

Percentile pvalues_：

[4.47651499e-03 1.65754167e-01 5.94344354e-26 2.50017968e-15]

old_data_10:

[[5.1 3.5 1.4 0.2]

[4.9 3. 1.4 0.2]

[4.7 3.2 1.3 0.2]

[4.6 3.1 1.5 0.2]

[5. 3.6 1.4 0.2]

[5.4 3.9 1.7 0.4]

[4.6 3.4 1.4 0.3]

[5. 3.4 1.5 0.2]

[4.4 2.9 1.4 0.2]

[4.9 3.1 1.5 0.1]]

X_new_SelectKBest:

[[1.4 0.2]

[1.4 0.2]

[1.3 0.2]

[1.5 0.2]

[1.4 0.2]

[1.7 0.4]

[1.4 0.3]

[1.5 0.2]

[1.4 0.2]

[1.5 0.1]]

X_new_SelectPercentile:

[[1.4 0.2]

[1.4 0.2]

[1.3 0.2]

[1.5 0.2]

[1.4 0.2]

[1.7 0.4]

[1.4 0.3]

[1.5 0.2]

[1.4 0.2]

[1.5 0.1]]

# 对比不同的 score_func 有什么区别，是不是不同的函数选择的特征都是一样的？

# 说明：这里小样本，结果具有偶然性，实际开发项目时可以对比一下。

# 个人觉得，根据理论方法，选择不同的score_func会选择出不同的特征。

# 分类函数比较: chi2 , f_classif , mutual_info_classif

X_SelectPercentile_chi2 = SelectPercentile(chi2,percentile=50)

X_SelectPercentile_f_classif = SelectPercentile(f_classif,percentile=50)

X_SelectPercentile_mutual_info_classif = SelectPercentile(mutual_info_classif,percentile=50)

X_new_SelectKBest_chi2 = X_SelectPercentile_chi2.fit_transform(X, y)

X_new_SelectKBest_f_classif = X_SelectPercentile_f_classif.fit_transform(X, y)

X_new_SelectKBest_mutual_info_classif = X_SelectPercentile_mutual_info_classif.fit_transform(X, y)

print('chi2 scores： \n',X_SelectPercentile_chi2.scores_)

print('f_classif scores： \n',X_SelectPercentile_f_classif.scores_)

print('mutual_info_classif scores：\n',X_SelectPercentile_mutual_info_classif.scores_)

print('chi2 pvalues_： \n',X_SelectPercentile_chi2.pvalues_)

print('f_classif pvalues_： \n',X_SelectPercentile_f_classif.pvalues_)

print('mutual_info_classif pvalues_：\n',X_SelectPercentile_mutual_info_classif.pvalues_)

运行结果

chi2 scores：

[ 10.81782088 3.59449902 116.16984746 67.24482759]

f_classif scores：

[ 119.26450218 47.3644614 1179.0343277 959.32440573]

mutual_info_classif scores：

[0.47980703 0.26374852 0.98914392 0.97626377]

chi2 pvalues_：

[4.47651499e-03 1.65754167e-01 5.94344354e-26 2.50017968e-15]

f_classif pvalues_：

[1.66966919e-31 1.32791652e-16 3.05197580e-91 4.37695696e-85]

mutual_info_classif pvalues_：

None

# 互信息

from minepy import MINE

import pandas as pd

def my_mine(x_data,y):

# 定义一个列表，将互信息的值放在列表中

mic_score = []

m = MINE()

# 取出 x 的每一列

for column in x_data.columns:

m.compute_score(x_data[column],y)

mic_score.append(m.mic())

# 转为 dataframe 并将其列名一一对应

mic_score = pd.DataFrame(mic_score )

# 更改索引

mic_score.index = x_data.columns

# 更改列名

mic_score = mic_score.rename(columns = {0:'MICSCORE'})

return mic_score

from sklearn.datasets import load_iris

iris = load_iris()

x ,y= pd.DataFrame(iris.data).rename(columns={0:'a',1:'b',2:'c',3:'d'}), iris.target

my_mine(x,y)

# 说明：

# sklearn.feature_selection 中封装的互信息，和此处有一点的差异，但是差异不大

# 推荐使用 sklearn 官网的sklearn.feature_selection 库

# [0.50056966 0.24913013 0.99109155 0.99213811]

运行结果

MICSCORE

a 0.642196

b 0.401504

c 0.918296

d 0.918296

1.3 特征选择------Pearson相关系数 (Pearson Correlation)

皮尔森相关系数是一种最简单的，能帮助理解特征和响应变量之间关系的方法，该方法衡量的是变量之间的线性相关性，结果的取值区间为[-1，1]，-1表示完全的负相关，+1表示完全的正相关，0表示没有线性相关。 Pearson相关系数的一个明显缺陷是，作为特征排序机制，只对线性关系敏感。如果关系是非线性的，即便两个变量具有一一对应的关系，Pearson相关性也可能会接近0。

# Scipy的 pearsonr 方法

import pandas as pd

from pandas import Series

from scipy.stats import pearsonr

def my_pearsonr(x_data,y_data):

# 遍历 x_data的每一列

sorce_p_value = []

for column in x_data.columns:

sorce = pearsonr(x_data[column], y_data)

sorce_p_value.append(list(sorce))

# 转化为 dataframe

sorce_p_value = pd.DataFrame(sorce_p_value)

# 更改索引

sorce_p_value.index = x_data.columns

# 更改列名

sorce_p_value = sorce_p_value.rename(columns = {0:'sorce',1:'p-value'})

return sorce_p_value

from sklearn.datasets import load_iris

x ,y= pd.DataFrame(iris.data).rename(columns={0:'a',1:'b',2:'c',3:'d'}), iris.target

sorce_p_value = my_pearsonr(x,y)

sorce_p_value

运行结果

sorce p-value

a 0.782561 2.890478e-32

b -0.419446 9.159985e-08

c 0.949043 4.155478e-76

d 0.956464 4.775002e-81

# 相关系数法对非线性不明感

import numpy as np

x = np.random.uniform(-1, 1, 100000)

print(pearsonr(x, x**2))

运行结果

(-0.006741987892572058, 0.03300673581074622)

1.4 特征选择------距离相关系数 (Distance correlation)

# 预备知识

# 维度改变

## atleast_xd 支持将输入数据直接视为 x维。这里的 x 可以表示：1，2，3。

from scipy.spatial.distance import pdist, squareform

import numpy as np

np.atleast_1d([1])

np.atleast_2d([1])

np.atleast_3d([1])

print('np.atleast_1d:',np.atleast_1d([1]))

print('np.atleast_2d:',np.atleast_2d([1]))

print('np.atleast_3d:',np.atleast_3d([1]))

# 欧几里德距离

# Pairwise distance between pairs of object(Pdist函数用于各种距离的生成)

X = [[1,2,3,4,5],

[2,4,6,8,10],

[3,6,9,12,15],

[1,2,3,4,5]]

x_pdist=pdist(X)

# pdist(x) 计算m*n的数据矩阵中对象之间的欧几里得距离

# 得到的是一个长度为m(m-1)/2的距离向量，距离是按顺序排列的(2,1)，(3,1)…….(m,1),(3,2)……..(m,2)………(m,m-1)

print('欧氏距离:\n',x_pdist)

#矩阵中的(i,j),对应于原始数据集中的i列和j列之间的欧氏距离。

print('矩阵:\n',squareform(x_pdist))

# 连乘操作

print(np.prod([1,2])) # 1*2

print(np.prod([[1,2],[3,4]])) # (1*2) * (3*4)

print(np.prod([[1,2],[3,4]],axis=0)) # (1*3) * (2*4)

print(np.prod([[1,2],[3,4]],axis=1)) # (1*2) (3*4)

运行结果

np.atleast_1d: [1]

np.atleast_2d: [[1]]

np.atleast_3d: [[[1]]]

欧氏距离:

[ 7.41619849 14.83239697 0. 7.41619849 7.41619849 14.83239697]

矩阵:

[[ 0. 7.41619849 14.83239697 0. ]

[ 7.41619849 0. 7.41619849 7.41619849]

[14.83239697 7.41619849 0. 14.83239697]

[ 0. 7.41619849 14.83239697 0. ]]

[3 8]

[ 2 12]

from scipy.spatial.distance import pdist, squareform

import numpy as np

def my_distcorr(X, Y):

""" Compute the distance correlation function

>>> a = [1,2,3,4,5]

>>> b = np.array([2,4,6,8,10])

>>> distcorr(a, b)

0.762676242417

"""

X = np.atleast_1d(X)

Y = np.atleast_1d(Y)

if np.prod(X.shape) == len(X):

X = X[:, None]

if np.prod(Y.shape) == len(Y):

Y = Y[:, None]

X = np.atleast_2d(X)

Y = np.atleast_2d(Y)

n = X.shape[0]

if Y.shape[0] != X.shape[0]:

raise ValueError('Number of samples must match')

a = squareform(pdist(X))

b = squareform(pdist(Y))

A = a - a.mean(axis=0)[None, :] - a.mean(axis=1)[:, None] + a.mean()

B = b - b.mean(axis=0)[None, :] - b.mean(axis=1)[:, None] + b.mean()

dcov2_xy = (A * B).sum()/float(n * n)

dcov2_xx = (A * A).sum()/float(n * n)

dcov2_yy = (B * B).sum()/float(n * n)

dcor = np.sqrt(dcov2_xy)/np.sqrt(np.sqrt(dcov2_xx) * np.sqrt(dcov2_yy))

return dcor

a = [1,2,3,4,5]

b = np.array([2,4,6,8,10])

print(my_distcorr(a, b))

# 查看 Pearson相关系数接近零，其距离相关系数

my_distcorr(a, [i*i for i in a])

运行结果

1.0

0.9869160440537483

1.5 特征选择------基于模型的特征排序 (Model based ranking)

from sklearn.cross_validation import cross_val_score, ShuffleSplit

from sklearn.datasets import load_boston

from sklearn.ensemble import RandomForestRegressor

import numpy as np

# Load boston housing dataset as an example

boston = load_boston()

X = boston["data"]

Y = boston["target"]

names = boston["feature_names"]

rf = RandomForestRegressor(n_estimators=20, max_depth=4)

scores = []

# 单独采用每个特征进行建模，并进行交叉验证

for i in range(X.shape[1]):

score = cross_val_score(rf, X[:, i:i+1], Y, scoring="r2", # 注意X[:, i]和X[:, i:i+1]的区别

cv=ShuffleSplit(len(X), 5, .3),

# cv=5,

n_jobs=-1)

scores.append((format(np.mean(score), '.3f'), names[i]))

print(sorted(scores, reverse=True))

运行结果

[('0.665', 'LSTAT'), ('0.545', 'RM'), ('0.394', 'NOX'), ('0.327', 'INDUS'), ('0.320', 'TAX'), ('0.318', 'PTRATIO'), ('0.201', 'CRIM'), ('0.179', 'ZN'), ('0.161', 'RAD'), ('0.124', 'DIS'), ('0.114', 'B'), ('0.078', 'AGE'), ('0.031', 'CHAS')]

1.6 特征选择------递归特征消除 (Recursive Feature Elimination)

属性：

support_：选择的特征，布尔类型。

ranking_：特征的排名位置。选择的特征被指定为等级1。越差的特征，等级数越高。

# 属性：

# support_：选择的特征，布尔类型。

# ranking_：特征的排名位置。选择的特征被指定为等级1。越差的特征，等级数越高。

# 不采用交叉验证

from sklearn.datasets import make_friedman1

from sklearn.feature_selection import RFE

from sklearn.svm import SVR

X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)

# 可以选择不同的基模型

estimator = SVR(kernel="linear")

#参数estimator为基模型，参数n_features_to_select为选择的特征个数

selector = RFE(estimator, n_features_to_select=5)

selector = selector.fit(X, y)

print('selector.support_:\n',selector.support_)

print('selector.ranking_:\n',selector.ranking_)

print(selector)

运行结果

selector.support_:

[ True True True True True False False False False False]

selector.ranking_:

[1 1 1 1 1 6 4 3 2 5]

RFE(estimator=SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',

kernel='linear', max_iter=-1, shrinking=True, tol=0.001, verbose=False),

n_features_to_select=5, step=1, verbose=0)

# # 采用交叉验证，

from sklearn.datasets import make_friedman1

from sklearn.feature_selection import RFECV

from sklearn.svm import SVR

X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)

estimator = SVR(kernel="linear")

# 参数：

# step：如果大于或等于1，那么`step`对应于(整数)每次迭代时要删除的要素数。

# 如果在(0.0,1.0)内，那么`step`对应于百分比(向下舍入)要在每次迭代时删除的要素。

selector = RFECV(estimator, step=1, cv=6)

selector = selector.fit(X, y)

print('selector.support_:\n',selector.support_)

print('selector.ranking_:\n',selector.ranking_)

运行结果

selector.support_:

[ True True True True True False False False False False]

selector.ranking_:

[1 1 1 1 1 6 4 3 2 5]

import matplotlib.pyplot as plt

from sklearn.svm import SVC

from sklearn.cross_validation import StratifiedKFold

from sklearn.feature_selection import RFECV

from sklearn.datasets import make_classification

# Build a classification task using 3 informative features

X, y = make_classification(n_samples=1000, n_features=25, n_informative=3,

n_redundant=2, n_repeated=0, n_classes=8,

n_clusters_per_class=1, random_state=0)

# Create the RFE object and compute a cross-validated score.

svc = SVC(kernel="linear")

# The "accuracy" scoring is proportional to the number of correct

# classifications

rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(y, 2),

scoring='accuracy')

rfecv.fit(X, y)

print("Optimal number of features : %d" % rfecv.n_features_)

# Plot number of features VS. cross-validation scores

plt.figure()

plt.xlabel("Number of features selected")

plt.ylabel("Cross validation score (nb of correct classifications)")

plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)

plt.show()

运行结果

Optimal number of features : 3

1.7 特征选择------基于L1的特征选择 (L1-based feature selection)

使用L1范数作为惩罚项的线性模型(Linear models)会得到稀疏解：大部分特征对应的系数为0。当你希望减少特征的维度以用于其它分类器时，可以通过 feature_selection.SelectFromModel 来选择不为0的系数。

特别指出，常用于此目的的稀疏预测模型有 linear_model.Lasso(回归),linear_model.LogisticRegression 和 svm.LinearSVC(分类).

SelectFromModel 作为meta-transformer，能够用于拟合后任何拥有coef_或feature_importances_ 属性的预测模型。如果特征对应的coef_ 或 feature_importances_ 值低于设定的阈值threshold，那么这些特征将被移除。除了手动设置阈值，也可通过字符串参数调用内置的启发式算法(heuristics)来设置阈值，包括：平均值(“mean”), 中位数(“median”)以及他们与浮点数的乘积，如”0.1*mean”。

# svm.LinearSVC

from sklearn.svm import LinearSVC

from sklearn.datasets import load_iris

from sklearn.feature_selection import SelectFromModel

iris = load_iris()

X, y = iris.data, iris.target

lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)

model = SelectFromModel(lsvc, prefit=True)

X_new = model.transform(X)

print('X_shape:',X.shape)

print('X_new.shape:',X_new.shape)

运行结果

X_shape: (150, 4)

X_new.shape: (150, 3)

## 带L1和L2惩罚项的逻辑回归作为基模型的特征选择

# 对于SVM和逻辑回归，参数C控制稀疏性：C越小，被选中的特征越少。对于Lasso，参数alpha越大，被选中的特征越少。

from sklearn.feature_selection import SelectFromModel

from sklearn.linear_model import LogisticRegression as LR

X_new = SelectFromModel(LR(C=0.1)).fit_transform(iris.data, iris.target)

print('X_new.shape:',X_new.shape)

运行结果

X_new.shape: (150, 2)

# Use SelectFromModel meta-transformer along with Lasso to select the best couple of features from the Boston dataset.

import matplotlib.pyplot as plt

import numpy as np

from sklearn.datasets import load_boston

from sklearn.feature_selection import SelectFromModel

from sklearn.linear_model import LassoCV

# Load the boston dataset.

boston = load_boston()

X, y = boston['data'], boston['target']

# We use the base estimator LassoCV since the L1 norm promotes sparsity of features.

clf = LassoCV()

# Set a minimum threshold of 0.25

sfm = SelectFromModel(clf, threshold=0.25)

sfm.fit(X, y)

n_features = sfm.transform(X).shape[1]

print('X_shape:',X.shape)

print('new_X_shape:',sfm.transform(X).shape)

# Reset the threshold till the number of features equals two.

# Note that the attribute can be set directly instead of repeatedly

# fitting the metatransformer.

while n_features > 2:

sfm.threshold += 0.1

X_transform = sfm.transform(X)

n_features = X_transform.shape[1]

# Plot the selected two features from X.

plt.title(

"Features selected from Boston using SelectFromModel with "

"threshold %0.3f." % sfm.threshold)

feature1 = X_transform[:, 0]

feature2 = X_transform[:, 1]

plt.plot(feature1, feature2, 'r.')

plt.xlabel("Feature number 1")

plt.ylabel("Feature number 2")

plt.ylim([np.min(feature2), np.max(feature2)])

plt.show()

运行结果

X_shape: (506, 13)

new_X_shape: (506, 5)

1.8 特征选择------ 随机稀疏模型 (Randomized sparse models)

基于L1的稀疏模型的局限在于，当面对一组互相关的特征时，它们只会选择其中一项特征。为了减轻该问题的影响,可以使用随机化技术，通过多次重新估计稀疏模型来扰乱设计矩阵，或通过多次下采样数据来统计一个给定的回归量被选中的次数。

RandomizedLasso 实现了使用这项策略的Lasso，RandomizedLogisticRegression 使用逻辑回归，适用于分类任务。要得到整个迭代过程的稳定分数，你可以使用 lasso_stability_path。

注意到对于非零特征的检测，要使随机稀疏模型比标准F统计量更有效，那么模型的参考标准需要是稀疏的，换句话说，非零特征应当只占一小部分。

import matplotlib.pyplot as plt

import numpy as np

from scipy import linalg

from sklearn.linear_model import (RandomizedLasso, lasso_stability_path,

LassoLarsCV)

from sklearn.feature_selection import f_regression

from sklearn.preprocessing import StandardScaler

from sklearn.metrics import auc, precision_recall_curve

from sklearn.ensemble import ExtraTreesRegressor

from sklearn.utils.extmath import pinvh

def mutual_incoherence(X_relevant, X_irelevant):

"""Mutual incoherence, as defined by formula (26a) of [Wainwright2006].

"""

projector = np.dot(np.dot(X_irelevant.T, X_relevant),

pinvh(np.dot(X_relevant.T, X_relevant)))

return np.max(np.abs(projector).sum(axis=1))

for conditioning in (1, 1e-4):

###########################################################################

# Simulate regression data with a correlated design

n_features = 501

n_relevant_features = 3

noise_level = .2

coef_min = .2

# The Donoho-Tanner phase transition is around n_samples=25: below we

# will completely fail to recover in the well-conditioned case

n_samples = 25

block_size = n_relevant_features

rng = np.random.RandomState(42)

# The coefficients of our model

coef = np.zeros(n_features)

coef[:n_relevant_features] = coef_min + rng.rand(n_relevant_features)

# The correlation of our design: variables correlated by blocs of 3

corr = np.zeros((n_features, n_features))

for i in range(0, n_features, block_size):

corr[i:i + block_size, i:i + block_size] = 1 - conditioning

corr.flat[::n_features + 1] = 1

corr = linalg.cholesky(corr)

# Our design

X = rng.normal(size=(n_samples, n_features))

X = np.dot(X, corr)

# Keep [Wainwright2006] (26c) constant

X[:n_relevant_features] /= np.abs(

linalg.svdvals(X[:n_relevant_features])).max()

X = StandardScaler().fit_transform(X.copy())

# The output variable

y = np.dot(X, coef)

y /= np.std(y)

# We scale the added noise as a function of the average correlation

# between the design and the output variable

y += noise_level * rng.normal(size=n_samples)

mi = mutual_incoherence(X[:, :n_relevant_features],

X[:, n_relevant_features:])

###########################################################################

# Plot stability selection path, using a high eps for early stopping

# of the path, to save computation time

alpha_grid, scores_path = lasso_stability_path(X, y, random_state=42,

eps=0.05)

plt.figure()

# We plot the path as a function of alpha/alpha_max to the power 1/3: the

# power 1/3 scales the path less brutally than the log, and enables to

# see the progression along the path

hg = plt.plot(alpha_grid[1:] ** .333, scores_path[coef != 0].T[1:], 'r')

hb = plt.plot(alpha_grid[1:] ** .333, scores_path[coef == 0].T[1:], 'k')

ymin, ymax = plt.ylim()

plt.xlabel(r'$(\alpha / \alpha_{max})^{1/3}$')

plt.ylabel('Stability score: proportion of times selected')

plt.title('Stability Scores Path - Mutual incoherence: %.1f' % mi)

plt.axis('tight')

plt.legend((hg[0], hb[0]), ('relevant features', 'irrelevant features'),

loc='best')

###########################################################################

# Plot the estimated stability scores for a given alpha

# Use 6-fold cross-validation rather than the default 3-fold: it leads to

# a better choice of alpha:

# Stop the user warnings outputs- they are not necessary for the example

# as it is specifically set up to be challenging.

with warnings.catch_warnings():

warnings.simplefilter('ignore', UserWarning)

lars_cv = LassoLarsCV(cv=6).fit(X, y)

# Run the RandomizedLasso: we use a paths going down to .1*alpha_max

# to avoid exploring the regime in which very noisy variables enter

# the model

alphas = np.linspace(lars_cv.alphas_[0], .1 * lars_cv.alphas_[0], 6)

clf = RandomizedLasso(alpha=alphas, random_state=42).fit(X, y)

trees = ExtraTreesRegressor(100).fit(X, y)

# Compare with F-score

F, _ = f_regression(X, y)

plt.figure()

for name, score in [('F-test', F),

('Stability selection', clf.scores_),

('Lasso coefs', np.abs(lars_cv.coef_)),

('Trees', trees.feature_importances_),

precision, recall, thresholds = precision_recall_curve(coef != 0,

score)

plt.semilogy(np.maximum(score / np.max(score), 1e-4),

label="%s. AUC: %.3f" % (name, auc(recall, precision)))

plt.plot(np.where(coef != 0)[0], [2e-4] * n_relevant_features, 'mo',

label="Ground truth")

plt.xlabel("Features")

plt.ylabel("Score")

# Plot only the 100 first coefficients

plt.xlim(0, 100)

plt.legend(loc='best')

plt.title('Feature selection scores - Mutual incoherence: %.1f'

% mi)

plt.show()

1.9 特征选择------ 基于树的特征选择 (Tree-based feature selection)

基于树的预测模型,能够用来计算特征的重要程度，因此能用来去除不相关的特征:

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_classification

from sklearn.ensemble import ExtraTreesClassifier

# Build a classification task using 3 informative features

X, y = make_classification(n_samples=1000,

n_features=10,

n_informative=3,

n_redundant=0,

n_repeated=0,

n_classes=2,

random_state=0,

shuffle=False)

# Build a forest and compute the feature importances

forest = ExtraTreesClassifier(n_estimators=250,

random_state=0)

forest.fit(X, y)

importances = forest.feature_importances_

std = np.std([tree.feature_importances_ for tree in forest.estimators_],

axis=0)

indices = np.argsort(importances)[::-1]

# Print the feature ranking

print("Feature ranking:")

for f in range(X.shape[1]):

print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# Plot the feature importances of the forest

plt.figure()

plt.title("Feature importances")

plt.bar(range(X.shape[1]), importances[indices],

color="r", yerr=std[indices], align="center")

plt.xticks(range(X.shape[1]), indices)

plt.xlim([-1, X.shape[1]])

plt.show()

运行结果

Feature ranking:

1. feature 1 (0.295902)

2. feature 2 (0.208351)

3. feature 0 (0.177632)

4. feature 3 (0.047121)

5. feature 6 (0.046303)

6. feature 8 (0.046013)

7. feature 7 (0.045575)

8. feature 4 (0.044614)

9. feature 9 (0.044577)

10. feature 5 (0.043912)

2019061208.png

# 用于人脸识别数据

from time import time

import matplotlib.pyplot as plt

from sklearn.datasets import fetch_olivetti_faces

from sklearn.ensemble import ExtraTreesClassifier

# Number of cores to use to perform parallel fitting of the forest model

n_jobs = 1

# Load the faces dataset

data = fetch_olivetti_faces()

X = data.images.reshape((len(data.images), -1))

y = data.target

mask = y < 5 # Limit to 5 classes

X = X[mask]

y = y[mask]

# Build a forest and compute the pixel importances

print("Fitting ExtraTreesClassifier on faces data with %d cores..." % n_jobs)

t0 = time()

forest = ExtraTreesClassifier(n_estimators=1000,

max_features=128,

n_jobs=n_jobs,

random_state=0)

forest.fit(X, y)

print("done in %0.3fs" % (time() - t0))

importances = forest.feature_importances_

importances = importances.reshape(data.images[0].shape)

# Plot pixel importances

plt.matshow(importances, cmap=plt.cm.hot)

plt.title("Pixel importances with forests of trees")

plt.show()

python 特征选择卡方_4. 机器学习之特征选择-Python代码相关推荐

python实现卡方（Chi-Squared Test）相关性检验
python实现卡方(Chi-Squared Test)相关性检验独立性检验是统计学的一种检验方式,与适合性检验同属于X2检验,即卡方检验(英文名:chi square test),它是根据次数资料 ...
python 特征选择卡方_特征选择
2020-01-10 皮尔逊相关系数 image.png 衡量线性相关性,检查数据集里目标和数值特征之间皮尔逊相关系数的绝对值.根据这个准则保留前n个特征.def cor_selector(X, y, ...
python泰坦尼克号数据预测_机器学习入门之Python机器学习：泰坦尼克号获救预测一...
本文主要向大家介绍了机器学习入门之Python机器学习:泰坦尼克号获救预测一,通过具体的内容向大家展现,希望对大家学习机器学习入门有所帮助. 一.项目概要 1.应用模式识别.数据挖掘(核心).统计学 ...
Python 散点图线性拟合_机器学习之利用Python进行简单线性回归分析
前言:在利用机器学习方法进行数据分析时经常要了解变量的相关性,有时还需要对变量进行回归分析.本文首先对人工智能/机器学习/深度学习.相关分析/因果分析/回归分析等易混淆的概念进行区分,最后结合案例介绍 ...
python实现线性回归预测_机器学习实战笔记(Python实现)-08-线性回归
--------------------------------------------------------------------------------------- 本系列文章为<机器 ...
python打卡摄像头黑屏_500行Python代码打造刷脸考勤系统！
需求分析 "员工刷脸考勤"系统,采用Python语言开发,可以通过摄像头添加员工面部信息,这里就涉及到两个具体的个问题,一个是应该以什么样的数据来标识每一个员工的面部信息,二是持久 ...
python的knn算法list_[机器学习]kNN算法python实现(实例:数字识别)
# 使用好任何机器学习算法的前提是选好Featuresfrom numpy import *import operatorfrom os import listdirdef classify0(inX ...
python编译软件排行榜_4 个快速的 Python 编译器 for 2018
简评:Python 和其他的解释型语言一样经常被吐槽性能不行,所以开发人员为了提升性能创建了不少编译器,本文则选取其中的四个做了基准测试. Python 其实是一种相当快的语言,但它并不像编译型语言那 ...
卡方分布和卡方验证的关系理解，以及Python实现
卡方分布定义:设 X1-Xn是服从标准正态分布的随机变量,则称统计量服从自由度为n的卡方分布,自由度为n时,他的期望是n,方差为2n 他是标准正态分布变量的平方和,网上找了一张概率密度图: 然后我 ...
特征工程之特征分箱（决策树分箱、卡方分箱、bestks以及评价标准WOE和IV)
特征工程之特征分箱:决策树分箱.卡方分箱.bestks以及评价标准 1.WOE和IV 2.无监督分箱 2.1等频分箱 2.2等距分箱 3.有监督分箱 3.1决策树分箱 3.2best-ks分箱 3.3 ...

python 特征选择卡方_4. 机器学习之特征选择-Python代码

python 特征选择卡方_4. 机器学习之特征选择-Python代码相关推荐

最新文章

热门文章