sklearn的特征抽取和特征筛选

数据分析过程中，往往需要处理很多类型的数据，数值型和非数值型，无论是在回归问题还是分类问题中，特征工程都是重中之重。
我们都知道，特征值和特征向量在高等数学和线性代数中极为关键，特征工程，从表面上来说，就是从大大小小的数据中，筛选出有意义或者有用的条目，进而转换成一种数学表达，让机器和算法能够理解其中的意义。好比一个班上的每个学生，都有性别、年龄、身高、体重、成绩、性格特点等等特征，年龄、身高、体重、成绩属于数值型的特征，可以直接为我们所用，性别、性格特点属于非数值型特征，需要通过转换才能被算法理解。
python的sklearn包，包含了特征工程常用的工具和api，主要有以下几个方面：

特征抽取

sklearn.feature_extraction：包含了常用的用于特征抽取的api

DictVectorizer(dtype=<class ‘numpy.float64’>, separator=’=’, sparse=True, sort=True)
通过字典传入抽取特征,用于将标签型或非数值型数据的抽取和转换，sparse参数表示是否转换为稀疏矩阵
```
>>> from sklearn.feature_extraction import DictVectorizer
>>> v = DictVectorizer(sparse=False)
>>> D = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}]
>>> X = v.fit_transform(D)
>>> X
array([[2., 0., 1.],[0., 1., 3.]])
```

FeatureHasher(n_features=1048576, input_type=’dict’, dtype=<class ‘numpy.float64’>, alternate_sign=True)
将输入转换为hash矩阵，n_features为特征的列数,也就是特征维数，input_type为输入的python数据类型，通常用于处理特征的维度较大的问题。

>>> from sklearn.feature_extraction import FeatureHasher
>>> h = FeatureHasher(n_features=10)
>>> D = [{'dog': 1, 'cat':2, 'elephant':4},{'dog': 2, 'run': 5}]
>>> f = h.transform(D)
>>> f.toarray()
array([[ 0.,  0., -4., -1.,  0.,  0.,  0.,  0.,  0.,  2.],[ 0.,  0.,  0., -2., -5.,  0.,  0.,  0.,  0.,  0.]])

image
从图像中直接抽取特征

image.extract_patches_2d(image, patch_size, max_patches=None, random_state=None)
从二维图像中抽取特征，patch_size为输出的色块或图像块的维度，是一个二维元祖，max_patches表示最大的色块维度，random_state表示当max_patches不为none时，是否采用随机数采样填充色块的值。输出的每个图像块均为一个RGB三通道构成

  >>> from sklearn.datasets import load_sample_image>>> from sklearn.feature_extraction import image>>> # Use the array data from the first image in this dataset:>>> one_image = load_sample_image("china.jpg")>>> print('Image shape: {}'.format(one_image.shape))Image shape: (427, 640, 3)>>> patches = image.extract_patches_2d(one_image, (2, 2))>>> print('Patches shape: {}'.format(patches.shape))Patches shape: (272214, 2, 2, 3)>>> # Here are just two of these patches:>>> print(patches[1]) [[[174 201 231][174 201 231]][[173 200 230][173 200 230]]]>>> print(patches[800])[[[187 214 243][188 215 244]][[187 214 243][188 215 244]]]

image.grid_to_graph(n_x, n_y, n_z=1, mask=None, return_as=<class ‘scipy.sparse.coo.coo_matrix’>, dtype=<class ‘int’>)
image.img_to_graph(img, mask=None, return_as=<class ‘scipy.sparse.coo.coo_matrix’>, dtype=None)
图像转换为连通图矩阵
image.reconstruct_from_patches_2d(patches, image_size)
用图像块矩阵构造图片

image.PatchExtractor(patch_size=None, max_patches=None, random_state=None)
同extract_patches_2d，用法不同

>>> from sklearn.datasets import load_sample_images
>>> from sklearn.feature_extraction import image
>>> # Use the array data from the second image in this dataset:
>>> X = load_sample_images().images[1]
>>> print('Image shape: {}'.format(X.shape))
Image shape: (427, 640, 3)
>>> pe = image.PatchExtractor(patch_size=(2, 2))
>>> pe_fit = pe.fit(X)
>>> pe_trans = pe.transform(X)
>>> print('Patches shape: {}'.format(pe_trans.shape))
Patches shape: (545706, 2, 2)

text
- text.CountVectorizer(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), analyzer=’word’, max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class ‘numpy.int64’>)
  根据词典，矢量词频统计
```
   >>> from sklearn.feature_extraction.text import CountVectorizer>>> corpus = [...     'This is the first document.',...     'This document is the second document.',...     'And this is the third one.',...     'Is this the first document?',... ]>>> vectorizer = CountVectorizer()>>> X = vectorizer.fit_transform(corpus)>>> print(vectorizer.get_feature_names())['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']>>> print(X.toarray())  [[0 1 1 1 0 0 1 0 1][0 2 0 1 0 1 1 0 1][1 0 0 1 1 0 1 1 1][0 1 1 1 0 0 1 0 1]]
```
- text.HashingVectorizer(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), analyzer=’word’, n_features=1048576, binary=False, norm=’l2’, alternate_sign=True, dtype=<class ‘numpy.float64’>)
  矢量词频统计，词典自动生成
```
 >>> from sklearn.feature_extraction.text import HashingVectorizer>>> corpus = [...     'This is the first document.',...     'This document is the second document.',...     'And this is the third one.',...     'Is this the first document?',... ]>>> vectorizer = HashingVectorizer(n_features=2**4)>>> X = vectorizer.fit_transform(corpus)>>> print(X.shape)(4, 16)
```
- text.TfidfTransformer(norm=’l2’, use_idf=True, smooth_idf=True, sublinear_tf=False)
- text.TfidfVectorizer(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer=’word’, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class ‘numpy.float64’>, norm=’l2’, use_idf=True, smooth_idf=True, sublinear_tf=False)

特征筛选

sklearn.feature_selection,包含了特征筛选所用的api

GenericUnivariateSelect(score_func=, mode=’percentile’, param=1e-05)
单变量筛选器，score_func为评价函数，mode为筛选策略，主要有{‘percentile’, ‘k_best’, ‘fpr’, ‘fdr’, ‘fwe’}几种，具体有什么意义见下文。param筛选策略的参数。
```
>>> from sklearn.datasets import load_breast_cancer
>>> from sklearn.feature_selection import GenericUnivariateSelect, chi2
>>> X, y = load_breast_cancer(return_X_y=True)
>>> X.shape
(569, 30)
>>> transformer = GenericUnivariateSelect(chi2, 'k_best', param=20)
>>> X_new = transformer.fit_transform(X, y)
>>> X_new.shape
(569, 20)
```
评价函数分为以下几种：
- f_classif
  方差分析中f值，也就是通常说的F检验，主要用于判断两个样本之间方差是否存在明显差异，主要用于特征对类别（标记）的分类任务中。
- mutual_info_classif
  变量和目标值之间的互信息，用于分类任务，计算公式如下。互信息是一种有用的信息度量，它可以看成是一个随机变量中包含的关于另一个随机变量的信息量，或者说是一个随机变量由于已知另一个随机变量而减少的不肯定性。
- chi2
  卡方检验，主要用于分类任务中。
- f_regression
  特征和类别（标记）的F值，主要用于回归分析。
- mutual_info_regression
  互信息，用于目标值为连续变量（非标记或类别）的分类任务。
SelectPercentile(score_func=, percentile=10)
基于评价函数结果的百分位数来筛选特征，score_func为评价函数，percentile是百分比。例如选择卡方检验，选择10百分数，则输出结果会筛选出卡方检验值的前10%的特征作出输出特征。
```
>>> from sklearn.datasets import load_digits
>>> from sklearn.feature_selection import SelectPercentile, chi2
>>> X, y = load_digits(return_X_y=True)
>>> X.shape
(1797, 64)
>>> X_new = SelectPercentile(chi2, percentile=10).fit_transform(X, y)
>>> X_new.shape
(1797, 7)
```
SelectKBest(score_func=, k=10)
基于评价函数结果筛选出最高的前K个特征

SelectFpr(score_func=, alpha=0.05)
基于特征的假正率测试，筛选特征。假正率=被模型预测为正的负样本/(被模型预测为正的负样本+被模型预测为负的负样本)，具体如下，主要用于二分类任务。score_func为评价函数，默认为f_classif，alpha评价函数的p值的上限值。

	预正	预负
实正	真正（TP）	假负(FN)
实负	假正（FP）	正负(TN)）

True Positive Rate（真正率 , TPR）或灵敏度（sensitivity）
TPR = TP /（TP + FN）
正样本预测结果数 / 正样本实际数
True Negative Rate（真负率 , TNR）或特指度（specificity）
TNR = TN /（TN + FP）
负样本预测结果数 / 负样本实际数
False Positive Rate （假正率, FPR）
FPR = FP /（FP + TN）
被预测为正的负样本结果数 /负样本实际数
False Negative Rate（假负率 , FNR）
FNR = FN /（TP + FN）
被预测为负的正样本结果数 / 正样本实际数

SelectFdr(score_func=, alpha=0.05)
基于错误发现率（fdr）进行特征筛选，定义是错误拒绝（拒绝真的原假设）的个数占所有被拒绝的原假设个数的比例的期望值。通俗一点理解，就是实际为负的那部分样本中，通过特征计算得出的为负的样本所占的比例。alpha为p值的上限值。同样用于二分类任务。
SelectFwe(score_func=, alpha=0.05)
基于总Ⅰ型错误率筛选特征，关于FDR和FWE的知识，可参考链接
SelectFromModel(estimator, threshold=None, prefit=False, norm_order=1, max_features=None)
基于模型选择特征，简单理解就是先采用模型对每个特征进行一次分类或回归任务，通过错误率或者误差筛选特征，estimator为评价器，也就是回归或分类模型，必须带有特征权重feature_importances_或者系数coef_；threshold为阈值，特征的权重大于或等于阈值的将会保留，否则就会被剔除；prefit是否直接fit筛选器；norm_order：规定的系数的个数；max_features=最大筛选特征量。

RFE(estimator, n_features_to_select=None, step=1, verbose=0)
递归筛选特征，给定一个带权重或系数的评价器，RFE会对权重进行排序，首先对所有特征进行评价计算，进行迭代计算，每次迭代计算筛选出step个权重或系数最小的特征，直到剩余的特征数等于n_features_to_select。verbose用于是否输出过程。

>>> from sklearn.datasets import make_friedman1
>>> from sklearn.feature_selection import RFE
>>> from sklearn.svm import SVR
>>> X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
>>> estimator = SVR(kernel="linear")
>>> selector = RFE(estimator, 5, step=1)
>>> selector = selector.fit(X, y)
>>> selector.support_
array([ True,  True,  True,  True,  True, False, False, False, False,False])
>>> selector.ranking_
array([1, 1, 1, 1, 1, 6, 4, 3, 2, 5])

RFECV(estimator, step=1, min_features_to_select=1, cv=’warn’, scoring=None, verbose=0, n_jobs=None)
同RFE，加入了交叉验证的过程。
VarianceThreshold(threshold=0.0)
方差阈值筛选，threshold是方差的阈值