机器学习编程sklearn常用语句

想随时随地写出一段机器学习代码吗？
想不靠上网找教程就完成对一个csv数据表格的处理吗？
一直感觉理论也不太会，就算知道一点实际上手一句代码也写不出来，还是要靠搜教程复制粘贴度日吗？
一直知道数据拿到手要清洗，处理，分割。那你可以秒写出如何分割吗？

关于ML。很多人说不需要掌握数学知识你也可以学得很好，对此我不反驳，但我也难以同意。同时这也不是本文要讨论的重点。
本文就假设我们数学基础小学毕业，这样能做到码出漂亮的代码吗？
Of course!
来看看那些你必会的ML,DL常用语句。
陆续更新…
须知：本文仅适用于有一定python编程经验，与小规模数据(sklearn自带数据集)处理的经验的初学者。
以下内容均为必会必背：

1.关于各种库的导入

import numpy as np
import matplotlib.pyplot as plt
#import pylab as plt等同于上句
import pandas as pd

2.关于数据分割

2.1最基本的方法，无验证集。

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.25, random_state = 0)

3.关于数据预处理

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

4.模型导入及训练

4.1逻辑回归：

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

---------------------------------------------我是分割线---------------------------------------
停一下。一个二分类问题在上面就被你解决了！这就是sklearn的魅力！所以很多人会说ML很简单，其实其他类似的多分类，回归问题与上类似~OK，我这里暂停下就是告诉以下上面就是一个完整的模板。下面继续背我们的常用语句：

0.常见小数据集的加载

本小节参考：https://www.cnblogs.com/nolonely/p/6980160.html

乳腺癌数据集load-barest-cancer（）：简单经典的用于二分类任务的数据集
糖尿病数据集：load-diabetes（）：经典的用于回归认为的数据集，值得注意的是，这10个特征中的每个特征都已经被处理成0均值，方差归一化的特征值，
波士顿房价数据集：load-boston（）：经典的用于回归任务的数据集
体能训练数据集：load_linnerud（）：经典的用于多变量回归任务的数据集，其内部包含两个小数据集：Excise是对3个训练变量的20次观测（体重，腰围，脉搏），physiological是对3个生理学变量的20次观测（引体向上，仰卧起坐，立定跳远）

from sklearn.datasets import load_iris
#加载数据集
iris=load_iris()
iris.keys()　　#dict_keys(['target', 'DESCR', 'data', 'target_names', 'feature_names'])
#数据的条数和维数
n_samples,n_features=iris.data.shape
print("Number of sample:",n_samples)  #Number of sample: 150
print("Number of feature",n_features)　　#Number of feature 4
#第一个样例
print(iris.data[0])　　　　　　#[ 5.1  3.5  1.4  0.2]
print(iris.data.shape)　　　　#(150, 4)
print(iris.target.shape)　　#(150,)
print(iris.target)

1.关于各种库的导入

1.1fundamental

import numpy as np
import matplotlib.pyplot as plt
#import pylab as plt等同于上句
import pandas as pd
import seaborn as sns

1.2model_selection常用库

from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import GridSearchCV

1.3其他常用库：

import warnings
warnings.filterwarnings("ignore")
from sklearn import metrics
from sklearn.externals import joblib

1.4指标评价库：
https://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules

from sklearn.metrics import f1_score
y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]
f1_score(y_true, y_pred, average='macro')
f1_score(y_true, y_pred, average='micro')
f1_score(y_true, y_pred, average='weighted')
f1_score(y_true, y_pred, average=None)

2.关于数据分割

2.1最基本的方法，无验证集。

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.25, random_state = 0)

3.关于数据预处理

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import  PolynomialFeatures
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

4.模型导入及训练

4.1逻辑回归：

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

4.2 LinearRegression

from sklearn.linear_model import LinearRegression

4.3多项式回归

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import  PolynomialFeatures
poly_reg =PolynomialFeatures(degree=2)
X_ploy =poly_reg.fit_transform(X_train)
lin_reg_2=LinearRegression()
lin_reg_2.fit(X_ploy,y_train)
print("截距：",lin_reg_2.intercept_)
print("回归系数：",lin_reg_2.coef_)
y_pred1 = lin_reg_2.predict(poly_reg.fit_transform(X_test))

4.4 岭回归

from sklearn.linear_model import Ridge
linreg = Ridge()
linreg.fit(X_train, y_train)
print("----------------------------")
print("截距：",linreg.intercept_)
print("回归系数：",linreg.coef_)
y_pred1 = linreg.predict(X_test)
print("平均相对误差：",np.mean(np.abs(y_test-y_pred1)/y_test))
print("最大相对误差:",np.max(np.abs(y_test-y_pred1)/y_test))
print("MAE:",metrics.mean_absolute_error(y_test, y_pred1))
print("MSE:",metrics.mean_squared_error(y_test, y_pred1))
print("RMSE:",np.sqrt(metrics.mean_squared_error(y_test, y_pred1)))
print("R2:",metrics.r2_score(y_test, y_pred1))

4.4.2 RidgeCV
https://blog.csdn.net/ssswill/article/details/86411009

from sklearn import linear_model
reg = linear_model.RidgeCV(alphas=[0.1, 1.0, 10.0], cv=3)
reg.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])
print(reg.alpha_)

若需要改变cv参数：

from slkearn.model_selection import KFold,ShuffleSplit
kfold = KFold(n_splits=3)
shuffle_split = ShuffleSplit(test_size=.5,n_splits=10)
ridge1 = RidfeCV(alphas = [.1,1,10,100],cv = kfold)
ridge2 = RidfeCV(alphas = [.1,1,10,100],cv = shuffle_split)

4.5Lasso回归

from sklearn.linear_model import LassoCV
linreg = LassoCV()
linreg.fit(X_train, y_train)
print('最佳的alpha：',linreg.alpha_)
print("----------------------------")
print("截距：",linreg.intercept_)
print("回归系数：",linreg.coef_)
y_pred1 = linreg.predict(X_test)

画lasso回归权重图：

import matplotlib.pyplot as plt
from sklearn.preprocessing import StandrdScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LassoCV
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0,test_size=0.15)
scaler2 = StandardScaler()
X_train = scaler2.fit_transform(X_train)
X_test = scaler2.transform(X_test)
print("ss处理之后的数据",X_test)
linreg = LassoCV()
linreg.fit(X_train, y_train)
print('最佳的alpha：',linreg.alpha_)
print("----------------------------")
print("截距：",linreg.intercept_)
print("回归系数：",linreg.coef_)
y_pred4 = linreg.predict(X_test)
coef = pd.Series(linreg.coef_, index = X_train.columns)
imp_coef = pd.concat([coef.sort_values().head(10), coef.sort_values().tail(10)])
#选头尾各10条，.sort_values() 可以将某一列的值进行排序。
#matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
plt.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Coefficients in the Lasso Model")
plt.show()

4.6 SVR

from sklearn.svm import SVR

4.7 随机森林回归

from sklearn.ensemble import RandomForestRegressor

4.8随机森林分类

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, max_depth=2,random_state=0)
clf.fit(X, y)
print(clf.feature_importances_)
print(clf.predict([[0, 0, 0, 0]]))

5.模型保存及导入

from sklearn.externals import joblib
joblib.dump(grid_search.best_estimator_, '/Users/will/Desktop/sklearn_model/k2_svm.pickle')
model = joblib.load('/Users/will/Desktop/sklearn_modle/k2modle.pickle')
y_pred_svr1 = model.predict(X_test)

6.网格搜索寻找超参数

para_grid = {'C':[0.0001,0.001,0.01,0.1,1,10,100],'gamma':[0.001,0.01,0.1,1,10,100,1000]}
t0 = time.time()
# grid_search = GridSearchCV(SVR(kernel="rbf"),para_grid,cv=5,scoring="neg_mean_absolute_error")
grid_search = GridSearchCV(SVR(kernel="rbf"),para_grid,cv=5)
grid_search.fit(X_train,y_train)
print("网格搜索用时：%.3f 秒"%(time.time()-t0))
print("最佳参数：{}".format(grid_search.best_params_))

关于网格搜索的更多内容请前往：
https://blog.csdn.net/ssswill/article/details/86373659

6.关于可视化

plt.rcParams['figure.figsize'] = (8.0, 4.0) # 设置figure_size尺寸
plt.rcParams['image.interpolation'] = 'nearest' # 设置 interpolation style
plt.rcParams['image.cmap'] = 'gray' # 设置 颜色 style

#figsize(12.5, 4) # 设置 figsize
plt.rcParams['savefig.dpi'] = 300 #图片像素
plt.rcParams['figure.dpi'] = 300 #分辨率
# 默认的像素：[6.0,4.0]，分辨率为100，图片尺寸为 600&400
# 指定dpi=200，图片尺寸为 1200*800
# 指定dpi=300，图片尺寸为 1800*1200
# 设置figsize可以在不改变分辨率情况下改变比例

具体参考：
https://blog.csdn.net/ssswill/article/details/86411009

7.关于交叉验证

7.1关于cross_val_score函数
sklearn是利用model_selection模块中的cross_val_score函数来实现交叉验证的。

The simplest way to use cross-validation is to call the cross_val_score helper function on the estimator and the dataset.
The following example demonstrates how to estimate the accuracy of a linear kernel support vector machine on the iris dataset by splitting the data, fitting a model and computing the score 5 consecutive times (with different splits each time):

from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
scores

array([0.96..., 1.  ..., 0.96..., 0.96..., 1.        ])

By default, the score computed at each CV iteration is the score method of the estimator. It is possible to change this by using the scoring parameter:
修改评分函数：

from sklearn import metrics
scores = cross_val_score(clf, iris.data, iris.target, cv=5, scoring='f1_macro')

修改cv:

from sklearn.model_selection import ShuffleSplit
n_samples = iris.data.shape[0]
cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
cross_val_score(clf, iris.data, iris.target, cv=cv)

7.2KFold

import numpy as np
from sklearn.model_selection import KFoldX = ["a", "b", "c", "d"]
kf = KFold(n_splits=2)
for train, test in kf.split(X):print("%s %s" % (train, test))[2 3] [0 1]
[0 1] [2 3]

7.3Leave One Out (LOO)

from sklearn.model_selection import LeaveOneOutX = [1, 2, 3, 4]
loo = LeaveOneOut()
for train, test in loo.split(X):print("%s %s" % (train, test))
[1 2 3] [0]
[0 2 3] [1]
[0 1 3] [2]
[0 1 2] [3]

7.4 Random permutations cross-validation a.k.a. Shuffle & Split

from sklearn.model_selection import ShuffleSplit
X = np.arange(10)
ss = ShuffleSplit(n_splits=5, test_size=0.25,random_state=0)
for train_index, test_index in ss.split(X):print("%s %s" % (train_index, test_index))
[9 1 6 7 3 0 5] [2 8 4]
[2 9 8 0 6 7 4] [3 5 1]
[4 5 1 0 6 9 7] [2 3 8]
[2 7 5 8 0 3 4] [6 1 9]
[4 1 0 6 8 9 3] [5 2 7]

7.5 Stratified k-fold

from sklearn.model_selection import StratifiedKFoldX = np.ones(10)
y = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
skf = StratifiedKFold(n_splits=3)
for train, test in skf.split(X, y):print("%s %s" % (train, test))
[2 3 6 7 8 9] [0 1 4 5]
[0 1 3 4 5 8 9] [2 6 7]
[0 1 2 4 5 6 7] [3 8 9]

7.6 Stratified Shuffle Split

from sklearn.model_selection import StratifiedShuffleSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 0, 1, 1, 1])
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
sss.get_n_splits(X, y)print(sss)       for train_index, test_index in sss.split(X, y):print("TRAIN:", train_index, "TEST:", test_index)X_train, X_test = X[train_index], X[test_index]y_train, y_test = y[train_index], y[test_index]
TRAIN: [5 2 3] TEST: [4 1 0]
TRAIN: [5 1 4] TEST: [0 2 3]
TRAIN: [5 0 2] TEST: [4 3 1]
TRAIN: [4 1 0] TEST: [2 3 5]
TRAIN: [0 5 1] TEST: [3 4 2]

7.7Group k-fold

from sklearn.model_selection import GroupKFoldX = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10]
y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d"]
groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]gkf = GroupKFold(n_splits=3)
for train, test in gkf.split(X, y, groups=groups):print("%s %s" % (train, test))
[0 1 2 3 4 5] [6 7 8 9]
[0 1 2 6 7 8 9] [3 4 5]
[3 4 5 6 7 8 9] [0 1 2]

对于分类：sklearn默认使用分层k折cv。
回归：标准k折cv
常用cv:kfold,stratified,groupkfold

8.降维

8.1PCA降维
参考：https://www.cnblogs.com/pinard/p/6243025.html
https://blog.csdn.net/u013597931/article/details/80066641

from sklearn.decomposition import PCA
pca = PCA(n_components=3)
pca.fit(X)
print pca.explained_variance_ratio_
print pca.explained_variance_
X_new = pca.transform(X)
#X_new = pca.fit_transform(X)

9.贝叶斯优化

参考：
https://www.cnblogs.com/yangruiGB2312/p/9374377.html

#pip install bayesian-optimization
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score
from bayes_opt import BayesianOptimization# 产生随机分类数据集，10个特征， 2个类别
x, y = make_classification(n_samples=1000,n_features=10,n_classes=2)

若不调参：

rf = RandomForestClassifier()
print(np.mean(cross_val_score(rf, x, y, cv=20, scoring='roc_auc')))
>>> 0.9355889743589744

尝试调参：
我们先定义一个目标函数，里面放入我们希望优化的函数。比如此时，函数输入为随机森林的所有参数，输出为模型交叉验证5次的AUC均值，作为我们的目标函数。因为bayes_opt库只支持最大值，所以最后的输出如果是越小越好，那么需要在前面加上负号，以转为最大值。由于bayes优化只能优化连续超参数，因此要加上int()转为离散超参数。

def rf_cv(n_estimators, min_samples_split, max_features, max_depth):val = cross_val_score(RandomForestClassifier(n_estimators=int(n_estimators),min_samples_split=int(min_samples_split),max_features=min(max_features, 0.999), # floatmax_depth=int(max_depth),random_state=2),x, y, 'roc_auc', cv=5).mean()return val

然后我们就可以实例化一个bayes优化对象了：

 rf_bo = BayesianOptimization(rf_cv,{'n_estimators': (10, 250),'min_samples_split': (2, 25),'max_features': (0.1, 0.999),'max_depth': (5, 15)})

里面的第一个参数是我们的优化目标函数，第二个参数是我们所需要输入的超参数名称，以及其范围。超参数名称必须和目标函数的输入名称一一对应。

完成上面两步之后，我们就可以运行bayes优化了！

rf_bo.maximize()

完成的时候会不断地输出结果，如下图所示：

等到程序结束，我们可以查看当前最优的参数和结果：

rf_bo.max

{'target': 0.9896799999999999,'params': {'max_depth': 12.022326832956438,'max_features': 0.42437136034968226,'min_samples_split': 17.51437357464919,'n_estimators': 116.69549115408005}}