机器学习的主要步骤

  1. 将问题框架化并且关注重点。
  2. 获取并探索数据以洞悉数据。
  3. 准备数据以更好地将基础数据模式暴露给机器学习算法。
  4. 探索多种不同的模型并列出最好的那些。
  5. 微调模型并将它们组合成一个很好的解决方案。
  6. 展示你的解决方案。
  7. 启动,监督并维护你的系统。

将问题框架化并关注重点

数据集是基于 1990 年加州普查的数据,数据包含每个街区组的人口、收入中位数、房价中位数等指标。
街区组是美国调查局发布样本数据的最小地理单位(一个街区通常有 600 到 3000 人)。我们将其简称为“街区”。
你的模型要利用这个数据进行学习,然后根据其它指标,预测任何街区的的房价中位数。

评估指标 RMSE

RMSE=1m∑i=1m(y(i)−y^(i))2RMSE = \sqrt{\frac{1}{m}\sum_{i=1}^{m}(y ^ {(i)} - \hat{y} ^ {(i)}) ^ {2}}RMSE=m1​∑i=1m​(y(i)−y^​(i))2​

数据探索

导入数据及第三方库

%matplotlib inline
# 导入第三方库
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns# 解决中文显示问题
mpl.rcParams['font.sans-serif'] = 'SimHei'
mpl.rcParams['axes.unicode_minus'] = False
# 数据导入
housing_data = pd.read_csv('housing.csv')
housing_data.head(3)
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY

数据探索

数据整体观察

# 查看数据描述
housing_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
total_bedrooms        20433 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
median_house_value    20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
# 查看数据属性
housing_data.describe()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 20640.000000 20640.000000 20640.000000 20640.000000 20433.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean -119.569704 35.631861 28.639486 2635.763081 537.870553 1425.476744 499.539680 3.870671 206855.816909
std 2.003532 2.135952 12.585558 2181.615252 421.385070 1132.462122 382.329753 1.899822 115395.615874
min -124.350000 32.540000 1.000000 2.000000 1.000000 3.000000 1.000000 0.499900 14999.000000
25% -121.800000 33.930000 18.000000 1447.750000 296.000000 787.000000 280.000000 2.563400 119600.000000
50% -118.490000 34.260000 29.000000 2127.000000 435.000000 1166.000000 409.000000 3.534800 179700.000000
75% -118.010000 37.710000 37.000000 3148.000000 647.000000 1725.000000 605.000000 4.743250 264725.000000
max -114.310000 41.950000 52.000000 39320.000000 6445.000000 35682.000000 6082.000000 15.000100 500001.000000
# 绘制各属性的关系图
sns.pairplot(housing_data,kind='scatter', diag_kind='kde' )
<seaborn.axisgrid.PairGrid at 0x1af0af9e608>

创建测试集

不合适的数据切割方式,会导致模型在训练时,朝着所划分的测试集模型进行训练。

1. 随机分割:train_test_split
对数据进行随机分割,适用于数据集很大的情况2.分层采样: StratifiedShuffleSplit
为了保证测试集不变,即使新增数据也会包含新数据同样的比例作为测试集(更公平有效),可以算出数据唯一索引的哈希值<=51作为测试集(约占20%)

假设某个属性对预测列影响较大(假设收入中位数’median_income’),分别用两种切割方式对数据进行切割,对比下切分后数据集的占比。

# 原始数据
housing_data['median_income'].plot.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x1af0e828108>

# 首先创建一个类别属性
housing_data['income_cat'] = np.ceil(housing_data['median_income']/1.5)
housing_data['income_cat'].value_counts(ascending=True).plot.bar()# 将5以上的数据先归为一类
housing_data['income_cat'] = housing_data['income_cat'].where(housing_data['income_cat']<5 ,5 )

# 查看原始数据概率分布
housing_data['income_cat'].value_counts(normalize=True)
3.0    0.350581
2.0    0.318847
4.0    0.176308
5.0    0.114438
1.0    0.039826
Name: income_cat, dtype: float64
# 查看随机拆分数据后的概率分布
from sklearn.model_selection import train_test_split
train, test = train_test_split(housing_data, test_size = 0.2 , random_state=2020)
for data in (train, test):print('概率分布分别为',data['income_cat'].value_counts(normalize=True))
概率分布分别为 3.0    0.351320
2.0    0.317042
4.0    0.178295
5.0    0.113372
1.0    0.039971
Name: income_cat, dtype: float64
概率分布分别为 3.0    0.347626
2.0    0.326066
4.0    0.168362
5.0    0.118702
1.0    0.039244
Name: income_cat, dtype: float64
# 查看分层抽样概率分布,训练集和测试集几乎相同
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=2020)
for train_index, test_index in split.split(housing_data, housing_data['income_cat']):train = housing_data.loc[train_index]test = housing_data.loc[test_index]for data in (train, test):print('概率分布分别为',data['income_cat'].value_counts(normalize=True))
概率分布分别为 3.0    0.350594
2.0    0.318859
4.0    0.176296
5.0    0.114402
1.0    0.039850
Name: income_cat, dtype: float64
概率分布分别为 3.0    0.350533
2.0    0.318798
4.0    0.176357
5.0    0.114583
1.0    0.039729
Name: income_cat, dtype: float64
for _ in (train, test, housing_data):_.drop('income_cat',axis=1,inplace=True)

地理相关数据进行可视化

# 对数据进行deep copy
housing = housing_data.copy()# 地理信息可视化
housing.plot.scatter(x='longitude', y='latitude', alpha=0.1, s=housing['population']/100, label='population', c='median_house_value', cmap=plt.get_cmap('jet'), colorbar=True)
plt.legend()
<matplotlib.legend.Legend at 0x1af11da8e08>

# 绘制地理信息
import matplotlib.image as mapingcalifonia = maping.imread('california.png')
ax = housing.plot(kind='scatter', x='longitude', y='latitude', alpha=0.1, s=housing['population']/100,label='populaton', c='median_house_value', cmap=plt.get_cmap('jet'), colorbar=False)
plt.imshow(califonia,extent=[-124.55, -113.80, 32.45, 42.05], alpha=0.5,cmap=plt.get_cmap("jet"))plt.xlabel('longitude', fontsize=14)
plt.ylabel('latitude', fontsize=14)price = housing['median_house_value']
tick_values = np.linspace(price.min(), price.max(), 11)
cbar = plt.colorbar()
cbar.ax.set_yticklabels(['$%dk'%(round(t/1000)) for t in tick_values], fontsize=16)
cbar.set_label('Median House Value', fontsize=16)
plt.legend(fontsize=16)plt.show()

### 特征相关分析,绝对值越大,相关性越强(可根据相关性强因素对数据进行分层采样)
corr = housing.corr()
corr['median_house_value'].sort_values(ascending=False)
median_house_value    1.000000
median_income         0.688075
total_rooms           0.134153
housing_median_age    0.105623
households            0.065843
total_bedrooms        0.049686
population           -0.024650
longitude            -0.045967
latitude             -0.144160
Name: median_house_value, dtype: float64
# 可以使用上面提到的sns.pairplot、pd.tools.plotting.scatter_matrix
# 这里绘制相关度大于0.1的
# 方法一
corrfea = list(corr['median_house_value'][(abs(corr['median_house_value'])>0.1)&(corr['median_house_value']!=-1)].index)sns.pairplot(housing[corrfea], kind='scatter', diag_kind='kde')
<seaborn.axisgrid.PairGrid at 0x1af125f6188>

属性组合

# 对属性进行组合
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]
# 查看属性相关性
corr = housing.corr()
corr['median_house_value'].sort_values(ascending=False)
median_house_value          1.000000
median_income               0.688075
rooms_per_household         0.151948
total_rooms                 0.134153
housing_median_age          0.105623
households                  0.065843
total_bedrooms              0.049686
population_per_household   -0.023737
population                 -0.024650
longitude                  -0.045967
latitude                   -0.144160
bedrooms_per_room          -0.255880
Name: median_house_value, dtype: float64

数据清洗

housing = train.drop('median_house_value', axis=1)
housing_label = train['median_house_value'].copy()

处理缺失值

  1. 删去含有缺失值的行
  2. 去掉整个属性
  3. 进行赋值(0、平均数、中位数)

数值型缺失值

# 导入缺失值处理
from sklearn.impute import SimpleImputer
#Imputer只能算出数值属性
imputer = SimpleImputer(strategy='median')
housing_num = housing.drop('ocean_proximity', axis=1)
imputer.fit(housing_num)
SimpleImputer(add_indicator=False, copy=True, fill_value=None,missing_values=nan, strategy='median', verbose=0)
# imputer 对缺失值转换后为二维数组
X = imputer.transform(housing_num)housing_tr = pd.DataFrame(X, columns=housing_num.columns)
housing_tr.head(2)
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income
0 -121.54 38.29 47.0 1396.0 254.0 630.0 218.0 2.8616
1 -118.60 34.15 28.0 4570.0 744.0 1693.0 695.0 6.1400

文本型缺失值填充

# LabelEncoder,只能用于单个特征列
from sklearn.preprocessing import OrdinalEncoderhousing_cat = housing[['ocean_proximity']]
encoder = OrdinalEncoder()
housing_cat_encoder = encoder.fit_transform(housing_cat)housing_cat_encoder
array([[1.],[0.],[3.],...,[0.],[1.],[0.]])
#onehot讲数值转化为
from sklearn.preprocessing import OneHotEncoder#默认sparse = True生成稀疏矩阵,False为原列表
encoder = OneHotEncoder()
housing_cat_encoder = encoder.fit_transform(housing_cat)housing_cat_encoder
<16512x5 sparse matrix of type '<class 'numpy.float64'>'with 16512 stored elements in Compressed Sparse Row format>
# 使用sparse=False或者 toarray()将稀疏矩阵转化为密集数组
housing_cat_encoder.toarray()
array([[0., 1., 0., 0., 0.],[1., 0., 0., 0., 0.],[0., 0., 0., 1., 0.],...,[1., 0., 0., 0., 0.],[0., 1., 0., 0., 0.],[1., 0., 0., 0., 0.]])

自定义转换器

属性创建转换器

# 将BaseEstimator作为基类获取fit_transform方法,将TransformerMixin作为基类获取get_params()和set_params()方法
from sklearn.base import BaseEstimator, TransformerMixin# 定义需组合特征位置
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6class CombinedAttributesAdder(BaseEstimator, TransformerMixin):def __init__(self, add_bedrooms_per_room = True):self.add_bedrooms_per_room = add_bedrooms_per_roomdef fit(self, X, y=None):return selfdef transform(self, X):rooms_per_household = X[:,rooms_ix]/X[:,household_ix]population_per_household = X[:,population_ix]/X[:,household_ix]if self.add_bedrooms_per_room:add_bedrooms_per_room = X[:,bedrooms_ix]/X[:,rooms_ix]return np.c_[X, rooms_per_household, population_per_household, add_bedrooms_per_room]else:return np.c_[X, rooms_per_household, population_per_household]attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=True)
housing_extra_feature = attr_adder.fit_transform(housing.values)
housing_extra_attributes = pd.DataFrame(housing_extra_feature,columns=list(housing.columns)+['rooms_per_household', 'population_per_household', 'add_bedrooms_per_room'],index=housing.index)
housing_extra_attributes.head(3)
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income ocean_proximity rooms_per_household population_per_household add_bedrooms_per_room
13143 -121.54 38.29 47 1396 254 630 218 2.8616 INLAND 6.40367 2.88991 0.181948
4017 -118.6 34.15 28 4570 744 1693 695 6.14 <1H OCEAN 6.57554 2.43597 0.162801
1408 -122.06 37.94 19 4005 972 1896 893 2.5268 NEAR BAY 4.48488 2.12318 0.242697

特征缩放

  • 线性函数归一化(Min-Max scaling)
  • 标准化(standardization)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScalernum_pipeline = Pipeline([('imputer', SimpleImputer(strategy='median')),('combine', CombinedAttributesAdder(add_bedrooms_per_room=True)),('std_scaler', StandardScaler())
])housing_num_tr = num_pipeline.fit_transform(housing_num)

转换流水线

方法一

# 实现Feature union的功能
from sklearn.compose import ColumnTransformer# 定义不同类型特征的名称
num_attribs = list(housing_num.columns)
cat_attribs = ['ocean_proximity']full_pipeline = ColumnTransformer([('num',num_pipeline, num_attribs),('cat',OneHotEncoder(),cat_attribs)
])housing_prepared = full_pipeline.fit_transform(housing)

方法二

from sklearn.base import BaseEstimator, TransformerMixin# Create a class to select numerical or categorical columns
class OldDataFrameSelector(BaseEstimator, TransformerMixin):def __init__(self, attribute_names):self.attribute_names = attribute_namesdef fit(self, X, y=None):return selfdef transform(self, X):return X[self.attribute_names].values
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]old_num_pipeline = Pipeline([('selector', OldDataFrameSelector(num_attribs)),('imputer', SimpleImputer(strategy="median")),('attribs_adder', CombinedAttributesAdder(add_bedrooms_per_room=True)),('std_scaler', StandardScaler()),])old_cat_pipeline = Pipeline([('selector', OldDataFrameSelector(cat_attribs)),('cat_encoder', OneHotEncoder(sparse=False)),])from sklearn.pipeline import FeatureUnionold_full_pipeline = FeatureUnion(transformer_list=[("num_pipeline", old_num_pipeline),("cat_pipeline", old_cat_pipeline),])old_housing_prepared = old_full_pipeline.fit_transform(housing)
old_housing_prepared
array([[-0.98322883,  1.24682673,  1.45453937, ...,  0.        ,0.        ,  0.        ],[ 0.48470644, -0.69523534, -0.05360975, ...,  0.        ,0.        ,  0.        ],[-1.24286364,  1.08264274, -0.76799617, ...,  0.        ,1.        ,  0.        ],...,[ 0.87915163, -0.82189156, -0.76799617, ...,  0.        ,0.        ,  0.        ],[-0.86839036,  1.42039267, -0.21236229, ...,  0.        ,0.        ,  0.        ],[ 0.66445361, -0.78905476, -0.60924363, ...,  0.        ,0.        ,  0.        ]])
# 两种方法比较
np.allclose(housing_prepared, old_housing_prepared)
True

选择并训练模型

机器学习模型

线性回归

# 导入基础包
from sklearn.linear_model import LinearRegressionlinear = LinearRegression()
linear.fit(housing_prepared, housing_label)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
some_data = housing.iloc[:5]
some_labels = housing_label[:5]
some_data_prepared = full_pipeline.transform(some_data)print('predict_data',linear.predict(some_data_prepared))
print('true_label',some_labels)
predict_data [137956.75615791 329050.91590395 195393.9574616   78921.3150396312160.31037651]
true_label 13143     92500.0
4017     361900.0
1408     235700.0
20076    128100.0
8724     343400.0
Name: median_house_value, dtype: float64
# 使用EMSE进行模型评估,sklearn中未提供RMSE,但是提供了MSE
from sklearn.metrics import mean_squared_errorlin_mse = mean_squared_error(housing_label, linear.predict(housing_prepared))
lin_rmse = np.sqrt(lin_mse)
print(lin_rmse)
68517.45734745667

think:68-95-99.7规则

误差符合高斯分布rmse=σ\sigmaσ=68517,即68%的预测值与真实值误差在σ\sigmaσ内,95%的预测值与真实值误差在2∗σ2*\sigma2∗σ内,99.7%的预测值与真实值误差在3∗σ3*\sigma3∗σ内,

# 误差范围过大
print('房价的上四分位:',np.quantile(housing_label, 0.75))
print('房价的下四分位:',np.quantile(housing_label, 0.25))
print('房价的四分位距IQR:',np.quantile(housing_label, 0.75)-np.quantile(housing_label, 0.25))
print('误差的标准差:', lin_rmse)
房价的上四分位: 265700.0
房价的下四分位: 119300.0
房价的四分位距IQR: 146400.0
误差的标准差: 68517.45734745667

决策树

## 决策树模型
from sklearn.tree import DecisionTreeRegressortree_reg = DecisionTreeRegressor(random_state=2020)
tree_reg.fit(housing_prepared, housing_label)
tree_mse = mean_squared_error(housing_label, tree_reg.predict(housing_prepared))
tree_rmse = np.sqrt(tree_mse)
print(tree_rmse)
0.0

使用交叉验证进行评估

from sklearn.model_selection import cross_val_scorescore = cross_val_score(tree_reg, housing_prepared, housing_label, scoring='neg_mean_squared_error', cv=10)
rmse = np.sqrt(-score)
print(rmse)
[71520.32970547 72827.43168352 68234.54680726 70701.7045246767644.34497746 71638.00303369 69866.07182955 73222.2525417174694.53250647 73373.62264942]
def display(score):print('score',score)print('mean', score.min())print('standard deviation',score.std())display(rmse)
score [71520.32970547 72827.43168352 68234.54680726 70701.7045246767644.34497746 71638.00303369 69866.07182955 73222.2525417174694.53250647 73373.62264942]
mean 67644.34497745866
standard deviation 2171.0902798366255
lin_scores = cross_val_score(linear, housing_prepared, housing_label,scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display(lin_rmse_scores)
score [67895.14487686 67868.97863166 70417.07160255 71775.2986989467181.16805776 65564.87996987 70874.45548392 72023.2397399767563.89684166 68150.63762242]
mean 65564.87996986833
standard deviation 2066.613293278968

集成学习:随机森林

from sklearn.ensemble import RandomForestRegressorforest_reg = RandomForestRegressor(n_estimators=100,random_state=2020)
forest_reg.fit(housing_prepared, housing_label)
forest_score = cross_val_score(forest_reg,housing_prepared, housing_label,scoring='neg_mean_squared_error',cv=10)
forest_rmse = np.sqrt(-forest_score)
display(forest_rmse)
score [49202.3705388  49998.75167494 48988.83338295 50864.1170888349898.55604849 50435.6058723  51839.78978468 52355.1248872149617.31268373 50219.44424988]
mean 48988.833382950754
standard deviation 1027.710228366977

支持向量回归

from sklearn.svm import SVRsvr = SVR(kernel='linear')
svr.fit(housing_prepared, housing_label)
svr_scores = cross_val_score(svr, housing_prepared, housing_label,scoring='neg_mean_squared_error', cv=10)
svr_rmse = np.sqrt(-svr_scores)
display(svr_rmse)
score [109298.1493646  111786.20238453 111197.32638411 111688.50872847109926.45675119 108017.66775691 115781.19573315 118183.39788143112057.81921931 108774.52041572]
mean 108017.66775691322
standard deviation 3001.71435465132

小结

通过对线性回归、决策树、随机森林以及支持向量机模型进行简单分析:随机森林效果较好

参数选择

网格搜索

# GridSearchCV 网格化搜索,RandomizedSearchCV随机搜索
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV#这里对随机森林模型进行参数搜索
param_grid = [{'n_estimators':[80,100,120],'max_features':[2,6,10]},{'bootstrap':[False],'n_estimators':[80,100,120],'max_features':[2,6,10]}
]
forest_reg = RandomForestRegressor()grid_search_forest = GridSearchCV(forest_reg, param_grid=param_grid, scoring='neg_mean_squared_error', cv=5)grid_search_forest.fit(housing_prepared, housing_label)
GridSearchCV(cv=5, error_score='raise-deprecating',estimator=RandomForestRegressor(bootstrap=True, criterion='mse',max_depth=None,max_features='auto',max_leaf_nodes=None,min_impurity_decrease=0.0,min_impurity_split=None,min_samples_leaf=1,min_samples_split=2,min_weight_fraction_leaf=0.0,n_estimators='warn', n_jobs=None,oob_score=False, random_state=None,verbose=0, warm_start=False),iid='warn', n_jobs=None,param_grid=[{'max_features': [2, 6, 10],'n_estimators': [80, 100, 120]},{'bootstrap': [False], 'max_features': [2, 6, 10],'n_estimators': [80, 100, 120]}],pre_dispatch='2*n_jobs', refit=True, return_train_score=False,scoring='neg_mean_squared_error', verbose=0)
print('最佳参数',grid_search_forest.best_params_)
print('最佳评估器',grid_search_forest.best_estimator_)
最佳参数 {'bootstrap': False, 'max_features': 6, 'n_estimators': 120}
最佳评估器 RandomForestRegressor(bootstrap=False, criterion='mse', max_depth=None,max_features=6, max_leaf_nodes=None,min_impurity_decrease=0.0, min_impurity_split=None,min_samples_leaf=1, min_samples_split=2,min_weight_fraction_leaf=0.0, n_estimators=120,n_jobs=None, oob_score=False, random_state=None,verbose=0, warm_start=False)
# 不同超参数下的模型得分
cvres = grid_search_forest.cv_results_
for mean_score, params in zip(cvres['mean_test_score'], cvres['params']):print(np.sqrt(-mean_score), params)
51798.81184450847 {'max_features': 2, 'n_estimators': 80}
51744.38798410818 {'max_features': 2, 'n_estimators': 100}
51643.26253212813 {'max_features': 2, 'n_estimators': 120}
49296.762680659456 {'max_features': 6, 'n_estimators': 80}
49222.094688003315 {'max_features': 6, 'n_estimators': 100}
49279.62135217163 {'max_features': 6, 'n_estimators': 120}
49646.800572027285 {'max_features': 10, 'n_estimators': 80}
49724.62489432864 {'max_features': 10, 'n_estimators': 100}
49629.428116323 {'max_features': 10, 'n_estimators': 120}
50899.56199594138 {'bootstrap': False, 'max_features': 2, 'n_estimators': 80}
50721.52475773612 {'bootstrap': False, 'max_features': 2, 'n_estimators': 100}
50687.53087264414 {'bootstrap': False, 'max_features': 2, 'n_estimators': 120}
48528.74523129022 {'bootstrap': False, 'max_features': 6, 'n_estimators': 80}
48455.76571280133 {'bootstrap': False, 'max_features': 6, 'n_estimators': 100}
48338.671805411715 {'bootstrap': False, 'max_features': 6, 'n_estimators': 120}
49319.85916510398 {'bootstrap': False, 'max_features': 10, 'n_estimators': 80}
49543.27916885783 {'bootstrap': False, 'max_features': 10, 'n_estimators': 100}
49273.546187372995 {'bootstrap': False, 'max_features': 10, 'n_estimators': 120}
pd.DataFrame(cvres)
mean_fit_time std_fit_time mean_score_time std_score_time param_max_features param_n_estimators param_bootstrap params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
0 2.011621 0.024415 0.079205 0.000474 2 80 NaN {'max_features': 2, 'n_estimators': 80} -2.601253e+09 -2.574979e+09 -2.697463e+09 -2.964578e+09 -2.577370e+09 -2.683117e+09 1.476458e+08 18
1 2.511683 0.011244 0.108528 0.019055 2 100 NaN {'max_features': 2, 'n_estimators': 100} -2.572565e+09 -2.619017e+09 -2.630042e+09 -2.962020e+09 -2.603813e+09 -2.677482e+09 1.435668e+08 17
2 3.036288 0.023821 0.119672 0.003513 2 120 NaN {'max_features': 2, 'n_estimators': 120} -2.559635e+09 -2.579147e+09 -2.657559e+09 -2.978728e+09 -2.560123e+09 -2.667027e+09 1.599511e+08 16
3 4.339004 0.036364 0.078590 0.001162 6 80 NaN {'max_features': 6, 'n_estimators': 80} -2.360057e+09 -2.316780e+09 -2.431691e+09 -2.652666e+09 -2.389715e+09 -2.430171e+09 1.173999e+08 7
4 5.479363 0.046203 0.108302 0.019676 6 100 NaN {'max_features': 6, 'n_estimators': 100} -2.335677e+09 -2.329762e+09 -2.421731e+09 -2.650995e+09 -2.375963e+09 -2.422815e+09 1.187522e+08 4
5 6.527553 0.051533 0.118490 0.000739 6 120 NaN {'max_features': 6, 'n_estimators': 120} -2.335004e+09 -2.359826e+09 -2.418344e+09 -2.621551e+09 -2.407730e+09 -2.428481e+09 1.012509e+08 6
6 6.715856 0.039870 0.079579 0.000742 10 80 NaN {'max_features': 10, 'n_estimators': 80} -2.393691e+09 -2.402065e+09 -2.466236e+09 -2.640531e+09 -2.421542e+09 -2.464805e+09 9.137220e+07 11
7 8.390792 0.058251 0.099930 0.001738 10 100 NaN {'max_features': 10, 'n_estimators': 100} -2.414548e+09 -2.390019e+09 -2.481327e+09 -2.647084e+09 -2.429757e+09 -2.472538e+09 9.224318e+07 12
8 10.125356 0.050684 0.117858 0.000731 10 120 NaN {'max_features': 10, 'n_estimators': 120} -2.390161e+09 -2.407785e+09 -2.447772e+09 -2.636661e+09 -2.433061e+09 -2.463080e+09 8.903726e+07 10
9 3.242529 0.043426 0.090175 0.001504 2 80 False {'bootstrap': False, 'max_features': 2, 'n_est... -2.469680e+09 -2.522976e+09 -2.590519e+09 -2.840138e+09 -2.530573e+09 -2.590765e+09 1.304320e+08 15
10 4.027049 0.028559 0.113289 0.003377 2 100 False {'bootstrap': False, 'max_features': 2, 'n_est... -2.449716e+09 -2.523860e+09 -2.551753e+09 -2.822577e+09 -2.515511e+09 -2.572673e+09 1.293469e+08 14
11 4.803779 0.044944 0.134239 0.001020 2 120 False {'bootstrap': False, 'max_features': 2, 'n_est... -2.451552e+09 -2.502831e+09 -2.565545e+09 -2.841947e+09 -2.484309e+09 -2.569226e+09 1.413152e+08 13
12 7.119553 0.043610 0.088762 0.001090 6 80 False {'bootstrap': False, 'max_features': 6, 'n_est... -2.219990e+09 -2.273329e+09 -2.335468e+09 -2.594170e+09 -2.352305e+09 -2.355039e+09 1.284418e+08 3
13 8.998152 0.205699 0.115074 0.003711 6 100 False {'bootstrap': False, 'max_features': 6, 'n_est... -2.245288e+09 -2.234267e+09 -2.349629e+09 -2.598824e+09 -2.311863e+09 -2.347961e+09 1.324405e+08 2
14 10.671701 0.076533 0.135019 0.003926 6 120 False {'bootstrap': False, 'max_features': 6, 'n_est... -2.245318e+09 -2.256408e+09 -2.338156e+09 -2.551095e+09 -2.292210e+09 -2.336627e+09 1.120187e+08 1
15 11.020343 0.034541 0.090557 0.000769 10 80 False {'bootstrap': False, 'max_features': 10, 'n_es... -2.378587e+09 -2.367559e+09 -2.374740e+09 -2.615608e+09 -2.425784e+09 -2.432449e+09 9.384092e+07 8
16 13.830014 0.054733 0.113120 0.001484 10 100 False {'bootstrap': False, 'max_features': 10, 'n_es... -2.384446e+09 -2.357297e+09 -2.454243e+09 -2.641478e+09 -2.435269e+09 -2.454537e+09 9.968472e+07 9
17 16.573276 0.052878 0.136029 0.002560 10 120 False {'bootstrap': False, 'max_features': 10, 'n_es... -2.360931e+09 -2.374058e+09 -2.422504e+09 -2.593451e+09 -2.388504e+09 -2.427882e+09 8.528778e+07 5

随机搜索

from scipy.stats import randint
random_params = {'n_estimators':randint(low=80,high=200), 'max_features':randint(low=2,high=8)}forset_reg = RandomForestRegressor(random_state=2020)
random_search_forest = RandomizedSearchCV(forest_reg, param_distributions=random_params,n_iter=20,scoring='neg_mean_squared_error', cv=5,random_state=2020)
random_search_forest.fit(housing_prepared, housing_label)
RandomizedSearchCV(cv=5, error_score='raise-deprecating',estimator=RandomForestRegressor(bootstrap=True,criterion='mse',max_depth=None,max_features='auto',max_leaf_nodes=None,min_impurity_decrease=0.0,min_impurity_split=None,min_samples_leaf=1,min_samples_split=2,min_weight_fraction_leaf=0.0,n_estimators='warn',n_jobs=None, oob_score=False,random_sta...iid='warn', n_iter=20, n_jobs=None,param_distributions={'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000001AF1BF222C8>,'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000001AF1C8E2188>},pre_dispatch='2*n_jobs', random_state=2020, refit=True,return_train_score=False, scoring='neg_mean_squared_error',verbose=0)
cvrus = random_search_forest.cv_results_
for mean_score, params in zip(cvrus['mean_test_score'], cvrus['params']):print(np.sqrt(-mean_score), params)
51733.91786298956 {'max_features': 2, 'n_estimators': 88}
49175.402733327974 {'max_features': 5, 'n_estimators': 198}
49072.56124039024 {'max_features': 5, 'n_estimators': 171}
49329.69260340587 {'max_features': 7, 'n_estimators': 83}
51668.64796676789 {'max_features': 2, 'n_estimators': 109}
51626.38924817916 {'max_features': 2, 'n_estimators': 112}
51653.31449218177 {'max_features': 2, 'n_estimators': 154}
50051.18447717637 {'max_features': 3, 'n_estimators': 131}
49002.28716520522 {'max_features': 5, 'n_estimators': 135}
49377.57265696732 {'max_features': 4, 'n_estimators': 142}
49116.19624155855 {'max_features': 5, 'n_estimators': 182}
49261.489886775096 {'max_features': 7, 'n_estimators': 128}
49119.876267968786 {'max_features': 6, 'n_estimators': 100}
51629.26353540449 {'max_features': 2, 'n_estimators': 118}
49126.056364226824 {'max_features': 6, 'n_estimators': 145}
50091.25341409802 {'max_features': 3, 'n_estimators': 159}
49145.173437757796 {'max_features': 7, 'n_estimators': 154}
50167.605677175954 {'max_features': 3, 'n_estimators': 142}
49231.738064127596 {'max_features': 7, 'n_estimators': 109}
49618.93885695293 {'max_features': 4, 'n_estimators': 86}
print('最佳参数',random_search_forest.best_params_)
print('最佳评估器',random_search_forest.best_estimator_)
最佳参数 {'max_features': 5, 'n_estimators': 135}
最佳评估器 RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,max_features=5, max_leaf_nodes=None,min_impurity_decrease=0.0, min_impurity_split=None,min_samples_leaf=1, min_samples_split=2,min_weight_fraction_leaf=0.0, n_estimators=135,n_jobs=None, oob_score=False, random_state=None,verbose=0, warm_start=False)

分析最佳模型

feature_impotrance = grid_search_forest.best_estimator_.feature_importances_
feature_impotrance
array([8.24743772e-02, 7.38500878e-02, 4.15492725e-02, 1.72282968e-02,1.57304570e-02, 1.63996319e-02, 1.49267635e-02, 3.26874395e-01,5.50194623e-02, 1.05546398e-01, 7.65439486e-02, 9.65806956e-03,1.55891908e-01, 2.14700630e-05, 3.55266323e-03, 4.73279777e-03])
extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
cat_attribs = full_pipeline.named_transformers_['cat']
cat_onehot_attribs = list(cat_attribs.categories_[0])
attributes = num_attribs+extra_attribs+cat_onehot_attribssorted(zip(feature_impotrance, attributes),reverse=True)
[(0.3268743953207875, 'median_income'),(0.15589190813712248, 'INLAND'),(0.10554639838437366, 'pop_per_hhold'),(0.08247437715821127, 'longitude'),(0.07654394856495599, 'bedrooms_per_room'),(0.073850087778326, 'latitude'),(0.055019462325508015, 'rooms_per_hhold'),(0.041549272502069176, 'housing_median_age'),(0.017228296826047918, 'total_rooms'),(0.016399631883433085, 'population'),(0.015730457042190598, 'total_bedrooms'),(0.014926763451934423, 'households'),(0.0096580695585693, '<1H OCEAN'),(0.004732797773001451, 'NEAR OCEAN'),(0.003552663230477094, 'NEAR BAY'),(2.147006299203099e-05, 'ISLAND')]

使用测试集进行评估

final_model = grid_search_forest.best_estimator_X_test = test.drop('median_house_value', axis=1)
y_test = test['median_house_value'].copy()X_test_prepared = full_pipeline.transform(X_test)
y_predict = final_model.predict(X_test_prepared)final_mse = mean_squared_error(y_test, y_predict)
final_rmse = np.sqrt(final_mse)display(final_rmse)

完整的机器学习_加州房价预测相关推荐

  1. 使用pmml跨平台部署机器学习模型Demo——房价预测

      基于房价数据,在python中训练得到一个线性回归的模型,在JavaWeb中加载模型完成房价预测的功能. 一. 训练.保存模型 工具:PyCharm-2017.Python-39.sklearn2 ...

  2. 波士顿房价预测python决策树_波士顿房价预测 - 最简单入门机器学习 - Jupyter

    机器学习入门项目分享 - 波士顿房价预测 该分享源于Udacity机器学习进阶中的一个mini作业项目,用于入门非常合适,刨除了繁琐的部分,保留了最关键.基本的步骤,能够对机器学习基本流程有一个最清晰 ...

  3. 【机器学习】从房价预测问题看回归算法

    关键词:机器学习 / 回归 文章目录 回归问题是什么 生成数据 最小二乘法学习一元线性回归模型 最小二乘法学习多元线性回归模型 梯度下降法学习回归模型 回归问题是什么 回归问题是除了分类问题以外,机器 ...

  4. 机器学习初级项目--房价预测案例

    项目背景: 运用回归模型进行房价预测. 影响房价的因素有很多,在本题的数据集中有79个变量几乎描述了爱荷华州艾姆斯(Ames,lowa)住宅的方方面面,要求预测最终的房价. 数据介绍: 我们要使用Ba ...

  5. 机器学习入门实例-加州房价预测-1(数据准备与可视化)

    问题描述 数据来源:California Housing Prices dataset from the StatLib repository,1990年加州的统计数据. 要求:预测任意一个街区的房价 ...

  6. 《scikit-learn机器学习》波斯顿房价预测(线性回归预测)

    本节内容: 首先是要导入数据,看数据有多少个样本,有多少个特征标签,对其进行模型训练,用线性回归的方式对80%的训练集进行训练,发现训练的score比较低,优化为多项式模型,画学习曲线判断哪个多项式最 ...

  7. 机器学习——线性回归、房价预测案例【正规方案与梯度下降】

    # coding:utf-8 # 1.获取数据集 #2.数据基本处理 #2.1.数据划分 #3.特征工程--标准化 #4.机器学习(线性回归) #5.模型评估 from sklearn.dataset ...

  8. 机器学习 基于加州房价的线性回归实验

    1.线性回归闭合形式参数求解的原理 如果定义X为m*(n+1)的矩阵,Y为m1的矩阵,θ为(n+1)1维的矩阵,那么在之前的定义中就可以表示为h(x)=Xθ.则代价函数可以表示为J(θ)=1/2(Xθ ...

  9. 【机器学习实用指南】加州房价中位数预测

    加州房价预测 # 同时支持python2和python3 from __future__ import division,print_function,unicode_literals# 常用库 im ...

最新文章

  1. 带你入门Python数据挖掘与机器学习(附代码、实例)
  2. 看图说话:OpenGL模型矩阵和投影矩阵
  3. arm开发板上电设置静态ip_与X86/Arm三分天下,RISCV还需几步?
  4. 2019秋第三周学习总结
  5. PHP的SAPI【web server与应用程序沟通的标准泛称】:CGI、FastCGI 【web server与应用程序的具体标准】及其对应程序PHP-CGI PHP-FPM【具体的程序应用】
  6. Entity Framework Core 2.1带来更好的SQL语句生成方案
  7. 排序算法入门之简单选择排序
  8. php7 mysql json 小程序_微信小程序JSON数组递交PHP服务端解析处理
  9. mysql 修改字段编码_mysql修改数据库编码字段编码
  10. LeetCode Single Number I / II / III
  11. Linux - iptables
  12. Kafka集群中 topic数据的分区 迁移到其他broker
  13. django创建模板报错:TemplateDoesNotExist at
  14. 永久改变Win10命令提示符(cmd)字体
  15. mysql 字段有分隔符_在MySQL字段中使用逗号分隔符
  16. 计算机职业素养200字,职业素养心得体会200字
  17. c语言编译kbhit出现问题,kbhit用C语言
  18. 猴子吃桃问题(记录自己的学习)
  19. MySQL表级锁之表锁
  20. 聚类 轮廓 matlab,通过聚类点matlab着色的等高线图

热门文章

  1. 阿里云研究中心主任田丰: 如何从实体经济走向智能产业
  2. linux 网易云音乐 ssh,网易云音乐For Linux的Fedora安装
  3. 【英语】大学英语CET考试,口语部分1(考试介绍与备考,讲义笔记)
  4. 我应该拿什么来拯救你,我的游戏?
  5. 我的世界服务器连接协议,go-mc: Minecraft(我的世界)各种协议的Go实现
  6. 装机大师无法发现linux硬盘,如何解决PE无法识别硬盘的问题
  7. 【技术面试官如何提问】
  8. PX90---Lags Backs
  9. 2019113_房价预测
  10. 刚刚!霍金向北京喊话:人类需要大胆前行,涉足无前人所及之处