前言

这个项目实战系列主要是跟着网络上的教程来做的,主要参考《跟着迪哥学习机器学习》中的思路和具体实现代码,但是书中使用到的应该是python2的版本,有一些代码也有问题,有的是省略了一些关键的步骤,有的是语法的问题,总之就是并不是直接照着敲就能够一路运行下来的。这里整理了能够运行的代码和数据集(链接容易挂,需要见评论区置顶)。

系列导航

  • 基于逻辑回归的信用卡欺诈
  • 基于随机森林的气温预测
  • 基于贝叶斯的新闻分类
  • 基于推荐系统的音乐推荐平台

思路

随机森林建模->特征选择->效率对比->参数调优

import pandas as pdfeatures = pd.read_csv("temps.csv")
features.head(5)
year month day week temp_2 temp_1 average actual friend
0 2019 1 1 Fri 45 45 45.6 45 29
1 2019 1 2 Sat 44 45 45.7 44 61
2 2019 1 3 Sun 45 44 45.8 41 56
3 2019 1 4 Mon 44 41 45.9 40 53
4 2019 1 5 Tues 41 40 46.0 44 41

具体任务

  1. 基本建模任务:

    • 数据预处理
    • 特征展示
    • 建模+可视化
  2. 数据样本量的个数对结果的影响:增加数据样本个数,观察结果变化,然后重新考虑特征工程,引入新特征后,再观察结果
  3. 对算法进行调参,找到最合适的参数:使用两种经典的调参方法

特征可视化与预处理

观察数据规模

print("数据维度:",features.shape)
数据维度: (348, 9)
features.describe()
year month day temp_2 temp_1 average actual friend
count 348.0 348.000000 348.000000 348.000000 348.000000 348.000000 348.000000 348.000000
mean 2019.0 6.477011 15.514368 62.652299 62.701149 59.760632 62.543103 60.034483
std 0.0 3.498380 8.772982 12.165398 12.120542 10.527306 11.794146 15.626179
min 2019.0 1.000000 1.000000 35.000000 35.000000 45.100000 35.000000 28.000000
25% 2019.0 3.000000 8.000000 54.000000 54.000000 49.975000 54.000000 47.750000
50% 2019.0 6.000000 15.000000 62.500000 62.500000 58.200000 62.500000 60.000000
75% 2019.0 10.000000 23.000000 71.000000 71.000000 69.025000 71.000000 71.000000
max 2019.0 12.000000 31.000000 117.000000 117.000000 77.400000 92.000000 95.000000

数据:

  1. count结果展示数据集是否有缺少值
  2. mean、std观察数据的分布特征
# 时间转化,用标准时间格式方便后续工作
import datetime
years = features['year']
months = features['month']
days = features['day']
dates = [str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years,months,days)]
dates = [datetime.datetime.strptime(date,'%Y-%m-%d') for date in dates]
dates[:5]
[datetime.datetime(2019, 1, 1, 0, 0),datetime.datetime(2019, 1, 2, 0, 0),datetime.datetime(2019, 1, 3, 0, 0),datetime.datetime(2019, 1, 4, 0, 0),datetime.datetime(2019, 1, 5, 0, 0)]
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight') # 绘图风格

可视化

布局视图,初步分析需要四项指标:分别为最高气温的标签值、前天、昨天、朋友预测的气温最高值,四个图。

fig,((ax1,ax2),(ax3,ax4)) = plt.subplots(nrows=2,ncols=2,figsize=(10,10))
fig.autofmt_xdate(rotation=45)
# 最高气温的标签值
ax1.plot(dates,features['actual'])
ax1.set_xlabel('');ax1.set_ylabel('Temperature');ax1.set_title('Max Temp')
# 昨天的最高温度值
ax2.plot(dates,features['temp_1'])
ax2.set_xlabel('');ax2.set_ylabel('Temperature');ax2.set_title('Yesterday Max Temp')
# 前天的最高温度值
ax3.plot(dates,features['temp_2'])
ax3.set_xlabel('');ax3.set_ylabel('Temperature');ax3.set_title('Two Days Prior Max Temp')
# 朋友预测的最高温度值
ax4.plot(dates,features['friend'])
ax4.set_xlabel('');ax4.set_ylabel('Temperature');ax4.set_title('Friend Forcast')
plt.tight_layout(pad=2)

预处理

对于一周的标签,并不是数值特征,需要将其转换为特征编码

# 独热编码
features = pd.get_dummies(features) #自动转换,自动添加后缀
features.head(5)
year month day temp_2 temp_1 average actual friend week_Fri week_Mon week_Sat week_Sun week_Thurs week_Tues week_Wed
0 2019 1 1 45 45 45.6 45 29 1 0 0 0 0 0 0
1 2019 1 2 44 45 45.7 44 61 0 0 1 0 0 0 0
2 2019 1 3 45 44 45.8 41 56 0 0 0 1 0 0 0
3 2019 1 4 44 41 45.9 40 53 0 1 0 0 0 0 0
4 2019 1 5 41 40 46.0 44 41 0 0 0 0 0 1 0
print(help(pd.get_dummies))
Help on function get_dummies in module pandas.core.reshape.reshape:get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None) -> 'DataFrame'Convert categorical variable into dummy/indicator variables.Parameters----------data : array-like, Series, or DataFrameData of which to get dummy indicators.prefix : str, list of str, or dict of str, default NoneString to append DataFrame column names.Pass a list with length equal to the number of columnswhen calling get_dummies on a DataFrame. Alternatively, `prefix`can be a dictionary mapping column names to prefixes.prefix_sep : str, default '_'If appending prefix, separator/delimiter to use. Or pass alist or dictionary as with `prefix`.dummy_na : bool, default FalseAdd a column to indicate NaNs, if False NaNs are ignored.columns : list-like, default NoneColumn names in the DataFrame to be encoded.If `columns` is None then all the columns with`object` or `category` dtype will be converted.sparse : bool, default FalseWhether the dummy-encoded columns should be backed bya :class:`SparseArray` (True) or a regular NumPy array (False).drop_first : bool, default FalseWhether to get k-1 dummies out of k categorical levels by removing thefirst level.dtype : dtype, default np.uint8Data type for new columns. Only a single dtype is allowed... versionadded:: 0.23.0Returns-------DataFrameDummy-coded data.See Also--------Series.str.get_dummies : Convert Series to dummy codes.Examples-------->>> s = pd.Series(list('abca'))>>> pd.get_dummies(s)a  b  c0  1  0  01  0  1  02  0  0  13  1  0  0>>> s1 = ['a', 'b', np.nan]>>> pd.get_dummies(s1)a  b0  1  01  0  12  0  0>>> pd.get_dummies(s1, dummy_na=True)a  b  NaN0  1  0    01  0  1    02  0  0    1>>> df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],...                    'C': [1, 2, 3]})>>> pd.get_dummies(df, prefix=['col1', 'col2'])C  col1_a  col1_b  col2_a  col2_b  col2_c0  1       1       0       0       1       01  2       0       1       1       0       02  3       1       0       0       0       1>>> pd.get_dummies(pd.Series(list('abcaa')))a  b  c0  1  0  01  0  1  02  0  0  13  1  0  04  1  0  0>>> pd.get_dummies(pd.Series(list('abcaa')), drop_first=True)b  c0  0  01  1  02  0  13  0  04  0  0>>> pd.get_dummies(pd.Series(list('abc')), dtype=float)a    b    c0  1.0  0.0  0.01  0.0  1.0  0.02  0.0  0.0  1.0None
# 数据与标签
import numpy as np
# 标签
labels = np.array(features['actual'])
# 特征中去除标签
features = features.drop('actual',axis=1) # 按照列去掉
# 名字单独保留
feature_list = list(features.columns)
# 转换为合适的格式
features = np.array(features)
# 数据集切分
from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split(features,labels,test_size=0.25,random_state=42)
print('训练集特征:',train_features.shape)
print('训练集标签:',train_labels.shape)
print('测试集标签:',test_features.shape)
print('测试机标签:',test_labels.shape)
训练集特征: (261, 14)
训练集标签: (261,)
测试集标签: (87, 14)
测试机标签: (87,)

随机森林回归模型

先建立1000棵决策树尝试,参数默认

先使用MAPE指标进行评估,这个评估是平均绝对百分误差,由于样本量比较少,训练模型的速率较快

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=1000,random_state=42)
rf.fit(train_features,train_labels)
predictions = rf.predict(test_features)
errors = abs(predictions - test_labels)
mape = 100 * (errors / test_labels)
print('MAPE:',np.mean(mape))
MAPE: 6.016378550202468

树模型可视化

from sklearn.tree import export_graphviz
import pydot
import os
# os.environ["PATH"] += os.pathsep + 'S:/Graphviz/bin/'
tree = rf.estimators_[5]
export_graphviz(tree,out_file="tree.dot",feature_names=feature_list,rounded=True,precision=1)
(graph,) = pydot.graph_from_dot_file('./tree.dot')
graph.write_png('./tree.png')

tree.png

# 限制树模型
rf_small = RandomForestRegressor(n_estimators=10,max_depth=3,random_state=42)
rf_small.fit(train_features,train_labels)
tree_small = rf_small.estimators_[5]
export_graphviz(tree_small,out_file='small_tree.dot',feature_names=feature_list,rounded=True,precision=1)
(graph,) = pydot.graph_from_dot_file('small_tree.dot')
graph.write_png('small_tree.png')

small_tree.png

# 决策树特征重要性
importances = list(rf.feature_importances_)
# 格式转换
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list,importances)]
feature_importances = sorted(feature_importances,key=lambda x:x[1],reverse=True)
# 打印
[print('Variable:{:20} importance: {}'.format(*pair)) for pair in feature_importances]
Variable:temp_1               importance: 0.69
Variable:average              importance: 0.2
Variable:day                  importance: 0.03
Variable:friend               importance: 0.03
Variable:temp_2               importance: 0.02
Variable:month                importance: 0.01
Variable:year                 importance: 0.0
Variable:week_Fri             importance: 0.0
Variable:week_Mon             importance: 0.0
Variable:week_Sat             importance: 0.0
Variable:week_Sun             importance: 0.0
Variable:week_Thurs           importance: 0.0
Variable:week_Tues            importance: 0.0
Variable:week_Wed             importance: 0.0[None,None,None,None,None,None,None,None,None,None,None,None,None,None]
# 绘制为直方图
x_values = list(range(len(importances)))
plt.bar(x_values,importances,orientation='vertical')
plt.xticks(x_values,feature_list,rotation='vertical')
plt.ylabel('Importance');plt.xlabel('Variable');plt.title('Variable Importances')
Text(0.5, 1.0, 'Variable Importances')

观察发现,使用重要性最好的特征建模,可能会有更好的结果,虽然不一定成功,但是速度一定会更快

# 尝试使用最重要的两个特征
rf_most_important = RandomForestRegressor(n_estimators=1000,random_state=42)
# 最重要特征
important_indices = [feature_list.index('temp_1'),feature_list.index('average')]
train_important = train_features[:,important_indices]
test_important = test_features[:,important_indices]
# 重新训练模型
rf_most_important.fit(train_important,train_labels)
# 预测结果
predictions = rf_most_important.predict(test_important)
errors = abs(predictions-test_labels)
# 评估结果,保留两位小数
print('Mean Absolute Error:',round(np.mean(errors),2),'%')
mape = np.mean(100*(errors/test_labels))
print('mape:',mape)
Mean Absolute Error: 3.92 %
mape: 6.243108595734665

这里发现,mape的值从6.0上升到6.2,并没有下降,说明不能只选择最重要的特征

首次预测

# 日期
months = features[:,feature_list.index('month')]
days = features[:,feature_list.index('day')]
years = features[:,feature_list.index('year')]
# 转换日期
dates = [str(int(year))+'-'+str(int(month))+'-'+str(int(day)) for year, month, day in zip(years,months,days)]
dates = [datetime.datetime.strptime(date,'%Y-%m-%d') for date in dates]
# 创建表格保存日期和其对应的标签数据
true_data = pd.DataFrame(data={'date':dates,'actual':labels})
# 另一个表格表示日期和对应预测值
months = test_features[:,feature_list.index('month')]
days = test_features[:,feature_list.index('day')]
years = test_features[:,feature_list.index('year')]
test_dates = [str(int(year))+'-'+str(int(month))+'-'+str(int(day)) for year,month,day in zip(years,months,days)]
test_dates = [datetime.datetime.strptime(date,'%Y-%m-%d') for date in test_dates]
predictions_data = pd.DataFrame(data = {'date':test_dates,'prediction':predictions})
# 真实值
plt.plot(true_data['date'],true_data['actual'],'b-',label='actual')
# 预测值
plt.plot(predictions_data['date'],predictions_data['prediction'],'ro',label='prediction')
plt.xticks(rotation='60')
(array([17897., 17956., 18017., 18078., 18140., 18201., 18262.]),[Text(0, 0, ''),Text(0, 0, ''),Text(0, 0, ''),Text(0, 0, ''),Text(0, 0, ''),Text(0, 0, ''),Text(0, 0, '')])

深入数据分析

  1. 如果可以利用的数据量增大,会对结果产生什么影响呢
  2. 加入的新特征会改进模型效果吗,此时的时间效率又怎么样

新数据集预测

import pandas as pd
features = pd.read_csv('temps_extended.csv')
features.head(5)
year month day weekday ws_1 prcp_1 snwd_1 temp_2 temp_1 average actual friend
0 2011 1 1 Sat 4.92 0.00 0 36 37 45.6 40 40
1 2011 1 2 Sun 5.37 0.00 0 37 40 45.7 39 50
2 2011 1 3 Mon 6.26 0.00 0 40 39 45.8 42 42
3 2011 1 4 Tues 5.59 0.00 0 39 42 45.9 38 59
4 2011 1 5 Wed 3.80 0.03 0 42 38 46.0 45 39
print('数据规模',features.shape)
数据规模 (2191, 12)

日期处理

# 时间转化,用标准时间格式方便后续工作
import datetime
years = features['year']
months = features['month']
days = features['day']
dates = [str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years,months,days)]
dates = [datetime.datetime.strptime(date,'%Y-%m-%d') for date in dates]
dates[:5]
[datetime.datetime(2011, 1, 1, 0, 0),datetime.datetime(2011, 1, 2, 0, 0),datetime.datetime(2011, 1, 3, 0, 0),datetime.datetime(2011, 1, 4, 0, 0),datetime.datetime(2011, 1, 5, 0, 0)]

特征可视化

# 对新特征进行可视化展示
fig,((ax1,ax2),(ax3,ax4)) = plt.subplots(nrows=2,ncols=2,figsize=(15,10))
fig.autofmt_xdate(rotation=45)
# 平均最高气温
ax1.plot(dates,features['average'])
ax1.set_xlabel('');ax1.set_ylabel('Tempertature (F)');ax1.set_title('Historical Avg Max Temp')
# 风速
ax2.plot(dates,features['ws_1'],'r-')
ax2.set_xlabel('');ax2.set_ylabel('Wind Speed (mph))');ax2.set_title('Prior Wind Speed')
# 降水
ax3.plot(dates,features['prcp_1'],'r-')
ax3.set_xlabel('Date');ax3.set_ylabel('Precipitation (in)');ax3.set_title('Prior Precipitation')
# 积雪
ax4.plot(dates,features['snwd_1'],'ro')
ax4.set_xlabel('Date');ax4.set_ylabel('Snow Depth (in)');ax4.set_title('Prior Snow Depth')plt.tight_layout(pad=2)

特征工程

  1. 尽可能多的选择有价值的特征,初始阶段的信息越多,建模可以利用的信息就越多
  2. 通常分析过程中可能会发现新的可以利用的特征,这个时候又放回去加入,过程比较繁琐,同时还需要实验进行对比。
  3. 因此需要在初始阶段就尽可能完善预处理与特征提取工作

关于这份数据:

  1. 天气变换与季节因素有关,然而数据集中并没有体现季节的特征,可以自己创建
# 季节变量
seasons = []
for month in features['month']:if month in [1,2,12]:seasons.append('winter')elif month in [3,4,5]:seasons.append('spring')elif month in [6,7,8]:seasons.append('summer')elif month in [9,10,11]:seasons.append('fall')
reduced_features = features[['temp_1','prcp_1','average','actual']]
reduced_features['season'] = seasons
<ipython-input-23-b690764ce900>:13: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copyreduced_features['season'] = seasons
import seaborn as sns
sns.set(style='ticks',color_codes=True)
# 主题
palette = sns.xkcd_palette(['dark blue','dark green','gold','orange'])
# pairplot绘图
sns.pairplot(reduced_features,hue='season',diag_kind='kde',palette=palette,plot_kws=dict(alpha=0.7),diag_kws=dict(shade=True))
<seaborn.axisgrid.PairGrid at 0x232067fa190>

分析的方法:不同的颜色代表不同的季节(hue),主对角线是同一个参数,表示数值在不同季节的分布。其他都是散点图,代表了各个特征之间的相关性

数据量对结果的影响分析

# 独热编码
features = pd.get_dummies(features)
# 提取特征和标签
labels = features['actual']
features = features.drop('actual',axis=1)
# 特征名字留着备用
feature_list = list(features.columns)
# 转换为所需格式
import numpy as np
features = np.array(features)
labels = np.array(labels)
# 数据集切分
from sklearn.model_selection import train_test_split
train_features,test_features,train_labels,test_labels = train_test_split(features,labels,test_size=0.25,random_state=0)
print("训练集特征:",train_features.shape)
print("训练集标签:",train_labels.shape)
print("测试集特征:",test_features.shape)
print("测试集标签:",test_labels.shape)
训练集特征: (1643, 17)
训练集标签: (1643,)
测试集特征: (548, 17)
测试集标签: (548,)

虽然是新样本,但是需要使用相同的测试集来对比结果。

# 对之前数据惊喜能够处理,对比
import pandas as pd
import numpy as np
# 统一特征
original_feature_indices = [feature_list.index(feature) for feature in feature_list if feature not in ['ws_1','prcp_1','snwd_1']]
# 重新读取老数据
original_features = pd.read_csv('temps.csv')
original_features = pd.get_dummies(original_features)
# 数据标签转换
original_labels = np.array(original_features['actual'])
original_features = original_features.drop('actual',axis=1)
original_feature_list = list(original_features.columns)
original_features = np.array(original_features)

先前数据切分

# 数据集切分
from sklearn.model_selection import train_test_split
original_train_features,original_test_features,original_train_labels,original_test_labels = train_test_split(original_features,original_labels,test_size=0.25,random_state=42)
# 数据建模
from sklearn.ensemble import RandomForestRegressor
# 同样参数和随机种子
rf = RandomForestRegressor(n_estimators=100,random_state=0)
# 老数据集
rf.fit(original_train_features,original_train_labels)
# 统一使用一个测试集,为了公平
predictions = rf.predict(test_features[:,original_feature_indices])
errors = abs(predictions-test_labels)
print('平均温度误差:',round(np.mean(errors),2),'°')
mape = 100 *(errors/test_labels)
# 为了观察设定准确率
accuracy = 100 -np.mean(mape)
print('Accuracy:',round(accuracy,2),'%')
平均温度误差: 4.68 °
Accuracy: 92.19 %

上面是样本较少的结果,这次观察样本增多会更好吗

from sklearn.ensemble import RandomForestRegressor
# 保证标签一致 这里改变了原白哦钱的维度方便计算
original_train_changeed_features = train_features[:,original_feature_indices]
original_test_changed_features = test_features[:,original_feature_indices]
rf = RandomForestRegressor(n_estimators=100,random_state=0)
rf.fit(original_train_changeed_features,train_labels)
# 预测
baseline_predictions = rf.predict(original_test_changed_features)
# 结果
baseline_errors = abs(baseline_predictions-test_labels)
print('平均温度误差:',round(np.mean(baseline_errors),2),'%')
baseline_mape = 100 * np.mean(baseline_errors/test_labels)
# 准确率
baseline_accuracy = 100 - baseline_mape
print('Accuracy:',round(baseline_accuracy,2),'%')
平均温度误差: 4.2 %
Accuracy: 93.12 %

特征的数量对结果影响

加入两次比较没有比较的新的天气特征

from sklearn.ensemble import RandomForestRegressor
rf_exp = RandomForestRegressor(n_estimators=100,random_state=0)
rf_exp.fit(train_features,train_labels)
# 同一测试集
predictions = rf_exp.predict(test_features)
# 评估
errors = abs(predictions - test_labels)
print('平均温度误差:',round(np.mean(errors),2),"%")
mape = np.mean(100*(errors/test_labels))
improvement_baseline = 100 * abs(mape-baseline_mape) / baseline_mape
print('特征增多以后模型效果变化:',round(improvement_baseline,2),'%')
# 准确率
accuracy = 100 - mape
print('Accuracy:',round(accuracy,2),'%')
平均温度误差: 4.05 %
特征增多以后模型效果变化: 3.34 %
Accuracy: 93.35 %
importances = list(rf_exp.feature_importances_)
# 名字和数值拼接在一起
feature_importances = [(feature,round(importance,2)) for feature,importance in zip(feature_list,importances)]
# 排序
feature_importances = sorted(feature_importances,key=lambda x:x[1],reverse=True)
# 打印结果
[print('Variable:{:20} Importance: {}'.format(*pair)) for pair in feature_importances]
Variable:temp_1               Importance: 0.85
Variable:average              Importance: 0.05
Variable:ws_1                 Importance: 0.02
Variable:friend               Importance: 0.02
Variable:year                 Importance: 0.01
Variable:month                Importance: 0.01
Variable:day                  Importance: 0.01
Variable:prcp_1               Importance: 0.01
Variable:temp_2               Importance: 0.01
Variable:snwd_1               Importance: 0.0
Variable:weekday_Fri          Importance: 0.0
Variable:weekday_Mon          Importance: 0.0
Variable:weekday_Sat          Importance: 0.0
Variable:weekday_Sun          Importance: 0.0
Variable:weekday_Thurs        Importance: 0.0
Variable:weekday_Tues         Importance: 0.0
Variable:weekday_Wed          Importance: 0.0[None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None]
# 可视化重要指标
plt.style.use('fivethirtyeight')
x_values = list(range(len(importances)))
plt.bar(x_values,importances,orientation="vertical",color="r",edgecolor="k",linewidth=1.2)
plt.xticks(x_values,feature_list,rotation='vertical')
plt.ylabel('Importance')
plt.xlabel('Variable')
plt.title('Variable Importances')
Text(0.5, 1.0, 'Variable Importances')

具体选择几个特征还是比较模糊,可以先把特征按照其重要性进行排序后再计算累计值,设置阈值,观察需要多少特征累加在一起后,特征重要性的累加值才能超过阈值

sorted_importances = [importance[1] for importance  in feature_importances]
sorted_features = [importance[0] for importance in feature_importances]
# 累计重要性
cumulative_importances = np.cumsum(sorted_importances)
# 绘制折线图
plt.plot(x_values,cumulative_importances,'g-')
plt.hlines(y=0.95,xmin=0,xmax=len(sorted_importances),color='r',linestyles='dashed')
plt.xticks(x_values,sorted_features,rotation='vertical')
plt.xlabel('Variable');plt.ylabel('Cumulative Importance')
plt.title('Cumulative Importances')
Text(0.5, 1.0, 'Cumulative Importances')

当使用前五个特征,总体的准确率累加值超过0.95,如果只使用这5个特征建模,观察结果

important_feature_names = [feature[0] for feature in feature_importances[0:5]]
# 名字
important_indices = [feature_list.index(feature) for feature in important_feature_names]
# 训练集
important_train_features = train_features[:,important_indices]
important_test_features = test_features[:,important_indices]
# 数据维度
print("important train features shape:",important_train_features.shape)
print("important test features shape:",important_test_features.shape)
# 训练模型
rf_exp.fit(important_train_features,train_labels)
# 同样的测试集
predictions = rf_exp.predict(important_test_features)
# 评估
errors = abs(predictions-test_labels)
print('平均温度误差:',round(np.mean(errors),2),"°")
mape = 100*(errors/test_labels)
accuracy = 100 - np.mean(mape)
print('Accuracy:',round(accuracy,2),"%")
important train features shape: (1643, 5)
important test features shape: (548, 5)
平均温度误差: 4.11 °
Accuracy: 93.28 %

虽然没有提升效率,那么观察一下在模型时间效率上面有没有提高·

import time
all_features_time = []
for _ in range(10):start_time = time.time()rf_exp.fit(train_features,train_labels)all_features_predictions = rf_exp.predict(test_features)end_time = time.time()all_features_time.append(end_time-start_time)all_features_time = np.mean(all_features_time)
print("使用所有特征与测试的平均时间消耗:",round(all_features_time,2),'s')
使用所有特征与测试的平均时间消耗: 0.69 s
# 只选用重要特征训练时
reduced_features_time = []
for _ in range(10):start_time = time.time()rf_exp.fit(important_train_features,train_labels)reduced_features_predictions = rf_exp.predict(important_test_features)end_time = time.time()reduced_features_time.append(end_time-start_time)reduced_features_time = np.mean(reduced_features_time)
print("使用重要特征与测试的平均时间消耗:",round(reduced_features_time,2),'s')
使用重要特征与测试的平均时间消耗: 0.42 s
# 原始模型时间效率
original_features_time =[]
for _ in range(10):start_time =time.time()rf.fit(original_train_features,original_train_labels)original_features_predictions =rf.predict(test_features[:,original_feature_indices])end_time =time.time()original_features_time.append(end_time -start_time)
original_features_time =np.mean(original_features_time)print("使用原始模型测试的平均时间消耗:",round(original_features_time,2),'s')
使用原始模型测试的平均时间消耗: 0.18 s
# 对比展示
all_accuracy = 100 * (1-np.mean(abs(all_features_predictions-test_labels)/test_labels))
reduced_accuracy = 100 * (1-np.mean(abs(reduced_features_predictions-test_labels)/test_labels))# 保存结果并展示
comparision = pd.DataFrame({'features':['all(17)','reduced(5)'],'runtime':[round(all_features_time,2),round(reduced_features_time,2)],'accuracy':[round(all_accuracy,2),round(reduced_accuracy,2)]})
comparision[['features','accuracy','runtime']]
features accuracy runtime
0 all(17) 93.35 0.69
1 reduced(5) 93.28 0.42

单独两个结果比较

准确率只是为了方面安眠自己定义,用于对比分析。结果如上表所见,准确率基本没有发生明显变化,时间效率上倒是有很大的提升。

# 时间效率可能会比准确率更加优先考虑
relative_accuracy_decrease =  100 * (all_accuracy - reduced_accuracy) / all_accuracy
print('相对accuracy提升:',round(relative_accuracy_decrease,3),"%")
relative_runtime_decrease = 100 * (all_features_time - reduced_features_time) / all_features_time
print("相对时间效率提升:",round(relative_runtime_decrease,3),"%")
相对accuracy提升: 0.071 %
相对时间效率提升: 39.17 %

各个实验结果

# 原模型的预测温度对比
original_mae = np.mean(abs(original_features_predictions -test_labels))
# 所有特征预测温度对比
exp_all_mae = np.mean(abs(all_features_predictions -test_labels))
# 重要特征预测温度对比
exp_reduced_mae = np.mean(abs(reduced_features_predictions -test_labels))
# 原模型的准确率
original_accuracy = 100 * (1 - np.mean(abs(original_features_predictions - test_labels) /test_labels))
model_comparison = pd.DataFrame({'model': ['original', 'exp_all', 'exp_reduced'],'error (degrees)': [original_mae, exp_all_mae, exp_reduced_mae],'accuracy': [original_accuracy, all_accuracy, reduced_accuracy],'run_time (s)': [original_features_time, all_features_time, reduced_features_time]})
# 汇聚所有实验结果
fig, (ax1,ax2,ax3) = plt.subplots(nrows=1,ncols=3,figsize=(16,5),sharex=True)
# X轴
x_values = [0,1,2]
labels = list(model_comparison['model'])
plt.xticks(x_values,labels)
# 字体大小
fontdict = {'fontsize':18}
fontdict_yaxis = {'fontsize':14}
# 预测温度和真实温度的比对比
ax1.bar(x_values,model_comparison['error (degrees)'], color=['b','r','g'],edgecolor='k',linewidth=1.5)
ax1.set_ylim(bottom=3.5, top=4.5)
ax1.set_ylabel('Error (degree) (F)',fontdict=fontdict_yaxis)
ax1.set_title('Model Error Comparison',fontdict=fontdict)
# 准确率对比
ax2.bar(x_values,model_comparison['accuracy'],color=['b','r','g'],edgecolor='k',linewidth=1.5)
ax2.set_ylim(bottom=92, top=94)
ax2.set_ylabel('Accuracy (%)',fontdict=fontdict_yaxis)
ax2.set_title('Model Accuracy Comparision',fontdict=fontdict)
# 时间效率对比
ax3.bar(x_values,model_comparison['run_time (s)'], color=['b','r','g'],edgecolor='k',linewidth=1.5)
ax3.set_ylim(bottom=0,top=1)
ax3.set_ylabel('run_time (s)',fontdict=fontdict_yaxis)
ax3.set_title('Model Run-Time Comparison',fontdict=fontdict)
plt.show()

模型调参

from sklearn.ensemble import RandomForestRegressor
from pprint import pprint
rf = RandomForestRegressor(random_state=42)
pprint(rf.get_params())
{'bootstrap': True,'ccp_alpha': 0.0,'criterion': 'mse','max_depth': None,'max_features': 'auto','max_leaf_nodes': None,'max_samples': None,'min_impurity_decrease': 0.0,'min_impurity_split': None,'min_samples_leaf': 1,'min_samples_split': 2,'min_weight_fraction_leaf': 0.0,'n_estimators': 100,'n_jobs': None,'oob_score': False,'random_state': 42,'verbose': 0,'warm_start': False}

自动随机调参

所选择的参数值的候选根据实际情况,每一个参数的取值范围都需要好好把控

from sklearn.model_selection import RandomizedSearchCV
# 建立树的个数
n_estimators = [int(x) for x in np.linspace(start=200,stop=2000,num=10)]
# 最大特征的选择方法
max_features = ['auto','sqrt']
# 树最大深度
max_depth = [int(x) for x in np.linspace(10,20,num=2)]
max_depth.append(None)
# 节点最小分裂所需要的样本个数
min_samples_split = [2,5,10]
# 叶子节点最小的样本数
min_samples_leaf = [1,2,4]
# 样本采样方法
bootstrap = [True,False]
# 随机参数空间
random_grid ={'n_estimators':n_estimators,'max_features':max_features,'max_depth':max_depth,'min_samples_split':min_samples_split,'min_samples_leaf':min_samples_leaf,'bootstrap':bootstrap}

随机组合参数

rf = RandomForestRegressor()
rf_random = RandomizedSearchCV(estimator=rf, # 指定调参模型param_distributions=random_grid, # 指定候选参数列表n_iter=100, # 随机选择参数组合的个数,这里是随机选择100组,找这中间最好的scoring='neg_mean_absolute_error', # 评估方法cv=3, # 交叉验证verbose=2, # 打印信息的数量random_state=42, # 随机种子,随便选n_jobs=-1) # 多线程数目,如果-1代表使用所有线程
# 寻找开始
rf_random.fit(train_features,train_labels)
rf_random.best_params_
# 评估结果
def evaluate(model,test_features,test_labels):predictions = model.predict(test_features)errors = abs(predictions - test_labels)mape = 100 * np.mean(errors / test_labels)accuracy = 100 - mapeprint('平均气温误差:',np.mean(errors))print('Accuracy = {:0.2f}%'.format(accuracy))
# 默认参数结果
base_model = RandomForestRegressor(random_state=42)
base_model.fit(train_features,train_labels)
evaluate(base_model,test_features,test_labels)
# 新配方
best_random = rf_random.best_estimator_
evaluate(best_random,test_features,test_labels)

网络参数搜索

地毯式搜索最优参数,注意这里的地毯也是根据之前的随机调参得到的大致范围

{'n_estimators': 1800,'min_samples_split': 10,'min_samples_leaf': 4,'max_features': 'auto','max_depth': None,'bootstrap': True}
from sklearn.model_selection import GridSearchCV
# 候选参数空间
param_grid = {'n_estimators':[1600,1700,1800,1900,2000],'max_features':['auto'],'max_depth':[8,10,12],'min_samples_split':[3,5,7],'min_samples_leaf':[2.3,4,5,6],'bootstrap':[True]
}
# 基本算法模型
rf = RandomForestRegressor()
# 网格搜索
grid_search = GridSearchCV(estimator=rf,param_grid=param_grid,scoring='neg_mean_absolute_error',cv=3,n_jobs=-1,verbose=2)
# 搜索开始
grid_search.fit(train_features,train_labels)

网格搜索的时候,如果参数空间过大,遍历次数过多时,将所有可能分成不同的小组进行,比较各个组的最优

贝叶斯优化

from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
from sklearn.model_selection import cross_val_score
def hyperopt_train_test(params):clf = RandomForestRegressor(**params)return cross_val_score(clf,train_features,train_labels).mean()
max_depth = [i for i in range(10,20)]
# max_depth.append(None)
space4rf = {'max_depth': hp.choice('max_depth', max_depth),'max_features': hp.choice('max_features', ['auto','sqrt']),'min_samples_split':hp.choice('min_samples_split',range(5,20)),'min_samples_leaf':hp.choice('min_samples_leaf',range(2,10)),'n_estimators': hp.choice('n_estimators', range(1000,2000)),'bootstrap':hp.choice('bootstrap',[True,False])
}best = 0
def f(params):global bestacc = hyperopt_train_test(params)if acc > best:best = accprint('new best:', best, params)return {'loss': -acc, 'status': STATUS_OK}trials = Trials()
best = fmin(f, space4rf, algo=tpe.suggest, max_evals=100, trials=trials)
print("best:",best)
new best:
0.8670547390932652
{'bootstrap': True, 'max_depth': 18, 'max_features': 'auto', 'min_samples_leaf': 4, 'min_samples_split': 11, 'n_estimators': 1942}
new best:
0.8679298658104889
{'bootstrap': True, 'max_depth': 16, 'max_features': 'auto', 'min_samples_leaf': 7, 'min_samples_split': 10, 'n_estimators': 1734}
new best:
0.8684034111523674
{'bootstrap': True, 'max_depth': 14, 'max_features': 'auto', 'min_samples_leaf': 9, 'min_samples_split': 19, 'n_estimators': 1766}
new best:
0.8685636302610934
{'bootstrap': True, 'max_depth': 16, 'max_features': 'auto', 'min_samples_leaf': 9, 'min_samples_split': 5, 'n_estimators': 1439}
new best:
0.8685767383919801
{'bootstrap': True, 'max_depth': 16, 'max_features': 'auto', 'min_samples_leaf': 9, 'min_samples_split': 5, 'n_estimators': 1404}
new best:
0.8685919671759731
{'bootstrap': True, 'max_depth': 19, 'max_features': 'auto', 'min_samples_leaf': 9, 'min_samples_split': 15, 'n_estimators': 1830}
new best:
0.8686049353034605
{'bootstrap': True, 'max_depth': 19, 'max_features': 'auto', 'min_samples_leaf': 9, 'min_samples_split': 16, 'n_estimators': 1099}
new best:
0.8686240725941452
{'bootstrap': True, 'max_depth': 19, 'max_features': 'auto', 'min_samples_leaf': 9, 'min_samples_split': 16, 'n_estimators': 1088}
100%|█████████████████████████████████████████████| 100/100 [39:13<00:00, 26.47s/trial, best loss: -0.8686240725941452]
best: {'bootstrap': 0, 'max_depth': 9, 'max_features': 0, 'min_samples_leaf': 7, 'min_samples_split': 11, 'n_estimators': 88}

【python3】基于随机森林的气温预测相关推荐

  1. python基于随机森林模型的预测概率和标签信息可视化ROC曲线

    python基于随机森林模型的预测概率和标签信息可视化ROC曲线 目录 python基于随机森林模型的预测概率和标签信息可视化ROC曲线

  2. 基于随机森林算法的贷款违约预测模型研究(Give me some credit)

    本文实验代码github:https://github.com/dengsiying/give-me-some-credit/tree/master 参考链接:从决策树到随机森林 http://www ...

  3. 随机森林 html5,利用随机森林回归算法预测总有机碳含量

    受美国"页岩气革命"的影响,页岩气勘探受到了大量的关注( 现有利用测井曲线预测TOC的方法有4类:①利用单曲线或多曲线建立回归关系的预测方法,如利用密度曲线(DEN)预测TOC.利 ...

  4. 动手实操丨基于随机森林算法进行硬盘故障预测

    摘要:业界期望使用机器学习技术来构建硬盘故障预测的模型,更准确地提前感知硬盘故障,降低运维成本,提升业务体验.本案例将使用随机森林算法来训练一个硬盘故障预测模型. 本文分享自华为云社区<基于随机 ...

  5. 基于随机森林算法进行硬盘故障预测

    摘要:本案例将带大家使用一份开源的S.M.A.R.T.数据集和机器学习中的随机森林算法,来训练一个硬盘故障预测模型,并测试效果. 本文分享自华为云社区<基于随机森林算法进行硬盘故障预测>, ...

  6. kaggle项目:基于随机森林模型的心脏病患者预测分类!

    公众号:尤而小屋 作者:Peter 编辑:Peter 大家好,我是Peter~ 新年的第一个项目实践~给大家分享一个新的kaggle案例:基于随机森林模型(RandomForest)的心脏病人预测分类 ...

  7. python线性回归预测pm2.5_基于随机森林算法的PM2.5预测

    空气质量指数(AQI)是衡量空气质量好坏的重要指数,它是依据空气中污染物浓度的高低来判断的. 但是因为空气污染本身是一个较为复杂的现象,来自固定和流动污染源的人为污染物排放大小是影响空气质量的最主要因 ...

  8. 数据分析:基于随机森林(RFC)对酒店预订分析预测

    数据分析:基于随机森林(RFC)对酒店预订分析预测 文章目录 数据分析:基于随机森林(RFC)对酒店预订分析预测 1.前言 2.数据探索 3.数据可视化分析 3.1酒店预订量和取消量 3.2酒店各月份 ...

  9. 机器学习:基于随机森林(RFC)对酒店预订分析预测

    机器学习:基于随机森林(RFC)对酒店预订分析预测 作者:i阿极 作者简介:Python领域新星作者.多项比赛获奖者:博主个人首页

最新文章

  1. 分布式主键解决方案----Twitter 雪花算法的原理(Java 版)
  2. jquery对select的操作详解
  3. Java一年,面试总结
  4. Linux telnet命令
  5. learning docker steps(3) ----- docker services 初次体验
  6. 函数名应用,闭包,装饰器初识
  7. 【探索PowerShell 】【二】基本操作
  8. 查询计算机系统安装日期,windows10系统查询系统安装日期方法介绍
  9. [数据结构] Link Cut Tree
  10. 怎样设置计算机硬盘的空间,如何调整电脑硬盘的分区容量
  11. 百度地图-将经纬度转化为地址
  12. 测试用例方法--错误推测法
  13. 用c语言实现矩阵的转置算法,C++实现矩阵原地转置算法
  14. android 手机调试链接不上,解决android 连接真机调试连接不了的问题
  15. Python3 + selenium 实现QQ群接龙自动化
  16. 如何快速查找下载文献
  17. rbf神经网络自适应控制matlab仿真,机械系统RBF神经网络控制:设计、分析及Matlab仿真(英文)...
  18. 考研数据结构之完全二叉树的构造及先序遍历递归算法
  19. 详解软件无线电(SDR)架构
  20. springboot理发店会员管理系统java

热门文章

  1. android 小米推送 no account,XiaomiPushDemo【小米推送集成,基于V3.6.12版本】
  2. 有哪些常见的数据库优化方法
  3. 本地搭建个人云盘并可远程访问【内网穿透】
  4. Listary 6.0 pro 如何关闭左键双击出现快捷菜单?
  5. MDT ADK AIK
  6. 三星平台fimc驱动详解
  7. 计算机法语编码格式,七、编码与编码格式
  8. 数据分析 | SQL基础查询语句+例题详解
  9. cocos creator 微信小游戏 封装banner、原生格子、插屏、激励视频广告
  10. 高性能API网关Kong介绍