大数据预测实战-随机森林预测实战（三）-数据量对结果影响分析

下面对比一下特征数量对结果的影响，之前两次比较没有加入新的天气特征，这次把降水、风速、积雪 3 项特征加入数据集中，看看效果怎样∶

# 准备加入新的特征
from sklearn.ensemble import RandomForestRegressorrf_exp = RandomForestRegressor(n_estimators= 100, random_state=0)
rf_exp.fit(train_features, train_labels)# 同样的测试集
predictions = rf_exp.predict(test_features)# 评估
errors = abs(predictions - test_labels)print('平均温度误差:', round(np.mean(errors), 2), 'degrees.')# (MAPE)
mape = np.mean(100 * (errors / test_labels))# 看一下提升了多少
improvement_baseline = 100 * abs(mape - baseline_mape) / baseline_mape
print('特征增多后模型效果提升:', round(improvement_baseline, 2), '%.')# accuracy
accuracy = 100 - mape
print('Accuracy:', round(accuracy, 2), '%.')平均温度误差: 4.05 degrees.
特征增多后模型效果提升: 3.34 %.
Accuracy: 93.35 %.

模型整体效果有了略微提升，这里我们还加入一项额外的评估就是模型跟基础模型相比提升的大小，方便来进行对比观察。这回特征也多了，我们可以好好研究下特征重要性这个指标了，虽说其只供参考，但是业界也有一些不成文的行规我们来看一下：

# 特征名字
importances = list(rf_exp.feature_importances_)# 名字，数值组合在一起
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]# 排序
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)# 打印出来
[print('特征: {:20} 重要性: {}'.format(*pair)) for pair in feature_importances];

对各个特征的重要性排序之后，打印出其各自结果，排在前面的依旧是 temp_1和average，风速 ws_虽然也上榜了，但是影响还是略小，好长一串数据看起来不方便，还是用图表显示更清晰明了。

# 指定风格
plt.style.use('fivethirtyeight')# 指定位置
x_values = list(range(len(importances)))# 绘图
plt.bar(x_values, importances, orientation = 'vertical', color = 'r', edgecolor = 'k', linewidth = 1.2)# x轴名字得竖着写
plt.xticks(x_values, feature_list, rotation='vertical')# 图名
plt.ylabel('Importance'); plt.xlabel('Variable'); plt.title('Variable Importances');

之前我们只是简单看了下载特征中哪些更重要，这回我们需要考虑的是特征的累加重要性，先把特征按照其重要性进行排序，再算起累计值，这里用到了cumsum()函数，比如cusm([1,2,3,4])得到的结果就是其累加值(1,3,6,10)，通常我们都以95%为阈值，看看有多少个特征累加在一起之后，其特征重要性的累加值超过该阈值，就取它们当做筛选后的特征：

# 对特征进行排序
sorted_importances = [importance[1] for importance in feature_importances]
sorted_features = [importance[0] for importance in feature_importances]# 累计重要性
cumulative_importances = np.cumsum(sorted_importances)# 绘制折线图
plt.plot(x_values, cumulative_importances, 'g-')# 画一条红色虚线，0.95那
plt.hlines(y = 0.95, xmin=0, xmax=len(sorted_importances), color = 'r', linestyles = 'dashed')# X轴
plt.xticks(x_values, sorted_features, rotation = 'vertical')# Y轴和名字
plt.xlabel('Variable'); plt.ylabel('Cumulative Importance'); plt.title('Cumulative Importances');

这里当第5个特征出现的时候，其总体的累加值超过了95%，那么接下来我们的对比实验又来了，如果只用这5个特征效果会怎么样呢？时间效率又会怎样呢？

# 选择这些特征
important_feature_names = [feature[0] for feature in feature_importances[0:5]]
# 找到它们的名字
important_indices = [feature_list.index(feature) for feature in important_feature_names]# 重新创建训练集
important_train_features = train_features[:, important_indices]
important_test_features = test_features[:, important_indices]# 数据维度
print('Important train features shape:', important_train_features.shape)
print('Important test features shape:', important_test_features.shape)# 再训练模型
rf_exp.fit(important_train_features, train_labels);# 同样的测试集
predictions = rf_exp.predict(important_test_features)# 评估结果
errors = abs(predictions - test_labels)print('平均温度误差:', round(np.mean(errors), 2), 'degrees.')mape = 100 * (errors / test_labels)# accuracy
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')平均温度误差: 4.11 degrees.
Accuracy: 93.28 %.

看起来奇迹并没有出现，本以为效果可能会更好，但其实还是有一点点下降，可能是由于树模型本身具有特征选择的被动技能，也可能是剩下5%的特征确实有一定作用。虽然模型效果没有提升，还可以再看看在时间效率的层面上有没有进步∶

# 要计算时间了
import time# 这次是用所有特征
all_features_time = []# 算一次可能不太准，来10次取个平均
for _ in range(10):start_time = time.time()rf_exp.fit(train_features, train_labels)all_features_predictions = rf_exp.predict(test_features)end_time = time.time()all_features_time.append(end_time - start_time)all_features_time = np.mean(all_features_time)
print('使用所有特征时建模与测试的平均时间消耗:', round(all_features_time, 2), '秒.')使用所有特征时建模与测试的平均时间消耗: 0.66 秒.

当我们使用全部特征的时候，建模与测试用的总时间为0.6秒，这里会由于机器性能导致咱们的速度不一样，大家在笔记本中估计运行时间要比我的稍长一点。再来看看只选择高重要性特征的时间结果：

# 这次是用部分重要的特征
reduced_features_time = []# 算一次可能不太准，来10次取个平均
for _ in range(10):start_time = time.time()rf_exp.fit(important_train_features, train_labels)reduced_features_predictions = rf_exp.predict(important_test_features)end_time = time.time()reduced_features_time.append(end_time - start_time)reduced_features_time = np.mean(reduced_features_time)
print('使用部分特征时建模与测试的平均时间消耗:', round(reduced_features_time, 2), '秒.')使用部分特征时建模与测试的平均时间消耗: 0.29 秒.

唯一改变的就是输入数据的规模，可以发现使用部分特征时试验的时间明显缩短，因为决策树需要遍历的特征少了很多。下面把对比情况展示在一起，更方便观察∶

# 用分别的预测值来计算评估结果
all_accuracy =  100 * (1- np.mean(abs(all_features_predictions - test_labels) / test_labels))
reduced_accuracy = 100 * (1- np.mean(abs(reduced_features_predictions - test_labels) / test_labels))#创建一个df来保存结果
comparison = pd.DataFrame({'features': ['all (17)', 'reduced (5)'], 'run_time': [round(all_features_time, 2), round(reduced_features_time, 2)],'accuracy': [round(all_accuracy, 2), round(reduced_accuracy, 2)]})comparison[['features', 'accuracy', 'run_time']]

这里的准确率只是为了观察方便自己定义的，用于对比分析，结果显示准确率基本没发生明显变化，但是在时间效率上却有明显差异。所以，当大家在选择算法与数据的同时，还需要根据实际业务具体分析，例如很多任务都需要实时进行响应，这时候时间效率可能会比准确率更优先考虑。可以通过具体数值看一下各自效果的提升∶

relative_accuracy_decrease = 100 * (all_accuracy - reduced_accuracy) / all_accuracy
print('相对accuracy下降:', round(relative_accuracy_decrease, 3), '%.')relative_runtime_decrease = 100 * (all_features_time - reduced_features_time) / all_features_time
print('相对时间效率提升:', round(relative_runtime_decrease, 3), '%.')相对accuracy下降: 0.071 %.
相对时间效率提升: 40.739 %.

实验结果显示，时间效率的提升相对更大，而且基本保证模型效果。最后把所有的实验结果汇总到一起进行对比∶

# Pandas is used for data manipulation
import pandas as pd# Read in data as pandas dataframe and display first 5 rows
original_features = pd.read_csv('data/temps.csv')
original_features = pd.get_dummies(original_features)# Use numpy to convert to arrays
import numpy as np# Labels are the values we want to predict
original_labels = np.array(original_features['actual'])# Remove the labels from the features
# axis 1 refers to the columns
original_features= original_features.drop('actual', axis = 1)# Saving feature names for later use
original_feature_list = list(original_features.columns)# Convert to numpy array
original_features = np.array(original_features)# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split# Split the data into training and testing sets
original_train_features, original_test_features, original_train_labels, original_test_labels = train_test_split(original_features, original_labels, test_size = 0.25, random_state = 42)

# Find the original feature indices
original_feature_indices = [feature_list.index(feature) for feature infeature_list if feature not in['ws_1', 'prcp_1', 'snwd_1']]# Create a test set of the original features
original_test_features = test_features[:, original_feature_indices]# Time to train on original data set (1 year)
original_features_time = []# Do 10 iterations and take average for all features
for _ in range(10):start_time = time.time()rf.fit(original_train_features, original_train_labels)original_features_predictions = rf.predict(original_test_features)end_time = time.time()original_features_time.append(end_time - start_time)original_features_time = np.mean(original_features_time)

# Calculate mean absolute error for each model
original_mae = np.mean(abs(original_features_predictions - test_labels))
exp_all_mae = np.mean(abs(all_features_predictions - test_labels))
exp_reduced_mae = np.mean(abs(reduced_features_predictions - test_labels))# Calculate accuracy for model trained on 1 year of data
original_accuracy = 100 * (1 - np.mean(abs(original_features_predictions - test_labels) / test_labels))# Create a dataframe for comparison
model_comparison = pd.DataFrame({'model': ['original', 'exp_all', 'exp_reduced'], 'error (degrees)':  [original_mae, exp_all_mae, exp_reduced_mae],'accuracy': [original_accuracy, all_accuracy, reduced_accuracy],'run_time (s)': [original_features_time, all_features_time, reduced_features_time]})# Order the dataframe
model_comparison = model_comparison[['model', 'error (degrees)', 'accuracy', 'run_time (s)']]

model_comparison

# 绘图来总结把
# 设置总体布局，还是一整行看起来好一些
fig, (ax1, ax2, ax3) = plt.subplots(nrows=1, ncols=3, figsize = (16,5), sharex = True)# X轴
x_values = [0, 1, 2]
labels = list(model_comparison['model'])
plt.xticks(x_values, labels)# 字体大小
fontdict = {'fontsize': 18}
fontdict_yaxis = {'fontsize': 14}# 预测温度和真实温度差异对比
ax1.bar(x_values, model_comparison['error (degrees)'], color = ['b', 'r', 'g'], edgecolor = 'k', linewidth = 1.5)
ax1.set_ylim(bottom = 3.5, top = 4.5)
ax1.set_ylabel('Error (degrees) (F)', fontdict = fontdict_yaxis);
ax1.set_title('Model Error Comparison', fontdict= fontdict)# Accuracy 对比
ax2.bar(x_values, model_comparison['accuracy'], color = ['b', 'r', 'g'], edgecolor = 'k', linewidth = 1.5)
ax2.set_ylim(bottom = 92, top = 94)
ax2.set_ylabel('Accuracy (%)', fontdict = fontdict_yaxis);
ax2.set_title('Model Accuracy Comparison', fontdict= fontdict)# 时间效率对比
ax3.bar(x_values, model_comparison['run_time (s)'], color = ['b', 'r', 'g'], edgecolor = 'k', linewidth = 1.5)
ax3.set_ylim(bottom = 0, top = 1)
ax3.set_ylabel('Run Time (sec)', fontdict = fontdict_yaxis);
ax3.set_title('Model Run-Time Comparison', fontdict= fontdict);

original代表是我们的老数据，也就是量少特征少的那份；exp_all代表我们的完整新数据；exp_reduced代表我们按照95%阈值选择的部分重要特征数据集。结果也是很明显的，数据量和特征越多，效果会提升一些，但是时间效率也会有所下降。

最终模型的决策需要通过实际业务应用来判断，但是分析工作一定要做到位。

大数据预测实战-随机森林预测实战（三）-数据量对结果影响分析相关推荐

大数据预测实战-随机森林预测实战（四）-模型微调
接下来介绍下一位参赛选手--GridSearchCV(),它要做的事情就跟其名字一样,进行网络搜索,也就是一个一个地遍历,不能放过任何一个可能的参数组合.就像之前说的组合有多少种,就全部走一遍,使用方 ...
大数据预测实战-随机森林预测实战（一）-数据预处理
数据读取气温预测的任务目标就是使用一份天气相关数据来预测某一天的最高温度,属于回归任务,首先观察一下数据集∶ # 数据读取 import pandas as pdfeatures = pd.read ...
大数据预测实战-随机森林预测实战（四）-模型调参
之前对比分析的主要是数据和特征层面,还有另一部分非常重要的工作等着大家去做,就是模型调参问题,在实验的最后,看一下对于树模型来说,应当如何进行参数调节. 调参是机器学习必经的一步,很多方法和经验并不是 ...
大数据预测实战-随机森林预测实战（三）-数据与特征对模型的影响
数据与特征对随机森林的影响带着上节提出的问题,重新读取规模更大的数据,任务还是保持不变,需要分别观察数据量和特征的选寸结果的影响. 导入工具包 import pandas as pd 读取数据 fe ...
基于大尺度结构的随机森林预测与I类HLA结合的稳定肽
点击下载https://www.frontiersin.org/articles/10.3389/fimmu.2020.01583/full 1.Abstract HLA I类亲和力预测的稳定肽是设计 ...
的garch预测_随机森林预测
当涉及到预测数据(时间序列或其他类型的序列)时,人们会关注基本回归.ARIMA.ARMA.GARCH,甚至Prophet,但不排除使用随机森林来预测数据. 随机森林通常被认为是一种分类技术,但回归问题 ...
python数据项目分析实战技法_《Python数据分析与机器学习实战-唐宇迪》读书笔记第9章--随机森林项目实战——气温预测(1/2)...
第9章--随机森林项目实战--气温预测(1/2) 第8章已经讲解过随机森林的基本原理,本章将从实战的角度出发,借助Python工具包完成气温预测任务,其中涉及多个模块,主要包含随机森林建模.特征选择. ...
python天气数据分析论文_《Python数据分析与机器学习实战-唐宇迪》读书笔记第9章--随机森林项目实战——气温预测(2/2)...
第9章--随机森林项目实战--气温预测(2/2) 第8章已经讲解过随机森林的基本原理,本章将从实战的角度出发,借助Python工具包完成气温预测任务,其中涉及多个模块,主要包含随机森林建模.特征选择. ...
机器学习sklearn实战-----随机森林调参乳腺癌分类预测
机器学习sklearn随机森林乳腺癌分类预测机器学习中调参的基本思想: 1)非常正确的调参思路和方法 2)对模型评估指标有深入理解 3)对数据的感觉和经验文章目录机器学习sklearn随机森林乳 ...

大数据预测实战-随机森林预测实战（三）-数据量对结果影响分析

大数据预测实战-随机森林预测实战（三）-数据量对结果影响分析相关推荐

最新文章

热门文章