随机森林实现回归预测(糖尿病数据集)

文章目录

**1.实验简介**
**2.算法分析**
**3.具体实现**
**4.代码**
**5.结果分析**

1.实验简介

本次实验需要实现一个随机森林模型并在糖尿病数据集上进行回归预测。

2.算法分析

随机森林是由N颗简单的决策树组合而成，对于分类任务随机森林的输出可以采用简单的投票法决定随机森林的预测值；对于回归任务来说，就是把N颗回归决策树的输出结果进行平均。
对于随机森林来进行回归任务，可以分两个部分来实现。第一部分我们先实现回归决策树，第二部分在回归决策树的基础上实现回归随机森林。

3.具体实现

3.1 回归决策树
在上一次实验的分类决策树基础上实现回归决策树有以下的改变：

增益的衡量在这里我们用方差来替代
叶子节点的预测值由占多数的类别改为平均值
在寻找最佳属性及其阈值时，直接取实际的数据作为候选阈值，不用排序再取两个数据的均值
划分过的属性在之后的划分还能继续使用
因为没有了属性使用的限制，需要实现树的深度的控制max_depth这个参数。另外，也需实现min_samples这个参数

3.2 回归随机森林
回归森林使用N棵回归决策树，这里有两点需要注意：

样本的随机性
对于每棵树输入的数据需要是不同的，如果对N棵树输入同样的数据，那得出的结果都是一样的，随机森林也就没有了意义。所以，对于每一棵树，使用的数据是训练集通过随机有放回的采样得到的。
属性的随机性
寻找最优划分属性时，先随机选出一部分，再在这一部分中选取增益最大属性的。

4.代码

import math
import matplotlib.pyplot as plt
import numpy as np
from sklearn.ensemble import RandomForestRegressor  # 导入随机森林训练模型
from sklearn.metrics import r2_score  # 使用拟合优度r2_score对实验结果进行评估
from sklearn.model_selection import train_test_split
from sklearn import datasetsclass DecisionNode(object):def __init__(self, f_idx, threshold, value=None, L=None, R=None):self.f_idx = f_idxself.threshold = thresholdself.value = valueself.L = Lself.R = R# 改变：不需要排序，取实际的数据作为划分点
def find_best_threshold(dataset: np.ndarray, f_idx: int):  # dataset:numpy.ndarray (n,m+1) x<-[x,y]  f_idx:feature indexbest_gain = -math.inf  # 先设置 best_gain 为无穷小best_threshold = Nonecandidate = list(set(dataset[:, f_idx].reshape(-1)))for threshold in candidate:L, R = split_dataset(dataset, f_idx, threshold)   # 根据阈值分割数据集，小于阈值gain = calculate_var_gain(dataset, L, R)  # 根据数据集和分割之后的数if gain > best_gain:  # 如果增益大于最大增益，则更换最大增益和最大best_gain = gainbest_threshold = thresholdreturn best_threshold, best_gaindef calculate_var(dataset: np.ndarray):y_ = dataset[:, -1].reshape(-1)var = np.var(y_)return vardef calculate_var_gain(dataset, l, r):var_y = calculate_var(dataset)var_gain = var_y - len(l) / len(dataset) * calculate_var(l) - len(r) / len(dataset) * calculate_var(r)return var_gaindef split_dataset(X: np.ndarray, f_idx: int, threshold: float):L = X[:, f_idx] < thresholdR = ~Lreturn X[L], X[R]def mean_y(dataset):y_ = dataset[:, -1]return np.mean(y_)def build_tree(dataset: np.ndarray, f_idx_list: list, depth, max_depth, min_samples):   # return DecisionNode 递归# 怎么判断depthclass_list = [data[-1] for data in dataset]  # 类别  dataset 为空了，n, m = dataset.shapek = int(math.log(m, 2)) + 1if n < min_samples:return DecisionNode(None, None, value=mean_y(dataset))elif depth > max_depth:return DecisionNode(None, None, value=mean_y(dataset))# 全属于同一类别elif class_list.count(class_list[0]) == len(class_list):return DecisionNode(None, None, value=mean_y(dataset))else:# 找到使增益最大的属性best_gain = -math. infbest_threshold = Nonebest_f_idx = None# 选取部分属性进行最优划分f_idx_list_random = list(np.random.choice(m-1, size=k, replace=False))for i in f_idx_list_random:threshold, gain = find_best_threshold(dataset, i)if gain > best_gain:  # 如果增益大于最大增益，则更换最大增益和最大阈值best_gain = gainbest_threshold = thresholdbest_f_idx = i# 创建分支L, R = split_dataset(dataset, best_f_idx, best_threshold)if len(L) == 0:depth += 1L_tree = DecisionNode(None, None, mean_y(dataset))  # 叶子节点else:depth += 1L_tree = build_tree(L, f_idx_list, depth, max_depth, min_samples)  # return DecisionNodeif len(R) == 0:R_tree = DecisionNode(None, None, mean_y(dataset))  # 叶子节点else:R_tree = build_tree(R, f_idx_list, depth, max_depth, min_samples)  # return DecisionNodereturn DecisionNode(best_f_idx, best_threshold, value=None, L=L_tree, R=R_tree)def predict_one(model: DecisionNode, data):if model.value is not None:return model.valueelse:feature_one = data[model.f_idx]branch = Noneif feature_one >= model.threshold:branch = model.R  # 走右边else:branch = model.L   # 走左边return predict_one(branch, data)# 有放回随机采样
def random_sample(dataset):n, _ = dataset.shapesub_data = np.copy(dataset)random_data_idx = np.random.choice(n, size=n, replace=True)  # 0~(n-1) 产生n个 有放回采样sub_data = sub_data[random_data_idx]return sub_data[:, 0:-1], sub_data[:, -1]class Random_forest(object):def __init__(self, min_samples, max_depth):self.min_samples = min_samples  # 节点样本数量少于 min_samples， 叶子节点self.max_depth = max_depth  # 最大深度def fit(self, X: np.ndarray, y: np.ndarray) -> None:dataset_in = np.c_[X, y]f_idx_list = [i for i in range(X.shape[1])]depth = 0self.my_tree = build_tree(dataset_in, f_idx_list, depth, self.max_depth, self.min_samples)def predict(self, X: np.ndarray) -> np.ndarray:   # 递归 how?predict_list = []for data in X:predict_list.append(predict_one(self.my_tree, data))return np.array(predict_list)if __name__ == "__main__":X, y = datasets.load_diabetes(return_X_y=True)y_predict_list = []r2_score_list = []tree_number = []MAE_list = []MAPE_list = []X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)print(y_test.shape)  # 89*1dataset = np.c_[X_train, y_train]np.seterr(divide='ignore', invalid='ignore')T = 100for i in range(T):X_train_samples, y_train_samples = random_sample(dataset)m = Random_forest(min_samples=5, max_depth=20)m.fit(X_train_samples, y_train_samples)y_predict = m.predict(X_test)y_predict_list.append(y_predict)  # 二维数组print("epoc", i+1, " done")y_ = np.mean(y_predict_list, axis=0)  # 当前的预测值score = r2_score(y_test, y_)r2_score_list.append(score)tree_number.append((i + 1))errors = abs(y_ - y_test)MAE_list.append(np.mean(errors))  # 平均绝对误差mape = 100 * (errors / y_test)MAPE_list.append(np.mean(mape))  # 平均绝对百分比误差## print("r2_score_list", r2_score_list)plt.plot(tree_number[5:-1], r2_score_list[5:-1])plt.title('r2_score')plt.xlabel('tree number')plt.ylabel('r2_score')plt.show()# print("MAE_list", MAE_list)## print("MAPE_list", MAPE_list)plt.plot(tree_number, MAPE_list)plt.xlabel('tree number')plt.ylabel('MAPE %')plt.title("MAPE: Mean Absolute Percentage Error")plt.show()y_result = np.mean(y_predict_list, axis=0)  # 最终结果print("r2_score:", r2_score(y_test, y_result))errors1 = abs(y_result - y_test)  # 平均绝对误差print('Mean Absolute Error:', np.round(np.mean(errors1), 2), 'degrees.')mape = 100 * (errors1 / y_test)  # 平均绝对百分比误差print('MAPE:', np.round(np.mean(mape), 2), '%.')# accuracy = 100 - np.mean(mape)# print('Accuracy:', round(accuracy, 2), '%.')# ---------------------------画图------------------------------plt.figure(figsize=(20, 5))plt.plot([i for i in range(y_test.shape[0])], y_test, color='red', alpha=0.8, label="y_test")plt.plot([i for i in range(y_test.shape[0])], y_result, color='blue', alpha=0.8, label="y_result")plt.legend(loc="upper right")plt.title("My Random forest")plt.show()# ----------------------------------sklearn--------------------------------regressor = RandomForestRegressor(n_estimators=100, min_samples_leaf=5)regressor.fit(X_train, y_train)  # 拟合模型y_pred = regressor.predict(X_test)print('sklearn score:{}'.format(r2_score(y_test, y_pred)))  # 显示训练结果与测试结果的拟合优度errors = abs(y_pred - y_test)# Print out the mean absolute error (mae)print('Mean Absolute Error:', np.round(np.mean(errors), 2), 'degrees.')mape = 100 * (errors / y_test)accuracy = 100 - np.mean(mape)print('Accuracy:', round(accuracy, 2), '%.')# ---------------------------画图------------------------------plt.figure(figsize=(20, 5))plt.plot([i for i in range(y_test.shape[0])], y_test, color='red', alpha=0.8, label="y_test")plt.plot([i for i in range(y_test.shape[0])], y_pred, color='blue', alpha=0.8, label="y_pred")plt.legend(loc="upper right")plt.title("sklearn RandomForestRegressor")plt.show()

5.结果分析

5.1 与sklearn自带的随机森林模块对比
这里绘制了两张折线图，展现了真实值与预测值的差别，可以看出:

两种方法的真实值与预测值的走势轨迹都大致相同。
上下两幅图的预测值走势是基本相同的，看出两种方法预测出的结果差别不大。

下表也能看出两种方法得出的结果差别不大

5.2 决策树数目对随机森林的影响
下面两幅图分别是r2_score和MAPE随决策树数目的变化曲线图。可以看出从1-20棵树变化时，两幅图的曲线都变化很快，快速收敛。在达到40棵树时，收敛效果都已经很好了，再增加的基分类器（决策树）的数目，效果基本不会提升。

随机森林实现回归预测(糖尿病数据集)相关推荐

基于随机森林算法的糖尿病数据集回归
基于随机森林算法的糖尿病数据集回归作者介绍 1. 随机森林算法原理 1.1决策树与Bagging 1.2 随机森林算法 2. 实验过程 2.1 糖尿病数据集 2.2 实验过程 2.3 实验结果展示 ...
基于随机森林算法的人脸数据集分类
目录 1. 作者介绍 2. 关于理论方面的知识介绍随机森林 3. 实验过程 3.1 数据集介绍 3.2 实验代码 3.3 运行结果 3.3 实验总结参考 1. 作者介绍李佳敏,女,西安工程大学电 ...
使用决策树和随机森林分析预测糖尿病
使用决策树与随机森林预测糖尿病数据源: https://www.kaggle.com/uciml/pima-indians-diabetes-database#diabetes.csv 目的:生成决 ...
利用随机森林算法对红酒数据集进行分类预测+对下载的人口数据集进行分类预测
随机森林算法可以很好的解决决策树算法的过拟合问题 def j2():'''随机森林可以很好的解决决策树的过拟合问题'''from sklearn.ensemble import RandomFores ...
独家 | 决策树VS随机森林——应该使用哪种算法？（附代码链接）
作者:Abhishek Sharma 翻译:陈超校对:丁楠雅本文长度为4600字,建议阅读20分钟本文以银行贷款数据为案例,对是否批准顾客贷款申请的决策过程进行了算法构建,并对比了决策树与随机森 ...
独家 | 一文读懂随机森林的解释和实现（附python代码）
作者:William Koehrsen 翻译:和中华校对:李润嘉本文约6000字,建议阅读15分钟. 本文从单棵决策树讲起,然后逐步解释了随机森林的工作原理,并使用sklearn中的随机森林对某个 ...
gbdt 算法比随机森林容易_用Python实现随机森林算法
CDA数据分析师出品拥有高方差使得决策树(secision tress)在处理特定训练数据集时其结果显得相对脆弱.bagging(bootstrap aggregating 的缩写)算法从训练数据 ...
xgboost、随机森林和逻辑回归的优缺点
转载自https://www.nowcoder.com/ta/review-ml/review?query=&asc=true&order=&page=99 Xgboost: ...
随机森林原理_机器学习(29):随机森林调参实战（信用卡欺诈预测）
点击"机器学习研习社","置顶"公众号重磅干货,第一时间送达回复[大礼包]送你机器学习资料与笔记回顾推荐收藏>机器学习文章集合:1-20 机器学习 ...
【华为云技术分享】【Python算法】分类与预测——Python随机森林
1.随机森林定义随机森林是一种多功能的机器学习算法,能够执行回归和分类的任务.同时,它也是一种数据降维手段,在处理缺失值.异常值以及其他数据探索等方面,取得了不错的成效.另外,它还担任了集成学习中的 ...