在jupytor中运行随机森林预测泰坦尼克旅客生存情况

数据集链接链接: link.

百度网盘：链接: https://pan.baidu.com/s/1_pQ-3iG4dr0hrvU_5hYUtg
提取码：9520
我们要用的编译软件是jupytor，选择它的原因是可以将程序进行分块运行。对于新手而言，很容易写错代码，如果在pycharm上运行的话，每次运行错误，都需要重新运行，如果数据集很大，就会很浪费时间，jupytor可以交互运行、分块运行，对于用户调试代码很友好。怎么运行jupytor，看下图，当你装anconda3的时候，会默认装一个jupytor，点击运行就可以。

之后会跳出一个网页（如下图），这个目录的路径是C盘下user（或者用户）下的一个目录路径，有些文件它显示的是英文，不过这样不影响操作，我们点击新建，选择Python3

进来之后，就是下面的界面，我们就可以运行代码了
运行某个块的快捷键 shift+enter

jupytor的用法讲完，咱们就用它运行我们的代码，下面的代码我是按分块保存的，所以您也可以像我这样分块运行

求解步骤

import pandas as pd
from sklearn.feature_extraction import DictVectorizer #对字典进行特征提取，将文本特征转化成one-hot编码，返回一个one-hot编码，当sparse=False不产生稀疏矩阵
from sklearn.ensemble import RandomForestClassifier  #导入随机森林
from sklearn.tree import DecisionTreeClassifier     #导入决策树包
from sklearn.model_selection import GridSearchCV  #导入网格搜索（用于参数调优）

（1）导入需要的数据

train_data = pd.read_csv('./data/train.csv')
test_data = pd.read_csv('./data/test.csv')
y_test = pd.read_csv('./data/gender_submission.csv')

（2）数据清洗

# 使用平均年龄来填充年龄中的 nan 值
train_data['Age'].fillna(train_data['Age'].mean(), inplace=True)
test_data['Age'].fillna(test_data['Age'].mean(),inplace=True)
# 使用票价的均值填充票价中的 nan 值
train_data['Fare'].fillna(train_data['Fare'].mean(), inplace=True)
test_data['Fare'].fillna(test_data['Fare'].mean(),inplace=True)
print(train_data['Embarked'].value_counts())# 使用登录最多的港口来填充登录港口的 nan 值
train_data['Embarked'].fillna('S', inplace=True)
test_data['Embarked'].fillna('S',inplace=True)

S    644
C    168
Q     77
Name: Embarked, dtype: int64

（3）特征选择

features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
x_train = train_data[features]
y_train = train_data['Survived']
x_test = test_data[features]
y_test = y_test['Survived']  #导入的数据有两列，取'Survived'列为测试标签dvec=DictVectorizer(sparse=False)
x_train = dvec.fit_transform(x_train.to_dict(orient='record'))
x_test = dvec.transform(x_test.to_dict(orient='record'))

E:\Program Files\Anaconda3\lib\site-packages\pandas\core\frame.py:1549: FutureWarning: Using short name for 'orient' is deprecated. Only the options: ('dict', list, 'series', 'split', 'records', 'index') will be used in a future version. Use one of the above to silence this warning.warnings.warn(

y_test

0      0
1      1
2      0
3      0
4      1..
413    0
414    1
415    0
416    0
417    0
Name: Survived, Length: 418, dtype: int64

（4）构造随机森林

estimator = RandomForestClassifier()

（5）加入网格搜索和交叉验证

param_dict  = {"n_estimators":[120,200,300,500,800,1200],"max_depth":[5,8,15,25,30]}
estimator = GridSearchCV(estimator,param_grid=param_dict,cv=3) #这里的cv是用10折的数据用来交叉验证
x_train,y_train = x_train, y_train
estimator.fit(x_train,y_train)

GridSearchCV(cv=3, estimator=RandomForestClassifier(),param_grid={'max_depth': [5, 8, 15, 25, 30],'n_estimators': [120, 200, 300, 500, 800, 1200]})

（6）评估模型

#方法1：直接对比真实值和预估值
y_predict = estimator.predict(x_test)
print('y_predict:\n', y_predict)
print('直接对比真实值和预测值：\n',y_test == y_predict)#方法2：计算准确率

y_predict:[0 0 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 0 1 1 1 1 0 1 0 1 0 0 0 0 0 1 0 1 0 00 0 1 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 1 1 1 0 0 1 1 0 0 01 0 0 1 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 00 1 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 00 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1 0 0 1 0 10 1 0 0 0 0 0 1 0 1 0 1 1 0 0 1 1 0 1 0 0 0 0 1 0 0 0 0 1 1 0 1 0 1 0 1 01 0 1 0 0 0 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 10 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 01 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 01 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0 1 0 0 0 1 00 1 0 0 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 00 0 1 0 1 0 0 1 0 0 1]
直接对比真实值和预测值：0       True
1      False
2       True
3       True
4       True...
413     True
414     True
415     True
416     True
417    False
Name: Survived, Length: 418, dtype: bool

score = estimator.score(x_test,y_test)
print("准确率为：\n",score)# 结果分析：
print('最佳参数：\n',estimator.best_params_)
# bestscore: 在交叉验证中验证的最好结果_
print('最佳结果：\n',estimator.best_score_)
# bestestimator：最好的参数模型
print('最佳预估器：\n',estimator.best_estimator_)
# cvresults: 每次交叉验证后的验证集准确率结果和训练集准确率结果
print('交叉验证结果：\n', estimator.cv_results_)

准确率为：0.861244019138756
最佳参数：{'max_depth': 8, 'n_estimators': 800}
最佳结果：0.8260381593714928
最佳预估器：RandomForestClassifier(max_depth=8, n_estimators=800)
交叉验证结果：{'mean_fit_time': array([0.14993318, 0.24965994, 0.38132485, 0.61536312, 1.04751142,1.57679296, 0.15622576, 0.25996598, 0.38662124, 0.64927038,1.03956922, 1.54551975, 0.18119264, 0.32046803, 0.48171075,0.72539409, 1.18317087, 1.83011953, 0.1668752 , 0.27692644,0.41985854, 0.71610451, 1.1459322 , 1.72572255, 0.16987888,0.30083632, 0.41321762, 0.75131655, 1.13660574, 1.64195848]), 'std_fit_time': array([0.00287001, 0.00737044, 0.00308405, 0.00643914, 0.07266875,0.10127349, 0.0020616 , 0.00248251, 0.00046787, 0.00295435,0.00633619, 0.00846855, 0.01510928, 0.01061129, 0.04667695,0.03663693, 0.01839278, 0.02784425, 0.00329037, 0.0065818 ,0.00488676, 0.02092371, 0.01521066, 0.07668334, 0.00418458,0.03867806, 0.00377439, 0.02325702, 0.00683441, 0.01049551]), 'mean_score_time': array([0.0126303 , 0.02026908, 0.03123816, 0.06316447, 0.08343434,0.11901553, 0.01332045, 0.02326671, 0.03324294, 0.05385939,0.0877467 , 0.1286722 , 0.01628272, 0.02326957, 0.03890379,0.06084267, 0.10037716, 0.13929447, 0.0143044 , 0.02393691,0.0355804 , 0.06049252, 0.09140317, 0.14261834, 0.01329788,0.02727048, 0.03690338, 0.06350621, 0.10572728, 0.13629254]), 'std_score_time': array([0.00045032, 0.00093432, 0.00093255, 0.01600419, 0.00410913,0.00248874, 0.00045671, 0.00124138, 0.00124407, 0.00081108,0.0008147 , 0.00162775, 0.00093646, 0.00170441, 0.0024408 ,0.00571806, 0.00692931, 0.00384891, 0.00046342, 0.00141029,0.00045754, 0.00048385, 0.00047451, 0.0029412 , 0.00047064,0.00576717, 0.00080651, 0.00400502, 0.01742332, 0.00125819]), 'param_max_depth': masked_array(data=[5, 5, 5, 5, 5, 5, 8, 8, 8, 8, 8, 8, 15, 15, 15, 15, 15,15, 25, 25, 25, 25, 25, 25, 30, 30, 30, 30, 30, 30],mask=[False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False,False, False, False, False, False, False],fill_value='?',dtype=object), 'param_n_estimators': masked_array(data=[120, 200, 300, 500, 800, 1200, 120, 200, 300, 500, 800,1200, 120, 200, 300, 500, 800, 1200, 120, 200, 300,500, 800, 1200, 120, 200, 300, 500, 800, 1200],mask=[False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False,False, False, False, False, False, False, False, False,False, False, False, False, False, False],fill_value='?',dtype=object), 'params': [{'max_depth': 5, 'n_estimators': 120}, {'max_depth': 5, 'n_estimators': 200}, {'max_depth': 5, 'n_estimators': 300}, {'max_depth': 5, 'n_estimators': 500}, {'max_depth': 5, 'n_estimators': 800}, {'max_depth': 5, 'n_estimators': 1200}, {'max_depth': 8, 'n_estimators': 120}, {'max_depth': 8, 'n_estimators': 200}, {'max_depth': 8, 'n_estimators': 300}, {'max_depth': 8, 'n_estimators': 500}, {'max_depth': 8, 'n_estimators': 800}, {'max_depth': 8, 'n_estimators': 1200}, {'max_depth': 15, 'n_estimators': 120}, {'max_depth': 15, 'n_estimators': 200}, {'max_depth': 15, 'n_estimators': 300}, {'max_depth': 15, 'n_estimators': 500}, {'max_depth': 15, 'n_estimators': 800}, {'max_depth': 15, 'n_estimators': 1200}, {'max_depth': 25, 'n_estimators': 120}, {'max_depth': 25, 'n_estimators': 200}, {'max_depth': 25, 'n_estimators': 300}, {'max_depth': 25, 'n_estimators': 500}, {'max_depth': 25, 'n_estimators': 800}, {'max_depth': 25, 'n_estimators': 1200}, {'max_depth': 30, 'n_estimators': 120}, {'max_depth': 30, 'n_estimators': 200}, {'max_depth': 30, 'n_estimators': 300}, {'max_depth': 30, 'n_estimators': 500}, {'max_depth': 30, 'n_estimators': 800}, {'max_depth': 30, 'n_estimators': 1200}], 'split0_test_score': array([0.8047138 , 0.8047138 , 0.81481481, 0.81144781, 0.81144781,0.81818182, 0.80808081, 0.79461279, 0.79124579, 0.7979798 ,0.8047138 , 0.8013468 , 0.77777778, 0.77777778, 0.77104377,0.77441077, 0.77104377, 0.77104377, 0.78114478, 0.76767677,0.77104377, 0.77777778, 0.77777778, 0.76767677, 0.78114478,0.77104377, 0.76767677, 0.77104377, 0.77104377, 0.77104377]), 'split1_test_score': array([0.83501684, 0.83501684, 0.83501684, 0.82828283, 0.83501684,0.82491582, 0.84848485, 0.84511785, 0.83838384, 0.83501684,0.85521886, 0.83838384, 0.81144781, 0.81144781, 0.8047138 ,0.81818182, 0.82491582, 0.81818182, 0.82154882, 0.82154882,0.82491582, 0.82491582, 0.82154882, 0.82154882, 0.82491582,0.82491582, 0.82154882, 0.82154882, 0.82154882, 0.82491582]), 'split2_test_score': array([0.81144781, 0.8047138 , 0.81481481, 0.80808081, 0.80808081,0.81144781, 0.81481481, 0.81481481, 0.82491582, 0.82154882,0.81818182, 0.82491582, 0.8047138 , 0.81144781, 0.80808081,0.81481481, 0.81481481, 0.81144781, 0.7979798 , 0.78451178,0.79124579, 0.79461279, 0.7979798 , 0.8047138 , 0.8047138 ,0.8013468 , 0.79461279, 0.79124579, 0.79461279, 0.8013468 ]), 'mean_test_score': array([0.81705948, 0.81481481, 0.82154882, 0.81593715, 0.81818182,0.81818182, 0.82379349, 0.81818182, 0.81818182, 0.81818182,0.82603816, 0.82154882, 0.7979798 , 0.80022447, 0.79461279,0.80246914, 0.80359147, 0.80022447, 0.80022447, 0.79124579,0.79573513, 0.79910213, 0.79910213, 0.7979798 , 0.80359147,0.79910213, 0.79461279, 0.79461279, 0.79573513, 0.79910213]), 'std_test_score': array([0.01299196, 0.01428499, 0.00952332, 0.00883727, 0.01198325,0.00549829, 0.01767454, 0.0207556 , 0.01982438, 0.0153066 ,0.02135387, 0.0153066 , 0.01454712, 0.01587221, 0.01672241,0.01988782, 0.02338122, 0.0208162 , 0.01657107, 0.02250274,0.02222109, 0.01950409, 0.01788707, 0.02250274, 0.01788707,0.02205037, 0.02199317, 0.0207556 , 0.02063387, 0.02205037]), 'rank_test_score': array([10, 12,  3, 11,  5,  5,  2,  5,  5,  5,  1,  3, 23, 16, 27, 15, 13,16, 16, 30, 25, 19, 19, 23, 13, 19, 27, 27, 25, 19])}

在jupytor中运行随机森林预测泰坦尼克旅客生存情况相关推荐

Kaggle实战：随机森林预测泰坦尼克生存率
一.项目背景和分析需求的提出泰坦尼克号出事后,收集了乘客的各项数据,包括: PassengerId.Survived.Pclass.Name.Sex.Age.SibSp.Parch.Ticket.F ...
基于深度学习的泰坦尼克旅客生存预测
基于深度学习的泰坦尼克旅客生存预测摘要:近年来,随着深度学习的迅速发展和崛起,尤其在图像分类方向取得了巨大的成就.本文实验基于Windows10系统,仿真软件用的是Anaconda下基于python ...
python预测随机数据_随机森林预测
Python部落(python.freelycode.com)组织翻译,禁止转载,欢迎转发. 当涉及到预测数据(时间序列或其他类型的序列)时,人们会关注基本回归.ARIMA.ARMA.GARCH,甚至 ...
的garch预测_随机森林预测
当涉及到预测数据(时间序列或其他类型的序列)时,人们会关注基本回归.ARIMA.ARMA.GARCH,甚至Prophet,但不排除使用随机森林来预测数据. 随机森林通常被认为是一种分类技术,但回归问题 ...
Python 中的随机森林
什么是随机森林众所周知,树模型是高方差.低偏差的模型.因此,它们容易过度拟合训练数据.如果我们不修剪树模型或引入早期停止标准(例如每个叶节点的最小实例数),我们可以概括一下树模型的作用,这很吸引人. ...
RandomForest：随机森林预测生物标记biomarker——回归
关于随机森林的简介和应用理论,请阅读之前分享的文章: 一文读懂随机森林在微生态中的应用关于随机森林进行分类的入门实战,请阅读之前分享的 - <RandomForest:随机森林预测生物标记b ...
集成学习中的随机森林
摘要:随机森林是集成算法最前沿的代表之一.随机森林是Bagging的升级,它和Bagging的主要区别在于引入了随机特征选择. 本文分享自华为云社区<集成学习中的随机森林>,原文作者:ch ...
python在Scikit-learn中用决策树和随机森林预测NBA获胜者
在本文中,我们将以Scikit-learn的决策树和随机森林预测NBA获胜者.美国国家篮球协会(NBA)是北美主要的男子职业篮球联赛,被广泛认为是首屈一指的男子职业篮球联赛在世界上.它有30个团队(美 ...
利用随机森林预测股票大盘涨跌
本文仅从实战角度去观察,利用机器学习算法中,随机森林模型预测股票市场指数涨跌的准确率. 适合入门玩家首先,我们导入所需要的模块 import numpy as np import pandas as ...

在jupytor中运行随机森林预测泰坦尼克旅客生存情况

在jupytor中运行随机森林预测泰坦尼克旅客生存情况

求解步骤

（1）导入需要的数据

（2）数据清洗

（3）特征选择

（4）构造随机森林

（5）加入网格搜索和交叉验证

（6）评估模型

在jupytor中运行随机森林预测泰坦尼克旅客生存情况相关推荐

最新文章

热门文章