之前在学校做过课程设计,但是对流程比较一知半解,现在看完了机器学习实战这本书,带着自己的理解重新做一遍。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

数据导入

观察数据的具体情况,可以发现年龄变量Age和Cabin有缺失,然后Name,sex,Ticket,cabin和Embark是object类型,在后续的数据处理中要进行调整。

data_train = pd.read_csv(r'C:/Users/train.csv')
data_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):#   Column       Non-Null Count  Dtype
---  ------       --------------  -----  0   PassengerId  891 non-null    int64  1   Survived     891 non-null    int64  2   Pclass       891 non-null    int64  3   Name         891 non-null    object 4   Sex          891 non-null    object 5   Age          714 non-null    float646   SibSp        891 non-null    int64  7   Parch        891 non-null    int64  8   Ticket       891 non-null    object 9   Fare         891 non-null    float6410  Cabin        204 non-null    object 11  Embarked     889 non-null    object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

再看看测试集

data_test= pd.read_csv(r'test.csv')
data_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):#   Column       Non-Null Count  Dtype
---  ------       --------------  -----  0   PassengerId  418 non-null    int64  1   Pclass       418 non-null    int64  2   Name         418 non-null    object 3   Sex          418 non-null    object 4   Age          332 non-null    float645   SibSp        418 non-null    int64  6   Parch        418 non-null    int64  7   Ticket       418 non-null    object 8   Fare         418 non-null    float649   Cabin        91 non-null     object 10  Embarked     418 non-null    object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

把索引设置为乘客编号

test_process = test_process.set_index(['PassengerId'])
test_process

现在测试集长这样

Pclass Name Sex Age SibSp Parch Ticket Fare Embarked Called Name_length First_name
PassengerId
892 3 Kelly, Mr. James male 34 0 0 330911 7.8292 Q Mr 16 Kelly
893 3 Wilkes, Mrs. James (Ellen Needs) female 47 1 0 363272 7.0000 S Mr 32 Wilkes
894 2 Myles, Mr. Thomas Francis male 62 0 0 240276 9.6875 Q Mr 25 Myles
895 3 Wirz, Mr. Albert male 27 0 0 315154 8.6625 S Mr 16 Wirz
896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22 1 1 3101298 12.2875 S Mr 44 Hirvonen
... ... ... ... ... ... ... ... ... ... ... ... ...
1305 3 Spector, Mr. Woolf male 25 0 0 A.5. 3236 8.0500 S Mr 18 Spector
1306 1 Oliva y Ocana, Dona. Fermina female 39 0 0 PC 17758 108.9000 C NaN 28 Oliva y Ocana
1307 3 Saether, Mr. Simon Sivertsen male 38 0 0 SOTON/O.Q. 3101262 7.2500 S Mr 28 Saether
1308 3 Ware, Mr. Frederick male 25 0 0 359309 8.0500 S Mr 19 Ware
1309 3 Peter, Master. Michael J male 22 1 1 2668 22.3583 C NaN 24 Peter

418 rows × 12 columns

数据处理

缺失值处理

本次数据的缺失应该是完全随机的,不依赖于其他完全变量,所以可以采取删除和填补两种方式。cabin缺失过多,直接删除这一特征,不放心的话可以计算一些相关度或者画图看看情况。

# 删除cabin
train_process = data_train.drop(['Cabin'],axis=1)
# 年龄数据进行缺失值填补
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
Age_df = train_process[['Age','Survived','Pclass','SibSp','Parch','Fare']]
UnknowAge = Age_df[Age_df.Age.isnull()].values
KnowAge = Age_df[Age_df.Age.notnull()].values
#y是目标年龄,x是已知属性
y_train = KnowAge[:,0]
x_train = KnowAge[:,1:]
rfr = RandomForestRegressor(n_estimators=500,random_state=42)
rfr.fit(x_train,y_train)
predictedAges = rfr.predict(UnknowAge[:,1::])
Age_df.loc[ (Age_df.Age.isnull()), 'Age' ] = predictedAges
train_process.Age=Age_df.Age.astype(int)

年龄缺失值使用随机森林进行填补,建立回归方程进行拟合。

测试集也要删除cabin变量和进行年龄缺失值的填补。

#测试集
test_process = data_test.drop(['Cabin'],axis=1)
test_process.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 10 columns):#   Column       Non-Null Count  Dtype
---  ------       --------------  -----  0   PassengerId  418 non-null    int64  1   Pclass       418 non-null    int64  2   Name         418 non-null    object 3   Sex          418 non-null    object 4   Age          332 non-null    float645   SibSp        418 non-null    int64  6   Parch        418 non-null    int64  7   Ticket       418 non-null    object 8   Fare         418 non-null    float649   Embarked     418 non-null    object
dtypes: float64(2), int64(4), object(4)
memory usage: 32.8+ KB
Age_df = test_process[['Age','Pclass','SibSp','Parch','Fare']]
UnknowAge = Age_df[Age_df.Age.isnull()].values
KnowAge = Age_df[Age_df.Age.notnull()].values
#y是目标年龄,x是已知属性
y_train = KnowAge[:,0]
x_train = KnowAge[:,1:]
rfr = RandomForestRegressor(n_estimators=500,random_state=42)
rfr.fit(x_train,y_train)
predictedAges = rfr.predict(UnknowAge[:,1::])
Age_df.loc[ (Age_df.Age.isnull()), 'Age' ] = predictedAges
test_process.Age=Age_df.Age.astype(int)

文本数据处理

对文本数据名字进行处理,把名字的称谓,长度,前名提取出来并舍弃名字变量。

def change(df):df['Called'] = df['Name'].str.findall('Miss|Mr|Ms').str[0].to_frame()df['Name_length'] = df['Name'].apply(lambda x:len(x))df['First_name'] = df['Name'].str.split(',').str[0]df = df.drop(['Name'],axis=1)change(train_process)
change(test_process)

TargetEncoder

把其他object类型变量进行编码处理。sklearn有很多种编码方式,target适用于特征无内在顺序,category数量 > 4的情况
one-hot适用于特征无内在顺序,category数量 < 4的情况。

import category_encoders
from category_encoders import TargetEncoder
X_train = train_process.iloc[:,2:]
y_train = train_process.iloc[:,1]
tar_encoder1 = TargetEncoder(cols=['Sex','Ticket','Embarked','Called','Name_length','First_name'],handle_missing='value',handle_unknown='value')
tar_encoder1.fit(X_train,y_train)
TargetEncoder(cols=['Sex', 'Ticket', 'Embarked', 'Called', 'Name_length','First_name'])
X_train_encoded = tar_encoder1.transform(X_train)
X_train_encoded.drop(['Name'],axis=1)
Pclass Sex Age SibSp Parch Ticket Fare Embarked Called Name_length First_name
0 3 0.188908 22.0 1 0 0.383838 7.2500 0.336957 0.283721 0.282051 0.103230
1 1 0.742038 38.0 1 0 0.383838 71.2833 0.553571 0.283721 0.998476 0.383838
2 3 0.742038 26.0 0 0 0.383838 7.9250 0.336957 0.697802 0.315789 0.383838
3 1 0.742038 35.0 1 0 0.468759 53.1000 0.336957 0.283721 0.999439 0.468759
4 3 0.188908 35.0 0 0 0.383838 8.0500 0.336957 0.283721 0.372093 0.468759
... ... ... ... ... ... ... ... ... ... ... ...
886 2 0.188908 27.0 0 0 0.383838 13.0000 0.336957 0.492063 0.325000 0.383838
887 1 0.742038 19.0 0 0 0.383838 30.0000 0.336957 0.697802 0.372093 0.632953
888 3 0.742038 NaN 1 2 0.103230 23.4500 0.336957 0.697802 0.428461 0.103230
889 1 0.188908 26.0 0 0 0.383838 30.0000 0.553571 0.283721 0.325000 0.383838
890 3 0.188908 32.0 0 0 0.383838 7.7500 0.389610 0.283721 0.234375 0.383838

891 rows × 11 columns

X_test = test_process
X_test.drop(['Name'],axis=1)
Pclass Sex Age SibSp Parch Ticket Fare Embarked Called Name_length First_name
PassengerId
892 3 male 34 0 0 330911 7.8292 Q Mr 16 Kelly
893 3 female 47 1 0 363272 7.0000 S Mr 32 Wilkes
894 2 male 62 0 0 240276 9.6875 Q Mr 25 Myles
895 3 male 27 0 0 315154 8.6625 S Mr 16 Wirz
896 3 female 22 1 1 3101298 12.2875 S Mr 44 Hirvonen
... ... ... ... ... ... ... ... ... ... ... ...
1305 3 male 25 0 0 A.5. 3236 8.0500 S Mr 18 Spector
1306 1 female 39 0 0 PC 17758 108.9000 C NaN 28 Oliva y Ocana
1307 3 male 38 0 0 SOTON/O.Q. 3101262 7.2500 S Mr 28 Saether
1308 3 male 25 0 0 359309 8.0500 S Mr 19 Ware
1309 3 male 22 1 1 2668 22.3583 C NaN 24 Peter

418 rows × 11 columns

X_test_encoded = tar_encoder1.transform(X_test)

归一化

后面要多模型验证,所以要把数据归一化。

import sklearn.preprocessing as preprocessing
scaler = preprocessing.StandardScaler()
scaler.fit(X_train_encoded[['Age','Fare']])
scaler.fit(X_test_encoded[['Age','Fare']])
StandardScaler()
X_train_encoded[['Age','Fare']] = scaler.transform(X_train_encoded[['Age','Fare']])
X_test_encoded[['Age','Fare']] = scaler.transform(X_test_encoded[['Age','Fare']])

模型预测

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
X_train_encoded
X_test_encoded
Pclass Sex Age SibSp Parch Ticket Fare Embarked Called Name_length First_name
PassengerId
892 3 0.188908 0.325138 0 0 0.383838 -0.497063 0.389610 0.283721 0.230769 0.732634
893 3 0.742038 1.326156 1 0 0.383838 -0.511926 0.336957 0.283721 0.565217 0.383838
894 2 0.188908 2.481178 0 0 0.383838 -0.463754 0.389610 0.283721 0.327273 0.383838
895 3 0.188908 -0.213872 0 0 0.383838 -0.482127 0.336957 0.283721 0.230769 0.383838
896 3 0.742038 -0.598880 1 1 0.383838 -0.417151 0.336957 0.283721 0.999439 0.383838
... ... ... ... ... ... ... ... ... ... ... ...
1305 3 0.188908 -0.367875 0 0 0.383838 -0.493105 0.336957 0.283721 0.200000 0.383838
1306 1 0.742038 0.710145 0 0 0.468759 1.314557 0.553571 0.492063 0.372093 0.383838
1307 3 0.188908 0.633143 0 0 0.383838 -0.507445 0.336957 0.283721 0.372093 0.383838
1308 3 0.188908 -0.367875 0 0 0.383838 -0.493105 0.336957 0.283721 0.234375 0.383838
1309 3 0.188908 -0.598880 1 1 0.834289 -0.236640 0.553571 0.492063 0.372093 0.834289

418 rows × 11 columns

X_train_encoded.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):#   Column       Non-Null Count  Dtype
---  ------       --------------  -----  0   Pclass       891 non-null    int64  1   Sex          891 non-null    float642   Age          891 non-null    int32  3   SibSp        891 non-null    int64  4   Parch        891 non-null    int64  5   Ticket       891 non-null    float646   Fare         891 non-null    float647   Embarked     891 non-null    float648   Called       891 non-null    float649   Name_length  891 non-null    float6410  First_name   891 non-null    float64
dtypes: float64(7), int32(1), int64(3)
memory usage: 73.2 KB

投票法

先看看投票法

lr_clf = LogisticRegression(penalty='l1',solver='saga',n_jobs=-1,max_iter=20000)
rnd_clf = RandomForestClassifier(n_estimators=300,max_depth=8,min_samples_leaf=1,min_samples_split=5,random_state=42)
svm_clf = SVC(C=2,kernel='poly',random_state=42,probability=True)
voting_clf = VotingClassifier(estimators=[('lr',lr_clf),('rf',rnd_clf),('scv',svm_clf)],voting='soft')
voting_clf.fit(X_train_encoded,y_train)
  VotingClassifier(estimators=[('lr',LogisticRegression(max_iter=20000, n_jobs=-1,penalty='l1', solver='saga')),('rf',RandomForestClassifier(max_depth=8,min_samples_split=5,n_estimators=300,random_state=42)),('scv',SVC(C=2, kernel='poly', probability=True,random_state=42))],voting='soft')
y_test = pd.read_csv(r'C:/Users/gender_submission.csv')
y_test = y_test['Survived']
from sklearn.metrics import accuracy_scorefor clf in (lr_clf,rnd_clf,svm_clf,voting_clf):clf.fit(X_train_encoded,y_train)y_pred = clf.predict(X_test_encoded)print(clf.__class__.__name__, accuracy_score(y_test, y_pred))
LogisticRegression 0.6961722488038278
RandomForestClassifier 0.80622009569378
SVC 0.6363636363636364
VotingClassifier 0.8110047846889952

再试试XGBoost,果然效果比较好。

XGBoost

import xgboost
from sklearn.metrics import mean_squared_error
xgb_reg = xgboost.XGBRFRegressor(random_state=42)
xgb_reg.fit(X_train_encoded,y_train)
y_pred = xgb_reg.predict(X_test_encoded)
val_error=mean_squared_error(y_test,y_pred)
print("Validation MSE:", val_error)
Validation MSE: 0.5023153196818051

数据分析——泰坦尼克号预测相关推荐

  1. 利用R语言对泰坦尼克号沉没事件幸存者的数据分析与预测

    题外话:在文章正式开始之前,我还是想先写一点题外话,一是为了引出写作这篇博客的目的,二则是希望能够记录下现在的所思所想为以后留个纪念.首先介绍一下自己,毕业3年多的小硕一枚,大学期间学的专业是高分子材 ...

  2. 深耕大数据市场,所问数据打造深度学习数据分析与预测引擎

    卖什么?卖多少钱? 这些是每一个线上零售卖家都会遇到的问题.在大数据时代开始之前,答案都是基于个人经验做的判断:随着近年数据分析平台纷纷上线,卖家们也渐渐开始接受多维度.不同时间粒度的数据分析服务,包 ...

  3. 基于机器学习的天气数据分析与预测系统

    温馨提示:文末有 CSDN 平台官方提供的学长 Wechat / QQ 名片 :) 1. 项目简介 本项目利用网络爬虫技术从某天气预报网站抓取某一城市的历史天气数据,构建天气数据分析与预测系统,实现对 ...

  4. 【Python】时间序列数据分析与预测之Python工具汇总

    本文中总结了十多种时间序列数据分析和预测工具和python库,在我们处理时间序列项目时,可以翻开本文,根据需要选择合适的工具,将会事半功倍! 在处理时间序列项目时,数据科学家或 ML 工程师通常会使用 ...

  5. 基于数据挖掘的共享单车骑行数据分析与预测

    温馨提示:文末有 CSDN 平台官方提供的博主 Wechat / QQ 名片 :) 1. 项目背景 共享单车系统在大城市越来越流行,通过提供价格合理的自行车租赁,让人们可以享受在城市里骑自行车的乐趣, ...

  6. 电商销售数据分析与预测(日期数据统计、按天统计、按月统计)

    本文来自<Python数据分析从入门到精通>--明日科技编著 随着电商行业的激烈竞争,电商平台推出了各种数字营销方案,付费广告也是花样繁多.那么电商投入广告后,究竟能给企业增加多少收益,对 ...

  7. 时间序列数据分析与预测之Python工具汇总

    ‍ ‍ 大家好,我是辰哥‍ ‍ 本文中硬核总结了十多种时间序列数据分析和预测工具和python库,在我们处理时间序列项目时,可以翻开本文,根据需要选择合适的工具,将会事半功倍! 在处理时间序列项目时, ...

  8. python空气质量分析与预测_python 空气质量AQI数据分析与预测 ---分析,相关系数矩阵...

    版权声明:本文为博主原创文章,遵循CC 4.0 by-sa版权协议,转载请附上原文出处链接和本声明. 本文链接:https://blog.csdn.net/YmeBtc/article/details ...

  9. 淘宝双11数据分析与预测

    淘宝双11数据分析与预测 一. 案例简介 Spark课程实验案例:淘宝双11数据分析与预测课程案例,由厦门大学数据库实验室团队开发,旨在满足全国高校大数据教学对实验案例的迫切需求.本案例涉及数据预处理 ...

最新文章

  1. 面试必问!Tomcat 优化篇!
  2. 【jenkins】jenkins build项目的三种方式
  3. 2021巨量引擎UGC互动营销白皮书
  4. 机器学习-吴恩达-笔记-14-应用实例:图片文字识别
  5. hadoop的shuffle过程
  6. HAproxy负载均衡动静分离实现及配置详解
  7. 008/160 CrackMe Andrénalin #1
  8. 接入华为推送用API给iOS应用发消息时如何获取access_token?
  9. OSChina 周三乱弹 ——垂死病中惊坐起,夜深还过女嫱来
  10. Java线程强制执行
  11. 费马小定理及MR素数判断
  12. 将工业ISM和消费者ISM频段设备迁移到LoRaWAN,LoRa设备开发参考指南(二十)
  13. python eel 多线程_Python + Eel + Sqlite 实现个人密码管理器
  14. 最小生成树————普利姆和克鲁斯卡尔
  15. maya扇子动画_MAYA制作动画的十大原理!
  16. python自省与反射
  17. linux dev queue xmit,dev_queue_xmi函数详解
  18. IOS Swift语言开发 tableView的重用以及自cell的自适应高度
  19. python3 字典遍历的方法
  20. 爬了1000张清纯妹子私房照,我流鼻血了...

热门文章

  1. CodeForces 348D Turtles(LGV定理)题解
  2. 极速数据api 全国违章查询api
  3. BEOPlayer---音色巨好,而且操作的界面也是最令类的,绝对一用难忘!!
  4. 如何自己制作一个路由器?
  5. 索尼ST18i刷机包 源于官方 感知流畅 深度精简 极度省电 稳定版上市
  6. ST-GCN论文简读以及复现
  7. 黑哥整理Go学习材料分享 2022版
  8. ML之p-value:p-value/P值的简介、使用方法、案例应用之详细攻略
  9. poi版本升级(3.13到4.0.1)的那些坑
  10. H3CNE中DHCP自动获取