数据分析——泰坦尼克号预测
之前在学校做过课程设计,但是对流程比较一知半解,现在看完了机器学习实战这本书,带着自己的理解重新做一遍。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
数据导入
观察数据的具体情况,可以发现年龄变量Age和Cabin有缺失,然后Name,sex,Ticket,cabin和Embark是object类型,在后续的数据处理中要进行调整。
data_train = pd.read_csv(r'C:/Users/train.csv')
data_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float646 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float6410 Cabin 204 non-null object 11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
再看看测试集
data_test= pd.read_csv(r'test.csv')
data_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 PassengerId 418 non-null int64 1 Pclass 418 non-null int64 2 Name 418 non-null object 3 Sex 418 non-null object 4 Age 332 non-null float645 SibSp 418 non-null int64 6 Parch 418 non-null int64 7 Ticket 418 non-null object 8 Fare 418 non-null float649 Cabin 91 non-null object 10 Embarked 418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
把索引设置为乘客编号
test_process = test_process.set_index(['PassengerId'])
test_process
现在测试集长这样
Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | Called | Name_length | First_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PassengerId | ||||||||||||
892 | 3 | Kelly, Mr. James | male | 34 | 0 | 0 | 330911 | 7.8292 | Q | Mr | 16 | Kelly |
893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47 | 1 | 0 | 363272 | 7.0000 | S | Mr | 32 | Wilkes |
894 | 2 | Myles, Mr. Thomas Francis | male | 62 | 0 | 0 | 240276 | 9.6875 | Q | Mr | 25 | Myles |
895 | 3 | Wirz, Mr. Albert | male | 27 | 0 | 0 | 315154 | 8.6625 | S | Mr | 16 | Wirz |
896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22 | 1 | 1 | 3101298 | 12.2875 | S | Mr | 44 | Hirvonen |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1305 | 3 | Spector, Mr. Woolf | male | 25 | 0 | 0 | A.5. 3236 | 8.0500 | S | Mr | 18 | Spector |
1306 | 1 | Oliva y Ocana, Dona. Fermina | female | 39 | 0 | 0 | PC 17758 | 108.9000 | C | NaN | 28 | Oliva y Ocana |
1307 | 3 | Saether, Mr. Simon Sivertsen | male | 38 | 0 | 0 | SOTON/O.Q. 3101262 | 7.2500 | S | Mr | 28 | Saether |
1308 | 3 | Ware, Mr. Frederick | male | 25 | 0 | 0 | 359309 | 8.0500 | S | Mr | 19 | Ware |
1309 | 3 | Peter, Master. Michael J | male | 22 | 1 | 1 | 2668 | 22.3583 | C | NaN | 24 | Peter |
418 rows × 12 columns
数据处理
缺失值处理
本次数据的缺失应该是完全随机的,不依赖于其他完全变量,所以可以采取删除和填补两种方式。cabin缺失过多,直接删除这一特征,不放心的话可以计算一些相关度或者画图看看情况。
# 删除cabin
train_process = data_train.drop(['Cabin'],axis=1)
# 年龄数据进行缺失值填补
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
Age_df = train_process[['Age','Survived','Pclass','SibSp','Parch','Fare']]
UnknowAge = Age_df[Age_df.Age.isnull()].values
KnowAge = Age_df[Age_df.Age.notnull()].values
#y是目标年龄,x是已知属性
y_train = KnowAge[:,0]
x_train = KnowAge[:,1:]
rfr = RandomForestRegressor(n_estimators=500,random_state=42)
rfr.fit(x_train,y_train)
predictedAges = rfr.predict(UnknowAge[:,1::])
Age_df.loc[ (Age_df.Age.isnull()), 'Age' ] = predictedAges
train_process.Age=Age_df.Age.astype(int)
年龄缺失值使用随机森林进行填补,建立回归方程进行拟合。
测试集也要删除cabin变量和进行年龄缺失值的填补。
#测试集
test_process = data_test.drop(['Cabin'],axis=1)
test_process.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 10 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 PassengerId 418 non-null int64 1 Pclass 418 non-null int64 2 Name 418 non-null object 3 Sex 418 non-null object 4 Age 332 non-null float645 SibSp 418 non-null int64 6 Parch 418 non-null int64 7 Ticket 418 non-null object 8 Fare 418 non-null float649 Embarked 418 non-null object
dtypes: float64(2), int64(4), object(4)
memory usage: 32.8+ KB
Age_df = test_process[['Age','Pclass','SibSp','Parch','Fare']]
UnknowAge = Age_df[Age_df.Age.isnull()].values
KnowAge = Age_df[Age_df.Age.notnull()].values
#y是目标年龄,x是已知属性
y_train = KnowAge[:,0]
x_train = KnowAge[:,1:]
rfr = RandomForestRegressor(n_estimators=500,random_state=42)
rfr.fit(x_train,y_train)
predictedAges = rfr.predict(UnknowAge[:,1::])
Age_df.loc[ (Age_df.Age.isnull()), 'Age' ] = predictedAges
test_process.Age=Age_df.Age.astype(int)
文本数据处理
对文本数据名字进行处理,把名字的称谓,长度,前名提取出来并舍弃名字变量。
def change(df):df['Called'] = df['Name'].str.findall('Miss|Mr|Ms').str[0].to_frame()df['Name_length'] = df['Name'].apply(lambda x:len(x))df['First_name'] = df['Name'].str.split(',').str[0]df = df.drop(['Name'],axis=1)change(train_process)
change(test_process)
TargetEncoder
把其他object类型变量进行编码处理。sklearn有很多种编码方式,target适用于特征无内在顺序,category数量 > 4的情况
one-hot适用于特征无内在顺序,category数量 < 4的情况。
import category_encoders
from category_encoders import TargetEncoder
X_train = train_process.iloc[:,2:]
y_train = train_process.iloc[:,1]
tar_encoder1 = TargetEncoder(cols=['Sex','Ticket','Embarked','Called','Name_length','First_name'],handle_missing='value',handle_unknown='value')
tar_encoder1.fit(X_train,y_train)
TargetEncoder(cols=['Sex', 'Ticket', 'Embarked', 'Called', 'Name_length','First_name'])
X_train_encoded = tar_encoder1.transform(X_train)
X_train_encoded.drop(['Name'],axis=1)
Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | Called | Name_length | First_name | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | 0.188908 | 22.0 | 1 | 0 | 0.383838 | 7.2500 | 0.336957 | 0.283721 | 0.282051 | 0.103230 |
1 | 1 | 0.742038 | 38.0 | 1 | 0 | 0.383838 | 71.2833 | 0.553571 | 0.283721 | 0.998476 | 0.383838 |
2 | 3 | 0.742038 | 26.0 | 0 | 0 | 0.383838 | 7.9250 | 0.336957 | 0.697802 | 0.315789 | 0.383838 |
3 | 1 | 0.742038 | 35.0 | 1 | 0 | 0.468759 | 53.1000 | 0.336957 | 0.283721 | 0.999439 | 0.468759 |
4 | 3 | 0.188908 | 35.0 | 0 | 0 | 0.383838 | 8.0500 | 0.336957 | 0.283721 | 0.372093 | 0.468759 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 2 | 0.188908 | 27.0 | 0 | 0 | 0.383838 | 13.0000 | 0.336957 | 0.492063 | 0.325000 | 0.383838 |
887 | 1 | 0.742038 | 19.0 | 0 | 0 | 0.383838 | 30.0000 | 0.336957 | 0.697802 | 0.372093 | 0.632953 |
888 | 3 | 0.742038 | NaN | 1 | 2 | 0.103230 | 23.4500 | 0.336957 | 0.697802 | 0.428461 | 0.103230 |
889 | 1 | 0.188908 | 26.0 | 0 | 0 | 0.383838 | 30.0000 | 0.553571 | 0.283721 | 0.325000 | 0.383838 |
890 | 3 | 0.188908 | 32.0 | 0 | 0 | 0.383838 | 7.7500 | 0.389610 | 0.283721 | 0.234375 | 0.383838 |
891 rows × 11 columns
X_test = test_process
X_test.drop(['Name'],axis=1)
Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | Called | Name_length | First_name | |
---|---|---|---|---|---|---|---|---|---|---|---|
PassengerId | |||||||||||
892 | 3 | male | 34 | 0 | 0 | 330911 | 7.8292 | Q | Mr | 16 | Kelly |
893 | 3 | female | 47 | 1 | 0 | 363272 | 7.0000 | S | Mr | 32 | Wilkes |
894 | 2 | male | 62 | 0 | 0 | 240276 | 9.6875 | Q | Mr | 25 | Myles |
895 | 3 | male | 27 | 0 | 0 | 315154 | 8.6625 | S | Mr | 16 | Wirz |
896 | 3 | female | 22 | 1 | 1 | 3101298 | 12.2875 | S | Mr | 44 | Hirvonen |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1305 | 3 | male | 25 | 0 | 0 | A.5. 3236 | 8.0500 | S | Mr | 18 | Spector |
1306 | 1 | female | 39 | 0 | 0 | PC 17758 | 108.9000 | C | NaN | 28 | Oliva y Ocana |
1307 | 3 | male | 38 | 0 | 0 | SOTON/O.Q. 3101262 | 7.2500 | S | Mr | 28 | Saether |
1308 | 3 | male | 25 | 0 | 0 | 359309 | 8.0500 | S | Mr | 19 | Ware |
1309 | 3 | male | 22 | 1 | 1 | 2668 | 22.3583 | C | NaN | 24 | Peter |
418 rows × 11 columns
X_test_encoded = tar_encoder1.transform(X_test)
归一化
后面要多模型验证,所以要把数据归一化。
import sklearn.preprocessing as preprocessing
scaler = preprocessing.StandardScaler()
scaler.fit(X_train_encoded[['Age','Fare']])
scaler.fit(X_test_encoded[['Age','Fare']])
StandardScaler()
X_train_encoded[['Age','Fare']] = scaler.transform(X_train_encoded[['Age','Fare']])
X_test_encoded[['Age','Fare']] = scaler.transform(X_test_encoded[['Age','Fare']])
模型预测
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
X_train_encoded
X_test_encoded
Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | Called | Name_length | First_name | |
---|---|---|---|---|---|---|---|---|---|---|---|
PassengerId | |||||||||||
892 | 3 | 0.188908 | 0.325138 | 0 | 0 | 0.383838 | -0.497063 | 0.389610 | 0.283721 | 0.230769 | 0.732634 |
893 | 3 | 0.742038 | 1.326156 | 1 | 0 | 0.383838 | -0.511926 | 0.336957 | 0.283721 | 0.565217 | 0.383838 |
894 | 2 | 0.188908 | 2.481178 | 0 | 0 | 0.383838 | -0.463754 | 0.389610 | 0.283721 | 0.327273 | 0.383838 |
895 | 3 | 0.188908 | -0.213872 | 0 | 0 | 0.383838 | -0.482127 | 0.336957 | 0.283721 | 0.230769 | 0.383838 |
896 | 3 | 0.742038 | -0.598880 | 1 | 1 | 0.383838 | -0.417151 | 0.336957 | 0.283721 | 0.999439 | 0.383838 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1305 | 3 | 0.188908 | -0.367875 | 0 | 0 | 0.383838 | -0.493105 | 0.336957 | 0.283721 | 0.200000 | 0.383838 |
1306 | 1 | 0.742038 | 0.710145 | 0 | 0 | 0.468759 | 1.314557 | 0.553571 | 0.492063 | 0.372093 | 0.383838 |
1307 | 3 | 0.188908 | 0.633143 | 0 | 0 | 0.383838 | -0.507445 | 0.336957 | 0.283721 | 0.372093 | 0.383838 |
1308 | 3 | 0.188908 | -0.367875 | 0 | 0 | 0.383838 | -0.493105 | 0.336957 | 0.283721 | 0.234375 | 0.383838 |
1309 | 3 | 0.188908 | -0.598880 | 1 | 1 | 0.834289 | -0.236640 | 0.553571 | 0.492063 | 0.372093 | 0.834289 |
418 rows × 11 columns
X_train_encoded.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 Pclass 891 non-null int64 1 Sex 891 non-null float642 Age 891 non-null int32 3 SibSp 891 non-null int64 4 Parch 891 non-null int64 5 Ticket 891 non-null float646 Fare 891 non-null float647 Embarked 891 non-null float648 Called 891 non-null float649 Name_length 891 non-null float6410 First_name 891 non-null float64
dtypes: float64(7), int32(1), int64(3)
memory usage: 73.2 KB
投票法
先看看投票法
lr_clf = LogisticRegression(penalty='l1',solver='saga',n_jobs=-1,max_iter=20000)
rnd_clf = RandomForestClassifier(n_estimators=300,max_depth=8,min_samples_leaf=1,min_samples_split=5,random_state=42)
svm_clf = SVC(C=2,kernel='poly',random_state=42,probability=True)
voting_clf = VotingClassifier(estimators=[('lr',lr_clf),('rf',rnd_clf),('scv',svm_clf)],voting='soft')
voting_clf.fit(X_train_encoded,y_train)
VotingClassifier(estimators=[('lr',LogisticRegression(max_iter=20000, n_jobs=-1,penalty='l1', solver='saga')),('rf',RandomForestClassifier(max_depth=8,min_samples_split=5,n_estimators=300,random_state=42)),('scv',SVC(C=2, kernel='poly', probability=True,random_state=42))],voting='soft')
y_test = pd.read_csv(r'C:/Users/gender_submission.csv')
y_test = y_test['Survived']
from sklearn.metrics import accuracy_scorefor clf in (lr_clf,rnd_clf,svm_clf,voting_clf):clf.fit(X_train_encoded,y_train)y_pred = clf.predict(X_test_encoded)print(clf.__class__.__name__, accuracy_score(y_test, y_pred))
LogisticRegression 0.6961722488038278
RandomForestClassifier 0.80622009569378
SVC 0.6363636363636364
VotingClassifier 0.8110047846889952
再试试XGBoost,果然效果比较好。
XGBoost
import xgboost
from sklearn.metrics import mean_squared_error
xgb_reg = xgboost.XGBRFRegressor(random_state=42)
xgb_reg.fit(X_train_encoded,y_train)
y_pred = xgb_reg.predict(X_test_encoded)
val_error=mean_squared_error(y_test,y_pred)
print("Validation MSE:", val_error)
Validation MSE: 0.5023153196818051
数据分析——泰坦尼克号预测相关推荐
- 利用R语言对泰坦尼克号沉没事件幸存者的数据分析与预测
题外话:在文章正式开始之前,我还是想先写一点题外话,一是为了引出写作这篇博客的目的,二则是希望能够记录下现在的所思所想为以后留个纪念.首先介绍一下自己,毕业3年多的小硕一枚,大学期间学的专业是高分子材 ...
- 深耕大数据市场,所问数据打造深度学习数据分析与预测引擎
卖什么?卖多少钱? 这些是每一个线上零售卖家都会遇到的问题.在大数据时代开始之前,答案都是基于个人经验做的判断:随着近年数据分析平台纷纷上线,卖家们也渐渐开始接受多维度.不同时间粒度的数据分析服务,包 ...
- 基于机器学习的天气数据分析与预测系统
温馨提示:文末有 CSDN 平台官方提供的学长 Wechat / QQ 名片 :) 1. 项目简介 本项目利用网络爬虫技术从某天气预报网站抓取某一城市的历史天气数据,构建天气数据分析与预测系统,实现对 ...
- 【Python】时间序列数据分析与预测之Python工具汇总
本文中总结了十多种时间序列数据分析和预测工具和python库,在我们处理时间序列项目时,可以翻开本文,根据需要选择合适的工具,将会事半功倍! 在处理时间序列项目时,数据科学家或 ML 工程师通常会使用 ...
- 基于数据挖掘的共享单车骑行数据分析与预测
温馨提示:文末有 CSDN 平台官方提供的博主 Wechat / QQ 名片 :) 1. 项目背景 共享单车系统在大城市越来越流行,通过提供价格合理的自行车租赁,让人们可以享受在城市里骑自行车的乐趣, ...
- 电商销售数据分析与预测(日期数据统计、按天统计、按月统计)
本文来自<Python数据分析从入门到精通>--明日科技编著 随着电商行业的激烈竞争,电商平台推出了各种数字营销方案,付费广告也是花样繁多.那么电商投入广告后,究竟能给企业增加多少收益,对 ...
- 时间序列数据分析与预测之Python工具汇总
大家好,我是辰哥 本文中硬核总结了十多种时间序列数据分析和预测工具和python库,在我们处理时间序列项目时,可以翻开本文,根据需要选择合适的工具,将会事半功倍! 在处理时间序列项目时, ...
- python空气质量分析与预测_python 空气质量AQI数据分析与预测 ---分析,相关系数矩阵...
版权声明:本文为博主原创文章,遵循CC 4.0 by-sa版权协议,转载请附上原文出处链接和本声明. 本文链接:https://blog.csdn.net/YmeBtc/article/details ...
- 淘宝双11数据分析与预测
淘宝双11数据分析与预测 一. 案例简介 Spark课程实验案例:淘宝双11数据分析与预测课程案例,由厦门大学数据库实验室团队开发,旨在满足全国高校大数据教学对实验案例的迫切需求.本案例涉及数据预处理 ...
最新文章
- 面试必问!Tomcat 优化篇!
- 【jenkins】jenkins build项目的三种方式
- 2021巨量引擎UGC互动营销白皮书
- 机器学习-吴恩达-笔记-14-应用实例:图片文字识别
- hadoop的shuffle过程
- HAproxy负载均衡动静分离实现及配置详解
- 008/160 CrackMe Andrénalin #1
- 接入华为推送用API给iOS应用发消息时如何获取access_token?
- OSChina 周三乱弹 ——垂死病中惊坐起,夜深还过女嫱来
- Java线程强制执行
- 费马小定理及MR素数判断
- 将工业ISM和消费者ISM频段设备迁移到LoRaWAN,LoRa设备开发参考指南(二十)
- python eel 多线程_Python + Eel + Sqlite 实现个人密码管理器
- 最小生成树————普利姆和克鲁斯卡尔
- maya扇子动画_MAYA制作动画的十大原理!
- python自省与反射
- linux dev queue xmit,dev_queue_xmi函数详解
- IOS Swift语言开发 tableView的重用以及自cell的自适应高度
- python3 字典遍历的方法
- 爬了1000张清纯妹子私房照,我流鼻血了...