之前在学校做过课程设计，但是对流程比较一知半解，现在看完了机器学习实战这本书，带着自己的理解重新做一遍。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

数据导入

观察数据的具体情况，可以发现年龄变量Age和Cabin有缺失，然后Name，sex，Ticket，cabin和Embark是object类型，在后续的数据处理中要进行调整。

data_train = pd.read_csv(r'C:/Users/train.csv')
data_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):#   Column       Non-Null Count  Dtype
---  ------       --------------  -----  0   PassengerId  891 non-null    int64  1   Survived     891 non-null    int64  2   Pclass       891 non-null    int64  3   Name         891 non-null    object 4   Sex          891 non-null    object 5   Age          714 non-null    float646   SibSp        891 non-null    int64  7   Parch        891 non-null    int64  8   Ticket       891 non-null    object 9   Fare         891 non-null    float6410  Cabin        204 non-null    object 11  Embarked     889 non-null    object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

再看看测试集

data_test= pd.read_csv(r'test.csv')
data_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):#   Column       Non-Null Count  Dtype
---  ------       --------------  -----  0   PassengerId  418 non-null    int64  1   Pclass       418 non-null    int64  2   Name         418 non-null    object 3   Sex          418 non-null    object 4   Age          332 non-null    float645   SibSp        418 non-null    int64  6   Parch        418 non-null    int64  7   Ticket       418 non-null    object 8   Fare         418 non-null    float649   Cabin        91 non-null     object 10  Embarked     418 non-null    object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

把索引设置为乘客编号

test_process = test_process.set_index(['PassengerId'])
test_process

现在测试集长这样

	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Embarked	Called	Name_length	First_name
PassengerId
892	3	Kelly, Mr. James	male	34	0	0	330911	7.8292	Q	Mr	16	Kelly
893	3	Wilkes, Mrs. James (Ellen Needs)	female	47	1	0	363272	7.0000	S	Mr	32	Wilkes
894	2	Myles, Mr. Thomas Francis	male	62	0	0	240276	9.6875	Q	Mr	25	Myles
895	3	Wirz, Mr. Albert	male	27	0	0	315154	8.6625	S	Mr	16	Wirz
896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22	1	1	3101298	12.2875	S	Mr	44	Hirvonen
...	...	...	...	...	...	...	...	...	...	...	...	...
1305	3	Spector, Mr. Woolf	male	25	0	0	A.5. 3236	8.0500	S	Mr	18	Spector
1306	1	Oliva y Ocana, Dona. Fermina	female	39	0	0	PC 17758	108.9000	C	NaN	28	Oliva y Ocana
1307	3	Saether, Mr. Simon Sivertsen	male	38	0	0	SOTON/O.Q. 3101262	7.2500	S	Mr	28	Saether
1308	3	Ware, Mr. Frederick	male	25	0	0	359309	8.0500	S	Mr	19	Ware
1309	3	Peter, Master. Michael J	male	22	1	1	2668	22.3583	C	NaN	24	Peter

418 rows × 12 columns

数据处理

缺失值处理

本次数据的缺失应该是完全随机的，不依赖于其他完全变量，所以可以采取删除和填补两种方式。cabin缺失过多，直接删除这一特征，不放心的话可以计算一些相关度或者画图看看情况。

# 删除cabin
train_process = data_train.drop(['Cabin'],axis=1)

# 年龄数据进行缺失值填补
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
Age_df = train_process[['Age','Survived','Pclass','SibSp','Parch','Fare']]
UnknowAge = Age_df[Age_df.Age.isnull()].values
KnowAge = Age_df[Age_df.Age.notnull()].values
#y是目标年龄，x是已知属性
y_train = KnowAge[:,0]
x_train = KnowAge[:,1:]
rfr = RandomForestRegressor(n_estimators=500,random_state=42)
rfr.fit(x_train,y_train)
predictedAges = rfr.predict(UnknowAge[:,1::])
Age_df.loc[ (Age_df.Age.isnull()), 'Age' ] = predictedAges
train_process.Age=Age_df.Age.astype(int)

年龄缺失值使用随机森林进行填补，建立回归方程进行拟合。

测试集也要删除cabin变量和进行年龄缺失值的填补。

#测试集
test_process = data_test.drop(['Cabin'],axis=1)
test_process.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 10 columns):#   Column       Non-Null Count  Dtype
---  ------       --------------  -----  0   PassengerId  418 non-null    int64  1   Pclass       418 non-null    int64  2   Name         418 non-null    object 3   Sex          418 non-null    object 4   Age          332 non-null    float645   SibSp        418 non-null    int64  6   Parch        418 non-null    int64  7   Ticket       418 non-null    object 8   Fare         418 non-null    float649   Embarked     418 non-null    object
dtypes: float64(2), int64(4), object(4)
memory usage: 32.8+ KB

Age_df = test_process[['Age','Pclass','SibSp','Parch','Fare']]
UnknowAge = Age_df[Age_df.Age.isnull()].values
KnowAge = Age_df[Age_df.Age.notnull()].values
#y是目标年龄，x是已知属性
y_train = KnowAge[:,0]
x_train = KnowAge[:,1:]
rfr = RandomForestRegressor(n_estimators=500,random_state=42)
rfr.fit(x_train,y_train)
predictedAges = rfr.predict(UnknowAge[:,1::])
Age_df.loc[ (Age_df.Age.isnull()), 'Age' ] = predictedAges
test_process.Age=Age_df.Age.astype(int)

文本数据处理

对文本数据名字进行处理，把名字的称谓，长度，前名提取出来并舍弃名字变量。

def change(df):df['Called'] = df['Name'].str.findall('Miss|Mr|Ms').str[0].to_frame()df['Name_length'] = df['Name'].apply(lambda x:len(x))df['First_name'] = df['Name'].str.split(',').str[0]df = df.drop(['Name'],axis=1)change(train_process)
change(test_process)

TargetEncoder

把其他object类型变量进行编码处理。sklearn有很多种编码方式，target适用于特征无内在顺序，category数量 > 4的情况
one-hot适用于特征无内在顺序，category数量 < 4的情况。

import category_encoders
from category_encoders import TargetEncoder
X_train = train_process.iloc[:,2:]
y_train = train_process.iloc[:,1]
tar_encoder1 = TargetEncoder(cols=['Sex','Ticket','Embarked','Called','Name_length','First_name'],handle_missing='value',handle_unknown='value')

tar_encoder1.fit(X_train,y_train)

TargetEncoder(cols=['Sex', 'Ticket', 'Embarked', 'Called', 'Name_length','First_name'])

X_train_encoded = tar_encoder1.transform(X_train)

X_train_encoded.drop(['Name'],axis=1)

	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Embarked	Called	Name_length	First_name
0	3	0.188908	22.0	1	0	0.383838	7.2500	0.336957	0.283721	0.282051	0.103230
1	1	0.742038	38.0	1	0	0.383838	71.2833	0.553571	0.283721	0.998476	0.383838
2	3	0.742038	26.0	0	0	0.383838	7.9250	0.336957	0.697802	0.315789	0.383838
3	1	0.742038	35.0	1	0	0.468759	53.1000	0.336957	0.283721	0.999439	0.468759
4	3	0.188908	35.0	0	0	0.383838	8.0500	0.336957	0.283721	0.372093	0.468759
...	...	...	...	...	...	...	...	...	...	...	...
886	2	0.188908	27.0	0	0	0.383838	13.0000	0.336957	0.492063	0.325000	0.383838
887	1	0.742038	19.0	0	0	0.383838	30.0000	0.336957	0.697802	0.372093	0.632953
888	3	0.742038	NaN	1	2	0.103230	23.4500	0.336957	0.697802	0.428461	0.103230
889	1	0.188908	26.0	0	0	0.383838	30.0000	0.553571	0.283721	0.325000	0.383838
890	3	0.188908	32.0	0	0	0.383838	7.7500	0.389610	0.283721	0.234375	0.383838

891 rows × 11 columns

X_test = test_process

X_test.drop(['Name'],axis=1)

	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Embarked	Called	Name_length	First_name
PassengerId
892	3	male	34	0	0	330911	7.8292	Q	Mr	16	Kelly
893	3	female	47	1	0	363272	7.0000	S	Mr	32	Wilkes
894	2	male	62	0	0	240276	9.6875	Q	Mr	25	Myles
895	3	male	27	0	0	315154	8.6625	S	Mr	16	Wirz
896	3	female	22	1	1	3101298	12.2875	S	Mr	44	Hirvonen
...	...	...	...	...	...	...	...	...	...	...	...
1305	3	male	25	0	0	A.5. 3236	8.0500	S	Mr	18	Spector
1306	1	female	39	0	0	PC 17758	108.9000	C	NaN	28	Oliva y Ocana
1307	3	male	38	0	0	SOTON/O.Q. 3101262	7.2500	S	Mr	28	Saether
1308	3	male	25	0	0	359309	8.0500	S	Mr	19	Ware
1309	3	male	22	1	1	2668	22.3583	C	NaN	24	Peter

418 rows × 11 columns

X_test_encoded = tar_encoder1.transform(X_test)

归一化

后面要多模型验证，所以要把数据归一化。

import sklearn.preprocessing as preprocessing
scaler = preprocessing.StandardScaler()
scaler.fit(X_train_encoded[['Age','Fare']])
scaler.fit(X_test_encoded[['Age','Fare']])

StandardScaler()

X_train_encoded[['Age','Fare']] = scaler.transform(X_train_encoded[['Age','Fare']])
X_test_encoded[['Age','Fare']] = scaler.transform(X_test_encoded[['Age','Fare']])

模型预测

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC

X_train_encoded
X_test_encoded

	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Embarked	Called	Name_length	First_name
PassengerId
892	3	0.188908	0.325138	0	0	0.383838	-0.497063	0.389610	0.283721	0.230769	0.732634
893	3	0.742038	1.326156	1	0	0.383838	-0.511926	0.336957	0.283721	0.565217	0.383838
894	2	0.188908	2.481178	0	0	0.383838	-0.463754	0.389610	0.283721	0.327273	0.383838
895	3	0.188908	-0.213872	0	0	0.383838	-0.482127	0.336957	0.283721	0.230769	0.383838
896	3	0.742038	-0.598880	1	1	0.383838	-0.417151	0.336957	0.283721	0.999439	0.383838
...	...	...	...	...	...	...	...	...	...	...	...
1305	3	0.188908	-0.367875	0	0	0.383838	-0.493105	0.336957	0.283721	0.200000	0.383838
1306	1	0.742038	0.710145	0	0	0.468759	1.314557	0.553571	0.492063	0.372093	0.383838
1307	3	0.188908	0.633143	0	0	0.383838	-0.507445	0.336957	0.283721	0.372093	0.383838
1308	3	0.188908	-0.367875	0	0	0.383838	-0.493105	0.336957	0.283721	0.234375	0.383838
1309	3	0.188908	-0.598880	1	1	0.834289	-0.236640	0.553571	0.492063	0.372093	0.834289

418 rows × 11 columns

X_train_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):#   Column       Non-Null Count  Dtype
---  ------       --------------  -----  0   Pclass       891 non-null    int64  1   Sex          891 non-null    float642   Age          891 non-null    int32  3   SibSp        891 non-null    int64  4   Parch        891 non-null    int64  5   Ticket       891 non-null    float646   Fare         891 non-null    float647   Embarked     891 non-null    float648   Called       891 non-null    float649   Name_length  891 non-null    float6410  First_name   891 non-null    float64
dtypes: float64(7), int32(1), int64(3)
memory usage: 73.2 KB

投票法

先看看投票法

lr_clf = LogisticRegression(penalty='l1',solver='saga',n_jobs=-1,max_iter=20000)
rnd_clf = RandomForestClassifier(n_estimators=300,max_depth=8,min_samples_leaf=1,min_samples_split=5,random_state=42)
svm_clf = SVC(C=2,kernel='poly',random_state=42,probability=True)
voting_clf = VotingClassifier(estimators=[('lr',lr_clf),('rf',rnd_clf),('scv',svm_clf)],voting='soft')
voting_clf.fit(X_train_encoded,y_train)

  VotingClassifier(estimators=[('lr',LogisticRegression(max_iter=20000, n_jobs=-1,penalty='l1', solver='saga')),('rf',RandomForestClassifier(max_depth=8,min_samples_split=5,n_estimators=300,random_state=42)),('scv',SVC(C=2, kernel='poly', probability=True,random_state=42))],voting='soft')

y_test = pd.read_csv(r'C:/Users/gender_submission.csv')

y_test = y_test['Survived']

from sklearn.metrics import accuracy_scorefor clf in (lr_clf,rnd_clf,svm_clf,voting_clf):clf.fit(X_train_encoded,y_train)y_pred = clf.predict(X_test_encoded)print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.6961722488038278
RandomForestClassifier 0.80622009569378
SVC 0.6363636363636364
VotingClassifier 0.8110047846889952

再试试XGBoost，果然效果比较好。

XGBoost

import xgboost
from sklearn.metrics import mean_squared_error
xgb_reg = xgboost.XGBRFRegressor(random_state=42)
xgb_reg.fit(X_train_encoded,y_train)
y_pred = xgb_reg.predict(X_test_encoded)
val_error=mean_squared_error(y_test,y_pred)
print("Validation MSE:", val_error)

Validation MSE: 0.5023153196818051

数据分析——泰坦尼克号预测相关推荐

利用R语言对泰坦尼克号沉没事件幸存者的数据分析与预测
题外话:在文章正式开始之前,我还是想先写一点题外话,一是为了引出写作这篇博客的目的,二则是希望能够记录下现在的所思所想为以后留个纪念.首先介绍一下自己,毕业3年多的小硕一枚,大学期间学的专业是高分子材 ...
深耕大数据市场，所问数据打造深度学习数据分析与预测引擎
卖什么?卖多少钱? 这些是每一个线上零售卖家都会遇到的问题.在大数据时代开始之前,答案都是基于个人经验做的判断:随着近年数据分析平台纷纷上线,卖家们也渐渐开始接受多维度.不同时间粒度的数据分析服务,包 ...
基于机器学习的天气数据分析与预测系统
温馨提示:文末有 CSDN 平台官方提供的学长 Wechat / QQ 名片 :) 1. 项目简介本项目利用网络爬虫技术从某天气预报网站抓取某一城市的历史天气数据,构建天气数据分析与预测系统,实现对 ...
【Python】时间序列数据分析与预测之Python工具汇总
本文中总结了十多种时间序列数据分析和预测工具和python库,在我们处理时间序列项目时,可以翻开本文,根据需要选择合适的工具,将会事半功倍! 在处理时间序列项目时,数据科学家或 ML 工程师通常会使用 ...
基于数据挖掘的共享单车骑行数据分析与预测
温馨提示:文末有 CSDN 平台官方提供的博主 Wechat / QQ 名片 :) 1. 项目背景共享单车系统在大城市越来越流行,通过提供价格合理的自行车租赁,让人们可以享受在城市里骑自行车的乐趣, ...
电商销售数据分析与预测（日期数据统计、按天统计、按月统计）
本文来自<Python数据分析从入门到精通>--明日科技编著随着电商行业的激烈竞争,电商平台推出了各种数字营销方案,付费广告也是花样繁多.那么电商投入广告后,究竟能给企业增加多少收益,对 ...
时间序列数据分析与预测之Python工具汇总
‍ ‍ 大家好,我是辰哥‍ ‍ 本文中硬核总结了十多种时间序列数据分析和预测工具和python库,在我们处理时间序列项目时,可以翻开本文,根据需要选择合适的工具,将会事半功倍! 在处理时间序列项目时, ...
python空气质量分析与预测_python 空气质量AQI数据分析与预测 ---分析，相关系数矩阵...
版权声明:本文为博主原创文章,遵循CC 4.0 by-sa版权协议,转载请附上原文出处链接和本声明. 本文链接:https://blog.csdn.net/YmeBtc/article/details ...
淘宝双11数据分析与预测
淘宝双11数据分析与预测一. 案例简介 Spark课程实验案例:淘宝双11数据分析与预测课程案例,由厦门大学数据库实验室团队开发,旨在满足全国高校大数据教学对实验案例的迫切需求.本案例涉及数据预处理 ...

数据分析——泰坦尼克号预测