从Kaggle官网下载数据:train 、test。


  • 泰坦尼克号的沉没是历史上最臭名昭著的沉船之一。1912年4月15日,泰坦尼克号在处女航时与冰山相撞沉没,2224名乘客和船员中有1502人遇难。这一耸人听闻的悲剧震惊了国际社会,并导致更好的船舶安全法规。船难造成如此巨大的人员伤亡的原因之一是船上没有足够的救生艇供乘客和船员使用。虽然在沉船事件中幸存下来是有运气因素的,但有些人比其他人更有可能存活下来。比如妇女、儿童和上层阶级。
  • 在此次比赛中,我们需要参赛者预测哪一类人更有可能存活下来。尤其是,我们需要你用机器学习的工具去预测哪些乘客在这次灾难中幸存。


  • 提出问题
  • 理解数据
  • 数据处理(数据预处理and特征工程)
  • 模型构建与评估
  • 总结




即基于一组预测变量预测一个分类结果(二分类)。有监督机器学习领域中包含可用于分类的方法:逻辑回归、KNN、决策树、随机森林、支持向量机、神经网络等。本文选择Logistic 和 KNN 来做分类预测。



import numpy as np
import pandas as pd
import re#作图
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns


train = pd.read_csv(r"G:\Kaggle\Titanic\train.csv")
test = pd.read_csv(r"G:\Kaggle\Titanic\test.csv")
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q


PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
417 1309 3 Peter, Master. Michael J male NaN 1 1 2668 22.3583 NaN C
224 1116 1 Candee, Mrs. Edward (Helen Churchill Hungerford) female 53.0 0 0 PC 17606 27.4458 NaN C
99 991 3 Nancarrow, Mr. William Henry male 33.0 0 0 A./5. 3338 8.0500 NaN S
410 1302 3 Naughton, Miss. Hannah female NaN 0 0 365237 7.7500 NaN Q
41 933 1 Franklin, Mr. Thomas Parham male NaN 0 0 113778 26.5500 D34 S
70 962 3 Mulvihill, Miss. Bertha E female 24.0 0 0 382653 7.7500 NaN Q


print("==" * 50)
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
Name Sex Ticket Cabin Embarked
count 891 891 891 204 889
unique 891 2 681 147 3
top Kink-Heilmann, Miss. Luise Gretchen male 1601 C23 C25 C27 S
freq 1 577 7 4 644


  • 类别型变量:Survived、Pclass(顺序)、Sex、Embarked。数值型变量:Age、 SibSp(离散)、Parch(离散)、Fare.

  • 总共4个字段有缺失,缺失程度不一样(Age、Cabin缺较多,Fare、Embarked缺较少)

  • 训练集中:

    • (1)共有891名乘客,生存率为38%
    • (2)年龄最小为0.42,最大为80岁,除去缺失值,平均年龄为29,高龄人士较少
    • (3)约25%的乘客有一个或以上的兄弟姐妹陪伴的,75%以上的乘客没有与父母孩子同行
    • (4)票价平均值在32美元,最高值在512美元,差距较大
    • (5)每个人的名字都是无重复的
    • (6)男性共计577人,男乘客较女乘客多
    • (7)Ticket有681个不同的值
    • (8)Cabin的数据缺失较多,891人中有记录的仅为204人
    • (9)上船口岸有缺失值,644人在S港口上船,占比较大










处理缺失值方式(在scikit-learn中,build models时若有缺失值会报错):

  • 删(简单粗暴,dropna)

    • 完整实例删除,即删行(简单粗暴,当样本量大,且缺失案例较少时用)
    • 删除有缺失值的特征(该列缺失严重,且该特征对建模效果影响不大时用)
  • Imputation(从已知的部分数据中推断出缺失值,虽然估计值并不绝对百正确,但是比上述删除列的做法来说,此法建模效果更好一点)

    • 用该特征的均值、中位数、众数等去估算(普通版)
    • 由其他已知的数值型数据,去估算缺失值的值(进阶版)



# 三.数据处理(数据预处理and特征工程) 首先合并train和test,为了后续写代码能同时处理两个数据集:

combination_data = [train,test]

**下面将根据现在数据的类型,分数值型和字符串来讨论、研究,同时完成缺失值进行处理、根据每个变量与生存率之间的关系进行选择,必要时将删除变量或者创造出新的变量来帮助模型的构建。最终所有的数据类型都将处理为数值型。** ## 数值型: - PassengerId 乘客编码,做区分用,对预测无作用,删掉。

del train["PassengerId"]

- Pclass 船舱分三等,某种程度上代表了乘客的身份、社会地位,下面探究一下Pclass的作用:

Pclass Survived
0 1 0.629630
1 2 0.472826
2 3 0.242363
SibSp Survived
1 1 0.535885
2 2 0.464286
0 0 0.345395
3 3 0.250000
4 4 0.166667
5 5 0.000000
6 8 0.000000


  • Parch
Parch Survived
3 3 0.600000
1 1 0.550847
2 2 0.500000
0 0 0.343658
5 5 0.200000
4 4 0.000000
6 6 0.000000


for dataset in combination_data:dataset["Family"] = dataset["SibSp"] + dataset["Parch"] + 1
Family Survived
3 4 0.724138
2 3 0.578431
1 2 0.552795
6 7 0.333333
0 1 0.303538
4 5 0.200000
5 6 0.136364
7 8 0.000000
8 11 0.000000
for dataset in combination_data:dataset["Family_size"] = 0    #创建新的一列dataset.loc[dataset["Family"] == 1,"Family_size"] = 1                              #小家庭(独自一人)dataset.loc[(dataset["Family"] > 1) & (dataset["Family"] <= 4),"Family_size"] = 2  #中家庭(2-4)dataset.loc[dataset["Family"] > 4,"Family_size"] = 3                                #大家庭(5-11)dataset["Family_size"] = dataset["Family_size"].astype(int)


for dataset in combination_data:dataset["Alone"] = dataset["Family"].map(lambda x : 1 if x==1 else 0)
Alone Survived
0 0 0.505650
1 1 0.303538
for dataset in combination_data:dataset.drop(["SibSp","Parch","Family"],axis=1,inplace=True)



count 714.000000 mean 29.699118 std 14.526497 min 0.420000 25% 20.125000 50% 28.000000 75% 38.000000 max 80.000000 Name: Age, dtype: float64

train["Age_group"] = pd.cut(train.Age,5)
Age_group Survived
0 (0.34, 16.336] 0.550000
3 (48.168, 64.084] 0.434783
2 (32.252, 48.168] 0.404255
1 (16.336, 32.252] 0.369942
4 (64.084, 80.0] 0.090909
del train["Age_group"]



177 train数据集的891个乘客中,177人(接近20%)的年龄数据缺失,平均年龄为29.7,标准差为14.5,中位数为28。 对于age的缺失值,暂时用平均值跟标准差填补,这在某种程度上引入了噪声。后期学到更高级的估算,再回来修改。

for dataset in combination_data:Age_avg = dataset.Age.mean()Age_std = dataset["Age"].std()missing_number = dataset["Age"].isnull().sum()dataset["Age"][np.isnan(dataset["Age"])] = np.random.randint(Age_avg - Age_std, Age_avg + Age_std, missing_number)dataset["Age"] = dataset["Age"].astype(int) 

F:\Anaconda\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

for dataset in combination_data:dataset["Age_group"] = pd.cut(dataset.Age, 5)
for dataset in combination_data:dataset.loc[dataset["Age"]  <= 16,"Age"] = 0dataset.loc[(dataset["Age"] > 16) & (dataset["Age"] <= 32), "Age"] = 1dataset.loc[(dataset["Age"] > 32) & (dataset["Age"] <= 48), "Age"] = 2dataset.loc[(dataset["Age"] > 48) & (dataset["Age"] <= 64), "Age"] = 3dataset.loc[dataset["Age"]  > 64, "Age"] = 4
for dataset in combination_data:dataset.drop("Age_group",axis=1,inplace=True)

- Fare


count 891.000000 mean 32.204208 std 49.693429 min 0.000000 25% 7.910400 50% 14.454200 75% 31.000000 max 512.329200 Name: Fare, dtype: float64

train["Fare_group"] = pd.qcut(train["Fare"],4) #分段
Fare_group Survived
0 (-0.001, 7.91] 0.197309
1 (7.91, 14.454] 0.303571
2 (14.454, 31.0] 0.454955
3 (31.0, 512.329] 0.581081



for dataset in combination_data:dataset.loc[dataset["Fare"]  <= 7.91,"Fare"] = 0dataset.loc[(dataset["Fare"] >  7.91)   & (dataset["Fare"] <= 14.454), "Fare"] = 1dataset.loc[(dataset["Fare"] >  14.454) & (dataset["Fare"] <= 31.0),   "Fare"] = 2dataset.loc[dataset["Fare"]  >  31.0, "Fare"] = 3dataset["Fare"] = dataset["Fare"].astype(int)
del train["Fare_group"]

## 字符型 ### Name 成员的名字没有重复项,本可删掉。但从别人的文章得知,外国人的名字长度、头衔也能反映一个人的身份地位,于是我们来探究一下这两个因素对生存率的影响: (1)名字长度

for dataset in combination_data:dataset["The_length_of_name"] = dataset["Name"].map(lambda x:len(re.split(" ",x)))
The_length_of_name Survived
6 9 1.000000
7 14 1.000000
4 7 0.842105
3 6 0.773585
5 8 0.555556
2 5 0.427083
1 4 0.340206
0 3 0.291803
from sklearn.preprocessing import StandardScaler
Stdsca = StandardScaler()
name_length1 = Stdsca.fit_transform(train[["The_length_of_name"]])
name_length1 = pd.DataFrame(name_length1,columns=["name_length"])
train = pd.concat([train,name_length1],axis=1)
name_length2 = Stdsca.fit_transform(test[["The_length_of_name"]])
name_length2 = pd.DataFrame(name_length2,columns=["name_length"])
test = pd.concat([test,name_length2],axis=1)
combination_data = [train,test]
for dataset in combination_data:del dataset["The_length_of_name"]



0 Braund, Mr. Owen Harris 1 Cumings, Mrs. John Bradley (Florence Briggs Th… 2 Heikkinen, Miss. Laina 3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 Allen, Mr. William Henry 5 Moran, Mr. James 6 McCarthy, Mr. Timothy J Name: Name, dtype: object

for dataset in combination_data:dataset["Title"] = dataset["Name"].str.extract("([A-Za-z]+)\.",expand=False)
Survived Pclass Name Sex Age Ticket Fare Cabin Embarked Family_size Alone name_length Title
271 1 3 Tornquist, Mr. William Henry male 1 LINE 0 NaN S 1 1 -0.059474 Mr
389 1 2 Lehmann, Miss. Bertha female 1 SC 1748 1 NaN C 1 1 -0.914177 Miss
40 0 3 Ahlin, Mrs. Johan (Johanna Persdotter Larsson) female 2 7546 1 NaN S 2 0 1.649930 Mrs
709 1 3 Moubarek, Master. Halim Gonios (“William George”) male 1 2661 2 NaN C 2 0 1.649930 Master
Sex female male
Capt 0 1
Col 0 2
Countess 1 0
Don 0 1
Dr 1 6
Jonkheer 0 1
Lady 1 0
Major 0 2
Master 0 40
Miss 182 0
Mlle 2 0
Mme 1 0
Mr 0 517
Mrs 125 0
Ms 1 0
Rev 0 6
Sir 0 1
for dataset in combination_data:dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
Title Survived
3 Mrs 0.793651
1 Miss 0.702703
0 Master 0.575000
4 Rare 0.347826
2 Mr 0.156673
for dataset in combination_data:dataset["Title"] = dataset["Title"].map({"Mr":1,"Mrs":2,"Miss":3,"Master":4,"Rare":5})dataset["Title"] = dataset["Title"].fillna(0)
for dataset in combination_data:del dataset["Name"]
Survived Pclass Sex Age Ticket Fare Cabin Embarked Family_size Alone name_length Title
0 0 3 male 1 A/5 21171 0 NaN S 2 0 -0.059474 1
1 1 1 female 2 PC 17599 3 C85 C 2 0 2.504633 2
2 1 3 female 1 STON/O2. 3101282 1 NaN S 1 1 -0.914177 3
  • Sex


Sex Survived
0 female 0.742038
1 male 0.188908
Pclass Sex Survived
0 1 female 0.968085
2 2 female 0.921053
4 3 female 0.500000
1 1 male 0.368852
3 2 male 0.157407
5 3 male 0.135447
for dataset in combination_data:dataset["Sex"] = dataset["Sex"].map({"male":0,"female":1})
Survived Pclass Sex Age Ticket Fare Cabin Embarked Family_size Alone name_length Title
0 0 3 0 1 A/5 21171 0 NaN S 2 0 -0.059474 1
1 1 1 1 2 PC 17599 3 C85 C 2 0 2.504633 2
2 1 3 1 1 STON/O2. 3101282 1 NaN S 1 1 -0.914177 3
3 1 1 1 2 113803 3 C123 S 2 0 2.504633 2
  • Cabin
a = train.Cabin.isnull().sum()
print("缺失个数:%d" % a)

缺失个数:687 超过75%的数据缺失,故不打算填补。考虑以Cabin是否缺失来构建一个新特征,看是否对生存有影响。若没有影响,则删除该列。

train["Cabin_exist"] = train.Cabin.map(lambda x : "Yes" if type(x)==str else "No")
train[["Cabin_exist", "Survived"]].groupby("Cabin_exist",as_index=False).mean()
Cabin_exist Survived
0 No 0.299854
1 Yes 0.666667
del train["Cabin_exist"]
for dataset in combination_data:dataset["Cabin_exist"] = dataset["Cabin"].map(lambda x : 1 if type(x)==str else 0)
for dataset in combination_data:del dataset["Cabin"]
Survived Pclass Sex Age Ticket Fare Embarked Family_size Alone name_length Title Cabin_exist
0 0 3 0 1 A/5 21171 0 S 2 0 -0.059474 1 0
1 1 1 1 2 PC 17599 3 C 2 0 2.504633 2 1
2 1 3 1 1 STON/O2. 3101282 1 S 1 1 -0.914177 3 0
  • Embarked


Embarked Survived
2 S 644
0 C 168
1 Q 77
Embarked Survived
0 C 0.553571
1 Q 0.389610
2 S 0.336957
Sex Embarked Survived
2 0 S 441
5 1 S 203
0 0 C 95
3 1 C 73
1 0 Q 41
4 1 Q 36



train["Embarked"] = train.Embarked.fillna("S")
for dataset in combination_data:dataset["Embarked"] = dataset["Embarked"].map({"C":0,"Q":1,"S":2}).astype(int)
Survived Pclass Sex Age Ticket Fare Embarked Family_size Alone name_length Title Cabin_exist
0 0 3 0 1 A/5 21171 0 2 2 0 -0.059474 1 0
1 1 1 1 2 PC 17599 3 0 2 0 2.504633 2 1
  • Ticket

749 335097 87 SOTON/OQ 392086 179 LINE 682 6563 629 334912 586 237565 159 CA. 2343 466 239853 539 13568 419 345773 Name: Ticket, dtype: object 该列无缺失值,但信息较为混乱,有681个不重复值,删掉不做考虑。

for dataset in combination_data:del dataset["Ticket"]

## 特征选择 数据处理完毕,现在看一下我们的特征:

Survived Pclass Sex Age Fare Embarked Family_size Alone name_length Title Cabin_exist
887 1 1 1 1 2 2 1 1 -0.059474 3 1
888 0 3 1 1 2 2 2 0 0.795228 3 0
889 1 1 0 1 2 0 1 1 -0.059474 1 1
890 0 3 0 1 0 1 1 1 -0.914177 1 0
PassengerId Pclass Sex Age Fare Embarked Family_size Alone name_length Title Cabin_exist
0 892 3 0 2 0 1 1 1 -0.933840 1 0
1 893 3 1 2 0 2 2 0 0.716668 2 0
2 894 2 0 3 1 1 1 1 -0.108586 1 0
3 895 3 0 1 1 2 1 1 -0.933840 1 0


corr_df = train.corr()
Survived Pclass Sex Age Fare Embarked Family_size Alone name_length Title Cabin_exist
Survived 1.000000 -0.338481 0.543351 -0.049290 0.295875 -0.167675 0.108631 -0.203367 0.278520 0.405921 0.316912
Pclass -0.338481 1.000000 -0.131900 -0.308842 -0.628459 0.162098 -0.043973 0.135207 -0.222866 -0.120491 -0.725541
Sex 0.543351 -0.131900 1.000000 -0.087157 0.248940 -0.108262 0.280570 -0.303646 0.375797 0.564438 0.140391
Age -0.049290 -0.308842 -0.087157 1.000000 0.066096 -0.039259 -0.187662 0.144766 0.052876 -0.194844 0.225237
Fare 0.295875 -0.628459 0.248940 0.066096 1.000000 -0.112248 0.559259 -0.568942 0.320767 0.265495 0.497108
Embarked -0.167675 0.162098 -0.108262 -0.039259 -0.112248 1.000000 -0.004951 0.063532 0.032424 -0.082845 -0.160196
Family_size 0.108631 -0.043973 0.280570 -0.187662 0.559259 -0.004951 1.000000 -0.923090 0.311132 0.328943 0.088993
Alone -0.203367 0.135207 -0.303646 0.144766 -0.568942 0.063532 -0.923090 1.000000 -0.369259 -0.289292 -0.158029
name_length 0.278520 -0.222866 0.375797 0.052876 0.320767 0.032424 0.311132 -0.369259 1.000000 0.124584 0.184484
Title 0.405921 -0.120491 0.564438 -0.194844 0.265495 -0.082845 0.328943 -0.289292 0.124584 1.000000 0.104024
Cabin_exist 0.316912 -0.725541 0.140391 0.225237 0.497108 -0.160196 0.088993 -0.158029 0.184484 0.104024 1.000000

Survived 1.000000 Sex 0.543351 Title 0.405921 Cabin_exist 0.316912 Fare 0.295875 name_length 0.278520 Family_size 0.108631 Age -0.049290 Embarked -0.167675 Alone -0.203367 Pclass -0.338481 Name: Survived, dtype: float64 正线性相关前三为:Sex、Title、Cabin_exist;负线性相关前三:Pclass、Alone、Embarked。

plt.title("Pearson Correlation of Features")
for dataset in combination_data:del dataset["Family_size"]
PassengerId Pclass Sex Age Fare Embarked Alone name_length Title Cabin_exist
0 892 3 0 2 0 1 1 -0.933840 1 0
1 893 3 1 2 0 2 0 0.716668 2 0
2 894 2 0 3 1 1 1 -0.108586 1 0
corr_df2 = train.corr()

Survived 1.000000 Sex 0.543351 Title 0.405921 Cabin_exist 0.316912 Fare 0.295875 name_length 0.278520 Age -0.049290 Embarked -0.167675 Alone -0.203367 Pclass -0.338481 Name: Survived, dtype: float64

plt.title("Pearson Correlation of Features2")

# 四.模型构建与评估

x_train = train.drop("Survived",axis=1)
y_train =train["Survived"]
x_test = test.drop("PassengerId",axis=1)

## 1.Logistic回归

from sklearn.linear_model import LogisticRegression
Classifier1 = LogisticRegression()
Y1_prediction = Classifier1.predict(x_test)
score_Logit = Classifier1.score(x_train,y_train)



array([[-0.77898168, 2.00093191, -0.33760786, -0.08497359, -0.30653537, 0.20655901, 0.28367358, 0.37006791, 0.76227031]])

Final = pd.DataFrame({"PassengerId":test["PassengerId"],"Survived":Y1_prediction})
PassengerId Survived
0 892 0
1 893 0
2 894 0
3 895 0
4 896 1
5 897 0
6 898 1
7 899 0
8 900 1
9 901 0

Kaggle得分0.77990 Fare系数很小,这时候我们剔除Fare,看看效果:

x1_train = train.drop(["Survived","Fare"],axis=1)
y1_train =train["Survived"]
x1_test = test.drop(["PassengerId","Fare"],axis=1)
Classifier2 = LogisticRegression()
Y2_prediction = Classifier2.predict(x1_test)
score_Logit_2 = Classifier2.score(x1_train,y1_train)



array([[-0.73467593, 2.00683788, -0.34049416, -0.31016514, 0.29292148, 0.27713963, 0.36168232, 0.73342319]])

Final_2 = pd.DataFrame({"PassengerId":test["PassengerId"],"Survived":Y2_prediction})
PassengerId Survived
0 892 0
1 893 0
2 894 0
3 895 0
4 896 1
5 897 0
6 898 1
7 899 0
8 900 1
9 901 0

提交kaggle后,得分降了。所以还是保存Fare。 ## 2.KNN

from sklearn.neighbors import KNeighborsClassifier
Classifier3 = KNeighborsClassifier(n_neighbors=5)
Y3_prediction = Classifier3.predict(x_test)
score_Knn = Classifier3.score(x_train,y_train)


Final_3 = pd.DataFrame({"PassengerId":test["PassengerId"],"Survived":Y3_prediction})
PassengerId Survived
0 892 0
1 893 0
2 894 0
3 895 0
4 896 1
5 897 0
6 898 1
7 899 0
8 900 1
9 901 0




# 决策树
#from sklearn.tree import DecisionTreeClassifier
#Classifier4 = DecisionTreeClassifier()
#Y4_prediction = Classifier4.predict(x_test)
#score_Dtc = Classifier4.score(x_train,y_train)
#from sklearn.ensemble import RandomForestClassifier
#Classifier5 = RandomForestClassifier(n_estimators=100)
#Y5_prediction = Classifier5.predict(x_test)
#score_Rfc = Classifier5.score(x_train,y_train)





