Competition Description

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

Practice Skills

  • Binary classification
  • Python and R basics



特征 PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
解释 乘客ID 死亡0/幸存/1 经济等级(1=high、2=middle、3=low) 乘客姓名 性别 年龄 船上的兄弟姐妹个数 船上的父母孩子个数 船票号码 票价 客舱号码 登船港口


import os
file_root = os.path.realpath('titanic')
file_name_test = os.path.join(file_root, "test.csv")
file_name_train = os.path.join(file_root, "train.csv")
import pandas as pd
pd.set_option('display.max_columns' , None)
titanic = pd.read_csv(file_name_train)
data = titanic.describe()#可以查看有哪些缺失值
titanic['Age'].fillna(titanic['Age'].median(), inplace=True)
data = titanic.describe()
print("Sex原属性值", titanic['Sex'].unique())
titanic.loc[titanic['Sex'] == "male", "Sex"] = 0
titanic.loc[titanic['Sex'] == "female", "Sex"] = 1
print("Sex替换后的属性值", titanic['Sex'].unique())
print("Embarked原属性值", titanic['Embarked'].unique())
titanic["Embarked"] = titanic["Embarked"].fillna('S')
titanic.loc[titanic['Embarked'] == 'S', 'Embarked'] = 0
titanic.loc[titanic['Embarked'] == 'C', 'Embarked'] = 1
titanic.loc[titanic['Embarked'] == 'Q', 'Embarked'] = 2
print("Embarked替换后的属性值", titanic['Embarked'].unique())#线性回归模型预测
from sklearn.linear_model import LinearRegression
from sklearn import model_selection
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
alg = LinearRegression()
print("titanic.shape[0]:", titanic.shape[0])
# kf = model_selection.KFold(titanic.shape[0], n_folds=3, random_state=1)
kf = model_selection.KFold(n_splits=3, random_state=1, shuffle=False)
predictions = []
for train, test in kf.split(titanic['Survived']):#把训练数据拿出来train_predictors = titanic[predictors].iloc[train,:]#我们使用样本训练的目标值train_target = titanic['Survived'].iloc[train]#应用线性回归,训练回归模型alg.fit(train_predictors, train_target)#利用测试集预测test_predictions = alg.predict(titanic[predictors].iloc[test,:])predictions.append(test_predictions)#看测试集的效果,回归值区间值为[0-1]
import numpy as np
#numpy提供了numpy.concatenate((a1,a2,...), axis=0)函数。能够一次完成多个数组的拼接。其中a1,a2,...是数组类型的参数
predictions = np.concatenate(predictions, axis=0)predictions[predictions > .5] = 1
predictions[predictions <= .5] = 0
accuracy = sum(predictions[predictions == titanic['Survived']]) / len(predictions)
print("线性回归模型: ", accuracy)
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
import warnings
alg = LogisticRegression(random_state=1)
scores = model_selection.cross_val_score(alg, titanic[predictors], titanic['Survived'], cv=3)
print("逻辑回归模型: ", scores.mean())#采用随机森林实现:构造多颗决策树共同决策结果,取出多次结果的平均值。
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier
pridictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
alg = RandomForestClassifier(random_state=1, n_estimators=50, min_impurity_split=4, min_samples_leaf=2)
kf = model_selection.KFold(n_splits=3, random_state=1, shuffle=False)
kf = kf.split(titanic['Survived'])
scores = model_selection.cross_val_score(alg, titanic[predictors], titanic['Survived'], cv=kf)
print("随机森林: ", scores.mean())


