继续学习数据挖掘,尝试了kaggle上的泰坦尼克号生存预测。

Titanic for Machine Learning

导入和读取

# data processing
import numpy as np
import pandas as pd
import re
#visiulization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
train = pd.read_csv('D:/data/titanic/train.csv')
test = pd.read_csv('D:/data/titanic/test.csv')
train.head()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

数据特征有:PassengerId,无特别意义
Pclass,客舱等级,对生存有影响吗?是否高等仓的有更多机会?
Name,姓名,可帮助我们判断性别,大概年龄。
Sex,女性的生产率是否更高?
Age,不同年龄段是否对生存有影响?
SibSp和Parch,指是否有兄弟姐妹和配偶父母,有亲人的情况下生存率是提高还是降低?
Fare,票价,高票价是否有更多机会?
Cabin,Embarked,客舱和登录港口……自然理解对生存应该没有影响

train.describe()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
train.describe(include=['O'])#['O'] indicates category feature
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

Name Sex Ticket Cabin Embarked
count 891 891 891 204 889
unique 891 2 681 147 3
top Hippach, Mrs. Louis Albert (Ida Sophia Fischer) male 1601 C23 C25 C27 S
freq 1 577 7 4 644

目标Survived特征

survive_num = train.Survived.value_counts()
survive_num.plot.pie(explode=[0,0.1],autopct='%1.1f%%',labels=['died','survived'],shadow=True)
plt.show()

x=[0,1]
plt.bar(x,survive_num,width=0.35)
plt.xticks(x,('died','survived'))
plt.show()

特征分析

num_f = [f for f in train.columns if train.dtypes[f] != 'object']
cat_f = [f for f in train.columns if train.dtypes[f]=='object']
print('there are %d numerical features:'%len(num_f),num_f)
print('there are %d category features:'%len(cat_f),cat_f)

there are 7 numerical features: [‘PassengerId’, ‘Survived’, ‘Pclass’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Fare’]
there are 5 category features: [‘Name’, ‘Sex’, ‘Ticket’, ‘Cabin’, ‘Embarked’]

feature类别:
- 数值型
- 特征型:可排序/不可排序型
- category不可排序型:sex,Embarked

category特征

性别

train.groupby(['Sex'])['Survived'].count()

Sex female 314 male 577 Name: Survived, dtype: int64

f,ax = plt.subplots(figsize=(8,6))
fig = sns.countplot(x='Sex',hue='Survived',data=train)
fig.set_title('Sex:Survived vs Dead')
plt.show()

train.groupby(['Sex'])['Survived'].sum()/train.groupby(['Sex'])['Survived'].count()

Sex female 0.742038 male 0.188908 Name: Survived, dtype: float64 船上原有人数,男性远高于女性;存活率,女性在75%左右,远高于男性18%-19%.可见女性存活率远高于男性,是重要特征。

Embarked

sns.factorplot('Embarked','Survived',data=train)
plt.show()

f,ax = plt.subplots(1,3,figsize=(24,6))
sns.countplot('Embarked',data=train,ax=ax[0])
ax[0].set_title('No. Of Passengers Boarded')
sns.countplot(x='Embarked',hue='Survived',data=train,ax=ax[1])
ax[1].set_title('Embarked vs Survived')
sns.countplot('Embarked',hue='Pclass',data=train,ax=ax[2])
ax[2].set_title('Embarked vs Pclass')
#plt.subplots_adjust(wspace=0.2,hspace=0.5)
plt.show()

#pd.pivot_table(train,index='Embarked',columns='Pclass',values='Fare')
sns.boxplot(x='Embarked',y='Fare',hue='Pclass',data=train)
plt.show()

从图中看出大部分乘客来自S port,其中多数为class 3,但是class 1 的人数也是3个口中最多的,C port的存活率最高,为0.55,因为C port中class1的人比例较高,Q port 绝大部分乘客是class 3的。C口1,2仓的票价均值较高,可能是暗示这个口上的人的社会地位较高。不过,从逻辑上说登录口对生存率是没有影响的,所以可以将其转成哑变量或drop.

Pclass

train.groupby('Pclass')['Survived'].value_counts()

Pclass Survived 1 1 136 0 80 2 0 97 1 87 3 0 372 1 119 Name: Survived, dtype: int64

plt.subplots(figsize=(8,6))
f = sns.countplot('Pclass',hue='Survived',data=train)

sns.factorplot('Pclass','Survived',hue='Sex',data=train)
plt.show()

class1,2的存活率明显较高,1有半数以上存活,2也基本持平,1,2仓女性甚至接近于1,所以客舱等级对生存有很大影响。

SibSp

train[["SibSp", "Survived"]].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

SibSp Survived
1 1 0.535885
2 2 0.464286
0 0 0.345395
3 3 0.250000
4 4 0.166667
5 5 0.000000
6 8 0.000000
sns.factorplot('SibSp','Survived',data=train)
plt.show()

#pd.pivot_table(train,values='Survived',index='SibSp',columns='Pclass')
sns.countplot(x='SibSp',hue='Pclass',data=train)
plt.show()

在没有同伴的情况下,存活率大概在0.3左右,有一个同伴的存活率最高>0.5,可能原因是1,2仓的乘客比例较高,随后,随着同伴数量增加而降低,降低的主要原因可能是,超过3人以上的乘客主要在class3,class3中3人以上存活率很低

Parch

#pd.pivot_table(train,values='Survived',index='Parch',columns='Pclass')
sns.countplot(x='Parch',hue='Pclass',data=train)
plt.show()

sns.factorplot('Parch','Survived',data=train)
plt.show()

趋势跟SibSp相似,一个人存活率较低,在有1-3parents时存活率较高,随后迅速降低,因为多数乘客来自class3

Age

train.groupby('Survived')['Age'].describe()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

count mean std min 25% 50% 75% max
Survived
0 424.0 30.626179 14.172110 1.00 21.0 28.0 39.0 74.0
1 290.0 28.343690 14.950952 0.42 19.0 28.0 36.0 80.0
f,ax = plt.subplots(1,2,figsize=(16,6))
sns.violinplot('Pclass','Age',hue='Survived',data=train,split=True,ax=ax[0])
ax[0].set_title('Pclass Age & Survived')
sns.violinplot('Sex','Age',hue='Survived',data=train,split=True,ax=ax[1])
ax[1].set_title('Sex Age & Survived')
plt.show()

1等仓获救年龄总体偏低,生存率年龄跨度大,尤其是20岁以上至50岁的生存率较高,可能和1等仓人年龄总体偏大有关;10岁左右的儿童在2,3等仓的生存率明显提升,对于男性而言同理,儿童有个明显提升,;女性的生存年龄集中在中青年;20-40岁左右的中青年人死亡人数最多。

Name

name主要用途是可以帮助我们分辨性别,帮助补充有相同title的年龄缺失值

#用正则表达式帮助找出姓名中表示年龄的title
def getTitle(data):name_sal = []for i in range(len(data['Name'])):name_sal.append(re.findall(r'.\w*\.',data.Name[i]))Salut = []for i in range(len(name_sal)):name = str(name_sal[i])name = name[1:-1].replace("'","")name = name.replace(".","").strip()name = name.replace(" ","")Salut.append(name)data['Title'] = SalutgetTitle(train)
train.head(2)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Title
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S Mr
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C Mrs
pd.crosstab(train['Title'],train['Sex'])
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

Sex female male
Title
Capt 0 1
Col 0 2
Countess 1 0
Don 0 1
Dr 1 6
Jonkheer 0 1
Lady 1 0
Major 0 2
Master 0 40
Miss 182 0
Mlle 2 0
Mme 1 0
Mr 0 517
Mrs 124 0
Mrs,L 1 0
Ms 1 0
Rev 0 6
Sir 0 1

补习一波英语:Mme:称呼非英语民族的”上层社会”已婚妇女,及有职业的妇女,相当于Mrs;Jonkheer:乡绅;Capt:船长;Lady:贵族夫人;Don唐:是西班牙语中贵族和有地位者的尊称;the Countess:女伯爵;Ms:Ms.或Mz:婚姻状态不明的妇女;Col:上校;Major:少校;Mlle:小姐;Rev:牧师。

Fare

train.groupby('Pclass')['Fare'].mean()

Pclass 1 84.154687 2 20.662183 3 13.675550 Name: Fare, dtype: float64

sns.distplot(train['Fare'].dropna())
plt.xlim((0,200))
plt.xticks(np.arange(0,200,10))
plt.show()

初步分析总结:
- 对于性别,女性生存率明显高于男性
- 头等舱生存率很高,3等仓很低,class1,2女性生存率接近于1
- 10岁左右的儿童生存率又明显提升
- SibSp和Parch相似,一个人存活率较低,有1-2SibSp或者1-3Parents生存率较高,但超过后生存率大幅下降
- name和age可以对所有数据进行处理,用name提取性别title,借助均值对age进行补充

数据处理

#合并训练集和测试集
passID = test['PassengerId']
all_data = pd.concat([train,test],keys=["train","test"])
all_data.shape
#all_data.head()

(1309, 13)

#统计缺失值
NAs = pd.concat([train.isnull().sum(),train.isnull().sum()/train.isnull().count(),test.isnull().sum(),test.isnull().sum()/test.isnull().count()],axis=1,keys=["train","percent_train","test","percent"])
NAs[NAs.sum(axis=1)>1].sort_values(by="percent",ascending=False)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

train percent_train test percent
Cabin 687 0.771044 327.0 0.782297
Age 177 0.198653 86.0 0.205742
Fare 0 0.000000 1.0 0.002392
Embarked 2 0.002245 0.0 0.000000
#删除无意义特征
all_data.drop(['PassengerId','Cabin'],axis=1,inplace=True)

all_data.head(2)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

Age Embarked Fare Name Parch Pclass Sex SibSp Survived Ticket Title
train 0 22.0 S 7.2500 Braund, Mr. Owen Harris 0 3 male 1 0.0 A/5 21171 Mr
1 38.0 C 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th… 0 1 female 1 1.0 PC 17599 Mrs

Age处理

#先提取name中的title
getTitle(all_data)
pd.crosstab(all_data['Title'], all_data['Sex'])
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

Sex female male
Title
Capt 0 1
Col 0 4
Countess 1 0
Don 0 1
Dona 1 0
Dr 1 7
Jonkheer 0 1
Lady 1 0
Major 0 2
Master 0 61
Miss 260 0
Mlle 2 0
Mme 1 0
Mr 0 757
Mrs 196 0
Mrs,L 1 0
Ms 2 0
Rev 0 8
Sir 0 1

all_data['Title'] = all_data['Title'].replace(['Lady','Dr','Dona','Mme','Countess'],'Mrs')
all_data['Title'] =all_data['Title'].replace('Mlle','Miss')
all_data['Title'] =all_data['Title'].replace('Mrs,L','Mrs')
all_data['Title'] = all_data['Title'].replace('Ms', 'Miss')
#all_data['Title'] = all_data['Title'].replace('Mme', 'Mrs')
all_data['Title'] = all_data['Title'].replace(['Capt','Col','Don','Major','Rev','Jonkheer','Sir'],'Mr')
'''
all_data['Title'] = all_data.Title.replace({'Mlle':'Miss','Mme':'Mrs','Ms':'Miss','Dr':'Mrs','Major':'Mr','Lady':'Mrs','Countess':'Mrs','Jonkheer':'Mr','Col':'Mr','Rev':'Mr','Capt':'Mr','Sir':'Mr','Don':'Mr','Mrs,L':'Mrs'})'''
all_data.Title.isnull().sum()

0

all_data[:train.shape[0]].groupby('Title')['Age'].mean()

Title Master 4.574167 Miss 21.845638 Mr 32.891990 Mrs 36.188034 Name: Age, dtype: float64

#通过训练集中title对应的age均值替换
all_data.loc[(all_data.Age.isnull()) & (all_data.Title=='Mr'),'Age']=32
all_data.loc[(all_data.Age.isnull())&(all_data.Title=='Mrs'),'Age']=36
all_data.loc[(all_data.Age.isnull())&(all_data.Title=='Master'),'Age']=5
all_data.loc[(all_data.Age.isnull())&(all_data.Title=='Miss'),'Age']=22
#all_data.loc[(all_data.Age.isnull())&(all_data.Title=='other'),'Age']=46all_data.Age.isnull().sum()

0

all_data[:train.shape[0]][['Title', 'Survived']].groupby(['Title'], as_index=False).mean()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

Title Survived
0 Master 0.575000
1 Miss 0.702703
2 Mr 0.158192
3 Mrs 0.777778
f,ax = plt.subplots(1,2,figsize=(16,6))
sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Sex=='female','Age'],color='red',ax=ax[0])
sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Sex=='male','Age'],color='blue',ax=ax[0])sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Survived==0,'Age' ],color='red', label='Not Survived', ax=ax[1])
sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Survived==1,'Age' ],color='blue', label='Survived', ax=ax[1])
plt.legend(loc='best')
plt.show()

  • 16岁左右儿童存活率较高,最年长乘客(80岁)幸存
  • 大量16~40青少年没有存活
  • 大多数乘客在16~40岁
  • 为辅助分类,将年龄分段,创造新特征,同时增加儿童特征

add isChild

def male_female_child(passenger):# 取年龄和性别age,sex = passenger# 提出儿童特征if age < 16:return 'child'else:return sex
# 创建新特征
all_data['person'] = all_data[['Age','Sex']].apply(male_female_child,axis=1)
#0-80岁的年龄分布,若分段成3组,按少年、中青年、老年分all_data['Age_band']=0
all_data.loc[all_data['Age']<=16,'Age_band']=0
all_data.loc[(all_data['Age']>16)&(all_data['Age']<=40),'Age_band']=1
all_data.loc[all_data['Age']>40,'Age_band']=2

Name处理

df = pd.get_dummies(all_data['Title'],prefix='Title')
all_data = pd.concat([all_data,df],axis=1)
all_data.drop('Title',axis=1,inplace=True)
#drop name
all_data.drop('Name',axis=1,inplace=True)

fiilna Embarked

all_data.loc[all_data.Embarked.isnull()]
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

Age Embarked Fare Parch Pclass Sex SibSp Survived Ticket Title person Age_band
train 61 38.0 NaN 80.0 0 1 female 0 1.0 113572 2 female 1
829 62.0 NaN 80.0 0 1 female 0 1.0 113572 3 female 2

票价80,一等舱,很大概率是C口

all_data['Embarked'].fillna('C',inplace=True)all_data.Embarked.isnull().any()

False

embark_dummy = pd.get_dummies(all_data.Embarked)
all_data = pd.concat([all_data,embark_dummy],axis=1)
all_data.head(2)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

Age Embarked Fare Parch Pclass Sex SibSp Survived Ticket person Age_band Title_Master Title_Miss Title_Mr Title_Mrs C Q S
train 0 22.0 S 7.2500 0 3 male 1 0.0 A/5 21171 male 1 0 0 1 0 0 0 1
1 38.0 C 71.2833 0 1 female 1 1.0 PC 17599 female 1 0 0 0 1 1 0 0

add SibSp and Parch

#创造familysize和alone两个新特征
all_data['Family_size'] = all_data['SibSp']+all_data['Parch']#是所有亲属总和
all_data['alone'] = 0#不是一个人
all_data.loc[all_data.Family_size==0,'alone']=1#代表是一个人
f,ax=plt.subplots(1,2,figsize=(16,6))
sns.factorplot('Family_size','Survived',data=all_data[:train.shape[0]],ax=ax[0])
ax[0].set_title('Family_size vs Survived')
sns.factorplot('alone','Survived',data=all_data[:train.shape[0]],ax=ax[1])
ax[1].set_title('alone vs Survived')
plt.close(2)
plt.close(3)
plt.show()

当乘客一个人的时候,生存率很低,大概在0.3左右,有1-3家庭成员时生存率上升,但>4时,生存率又急速下降。

#再将family size分段
all_data['Family_size'] = np.where(all_data['Family_size']==0, 'solo',np.where(all_data['Family_size']<=3, 'normal', 'big'))
sns.factorplot('alone','Survived',hue='Sex',data=all_data[:train.shape[0]],col='Pclass')
plt.show()

对于女性,1,2等仓来说,是否一个人对生存率影响不大,但对于3等仓女性,一个人时反而生存率提高。

all_data['poor_girl'] = 0
all_data.loc[(all_data['Sex']=='female')&(all_data['Pclass']==3)&(all_data['alone']==1),'poor_girl']=1

连续变量Fare填充、分段

#补充全缺失值
all_data.loc[(all_data.Fare.isnull()) & (all_data.Pclass==1),'Fare']=84
all_data.loc[(all_data.Fare.isnull()) & (all_data.Pclass==2),'Fare']=21
all_data.loc[(all_data.Fare.isnull()) & (all_data.Pclass==3),'Fare']=14
sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Survived==0,'Fare' ],color='red', label='Not Survived')
sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Survived==1,'Fare' ],color='blue', label='Survived')
plt.xlim((0,100))
(0, 100)

sns.lmplot('Fare','Survived',data=all_data[:train.shape[0]])
plt.show()

#Fare平均分成3段取均值
all_data['Fare_band'] = pd.qcut(all_data['Fare'],3)all_data[:train.shape[0]].groupby('Fare_band')['Survived'].mean()

Fare_band (-0.001, 8.662] 0.198052 (8.662, 26.0] 0.402778 (26.0, 512.329] 0.559322 Name: Survived, dtype: float64

#将连续变量fare分段,离散化all_data['Fare_cut'] = 0
all_data.loc[all_data['Fare']<=8.662,'Fare_cut'] = 0
all_data.loc[((all_data['Fare']>8.662) & (all_data['Fare']<=26)),'Fare_cut'] = 1
#all_data.loc[((all_data['Fare']>14.454) & (all_data['Fare']<=31.275)),'Fare_cut'] = 2
all_data.loc[((all_data['Fare']>26) & (all_data['Fare']<513)),'Fare_cut'] = 2sns.factorplot('Fare_cut','Survived',hue='Sex',data=all_data[:train.shape[0]])
plt.show()

价格上升,生存率增加,对男性尤为明显

# creat a feature about rich man
all_data['rich_man'] = 0
all_data.loc[((all_data['Fare']>=80) & (all_data['Sex']=='male')),'rich_man'] = 1

类型特征数值化

all_data.head()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

Age Embarked Fare Parch Pclass Sex SibSp Survived Ticket person Title_Mrs C Q S Family_size alone poor_girl Fare_band Fare_cut rich_man
train 0 22.0 S 7.2500 0 3 male 1 0.0 A/5 21171 male 0 0 0 1 normal 0 0 (-0.001, 8.662] 0 0
1 38.0 C 71.2833 0 1 female 1 1.0 PC 17599 female 1 1 0 0 normal 0 0 (26.0, 512.329] 2 0
2 26.0 S 7.9250 0 3 female 0 1.0 STON/O2. 3101282 female 0 0 0 1 solo 1 1 (-0.001, 8.662] 0 0
3 35.0 S 53.1000 0 1 female 1 1.0 113803 female 1 0 0 1 normal 0 0 (26.0, 512.329] 2 0
4 35.0 S 8.0500 0 3 male 0 0.0 373450 male 0 0 0 1 solo 1 0 (-0.001, 8.662] 0 0

5 rows × 24 columns

舍弃特征有Embarked(已离散化),Fare,Fare_band(已用Fare_cut代替),Sex(已用Person代替),Age(有Age_band),Ticket,S,SibSp,Parch

'''
舍弃不需要的特征:Age,用Age_band分段代替了,
Fare,Fare_band用Fare_cut分段代替了
Ticket无意义
'''
#all_data.drop(['Age','Fare','Fare_band','Ticket'],axis=1,inplace=True)
#all_data.drop(['Age','Fare','Fare_band','Ticket','Embarked','C'],axis=1,inplace=True)
all_data.drop(['Age','Fare','Ticket','Embarked','C','Fare_band','SibSp','Parch'],axis=1,inplace=True)
all_data.head(2)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

Pclass Sex Survived person Age_band Title_Master Title_Miss Title_Mr Title_Mrs Q S Family_size alone poor_girl Fare_cut rich_man
train 0 3 male 0.0 male 1 0 0 1 0 0 1 normal 0 0 0 0
1 1 female 1.0 female 1 0 0 0 1 0 0 normal 0 0 2 0
df1 = pd.get_dummies(all_data['Family_size'],prefix='Family_size')
df2 = pd.get_dummies(all_data['person'],prefix='person')
df3 = pd.get_dummies(all_data['Age_band'],prefix='age')
all_data = pd.concat([all_data,df1,df2,df3],axis=1)
all_data.head()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

Pclass Sex Survived person Age_band Title_Master Title_Miss Title_Mr Title_Mrs Q rich_man Family_size_big Family_size_normal Family_size_solo person_child person_female person_male age_0 age_1 age_2
train 0 3 male 0.0 male 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0
1 1 female 1.0 female 1 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0
2 3 female 1.0 female 1 0 1 0 0 0 0 0 0 1 0 1 0 0 1 0
3 1 female 1.0 female 1 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0
4 3 male 0.0 male 1 0 0 1 0 0 0 0 0 1 0 0 1 0 1 0

5 rows × 25 columns

all_data.drop(['Sex','person','Age_band','Family_size'],axis=1,inplace=True)
all_data.head()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

Pclass Survived Title_Master Title_Miss Title_Mr Title_Mrs Q S alone poor_girl rich_man Family_size_big Family_size_normal Family_size_solo person_child person_female person_male age_0 age_1 age_2
train 0 3 0.0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0
1 1 1.0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0
2 3 1.0 0 1 0 0 0 1 1 1 0 0 0 1 0 1 0 0 1 0
3 1 1.0 0 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 1 0
4 3 0.0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 0 1 0

5 rows × 21 columns

建立模型

from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import confusion_matrix#retun array of prredict and target
from sklearn.model_selection import cross_val_predict#use to retun the predict of cross val from sklearn.model_selection import GridSearchCV
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
train_data = all_data[:train.shape[0]]
test_data = all_data[train.shape[0]:]
print('train data:'+str(train_data.shape))
print('test data:'+str(test_data.shape))

train data:(668, 21) test data:(641, 21)


train,test = train_test_split(train_data,test_size = 0.25, random_state=0,stratify=train_data['Survived'])
train_x = train.drop('Survived',axis=1)train_y = train['Survived']test_x = test.drop('Survived',axis=1)
test_y = test['Survived']
print(train_x.shape)
print(test_x.shape)

(668, 20) (223, 20)

# define score on train and test data
def cv_score(model):cv_result = cross_val_score(model,train_x,train_y,cv=10,scoring = "accuracy")return(cv_result)def cv_score_test(model):cv_result_test = cross_val_score(model,test_x,test_y,cv=10,scoring = "accuracy")return(cv_result_test)

rbf SVM

# RBF SVM modelparam_grid = {'C': [1e3, 5e3, 1e4, 5e4, 1e5],'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1], }
clf_svc = GridSearchCV(svm.SVC(kernel='rbf', class_weight='balanced'), param_grid)
clf_svc = clf_svc.fit(train_x, train_y)
print("Best estimator found by grid search:")
print(clf_svc.best_estimator_)
acc_svc_train = cv_score(clf_svc.best_estimator_).mean()
acc_svc_test = cv_score_test(clf_svc.best_estimator_).mean()
print(acc_svc_train)
print(acc_svc_test)

Best estimator found by grid search: SVC(C=1000.0, cache_size=200, class_weight=’balanced’, coef0=0.0, decision_function_shape=None, degree=3, gamma=0.0001, kernel=’rbf’, max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) 0.826306967835 0.816196122718

决策树

#a simple treeclf_tree = DecisionTreeClassifier()
clf_tree.fit(train_x,train_y)
acc_tree_train = cv_score(clf_tree).mean()
acc_tree_test = cv_score_test(clf_tree).mean()
print(acc_tree_train)
print(acc_tree_test)

0.808216271583 0.811631846414

KNN

#test n_neighbors pred = []
for i in range(1,11):model = KNeighborsClassifier(n_neighbors=i)model.fit(train_x,train_y)pred.append(cv_score(model).mean())
n = list(range(1,11))
plt.plot(n,pred)
plt.xticks(range(1,11))
plt.show()  

clf_knn = KNeighborsClassifier(n_neighbors=4)
clf_knn.fit(train_x,train_y)
acc_knn_train = cv_score(clf_knn).mean()
acc_knn_test = cv_score_test(clf_knn).mean()
print(acc_knn_train)
print(acc_knn_test)

0.826239790353 0.829653679654

逻辑回归

#logistic regressionclf_LR = LogisticRegression()
clf_LR.fit(train_x,train_y)
acc_LR_train = cv_score(clf_LR).mean()
acc_LR_test = cv_score_test(clf_LR).mean()
print(acc_LR_train)
print(acc_LR_test)

0.838226647511 0.811848296631

高斯贝叶斯

clf_gb = GaussianNB()
clf_gb.fit(train_x,train_y)
acc_gb_train = cv_score(clf_gb).mean()
acc_gb_test = cv_score_test(clf_gb).mean()
print(acc_gb_train)
print(acc_gb_test)

0.794959693511 0.789695087521

随机森林

n_estimators = range(100,1000,100)
grid = {'n_estimators':n_estimators}clf_forest = GridSearchCV(RandomForestClassifier(random_state=0),param_grid=grid,verbose=True)
clf_forest.fit(train_x,train_y)
print(clf_forest.best_estimator_)
print(clf_forest.best_score_)
#print(cv_score(clf_forest).mean())
#print(cv_score_test(clf_forest).mean())

Fitting 3 folds for each of 9 candidates, totalling 27 fits [Parallel(n_jobs=1)]: Done 27 out of 27 | elapsed: 32.2s finished RandomForestClassifier(bootstrap=True, class_weight=None, criterion=’gini’, max_depth=None, max_features=’auto’, max_leaf_nodes=None, min_impurity_split=1e-07, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1, oob_score=False, random_state=0, verbose=0, warm_start=False) 0.817365269461

clf_forest = RandomForestClassifier(n_estimators=200)
clf_forest.fit(train_x,train_y)
acc_forest_train = cv_score(clf_forest).mean()
acc_forest_test = cv_score_test(clf_forest).mean()
print(acc_forest_train)
print(acc_forest_test)

0.811178066885 0.811434217956

pd.Series(clf_forest.feature_importances_,train_x.columns).sort_values(ascending=True).plot.barh(width=0.8)
plt.show()


models = pd.DataFrame({'model':['SVM','Decision Tree','KNN','Logistic regression','Gaussion Bayes','Random Forest'],'score on train':[acc_svc_train,acc_tree_train,acc_knn_train,acc_LR_train,acc_gb_train,acc_forest_train],'score on test':[acc_svc_test,acc_tree_test,acc_knn_test,acc_LR_test,acc_gb_test,acc_forest_test]
})
models.sort_values(by='score on test', ascending=False)
'''
models = pd.DataFrame({'model':['SVM','Decision Tree','KNN','Logistic regression','Gaussion Bayes','Random Forest'],'score on train':[acc_svc_train,acc_tree_train,acc_knn_train,acc_LR_train,acc_gb_train,acc_forest_train]
})
'''
models.sort_values(by='score on test', ascending=False)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

model score on test score on train
2 KNN 0.829654 0.826240
0 SVM 0.816196 0.826307
3 Logistic regression 0.811848 0.838227
1 Decision Tree 0.811632 0.808216
5 Random Forest 0.811434 0.811178
4 Gaussion Bayes 0.789695 0.794960

Ensemble

from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import GradientBoostingClassifier

# bagging Decision tree
from sklearn.ensemble import BaggingClassifier
bag_tree = BaggingClassifier(base_estimator=clf_svc.best_estimator_,n_estimators=200,random_state=0)
bag_tree.fit(train_x,train_y)
acc_bagtree_train = cv_score(bag_tree).mean()
acc_bagtree_test =cv_score_test(bag_tree).mean()
print(acc_bagtree_train)
print(acc_bagtree_test)
0.82782211935
0.816196122718

Adaboosting

n_estimators = range(100,1000,100)
a = [0.05,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
grid = {'n_estimators':n_estimators,'learning_rate':a}
ada = GridSearchCV(AdaBoostClassifier(),param_grid=grid,verbose=True)
ada.fit(train_x,train_y)
print(ada.best_estimator_)
print(ada.best_score_)
#acc_ada_train = cv_score(ada).mean()
#acc_ada_test = cv_score_test(ada).mean()#print(acc_ada_train)
#print(acc_ada_test)
Fitting 3 folds for each of 90 candidates, totalling 270 fits[Parallel(n_jobs=1)]: Done 270 out of 270 | elapsed:  5.4min finishedAdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,learning_rate=0.05, n_estimators=200, random_state=None)
0.835329341317
ada = AdaBoostClassifier(n_estimators=200,random_state=0,learning_rate=0.2)
ada.fit(train_x,train_y)acc_ada_train = cv_score(ada).mean()
acc_ada_test = cv_score_test(ada).mean()print(acc_ada_train)
print(acc_ada_test)
0.829248144305
0.825719932242
#confusion matrix to see the presictiony_pred = cross_val_predict(ada,test_x,test_y,cv=10)
sns.heatmap(confusion_matrix(test_y,y_pred),cmap='winter',annot=True,fmt='2.0f')
plt.show()

GradientBoosting


n_estimators = range(100,1000,100)
a = [0.05,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
grid = {'n_estimators':n_estimators,'learning_rate':a}
grad = GridSearchCV(GradientBoostingClassifier(),param_grid=grid,verbose=True)
grad.fit(train_x,train_y)
print(grad.best_estimator_)
print(grad.best_score_)
Fitting 3 folds for each of 90 candidates, totalling 270 fits[Parallel(n_jobs=1)]: Done 270 out of 270 | elapsed:  2.4min finishedGradientBoostingClassifier(criterion='friedman_mse', init=None,learning_rate=0.05, loss='deviance', max_depth=3,max_features=None, max_leaf_nodes=None,min_impurity_split=1e-07, min_samples_leaf=1,min_samples_split=2, min_weight_fraction_leaf=0.0,n_estimators=200, presort='auto', random_state=None,subsample=1.0, verbose=0, warm_start=False)
0.824850299401
#use best estimator in gradientclf_grad=GradientBoostingClassifier(n_estimators=200,random_state=0,learning_rate=0.05)
clf_grad.fit(train_x,train_y)
acc_grad_train = cv_score(clf_grad).mean()
acc_grad_test = cv_score_test(clf_grad).mean()print(acc_grad_train)
print(acc_grad_test)
0.818709926304
0.807500470544
from sklearn.metrics import precision_score
class Ensemble(object):def __init__(self,estimators):self.estimator_names = []self.estimators = []for i in estimators:self.estimator_names.append(i[0])self.estimators.append(i[1])self.clf = LogisticRegression()def fit(self, train_x, train_y):for i in self.estimators:i.fit(train_x,train_y)x = np.array([i.predict(train_x) for i in self.estimators]).Ty = train_yself.clf.fit(x, y)def predict(self,x):x = np.array([i.predict(x) for i in self.estimators]).T#print(x)return self.clf.predict(x)def score(self,x,y):s = precision_score(y,self.predict(x))return s
ensem = Ensemble([('Ada',ada),('Bag',bag_tree),('SVM',clf_svc.best_estimator_),('LR',clf_LR),('gbdt',clf_grad)])
score = 0
for i in range(0,10):ensem.fit(train_x, train_y)sco = round(ensem.score(test_x,test_y) * 100, 2)score+=sco
print(score/10)
89.83

提交

pre = ensem.predict(test_data)
pd.DataFrame({'PassengerId':temp['PassengerId'],'Survived':pre})
submission = pd.DataFrame({'PassengerId':passID,'Survived':pre})

提交结果看,ensemble模型和单个模型比并没有明显提升,分析可能是基模型相关性较强,训练数据不够多,或者是one-hot编码会不会引入共线性。虽然测试集和训练集结果相差不大,但提交结果降低明显,分析可能是数据不够,训练不充分,特征较少且相关性强,可以考虑引入更多特征。

kaggle初探--泰坦尼克号生存预测相关推荐

  1. Kaggle经典数据分析项目:泰坦尼克号生存预测!

    ↑↑↑关注后"星标"Datawhale 每日干货 & 每月组队学习,不错过 Datawhale干货 作者:陈锴,中山大学,Datawhale成员 最近有很多读者留言,希望能 ...

  2. Kaggle泰坦尼克号生存预测挑战——模型建立、模型调参、融合

    Kaggle泰坦尼克号生存预测挑战 这是kaggle上Getting Started 的Prediction Competition,也是比较入门和简单的新人赛,我的最好成绩好像有进入top8%,重新 ...

  3. Kaggle泰坦尼克号生存预测挑战——数据分析

    Kaggle泰坦尼克号生存预测挑战 这是kaggle上Getting Started 的Prediction Competition,也是比较入门和简单的新人赛,我的最好成绩好像有进入top8%,重新 ...

  4. Kaggle经典数据分析项目:泰坦尼克号生存预测!1. 数据概述与可视化2. 数据预处理3. 模型训练4. 模型优化(调参)

    ↑↑↑关注后"星标"Datawhale 每日干货 & 每月组队学习 ,不错过 Datawhale干货 作者:陈锴,中山大学,Datawhale成员 最近有很多读者留言,希望 ...

  5. 【机器学习kaggle赛事】泰坦尼克号生存预测

    目录 写在前面 数据集情况查看 数据清洗 Embarked: Fare Age Cabin 特征工程 1,探究Sex与Survived的相关性 2,探究Pcalss与Survived的关联性 3,Em ...

  6. kaggle 泰坦尼克号生存预测——六种算法模型实现与比较

    Hi,大家好,这是我第一篇博客. 作为非专业程序小白,博客内容必然有不少错误之处,还望各位大神多多批评指正. 在开始正式内容想先介绍下自己和一些异想天开的想法. 我是一名研究生,研究的方向是蛋白质结构 ...

  7. 泰坦尼克号生存预测(多种模型实现)python

    泰坦尼克号生存预测 这是kaggle上面比较入门的一个比赛.今天让我们来看看怎么做吧.kaggle传送门.首先报名,下载数据集. 数据载入及概述 首先导入从Kaggle上面下载的数据集,在导入的过程中 ...

  8. 机器学习实战一:泰坦尼克号生存预测 Titantic

    泰坦尼克号生存预测 Titantic 这是我在kaggle上找到的一个泰坦尼克号的生存的预测案例 希望能用它来进行我的学习与实践,从这里开始入门Machine Learning 也希望在这里,开始我的 ...

  9. 【阿旭机器学习实战】【13】决策树分类模型实战:泰坦尼克号生存预测

    [阿旭机器学习实战]系列文章主要介绍机器学习的各种算法模型及其实战案例,欢迎点赞,关注共同学习交流. 本文用机器学习中的决策树分类模型对泰坦尼克号生存项目进行预测. 关于决策树的详细介绍及原理参见前一 ...

  10. 机器学习实战-泰坦尼克号生存预测案例

    泰坦尼克号生存预测案例 操作平台:Jupyter Notebook 实验数据:从官方下载的泰坦尼克号测试集与训练集 使用语言:python 实验步骤: 安装我们所需要的第三方库,本次实验需要额外下载安 ...

最新文章

  1. ajax php加载列表实例,jQuery+PHP+ajax实现加载更多内容列表
  2. 基于区块链的健康链系统设计与实现(4)系统实现
  3. java 路径 20,java中得到classpath和当前类的绝对路径的一些方法(路径中的%20进行替换空格)...
  4. html属性可以用来定义内联样式,18年6月考试《网页设计与制作》期末大作业.doc...
  5. Java引用常量得好处_JAVA常量池的作用
  6. ORA-01034: ORACLE not available ORA-27101: shared memory realm does not exist的原因分析
  7. moodle php代码解读_Moodle插件开发笔记
  8. php nlpir,NLPIR简介及使用配置
  9. iDLG Improved Deep Leakage from Gradients
  10. ubuntu 19.04源
  11. 直接在html打开ppt,无需频繁跳转 教你在PPT内直接看网页
  12. 九度_题目1361:翻转单词顺序
  13. 模拟斗地主发牌, 把54张牌发给三个玩家, 地主多三张, 对每个玩家的牌进行排序
  14. SprinBoot实现接管SpringMVC自定义配置
  15. 夜间旅游在经济发展中起到哪些作用
  16. 【内存】内存对齐 的原理
  17. python自动翻译pdf_在Python中自动执行PDF
  18. 标幺值下Simulink三相瞬时功率模块输出端加增益2/3的原因
  19. 关于 A/B 测试那些事儿
  20. 新版V10MXone Pro自适应苹果CMS影视模板/亲测

热门文章

  1. android开发之Android 5.0 Lollipop新特性介绍
  2. 跑跑卡丁车rush服务器维护,跑跑卡丁车RUSH游戏官方-跑跑卡丁车RUSH+手游官网预约v1.0.8 - 逗游网...
  3. 英语口语笔记B1-Unit04.家居生活-02-Buying household products
  4. Redhat下小企鹅输入法的安装
  5. 站在巨人的肩膀上-听课感想
  6. 【全自动网盘扩容软件使用教程】百度网盘自助无限扩容+自助无限修复软件使用步骤说明
  7. lav点搜网metro风格分享
  8. jenkins使用报错记录exception message 137
  9. 人工智能初识,百度AI
  10. 怎么判断自己在不在一家好公司?