原题:

Start here if...

You're new to data science and machine learning, or looking for a simple intro to the Kaggle prediction competitions.

Competition Description

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

Practice Skills

  • Binary classification
  • Python and R basics

训练数据:

训练数据中的特征:

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
特征 PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
解释 乘客ID 死亡0/幸存/1 经济等级(1=high、2=middle、3=low) 乘客姓名 性别 年龄 船上的兄弟姐妹个数 船上的父母孩子个数 船票号码 票价 客舱号码 登船港口

解决思路:加载样本->求出总数、总计、均值、方差->利用均值补全空白值->。。。->交叉验证(将训练数据做测试,123选中其二作为训练模型,剩下一个作为测试(原测试集不用),交叉训练验证取平均值)->线性回归->逻辑回归->随机森林

#coding=utf-8
import os
file_root = os.path.realpath('titanic')
file_name_test = os.path.join(file_root, "test.csv")
file_name_train = os.path.join(file_root, "train.csv")
import pandas as pd
#显示所有信息
pd.set_option('display.max_columns' , None)
titanic = pd.read_csv(file_name_train)
data = titanic.describe()#可以查看有哪些缺失值
titanic.info()
#缺失的Age内容进行取均值替换
titanic['Age'].fillna(titanic['Age'].median(), inplace=True)
data = titanic.describe()
print(data)#查看Sex下属性值,并替换
print("Sex原属性值", titanic['Sex'].unique())
titanic.loc[titanic['Sex'] == "male", "Sex"] = 0
titanic.loc[titanic['Sex'] == "female", "Sex"] = 1
print("Sex替换后的属性值", titanic['Sex'].unique())
#查看Embarked下属性值,并替换
print("Embarked原属性值", titanic['Embarked'].unique())
titanic["Embarked"] = titanic["Embarked"].fillna('S')
titanic.loc[titanic['Embarked'] == 'S', 'Embarked'] = 0
titanic.loc[titanic['Embarked'] == 'C', 'Embarked'] = 1
titanic.loc[titanic['Embarked'] == 'Q', 'Embarked'] = 2
print("Embarked替换后的属性值", titanic['Embarked'].unique())#线性回归模型预测
from sklearn.linear_model import LinearRegression
#交叉验证
from sklearn import model_selection
#特征值
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
#初始化
alg = LinearRegression()
#titanic.shape[0]:表示得到m和n的二元组,也就是样本数目;表示n_folds:表示做基层的交叉验证;
print("titanic.shape[0]:", titanic.shape[0])
# kf = model_selection.KFold(titanic.shape[0], n_folds=3, random_state=1)
kf = model_selection.KFold(n_splits=3, random_state=1, shuffle=False)
predictions = []
#n_folds=3遍历三层
for train, test in kf.split(titanic['Survived']):#把训练数据拿出来train_predictors = titanic[predictors].iloc[train,:]#我们使用样本训练的目标值train_target = titanic['Survived'].iloc[train]#应用线性回归,训练回归模型alg.fit(train_predictors, train_target)#利用测试集预测test_predictions = alg.predict(titanic[predictors].iloc[test,:])predictions.append(test_predictions)#看测试集的效果,回归值区间值为[0-1]
import numpy as np
#numpy提供了numpy.concatenate((a1,a2,...), axis=0)函数。能够一次完成多个数组的拼接。其中a1,a2,...是数组类型的参数
predictions = np.concatenate(predictions, axis=0)predictions[predictions > .5] = 1
predictions[predictions <= .5] = 0
accuracy = sum(predictions[predictions == titanic['Survived']]) / len(predictions)
print("线性回归模型: ", accuracy)
#输出:0.78...
#采用逻辑回归方式实现
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings("ignore")
#初始化
alg = LogisticRegression(random_state=1)
#比较测试值
scores = model_selection.cross_val_score(alg, titanic[predictors], titanic['Survived'], cv=3)
print("逻辑回归模型: ", scores.mean())#采用随机森林实现:构造多颗决策树共同决策结果,取出多次结果的平均值。
#随机森林在这七个特征当中进行随机选择个数
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier
pridictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
#参数:随机数、用了多少树、最小样本个数、最小叶子结点个数
alg = RandomForestClassifier(random_state=1, n_estimators=50, min_impurity_split=4, min_samples_leaf=2)
kf = model_selection.KFold(n_splits=3, random_state=1, shuffle=False)
kf = kf.split(titanic['Survived'])
scores = model_selection.cross_val_score(alg, titanic[predictors], titanic['Survived'], cv=kf)
print("随机森林: ", scores.mean())

视频地址:https://study.163.com/course/courseLearn.htm?courseId=1003551009#/learn/video?lessonId=1004052091&courseId=1003551009

【kaggle入门题一】Titanic: Machine Learning from Disaster相关推荐

  1. 机器学习案例学习【每周一例】之 Titanic: Machine Learning from Disaster

     下面一文章就总结几点关键: 1.要学会观察,尤其是输入数据的特征提取时,看各输入数据和输出的关系,用绘图看! 2.训练后,看测试数据和训练数据误差,确定是否过拟合还是欠拟合: 3.欠拟合的话,说明模 ...

  2. 小白的机器学习之路(1)---Kaggle竞赛:泰坦尼克之灾(Titanic Machine Learning from Disaster)

    我是目录 前言 数据导入 可视化分析 Pclass Sex Age SibSp Parch Fare Cabin Embarked 特征提取 Title Family Size Companion A ...

  3. 【Kaggle】Titanic - Machine Learning from Disaster(二)

    文章目录 1. 前言 2. 预备-环境配置 3. 数据集处理 3.1 读取数据集 3.2 查看pandas数据信息 3.2.1 查看总体信息 3.2.2 数据集空值统计 3.3. 相关性分析 3.3. ...

  4. Kaggle | Titanic - Machine Learning from Disaster【泰坦尼克号生存预测】 | baseline及优秀notebook总结

    文章目录 一.数据介绍 二.代码 三.代码优化方向 一.数据介绍   Titanic - Machine Learning from Disaster是主要针对机器学习初学者开展的比赛,数据格式比较简 ...

  5. 大数据第一课(满分作业)——泰坦尼克号生存者预测(Titanic - Machine Learning from Disaster)

    大数据第一课(满分作业)--泰坦尼克号生存者预测(Titanic - Machine Learning from Disaster) 1 项目背景 1.1 The Challenge 1.2 What ...

  6. Kaggle比赛(一)Titanic: Machine Learning from Disaster

    泰坦尼克号幸存预测是本小白接触的第一个Kaggle入门比赛,主要参考了以下两篇教程: https://www.cnblogs.com/star-zhao/p/9801196.html https:// ...

  7. 数据分析入门项目之 :Titanic: Machine Learning from Disaster

    1.摘要: 本文详述了新手如何通过数据预览,探索式数据分析,缺失数据填补,删除关联特征以及派生新特征等数据处理方法,完成Kaggle的Titanic幸存预测要求的内容和目标. 2.背景介绍: Tita ...

  8. Kaggle——泰坦尼克号(Titanic: Machine Learning from Disaster)详细过程

    一.简介 1.数据 (1)训练集(train.csv) (2)测试集(test.csv) (3)提交文件示例(gender_submission.csv) 对于训练集,我们为每位乘客提供结果.模型将基 ...

  9. Titanic: Machine Learning from Disaster-kaggle入门赛-学习笔记

    Titanic: Machine Learning from Disaster 对实验用的数据的认识,数据中的特殊点/离群点的分析和处理,特征工程(feature engineering)很重要. 注 ...

最新文章

  1. 三、Spring Boot在org.springframework.boot组下应用程序启动器
  2. 构造函数必须是public吗_c++ 构造函数,析构函数必须要给成公有的吗?
  3. 一、学爬虫前,你需要知道的爬虫常识
  4. 数据库-DQL练习(附答案)
  5. angular2--pipe管道使用
  6. 【Unity新闻】如何使用Unity进行机器人模拟? | AR/VR全球需求率激增1400%
  7. Node.js Unix/Linux NVM
  8. Linux Mint,Ubuntu 18 ,Deepin15.7 安装mysql 没有提示输入密码,修改root用户密码过程...
  9. ceph集群安装报错解决方法
  10. 2022强网杯web(部分)
  11. 一款灵活可配置的开源监控平台
  12. 伽卡他卡学生端卸载方案
  13. 聚观早报|中国将是ChatGPT主要对手;​iPhone 15将使用USB-C接口
  14. 用python爬取考研词汇及其近反义词与例句
  15. Finance_finacial_engineering_02
  16. 更换移动硬盘显示图标
  17. 雅思成绩单上的这个符号, CEFR 究竟是什么意思
  18. 开源的 IM 项目 Sealtalk
  19. Python 编程1000例(5):拿鸡蛋问题和回文数问题
  20. 乌克兰证券委员会主席支持认可加密货币作为金融工具的合法地位

热门文章

  1. linux 防火墙 ftp,RedHat6 建立基于防火墙和SELinux的虚拟用户vsFTP
  2. python中table表结构_python中的简易表格prettytable
  3. linux直接用iso文件装服务,linux系统安装iso文件方法
  4. 子窗体 记录选择_如何设计一个简单的Access登录窗体(1)
  5. 10个简单的 Java 性能调优技巧
  6. 【LeetCode笔记】309. 最佳买卖股票时机含冷冻期(Java、动态规划)
  7. 【LeetCode笔记】438. 找到字符串中所有字母异位词(Java、字符串、滑动窗口)
  8. java对象底层原存储结构图解_图解图库JanusGraph系列-一文知晓“图数据“底层存储结构...
  9. c语言 算术平均滤波法_单片机数字滤波的算法
  10. jlabel 不能连续两次set_为什么有时连续多次setState只有一次生效?