kaggle泰坦尼克号_Kaggle基础知识：泰坦尼克号比赛

kaggle泰坦尼克号

Kaggle is a site where people create algorithms and compete against machine learning practitioners around the world. Your algorithm wins the competition if it’s the most accurate on a particular data set. Kaggle is a fun way to practice your machine learning skills.

Kaggle是一个人们在其中创建算法并与全球机器学习从业人员竞争的网站。如果您的算法在特定数据集上最准确，那么它将赢得竞争。 Kaggle是练习机器学习技能的有趣方式。

This tutorial is based on part of our free, four-part course: Kaggle Fundamentals. This interactive course is the most comprehensive introduction to Kaggle’s Titanic competition ever made. The course includes a certificate on completion. Use the button below to start the course:

本教程基于我们的免费课程（分为四部分）： Kaggle基础知识。该互动式课程是对Kaggle泰坦尼克号比赛有史以来最全面的介绍。该课程包括结业证书。使用下面的按钮开始课程：

Start the free Kaggle Fundamentals course

开始免费的Kaggle基础知识课程

In this tutorial we’ll learn learn how to:

在本教程中，我们将学习学习如何：

Approach a Kaggle competition
Explore the competition data and learn about the competition topic
Prepare data for machine learning
Train a model
Measure the accuracy of your model
Prepare and make your first Kaggle submission

参加Kaggle比赛
探索比赛数据并了解比赛主题
准备用于机器学习的数据
训练模型
评估模型的准确性
准备并进行首次Kaggle提交

This tutorial presumes you have an understanding of Python and the pandas library. If you need to learn about these, we recommend our pandas tutorial blog post.

本教程假定您已了解Python和pandas库。如果您需要了解这些内容，我们建议您阅读熊猫教程博客文章。

泰坦尼克号比赛 (The Titanic competition)

Kaggle has created a number of competitions designed for beginners. The most popular of these competitions, and the one we’ll be looking at, is about predicting which passengers survived the sinking of the Titanic.

Kaggle创建了许多针对初学者的比赛。在这些竞赛中，我们将要探讨的最流行的竞赛是关于预测哪些乘客在泰坦尼克号沉没中幸存下来。

In this competition, we have a data set of different information about passengers onboard the Titanic, and we see if we can use that information to predict whether those people survived or not. Before we start looking at this specific competition, let’s take a moment to understand how Kaggle competitions work.

在这场比赛中，我们拥有一系列有关泰坦尼克号上乘客的不同信息的数据，我们将看看是否可以使用该信息来预测这些人是否幸存。在开始研究此特定比赛之前，让我们花点时间了解Kaggle比赛的工作方式。

Each Kaggle competition has two key data files that you will work with – a training set and a testing set.

每个Kaggle竞赛都有两个关键数据文件可供您使用- 训练集和测试集。

The training set contains data we can use to train our model. It has a number of feature columns which contain various descriptive data, as well as a column of the target values we are trying to predict: in this case, Survival.

训练集包含可用于训练模型的数据。它具有许多要素列，其中包含各种描述性数据，以及我们尝试预测的目标值的列：在这种情况下为Survival 。

The testing set contains all of the same feature columns, but is missing the target value column. Additionally, the testing set usually has fewer observations (rows) than the training set.

测试集包含所有相同的特征列，但缺少目标值列。另外，测试集通常比训练集具有更少的观察（行）。

This is useful because we want as much data as we can to train our model on. Once we have trained our model on the training set, we will use that model to make predictions on the data from the testing set, and submit those predictions to Kaggle.

这很有用，因为我们需要尽可能多的数据来训练模型。在训练集中训练完模型后，我们将使用该模型对测试集中的数据进行预测，然后将这些预测提交给Kaggle。

In this competition, the two files are named test.csv and train.csv. We’ll start by using pandas.read_csv() library to read both files and then inspect their size.

在本次比赛中，这两个文件名为test.csv和train.csv 。我们将从使用pandas.read_csv()库开始读取两个文件，然后检查它们的大小。

import import pandas pandas as as pdpdtest test = = pdpd .. read_csvread_csv (( "test.csv""test.csv" )
)
train train = = pdpd .. read_csvread_csv (( "train.csv""train.csv" ))printprint (( "Dimensions of train: "Dimensions of train:  {}{} "" .. formatformat (( traintrain .. shapeshape ))
))
printprint (( "Dimensions of test: "Dimensions of test:  {}{} "" .. formatformat (( testtest .. shapeshape ))
))


Dimensions of train: (891, 12)
Dimensions of test: (418, 11)

探索数据 (Exploring the data)

The files we just opened are available on the data page for the Titanic competition on Kaggle. That page also has a data dictionary, which explains the various columns that make up the data set. Below are the descriptions contained in that data dictionary:

我们刚刚打开的文件可在Kaggle上的Titanic竞赛的数据页面上找到。该页面还具有一个数据字典 ，该字典解释了构成数据集的各个列。以下是该数据字典中包含的描述：

PassengerID— A column added by Kaggle to identify each row and make submissions easier
Survived— Whether the passenger survived or not and the value we are predicting (0=No, 1=Yes)
Pclass— The class of the ticket the passenger purchased (1=1st, 2=2nd, 3=3rd)
Sex— The passenger’s sex
Age— The passenger’s age in years
SibSp— The number of siblings or spouses the passenger had aboard the Titanic
Parch— The number of parents or children the passenger had aboard the Titanic
Ticket— The passenger’s ticket number
Fare— The fare the passenger paid
Cabin— The passenger’s cabin number
Embarked— The port where the passenger embarked (C=Cherbourg, Q=Queenstown, S=Southampton)

PassengerID由Kaggle添加的列，用于标识每一行并简化提交过程
Survived —乘客是否幸存，以及我们所预测的值（0 =否，1 =是）
Pclass乘客购买的机票等级（1 = 1st，2 = 2nd，3 = 3rd）
Sex -乘客的性别
Age -乘客的年龄（以年为单位）
SibSp乘客在泰坦尼克号上拥有的兄弟姐妹或配偶的数量
Parch -泰坦尼克号上乘客的父母或子女人数
Ticket —乘客的机票号
Fare —乘客支付的票价
Cabin -旅客的客舱编号
Embarked —旅客Embarked的港口（C =瑟堡，Q =皇后镇，S =南安普敦）

The data page on Kaggle has some additional notes about some of the columns. It’s always worth exploring this in detail to get a full understanding of the data.

Kaggle上的数据页面还有一些有关某些列的附加说明。为了全面了解数据，总是值得进行详细研究。

Let’s take a look at the first few rows of the train dataframe.

让我们看一下train数据帧的前几行。

		PassengerId	旅客编号	Survived	幸存下来	Pclass	P类	Name	名称	Sex	性别	Age	年龄	SibSp	锡卜	Ticket	票	Fare	票价	Cabin	舱	Embarked	出发
0	0	1	1个	0	0	3	3	Braund, Mr. Owen Harris	布朗德，欧文·哈里斯先生	male	男	22.0	22.0	1	1个	A/5 21171	A / 5 21171	7.2500	7.2500	NaN	N	S	小号
1	1个	2	2	1	1个	1	1个	Cumings, Mrs. John Bradley (Florence Briggs Th…	卡明斯，约翰·布拉德利夫人（弗洛伦斯·布里格斯	female	女	38.0	38.0	1	1个	PC 17599	电脑17599	71.2833	71.2833	C85	C85	C	C
2	2	3	3	1	1个	3	3	Heikkinen, Miss. Laina	海基宁·莱娜小姐	female	女	26.0	26.0	0	0	STON/O2. 3101282	STON / O2。 3101282	7.9250	7.9250	NaN	N	S	小号
3	3	4	4	1	1个	1	1个	Futrelle, Mrs. Jacques Heath (Lily May Peel)	Futrelle，Jacques Heath夫人（莉莉·梅·皮尔）	female	女	35.0	35.0	1	1个	113803	113803	53.1000	53.1000	C123	C123	S	小号
4	4	5	5	0	0	3	3	Allen, Mr. William Henry	艾伦·威廉·亨利先生	male	男	35.0	35.0	0	0	373450	373450	8.0500	8.0500	NaN	N	S	小号

The type of machine learning we will be doing is called classification, because when we make predictions we are classifying each passenger as ‘survived’ or not. More specifically, we are performing binary classification, which means that there are only two different states we are classifying.

我们将要进行的机器学习类型称为分类，因为当我们进行预测时，我们会将每位乘客分类为“幸存”。更具体地说，我们正在执行二进制分类 ，这意味着我们仅对两个不同的状态进行分类。

In any machine learning exercise, thinking about the topic you are predicting is very important. We call this step acquiring domain knowledge, and it’s one of the most important determinants for success in machine learning.

在任何机器学习练习中，思考您要预测的主题都是非常重要的。我们将此步骤称为获取领域知识，这是机器学习成功的最重要决定因素之一。

In this case, understanding the Titanic disaster and specifically what variables might affect the outcome of survival is important. Anyone who has watched the movie Titanic would remember that women and children were given preference to lifeboats (as they were in real life). You would also remember the vast class disparity of the passengers.

在这种情况下，了解泰坦尼克号灾难尤其是哪些变量可能会影响生存结果非常重要。任何看过电影《泰坦尼克号》的人都会记得，妇女和儿童被优先考虑使用救生艇（就像他们在现实生活中一样）。您还会记得乘客之间的巨大差距。

This indicates that Age, Sex, and PClass may be good predictors of survival. We’ll start by exploring Sex and Pclass by visualizing the data.

这表明Age ， Sex和PClass可能是生存的良好预测指标。我们将通过可视化数据来探索Sex和Pclass 。

Because the Survived column contains 0 if the passenger did not survive and 1 if they did, we can segment our data by sex and calculate the mean of this column. We can use DataFrame.pivot_table() to easily do this:

因为如果乘客没有幸存，则“ Survived列包含0如果乘客没有幸存，则包含1 ，因此我们可以按性别对数据进行细分，然后计算此列的平均值。我们可以使用DataFrame.pivot_table()轻松地做到这一点：

import import matplotlib.pyplot matplotlib.pyplot as as plt
plt
%% matplotlib inlinematplotlib inlinesex_pivot sex_pivot = = traintrain .. pivot_tablepivot_table (( indexindex == "Sex""Sex" ,, valuesvalues == "Survived""Survived" )
)
sex_pivotsex_pivot .. plotplot .. barbar ()
()
pltplt .. showshow ()
()

We can immediately see that females survived in much higher proportions than males did. Let’s do the same with the Pclass column.

我们可以立即看到，女性的存活率比男性高得多。让我们对Pclass列进行相同的Pclass 。

探索和转换年龄列 (Exploring and converting the age column)

The Sex and PClass columns are what we call categorical features. That means that the values represented a few separate options (for instance, whether the passenger was male or female).

Sex和PClass列是我们所谓的分类特征。这意味着这些值代表几个单独的选项（例如，乘客是男性还是女性）。

Let’s take a look at the Age column using Series.describe().

让我们使用Series.describe()来查看Age列。

traintrain [[ "Age""Age" ]] .. describedescribe ()
()


count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

The Age column contains numbers ranging from 0.42 to 80.0 (If you look at Kaggle’s data page, it informs us that Age is fractional if the passenger is less than one). The other thing to note here is that there are 714 values in this column, fewer than the 814 rows we discovered that the train data set had earlier in this mission which indicates we have some missing values.

“ Age列包含的数字范围为0.42到80.0 （如果您查看Kaggle的数据页，它会告诉我们，如果乘客少于一， Age就是小数）。这里要注意的另一件事是，此列中有714个值，少于我们发现train数据集在此任务中较早的814行，这表明我们有一些缺失的值。

All of this means that the Age column needs to be treated slightly differently, as this is a continuous numerical column. One way to look at distribution of values in a continuous numerical set is to use histograms. We can create two histograms to compare visually the those that survived vs those who died across different age ranges:

所有这些都意味着“ Age列的处理需要稍有不同，因为这是一个连续的数字列。查看连续数值集中的值分布的一种方法是使用直方图。我们可以创建两个直方图，以直观比较幸存者和不同年龄段死亡者：

The relationship here is not simple, but we can see that in some age ranges more passengers survived – where the red bars are higher than the blue bars.

这里的关系并不简单，但是我们可以看到，在某些年龄范围内，有更多的乘客幸存下来-红色条高于蓝色条。

In order for this to be useful to our machine learning model, we can separate this continuous feature into a categorical feature by dividing it into ranges. We can use the pandas.cut() function to help us out.

为了使该功能对我们的机器学习模型有用，我们可以通过将连续特征划分为多个范围，从而将该连续特征分离为分类特征。我们可以使用pandas.cut()函数来帮助我们。

The pandas.cut() function has two required parameters – the column we wish to cut, and a list of numbers which define the boundaries of our cuts. We are also going to use the optional parameter labels, which takes a list of labels for the resultant bins. This will make it easier for us to understand our results.

pandas.cut()函数具有两个必需的参数-我们希望剪切的列，以及定义剪切边界的数字列表。我们还将使用可选参数labels ，该参数为生成的垃圾箱获取标签列表。这将使我们更容易理解我们的结果。

Before we modify this column, we have to be aware of two things. Firstly, any change we make to the train data, we also need to make to the test data, otherwise we will be unable to use our model to make predictions for our submissions. Secondly, we need to remember to handle the missing values we observed above.

在修改此列之前，我们必须了解两件事。首先，我们对train数据所做的任何更改，我们还需要对test数据进行更改，否则我们将无法使用我们的模型对提交的内容进行预测。其次，我们需要记住处理上面观察到的缺失值。

We’ll create a function that:

我们将创建一个函数：

Uses the pandas.fillna() method to fill all of the missing values with -0.5
Cuts the Age column into six segments:
- Missing, from -1 to 0
- Infant, from 0 to 5
- Child, from 5 to 12
- Teenager, from 12 to 18
- Young Adult, from 18 to 35
- Adult, from 35 to 60
- Senior, from 60 to 100

使用pandas.fillna()方法以-0.5填充所有缺少的值
将“ Age列分为六个部分：
- Missing ，从-1到0
- Infant ，从0到5
- Child （ 5至12
- Teenager （ 12至18
- 18至35 Young Adult
- Adult ，从35至60
- Senior ，从60到100

We’ll then use that function on both the train and test dataframes.

然后，我们将在train和test数据帧上使用该功能。

The diagram below shows how the function converts the data:

下图显示了该函数如何转换数据：

Note that the cut_points list has one more element than the label_names list, since it needs to define the upper boundary for the last segment.

请注意， cut_points列表比label_names列表多一个元素，因为它需要定义最后一段的上限。

def def process_ageprocess_age (( dfdf ,, cut_pointscut_points ,, label_nameslabel_names ):):dfdf [[ "Age""Age" ] ] = = dfdf [[ "Age""Age" ]] .. fillnafillna (( -- 0.50.5 ))dfdf [[ "Age_categories""Age_categories" ] ] = = pdpd .. cutcut (( dfdf [[ "Age""Age" ],], cut_pointscut_points ,, labelslabels == label_nameslabel_names ))return return dfdfcut_points cut_points = = [[ -- 11 ,, 00 ,, 55 ,, 1212 ,, 1818 ,, 3535 ,, 6060 ,, 100100 ]
]
label_names label_names = = [[ "Missing""Missing" ,, "Infant""Infant" ,, "Child""Child" ,, "Teenager""Teenager" ,, "Young Adult""Young Adult" ,, "Adult""Adult" ,, "Senior""Senior" ]]train train = = process_ageprocess_age (( traintrain ,, cut_pointscut_points ,, label_nameslabel_names )
)
test test = = process_ageprocess_age (( testtest ,, cut_pointscut_points ,, label_nameslabel_names ))pivot pivot = = traintrain .. pivot_tablepivot_table (( indexindex == "Age_categories""Age_categories" ,, valuesvalues == 'Survived''Survived' )
)
pivotpivot .. plotplot .. barbar ()
()
pltplt .. showshow ()
()

为机器学习准备我们的数据 (Preparing our data for machine learning)

So far we have identified three columns that may be useful for predicting survival:

到目前为止，我们已经确定了三列可能对预测生存率有用的列：

Sex
Pclass
Age, or more specifically our newly created Age_categories

Sex
Pclass
Age ，或更确切地说是我们新创建的Age_categories

Before we build our model, we need to prepare these columns for machine learning. Most machine learning algorithms can’t understand text labels, so we have to convert our values into numbers.

在建立模型之前，我们需要准备这些列以进行机器学习。大多数机器学习算法无法理解文本标签，因此我们必须将值转换为数字。

Additionally, we need to be careful that we don’t imply any numeric relationship where there isn’t one. The data dictionary tells us that the values in the Pclass columnare 1, 2, and 3. We can confirm this with pandas:

此外，我们需要注意不要在没有数字关系的地方隐含任何数字关系。数据字典告诉我们的是，在的值Pclass columnare 1 ， 2 ，和3 。我们可以用熊猫来确认：


3    491
1    216
2    184
Name: Pclass, dtype: int64

While the class of each passenger certainly has some sort of ordered relationship, the relationship between each class is not the same as the relationship between the numbers 1, 2, and 3. For instance, class 2 isn’t “worth” double what class 1 is, and class 3 isn’t “worth” triple what class 1 is.

虽然类每位乘客的肯定有某种有序的关系，每个类之间的关系是不一样的数字之间的关系1 ， 2 ，和3 。例如，第2类的“价值”不是第1类的两倍，而第3类的“价值”不是第1类的两倍。

In order to remove this relationship, we can create dummy columns for each unique value in Pclass:

为了消除这种关系，我们可以为Pclass每个唯一值创建虚拟列：

Rather than doing this manually, we can use the pandas.get_dummies() function which will generate columns shown in the diagram above.

pandas.get_dummies()手动执行此操作，我们可以使用pandas.get_dummies()函数，该函数将生成上图中所示的列。

We’ll create a function to create the dummy columns for the Pclass column and add it back to the original dataframe. We’ll then apply that function on the train and test dataframes for each of the Pclass, Sex, and Age_categories columns.

我们将创建一个函数来为Pclass列创建虚拟列，并将其添加回原始数据帧。然后，我们将该功能应用于Pclass ， Sex和Age_categories列的train和test数据帧。

def def create_dummiescreate_dummies (( dfdf ,, column_namecolumn_name ):):dummies dummies = = pdpd .. get_dummiesget_dummies (( dfdf [[ column_namecolumn_name ],], prefixprefix == column_namecolumn_name ))df df = = pdpd .. concatconcat ([([ dfdf ,, dummiesdummies ],], axisaxis == 11 ))return return dfdffor for column column in in [[ "Pclass""Pclass" ,, "Sex""Sex" ,, "Age_categories""Age_categories" ]:]:train train = = create_dummiescreate_dummies (( traintrain ,, columncolumn ))test test = = create_dummiescreate_dummies (( testtest ,, columncolumn )
)

喜欢这篇文章吗？使用Dataquest学习数据科学！ (Enjoying this post? Learn data science with Dataquest!)

Learn from the comfort of your browser.
Work with real-life data sets.
Build a portfolio of projects.

从舒适的浏览器中学习。
处理实际数据集。
建立项目组合。

Start for Free

免费开始

创建我们的第一个机器学习模型 (Creating our first machine learning model)

Now that our data has been prepared, we are ready to train our first model. The first model we will use is called Logistic Regression, which is often the first model you will train when performing classification.

现在我们的数据已经准备好了，我们准备训练我们的第一个模型。我们将使用的第一个模型称为Logistic回归 ，它通常是执行分类时将训练的第一个模型。

We will be using the scikit-learn library as it has many tools that make performing machine learning easier. The scikit-learn workflow consists of four main steps:

我们将使用scikit-learn库，因为它具有许多使执行机器学习更加容易的工具。 scikit-learn工作流程包括四个主要步骤：

Instantiate (or create) the specific machine learning model you want to use
Fit the model to the training data
Use the model to make predictions
Evaluate the accuracy of the predictions

实例化（或创建）您要使用的特定机器学习模型
使模型适合训练数据
使用模型进行预测
评估预测的准确性

Each model in scikit-learn is implemented as a separate class and the first step is to identify the class we want to create an instance of. In our case, we want to use the LogisticRegression class.

scikit-learn中的每个模型都是作为一个单独的类实现的，第一步是识别我们要为其创建实例的类。在我们的例子中，我们想使用LogisticRegression类。

We’ll start by looking at the first two steps. First, we need to import the class:

我们将从头两个步骤开始。首先，我们需要导入该类：

Next, we create a LogisticRegression object:

接下来，我们创建一个LogisticRegression对象：

lr lr = = LogisticRegressionLogisticRegression ()
()

Lastly, we use the LogisticRegression.fit() method to train our model. The .fit() method accepts two arguments: X and y. X must be a two dimensional array (like a dataframe) of the features that we wish to train our model on, and y must be a one-dimensional array (like a series) of our target, or the column we wish to predict.

最后，我们使用LogisticRegression.fit()方法来训练我们的模型。 .fit()方法接受两个参数： X和y 。 X必须是我们希望在其上训练模型的特征的二维数组（例如数据框）， y必须是目标或我们希望预测的列的一维数组（例如一系列）。

The code above fits (or trains) our LogisticRegression model using three columns: Pclass_2, Pclass_3, and Sex_male.

上面的代码使用三列适合（或训练）我们的LogisticRegression模型： Pclass_2 ， Pclass_3和Sex_male 。

Let’s train our model using all of the columns we created with our create_dummies() function.

让我们使用通过create_dummies()函数创建的所有列来训练模型。

from from sklearn.linear_model sklearn.linear_model import import LogisticRegressionLogisticRegressioncolumns columns = = [[ 'Pclass_1''Pclass_1' , , 'Pclass_2''Pclass_2' , , 'Pclass_3''Pclass_3' , , 'Sex_female''Sex_female' , , 'Sex_male''Sex_male' ,,'Age_categories_Missing''Age_categories_Missing' ,, 'Age_categories_Infant''Age_categories_Infant' ,,'Age_categories_Child''Age_categories_Child' , , 'Age_categories_Teenager''Age_categories_Teenager' ,,'Age_categories_Young Adult''Age_categories_Young Adult' , , 'Age_categories_Adult''Age_categories_Adult' ,,'Age_categories_Senior''Age_categories_Senior' ]]lr lr = = LogisticRegressionLogisticRegression ()
()
lrlr .. fitfit (( traintrain [[ columnscolumns ], ], traintrain [[ "Survived""Survived" ])
])


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,penalty='l2', random_state=None, solver='liblinear', tol=0.0001,verbose=0, warm_start=False)

拆分我们的训练数据 (Splitting our training data)

Congratulations, you’ve trained your first machine learning model! Our next step is to find out how accurate our model is, and to do that, we’ll have to make some predictions.

恭喜，您已经训练了您的第一个机器学习模型！我们的下一步是找出模型的准确性，然后，我们必须做出一些预测。

If you recall from earlier, we do have a test dataframe that we could use to make predictions. We could make predictions on that data set, but because it doesn’t have the Survived column we would have to submit it to Kaggle to find out our accuracy. This would quickly become a pain if we had to submit to find out the accuracy every time we optimized our model.

如果您回想起以前，我们确实有一个可用于进行预测的test数据框。我们可以对该数据集进行预测，但是由于它没有Survived列，因此必须将其提交给Kaggle才能确定我们的准确性。如果每次我们优化模型都必须提交以找出准确性时，这将很快成为一种痛苦。

We could also fit and predict on our train dataframe, however if we do this there is a high likelihood that our model will overfit, which means it will perform well because we’re testing on the same data we’ve trained on, but then perform much worse on new, unseen data.

我们也可以在train数据帧上进行拟合和预测，但是如果这样做，我们的模型很可能会过拟合，这意味着它会运行良好，因为我们正在测试的数据与我们训练的相同，但是在看不见的新数据上的表现要差得多。

Instead we can split our train dataframe into two:

相反，我们可以将train数据帧分为两个部分：

One part to train our model on (often 80% of the observations)
One part to make predictions with and test our model (often 20% of the observations)

一部分用于训练我们的模型（通常有80％的观测值）
用一部分进行预测并测试我们的模型（通常占观察值的20％）

The convention in machine learning is to call these two parts train and test. This can become confusing, since we already have our test dataframe that we will eventually use to make predictions to submit to Kaggle. To avoid confusion, from here on, we’re going to call this Kaggle ‘test’ data holdout data, which is the technical name given to this type of data used for final predictions.

机器学习的惯例是将这两个部分称为train和test 。这可能会造成混淆，因为我们已经有了test数据框，最终将用于进行预测以提交给Kaggle。为了避免混淆，从这里开始，我们将调用这个Kaggle“测试”数据维持数据，这是考虑到这种类型的用于最终预测数据的技术名称。

The scikit-learn library has a handy model_selection.train_test_split() function that we can use to split our data. train_test_split() accepts two parameters, X and y, which contain all the data we want to train and test on, and returns four objects: train_X, train_y, test_X, test_y:

scikit-learn库具有一个方便的model_selection.train_test_split()函数，可用于拆分数据。 train_test_split()接受两个参数X和y ，它们包含我们要对其进行训练和测试的所有数据，并返回四个对象： train_X ， train_y ， test_X ， test_y ：

You’ll notice that we use some extra parameters: test_size, which lets us control what proportions our data are split into, and random_state. The train_test_split() function randomizes observations before dividing them, and setting a random seed means that our results will be reproducible, so you can follow along and get the same result as we did.

你会发现，我们使用一些额外的参数： test_size ，这让我们对照一下我们将按比例数据被分成和random_state 。 train_test_split()函数会在划分观察值之前对观察值进行随机化，并且设置随机种子意味着我们的结果将具有可重复性，因此您可以继续进行并获得与我们相同的结果。

进行预测并测量其准确性 (Making predictions and measuring their accuracy)

Now that we have our data split into train and test sets, we can fit our model again on our training set, and then use that model to make predictions on our test set.

现在我们将数据分为训练集和测试集，我们可以再次将模型拟合到训练集上，然后使用该模型对测试集进行预测。

Once we have fit our model, we can use the LogisticRegression.predict() method to make predictions.

拟合模型后，可以使用LogisticRegression.predict()方法进行预测。

The predict() method takes a single parameter X, a two dimensional array of features for the observations we wish to predict. X must have the exact same features as the array we used to fit our model. The method returns single dimensional array of predictions.

predict()方法采用单个参数X ，这是我们希望预测的观测值的二维特征数组。 X必须具有与用于拟合模型的数组完全相同的特征。该方法返回一维预测数组。

lr lr = = LogisticRegressionLogisticRegression ()
()
lrlr .. fitfit (( train_Xtrain_X , , train_ytrain_y )
)
predictions predictions = = lrlr .. predictpredict (( test_Xtest_X )
)

There are a number of ways to measure the accuracy of machine learning models, but when competing in Kaggle competitions you want to make sure you use the same method that Kaggle uses to calculate accuracy for that specific competition.

有多种方法可以衡量机器学习模型的准确性，但是在参加Kaggle比赛时，您需要确保使用与Kaggle用于计算特定比赛的准确性相同的方法。

In this case, the evaluation section for the Titanic competition on Kaggle tells us that our score calculated as “the percentage of passengers correctly predicted”. This is by far the most common form of accuracy for binary classification.

在这种情况下， Kaggle泰坦尼克号比赛的评估部分告诉我们，我们的得分计算为“正确预测的乘客百分比”。到目前为止，这是二进制分类最常用的精度形式。

As an example, imagine we were predicting a small data set of five observations.

例如，假设我们正在预测一个包含五个观测值的小型数据集。

Our model’s prediction	我们模型的预测	The actual value	实际值	Correct	正确
0	0	0	0	Yes	是
1	1个	0	0	No	没有
0	0	1	1个	No	没有
1	1个	1	1个	Yes	是
1	1个	1	1个	Yes	是

In this case, our model correctly predicted three out of five values, so the accuracy based on this prediction set would be 60%.

在这种情况下，我们的模型可以正确预测五个值中的三个，因此基于此预测集的准确性为60％。

Again, scikit-learn has a handy function we can use to calculate accuracy: metrics.accuracy_score(). The function accepts two parameters, y_true and y_pred, which are the actual values and our predicted values respectively, and returns our accuracy score.

再次，scikit-learn有一个方便的函数可用来计算准确性： metrics.accuracy_score() 。该函数接受两个参数y_true和y_pred ，分别是实际值和我们的预测值，并返回我们的准确性得分。

Let’s put all of these steps together, and get our first accuracy score.

让我们将所有这些步骤放在一起，以获得我们的第一个准确性分数。

from from sklearn.metrics sklearn.metrics import import accuracy_scoreaccuracy_scorelr lr = = LogisticRegressionLogisticRegression ()
()
lrlr .. fitfit (( train_Xtrain_X , , train_ytrain_y )
)
predictions predictions = = lrlr .. predictpredict (( test_Xtest_X )
)
accuracy accuracy = = accuracy_scoreaccuracy_score (( test_ytest_y , , predictionspredictions ))printprint (( accuracyaccuracy )
)


0.810055865922

使用交叉验证进行更准确的错误测量 (Using cross validation for more accurate error measurement)

Our model has an accuracy score of 81.0% when tested against our 20% test set. Given that this data set is quite small, there is a good chance that our model is overfitting, and will not perform as well on totally unseen data.

与我们的20％测试集相比，我们的模型的准确性得分为81.0％。鉴于此数据集非常小，因此我们的模型很有可能过度拟合，并且在完全看不见的数据上表现不佳。

To give us a better understanding of the real performance of our model, we can use a technique called cross validation to train and test our model on different splits of our data, and then average the accuracy scores.

为了使我们对模型的实际性能有更好的了解，我们可以使用一种称为交叉验证的技术来对数据的不同分割进行训练和测试，然后对准确性得分取平均。

The most common form of cross validation, and the one we will be using, is called k-fold cross validation. ‘Fold’ refers to each different iteration that we train our model on, and ‘k’ just refers to the number of folds. In the diagram above, we have illustrated k-fold validation where k is 5.

我们将使用的最常见的交叉验证形式称为k折交叉验证。 “折叠”是指我们训练模型的每个不同迭代，而“ k”仅是指折叠数。在上图中，我们说明了k倍验证，其中k为5。

We will use scikit-learn’s model_selection.cross_val_score() function to automate the process. The basic syntax for cross_val_score() is:

我们将使用scikit-learn的model_selection.cross_val_score()函数来自动执行该过程。 cross_val_score()的基本语法为：

estimator is a scikit-learn estimator object, like the LogisticRegression() objects we have been creating.
X is all features from our data set.
y is the target variables.
cv specifies the number of folds.

estimator是一个scikit-learn估计器对象，就像我们一直在创建的LogisticRegression()对象一样。
X是我们数据集中的所有特征。
y是目标变量。
cv指定折数。

The function returns a numpy ndarray of the accuracy scores of each fold. It’s worth noting, the cross_val_score() function can use a variety of cross validation techniques and scoring types, but it defaults to k-fold validation and accuracy scores for our input types.

该函数返回每个折叠的准确性得分的numpy ndarray。值得注意的是， cross_val_score()函数可以使用多种交叉验证技术和评分类型，但是对于我们的输入类型，它默认为k倍验证和准确性得分。

We’ll use model_selection.cross_val_score() to perform cross-validation on our data, before calculating the mean of the scores produced:

在计算产生的分数的平均值之前，我们将使用model_selection.cross_val_score()对我们的数据进行交叉验证：

from from sklearn.model_selection sklearn.model_selection import import cross_val_scorecross_val_scorelr lr = = LogisticRegressionLogisticRegression ()
()
scores scores = = cross_val_scorecross_val_score (( lrlr , , all_Xall_X , , all_yall_y , , cvcv == 1010 )
)
scoresscores .. sortsort ()
()
accuracy accuracy = = scoresscores .. meanmean ()()printprint (( scoresscores )
)
printprint (( accuracyaccuracy )
)


[ 0.76404494  0.76404494  0.7752809   0.78651685  0.8         0.806818180.80898876  0.81111111  0.83146067  0.87640449]
0.802467086596

对看不见的数据进行预测 (Making predictions on unseen data)

From the results of our k-fold validation, you can see that the accuracy number varies with each fold – ranging between 76.4% and 87.6%. This demonstrates why cross validation is important.

从我们的k折验证结果中，您可以看到准确度数字随每折而变化-介于76.4％和87.6％之间。这说明了为什么交叉验证很重要。

As it happens, our average accuracy score was 80.2%, which is not far from the 81.0% we got from our simple train/test split, however this will not always be the case, and you should always use cross-validation to make sure the error metrics you are getting from your model are accurate.

碰巧的是，我们的平均准确性得分是80.2％，与我们从简单的训练/测试划分中获得的81.0％相差不远，但是情况并非总是如此，您应该始终使用交叉验证来确保您从模型中获得的错误指标是准确的。

We are now ready to use the model we have built to train our final model and then make predictions on our unseen holdout data, or what Kaggle calls the ‘test’ data set.

现在，我们可以使用我们构建的模型来训练最终模型，然后对看不见的保持数据或Kaggle所谓的“测试”数据集进行预测。

创建提交文件 (Creating a submission file)

The last thing we need to do is create a submission file. Each Kaggle competition can have slightly different requirements for the submission file. Here’s what is specified on the Titanic competition evaluation page:

我们需要做的最后一件事是创建一个提交文件。每次Kaggle竞赛对提交文件的要求可能都略有不同。这是《铁达尼号》竞赛评估页面上指定的内容：

You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if you have extra columns (beyond PassengerId and Survived) or rows.

您应该提交包含418个条目以及标题行的csv文件。如果您有多余的列（PassengerId和Survived以外）或行，则提交的内容将显示错误。

The file should have exactly 2 columns:

该文件应恰好有2列：

PassengerId (sorted in any order)

Survived (contains your binary predictions: 1 for survived, 0 for deceased)

PassengerId（按任何顺序排序）

尚存（包含您的二进制预测：1表示尚存，0表示已故）

The table below shows this in a slightly easier to understand format, so we can visualize what we are aiming for.

下表以稍微容易理解的格式显示了此内容，因此我们可以直观地看到目标。

PassengerId	旅客编号	Survived	幸存下来
892	892	0	0
893	893	1	1个
894	894	0	0

We will need to create a new dataframe that contains the holdout_predictions we created in the previous screen and the PassengerId column from the holdout dataframe. We don’t need to worry about matching the data up, as both of these remain in their original order.

我们需要创建一个包含了新的数据帧holdout_predictions我们在前面的屏幕创建和PassengerId从列holdout数据帧。我们不必担心数据匹配，因为它们都保持原始顺序。

To do this, we can pass a dictionary to the pandas.DataFrame() function:

为此，我们可以将字典传递给pandas.DataFrame()函数：

holdout_ids holdout_ids = = holdoutholdout [[ "PassengerId""PassengerId" ]
]
submission_df submission_df = = {{ "PassengerId""PassengerId" : : holdout_idsholdout_ids ,,"Survived""Survived" : : holdout_predictionsholdout_predictions }
}
submission submission = = pdpd .. DataFrameDataFrame (( submission_dfsubmission_df )
)

Finally, we’ll use the DataFrame.to_csv() method to save the dataframe to a CSV file. We need to make sure the index parameter is set to False, otherwise we will add an extra column to our CSV.

最后，我们将使用DataFrame.to_csv()方法将数据DataFrame.to_csv()保存到CSV文件。我们需要确保index参数设置为False ，否则我们将在CSV中添加额外的列。

向Kaggle首次提交 (Making our first submission to Kaggle)

You can download the submission file created above from within our free Kaggle Fundamentals course. When working on your own computer, it will be in the same directory as your notebook.

您可以从我们的免费Kaggle基础知识课程中下载上面创建的提交文件。在您自己的计算机上工作时，它将与笔记本计算机位于同一目录中。

Now that we have our submission file, we can start our submission to Kaggle by clicking the blue ‘Submit Predictions’ button on the competition page.

现在我们有了提交文件，我们可以单击比赛页面上的蓝色“提交预测”按钮开始向Kaggle提交文件。

You will then be prompted to upload your CSV file, and add a brief description of your submission. When you make your submission, Kaggle will process your predictions and give you your accuracy for the holdout data and your ranking. When it is finished processing you will see our first submission gets an accuracy score of 0.75598, or 75.6%.

然后，系统将提示您上传CSV文件，并添加对提交内容的简短描述。提交时，Kaggle将处理您的预测，并为您提供有关保留数据和排名的准确性。处理完成后，您将看到我们的第一份报告得到的准确性得分为0.75598，或75.6％。

The fact that our accuracy on the holdout data is 75.6% compared with the 80.2% accuracy we got with cross-validation indicates that our model is overfitting slightly to our training data.

我们在保留数据上的准确性为75.6％，而交叉验证的准确性为80.2％，这表明我们的模型与训练数据有些过拟合。

At the time of writing, accuracy of 75.6% gives a rank of 6,663 out of 7,954. It’s easy to look at Kaggle leaderboards after your first submission and get discouraged, but keep in mind that this is just a starting point.

在撰写本文时，75.6％的准确度使7,954中的6,663排名。初次提交后很容易看到Kaggle排行榜，但灰心丧气，但是请记住，这只是一个起点。

It’s also very common to see a small number of scores of 100% at the top of the Titanic leaderboard and think that you have a long way to go. In reality, anyone scoring about 90% on this competition is likely cheating (it’s easy to look up the names of the passengers in the holdout set online and see if they survived).

在Titanic排行榜的顶部看到少量100％的分数并认为您还有很长的路要走，这也是很常见的。实际上，任何在这项比赛中得分达到90％的人都可能会作弊（很容易在网上设置的候补名单中查找乘客的姓名，看看他们是否还活着）。

There is a great analysis on Kaggle, How am I doing with my score, which uses a few different strategies and suggests a minimum score for this competition is 62.7% (achieved by presuming that every passenger died) and a maximum of around 82%. We are a little over halfway between the minimum and maximum, which is a great starting point.

在Kaggle上有一个很好的分析， “我该如何处理自己的分数” ，它使用了几种不同的策略，因此建议该比赛的最低分数为62.7％（假设每个乘客都死亡），最高分数约为82％。我们在最小值和最大值之间略微超过一半，这是一个很好的起点。

继续学习Kaggle (Continue learning about Kaggle)

There are many things we can do to improve the accuracy of our model. Here are some of the things you’ll learn in the rest of our Kaggle fundamentals course:

我们可以做很多事情来提高模型的准确性。以下是我们在Kaggle基础知识课程其余部分中将学到的一些知识：

Feature Preparation, Selection, and Engineering
- How to determine which features in your model are the most-relevant to your predictions
- Ways to reduce the number of features used to train your model and avoid overfitting
- Techniques to create new features to improve the accuracy of your model
Model Selection and Tuning
- How the k-nearest neighbors and random forests algorithms work
- About hyperparameters, and how to select the hyperparameters that give the best predictions
- How to compare different algorithms to improve the accuracy of your predictions
Creating A Kaggle Workflow
- How to use Jupyter notebook while working with Kaggle comptitions
- Why workflows are important for machine learning and create a Kaggle workflow
- How to use functions to automate and simplify repetitive machine learning tasks

特征准备，选择和工程
- 如何确定模型中的哪些特征与您的预测最相关
- 减少用于训练模型的功能数量并避免过度拟合的方法
- 创建新功能以提高模型准确性的技术
模型选择和调整
- k最近邻和随机森林算法如何工作
- 关于超参数，以及如何选择能提供最佳预测的超参数
- 如何比较不同的算法以提高预测的准确性
创建Kaggle工作流程
- 在Kaggle竞争中如何使用Jupyter笔记本
- 为什么工作流程对于机器学习和创建Kaggle工作流程很重要
- 如何使用功能来自动化和简化重复的机器学习任务

翻译自: https://www.pybloggers.com/2017/10/kaggle-fundamentals-the-titanic-competition/

kaggle泰坦尼克号