Did you know, a novel predicted the Titanic sinking 14 years previously to the actual disaster???


In 1898 (14 years before the Titanic sank), American author Morgan Robertson wrote a novel titled ‘The Wreck of the Titan.’

1898年(泰坦尼克号沉没之前的14年),美国作家摩根·罗伯森(Morgan Robertson)写了一本名为《泰坦号的残骸》的小说。

The book was about a fictional ocean liner that sinks due to a collision with an iceberg. In the book, the ship is described as being “unsinkable” and doesn’t have enough lifeboats for everyone on board, sounds familiar yeah you’re right it’s the epic story of titanic which was predicted years ago.

这本书是关于一个虚构的远洋客轮,由于与冰山相撞而沉没。 在这本书中,这艘船被描述为“不沉”,船上没有足够的救生艇供所有人使用,听起来不错,是的,这是多年前预言的泰坦尼克号的史诗故事。

The Wreck of Titan

We cannot conclude whether the author had technical proofs for his prediction, but we as responsible Data science enthusiasts can predict the possibilities and outcomes of the disaster using the data set and what not we can even try to envision the various prospects of the climax.


I am sure that all of us know what happened to Rose and Jack in the movie Titanic. We all wished that the story had a different ending, didn’t we? Let’s try to make our wish come true by recreating the climax of the story by a simple analysis of the story plot,

我确定我们所有人都知道电影《泰坦尼克号》中罗斯和杰克的事。 我们都希望这个故事有一个不同的结局,不是吗? 让我们通过对故事情节的简单分析来重现故事的高潮,以实现我们的愿望,

At the end of the analysis we will be creating three climaxes and come to know the answer of three questions:


• Is there a possibility for jack to be alive and rose’s survival?


• Was there a chance for Jack and Rose together to narrate their adventurous story to their grandchildren?


• Did Cal Hockley (Rose’s Fiancé) have a higher chance of survival as he belonged to the upper-class or what would make the villain dead?


We are carrying out our analysis using the ‘Matplotlib’, ‘Numpy’,’Pandas’, and ‘Seaborn’ Libraries.

我们正在使用“ Matplotlib”,“ Numpy”,“ Pandas”和“ Seaborn”图书馆进行分析。

Let us see what each library function is:


Matplotlib is a python library used for visualizing data sets using various plots; it has more than 50 plots to name a few, bar plot, line plot, histogram, etc.

Matplotlib是一个python库,用于使用各种图表可视化数据集。 它有50多个图,例如条形图,线形图,直方图等。

Numpy is also a Python library that provides a high-performance multidimensional array and basic tools to compute with and manipulate these arrays.


Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

Pandas是为Python编程语言编写的用于数据处理和分析的软件库。 特别是,它提供了用于处理数字表和时间序列的数据结构和操作。

Seaborn is a Python data visualization library based on Matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

Seaborn是基于Matplotlib的Python数据可视化库。 它提供了一个高级界面,用于绘制引人入胜且内容丰富的统计图形。

Let us start our journey…


数据探索: (Data Exploration:)

import pandas as pdimport numpy as npimport random as rnd# visualizationimport seaborn as snsimport matplotlib.pyplot as plt%matplotlib inline

We import all the necessary libraries and read the data set which has the stat of titanic disaster using the ‘pandas’ library.

我们导入所有必需的库,并使用“ pandas”库读取具有泰坦尼克号灾难状态的数据集。

titanic_df = pd.read_csv('titanic.csv')

Then we display the first five entries using the head command to get a glimpse of the nature of the data and the categories of labels which we are going to explore,


Titanic.csv First 5 entries

功能分析: (Feature Analysis:)

Looking at titanic_df.describe() we gain a lot of useful insights and find the categorical labels which we can ignore


Titanic.csv decribe command to see the column characteristics
Titanic.csv decribe命令以查看列特征

• PassengerId: Unique for each passenger so this has no relation with the survival label hence this need not be considered for analyzing


• Survived: Survival is a binary option, 0 for the passenger is dead and 1 for the passenger is alive, so this will be only ‘Y’ variable in XY plotting

•Survived(生存):Survival(生存)是一个二进制选项,0表示乘客已死亡,1表示该乘客还活着,因此在XY绘图中这仅是“ Y”变量

• Pclass: Integer equal to 1, 2, or 3 indicating the class of each passenger (lower, middle, or upper), this can be taken for analyzing as this has three inner categories which may contribute to the survival of passengers


• Age: Number representing the age of each passenger, though as we can see in titanic_df.tail(), some passengers have NaN for their age, this can also be considered as maybe younger ones can act swiftly and escape so this can also contribute to the survival label


• SibSp: Number of siblings also on board, we may not completely ignore this, as it may or may not support the survival label


• Parch: Number of children also on board, this also has a similar case of SibSp


• Fare: amount paid for the ticket by each passenger, this may add essence to the Passenger Class label as the higher the fare higher the class of ticket.


For a quick comparison, we’ll create use NumPy functions to verify the mean, standard deviation, min, and max of numerical columns.


columns = list(titanic_df[['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']])def describe_data(data, col):    print ('\n\n', col)    print ('_' * 40)    print ('Mean:', np.mean(data)) #NumPy Mean    print ('STD:', np.std(data))   #NumPy STD    print ('Min:', np.min(data))   #NumPy Min    print ('Max:', np.max(data))   #NumPy Maxfor c in columns:    describe_data(titanic_df[c], c)
Mean, Standard Deviation, Minimum and Maximum values for each Label

这些见解包括: (Insights from these are:)

• Survived is a categorical label with 0 or 1 values.


• Around 38% of samples survived representative of the actual survival rate at 32%.


• Most passengers (> 75%) did not travel with parents or children.

•大多数乘客(> 75%)没有和父母或孩子一起旅行。

• Nearly 30% of the passengers had siblings and/or spouse aboard.


• Fares varied significantly with few passengers (<1%) paying as high as $512.


• Few elderly passengers (<1%) within the age range 65–80.


Great numbers, Let us move on to realize our dream climaxes…


高潮1:杰克活着玫瑰死了! (Climax 1: Jack Lived Rose Died!!!)

If Jack Survived…

The vice versa case where Jack narrated his love story to his grandchildren and Rose sadly died, is there a possibility for this? Let us examine by keeping in mind Jack is a male and he belongs to the lower class, and Rose is of Female gender belonging to Upper Class,

反之亦然,杰克向孙子讲述自己的爱情故事,而罗斯则不幸去世,这有可能吗? 让我们记住杰克是男性,他属于下层阶级,罗斯是女性,属于上层阶级,

We first need to break the analysis into several parts. First, we will look at the impact sex had on survival by pivoting the data frame.

我们首先需要将分析分为几个部分。 首先,我们将通过透视数据框来研究性别对生存的影响。

titanic.csv description
titanic_df[['Sex', 'Survived']].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)
Male Female survival calculation

This table shows us the percentage of females that survived and the percentage of males that survived. The female survival rate was 74.2%, and the male survival rate was 18.9%. The huge gap between these numbers is an immediate indication that the female survival rate on the Titanic was significantly higher than the male survival rate, and being a woman in fact increase Rose’s chances of survival.

该表向我们显示了女性存活率和男性存活率。 女性生存率为74.2%,男性生存率为18.9%。 这些数字之间的巨大差距立即表明,泰坦尼克号上的女性成活率明显高于男性成活率,而成为女性实际上增加了罗斯的生存机会。

titanic_df['AgeRange'] = pd.cut(titanic_df['Age'], 16)# Calculate proportion of surviors for each AgeRangetitanic_df[['AgeRange', 'Survived']].groupby(['AgeRange'], as_index=False).mean().sort_values(by='AgeRange', ascending=True)
Age Range for survivors
age_sex_hist = sns.FacetGrid(titanic_df, col='Survived', row='Sex', hue='Survived')age_sex_hist.map(plt.hist, 'Age', bins=20)
Age Gender Comparisons

This breakdown gives us an extremely interesting and informative view of the answer to survival rate for women and children. If we look at the data for children 5 and under, we can see that sex didn’t have much of an impact on survival. But in large, most of the children in this sample survived, including the males. Another interesting insight we can see is just how many of the males on board died (aside from male children under 5). Men were more likely to have died than to have survived. When we make the same comparison for females, you can see that females in almost every age range were more likely to survive than to have died. To validate the inferences made here, we can look at the numbers in a table once again, though it becomes harder to read with more variables. However, viewing this type of table can emphasize how helpful histograms can be for visualizing data.

这种分类为我们提供了关于妇女和儿童生存率答案的极其有趣和有益的见解。 如果我们查看5岁及以下儿童的数据,可以发现性别对生存没有太大影响。 但总体而言,此样本中的大多数儿童(包括男性)都得以幸存。 我们可以看到的另一个有趣的见解是,机上有多少男性死亡(5岁以下的男性儿童除外)。 男人比死者更有可能死。 当我们对女性进行相同的比较时,您可以看到几乎每个年龄段的女性生存的可能性都比死亡的可能性大。 为了验证此处所做的推论,我们可以再次查看表中的数字,尽管使用更多的变量将更难以阅读。 但是,查看这种类型的表可以强调直方图对可视化数据有多大帮助。

titanic_df[['Sex', 'AgeRange', 'Survived']].groupby(['Sex', 'AgeRange'], as_index=False).mean().sort_values(by='AgeRange', ascending=True)
Gender Age Range and Survival Comparisons

Based on these observations and numbers, we can conclude that both women and children had a higher chance of survival and hence our Climax 1 has a lesser probability of realization also we could not change factors like gender and age to make it as per our wish, So there is lesser possibility of Rose dead and Jack Alive, Poor Jack!!!

根据这些观察和数字,我们可以得出结论,妇女和儿童都有较高的生存机会,因此我们的Climax 1的实现可能性较小,我们也无法根据自己的意愿改变性别和年龄等因素,因此,罗斯死亡和杰克活着,可怜的杰克的可能性较小!!!

Poor Jack…

高潮2:杰克和罗斯逃脱并过着幸福的生活! (Climax 2: Jack and Rose Escaped and lived happily!!!)

Imagine while rose was about finish her adventurous trip story, jack joins her and ends as, “that’s how your grandma and grandpa fell in love with each other and ended up together”


That’s the heart-warming climax one would ever want, so what all ways this could be realized, let us analyze the Passenger Class as it is one of the important labels in the titanic data set.


Were upper-class passengers more likely to have made it onto a lifeboat than middle and lower class passengers? Let us make it interesting by examining using bar and point plots.

上层阶级的乘客比中层和下层阶级的乘客更有可能登上救生艇吗? 让我们通过使用条形图和点图来使其有趣。

titanic_df['Passenger'] = 'Passenger'# Create Class column with string values for classtitanic_df['Class'] = titanic_df['Pclass'].map( {1: 'Upper', 2: 'Middle', 3: 'Lower'} )# Create PointPlot for Passengers by Classbp = sns.pointplot(x='Passenger', y='Survived', hue='Class', data=titanic_df, hue_order=['Lower', 'Middle', 'Upper'])bp.set(ylabel='% Survivors', xlabel='Passenger Class')
Point Plot

This plot shows the average survival and confidence interval of passengers by class. Looking at the breakdown of average survival rates by class shows a correlation between class and rate of survival. Lower class passenger survival ranged somewhere between 20–30%, while upper-class survival ranged somewhere between 55–70%, with middle class ranging somewhere between 35–55%.

该图显示了按等级划分的旅客的平均生存和置信区间。 按类别查看平均存活率的细目分类显示类别与存活率之间的相关性。 下层阶级的乘客生存率介于20%至30%之间,而上层阶级的生存率介于55%至70%之间,中层阶级的生存率介于35%至55%之间。

So to make our Climax 2, come true what rose could have done is maybe instead of her joining Jack for the party in his lower passenger class, she could have taken Jack along with her to Upper Class and enjoyed the day, which may have kept them alive for a happy ending.

因此,要使我们的Climax 2成为现实,玫瑰所能做的可能不是代替她参加低级客舱的杰克派对,而是可以将杰克和她一起带到上层客舱并享受这一天,这可能使他们为了一个幸福的结局而活着。

Jack and Rose

高潮3:卡尔·霍克利,想让他死还是活??? (Climax 3: Cal Hockley, want him dead or alive???)

Cal Hockley, the villain who took advantage of Rose’s state and tricked her into his marriage proposal, also seemed to be escaped from the sink what could be the reason?

恶棍卡尔·霍克利(Cal Hockley)趁着罗斯(Rose)的州,骗她参加了求婚,似乎也从水槽中逃了出来,这可能是什么原因?

Let us explore in detail the passenger class and gender labels as they contribute more significantly and are varying parameters between Jack and Cal using seaborn plots


bps = sns.barplot(x='Sex', y='Survived', hue='Class', data=titanic_df, hue_order=['Lower', 'Middle', 'Upper'])bps.set(ylabel='% Survivors', xlabel='Passenger Sex by Class')
Bar Plot

We can see that middle-class female passengers had almost the same rate of survival as upper-class females, but middle-class men had about the same rate of survival as lower classmen, which further illustrates the greater likelihood of women to have survived. Overall we can observe that upper-class passengers did indeed have a higher chance of survival than lower-class passengers regardless of sex.

我们可以看到,中产阶级女性乘客的生存率与上层阶级女性几乎相同,但是中产阶级男性的生存率与下层阶级的男性几乎相同,这进一步说明了女性生存的可能性更大。 总的来说,我们可以观察到,不论性别,上层旅客确实比下层旅客有更高的生存机会。

np.corrcoef(x=titanic_df['Pclass'], y=titanic_df['Survived'])
Correlation Matrix

A negative correlation tells us is that when class increases (1 → 2 → 3), survival decreases. So, since the lower class is represented as 3, the lower class is correlated with lower survival.

负相关告诉我们,当阶级增加时(1→2→3),生存率降低。 因此,由于较低的类别表示为3,因此较低的类别与较低的生存率相关。

Unfortunately, there are more chances for the survival of Cal Hockley!

不幸的是,卡尔·霍克利(Cal Hockley)有更多的生存机会!

Cal Hockley

除了我们对Climax的假设外,还有一些限制: (Apart from our assumptions to the Climax, there are certain limitations:)

  • As some of these inferences were drawn based on correlation, it’s always important to remember that correlation does not imply causation (relationship).由于其中一些推论是根据相关性得出的,因此记住相关性并不意味着因果关系(关系)总是很重要的。
  • Since we know that some passengers did not have a recorded age, entries with ‘NaN’ (null) were not taken into account when running these numbers.由于我们知道有些乘客没有年龄记录,因此运行这些数字时不会考虑带有“ NaN”(空)的条目。
  • Conclusions were drawn based on descriptive statistics, charts, and opted not to run t-tests on the sample.根据描述性统计数据,图表得出结论,并选择不对样本进行t检验。

有趣的发现: (Interesting Findings:)

What proportion of passengers in the sample survived?


  • 38% of total passengers in the sample survived样本中38%的旅客幸存下来

Did women and children have a higher survival rate?


  • The female survival rate in this sample was 55.3% higher than the survival rate for males.该样本中女性的生存率比男性的生存率高55.3%。
  • Women had a much higher rate of survival than men.女人的生存率比男人高得多。
  • Children under the age of 5, regardless of sex, had a much higher rate of survival5岁以下的儿童(不论性别)的存活率要高得多

Did upper-class passengers in the sample have an advantage that translated into a higher survival rate than lower-class passengers?


  • The class has a strong correlation with survival, with upper-class passengers having a much larger rate of survival than lower-class passengers, regardless of sex and age.班级与生存能力密切相关,不论性别和年龄,上层旅客的生存率都比下层旅客高得多。
  • Upper-class passengers were more likely to survive than lower-class passengers.上层旅客比下层旅客更有可能生存。

So, as we are approaching the climax of our post, quickly let’s summarize, we got some insights about Matplotlib, Numpy, pandas, and seaborn libraries which are essential and inevitable for data science.


Also instead of mourning on the loss of Jack and the separation of true love, we tried the possibilities to change the climax, what’s exactly the duty data scientist, to analyze the data and come up with useful possibilities to attain desired outcomes.


Now, it’s your turn pals to create your own customized climaxes and conclusions with these kinds of simple analysis of the data set and come up with creative and innovative endings of your favorite historical epics, kudos for learners!


The End (Happy Ending)!


You can find the code and dataset at:






Anjana M P — https://anjana21it.wixsite.com/mysite

Anjana MP- https: //anjana21it.wixsite.com/mysite

Pradeepa K — https://ptljkpd.wixsite.com/pradeepa

Pradeepa K- https: //ptljkpd.wixsite.com/pradeepa

翻译自: https://medium.com/diva-coders/envision-the-titanic-climax-with-matplotlib-numpy-pandas-d5568cc6d0cc



