数据科学 (Data Science)

CitiBike is New York City’s famous bike rental company and the largest in the USA. CitiBike launched in May 2013 and has become an essential part of the transportation network. They make commute fun, efficient, and affordable — not to mention healthy and good for the environment.

CitiBike是纽约市著名的自行车租赁公司,也是美国最大的自行车租赁公司。 花旗自行车(CitiBike)于2013年5月推出,现已成为交通网络的重要组成部分。 它们使通勤变得有趣,高效且负担得起,更不用说健康且对环境有益。

I have got the data of CityBike riders of June 2013 from Kaggle. I will walk you through the complete exploratory data analysis answering some of the questions like:

我从Kaggle获得了2013年6月的CityBike骑手数据。 我将引导您完成完整的探索性数据分析,回答一些问题,例如:

  1. Where do CitiBikers ride?CitiBikers骑在哪里?
  2. When do they ride?他们什么时候骑?
  3. How far do they go?他们走了多远?
  4. Which stations are most popular?哪个电台最受欢迎?
  5. What days of the week are most rides taken on?大多数游乐设施在一周的哪几天?
  6. And many more还有很多

Key learning:


I have used many parameters to tweak the plotting functions of Matplotlib and Seaborn. It will be a good read to learn them practically.

我使用了许多参数来调整Matplotlib和Seaborn的绘图功能。 实际学习它们将是一本好书。



This article is best viewed on a larger screen like a tablet or desktop. At any point of time if you find difficulty in understanding anything I will be dropping the link to my Kaggle notebook at the end of this article, you can drop your quaries in the comment section.

最好在平板电脑或台式机等较大的屏幕上查看本文。 在任何时候,如果您发现难以理解任何内容,那么在本文结尾处,我都会删除指向我的Kaggle笔记本的链接,您可以在评论部分中删除您的查询。

让我们开始吧 (Let’s get started)

Importing necessary libraries and reading data.


#importing necessary librariesimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns#setting plot style to seabornplt.style.use('seaborn')#reading datadf = pd.read_csv('../input/citibike-system-data/201306-citibike-tripdata.csv')df.head()

Let’s get some more information on the data.


#sum of missing values in each columndf.isna().sum()

We have whooping 5,77,703 rows to crunch and 15 columns. Also, quite a bit of missing values. Let’s deal with missing values first.

我们有多达5,77,703行要紧缩和15列。 此外,还有很多缺失值。 让我们先处理缺失值。

处理缺失值 (Handling missing values)

Let’s first see the percentage of missing values which will help us decide whether to drop them or no.


#calculating the percentage of missing values#sum of missing value is the column divided by total number of rows in the dataset multiplied by 100data_loss1 = round((df['end station id'].isna().sum()/df.shape[0])*100)data_loss2 = round((df['birth year'].isna().sum()/df.shape[0])*100)print(data_loss1, '% of data loss if NaN rows of end station id, \nend station name, end station latitude and end station longitude dropped.\n')print(data_loss2, '% of data loss if NaN rows of birth year dropped.')

We can not afford to drop the missing valued rows of ‘birth year’. Hence, drop the entire column ‘birth year’ and drop missing valued rows of ‘end station id’,‘ end station name’,‘ end station latitude’, and ‘end station longitude’. Fortunately, all the missing values in these four rows (end station id, end station name, end station latitude, and end station longitude) are on the exact same row, so dropping NaN rows from all four rows will still result in only 3% data loss.

我们不能舍弃丢失的“出生年份”有价值的行。 因此,删除整列“出生年”并删除“终端站ID”,“终端站名称”,“终端站纬度”和“终端站经度”的缺失值行。 幸运的是,这四行中的所有缺失值(终端站ID,终端站名称,终端站纬度和终端站经度)都在同一行上,因此从所有四行中删除NaN行仍将仅导致3%数据丢失。

#dropping NaN valuesrows_before_dropping = df.shape[0]#drop entire birth year column.df.drop(’birth year’,axis=1, inplace=True)#Now left with end station id, end station name, end station latitude and end station longitude#these four columns have missing values in exact same row,#so dropping NaN from all four columns will still result in only 3% data lossdf.dropna(axis=0, inplace=True)rows_after_dropping = df.shape[0]#total data lossprint('% of data lost: ',((rows_before_dropping-rows_after_dropping)/rows_before_dropping)*100)#checking for NaNdf.isna().sum()

让我们看看性别在谈论我们的数据 (Let’s see what gender talks about our data)

#plotting total no.of males and femalessplot = sns.countplot('gender', data=df)#adding value above each bar:Annotationfor p in splot.patches:    an = splot.annotate(format(p.get_height(), '.2f'),               #bar value is nothing but height of the bar               (p.get_x() + p.get_width() / 2., p.get_height()),                ha = 'center',                va = 'center',                 xytext = (0, 10),                 textcoords = 'offset points')    an.set_size(20)#test sizesplot.axes.set_title("Gender distribution",fontsize=30)splot.axes.set_xlabel("Gender",fontsize=20)splot.axes.set_ylabel("Count",fontsize=20)#adding x tick valuessplot.axes.set_xticklabels(['Unknown', 'Male', 'Female'])plt.show()

We can see more male riders than females in New York City but due to a large number of unknown gender, we cannot get to any concrete conclusion. Filling unknown gender values is possible but we are not going to do it considering riders did not choose to disclose their gender.

在纽约市,我们看到男性骑手的人数多于女性骑手,但由于性别众多,我们无法得出任何具体结论。 可以填写未知的性别值,但考虑到车手没有选择公开性别,我们不会这样做。

订户与客户 (Subscribers vs Customers)

Subscribers are the users who bought the annual pass and customers are the once who bought either a 24-hour pass or a 3-day pass. Let’s see what the riders choose the most.

订户是购买年度通行证的用户,客户是购买24小时通行证或3天通行证的用户。 让我们来看看骑手最喜欢的东西。

user_type_count = df[’usertype’].value_counts()plt.pie(user_type_count.values,       labels=user_type_count.index,       autopct=’%1.2f%%’,       textprops={’fontsize’: 15} )plt.title(’Subcribers vs Customers’, fontsize=20)plt.show()

We can see there is more number of yearly subscribers than 1-3day customers. But the difference is not much, the company has to focus on converting customers to subscribers with some offers or sale.

我们可以看到,每年订阅者的数量超过1-3天的客户。 但是差异并不大,该公司必须专注于将客户转换为具有某些要约或销售的订户。

骑自行车通常需要花费几个小时 (How many hours do rides use the bike typically)

We have a column called ‘timeduration’ which talks about the duration each trip covered which is in seconds. Firstly, we will convert it to minutes, then create bins to group the trips into 0–30min, 30–60min, 60–120min, 120min, and above ride time. Then, let’s plot a graph to see how many hours do rides ride the bike typically.

我们有一个名为“ timeduration”的列,它讨论了每次旅行的持续时间,以秒为单位。 首先,我们将其转换为分钟,然后创建垃圾箱,将行程分为0–30分钟,30–60分钟,60–120分钟,120分钟及以上行驶时间。 然后,让我们绘制一个图表,看看骑车通常需要骑几个小时。

#converting trip duration from seconds to minuitsdf['tripduration'] = df['tripduration']/60#creating bins (0-30min, 30-60min, 60-120min, 120 and above)max_limit = df['tripduration'].max()df['tripduration_bins'] = pd.cut(df['tripduration'], [0, 30, 60, 120, max_limit])sns.barplot(x='tripduration_bins', y='tripduration', data=df, estimator=np.size)plt.title('Usual riding time', fontsize=30)plt.xlabel('Trip duration group', fontsize=20)plt.ylabel('Trip Duration', fontsize=20)plt.show()

There are a large number of riders who ride for less than half an hour per trip and most less than 1 hour.


相同的开始和结束位置VS不同的开始和结束位置 (Same start and end location VS different start and end location)

We see in the data there are some trips that start and end at the same location. Let’s see how many.

我们在数据中看到一些行程在同一位置开始和结束。 让我们看看有多少。

#number of trips that started and ended at same stationstart_end_same = df[df['start station name'] == df['end station name']].shape[0]#number of trips that started and ended at different stationstart_end_diff = df.shape[0]-start_end_sameplt.pie([start_end_same,start_end_diff],        labels=['Same start and end location',        'Different start and end location'],        autopct='%1.2f%%',        textprops={'fontsize': 15})plt.title('Same start and end location vs Different start and end location', fontsize=20)plt.show()

本月的骑行方式 (Riding pattern of the month)

This part is where I have spent a lot of time and effort. The below graph talks a lot. Technically there is a lot of coding. Before looking at the code I will give an overview of what we are doing here. Basically, we are plotting a time series graph to see the trend of the number of rides taken per day and the trend of the total number of duration the bikes were in use per day. Let’s look at the code first then I will break it down for you.

这是我花费大量时间和精力的地方。 下图很讲究。 从技术上讲,有很多编码。 在查看代码之前,我将概述我们在这里所做的事情。 基本上,我们正在绘制一个时间序列图,以查看每天骑行次数的趋势以及每天使用自行车的持续时间总数的趋势。 让我们先看一下代码,然后我将为您分解代码。

#converting string to datetime objectdf['starttime']= pd.to_datetime(df['starttime'])#since we are dealing with single month, we grouping by days#using count aggregation to get number of occurances i.e, total trips per daystart_time_count = df.set_index('starttime').groupby(pd.Grouper(freq='D')).count()#we have data from July month for only one day which is at last row, lets drop itstart_time_count.drop(start_time_count.tail(1).index, axis=0, inplace=True)#again grouping by day and aggregating with sum to get total trip duration per day#which will used while plottingtrip_duration_count = df.set_index('starttime').groupby(pd.Grouper(freq='D')).sum()#again dropping the last row for same reasontrip_duration_count.drop(trip_duration_count.tail(1).index, axis=0, inplace=True)#plotting total rides per day#using start station id to get the countfig,ax=plt.subplots(figsize=(25,10))ax.bar(start_time_count.index, 'start station id', data=start_time_count, label='Total riders')#bbox_to_anchor is to position the legend boxax.legend(loc ="lower left", bbox_to_anchor=(0.01, 0.89), fontsize='20')ax.set_xlabel('Days of the month June 2013', fontsize=30)ax.set_ylabel('Riders',  fontsize=40)ax.set_title('Bikers trend for the month June', fontsize=50)#creating twin x axis to plot line chart is same figureax2=ax.twinx()#plotting total trip duration of all user per dayax2.plot('tripduration', data=trip_duration_count, color='y', label='Total trip duration', marker='o', linewidth=5, markersize=12)ax2.set_ylabel('Time duration',  fontsize=40)ax2.legend(loc ="upper left", bbox_to_anchor=(0.01, 0.9), fontsize='20')ax.set_xticks(trip_duration_count.index)ax.set_xticklabels([i for i in range(1,31)])#tweeking x and y ticks labels of axes1ax.tick_params(labelsize=30, labelcolor='#eb4034')#tweeking x and y ticks labels of axes2ax2.tick_params(labelsize=30, labelcolor='#eb4034')plt.show()

You might have understood the basic idea by reading the comments but let me explain the process step-by-step:


  1. The date-time is in the string, we will convert it into DateTime object.日期时间在字符串中,我们将其转换为DateTime对象。
  2. Grouping the data by days of the month and counting the number of occurrences to plot rides per day.将数据按每月的天数进行分组,并计算每天的出行次数。
  3. We have only one row with the information for the month of July. This is an outlier, drop it.我们只有一行包含7月份的信息。 这是一个离群值,将其删除。
  4. Repeat steps 2 and 3 but the only difference this time is we sum the data instead of counting to get the total time duration of the trips per day.重复第2步和第3步,但是这次唯一的区别是我们对数据求和而不是进行计数以获得每天行程的总持续时间。
  5. Plot both the data on a single graph using the twin axis method.使用双轴方法将两个数据绘制在一个图形上。

I have used a lot of tweaking methods on matplotlib, make sure to go through each of them. If any doubts drop a comment on the Kaggle notebook for which the link will be dropped at the end of this article.

我在matplotlib上使用了很多调整方法,请确保每种方法都要经过。 如果有任何疑问,请在Kaggle笔记本上发表评论,其链接将在本文结尾处删除。

The number of riders increases considerably closing to the end of the month. There are negligible riders on the 1st Sunday of the month. The amount of time the bikers ride the bike decreases closing to the end of the month.

到月底,车手的数量大大增加。 每个月的第一个星期日的车手微不足道。 骑自行车的人骑自行车的时间减少到月底接近。

前10个出发站 (Top 10 start stations)

This is pretty straightforward, we get the occurrences of each start station using value_counts() and slice to get the first 10 values from it then plot the same.


#adding value above each bar:Annotationfor p in ax.patches:    an = ax.annotate(format(p.get_height(), '.2f'),                    (p.get_x() + p.get_width() / 2., p.get_height()),                    ha = 'center',                   va = 'center',                    xytext = (0, 10),                    textcoords = 'offset points')    an.set_size(20)ax.set_title("Top 10 start locations in NY",fontsize=30)ax.set_xlabel("Station name",fontsize=20)#rotating the x tick labels to 45 degreesax.set_xticklabels(top_start_station.index, rotation = 45, ha="right")ax.set_ylabel("Count",fontsize=20)#tweeking x and y tick labels ax.tick_params(labelsize=15)plt.show()

十佳终端站 (Top 10 end stations)

#top 10 end stationtop_end_station = df['end station name'].value_counts()[:10]fig,ax=plt.subplots(figsize=(20,8))ax.bar(x=top_end_station.index, height=top_end_station.values, color='#edde68', width=0.5)#adding value above each bar:Annotationfor p in ax.patches:    an = ax.annotate(format(p.get_height(), '.2f'),                    (p.get_x() + p.get_width() / 2., p.get_height()),                    ha = 'center',                   va = 'center',                    xytext = (0, 10),                    textcoords = 'offset points')    an.set_size(20)ax.set_title("Top 10 end locations in NY",fontsize=30)ax.set_xlabel("Street name",fontsize=20)#rotating the x tick labels to 45 degreesax.set_xticklabels(top_end_station.index, rotation = 45, ha="right")ax.set_ylabel("Count",fontsize=20)#tweeking x and y tick labels ax.tick_params(labelsize=15)plt.show()

Kaggle Notebook where I worked it out. Feel free to drop queries in the comment section.

Kaggle笔记本是我在其中解决的。 随时在评论部分中删除查询。

翻译自: https://medium.com/towards-artificial-intelligence/analyzing-citibike-data-eda-e657409f007a



