根据纽约出租车的运营数据,针对客户旅途时间展开分析与建模。

import os
import pandas as pd
import numpy as np
from matplotlib.pyplot import *
import matplotlib.pyplot as plt
from matplotlib import animation
from matplotlib import cm
from sklearn.cluster import KMeans
from sklearn.neighbors import KNeighborsClassifier
from dateutil import parser
import io
import base64
df = pd.read_csv('train.csv')
df.head()
---------------------------------------------------------------------------FileNotFoundError                         Traceback (most recent call last)<ipython-input-3-04d00b4537ee> in <module>()
----> 1 df = pd.read_csv('train.csv')2 df.head()~\Anaconda3\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)707                     skip_blank_lines=skip_blank_lines)708
--> 709         return _read(filepath_or_buffer, kwds)710 711     parser_f.__name__ = name~\Anaconda3\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)447 448     # Create the parser.
--> 449     parser = TextFileReader(filepath_or_buffer, **kwds)450 451     if chunksize or iterator:~\Anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, f, engine, **kwds)816             self.options['has_index_names'] = kwds['has_index_names']817
--> 818         self._make_engine(self.engine)819 820     def close(self):~\Anaconda3\lib\site-packages\pandas\io\parsers.py in _make_engine(self, engine)1047     def _make_engine(self, engine='c'):1048         if engine == 'c':
-> 1049             self._engine = CParserWrapper(self.f, **self.options)1050         else:1051             if engine == 'python':~\Anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, src, **kwds)1693         kwds['allow_leading_cols'] = self.index_col is not False1694
-> 1695         self._reader = parsers.TextReader(src, **kwds)1696 1697         # XXXpandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()FileNotFoundError: File b'train.csv' does not exist

太远地方的就先去掉啦

xlim = [-74.03, -73.77]
ylim = [40.63, 40.85]
df = df[(df.pickup_longitude> xlim[0]) & (df.pickup_longitude < xlim[1])]
df = df[(df.dropoff_longitude> xlim[0]) & (df.dropoff_longitude < xlim[1])]
df = df[(df.pickup_latitude> ylim[0]) & (df.pickup_latitude < ylim[1])]
df = df[(df.dropoff_latitude> ylim[0]) & (df.dropoff_latitude < ylim[1])]

上下车地点集中区域

longitude = list(df.pickup_longitude) + list(df.dropoff_longitude)
latitude = list(df.pickup_latitude) + list(df.dropoff_latitude)
plt.figure(figsize = (10,10))
plt.plot(longitude,latitude,'.', alpha = 0.4, markersize = 0.05)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-HCsphdXx-1633489386722)(output_6_0.png)]

根据上下车的地点,将区域分一下,用聚类来试试

loc_df = pd.DataFrame()
loc_df['longitude'] = longitude
loc_df['latitude'] = latitude
kmeans = KMeans(n_clusters=15, random_state=2, n_init = 10).fit(loc_df)
loc_df['label'] = kmeans.labels_loc_df = loc_df.sample(200000)
plt.figure(figsize = (10,10))
for label in loc_df.label.unique():plt.plot(loc_df.longitude[loc_df.label == label],loc_df.latitude[loc_df.label == label],'.', alpha = 0.3, markersize = 0.3)plt.title('Clusters of New York')
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1yr91k0i-1633489386725)(output_9_0.png)]

给区域来个标记吧

fig,ax = plt.subplots(figsize = (10,10))
for label in loc_df.label.unique():ax.plot(loc_df.longitude[loc_df.label == label],loc_df.latitude[loc_df.label == label],'.', alpha = 0.4, markersize = 0.1, color = 'gray')ax.plot(kmeans.cluster_centers_[label,0],kmeans.cluster_centers_[label,1],'o', color = 'r')ax.annotate(label, (kmeans.cluster_centers_[label,0],kmeans.cluster_centers_[label,1]), color = 'b', fontsize = 20)
ax.set_title('Cluster Centers')
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pcDq1idT-1633489386727)(output_11_0.png)]

df['pickup_cluster'] = kmeans.predict(df[['pickup_longitude','pickup_latitude']])
df['dropoff_cluster'] = kmeans.predict(df[['dropoff_longitude','dropoff_latitude']])
df['pickup_hour'] = df.pickup_datetime.apply(lambda x: parser.parse(x).hour )
clusters = pd.DataFrame()
clusters['x'] = kmeans.cluster_centers_[:,0]
clusters['y'] = kmeans.cluster_centers_[:,1]
clusters['label'] = range(len(clusters))loc_df = loc_df.sample(5000)

展示了方向与趋势,箭头的宽度与车流成正比

fig, ax = plt.subplots(1, 1, figsize = (10,10))def animate(hour):ax.clear()ax.set_title('Relative Traffic - Hour ' + str(int(hour)) + ':00')    plt.figure(figsize = (10,10))for label in loc_df.label.unique():ax.plot(loc_df.longitude[loc_df.label == label],loc_df.latitude[loc_df.label == label],'.', alpha = 1, markersize = 2, color = 'gray')ax.plot(kmeans.cluster_centers_[label,0],kmeans.cluster_centers_[label,1],'o', color = 'r')for label in clusters.label:for dest_label in clusters.label:num_of_rides = len(df[(df.pickup_cluster == label) & (df.dropoff_cluster == dest_label) & (df.pickup_hour == hour)])dist_x = clusters.x[clusters.label == label].values[0] - clusters.x[clusters.label == dest_label].values[0]dist_y = clusters.y[clusters.label == label].values[0] - clusters.y[clusters.label == dest_label].values[0]pct = np.true_divide(num_of_rides,len(df[df.pickup_hour == hour]))arr = Arrow(clusters.x[clusters.label == label].values, clusters.y[clusters.label == label].values, -dist_x, -dist_y, edgecolor='white', width = pct)ax.add_patch(arr)arr.set_facecolor('g')ani = animation.FuncAnimation(fig,animate,sorted(df.pickup_hour.unique()), interval = 1000)
plt.close()
ani.save('animation2.html', writer='imagemagick', fps=2)
e:\ProgramData\Anaconda3\lib\site-packages\matplotlib\pyplot.py:523: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).max_open_warning, RuntimeWarning)
e:\ProgramData\Anaconda3\lib\site-packages\matplotlib\animation.py:1218: UserWarning: MovieWriter imagemagick unavailablewarnings.warn("MovieWriter %s unavailable" % writer)

邻居分析

neighborhood = {-74.0019368351: 'Chelsea',-73.837549761: 'Queens',-73.7854240738: 'JFK',-73.9810421975:'Midtown-North-West',-73.9862336241: 'East Village',-73.971273324:'Midtown-North-East',-73.9866739677: 'Brooklyn-parkslope',-73.8690098118: 'LaGuardia',-73.9890572967:'Midtown',-74.0081765545: 'Downtown',-73.9213024854: 'Queens-Astoria',-73.9470256923: 'Harlem',-73.9555565018: 'Uppe East Side',-73.9453487097: 'Brooklyn-Williamsburgt',-73.9745967889:'Upper West Side'}
rides_df = pd.DataFrame(columns = neighborhood.values())
rides_df['name'] = neighborhood.values()neigh = KNeighborsClassifier(n_neighbors=1)
neigh.fit(np.array(list(neighborhood.keys())).reshape(-1, 1), list(neighborhood.values()))
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',metric_params=None, n_jobs=1, n_neighbors=1, p=2,weights='uniform')
df['pickup_neighborhood'] = neigh.predict(df.pickup_longitude.reshape(-1,1))
df['dropoff_neighborhood'] = neigh.predict(df.dropoff_longitude.reshape(-1,1))for col in rides_df.columns[:-1]:rides_df[col] = rides_df.name.apply(lambda x: len(df[(df.pickup_neighborhood == x) & (df.dropoff_neighborhood == col)]))
e:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead"""Entry point for launching an IPython kernel.
e:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
rides_df.head()
Chelsea Queens JFK Midtown-North-West East Village Midtown-North-East Brooklyn-parkslope LaGuardia Midtown Downtown Queens-Astoria Harlem Uppe East Side Brooklyn-Williamsburgt Upper West Side name
0 28526 228 950 18989 10622 7657 6141 1497 35963 22084 2119 2209 7998 2317 8742 Chelsea
1 27 375 93 55 37 20 5 120 33 24 43 15 43 15 41 Queens
2 1887 1221 2779 3578 2116 2351 743 1463 3207 1576 1749 993 2847 1244 2208 JFK
3 17496 416 2183 30833 13214 27005 6747 4206 35307 10196 2940 4654 22343 3898 23537 Midtown-North-West
4 10616 186 1168 13532 5619 8030 2622 2073 16225 5980 1625 1793 7099 1704 9138 East Village
import plotly.plotly
import plotly.offline as py
import plotly.graph_objs as go
py.init_notebook_mode(connected=True)trace = go.Heatmap(z= np.array(rides_df.as_matrix()),x = rides_df.columns[:-1],y = rides_df.columns)
layout = dict(title = ' <b>Neighborhoods Interaction</b>',titlefont = dict(size = 30,color = ('rgb(100,100,100)')),margin = dict(t=100,r=100,b=100,l=150),yaxis = dict(title = ' <b> From </b>'),xaxis = dict(title = '<b> To </b>'))
data=[trace]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='labelled-heatmap')
---------------------------------------------------------------------------ModuleNotFoundError                       Traceback (most recent call last)<ipython-input-6-78fdfe48a4fb> in <module>()
----> 1 import plotly.plotly2 import plotly.offline as py3 import plotly.graph_objs as go4 py.init_notebook_mode(connected=True)5 ModuleNotFoundError: No module named 'plotly'

进出分析

fig,ax = plt.subplots(figsize = (12,12))
for i in range(len(rides_df)):  ax.plot(rides_df.sum(axis = 1)[i],rides_df.sum(axis = 0)[i],'o', color = 'b')ax.annotate(rides_df.index.tolist()[i], (rides_df.sum(axis = 1)[i],rides_df.sum(axis = 0)[i]), color = 'b', fontsize = 12)ax.plot([0,250000],[0,250000], color = 'r', linewidth = 1)
ax.grid('off')
ax.set_xlim([0,250000])
ax.set_ylim([0,250000])
ax.set_xlabel('Outbound Taxis')
ax.set_ylabel('Inbound Taxis')
ax.set_title('Inbound and Outbound rides for each cluster')
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-LBdcqkpS-1633489386732)(output_23_0.png)]

我们可以看到,每个地区的出入的比率是相对平衡的。

import pandas as pd  #pandas for using dataframe and reading csv
import numpy as np   #numpy for vector operations and basic maths
#import simplejson    #getting JSON in simplified format
import urllib        #for url stuff
#import gmaps       #for using google maps to visulalize places on maps
import re            #for processing regular expressions
import datetime      #for datetime operations
import calendar      #for calendar for datetime operations
import time          #to get the system time
import scipy         #for other dependancies
from sklearn.cluster import KMeans # for doing K-means clustering
from haversine import haversine # for calculating haversine distance
import math          #for basic maths operations
import seaborn as sns #for making plots
import matplotlib.pyplot as plt # for plotting
import os  # for os commands
from scipy.misc import imread, imresize, imsave  # for plots
import plotly.plotly as py
import plotly.graph_objs as go
import plotly
from bokeh.palettes import Spectral4
from bokeh.plotting import figure, output_notebook, show
from IPython.display import HTML
from matplotlib.pyplot import *
from matplotlib import cm
from matplotlib import animation
import io
import base64
import warnings
warnings.filterwarnings("ignore")
output_notebook()
plotly.offline.init_notebook_mode() # run at the start of every ipython notebook
<div class="bk-root"><a href="https://bokeh.pydata.org" target="_blank" class="bk-logo bk-logo-small bk-logo-notebook"></a><span id="34648302-a733-4049-b504-19c148122e0a">Loading BokehJS ...</span>
</div>IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

数据读取与特征选择

s = time.time()
train_fr_1 = pd.read_csv('./data/fastest_routes_train_part_1.csv')
train_fr_2 = pd.read_csv('./data/fastest_routes_train_part_2.csv')
train_fr = pd.concat([train_fr_1, train_fr_2])
train_fr_new = train_fr[['id', 'total_distance', 'total_travel_time', 'number_of_steps']]
train_df = pd.read_csv('./data/train.csv')
train = pd.merge(train_df, train_fr_new, on = 'id', how = 'left')
train_df = train.copy()
end = time.time()
print("Time taken by above cell is {}.".format((end-s)))
train_df.head()
Time taken by above cell is 14.2900869846344.
id vendor_id pickup_datetime dropoff_datetime passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag trip_duration total_distance total_travel_time number_of_steps
0 id2875421 2 2016-03-14 17:24:55 2016-03-14 17:32:30 1 -73.982155 40.767937 -73.964630 40.765602 N 455 2009.1 164.9 5.0
1 id2377394 1 2016-06-12 00:43:35 2016-06-12 00:54:38 1 -73.980415 40.738564 -73.999481 40.731152 N 663 2513.2 332.0 6.0
2 id3858529 2 2016-01-19 11:35:24 2016-01-19 12:10:48 1 -73.979027 40.763939 -74.005333 40.710087 N 2124 11060.8 767.6 16.0
3 id3504673 2 2016-04-06 19:32:31 2016-04-06 19:39:40 1 -74.010040 40.719971 -74.012268 40.706718 N 429 1779.4 235.8 4.0
4 id2181028 2 2016-03-26 13:30:55 2016-03-26 13:38:10 1 -73.973053 40.793209 -73.972923 40.782520 N 435 1614.9 140.1 5.0

数据检查

# checking if Ids are unique,
train_data = train_df.copy()
print("Number of columns and rows and columns are {} and {} respectively.".format(train_data.shape[1], train_data.shape[0]))
if train_data.id.nunique() == train_data.shape[0]:print("Train ids are unique")
print("Number of Nulls - {}.".format(train_data.isnull().sum().sum()))
Number of columns and rows and columns are 14 and 1458644 respectively.
Train ids are unique
Number of Nulls - 3.

旅行持续时间log展示

%matplotlib inline
start = time.time()
sns.set(style="white", palette="muted", color_codes=True)
f, axes = plt.subplots(1, 1, figsize=(11, 7), sharex=True)
sns.despine(left=True)
sns.distplot(np.log(train_df['trip_duration'].values+1), axlabel = 'Log(trip_duration)', label = 'log(trip_duration)', bins = 50, color="r")
plt.setp(axes, yticks=[])
plt.tight_layout()
end = time.time()
print("Time taken by above cell is {}.".format((end-start)))
plt.show()
Time taken by above cell is 0.2782478332519531.

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-YrFRifAU-1633489386736)(output_31_1.png)]

正太分布的,有个别时间有点高的离谱了。。。有个别的神速了

print ('大部分的旅行时间是在:',np.exp(4)/60,np.exp(8)/60)
print ('比较吊的。。。',np.exp(2)/60,np.exp(12)/60)
大部分的旅行时间是在: 0.909969167219 49.6826331174
比较吊的。。。 0.123150934982 2712.57985698

数据提供的位置

start = time.time()
sns.set(style="white", palette="muted", color_codes=True)
f, axes = plt.subplots(2,2,figsize=(10, 10), sharex=False, sharey = False)
sns.despine(left=True)
sns.distplot(train_df['pickup_latitude'].values, label = 'pickup_latitude',color="m",bins = 100, ax=axes[0,0])
sns.distplot(train_df['pickup_longitude'].values, label = 'pickup_longitude',color="m",bins =100, ax=axes[0,1])
sns.distplot(train_df['dropoff_latitude'].values, label = 'dropoff_latitude',color="m",bins =100, ax=axes[1, 0])
sns.distplot(train_df['dropoff_longitude'].values, label = 'dropoff_longitude',color="m",bins =100, ax=axes[1, 1])
plt.setp(axes, yticks=[])
plt.tight_layout()
end = time.time()
print("Time taken by above cell is {}.".format((end-start)))
plt.show()
Time taken by above cell is 1.2390995025634766.

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-0wExblfP-1633489386738)(output_35_1.png)]

有些位置是不是太偏僻了,还是统计误差啊,去掉那些离谱的

纬度控制在40.6到40.9
经度控制在-74.05到-73.70

start = time.time()
df = train_df.loc[(train_df.pickup_latitude > 40.6) & (train_df.pickup_latitude < 40.9)]
df = df.loc[(df.dropoff_latitude>40.6) & (df.dropoff_latitude < 40.9)]
df = df.loc[(df.dropoff_longitude > -74.05) & (df.dropoff_longitude < -73.7)]
df = df.loc[(df.pickup_longitude > -74.05) & (df.pickup_longitude < -73.7)]
train_data_new = df.copy()
sns.set(style="white", palette="muted", color_codes=True)
f, axes = plt.subplots(2,2,figsize=(12, 12), sharex=False, sharey = False)#
sns.despine(left=True)
sns.distplot(train_data_new['pickup_latitude'].values, label = 'pickup_latitude',color="m",bins = 100, ax=axes[0,0])
sns.distplot(train_data_new['pickup_longitude'].values, label = 'pickup_longitude',color="g",bins =100, ax=axes[0,1])
sns.distplot(train_data_new['dropoff_latitude'].values, label = 'dropoff_latitude',color="m",bins =100, ax=axes[1, 0])
sns.distplot(train_data_new['dropoff_longitude'].values, label = 'dropoff_longitude',color="g",bins =100, ax=axes[1, 1])
plt.setp(axes, yticks=[])
plt.tight_layout()
end = time.time()
print("Time taken by above cell is {}.".format((end-start)))
print(df.shape[0], train_data.shape[0])
plt.show()
Time taken by above cell is 1.8928685188293457.
1452385 1458644

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-iC5agbNj-1633489386742)(output_38_1.png)]

以黑色为背景

temp = train_data.copy()
start = time.time()
rgb = np.zeros((3000, 3500, 3), dtype=np.uint8)
rgb[..., 0] = 0
rgb[..., 1] = 0
rgb[..., 2] = 0
train_data_new['pick_lat_new'] = list(map(int, (train_data_new['pickup_latitude'] - (40.6000))*10000))
train_data_new['drop_lat_new'] = list(map(int, (train_data_new['dropoff_latitude'] - (40.6000))*10000))
train_data_new['pick_lon_new'] = list(map(int, (train_data_new['pickup_longitude'] - (-74.050))*10000))
train_data_new['drop_lon_new'] = list(map(int,(train_data_new['dropoff_longitude'] - (-74.050))*10000))train_data_new.head()
id vendor_id pickup_datetime dropoff_datetime passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag trip_duration total_distance total_travel_time number_of_steps pick_lat_new drop_lat_new pick_lon_new drop_lon_new
0 id2875421 2 2016-03-14 17:24:55 2016-03-14 17:32:30 1 -73.982155 40.767937 -73.964630 40.765602 N 455 2009.1 164.9 5.0 1679 1656 678 853
1 id2377394 1 2016-06-12 00:43:35 2016-06-12 00:54:38 1 -73.980415 40.738564 -73.999481 40.731152 N 663 2513.2 332.0 6.0 1385 1311 695 505
2 id3858529 2 2016-01-19 11:35:24 2016-01-19 12:10:48 1 -73.979027 40.763939 -74.005333 40.710087 N 2124 11060.8 767.6 16.0 1639 1100 709 446
3 id3504673 2 2016-04-06 19:32:31 2016-04-06 19:39:40 1 -74.010040 40.719971 -74.012268 40.706718 N 429 1779.4 235.8 4.0 1199 1067 399 377
4 id2181028 2 2016-03-26 13:30:55 2016-03-26 13:38:10 1 -73.973053 40.793209 -73.972923 40.782520 N 435 1614.9 140.1 5.0 1932 1825 769 770
summary_plot = pd.DataFrame(train_data_new.groupby(['pick_lat_new', 'pick_lon_new'])['id'].count())summary_plot.reset_index(inplace = True)
summary_plot.head()
pick_lat_new pick_lon_new id
0 2 544 1
1 6 840 1
2 8 454 1
3 9 706 1
4 17 1030 1
lat_list = summary_plot['pick_lat_new'].unique()
for i in lat_list:lon_list = summary_plot.loc[summary_plot['pick_lat_new']==i]['pick_lon_new'].tolist()unit = summary_plot.loc[summary_plot['pick_lat_new']==i]['id'].tolist()for j in lon_list:a = unit[lon_list.index(j)]if (a//50) >0:rgb[i][j][0] = 255rgb[i,j, 1] = 0rgb[i,j, 2] = 255elif (a//10)>0:rgb[i,j, 0] = 0rgb[i,j, 1] = 255rgb[i,j, 2] = 0else:rgb[i,j, 0] = 255rgb[i,j, 1] = 0rgb[i,j, 2] = 0
fig, ax = plt.subplots(nrows=1,ncols=1,figsize=(14,20))
end = time.time()
print("Time taken by above cell is {}.".format((end-start)))
ax.imshow(rgb, cmap = 'hot')
ax.set_axis_off()
Time taken by above cell is 4.935481071472168.

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-43bBGevN-1633489386745)(output_42_1.png)]

  • 红点表示在给定数据中的1-10次行程具有该点作为起始点
  • 绿点表示在给定数据中超过10-50次旅行具有该点作为起始点
  • 黄点表示在给定数据中超过50次以上的行程具有该点作为起始点

特征工程:
选择对旅途时间有影响的因素

#空间地理距离
start = time.time()
def haversine_(lat1, lng1, lat2, lng2):"""function to calculate haversine distance between two co-ordinates"""lat1, lng1, lat2, lng2 = map(np.radians, (lat1, lng1, lat2, lng2))AVG_EARTH_RADIUS = 6371  # in kmlat = lat2 - lat1lng = lng2 - lng1d = np.sin(lat * 0.5) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(lng * 0.5) ** 2h = 2 * AVG_EARTH_RADIUS * np.arcsin(np.sqrt(d))return(h)def manhattan_distance_pd(lat1, lng1, lat2, lng2):"""function to calculate manhatten distance between pick_drop"""a = haversine_(lat1, lng1, lat1, lng2)b = haversine_(lat1, lng1, lat2, lng1)return a + bimport math
def bearing_array(lat1, lng1, lat2, lng2):AVG_EARTH_RADIUS = 6371  # in kmlng_delta_rad = np.radians(lng2 - lng1)lat1, lng1, lat2, lng2 = map(np.radians, (lat1, lng1, lat2, lng2))y = np.sin(lng_delta_rad) * np.cos(lat2)x = np.cos(lat1) * np.sin(lat2) - np.sin(lat1) * np.cos(lat2) * np.cos(lng_delta_rad)return np.degrees(np.arctan2(y, x))end = time.time()
print("Time taken by above cell is {}.".format((end-start)))
Time taken by above cell is 0.0.
start = time.time()
train_data = temp.copy()
train_data['pickup_datetime'] = pd.to_datetime(train_data.pickup_datetime)
train_data.loc[:, 'pick_month'] = train_data['pickup_datetime'].dt.month
train_data.loc[:, 'hour'] = train_data['pickup_datetime'].dt.hour
train_data.loc[:, 'week_of_year'] = train_data['pickup_datetime'].dt.weekofyear
train_data.loc[:, 'day_of_year'] = train_data['pickup_datetime'].dt.dayofyear
train_data.loc[:, 'day_of_week'] = train_data['pickup_datetime'].dt.dayofweek
train_data.loc[:,'hvsine_pick_drop'] = haversine_(train_data['pickup_latitude'].values, train_data['pickup_longitude'].values, train_data['dropoff_latitude'].values, train_data['dropoff_longitude'].values)
train_data.loc[:,'manhtn_pick_drop'] = manhattan_distance_pd(train_data['pickup_latitude'].values, train_data['pickup_longitude'].values, train_data['dropoff_latitude'].values, train_data['dropoff_longitude'].values)
train_data.loc[:,'bearing'] = bearing_array(train_data['pickup_latitude'].values, train_data['pickup_longitude'].values, train_data['dropoff_latitude'].values, train_data['dropoff_longitude'].values)end = time.time()
print("Time taken by above cell is {}.".format(end-start))
Time taken by above cell is 2.7582061290740967.
start = time.time()
def color(hour):"""function for color change in animation"""return(10*hour)def Animation(hour, temp, rgb):"""Function to generate return a pic of plotings"""#ax.clear()train_data_new = temp.loc[temp['hour'] == hour]start = time.time()rgb = np.zeros((3000, 3500, 3), dtype=np.uint8)rgb[..., 0] = 0rgb[..., 1] = 0rgb[..., 2] = 0train_data_new['pick_lat_new'] = list(map(int, (train_data_new['pickup_latitude'] - (40.6000))*10000))train_data_new['drop_lat_new'] = list(map(int, (train_data_new['dropoff_latitude'] - (40.6000))*10000))train_data_new['pick_lon_new'] = list(map(int, (train_data_new['pickup_longitude'] - (-74.050))*10000))train_data_new['drop_lon_new'] = list(map(int,(train_data_new['dropoff_longitude'] - (-74.050))*10000))summary_plot = pd.DataFrame(train_data_new.groupby(['pick_lat_new', 'pick_lon_new'])['id'].count())summary_plot.reset_index(inplace = True)summary_plot.head(120)lat_list = summary_plot['pick_lat_new'].unique()for i in lat_list:#print(i)lon_list = summary_plot.loc[summary_plot['pick_lat_new']==i]['pick_lon_new'].tolist()unit = summary_plot.loc[summary_plot['pick_lat_new']==i]['id'].tolist()for j in lon_list:#j = int(j)a = unit[lon_list.index(j)]#print(a)if (a//50) >0:rgb[i][j][0] = 255 - color(hour)rgb[i,j, 1] = 255 - color(hour)rgb[i,j, 2] = 0 + color(hour)elif (a//10)>0:rgb[i,j, 0] = 0 + color(hour)rgb[i,j, 1] = 255 - color(hour)rgb[i,j, 2] = 0 + color(hour)else:rgb[i,j, 0] = 255 - color(hour)rgb[i,j, 1] = 0 + color(hour)rgb[i,j, 2] = 0 + color(hour)#fig, ax = plt.subplots(nrows=1,ncols=1,figsize=(14,20))end = time.time()print("Time taken by above cell is {} for {}.".format((end-start), hour))return(rgb)
end = time.time()
print("Time taken by above cell is {}.".format(end -start))
Time taken by above cell is 0.0.
start = time.time()
images_list=[]
train_data_new['pickup_datetime'] = pd.to_datetime(train_data_new.pickup_datetime)
train_data_new.loc[:, 'hour'] = train_data_new['pickup_datetime'].dt.hourfor i in list(range(0, 24)):im = Animation(i, train_data_new, rgb.copy())images_list.append(im)
end = time.time()
print("Time taken by above cell is {}.".format(end -start))
Time taken by above cell is 1.6389679908752441 for 0.
Time taken by above cell is 1.4990801811218262 for 1.
Time taken by above cell is 1.3021628856658936 for 2.
Time taken by above cell is 1.275221347808838 for 3.
Time taken by above cell is 1.2565860748291016 for 4.
Time taken by above cell is 1.2167680263519287 for 5.
Time taken by above cell is 1.4150054454803467 for 6.
Time taken by above cell is 1.555870771408081 for 7.
Time taken by above cell is 1.5852866172790527 for 8.
Time taken by above cell is 1.4900872707366943 for 9.
Time taken by above cell is 1.457101583480835 for 10.
Time taken by above cell is 1.4791648387908936 for 11.
Time taken by above cell is 1.4905309677124023 for 12.
Time taken by above cell is 1.471121072769165 for 13.
Time taken by above cell is 1.5651226043701172 for 14.
Time taken by above cell is 1.582446575164795 for 15.
Time taken by above cell is 1.5430455207824707 for 16.
Time taken by above cell is 1.8448097705841064 for 17.
Time taken by above cell is 1.6787598133087158 for 18.
Time taken by above cell is 1.6811716556549072 for 19.
Time taken by above cell is 1.634563684463501 for 20.
Time taken by above cell is 1.6993639469146729 for 21.
Time taken by above cell is 1.7138051986694336 for 22.
Time taken by above cell is 1.6760783195495605 for 23.
Time taken by above cell is 38.207443714141846.
start = time.time()
def build_gif(imgs = images_list, show_gif=False, save_gif=True, title=''):"""function to create a gif of heatmaps"""fig, ax = plt.subplots(nrows=1,ncols=1,figsize=(10,10))ax.set_axis_off()hr_range = list(range(0,24))def show_im(pairs):ax.clear()ax.set_title('Absolute Traffic - Hour ' + str(int(pairs[0])) + ':00')ax.imshow(pairs[1])ax.set_axis_off() pairs = list(zip(hr_range, imgs))#ims = map(lambda x: (ax.imshow(x), ax.set_title(title)), imgs)im_ani = animation.FuncAnimation(fig, show_im, pairs,interval=500, repeat_delay=0, blit=False)plt.cla()if save_gif:im_ani.save('animation.html', writer='imagemagick') #, writer='imagemagick'if show_gif:plt.show()return
end = time.time()
print("Time taken by above cell is {}".format(end-start))
Time taken by above cell is 0.0
start = time.time()
build_gif()
end = time.time()
print(end-start)
7.758885860443115

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-nujOayfB-1633489386747)(output_50_1.png)]

特征解释

start = time.time()
summary_wdays_avg_duration = pd.DataFrame(train_data.groupby(['vendor_id','day_of_week'])['trip_duration'].mean())
summary_wdays_avg_duration.reset_index(inplace = True)
summary_wdays_avg_duration['unit']=1
sns.set(style="white", palette="muted", color_codes=True)
sns.set_context("poster")
sns.tsplot(data=summary_wdays_avg_duration, time="day_of_week", unit = "unit", condition="vendor_id", value="trip_duration")
sns.despine(bottom = False)
end = time.time()
print(end - start)
0.24365997314453125

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-l1Vmzxe1-1633489386749)(output_52_1.png)]

显而易见的是,出租车1类在一周中的所有日子里花费的时间都多于出租车2类,平均差不多多了250秒

vovin plot

import seaborn as sns
sns.set(style="whitegrid", palette="pastel", color_codes=True)
sns.set_context("poster")
train_data2 = train_data.copy()
train_data2['trip_duration']= np.log(train_data['trip_duration'])
sns.violinplot(x="passenger_count", y="trip_duration", hue="vendor_id", data=train_data2, split=True,inner="quart",palette={1: "g", 2: "r"})sns.despine(left=True)
print(df.shape[0])
1452385

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6RDCnsQU-1633489386753)(output_55_1.png)]

  • 空载是在刷单吗。。。
  • 载客人数的分布情况差不多|

Box-Plots

start = time.time()
sns.set(style="ticks")
sns.set_context("poster")
sns.boxplot(x="day_of_week", y="trip_duration", hue="vendor_id", data=train_data, palette="PRGn")
plt.ylim(0, 6000)
plt.legend(loc = 'upper right')
sns.despine(offset=10, trim=True)
print(train_data.trip_duration.max())
end = time.time()
print("Time taken by above cell is {}.".format(end-start))
3526282
Time taken by above cell is 0.3499119281768799.

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-0uC9HxJz-1633489386755)(output_58_1.png)]

  • 周六日的出行时间更短一些

line-plots

summary_hour_duration = pd.DataFrame(train_data.groupby(['day_of_week','hour'])['trip_duration'].mean())
summary_hour_duration.reset_index(inplace = True)
summary_hour_duration['unit']=1
sns.set(style="white", palette="muted", color_codes=False)
sns.set_context("poster")
sns.tsplot(data=summary_hour_duration, time="hour", unit = "unit", condition="day_of_week", value="trip_duration")
sns.despine(bottom = False)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-H102jfam-1633489386756)(output_61_0.png)]

  • 周六日在5点到15点之间还是比较快的

聚类

start = time.time()
def assign_cluster(df, k):"""function to assign clusters """df_pick = df[['pickup_longitude','pickup_latitude']]df_drop = df[['dropoff_longitude','dropoff_latitude']]"""I am using initialization as from the output ofk-means from my local machine to save time in this kernel"""init = np.array([[ -73.98737616,   40.72981533],[-121.93328857,   37.38933945],[ -73.78423222,   40.64711269],[ -73.9546417 ,   40.77377538],[ -66.84140269,   36.64537175],[ -73.87040541,   40.77016484],[ -73.97316185,   40.75814346],[ -73.98861094,   40.7527791 ],[ -72.80966949,   51.88108444],[ -76.99779701,   38.47370625],[ -73.96975298,   40.69089596],[ -74.00816622,   40.71414939],[ -66.97216034,   44.37194443],[ -61.33552933,   37.85105133],[ -73.98001393,   40.7783577 ],[ -72.00626526,   43.20296402],[ -73.07618713,   35.03469086],[ -73.95759366,   40.80316361],[ -79.20167796,   41.04752096],[ -74.00106031,   40.73867723]])k_means_pick = KMeans(n_clusters=k, init=init, n_init=1)k_means_pick.fit(df_pick)clust_pick = k_means_pick.labels_df['label_pick'] = clust_pick.tolist()df['label_drop'] = k_means_pick.predict(df_drop)return df, k_means_pickend = time.time()
print("time taken by thie script by now is {}.".format(end-start))
time taken by thie script by now is 0.0005013942718505859.
start = time.time()
train_cl, k_means = assign_cluster(train_data, 20)  # make it 100 when extracting features
centroid_pickups = pd.DataFrame(k_means.cluster_centers_, columns = ['centroid_pick_long', 'centroid_pick_lat'])
centroid_dropoff = pd.DataFrame(k_means.cluster_centers_, columns = ['centroid_drop_long', 'centroid_drop_lat'])
centroid_pickups['label_pick'] = centroid_pickups.index
centroid_dropoff['label_drop'] = centroid_dropoff.index
#centroid_pickups.head()
train_cl = pd.merge(train_cl, centroid_pickups, how='left', on=['label_pick'])
train_cl = pd.merge(train_cl, centroid_dropoff, how='left', on=['label_drop'])
#train_cl.head()
end = time.time()
print("Time taken in clustering is {}.".format(end - start))
Time taken in clustering is 2.5313637256622314.

聚类相关特征

  • 上下客点所在簇中心点的距离
  • 方向特征 - 集群质心之间的方向
start = time.time()
train_cl.loc[:,'hvsine_pick_cent_p'] = haversine_(train_cl['pickup_latitude'].values, train_cl['pickup_longitude'].values, train_cl['centroid_pick_lat'].values, train_cl['centroid_pick_long'].values)
train_cl.loc[:,'hvsine_drop_cent_d'] = haversine_(train_cl['dropoff_latitude'].values, train_cl['dropoff_longitude'].values, train_cl['centroid_drop_lat'].values, train_cl['centroid_drop_long'].values)
train_cl.loc[:,'hvsine_cent_p_cent_d'] = haversine_(train_cl['centroid_pick_lat'].values, train_cl['centroid_pick_long'].values, train_cl['centroid_drop_lat'].values, train_cl['centroid_drop_long'].values)
train_cl.loc[:,'manhtn_pick_cent_p'] = manhattan_distance_pd(train_cl['pickup_latitude'].values, train_cl['pickup_longitude'].values, train_cl['centroid_pick_lat'].values, train_cl['centroid_pick_long'].values)
train_cl.loc[:,'manhtn_drop_cent_d'] = manhattan_distance_pd(train_cl['dropoff_latitude'].values, train_cl['dropoff_longitude'].values, train_cl['centroid_drop_lat'].values, train_cl['centroid_drop_long'].values)
train_cl.loc[:,'manhtn_cent_p_cent_d'] = manhattan_distance_pd(train_cl['centroid_pick_lat'].values, train_cl['centroid_pick_long'].values, train_cl['centroid_drop_lat'].values, train_cl['centroid_drop_long'].values)train_cl.loc[:,'bearing_pick_cent_p'] = bearing_array(train_cl['pickup_latitude'].values, train_cl['pickup_longitude'].values, train_cl['centroid_pick_lat'].values, train_cl['centroid_pick_long'].values)
train_cl.loc[:,'bearing_drop_cent_p'] = bearing_array(train_cl['dropoff_latitude'].values, train_cl['dropoff_longitude'].values, train_cl['centroid_drop_lat'].values, train_cl['centroid_drop_long'].values)
train_cl.loc[:,'bearing_cent_p_cent_d'] = bearing_array(train_cl['centroid_pick_lat'].values, train_cl['centroid_pick_long'].values, train_cl['centroid_drop_lat'].values, train_cl['centroid_drop_long'].values)
train_cl['speed_hvsn'] = train_cl.hvsine_pick_drop/train_cl.total_travel_time
train_cl['speed_manhtn'] = train_cl.manhtn_pick_drop/train_cl.total_travel_time
end = time.time()
print("Time Taken by above cell is {}.".format(end-start))
train_cl.head()
Time Taken by above cell is 3.551389694213867.
id vendor_id pickup_datetime dropoff_datetime passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag ... hvsine_drop_cent_d hvsine_cent_p_cent_d manhtn_pick_cent_p manhtn_drop_cent_d manhtn_cent_p_cent_d bearing_pick_cent_p bearing_drop_cent_p bearing_cent_p_cent_d speed_hvsn speed_manhtn
0 id2875421 2 2016-03-14 17:24:55 2016-03-14 17:32:30 1 -73.982155 40.767937 -73.964630 40.765602 N ... 1.098585 2.319857 1.338601 1.549840 2.822553 8.812218 -138.980503 165.640915 0.009087 0.010524
1 id2377394 1 2016-06-12 00:43:35 2016-06-12 00:54:38 1 -73.980415 40.738564 -73.999481 40.731152 N ... 0.845448 1.520191 1.573052 0.968702 2.144236 -149.031278 -9.113659 -49.174617 0.005438 0.007321
2 id3858529 2 2016-01-19 11:35:24 2016-01-19 12:10:48 1 -73.979027 40.763939 -74.005333 40.710087 N ... 0.508922 5.718571 1.135490 0.690697 7.848844 142.642889 -28.669171 -148.907292 0.008318 0.010687
3 id3504673 2 2016-04-06 19:32:31 2016-04-06 19:39:40 1 -74.010040 40.719971 -74.012268 40.706718 N ... 0.888828 0.000000 0.805089 1.161466 0.000000 166.837718 22.515049 0.000000 0.006300 0.007046
4 id2181028 2 2016-03-26 13:30:55 2016-03-26 13:38:10 1 -73.973053 40.793209 -73.972923 40.782520 N ... 0.755820 0.000000 2.237846 1.060322 0.000000 -160.438403 -127.746230 0.000000 0.008484 0.008561

5 rows × 39 columns

聚类可视化展示

start = time.time()
def cluster_summary(sum_df):"""function to calculate summary of given list of clusters """#agg_func = {'trip_duration':'mean','label_drop':'count','bearing':'mean','id':'count'} # that's how you use agg function with groupbysummary_avg_time = pd.DataFrame(sum_df.groupby('label_pick')['trip_duration'].mean())summary_avg_time.reset_index(inplace = True)summary_pref_clus = pd.DataFrame(sum_df.groupby(['label_pick', 'label_drop'])['id'].count())summary_pref_clus = summary_pref_clus.reset_index()summary_pref_clus = summary_pref_clus.loc[summary_pref_clus.groupby('label_pick')['id'].idxmax()]summary =pd.merge(summary_avg_time, summary_pref_clus, how = 'left', on = 'label_pick')summary = summary.rename(columns={'trip_duration':'avg_triptime'})return summary
end = time.time()
print("Time Taken by above cell is {}.".format(end-start))
Time Taken by above cell is 0.0005021095275878906.
import folium
def show_fmaps(train_data, path=1):"""function to generate map and add the pick up and drop coordinates1. Path = 1 : Join pickup (blue) and drop(red) using a straight line"""full_data = train_datasummary_full_data = pd.DataFrame(full_data.groupby('label_pick')['id'].count())summary_full_data.reset_index(inplace = True)summary_full_data = summary_full_data.loc[summary_full_data['id']>70000]map_1 = folium.Map(location=[40.767937, -73.982155], zoom_start=10,tiles='Stamen Toner') # manually added centrenew_df = train_data.loc[train_data['label_pick'].isin(summary_full_data.label_pick.tolist())].sample(50)new_df.reset_index(inplace = True, drop = True)for i in range(new_df.shape[0]):pick_long = new_df.loc[new_df.index ==i]['pickup_longitude'].values[0]pick_lat = new_df.loc[new_df.index ==i]['pickup_latitude'].values[0]dest_long = new_df.loc[new_df.index ==i]['dropoff_longitude'].values[0]dest_lat = new_df.loc[new_df.index ==i]['dropoff_latitude'].values[0]folium.Marker([pick_lat, pick_long]).add_to(map_1)folium.Marker([dest_lat, dest_long]).add_to(map_1)return map_1

重点的clusters:大于70000个记录

def clusters_map(clus_data, full_data, tile = 'OpenStreetMap', sig = 0, zoom = 12, circle = 0, radius_ = 30):""" function to plot clusters on map"""map_1 = folium.Map(location=[40.767937, -73.982155], zoom_start=zoom,tiles= tile) # 'Mapbox' 'Stamen Toner'summary_full_data = pd.DataFrame(full_data.groupby('label_pick')['id'].count())summary_full_data.reset_index(inplace = True)if sig == 1:summary_full_data = summary_full_data.loc[summary_full_data['id']>70000]sig_cluster = summary_full_data['label_pick'].tolist()clus_summary = cluster_summary(full_data)for i in sig_cluster:pick_long = clus_data.loc[clus_data.index ==i]['centroid_pick_long'].values[0]pick_lat = clus_data.loc[clus_data.index ==i]['centroid_pick_lat'].values[0]clus_no = clus_data.loc[clus_data.index ==i]['label_pick'].values[0]most_visited_clus = clus_summary.loc[clus_summary['label_pick']==i]['label_drop'].values[0]avg_triptime = clus_summary.loc[clus_summary['label_pick']==i]['avg_triptime'].values[0]pop = 'cluster = '+str(clus_no)+' & most visited cluster = ' +str(most_visited_clus) +' & avg triptime from this cluster =' + str(avg_triptime)if circle == 1:folium.CircleMarker(location=[pick_lat, pick_long], radius=radius_,color='#F08080',fill_color='#3186cc', popup=pop).add_to(map_1)folium.Marker([pick_lat, pick_long], popup=pop).add_to(map_1)return map_1
osm = show_fmaps(train_data, path=1)
osm
clus_map = clusters_map(centroid_pickups, train_cl, sig =0, zoom =3.2, circle =1, tile = 'Stamen Terrain')
clus_map
clus_map_sig = clusters_map(centroid_pickups, train_cl, sig =1, circle =1)
clus_map_sig

测试集进行相同的处理

test_df = pd.read_csv('./data/test.csv')
test_fr = pd.read_csv('./data/fastest_routes_test.csv')
test_fr_new = test_fr[['id', 'total_distance', 'total_travel_time', 'number_of_steps']]
test_df = pd.merge(test_df, test_fr_new, on = 'id', how = 'left')
test_df.head()
id vendor_id pickup_datetime passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag total_distance total_travel_time number_of_steps
0 id3004672 1 2016-06-30 23:59:58 1 -73.988129 40.732029 -73.990173 40.756680 N 3795.9 424.6 4
1 id3505355 1 2016-06-30 23:59:53 1 -73.964203 40.679993 -73.959808 40.655403 N 2904.5 200.0 4
2 id1217141 1 2016-06-30 23:59:47 1 -73.997437 40.737583 -73.986160 40.729523 N 1499.5 193.2 4
3 id2150126 2 2016-06-30 23:59:41 1 -73.956070 40.771900 -73.986427 40.730469 N 7023.9 494.8 11
4 id1598245 1 2016-06-30 23:59:33 1 -73.970215 40.761475 -73.961510 40.755890 N 1108.2 103.2 4

时间特征

start = time.time()
test_data = test_df.copy()
test_data['pickup_datetime'] = pd.to_datetime(test_data.pickup_datetime)
test_data.loc[:, 'pick_month'] = test_data['pickup_datetime'].dt.month
test_data.loc[:, 'hour'] = test_data['pickup_datetime'].dt.hour
test_data.loc[:, 'week_of_year'] = test_data['pickup_datetime'].dt.weekofyear
test_data.loc[:, 'day_of_year'] = test_data['pickup_datetime'].dt.dayofyear
test_data.loc[:, 'day_of_week'] = test_data['pickup_datetime'].dt.dayofweek
end = time.time()
print("Time taken by above cell is {}.".format(end-start))
Time taken by above cell is 0.8934004306793213.

距离特征

strat = time.time()
test_data.loc[:,'hvsine_pick_drop'] = haversine_(test_data['pickup_latitude'].values, test_data['pickup_longitude'].values, test_data['dropoff_latitude'].values, test_data['dropoff_longitude'].values)
test_data.loc[:,'manhtn_pick_drop'] = manhattan_distance_pd(test_data['pickup_latitude'].values, test_data['pickup_longitude'].values, test_data['dropoff_latitude'].values, test_data['dropoff_longitude'].values)
test_data.loc[:,'bearing'] = bearing_array(test_data['pickup_latitude'].values, test_data['pickup_longitude'].values, test_data['dropoff_latitude'].values, test_data['dropoff_longitude'].values)
end = time.time()
print("Time taken by above cell is {}.".format(end-strat))
Time taken by above cell is 0.3820157051086426.

聚类特征

start = time.time()
test_data['label_pick'] = k_means.predict(test_data[['pickup_longitude','pickup_latitude']])
test_data['label_drop'] = k_means.predict(test_data[['dropoff_longitude','dropoff_latitude']])
test_cl = pd.merge(test_data, centroid_pickups, how='left', on=['label_pick'])
test_cl = pd.merge(test_cl, centroid_dropoff, how='left', on=['label_drop'])
#test_cl.head()
end = time.time()
print("Time Taken by above cell is {}.".format(end-start))
Time Taken by above cell is 0.714956521987915.
start = time.time()
test_cl.loc[:,'hvsine_pick_cent_p'] = haversine_(test_cl['pickup_latitude'].values, test_cl['pickup_longitude'].values, test_cl['centroid_pick_lat'].values, test_cl['centroid_pick_long'].values)
test_cl.loc[:,'hvsine_drop_cent_d'] = haversine_(test_cl['dropoff_latitude'].values, test_cl['dropoff_longitude'].values, test_cl['centroid_drop_lat'].values, test_cl['centroid_drop_long'].values)
test_cl.loc[:,'hvsine_cent_p_cent_d'] = haversine_(test_cl['centroid_pick_lat'].values, test_cl['centroid_pick_long'].values, test_cl['centroid_drop_lat'].values, test_cl['centroid_drop_long'].values)
test_cl.loc[:,'manhtn_pick_cent_p'] = manhattan_distance_pd(test_cl['pickup_latitude'].values, test_cl['pickup_longitude'].values, test_cl['centroid_pick_lat'].values, test_cl['centroid_pick_long'].values)
test_cl.loc[:,'manhtn_drop_cent_d'] = manhattan_distance_pd(test_cl['dropoff_latitude'].values, test_cl['dropoff_longitude'].values, test_cl['centroid_drop_lat'].values, test_cl['centroid_drop_long'].values)
test_cl.loc[:,'manhtn_cent_p_cent_d'] = manhattan_distance_pd(test_cl['centroid_pick_lat'].values, test_cl['centroid_pick_long'].values, test_cl['centroid_drop_lat'].values, test_cl['centroid_drop_long'].values)test_cl.loc[:,'bearing_pick_cent_p'] = bearing_array(test_cl['pickup_latitude'].values, test_cl['pickup_longitude'].values, test_cl['centroid_pick_lat'].values, test_cl['centroid_pick_long'].values)
test_cl.loc[:,'bearing_drop_cent_p'] = bearing_array(test_cl['dropoff_latitude'].values, test_cl['dropoff_longitude'].values, test_cl['centroid_drop_lat'].values, test_cl['centroid_drop_long'].values)
test_cl.loc[:,'bearing_cent_p_cent_d'] = bearing_array(test_cl['centroid_pick_lat'].values, test_cl['centroid_pick_long'].values, test_cl['centroid_drop_lat'].values, test_cl['centroid_drop_long'].values)
test_cl['speed_hvsn'] = test_cl.hvsine_pick_drop/test_cl.total_travel_time
test_cl['speed_manhtn'] = test_cl.manhtn_pick_drop/test_cl.total_travel_time
end = time.time()
print("Time Taken by above cell is {}.".format(end-start))
Time Taken by above cell is 1.4610087871551514.
test_cl.head()
id vendor_id pickup_datetime passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag total_distance ... hvsine_drop_cent_d hvsine_cent_p_cent_d manhtn_pick_cent_p manhtn_drop_cent_d manhtn_cent_p_cent_d bearing_pick_cent_p bearing_drop_cent_p bearing_cent_p_cent_d speed_hvsn speed_manhtn
0 id3004672 1 2016-06-30 23:59:58 1 -73.988129 40.732029 -73.990173 40.756680 N 3795.9 ... 0.460746 2.557813 0.316489 0.572701 2.657047 166.844127 163.485813 -2.267563 0.006468 0.006861
1 id3505355 1 2016-06-30 23:59:53 1 -73.964203 40.679993 -73.959808 40.655403 N 2904.5 ... 4.035735 0.000000 1.680995 4.786109 0.000000 -21.064979 -11.983230 0.000000 0.013796 0.015524
2 id1217141 1 2016-06-30 23:59:47 1 -73.997437 40.737583 -73.986160 40.729523 N 1499.5 ... 0.108194 1.520191 0.425801 0.128054 2.144081 -68.660063 -78.185156 130.816473 0.006761 0.009557
3 id2150126 2 2016-06-30 23:59:41 1 -73.956070 40.771900 -73.986427 40.730469 N 7023.9 ... 0.117694 5.622357 0.327942 0.166444 7.657602 29.967965 -134.876766 -150.583803 0.010649 0.014477
4 id1598245 1 2016-06-30 23:59:33 1 -73.970215 40.761475 -73.961510 40.755890 N 1108.2 ... 1.015384 0.000000 0.620567 1.234923 0.000000 -145.882420 -75.681725 0.000000 0.009310 0.013122

5 rows × 37 columns

xgboost模型

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.cluster import MiniBatchKMeans
import warnings

可以尝试加入PCA特征

# Lets Add PCA features in the model, reference Beluga's PCA
train = train_cl
test = test_cl
start = time.time()
coords = np.vstack((train[['pickup_latitude', 'pickup_longitude']].values,train[['dropoff_latitude', 'dropoff_longitude']].values,test[['pickup_latitude', 'pickup_longitude']].values,test[['dropoff_latitude', 'dropoff_longitude']].values))pca = PCA().fit(coords)
train['pickup_pca0'] = pca.transform(train[['pickup_latitude', 'pickup_longitude']])[:, 0]
train['pickup_pca1'] = pca.transform(train[['pickup_latitude', 'pickup_longitude']])[:, 1]
train['dropoff_pca0'] = pca.transform(train[['dropoff_latitude', 'dropoff_longitude']])[:, 0]
train['dropoff_pca1'] = pca.transform(train[['dropoff_latitude', 'dropoff_longitude']])[:, 1]
test['pickup_pca0'] = pca.transform(test[['pickup_latitude', 'pickup_longitude']])[:, 0]
test['pickup_pca1'] = pca.transform(test[['pickup_latitude', 'pickup_longitude']])[:, 1]
test['dropoff_pca0'] = pca.transform(test[['dropoff_latitude', 'dropoff_longitude']])[:, 0]
test['dropoff_pca1'] = pca.transform(test[['dropoff_latitude', 'dropoff_longitude']])[:, 1]
end = time.time()
print("Time Taken by above cell is {}.".format(end - start))
Time Taken by above cell is 1.553161382675171.
train['store_and_fwd_flag_int'] = np.where(train['store_and_fwd_flag']=='N', 0, 1)
test['store_and_fwd_flag_int'] = np.where(test['store_and_fwd_flag']=='N', 0, 1)
feature_names = list(train.columns)
print("Difference of features in train and test are {}".format(np.setdiff1d(train.columns, test.columns)))
print("")
do_not_use_for_training = ['pick_date','id', 'pickup_datetime', 'dropoff_datetime', 'trip_duration', 'store_and_fwd_flag']
feature_names = [f for f in train.columns if f not in do_not_use_for_training]
print("We will be using following features for training {}.".format(feature_names))
print("")
print("Total number of features are {}.".format(len(feature_names)))
Difference of features in train and test are ['dropoff_datetime' 'trip_duration']We will be using following features for training ['vendor_id', 'passenger_count', 'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'total_distance', 'total_travel_time', 'number_of_steps', 'pick_month', 'hour', 'week_of_year', 'day_of_year', 'day_of_week', 'hvsine_pick_drop', 'manhtn_pick_drop', 'bearing', 'label_pick', 'label_drop', 'centroid_pick_long', 'centroid_pick_lat', 'centroid_drop_long', 'centroid_drop_lat', 'hvsine_pick_cent_p', 'hvsine_drop_cent_d', 'hvsine_cent_p_cent_d', 'manhtn_pick_cent_p', 'manhtn_drop_cent_d', 'manhtn_cent_p_cent_d', 'bearing_pick_cent_p', 'bearing_drop_cent_p', 'bearing_cent_p_cent_d', 'speed_hvsn', 'speed_manhtn', 'pickup_pca0', 'pickup_pca1', 'dropoff_pca0', 'dropoff_pca1', 'store_and_fwd_flag_int'].Total number of features are 39.
y = np.log(train['trip_duration'].values + 1)
start = time.time()
Xtr, Xv, ytr, yv = train_test_split(train[feature_names].values, y, test_size=0.2, random_state=1987)
dtrain = xgb.DMatrix(Xtr, label=ytr)
dvalid = xgb.DMatrix(Xv, label=yv)
dtest = xgb.DMatrix(test[feature_names].values)
watchlist = [(dtrain, 'train'), (dvalid, 'valid')]xgb_pars = {'min_child_weight': 50, 'eta': 0.3, 'colsample_bytree': 0.3, 'max_depth': 10,'subsample': 0.8, 'lambda': 1., 'nthread': -1, 'booster' : 'gbtree', 'silent': 1,'eval_metric': 'rmse', 'objective': 'reg:linear'}# You could try to train with more epoch
model = xgb.train(xgb_pars, dtrain, 15, watchlist, early_stopping_rounds=2,maximize=False, verbose_eval=1)
end = time.time()
print("Time taken by above cell is {}.".format(end - start))
print('Modeling RMSLE %.5f' % model.best_score)
[0]  train-rmse:4.22726  valid-rmse:4.22841
Multiple eval metrics have been passed: 'valid-rmse' will be used for early stopping.Will train until valid-rmse hasn't improved in 2 rounds.
[1] train-rmse:2.98083  valid-rmse:2.98244
[2] train-rmse:2.11167  valid-rmse:2.11381
[3] train-rmse:1.51307  valid-rmse:1.51598
[4] train-rmse:1.10813  valid-rmse:1.11194
[5] train-rmse:0.836374 valid-rmse:0.841546
[6] train-rmse:0.663311 valid-rmse:0.669983
[7] train-rmse:0.558202 valid-rmse:0.566389
[8] train-rmse:0.485001 valid-rmse:0.494619
[9] train-rmse:0.451296 valid-rmse:0.462016
[10]    train-rmse:0.431356 valid-rmse:0.443106
[11]    train-rmse:0.420363 valid-rmse:0.432821
[12]    train-rmse:0.415032 valid-rmse:0.427993
[13]    train-rmse:0.410913 valid-rmse:0.424339
[14]    train-rmse:0.409381 valid-rmse:0.423168
Time taken by above cell is 17.472981691360474.
Modeling RMSLE 0.42317

加入更多特征

天气特征

weather = pd.read_csv('./data/weather_data_nyc_centralpark_2016.csv')
weather.head()
date maximum temperature minimum temperature average temperature precipitation snow fall snow depth
0 1-1-2016 42 34 38.0 0.00 0.0 0
1 2-1-2016 40 32 36.0 0.00 0.0 0
2 3-1-2016 45 35 40.0 0.00 0.0 0
3 4-1-2016 36 14 25.0 0.00 0.0 0
4 5-1-2016 29 11 20.0 0.00 0.0 0
from ggplot import *
weather.date = pd.to_datetime(weather.date)
weather['day_of_year']= weather.date.dt.dayofyear
p = ggplot(aes(x='date'),data=weather) + geom_line(aes(y='minimum temperature', colour = "blue")) + geom_line(aes(y='maximum temperature', colour = "red"))
p + geom_point(aes(y='minimum temperature',colour = "blue")) #+ stat_smooth(colour='yellow', span=0.2)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-egnwFyZL-1633489386760)(output_98_0.png)]

<ggplot: (157466912993)>

下雪,降雨,积雪情况

import matplotlib.pyplot as plt
%matplotlib inline
weather['precipitation'].unique()
weather['precipitation'] = np.where(weather['precipitation']=='T', '0.00',weather['precipitation'])
weather['precipitation'] = list(map(float, weather['precipitation']))
weather['snow fall'] = np.where(weather['snow fall']=='T', '0.00',weather['snow fall'])
weather['snow fall'] = list(map(float, weather['snow fall']))
weather['snow depth'] = np.where(weather['snow depth']=='T', '0.00',weather['snow depth'])
weather['snow depth'] = list(map(float, weather['snow depth']))
import plotly.plotly as py
import plotly.graph_objs as go
import plotly
random_x = weather['date'].values
random_y0 = weather['precipitation']
random_y1 = weather['snow fall']
random_y2 = weather['snow depth']import plotly.plotly as py
import plotly.graph_objs as go
import plotly
random_x = weather['date'].values
random_y0 = weather['precipitation']
random_y1 = weather['snow fall']
random_y2 = weather['snow depth']# Create traces
trace0 = go.Scatter(x = random_x,y = random_y0,mode = 'markers',name = 'precipitation'
)
trace1 = go.Scatter(x = random_x,y = random_y1,mode = 'markers',name = 'snow fall'
)
trace2 = go.Scatter(x = random_x,y = random_y2,mode = 'markers',name = 'snow depth'
)data = [trace0, trace1, trace2]
plotly.offline.iplot(data, filename='scatter-mode')

动作方向

def freq_turn(step_dir):"""function to create dummy for turn type"""from collections import Counterstep_dir_new = step_dir.split("|")a_list = Counter(step_dir_new).most_common()path = {}for i in range(len(a_list)):path.update({a_list[i]})a = 0b = 0c = 0if 'straight' in (path.keys()):a = path['straight']#print(a)if 'left' in (path.keys()):b = path['left']#print(b)if 'right' in (path.keys()):c = path['right']#print(c)return a,b,c
start = time.time()
train_fr['straight']= 0
train_fr['left'] =0
train_fr['right'] = 0
train_fr['straight'], train_fr['left'], train_fr['right'] = zip(*train_fr['step_direction'].map(freq_turn))
end = time.time()
print("Time Taken by above cell is {}.".format(end - start))
Time Taken by above cell is 12.961659669876099.
train_fr_new = train_fr[['id','straight','left','right']]
train = pd.merge(train, train_fr_new, on = 'id', how = 'left')
#train = pd.merge(train, weather, on= 'date', how = 'left')
print(len(train.columns))
#train.columns
47
train.head()
id vendor_id pickup_datetime dropoff_datetime passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag ... speed_hvsn speed_manhtn pickup_pca0 pickup_pca1 dropoff_pca0 dropoff_pca1 store_and_fwd_flag_int straight left right
0 id2875421 2 2016-03-14 17:24:55 2016-03-14 17:32:30 1 -73.982155 40.767937 -73.964630 40.765602 N ... 0.009087 0.010524 0.007691 0.017053 -0.009666 0.013695 0 2.0 1.0 1.0
1 id2377394 1 2016-06-12 00:43:35 2016-06-12 00:54:38 1 -73.980415 40.738564 -73.999481 40.731152 N ... 0.005438 0.007321 0.007677 -0.012371 0.027145 -0.018652 0 0.0 2.0 2.0
2 id3858529 2 2016-01-19 11:35:24 2016-01-19 12:10:48 1 -73.979027 40.763939 -74.005333 40.710087 N ... 0.008318 0.010687 0.004803 0.012879 0.034222 -0.039337 0 3.0 4.0 5.0
3 id3504673 2 2016-04-06 19:32:31 2016-04-06 19:39:40 1 -74.010040 40.719971 -74.012268 40.706718 N ... 0.006300 0.007046 0.038342 -0.029194 0.041343 -0.042293 0 0.0 2.0 1.0
4 id2181028 2 2016-03-26 13:30:55 2016-03-26 13:38:10 1 -73.973053 40.793209 -73.972923 40.782520 N ... 0.008484 0.008561 -0.002877 0.041749 -0.002380 0.031071 0 0.0 2.0 2.0

5 rows × 47 columns

加入天气特征

train['pickup_datetime'] = pd.to_datetime(train['pickup_datetime'])
train['date'] = train['pickup_datetime'].dt.date
train.head()train['date'] = pd.to_datetime(train['date'])
train = pd.merge(train, weather[['date','minimum temperature', 'precipitation', 'snow fall', 'snow depth']], on= 'date', how = 'left')
train.shape[0]
1458644
train.loc[:,'hvsine_pick_cent_d'] = haversine_(train['pickup_latitude'].values, train['pickup_longitude'].values, train['centroid_drop_lat'].values, train['centroid_drop_long'].values)
train.loc[:,'hvsine_drop_cent_p'] = haversine_(train['dropoff_latitude'].values, train['dropoff_longitude'].values, train['centroid_pick_lat'].values, train['centroid_pick_long'].values)test.loc[:,'hvsine_pick_cent_d'] = haversine_(test['pickup_latitude'].values, test['pickup_longitude'].values, test['centroid_drop_lat'].values, test['centroid_drop_long'].values)
test.loc[:,'hvsine_drop_cent_p'] = haversine_(test['dropoff_latitude'].values, test['dropoff_longitude'].values, test['centroid_pick_lat'].values, test['centroid_pick_long'].values)print("shape of train_features is {}.".format(len(train.columns)))
shape of train_features is 54.

测试集才用相同的特征

start = time.time()
test_fr['straight']= 0
test_fr['left'] =0
test_fr['right'] = 0
test_fr['straight'], test_fr['left'], test_fr['right'] = zip(*test_fr['step_direction'].map(freq_turn))
end = time.time()
print("Time Taken by above cell is {}.".format(end - start))
#test_fr.head()
Time Taken by above cell is 5.300434827804565.
test_fr_new = test_fr[['id','straight','left','right']]
test = pd.merge(test, test_fr_new, on = 'id', how = 'left')
print(len(test.columns))
#test.columns
47
test['pickup_datetime'] = pd.to_datetime(test['pickup_datetime'])
test['date'] = test['pickup_datetime'].dt.date
test['date'] = pd.to_datetime(test['date'])
test= pd.merge(test, weather[['date','minimum temperature', 'precipitation', 'snow fall', 'snow depth']], on= 'date', how = 'left')
feature_names = list(train.columns)
print("Difference of features in train and test are {}".format(np.setdiff1d(train.columns, test.columns)))
print("")
do_not_use_for_training = ['pick_date','id', 'pickup_datetime', 'dropoff_datetime', 'trip_duration', 'store_and_fwd_flag', 'date']
feature_names = [f for f in train.columns if f not in do_not_use_for_training]
print("We will be using following features for training {}.".format(feature_names))
print("")
print("Total number of features are {}.".format(len(feature_names)))
Difference of features in train and test are ['dropoff_datetime' 'trip_duration']We will be using following features for training ['vendor_id', 'passenger_count', 'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'total_distance', 'total_travel_time', 'number_of_steps', 'pick_month', 'hour', 'week_of_year', 'day_of_year', 'day_of_week', 'hvsine_pick_drop', 'manhtn_pick_drop', 'bearing', 'label_pick', 'label_drop', 'centroid_pick_long', 'centroid_pick_lat', 'centroid_drop_long', 'centroid_drop_lat', 'hvsine_pick_cent_p', 'hvsine_drop_cent_d', 'hvsine_cent_p_cent_d', 'manhtn_pick_cent_p', 'manhtn_drop_cent_d', 'manhtn_cent_p_cent_d', 'bearing_pick_cent_p', 'bearing_drop_cent_p', 'bearing_cent_p_cent_d', 'speed_hvsn', 'speed_manhtn', 'pickup_pca0', 'pickup_pca1', 'dropoff_pca0', 'dropoff_pca1', 'store_and_fwd_flag_int', 'straight', 'left', 'right', 'minimum temperature', 'precipitation', 'snow fall', 'snow depth', 'hvsine_pick_cent_d', 'hvsine_drop_cent_p'].Total number of features are 48.
y = np.log(train['trip_duration'].values + 1)

再次训练模型

Xtr, Xv, ytr, yv = train_test_split(train[feature_names].values, y, test_size=0.2, random_state=1987)
dtrain = xgb.DMatrix(Xtr, label=ytr)
dvalid = xgb.DMatrix(Xv, label=yv)
dtest = xgb.DMatrix(test[feature_names].values)
watchlist = [(dtrain, 'train'), (dvalid, 'valid')]start = time.time()
xgb_pars = {'min_child_weight': 50, 'eta': 0.3, 'colsample_bytree': 0.3, 'max_depth': 10,'subsample': 0.8, 'lambda': 1., 'nthread': -1, 'booster' : 'gbtree', 'silent': 1,'eval_metric': 'rmse', 'objective': 'reg:linear'}model_1 = xgb.train(xgb_par, dtrain, 100, watchlist, early_stopping_rounds=4, maximize=False, verbose_eval=1)
print('Modeling RMSLE %.5f' % model.best_score)
end = time.time()
print("Time taken in training is {}.".format(end - start))
[0]  train-rmse:5.72042  valid-rmse:5.72132
Multiple eval metrics have been passed: 'valid-rmse' will be used for early stopping.Will train until valid-rmse hasn't improved in 4 rounds.
[1] train-rmse:5.43622  valid-rmse:5.43719
[2] train-rmse:5.16677  valid-rmse:5.16779
[3] train-rmse:4.91052  valid-rmse:4.91162
[4] train-rmse:4.6672   valid-rmse:4.66837
[5] train-rmse:4.43612  valid-rmse:4.43735
[6] train-rmse:4.21655  valid-rmse:4.21782
[7] train-rmse:4.00819  valid-rmse:4.00953
[8] train-rmse:3.81012  valid-rmse:3.81151
[9] train-rmse:3.62209  valid-rmse:3.62354
[10]    train-rmse:3.44373  valid-rmse:3.44527
[11]    train-rmse:3.2744   valid-rmse:3.27601
[12]    train-rmse:3.11365  valid-rmse:3.11535
[13]    train-rmse:2.9613   valid-rmse:2.96308
[14]    train-rmse:2.81629  valid-rmse:2.81817
[15]    train-rmse:2.67887  valid-rmse:2.68086
[16]    train-rmse:2.54841  valid-rmse:2.55053
[17]    train-rmse:2.42448  valid-rmse:2.42672
[18]    train-rmse:2.30707  valid-rmse:2.30944
[19]    train-rmse:2.19537  valid-rmse:2.19787
[20]    train-rmse:2.08975  valid-rmse:2.09242
[21]    train-rmse:1.98954  valid-rmse:1.99237
[22]    train-rmse:1.89417  valid-rmse:1.89717
[23]    train-rmse:1.80406  valid-rmse:1.80727
[24]    train-rmse:1.71832  valid-rmse:1.72173
[25]    train-rmse:1.63703  valid-rmse:1.64065
[26]    train-rmse:1.56005  valid-rmse:1.56394
[27]    train-rmse:1.48717  valid-rmse:1.49133
[28]    train-rmse:1.41816  valid-rmse:1.42261
[29]    train-rmse:1.35304  valid-rmse:1.35777
[30]    train-rmse:1.29104  valid-rmse:1.29611
[31]    train-rmse:1.23239  valid-rmse:1.2378
[32]    train-rmse:1.17727  valid-rmse:1.18303
[33]    train-rmse:1.12499  valid-rmse:1.13116
[34]    train-rmse:1.0754   valid-rmse:1.08201
[35]    train-rmse:1.02851  valid-rmse:1.03561
[36]    train-rmse:0.984373 valid-rmse:0.991948
[37]    train-rmse:0.942633 valid-rmse:0.950732
[38]    train-rmse:0.903371 valid-rmse:0.912026
[39]    train-rmse:0.866419 valid-rmse:0.875636
[40]    train-rmse:0.831615 valid-rmse:0.841465
[41]    train-rmse:0.798508 valid-rmse:0.809044
[42]    train-rmse:0.76716  valid-rmse:0.778418
[43]    train-rmse:0.738213 valid-rmse:0.750092
[44]    train-rmse:0.710718 valid-rmse:0.723319
[45]    train-rmse:0.684879 valid-rmse:0.698242
[46]    train-rmse:0.660684 valid-rmse:0.674864
[47]    train-rmse:0.637616 valid-rmse:0.6527
[48]    train-rmse:0.615922 valid-rmse:0.63198
[49]    train-rmse:0.595885 valid-rmse:0.612811
[50]    train-rmse:0.577099 valid-rmse:0.59497
[51]    train-rmse:0.559619 valid-rmse:0.57841
[52]    train-rmse:0.54312  valid-rmse:0.562796
[53]    train-rmse:0.527632 valid-rmse:0.548306
[54]    train-rmse:0.513364 valid-rmse:0.534999
[55]    train-rmse:0.499734 valid-rmse:0.522428
[56]    train-rmse:0.487051 valid-rmse:0.510792
[57]    train-rmse:0.475515 valid-rmse:0.500199
[58]    train-rmse:0.465022 valid-rmse:0.490585
[59]    train-rmse:0.454814 valid-rmse:0.481421
[60]    train-rmse:0.445312 valid-rmse:0.472921
[61]    train-rmse:0.436683 valid-rmse:0.465274
[62]    train-rmse:0.428942 valid-rmse:0.458362
[63]    train-rmse:0.421499 valid-rmse:0.451837
[64]    train-rmse:0.414361 valid-rmse:0.445674
[65]    train-rmse:0.407798 valid-rmse:0.440093
[66]    train-rmse:0.401684 valid-rmse:0.43488
[67]    train-rmse:0.396445 valid-rmse:0.430372
[68]    train-rmse:0.391588 valid-rmse:0.426227
[69]    train-rmse:0.386804 valid-rmse:0.422285
[70]    train-rmse:0.382344 valid-rmse:0.418661
[71]    train-rmse:0.378198 valid-rmse:0.415358
[72]    train-rmse:0.374537 valid-rmse:0.412416
[73]    train-rmse:0.371061 valid-rmse:0.409669
[74]    train-rmse:0.367815 valid-rmse:0.407142
[75]    train-rmse:0.365014 valid-rmse:0.404898
[76]    train-rmse:0.362352 valid-rmse:0.402853
[77]    train-rmse:0.359678 valid-rmse:0.400882
[78]    train-rmse:0.357404 valid-rmse:0.399173
[79]    train-rmse:0.355237 valid-rmse:0.397581
[80]    train-rmse:0.353313 valid-rmse:0.396152
[81]    train-rmse:0.351466 valid-rmse:0.394852
[82]    train-rmse:0.349827 valid-rmse:0.393688
[83]    train-rmse:0.348238 valid-rmse:0.392624
[84]    train-rmse:0.346586 valid-rmse:0.391577
[85]    train-rmse:0.344865 valid-rmse:0.390459
[86]    train-rmse:0.343565 valid-rmse:0.38962
[87]    train-rmse:0.342047 valid-rmse:0.388682
[88]    train-rmse:0.340773 valid-rmse:0.387944
[89]    train-rmse:0.339611 valid-rmse:0.387237
[90]    train-rmse:0.338232 valid-rmse:0.386392
[91]    train-rmse:0.337017 valid-rmse:0.38571
[92]    train-rmse:0.33599  valid-rmse:0.385164
[93]    train-rmse:0.334952 valid-rmse:0.384605
[94]    train-rmse:0.333857 valid-rmse:0.384042
[95]    train-rmse:0.332787 valid-rmse:0.383526
[96]    train-rmse:0.332035 valid-rmse:0.383221
[97]    train-rmse:0.331577 valid-rmse:0.38295
[98]    train-rmse:0.330563 valid-rmse:0.382527
[99]    train-rmse:0.329945 valid-rmse:0.382268
Modeling RMSLE 0.42317
Time taken in training is 184.5209550857544.
print('Modeling RMSLE %.5f' % model_1.best_score)
end = time.time()
print("Time taken in training is {}.".format(end - start))
start = time.time()
yvalid = model_1.predict(dvalid)
ytest = model_1.predict(dtest)
end = time.time()
print("Time taken in prediction is {}.".format(end - start))
Modeling RMSLE 3.62354
Time taken in training is 16.804673671722412.
Time taken in prediction is 0.07018685340881348.
start = time.time()
if test.shape[0] == ytest.shape[0]:print('Test shape OK.')
test['trip_duration'] = np.exp(ytest) - 1
test[['id', 'trip_duration']].to_csv('mahesh_xgb_submission.csv', index=False)
end = time.time()
print("Time taken in training is {}.".format(end - start))
Test shape OK.
Time taken in training is 1.3792648315429688.

纽约出租车旅途时间建模分析相关推荐

  1. kaggle项目:纽约出租车行程时间NYC Taxi Trip Duration

    kaggle项目:纽约出租车行程时间NYC Taxi Trip Duration 1. 项目简介 1.1 数据说明 1.2 相关数据集 2. 数据读取 2.1 读取数据集 2.2 读取节假日数据 2. ...

  2. 【python】kaggle项目之纽约出租车行程时间预测

    一.项目背景 (1)Kaggle项目,用于预测出租车出行的总时间. (2)根据已有数据,抽提出更多有用特征,提升预测的准确性. (3)依据探索出来的特征数据,探索性的发现纽约出租车的订单数量变化情况以 ...

  3. Symtavision—分布式嵌入式系统时间建模分析和验证工具

    Symtavision工具为Luxoft公司提供的一款分布式嵌入式系统时间特性建模.分析和验证工具,主要应用于汽车领域.经纬恒润联合Symtavision工具厂商能够为客户提供完整的系统级时间特性建模 ...

  4. Python项目实战——纽约出租车运行情况分析建模

    一.项目概述 根据出租车的运营数据,针对客户旅途时间展开分析与建模,对客流趋势及区域分布进行分析,对出租车历史数据进行分析,为客户预测预计到达时间等 过程设计: 提出问题 理解数据 数据清理 数据分析 ...

  5. Java初学者作业——分析计费规则后,编写程序输入乘坐出租车的时间和里程数,计算里程价格

    返回本章节 返回作业目录 需求说明: 某城市的出租车计费规则如下: 在 7:00 - 23:00 之间,3km 以内收取起步价 10 元,超过 3km 每 km 收取 2 元. 如果不在这个时间段,在 ...

  6. New York City Taxi Trip Duration纽约出租车大数据探索(报告版

    一.项目说明 该项目来源于Kaggle,旨在建模来预测纽约出租车在行程中的总行驶时间. 在建模预测的过程中,我们可以顺便探索纽约市民打车出行习惯及其他有效信息. 附kaggle项目链接 https:/ ...

  7. 【Python】New York City Taxi Trip Duration纽约出租车大数据探索(技术实现过程)

    # New York City Taxi Trip Duration纽约出租车大数据探索 # 该项目来源于Kaggle,旨在建模来预测纽约出租车在行程中的总行驶时间. # 在建模预测的过程中,我们可以 ...

  8. Spark读写HBase:处理纽约出租车数据

    一.数据及部分代码来源: 解析geojson数据:https://github.com/jwills/geojson 纽约出租车数据:http://www.andresmh.com/nyctaxitr ...

  9. 转:大数据处理与开发课程设计——纽约出租车大数据分析

    ​​​​​​大数据处理与开发课程设计--纽约出租车大数据分析_LHR13的博客-CSDN博客_出租车大数据分析 一.设计目的 综合应用所学的Hadoop/Spark/Storm/Mongdb等技术,设 ...

最新文章

  1. C++ Primer 5th笔记(chap 15 OOP)构造函数和拷贝控制
  2. 装箱问题C语言报告,装包装箱问题 (C语言代码)
  3. NHibernate初学者指南(2):一个完整的例子
  4. 计算机应用基础第二版在线作业c,计算机应用基础作业二(答案)
  5. Android 中英文语言切换
  6. 使用Python抓取网页信息
  7. Linux下c语言多线程编程
  8. java 录制_Java屏幕录像
  9. ZEGO 自研客户端配置管理系统 —— 云控
  10. java案例_面向对象编程_Stool
  11. Java学习---day07_继承及final、Object的介绍
  12. Qt 新建文件夹并在该文件夹下新建文件
  13. 大数据与Hadoop系列之分布式文件系统(一)
  14. L3-020 至多删三个字符 (30 分)
  15. Mysql 笛卡尔积
  16. 用计算机亩换算成平方,平米与亩换算(平方米换算亩计算器)
  17. fedora RPM包下载地址
  18. 起源鸿蒙虚无等级,《刺客信条:起源》或为开放世界游戏 最高等级只有40
  19. python中encode是什么意思_【转 记录】python中的encode以及decode
  20. IE访问HTTPS链接下载文件,IE提示无法下载

热门文章

  1. Python学习笔记(六)Python基础_数据类型——字符串
  2. 如何更好的建设标准化数字化智慧工地?
  3. android输入法架构解析
  4. 华为总裁办紧急发文!两个鸿蒙别傻傻分不清!
  5. 怎样清除打开方式中的无用项目
  6. java一些必会算法(转自落尘曦的博客:http://blog.csdn.net/qq_23994787。 )
  7. CSS的再深入2(更新中···)
  8. 初识大数据(一)什么是大数据
  9. Exynos4412 移植针对Samsung的Linux-6.1(二)SD卡驱动——解决无法挂载SD卡的根文件系统
  10. XiaoHu日志 6/10~6/12