机器学习训练营——机器学习爱好者的自由交流空间（入群联系qq：2279055353）

案例介绍

共享单车是一种自行车租赁方式。注册成为单车提供商的会员后就可以通过网络自主租赁与反还单车。使用商家提供的APP, 人们可以很方便地在出发地扫码租车，骑到目的地自由反还。目前，全世界有500多家共享单车租赁商。

单车租赁商收集到租车人的骑行时间、出发地点、到达地点。因此，一套共享单车系统可以当作一个研究城市人口流动情况的传感网络。本案例结合华盛顿特区的共享单车系统记录的历史数据与当地天气数据，预报单车租赁需求量。本案例使用Python 编码。

数据介绍

该案例数据来自Capital Bikeshare, 存储在UCI数据仓库.
数据集记录了两年内的单车租赁数据(按小时记). 其中，训练集由每个月前19天的租赁记录组成；检验集由每个月的后20天的租赁数据组成。你需要预测的是检验集里每天每个小时单车的租赁数。

变量说明

datetime: hourly date + timestamp
season: 1 = spring, 2 = summer, 3 = fall, 4 = winter
holiday: whether the day is considered a holiday
workingday: whether the day is neither a weekend nor holiday
weather:

1 Clear, Few clouds, Partly cloudy

2 Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist

3 Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds

4 Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

temp: temperature in Celsius
atemp: “feels like” temperature in Celsius
humidity: relative humidity
windspeed: wind speed
casual: number of non-registered user rentals initiated
registered: number of registered user rentals initiated
count: number of total rentals (Dependent Variable)

加载数据集

加载必需的库与训练集。

import pylab
import calendar
import numpy as np
import pandas as pd
import seaborn as sn
from scipy import stats
import missingno as msno # Missing data visualization
from datetime import datetime
import matplotlib.pyplot as plt
import warnings
pd.options.mode.chained_assignment = None
warnings.filterwarnings("ignore", category=DeprecationWarning)
%matplotlib inline

dailyData = pd.read_csv("../input/train.csv")

数据描述

数据集大小

print(dailyData.shape)

(10886, 12)

数据样本

print(dailyData.head(2))

变量类型

print(dailyData.dtypes)

特征工程

season, holiday, workingday and weather应该是类别型，但是当前的类型是整型，因此需要做类型转换。我们按照以下方式转换数据集。

从datetime产生新特征(变量)date, hour, weekDay and month.

dailyData["date"] = dailyData.datetime.apply(lambda x : x.split()[0])
dailyData["hour"] = dailyData.datetime.apply(lambda x : x.split()[1].split(":")[0])
dailyData["weekday"] = dailyData.date.apply(lambda dateString : calendar.day_name[datetime.strptime(dateString,"%Y-%m-%d").weekday()])
dailyData["month"] = dailyData.date.apply(lambda dateString : calendar.month_name[datetime.strptime(dateString,"%Y-%m-%d").month])
dailyData["season"] = dailyData.season.map({1: "Spring", 2 : "Summer", 3 : "Fall", 4 :"Winter" })
dailyData["weather"] = dailyData.weather.map({1: " Clear + Few clouds + Partly cloudy + Partly cloudy",\2 : " Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist ", \3 : " Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds", \4 :" Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog " })

强制转换"season",“holiday”,“workingday” and “weather” 为类别型。

categoryVariableList = ["hour","weekday","month","season","weather","holiday","workingday"]
for var in categoryVariableList:dailyData[var] = dailyData[var].astype("category")

删除"datetime".

dailyData  = dailyData.drop(["datetime"],axis=1)

在初步了解数据后，接着我们要检查数据集里是否有缺失值。通过missingno模块，我们可视化数据里的缺失情况。

# display white color in missingness
msno.matrix(dailyData,figsize=(20,10))
plt.show()

print(dailyData.isnull().sum())

结果显示，数据里没有缺失值。

离群点分析

我们画出特征的箱线图来分析是否存在离群点。

fig, axes = plt.subplots(nrows=2,ncols=2)
fig.set_size_inches(12, 10)
sn.boxplot(data=dailyData,y="count",orient="v",ax=axes[0][0])
sn.boxplot(data=dailyData,y="count",x="season",orient="v",ax=axes[0][1])
sn.boxplot(data=dailyData,y="count",x="hour",orient="v",ax=axes[1][0])
sn.boxplot(data=dailyData,y="count",x="workingday",orient="v",ax=axes[1][1])axes[0][0].set(ylabel='Count',title="Box Plot On Count")
axes[0][1].set(xlabel='Season', ylabel='Count',title="Box Plot On Count Across Season")
axes[1][0].set(xlabel='Hour Of The Day', ylabel='Count',title="Box Plot On Count Across Hour Of The Day")
axes[1][1].set(xlabel='Working Day', ylabel='Count',title="Box Plot On Count Across Working Day")plt.show()

count包括很多超过上警戒线的离群点，还有以下结论：

从中位数线可以看出，春季频数较低。
"Hour Of The Day"的箱线图很有意思。中位数在早7点-8点，晚5点-6点较高。这两个时间段正值学校放学、下班高峰期。
从图4看出，大多数离群点来自"Working Day"而非"Non Working Day".

删除离群点

dailyDataWithoutOutliers = dailyData[np.abs(dailyData["count"]-dailyData["count"].mean())<=(3*dailyData["count"].std())] print("Shape Of The Before Ouliers: ",dailyData.shape)
print("Shape Of The After Ouliers: ",dailyDataWithoutOutliers.shape)

Shape Of The Before Ouliers: (10886, 15)
Shape Of The After Ouliers: (10739, 15)

数据分布

大多数机器学习技术要求因变量是正态的，我们发现count的直方图是右偏的。一种可能的解决办法是，删除离群点后对count作对数变换。下图显示，对数变换后的count更象正态的，但依然不完美。

fig,axes = plt.subplots(ncols=2,nrows=2)
fig.set_size_inches(12, 10)
sn.distplot(dailyData["count"],ax=axes[0][0])
stats.probplot(dailyData["count"], dist='norm', fit=True, plot=axes[0][1])
sn.distplot(np.log(dailyDataWithoutOutliers["count"]),ax=axes[1][0])
stats.probplot(np.log1p(dailyDataWithoutOutliers["count"]), dist='norm', fit=True, plot=axes[1][1])plt.show()

Count Vs (Month,Season,Hour,Weekday,Usertype)

fig,(ax1,ax2,ax3,ax4)= plt.subplots(nrows=4)
fig.set_size_inches(12,20)
sortOrder = ["January","February","March","April","May","June","July","August","September","October","November","December"]
hueOrder = ["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]monthAggregated = pd.DataFrame(dailyData.groupby("month")["count"].mean()).reset_index()
monthSorted = monthAggregated.sort_values(by="count",ascending=False)
sn.barplot(data=monthSorted,x="month",y="count",ax=ax1,order=sortOrder)
ax1.set(xlabel='Month', ylabel='Avearage Count',title="Average Count By Month")hourAggregated = pd.DataFrame(dailyData.groupby(["hour","season"],sort=True)["count"].mean()).reset_index()
sn.pointplot(x=hourAggregated["hour"], y=hourAggregated["count"],hue=hourAggregated["season"], data=hourAggregated, join=True,ax=ax2)
ax2.set(xlabel='Hour Of The Day', ylabel='Users Count',title="Average Users Count By Hour Of The Day Across Season",label='big')hourAggregated = pd.DataFrame(dailyData.groupby(["hour","weekday"],sort=True)["count"].mean()).reset_index()
sn.pointplot(x=hourAggregated["hour"], y=hourAggregated["count"],hue=hourAggregated["weekday"],hue_order=hueOrder, data=hourAggregated, join=True,ax=ax3)
ax3.set(xlabel='Hour Of The Day', ylabel='Users Count',title="Average Users Count By Hour Of The Day Across Weekdays",label='big')hourTransformed = pd.melt(dailyData[["hour","casual","registered"]], id_vars=['hour'], value_vars=['casual', 'registered'])
hourAggregated = pd.DataFrame(hourTransformed.groupby(["hour","variable"],sort=True)["value"].mean()).reset_index()
sn.pointplot(x=hourAggregated["hour"], y=hourAggregated["value"],hue=hourAggregated["variable"],hue_order=["casual","registered"], data=hourAggregated, join=True,ax=ax4)
ax4.set(xlabel='Hour Of The Day', ylabel='Users Count',title="Average Users Count By Hour Of The Day Across User Type",label='big')plt.show()

人们明显愿意在夏季租车。6, 7, 8月有最高的租车数量。
注册用户平时的租车高峰在早7-8点、晚5-6点。

到此，我们完成了建模前的准备和数据探索性分析。

共享单车需求预测问题：分析篇相关推荐

共享单车数据处理与分析
共享单车数据处理与分析 1. 案例概述 1.1项目背景 1.2 任务要求 1.3 项目分析思维导图 2. 分析实现 1.2 包的依赖版本 1.3 导入模块 1.4 加载数据与数据探索 1.5 数据分析 ...
机器学习——共享单车数据集单项分析
文章目录总租车人数cnt 的直方图/分布总租车数的散点图分析工作日出现的次数风速分析独热图表示两两特征之间的相关性只选择高度相关的两两属性以散射图显示高相关的属性分割数据集缺省参数的 ...
基于PyTorch+LSTM实现共享单车需求预测
前言大家好,我是阿光. 本专栏整理了<PyTorch深度学习项目实战100例>,内包含了各种不同的深度学习项目,包含项目原理以及源码,每一个项目实例都附带有完整的代码+数据集. 正在更新 ...
【Social listening实操】用大数据文本挖掘，来洞察“共享单车”的行业现状及走势
本文转自知乎作者:苏格兰折耳喵 ----------------------------------------------------- 对于当下共享单车在互联网界的火热状况,笔者想从大数据文本挖 ...
基于深度学习和多源大数据的浮动共享单车流量预测(附共享单车轨迹数据集下载方式)...
这篇文章相对比较简单,比较容易复现模型,有相关数据集的可以尝试做一下~ 1.文章信息 <Short-term FFBS demand prediction with multi-source d ...
2022年版中国共享单车市场现状调研及投资前景预测报告
2022年版中国共享单车市场现状调研及投资前景预测报告﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌﹌ [报告编号]: 414927 [出版时间]: 2022年3月 [出版机构]: 中商经济研究网第 ...
早高峰共享单车潮汐点的群智优化Baseline
参考自:coogle数据科学 https://coggle.club/learn/dcic2021/ 一.赛题说明 2021数字中国创新大赛大数据赛道-城市管理大数据专题二.数据读取与理解共享单车 ...
算法分析_早高峰共享单车潮汐点的群智优化学习记录
提交截图: 导入数据集共享单车轨迹数据为再使用时产生的定位数据,具体包含单车在不同时间段(默认15秒记录一次)的经纬度信息 #导入常见库 import os,codecs import pandas ...
共享单车是如何利用物联网卡实现智能开锁的？
随着科技的不断发展,我们社会生活也在慢慢的发生着改变,尤其是在最近几年,共享经济已经完全融入了我们的生活.就比如说共享单车.共享汽车.共享充电宝.共享图书馆等一系列的共享设备都是通过物联网卡来实现的, ...
我是怎样爬下6万共享单车数据并进行分析的（附代码）
共享经济的浪潮席卷着各行各业,而出行行业是这股大潮中的主要分支.如今,在城市中随处可见共享单车的身影,给人们的生活出行带来了便利.相信大家总会遇到这样的窘境,在APP中能看到很多单车,但走到那里的时候 ...

共享单车需求预测问题：分析篇