这里是用jupyter notebook写的关于使用python进行数据收集的基本知识,包括crawl_and_parse、different_format_data_processing、feature_engineering_example和python_regular_expression等。之前课程里提供的资料,移植到了python3+windows环境上。代码上传到csdn资源啦:ABC of data_collection
为了方便查看,代码分开4篇博客里。

下面是jupyter notebook代码导出的md文件。
3.feature_engineering_example

特征工程小案例

Kaggle上有这样一个比赛:城市自行车共享系统使用状况。

提供的数据为2年内按小时做的自行车租赁数据,其中训练集由每个月的前19天组成,测试集由20号之后的时间组成。

#先把数据读进来
import pandas as pd
data = pd.read_csv('kaggle_bike_competition_train.csv', header = 0, error_bad_lines=False)
#看一眼数据长什么样
data.head()
datetime season holiday workingday weather temp atemp humidity windspeed casual registered count
0 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0.0 3 13 16
1 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0.0 8 32 40
2 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0.0 5 27 32
3 2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75 0.0 3 10 13
4 2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75 0.0 0 1 1

### 把datetime域切成 日期 和 时间 两部分。

# 处理时间字段
temp = pd.DatetimeIndex(data['datetime'])
data['date'] = temp.date
data['time'] = temp.time
data.head()
datetime season holiday workingday weather temp atemp humidity windspeed casual registered count date time
0 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0.0 3 13 16 2011-01-01 00:00:00
1 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0.0 8 32 40 2011-01-01 01:00:00
2 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0.0 5 27 32 2011-01-01 02:00:00
3 2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75 0.0 3 10 13 2011-01-01 03:00:00
4 2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75 0.0 0 1 1 2011-01-01 04:00:00

### 时间那部分,好像最细的粒度是小时,所以我们干脆把小时字段拿出来作为更简洁的特征。

# 设定hour这个小时字段
data['hour'] = pd.to_datetime(data.time, format="%H:%M:%S")
data['hour'] = pd.Index(data['hour']).hour
data
datetime season holiday workingday weather temp atemp humidity windspeed casual registered count date time hour
0 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0.0000 3 13 16 2011-01-01 00:00:00 0
1 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0.0000 8 32 40 2011-01-01 01:00:00 1
2 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0.0000 5 27 32 2011-01-01 02:00:00 2
3 2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75 0.0000 3 10 13 2011-01-01 03:00:00 3
4 2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75 0.0000 0 1 1 2011-01-01 04:00:00 4
5 2011-01-01 05:00:00 1 0 0 2 9.84 12.880 75 6.0032 0 1 1 2011-01-01 05:00:00 5
6 2011-01-01 06:00:00 1 0 0 1 9.02 13.635 80 0.0000 2 0 2 2011-01-01 06:00:00 6
7 2011-01-01 07:00:00 1 0 0 1 8.20 12.880 86 0.0000 1 2 3 2011-01-01 07:00:00 7
8 2011-01-01 08:00:00 1 0 0 1 9.84 14.395 75 0.0000 1 7 8 2011-01-01 08:00:00 8
9 2011-01-01 09:00:00 1 0 0 1 13.12 17.425 76 0.0000 8 6 14 2011-01-01 09:00:00 9
10 2011-01-01 10:00:00 1 0 0 1 15.58 19.695 76 16.9979 12 24 36 2011-01-01 10:00:00 10
11 2011-01-01 11:00:00 1 0 0 1 14.76 16.665 81 19.0012 26 30 56 2011-01-01 11:00:00 11
12 2011-01-01 12:00:00 1 0 0 1 17.22 21.210 77 19.0012 29 55 84 2011-01-01 12:00:00 12
13 2011-01-01 13:00:00 1 0 0 2 18.86 22.725 72 19.9995 47 47 94 2011-01-01 13:00:00 13
14 2011-01-01 14:00:00 1 0 0 2 18.86 22.725 72 19.0012 35 71 106 2011-01-01 14:00:00 14
15 2011-01-01 15:00:00 1 0 0 2 18.04 21.970 77 19.9995 40 70 110 2011-01-01 15:00:00 15
16 2011-01-01 16:00:00 1 0 0 2 17.22 21.210 82 19.9995 41 52 93 2011-01-01 16:00:00 16
17 2011-01-01 17:00:00 1 0 0 2 18.04 21.970 82 19.0012 15 52 67 2011-01-01 17:00:00 17
18 2011-01-01 18:00:00 1 0 0 3 17.22 21.210 88 16.9979 9 26 35 2011-01-01 18:00:00 18
19 2011-01-01 19:00:00 1 0 0 3 17.22 21.210 88 16.9979 6 31 37 2011-01-01 19:00:00 19
20 2011-01-01 20:00:00 1 0 0 2 16.40 20.455 87 16.9979 11 25 36 2011-01-01 20:00:00 20
21 2011-01-01 21:00:00 1 0 0 2 16.40 20.455 87 12.9980 3 31 34 2011-01-01 21:00:00 21
22 2011-01-01 22:00:00 1 0 0 2 16.40 20.455 94 15.0013 11 17 28 2011-01-01 22:00:00 22
23 2011-01-01 23:00:00 1 0 0 2 18.86 22.725 88 19.9995 15 24 39 2011-01-01 23:00:00 23
24 2011-01-02 00:00:00 1 0 0 2 18.86 22.725 88 19.9995 4 13 17 2011-01-02 00:00:00 0
25 2011-01-02 01:00:00 1 0 0 2 18.04 21.970 94 16.9979 1 16 17 2011-01-02 01:00:00 1
26 2011-01-02 02:00:00 1 0 0 2 17.22 21.210 100 19.0012 1 8 9 2011-01-02 02:00:00 2
27 2011-01-02 03:00:00 1 0 0 2 18.86 22.725 94 12.9980 2 4 6 2011-01-02 03:00:00 3
28 2011-01-02 04:00:00 1 0 0 2 18.86 22.725 94 12.9980 2 1 3 2011-01-02 04:00:00 4
29 2011-01-02 06:00:00 1 0 0 3 17.22 21.210 77 19.9995 0 2 2 2011-01-02 06:00:00 6
10856 2012-12-18 18:00:00 4 0 1 1 15.58 19.695 46 22.0028 13 512 525 2012-12-18 18:00:00 18
10857 2012-12-18 19:00:00 4 0 1 1 15.58 19.695 46 26.0027 19 334 353 2012-12-18 19:00:00 19
10858 2012-12-18 20:00:00 4 0 1 1 14.76 16.665 50 16.9979 4 264 268 2012-12-18 20:00:00 20
10859 2012-12-18 21:00:00 4 0 1 1 14.76 17.425 50 15.0013 9 159 168 2012-12-18 21:00:00 21
10860 2012-12-18 22:00:00 4 0 1 1 13.94 16.665 49 0.0000 5 127 132 2012-12-18 22:00:00 22
10861 2012-12-18 23:00:00 4 0 1 1 13.94 17.425 49 6.0032 1 80 81 2012-12-18 23:00:00 23
10862 2012-12-19 00:00:00 4 0 1 1 12.30 15.910 61 0.0000 6 35 41 2012-12-19 00:00:00 0
10863 2012-12-19 01:00:00 4 0 1 1 12.30 15.910 65 6.0032 1 14 15 2012-12-19 01:00:00 1
10864 2012-12-19 02:00:00 4 0 1 1 11.48 15.150 65 6.0032 1 2 3 2012-12-19 02:00:00 2
10865 2012-12-19 03:00:00 4 0 1 1 10.66 13.635 75 8.9981 0 5 5 2012-12-19 03:00:00 3
10866 2012-12-19 04:00:00 4 0 1 1 9.84 12.120 75 8.9981 1 6 7 2012-12-19 04:00:00 4
10867 2012-12-19 05:00:00 4 0 1 1 10.66 14.395 75 6.0032 2 29 31 2012-12-19 05:00:00 5
10868 2012-12-19 06:00:00 4 0 1 1 9.84 12.880 75 6.0032 3 109 112 2012-12-19 06:00:00 6
10869 2012-12-19 07:00:00 4 0 1 1 10.66 13.635 75 8.9981 3 360 363 2012-12-19 07:00:00 7
10870 2012-12-19 08:00:00 4 0 1 1 9.84 12.880 87 7.0015 13 665 678 2012-12-19 08:00:00 8
10871 2012-12-19 09:00:00 4 0 1 1 11.48 14.395 75 7.0015 8 309 317 2012-12-19 09:00:00 9
10872 2012-12-19 10:00:00 4 0 1 1 13.12 16.665 70 7.0015 17 147 164 2012-12-19 10:00:00 10
10873 2012-12-19 11:00:00 4 0 1 1 16.40 20.455 54 15.0013 31 169 200 2012-12-19 11:00:00 11
10874 2012-12-19 12:00:00 4 0 1 1 16.40 20.455 54 19.0012 33 203 236 2012-12-19 12:00:00 12
10875 2012-12-19 13:00:00 4 0 1 1 17.22 21.210 50 12.9980 30 183 213 2012-12-19 13:00:00 13
10876 2012-12-19 14:00:00 4 0 1 1 17.22 21.210 50 12.9980 33 185 218 2012-12-19 14:00:00 14
10877 2012-12-19 15:00:00 4 0 1 1 17.22 21.210 50 19.0012 28 209 237 2012-12-19 15:00:00 15
10878 2012-12-19 16:00:00 4 0 1 1 17.22 21.210 50 23.9994 37 297 334 2012-12-19 16:00:00 16
10879 2012-12-19 17:00:00 4 0 1 1 16.40 20.455 50 26.0027 26 536 562 2012-12-19 17:00:00 17
10880 2012-12-19 18:00:00 4 0 1 1 15.58 19.695 50 23.9994 23 546 569 2012-12-19 18:00:00 18
10881 2012-12-19 19:00:00 4 0 1 1 15.58 19.695 50 26.0027 7 329 336 2012-12-19 19:00:00 19
10882 2012-12-19 20:00:00 4 0 1 1 14.76 17.425 57 15.0013 10 231 241 2012-12-19 20:00:00 20
10883 2012-12-19 21:00:00 4 0 1 1 13.94 15.910 61 15.0013 4 164 168 2012-12-19 21:00:00 21
10884 2012-12-19 22:00:00 4 0 1 1 13.94 17.425 61 6.0032 12 117 129 2012-12-19 22:00:00 22
10885 2012-12-19 23:00:00 4 0 1 1 13.12 16.665 66 8.9981 4 84 88 2012-12-19 23:00:00 23

10886 rows × 15 columns

### 仔细想想,数据只告诉我们是哪天了,按照一般逻辑,应该周末和工作日出去的人数量不同吧。我们设定一个新的字段dayofweek表示是一周中的第几天。再设定一个字段dateDays表示离第一天开始租车多久了(猜测在欧美国家,这种绿色环保的出行方式,会迅速蔓延吧)

# 我们对时间类的特征做处理,产出一个星期几的类别型变量
data['dayofweek'] = pd.DatetimeIndex(data.date).dayofweek# 对时间类特征处理,产出一个时间长度变量
data['dateDays'] = (data.date - data.date[0]).astype('timedelta64[D]')data
datetime season holiday workingday weather temp atemp humidity windspeed casual registered count date time hour dayofweek dateDays
0 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0.0000 3 13 16 2011-01-01 00:00:00 0 5 0.0
1 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0.0000 8 32 40 2011-01-01 01:00:00 1 5 0.0
2 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0.0000 5 27 32 2011-01-01 02:00:00 2 5 0.0
3 2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75 0.0000 3 10 13 2011-01-01 03:00:00 3 5 0.0
4 2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75 0.0000 0 1 1 2011-01-01 04:00:00 4 5 0.0
5 2011-01-01 05:00:00 1 0 0 2 9.84 12.880 75 6.0032 0 1 1 2011-01-01 05:00:00 5 5 0.0
6 2011-01-01 06:00:00 1 0 0 1 9.02 13.635 80 0.0000 2 0 2 2011-01-01 06:00:00 6 5 0.0
7 2011-01-01 07:00:00 1 0 0 1 8.20 12.880 86 0.0000 1 2 3 2011-01-01 07:00:00 7 5 0.0
8 2011-01-01 08:00:00 1 0 0 1 9.84 14.395 75 0.0000 1 7 8 2011-01-01 08:00:00 8 5 0.0
9 2011-01-01 09:00:00 1 0 0 1 13.12 17.425 76 0.0000 8 6 14 2011-01-01 09:00:00 9 5 0.0
10 2011-01-01 10:00:00 1 0 0 1 15.58 19.695 76 16.9979 12 24 36 2011-01-01 10:00:00 10 5 0.0
11 2011-01-01 11:00:00 1 0 0 1 14.76 16.665 81 19.0012 26 30 56 2011-01-01 11:00:00 11 5 0.0
12 2011-01-01 12:00:00 1 0 0 1 17.22 21.210 77 19.0012 29 55 84 2011-01-01 12:00:00 12 5 0.0
13 2011-01-01 13:00:00 1 0 0 2 18.86 22.725 72 19.9995 47 47 94 2011-01-01 13:00:00 13 5 0.0
14 2011-01-01 14:00:00 1 0 0 2 18.86 22.725 72 19.0012 35 71 106 2011-01-01 14:00:00 14 5 0.0
15 2011-01-01 15:00:00 1 0 0 2 18.04 21.970 77 19.9995 40 70 110 2011-01-01 15:00:00 15 5 0.0
16 2011-01-01 16:00:00 1 0 0 2 17.22 21.210 82 19.9995 41 52 93 2011-01-01 16:00:00 16 5 0.0
17 2011-01-01 17:00:00 1 0 0 2 18.04 21.970 82 19.0012 15 52 67 2011-01-01 17:00:00 17 5 0.0
18 2011-01-01 18:00:00 1 0 0 3 17.22 21.210 88 16.9979 9 26 35 2011-01-01 18:00:00 18 5 0.0
19 2011-01-01 19:00:00 1 0 0 3 17.22 21.210 88 16.9979 6 31 37 2011-01-01 19:00:00 19 5 0.0
20 2011-01-01 20:00:00 1 0 0 2 16.40 20.455 87 16.9979 11 25 36 2011-01-01 20:00:00 20 5 0.0
21 2011-01-01 21:00:00 1 0 0 2 16.40 20.455 87 12.9980 3 31 34 2011-01-01 21:00:00 21 5 0.0
22 2011-01-01 22:00:00 1 0 0 2 16.40 20.455 94 15.0013 11 17 28 2011-01-01 22:00:00 22 5 0.0
23 2011-01-01 23:00:00 1 0 0 2 18.86 22.725 88 19.9995 15 24 39 2011-01-01 23:00:00 23 5 0.0
24 2011-01-02 00:00:00 1 0 0 2 18.86 22.725 88 19.9995 4 13 17 2011-01-02 00:00:00 0 6 1.0
25 2011-01-02 01:00:00 1 0 0 2 18.04 21.970 94 16.9979 1 16 17 2011-01-02 01:00:00 1 6 1.0
26 2011-01-02 02:00:00 1 0 0 2 17.22 21.210 100 19.0012 1 8 9 2011-01-02 02:00:00 2 6 1.0
27 2011-01-02 03:00:00 1 0 0 2 18.86 22.725 94 12.9980 2 4 6 2011-01-02 03:00:00 3 6 1.0
28 2011-01-02 04:00:00 1 0 0 2 18.86 22.725 94 12.9980 2 1 3 2011-01-02 04:00:00 4 6 1.0
29 2011-01-02 06:00:00 1 0 0 3 17.22 21.210 77 19.9995 0 2 2 2011-01-02 06:00:00 6 6 1.0
10856 2012-12-18 18:00:00 4 0 1 1 15.58 19.695 46 22.0028 13 512 525 2012-12-18 18:00:00 18 1 717.0
10857 2012-12-18 19:00:00 4 0 1 1 15.58 19.695 46 26.0027 19 334 353 2012-12-18 19:00:00 19 1 717.0
10858 2012-12-18 20:00:00 4 0 1 1 14.76 16.665 50 16.9979 4 264 268 2012-12-18 20:00:00 20 1 717.0
10859 2012-12-18 21:00:00 4 0 1 1 14.76 17.425 50 15.0013 9 159 168 2012-12-18 21:00:00 21 1 717.0
10860 2012-12-18 22:00:00 4 0 1 1 13.94 16.665 49 0.0000 5 127 132 2012-12-18 22:00:00 22 1 717.0
10861 2012-12-18 23:00:00 4 0 1 1 13.94 17.425 49 6.0032 1 80 81 2012-12-18 23:00:00 23 1 717.0
10862 2012-12-19 00:00:00 4 0 1 1 12.30 15.910 61 0.0000 6 35 41 2012-12-19 00:00:00 0 2 718.0
10863 2012-12-19 01:00:00 4 0 1 1 12.30 15.910 65 6.0032 1 14 15 2012-12-19 01:00:00 1 2 718.0
10864 2012-12-19 02:00:00 4 0 1 1 11.48 15.150 65 6.0032 1 2 3 2012-12-19 02:00:00 2 2 718.0
10865 2012-12-19 03:00:00 4 0 1 1 10.66 13.635 75 8.9981 0 5 5 2012-12-19 03:00:00 3 2 718.0
10866 2012-12-19 04:00:00 4 0 1 1 9.84 12.120 75 8.9981 1 6 7 2012-12-19 04:00:00 4 2 718.0
10867 2012-12-19 05:00:00 4 0 1 1 10.66 14.395 75 6.0032 2 29 31 2012-12-19 05:00:00 5 2 718.0
10868 2012-12-19 06:00:00 4 0 1 1 9.84 12.880 75 6.0032 3 109 112 2012-12-19 06:00:00 6 2 718.0
10869 2012-12-19 07:00:00 4 0 1 1 10.66 13.635 75 8.9981 3 360 363 2012-12-19 07:00:00 7 2 718.0
10870 2012-12-19 08:00:00 4 0 1 1 9.84 12.880 87 7.0015 13 665 678 2012-12-19 08:00:00 8 2 718.0
10871 2012-12-19 09:00:00 4 0 1 1 11.48 14.395 75 7.0015 8 309 317 2012-12-19 09:00:00 9 2 718.0
10872 2012-12-19 10:00:00 4 0 1 1 13.12 16.665 70 7.0015 17 147 164 2012-12-19 10:00:00 10 2 718.0
10873 2012-12-19 11:00:00 4 0 1 1 16.40 20.455 54 15.0013 31 169 200 2012-12-19 11:00:00 11 2 718.0
10874 2012-12-19 12:00:00 4 0 1 1 16.40 20.455 54 19.0012 33 203 236 2012-12-19 12:00:00 12 2 718.0
10875 2012-12-19 13:00:00 4 0 1 1 17.22 21.210 50 12.9980 30 183 213 2012-12-19 13:00:00 13 2 718.0
10876 2012-12-19 14:00:00 4 0 1 1 17.22 21.210 50 12.9980 33 185 218 2012-12-19 14:00:00 14 2 718.0
10877 2012-12-19 15:00:00 4 0 1 1 17.22 21.210 50 19.0012 28 209 237 2012-12-19 15:00:00 15 2 718.0
10878 2012-12-19 16:00:00 4 0 1 1 17.22 21.210 50 23.9994 37 297 334 2012-12-19 16:00:00 16 2 718.0
10879 2012-12-19 17:00:00 4 0 1 1 16.40 20.455 50 26.0027 26 536 562 2012-12-19 17:00:00 17 2 718.0
10880 2012-12-19 18:00:00 4 0 1 1 15.58 19.695 50 23.9994 23 546 569 2012-12-19 18:00:00 18 2 718.0
10881 2012-12-19 19:00:00 4 0 1 1 15.58 19.695 50 26.0027 7 329 336 2012-12-19 19:00:00 19 2 718.0
10882 2012-12-19 20:00:00 4 0 1 1 14.76 17.425 57 15.0013 10 231 241 2012-12-19 20:00:00 20 2 718.0
10883 2012-12-19 21:00:00 4 0 1 1 13.94 15.910 61 15.0013 4 164 168 2012-12-19 21:00:00 21 2 718.0
10884 2012-12-19 22:00:00 4 0 1 1 13.94 17.425 61 6.0032 12 117 129 2012-12-19 22:00:00 22 2 718.0
10885 2012-12-19 23:00:00 4 0 1 1 13.12 16.665 66 8.9981 4 84 88 2012-12-19 23:00:00 23 2 718.0

10886 rows × 17 columns

### 其实我们刚才一直都在猜测,并不知道真实的日期相关的数据分布对吧,所以我们要做一个小小的统计来看看真实的数据分布,我们统计一下一周各天的自行车租赁情况(分注册的人和没注册的人)

byday = data.groupby('dayofweek')
# 统计下没注册的用户租赁情况
byday['casual'].sum().reset_index()
dayofweek casual
0 0 46288
1 1 35365
2 2 34931
3 3 37283
4 4 47402
5 5 100782
6 6 90084
# 统计下注册的用户的租赁情况
byday['registered'].sum().reset_index()
dayofweek registered
0 0 249008
1 1 256620
2 2 257295
3 3 269118
4 4 255102
5 5 210736
6 6 195462

### 周末既然有不同,就单独拿一列出来给星期六,再单独拿一列出来给星期日

data['Saturday']=0
data.Saturday[data.dayofweek==5]=1data['Sunday']=0
data.Sunday[data.dayofweek==6]=1data

C:\Program Files\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy from ipykernel import kernelapp as app C:\Program Files\Anaconda3\lib\site-packages\ipykernel\__main__.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

datetime season holiday workingday weather temp atemp humidity windspeed casual registered count date time hour dayofweek dateDays Saturday Sunday
0 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0.0000 3 13 16 2011-01-01 00:00:00 0 5 0.0 1 0
1 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0.0000 8 32 40 2011-01-01 01:00:00 1 5 0.0 1 0
2 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0.0000 5 27 32 2011-01-01 02:00:00 2 5 0.0 1 0
3 2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75 0.0000 3 10 13 2011-01-01 03:00:00 3 5 0.0 1 0
4 2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75 0.0000 0 1 1 2011-01-01 04:00:00 4 5 0.0 1 0
5 2011-01-01 05:00:00 1 0 0 2 9.84 12.880 75 6.0032 0 1 1 2011-01-01 05:00:00 5 5 0.0 1 0
6 2011-01-01 06:00:00 1 0 0 1 9.02 13.635 80 0.0000 2 0 2 2011-01-01 06:00:00 6 5 0.0 1 0
7 2011-01-01 07:00:00 1 0 0 1 8.20 12.880 86 0.0000 1 2 3 2011-01-01 07:00:00 7 5 0.0 1 0
8 2011-01-01 08:00:00 1 0 0 1 9.84 14.395 75 0.0000 1 7 8 2011-01-01 08:00:00 8 5 0.0 1 0
9 2011-01-01 09:00:00 1 0 0 1 13.12 17.425 76 0.0000 8 6 14 2011-01-01 09:00:00 9 5 0.0 1 0
10 2011-01-01 10:00:00 1 0 0 1 15.58 19.695 76 16.9979 12 24 36 2011-01-01 10:00:00 10 5 0.0 1 0
11 2011-01-01 11:00:00 1 0 0 1 14.76 16.665 81 19.0012 26 30 56 2011-01-01 11:00:00 11 5 0.0 1 0
12 2011-01-01 12:00:00 1 0 0 1 17.22 21.210 77 19.0012 29 55 84 2011-01-01 12:00:00 12 5 0.0 1 0
13 2011-01-01 13:00:00 1 0 0 2 18.86 22.725 72 19.9995 47 47 94 2011-01-01 13:00:00 13 5 0.0 1 0
14 2011-01-01 14:00:00 1 0 0 2 18.86 22.725 72 19.0012 35 71 106 2011-01-01 14:00:00 14 5 0.0 1 0
15 2011-01-01 15:00:00 1 0 0 2 18.04 21.970 77 19.9995 40 70 110 2011-01-01 15:00:00 15 5 0.0 1 0
16 2011-01-01 16:00:00 1 0 0 2 17.22 21.210 82 19.9995 41 52 93 2011-01-01 16:00:00 16 5 0.0 1 0
17 2011-01-01 17:00:00 1 0 0 2 18.04 21.970 82 19.0012 15 52 67 2011-01-01 17:00:00 17 5 0.0 1 0
18 2011-01-01 18:00:00 1 0 0 3 17.22 21.210 88 16.9979 9 26 35 2011-01-01 18:00:00 18 5 0.0 1 0
19 2011-01-01 19:00:00 1 0 0 3 17.22 21.210 88 16.9979 6 31 37 2011-01-01 19:00:00 19 5 0.0 1 0
20 2011-01-01 20:00:00 1 0 0 2 16.40 20.455 87 16.9979 11 25 36 2011-01-01 20:00:00 20 5 0.0 1 0
21 2011-01-01 21:00:00 1 0 0 2 16.40 20.455 87 12.9980 3 31 34 2011-01-01 21:00:00 21 5 0.0 1 0
22 2011-01-01 22:00:00 1 0 0 2 16.40 20.455 94 15.0013 11 17 28 2011-01-01 22:00:00 22 5 0.0 1 0
23 2011-01-01 23:00:00 1 0 0 2 18.86 22.725 88 19.9995 15 24 39 2011-01-01 23:00:00 23 5 0.0 1 0
24 2011-01-02 00:00:00 1 0 0 2 18.86 22.725 88 19.9995 4 13 17 2011-01-02 00:00:00 0 6 1.0 0 1
25 2011-01-02 01:00:00 1 0 0 2 18.04 21.970 94 16.9979 1 16 17 2011-01-02 01:00:00 1 6 1.0 0 1
26 2011-01-02 02:00:00 1 0 0 2 17.22 21.210 100 19.0012 1 8 9 2011-01-02 02:00:00 2 6 1.0 0 1
27 2011-01-02 03:00:00 1 0 0 2 18.86 22.725 94 12.9980 2 4 6 2011-01-02 03:00:00 3 6 1.0 0 1
28 2011-01-02 04:00:00 1 0 0 2 18.86 22.725 94 12.9980 2 1 3 2011-01-02 04:00:00 4 6 1.0 0 1
29 2011-01-02 06:00:00 1 0 0 3 17.22 21.210 77 19.9995 0 2 2 2011-01-02 06:00:00 6 6 1.0 0 1
10856 2012-12-18 18:00:00 4 0 1 1 15.58 19.695 46 22.0028 13 512 525 2012-12-18 18:00:00 18 1 717.0 0 0
10857 2012-12-18 19:00:00 4 0 1 1 15.58 19.695 46 26.0027 19 334 353 2012-12-18 19:00:00 19 1 717.0 0 0
10858 2012-12-18 20:00:00 4 0 1 1 14.76 16.665 50 16.9979 4 264 268 2012-12-18 20:00:00 20 1 717.0 0 0
10859 2012-12-18 21:00:00 4 0 1 1 14.76 17.425 50 15.0013 9 159 168 2012-12-18 21:00:00 21 1 717.0 0 0
10860 2012-12-18 22:00:00 4 0 1 1 13.94 16.665 49 0.0000 5 127 132 2012-12-18 22:00:00 22 1 717.0 0 0
10861 2012-12-18 23:00:00 4 0 1 1 13.94 17.425 49 6.0032 1 80 81 2012-12-18 23:00:00 23 1 717.0 0 0
10862 2012-12-19 00:00:00 4 0 1 1 12.30 15.910 61 0.0000 6 35 41 2012-12-19 00:00:00 0 2 718.0 0 0
10863 2012-12-19 01:00:00 4 0 1 1 12.30 15.910 65 6.0032 1 14 15 2012-12-19 01:00:00 1 2 718.0 0 0
10864 2012-12-19 02:00:00 4 0 1 1 11.48 15.150 65 6.0032 1 2 3 2012-12-19 02:00:00 2 2 718.0 0 0
10865 2012-12-19 03:00:00 4 0 1 1 10.66 13.635 75 8.9981 0 5 5 2012-12-19 03:00:00 3 2 718.0 0 0
10866 2012-12-19 04:00:00 4 0 1 1 9.84 12.120 75 8.9981 1 6 7 2012-12-19 04:00:00 4 2 718.0 0 0
10867 2012-12-19 05:00:00 4 0 1 1 10.66 14.395 75 6.0032 2 29 31 2012-12-19 05:00:00 5 2 718.0 0 0
10868 2012-12-19 06:00:00 4 0 1 1 9.84 12.880 75 6.0032 3 109 112 2012-12-19 06:00:00 6 2 718.0 0 0
10869 2012-12-19 07:00:00 4 0 1 1 10.66 13.635 75 8.9981 3 360 363 2012-12-19 07:00:00 7 2 718.0 0 0
10870 2012-12-19 08:00:00 4 0 1 1 9.84 12.880 87 7.0015 13 665 678 2012-12-19 08:00:00 8 2 718.0 0 0
10871 2012-12-19 09:00:00 4 0 1 1 11.48 14.395 75 7.0015 8 309 317 2012-12-19 09:00:00 9 2 718.0 0 0
10872 2012-12-19 10:00:00 4 0 1 1 13.12 16.665 70 7.0015 17 147 164 2012-12-19 10:00:00 10 2 718.0 0 0
10873 2012-12-19 11:00:00 4 0 1 1 16.40 20.455 54 15.0013 31 169 200 2012-12-19 11:00:00 11 2 718.0 0 0
10874 2012-12-19 12:00:00 4 0 1 1 16.40 20.455 54 19.0012 33 203 236 2012-12-19 12:00:00 12 2 718.0 0 0
10875 2012-12-19 13:00:00 4 0 1 1 17.22 21.210 50 12.9980 30 183 213 2012-12-19 13:00:00 13 2 718.0 0 0
10876 2012-12-19 14:00:00 4 0 1 1 17.22 21.210 50 12.9980 33 185 218 2012-12-19 14:00:00 14 2 718.0 0 0
10877 2012-12-19 15:00:00 4 0 1 1 17.22 21.210 50 19.0012 28 209 237 2012-12-19 15:00:00 15 2 718.0 0 0
10878 2012-12-19 16:00:00 4 0 1 1 17.22 21.210 50 23.9994 37 297 334 2012-12-19 16:00:00 16 2 718.0 0 0
10879 2012-12-19 17:00:00 4 0 1 1 16.40 20.455 50 26.0027 26 536 562 2012-12-19 17:00:00 17 2 718.0 0 0
10880 2012-12-19 18:00:00 4 0 1 1 15.58 19.695 50 23.9994 23 546 569 2012-12-19 18:00:00 18 2 718.0 0 0
10881 2012-12-19 19:00:00 4 0 1 1 15.58 19.695 50 26.0027 7 329 336 2012-12-19 19:00:00 19 2 718.0 0 0
10882 2012-12-19 20:00:00 4 0 1 1 14.76 17.425 57 15.0013 10 231 241 2012-12-19 20:00:00 20 2 718.0 0 0
10883 2012-12-19 21:00:00 4 0 1 1 13.94 15.910 61 15.0013 4 164 168 2012-12-19 21:00:00 21 2 718.0 0 0
10884 2012-12-19 22:00:00 4 0 1 1 13.94 17.425 61 6.0032 12 117 129 2012-12-19 22:00:00 22 2 718.0 0 0
10885 2012-12-19 23:00:00 4 0 1 1 13.12 16.665 66 8.9981 4 84 88 2012-12-19 23:00:00 23 2 718.0 0 0

10886 rows × 19 columns

### 从数据中,把原始的时间字段等踢掉

# remove old data features
dataRel = data.drop(['datetime', 'count','date','time','dayofweek'], axis=1)
dataRel.head()
season holiday workingday weather temp atemp humidity windspeed casual registered hour dateDays Saturday Sunday
0 1 0 0 1 9.84 14.395 81 0.0 3 13 0 0.0 1 0
1 1 0 0 1 9.02 13.635 80 0.0 8 32 1 0.0 1 0
2 1 0 0 1 9.02 13.635 80 0.0 5 27 2 0.0 1 0
3 1 0 0 1 9.84 14.395 75 0.0 3 10 3 0.0 1 0
4 1 0 0 1 9.84 14.395 75 0.0 0 1 4 0.0 1 0

### 特征向量化 我们这里打算用scikit-learn来建模。对于pandas的dataframe我们有方法/函数可以直接转成python中的dict。 另外,在这里我们要对离散值和连续值特征区分一下了,以便之后分开做不同的特征处理。

from sklearn.feature_extraction import DictVectorizer
# 我们把连续值的属性放入一个dict中
featureConCols = ['temp','atemp','humidity','windspeed','dateDays','hour']
dataFeatureCon = dataRel[featureConCols]
dataFeatureCon = dataFeatureCon.fillna( 'NA' ) #in case I missed any
X_dictCon = dataFeatureCon.T.to_dict().values()
# print(X_dictCon)# 把离散值的属性放到另外一个dict中
featureCatCols = ['season','holiday','workingday','weather','Saturday', 'Sunday']
dataFeatureCat = dataRel[featureCatCols]
dataFeatureCat = dataFeatureCat.fillna( 'NA' ) #in case I missed any
X_dictCat = dataFeatureCat.T.to_dict().values() # 向量化特征
vec = DictVectorizer(sparse = False)
X_vec_cat = vec.fit_transform(X_dictCat)
X_vec_con = vec.fit_transform(X_dictCon)
dataFeatureCon.head()
temp atemp humidity windspeed dateDays hour
0 9.84 14.395 81 0.0 0.0 0
1 9.02 13.635 80 0.0 0.0 1
2 9.02 13.635 80 0.0 0.0 2
3 9.84 14.395 75 0.0 0.0 3
4 9.84 14.395 75 0.0 0.0 4
X_vec_con

array([[ 14.395 , 0. , 0. , 81. , 9.84 , 0. ], [ 13.635 , 0. , 1. , 80. , 9.02 , 0. ], [ 13.635 , 0. , 2. , 80. , 9.02 , 0. ], …, [ 15.91 , 718. , 21. , 61. , 13.94 , 15.0013], [ 17.425 , 718. , 22. , 61. , 13.94 , 6.0032], [ 16.665 , 718. , 23. , 66. , 13.12 , 8.9981]])

dataFeatureCat.head()
season holiday workingday weather Saturday Sunday
0 1 0 0 1 1 0
1 1 0 0 1 1 0
2 1 0 0 1 1 0
3 1 0 0 1 1 0
4 1 0 0 1 1 0
X_vec_cat
array([[ 1.,  0.,  0.,  1.,  1.,  0.],[ 1.,  0.,  0.,  1.,  1.,  0.],[ 1.,  0.,  0.,  1.,  1.,  0.],..., [ 0.,  0.,  0.,  4.,  1.,  1.],[ 0.,  0.,  0.,  4.,  1.,  1.],[ 0.,  0.,  0.,  4.,  1.,  1.]])

标准化连续值特征

我们要对连续值属性做一些处理,最基本的当然是标准化,让连续值属性处理过后均值为0,方差为1。
这样的数据放到模型里,对模型训练的收敛和模型的准确性都有好处

from sklearn import preprocessing
# 标准化连续值数据
scaler = preprocessing.StandardScaler().fit(X_vec_con)
X_vec_con = scaler.transform(X_vec_con)
X_vec_con
array([[-1.09273697, -1.70912256, -1.66894356,  0.99321305, -1.33366069,-1.56775367],[-1.18242083, -1.70912256, -1.52434128,  0.94124921, -1.43890721,-1.56775367],[-1.18242083, -1.70912256, -1.379739  ,  0.94124921, -1.43890721,-1.56775367],..., [-0.91395927,  1.70183906,  1.36770431, -0.04606385, -0.80742813,0.26970368],[-0.73518157,  1.70183906,  1.51230659, -0.04606385, -0.80742813,-0.83244247],[-0.82486544,  1.70183906,  1.65690887,  0.21375537, -0.91267464,-0.46560752]])

类别特征编码

最常用的当然是one-hot编码咯,比如颜色 红、蓝、黄 会被编码为[1, 0, 0],[0, 1, 0],[0, 0, 1]

from sklearn import preprocessing
# one-hot编码
enc = preprocessing.OneHotEncoder()
enc.fit(X_vec_cat)
X_vec_cat = enc.transform(X_vec_cat).toarray()
X_vec_cat
array([[ 0.,  1.,  1., ...,  0.,  1.,  0.],[ 0.,  1.,  1., ...,  0.,  1.,  0.],[ 0.,  1.,  1., ...,  0.,  1.,  0.],..., [ 1.,  0.,  1., ...,  0.,  0.,  1.],[ 1.,  0.,  1., ...,  0.,  0.,  1.],[ 1.,  0.,  1., ...,  0.,  0.,  1.]])

把特征拼一起

把离散和连续的特征都组合在一起

import numpy as np
# combine cat & con features
X_vec = np.concatenate((X_vec_con,X_vec_cat), axis=1)
X_vec
array([[-1.09273697, -1.70912256, -1.66894356, ...,  0.        ,1.        ,  0.        ],[-1.18242083, -1.70912256, -1.52434128, ...,  0.        ,1.        ,  0.        ],[-1.18242083, -1.70912256, -1.379739  , ...,  0.        ,1.        ,  0.        ],..., [-0.91395927,  1.70183906,  1.36770431, ...,  0.        ,0.        ,  1.        ],[-0.73518157,  1.70183906,  1.51230659, ...,  0.        ,0.        ,  1.        ],[-0.82486544,  1.70183906,  1.65690887, ...,  0.        ,0.        ,  1.        ]])

最后的特征,前6列是标准化过后的连续值特征,后面是编码后的离散值特征

对结果值也处理一下

拿到结果的浮点数值

# 对Y向量化
Y_vec_reg = dataRel['registered'].values.astype(float)
Y_vec_cas = dataRel['casual'].values.astype(float)
# 看看处理后的结果值
Y_vec_reg
array([  13.,   32.,   27., ...,  164.,  117.,   84.])
Y_vec_cas
array([  3.,   8.,   5., ...,   4.,  12.,   4.])

使用python进行收据搜集示例之feature_engineering_example相关推荐

  1. 使用python进行收据搜集示例之different_format_data_processing

    这里是用jupyter notebook写的关于使用python进行数据收集的基本知识,包括crawl_and_parse.different_format_data_processing.featu ...

  2. 使用python进行收据搜集示例之python_regular_expression

    这里是用jupyter notebook写的关于使用Python进行数据收集的基本知识,包括crawl_and_parse.different_format_data_processing.featu ...

  3. python面试题搜集

    1. Python面试题搜集(一) 2019 Python最新面试题及答案16道题吐血总结!50道Python面试题集锦(附答案) Python是目前编程领域最受欢迎的语言.在本文中,我将总结Pyth ...

  4. python元组(tuple)使用示例+常用方法+列表(list)和元组(tuple)的异同?

    python元组(tuple)使用示例+常用方法+列表(list)和元组(tuple)的异同? 元组(tuple) # 元组类似列表,元组里面的元素也是进行索引计算. # 列表里面的元素的值可以修改, ...

  5. python中typing.NamedTuple示例

    python中typing.NamedTuple示例 from typing import * # 导入依赖项 class Friend(NamedTuple): """ ...

  6. python开源项目及示例代码

    1 算法 1.1 字符串处理 re 正则表达式的标准库. StringIO / cStringIO 以读写文件的方式来操作字符串(有点类似于内存文件). cStringIO 是 C 语言实现的,提供高 ...

  7. python画图代码星星-Python中turtle作图示例

    在Python里,海龟不仅可以画简单的黑线,还可以用它画更复杂的几何图形,用不同的颜色,甚至还可以给形状填色. 一.从基本的正方形开始 引入turtle模块并创建Pen对象: >>> ...

  8. python项目实例代码-python开源项目及示例代码

    下面列出的各种 Python 库/模块/工具,如果名称带超链接,说明是第三方的:否则是 Python 语言内置的. 1 算法 1.1 字符串处理 re 正则表达式的标准库. StringIO / cS ...

  9. python的用途实例-python assert的用处示例详解

    使用assert断言是学习python一个非常好的习惯,python assert 断言句语格式及用法很简单.在没完善一个程序之前,我们不知道程序在哪里会出错,与其让它在运行最崩溃,不如在出现错误条件 ...

  10. python如何更新包_python如何更新包 python更新包代码示例

    python如何更新包?本篇文章小编给大家分享一下python更新包代码示例,代码介绍的很详细,小编觉得挺不错的,现在分享给大家供大家参考,有需要的小伙伴们可以来看看. Python安装新包,pip是 ...

最新文章

  1. Science:中英合作揭示拟南芥三萜化合物特异调控根系微生物组
  2. 【STM32】处理字符串时 \r\n、\r、\n 的区别
  3. 原平 计算机培训,原平编程培训,原平编程培训班,原平编程培训完找什么工作 - IT教育频道...
  4. python中cgi到底是什么_什么是CGI?详细介绍Python CGI编程
  5. Power BI新主页将使内容的导航和发现变得轻而易举!
  6. StringUtils工具类说明
  7. 搭建dubbo框架的遇到的一些问题
  8. UI——PS色彩搭配
  9. CUDA编程入门极简教程
  10. 印象笔记终于支持默认markdown预览模式
  11. 小程序也能是App中的引流神器?
  12. 请教一下如何使用mdx文件
  13. ADSP21489 Target halted due to software breakpoint but no breakpoint found at address: 0x208c0b6 ()
  14. 微信支付-企业付款到零钱问题集锦
  15. NAT的两种模式SNAT和DNAT介绍
  16. U盘不显示盘符的问题
  17. 【财经期刊FM-Radio|2020年11月03日】
  18. 渝粤题库 陕西师范大学 《旅游法规》作业
  19. java获取dns记录_java-使用JNDI获取DNS SRV记录
  20. 淘宝商家如何通过淘宝客来打造淘宝爆款?

热门文章

  1. django慢学日常
  2. 傅里叶分解、叠加《matplotlib学习》
  3. Python爬取堆糖网图片的代码
  4. 135微信编辑html语言,135微信编辑器
  5. 向Docker告别的时候到了
  6. linux 前台和后台,Linux前台进程与后台进程的区别
  7. flac音乐格式怎么转换mp3?
  8. python显示实时时间校对_Python实现系统时间自动校正
  9. 电脑Java安装 报错_Windows10系统电脑安装Java开发环境的方法
  10. PhotoShop 之钢笔工具