Github 地址: https://github.com/guipsamora/pandas_exercises
数据集: https://www.aliyundrive.com/s/Wk6L6R67rrn
1. 了解数据
1.1 导入数据集并查看数据集基本情况
import pandas as pd
chipotle = pd.read_csv('001_chipotle.tsv',sep='\t')
chipotle.head()
|
order_id
|
quantity
|
item_name
|
choice_description
|
item_price
|
0
|
1
|
1
|
Chips and Fresh Tomato Salsa
|
NaN
|
$2.39
|
1
|
1
|
1
|
Izze
|
[Clementine]
|
$3.39
|
2
|
1
|
1
|
Nantucket Nectar
|
[Apple]
|
$3.39
|
3
|
1
|
1
|
Chips and Tomatillo-Green Chili Salsa
|
NaN
|
$2.39
|
4
|
2
|
2
|
Chicken Bowl
|
[Tomatillo-Red Chili Salsa (Hot), [Black Beans...
|
$16.98
|
chipotle.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 5 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 order_id 4622 non-null int64 1 quantity 4622 non-null int64 2 item_name 4622 non-null object3 choice_description 3376 non-null object4 item_price 4622 non-null object
dtypes: int64(2), object(3)
memory usage: 180.7+ KB
chipotle.columns
Index(['order_id', 'quantity', 'item_name', 'choice_description','item_price'],dtype='object')
chipotle.index
RangeIndex(start=0, stop=4622, step=1)
1.2 被下单最多的商品是什么
- as_index = false 使用默认索引列,
- 显示效果上, 使列名都在一行, “不合并单元格”, 更符合使用习惯
item_buy_count = chipotle[['item_name','quantity']].groupby('item_name').sum()['quantity'].nlargest(1)
item_buy_count
item_name
Chicken Bowl 761
Name: quantity, dtype: int64
item_buy_count_1 = chipotle[['item_name','quantity']].groupby('item_name', as_index=False).agg({'quantity':'sum'})
item_buy_count_1.sort_values('quantity', ascending=False, inplace=True)
item_buy_count_1.head()
|
item_name
|
quantity
|
17
|
Chicken Bowl
|
761
|
18
|
Chicken Burrito
|
591
|
25
|
Chips and Guacamole
|
506
|
39
|
Steak Burrito
|
386
|
10
|
Canned Soft Drink
|
351
|
1.3 有多少种商品被下单
chipotle['item_name'].nunique()
50
# 简单做个图
s = chipotle['item_name'].value_counts()from matplotlib import pyplot as plt
plt.figure(figsize=(20,4),dpi=100)
plt.xticks(fontsize=13, rotation='vertical')
plt.bar(s.index,s.values)
<BarContainer object of 50 artists>
1.4 被下单的商品中, choice_description 的排名
chipotle['choice_description'].value_counts().head()
[Diet Coke] 134
[Coke] 123
[Sprite] 77
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese, Sour Cream, Lettuce]] 42
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese, Sour Cream, Guacamole, Lettuce]] 40
Name: choice_description, dtype: int64
1.5 一共有多少商品被下单
chipotle['quantity'].sum()
4972
1.6 将item_price转换为浮点数
chipotle['item_price'] = chipotle['item_price'].str.lstrip('$').astype('float')
chipotle.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 5 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 order_id 4622 non-null int64 1 quantity 4622 non-null int64 2 item_name 4622 non-null object 3 choice_description 3376 non-null object 4 item_price 4622 non-null float64
dtypes: float64(1), int64(2), object(2)
memory usage: 180.7+ KB
chipotle['item_price'] = chipo['item_price'].apply(lambda x: float(x[1:-1]))
---------------------------------------------------------------------------NameError Traceback (most recent call last)~\AppData\Local\Temp/ipykernel_19644/772157282.py in <module>
----> 1 chipotle['item_price'] = chipo['item_price'].apply(lambda x: float(x[1:-1]))NameError: name 'chipo' is not defined
1.7 总收入
chipotle['sub_total'] = (chipotle['item_price'] * chipotle['quantity'])
chipotle['sub_total'].sum()
39237.02
1.8 总订单数
chipotle['order_id'].nunique()
1834
1.9 单均价
chipotle['sub_total'].sum()/chipotle['order_id'].nunique()
21.39423118865867
chipotle[['order_id','sub_total']].groupby(by=['order_id']).agg({'sub_total':'sum'})['sub_total'].mean()
21.39423118865867
2. 数据过滤与排序
2.1 数据导入, 查看数据集基本情况
import pandas as pd
euro = pd.read_csv('Euro2012_stats.csv')
euro.head()
|
Team
|
Goals
|
Shots on target
|
Shots off target
|
Shooting Accuracy
|
% Goals-to-shots
|
Total shots (inc. Blocked)
|
Hit Woodwork
|
Penalty goals
|
Penalties not scored
|
...
|
Saves made
|
Saves-to-shots ratio
|
Fouls Won
|
Fouls Conceded
|
Offsides
|
Yellow Cards
|
Red Cards
|
Subs on
|
Subs off
|
Players Used
|
0
|
Croatia
|
4
|
13
|
12
|
51.9%
|
16.0%
|
32
|
0
|
0
|
0
|
...
|
13
|
81.3%
|
41
|
62
|
2
|
9
|
0
|
9
|
9
|
16
|
1
|
Czech Republic
|
4
|
13
|
18
|
41.9%
|
12.9%
|
39
|
0
|
0
|
0
|
...
|
9
|
60.1%
|
53
|
73
|
8
|
7
|
0
|
11
|
11
|
19
|
2
|
Denmark
|
4
|
10
|
10
|
50.0%
|
20.0%
|
27
|
1
|
0
|
0
|
...
|
10
|
66.7%
|
25
|
38
|
8
|
4
|
0
|
7
|
7
|
15
|
3
|
England
|
5
|
11
|
18
|
50.0%
|
17.2%
|
40
|
0
|
0
|
0
|
...
|
22
|
88.1%
|
43
|
45
|
6
|
5
|
0
|
11
|
11
|
16
|
4
|
France
|
3
|
22
|
24
|
37.9%
|
6.5%
|
65
|
1
|
0
|
0
|
...
|
6
|
54.6%
|
36
|
51
|
5
|
6
|
0
|
11
|
11
|
19
|
5 rows × 35 columns
euro.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 35 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 Team 16 non-null object 1 Goals 16 non-null int64 2 Shots on target 16 non-null int64 3 Shots off target 16 non-null int64 4 Shooting Accuracy 16 non-null object 5 % Goals-to-shots 16 non-null object 6 Total shots (inc. Blocked) 16 non-null int64 7 Hit Woodwork 16 non-null int64 8 Penalty goals 16 non-null int64 9 Penalties not scored 16 non-null int64 10 Headed goals 16 non-null int64 11 Passes 16 non-null int64 12 Passes completed 16 non-null int64 13 Passing Accuracy 16 non-null object 14 Touches 16 non-null int64 15 Crosses 16 non-null int64 16 Dribbles 16 non-null int64 17 Corners Taken 16 non-null int64 18 Tackles 16 non-null int64 19 Clearances 16 non-null int64 20 Interceptions 16 non-null int64 21 Clearances off line 15 non-null float6422 Clean Sheets 16 non-null int64 23 Blocks 16 non-null int64 24 Goals conceded 16 non-null int64 25 Saves made 16 non-null int64 26 Saves-to-shots ratio 16 non-null object 27 Fouls Won 16 non-null int64 28 Fouls Conceded 16 non-null int64 29 Offsides 16 non-null int64 30 Yellow Cards 16 non-null int64 31 Red Cards 16 non-null int64 32 Subs on 16 non-null int64 33 Subs off 16 non-null int64 34 Players Used 16 non-null int64
dtypes: float64(1), int64(29), object(5)
memory usage: 4.5+ KB
2.1 有多少球队参与了 2012 年欧洲杯
euro['Team'].nunique()
16
2.2 将数据集中的列 Team, Yellow Cards 和 Red Cards 单独存在一个名叫 discripline 的 DateFrame 里
discripline = euro[['Team','Yellow Cards', 'Red Cards']]
discripline
|
Team
|
Yellow Cards
|
Red Cards
|
0
|
Croatia
|
9
|
0
|
1
|
Czech Republic
|
7
|
0
|
2
|
Denmark
|
4
|
0
|
3
|
England
|
5
|
0
|
4
|
France
|
6
|
0
|
5
|
Germany
|
4
|
0
|
6
|
Greece
|
9
|
1
|
7
|
Italy
|
16
|
0
|
8
|
Netherlands
|
5
|
0
|
9
|
Poland
|
7
|
1
|
10
|
Portugal
|
12
|
0
|
11
|
Republic of Ireland
|
6
|
1
|
12
|
Russia
|
6
|
0
|
13
|
Spain
|
11
|
0
|
14
|
Sweden
|
7
|
0
|
15
|
Ukraine
|
5
|
0
|
2.3 对 discripline 排序
discripline.sort_values(['Yellow Cards', 'Red Cards'], ascending=[False, False], inplace=True)
discripline
C:\Users\LiCN\AppData\Local\Temp/ipykernel_19644/3135130171.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrameSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copydiscripline.sort_values(['Yellow Cards', 'Red Cards'], ascending=[False, False], inplace=True)
|
Team
|
Yellow Cards
|
Red Cards
|
7
|
Italy
|
16
|
0
|
10
|
Portugal
|
12
|
0
|
13
|
Spain
|
11
|
0
|
6
|
Greece
|
9
|
1
|
0
|
Croatia
|
9
|
0
|
9
|
Poland
|
7
|
1
|
1
|
Czech Republic
|
7
|
0
|
14
|
Sweden
|
7
|
0
|
11
|
Republic of Ireland
|
6
|
1
|
4
|
France
|
6
|
0
|
12
|
Russia
|
6
|
0
|
3
|
England
|
5
|
0
|
8
|
Netherlands
|
5
|
0
|
15
|
Ukraine
|
5
|
0
|
2
|
Denmark
|
4
|
0
|
5
|
Germany
|
4
|
0
|
2.4 计算每个球队拿到的黄牌数的平均值
discripline['Yellow Cards'].mean()
7.4375
2.5 进球大于 6 的球队
discripline[discripline['Yellow Cards']>6]
# discripline[discripline.Yellow Cards >6]
|
Team
|
Yellow Cards
|
Red Cards
|
7
|
Italy
|
16
|
0
|
10
|
Portugal
|
12
|
0
|
13
|
Spain
|
11
|
0
|
6
|
Greece
|
9
|
1
|
0
|
Croatia
|
9
|
0
|
9
|
Poland
|
7
|
1
|
1
|
Czech Republic
|
7
|
0
|
14
|
Sweden
|
7
|
0
|
2.6 选取以字母 G 开头的球队数据
euro[euro['Team'].str.startswith('G')]
|
Team
|
Goals
|
Shots on target
|
Shots off target
|
Shooting Accuracy
|
% Goals-to-shots
|
Total shots (inc. Blocked)
|
Hit Woodwork
|
Penalty goals
|
Penalties not scored
|
...
|
Saves made
|
Saves-to-shots ratio
|
Fouls Won
|
Fouls Conceded
|
Offsides
|
Yellow Cards
|
Red Cards
|
Subs on
|
Subs off
|
Players Used
|
5
|
Germany
|
10
|
32
|
32
|
47.8%
|
15.6%
|
80
|
2
|
1
|
0
|
...
|
10
|
62.6%
|
63
|
49
|
12
|
4
|
0
|
15
|
15
|
17
|
6
|
Greece
|
5
|
8
|
18
|
30.7%
|
19.2%
|
32
|
1
|
1
|
1
|
...
|
13
|
65.1%
|
67
|
48
|
12
|
9
|
1
|
12
|
12
|
20
|
2 rows × 35 columns
2.7 选取前 7 列
euro.iloc[:,0:7]
|
Team
|
Goals
|
Shots on target
|
Shots off target
|
Shooting Accuracy
|
% Goals-to-shots
|
Total shots (inc. Blocked)
|
0
|
Croatia
|
4
|
13
|
12
|
51.9%
|
16.0%
|
32
|
1
|
Czech Republic
|
4
|
13
|
18
|
41.9%
|
12.9%
|
39
|
2
|
Denmark
|
4
|
10
|
10
|
50.0%
|
20.0%
|
27
|
3
|
England
|
5
|
11
|
18
|
50.0%
|
17.2%
|
40
|
4
|
France
|
3
|
22
|
24
|
37.9%
|
6.5%
|
65
|
5
|
Germany
|
10
|
32
|
32
|
47.8%
|
15.6%
|
80
|
6
|
Greece
|
5
|
8
|
18
|
30.7%
|
19.2%
|
32
|
7
|
Italy
|
6
|
34
|
45
|
43.0%
|
7.5%
|
110
|
8
|
Netherlands
|
2
|
12
|
36
|
25.0%
|
4.1%
|
60
|
9
|
Poland
|
2
|
15
|
23
|
39.4%
|
5.2%
|
48
|
10
|
Portugal
|
6
|
22
|
42
|
34.3%
|
9.3%
|
82
|
11
|
Republic of Ireland
|
1
|
7
|
12
|
36.8%
|
5.2%
|
28
|
12
|
Russia
|
5
|
9
|
31
|
22.5%
|
12.5%
|
59
|
13
|
Spain
|
12
|
42
|
33
|
55.9%
|
16.0%
|
100
|
14
|
Sweden
|
5
|
17
|
19
|
47.2%
|
13.8%
|
39
|
15
|
Ukraine
|
2
|
7
|
26
|
21.2%
|
6.0%
|
38
|
2.8 选取除了最后 3 列之外的全部列
euro.iloc[:,:-3]
|
Team
|
Goals
|
Shots on target
|
Shots off target
|
Shooting Accuracy
|
% Goals-to-shots
|
Total shots (inc. Blocked)
|
Hit Woodwork
|
Penalty goals
|
Penalties not scored
|
...
|
Clean Sheets
|
Blocks
|
Goals conceded
|
Saves made
|
Saves-to-shots ratio
|
Fouls Won
|
Fouls Conceded
|
Offsides
|
Yellow Cards
|
Red Cards
|
0
|
Croatia
|
4
|
13
|
12
|
51.9%
|
16.0%
|
32
|
0
|
0
|
0
|
...
|
0
|
10
|
3
|
13
|
81.3%
|
41
|
62
|
2
|
9
|
0
|
1
|
Czech Republic
|
4
|
13
|
18
|
41.9%
|
12.9%
|
39
|
0
|
0
|
0
|
...
|
1
|
10
|
6
|
9
|
60.1%
|
53
|
73
|
8
|
7
|
0
|
2
|
Denmark
|
4
|
10
|
10
|
50.0%
|
20.0%
|
27
|
1
|
0
|
0
|
...
|
1
|
10
|
5
|
10
|
66.7%
|
25
|
38
|
8
|
4
|
0
|
3
|
England
|
5
|
11
|
18
|
50.0%
|
17.2%
|
40
|
0
|
0
|
0
|
...
|
2
|
29
|
3
|
22
|
88.1%
|
43
|
45
|
6
|
5
|
0
|
4
|
France
|
3
|
22
|
24
|
37.9%
|
6.5%
|
65
|
1
|
0
|
0
|
...
|
1
|
7
|
5
|
6
|
54.6%
|
36
|
51
|
5
|
6
|
0
|
5
|
Germany
|
10
|
32
|
32
|
47.8%
|
15.6%
|
80
|
2
|
1
|
0
|
...
|
1
|
11
|
6
|
10
|
62.6%
|
63
|
49
|
12
|
4
|
0
|
6
|
Greece
|
5
|
8
|
18
|
30.7%
|
19.2%
|
32
|
1
|
1
|
1
|
...
|
1
|
23
|
7
|
13
|
65.1%
|
67
|
48
|
12
|
9
|
1
|
7
|
Italy
|
6
|
34
|
45
|
43.0%
|
7.5%
|
110
|
2
|
0
|
0
|
...
|
2
|
18
|
7
|
20
|
74.1%
|
101
|
89
|
16
|
16
|
0
|
8
|
Netherlands
|
2
|
12
|
36
|
25.0%
|
4.1%
|
60
|
2
|
0
|
0
|
...
|
0
|
9
|
5
|
12
|
70.6%
|
35
|
30
|
3
|
5
|
0
|
9
|
Poland
|
2
|
15
|
23
|
39.4%
|
5.2%
|
48
|
0
|
0
|
0
|
...
|
0
|
8
|
3
|
6
|
66.7%
|
48
|
56
|
3
|
7
|
1
|
10
|
Portugal
|
6
|
22
|
42
|
34.3%
|
9.3%
|
82
|
6
|
0
|
0
|
...
|
2
|
11
|
4
|
10
|
71.5%
|
73
|
90
|
10
|
12
|
0
|
11
|
Republic of Ireland
|
1
|
7
|
12
|
36.8%
|
5.2%
|
28
|
0
|
0
|
0
|
...
|
0
|
23
|
9
|
17
|
65.4%
|
43
|
51
|
11
|
6
|
1
|
12
|
Russia
|
5
|
9
|
31
|
22.5%
|
12.5%
|
59
|
2
|
0
|
0
|
...
|
0
|
8
|
3
|
10
|
77.0%
|
34
|
43
|
4
|
6
|
0
|
13
|
Spain
|
12
|
42
|
33
|
55.9%
|
16.0%
|
100
|
0
|
1
|
0
|
...
|
5
|
8
|
1
|
15
|
93.8%
|
102
|
83
|
19
|
11
|
0
|
14
|
Sweden
|
5
|
17
|
19
|
47.2%
|
13.8%
|
39
|
3
|
0
|
0
|
...
|
1
|
12
|
5
|
8
|
61.6%
|
35
|
51
|
7
|
7
|
0
|
15
|
Ukraine
|
2
|
7
|
26
|
21.2%
|
6.0%
|
38
|
0
|
0
|
0
|
...
|
0
|
4
|
4
|
13
|
76.5%
|
48
|
31
|
4
|
5
|
0
|
16 rows × 32 columns
2.9 找到英格兰 (England)、意大利 (Italy) 和俄罗斯 (Russia) 的射正率 (Shooting Accuracy)
euro[euro['Team'].isin(['England', 'Italy', 'Russia'])][['Team','Shooting Accuracy']]
|
Team
|
Shooting Accuracy
|
3
|
England
|
50.0%
|
7
|
Italy
|
43.0%
|
12
|
Russia
|
22.5%
|
euro.loc[euro['Team'].isin(['England', 'Italy', 'Russia']), ['Team','Shooting Accuracy']]
|
Team
|
Shooting Accuracy
|
3
|
England
|
50.0%
|
7
|
Italy
|
43.0%
|
12
|
Russia
|
22.5%
|
3. 数据分组 探索酒类消费数据
import pandas as pddrinks = pd.read_csv('drinks.csv')
drinks.head()
|
country
|
beer_servings
|
spirit_servings
|
wine_servings
|
total_litres_of_pure_alcohol
|
continent
|
0
|
Afghanistan
|
0
|
0
|
0
|
0.0
|
AS
|
1
|
Albania
|
89
|
132
|
54
|
4.9
|
EU
|
2
|
Algeria
|
25
|
0
|
14
|
0.7
|
AF
|
3
|
Andorra
|
245
|
138
|
312
|
12.4
|
EU
|
4
|
Angola
|
217
|
57
|
45
|
5.9
|
AF
|
drinks.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 country 193 non-null object 1 beer_servings 193 non-null int64 2 spirit_servings 193 non-null int64 3 wine_servings 193 non-null int64 4 total_litres_of_pure_alcohol 193 non-null float645 continent 170 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 9.2+ KB
3.1 哪个大陆平均消耗的啤酒更多?
drinks.groupby('continent')['beer_servings'].mean().sort_values().head(1)
continent
AS 37.045455
Name: beer_servings, dtype: float64
3.2 打印出每个大陆红酒消耗的描述性统计值
drinks.groupby('continent')['wine_servings'].describe()
|
count
|
mean
|
std
|
min
|
25%
|
50%
|
75%
|
max
|
continent
|
|
|
|
|
|
|
|
|
AF
|
53.0
|
16.264151
|
38.846419
|
0.0
|
1.0
|
2.0
|
13.00
|
233.0
|
AS
|
44.0
|
9.068182
|
21.667034
|
0.0
|
0.0
|
1.0
|
8.00
|
123.0
|
EU
|
45.0
|
142.222222
|
97.421738
|
0.0
|
59.0
|
128.0
|
195.00
|
370.0
|
OC
|
16.0
|
35.625000
|
64.555790
|
0.0
|
1.0
|
8.5
|
23.25
|
212.0
|
SA
|
12.0
|
62.416667
|
88.620189
|
1.0
|
3.0
|
12.0
|
98.50
|
221.0
|
3.3 打印出每个大陆每种酒类的消耗平均值
drinks.groupby('continent').mean()
|
beer_servings
|
spirit_servings
|
wine_servings
|
total_litres_of_pure_alcohol
|
continent
|
|
|
|
|
AF
|
61.471698
|
16.339623
|
16.264151
|
3.007547
|
AS
|
37.045455
|
60.840909
|
9.068182
|
2.170455
|
EU
|
193.777778
|
132.555556
|
142.222222
|
8.617778
|
OC
|
89.687500
|
58.437500
|
35.625000
|
3.381250
|
SA
|
175.083333
|
114.750000
|
62.416667
|
6.308333
|
3.4 打印出每个大陆每种酒类的消耗的中位数
drinks.groupby('continent').median()
|
beer_servings
|
spirit_servings
|
wine_servings
|
total_litres_of_pure_alcohol
|
continent
|
|
|
|
|
AF
|
32.0
|
3.0
|
2.0
|
2.30
|
AS
|
17.5
|
16.0
|
1.0
|
1.20
|
EU
|
219.0
|
122.0
|
128.0
|
10.00
|
OC
|
52.5
|
37.0
|
8.5
|
1.75
|
SA
|
162.5
|
108.5
|
12.0
|
6.85
|
3.5 打印出每个大陆对 spirit 饮品消耗的平均值, 最大值, 最小值
drinks.groupby('continent')['spirit_servings'].agg(['mean','max','min'])
|
mean
|
max
|
min
|
continent
|
|
|
|
AF
|
16.339623
|
152
|
0
|
AS
|
60.840909
|
326
|
0
|
EU
|
132.555556
|
373
|
0
|
OC
|
58.437500
|
254
|
0
|
SA
|
114.750000
|
302
|
25
|
4. apply 函数 (1960-2014 美国犯罪数据探索)
4.1 数据导入及数据集基本情况
import pandas as pdcrime = pd.read_csv('US_Crime_Rates_1960_2014.csv')crime.head()
|
Year
|
Population
|
Total
|
Violent
|
Property
|
Murder
|
Forcible_Rape
|
Robbery
|
Aggravated_assault
|
Burglary
|
Larceny_Theft
|
Vehicle_Theft
|
0
|
1960
|
179323175
|
3384200
|
288460
|
3095700
|
9110
|
17190
|
107840
|
154320
|
912100
|
1855400
|
328200
|
1
|
1961
|
182992000
|
3488000
|
289390
|
3198600
|
8740
|
17220
|
106670
|
156760
|
949600
|
1913000
|
336000
|
2
|
1962
|
185771000
|
3752200
|
301510
|
3450700
|
8530
|
17550
|
110860
|
164570
|
994300
|
2089600
|
366800
|
3
|
1963
|
188483000
|
4109500
|
316970
|
3792500
|
8640
|
17650
|
116470
|
174210
|
1086400
|
2297800
|
408300
|
4
|
1964
|
191141000
|
4564600
|
364220
|
4200400
|
9360
|
21420
|
130390
|
203050
|
1213200
|
2514400
|
472800
|
crime.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 12 columns):# Column Non-Null Count Dtype
--- ------ -------------- -----0 Year 55 non-null int641 Population 55 non-null int642 Total 55 non-null int643 Violent 55 non-null int644 Property 55 non-null int645 Murder 55 non-null int646 Forcible_Rape 55 non-null int647 Robbery 55 non-null int648 Aggravated_assault 55 non-null int649 Burglary 55 non-null int6410 Larceny_Theft 55 non-null int6411 Vehicle_Theft 55 non-null int64
dtypes: int64(12)
memory usage: 5.3 KB
4.2 将 Year 数据类型转换为 datetime
crime['Year'] = pd.to_datetime(crime['Year'], format='%Y')
crime.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 12 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 Year 55 non-null datetime64[ns]1 Population 55 non-null int64 2 Total 55 non-null int64 3 Violent 55 non-null int64 4 Property 55 non-null int64 5 Murder 55 non-null int64 6 Forcible_Rape 55 non-null int64 7 Robbery 55 non-null int64 8 Aggravated_assault 55 non-null int64 9 Burglary 55 non-null int64 10 Larceny_Theft 55 non-null int64 11 Vehicle_Theft 55 non-null int64
dtypes: datetime64[ns](1), int64(11)
memory usage: 5.3 KB
4.3 将 Year 设置为索引
# 好像不要 drop=True 原索引也不会成为新的一列
crime.set_index('Year', inplace=True)
crime.head()
|
Population
|
Total
|
Violent
|
Property
|
Murder
|
Forcible_Rape
|
Robbery
|
Aggravated_assault
|
Burglary
|
Larceny_Theft
|
Vehicle_Theft
|
Year
|
|
|
|
|
|
|
|
|
|
|
|
1960-01-01
|
179323175
|
3384200
|
288460
|
3095700
|
9110
|
17190
|
107840
|
154320
|
912100
|
1855400
|
328200
|
1961-01-01
|
182992000
|
3488000
|
289390
|
3198600
|
8740
|
17220
|
106670
|
156760
|
949600
|
1913000
|
336000
|
1962-01-01
|
185771000
|
3752200
|
301510
|
3450700
|
8530
|
17550
|
110860
|
164570
|
994300
|
2089600
|
366800
|
1963-01-01
|
188483000
|
4109500
|
316970
|
3792500
|
8640
|
17650
|
116470
|
174210
|
1086400
|
2297800
|
408300
|
1964-01-01
|
191141000
|
4564600
|
364220
|
4200400
|
9360
|
21420
|
130390
|
203050
|
1213200
|
2514400
|
472800
|
4.4 删除名为 Total 的列
del crime['Total']
4.5 每十年重组数据 resample 非常重要
- Pandas中的resample,重新采样,是对原样本重新处理的一个方法,是一个对常规时间序列数据重新采样和频率转换的便捷的方法。
- 参数表示采样的规则 常见时间频率 :A year M month W week D day H hour T minute S second
- 按照每十年为一组进行采样统计
crime_10Ys = crime.resample('10As').sum()
crime_10Ys
|
Population
|
Violent
|
Property
|
Murder
|
Forcible_Rape
|
Robbery
|
Aggravated_assault
|
Burglary
|
Larceny_Theft
|
Vehicle_Theft
|
Year
|
|
|
|
|
|
|
|
|
|
|
1960-01-01
|
1915053175
|
4134930
|
45160900
|
106180
|
236720
|
1633510
|
2158520
|
13321100
|
26547700
|
5292100
|
1970-01-01
|
2121193298
|
9607930
|
91383800
|
192230
|
554570
|
4159020
|
4702120
|
28486000
|
53157800
|
9739900
|
1980-01-01
|
2371370069
|
14074328
|
117048900
|
206439
|
865639
|
5383109
|
7619130
|
33073494
|
72040253
|
11935411
|
1990-01-01
|
2612825258
|
17527048
|
119053499
|
211664
|
998827
|
5748930
|
10568963
|
26750015
|
77679366
|
14624418
|
2000-01-01
|
2947969117
|
13968056
|
100944369
|
163068
|
922499
|
4230366
|
8652124
|
21565176
|
67970291
|
11412834
|
2010-01-01
|
1570146307
|
6072017
|
44095950
|
72867
|
421059
|
1749809
|
3764142
|
10125170
|
30401698
|
3569080
|
- 用 population 的最大值替换重组后的 population 值, 上面每个十年的 population 加总没有意义
- 用十年中的最大值替换掉这个加总值
crime_10Ys['Population'] = crime['Population'].resample('10As').max()
crime_10Ys.head()
|
Population
|
Violent
|
Property
|
Murder
|
Forcible_Rape
|
Robbery
|
Aggravated_assault
|
Burglary
|
Larceny_Theft
|
Vehicle_Theft
|
Year
|
|
|
|
|
|
|
|
|
|
|
1960-01-01
|
201385000
|
4134930
|
45160900
|
106180
|
236720
|
1633510
|
2158520
|
13321100
|
26547700
|
5292100
|
1970-01-01
|
220099000
|
9607930
|
91383800
|
192230
|
554570
|
4159020
|
4702120
|
28486000
|
53157800
|
9739900
|
1980-01-01
|
248239000
|
14074328
|
117048900
|
206439
|
865639
|
5383109
|
7619130
|
33073494
|
72040253
|
11935411
|
1990-01-01
|
272690813
|
17527048
|
119053499
|
211664
|
998827
|
5748930
|
10568963
|
26750015
|
77679366
|
14624418
|
2000-01-01
|
307006550
|
13968056
|
100944369
|
163068
|
922499
|
4230366
|
8652124
|
21565176
|
67970291
|
11412834
|
4.6 何时是美国历史上生存最危险的年代
crime
|
Population
|
Violent
|
Property
|
Murder
|
Forcible_Rape
|
Robbery
|
Aggravated_assault
|
Burglary
|
Larceny_Theft
|
Vehicle_Theft
|
Year
|
|
|
|
|
|
|
|
|
|
|
1960-01-01
|
179323175
|
288460
|
3095700
|
9110
|
17190
|
107840
|
154320
|
912100
|
1855400
|
328200
|
1961-01-01
|
182992000
|
289390
|
3198600
|
8740
|
17220
|
106670
|
156760
|
949600
|
1913000
|
336000
|
1962-01-01
|
185771000
|
301510
|
3450700
|
8530
|
17550
|
110860
|
164570
|
994300
|
2089600
|
366800
|
1963-01-01
|
188483000
|
316970
|
3792500
|
8640
|
17650
|
116470
|
174210
|
1086400
|
2297800
|
408300
|
1964-01-01
|
191141000
|
364220
|
4200400
|
9360
|
21420
|
130390
|
203050
|
1213200
|
2514400
|
472800
|
1965-01-01
|
193526000
|
387390
|
4352000
|
9960
|
23410
|
138690
|
215330
|
1282500
|
2572600
|
496900
|
1966-01-01
|
195576000
|
430180
|
4793300
|
11040
|
25820
|
157990
|
235330
|
1410100
|
2822000
|
561200
|
1967-01-01
|
197457000
|
499930
|
5403500
|
12240
|
27620
|
202910
|
257160
|
1632100
|
3111600
|
659800
|
1968-01-01
|
199399000
|
595010
|
6125200
|
13800
|
31670
|
262840
|
286700
|
1858900
|
3482700
|
783600
|
1969-01-01
|
201385000
|
661870
|
6749000
|
14760
|
37170
|
298850
|
311090
|
1981900
|
3888600
|
878500
|
1970-01-01
|
203235298
|
738820
|
7359200
|
16000
|
37990
|
349860
|
334970
|
2205000
|
4225800
|
928400
|
1971-01-01
|
206212000
|
816500
|
7771700
|
17780
|
42260
|
387700
|
368760
|
2399300
|
4424200
|
948200
|
1972-01-01
|
208230000
|
834900
|
7413900
|
18670
|
46850
|
376290
|
393090
|
2375500
|
4151200
|
887200
|
1973-01-01
|
209851000
|
875910
|
7842200
|
19640
|
51400
|
384220
|
420650
|
2565500
|
4347900
|
928800
|
1974-01-01
|
211392000
|
974720
|
9278700
|
20710
|
55400
|
442400
|
456210
|
3039200
|
5262500
|
977100
|
1975-01-01
|
213124000
|
1039710
|
10252700
|
20510
|
56090
|
470500
|
492620
|
3265300
|
5977700
|
1009600
|
1976-01-01
|
214659000
|
1004210
|
10345500
|
18780
|
57080
|
427810
|
500530
|
3108700
|
6270800
|
966000
|
1977-01-01
|
216332000
|
1029580
|
9955000
|
19120
|
63500
|
412610
|
534350
|
3071500
|
5905700
|
977700
|
1978-01-01
|
218059000
|
1085550
|
10123400
|
19560
|
67610
|
426930
|
571460
|
3128300
|
5991000
|
1004100
|
1979-01-01
|
220099000
|
1208030
|
11041500
|
21460
|
76390
|
480700
|
629480
|
3327700
|
6601000
|
1112800
|
1980-01-01
|
225349264
|
1344520
|
12063700
|
23040
|
82990
|
565840
|
672650
|
3795200
|
7136900
|
1131700
|
1981-01-01
|
229146000
|
1361820
|
12061900
|
22520
|
82500
|
592910
|
663900
|
3779700
|
7194400
|
1087800
|
1982-01-01
|
231534000
|
1322390
|
11652000
|
21010
|
78770
|
553130
|
669480
|
3447100
|
7142500
|
1062400
|
1983-01-01
|
233981000
|
1258090
|
10850500
|
19310
|
78920
|
506570
|
653290
|
3129900
|
6712800
|
1007900
|
1984-01-01
|
236158000
|
1273280
|
10608500
|
18690
|
84230
|
485010
|
685350
|
2984400
|
6591900
|
1032200
|
1985-01-01
|
238740000
|
1328800
|
11102600
|
18980
|
88670
|
497870
|
723250
|
3073300
|
6926400
|
1102900
|
1986-01-01
|
240132887
|
1489169
|
11722700
|
20613
|
91459
|
542775
|
834322
|
3241410
|
7257153
|
1224137
|
1987-01-01
|
242282918
|
1483999
|
12024700
|
20096
|
91110
|
517704
|
855088
|
3236184
|
7499900
|
1288674
|
1988-01-01
|
245807000
|
1566220
|
12356900
|
20680
|
92490
|
542970
|
910090
|
3218100
|
7705900
|
1432900
|
1989-01-01
|
248239000
|
1646040
|
12605400
|
21500
|
94500
|
578330
|
951710
|
3168200
|
7872400
|
1564800
|
1990-01-01
|
248709873
|
1820130
|
12655500
|
23440
|
102560
|
639270
|
1054860
|
3073900
|
7945700
|
1635900
|
1991-01-01
|
252177000
|
1911770
|
12961100
|
24700
|
106590
|
687730
|
1092740
|
3157200
|
8142200
|
1661700
|
1992-01-01
|
255082000
|
1932270
|
12505900
|
23760
|
109060
|
672480
|
1126970
|
2979900
|
7915200
|
1610800
|
1993-01-01
|
257908000
|
1926020
|
12218800
|
24530
|
106010
|
659870
|
1135610
|
2834800
|
7820900
|
1563100
|
1994-01-01
|
260341000
|
1857670
|
12131900
|
23330
|
102220
|
618950
|
1113180
|
2712800
|
7879800
|
1539300
|
1995-01-01
|
262755000
|
1798790
|
12063900
|
21610
|
97470
|
580510
|
1099210
|
2593800
|
7997700
|
1472400
|
1996-01-01
|
265228572
|
1688540
|
11805300
|
19650
|
96250
|
535590
|
1037050
|
2506400
|
7904700
|
1394200
|
1997-01-01
|
267637000
|
1634770
|
11558175
|
18208
|
96153
|
498534
|
1023201
|
2460526
|
7743760
|
1354189
|
1998-01-01
|
270296000
|
1531044
|
10944590
|
16914
|
93103
|
446625
|
974402
|
2329950
|
7373886
|
1240754
|
1999-01-01
|
272690813
|
1426044
|
10208334
|
15522
|
89411
|
409371
|
911740
|
2100739
|
6955520
|
1152075
|
2000-01-01
|
281421906
|
1425486
|
10182586
|
15586
|
90178
|
408016
|
911706
|
2050992
|
6971590
|
1160002
|
2001-01-01
|
285317559
|
1439480
|
10437480
|
16037
|
90863
|
423557
|
909023
|
2116531
|
7092267
|
1228391
|
2002-01-01
|
287973924
|
1423677
|
10455277
|
16229
|
95235
|
420806
|
891407
|
2151252
|
7057370
|
1246646
|
2003-01-01
|
290690788
|
1383676
|
10442862
|
16528
|
93883
|
414235
|
859030
|
2154834
|
7026802
|
1261226
|
2004-01-01
|
293656842
|
1360088
|
10319386
|
16148
|
95089
|
401470
|
847381
|
2144446
|
6937089
|
1237851
|
2005-01-01
|
296507061
|
1390745
|
10174754
|
16740
|
94347
|
417438
|
862220
|
2155448
|
6783447
|
1235859
|
2006-01-01
|
299398484
|
1418043
|
9983568
|
17030
|
92757
|
447403
|
860853
|
2183746
|
6607013
|
1192809
|
2007-01-01
|
301621157
|
1408337
|
9843481
|
16929
|
90427
|
445125
|
855856
|
2176140
|
6568572
|
1095769
|
2008-01-01
|
304374846
|
1392628
|
9767915
|
16442
|
90479
|
443574
|
842134
|
2228474
|
6588046
|
958629
|
2009-01-01
|
307006550
|
1325896
|
9337060
|
15399
|
89241
|
408742
|
812514
|
2203313
|
6338095
|
795652
|
2010-01-01
|
309330219
|
1251248
|
9112625
|
14772
|
85593
|
369089
|
781844
|
2168457
|
6204601
|
739565
|
2011-01-01
|
311587816
|
1206031
|
9052743
|
14661
|
84175
|
354772
|
752423
|
2185140
|
6151095
|
716508
|
2012-01-01
|
313873685
|
1217067
|
9001992
|
14866
|
85141
|
355051
|
762009
|
2109932
|
6168874
|
723186
|
2013-01-01
|
316497531
|
1199684
|
8650761
|
14319
|
82109
|
345095
|
726575
|
1931835
|
6018632
|
700294
|
2014-01-01
|
318857056
|
1197987
|
8277829
|
14249
|
84041
|
325802
|
741291
|
1729806
|
5858496
|
689527
|
crime.idxmax()
Population 2014-01-01
Violent 1992-01-01
Property 1991-01-01
Murder 1991-01-01
Forcible_Rape 1992-01-01
Robbery 1991-01-01
Aggravated_assault 1993-01-01
Burglary 1980-01-01
Larceny_Theft 1991-01-01
Vehicle_Theft 1991-01-01
dtype: datetime64[ns]
- 各种犯罪数量的最大值集中出现在1991年~1993年, 从整个数据集看人口数量是在逐渐增长的(从1960年的1.8亿到2014年的3.2亿), 说明这一时期犯罪率高于其他时期(但并不能断言整个90年代是最危险的年代)
5. 合并(探索虚拟姓名数据)
5.1 构建虚拟数据
import numpy as np
import pandas as pdraw_data_1 = {'subject_id': ['1', '2', '3', '4', '5'],'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}
raw_data_2 = {'subject_id': ['4', '5', '6', '7', '8'],'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}
raw_data_3 = {'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}data_1 = pd.DataFrame(raw_data_1)
data_2 = pd.DataFrame(raw_data_2)
data_3 = pd.DataFrame(raw_data_3)print(data_1)
print(data_2)
print(data_3)
subject_id first_name last_name
0 1 Alex Anderson
1 2 Amy Ackerman
2 3 Allen Ali
3 4 Alice Aoni
4 5 Ayoung Atichessubject_id first_name last_name
0 4 Billy Bonder
1 5 Brian Black
2 6 Bran Balwner
3 7 Bryce Brice
4 8 Betty Btisansubject_id test_id
0 1 51
1 2 15
2 3 15
3 4 61
4 5 16
5 7 14
6 8 15
7 9 1
8 10 61
9 11 16
5.2 上下合并, 鸡蛋下落方向, axis=0
all_data = pd.concat([data_1,data_2], ignore_index=True)
all_data
|
subject_id
|
first_name
|
last_name
|
0
|
1
|
Alex
|
Anderson
|
1
|
2
|
Amy
|
Ackerman
|
2
|
3
|
Allen
|
Ali
|
3
|
4
|
Alice
|
Aoni
|
4
|
5
|
Ayoung
|
Atiches
|
5
|
4
|
Billy
|
Bonder
|
6
|
5
|
Brian
|
Black
|
7
|
6
|
Bran
|
Balwner
|
8
|
7
|
Bryce
|
Brice
|
9
|
8
|
Betty
|
Btisan
|
5.3 左右合并, axis=1
by_col = pd.concat([data_1,data_2], axis=1)
by_col
|
subject_id
|
first_name
|
last_name
|
subject_id
|
first_name
|
last_name
|
0
|
1
|
Alex
|
Anderson
|
4
|
Billy
|
Bonder
|
1
|
2
|
Amy
|
Ackerman
|
5
|
Brian
|
Black
|
2
|
3
|
Allen
|
Ali
|
6
|
Bran
|
Balwner
|
3
|
4
|
Alice
|
Aoni
|
7
|
Bryce
|
Brice
|
4
|
5
|
Ayoung
|
Atiches
|
8
|
Betty
|
Btisan
|
5.4 按照 subject_id 的值对 all_data 和 data3 作合并
all_data = pd.merge(all_data,data_3,on='subject_id')
all_data
|
subject_id
|
first_name
|
last_name
|
test_id
|
0
|
1
|
Alex
|
Anderson
|
51
|
1
|
2
|
Amy
|
Ackerman
|
15
|
2
|
3
|
Allen
|
Ali
|
15
|
3
|
4
|
Alice
|
Aoni
|
61
|
4
|
4
|
Billy
|
Bonder
|
61
|
5
|
5
|
Ayoung
|
Atiches
|
16
|
6
|
5
|
Brian
|
Black
|
16
|
7
|
7
|
Bryce
|
Brice
|
14
|
8
|
8
|
Betty
|
Btisan
|
15
|
5.5 对 data1 和 data2 按照 subject_id 作内连接
inner_join = pd.merge(data_1,data_2, on='subject_id', how='inner')
inner_join
|
subject_id
|
first_name_x
|
last_name_x
|
first_name_y
|
last_name_y
|
0
|
4
|
Alice
|
Aoni
|
Billy
|
Bonder
|
1
|
5
|
Ayoung
|
Atiches
|
Brian
|
Black
|
5.6 对 data1 和 data2 按照 subject_id 作外连接
outer_join = pd.merge(data_1,data_2, on='subject_id', how='outer')
outer_join
|
subject_id
|
first_name_x
|
last_name_x
|
first_name_y
|
last_name_y
|
0
|
1
|
Alex
|
Anderson
|
NaN
|
NaN
|
1
|
2
|
Amy
|
Ackerman
|
NaN
|
NaN
|
2
|
3
|
Allen
|
Ali
|
NaN
|
NaN
|
3
|
4
|
Alice
|
Aoni
|
Billy
|
Bonder
|
4
|
5
|
Ayoung
|
Atiches
|
Brian
|
Black
|
5
|
6
|
NaN
|
NaN
|
Bran
|
Balwner
|
6
|
7
|
NaN
|
NaN
|
Bryce
|
Brice
|
7
|
8
|
NaN
|
NaN
|
Betty
|
Btisan
|
6. 统计 探索风速数据
6.1 数据导入等
import pandas as pd
import datetimewind = pd.read_table('wind.data', sep='\s+', parse_dates=[[0,1,2]])
wind.head()
|
Yr_Mo_Dy
|
RPT
|
VAL
|
ROS
|
KIL
|
SHA
|
BIR
|
DUB
|
CLA
|
MUL
|
CLO
|
BEL
|
MAL
|
0
|
2061-01-01
|
15.04
|
14.96
|
13.17
|
9.29
|
NaN
|
9.87
|
13.67
|
10.25
|
10.83
|
12.58
|
18.50
|
15.04
|
1
|
2061-01-02
|
14.71
|
NaN
|
10.83
|
6.50
|
12.62
|
7.67
|
11.50
|
10.04
|
9.79
|
9.67
|
17.54
|
13.83
|
2
|
2061-01-03
|
18.50
|
16.88
|
12.33
|
10.13
|
11.17
|
6.17
|
11.25
|
NaN
|
8.50
|
7.67
|
12.75
|
12.71
|
3
|
2061-01-04
|
10.58
|
6.63
|
11.75
|
4.58
|
4.54
|
2.88
|
8.63
|
1.79
|
5.83
|
5.88
|
5.46
|
10.88
|
4
|
2061-01-05
|
13.33
|
13.25
|
11.42
|
6.17
|
10.71
|
8.21
|
11.92
|
6.54
|
10.92
|
10.34
|
12.92
|
11.83
|
wind.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6574 entries, 0 to 6573
Data columns (total 13 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 Yr_Mo_Dy 6574 non-null datetime64[ns]1 RPT 6568 non-null float64 2 VAL 6571 non-null float64 3 ROS 6572 non-null float64 4 KIL 6569 non-null float64 5 SHA 6572 non-null float64 6 BIR 6574 non-null float64 7 DUB 6571 non-null float64 8 CLA 6572 non-null float64 9 MUL 6571 non-null float64 10 CLO 6573 non-null float64 11 BEL 6574 non-null float64 12 MAL 6570 non-null float64
dtypes: datetime64[ns](1), float64(12)
memory usage: 667.8 KB
6.2 2061 年是错误数据, 尝试修复
- 原答案少一层’pd.to_datetime()', 这样的话应用函数后 wind[‘Yr_Mo_Dy’] 将不再是 datetime格式\
def fix_year(x):year=x.year-100 if x.year>1978 else x.yearreturn pd.to_datetime(datetime.date(year, x.month, x.day),format='%Y/%m/%d')wind['Yr_Mo_Dy'] = wind['Yr_Mo_Dy'].apply(fix_year)
wind['Yr_Mo_Dy'].dt.year.unique()
array([1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971,1972, 1973, 1974, 1975, 1976, 1977, 1978], dtype=int64)
6.3 将日期设为索引, 注意数据类型, 应该是 datetime64
wind.set_index('Yr_Mo_Dy', inplace=True)
wind.head()
|
RPT
|
VAL
|
ROS
|
KIL
|
SHA
|
BIR
|
DUB
|
CLA
|
MUL
|
CLO
|
BEL
|
MAL
|
Yr_Mo_Dy
|
|
|
|
|
|
|
|
|
|
|
|
|
1961-01-01
|
15.04
|
14.96
|
13.17
|
9.29
|
NaN
|
9.87
|
13.67
|
10.25
|
10.83
|
12.58
|
18.50
|
15.04
|
1961-01-02
|
14.71
|
NaN
|
10.83
|
6.50
|
12.62
|
7.67
|
11.50
|
10.04
|
9.79
|
9.67
|
17.54
|
13.83
|
1961-01-03
|
18.50
|
16.88
|
12.33
|
10.13
|
11.17
|
6.17
|
11.25
|
NaN
|
8.50
|
7.67
|
12.75
|
12.71
|
1961-01-04
|
10.58
|
6.63
|
11.75
|
4.58
|
4.54
|
2.88
|
8.63
|
1.79
|
5.83
|
5.88
|
5.46
|
10.88
|
1961-01-05
|
13.33
|
13.25
|
11.42
|
6.17
|
10.71
|
8.21
|
11.92
|
6.54
|
10.92
|
10.34
|
12.92
|
11.83
|
6.4 对应每一个 location, 一共有多少数据缺失
wind.isna().sum()
RPT 6
VAL 3
ROS 2
KIL 5
SHA 2
BIR 0
DUB 3
CLA 2
MUL 3
CLO 1
BEL 0
MAL 4
dtype: int64
6.5 对应每一个 location,一共有多少完整的数据值
- notna() 和 info() 显示的都能显示非空行
wind.notna().sum()
RPT 6568
VAL 6571
ROS 6572
KIL 6569
SHA 6572
BIR 6574
DUB 6571
CLA 6572
MUL 6571
CLO 6573
BEL 6574
MAL 6570
dtype: int64
# 原答案
wind.shape[0] - wind.isnull().sum()
RPT 6568
VAL 6571
ROS 6572
KIL 6569
SHA 6572
BIR 6574
DUB 6571
CLA 6572
MUL 6571
CLO 6573
BEL 6574
MAL 6570
dtype: int64
6.6 对于全体数据, 计算风速的平均值
wind.mean().mean()
10.227982360836938
6.7 创建一个 DataFrame 计算并存储每个 location 的风速最小值, 最大值, 平均值和标准差
wind_des = wind.describe().T.iloc[:,[3,7,1,2]]
wind_des
|
min
|
max
|
mean
|
std
|
RPT
|
0.67
|
35.80
|
12.362987
|
5.618413
|
VAL
|
0.21
|
33.37
|
10.644314
|
5.267356
|
ROS
|
1.50
|
33.84
|
11.660526
|
5.008450
|
KIL
|
0.00
|
28.46
|
6.306468
|
3.605811
|
SHA
|
0.13
|
37.54
|
10.455834
|
4.936125
|
BIR
|
0.00
|
26.16
|
7.092254
|
3.968683
|
DUB
|
0.00
|
30.37
|
9.797343
|
4.977555
|
CLA
|
0.00
|
31.08
|
8.495053
|
4.499449
|
MUL
|
0.00
|
25.88
|
8.493590
|
4.166872
|
CLO
|
0.04
|
28.21
|
8.707332
|
4.503954
|
BEL
|
0.13
|
42.38
|
13.121007
|
5.835037
|
MAL
|
0.67
|
42.54
|
15.599079
|
6.699794
|
# 一般写法
day_stats = pd.DataFrame()
day_stats['min'] = wind.min(axis=1)
day_stats['max'] = wind.max(axis=1)
day_stats['mean'] = wind.mean(axis=1)
day_stats['std'] = wind.std(axis=1)
day_stats.head()
6.8 对于每一个 location, 计算一月份的平均风速
wind['date'] = wind.index
wind.head()
|
RPT
|
VAL
|
ROS
|
KIL
|
SHA
|
BIR
|
DUB
|
CLA
|
MUL
|
CLO
|
BEL
|
MAL
|
date
|
Yr_Mo_Dy
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1961-01-01
|
15.04
|
14.96
|
13.17
|
9.29
|
NaN
|
9.87
|
13.67
|
10.25
|
10.83
|
12.58
|
18.50
|
15.04
|
1961-01-01
|
1961-01-02
|
14.71
|
NaN
|
10.83
|
6.50
|
12.62
|
7.67
|
11.50
|
10.04
|
9.79
|
9.67
|
17.54
|
13.83
|
1961-01-02
|
1961-01-03
|
18.50
|
16.88
|
12.33
|
10.13
|
11.17
|
6.17
|
11.25
|
NaN
|
8.50
|
7.67
|
12.75
|
12.71
|
1961-01-03
|
1961-01-04
|
10.58
|
6.63
|
11.75
|
4.58
|
4.54
|
2.88
|
8.63
|
1.79
|
5.83
|
5.88
|
5.46
|
10.88
|
1961-01-04
|
1961-01-05
|
13.33
|
13.25
|
11.42
|
6.17
|
10.71
|
8.21
|
11.92
|
6.54
|
10.92
|
10.34
|
12.92
|
11.83
|
1961-01-05
|
wind['year']= wind['date'].dt.year
wind['month']= wind['date'].dt.month
wind['day']= wind['date'].dt.day
wind.head()
|
RPT
|
VAL
|
ROS
|
KIL
|
SHA
|
BIR
|
DUB
|
CLA
|
MUL
|
CLO
|
BEL
|
MAL
|
date
|
year
|
month
|
day
|
Yr_Mo_Dy
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1961-01-01
|
15.04
|
14.96
|
13.17
|
9.29
|
NaN
|
9.87
|
13.67
|
10.25
|
10.83
|
12.58
|
18.50
|
15.04
|
1961-01-01
|
1961
|
1
|
1
|
1961-01-02
|
14.71
|
NaN
|
10.83
|
6.50
|
12.62
|
7.67
|
11.50
|
10.04
|
9.79
|
9.67
|
17.54
|
13.83
|
1961-01-02
|
1961
|
1
|
2
|
1961-01-03
|
18.50
|
16.88
|
12.33
|
10.13
|
11.17
|
6.17
|
11.25
|
NaN
|
8.50
|
7.67
|
12.75
|
12.71
|
1961-01-03
|
1961
|
1
|
3
|
1961-01-04
|
10.58
|
6.63
|
11.75
|
4.58
|
4.54
|
2.88
|
8.63
|
1.79
|
5.83
|
5.88
|
5.46
|
10.88
|
1961-01-04
|
1961
|
1
|
4
|
1961-01-05
|
13.33
|
13.25
|
11.42
|
6.17
|
10.71
|
8.21
|
11.92
|
6.54
|
10.92
|
10.34
|
12.92
|
11.83
|
1961-01-05
|
1961
|
1
|
5
|
# 原
wind['year'] = wind['date'].apply(lambda x: x.year)
wind['month'] = wind['date'].apply(lambda x: x.month)
wind['day'] = wind['date'].apply(lambda x: x.day)
wind[wind['month']==1].iloc[:,0:12].mean()
RPT 14.847325
VAL 12.914560
ROS 13.299624
KIL 7.199498
SHA 11.667734
BIR 8.054839
DUB 11.819355
CLA 9.512047
MUL 9.543208
CLO 10.053566
BEL 14.550520
MAL 18.028763
dtype: float64
wind.loc[wind.index.month==1].mean()
RPT 14.847325
VAL 12.914560
ROS 13.299624
KIL 7.199498
SHA 11.667734
BIR 8.054839
DUB 11.819355
CLA 9.512047
MUL 9.543208
CLO 10.053566
BEL 14.550520
MAL 18.028763
dtype: float64
6.9 对每个地区的记录进行年度向下采样
- DatetimeIndex 多个独立的日期组合起来
- PeriodIndex 指定了间隔频率的 DatetimeIndex, 通过将 DatetimeIndex.to_period (“”),指定间隔频率来得到
wind.groupby(wind.index.to_period('A')).mean()
|
RPT
|
VAL
|
ROS
|
KIL
|
SHA
|
BIR
|
DUB
|
CLA
|
MUL
|
CLO
|
BEL
|
MAL
|
Yr_Mo_Dy
|
|
|
|
|
|
|
|
|
|
|
|
|
1961
|
12.299583
|
10.351796
|
11.362369
|
6.958227
|
10.881763
|
7.729726
|
9.733923
|
8.858788
|
8.647652
|
9.835577
|
13.502795
|
13.680773
|
1962
|
12.246923
|
10.110438
|
11.732712
|
6.960440
|
10.657918
|
7.393068
|
11.020712
|
8.793753
|
8.316822
|
9.676247
|
12.930685
|
14.323956
|
1963
|
12.813452
|
10.836986
|
12.541151
|
7.330055
|
11.724110
|
8.434712
|
11.075699
|
10.336548
|
8.903589
|
10.224438
|
13.638877
|
14.999014
|
1964
|
12.363661
|
10.920164
|
12.104372
|
6.787787
|
11.454481
|
7.570874
|
10.259153
|
9.467350
|
7.789016
|
10.207951
|
13.740546
|
14.910301
|
1965
|
12.451370
|
11.075534
|
11.848767
|
6.858466
|
11.024795
|
7.478110
|
10.618712
|
8.879918
|
7.907425
|
9.918082
|
12.964247
|
15.591644
|
1966
|
13.461973
|
11.557205
|
12.020630
|
7.345726
|
11.805041
|
7.793671
|
10.579808
|
8.835096
|
8.514438
|
9.768959
|
14.265836
|
16.307260
|
1967
|
12.737151
|
10.990986
|
11.739397
|
7.143425
|
11.630740
|
7.368164
|
10.652027
|
9.325616
|
8.645014
|
9.547425
|
14.774548
|
17.135945
|
1968
|
11.835628
|
10.468197
|
11.409754
|
6.477678
|
10.760765
|
6.067322
|
8.859180
|
8.255519
|
7.224945
|
7.832978
|
12.808634
|
15.017486
|
1969
|
11.166356
|
9.723699
|
10.902000
|
5.767973
|
9.873918
|
6.189973
|
8.564493
|
7.711397
|
7.924521
|
7.754384
|
12.621233
|
15.762904
|
1970
|
12.600329
|
10.726932
|
11.730247
|
6.217178
|
10.567370
|
7.609452
|
9.609890
|
8.334630
|
9.297616
|
8.289808
|
13.183644
|
16.456027
|
1971
|
11.273123
|
9.095178
|
11.088329
|
5.241507
|
9.440329
|
6.097151
|
8.385890
|
6.757315
|
7.915370
|
7.229753
|
12.208932
|
15.025233
|
1972
|
12.463962
|
10.561311
|
12.058333
|
5.929699
|
9.430410
|
6.358825
|
9.704508
|
7.680792
|
8.357295
|
7.515273
|
12.727377
|
15.028716
|
1973
|
11.828466
|
10.680493
|
10.680493
|
5.547863
|
9.640877
|
6.548740
|
8.482110
|
7.614274
|
8.245534
|
7.812411
|
12.169699
|
15.441096
|
1974
|
13.643096
|
11.811781
|
12.336356
|
6.427041
|
11.110986
|
6.809781
|
10.084603
|
9.896986
|
9.331753
|
8.736356
|
13.252959
|
16.947671
|
1975
|
12.008575
|
10.293836
|
11.564712
|
5.269096
|
9.190082
|
5.668521
|
8.562603
|
7.843836
|
8.797945
|
7.382822
|
12.631671
|
15.307863
|
1976
|
11.737842
|
10.203115
|
10.761230
|
5.109426
|
8.846339
|
6.311038
|
9.149126
|
7.146202
|
8.883716
|
7.883087
|
12.332377
|
15.471448
|
1977
|
13.099616
|
11.144493
|
12.627836
|
6.073945
|
10.003836
|
8.586438
|
11.523205
|
8.378384
|
9.098192
|
8.821616
|
13.459068
|
16.590849
|
1978
|
12.504356
|
11.044274
|
11.380000
|
6.082356
|
10.167233
|
7.650658
|
9.489342
|
8.800466
|
9.089753
|
8.301699
|
12.967397
|
16.771370
|
6.10 对每个地区的记录进行月度向下采样
wind.groupby(wind.index.to_period('M')).mean()
|
RPT
|
VAL
|
ROS
|
KIL
|
SHA
|
BIR
|
DUB
|
CLA
|
MUL
|
CLO
|
BEL
|
MAL
|
Yr_Mo_Dy
|
|
|
|
|
|
|
|
|
|
|
|
|
1961-01
|
14.841333
|
11.988333
|
13.431613
|
7.736774
|
11.072759
|
8.588065
|
11.184839
|
9.245333
|
9.085806
|
10.107419
|
13.880968
|
14.703226
|
1961-02
|
16.269286
|
14.975357
|
14.441481
|
9.230741
|
13.852143
|
10.937500
|
11.890714
|
11.846071
|
11.821429
|
12.714286
|
18.583214
|
15.411786
|
1961-03
|
10.890000
|
11.296452
|
10.752903
|
7.284000
|
10.509355
|
8.866774
|
9.644194
|
9.829677
|
10.294138
|
11.251935
|
16.410968
|
15.720000
|
1961-04
|
10.722667
|
9.427667
|
9.998000
|
5.830667
|
8.435000
|
6.495000
|
6.925333
|
7.094667
|
7.342333
|
7.237000
|
11.147333
|
10.278333
|
1961-05
|
9.860968
|
8.850000
|
10.818065
|
5.905333
|
9.490323
|
6.574839
|
7.604000
|
8.177097
|
8.039355
|
8.499355
|
11.900323
|
12.011613
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
1978-08
|
9.645161
|
8.259355
|
9.032258
|
4.502903
|
7.368065
|
5.935161
|
5.650323
|
5.417742
|
7.241290
|
5.536774
|
10.466774
|
12.054194
|
1978-09
|
10.913667
|
10.895000
|
10.635000
|
5.725000
|
10.372000
|
9.278333
|
10.790333
|
9.583000
|
10.069333
|
8.939000
|
15.680333
|
19.391333
|
1978-10
|
9.897742
|
8.670968
|
9.295806
|
4.721290
|
8.525161
|
6.774194
|
8.115484
|
7.337742
|
8.297742
|
8.243871
|
13.776774
|
17.150000
|
1978-11
|
16.151667
|
14.802667
|
13.508000
|
7.317333
|
11.475000
|
8.743000
|
11.492333
|
9.657333
|
10.701333
|
10.676000
|
17.404667
|
20.723000
|
1978-12
|
16.175484
|
13.748065
|
15.635161
|
7.094839
|
11.398710
|
9.241613
|
12.077419
|
10.194839
|
10.616774
|
11.028710
|
13.859677
|
21.371613
|
216 rows × 12 columns
6.11 对每个地区的记录进行周度向下采样
wind.groupby(wind.index.to_period('W')).mean()
|
RPT
|
VAL
|
ROS
|
KIL
|
SHA
|
BIR
|
DUB
|
CLA
|
MUL
|
CLO
|
BEL
|
MAL
|
Yr_Mo_Dy
|
|
|
|
|
|
|
|
|
|
|
|
|
1960-12-26/1961-01-01
|
15.040000
|
14.960000
|
13.170000
|
9.290000
|
NaN
|
9.870000
|
13.670000
|
10.250000
|
10.830000
|
12.580000
|
18.500000
|
15.040000
|
1961-01-02/1961-01-08
|
13.541429
|
11.486667
|
10.487143
|
6.417143
|
9.474286
|
6.435714
|
11.061429
|
6.616667
|
8.434286
|
8.497143
|
12.481429
|
13.238571
|
1961-01-09/1961-01-15
|
12.468571
|
8.967143
|
11.958571
|
4.630000
|
7.351429
|
5.072857
|
7.535714
|
6.820000
|
5.712857
|
7.571429
|
11.125714
|
11.024286
|
1961-01-16/1961-01-22
|
13.204286
|
9.862857
|
12.982857
|
6.328571
|
8.966667
|
7.417143
|
9.257143
|
7.875714
|
7.145714
|
8.124286
|
9.821429
|
11.434286
|
1961-01-23/1961-01-29
|
19.880000
|
16.141429
|
18.225714
|
12.720000
|
17.432857
|
14.828571
|
15.528571
|
15.160000
|
14.480000
|
15.640000
|
20.930000
|
22.530000
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
1978-11-27/1978-12-03
|
14.934286
|
11.232857
|
13.941429
|
5.565714
|
10.215714
|
8.618571
|
9.642857
|
7.685714
|
9.011429
|
9.547143
|
11.835714
|
18.728571
|
1978-12-04/1978-12-10
|
20.740000
|
19.190000
|
17.034286
|
9.777143
|
15.287143
|
12.774286
|
14.437143
|
12.488571
|
13.870000
|
14.082857
|
18.517143
|
23.061429
|
1978-12-11/1978-12-17
|
16.758571
|
14.692857
|
14.987143
|
6.917143
|
11.397143
|
7.272857
|
10.208571
|
7.967143
|
9.168571
|
8.565714
|
11.102857
|
15.562857
|
1978-12-18/1978-12-24
|
11.155714
|
8.008571
|
13.172857
|
4.004286
|
7.825714
|
6.290000
|
7.798571
|
8.667143
|
7.151429
|
8.072857
|
11.845714
|
18.977143
|
1978-12-25/1978-12-31
|
14.951429
|
11.801429
|
16.035714
|
6.507143
|
9.660000
|
8.620000
|
13.708571
|
10.477143
|
10.868571
|
11.471429
|
12.947143
|
26.844286
|
940 rows × 12 columns
6.12 计算前 52 周内所有地点(假设第一个星期从 1961年1月2日开始) 的最小值、最大值、平均风速和风速的标准偏差。
# 重采样并聚合
weekly = wind.resample('W').agg(['min','max','mean','std'])
weekly
|
RPT
|
VAL
|
ROS
|
...
|
CLO
|
BEL
|
MAL
|
|
min
|
max
|
mean
|
std
|
min
|
max
|
mean
|
std
|
min
|
max
|
...
|
mean
|
std
|
min
|
max
|
mean
|
std
|
min
|
max
|
mean
|
std
|
Yr_Mo_Dy
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1961-01-01
|
15.04
|
15.04
|
15.040000
|
NaN
|
14.96
|
14.96
|
14.960000
|
NaN
|
13.17
|
13.17
|
...
|
12.580000
|
NaN
|
18.50
|
18.50
|
18.500000
|
NaN
|
15.04
|
15.04
|
15.040000
|
NaN
|
1961-01-08
|
10.58
|
18.50
|
13.541429
|
2.631321
|
6.63
|
16.88
|
11.486667
|
3.949525
|
7.62
|
12.33
|
...
|
8.497143
|
1.704941
|
5.46
|
17.54
|
12.481429
|
4.349139
|
10.88
|
16.46
|
13.238571
|
1.773062
|
1961-01-15
|
9.04
|
19.75
|
12.468571
|
3.555392
|
3.54
|
12.08
|
8.967143
|
3.148945
|
7.08
|
19.50
|
...
|
7.571429
|
4.084293
|
5.25
|
20.71
|
11.125714
|
5.552215
|
5.17
|
16.92
|
11.024286
|
4.692355
|
1961-01-22
|
4.92
|
19.83
|
13.204286
|
5.337402
|
3.42
|
14.37
|
9.862857
|
3.837785
|
7.29
|
20.79
|
...
|
8.124286
|
4.783952
|
6.50
|
15.92
|
9.821429
|
3.626584
|
6.79
|
17.96
|
11.434286
|
4.237239
|
1961-01-29
|
13.62
|
25.04
|
19.880000
|
4.619061
|
9.96
|
23.91
|
16.141429
|
5.170224
|
12.67
|
25.84
|
...
|
15.640000
|
3.713368
|
14.04
|
27.71
|
20.930000
|
5.210726
|
17.50
|
27.63
|
22.530000
|
3.874721
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
1978-12-03
|
9.08
|
21.29
|
14.934286
|
4.931754
|
4.54
|
21.34
|
11.232857
|
5.978968
|
8.21
|
24.04
|
...
|
9.547143
|
6.284973
|
4.92
|
21.42
|
11.835714
|
5.950112
|
11.50
|
25.75
|
18.728571
|
6.393188
|
1978-12-10
|
9.92
|
29.33
|
20.740000
|
7.215012
|
12.54
|
24.79
|
19.190000
|
4.953060
|
7.21
|
25.37
|
...
|
14.082857
|
5.516405
|
9.54
|
26.08
|
18.517143
|
5.600389
|
15.34
|
34.59
|
23.061429
|
8.093976
|
1978-12-17
|
9.87
|
23.13
|
16.758571
|
4.499431
|
3.21
|
24.04
|
14.692857
|
7.578665
|
8.04
|
18.05
|
...
|
8.565714
|
5.487801
|
5.00
|
21.50
|
11.102857
|
6.631925
|
6.92
|
22.83
|
15.562857
|
6.005594
|
1978-12-24
|
6.21
|
16.62
|
11.155714
|
3.522759
|
3.63
|
13.29
|
8.008571
|
3.882900
|
8.50
|
22.21
|
...
|
8.072857
|
3.023131
|
3.21
|
19.79
|
11.845714
|
5.750301
|
10.29
|
31.71
|
18.977143
|
7.194108
|
1978-12-31
|
7.21
|
20.33
|
14.951429
|
4.350400
|
5.46
|
17.41
|
11.801429
|
4.705392
|
7.83
|
27.29
|
...
|
11.471429
|
5.533397
|
1.21
|
21.79
|
12.947143
|
7.523148
|
11.96
|
41.46
|
26.844286
|
11.621233
|
940 rows × 48 columns
weekly.loc[weekly.index[1:53],:].head()
|
RPT
|
VAL
|
ROS
|
...
|
CLO
|
BEL
|
MAL
|
|
min
|
max
|
mean
|
std
|
min
|
max
|
mean
|
std
|
min
|
max
|
...
|
mean
|
std
|
min
|
max
|
mean
|
std
|
min
|
max
|
mean
|
std
|
Yr_Mo_Dy
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1961-01-08
|
10.58
|
18.50
|
13.541429
|
2.631321
|
6.63
|
16.88
|
11.486667
|
3.949525
|
7.62
|
12.33
|
...
|
8.497143
|
1.704941
|
5.46
|
17.54
|
12.481429
|
4.349139
|
10.88
|
16.46
|
13.238571
|
1.773062
|
1961-01-15
|
9.04
|
19.75
|
12.468571
|
3.555392
|
3.54
|
12.08
|
8.967143
|
3.148945
|
7.08
|
19.50
|
...
|
7.571429
|
4.084293
|
5.25
|
20.71
|
11.125714
|
5.552215
|
5.17
|
16.92
|
11.024286
|
4.692355
|
1961-01-22
|
4.92
|
19.83
|
13.204286
|
5.337402
|
3.42
|
14.37
|
9.862857
|
3.837785
|
7.29
|
20.79
|
...
|
8.124286
|
4.783952
|
6.50
|
15.92
|
9.821429
|
3.626584
|
6.79
|
17.96
|
11.434286
|
4.237239
|
1961-01-29
|
13.62
|
25.04
|
19.880000
|
4.619061
|
9.96
|
23.91
|
16.141429
|
5.170224
|
12.67
|
25.84
|
...
|
15.640000
|
3.713368
|
14.04
|
27.71
|
20.930000
|
5.210726
|
17.50
|
27.63
|
22.530000
|
3.874721
|
1961-02-05
|
10.58
|
24.21
|
16.827143
|
5.251408
|
9.46
|
24.21
|
15.460000
|
5.187395
|
9.04
|
19.70
|
...
|
9.460000
|
2.839501
|
9.17
|
19.33
|
14.012857
|
4.210858
|
7.17
|
19.25
|
11.935714
|
4.336104
|
5 rows × 48 columns
7. 探索泰坦尼克灾难数据(可视化)
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as npplt.rcParams["font.sans-serif"]=['SimHei'] #正常显示中文
plt.rcParams["axes.unicode_minus"]=False # 正常显示负号%matplotlib inlineimport warnings
warnings.filterwarnings('ignore')
7.1 导入数据等
titanic = pd.read_csv('train.csv')
titanic.head()
|
PassengerId
|
Survived
|
Pclass
|
Name
|
Sex
|
Age
|
SibSp
|
Parch
|
Ticket
|
Fare
|
Cabin
|
Embarked
|
0
|
1
|
0
|
3
|
Braund, Mr. Owen Harris
|
male
|
22.0
|
1
|
0
|
A/5 21171
|
7.2500
|
NaN
|
S
|
1
|
2
|
1
|
1
|
Cumings, Mrs. John Bradley (Florence Briggs Th...
|
female
|
38.0
|
1
|
0
|
PC 17599
|
71.2833
|
C85
|
C
|
2
|
3
|
1
|
3
|
Heikkinen, Miss. Laina
|
female
|
26.0
|
0
|
0
|
STON/O2. 3101282
|
7.9250
|
NaN
|
S
|
3
|
4
|
1
|
1
|
Futrelle, Mrs. Jacques Heath (Lily May Peel)
|
female
|
35.0
|
1
|
0
|
113803
|
53.1000
|
C123
|
S
|
4
|
5
|
0
|
3
|
Allen, Mr. William Henry
|
male
|
35.0
|
0
|
0
|
373450
|
8.0500
|
NaN
|
S
|
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float646 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float6410 Cabin 204 non-null object 11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
7.1 绘制一个展示乘客性别比例的扇形图
males = (titanic['Sex']=='male').sum()
females = (titanic['Sex']=='female').sum()proportions = [males, females]
#plt.figure(figsize=(8,8), dpi=100)
#plt.Subplot(111,facecolor='#F0F0F0')plt.pie(proportions,labels = ['Males', 'Females'],shadow = True,colors = ['#E76278','#FAC699'],explode = (0.15,0),startangle = 90,autopct = '%1.1f%%')plt.axis('equal')
plt.title('Sex Proportion')
plt.tight_layout()
plt.show()
7.2 绘制一个展示船票费用与乘客年龄和船票的散点图
lm = sns.lmplot(x='Age', y='Fare', data=titanic, hue='Sex', fit_reg=False)
lm.set(title='Fare_Age Scatter')plt.tight_layout()
plt.show()
7.3 有多少人生还
titanic['Survived'].sum()
342
7.4 绘制一个展示船票价格的直方图
plt.figure(figsize=(10,4), dpi=100)
plt.subplot(111, facecolor='#F0F0F0')plt.title('船票价格分布')
plt.xlabel('船票价格')
plt.ylabel('人数')plt.hist(titanic['Fare'], bins=50, color='#447C69')
plt.xticks(np.arange(0,300,20), fontsize=10)
plt.yticks(fontsize=10)plt.tight_layout()
8. 探索 Pokemon 数据
8.1 数据导入及其他
import pandas as pd
raw_data = {"name": ['Bulbasaur', 'Charmander','Squirtle','Caterpie'],"evolution": ['Ivysaur','Charmeleon','Wartortle','Metapod'],"type": ['grass', 'fire', 'water', 'bug'],"hp": [45, 39, 44, 45],"pokedex": ['yes', 'no','yes','no']}
8.2 创建 DataFrame
pokemon = pd.DataFrame(raw_data)
pokemon.head()
|
name
|
evolution
|
type
|
hp
|
pokedex
|
0
|
Bulbasaur
|
Ivysaur
|
grass
|
45
|
yes
|
1
|
Charmander
|
Charmeleon
|
fire
|
39
|
no
|
2
|
Squirtle
|
Wartortle
|
water
|
44
|
yes
|
3
|
Caterpie
|
Metapod
|
bug
|
45
|
no
|
8.3 列重新排序
pokemon = pokemon[['name','type','hp','evolution','pokedex']]
pokemon.head()
|
name
|
type
|
hp
|
evolution
|
pokedex
|
0
|
Bulbasaur
|
grass
|
45
|
Ivysaur
|
yes
|
1
|
Charmander
|
fire
|
39
|
Charmeleon
|
no
|
2
|
Squirtle
|
water
|
44
|
Wartortle
|
yes
|
3
|
Caterpie
|
bug
|
45
|
Metapod
|
no
|
8.4 添加列 place
pokemon['place'] = ['park','street','lake','forest']
pokemon
|
name
|
type
|
hp
|
evolution
|
pokedex
|
place
|
0
|
Bulbasaur
|
grass
|
45
|
Ivysaur
|
yes
|
park
|
1
|
Charmander
|
fire
|
39
|
Charmeleon
|
no
|
street
|
2
|
Squirtle
|
water
|
44
|
Wartortle
|
yes
|
lake
|
3
|
Caterpie
|
bug
|
45
|
Metapod
|
no
|
forest
|
8.5 查看每个列的数据类型
pokemon.dtypes
name object
type object
hp int64
evolution object
pokedex object
place object
dtype: object
9. 探索 Apple 公司股价数据
9.1 导入数据等
import pandas as pd
import numpy as npimport matplotlib.pyplot as plt
%matplotlib inline
apple = pd.read_csv('Apple_stock.csv')
apple.head()
|
Date
|
Open
|
High
|
Low
|
Close
|
Volume
|
Adj Close
|
0
|
2014-07-08
|
96.27
|
96.80
|
93.92
|
95.35
|
65130000
|
95.35
|
1
|
2014-07-07
|
94.14
|
95.99
|
94.10
|
95.97
|
56305400
|
95.97
|
2
|
2014-07-03
|
93.67
|
94.10
|
93.20
|
94.03
|
22891800
|
94.03
|
3
|
2014-07-02
|
93.87
|
94.06
|
93.09
|
93.48
|
28420900
|
93.48
|
4
|
2014-07-01
|
93.52
|
94.07
|
93.13
|
93.52
|
38170200
|
93.52
|
apple.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 8465 entries, 1980-12-12 to 2014-07-08
Data columns (total 6 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 Open 8465 non-null float641 High 8465 non-null float642 Low 8465 non-null float643 Close 8465 non-null float644 Volume 8465 non-null int64 5 Adj Close 8465 non-null float64
dtypes: float64(5), int64(1)
memory usage: 721.0 KB
9.1 有重复的日期吗
apple['Date'].duplicated().sum()
# 没有重复数据
0
apple['Date'].is_unique
True
# 也可以用在索引上
apple.index.is_unique
True
apple['Date'] = pd.to_datetime(apple['Date'])
apple.set_index('Date', inplace=True)
apple.head()
|
Open
|
High
|
Low
|
Close
|
Volume
|
Adj Close
|
Date
|
|
|
|
|
|
|
2014-07-08
|
96.27
|
96.80
|
93.92
|
95.35
|
65130000
|
95.35
|
2014-07-07
|
94.14
|
95.99
|
94.10
|
95.97
|
56305400
|
95.97
|
2014-07-03
|
93.67
|
94.10
|
93.20
|
94.03
|
22891800
|
94.03
|
2014-07-02
|
93.87
|
94.06
|
93.09
|
93.48
|
28420900
|
93.48
|
2014-07-01
|
93.52
|
94.07
|
93.13
|
93.52
|
38170200
|
93.52
|
9.2 将 index 设置为升序
apple.sort_index(ascending=True, inplace=True)
apple.head()
|
Open
|
High
|
Low
|
Close
|
Volume
|
Adj Close
|
Date
|
|
|
|
|
|
|
1980-12-12
|
28.75
|
28.87
|
28.75
|
28.75
|
117258400
|
0.45
|
1980-12-15
|
27.38
|
27.38
|
27.25
|
27.25
|
43971200
|
0.42
|
1980-12-16
|
25.37
|
25.37
|
25.25
|
25.25
|
26432000
|
0.39
|
1980-12-17
|
25.87
|
26.00
|
25.87
|
25.87
|
21610400
|
0.40
|
1980-12-18
|
26.63
|
26.75
|
26.63
|
26.63
|
18362400
|
0.41
|
9.3 找到每个月的最后一个交易日(business day)
# 'BM' business month end frequency
apple_month = apple.resample('BM').mean()
apple_month.head()
|
Open
|
High
|
Low
|
Close
|
Volume
|
Adj Close
|
Date
|
|
|
|
|
|
|
1980-12-31
|
30.481538
|
30.567692
|
30.443077
|
30.443077
|
2.586252e+07
|
0.473077
|
1981-01-30
|
31.754762
|
31.826667
|
31.654762
|
31.654762
|
7.249867e+06
|
0.493810
|
1981-02-27
|
26.480000
|
26.572105
|
26.407895
|
26.407895
|
4.231832e+06
|
0.411053
|
1981-03-31
|
24.937727
|
25.016818
|
24.836364
|
24.836364
|
7.962691e+06
|
0.387727
|
1981-04-30
|
27.286667
|
27.368095
|
27.227143
|
27.227143
|
6.392000e+06
|
0.423333
|
9.4 数据集中最早的日期和最晚的日期相差多少天?
(apple.index.max()-apple.index.min()).days
12261
9.5 数据集中一共有多少个月?
len(apple.resample('BM').mean())
404
9.6 按照时间顺序可视化 Adj Close 值
plt.figure(figsize=(8,3), dpi=100)
apple['Adj Close'].plot(title='Apple Stock')
<AxesSubplot:title={'center':'Apple Stock'}, xlabel='Date'>
10. 探索 Iris 纸鸢花数据
10.1 导入数据等
import pandas as pd
iris = pd.read_csv('iris.csv')
iris.head()
|
5.1
|
3.5
|
1.4
|
0.2
|
Iris-setosa
|
0
|
4.9
|
3.0
|
1.4
|
0.2
|
Iris-setosa
|
1
|
4.7
|
3.2
|
1.3
|
0.2
|
Iris-setosa
|
2
|
4.6
|
3.1
|
1.5
|
0.2
|
Iris-setosa
|
3
|
5.0
|
3.6
|
1.4
|
0.2
|
Iris-setosa
|
4
|
5.4
|
3.9
|
1.7
|
0.4
|
Iris-setosa
|
iris.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149 entries, 0 to 148
Data columns (total 5 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 5.1 149 non-null float641 3.5 149 non-null float642 1.4 149 non-null float643 0.2 149 non-null float644 Iris-setosa 149 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB
10.2 添加列名
iris.columns = ['sepal_length','sepal_width', 'petal_length','petal_width', 'class']
iris.head()
|
sepal_length
|
sepal_width
|
petal_length
|
petal_width
|
class
|
0
|
4.9
|
3.0
|
1.4
|
0.2
|
Iris-setosa
|
1
|
4.7
|
3.2
|
1.3
|
0.2
|
Iris-setosa
|
2
|
4.6
|
3.1
|
1.5
|
0.2
|
Iris-setosa
|
3
|
5.0
|
3.6
|
1.4
|
0.2
|
Iris-setosa
|
4
|
5.4
|
3.9
|
1.7
|
0.4
|
Iris-setosa
|
10.3 数据集有缺失值吗
iris.isna().sum()
sepal_length 0
sepal_width 0
petal_length 0
petal_width 0
class 0
dtype: int64
10.4 将petal_lenth 的第10到19行设置为缺失值
- python range() 函数是左闭右开,下面的stop=20,不包括20
for i in range(10,20):iris['petal_length'].at[i] = np.naniris.head(21)
|
sepal_length
|
sepal_width
|
petal_length
|
petal_width
|
class
|
0
|
4.9
|
3.0
|
1.4
|
0.2
|
Iris-setosa
|
1
|
4.7
|
3.2
|
1.3
|
0.2
|
Iris-setosa
|
2
|
4.6
|
3.1
|
1.5
|
0.2
|
Iris-setosa
|
3
|
5.0
|
3.6
|
1.4
|
0.2
|
Iris-setosa
|
4
|
5.4
|
3.9
|
1.7
|
0.4
|
Iris-setosa
|
5
|
4.6
|
3.4
|
1.4
|
0.3
|
Iris-setosa
|
6
|
5.0
|
3.4
|
1.5
|
0.2
|
Iris-setosa
|
7
|
4.4
|
2.9
|
1.4
|
0.2
|
Iris-setosa
|
8
|
4.9
|
3.1
|
1.5
|
0.1
|
Iris-setosa
|
9
|
5.4
|
3.7
|
1.5
|
0.2
|
Iris-setosa
|
10
|
4.8
|
3.4
|
NaN
|
0.2
|
Iris-setosa
|
11
|
4.8
|
3.0
|
NaN
|
0.1
|
Iris-setosa
|
12
|
4.3
|
3.0
|
NaN
|
0.1
|
Iris-setosa
|
13
|
5.8
|
4.0
|
NaN
|
0.2
|
Iris-setosa
|
14
|
5.7
|
4.4
|
NaN
|
0.4
|
Iris-setosa
|
15
|
5.4
|
3.9
|
NaN
|
0.4
|
Iris-setosa
|
16
|
5.1
|
3.5
|
NaN
|
0.3
|
Iris-setosa
|
17
|
5.7
|
3.8
|
NaN
|
0.3
|
Iris-setosa
|
18
|
5.1
|
3.8
|
NaN
|
0.3
|
Iris-setosa
|
19
|
5.4
|
3.4
|
NaN
|
0.2
|
Iris-setosa
|
20
|
5.1
|
3.7
|
1.5
|
0.4
|
Iris-setosa
|
iris.iloc[10:20, 1:2] = np.nan
iris.head(21)
|
sepal_length
|
sepal_width
|
petal_length
|
petal_width
|
class
|
0
|
4.9
|
3.0
|
1.4
|
0.2
|
Iris-setosa
|
1
|
4.7
|
3.2
|
1.3
|
0.2
|
Iris-setosa
|
2
|
4.6
|
3.1
|
1.5
|
0.2
|
Iris-setosa
|
3
|
5.0
|
3.6
|
1.4
|
0.2
|
Iris-setosa
|
4
|
5.4
|
3.9
|
1.7
|
0.4
|
Iris-setosa
|
5
|
4.6
|
3.4
|
1.4
|
0.3
|
Iris-setosa
|
6
|
5.0
|
3.4
|
1.5
|
0.2
|
Iris-setosa
|
7
|
4.4
|
2.9
|
1.4
|
0.2
|
Iris-setosa
|
8
|
4.9
|
3.1
|
1.5
|
0.1
|
Iris-setosa
|
9
|
5.4
|
3.7
|
1.5
|
0.2
|
Iris-setosa
|
10
|
4.8
|
NaN
|
1.6
|
0.2
|
Iris-setosa
|
11
|
4.8
|
NaN
|
1.4
|
0.1
|
Iris-setosa
|
12
|
4.3
|
NaN
|
1.1
|
0.1
|
Iris-setosa
|
13
|
5.8
|
NaN
|
1.2
|
0.2
|
Iris-setosa
|
14
|
5.7
|
NaN
|
1.5
|
0.4
|
Iris-setosa
|
15
|
5.4
|
NaN
|
1.3
|
0.4
|
Iris-setosa
|
16
|
5.1
|
NaN
|
1.4
|
0.3
|
Iris-setosa
|
17
|
5.7
|
NaN
|
1.7
|
0.3
|
Iris-setosa
|
18
|
5.1
|
NaN
|
1.5
|
0.3
|
Iris-setosa
|
19
|
5.4
|
NaN
|
1.7
|
0.2
|
Iris-setosa
|
20
|
5.1
|
3.7
|
1.5
|
0.4
|
Iris-setosa
|
10.5 将缺失值全部替换为 1.0
iris.fillna(1, inplace=True)
iris.head(21)
|
sepal_length
|
sepal_width
|
petal_length
|
petal_width
|
class
|
0
|
4.9
|
3.0
|
1.4
|
0.2
|
Iris-setosa
|
1
|
4.7
|
3.2
|
1.3
|
0.2
|
Iris-setosa
|
2
|
4.6
|
3.1
|
1.5
|
0.2
|
Iris-setosa
|
3
|
5.0
|
3.6
|
1.4
|
0.2
|
Iris-setosa
|
4
|
5.4
|
3.9
|
1.7
|
0.4
|
Iris-setosa
|
5
|
4.6
|
3.4
|
1.4
|
0.3
|
Iris-setosa
|
6
|
5.0
|
3.4
|
1.5
|
0.2
|
Iris-setosa
|
7
|
4.4
|
2.9
|
1.4
|
0.2
|
Iris-setosa
|
8
|
4.9
|
3.1
|
1.5
|
0.1
|
Iris-setosa
|
9
|
5.4
|
3.7
|
1.5
|
0.2
|
Iris-setosa
|
10
|
4.8
|
3.4
|
1.0
|
0.2
|
Iris-setosa
|
11
|
4.8
|
3.0
|
1.0
|
0.1
|
Iris-setosa
|
12
|
4.3
|
3.0
|
1.0
|
0.1
|
Iris-setosa
|
13
|
5.8
|
4.0
|
1.0
|
0.2
|
Iris-setosa
|
14
|
5.7
|
4.4
|
1.0
|
0.4
|
Iris-setosa
|
15
|
5.4
|
3.9
|
1.0
|
0.4
|
Iris-setosa
|
16
|
5.1
|
3.5
|
1.0
|
0.3
|
Iris-setosa
|
17
|
5.7
|
3.8
|
1.0
|
0.3
|
Iris-setosa
|
18
|
5.1
|
3.8
|
1.0
|
0.3
|
Iris-setosa
|
19
|
5.4
|
3.4
|
1.0
|
0.2
|
Iris-setosa
|
20
|
5.1
|
3.7
|
1.5
|
0.4
|
Iris-setosa
|
10.6 删除列
del iris['class']
iris.head()
|
sepal_length
|
sepal_width
|
petal_length
|
petal_width
|
0
|
4.9
|
3.0
|
1.4
|
0.2
|
1
|
4.7
|
3.2
|
1.3
|
0.2
|
2
|
4.6
|
3.1
|
1.5
|
0.2
|
3
|
5.0
|
3.6
|
1.4
|
0.2
|
4
|
5.4
|
3.9
|
1.7
|
0.4
|
10.7 将数据的前三行设置为缺失值
iris.iloc[0:3,:] = np.nan
iris.head()
|
sepal_length
|
sepal_width
|
petal_length
|
petal_width
|
0
|
NaN
|
NaN
|
NaN
|
NaN
|
1
|
NaN
|
NaN
|
NaN
|
NaN
|
2
|
NaN
|
NaN
|
NaN
|
NaN
|
3
|
5.0
|
3.6
|
1.4
|
0.2
|
4
|
5.4
|
3.9
|
1.7
|
0.4
|
10.8 删除有缺失值的行
iris.drop_duplicates(inplace=True)
iris.head()
|
sepal_length
|
sepal_width
|
petal_length
|
petal_width
|
0
|
NaN
|
NaN
|
NaN
|
NaN
|
3
|
5.0
|
3.6
|
1.4
|
0.2
|
4
|
5.4
|
3.9
|
1.7
|
0.4
|
5
|
4.6
|
3.4
|
1.4
|
0.3
|
6
|
5.0
|
3.4
|
1.5
|
0.2
|
10.9 重新设置索引
iris.reset_index(inplace=True, drop=True)
iris.head()
|
sepal_length
|
sepal_width
|
petal_length
|
petal_width
|
0
|
NaN
|
NaN
|
NaN
|
NaN
|
1
|
5.0
|
3.6
|
1.4
|
0.2
|
2
|
5.4
|
3.9
|
1.7
|
0.4
|
3
|
4.6
|
3.4
|
1.4
|
0.3
|
4
|
5.0
|
3.4
|
1.5
|
0.2
|
[ github ] 10道题串起 Pandas 常用知识点相关推荐
- 数据分析 | Pandas 200道练习题,每日10道题,学完必成大神(2)
文章目录 前期准备 1.通过DataFrame保存为EXCEL 2.查看数据行列数 3.提取popularity列中值大于3小于7的行 4.交换两列的位置 5.提取popularity列最大的行所在行 ...
- 数据分析 | Pandas 200道练习题,每日10道题,学完必成大神(3)
文章目录 1.读取本的数据集 2.查看数据的前5行 3.将salary列的数据转换为最大值和最小值的平均值 4.将数据根据学历进行分组计算平均值 5.将createTime列转换为月日 6.查看所索引 ...
- ES6常用知识点概述
前言 国庆假期已过一半,来篇干货压压惊. ES6,并不是一个新鲜的东西,ES7.ES8已经赶脚了.但是,东西不在于新,而在于总结.每个学前端的人,身边也必定有本阮老师的<ES6标准入门>或 ...
- matplotlib一些常用知识点的整理,
本文作为学习过程中对matplotlib一些常用知识点的整理,方便查找. 强烈推荐ipython 无论你工作在什么项目上,IPython都是值得推荐的.利用ipython --pylab,可以进入Py ...
- 仅需10道题轻松掌握Python文件处理 | Python技能树征题
仅需10道题轻松掌握Python文件处理 | Python技能树征题 0. 前言 1. 第 1 题:文件路径名的处理 2. 第 2 题:检测文件是否存在 3. 第 3 题:获取指定文件夹下的文件列表 ...
- 仅需10道题轻松掌握Python字符串方法 | Python技能树征题
仅需10道题轻松掌握Python字符串方法 | Python技能树征题 0. 前言 1. 第 1 题:字符串检查 2. 第 2 题:字符串大小写转换 3. 第 3 题:字符串开头或结尾匹配 4. 第 ...
- 【C++】-- C++11基础常用知识点(下)
上篇: [C++]-- C++11基础常用知识点(上)_川入的博客-CSDN博客 目录 新的类功能 默认成员函数 可变参数模板 可变参数 可变参数模板 empalce lambda表达式 C++98中 ...
- 数据汇总与统计(pandas库)知识点归纳总结及练习题
统计的基本概念 总体:研究对象的全体–eg:所有学生的身高.成绩和体重等 个体:总体中的每一个成员–eg:单个同学的身高.成绩 样本:从总体中抽出部分个体组成的集合 样本容量:样本中所含个体的数目 常 ...
- Numpy 100道练习题+常用知识点
目录 Numpy 100道练习题知识点总结 打印numpy配置 `np.show_config` 数组大小.元素个数.元素比特数 查询numpy函数的文档 `np.info` 获取范围数组 `np.a ...
- javaScript常用知识点有哪些
javaScript常用知识点有哪些 一.总结 一句话总结:int = ~~myVar, // to integer | 是二进制或, x|0 永远等于x:^为异或,同0异1,所以 x^0 还是永远等 ...
最新文章
- ASP.NET程序中常用的三十三种代码(转载)
- jQuery 选择器语法
- 【CentOS Linux 7】【Linux系统及应用---调研报告】
- java junit autowired_写Junit测试时用Autowired注入的类实例始终为空怎么解?
- 远程工具:MobaXterm使用图文教程
- python3如何安装selenium_Mac-Firefox浏览器+selenium+Python3环境安装
- 查询各个年级的平均分_艰难困苦终生志,玉汝于成竟英才——高三年级期中考试总结暨优秀学生表彰会...
- golang---文件读写
- java test20006_Java单例7种测试实践
- (CPSCA's)CPOJC+VIJOS
- [Sdoi2013] 直径
- 【Godot】行为树(一)了解与设计行为树代码
- 利用第三方软件识别图片文字并转换为文本
- 转-思维要裂变要敢闯想
- 用了MybatisPlus后,我很久没有手写sql了
- Mysql和Navicat
- 利用马青公式输出π的后任意位数字
- USB转串口驱动应用于macbook
- 基于PyQt5实现界面控件自适应大小
- 有哪些好用的BT下载器?
热门文章
- C++引用、取地址符
- 英文pdf翻译为中文(word+google浏览器即可)
- C#通过WebBrowser对网页截图
- 计算机系统的优化具体操作,注册表优化电脑内存的详细操作步骤
- 计算机内存不足 无法使用,电脑内存不足怎么办,教您解决电脑内存不足
- php用redis实现队列,PHP使用Redis实现队列
- 电脑键盘部分按键失灵_win10键盘个别按键失灵的原因及解决方法
- 简单使用Easy Touch5摇杆控制物体移动
- 安装nodejs时:The error code is 2503.
- word页眉页脚页码设置详解