1. 导入 Pandas 库并简写为 pd,并输出版本号。

In [2]:

import pandas as pd
pd.__version__

Out[2]:

'1.4.4'

2. 从列表创建 Series

In [3]:

data = [1,2,3,4,5]
frame = pd.Series(data, index = ['a','b','c','d','e'])
frame

Out[3]:

a    1
b    2
c    3
d    4
e    5
dtype: int64

3. 从字典创建 Series

In [4]:

data = {'a':1, 'b':2, 'c':3, 'd':4,'e':5}
frame=pd.Series(data)
frame

Out[4]:

a    1
b    2
c    3
d    4
e    5
dtype: int64

4. 从 NumPy 随机数组创建 DataFrame,并以时间序列作为行索引,以字母作为列索引,

In [21]:

import numpy as np
dt1 = pd.date_range(start="today", periods=6, freq="D")
dt1

Out[21]:

DatetimeIndex(['2023-06-05 11:19:00.732923', '2023-06-06 11:19:00.732923','2023-06-07 11:19:00.732923', '2023-06-08 11:19:00.732923','2023-06-09 11:19:00.732923', '2023-06-10 11:19:00.732923'],dtype='datetime64[ns]', freq='D')

In [25]:

num_arr=np.random.randn(6,4)
columns=['A','B','C','D']
df=pd.DataFrame(num_arr,index=dt1,columns=columns)
df

Out[25]:

A B C D
2023-06-05 11:19:00.732923 0.853507 0.461207 -0.698314 1.271267
2023-06-06 11:19:00.732923 0.621321 -0.032685 0.334610 0.536929
2023-06-07 11:19:00.732923 0.774693 -1.199595 1.263980 -0.769168
2023-06-08 11:19:00.732923 -1.041118 0.610756 0.880698 0.474968
2023-06-09 11:19:00.732923 -1.963501 0.655607 -0.185408 -2.162950
2023-06-10 11:19:00.732923 -0.190997 1.608209 -1.175479 0.692370

5. 创建一个结构如图所示的Serial对象,分别获取其索引、数据以及位置索引2对应的数据。

In [27]:

import pandas as pd
ser_obj=pd.Series([1,2,3,4,5],index=['No.0','No.1','No.2','No.3','No.4'])
ser_obj

Out[27]:

No.0    1
No.1    2
No.2    3
No.3    4
No.4    5
dtype: int64

6. 现有如下图所示的表格数据,请对该数据进行以下操作:

In [29]:

#(1)    创建一个结构上如上图所示的DataFrame对象
import numpy as np
import pandas as pd
df_data = np.array([[1, 5, 8, 8], [2, 2, 4, 9],[7, 4, 2, 3], [3, 0, 5, 2]])  # 创建数组
col_data = np.array(['A', 'B', 'C', 'D'])  # 创建数组
# 基于数组创建DataFrame对象
df_obj = pd.DataFrame(columns=col_data, data=df_data)
df_obj

Out[29]:

A B C D
0 1 5 8 8
1 2 2 4 9
2 7 4 2 3
3 3 0 5 2

In [30]:

#(2)    将图中的B列数据按降序排列。
sort_values_data = df_obj.sort_values(by=['B'], ascending=False)
sort_values_data

Out[30]:

A B C D
0 1 5 8 8
2 7 4 2 3
1 2 2 4 9
3 3 0 5 2

In [32]:

#(3)    将排序后的数据写入到CSV文件,取名为write_data.csv。
sort_values_data.to_csv(r'F:\实训\数据分析实训\项目二 Pandas基础练习\write_data.csv')
'写入完毕'

Out[32]:

'写入完毕'

7. 现有如下图所示的表格数据,请对该数据进行以下操作

In [39]:

mulitindex_series=pd.Series([15848,13472,12073.8,7813,7446,6444,15230,8269],index=[['河北省','河北省','河北省','河北省', '河南省','河南省','河南省','河南省'],['石家庄市','唐山市','邯郸市','秦皇岛市','郑州市','开封市','洛阳市','新乡市']])
mulitindex_series

Out[39]:

河北省  石家庄市    15848.0唐山市     13472.0邯郸市     12073.8秦皇岛市     7813.0
河南省  郑州市      7446.0开封市      6444.0洛阳市     15230.0新乡市      8269.0
dtype: float64

In [40]:

#(2)    获取所有外层索引为“河北省”的子集。
mulitindex_series['河北省']

Out[40]:

石家庄市    15848.0
唐山市     13472.0
邯郸市     12073.8
秦皇岛市     7813.0
dtype: float64

In [44]:

#(3)    获取内层索引“洛阳市”对应的子集。
mulitindex_series[:,'洛阳市']

Out[44]:

河南省    15230.0
dtype: float64

In [46]:

#(4)    交换外层索引和内层索引的位置。
mulitindex_series.swaplevel()

Out[46]:

石家庄市  河北省    15848.0
唐山市   河北省    13472.0
邯郸市   河北省    12073.8
秦皇岛市  河北省     7813.0
郑州市   河南省     7446.0
开封市   河南省     6444.0
洛阳市   河南省    15230.0
新乡市   河南省     8269.0
dtype: float64

8. 现有如下图所示的表格数据,请对该数据进行以下操作

In [47]:

#(1)    对列索引为C的数据进行升序排序。
import numpy as np
import pandas as pd
df_data = np.array([[1, 5, 8, 8], [2, 2, 4, 9],[7, 4, 2, 3], [3, 0, 5, 2]])  # 创建数组
col_data = np.array(['A', 'B', 'C', 'D'])  # 创建数组
# 基于数组创建DataFrame对象
df_obj = pd.DataFrame(columns=col_data, data=df_data)
df_obj

Out[47]:

A B C D
0 1 5 8 8
1 2 2 4 9
2 7 4 2 3
3 3 0 5 2

In [48]:

sort_values_data = df_obj.sort_values(by=['C'])
sort_values_data

Out[48]:

A B C D
2 7 4 2 3
1 2 2 4 9
3 3 0 5 2
0 1 5 8 8

In [49]:

#(2)    分别计算每列的和,最大值及统计描述。
print(df_obj.sum())
print(df_obj.max())
print(df_obj.describe())
A    13
B    11
C    19
D    22
dtype: int64
A    7
B    5
C    8
D    9
dtype: int32A         B     C         D
count  4.000000  4.000000  4.00  4.000000
mean   3.250000  2.750000  4.75  5.500000
std    2.629956  2.217356  2.50  3.511885
min    1.000000  0.000000  2.00  2.000000
25%    1.750000  1.500000  3.50  2.750000
50%    2.500000  3.000000  4.50  5.500000
75%    4.000000  4.250000  5.75  8.250000
max    7.000000  5.000000  8.00  9.000000

9. 按要求创建DataFrame对象,并完成以下操作:

In [50]:

#(1)    从字典对象创建如下DataFrame对象,索引设置为labels。
import numpy as np
data = {'animal':['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],'priority':['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']
}labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, index=labels)
df

Out[50]:

animal age visits priority
a cat 2.5 1 yes
b cat 3.0 3 yes
c snake 0.5 2 no
d dog NaN 3 yes
e dog 5.0 2 no
f cat 2.0 3 no
g snake 4.5 1 no
h cat NaN 1 yes
i dog 7.0 2 no
j dog 3.0 1 no

In [60]:

#(2)    显示DataFrame的基础信息,包括行数,列名,值的数量和类型。
#df.describe()
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, a to j
Data columns (total 4 columns):
animal      10 non-null object
age         8 non-null float64
visits      10 non-null int64
priority    10 non-null object
dtypes: float64(1), int64(1), object(2)
memory usage: 400.0+ bytes

In [61]:

#(3)    展示前三行(两种方式)。
#df.iloc[:3]
df.head(3)

Out[61]:

animal age visits priority
a cat 2.5 1 yes
b cat 3.0 3 yes
c snake 0.5 2 no

In [62]:

#(4)    取出frame的animal和age列。
df.loc[:, ['animal', 'age']]
# df[['animal', 'age']]

Out[62]:

animal age
a cat 2.5
b cat 3.0
c snake 0.5
d dog NaN
e dog 5.0
f cat 2.0
g snake 4.5
h cat NaN
i dog 7.0
j dog 3.0

In [63]:

#(5)    取出索引为[3, 4, 8]行的animal和age列。
df.loc[df.index[[3, 4, 8]], ['animal', 'age']]

Out[63]:

animal age
d dog NaN
e dog 5.0
i dog 7.0

In [78]:

#(6)    取出age值大于3的行。
df[df['age'] > 3]

Out[78]:

animal age visits priority
e dog 5.0 2 no
g snake 4.5 1 no
i dog 7.0 2 no

In [79]:

#(7)    取出age值缺失的行。
df[df['age'].isnull()]

Out[79]:

animal age visits priority
d dog NaN 3 yes
h cat NaN 1 yes

In [82]:

#(8)    取出age在2,4间的行(不含)
#df[(df['age']>2) & (df['age']>4)]
df[df['age'].between(2, 4)]

Out[82]:

animal age visits priority
a cat 2.5 1 yes
b cat 3.0 3 yes
f cat 2.0 3 no
j dog 3.0 1 no

In [85]:

#(9)    f行的age改为1.5。
df.loc['f', 'age'] = 1.5
df

Out[85]:

animal age visits priority
a cat 2.5 1 yes
b cat 3.0 3 yes
c snake 0.5 2 no
d dog NaN 3 yes
e dog 5.0 2 no
f cat 1.5 3 no
g snake 4.5 1 no
h cat NaN 1 yes
i dog 7.0 2 no
j dog 3.0 1 no

In [84]:

#(10)   计算visits的总和。
df['visits'].sum()

Out[84]:

19

In [86]:

#(11)   计算每个不同种类animal的age的平均数。
df.groupby('animal')['age'].mean()

Out[86]:

animal
cat      2.333333
dog      5.000000
snake    2.500000
Name: age, dtype: float64

In [87]:

#(12)   计算df中每个种类animal的数量。
#插入
df.loc['k'] = [5.5, 'dog', 'no', 2]
# 删除
df = df.drop('k')
df

Out[87]:

animal age visits priority
a cat 2.5 1 yes
b cat 3 3 yes
c snake 0.5 2 no
d dog NaN 3 yes
e dog 5 2 no
f cat 1.5 3 no
g snake 4.5 1 no
h cat NaN 1 yes
i dog 7 2 no
j dog 3 1 no

In [88]:

#(13)   先按age降序排列,后按visits升序排列。
df.sort_values(by=['age', 'visits'], ascending=[False, True])

Out[88]:

animal age visits priority
i dog 7 2 no
e dog 5 2 no
g snake 4.5 1 no
j dog 3 1 no
b cat 3 3 yes
a cat 2.5 1 yes
f cat 1.5 3 no
c snake 0.5 2 no
h cat NaN 1 yes
d dog NaN 3 yes

In [89]:

#(14)   将priority列中的yes, no替换为布尔值True, False。
df['priority'] = df['priority'].map({'yes': True, 'no': False})
df

Out[89]:

animal age visits priority
a cat 2.5 1 True
b cat 3 3 True
c snake 0.5 2 False
d dog NaN 3 True
e dog 5 2 False
f cat 1.5 3 False
g snake 4.5 1 False
h cat NaN 1 True
i dog 7 2 False
j dog 3 1 False

In [90]:

#(15)   将animal列中的snake替换为python。
df['animal'] = df['animal'].replace('snake', 'python')
df

Out[90]:

animal age visits priority
a cat 2.5 1 True
b cat 3 3 True
c python 0.5 2 False
d dog NaN 3 True
e dog 5 2 False
f cat 1.5 3 False
g python 4.5 1 False
h cat NaN 1 True
i dog 7 2 False
j dog 3 1 False

In [100]:

#(16)  对每种animal的每种不同数量visits,计算平均age,即,返回一个表格,行是aniaml种类,列是visits数量,表格值是行动物种类列访客数量的平均年龄。
df.age=df.age.astype(float)
df.dtypes
df.pivot_table(index='animal', columns='visits', values='age', aggfunc='mean')

Out[100]:

visits 1 2 3
animal
cat 2.5 NaN 2.25
dog 3.0 6.0 NaN
python 4.5 0.5 NaN

In [93]:

#(17)   在frame中插入新行k,['cat',5,2,'no'],然后删除该行。
#插入
df.loc['k'] = [5.5, 'dog', 'no', 2]
# 删除
df = df.drop('k')
df

Out[93]:

animal age visits priority
a cat 2.5 1 1
b cat 3 3 1
c python 0.5 2 0
d dog NaN 3 1
e dog 5 2 0
f cat 1.5 3 0
g python 4.5 1 0
h cat NaN 1 1
i dog 7 2 0
j dog 3 1 0

10. 读取并查看P2P网络贷款数据主表的基本信息

In [104]:

#(1)读取数据Training_Master.csv;
import pandas as pd
dt1=open('F:/实训/数据分析实训/项目二 Pandas基础练习/Training_Master.csv')
data=pd.read_csv(dt1)
data

Out[104]:

Idx UserInfo_1 UserInfo_2 UserInfo_3 UserInfo_4 WeblogInfo_1 WeblogInfo_2 WeblogInfo_3 WeblogInfo_4 WeblogInfo_5 ... SocialNetwork_10 SocialNetwork_11 SocialNetwork_12 SocialNetwork_13 SocialNetwork_14 SocialNetwork_15 SocialNetwork_16 SocialNetwork_17 target ListingInfo
0 10001 1.0 深圳 4.0 深圳 NaN 1.0 NaN 1.0 1.0 ... 222 -1 0 0 0 0 0 1 0 2014/3/5
1 10002 1.0 温州 4.0 温州 NaN 0.0 NaN 1.0 1.0 ... 1 -1 0 0 0 0 0 2 0 2014/2/26
2 10003 1.0 宜昌 3.0 宜昌 NaN 0.0 NaN 2.0 2.0 ... -1 -1 -1 1 0 0 0 0 0 2014/2/28
3 10006 4.0 南平 1.0 南平 NaN NaN NaN NaN NaN ... -1 -1 -1 0 0 0 0 0 0 2014/2/25
4 10007 5.0 辽阳 1.0 辽阳 NaN 0.0 NaN 1.0 1.0 ... -1 -1 -1 0 0 0 0 0 0 2014/2/27
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
29995 9991 3.0 南阳 4.0 南阳 NaN 1.0 NaN 3.0 2.0 ... 0 -1 0 1 0 0 0 1 0 2014/2/22
29996 9992 3.0 宁德 4.0 泉州 NaN 0.0 NaN 6.0 1.0 ... 407 -1 0 0 0 0 0 1 0 2014/2/28
29997 9995 1.0 天津 2.0 天津 NaN 0.0 NaN 2.0 2.0 ... -1 -1 -1 0 0 0 0 0 0 2014/2/24
29998 9997 3.0 运城 3.0 运城 NaN 0.0 NaN 1.0 1.0 ... 612 -1 0 1 0 0 0 1 0 2014/2/28
29999 9998 4.0 金华 5.0 无锡 NaN 0.0 NaN 1.0 1.0 ... -1 -1 -1 0 0 0 0 0 0 2014/3/5

30000 rows × 228 columns

In [105]:

#(2)   使用ndim、shape、memory_usage属性分别维度、大小和占用内存信息;
#查看主表信息的维度
print("主表信息的维度为:",data.ndim)
#查看主表信息的大小
print("主表信息的大小为:",data.shape)
#查看出表信息的占用内存信息
print("主表信息的占用内存信息是:\n",data.memory_usage())
主表信息的维度为: 2
主表信息的大小为: (30000, 228)
主表信息的占用内存信息是:Index                  128
Idx                 240000
UserInfo_1          240000
UserInfo_2          240000
UserInfo_3          240000...
SocialNetwork_15    240000
SocialNetwork_16    240000
SocialNetwork_17    240000
target              240000
ListingInfo         240000
Length: 229, dtype: int64

In [106]:

#(3)   使用describe方法进行描述性统计。
a_describe = data.describe()
print("使用describe方法进行描述性统计:",a_describe)
使用describe方法进行描述性统计:                 Idx    UserInfo_1    UserInfo_3  WeblogInfo_1  WeblogInfo_2  \
count  30000.000000  29994.000000  29993.000000    970.000000  28342.000000
mean   46318.673267      3.219911      4.694329      2.201031      0.131466
std    26640.397805      1.827684      1.321458      7.831679      0.358486
min        3.000000      0.000000      0.000000      1.000000      0.000000
25%    22924.250000      1.000000      4.000000      1.000000      0.000000
50%    46849.500000      3.000000      5.000000      1.000000      0.000000
75%    69447.250000      5.000000      5.000000      1.000000      0.000000
max    91703.000000      7.000000      7.000000    133.000000      4.000000   WeblogInfo_3  WeblogInfo_4  WeblogInfo_5  WeblogInfo_6  WeblogInfo_7  \
count    970.000000  28349.000000  28349.000000  28349.000000  30000.000000
mean       1.308247      3.025962      1.816960      2.948711     10.632800
std        7.866457      3.772421      1.701177      3.770300     16.097588
min        0.000000      1.000000      1.000000      1.000000      0.000000
25%        0.000000      1.000000      1.000000      1.000000      2.000000
50%        0.000000      2.000000      1.000000      2.000000      6.000000
75%        1.000000      3.000000      2.000000      3.000000     13.000000
max      133.000000    165.000000     73.000000    165.000000    722.000000   ...  SocialNetwork_9  SocialNetwork_10  SocialNetwork_11  \
count  ...     30000.000000      30000.000000      30000.000000
mean   ...        35.516167         75.211233         -0.999267
std    ...       135.954587        742.978305          0.052911
min    ...        -1.000000         -1.000000         -1.000000
25%    ...        -1.000000         -1.000000         -1.000000
50%    ...        -1.000000         -1.000000         -1.000000
75%    ...        -1.000000         -1.000000         -1.000000
max    ...      3242.000000      71253.000000          6.000000   SocialNetwork_12  SocialNetwork_13  SocialNetwork_14  SocialNetwork_15  \
count      30000.000000      30000.000000      30000.000000      30000.000000
mean          -0.745033          0.221167          0.062033          0.027967
std            0.441473          0.420545          0.242598          0.164880
min           -1.000000          0.000000          0.000000          0.000000
25%           -1.000000          0.000000          0.000000          0.000000
50%           -1.000000          0.000000          0.000000          0.000000
75%            0.000000          0.000000          0.000000          0.000000
max            1.000000          2.000000          3.000000          1.000000   SocialNetwork_16  SocialNetwork_17        target
count      30000.000000      30000.000000  30000.000000
mean           0.016633          0.253467      0.073267
std            0.127895          0.437296      0.260578
min            0.000000          0.000000      0.000000
25%            0.000000          0.000000      0.000000
50%            0.000000          0.000000      0.000000
75%            0.000000          1.000000      0.000000
max            1.000000          3.000000      1.000000  [8 rows x 208 columns]

11. 探索2012欧洲杯数据

In [121]:

import pandas as pd
dt1=open('F:/实训/数据分析实训/项目二 Pandas基础练习/Euro2012.csv')
data=pd.read_csv(dt1)
data

Out[121]:

Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored ... Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards Subs on Subs off Players Used
0 Croatia 4 13 12 51.9% 16.0% 32 0 0 0 ... 13 81.3% 41 62 2 9 0 9 9 16
1 Czech Republic 4 13 18 41.9% 12.9% 39 0 0 0 ... 9 60.1% 53 73 8 7 0 11 11 19
2 Denmark 4 10 10 50.0% 20.0% 27 1 0 0 ... 10 66.7% 25 38 8 4 0 7 7 15
3 England 5 11 18 50.0% 17.2% 40 0 0 0 ... 22 88.1% 43 45 6 5 0 11 11 16
4 France 3 22 24 37.9% 6.5% 65 1 0 0 ... 6 54.6% 36 51 5 6 0 11 11 19
5 Germany 10 32 32 47.8% 15.6% 80 2 1 0 ... 10 62.6% 63 49 12 4 0 15 15 17
6 Greece 5 8 18 30.7% 19.2% 32 1 1 1 ... 13 65.1% 67 48 12 9 1 12 12 20
7 Italy 6 34 45 43.0% 7.5% 110 2 0 0 ... 20 74.1% 101 89 16 16 0 18 18 19
8 Netherlands 2 12 36 25.0% 4.1% 60 2 0 0 ... 12 70.6% 35 30 3 5 0 7 7 15
9 Poland 2 15 23 39.4% 5.2% 48 0 0 0 ... 6 66.7% 48 56 3 7 1 7 7 17
10 Portugal 6 22 42 34.3% 9.3% 82 6 0 0 ... 10 71.5% 73 90 10 12 0 14 14 16
11 Republic of Ireland 1 7 12 36.8% 5.2% 28 0 0 0 ... 17 65.4% 43 51 11 6 1 10 10 17
12 Russia 5 9 31 22.5% 12.5% 59 2 0 0 ... 10 77.0% 34 43 4 6 0 7 7 16
13 Spain 12 42 33 55.9% 16.0% 100 0 1 0 ... 15 93.8% 102 83 19 11 0 17 17 18
14 Sweden 5 17 19 47.2% 13.8% 39 3 0 0 ... 8 61.6% 35 51 7 7 0 9 9 18
15 Ukraine 2 7 26 21.2% 6.0% 38 0 0 0 ... 13 76.5% 48 31 4 5 0 9 9 18

16 rows × 35 columns

In [124]:

#(1)  将数据集命名为euro12
#将数据集命名为euro12
#从目标路径导入数据集
path2 = "F:/实训/数据分析实训/项目二 Pandas基础练习/Euro2012.csv"
# Euro2012_stats.csv
euro12 = pd.read_csv(path2)
euro12

Out[124]:

Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored ... Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards Subs on Subs off Players Used
0 Croatia 4 13 12 51.9% 16.0% 32 0 0 0 ... 13 81.3% 41 62 2 9 0 9 9 16
1 Czech Republic 4 13 18 41.9% 12.9% 39 0 0 0 ... 9 60.1% 53 73 8 7 0 11 11 19
2 Denmark 4 10 10 50.0% 20.0% 27 1 0 0 ... 10 66.7% 25 38 8 4 0 7 7 15
3 England 5 11 18 50.0% 17.2% 40 0 0 0 ... 22 88.1% 43 45 6 5 0 11 11 16
4 France 3 22 24 37.9% 6.5% 65 1 0 0 ... 6 54.6% 36 51 5 6 0 11 11 19
5 Germany 10 32 32 47.8% 15.6% 80 2 1 0 ... 10 62.6% 63 49 12 4 0 15 15 17
6 Greece 5 8 18 30.7% 19.2% 32 1 1 1 ... 13 65.1% 67 48 12 9 1 12 12 20
7 Italy 6 34 45 43.0% 7.5% 110 2 0 0 ... 20 74.1% 101 89 16 16 0 18 18 19
8 Netherlands 2 12 36 25.0% 4.1% 60 2 0 0 ... 12 70.6% 35 30 3 5 0 7 7 15
9 Poland 2 15 23 39.4% 5.2% 48 0 0 0 ... 6 66.7% 48 56 3 7 1 7 7 17
10 Portugal 6 22 42 34.3% 9.3% 82 6 0 0 ... 10 71.5% 73 90 10 12 0 14 14 16
11 Republic of Ireland 1 7 12 36.8% 5.2% 28 0 0 0 ... 17 65.4% 43 51 11 6 1 10 10 17
12 Russia 5 9 31 22.5% 12.5% 59 2 0 0 ... 10 77.0% 34 43 4 6 0 7 7 16
13 Spain 12 42 33 55.9% 16.0% 100 0 1 0 ... 15 93.8% 102 83 19 11 0 17 17 18
14 Sweden 5 17 19 47.2% 13.8% 39 3 0 0 ... 8 61.6% 35 51 7 7 0 9 9 18
15 Ukraine 2 7 26 21.2% 6.0% 38 0 0 0 ... 13 76.5% 48 31 4 5 0 9 9 18

16 rows × 35 columns

In [125]:

#(2)  只选取 Goals 这一列
#只选取 Goals 这一列
euro12.Goals

Out[125]:

0      4
1      4
2      4
3      5
4      3
5     10
6      5
7      6
8      2
9      2
10     6
11     1
12     5
13    12
14     5
15     2
Name: Goals, dtype: int64

In [126]:

#(3)  有多少球队参与了2012欧洲杯?
euro12.shape[0]

Out[126]:

16

In [127]:

#(4)该数据集中一共有多少列(columns)?
euro12.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 35 columns):
Team                          16 non-null object
Goals                         16 non-null int64
Shots on target               16 non-null int64
Shots off target              16 non-null int64
Shooting Accuracy             16 non-null object
% Goals-to-shots              16 non-null object
Total shots (inc. Blocked)    16 non-null int64
Hit Woodwork                  16 non-null int64
Penalty goals                 16 non-null int64
Penalties not scored          16 non-null int64
Headed goals                  16 non-null int64
Passes                        16 non-null int64
Passes completed              16 non-null int64
Passing Accuracy              16 non-null object
Touches                       16 non-null int64
Crosses                       16 non-null int64
Dribbles                      16 non-null int64
Corners Taken                 16 non-null int64
Tackles                       16 non-null int64
Clearances                    16 non-null int64
Interceptions                 16 non-null int64
Clearances off line           15 non-null float64
Clean Sheets                  16 non-null int64
Blocks                        16 non-null int64
Goals conceded                16 non-null int64
Saves made                    16 non-null int64
Saves-to-shots ratio          16 non-null object
Fouls Won                     16 non-null int64
Fouls Conceded                16 non-null int64
Offsides                      16 non-null int64
Yellow Cards                  16 non-null int64
Red Cards                     16 non-null int64
Subs on                       16 non-null int64
Subs off                      16 non-null int64
Players Used                  16 non-null int64
dtypes: float64(1), int64(29), object(5)
memory usage: 4.5+ KB

In [128]:

#(5)将数据集中的列Team, Yellow Cards和Red Cards单独存在一个名叫discipline的数据框中
discipline = euro12[['Team','Yellow Cards','Red Cards']]
discipline

Out[128]:

Team Yellow Cards Red Cards
0 Croatia 9 0
1 Czech Republic 7 0
2 Denmark 4 0
3 England 5 0
4 France 6 0
5 Germany 4 0
6 Greece 9 1
7 Italy 16 0
8 Netherlands 5 0
9 Poland 7 1
10 Portugal 12 0
11 Republic of Ireland 6 1
12 Russia 6 0
13 Spain 11 0
14 Sweden 7 0
15 Ukraine 5 0

In [129]:

#(6)对数据框discipline按照先Red Cards再Yellow Cards进行排序
discipline.sort_values(['Red Cards','Yellow Cards'],ascending = False)

Out[129]:

Team Yellow Cards Red Cards
6 Greece 9 1
9 Poland 7 1
11 Republic of Ireland 6 1
7 Italy 16 0
10 Portugal 12 0
13 Spain 11 0
0 Croatia 9 0
1 Czech Republic 7 0
14 Sweden 7 0
4 France 6 0
12 Russia 6 0
3 England 5 0
8 Netherlands 5 0
15 Ukraine 5 0
2 Denmark 4 0
5 Germany 4 0

In [130]:

#(7)计算每个球队拿到的黄牌数的平均值
round(discipline['Yellow Cards'].mean())

Out[130]:

7

In [131]:

#(8)找到进球数Goals超过6的球队数据
euro12[euro12.Goals > 6]

Out[131]:

Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored ... Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards Subs on Subs off Players Used
5 Germany 10 32 32 47.8% 15.6% 80 2 1 0 ... 10 62.6% 63 49 12 4 0 15 15 17
13 Spain 12 42 33 55.9% 16.0% 100 0 1 0 ... 15 93.8% 102 83 19 11 0 17 17 18

2 rows × 35 columns

In [132]:

#(9)选取以字母G开头的球队数据
euro12[euro12.Team.str.startswith('G')]

Out[132]:

Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored ... Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards Subs on Subs off Players Used
5 Germany 10 32 32 47.8% 15.6% 80 2 1 0 ... 10 62.6% 63 49 12 4 0 15 15 17
6 Greece 5 8 18 30.7% 19.2% 32 1 1 1 ... 13 65.1% 67 48 12 9 1 12 12 20

2 rows × 35 columns

In [133]:

#(10)选取前7列
euro12.iloc[:,0:7]

Out[133]:

Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked)
0 Croatia 4 13 12 51.9% 16.0% 32
1 Czech Republic 4 13 18 41.9% 12.9% 39
2 Denmark 4 10 10 50.0% 20.0% 27
3 England 5 11 18 50.0% 17.2% 40
4 France 3 22 24 37.9% 6.5% 65
5 Germany 10 32 32 47.8% 15.6% 80
6 Greece 5 8 18 30.7% 19.2% 32
7 Italy 6 34 45 43.0% 7.5% 110
8 Netherlands 2 12 36 25.0% 4.1% 60
9 Poland 2 15 23 39.4% 5.2% 48
10 Portugal 6 22 42 34.3% 9.3% 82
11 Republic of Ireland 1 7 12 36.8% 5.2% 28
12 Russia 5 9 31 22.5% 12.5% 59
13 Spain 12 42 33 55.9% 16.0% 100
14 Sweden 5 17 19 47.2% 13.8% 39
15 Ukraine 2 7 26 21.2% 6.0% 38

In [134]:

#(11)选取除了最后3列之外的全部列
euro12.iloc[:,:-3]

Out[134]:

Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored ... Clean Sheets Blocks Goals conceded Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards
0 Croatia 4 13 12 51.9% 16.0% 32 0 0 0 ... 0 10 3 13 81.3% 41 62 2 9 0
1 Czech Republic 4 13 18 41.9% 12.9% 39 0 0 0 ... 1 10 6 9 60.1% 53 73 8 7 0
2 Denmark 4 10 10 50.0% 20.0% 27 1 0 0 ... 1 10 5 10 66.7% 25 38 8 4 0
3 England 5 11 18 50.0% 17.2% 40 0 0 0 ... 2 29 3 22 88.1% 43 45 6 5 0
4 France 3 22 24 37.9% 6.5% 65 1 0 0 ... 1 7 5 6 54.6% 36 51 5 6 0
5 Germany 10 32 32 47.8% 15.6% 80 2 1 0 ... 1 11 6 10 62.6% 63 49 12 4 0
6 Greece 5 8 18 30.7% 19.2% 32 1 1 1 ... 1 23 7 13 65.1% 67 48 12 9 1
7 Italy 6 34 45 43.0% 7.5% 110 2 0 0 ... 2 18 7 20 74.1% 101 89 16 16 0
8 Netherlands 2 12 36 25.0% 4.1% 60 2 0 0 ... 0 9 5 12 70.6% 35 30 3 5 0
9 Poland 2 15 23 39.4% 5.2% 48 0 0 0 ... 0 8 3 6 66.7% 48 56 3 7 1
10 Portugal 6 22 42 34.3% 9.3% 82 6 0 0 ... 2 11 4 10 71.5% 73 90 10 12 0
11 Republic of Ireland 1 7 12 36.8% 5.2% 28 0 0 0 ... 0 23 9 17 65.4% 43 51 11 6 1
12 Russia 5 9 31 22.5% 12.5% 59 2 0 0 ... 0 8 3 10 77.0% 34 43 4 6 0
13 Spain 12 42 33 55.9% 16.0% 100 0 1 0 ... 5 8 1 15 93.8% 102 83 19 11 0
14 Sweden 5 17 19 47.2% 13.8% 39 3 0 0 ... 1 12 5 8 61.6% 35 51 7 7 0
15 Ukraine 2 7 26 21.2% 6.0% 38 0 0 0 ... 0 4 4 13 76.5% 48 31 4 5 0

16 rows × 32 columns

In [135]:

#(12)找到英格兰(England)、意大利(Italy)和俄罗斯(Russia)的射正率(Shooting Accuracy)
euro12.loc[euro12.Team.isin(['England','Italy','Russia']),['Team','Shooting Accuracy']]

Out[135]:

Team Shooting Accuracy
3 England 50.0%
7 Italy 43.0%
12 Russia 22.5%

12. 探索Chipotle快餐数据

In [137]:

#(1) 将数据集存入一个名为chipo的数据框内
import pandas as pd
chipo = pd.read_csv('F:/实训/数据分析实训/项目二 Pandas基础练习/chipotle.csv',sep='\t')
'完成'

Out[137]:

'完成'

In [138]:

#查看前10行内容
chipo.head(10)

Out[138]:

order_id quantity item_name choice_description item_price
0 1 1 Chips and Fresh Tomato Salsa NaN $2.39
1 1 1 Izze [Clementine] $3.39
2 1 1 Nantucket Nectar [Apple] $3.39
3 1 1 Chips and Tomatillo-Green Chili Salsa NaN $2.39
4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98
5 3 1 Chicken Bowl [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou... $10.98
6 3 1 Side of Chips NaN $1.69
7 4 1 Steak Burrito [Tomatillo Red Chili Salsa, [Fajita Vegetables... $11.75
8 4 1 Steak Soft Tacos [Tomatillo Green Chili Salsa, [Pinto Beans, Ch... $9.25
9 5 1 Steak Burrito [Fresh Tomato Salsa, [Rice, Black Beans, Pinto... $9.25

In [139]:

#查看数据后10行
chipo.tail(10)

Out[139]:

order_id quantity item_name choice_description item_price
4612 1831 1 Carnitas Bowl [Fresh Tomato Salsa, [Fajita Vegetables, Rice,... $9.25
4613 1831 1 Chips NaN $2.15
4614 1831 1 Bottled Water NaN $1.50
4615 1832 1 Chicken Soft Tacos [Fresh Tomato Salsa, [Rice, Cheese, Sour Cream]] $8.75
4616 1832 1 Chips and Guacamole NaN $4.45
4617 1833 1 Steak Burrito [Fresh Tomato Salsa, [Rice, Black Beans, Sour ... $11.75
4618 1833 1 Steak Burrito [Fresh Tomato Salsa, [Rice, Sour Cream, Cheese... $11.75
4619 1834 1 Chicken Salad Bowl [Fresh Tomato Salsa, [Fajita Vegetables, Pinto... $11.25
4620 1834 1 Chicken Salad Bowl [Fresh Tomato Salsa, [Fajita Vegetables, Lettu... $8.75
4621 1834 1 Chicken Salad Bowl [Fresh Tomato Salsa, [Fajita Vegetables, Pinto... $8.75

In [140]:

#查看形状,数据的行数和列数,输出(行数,列数)
chipo.shape

Out[140]:

(4622, 5)

In [141]:

#(5) 数据集中有多少个列(columns)
chipo.columns.size
#chipo.shape[1]

Out[141]:

5

In [161]:

#(6) 打印出全部的列名称
chipo.columns
#chipo.keys()

Out[161]:

Index(['order_id', 'quantity', 'item_name', 'choice_description','item_price'],dtype='object')

In [162]:

#(7)   数据集的索引是怎样的?
chipo.index

Out[162]:

RangeIndex(start=0, stop=4622, step=1)

In [163]:

#(8) 查看数值型列的数据汇总统计
chipo.describe()

Out[163]:

order_id quantity
count 4622.000000 4622.000000
mean 927.254868 1.075725
std 528.890796 0.410186
min 1.000000 1.000000
25% 477.250000 1.000000
50% 926.000000 1.000000
75% 1393.000000 1.000000
max 1834.000000 15.000000

In [164]:

#(9)    查看列索引(Columns)、数据类型(Dtype)、缺失值个数(Non-Null Count)和内存信息(memery usage)
chipo.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 5 columns):
order_id              4622 non-null int64
quantity              4622 non-null int64
item_name             4622 non-null object
choice_description    3376 non-null object
item_price            4622 non-null object
dtypes: int64(2), object(3)
memory usage: 180.7+ KB

In [165]:

#(10)  查看产品名称这一列
chipo.item_name
chipo['item_name']

Out[165]:

0                Chips and Fresh Tomato Salsa
1                                        Izze
2                            Nantucket Nectar
3       Chips and Tomatillo-Green Chili Salsa
4                                Chicken Bowl...
4617                            Steak Burrito
4618                            Steak Burrito
4619                       Chicken Salad Bowl
4620                       Chicken Salad Bowl
4621                       Chicken Salad Bowl
Name: item_name, Length: 4622, dtype: object

In [169]:

#(11)  查看产品名称及数量这两列,返回数据为DataFrame
chipo[['item_name','quantity']]

Out[169]:

item_name quantity
0 Chips and Fresh Tomato Salsa 1
1 Izze 1
2 Nantucket Nectar 1
3 Chips and Tomatillo-Green Chili Salsa 1
4 Chicken Bowl 2
... ... ...
4617 Steak Burrito 1
4618 Steak Burrito 1
4619 Chicken Salad Bowl 1
4620 Chicken Salad Bowl 1
4621 Chicken Salad Bowl 1

4622 rows × 2 columns

In [170]:

#(12)    查看行索引从3开始到10结束(不包含)
chipo[3:15]

Out[170]:

order_id quantity item_name choice_description item_price
3 1 1 Chips and Tomatillo-Green Chili Salsa NaN $2.39
4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98
5 3 1 Chicken Bowl [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou... $10.98
6 3 1 Side of Chips NaN $1.69
7 4 1 Steak Burrito [Tomatillo Red Chili Salsa, [Fajita Vegetables... $11.75
8 4 1 Steak Soft Tacos [Tomatillo Green Chili Salsa, [Pinto Beans, Ch... $9.25
9 5 1 Steak Burrito [Fresh Tomato Salsa, [Rice, Black Beans, Pinto... $9.25
10 5 1 Chips and Guacamole NaN $4.45
11 6 1 Chicken Crispy Tacos [Roasted Chili Corn Salsa, [Fajita Vegetables,... $8.75
12 6 1 Chicken Soft Tacos [Roasted Chili Corn Salsa, [Rice, Black Beans,... $8.75
13 7 1 Chicken Bowl [Fresh Tomato Salsa, [Fajita Vegetables, Rice,... $11.25
14 7 1 Chips and Guacamole NaN $4.45

In [172]:

#(13)  查看销售数量大于5的商品订单信息
cond = chipo.quantity>5
#返回值是boolean类型的Series
chipo[cond]
#返回数量quantity>5的商品订单信息

Out[172]:

order_id quantity item_name choice_description item_price
3598 1443 15 Chips and Fresh Tomato Salsa NaN $44.25
3599 1443 7 Bottled Water NaN $10.50
3887 1559 8 Side of Chips NaN $13.52
4152 1660 10 Bottled Water NaN $15.00

In [173]:

#(14)  查看销售数量大于50,商品名称为’Bottled Water’的订单信息
cond = (chipo.quantity>5) & (chipo.item_name =='Bottled Water')#与运算,返回布尔值
chipo[cond]

Out[173]:

order_id quantity item_name choice_description item_price
3599 1443 7 Bottled Water NaN $10.50
4152 1660 10 Bottled Water NaN $15.00

In [180]:

#(15)  被下单数最多商品(item)是什么?
#chipo[['item_name','quantity']].groupby(by=['item_name']).sum().sort_values(by=['quantity'],ascending=False)
chipo["item_name"].value_counts().head(1)
#下单数最多的商品是Chicken Bowl

Out[180]:

Chicken Bowl    726
Name: item_name, dtype: int64

In [166]:

#(16) 在item_name这一列中,一共有多少种商品被下单?
len(chipo["item_name"].unique())
#chipo["item_name"].nunique()

Out[166]:

50

In [183]:

#(17)   在choice_description中,下单次数最多的商品是什么?
chipo[['choice_description','quantity']].groupby(by=['choice_description']).sum().sort_values(by=['quantity'],ascending=False)
chipo['choice_description'].value_counts().head(1)

Out[183]:

[Diet Coke]    134
Name: choice_description, dtype: int64

In [167]:

#(18)    一共有多少商品被下单?
chipo["quantity"].sum()

Out[167]:

4972

In [175]:

#(19) 将item_price转换为浮点数
print("转换前的数据类型",chipo["item_price"].dtypes)
for i in range(len(chipo["item_price"])):chipo["item_price"][i]=chipo["item_price"][i].replace('$','')
chipo["item_price"]=chipo["item_price"].astype('float')
print("转换后的数据类型",chipo["item_price"].dtypes)
转换前的数据类型 object
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrameSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copyafter removing the cwd from sys.path.
转换后的数据类型 float64

In [176]:

#(20)  在该数据集对应的时期内,收入(revenue)是多少?
chipo['sub_total'] = round(chipo['item_price'] * chipo['quantity'],2)#单价x数量
chipo['sub_total'].sum()

Out[176]:

39237.02

In [174]:

#(21) 在该数据集对应的时期内,一共有多少订单?
len(chipo["order_id"].unique())
#chipo["order_id"].nunique()

Out[174]:

1834

In [177]:

#(22) 每一单(order)对应的平均总价是多少?
(chipo['quantity']*chipo['item_price']).sum()/chipo["order_id"].nunique()

Out[177]:

21.39423118865867

Python数据分析练习(二)数据分析工具Pandas相关推荐

  1. python数据处理模块pandas_数据处理工具--Pandas模块

    强大的数据处理模块Pandas,可以解决数据的预处理工作,如数据类型的转换.缺失值的处理.描述性统计分析和数据的汇总等 一.序列与数据框的构造 Pandas模块的核心操作对象为序列和数据框.序列指数据 ...

  2. 【Python学习系列二十一】pandas库基本操作

    pandas很强大,操作参考官网:http://pandas.pydata.org/pandas-docs/stable/ 也有一份10分钟入门的材料:http://pandas.pydata.org ...

  3. python数据建模工具_python数据分析工具——Pandas、StatsModels、Scikit-Learn

    Pandas Pandas是 Python下最强大的数据分析和探索工具.它包含高级的数据结构和精巧的工具,使得在 Python中处理数据非常快速和简单. Pandas构建在 Numpy之上,它使得以 ...

  4. python基础知识及数据分析工具安装及简单使用(Numpy/Scipy/Matplotlib/Pandas/StatsModels/Scikit-Learn/Keras/Gensim))

    Python介绍. Unix & Linux & Window & Mac 平台安装更新 Python3 及VSCode下Python环境配置配置 python基础知识及数据分 ...

  5. Python中的数据可视化工具与方法——常用的数据分析包numpy、pandas、statistics的理解实现和可视化工具matplotlib的使用

    Python中的数据可视化工具与方法 本文主要总结了: 1.本人在初学python时对常用的数据分析包numpy.pandas.statistics的学习理解以及简单的实例实现 2.可视化工具matp ...

  6. 【Python有趣打卡】利用pandas完成数据分析项目(二)——爬微信好友+分析

    今天依然是跟着罗罗攀学习数据分析,原创:罗罗攀(公众号:luoluopan1) Python有趣|数据可视化那些事(二) 今天主要是学习pyecharts(http://pyecharts.org/# ...

  7. Python数据处理035:结构化数据分析工具Pandas之Pandas概览

    Pandas是做数据分析最核心的一个工具.我们要先了解数据分析,才能更好的明白Pandas,因此,本文分为三个部分: 1.数据分析 2.Pandas概述 3.Pandas安装anaconda 文章目录 ...

  8. 数据分析---数据处理工具pandas(二)

    文章目录 数据分析---数据处理工具pandas(二) 一.Pandas数据结构Dataframe:基本概念及创建 1.DataFrame简介 2.创建Dataframe (1)方法一:由数组/lis ...

  9. 小白学 Python 数据分析(3):Pandas (二)数据结构 Series

    在家为国家做贡献太无聊,不如跟我一起学点 Python 顺便问一下,你们都喜欢什么什么样的文章封面图,老用这一张感觉有点丑 人生苦短,我用 Python 前文传送门: 小白学 Python 数据分析( ...

  10. python数据分析图_Python数据分析:手把手教你用Pandas生成可视化图表的教程

    大家都知道,Matplotlib 是众多 Python 可视化包的鼻祖,也是Python最常用的标准可视化库,其功能非常强大,同时也非常复杂,想要搞明白并非易事.但自从Python进入3.0时代以后, ...

最新文章

  1. Insufficient free space for journal files
  2. python计算多次_Python – 只计算一次属性并多次使用结果(不同的方法)
  3. Leetcode题目:Range Sum Query - Immutable
  4. 浅析SQL Server数据库中的伪列以及伪列的含义
  5. 【Heritrix基础教程之4】开始一个爬虫抓取的全流程代码分析
  6. 一个小例子体会Java反射的动态性
  7. 洛谷P2286 [HNOI2004]宠物收养场
  8. Java PipedOutputStream connect()方法与示例
  9. setjmp.h(c标准库)
  10. XSS-Game level 8
  11. 【Android】选项卡使用
  12. nodejs+react使用webpack打包时控制台报错
  13. DisplayUtils
  14. (转)ARM Linux启动过程分析
  15. mysql基本操作--数据库SQL操作
  16. Redis-01-NoSQL简介及Redis数据库安装
  17. SQLServer 2008安装教程
  18. Date 日期时间工具类,针对日期的一些常用的处理方法
  19. mysql fprintf_matlab中fprintf函数的用法详解
  20. 符合Scorm的LMS系统

热门文章

  1. 财务python分析_tushare+matplotlib 简单财务分析
  2. js 限制开始时间到结束时间 最长跨度三个月
  3. 精彩推荐(1):一图概括WIFI发展史
  4. html鼠标跟随特效代码简短,JS实现的简单鼠标跟随DiV层效果完整实例
  5. Nexus3 部署备份与恢复
  6. 如何有效进行回顾会议(上)?
  7. RecBole小白入门系列博客(二) ——General类模型运行流程
  8. 自媒体视频如何“伪原创”?九大技巧揭秘!
  9. 删除elemnt UI ——el-popper文字提示的小三角
  10. PowerPoint新功能设计器,