Python数据分析练习(二)数据分析工具Pandas
1. 导入 Pandas 库并简写为 pd,并输出版本号。
In [2]:
import pandas as pd pd.__version__Out[2]:
'1.4.4'
2. 从列表创建 Series
In [3]:
data = [1,2,3,4,5] frame = pd.Series(data, index = ['a','b','c','d','e']) frameOut[3]:
a 1 b 2 c 3 d 4 e 5 dtype: int64
3. 从字典创建 Series
In [4]:
data = {'a':1, 'b':2, 'c':3, 'd':4,'e':5} frame=pd.Series(data) frameOut[4]:
a 1 b 2 c 3 d 4 e 5 dtype: int64
4. 从 NumPy 随机数组创建 DataFrame,并以时间序列作为行索引,以字母作为列索引,
In [21]:
import numpy as np dt1 = pd.date_range(start="today", periods=6, freq="D") dt1Out[21]:
DatetimeIndex(['2023-06-05 11:19:00.732923', '2023-06-06 11:19:00.732923','2023-06-07 11:19:00.732923', '2023-06-08 11:19:00.732923','2023-06-09 11:19:00.732923', '2023-06-10 11:19:00.732923'],dtype='datetime64[ns]', freq='D')
In [25]:
num_arr=np.random.randn(6,4) columns=['A','B','C','D'] df=pd.DataFrame(num_arr,index=dt1,columns=columns) dfOut[25]:
A B C D 2023-06-05 11:19:00.732923 0.853507 0.461207 -0.698314 1.271267 2023-06-06 11:19:00.732923 0.621321 -0.032685 0.334610 0.536929 2023-06-07 11:19:00.732923 0.774693 -1.199595 1.263980 -0.769168 2023-06-08 11:19:00.732923 -1.041118 0.610756 0.880698 0.474968 2023-06-09 11:19:00.732923 -1.963501 0.655607 -0.185408 -2.162950 2023-06-10 11:19:00.732923 -0.190997 1.608209 -1.175479 0.692370
5. 创建一个结构如图所示的Serial对象,分别获取其索引、数据以及位置索引2对应的数据。
In [27]:
import pandas as pd ser_obj=pd.Series([1,2,3,4,5],index=['No.0','No.1','No.2','No.3','No.4']) ser_obj
Out[27]:
No.0 1 No.1 2 No.2 3 No.3 4 No.4 5 dtype: int64
6. 现有如下图所示的表格数据,请对该数据进行以下操作:
In [29]:
#(1) 创建一个结构上如上图所示的DataFrame对象 import numpy as np import pandas as pd df_data = np.array([[1, 5, 8, 8], [2, 2, 4, 9],[7, 4, 2, 3], [3, 0, 5, 2]]) # 创建数组 col_data = np.array(['A', 'B', 'C', 'D']) # 创建数组 # 基于数组创建DataFrame对象 df_obj = pd.DataFrame(columns=col_data, data=df_data) df_obj
Out[29]:
A B C D 0 1 5 8 8 1 2 2 4 9 2 7 4 2 3 3 3 0 5 2
In [30]:
#(2) 将图中的B列数据按降序排列。 sort_values_data = df_obj.sort_values(by=['B'], ascending=False) sort_values_data
Out[30]:
A B C D 0 1 5 8 8 2 7 4 2 3 1 2 2 4 9 3 3 0 5 2
In [32]:
#(3) 将排序后的数据写入到CSV文件,取名为write_data.csv。 sort_values_data.to_csv(r'F:\实训\数据分析实训\项目二 Pandas基础练习\write_data.csv') '写入完毕'
Out[32]:
'写入完毕'
7. 现有如下图所示的表格数据,请对该数据进行以下操作
In [39]:
mulitindex_series=pd.Series([15848,13472,12073.8,7813,7446,6444,15230,8269],index=[['河北省','河北省','河北省','河北省', '河南省','河南省','河南省','河南省'],['石家庄市','唐山市','邯郸市','秦皇岛市','郑州市','开封市','洛阳市','新乡市']]) mulitindex_series
Out[39]:
河北省 石家庄市 15848.0唐山市 13472.0邯郸市 12073.8秦皇岛市 7813.0 河南省 郑州市 7446.0开封市 6444.0洛阳市 15230.0新乡市 8269.0 dtype: float64
In [40]:
#(2) 获取所有外层索引为“河北省”的子集。 mulitindex_series['河北省']
Out[40]:
石家庄市 15848.0 唐山市 13472.0 邯郸市 12073.8 秦皇岛市 7813.0 dtype: float64
In [44]:
#(3) 获取内层索引“洛阳市”对应的子集。 mulitindex_series[:,'洛阳市']
Out[44]:
河南省 15230.0 dtype: float64
In [46]:
#(4) 交换外层索引和内层索引的位置。 mulitindex_series.swaplevel()
Out[46]:
石家庄市 河北省 15848.0 唐山市 河北省 13472.0 邯郸市 河北省 12073.8 秦皇岛市 河北省 7813.0 郑州市 河南省 7446.0 开封市 河南省 6444.0 洛阳市 河南省 15230.0 新乡市 河南省 8269.0 dtype: float64
8. 现有如下图所示的表格数据,请对该数据进行以下操作
In [47]:
#(1) 对列索引为C的数据进行升序排序。 import numpy as np import pandas as pd df_data = np.array([[1, 5, 8, 8], [2, 2, 4, 9],[7, 4, 2, 3], [3, 0, 5, 2]]) # 创建数组 col_data = np.array(['A', 'B', 'C', 'D']) # 创建数组 # 基于数组创建DataFrame对象 df_obj = pd.DataFrame(columns=col_data, data=df_data) df_obj
Out[47]:
A B C D 0 1 5 8 8 1 2 2 4 9 2 7 4 2 3 3 3 0 5 2
In [48]:
sort_values_data = df_obj.sort_values(by=['C']) sort_values_data
Out[48]:
A | B | C | D | |
---|---|---|---|---|
2 | 7 | 4 | 2 | 3 |
1 | 2 | 2 | 4 | 9 |
3 | 3 | 0 | 5 | 2 |
0 | 1 | 5 | 8 | 8 |
In [49]:
#(2) 分别计算每列的和,最大值及统计描述。 print(df_obj.sum()) print(df_obj.max()) print(df_obj.describe())
A 13 B 11 C 19 D 22 dtype: int64 A 7 B 5 C 8 D 9 dtype: int32A B C D count 4.000000 4.000000 4.00 4.000000 mean 3.250000 2.750000 4.75 5.500000 std 2.629956 2.217356 2.50 3.511885 min 1.000000 0.000000 2.00 2.000000 25% 1.750000 1.500000 3.50 2.750000 50% 2.500000 3.000000 4.50 5.500000 75% 4.000000 4.250000 5.75 8.250000 max 7.000000 5.000000 8.00 9.000000
9. 按要求创建DataFrame对象,并完成以下操作:
In [50]:
#(1) 从字典对象创建如下DataFrame对象,索引设置为labels。 import numpy as np data = {'animal':['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],'priority':['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no'] }labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] df = pd.DataFrame(data, index=labels) df
Out[50]:
animal age visits priority a cat 2.5 1 yes b cat 3.0 3 yes c snake 0.5 2 no d dog NaN 3 yes e dog 5.0 2 no f cat 2.0 3 no g snake 4.5 1 no h cat NaN 1 yes i dog 7.0 2 no j dog 3.0 1 no
In [60]:
#(2) 显示DataFrame的基础信息,包括行数,列名,值的数量和类型。 #df.describe() df.info()
<class 'pandas.core.frame.DataFrame'> Index: 10 entries, a to j Data columns (total 4 columns): animal 10 non-null object age 8 non-null float64 visits 10 non-null int64 priority 10 non-null object dtypes: float64(1), int64(1), object(2) memory usage: 400.0+ bytes
In [61]:
#(3) 展示前三行(两种方式)。 #df.iloc[:3] df.head(3)
Out[61]:
animal age visits priority a cat 2.5 1 yes b cat 3.0 3 yes c snake 0.5 2 no
In [62]:
#(4) 取出frame的animal和age列。 df.loc[:, ['animal', 'age']] # df[['animal', 'age']]
Out[62]:
animal age a cat 2.5 b cat 3.0 c snake 0.5 d dog NaN e dog 5.0 f cat 2.0 g snake 4.5 h cat NaN i dog 7.0 j dog 3.0
In [63]:
#(5) 取出索引为[3, 4, 8]行的animal和age列。 df.loc[df.index[[3, 4, 8]], ['animal', 'age']]
Out[63]:
animal age d dog NaN e dog 5.0 i dog 7.0
In [78]:
#(6) 取出age值大于3的行。 df[df['age'] > 3]
Out[78]:
animal age visits priority e dog 5.0 2 no g snake 4.5 1 no i dog 7.0 2 no
In [79]:
#(7) 取出age值缺失的行。 df[df['age'].isnull()]
Out[79]:
animal age visits priority d dog NaN 3 yes h cat NaN 1 yes
In [82]:
#(8) 取出age在2,4间的行(不含) #df[(df['age']>2) & (df['age']>4)] df[df['age'].between(2, 4)]
Out[82]:
animal age visits priority a cat 2.5 1 yes b cat 3.0 3 yes f cat 2.0 3 no j dog 3.0 1 no
In [85]:
#(9) f行的age改为1.5。 df.loc['f', 'age'] = 1.5 dfOut[85]:
animal age visits priority a cat 2.5 1 yes b cat 3.0 3 yes c snake 0.5 2 no d dog NaN 3 yes e dog 5.0 2 no f cat 1.5 3 no g snake 4.5 1 no h cat NaN 1 yes i dog 7.0 2 no j dog 3.0 1 no
In [84]:
#(10) 计算visits的总和。 df['visits'].sum()Out[84]:
19
In [86]:
#(11) 计算每个不同种类animal的age的平均数。 df.groupby('animal')['age'].mean()Out[86]:
animal cat 2.333333 dog 5.000000 snake 2.500000 Name: age, dtype: float64
In [87]:
#(12) 计算df中每个种类animal的数量。 #插入 df.loc['k'] = [5.5, 'dog', 'no', 2] # 删除 df = df.drop('k') dfOut[87]:
animal age visits priority a cat 2.5 1 yes b cat 3 3 yes c snake 0.5 2 no d dog NaN 3 yes e dog 5 2 no f cat 1.5 3 no g snake 4.5 1 no h cat NaN 1 yes i dog 7 2 no j dog 3 1 no
In [88]:
#(13) 先按age降序排列,后按visits升序排列。 df.sort_values(by=['age', 'visits'], ascending=[False, True])Out[88]:
animal age visits priority i dog 7 2 no e dog 5 2 no g snake 4.5 1 no j dog 3 1 no b cat 3 3 yes a cat 2.5 1 yes f cat 1.5 3 no c snake 0.5 2 no h cat NaN 1 yes d dog NaN 3 yes
In [89]:
#(14) 将priority列中的yes, no替换为布尔值True, False。 df['priority'] = df['priority'].map({'yes': True, 'no': False}) dfOut[89]:
animal age visits priority a cat 2.5 1 True b cat 3 3 True c snake 0.5 2 False d dog NaN 3 True e dog 5 2 False f cat 1.5 3 False g snake 4.5 1 False h cat NaN 1 True i dog 7 2 False j dog 3 1 False
In [90]:
#(15) 将animal列中的snake替换为python。 df['animal'] = df['animal'].replace('snake', 'python') dfOut[90]:
animal age visits priority a cat 2.5 1 True b cat 3 3 True c python 0.5 2 False d dog NaN 3 True e dog 5 2 False f cat 1.5 3 False g python 4.5 1 False h cat NaN 1 True i dog 7 2 False j dog 3 1 False
In [100]:
#(16) 对每种animal的每种不同数量visits,计算平均age,即,返回一个表格,行是aniaml种类,列是visits数量,表格值是行动物种类列访客数量的平均年龄。 df.age=df.age.astype(float) df.dtypes df.pivot_table(index='animal', columns='visits', values='age', aggfunc='mean')Out[100]:
visits 1 2 3 animal cat 2.5 NaN 2.25 dog 3.0 6.0 NaN python 4.5 0.5 NaN
In [93]:
#(17) 在frame中插入新行k,['cat',5,2,'no'],然后删除该行。 #插入 df.loc['k'] = [5.5, 'dog', 'no', 2] # 删除 df = df.drop('k') dfOut[93]:
animal age visits priority a cat 2.5 1 1 b cat 3 3 1 c python 0.5 2 0 d dog NaN 3 1 e dog 5 2 0 f cat 1.5 3 0 g python 4.5 1 0 h cat NaN 1 1 i dog 7 2 0 j dog 3 1 0
10. 读取并查看P2P网络贷款数据主表的基本信息
In [104]:
#(1)读取数据Training_Master.csv; import pandas as pd dt1=open('F:/实训/数据分析实训/项目二 Pandas基础练习/Training_Master.csv') data=pd.read_csv(dt1) dataOut[104]:
Idx UserInfo_1 UserInfo_2 UserInfo_3 UserInfo_4 WeblogInfo_1 WeblogInfo_2 WeblogInfo_3 WeblogInfo_4 WeblogInfo_5 ... SocialNetwork_10 SocialNetwork_11 SocialNetwork_12 SocialNetwork_13 SocialNetwork_14 SocialNetwork_15 SocialNetwork_16 SocialNetwork_17 target ListingInfo 0 10001 1.0 深圳 4.0 深圳 NaN 1.0 NaN 1.0 1.0 ... 222 -1 0 0 0 0 0 1 0 2014/3/5 1 10002 1.0 温州 4.0 温州 NaN 0.0 NaN 1.0 1.0 ... 1 -1 0 0 0 0 0 2 0 2014/2/26 2 10003 1.0 宜昌 3.0 宜昌 NaN 0.0 NaN 2.0 2.0 ... -1 -1 -1 1 0 0 0 0 0 2014/2/28 3 10006 4.0 南平 1.0 南平 NaN NaN NaN NaN NaN ... -1 -1 -1 0 0 0 0 0 0 2014/2/25 4 10007 5.0 辽阳 1.0 辽阳 NaN 0.0 NaN 1.0 1.0 ... -1 -1 -1 0 0 0 0 0 0 2014/2/27 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 29995 9991 3.0 南阳 4.0 南阳 NaN 1.0 NaN 3.0 2.0 ... 0 -1 0 1 0 0 0 1 0 2014/2/22 29996 9992 3.0 宁德 4.0 泉州 NaN 0.0 NaN 6.0 1.0 ... 407 -1 0 0 0 0 0 1 0 2014/2/28 29997 9995 1.0 天津 2.0 天津 NaN 0.0 NaN 2.0 2.0 ... -1 -1 -1 0 0 0 0 0 0 2014/2/24 29998 9997 3.0 运城 3.0 运城 NaN 0.0 NaN 1.0 1.0 ... 612 -1 0 1 0 0 0 1 0 2014/2/28 29999 9998 4.0 金华 5.0 无锡 NaN 0.0 NaN 1.0 1.0 ... -1 -1 -1 0 0 0 0 0 0 2014/3/5 30000 rows × 228 columns
In [105]:
#(2) 使用ndim、shape、memory_usage属性分别维度、大小和占用内存信息; #查看主表信息的维度 print("主表信息的维度为:",data.ndim) #查看主表信息的大小 print("主表信息的大小为:",data.shape) #查看出表信息的占用内存信息 print("主表信息的占用内存信息是:\n",data.memory_usage())主表信息的维度为: 2 主表信息的大小为: (30000, 228) 主表信息的占用内存信息是:Index 128 Idx 240000 UserInfo_1 240000 UserInfo_2 240000 UserInfo_3 240000... SocialNetwork_15 240000 SocialNetwork_16 240000 SocialNetwork_17 240000 target 240000 ListingInfo 240000 Length: 229, dtype: int64
In [106]:
#(3) 使用describe方法进行描述性统计。 a_describe = data.describe() print("使用describe方法进行描述性统计:",a_describe)使用describe方法进行描述性统计: Idx UserInfo_1 UserInfo_3 WeblogInfo_1 WeblogInfo_2 \ count 30000.000000 29994.000000 29993.000000 970.000000 28342.000000 mean 46318.673267 3.219911 4.694329 2.201031 0.131466 std 26640.397805 1.827684 1.321458 7.831679 0.358486 min 3.000000 0.000000 0.000000 1.000000 0.000000 25% 22924.250000 1.000000 4.000000 1.000000 0.000000 50% 46849.500000 3.000000 5.000000 1.000000 0.000000 75% 69447.250000 5.000000 5.000000 1.000000 0.000000 max 91703.000000 7.000000 7.000000 133.000000 4.000000 WeblogInfo_3 WeblogInfo_4 WeblogInfo_5 WeblogInfo_6 WeblogInfo_7 \ count 970.000000 28349.000000 28349.000000 28349.000000 30000.000000 mean 1.308247 3.025962 1.816960 2.948711 10.632800 std 7.866457 3.772421 1.701177 3.770300 16.097588 min 0.000000 1.000000 1.000000 1.000000 0.000000 25% 0.000000 1.000000 1.000000 1.000000 2.000000 50% 0.000000 2.000000 1.000000 2.000000 6.000000 75% 1.000000 3.000000 2.000000 3.000000 13.000000 max 133.000000 165.000000 73.000000 165.000000 722.000000 ... SocialNetwork_9 SocialNetwork_10 SocialNetwork_11 \ count ... 30000.000000 30000.000000 30000.000000 mean ... 35.516167 75.211233 -0.999267 std ... 135.954587 742.978305 0.052911 min ... -1.000000 -1.000000 -1.000000 25% ... -1.000000 -1.000000 -1.000000 50% ... -1.000000 -1.000000 -1.000000 75% ... -1.000000 -1.000000 -1.000000 max ... 3242.000000 71253.000000 6.000000 SocialNetwork_12 SocialNetwork_13 SocialNetwork_14 SocialNetwork_15 \ count 30000.000000 30000.000000 30000.000000 30000.000000 mean -0.745033 0.221167 0.062033 0.027967 std 0.441473 0.420545 0.242598 0.164880 min -1.000000 0.000000 0.000000 0.000000 25% -1.000000 0.000000 0.000000 0.000000 50% -1.000000 0.000000 0.000000 0.000000 75% 0.000000 0.000000 0.000000 0.000000 max 1.000000 2.000000 3.000000 1.000000 SocialNetwork_16 SocialNetwork_17 target count 30000.000000 30000.000000 30000.000000 mean 0.016633 0.253467 0.073267 std 0.127895 0.437296 0.260578 min 0.000000 0.000000 0.000000 25% 0.000000 0.000000 0.000000 50% 0.000000 0.000000 0.000000 75% 0.000000 1.000000 0.000000 max 1.000000 3.000000 1.000000 [8 rows x 208 columns]
11. 探索2012欧洲杯数据
In [121]:
import pandas as pd dt1=open('F:/实训/数据分析实训/项目二 Pandas基础练习/Euro2012.csv') data=pd.read_csv(dt1) dataOut[121]:
Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored ... Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards Subs on Subs off Players Used 0 Croatia 4 13 12 51.9% 16.0% 32 0 0 0 ... 13 81.3% 41 62 2 9 0 9 9 16 1 Czech Republic 4 13 18 41.9% 12.9% 39 0 0 0 ... 9 60.1% 53 73 8 7 0 11 11 19 2 Denmark 4 10 10 50.0% 20.0% 27 1 0 0 ... 10 66.7% 25 38 8 4 0 7 7 15 3 England 5 11 18 50.0% 17.2% 40 0 0 0 ... 22 88.1% 43 45 6 5 0 11 11 16 4 France 3 22 24 37.9% 6.5% 65 1 0 0 ... 6 54.6% 36 51 5 6 0 11 11 19 5 Germany 10 32 32 47.8% 15.6% 80 2 1 0 ... 10 62.6% 63 49 12 4 0 15 15 17 6 Greece 5 8 18 30.7% 19.2% 32 1 1 1 ... 13 65.1% 67 48 12 9 1 12 12 20 7 Italy 6 34 45 43.0% 7.5% 110 2 0 0 ... 20 74.1% 101 89 16 16 0 18 18 19 8 Netherlands 2 12 36 25.0% 4.1% 60 2 0 0 ... 12 70.6% 35 30 3 5 0 7 7 15 9 Poland 2 15 23 39.4% 5.2% 48 0 0 0 ... 6 66.7% 48 56 3 7 1 7 7 17 10 Portugal 6 22 42 34.3% 9.3% 82 6 0 0 ... 10 71.5% 73 90 10 12 0 14 14 16 11 Republic of Ireland 1 7 12 36.8% 5.2% 28 0 0 0 ... 17 65.4% 43 51 11 6 1 10 10 17 12 Russia 5 9 31 22.5% 12.5% 59 2 0 0 ... 10 77.0% 34 43 4 6 0 7 7 16 13 Spain 12 42 33 55.9% 16.0% 100 0 1 0 ... 15 93.8% 102 83 19 11 0 17 17 18 14 Sweden 5 17 19 47.2% 13.8% 39 3 0 0 ... 8 61.6% 35 51 7 7 0 9 9 18 15 Ukraine 2 7 26 21.2% 6.0% 38 0 0 0 ... 13 76.5% 48 31 4 5 0 9 9 18 16 rows × 35 columns
In [124]:
#(1) 将数据集命名为euro12 #将数据集命名为euro12 #从目标路径导入数据集 path2 = "F:/实训/数据分析实训/项目二 Pandas基础练习/Euro2012.csv" # Euro2012_stats.csv euro12 = pd.read_csv(path2) euro12Out[124]:
Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored ... Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards Subs on Subs off Players Used 0 Croatia 4 13 12 51.9% 16.0% 32 0 0 0 ... 13 81.3% 41 62 2 9 0 9 9 16 1 Czech Republic 4 13 18 41.9% 12.9% 39 0 0 0 ... 9 60.1% 53 73 8 7 0 11 11 19 2 Denmark 4 10 10 50.0% 20.0% 27 1 0 0 ... 10 66.7% 25 38 8 4 0 7 7 15 3 England 5 11 18 50.0% 17.2% 40 0 0 0 ... 22 88.1% 43 45 6 5 0 11 11 16 4 France 3 22 24 37.9% 6.5% 65 1 0 0 ... 6 54.6% 36 51 5 6 0 11 11 19 5 Germany 10 32 32 47.8% 15.6% 80 2 1 0 ... 10 62.6% 63 49 12 4 0 15 15 17 6 Greece 5 8 18 30.7% 19.2% 32 1 1 1 ... 13 65.1% 67 48 12 9 1 12 12 20 7 Italy 6 34 45 43.0% 7.5% 110 2 0 0 ... 20 74.1% 101 89 16 16 0 18 18 19 8 Netherlands 2 12 36 25.0% 4.1% 60 2 0 0 ... 12 70.6% 35 30 3 5 0 7 7 15 9 Poland 2 15 23 39.4% 5.2% 48 0 0 0 ... 6 66.7% 48 56 3 7 1 7 7 17 10 Portugal 6 22 42 34.3% 9.3% 82 6 0 0 ... 10 71.5% 73 90 10 12 0 14 14 16 11 Republic of Ireland 1 7 12 36.8% 5.2% 28 0 0 0 ... 17 65.4% 43 51 11 6 1 10 10 17 12 Russia 5 9 31 22.5% 12.5% 59 2 0 0 ... 10 77.0% 34 43 4 6 0 7 7 16 13 Spain 12 42 33 55.9% 16.0% 100 0 1 0 ... 15 93.8% 102 83 19 11 0 17 17 18 14 Sweden 5 17 19 47.2% 13.8% 39 3 0 0 ... 8 61.6% 35 51 7 7 0 9 9 18 15 Ukraine 2 7 26 21.2% 6.0% 38 0 0 0 ... 13 76.5% 48 31 4 5 0 9 9 18 16 rows × 35 columns
In [125]:
#(2) 只选取 Goals 这一列 #只选取 Goals 这一列 euro12.GoalsOut[125]:
0 4 1 4 2 4 3 5 4 3 5 10 6 5 7 6 8 2 9 2 10 6 11 1 12 5 13 12 14 5 15 2 Name: Goals, dtype: int64In [126]:
#(3) 有多少球队参与了2012欧洲杯? euro12.shape[0]Out[126]:
16In [127]:
#(4)该数据集中一共有多少列(columns)? euro12.info()<class 'pandas.core.frame.DataFrame'> RangeIndex: 16 entries, 0 to 15 Data columns (total 35 columns): Team 16 non-null object Goals 16 non-null int64 Shots on target 16 non-null int64 Shots off target 16 non-null int64 Shooting Accuracy 16 non-null object % Goals-to-shots 16 non-null object Total shots (inc. Blocked) 16 non-null int64 Hit Woodwork 16 non-null int64 Penalty goals 16 non-null int64 Penalties not scored 16 non-null int64 Headed goals 16 non-null int64 Passes 16 non-null int64 Passes completed 16 non-null int64 Passing Accuracy 16 non-null object Touches 16 non-null int64 Crosses 16 non-null int64 Dribbles 16 non-null int64 Corners Taken 16 non-null int64 Tackles 16 non-null int64 Clearances 16 non-null int64 Interceptions 16 non-null int64 Clearances off line 15 non-null float64 Clean Sheets 16 non-null int64 Blocks 16 non-null int64 Goals conceded 16 non-null int64 Saves made 16 non-null int64 Saves-to-shots ratio 16 non-null object Fouls Won 16 non-null int64 Fouls Conceded 16 non-null int64 Offsides 16 non-null int64 Yellow Cards 16 non-null int64 Red Cards 16 non-null int64 Subs on 16 non-null int64 Subs off 16 non-null int64 Players Used 16 non-null int64 dtypes: float64(1), int64(29), object(5) memory usage: 4.5+ KBIn [128]:
#(5)将数据集中的列Team, Yellow Cards和Red Cards单独存在一个名叫discipline的数据框中 discipline = euro12[['Team','Yellow Cards','Red Cards']] disciplineOut[128]:
Team Yellow Cards Red Cards 0 Croatia 9 0 1 Czech Republic 7 0 2 Denmark 4 0 3 England 5 0 4 France 6 0 5 Germany 4 0 6 Greece 9 1 7 Italy 16 0 8 Netherlands 5 0 9 Poland 7 1 10 Portugal 12 0 11 Republic of Ireland 6 1 12 Russia 6 0 13 Spain 11 0 14 Sweden 7 0 15 Ukraine 5 0 In [129]:
#(6)对数据框discipline按照先Red Cards再Yellow Cards进行排序 discipline.sort_values(['Red Cards','Yellow Cards'],ascending = False)Out[129]:
Team Yellow Cards Red Cards 6 Greece 9 1 9 Poland 7 1 11 Republic of Ireland 6 1 7 Italy 16 0 10 Portugal 12 0 13 Spain 11 0 0 Croatia 9 0 1 Czech Republic 7 0 14 Sweden 7 0 4 France 6 0 12 Russia 6 0 3 England 5 0 8 Netherlands 5 0 15 Ukraine 5 0 2 Denmark 4 0 5 Germany 4 0 In [130]:
#(7)计算每个球队拿到的黄牌数的平均值 round(discipline['Yellow Cards'].mean())Out[130]:
7In [131]:
#(8)找到进球数Goals超过6的球队数据 euro12[euro12.Goals > 6]Out[131]:
Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored ... Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards Subs on Subs off Players Used 5 Germany 10 32 32 47.8% 15.6% 80 2 1 0 ... 10 62.6% 63 49 12 4 0 15 15 17 13 Spain 12 42 33 55.9% 16.0% 100 0 1 0 ... 15 93.8% 102 83 19 11 0 17 17 18 2 rows × 35 columns
In [132]:
#(9)选取以字母G开头的球队数据 euro12[euro12.Team.str.startswith('G')]Out[132]:
Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored ... Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards Subs on Subs off Players Used 5 Germany 10 32 32 47.8% 15.6% 80 2 1 0 ... 10 62.6% 63 49 12 4 0 15 15 17 6 Greece 5 8 18 30.7% 19.2% 32 1 1 1 ... 13 65.1% 67 48 12 9 1 12 12 20 2 rows × 35 columns
In [133]:
#(10)选取前7列 euro12.iloc[:,0:7]Out[133]:
Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) 0 Croatia 4 13 12 51.9% 16.0% 32 1 Czech Republic 4 13 18 41.9% 12.9% 39 2 Denmark 4 10 10 50.0% 20.0% 27 3 England 5 11 18 50.0% 17.2% 40 4 France 3 22 24 37.9% 6.5% 65 5 Germany 10 32 32 47.8% 15.6% 80 6 Greece 5 8 18 30.7% 19.2% 32 7 Italy 6 34 45 43.0% 7.5% 110 8 Netherlands 2 12 36 25.0% 4.1% 60 9 Poland 2 15 23 39.4% 5.2% 48 10 Portugal 6 22 42 34.3% 9.3% 82 11 Republic of Ireland 1 7 12 36.8% 5.2% 28 12 Russia 5 9 31 22.5% 12.5% 59 13 Spain 12 42 33 55.9% 16.0% 100 14 Sweden 5 17 19 47.2% 13.8% 39 15 Ukraine 2 7 26 21.2% 6.0% 38 In [134]:
#(11)选取除了最后3列之外的全部列 euro12.iloc[:,:-3]Out[134]:
Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored ... Clean Sheets Blocks Goals conceded Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards 0 Croatia 4 13 12 51.9% 16.0% 32 0 0 0 ... 0 10 3 13 81.3% 41 62 2 9 0 1 Czech Republic 4 13 18 41.9% 12.9% 39 0 0 0 ... 1 10 6 9 60.1% 53 73 8 7 0 2 Denmark 4 10 10 50.0% 20.0% 27 1 0 0 ... 1 10 5 10 66.7% 25 38 8 4 0 3 England 5 11 18 50.0% 17.2% 40 0 0 0 ... 2 29 3 22 88.1% 43 45 6 5 0 4 France 3 22 24 37.9% 6.5% 65 1 0 0 ... 1 7 5 6 54.6% 36 51 5 6 0 5 Germany 10 32 32 47.8% 15.6% 80 2 1 0 ... 1 11 6 10 62.6% 63 49 12 4 0 6 Greece 5 8 18 30.7% 19.2% 32 1 1 1 ... 1 23 7 13 65.1% 67 48 12 9 1 7 Italy 6 34 45 43.0% 7.5% 110 2 0 0 ... 2 18 7 20 74.1% 101 89 16 16 0 8 Netherlands 2 12 36 25.0% 4.1% 60 2 0 0 ... 0 9 5 12 70.6% 35 30 3 5 0 9 Poland 2 15 23 39.4% 5.2% 48 0 0 0 ... 0 8 3 6 66.7% 48 56 3 7 1 10 Portugal 6 22 42 34.3% 9.3% 82 6 0 0 ... 2 11 4 10 71.5% 73 90 10 12 0 11 Republic of Ireland 1 7 12 36.8% 5.2% 28 0 0 0 ... 0 23 9 17 65.4% 43 51 11 6 1 12 Russia 5 9 31 22.5% 12.5% 59 2 0 0 ... 0 8 3 10 77.0% 34 43 4 6 0 13 Spain 12 42 33 55.9% 16.0% 100 0 1 0 ... 5 8 1 15 93.8% 102 83 19 11 0 14 Sweden 5 17 19 47.2% 13.8% 39 3 0 0 ... 1 12 5 8 61.6% 35 51 7 7 0 15 Ukraine 2 7 26 21.2% 6.0% 38 0 0 0 ... 0 4 4 13 76.5% 48 31 4 5 0 16 rows × 32 columns
In [135]:
#(12)找到英格兰(England)、意大利(Italy)和俄罗斯(Russia)的射正率(Shooting Accuracy) euro12.loc[euro12.Team.isin(['England','Italy','Russia']),['Team','Shooting Accuracy']]Out[135]:
Team Shooting Accuracy 3 England 50.0% 7 Italy 43.0% 12 Russia 22.5%
12. 探索Chipotle快餐数据
In [137]:
#(1) 将数据集存入一个名为chipo的数据框内 import pandas as pd chipo = pd.read_csv('F:/实训/数据分析实训/项目二 Pandas基础练习/chipotle.csv',sep='\t') '完成'Out[137]:
'完成'In [138]:
#查看前10行内容 chipo.head(10)Out[138]:
order_id quantity item_name choice_description item_price 0 1 1 Chips and Fresh Tomato Salsa NaN $2.39 1 1 1 Izze [Clementine] $3.39 2 1 1 Nantucket Nectar [Apple] $3.39 3 1 1 Chips and Tomatillo-Green Chili Salsa NaN $2.39 4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98 5 3 1 Chicken Bowl [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou... $10.98 6 3 1 Side of Chips NaN $1.69 7 4 1 Steak Burrito [Tomatillo Red Chili Salsa, [Fajita Vegetables... $11.75 8 4 1 Steak Soft Tacos [Tomatillo Green Chili Salsa, [Pinto Beans, Ch... $9.25 9 5 1 Steak Burrito [Fresh Tomato Salsa, [Rice, Black Beans, Pinto... $9.25 In [139]:
#查看数据后10行 chipo.tail(10)Out[139]:
order_id quantity item_name choice_description item_price 4612 1831 1 Carnitas Bowl [Fresh Tomato Salsa, [Fajita Vegetables, Rice,... $9.25 4613 1831 1 Chips NaN $2.15 4614 1831 1 Bottled Water NaN $1.50 4615 1832 1 Chicken Soft Tacos [Fresh Tomato Salsa, [Rice, Cheese, Sour Cream]] $8.75 4616 1832 1 Chips and Guacamole NaN $4.45 4617 1833 1 Steak Burrito [Fresh Tomato Salsa, [Rice, Black Beans, Sour ... $11.75 4618 1833 1 Steak Burrito [Fresh Tomato Salsa, [Rice, Sour Cream, Cheese... $11.75 4619 1834 1 Chicken Salad Bowl [Fresh Tomato Salsa, [Fajita Vegetables, Pinto... $11.25 4620 1834 1 Chicken Salad Bowl [Fresh Tomato Salsa, [Fajita Vegetables, Lettu... $8.75 4621 1834 1 Chicken Salad Bowl [Fresh Tomato Salsa, [Fajita Vegetables, Pinto... $8.75 In [140]:
#查看形状,数据的行数和列数,输出(行数,列数) chipo.shapeOut[140]:
(4622, 5)In [141]:
#(5) 数据集中有多少个列(columns) chipo.columns.size #chipo.shape[1]Out[141]:
5In [161]:
#(6) 打印出全部的列名称 chipo.columns #chipo.keys()Out[161]:
Index(['order_id', 'quantity', 'item_name', 'choice_description','item_price'],dtype='object')In [162]:
#(7) 数据集的索引是怎样的? chipo.indexOut[162]:
RangeIndex(start=0, stop=4622, step=1)In [163]:
#(8) 查看数值型列的数据汇总统计 chipo.describe()Out[163]:
order_id quantity count 4622.000000 4622.000000 mean 927.254868 1.075725 std 528.890796 0.410186 min 1.000000 1.000000 25% 477.250000 1.000000 50% 926.000000 1.000000 75% 1393.000000 1.000000 max 1834.000000 15.000000 In [164]:
#(9) 查看列索引(Columns)、数据类型(Dtype)、缺失值个数(Non-Null Count)和内存信息(memery usage) chipo.info()<class 'pandas.core.frame.DataFrame'> RangeIndex: 4622 entries, 0 to 4621 Data columns (total 5 columns): order_id 4622 non-null int64 quantity 4622 non-null int64 item_name 4622 non-null object choice_description 3376 non-null object item_price 4622 non-null object dtypes: int64(2), object(3) memory usage: 180.7+ KBIn [165]:
#(10) 查看产品名称这一列 chipo.item_name chipo['item_name']Out[165]:
0 Chips and Fresh Tomato Salsa 1 Izze 2 Nantucket Nectar 3 Chips and Tomatillo-Green Chili Salsa 4 Chicken Bowl... 4617 Steak Burrito 4618 Steak Burrito 4619 Chicken Salad Bowl 4620 Chicken Salad Bowl 4621 Chicken Salad Bowl Name: item_name, Length: 4622, dtype: objectIn [169]:
#(11) 查看产品名称及数量这两列,返回数据为DataFrame chipo[['item_name','quantity']]Out[169]:
item_name quantity 0 Chips and Fresh Tomato Salsa 1 1 Izze 1 2 Nantucket Nectar 1 3 Chips and Tomatillo-Green Chili Salsa 1 4 Chicken Bowl 2 ... ... ... 4617 Steak Burrito 1 4618 Steak Burrito 1 4619 Chicken Salad Bowl 1 4620 Chicken Salad Bowl 1 4621 Chicken Salad Bowl 1 4622 rows × 2 columns
In [170]:
#(12) 查看行索引从3开始到10结束(不包含) chipo[3:15]Out[170]:
order_id quantity item_name choice_description item_price 3 1 1 Chips and Tomatillo-Green Chili Salsa NaN $2.39 4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98 5 3 1 Chicken Bowl [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou... $10.98 6 3 1 Side of Chips NaN $1.69 7 4 1 Steak Burrito [Tomatillo Red Chili Salsa, [Fajita Vegetables... $11.75 8 4 1 Steak Soft Tacos [Tomatillo Green Chili Salsa, [Pinto Beans, Ch... $9.25 9 5 1 Steak Burrito [Fresh Tomato Salsa, [Rice, Black Beans, Pinto... $9.25 10 5 1 Chips and Guacamole NaN $4.45 11 6 1 Chicken Crispy Tacos [Roasted Chili Corn Salsa, [Fajita Vegetables,... $8.75 12 6 1 Chicken Soft Tacos [Roasted Chili Corn Salsa, [Rice, Black Beans,... $8.75 13 7 1 Chicken Bowl [Fresh Tomato Salsa, [Fajita Vegetables, Rice,... $11.25 14 7 1 Chips and Guacamole NaN $4.45 In [172]:
#(13) 查看销售数量大于5的商品订单信息 cond = chipo.quantity>5 #返回值是boolean类型的Series chipo[cond] #返回数量quantity>5的商品订单信息Out[172]:
order_id quantity item_name choice_description item_price 3598 1443 15 Chips and Fresh Tomato Salsa NaN $44.25 3599 1443 7 Bottled Water NaN $10.50 3887 1559 8 Side of Chips NaN $13.52 4152 1660 10 Bottled Water NaN $15.00 In [173]:
#(14) 查看销售数量大于50,商品名称为’Bottled Water’的订单信息 cond = (chipo.quantity>5) & (chipo.item_name =='Bottled Water')#与运算,返回布尔值 chipo[cond]Out[173]:
order_id quantity item_name choice_description item_price 3599 1443 7 Bottled Water NaN $10.50 4152 1660 10 Bottled Water NaN $15.00 In [180]:
#(15) 被下单数最多商品(item)是什么? #chipo[['item_name','quantity']].groupby(by=['item_name']).sum().sort_values(by=['quantity'],ascending=False) chipo["item_name"].value_counts().head(1) #下单数最多的商品是Chicken BowlOut[180]:
Chicken Bowl 726 Name: item_name, dtype: int64In [166]:
#(16) 在item_name这一列中,一共有多少种商品被下单? len(chipo["item_name"].unique()) #chipo["item_name"].nunique()Out[166]:
50In [183]:
#(17) 在choice_description中,下单次数最多的商品是什么? chipo[['choice_description','quantity']].groupby(by=['choice_description']).sum().sort_values(by=['quantity'],ascending=False) chipo['choice_description'].value_counts().head(1)Out[183]:
[Diet Coke] 134 Name: choice_description, dtype: int64In [167]:
#(18) 一共有多少商品被下单? chipo["quantity"].sum()Out[167]:
4972In [175]:
#(19) 将item_price转换为浮点数 print("转换前的数据类型",chipo["item_price"].dtypes) for i in range(len(chipo["item_price"])):chipo["item_price"][i]=chipo["item_price"][i].replace('$','') chipo["item_price"]=chipo["item_price"].astype('float') print("转换后的数据类型",chipo["item_price"].dtypes)转换前的数据类型 objectC:\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrameSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copyafter removing the cwd from sys.path.转换后的数据类型 float64In [176]:
#(20) 在该数据集对应的时期内,收入(revenue)是多少? chipo['sub_total'] = round(chipo['item_price'] * chipo['quantity'],2)#单价x数量 chipo['sub_total'].sum()Out[176]:
39237.02In [174]:
#(21) 在该数据集对应的时期内,一共有多少订单? len(chipo["order_id"].unique()) #chipo["order_id"].nunique()Out[174]:
1834In [177]:
#(22) 每一单(order)对应的平均总价是多少? (chipo['quantity']*chipo['item_price']).sum()/chipo["order_id"].nunique()Out[177]:
21.39423118865867
Python数据分析练习(二)数据分析工具Pandas相关推荐
- python数据处理模块pandas_数据处理工具--Pandas模块
强大的数据处理模块Pandas,可以解决数据的预处理工作,如数据类型的转换.缺失值的处理.描述性统计分析和数据的汇总等 一.序列与数据框的构造 Pandas模块的核心操作对象为序列和数据框.序列指数据 ...
- 【Python学习系列二十一】pandas库基本操作
pandas很强大,操作参考官网:http://pandas.pydata.org/pandas-docs/stable/ 也有一份10分钟入门的材料:http://pandas.pydata.org ...
- python数据建模工具_python数据分析工具——Pandas、StatsModels、Scikit-Learn
Pandas Pandas是 Python下最强大的数据分析和探索工具.它包含高级的数据结构和精巧的工具,使得在 Python中处理数据非常快速和简单. Pandas构建在 Numpy之上,它使得以 ...
- python基础知识及数据分析工具安装及简单使用(Numpy/Scipy/Matplotlib/Pandas/StatsModels/Scikit-Learn/Keras/Gensim))
Python介绍. Unix & Linux & Window & Mac 平台安装更新 Python3 及VSCode下Python环境配置配置 python基础知识及数据分 ...
- Python中的数据可视化工具与方法——常用的数据分析包numpy、pandas、statistics的理解实现和可视化工具matplotlib的使用
Python中的数据可视化工具与方法 本文主要总结了: 1.本人在初学python时对常用的数据分析包numpy.pandas.statistics的学习理解以及简单的实例实现 2.可视化工具matp ...
- 【Python有趣打卡】利用pandas完成数据分析项目(二)——爬微信好友+分析
今天依然是跟着罗罗攀学习数据分析,原创:罗罗攀(公众号:luoluopan1) Python有趣|数据可视化那些事(二) 今天主要是学习pyecharts(http://pyecharts.org/# ...
- Python数据处理035:结构化数据分析工具Pandas之Pandas概览
Pandas是做数据分析最核心的一个工具.我们要先了解数据分析,才能更好的明白Pandas,因此,本文分为三个部分: 1.数据分析 2.Pandas概述 3.Pandas安装anaconda 文章目录 ...
- 数据分析---数据处理工具pandas(二)
文章目录 数据分析---数据处理工具pandas(二) 一.Pandas数据结构Dataframe:基本概念及创建 1.DataFrame简介 2.创建Dataframe (1)方法一:由数组/lis ...
- 小白学 Python 数据分析(3):Pandas (二)数据结构 Series
在家为国家做贡献太无聊,不如跟我一起学点 Python 顺便问一下,你们都喜欢什么什么样的文章封面图,老用这一张感觉有点丑 人生苦短,我用 Python 前文传送门: 小白学 Python 数据分析( ...
- python数据分析图_Python数据分析:手把手教你用Pandas生成可视化图表的教程
大家都知道,Matplotlib 是众多 Python 可视化包的鼻祖,也是Python最常用的标准可视化库,其功能非常强大,同时也非常复杂,想要搞明白并非易事.但自从Python进入3.0时代以后, ...
最新文章
- Insufficient free space for journal files
- python计算多次_Python – 只计算一次属性并多次使用结果(不同的方法)
- Leetcode题目:Range Sum Query - Immutable
- 浅析SQL Server数据库中的伪列以及伪列的含义
- 【Heritrix基础教程之4】开始一个爬虫抓取的全流程代码分析
- 一个小例子体会Java反射的动态性
- 洛谷P2286 [HNOI2004]宠物收养场
- Java PipedOutputStream connect()方法与示例
- setjmp.h(c标准库)
- XSS-Game level 8
- 【Android】选项卡使用
- nodejs+react使用webpack打包时控制台报错
- DisplayUtils
- (转)ARM Linux启动过程分析
- mysql基本操作--数据库SQL操作
- Redis-01-NoSQL简介及Redis数据库安装
- SQLServer 2008安装教程
- Date 日期时间工具类,针对日期的一些常用的处理方法
- mysql fprintf_matlab中fprintf函数的用法详解
- 符合Scorm的LMS系统