Python: DateFrame教程
目录
创建、读写、显示
创建DataFrame与Series
读取、保存数据文件
设置显示格式
列数据对齐
索引、选择
取行
取列
同时取行列
将某列设为行索引
条件选取
汇总函数与映射
汇总函数:describe(),unique(),value_counts()
映射:map()与apply()
连接两列文本
分组与排序
分组分析
多层索引
排序
数据类型与缺失值处理
数据类型及转换
缺失值处理
重命名、更改列顺序、添加新行列
重命名列名与行名
更改列顺序
添加新列
数据框合并
相同列名的数据框纵向拼接
横向拼接或融合数据框
生成数据分析报告
导入库
import pandas as pd
创建、读写、显示
创建DataFrame与Series
>>> pd.DataFrame({'Yes':[50,21],"No":[131,2]})Yes No
0 50 131
1 21 2
>>> #数据框的内容也可以为字符串
>>> pd.DataFrame({'Bob':['T like it.','It was awful.'],'Sue':['Pretty good.','Bland.']})Bob Sue
0 T like it. Pretty good.
1 It was awful. Bland.
>>> #设置行索引
>>> pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']},index=['Product A', 'Product B'])Bob Sue
Product A I liked it. Pretty good.
Product B It was awful. Bland.
>>> #创建Series
>>> pd.Series([1,2,3,4,5])
0 1
1 2
2 3
3 4
4 5
dtype: int64
>>> #Series无列名,只有一个总体的名称
>>> pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')
2015 Sales 30
2016 Sales 35
2017 Sales 40
Name: Product A, dtype: int64
读取、保存数据文件
>>> wine_reviews = pd.read_csv('D:\DOCUMENT\PRO\PYTHON\DataFrameTurtorial\winemag-data-130k-v2.csv')
>>> wine_reviews Unnamed: 0 country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 0 Italy Aromas... Vulkà ... 87 NaN Sicily... Etna NaN Kerin ... @kerin... Nicosi... White ... Nicosia
1 1 Portugal This i... Avidagos 87 15.0 Douro NaN NaN Roger ... @vossr... Quinta... Portug... Quinta...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 129969 France A dry ... NaN 90 32.0 Alsace Alsace NaN Roger ... @vossr... Domain... Pinot ... Domain...
129970 129970 France Big, r... Lieu-d... 90 21.0 Alsace Alsace NaN Roger ... @vossr... Domain... Gewürz... Domain...
[129971 rows x 14 columns]>>> #显示行列数
>>> wine_reviews.shape
(129971, 14)>>> #显示首尾几行
>>> wine_reviews.head() #默认显示前5行,head(3)显示前3行Unnamed: 0 country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 0 Italy Aromas... Vulkà ... 87 NaN Sicily... Etna NaN Kerin ... @kerin... Nicosi... White ... Nicosia
1 1 Portugal This i... Avidagos 87 15.0 Douro NaN NaN Roger ... @vossr... Quinta... Portug... Quinta...
2 2 US Tart a... NaN 87 14.0 Oregon Willam... Willam... Paul G... @paulg... Rainst... Pinot ... Rainstorm
3 3 US Pineap... Reserv... 87 13.0 Michigan Lake M... NaN Alexan... NaN St. Ju... Riesling St. Ju...
4 4 US Much l... Vintne... 87 65.0 Oregon Willam... Willam... Paul G... @paulg... Sweet ... Pinot ... Sweet ...>>> wine_reviews.tail() #默认显示后5行,tail(3)显示后3行Unnamed: 0 country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
129966 129966 Germany Notes ... Braune... 90 28.0 Mosel NaN NaN Anna L... NaN Dr. H.... Riesling Dr. H....
129967 129967 US Citati... NaN 90 75.0 Oregon Oregon Oregon... Paul G... @paulg... Citati... Pinot ... Citation
129968 129968 France Well-d... Kritt 90 30.0 Alsace Alsace NaN Roger ... @vossr... Domain... Gewürz... Domain...
129969 129969 France A dry ... NaN 90 32.0 Alsace Alsace NaN Roger ... @vossr... Domain... Pinot ... Domain...
129970 129970 France Big, r... Lieu-d... 90 21.0 Alsace Alsace NaN Roger ... @vossr... Domain... Gewürz... Domain...>>> #创建时可指定索引
>>> wine_reviews = pd.read_csv('D:\DOCUMENT\PRO\PYTHON\DataFrameTurtorial\winemag-data-130k-v2.csv', index_col=0)
>>> wine_reviews country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas... Vulkà ... 87 NaN Sicily... Etna NaN Kerin ... @kerin... Nicosi... White ... Nicosia
1 Portugal This i... Avidagos 87 15.0 Douro NaN NaN Roger ... @vossr... Quinta... Portug... Quinta...
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry ... NaN 90 32.0 Alsace Alsace NaN Roger ... @vossr... Domain... Pinot ... Domain...
129970 France Big, r... Lieu-d... 90 21.0 Alsace Alsace NaN Roger ... @vossr... Domain... Gewürz... Domain...[129971 rows x 13 columns]>>> #将数据框保存为csv文件
>>> wine_reviews.to_csv(path)
设置显示格式
#显示所有列(参数设置为None代表显示所有列,也可以自行设置数字)
pd.set_option('display.max_columns',None)
#显示所有行
pd.set_option('display.max_rows',None)
#设置数据的显示长度,默认为50
pd.set_option('max_colwidth',200)
#禁止自动换行(设置为Flase不自动换行,True反之)
pd.set_option('expand_frame_repr', False)
列数据对齐
#列数据对齐
>>> df #对齐前f10 f12 f14 f2 f23 f3 f8 f9
0 79 000001 平安银行 2150 142 -129 30 1442
1 61 000002 万 科A 3000 155 57 39 840
2 0 000003 PT金田A 0 0 0 0 0
>>> pd.set_option('display.unicode.ambiguous_as_wide', True)
>>> dff10 f12 f14 f2 f23 f3 f8 f9
0 79 000001 平安银行 2150 142 -129 30 1442
1 61 000002 万 科A 3000 155 57 39 840
2 0 000003 PT金田A 0 0 0 0 0
>>> pd.set_option('display.unicode.east_asian_width', True) #只用这一行似乎也可
>>> df #对齐后f10 f12 f14 f2 f23 f3 f8 f9
0 79 000001 平安银行 2150 142 -129 30 1442
1 61 000002 万 科A 3000 155 57 39 840
2 0 000003 PT金田A 0 0 0 0 0
>>> pd.set_option('display.width', 180) # 设置打印宽度(**重要**)
>>> dff10 f12 f14 f2 f23 f3 f8 f9
0 79 000001 平安银行 2150 142 -129 30 1442
1 61 000002 万 科A 3000 155 57 39 840
2 0 000003 PT金田A 0 0 0 0 0
索引、选择
>>> #导入数据并设置最大显示行数为5
>>> import pandas as pd
>>> reviews = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)
>>> pd.set_option('max_rows', 5)
>>> reviewscountry description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas... Vulkà ... 87 NaN Sicily... Etna NaN Kerin ... @kerin... Nicosi... White ... Nicosia
1 Portugal This i... Avidagos 87 15.0 Douro NaN NaN Roger ... @vossr... Quinta... Portug... Quinta...
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry ... NaN 90 32.0 Alsace Alsace NaN Roger ... @vossr... Domain... Pinot ... Domain...
129970 France Big, r... Lieu-d... 90 21.0 Alsace Alsace NaN Roger ... @vossr... Domain... Gewürz... Domain...[129971 rows x 13 columns]
取行
>>> reviews.iloc[0] #用行号访问第一行,返回Series
>>> reviews.iloc[[0]] #用行号访问第一行,返回DataFrame
>>> reviews.iloc[-5:] #取后5行,返回DataFrame
>>> reviews.iloc[2:4] #第三行与第四行
>>> reviews[2:4] #取第三行与第四行
>>> reviews.loc[[2,4]] #取不连续的多行country ... winery
2 US ... Rainstorm
4 US ... Sweet Cheeks[2 rows x 13 columns]
取列
>>> #取一列
>>> reviews['country'] #用列名访问,返回Series
>>> reviews.country #用.访问,名称中有空格时不适用,返回Series
>>> reviews.loc[:,'country'] #同reviews['country'],返回Series
>>> reviews.iloc[:,0] #或用列位置索引访问,选取第1列,返回Series
>>> reviews[['country']] #取country列,返回DataFrame
>>> reviews.loc[:,['country']] #同reviews[['country']]
>>> reviews.iloc[:,[0]] #取第0列,用位置索引访问,返回DataFramecountry
0 Italy
1 Portugal
... ...
129969 France
129970 France[129971 rows x 1 columns]>>> #取多列,返回DataFrame
>>> reviews.loc[:'country':'points'] #取从country到points(包含points)的所有列,country列需在points列之前
>>> reviews.iloc[:,0:4] #取第0列到第4列(不包含第四列)的所有列
>>> reviews[['country','points']] #取不连续的country与points列、
>>> reviews.loc[:,['country','points']] #同reviews[['country','points']]
>>> reviews.iloc[:,[0,3]] #取不连续的第0列与第2列country points
0 Italy 87
1 Portugal 87
... ... ...
129969 France 90
129970 France 90[129971 rows x 2 columns]
同时取行列
>>> #取单个值,返回类型为值的类型
>>> reviews['country'][0] #取country列第一个值
>>> reviews.loc[0,'country'] #取第0行country列的值
>>> reviews.iloc[1,0] #取第一行第0列的值>>> #取多个值
>>> reviews.iloc[1:4,0] #选取第2行到第4行的第一列,连续索引,返回Series
>>> reviews.iloc[[1,3],0] #选取第2行第4行的第一列,分散索引,返回Series
>>> reviews.iloc[[1,3],[0]] #选取第2行第4行的第一列,分散索引,返回DataFrame
>>> reviews.iloc[[1,3],2:5] #选取第2行第4行的第3列到第5列,连续索引,返回DataFrame
>>> reviews.iloc[[1,3],[2,5]] #选取第2行第4行的第3列和第6列,分散索引,返回DataFrame
>>> reviews.loc[1,['country','points']] #选取第2行的country与points列,分散索引,返回Series
>>> reviews.loc[[1],['country','points']] #选取第2行的country与points列,分散索引,返回DataFrame
>>> reviews.loc[[1,3],['country','points']] #选取第2行与第4行的country与points列,分散索引,返回DataFrame
>>> reviews.loc[[1,3],'country':'points'] #选取第2行与第4行的country到points列,返回DataFramecountry ... points
1 Portugal ... 87
3 US ... 87[2 rows x 4 columns]
将某列设为行索引
>>> reviews.set_index('title')country ... winery
title ...
Nicosia 2013 Vulkà Bianco (Etna) Italy ... Nicosia
Quinta dos Avidagos 2011 Avidagos Red (Douro) Portugal ... Quinta dos Avidagos
... ... ... ...
Domaine Marcel Deiss 2012 Pinot Gris (Alsace) France ... Domaine Marcel Deiss
Domaine Schoffit 2012 Lieu-dit Harth Cuvée Caro... France ... Domaine Schoffit[129971 rows x 12 columns]
条件选取
>>> #判断每行country是否为Italy,返回Series
>>> reviews.country == 'Italy'0 True
1 False...
129969 False
129970 False
Name: country, Length: 129971, dtype: bool>>> reviews[reviews.country == 'Italy'] #选取country为Italy的行
>>> reviews.loc[reviews.country == 'Italy'] #同上,可不用loc
>>> reviews[reviews['points'] == 90] #选取points为90的行
>>> reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)] #选取country为Italy且points>=90的行
>>> reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)] #选取country为Italy或points>=90的行
>>> reviews.loc[reviews.country.isin(['Italy', 'France'])] #选取country为Italy或France的行
>>> reviews.loc[reviews.price.isnull()] #选取price为空的行
>>> reviews.loc[reviews.price.notnull()] #选取price为非空的行
>>> reviews[reviews['description'].str.contains('great')] #选取description列中包含great的行,有空值报错时添加参数na=False#,即reviews[reviews['description'].str.contains('great',na=False)]
>>> reviews[~reviews['description'].str.contains('great')] #选取description列中不包含great的行country ... winery
0 Italy ... Nicosia
1 Portugal ... Quinta dos Avidagos
... ... ... ...
129969 France ... Domaine Marcel Deiss
129970 France ... Domaine Schoffit[125196 rows x 13 columns]
汇总函数与映射
导入数据
>>> #导入数据并设置最大显示行数为5
>>> import pandas as pd
>>> reviews = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)
>>> pd.set_option('max_rows', 5)
>>> reviewscountry description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas... Vulkà ... 87 NaN Sicily... Etna NaN Kerin ... @kerin... Nicosi... White ... Nicosia
1 Portugal This i... Avidagos 87 15.0 Douro NaN NaN Roger ... @vossr... Quinta... Portug... Quinta...
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry ... NaN 90 32.0 Alsace Alsace NaN Roger ... @vossr... Domain... Pinot ... Domain...
129970 France Big, r... Lieu-d... 90 21.0 Alsace Alsace NaN Roger ... @vossr... Domain... Gewürz... Domain...[129971 rows x 13 columns]
汇总函数:describe(),unique(),value_counts()
>>> reviews.points.describe() #返回数值列数据的均值、最大最小值等性质count 129971...
mean 88.447138...
75% 91.000000
max 100.00...
Name: points, Length: 8, dtype: float64>>> reviews.taster_name.describe() #返回文本列文本的一些性质count 103727
unique 19
top Roger ...
freq 25514
Name: taster_name, dtype: object>>> reviews.points.mean() #取points列的平均值>>> reviews.taster_name.unique() #查看taster_name列数据有多少种,返回数组array(['Kerin O’Keefe', 'Roger Voss', 'Paul Gregutt','Alexander Peartree', 'Michael Schachner', 'Anna Lee C. Iijima','Virginie Boone', 'Matt Kettmann', nan, 'Sean P. Sullivan','Jim Gordon', 'Joe Czerwinski', 'Anne Krebiehl\xa0MW','Lauren Buzzeo', 'Mike DeSimone', 'Jeff Jenssen','Susan Kostrzewa', 'Carrie Dykes', 'Fiona Adams','Christina Pickard'], dtype=object)>>> reviews.taster_name.value_counts() #查看taster_name列数据每种有多少个Roger Voss 25514
Michael Schachner 15134...
Fiona Adams 27
Christina Pickard 6
Name: taster_name, Length: 19, dtype: int64
映射:map()与apply()
>>> #数据列扣除平均值
>>> review_points_mean = reviews.points.mean()
>>> reviews.points-review_points_mean #返回Series,扣除平均值最快的方法,结果同下
>>> reviews.points.map(lambda p: p - review_points_mean) #返回Series,不改变原DataFrame0 -1.447138
1 -1.447138...
129969 1.552862
129970 1.552862
Name: points, Length: 129971, dtype: float64>>> #apply()方法逐行更改数据,较慢
>>> def remean_points(row):row.points = row.points - review_points_meanreturn row
>>> reviews.apply(remean_points, axis='columns') #返回新的DataFrame,不改变原DataFramecountry description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas... Vulkà ... -1.447138 NaN Sicily... Etna NaN Kerin ... @kerin... Nicosi... White ... Nicosia
1 Portugal This i... Avidagos -1.447138 15.0 Douro NaN NaN Roger ... @vossr... Quinta... Portug... Quinta...
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry ... NaN 1.552862 32.0 Alsace Alsace NaN Roger ... @vossr... Domain... Pinot ... Domain...
129970 France Big, r... Lieu-d... 1.552862 21.0 Alsace Alsace NaN Roger ... @vossr... Domain... Gewürz... Domain...[129971 rows x 13 columns]
连接两列文本
>>> import pandas as pd
>>> import numpy as np>>> df=pd.DataFrame({'c1':['d11','d21'],'c2':[np.nan,'d22']})
>>> dfc1 c2
0 d11 NaN
1 d21 d22
>>> df['c3']=df.c1+' '+df.c2 #当有缺失值时,用+连接后的结果也为缺失值
>>> df['c4']=df.c1+' '+df.c2.fillna('') #可用fillna填充缺失值
>>> dfc1 c2 c3 c4
0 d11 NaN NaN d11
1 d21 d22 d21 d22 d21 d22
分组与排序
导入数据
>>> #导入数据并设置最大显示行数为5
>>> import pandas as pd
>>> reviews = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)
>>> pd.set_option('max_rows', 5)
>>> reviewscountry description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas... Vulkà ... 87 NaN Sicily... Etna NaN Kerin ... @kerin... Nicosi... White ... Nicosia
1 Portugal This i... Avidagos 87 15.0 Douro NaN NaN Roger ... @vossr... Quinta... Portug... Quinta...
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry ... NaN 90 32.0 Alsace Alsace NaN Roger ... @vossr... Domain... Pinot ... Domain...
129970 France Big, r... Lieu-d... 90 21.0 Alsace Alsace NaN Roger ... @vossr... Domain... Gewürz... Domain...[129971 rows x 13 columns]
分组分析
>>> #按points不同分类,并统计每种points有多少行,返回Series
>>> reviews.groupby('points').size() #方法1
>>> reviews.groupby('points').points.count() #方法2
>>> reviews.points.value_counts() #方法3,顺序可能不同,该方法没有Series列名88 17207
87 16933...
99 33
100 19
Name: points, Length: 21, dtype: int64>>> #按points不同分类,并统计每种points种的最小price
>>> reviews.groupby('points').price.min()points
80 5.0
81 5.0...
99 44.0
100 80.0
Name: price, Length: 21, dtype: float64>>> #按winery不同分类,查看每类中第一行的title
>>> reviews.groupby('winery').apply(lambda df: df.title.iloc[0])winery
1+1=3 1+1=3 ...
10 Knots 10 Kno......
àMaurice àMauri...
Štoka Štoka ...
Length: 16757, dtype: object
>>> #根据多列(country,province)分类,并找出每类中points最高的行
>>> reviews.groupby(['country', 'province']).apply(lambda df: df.loc[df.points.idxmax()])country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
country province
Argentina Mendoza... Argentina If the... Nicasi... 97 120.0 Mendoz... Mendoza NaN Michae... @wines... Bodega... Malbec Bodega...Other Argentina Take n... Reserva 95 90.0 Other Salta NaN Michae... @wines... Colomé... Malbec Colomé
... ... ... ... ... ... ... ... ... ... ... ... ... ...
Uruguay San Jose Uruguay Baked,... El Pre... 87 50.0 San Jose NaN NaN Michae... @wines... Castil... Red Blend Castil...Uruguay Uruguay Cherry... Blend ... 91 22.0 Uruguay NaN NaN Michae... @wines... Narbon... Tannat... Narbona[425 rows x 13 columns]>>> #按country分类,并统计每类有多少行及每类的最大值最小值
>>> reviews.groupby(['country']).price.agg([len, min, max])len min max
country
Argentina 3800.0 4.0 230.0
Armenia 2.0 14.0 15.0
... ... ... ...
Ukraine 14.0 6.0 13.0
Uruguay 109.0 10.0 130.0[43 rows x 3 columns]
多层索引
>>> #按country和province分类,查看每类有多少个
>>> countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])
>>> countries_reviewedlen
country province
Argentina Mendoza... 3264Other 536
... ...
Uruguay San Jose 3Uruguay 24[425 rows x 1 columns]>>> #查看数据类型
>>> mi = countries_reviewed.index
>>> type(mi)
<class 'pandas.core.indexes.multi.MultiIndex'>>>> #重置行索引
>>> countries_reviewed.reset_index()country province len
0 Argentina Mendoz... 3264
1 Argentina Other 536
.. ... ... ...
423 Uruguay San Jose 3
424 Uruguay Uruguay 24[425 rows x 3 columns]
排序
>>> #分类结果按种类数目(值)排序
>>> countries_reviewed = countries_reviewed.reset_index()
>>> countries_reviewed.sort_values(by='len') #升序(默认)
>>> countries_reviewed.sort_values(by='len', ascending=False) #降序country province len
392 US Califo... 36247
415 US Washin... 8639
.. ... ... ...
63 Chile Coelemu 1
149 Greece Beotia 1[425 rows x 3 columns]>>> #分类结果按索引升序排序
>>> countries_reviewed.sort_index()country province len
0 Argentina Mendoz... 3264
1 Argentina Other 536
.. ... ... ...
423 Uruguay San Jose 3
424 Uruguay Uruguay 24[425 rows x 3 columns]>>> #分类结果多条件排序
>>> countries_reviewed.sort_values(by=['country', 'len'])country province len
1 Argentina Other 536
0 Argentina Mendoz... 3264
.. ... ... ...
424 Uruguay Uruguay 24
419 Uruguay Canelones 43[425 rows x 3 columns]
数据类型与缺失值处理
导入数据
>>> #导入数据并设置最大显示行数为5
>>> import pandas as pd
>>> reviews = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)
>>> pd.set_option('max_rows', 5)
>>> reviewscountry description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas... Vulkà ... 87 NaN Sicily... Etna NaN Kerin ... @kerin... Nicosi... White ... Nicosia
1 Portugal This i... Avidagos 87 15.0 Douro NaN NaN Roger ... @vossr... Quinta... Portug... Quinta...
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry ... NaN 90 32.0 Alsace Alsace NaN Roger ... @vossr... Domain... Pinot ... Domain...
129970 France Big, r... Lieu-d... 90 21.0 Alsace Alsace NaN Roger ... @vossr... Domain... Gewürz... Domain...[129971 rows x 13 columns]
数据类型及转换
>>> #查看列数据类型
>>> reviews.index.dtype #查看索引列数据类型,dtype('int64')
>>> reviews.price.dtype #查看price列数据类型,dtype('float64')
>>> reviews.dtypescountry object
description object...
variety object
winery object
Length: 13, dtype: object>>> #转换数据类型
>>> reviews.points.astype('float64')
0 87.0
1 87.0...
129969 90.0
129970 90.0
Name: points, Length: 129971, dtype: float64>>> #将列数据转换为列表
>>> reviews['country'].tolist()
>>> #或
>>> list(reviews['country'])
缺失值填充
缺失值标记为NaN(Not a Number),总为float64类型。缺失值不是空值,空值为"",isnull()函数不检测空值。
>>> #选择country值缺失的行
>>> reviews[pd.isnull(reviews.country)] #反过来为notnullcountry ... winery
913 NaN ... Gotsa Family Wines
3131 NaN ... Barton & Guestier
... ... ... ...
129590 NaN ... Büyülübağ
129900 NaN ... Psagot[63 rows x 13 columns]>>> #用Unknown填充region_2缺失的行
>>> reviews.region_2.fillna("Unknown")
0 Unknown
1 Unknown...
129969 Unknown
129970 Unknown
Name: region_2, Length: 129971, dtype: object>>> #值替换
>>> reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino") #将taster_twitter_handle列中的@kerinokeefe替换为@kerino
0 @kerino
1 @vossroger...
129969 @vossroger
129970 @vossroger
Name: taster_twitter_handle, Length: 129971, dtype: object
用非空值的平均值替换缺失值:
>>> df=pd.DataFrame([[1,2,3],[4,np.NaN,5],[6,7,np.NaN],[np.nan,np.nan,np.nan]],columns=['A','B','C'])
>>> dfA B C
0 1.0 2.0 3.0
1 4.0 NaN 5.0
2 6.0 7.0 NaN
3 NaN NaN NaN
#方法1:fillna()
>>> df.fillna(df.mean())A B C
0 1.000000 2.0 3.0
1 4.000000 4.5 5.0
2 6.000000 7.0 4.0
3 3.666667 4.5 4.0
#方法2:replace()
>>> df.notnull().sum()
A 3
B 2
C 2
dtype: int64
>>> df.sum()/df.notnull().sum()
A 3.666667
B 4.500000
C 4.000000
dtype: float64
>>> df.replace(np.NaN,df.sum()/df.notnull().sum(),inplace=True)
>>> dfA B C
0 1.000000 2.0 3.0
1 4.000000 4.5 5.0
2 6.000000 7.0 4.0
3 3.666667 4.5 4.0
向前/后填充:
>>> df=pd.DataFrame([[1,2,np.NaN],[4,np.NaN,5],[6,7,np.NaN],[np.nan,np.nan,np.nan]],columns=['A','B','C'])
>>> dfA B C
0 1.0 2.0 NaN
1 4.0 NaN 5.0
2 6.0 7.0 NaN
3 NaN NaN NaN
>>> df.ffill()#向前填充A B C
0 1.0 2.0 NaN
1 4.0 2.0 5.0
2 6.0 7.0 5.0
3 6.0 7.0 5.0
>>> df.bfill()#向后填充A B C
0 1.0 2.0 5.0
1 4.0 7.0 5.0
2 6.0 7.0 NaN
3 NaN NaN NaN
重命名、更改列顺序、添加新行列
导入数据
>>> #导入数据并设置最大显示行数为5
>>> import pandas as pd
>>> reviews = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)
>>> pd.set_option('max_rows', 5)
>>> reviewscountry description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas... Vulkà ... 87 NaN Sicily... Etna NaN Kerin ... @kerin... Nicosi... White ... Nicosia
1 Portugal This i... Avidagos 87 15.0 Douro NaN NaN Roger ... @vossr... Quinta... Portug... Quinta...
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry ... NaN 90 32.0 Alsace Alsace NaN Roger ... @vossr... Domain... Pinot ... Domain...
129970 France Big, r... Lieu-d... 90 21.0 Alsace Alsace NaN Roger ... @vossr... Domain... Gewürz... Domain...[129971 rows x 13 columns]
重命名列名与行名
>>> #重命名列名
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> #方法1:采用字典格式设置列名
>>> df.rename(columns={"A": "a", "B": "c"},inplace=True)
>>>#或
>>>df.rename(columns=dict(A='a',B='b'),inplace=True)
>>> #方法2:设置columns,按原来的顺序全部列出
>>> df.columns=['a','c']
>>> dfa c
0 1 4
1 2 5
2 3 6>>> #重命名行索引
>>> df.rename(index={0: "x", 1: "y", 2: "z"},inplace=True)
>>> df.rename({0: "x", 1: "y", 2: "z"}, axis='index',inplace=True)
>>> dfa c
x 1 4
y 2 5
z 3 6#命名索引列名称
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> df.rename_axis('number',axis='rows')A B
number
0 1 4
1 2 5
2 3 6
更改列顺序
>>> #方法1:设置列名顺序列表并应用
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> newColumnOrderList=['B','A']
>>> df=df[newColumnOrderList]>>> #方法2:取出列,删除原来的列,在新的位置插入取出的列
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> df_B=df.B
>>> df.drop('B',axis=1,inplace=True)
>>> df.insert(0,'new_B',df_B) #插入新列时还可以对新列重命名
>>> dfnew_B A
0 4 1
1 5 2
2 6 3
添加新列
>>> reviews['critic'] = 'everyone' #新建critic列并赋值everyone,操作类似字典
>>> reviews['critic']
0 everyone
1 everyone...
129969 everyone
129970 everyone
Name: critic, Length: 129971, dtype: object
#迭代赋值
>>> reviews['index_backwards'] = range(len(reviews), 0, -1) #迭代赋值
>>> reviews['index_backwards'] = list(range(len(reviews), 0, -1))
>>> reviews['index_backwards']
0 129971
1 129970...
129969 2
129970 1
Name: index_backwards, Length: 129971, dtype: int64
数据框合并
相同列名的数据框纵向拼接
>>> #相同列名的数据框纵向拼接
>>> df1 = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
>>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('BA'))>>> #方法1:append()方法
>>> #拼接两个DataFrame
>>> df1.append(df2,ignore_index=True) #生成新的DataFrame,不改变原有的DataFrame。ignore_index设为True用于重置行索引A B
0 1 2
1 3 4
2 6 5
3 8 7
>>> #拼接多个Series
>>> s1 = pd.Series(['a', 'b'])
>>> s2 = pd.Series(['c', 'd'])
>>> pd.concat([s1,s2],ignore_index=True) #
0 a
1 b
2 c
3 d
dtype: object>>> #方法2:caoncat()方法,可以拼接多个列名相同的DataFrame
>>> df1 = pd.DataFrame([['a', 1], ['b', 2]],columns=['letter', 'number'])
>>> df2 = pd.DataFrame([['c', 3], ['d', 4]],columns=['letter', 'number'])
>>> pd.concat([df1, df2])letter number
0 a 1
1 b 2
0 c 3
1 d 4
横向拼接或融合数据框
>>> df1 = pd.DataFrame([['a', 1], ['b', 2]],columns=['letter', 'number'])
>>> df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george']],columns=['animal', 'name'])>>> #方法1:pd.concat()方法
>>> pd.concat([df1,df4],axis=1)letter number animal name
0 a 1 bird polly
1 b 2 monkey george>>> #方法2:join()方法>>> df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
>>> df2 = pd.DataFrame({'key': ['K0', 'K1', 'K2'],'B': ['B0', 'B1', 'B2']})>>> #横向合并两个数据框,保留所有列名
>>> df1.join(df2, lsuffix='_df1', rsuffix='_df2')key_df1 A key_df2 B
0 K0 A0 K0 B0
1 K1 A1 K1 B1
2 K2 A2 K2 B2
3 K3 A3 NaN NaN
4 K4 A4 NaN NaN
5 K5 A5 NaN NaN
>>> #如果要用key列融合两个数据框,需将key列设为两个数据框的索引
>>> df1.set_index('key').join(df2.set_index('key'))A B
key
K0 A0 B0
K1 A1 B1
K2 A2 B2
K3 A3 NaN
K4 A4 NaN
K5 A5 NaN
>>> #还可采用设置on参数的方法融合两个数据框
>>> df1.join(df2.set_index('key'), on='key')key A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K3 A3 NaN
4 K4 A4 NaN
5 K5 A5 NaN
此外,合并方法还有df.merge()等。
生成数据分析报告
import pandas_profiling as ppreport = pp.ProfileReport(df)
report
Python: DateFrame教程相关推荐
- Python培训教程:Python有哪些比较重要的内置函数?
学习Python技术或者参加Python工作的小伙伴们应该都知道,在Python编程语言中会经常出现很多内置函数,很少有人清楚这些函数,但是它的功能是不可小觑的,下面小编就为大家详细介绍一下Pytho ...
- Python培训教程分享:10款超好用的Python开发工具
学会Python技术后,我们在参加工作的时候如果能有辅助工具的话,那么会很大程度的提高我们的工作效率,那么Python都有哪些好用的开发工具呢?下面小编就为大家详细的介绍一下10款超好用的Python ...
- Python培训教程分享:Python异常机制
在学习Python技术的时候,我们经常会遇到一些异常,例如导致程序在运行过程中出现的中断或退出,我们都称之为异常,大多数的异常都不会被程序处理,而是以错误信息的形式展现出来.本期Python培训教 ...
- Python培训教程分享:有哪些值得使用的爬虫开源项目?
相信很多同学在学习Python技术的时候,都有学习到Python爬虫技术,爬虫技术在各大互联网公司都是非常常见的,可以帮助我们获取各种网站的信息,比如微博.B站.知乎等,本篇Python培训教程分享为 ...
- Python培训教程分享:visual studio编写python怎么样?
本期小编要为大家介绍的Python培训教程就是关于"visual studio编写python怎么样?"的问题,但答案当然是可以的,据了解,vs2017.vs2019都集成了pyt ...
- Python培训教程分享:Python中选择结构是什么
越来越多的人开始报名学习Python技术,那么学习Python技术不是一两天就能学会的,本期小编为大家推荐的Python培训教程主要讲的是"Python中选择结构是什么",下面来看 ...
- Python培训教程:pycharm常用的快捷键合集
本期Python培训教程:pycharm常用的快捷键合集,希望能在后面的学习和工作中给大家带来一些帮助,首先我们来了解一下什么是PyCharm?PyCharm是一种Python IDE,它不仅具备了一 ...
- Python培训教程分享:Python模块如何导入__all__属性?
本期小编为大家带来的Python培训教程是关于"Python模块如何导入__all__属性?"的内容,后面在工作中是会遇到Python模块这个工作内容的,Python模块的开头通常 ...
- Python培训教程:什么是Python全局解释器锁(GIL)?
本期Python培训教程小编为大家带来的是关于"什么是Python全局解释器锁(GIL)?"的问题,全局解释器锁是计算机程序设计语言解释器用于同步线程的工具,使得在同一进程内任何时 ...
最新文章
- 【Java基础】集合
- UML类图10分钟快速入门
- 关于jquery的ajax编码的另类解决方案,巨简便
- java 18 -4 LinkedHashMap集合
- php内置函数数组函数,PHP 数组排序内置函数
- webrtc 代码_英特尔开源WebRTC开发套件OWT
- python序列类型包括字符串_python序列类型字符串列表元组
- 编程绘制图形java_[JAVA100例]026、基本图形绘制 - 编程入门网
- 非递归中序,后序,先序遍历二叉树完整代码(用链式栈实现)
- 小波多尺度分析的发明:跨学科创新的典范
- 关于Microsoft Office 2007 Beta 简体中文版的一些消息
- Ubuntu中的Gif动画录制工具
- 激光导航技术是如何实现的?
- OpenGL红宝书正序解读(二)
- 神农班和我的2020年总结
- 【基础教程】Python的特点(优点和缺点)
- iis aspnet_iisreg.exe 0x8007000B
- 2020区块链行业回顾与前瞻
- python输入语句没有定义_【20200925】Python基本语法
- Ubuntu18.04 安装Docker 报错:Signed-By 中含有互相冲突的值