Pandas学习笔记
目录
- Series
- DataFrame
- Index, reindex and hierarchical indexing
- Merge, Join, Concatenate, Groupby and Aggregate
- Read from csv
- Merge, Join, Concatenate, Groupby and Aggregate
Series
Series是一个一维的数据结构
# 列表构造和初始化Series
s = pd.Series([7, 'Beijing', 2.17, -1232, 'Happy birthday!'])
print(s)
0 7
1 Beijing
2 2.17
3 -1232
4 Happy birthday!
dtype: object # 同时有数字字符串,类型为object
# pandas会默认用0到n来作为Series的index,但是我们也可以自己指定index。
s = pd.Series([7, 'Beijing', 2.17, -1232, 'Happy birthday!'],index = ['A', 'B', 'C', 'D', 'E'])# 还可以用dictionary来构造一个Series,因为Series本来就是key value pairs。
cities = {'Beijing': 55000, 'Shanghai': 60000, 'Shenzhen': 50000, 'Hangzhou':20000, \'Guangzhou': 25000, 'Suzhou': None}
apts = pd.Series(cities)
print(apts)
print(type(apts))
Beijing 55000.0
Guangzhou 25000.0
Hangzhou 20000.0
Shanghai 60000.0
Shenzhen 50000.0
Suzhou NaN
dtype: float64
<class 'pandas.core.series.Series'># 选择数据
apts["Hangzhou"]
apts[["Hangzhou", "Beijing", "Shenzhen"]] # 选多个要再加一层[]
20000.0
Hangzhou 20000.0
Beijing 55000.0
Shenzhen 50000.0
dtype: float64apts < 50000
Beijing False
Shanghai False
Shenzhen False
Hangzhou True
Guangzhou True
Suzhou False
dtype: boolapts[apts < 50000]
Guangzhou 25000.0
Hangzhou 20000.0
dtype: float64# 赋值
apts['Shenzhen'] = 55000
print(apts['Shenzhen'])
55000.0apts[apts < 50000] = 40000
print(apts)
Beijing 55000.0
Guangzhou 40000.0
Hangzhou 40000.0
Shanghai 60000.0
Shenzhen 55000.0
Suzhou NaN
dtype: float64# 数学运算
除法: apts / 2
乘法: apts * 2
平方: (1) np.square(apts) (2) apts ** 2
加法:
cars = pd.Series({'Beijing': 300000, 'Shanghai': 400000, 'Shenzhen': 300000, \'Tianjin': 200000, 'Guangzhou': 200000, 'Chongqing': 150000})
print(cars + apts * 100)
Beijing 5800000.0
Chongqing NaN
Guangzhou 4200000.0
Hangzhou NaN
Shanghai 6400000.0
Shenzhen 5800000.0
Suzhou NaN
Tianjin NaN
dtype: float64
# 不是交集城市的数字都是NaN# 数据缺失
'Hangzhou' in apts
True
'Hangzhou' in cars
Falseapts.notnull()
Beijing True
Guangzhou True
Hangzhou True
Shanghai True
Shenzhen True
Suzhou False
dtype: boolapts.isnull()
Beijing False
Guangzhou False
Hangzhou False
Shanghai False
Shenzhen False
Suzhou True
dtype: bool# 与上文bool操作类似
apts[apts.isnull() == False]
Beijing 55000.0
Guangzhou 40000.0
Hangzhou 40000.0
Shanghai 60000.0
Shenzhen 55000.0
dtype: float64
DataFrame
一个Dataframe就是一张表格,Series表示的是一维数组,Dataframe则是一个二维数组,可以类比成一张excel的spreadsheet。也可以把Dataframe当做一组Series的集合。
# 列表构造dataframe
data = {'city': ['Beijing', 'Shanghai', 'Guangzhou', 'Shenzhen', 'Hangzhou', 'Chongqing'],'year': [2016,2017,2016,2017,2016, 2016],'population': [2100, 2300, 1000, 700, 500, 500]}
pd.DataFrame(data)city year population
0 Beijing 2016 2100
1 Shanghai 2017 2300
2 Guangzhou 2016 1000
3 Shenzhen 2017 700
4 Hangzhou 2016 500
5 Chongqing 2016 500# columns的名字和顺序可以指定
pd.DataFrame(data, columns = ['year', 'city', 'population'])
year city population# 添加列
pd.DataFrame(data, columns = ['year', 'city', 'population', 'debt'])year city population debt
0 2016 Beijing 2100 NaN
1 2017 Shanghai 2300 NaN
2 2016 Guangzhou 1000 NaN
3 2017 Shenzhen 700 NaN
4 2016 Hangzhou 500 NaN
5 2016 Chongqing 500 NaN# 同样可以指定index
frame2 = pd.DataFrame(data, columns = ['year', 'city', 'population', 'debt'],index=['one', 'two', 'three', 'four', 'five', 'six'])# 选择数据
# 选择列
(1) frame2['city']
(2) print(frame2.city)
one Beijing
two Shanghai
three Guangzhou
four Shenzhen
five Hangzhou
six Chongqing
Name: city, dtype: objectprint(type(frame2['city']))
pandas.core.series.Series # 注意列是Series# 选择行
(1) frame2.ix['three']
(2) frame2.ix[2]
year 2016
city Guangzhou
population 1000
debt NaN
Name: three, dtype: objectprint(type(frame2.ix['three']))
pandas.core.series.Series # 注意行也是Series# 赋值
frame2.debt = np.arange(6)
frame2['debt'] = 100000000 # 整列赋值frame2.ix['six'] = 0 # 整行赋值frame2["population"]["one"] = 2200 # Warning不用管# 还可以用Series来指定需要修改的index以及相对应的value,没有指定的默认用NaN.
val = pd.Series([200, 300, 500], index=['two', 'three', 'five'])
frame2['debt'] = val
print(frame2)year city population debt
one 2016 Beijing 2100 NaN
two 2017 Shanghai 2300 200.0
three 2016 Guangzhou 1000 300.0
four 2017 Shenzhen 700 NaN
five 2016 Hangzhou 500 500.0
six 2016 Chongqing 500 NaN# 添加列判断(bool)
frame2['western'] = (frame2.city == 'Chongqing')
print(frame2)year city population debt western
one 2016 Beijing 2100 NaN False
two 2017 Shanghai 2300 200.0 False
three 2016 Guangzhou 1000 300.0 False
four 2017 Shenzhen 700 NaN False
five 2016 Hangzhou 500 500.0 False
six 2016 Chongqing 500 NaN Trueframe2.columns
Index(['year', 'city', 'population', 'debt', 'western'], dtype='object')
frame2.index
Index(['one', 'two', 'three', 'four', 'five', 'six'], dtype='object')# 转置
# 字典构造dataframe
pop = {'Beijing': {2016: 2100, 2017:2200},'Shanghai': {2015:2400, 2016:2500, 2017:2600}}
frame3 = pd.DataFrame(pop)
print(frame3)Beijing Shanghai
2015 NaN 2400
2016 2100.0 2500
2017 2200.0 2600print(frame3.T)2015 2016 2017
Beijing NaN 2100.0 2200.0
Shanghai 2400.0 2500.0 2600.0# 切片
frame3['Beijing'][1:3]
2016 2100.0
2017 2200.0
Name: Beijing, dtype: float64
# 构建新dataframe
pdata = {'Beijing': frame3['Beijing'][:-1], 'Shanghai':frame3['Shanghai'][:-1]}
print(pd.DataFrame(pdata))Beijing Shanghai
2015 NaN 2400
2016 2100.0 2500# 给行列取名字
frame3.index.name = 'year'
frame3.columns.name = 'city'
city Beijing Shanghai
year
2015 NaN 2400
2016 2100.0 2500
2017 2200.0 2600# 取出所有值
frame3.values
Index, reindex and hierarchical indexing
obj = pd.Series(range(3), index=['a', 'b', 'c']) # 默认的数字index依旧可以使用
index = obj.index
print(index)
print(index[1:])
Index(['a', 'b', 'c'], dtype='object')
Index(['b', 'c'], dtype='object')index[1] = 'd' # 这样修改非法的,会报错index = pd.Index(np.arange(3))
obj2 = pd.Series([2,5,7], index=index)
print(obj2)
0 2
1 5
2 7
dtype: int64# 针对index进行索引和切片
obj = pd.Series(np.arange(4), index=['a','b','c','d'])
print(obj)
a 0
b 1
c 2
d 3
dtype: int32obj[['b', 'a']]
b 1
a 0
dtype: int32obj[[0, 2]]
a 0
c 2
dtype: int32obj[1:3]
b 1
c 2
dtype: int32obj['b':'d'] = 5
print(obj)
a 0
b 5
c 5
d 5
dtype: int32# 对DataFrame进行Indexing与Series基本相同
frame = pd.DataFrame(np.arange(9).reshape(3,3),index = ['a', 'c', 'd'],columns = ['Hangzhou', 'Shenzhen', 'Nanjing'])
print(frame)Hangzhou Shenzhen Nanjing
a 0 1 2
c 3 4 5
d 6 7 8
frame['Hangzhou']frame[:2]Hangzhou Shenzhen Nanjing
a 0 1 2
c 3 4 5frame.ix[:, 'Shenzhen':'Nanjing']Shenzhen Nanjing
a 1 2
c 4 5
d 7 8# reindex
# 把一个Series或者DataFrame按照新的index顺序进行重排
obj = pd.Series([4.5, 7.2, -5.3, 3.2], index=['d', 'b', 'a', 'c'])
print(obj)
d 4.5
b 7.2
a -5.3
c 3.2
dtype: float64obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value = 0)
a -5.3
b 7.2
c 3.2
d 4.5
e NaN # 加上fill_value=0,NaN会变为0.0
dtype: float64# 如何用drop来删除Series和DataFrame中的index
frame.drop('a')Hangzhou Shenzhen Nanjing
c 3 4 5
d 6 7 8# 删除列要加axis=1
frame.drop('Shenzhen', axis=1)Hangzhou Nanjing
a 0 2
c 3 5
d 6 8
hierarchical index
data = pd.Series(np.random.randn(10), index=[['a','a','a','b','b','c','c','c','d','d'], \[1,2,3,1,2,1,2,3,1,2]])
print(data)
a 1 -0.5359402 0.1904443 0.137448
b 1 2.4429862 0.211423
c 1 1.4391002 -1.6902193 1.338686
d 1 1.2742252 -0.238439
dtype: float64print(data.index)
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],labels=[[0, 0, 0, 1, 1, 2, 2, 2, 3, 3], [0, 1, 2, 0, 1, 0, 1, 2, 0, 1]])print(data.b)
1 2.442986
2 0.211423
dtype: float64data[1:4] # 注意
a 2 0.1904443 0.137448
b 1 2.442986
dtype: float64# unstack和stack可以帮助我们在hierarchical indexing和DataFrame之间进行切换。
data.unstack()1 2 3
a 0.912244 1.788456 -1.347809
b 0.343858 0.859961 NaN
c 0.330364 -1.137480 -0.894540
d -0.551960 -0.043917 NaNdata.unstack().stack() # 又会变回来 # DataFrame的hierarchical indexing
frame = pd.DataFrame(np.arange(12).reshape((4,3)),index = [['a','a','b','b'], [1,2,1,2]],columns = [['Beijing', 'Beijing', 'Shanghai'], ['apts', 'cars', 'apts']])
print(frame)Beijing Shanghaiapts cars apts
a 1 0 1 22 3 4 5
b 1 6 7 82 9 10 11frame.index.names = ['alpha', 'number']
frame.columns.names = ['city', 'type']
print(frame)
city Beijing Shanghai
type apts cars apts
alpha number
a 1 0 1 22 3 4 5
b 1 6 7 82 9 10 11frame.ix['a', 2]['Beijing']['apts']
3
Merge, Join, Concatenate, Groupby and Aggregate
concatenate
df1 = pd.DataFrame({'apts': [55000, 60000],'cars': [200000, 300000],},index = ['Shanghai', 'Beijing'])df2 = pd.DataFrame({'apts': [25000, 20000],'cars': [150000, 120000],},index = ['Hangzhou', 'Najing'])df3 = pd.DataFrame({'apts': [30000, 10000],'cars': [180000, 100000],},index = ['Guangzhou', 'Chongqing'])result = pd.concat([df1, df2, df3])
print(result)apts cars
Shanghai 55000 200000
Beijing 60000 300000
Hangzhou 25000 150000
Najing 20000 120000
Guangzhou 30000 180000
Chongqing 10000 100000result2 = pd.concat([df1, df2, df3], keys=['x', 'y', 'z'])
print(result2)apts cars
x Shanghai 55000 200000Beijing 60000 300000
y Hangzhou 25000 150000Najing 20000 120000
z Guangzhou 30000 180000Chongqing 10000 100000df4 = pd.DataFrame({'salaries': [10000, 30000, 30000, 20000, 15000]},index = ['Suzhou', 'Beijing', 'Shanghai', 'Guangzhou', 'Tianjin'])result3 = pd.concat([result, df4], axis=1) # 默认为上下连接,axis=1则左右连接
print(result3)apts cars salaries
Beijing 60000.0 300000.0 30000.0
Chongqing 10000.0 100000.0 NaN
Guangzhou 30000.0 180000.0 20000.0
Hangzhou 25000.0 150000.0 NaN
Najing 20000.0 120000.0 NaN
Shanghai 55000.0 200000.0 30000.0
Suzhou NaN NaN 10000.0
Tianjin NaN NaN 15000.0# 用inner可以去掉NaN
result3 = pd.concat([result, df4], axis=1, join='inner')
print(result3)apts cars salaries
Shanghai 55000 200000 30000
Beijing 60000 300000 30000
Guangzhou 30000 180000 20000# 用append来做concatenation
df1.append(df2) # 直接暴力上下连接,不管重复# Series和DataFrame还可以被一起concatenate,这时候Series会先被转成DataFrame然后做Join,因为Series本来就是一个只有一维的DataFrame
s1 = pd.Series([60, 50], index=['Shanghai', 'Beijing'], name='meal')
pd.concat([df1, s1], axis=1)apts cars meal
Shanghai 55000 200000 60
Beijing 60000 300000 50s2 = pd.Series([18000, 120000], index=['apts', 'cars'], name='Xiamen')
print(df1.append(s2))apts cars
Shanghai 55000 200000
Beijing 60000 300000
Xiamen 18000 120000
Merge(Join)
# Merge按列拼
# Join按行(index)拼
df1 = pd.DataFrame({'apts': [55000, 60000, 58000],'cars': [200000, 300000,250000],'cities': ['Shanghai', 'Beijing','Shenzhen']})df4 = pd.DataFrame({'salaries': [10000, 30000, 30000, 20000, 15000],'cities': ['Suzhou', 'Beijing', 'Shanghai', 'Guangzhou', 'Tianjin']})result = pd.merge(df1, df4, on='cities') # 拼不到一起的都删掉
print(result) apts cars cities salaries
0 55000 200000 Shanghai 30000
1 60000 300000 Beijing 30000result2 = pd.merge(df1, df4, on='cities', how='outer') # 没匹配上的也留下,还有how='right'/'left'apts cars cities salaries
0 55000.0 200000.0 Shanghai 30000.0
1 60000.0 300000.0 Beijing 30000.0
2 58000.0 250000.0 Shenzhen NaN
3 NaN NaN Suzhou 10000.0
4 NaN NaN Guangzhou 20000.0
5 NaN NaN Tianjin 15000.0df1.join(df4, how='outer')
Group By
归类
# 假设我们现在有一张公司每个员工的收入流水。
salaries = pd.DataFrame({'Name': ['July', 'Chu', 'Chu', 'Lin', 'July', 'July', 'Chu', 'July'],'Year': [2016,2016,2016,2016,2017,2017,2017,2017],'Salary': [10000,2000,4000,5000,18000,25000,3000,4000],'Bonus': [3000,1000,1000,1200,4000,2300,500,1000]
})
group_by_name = salaries.groupby('Name')
# groupby经常和aggregate一起使用
(1) group_by_name.aggregate(sum)
(2) group_by_name.sum()Year Salary Bonus
Name
Chu 6049 9000 2500
July 8067 57000 10300
Lin 2016 5000 1200group_by_name_year = salaries.groupby(['Name', 'Year'])
group_by_name_year.sum()Salary Bonus
Name Year
Chu 2016 6000 20002017 3000 500
July 2016 10000 30002017 47000 7300
Lin 2016 5000 1200# describe这个function可以为我们展示各种有用的统计信息
group_by_name_year.describe()Bonus ... Salary count mean std ... 50% 75% max
Name Year ...
Chu 2016 2.0 1000.000000 0.00000 ... 3000.0 3500.0 4000.02017 1.0 500.000000 NaN ... 3000.0 3000.0 3000.0
July 2016 1.0 3000.000000 NaN ... 10000.0 10000.0 10000.02017 3.0 2433.333333 1504.43788 ... 18000.0 21500.0 25000.0
Lin 2016 1.0 1200.000000 NaN ... 5000.0 5000.0 5000.0# 还有.size(),.max(),.min()等方法
Read from csv
我们先从CSV文件中读取一些数据。
bike.csv记录了Montreal自行车路线的数据,具体有7条路线,数据记录了每条自行车路线每天分别有多少人。
bikes = pd.read_csv('bikes.csv', encoding='latin1', sep=';', parse_dates=['Date'], dayfirst=True, index_col='Date') # 因为文件中有法语,utf-8会报错;数据以;号分割print(bikes.head())Berri 1 Brébeuf (données non disponibles) Côte-Sainte-Catherine \
Date
2012-01-01 35 NaN 0
2012-01-02 83 NaN 1
2012-01-03 135 NaN 2
2012-01-04 144 NaN 1
2012-01-05 197 NaN 2 # dropna会删除所有带NA的行
bikes.dropna(how='all', axis=1).head() # 因为axis=1,所以列全是NA的删掉,否则为行全为NA的删掉berri_bikes = bikes[['Berri 1']].copy()
# berri_bikes.index.weekday 每天是星期几
berri_bikes.ix[:, 'weekday'] = berri_bikes.index.weekdayBerri 1 weekday
Date
2012-01-01 35 6
2012-01-02 83 0
2012-01-03 135 1
2012-01-04 144 2
2012-01-05 197 3weekday_counts = berri_bikes.groupby('weekday').sum() # 星期几总共骑车人数Berri 1
weekday
0 134298
1 135305
2 152972
3 160131
4 141771
5 101578
6 99310# 如何填充缺失的数据
bikes_sum = bikes.sum(axis=1).to_frame() # 默认是Series类型,我强制转为dataframe类型bikes_sum.ix[:, 'weekday'] = bikes_sum.index.weekday0 weekday
Date
2012-01-01 176.0 6
2012-01-02 407.0 0
2012-01-03 639.0 1
2012-01-04 759.0 2
2012-01-05 858.0 3# 把每条路线都加起来,然后算出一天骑自行车出门的人数之和。
weekday_counts = bikes_sum.groupby('weekday').aggregate(sum)
weekday_counts.index = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']0
Monday 714963.0
Tuesday 698582.0
Wednesday 789722.0
Thursday 829069.0
Friday 738772.0
Saturday 516701.0
Sunday 518047.0
Pandas学习笔记相关推荐
- pandas学习笔记:pandas.Dataframe.rename()函数用法
pandas学习笔记:pandas.Dataframe.rename()函数用法 pandas.Dataframe.rename()函数主要是用来修改Dataframe数据的行名和列名. 主要用到的参 ...
- Pandas学习笔记(一)
Pandas学习笔记一 Pandas数组读取 读取csv.tsv.txt文件 读取excel文件 读取mysql数据表 Pandas数据结构 创建Series的几种方法 根据标签查询Series数据 ...
- pandas学习笔记之DateFrame
pandas学习笔记之DateFrame 文章目录 pandas学习笔记之DateFrame 1.DateFrame的创建 1)认识DataFrame对象 2)由二维列表创建(默认index和colu ...
- 数据分析之pandas学习笔记(六)(层次化索引、重塑、轴向旋转、行列变换、合并表数据)
数据分析之Pandas学习笔记(六)(层次化索引.重塑.轴向旋转.行列变换.合并表数据) level层次化索引 unstack()与stack()进行重塑,即:行列索引变换 swaplevel()交换 ...
- [Pandas 学习笔记] - No.1 pandas学习笔记
pandas学习笔记 pandas是基于numpy开发出的数据分析包,用于高效地操作大型数据集.pandas的数据结构有三种 分别为 series,dataframe和panel,对应一维,二维,三维 ...
- pandas学习笔记之Series
pandas学习笔记之Series 文章目录 pandas学习笔记之Series pandas中Series的创建 1)用python中的列表list创建: 2)用numpy数组创建 3)用pytho ...
- pandas学习笔记(三):数据的变换与数据的管理
注:学习笔记基于文彤老师的pandas的系列课程 课程链接:https://study.163.com/course/courseMain.htm?courseId=1005124008&sh ...
- Pandas 学习笔记一
文章目录 Pandas 学习 核心数据结构 DataFrame 属性 Panel Series 基本数据操作 运算 Pandas 学习 核心数据结构 DataFrame import numpy as ...
- pandas 每一列相加_Python3 numpy amp; pandas 学习笔记
写在前面 在用python做一些ML和DL的工作前,先看莫烦大佬的视频学习一下numpy和pandas这两个数据处理包,学习中记了一些笔记,便于自己日后查阅,同时发布到知乎希望能够帮助到其他小伙伴! ...
- Python科学计算-Numpy和Pandas学习笔记(一)-安装、Numpy属性和常用方法(chaochao`s Blog)
1 前言 由于项目的原因,最近开始学习机器学习的相关内容,在写文字笔记的时候发现我写的字确实很丑.很难看,并且记笔记的速度也很慢,由此萌生了编写自己博客的想法.从今天开始,我会将学习的笔记以博客的方式 ...
最新文章
- Java删除指定文件夹下的文件和目录
- null?对象?异常?到底应该如何返回错误信息
- linux 多线程条件变量,linux多线程之条件变量
- python3 的 str bytes 区别
- hdu 4496 并查集 逆向 并查集删边
- python2.7中文有限制么_python2.7
- 无服务器架构_如何开始使用无服务器架构
- gsoap使用心得!
- Windows Server 2008 R2/2012 网络负载平衡NLB相关内容
- redis 系列19 客户端
- Web应用开发基本技术及思想
- Web前端开发技术实验与实践(第3版)储久良编著实训3
- 模块“dm.dll”已加载,但对 DllRegisterServer 的调用失败,错误代码为 0x80020009
- 5G NR CDRX
- 降雨量等值线图(一)——底图整饬
- VMware安装中标麒麟Linux_server_x86_ZX64(兆芯版)
- 由浅入深配置webpack4
- 子佩录音盒,winform实现拨打电话,接听电话,显示客户信息,并录音,写入一条记录
- Jest 单元测试术语解析:describe、it、expect、test
- java接口可以被继承吗