(XWZ)的python学习笔记—

Pandas 的数据结构：Pandas 主要有 Series（一维数组），DataFrame（二维数组），Panel（三维数组），Panel4D（四维数组），PanelND（更多维数组）等数据结构。
Series 是一维带标签的数组，它可以包含任何数据类型。包括整数，字符串，浮点数，Python 对象等。Series 可以通过标签来定位。
DataFrame 是二维的带标签的数据结构。我们可以通过标签来定位数据。这是 NumPy 所没有的。
Pandas 中，Series 可以被看作由 1 列数据组成的数据集。

使用pd.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)来创建Series

pd.Series(np.random.randint(10, size=7))
'''
0    2
1    1
2    2
3    0
4    4
5    5
6    2
dtype: int64
'''
# 使用字典创建Series
d = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}
pd.Series(d)
'''
a    1
b    2
c    3
d    4
e    5
dtype: int64
'''

第一列为标签，第二列为值

使用xx.drop()删除指定标签处的元素

a
'''
0    3
1    0
2    8
3    5
4    1
5    7
6    6
dtype: int64
'''a.drop(2)
'''
0    3
1    0
3    5
4    1
5    7
6    6
dtype: int64
'''

使用xx[index] = value来修改指定标签处的元素

使用a.add(b)来对a和b处相同标签位置的元素进行相加，若标签不同则填充为NaN（空值），使用a.sub(b)，a.mul(b)，a.div(b)用法都与之一样。

a
'''
0    3
1    0
2    8
3    5
4    1
5    7
6    6
dtype: object
'''b
'''
1    2
0    3
dtype: int64
'''
a.add(b)
'''
0      6
1      2
2    NaN
3    NaN
4    NaN
5    NaN
6    NaN
dtype: object
'''

a.median()求a的中位数，a.sum()求a中元素的和，a.max()求a中的最大值，a.min()求a中的最小值

可以使用标签，也可以使用索引取值

d = {'a': 1, 'b': 2, 'c':3, 'd':4}
a = pd.Series(d)
a[:'b'] #使用标签进行切片
'''
a    1
b    2
dtype: int64
'''
In [33]: b[1:-1] #使用索引进行切片
Out[33]:
b    2
c    3
dtype: int64In [60]: b[2]
Out[60]: 3In [62]: b[0]
Out[62]: 1In [63]: b[:-1]
Out[63]:
a    1
b    2
c    3
dtype: int64In [64]: b[['a', 'c']]

与 Sereis 不同，DataFrame 可以存在多列数据。一般情况下，DataFrame 也更加常用。

pd.date_range(start=None, end=None, periods=None, freq='D', tz=None, normalize=False, name=None, closed=None, **kwargs)该函数主要用于生成一个固定频率的时间索引，在调用构造方法时，必须指定start、end、periods中的两个参数值，也不能多取，否则报错.

start:开始时间, 取'today'表示从此刻开始
end:结束时间
periods:要生成的标签的数量
freq：相邻标签的间隔时间，可以使用str或DateOffset类型，如'10s'表示10秒，'10h'表示10个小时，'10d'表示10天，'10m'，表示10个月'10y'表示10年，默认为'd'
name：生成时间标签对象的名称，取值为string或None
closed：可以理解成在closed=None情况下返回的结果中，若closed=‘left’表示在返回的结果基础上，再取左开右闭的结果，若closed='right'表示在返回的结果基础上，再取左开右闭的结果

pd.date_range(start='20201217', end='20210101') #生成从2020-12-16到2021-1-1的时间标签序列，相邻的时间间隔freq默认为'd'
'''DatetimeIndex(['2020-12-17', '2020-12-18', '2020-12-19', '2020-12-20','2020-12-21', '2020-12-22', '2020-12-23', '2020-12-24','2020-12-25', '2020-12-26', '2020-12-27', '2020-12-28','2020-12-29', '2020-12-30', '2020-12-31', '2021-01-01'],dtype='datetime64[ns]', freq='D')
'''
pd.date_range(start='20201217', freq='2m', end='20211217') #生成从2020-12-17到2021-12-17的标签序列，时间间隔设置为'2m'（两个月）
'''
DatetimeIndex(['2020-12-31', '2021-02-28', '2021-04-30', '2021-06-30','2021-08-31', '2021-10-31'],dtype='datetime64[ns]', freq='2M')
'''
pd.date_range(start='20201217', periods=10, freq='2y') #生成从2020-12-17开始的10个时间标签序列，时间间隔设置为2年
'''
DatetimeIndex(['2020-12-31', '2022-12-31', '2024-12-31', '2026-12-31','2028-12-31', '2030-12-31', '2032-12-31', '2034-12-31','2036-12-31', '2038-12-31'],dtype='datetime64[ns]', freq='2A-DEC')
'''

Series中的标签可以重复

pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)

index:每列数据的标签或者说行名
column:列名

In [6]: idx =  ['a', 'b', 'c', 'd', 'e']
In [7]: clmn = ['A', 'B', 'C', 'D']
In [10]: pd.DataFrame(np.random.randint(10, size=(5, 4)), index=idx, columns=clmn)
Out[10]: A  B  C  D
a  4  1  6  7
b  8  9  2  8
c  0  9  0  1
d  6  2  5  8
e  2  1  9  2
#使用字典来创建DataFrame，key表示列名，value代表每一列的元素
In [34]: data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],...:         'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],...:         'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],...:         'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
In [36]: pd.DataFrame(data)
Out[36]: animal  age  visits priority
0    cat  2.5       1      yes
1    cat  3.0       3      yes
2  snake  0.5       2       no
3    dog  NaN       3      yes
4    dog  5.0       2       no
5    cat  2.0       3       no
6  snake  4.5       1       no
7    cat  NaN       1      yes
8    dog  7.0       2       no
9    dog  3.0       1       no

查看DataFrame的数据类型

In [41]: a.dtypes
Out[41]:
animal       object
age         float64
visits        int64
priority     object
dtype: object

DataFrame基本操作

In [42]: a.head(3) #预览前三行
Out[42]: animal  age  visits priority
0    cat  2.5       1      yes
1    cat  3.0       3      yes
2  snake  0.5       2       no
In [43]: a.tail(3) #预览后3行数据
Out[43]: animal  age  visits priority
7    cat  NaN       1      yes
8    dog  7.0       2       no
9    dog  3.0       1       noIn [44]: a.index #查看标签（行名）
Out[44]: RangeIndex(start=0, stop=10, step=1)
In [45]: a.columns #查看列名
Out[45]: Index(['animal', 'age', 'visits', 'priority'], dtype='object')
In [46]: a.values #查看值
Out[46]:
array([['cat', 2.5, 1, 'yes'],['cat', 3.0, 3, 'yes'],['snake', 0.5, 2, 'no'],['dog', nan, 3, 'yes'],['dog', 5.0, 2, 'no'],['cat', 2.0, 3, 'no'],['snake', 4.5, 1, 'no'],['cat', nan, 1, 'yes'],['dog', 7.0, 2, 'no'],['dog', 3.0, 1, 'no']], dtype=object)
In [47]: a.describe() #查看统计数据
Out[47]: age     visits
count  8.000000  10.000000
mean   3.437500   1.900000
std    2.007797   0.875595
min    0.500000   1.000000
25%    2.375000   1.000000
50%    3.000000   2.000000
75%    4.625000   2.750000
max    7.000000   3.000000
In [48]: a.T #转置操作
Out[48]: 0    1      2    3    4    5      6    7    8    9
animal    cat  cat  snake  dog  dog  cat  snake  cat  dog  dog
age       2.5    3    0.5  NaN    5    2    4.5  NaN    7    3
visits      1    3      2    3    2    3      1    1    2    1
priority  yes  yes     no  yes   no   no     no  yes   no   noIn [49]: a.sort_values('age') #按age这一列进行排序
Out[49]: animal  age  visits priority
2  snake  0.5       2       no
5    cat  2.0       3       no
0    cat  2.5       1      yes
1    cat  3.0       3      yes
9    dog  3.0       1       no
6  snake  4.5       1       no
4    dog  5.0       2       no
8    dog  7.0       2       no
3    dog  NaN       3      yes
7    cat  NaN       1      yes
In [52]: a[4:-1] #切片操作
Out[52]: animal  age  visits priority
4    dog  5.0       2       no
5    cat  2.0       3       no
6  snake  4.5       1       no
7    cat  NaN       1      yes
8    dog  7.0       2       no
In [16]: a[['animal', 'age']] #查看某些列
Out[16]: animal  age
a    cat  2.5
b    cat  3.0
c  snake  0.5
d    dog  NaN
e    dog  5.0
f    cat  2.0
g  snake  4.5
h    cat  NaN
i    dog  7.0
j    dog  3.0#Series中切片操作
In [19]: b = pd.Series([1, 2, 3, 4], ['a', 'b', 'c', 'd'])In [20]: b
Out[20]:
a    1
b    2
c    3
d    4
dtype: int64In [22]: b[1:2] #使用索引取的是左闭右开区间
Out[22]:
b    2
dtype: int64In [23]: b[:-1] #使用索引取的是左闭右开区间
Out[23]:
a    1
b    2
c    3
dtype: int64In [28]: b['a':'c'] #使用标签取的是闭区间
Out[28]:
a    1
b    2
c    3
dtype: int64#DataFrame中的切片操作和Series中差不多，只是DataFrame中是多列数据
In [36]: a[-3:-1]
Out[36]: animal  age  visits priority
h    cat  NaN       1      yes
i    dog  7.0       2       noIn [37]: a['h':'j']
Out[37]: animal  age  visits priority
h    cat  NaN       1      yes
i    dog  7.0       2       no
j    dog  3.0       1       no#但是以上方法不能够对DataFrame的行和列同时进行索引，此时应该使用iloc()和loc(),iloc是使用索引完成索引操作，而loc是使用标签完成索引操作
In [46]: a.iloc[[1, 3, 4], [2, 3]]
Out[46]: visits priority
b       3      yes
d       3      yes
e       2       noIn [47]: a.iloc[:4, 2:3]
Out[47]: visits
a       1
b       3
c       2
d       3
In [50]: a.loc['a':'f', 'animal':'age']
Out[50]: animal  age
a    cat  2.5
b    cat  3.0
c  snake  0.5
d    dog  NaN
e    dog  5.0
f    cat  2.0
In [54]: a.loc[['a', 'e', 'f'], ['animal', 'visits']]
Out[54]: animal  visits
a    cat       1
e    dog       2
f    cat       3
In [72]: a.copy() #拷贝副本
Out[72]: animal  age  visits priority
a    cat  2.5       1      yes
b    cat  3.0       3      yes
c  snake  0.5       2       no
d    dog  NaN       3      yes
e    dog  5.0       2       no
f    cat  2.0       3       no
g  snake  4.5       1       no
h    cat  NaN       1      yes
i    dog  7.0       2       no
j    dog  3.0       1       no
In [73]: a.isnull() #判断是否为空
Out[73]: animal    age  visits  priority
a   False  False   False     False
b   False  False   False     False
c   False  False   False     False
d   False   True   False     False
e   False  False   False     False
f   False  False   False     False
g   False  False   False     False
h   False   True   False     False
i   False  False   False     False
j   False  False   False     False# 增加新的列
In [77]: newcol = pd.Series(np.arange(a.shape[0]), a.index)In [78]: newcol
Out[78]:
a    0
b    1
c    2
d    3
e    4
f    5
g    6
h    7
i    8
j    9
dtype: int32In [79]: a['number']= newcolIn [80]: a
Out[80]: animal  age  visits priority  number
a    cat  2.5       1      yes       0
b    cat  3.0       3      yes       1
c  snake  0.5       2       no       2
d    dog  NaN       3      yes       3
e    dog  5.0       2       no       4
f    cat  2.0       3       no       5
g  snake  4.5       1       no       6
h    cat  NaN       1      yes       7
i    dog  7.0       2       no       8
j    dog  3.0       1       no       9
# iat[]用于对单一个元素进行索引，而iloc[]能对多个元素进行索引
In [87]: a.iat[2, 3]
Out[87]: 'no'
In [101]: a.mean() #默认对数值类型的列求平均值
Out[101]:
age       3.4375
visits    1.9000
number    4.5000
dtype: float64
In [103]: a['visits'].sum() #对某些列求和
Out[103]: 19

像a['a':'f']或a[2:4]这样的切片是对行进行操作，而a['animal']这样的索引是对列进行操作

对缺失值的操作

In [104]: a = pd.Series([1, 2, 4, np.nan])In [105]: a
Out[105]:
0    1.0
1    2.0
2    4.0
3    NaN
dtype: float64In [106]: a.fillna(value='y') #对缺失值进行填充
Out[106]:
0    1
1    2
2    4
3    y
dtype: object
In [108]: a.dropna() #删除缺失值
Out[108]:
0    1.0
1    2.0
2    4.0
dtype: float64

使用merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=('_x', '_y'), copy=True, indicator=False)实现dataframe的连接，dataframe的merge是按照两个dataframe共有的column进行连接，两个dataframe必须具有同名的column，相当于数据库中的自然连接。参数on可以指定根据哪一列来进行连接，参数how设置为'left'时相当于左外连接，设置为'right'表示右外连接，设置为'outer'表示为外连接，默认为'inner'表示不保留悬浮元组。

In [117]: df1
Out[117]: x  y
0  1  2
1  2  3
2  3  4
3  2  5In [118]: df2
Out[118]: y  z
0  3  1
1  5  2
2  5  3
3  7  4
4  8  5In [119]: pd.merge(df1, df2)
Out[119]: x  y  z
0  2  3  1
1  2  5  2
2  2  5  3
In [121]: pd.merge(df1, df2, on='y', how='left')
Out[121]: x  y    z
0  1  2  NaN
1  2  3  1.0
2  3  4  NaN
3  2  5  2.0
4  2  5  3.0In [122]: pd.merge(df1, df2, how='right')
Out[122]: x  y  z
0  2.0  3  1
1  2.0  5  2
2  2.0  5  3
3  NaN  7  4
4  NaN  8  5In [123]: pd.merge(df1, df2, how='outer')
Out[123]: x  y    z
0  1.0  2  NaN
1  2.0  3  1.0
2  3.0  4  NaN
3  2.0  5  2.0
4  2.0  5  3.0
5  NaN  7  4.0
6  NaN  8  5.0

使用xx.to_csv()实现csv文件写入，设定参数index=False时不会将索引写入，使用pandas.read_csv()实现csv文件读入
使用xx.copy()获得原数据的副本
使用xx.to_excel()实现excel文件写入，设定参数index=False时不会将索引写入，使用pandas.read_excel()实现excel文件读入https://blog.csdn.net/tongxinzhazha/article/details/78796952

用DataFrame.resample(rule, axis=0, closed=None, label=None, convention='start', kind=None, loffset=None, base=None, on=None, level=None, origin='start_day', offset=None)实现重新采样，是对原样本重新处理的一个方法，是一个对常规时间序列数据重新采样和频率转换的便捷的方法，重新取样时间序列数据。

In [166]: ss
Out[166]:
2020-01-01 00:00:00    2
2020-01-01 00:01:00    1
2020-01-01 00:02:00    9
2020-01-01 00:03:00    5
2020-01-01 00:04:00    1
2020-01-01 00:05:00    7
2020-01-01 00:06:00    4
2020-01-01 00:07:00    0
2020-01-01 00:08:00    4
2020-01-01 00:09:00    5
Freq: T, dtype: int32
In [168]: ss.resample('3t').sum() #将序列中每三分钟的数据落入到一个桶中，并对每个桶中的数据求和形成一个新的数据
Out[168]:
2020-01-01 00:00:00    12
2020-01-01 00:03:00    13
2020-01-01 00:06:00     8
2020-01-01 00:09:00     5
Freq: 3T, dtype: int32
#每个桶默认使用的是左边界标签进行标记，若要使用右边界可以设定参数label='right'，但是该标签的数据是不在桶中的，若要包括该数据，则要关闭对应的边界，即设定参数closed
In [169]: ss.resample('3t', label='right').sum()
Out[169]:
2020-01-01 00:03:00    12
2020-01-01 00:06:00    13
2020-01-01 00:09:00     8
2020-01-01 00:12:00     5
Freq: 3T, dtype: int32In [171]: ss.resample('3t', label='right', closed='right').sum()
Out[171]:
2020-01-01 00:00:00     2
2020-01-01 00:03:00    15
2020-01-01 00:06:00    12
2020-01-01 00:09:00     9
Freq: 3T, dtype: int32

时间转换

s = pd.date_range('today', periods=1, freq='d') #获取当前的本地时间'''
DatetimeIndex(['2020-12-31 11:05:35.767090'], dtype='datetime64[ns]', freq='D')
'''ts_utc = s.tz_localize('UTC') #转换为世界统一时间UTC
'''
DatetimeIndex(['2020-12-31 11:05:35.767090+00:00'], dtype='datetime64[ns, UTC]', freq='D')
'''ts_utc.tz_convert('Asia/Shanghai') #转换为上海时间，无法直接将本地时间转换为地区时间
'''
DatetimeIndex(['2020-12-31 19:05:35.767090+08:00'], dtype='datetime64[ns, Asia/Shanghai]', freq='D')
'''

timestamp为时间点，Period为时间段,timedelta为时间间隔，https://blog.csdn.net/qq_15230053/article/details/82556958

使用loc[]实现的是标签式的索引，iloc[]实现index式的索引，而使用ix[]实现的是混合索引

a
'''n     d   g
b   0   2   7
e   5   3   1
c   8   5   0
d   2   6   1
a   8   7   8
'''a.loc['e':'d', 'n':'g']'''n    d   g
e   5   3   1
c   8   5   0
d   2   6   1
'''a.iloc[3:, :2]'''c     n
d   7   2
a   3   8
'''a.ix['b':'a', 1:]'''n  d   g
b   0   2   7
e   5   3   1
c   8   5   0
d   2   6   1
a   8   7   8
'''

在pandas中，使用布尔索引方式筛选出来的时“元组（行）”，而numpy中筛选出来的是元素的集合。

a
'''n     d   g
b   0   2   7
e   5   3   1
c   8   5   0
d   2   6   1
a   8   7   8
'''a['n'][a['d'] % 2 == 1] = np.nan # 将a中d这一列值为奇数的元组在n上的分量设为nan
a'''c    n   d   g
b   7   0.0     2   7
e   4   NaN     3   1
c   5   NaN     5   0
d   7   2.0     6   1
a   3   NaN     7   8
'''

使用dropna()来丢掉含有nan的数据行或列，设定axis=0时表示丢掉所在行，axis=1时丢掉所在列，设定参数how='any'时表示只要某行（列）含有一个nan就丢掉该行（列），how='all'表示只有该行（列）全为nan时才丢掉改行（列）

a'''c    n   d   g
b   7   0.0     2   7
e   4   NaN     3   1
c   5   NaN     5   0
d   7   2.0     6   1
a   3   NaN     7   8
'''a.dropna(axis=0, how='any')'''c    n   d   g
b   7   0.0     2   7
d   7   2.0     6   1
'''a.dropna(axis=1, how='any')'''c    d   g
b   7   2   7
e   4   3   1
c   5   5   0
d   7   6   1
a   3   7   8
'''

(XWZ)的python学习笔记——pandas相关推荐

python学习笔记——pandas
from pandas import Series,DataFrame import pandas as pd import numpy as np 行用0,列用1 (一) Series 一种类似 ...
Python学习笔记-pandas应用
学习来源:https://www.bilibili.com/video/BV1UJ411A7Fs(b站真是个神奇的地方--) 目录一.读取数据 DataFrame读数查询 apply方法 assi ...
Python学习笔记：pandas初体验
Python学习笔记:pandas初体验一.安装pandas模块 1.安装Python3.7 注意:必须勾选pip(python install package). 2.配置环境变量 3.下载 pa ...
【Python学习笔记—保姆版】第四章—关于Pandas、数据准备、数据处理、数据分析、数据可视化
第四章欢迎访问我搞事情的[知乎账号]:Coffee 以及我的[B站漫威剪辑账号]:VideosMan 若我的笔记对你有帮助,请用小小的手指,点一个大大的赞哦. #编译器使用的是sypder,其中&q ...
金融民工python学习笔记
金融民工python学习笔记 #仅用于个人python学习的记录,便于查看和复习利用pandas_datareader获取股票信息参考链接:https://www.jianshu.com/p/63 ...
PYTHON学习笔记之（一）2020.08
PYTHON学习笔记之(一)2020.08 Python基础数据类型常见的列表.字典,以及元组.集合. 1 列表 list 1.1 列表转换字符串 stu = ['王一', '李二', '张三'] ...
Python学习笔记---merge和concat数据合并(1)
Python学习笔记-merge和concat数据合并(1) Python学习笔记-merge和concat数据合并(2) 文章目录 Python学习笔记---merge和concat数据合并(1) ...
2016April Python学习笔记（pandasecharts）
2016April Python学习笔记 Git学习 git(/ɡɪt/,关于这个音频文件音频(帮助·信息))是一个分布式版本控制软件,最初由林纳斯·托瓦兹(Linus Torvalds)创作,于2 ...
零基础学Python学习笔记
Python学习笔记代码下载地址链接:https://pan.baidu.com/s/1yGnpfq4ZHeKpt4V0J_PTSg 提取码:hmzs 1. Python 基础语法 1.1 基本数 ...
Python学习笔记——Numpy数组的移动滑窗，使用as_strided实现
Python学习笔记--Numpy数组的移动滑窗,使用as_strided实现 `Numpy`中移动滑窗的实现为何需要移动滑窗 `Numpy`中的移动滑窗移动滑窗的`as_strided`实现方法 ...

(XWZ)的python学习笔记——pandas

(XWZ)的python学习笔记——pandas相关推荐

最新文章

热门文章