pandas学习之pandas基础

感想

一、窗口对象

pandas 中有 3 类窗口，分别是滑动窗口 rolling 、扩张窗口 expanding 以及指数加权窗口 ewm。
1、rolling():移动窗口，常用参数window

ab= pd.Series([1,2,3,4,5,6,7,8,9,10])
ab.rolling(window=2).sum()
返回结果：

很容易发现，window参数就相当于一个移动切片，从第window个数据开始往前截取window个数据。

     小练习：rolling 对象的默认窗口方向都是向前的，某些情况下用户需要向后的窗口，例如对 1,2,3 设定向后窗口为 2 的 sum 操作，结果为 3,5,NaN，此时应该如何实现向后的滑窗操作？

答案：

c= pd.Series([1,2,3])
c.rolling(window=2).sum().shift(-1)

2、 expanding():累计窗口，可实现cumsum、cummax、cummin。下面就一个一个的演示。
①、cumsum

ab= pd.Series([2,4,6,8,10])
ab.expanding().sum()
结果：2 6 12 20 30

②、cummax

ab= pd.Series([5,4,3,8,10])
ab.expanding().max()
结果：5 5 5 8 10

③、cummin

ab= pd.Series([5,4,3,8,10])
ab.expanding().min()
结果：5 4 3 3 3

二、数据读取

1、文件读取
python支持读取多种格式文件，如XLSX、JSON、CSV、TXT、HTML、JPEG、ZIP等。更多了解，可查阅https://blog.csdn.net/kevinelstri/article/details/61921812
注意：文件路径格式为文件路径+文件名+文件后缀格式，如：r’E:\Data analysis\pandaslearning\joyful-pandas-master\data\my_excel_saved.xlsx’或’E:\Data analysis\pandaslearning\joyful-pandas-master\data\my_excel_saved.xlsx’
①、XLSX文件读取
一般格式：pd.read_excel(io,sheet_name=0,header=0,index_col=None,parse_cols=None,usecols=None,nrows=None,parse_dates=False,date_parser=None,encoding=None)
常修改的参数有：io、encoding、parse_dates；其他参数均采用默认

参数	说明
io	文件路径
sheet_name	工作簿中第几个表，默认为0，即第一个表
header	列名取至Excel 中第几行，默认为0
index_col	用Excel 中第几列作为index，默认为None
parse_cols	重新设置列名，默认为None
usecols	读取哪些列数据，默认为None，即全部列
nrows	读取哪些行数据，默认为None，即全部行
parse_dates	需要转化为时间的列，默认为False；pandas中的时间数据是时间戳，有时从Excel表中读取的时间数据会是数字，此时需要做处理，处理方法是apply(lambda x:datetime.datetime.strptime(‘1899-12-30’,’%Y-%m-%d’)+dt.timedelta(days=x))
date_parser	对时间列重新指定格式，默认为None
encoding	文件编码格式，默认为None，即utf-8；当文件中含有汉字，需要修改为 GBK，否则文件读取结果的汉字会是乱码

②、CSV和TXT文件读取
CSV和TXT文件都是文本文件，都是使用pd.read_csv()方法读取数据。
一般格式：pd.read_csv(filepath_or_buffer,sep=’,’,names=None,index_col=None,usecols=None,nrows=None,parse_dates=False,date_parser=None,encoding=None)
常修改的参数有：filepath_or_buffer、encoding、names、sep、parse_dates；其他参数均采用默认

参数	说明
filepath_or_buffer	文件路径
sep	分隔符，默认为’,’;即每列分隔符号，CSV文件默认是’,’;若读取TXT文件需先查看分隔符号
names	定义列名，默认为None，即第一行作为列名
index_col	用文件中第几列作为index，默认为None
usecols	读取哪些列数据，默认为None，即全部列
nrows	读取哪些行数据，默认为None，即全部行
parse_dates	需要转化为时间的列，默认为False；pandas中的时间数据是时间戳，有时从文件中读取的时间数据会是数字，此时需要做处理，处理方法是apply(lambda x:datetime.datetime.strptime(‘1899-12-30’,’%Y-%m-%d’)+dt.timedelta(days=x))
date_parser	对时间列重新指定格式，默认为None
encoding	文件编码格式，默认为utf-8；当文件中含有汉字，需要修改文件的编码格式为utf-8，否则文件读取结果的汉字会是乱码

2、文件写入
文件写入本地电脑的格式较简单，可参照如下格式：

result.to_csv('E:\\Data analysis\\pandaslearning\\joyful-pandas-master\\data\\新文本数据.txt',encoding='utf-8',index=False)
result.to_csv('E:\\Data analysis\\pandaslearning\\joyful-pandas-master\\data\\新CSV数据.txt',encoding='utf-8',index=False)
result.to_excel('E:\\Data analysis\\pandaslearning\\joyful-pandas-master\\data\\新Excel文件.xlsx',encoding='GBK',index=False)

3、数据库数据读取
若要从数据库中读取数据，如何操作呢？

读取mysql数据可以参考https://blog.csdn.net/wcg541/article/details/100749975

import cx_Oracle
orl= cx_Oracle.connect(数据库用户名, 数据库用户密码, 数据库名称, encoding='UTF-8')
orl_cur= orl.cursor()

三、小练习：

口袋妖怪数据集
现有一份口袋妖怪的数据集，下面进行一些背景说明：

# 代表全国图鉴编号，不同行存在相同数字则表示为该妖怪的不同状态
妖怪具有单属性和双属性两种，对于单属性的妖怪， Type 2 为缺失值
Total, HP, Attack, Defense, Sp. Atk, Sp. Def, Speed 分别代表种族值、体力、物攻、防御、特攻、特防、速度，其中种族值为后6项之和

题目：

1、对 HP, Attack, Defense, Sp. Atk, Sp. Def, Speed 进行加总，验证是否为 Total 值。

# 读取数据
df=pd.read_csv('E:\\Data analysis\\pandaslearning\\joyful-pandas-master\\data\\pokemon.csv')
df.head()#1、对 HP, Attack, Defense, Sp. Atk, Sp. Def, Speed 进行加总，验证是否为 Total 值。
# 方法一：
df['Total_1']=df[['HP', 'Attack', 'Defense','Sp. Atk', 'Sp. Def', 'Speed']].sum(axis=1)
df['is_total']=df['Total']-df['Total_1']
(~df['is_total']==0).sum() # 若结果不等于0，则存在Total数据异常现象#方法二：一步到位
(df[['HP', 'Attack', 'Defense','Sp. Atk', 'Sp. Def', 'Speed']].sum(axis=1)!=df['Total']).sum()

2、对于 # 重复的妖怪只保留第一条记录，解决以下问题：

a.求第一属性的种类数量和前三多数量对应的种类b.求第一属性和第二属性的组合种类c.求尚未出现过的属性组合

# 2、对于 # 重复的妖怪只保留第一条记录，解决以下问题：
#a.求第一属性的种类数量和前三多数量对应的种类
type1 = df.drop_duplicates('#', keep='first')
type1['Type 1'].nunique() # 获得第一属性的种类数量
type1['Type 1'].value_counts()[:3].index # 获得第一属性的种类数量
#或者
type1['Type 1'].value_counts().index[:3]
#b.求第一属性和第二属性的组合种类
type1_2=df[[ 'Type 1','Type 2']].drop_duplicates()#c.求尚未出现过的属性组合
type_f=[i+' '+j for i in df['Type 1'].unique().tolist() for j in (df['Type 1'].unique().tolist() + [''])]
type_p=[i+' '+j for i, j in zip(df['Type 1'], df['Type 2'].replace(np.nan, ''))]
result=set(type_f).difference(set(type_p))

3、按照下述要求，构造 Series ：

a.取出物攻，超过120的替换为 high ，不足50的替换为 low ，否则设为 midb.取出第一属性，分别用 replace 和 apply 替换所有字母为大写c.求每个妖怪六项能力的离差，即所有能力中偏离中位数最大的值，添加到 df 并从大到小排序

#3、按照下述要求，构造 Series ：
#a.取出物攻，超过120的替换为 high ，不足50的替换为 low ，否则设为 mid
type1['Attack'].apply(lambda x:'high' if x>=120 else 'mid' if x>=50 else 'low')# b.取出第一属性，分别用 replace 和 apply 替换所有字母为大写
# 方法一：
df['Type 1'].replace({x:str.upper(x) for x in df['Type 1'].unique()})
#方法二：简单粗暴
type1['Type 1'].apply(lambda x:str.upper(x))# c.求每个妖怪六项能力的离差，即所有能力中偏离中位数最大的值，添加到 df 并从大到小排序
df['Deviation'] = df[['HP', 'Attack', 'Defense', 'Sp. Atk','Sp. Def', 'Speed']].apply(lambda x:np.max((x-x.median()).abs()), 1)
df.sort_values('Deviation', ascending=False)

其他参考文件：

pandas基础知识https://datawhalechina.github.io/joyful-pandas/build/html/%E7%9B%AE%E5%BD%95/ch2.html#id1