dataframe常用操作_Pandas模块基础及常用方法
Pandas是基于Numpy的数据处理与分析模块。包含两个最重要的基本类型:Series和DataFrame。其中Series类似numpy的一维数组,DataFrame类似二维数组,但可存储不同类型。两种类型都可以自定义标签,并通过标签(索引)获取值,同时也可以通过位置获取。
一,Series基础及常用方法
1、创建Series
通过pandas中的Series函数可以直接把列表、字典转换为Series。
Series允许自定义标签,也可以使用默认标签。后面可以通过标签和位置来获取数据。在下面的例子中,Series输出的结果看起来有两列,其中第一列就是标签,第二列是值。
#%%创建Series
import pandas as pd
#不指定标签,使用默认的
first_series = pd.Series([1,"str",True,4])
print(first_series)
print("-"*20)
#指定新的标签(索引)
first_series = pd.Series([1,"str",True,4], index = list('abcd'))
print(first_series)
# 输出
# 0 1
# 1 str
# 2 True
# 3 4
# dtype: object
# --------------------
# a 1
# b str
# c True
# d 4
# dtype: object
print("-"*20)
first_series = pd.Series({"name":"zhangsan", "age":"18", "height":"178", "hobby":"code"})
print(first_series)
#输出
# --------------------
# name zhangsan
# age 18
# height 178
# hobby code
# dtype: object
#重新赋值索引标签
first_series.index = list("abcd")
print(first_series)
#输出
# a zhangsan
# b 18
# c 178
# d code
# dtype: object
2、数据获取
first_series = pd.Series([100,200,300,400], index = list('abcd'))
print(first_series)
# 输出
# a 100
# b 200
# c 300
# d 400
# dtype: int64
# 200在series中标签为‘b’
print("标签获取值: ")
print(first_series['b'])
print(first_series['b':'d'])
# 输出
# 标签获取值:
# 200
# b 200
# c 300
# d 400
# dtype: int64
# 200在series中是位置为1
print("位置获取值: ")
print(first_series[1])
print(first_series[1:6])#超出位置范围不报错
# 输出
# 位置获取值:
# 200
# b 200
# c 300
# d 400
# dtype: int64
#条件获取值
print("条件获取值: ")
print(first_series[first_series>=200])
#输出
# 条件获取值:
# b 200
# c 300
# d 400
# dtype: int64
#提取Series的标签值
result = first_series.index
for value in result:print(value)
print(first_series.index.tolist())
#输出
# a
# b
# c
# d
# ['a', 'b', 'c', 'd']
#提取series的值
print(first_series.values.tolist())
#输出
# [100, 200, 300, 400]
3、Series增加元素,append方法
first_series = pd.Series([100,200,300,400], index = list('abcd'))
print(first_series)
second_series = pd.Series([500,600], index = list('ef'))
print(second_series)
third_series = first_series.append(second_series)
print(third_series)
# 输出
# a 100
# b 200
# c 300
# d 400
# dtype: int64
# e 500
# f 600
# dtype: int64
# a 100
# b 200
# c 300
# d 400
# e 500
# f 600
# dtype: int64
4、series删除元素
#删除元素
first_series = pd.Series([100,200,300,400,500,600], index = list('abcdef'))
print(first_series)
# 根据标签索引删除数据
del first_series['b']
#drop方法指定inplace为True时,直接操作原有数据,没有返回值
first_series.drop("d", inplace=True)
print(first_series)
#drop方法不指定inplace为True时,返回新值,不改变原值,此时需要用变量来承接
#不用变量承接,元数据未改变的例子
first_series.drop("f")
print(first_series)
#将删除后的新值赋值给原数据,使之改变
first_series = first_series.drop("f")
print(first_series)
#根据位置索引删除数据
#直接使用位置索引会报错
# first_series.drop(1)
first_series = first_series.drop(first_series.index[1])
print(first_series)
#输出
# a 100
# b 200
# c 300
# d 400
# e 500
# f 600
# dtype: int64
#
# a 100
# c 300
# e 500
# f 600
# dtype: int64
#
# a 100
# c 300
# e 500
# f 600
# dtype: int64
#
# a 100
# c 300
# e 500
# dtype: int64
#
# a 100
# e 500
# dtype: int64
5、series更改元素
#更改元素
first_series = pd.Series([100,200,300,400], index = list('abcd'))
print(first_series)
first_series[1] = 2000
print(first_series)
first_series["a"] = 1000
first_series[1] = 2000
print(first_series)
#输出
# a 100
# b 200
# c 300
# d 400
# dtype: int64
# a 100
# b 2000
# c 300
# d 400
# dtype: int64
# a 1000
# b 2000
# c 300
# d 400
# dtype: int64
6、series进行算术运算
两个Series运算,对应标签的才会进行相应计算
#两个Series运算,对应标签的才会进行相应计算
first_series = pd.Series([10, 20, 30, 40], index = list('abcd'))
second_series = pd.Series([300, 400, 500, 600], index = list('cdef'))
print(first_series)
print(second_series)
third_series = second_series - first_series
print(third_series)
third_series = second_series / first_series
print(third_series)
#输出结果
# a 10
# b 20
# c 30
# d 40
# dtype: int64
# c 300
# d 400
# e 500
# f 600
# dtype: int64
# a NaN
# b NaN
# c 270.0
# d 360.0
# e NaN
# f NaN
# dtype: float64
# a NaN
# b NaN
# c 10.0
# d 10.0
# e NaN
# f NaN
# dtype: float64
#标量乘法
third_series = first_series * 2
print(third_series)
#输出
# a 20
# b 40
# c 60
# d 80
# dtype: int64
#还可以直接使用numpy中的方法对series进行处理
import numpy as np
third_series = np.square(first_series)
print(third_series)
#输出
# a 100
# b 400
# c 900
# d 1600
# dtype: int64
二、DataFrame基础及常用函数
dataframe是一个二位数组,除了像series一样自定义行标签之外,还可以自定义列标签。和R语言中的dataframe类似,是数据分析建模中比较常用的一种数据格式。
1、创建dataframe
#DataFrame
first_df = pd.DataFrame(np.random.randint(0,10,(5,5)))
print(first_df)
print('-'*20)
#自定义行列标签
first_df = pd.DataFrame(np.random.randint(0,10,(5,5)), index=list('abcde'), columns=list('abcde'))
print(first_df)
print('-'*20)
#字典转换为dataframe
dict_test = {"name":["zhangsan","lisi","wangwu"],"score":[80,90,95],"gender":["F","M","F"]}
first_df = pd.DataFrame(dict_test)
print(first_df)
print('-'*20)
#若是字典中值是series类型的,在转换为dataframe的过程中,会按照series行标签进行组合,此时可能会出现NAN
dict_test = {"name":pd.Series(["zhangsan","lisi","wangwu"],index=['a','b','c']),"score":pd.Series([80,90,95],index=['b','c','d']),"gender":pd.Series(["F","M","F"],index=['c','d','e'])}
first_df = pd.DataFrame(dict_test)
print(first_df)
#输出
# 0 1 2 3 4
# 0 4 0 3 1 1
# 1 8 1 4 1 7
# 2 2 5 2 1 6
# 3 9 3 1 9 8
# 4 7 8 0 0 4
# --------------------
# a b c d e
# a 9 6 4 5 7
# b 6 8 3 5 1
# c 1 2 3 0 7
# d 3 7 5 7 9
# e 0 0 8 8 8
# --------------------
# name score gender
# 0 zhangsan 80 F
# 1 lisi 90 M
# 2 wangwu 95 F
# --------------------
# name score gender
# a zhangsan NaN NaN
# b lisi 80.0 NaN
# c wangwu 90.0 F
# d NaN 95.0 M
# e NaN NaN F
2、常用属性
# dataframe常用属性
dict_test = {"name":["zhangsan","lisi","wangwu","songliu"],"score":[80,90,95,75],"gender":["F","M","F","M"]}
first_df = pd.DataFrame(dict_test)
print(first_df)
print('-'*20)
print(first_df.shape) #获取行数和列数
print('-'*20)
print(first_df.index.tolist())#获取行标签
print('-'*20)
print(first_df.columns.tolist())#获取列标签
print('-'*20)
print(first_df.dtypes)#获取数据类型
print('-'*20)
print(first_df.values)#以ndarray形式返回dataframe数据
print('-'*20)
print(first_df.info())#获取dataframe基本信息
print('-'*20)
print(first_df.head(2))#获取dataframe前两行信息
print('-'*20)
print(first_df.tail(2))#获取dataframe后两行信息
# 输出
# name score gender
# 0 zhangsan 80 F
# 1 lisi 90 M
# 2 wangwu 95 F
# 3 songliu 75 M
# --------------------
# (4, 3)
# --------------------
# [0, 1, 2, 3]
# --------------------
# ['name', 'score', 'gender']
# --------------------
# name object
# score int64
# gender object
# dtype: object
# --------------------
# [['zhangsan' 80 'F']
# ['lisi' 90 'M']
# ['wangwu' 95 'F']
# ['songliu' 75 'M']]
# --------------------
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 4 entries, 0 to 3
# Data columns (total 3 columns):
# name 4 non-null object
# score 4 non-null int64
# gender 4 non-null object
# dtypes: int64(1), object(2)
# memory usage: 176.0+ bytes
# None
# --------------------
# name score gender
# 0 zhangsan 80 F
# 1 lisi 90 M
# --------------------
# name score gender
# 2 wangwu 95 F
# 3 songliu 75 M
3、dataframe通过索引获取值、切片
(1)
dict_test = {"name":["zhangsan","lisi","wangwu","songliu"],"score":[80,90,95,75],"gender":["F","M","F","M"]}
first_df = pd.DataFrame(dict_test)
print(first_df)
print('#1','-'*30)
#2、获取列
print(first_df["name"])
print('#2','-'*30)
print(first_df[["name","score"]])
print('#2','-'*30)
#3、获取行
print(first_df[1:3])
print('#3','-'*30)
#4、获取某几行的某几列
print(first_df[1:3][["name","score"]])
print('#4','-'*30)
# 输出
# name score gender
# 0 zhangsan 80 F
# 1 lisi 90 M
# 2 wangwu 95 F
# 3 songliu 75 M
# #1 ------------------------------
# 0 zhangsan
# 1 lisi
# 2 wangwu
# 3 songliu
# Name: name, dtype: object
# #2 ------------------------------
# name score
# 0 zhangsan 80
# 1 lisi 90
# 2 wangwu 95
# 3 songliu 75
# #2 ------------------------------
# name score gender
# 1 lisi 90 M
# 2 wangwu 95 F
# #3 ------------------------------
# name score
# 1 lisi 90
# 2 wangwu 95
# #4 ------------------------------
(2)使用loc()通过标签索引获取
# 5、通过loc标签索引,iloc位置索引来获取值
print(first_df.loc[:,'name'])
print('#5','-'*30)
#6、注意原始行标签0123是数字类型,如果自定义了标签为字符串类型,这种操作就会报错
print(first_df.loc[1:3,['name','score']])
print('#6','-'*30)
#7、注意原始行标签0123是数字类型,如果自定义了标签为字符串类型,这种操作就会报错
first_df.index = ['0','1','2','3']
print(first_df)
print('#7','-'*30)
#会报错print(first_df.loc[1:3,['name','score']])
print(first_df.loc[['1','2'],['name','score']])
print('#7','-'*30)
#输出
# 0 zhangsan
# 1 lisi
# 2 wangwu
# 3 songliu
# Name: name, dtype: object
# #5 ------------------------------
# name score
# 1 lisi 90
# 2 wangwu 95
# 3 songliu 75
# #6 ------------------------------
# name score gender
# 0 zhangsan 80 F
# 1 lisi 90 M
# 2 wangwu 95 F
# 3 songliu 75 M
# #7 ------------------------------
# name score
# 1 lisi 90
# 2 wangwu 95
# #7 ------------------------------
(3)使用iloc()通过你位置索引获取数据
#8使用iloc通过位置索引获取数据
print(first_df.iloc[1])#获取第二行数据
print('#8-1','-'*30)#只有一个数字时都是指行索引
print(first_df.iloc[:,:])#获取所有值
print('#8-2','-'*30)
print(first_df.iloc[0:3,1:3])#获取行位置为0、1、2,列位置标签为1、2的所有值,即第一二三行的第二三列
print('#8-3','-'*30)
# 输出
# name lisi
# score 90
# gender M
# Name: 1, dtype: object
# #8-1 ------------------------------
# name score gender
# 0 zhangsan 80 F
# 1 lisi 90 M
# 2 wangwu 95 F
# 3 songliu 75 M
# #8-2 ------------------------------
# score gender
# 0 80 F
# 1 90 M
# 2 95 F
# #8-3 ------------------------------
4、修改索引的几种方法
#1
dict_test = {"name":["zhangsan","lisi","wangwu","songliu"],"score":[80,90,95,75],"gender":["F","M","F","M"]}
first_df = pd.DataFrame(dict_test)
print(first_df)
print('#1','-'*30)
#直接修改
first_df.columns = ["Name","Score","Gender"]
print(first_df)
print('#2','-'*30)
#指定将行标签0、1改成'a'、'b', inplace=True与之前类似,使操作在原数据基础上进行,而不是返回新的值。
#如果要修改列标签,可以通过columns参数进行,与index操作相同
first_df.rename(index={0:"a",1:"b"},inplace=True)
print(first_df)
print('#3','-'*30)
#自定义函数修改,下面是自定义了一个把文本全部转换成大写,并且加上*的函数
#把函数传递给列标签参数columns和行标签参数index就可以对每个标签进行函数运算
def change_col(col):new_col = str(col).upper() + '*'return new_col
first_df.rename(columns = change_col, index = change_col, inplace=True)
print(first_df)
print('#4','-'*30)
#把某一列设置成索引,drop=False作用是不删除作为索引的列,否则会删除, inplace=True的作用同上,都是说明在原有数据基础上修改
first_df.set_index('NAME*',drop=False,inplace=True)
print(first_df)
print('#5','-'*30)
#输出
# name score gender
# 0 zhangsan 80 F
# 1 lisi 90 M
# 2 wangwu 95 F
# 3 songliu 75 M
# #1 ------------------------------
# Name Score Gender
# 0 zhangsan 80 F
# 1 lisi 90 M
# 2 wangwu 95 F
# 3 songliu 75 M
# #2 ------------------------------
# Name Score Gender
# a zhangsan 80 F
# b lisi 90 M
# 2 wangwu 95 F
# 3 songliu 75 M
# #3 ------------------------------
# NAME* SCORE* GENDER*
# A* zhangsan 80 F
# B* lisi 90 M
# 2* wangwu 95 F
# 3* songliu 75 M
# #4 ------------------------------
# NAME* SCORE* GENDER*
# NAME*
# zhangsan zhangsan 80 F
# lisi lisi 90 M
# wangwu wangwu 95 F
# songliu songliu 75 M
# #5 ------------------------------
5、往dataframe中添加和删除数据
dict_test = {"name":["zhangsan","lisi","wangwu","songliu"],"score":[80,90,95,75],"gender":["F","M","F","M"]}
first_df = pd.DataFrame(dict_test)
print(first_df)
print('#0','-'*30)
#1 增加新列
first_df["weight"] = [110,90,90,130]
print(first_df)
print('#1','-'*30)
#1-2指定位置插入新列,下例是将使新列的位置为2,即新列为第三列
first_df.insert(2,"height",[170,165,180,173])
print(first_df)
print('#1-1','-'*30)
# 输出
# name score gender
# 0 zhangsan 80 F
# 1 lisi 90 M
# 2 wangwu 95 F
# 3 songliu 75 M
# #0 ------------------------------
# name score gender weight
# 0 zhangsan 80 F 110
# 1 lisi 90 M 90
# 2 wangwu 95 F 90
# 3 songliu 75 M 130
# #1 ------------------------------
# name score height gender weight
# 0 zhangsan 80 170 F 110
# 1 lisi 90 165 M 90
# 2 wangwu 95 180 F 90
# 3 songliu 75 173 M 130
# #1-1 ------------------------------
6、dataframe之间的合并append、contact、merger、join
dict_test = {"name":["zhangsan","lisi","wangwu","songliu"],"score":[80,90,95,75],"gender":["F","M","F","M"]}
first_df = pd.DataFrame(dict_test)
dict_test = {"name":["wanger","lisi"],"score":[85,90],"gender":["F","M"],"height":[170,180]}
second_df = pd.DataFrame(dict_test)
print(first_df)
print(second_df)
print('#0','-'*30)
#1 append ignore_index=True说明不保留原索引,直接新建从0开始的索引
third_df = first_df.append(second_df,ignore_index=True)
print(third_df)
print('#1','-'*30)
#2 contact
third_df = pd.concat([first_df,second_df], axis=0)
print(third_df)
print('#2-1','-'*30)
third_df = pd.concat([first_df,second_df], axis=1)
print(third_df)
print('#2-2','-'*30)
#join参数inner是说行标签索引一致的进行保留
third_df = pd.concat([first_df,second_df], axis=1, join='inner')
print(third_df)
print('#2-3','-'*30)
#3 merge()
# 默认内连接how = ”inner“,同时根据左右对象中出现同名的列作为连接的键
third_df = pd.merge(first_df,second_df)
print(third_df)
print('#3-1','-'*30)
#how='left'代表左连接,on='score'代表以score为关键进行连接,寻找相同的score然后进行拼接
third_df = pd.merge(first_df,second_df,how='left')
third_df = pd.merge(first_df,second_df,how='left',on='score')
print(third_df)
print('#3-2','-'*30)
#全连接
third_df = pd.merge(first_df,second_df,how='outer')
print(third_df)
print('#3-3','-'*30)
#4 join()的作用和merge()类似#输出
# name score gender
# 0 zhangsan 80 F
# 1 lisi 90 M
# 2 wangwu 95 F
# 3 songliu 75 M
# name score gender height
# 0 wanger 85 F 170
# 1 lisi 90 M 180
# #0 ------------------------------
# gender height name score
# 0 F NaN zhangsan 80
# 1 M NaN lisi 90
# 2 F NaN wangwu 95
# 3 M NaN songliu 75
# 4 F 170.0 wanger 85
# 5 M 180.0 lisi 90
# #1 ------------------------------
# gender height name score
# 0 F NaN zhangsan 80
# 1 M NaN lisi 90
# 2 F NaN wangwu 95
# 3 M NaN songliu 75
# 0 F 170.0 wanger 85
# 1 M 180.0 lisi 90
# #2-1 ------------------------------
# name score gender name score gender height
# 0 zhangsan 80 F wanger 85.0 F 170.0
# 1 lisi 90 M lisi 90.0 M 180.0
# 2 wangwu 95 F NaN NaN NaN NaN
# 3 songliu 75 M NaN NaN NaN NaN
# #2-2 ------------------------------
# name score gender name score gender height
# 0 zhangsan 80 F wanger 85 F 170
# 1 lisi 90 M lisi 90 M 180
# #2-3 ------------------------------
# name score gender height
# 0 lisi 90 M 180
# #3-1 ------------------------------
# name_x score gender_x name_y gender_y height
# 0 zhangsan 80 F NaN NaN NaN
# 1 lisi 90 M lisi M 180.0
# 2 wangwu 95 F NaN NaN NaN
# 3 songliu 75 M NaN NaN NaN
# #3-2 ------------------------------
# name score gender height
# 0 zhangsan 80 F NaN
# 1 lisi 90 M 180.0
# 2 wangwu 95 F NaN
# 3 songliu 75 M NaN
# 4 wanger 85 F 170.0
# #3-3 ------------------------------
7、dataframe常用数据处理方法
dict_test = {"name":["zhangsan","lisi","wangwu","songliu",np.nan],"score":[80,np.nan,95,75,np.nan],"gender":["F",np.nan,"F","M",np.nan],"height":[np.nan, np.nan,196, 176,np.nan]}
first_df = pd.DataFrame(dict_test)
print(first_df)
print('#0','-'*30)
#1缺失值去除
#1-1 去除包含NAN值的行
second_df = first_df.dropna()
print(second_df)
print('#1-1','-'*30)
#1-2 去除整行都是NAN值的行
second_df = first_df.dropna(how = 'all')
print(second_df)
print('#1-2','-'*30)
#1-3 去除包含NAN的列
second_df = first_df.dropna(axis=1)
print(second_df)
print('#1-3','-'*30)
#1-4 保留两个以上非NAN值的行
second_df = first_df.dropna(thresh=2)
print(second_df)
print('#1-4','-'*30)#2 缺失值填充
#2-1缺失值用0填充
second_df = first_df.fillna(0)
print(second_df)
print('#2-1','-'*30)
#2-2"name"列缺失值用”unknow“填充,其余列缺失值用0填充
second_df = first_df.fillna({"name":"uknown","score":0, "gender":0, "height":0})
print(second_df)
print('#2-2','-'*30)
#2-3 用每一列的均值填充
second_df = first_df.fillna(first_df.mean())
print(second_df)
print('#2-3','-'*30)
#2-4用每一列前面的值填充
second_df = first_df.fillna(method='ffill')
print(second_df)
print('#2-4','-'*30)#3 移除重复数据
#3-1 去除整行都重复的行
second_df = first_df.drop_duplicates()
print(second_df)
print('#3-1','-'*30)
#3-2 去除指定列重复的行
second_df = first_df.drop_duplicates(['score'])
print(second_df)
print('#3-2','-'*30)
#3-3 去除整行都重复的行,并保留重复的最后一行
second_df = first_df.drop_duplicates(keep='last')
print(second_df)
print('#3-3','-'*30)#输出
# name score gender height
# 0 zhangsan 80.0 F NaN
# 1 lisi NaN NaN NaN
# 2 wangwu 95.0 F 196.0
# 3 songliu 75.0 M 176.0
# 4 NaN NaN NaN NaN
# #0 ------------------------------
# name score gender height
# 2 wangwu 95.0 F 196.0
# 3 songliu 75.0 M 176.0
# #1-1 ------------------------------
# name score gender height
# 0 zhangsan 80.0 F NaN
# 1 lisi NaN NaN NaN
# 2 wangwu 95.0 F 196.0
# 3 songliu 75.0 M 176.0
# #1-2 ------------------------------
# Empty DataFrame
# Columns: []
# Index: [0, 1, 2, 3, 4]
# #1-3 ------------------------------
# name score gender height
# 0 zhangsan 80.0 F NaN
# 2 wangwu 95.0 F 196.0
# 3 songliu 75.0 M 176.0
# #1-4 ------------------------------
# name score gender height
# 0 zhangsan 80.0 F 0.0
# 1 lisi 0.0 0 0.0
# 2 wangwu 95.0 F 196.0
# 3 songliu 75.0 M 176.0
# 4 0 0.0 0 0.0
# #2-1 ------------------------------
# name score gender height
# 0 zhangsan 80.0 F 0.0
# 1 lisi 0.0 0 0.0
# 2 wangwu 95.0 F 196.0
# 3 songliu 75.0 M 176.0
# 4 uknown 0.0 0 0.0
# #2-2 ------------------------------
# name score gender height
# 0 zhangsan 80.000000 F 186.0
# 1 lisi 83.333333 NaN 186.0
# 2 wangwu 95.000000 F 196.0
# 3 songliu 75.000000 M 176.0
# 4 NaN 83.333333 NaN 186.0
# #2-3 ------------------------------
# name score gender height
# 0 zhangsan 80.0 F NaN
# 1 lisi 80.0 F NaN
# 2 wangwu 95.0 F 196.0
# 3 songliu 75.0 M 176.0
# 4 songliu 75.0 M 176.0
# #2-4 ------------------------------
# name score gender height
# 0 zhangsan 80.0 F NaN
# 1 lisi NaN NaN NaN
# 2 wangwu 95.0 F 196.0
# 3 songliu 75.0 M 176.0
# 4 NaN NaN NaN NaN
# #3-1 ------------------------------
# name score gender height
# 0 zhangsan 80.0 F NaN
# 1 lisi NaN NaN NaN
# 2 wangwu 95.0 F 196.0
# 3 songliu 75.0 M 176.0
# #3-2 ------------------------------
# name score gender height
# 0 zhangsan 80.0 F NaN
# 1 lisi NaN NaN NaN
# 2 wangwu 95.0 F 196.0
# 3 songliu 75.0 M 176.0
# 4 NaN NaN NaN NaN
# #3-3 ------------------------------
8,dataframe分组聚合
dict_test = {"name":["zhangsan","lisi","wangwu","songliu","wamger","zhaoqi"],"score":[80,85,95,75,89,79],"gender":["F","M","F","M","M","F"],"height":[165, 196, 176,168,178,175]}
first_df = pd.DataFrame(dict_test)
print(first_df)
print('#0','-'*30)
#分组统计.groupby()生成一个对象,可以通过遍历查看具体值
result = first_df.groupby("gender")
for groupName,data in result:print(groupName, data, sep='n')
print(result)
print('#1','-'*30)
#按照性别分组计算分数、身高和
print(first_df.groupby("gender")["score","height"].sum())
print('#2-1','-'*30)
#按照性别分组统计频次
print(first_df.groupby("gender")["score","height"].count())
print('#2-2','-'*30)
#按照性别分组计算平均分、平均身高
print(first_df.groupby("gender")["score","height"].mean())
print('#2-3','-'*30)
#按照性别分组计算分数、身高和、频次、平均分、平均身高
print(first_df.groupby("gender")["score","height"].agg(['sum','count','mean']))
print('#2-4','-'*30)
#按照性别分组计算分数和、频次、平均分、自定义运算
def peak_value(dat):return max(dat) - min(dat)
print(first_df.groupby("gender")["score","height"].agg(['sum','count','mean',peak_value]))
print('#2-5','-'*30)
#3 apply()函数
def top_num(dat,col,num=2):return dat.sort_values(by=col,ascending=False)[0:num]
#求每个类别分数前二
print(first_df.groupby("gender").apply(top_num,'score',2))
print('#3-1','-'*30)
#求每个类别身高前二
print(first_df.groupby("gender").apply(top_num,'height',2))
print('#3-2','-'*30)
#使用apply给下面的dataframe增加一列diff其值为score2-score1,
dict_test = {"name":["zhangsan","lisi","wangwu","songliu","wamger","zhaoqi"],"score1":[80,85,95,75,89,79],"score2":[85,89,96,78,92,83],"gender":["F","M","F","M","M","F"],"height":[165, 196, 176,168,178,175]}
first_df = pd.DataFrame(dict_test)
print(first_df)
print('#3-3','-'*30)
def get_diff(dat):return dat["score2"]-dat["score1"]
#apply()中的axis参数规定了对df按行还是按列进行操作,此处是按行
first_df["diff"] = first_df.apply(get_diff,axis=1)
print(first_df)
print('#3-4','-'*30)
#输出
# name score gender height
# 0 zhangsan 80 F 165
# 1 lisi 85 M 196
# 2 wangwu 95 F 176
# 3 songliu 75 M 168
# 4 wamger 89 M 178
# 5 zhaoqi 79 F 175
# #0 ------------------------------
# F
# name score gender height
# 0 zhangsan 80 F 165
# 2 wangwu 95 F 176
# 5 zhaoqi 79 F 175
# M
# name score gender height
# 1 lisi 85 M 196
# 3 songliu 75 M 168
# 4 wamger 89 M 178
# <pandas.core.groupby.groupby.DataFrameGroupBy object at 0x000001E4693A8390>
# #1 ------------------------------
# score height
# gender
# F 254 516
# M 249 542
# #2-1 ------------------------------
# score height
# gender
# F 3 3
# M 3 3
# #2-2 ------------------------------
# score height
# gender
# F 84.666667 172.000000
# M 83.000000 180.666667
# #2-3 ------------------------------
# score height
# sum count mean sum count mean
# gender
# F 254 3 84.666667 516 3 172.000000
# M 249 3 83.000000 542 3 180.666667
# #2-4 ------------------------------
# score height
# sum count mean peak_value sum count mean peak_value
# gender
# F 254 3 84.666667 16 516 3 172.000000 11
# M 249 3 83.000000 14 542 3 180.666667 28
# #2-5 ------------------------------
# name score gender height
# gender
# F 2 wangwu 95 F 176
# 0 zhangsan 80 F 165
# M 4 wamger 89 M 178
# 1 lisi 85 M 196
# #3-1 ------------------------------
# name score gender height
# gender
# F 2 wangwu 95 F 176
# 5 zhaoqi 79 F 175
# M 1 lisi 85 M 196
# 4 wamger 89 M 178
# #3-2 ------------------------------
# name score1 score2 gender height
# 0 zhangsan 80 85 F 165
# 1 lisi 85 89 M 196
# 2 wangwu 95 96 F 176
# 3 songliu 75 78 M 168
# 4 wamger 89 92 M 178
# 5 zhaoqi 79 83 F 175
# #3-3 ------------------------------
# name score1 score2 gender height diff
# 0 zhangsan 80 85 F 165 5
# 1 lisi 85 89 M 196 4
# 2 wangwu 95 96 F 176 1
# 3 songliu 75 78 M 168 3
# 4 wamger 89 92 M 178 3
# 5 zhaoqi 79 83 F 175 4
# #3-4 ------------------------------
9、pandas时间序列
pandas中允许使用时间作为索引,可以比较方便的按照时间取值。
#时间序列
#生成一段时间序列,下面代码意义为从20200101开始,每个50天取一个日期,一共取10个
#pd.date_range(start="20200101",end="20201201",freq="20D")代表从20200101到20201201之间按20天的间隔取日期,能取几个是几个
first_date = pd.date_range(start="20200101",periods=10,freq="50D")
print(first_date)
print('#0','-'*30)
first_df = pd.DataFrame(np.random.randint(0 ,10, (10,2)),index=first_date)
print(first_df)
print('#1','-'*30)
#取2021年的数据
print(first_df["2021"])
print('#2','-'*30)
#取2020年2月到8月的数据
print(first_df["2020-02"])
print('#3','-'*30)
#取2020年2月到8月的数据
print(first_df["2020-02":"2020-08"])
#输出
# DatetimeIndex(['2020-01-01', '2020-02-20', '2020-04-10', '2020-05-30',
# '2020-07-19', '2020-09-07', '2020-10-27', '2020-12-16',
# '2021-02-04', '2021-03-26'],
# dtype='datetime64[ns]', freq='50D')
# #0 ------------------------------
# 0 1
# 2020-01-01 4 1
# 2020-02-20 2 4
# 2020-04-10 3 0
# 2020-05-30 5 5
# 2020-07-19 9 5
# 2020-09-07 5 9
# 2020-10-27 4 5
# 2020-12-16 1 6
# 2021-02-04 7 1
# 2021-03-26 9 0
# #1 ------------------------------
# 0 1
# 2021-02-04 7 1
# 2021-03-26 9 0
# #2 ------------------------------
# 0 1
# 2020-02-20 2 4
# #3 ------------------------------
# 0 1
# 2020-02-20 2 4
# 2020-04-10 3 0
# 2020-05-30 5 5
# 2020-07-19 9 5
dataframe常用操作_Pandas模块基础及常用方法相关推荐
- dataframe常用操作_Pandas | Dataframe的merge操作,像数据库一样尽情join
点击上方蓝字,关注并星标,和我一起学技术. 今天是pandas数据处理第8篇文章,我们一起来聊聊dataframe的合并. 常见的数据合并操作主要有两种,第一种是我们新生成了新的特征,想要把它和旧的特 ...
- Spark SQL概述,DataFrames,创建DataFrames的案例,DataFrame常用操作(DSL风格语法),sql风格语法
一. Spark SQL 1. Spark SQL概述 1.1. 什么是Spark SQL Spark SQL是Spark用来处理结构化数据的一个模块,它提供了一个编程抽象叫做DataFrame并且作 ...
- pandas库--DataFrame常用操作
文章目录 前言 一.DataFrame创建 1.基于列表创建 2.基于字典创建 二.查询 1.df直接查询 ① 查询一列 ② 查询多列 ③ 条件查询 2.query()方法 ① 条件查询 ② 带有变量 ...
- python spark dataframe_pyspark dataframe 常用操作
spark dataframe派生于RDD类,但是提供了非常强大的数据操作功能.当然主要对类SQL的支持. 在实际工作中会遇到这样的情况,主要是会进行两个数据集的筛选.合并,重新入库. 首先加载数据集 ...
- r生成新的dataframe_R 语言的Dataframe常用操作
上节我们简单介绍了Dataframe的定义,这节我们具体来看一下Dataframe的操作 首先,数据框的创建函数为 data.frame( ),参考R语言的帮助文档,我们来了解一下data.frame ...
- dataframe常用操作_【Data Mining】机器学习三剑客之Pandas常用算法总结上
一.前言 看pandas之前我建议先看我的numpy总结,效果更佳. SEU-AI蜗牛车:[Data Mining]机器学习三剑客之Numpy常用算法总结zhuanlan.zhihu.com 可以 ...
- 【极简spark教程】DataFrame常用操作
目录 创建DataFrame List,toDF:使用List[Tuple]包装每行记录,结合toDF接口,,转化为DataFrame DataFrameRDD,StructType:推荐使用RDD和 ...
- dataframe常用操作总结
初始化 可以使用array+columns的格式, d=pd.DataFrame(np.arange(10).reshape(2,5)) df1 = pd.DataFrame([['Snow','M' ...
- Elasticsearch一些常用操作和一些基础概念
1.查看集群健康状态 [root@ELK-chaofeng01 ~]#curl -XGET http://172.16.0.51:9200/_cat/health?v epoch timestamp ...
最新文章
- 1.4亿围观!宝藏副教授火速走红:如果不喜欢我的研究方向,我可以改!
- Django详解之models操作
- 玩点深入的:Java 虚拟机内存结构及编码实战
- could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
- gcn语义分割_ICCV Oral 2019:152层GCN大幅加深图卷积网络的方法,点云分割任务效果显著...
- 验证码的编写——本质:图片目的:防止恶意表单注册
- 爆发前的最后按钮 白鹭推HTML5首款生态产品Egret Runtime
- @Valid注解详解
- 自学python能干些什么副业-揭秘!女程序员为啥更赚钱?这4个大招,用Python做副业躺赚...
- inner join、 left join 、right join、full outer join之间的区别
- 如何给网站添加rss功能
- (2) 怎么学习IFC (Industry Foundation Class)
- android上传速度测试,使用Android获取当前的互联网速度(移动和Wifi)
- MATLAB 中如何使用 help
- Java使用poi给Word加水印(目前自己了解的仅支持后缀为.docx格式的,.doc仍在研究)开源、免费。
- [目录]-博客笔记导读目录(全部)
- css3中-moz、-ms、-webkit、-o 各是什么意思
- QQ邮箱疯狂的附件:别人笑我太疯癫 我笑别人看不穿
- Android 吸入动画效果详解(仿mac退出效果)
- 浏览器2014官方下载