pandas查看属性和数据

10.2 查看其属性、概览

1.属性

df.shape # 查看形状，⾏数和列数
df.dtypes # 查看数据类型
df.index # ⾏标签
df.columns # 列标签
df.values # 对象值，⼆维ndarray数组
df.size # DataFrame中的元素数量
df.ndim # 轴的数量，也指数组的维数
df.empty # DataFrame中没有数据或者任意坐标轴的长度为0，则返回True
df.axes # 返回一个仅以行轴标签和列轴标签为成员的列表
df.T # 行和列转置

2.概览

df.head(10) # 显示头部10⾏，默认5个
df.tail(10) # 显示末尾10⾏，默认5个
df.describe() # 查看数值型列的汇总统计,计数、平均值、标准差、最⼩值、四分位数、最⼤值 includ="object" 查看字符串类型， includ ="all" 查看所有的

df.info() # 查看列索引、数据类型、⾮空计数和内存信息

data = {'name': ['John', 'Mike', 'Mozla', 'Rose', 'David', 'Marry', 'Wansi', 'Sidy', 'Jack', 'Alic'],'age': [20, 32, 29, np.nan, 15, 28, 21, 30, 37, 25],'gender': [0, 0, 1, 1, 0, 1, 0, 0, 1, 1],'isMarried': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
label = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
data = pd.DataFrame(data,index=label)
print(data.dtypes)
print(data.index)
print(data.columns)
print(data.values)
print(data.size)
print(data.ndim)
print(data.empty)
print(data.axes)
print('------head----------')
print(data.head())
print('------tail----------')
print(data.tail())
print('------describe----------')
print(data.describe())
print('------info----------')
print(data.info())out:
name          object
age          float64
gender         int64
isMarried     object
dtype: object
Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')
Index(['name', 'age', 'gender', 'isMarried'], dtype='object')
[['John' 20.0 0 'yes']['Mike' 32.0 0 'yes']['Mozla' 29.0 1 'no']['Rose' nan 1 'yes']['David' 15.0 0 'no']['Marry' 28.0 1 'no']['Wansi' 21.0 0 'no']['Sidy' 30.0 0 'yes']['Jack' 37.0 1 'no']['Alic' 25.0 1 'no']]
40
2
False
[Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object'), Index(['name', 'age', 'gender', 'isMarried'], dtype='object')]
------head----------name   age  gender isMarried
a   John  20.0       0       yes
b   Mike  32.0       0       yes
c  Mozla  29.0       1        no
d   Rose   NaN       1       yes
e  David  15.0       0        no
------tail----------name   age  gender isMarried
f  Marry  28.0       1        no
g  Wansi  21.0       0        no
h   Sidy  30.0       0       yes
i   Jack  37.0       1        no
j   Alic  25.0       1        no
------describe----------age     gender
count   9.000000  10.000000
mean   26.333333   0.500000
std     6.782330   0.527046
min    15.000000   0.000000
25%    21.000000   0.000000
50%    28.000000   0.500000
75%    30.000000   1.000000
max    37.000000   1.000000
------info----------
<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, a to j
Data columns (total 4 columns):#   Column     Non-Null Count  Dtype
---  ------     --------------  -----  0   name       10 non-null     object 1   age        9 non-null      float642   gender     10 non-null     int64  3   isMarried  10 non-null     object
dtypes: float64(1), int64(1), object(2)
memory usage: 400.0+ bytes
None

10.3 读取和修改数据

10.3.1 列/行标签或列条件、行下标读取及修改数据
官方文档:10 minutes to pandas — pandas 1.4.3 documentation

查询数据的5种方式:1.索引运算符[] 2.loc函数 3.iloc函数 4.df.where 5.df.query

列访问(索引运算符[]和loc举例)
①使用单个的lable值进行: df[["A"]]简写data.A(只能访问列)----df.loc[["a"],["B"]] 不用列表会自动转series
②使用标签列表: df[["A","C"]]----df.loc[["a","b","d"],"B"]

行访问(索引运算符[]和loc)
①使用切片对象查询: df["a":"b"]----df.loc["a":"b",["A","B"]]
②使用bool列表的查询: df[df["b"]>=8]----df.loc[df["B"]>0] (,后无值表示所有列,多条件()括起来,&连接)
③使用函数:

总结:索引运算符实际只能单独获取列行,无法获取指定列行,而loc则更灵活
1.默认使用loc函数获取数据,不论是行还是列(标签,标签列表,标签切片,bool）
2.只有获取列数据的时候,才使用索引运算符(列：标签，标签列表行：切片对象，bool)

注:行小写字母,列大写字母

10.3.2 loc/iloc 选择

在数据分析过程中，很多时候需要从数据表中提取出相应的数据，而这么做的前提是需要先“索引”出这一部分数据。虽然通过 Python 提供的索引操作符"[]"和属性操作符"."可以访问 Series 或者 DataFrame 中的数据，但这种方式只适应与少量的数据，为了解决这一问题，Pandas 提供了两种类型的索引方式来实现数据的访问。

df.loc[] 只能使用标签检索，不能使用下标检索。当通过标签检索的切片方式来筛选数据时，它的取值前闭后闭，也就是只包括边界值标签（开始和结束）
loc 位置标签(条件,标签或切片）

df.loc[] 具有多种访问方法，如下所示：

标量标签["a"] - df.loc["a","age"]
标签列表[["a","b"]] - df.loc[["a","b"],["age","gender"]]
标签切片对象(闭区间) - df.loc["a":"b","age":"gender"]
布尔数组条件 - df.loc[df.gender==0,["name","age"]]

df.iloc[] 只能使用下标，不能使用标签索引，通过下标切片选择数据时，前闭后开(不包含边界结束值)。同 Python 和 NumPy 一样，它们的索引都是从 0 开始。
iloc 位置索引(下标，切片)

df.iloc[] 提供了以下方式来选择数据：

整数索引 df.iloc[0,1]
整数列表 df.iloc[[0,3],[1,5]]
数值范围(切片) df.iloc[:,:]

#创建一组数据
data = {'name': ['John', 'Mike', 'Mozla', 'Rose', 'David', 'Marry', 'Wansi', 'Sidy', 'Jack', 'Alic'],'age': [20, 32, 29, np.nan, 15, 28, 21, 30, 37, 25],'gender': [0, 0, 1, 1, 0, 1, 0, 0, 1, 1],'isMarried': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
label = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, index=label)
dfname  age gender  isMarried
a   John    20.0    0   yes
b   Mike    32.0    0   yes
c   Mozla   29.0    1   no
d   Rose    NaN 1   yes
e   David   15.0    0   no
f   Marry   28.0    1   no
g   Wansi   21.0    0   no
h   Sidy    30.0    0   yes
i   Jack    37.0    1   no
j   Alic    25.0    1   no

#1名字长度为4的gender
df.gender[df.name.str.len()>4]
df.loc[df.name.str.len()>4,"gender"] #行:df.name.str.len()>4 列:gender#2提取行标签为e,f,g 的三行数据
df["e":"g"]
df.loc["e":"g",::]#3.年龄大于30的数据
df[df.age>30]
df.loc[df.age>30,::]#4.第一行的第一列
df.loc["a","name"]
df.iloc[0,0]#练习
#1奇数行的age,gender列 （按计数算奇数）
df.loc[::2,["age","gender"]]
df.iloc[::2,1::1]#2前3行的偶数列（按计数算偶数）
df.iloc[:3,1::2]#3把age的缺失值赋值为25
df.loc[df.age.isna(),"age"]=25#4名字以M开头的所有数据
df.loc[df.name.str.contains("^M"),::]#5名字包含a(不分大小写)的数据
df.loc[df.name.str.contains("a",case=False),::]#6age>20 并且 gender 为 0 的数据
df.loc[(df.age>20)&(df.gender==0),::]#7后三行的age,gender列
#df.loc[-3:,["age","gender"]]
df.iloc[-3:,[1,2]]
df.iloc[-3:,1:3]
#8后两行的后两列
#df.iloc[-2:,-2:]

10.3.3 添加删除行或列

添加
1 行添加 df1.append(df2) 同列表的 append
2 列添加 df1["新列名"] = 常数或数据组 ,当为原有列名时变为修改
3 列添加_指定位置添加 df.insert(位置_列下标,"新列名",常数或数据)

删除
1 列删除
del df["列名"]
df.pop("列名")
df.drop("列名",axis=1)

2 行删除
df.drop("行标签",axis=0)

10.4 数据集成

10.4.1 concat数据串联(上下居多)

append添加列结构一致的数据
df1.append(df2)

concat,axis=0以列标签对应添加，axis=1以行标签对应添加
pd.concat([df1,df2],axis=0)
pd.concat([df1,df3],axis=1)

df1 = pd.DataFrame(data = np.random.randint(0,150,size = [10,3]),# 计算机科⽬的考试成绩
index = list('ABCDEFGHIJ'),# ⾏标签，⽤户
columns=['Python','Tensorflow','Keras']) # 考试科⽬
df2 = pd.DataFrame(data = np.random.randint(0,150,size = [10,3]),# 计算机科⽬的考试成绩
index = list('KLMNOPQRST'),# ⾏标签，⽤户
columns=['Python','Tensorflow','Keras']) # 考试科⽬
df3 = pd.DataFrame(data = np.random.randint(0,150,size = (9,2)),
index = list('ABCDEFGHI'),
columns=['PyTorch','Paddle'])pd.concat((df1,df3),axis=0)# df1和df2⾏串联，df2的⾏追加df1⾏后⾯
pd.concat((df1,df3),axis=1)# df1和df3列串联，df3的列追加df1列后⾯

10.4.2 merge数据集合并(左右居多)

数据集的合并（merge）或连接（join）运算是通过⼀个或者多个键将数据链接起来的。这些运算是关
系型数据库的核⼼操作。pandas的merge函数是数据集进⾏join运算的主要切⼊点。

左表.merge(右表, how="inner",on="关联字段名")

参数说明
how : 'left', 'right', 'outer', 'inner', 'cross'
on : 关联条件，当两个表的关联列名一致时用，若不一致用 right_on 和 left_on 分别指定

# 表⼀中记录的是name和体重信息
df1 = pd.DataFrame(data = {'name':['softpo','Daniel','Brandon','Ella'],'weight':[70,55,75,65]})
# 表⼆中记录的是name和身⾼信息
df2 = pd.DataFrame(data = {'name':['softpo','Daniel','Brandon','Cindy'],'height':[172,170,170,166]})
df3 = pd.DataFrame(data = {'名字':['softpo','Daniel','Brandon','Cindy'],'height':[172,170,170,166]})
# 根据共同的name将俩表的数据，进⾏合并# 左数据框.merge(右数据框，how=“inner/left/right/outer(并集)" ,on =关联字段名)
df1.merge(df2,how="inner",on="name") #关联字段相同
#df1.merge(df3,how="right",left_on="name", right_on="名字") #关联字段不同

10.8 数据排序

索引列名排序
df.sort_index(axis = 0/1,ascending=True/False) # 按索引标签排序默认按照0排序
属性值排序
df.sort_values(by = ['列名'] )
df.sort_values(by = ['列名1','列名2'...] )

返回属性n⼤或者n⼩的值
df.nlargest(10,columns='列名') # 根据指定列名排序,返回最⼤10个数据
df.nsmallest(5,columns='列名') # 根据指定列名排序，返回最⼩5个数据

df = pd.DataFrame(data = np.random.randint(0,15,size = (15,3)),
index = list('qwertyuioijhgfc'),
columns = ['Python','Keras','Pytorch'])
dfPython    Keras   Pytorch
q   2   7   10
w   6   8   9
e   11  11  13
r   4   10  3
t   3   10  6# 1、索引列名排序
df.sort_index(axis = 0,ascending=True) # 按索引排序，升序
df.sort_index(axis = 1,ascending=False) #按列名排序，降序
# 2、属性值排序
df.sort_values(by = ['Python']) #按Python属性值排序
df.sort_values(by = ['Python','Keras'])#先按Python，再按Keras排序
# 3、返回属性n⼤或者n⼩的值
df.nlargest(10,columns='Keras') # 根据属性Keras排序,返回最⼤10个数据
df.nsmallest(5,columns='Python') # 根据属性Python排序，返回最⼩5个数据

10.9 分组聚合Group by

用指定列分组后对指定列进行统计
df.groupby(by = ['分组列名1','分组列名2',..],as_index=分组列是否变为行标签)['统计列名1',"统计列名12",...].统计函数()
用指定列分组后对指定列进行多种统计
df.groupby(by = ['分组列名1','分组列名2',..],as_index=分组列是否变为索引标签)['统计列名1',"统计列名12",...].aggregate([统计函数1.统计函数2]) ，对多重索引访问是要通过pd.IndexSlice 实现
用指定列分组后对指定列进行不同统计 df.groupby(by=["sex"]).aggregate({"列名1":统计函数1,"列名2":统计函数2})

总结：
as_index=False不用分组字段做行标签，统计的过程中不要产生重复列名
df.groupby([分组字段])[统计字段].统计函数名
df.groupby([分组字段])[统计字段].agg([统计函数])
df.groupby([分组字段]).agg({统计字段名：统计函数，统计字段名：统计函数,..})
df.groupby([分组字段])[统计字段].agg([统计函数]).rename(columns={"sum": "foo", "mean": "bar", "std": "baz"})

df = pd.DataFrame(data = {'sex':np.random.randint(0,2,size = 300), # 0男，1⼥
'Class':np.random.randint(1,9,size = 300),#1~8⼋个班
'Python':np.random.randint(0,151,size = 300),#Python成绩
'Keras':np.random.randint(0,151,size =300),#Keras成绩
'Tensorflow':np.random.randint(0,151,size=300),
'Java':np.random.randint(0,151,size = 300),
'C++':np.random.randint(0,151,size = 300)})#以sex进行分组,对C++求最大值
df.groupby("sex")["C++"].max()#统计各班级各性别的人数
df.groupby(["Class","sex"],as_index=False)["C++"].count()#as_index=False 不把分组信息变为索引#以class进行分组，对Keras，Java求平均值，求和
df.groupby("Class")[["Keras","Java"]].agg(["mean","sum"])
df.groupby("Class",as_index=False)["Java"].agg({"aa":"max","bb":"min"}) #重命名列名(统计字段一个),as_index=False
df.groupby("Class")[["Keras","Java"]].agg(lambda x: x.astype(int).sum())#统计各班级Keras 和 Java 的平均分
df.groupby("Class").agg({"Keras":"mean","Java":"mean"})#找到Keras 成绩最好的班级
df.groupby("Class")[["Keras","Java"]].mean().sort_values(by=["Keras","Java"],ascending=False).index[0]#找到各班Python的前3名
df["Python_rank"]=df.groupby("Class")["Python"].rank(method="dense",ascending=False)
df[df.Python_rank<=3].sort_values(by=["Class","Python"],ascending=[True,False])
df

10.10 透视表pivot_table

等同于Excel 的透视表

没有指定列时等同于groupby

df = pd.DataFrame(data = {'sex':np.random.randint(0,2,size = 300), # 0男，1⼥
'class':np.random.randint(1,9,size = 300),#1~8⼋个班
'Python':np.random.randint(0,151,size = 300),#Python成绩
'Keras':np.random.randint(0,151,size =300),#Keras成绩
'Tensorflow':np.random.randint(0,151,size=300),
'Java':np.random.randint(0,151,size = 300),
'C++':np.random.randint(0,151,size = 300)})
df.head()#Excel 透视表4个功能区 值，行，列，筛选
#df.pivot_table(index_行,columns_列, values_值,aggfunc_聚合函数)
#注意： 列名别相同#统计各性别C++的最高分
df.groupby("sex")["C++"].max()
df.pivot_table(index="sex",values="C++",aggfunc="max")#统计各班级Keras 和 Java 的平均分
df.pivot_table(index="class",values=["Keras","Java"],aggfunc="mean")
#统计各班级各性别的人数
df.pivot_table(index="class",columns="sex",values="Java",aggfunc="count")
#各班级人数占比
#各班级各性别的Python平均分
#df.plot(x="Python",y="Java",kind="scatter")