python3-pandas 数据结构 Series、DataFrame 基础

Pandas 应用
Pandas 的主要数据结构是 Series （一维数据）与 DataFrame（二维数据），这两种数据结构足以处理金融、统计、社会科学、工程等领域里的大多数典型用例。
数据结构
Series 是一种类似于一维数组的对象，它由一组数据（各种Numpy数据类型）以及一组与之相关的数据标签（即索引）组成。
DataFrame 是一个表格型的数据结构，它含有一组有序的列，每列可以是不同的值类型（数值、字符串、布尔型值）。DataFrame 既有行索引也有列索引，它可以被看做由 Series 组成的字典（共同用一个索引）。

1、Pandas 数据结构 - Series

Series 带标签的一维数组
pandas.Series( data, index, dtype, name, copy)

参数说明：

data：一组数据(ndarray 类型)。
index：数据索引标签，如果不指定，默认从 0 开始。
dtype：数据类型，默认会自己判断。
name：设置名称。
copy：拷贝数据，默认为 False。

如果没有指定索引，索引值就从 0 开始

t = pd.Series([4,5,6])
print(t)
print(type(t))  # <class 'pandas.core.series.Series'>
print(t[1])  # 5
"""
0    4
1    5
2    6
dtype: int64
<class 'pandas.core.series.Series'>
5
"""

指定索引值，修改数据类型：

t2 = pd.Series([2,4,6,8], index=list("abcd"))
print(t2)
print(t2["c"])  # 6
print(t2.astype(float))
print(t2[t2>5])
"""
a    2
b    4
c    6
d    8
dtype: int64
6
a    2.0
b    4.0
c    6.0
d    8.0
dtype: float64
c    6
d    8
dtype: int64
"""

使用 key/value 对象，类似字典来创建 Series

temp_dict = {"name": "wang1", "age": 18, "tel": 10010}t3 = pd.Series(temp_dict)
print(t3)
print(t3["age"])  # 18
print(t3[1])  # 18
print(t3[:2])
print(t3[[1,2]])
print(t3[["name","tel"]])
"""
name    wang1
age        18
tel     10010
dtype: object
18
18
name    wang1
age        18
dtype: object
age       18
tel    10010
dtype: object
name    wang1
tel     10010
dtype: object
"""

获取 Series 的值、索引

print(t3.index)  # Index(['name', 'age', 'tel'], dtype='object')
print(type(t3.index))  # <class 'pandas.core.indexes.base.Index'>print(t3.values)  # ['wang1' 18 10010]
print(type(t3.values))  # <class 'numpy.ndarray'>

2、Pandas 数据结构 - DataFrame

DataFrame 是一个表格型的数据结构，它含有一组有序的列，每列可以是不同的值类型（数值、字符串、布尔型值）。DataFrame 既有行索引也有列索引，它可以被看做由 Series 组成的字典（共同用一个索引）

DataFrame 构造方法如下：
pandas.DataFrame( data, index, columns, dtype, copy)
DataFrame 二维，Series 容器
参数说明：

data：一组数据(ndarray、series, map, lists, dict 等类型)。
index：索引值，或者可以称为行标签。
columns：列标签，默认为 RangeIndex (0, 1, 2, …, n) 。
dtype：数据类型。
copy：拷贝数据，默认为 False。

import pandas as pd
import numpy as np
t = pd.DataFrame(np.arange(12).reshape(3,4))
print(t)
"""0  1   2   3
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11
"""

DataFrame对象既有行索引，又有列索引
行索引，表明不同行，横向索引，叫index，0轴，axis=0
列索引，表明不同列，纵向索引，叫columns，1轴，axis=1

2.1、index、columns 使用：

t1 = pd.DataFrame(np.arange(12).reshape(3,4), index=list("abc"), columns=list("wxyz"))
print(t1)
"""w  x   y   z
a  0  1   2   3
b  4  5   6   7
c  8  9  10  11
"""

2.2、使用列表创建DataFrame，缺失的值用 NaN 代替

data = [['Google',10],['Runoob',12],['Wiki',13]]
df = pd.DataFrame(data,columns=['Site','Age'])
print(df)
"""Site  Age
0  Google   10
1  Runoob   12
2    Wiki   13
"""

2.3、使用字典创建DataFrame，缺失的值用 NaN 代替

data = {'Site':['Google', 'Runoob', 'Wiki'], 'Age':[10, 12, 13]}
df = pd.DataFrame(data)
print (df)
"""Site  Age
0  Google   10
1  Runoob   12
2    Wiki   13
"""
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print (df)
"""a   b     c
0  1   2   NaN
1  5  10  20.0
"""

2.4、DataFrame基础属性

DataFrame.shape  # 行数 列数
DataFrame.dtypes  # 列数据类型
DataFrame.ndim  # 数据维度
DataFrame.index  # 行索引
DataFrame.columns  # 列索引
DataFrame.values  # 对象值DataFrame.head(3)  # 显示头部几行，默认5行
DataFrame.tail(3)  # 显示末尾几行，默认5行
DataFrame.info()  # 相关信息概览：行数，列数，列索引，列非空值个数，列类型，内存占用
DataFrame.describe()  # 快速综合统计结果：计数，均值，标准差，最大值，四分位数，最小值

data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print(df)
"""a   b     c
0  1   2   NaN
1  5  10  20.0
"""
print(df.index)  # RangeIndex(start=0, stop=2, step=1)
print(df.columns)  # Index(['a', 'b', 'c'], dtype='object')
print(df.values)  # [[ 1.  2. nan] [ 5. 10. 20.]]
print(df.shape)  # (2, 3)
print(df.ndim)  # 数据维度 2
print(df.dtypes)  # 列数据类型
"""
a      int64
b      int64
c    float64
dtype: object
"""
print("*"*80)
print(df.info())
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):#   Column  Non-Null Count  Dtype
---  ------  --------------  -----  0   a       2 non-null      int64  1   b       2 non-null      int64  2   c       1 non-null      float64
dtypes: float64(1), int64(2)
memory usage: 176.0 bytes
None
"""
print(df.describe())
"""a          b     c
count  2.000000   2.000000   1.0
mean   3.000000   6.000000  20.0
std    2.828427   5.656854   NaN
min    1.000000   2.000000  20.0
25%    2.000000   4.000000  20.0
50%    3.000000   6.000000  20.0
75%    4.000000   8.000000  20.0
max    5.000000  10.000000  20.0
"""

2.5、DataFrame 排序

data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print(df)
"""a   b     c
0  1   2   NaN
1  5  10  20.0
"""
# ascending=True 升序
# ascending=False 降序
df = df.sort_values("c", ascending=False)
print(df)
"""a   b     c
1  5  10  20.0
0  1   2   NaN
"""

https://www.runoob.com/pandas/pandas-series.html
https://www.bilibili.com/video/BV1hx411d7jb?p=23
https://www.bilibili.com/video/BV1hx411d7jb?p=24
https://www.bilibili.com/video/BV1hx411d7jb?p=25
https://www.bilibili.com/video/BV1hx411d7jb?p=26