python工具包--Pandas

文章目录

数据读取
DataFrame结构
series结构
数据分析与value_counts函数
pivot操作
groupby操作
Merge操作
- 示例1
- 示例2
排序操作
缺失值处理
apply自定义函数
时间操作
绘图操作
大数据处理

官方的API参考文档: https://pandas.pydata.org/docs/reference/index.html

可以简单的了解一个pandas，以后需要使用再查文档也不迟

import pandas as pd

数据读取

API文档中显示pandas可以读取三种格式的文档，如下所示:

read_table(filepath_or_buffer[, sep, …])

Read general delimited file into DataFrame.

read_csv(filepath_or_buffer[, sep, …])

Read a comma-separated values (csv) file into DataFrame.

read_fwf(filepath_or_buffer[, colspecs, …])

Read a table of fixed-width formatted lines into DataFrame.

# 要注意，read_csv只能读取csv格式的数据，而且路径需要用的是/，不能直接在文件夹中复制路径然后黏贴上去
df = pd.read_csv('./data/titanic.csv')

# 读取前5条数据
df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

# 读取最后的5条数据
df.tail()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.00	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.00	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.45	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.00	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.75	NaN	Q

DataFrame结构

指定读取数据返回结果的名字叫做df，其实就是pandas工具包中最常用的基础结构DataFrame结构

# 打印当前读取数据的部分信息，包括数据样本规模，每列特征类型与个数，整体的内存占用等
# 其中的行表述数据样本，列表示每一个特征指标，读回来的数据基本上返回的都是df结构
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):#   Column       Non-Null Count  Dtype
---  ------       --------------  -----  0   PassengerId  891 non-null    int64  1   Survived     891 non-null    int64  2   Pclass       891 non-null    int64  3   Name         891 non-null    object 4   Sex          891 non-null    object 5   Age          714 non-null    float646   SibSp        891 non-null    int64  7   Parch        891 non-null    int64  8   Ticket       891 non-null    object 9   Fare         891 non-null    float6410  Cabin        204 non-null    object 11  Embarked     889 non-null    object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

DataFrame的其他属性

# 返回索引
df.index

RangeIndex(start=0, stop=891, step=1)

# 拿到每一个特征的名字
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp','Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],dtype='object')

# 每一列的类型,其中object表示python中的字符串，而numpy中的object也是表示字符串
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

# 直接取得数值矩阵
df.values

array([[1, 0, 3, ..., 7.25, nan, 'S'],[2, 1, 1, ..., 71.2833, 'C85', 'C'],[3, 1, 3, ..., 7.925, nan, 'S'],...,[889, 0, 3, ..., 23.45, nan, 'S'],[890, 1, 1, ..., 30.0, 'C148', 'C'],[891, 0, 3, ..., 7.75, nan, 'Q']], dtype=object)

# 数据索引
age = df['Age']
# 获得单个数据
print("single:\n",age[0])
# 获得前x个数据
print("more:\n",age[:5])
# 单独把结果拿出来
age.values[:5]

single:22.0
more:0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64array([22., 38., 26., 35., 35.])

一般情况下是用数字来作为索引的，但还可以将其他内容设置为索引，比如所名字设置为索引

# 默认数字为索引
df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

# 将名字设置为索引,reset_index()是还原索引
df = df.set_index('Name')
df.head()

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
Name
Braund, Mr. Owen Harris	1	0	3	male	22.0	1	0	A/5 21171	7.2500	NaN	S
Cumings, Mrs. John Bradley (Florence Briggs Thayer)	2	1	1	female	38.0	1	0	PC 17599	71.2833	C85	C
Heikkinen, Miss. Laina	3	1	3	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
Futrelle, Mrs. Jacques Heath (Lily May Peel)	4	1	1	female	35.0	1	0	113803	53.1000	C123	S
Allen, Mr. William Henry	5	0	3	male	35.0	0	0	373450	8.0500	NaN	S

# 此时便可以通过岁数来查找年龄
age = df['Age']
print("Braund, Mr. Owen Harris age:",age['Braund, Mr. Owen Harris'])
print("Futrelle, Mrs. Jacques Heath (Lily May Peel) age:",age['Futrelle, Mrs. Jacques Heath (Lily May Peel)'])

Braund, Mr. Owen Harris age: 22.0
Futrelle, Mrs. Jacques Heath (Lily May Peel) age: 35.0

# 还原索引
df = df.reset_index()
# 设置多个数据
Clichong = df[['Name','Age','Fare']]
Clichong[:5]

	Name	Age	Fare
0	Braund, Mr. Owen Harris	22.0	7.2500
1	Cumings, Mrs. John Bradley (Florence Briggs Th...	38.0	71.2833
2	Heikkinen, Miss. Laina	26.0	7.9250
3	Futrelle, Mrs. Jacques Heath (Lily May Peel)	35.0	53.1000
4	Allen, Mr. William Henry	35.0	8.0500

索引中可以使用iloc函数或者是loc函数来取索引，以iloc为例

# iloc例子
# 可以拿到一个数据
print("df.iloc[2]",df.iloc[2])
# 可以拿到一片数据
print("\n df.iloc[1:4]",df.iloc[1:4])
# 不仅可以指定样本，还可以指定特征
print("\n df.iloc[1:4,1:3]",df.iloc[1:4,1:3])

df.iloc[2] index                               2
Name           Heikkinen, Miss. Laina
PassengerId                         3
Survived                            1
Pclass                              3
Sex                            female
Age                                26
SibSp                               0
Parch                               0
Ticket               STON/O2. 3101282
Fare                            7.925
Cabin                             NaN
Embarked                            S
Name: 2, dtype: objectdf.iloc[1:4]    index                                               Name  PassengerId  \
1      1  Cumings, Mrs. John Bradley (Florence Briggs Th...            2
2      2                             Heikkinen, Miss. Laina            3
3      3       Futrelle, Mrs. Jacques Heath (Lily May Peel)            4   Survived  Pclass     Sex   Age  SibSp  Parch            Ticket     Fare  \
1         1       1  female  38.0      1      0          PC 17599  71.2833
2         1       3  female  26.0      0      0  STON/O2. 3101282   7.9250
3         1       1  female  35.0      1      0            113803  53.1000   Cabin Embarked
1   C85        C
2   NaN        S
3  C123        S  df.iloc[1:4,1:3]                                                 Name  PassengerId
1  Cumings, Mrs. John Bradley (Florence Briggs Th...            2
2                             Heikkinen, Miss. Laina            3
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)            4

# loc例子
# 取出Heikkinen, Miss. Laina的全部信息
df.loc[2]

Name           Heikkinen, Miss. Laina
index                               2
PassengerId                         3
Survived                            1
Pclass                              3
Sex                            female
Age                                26
SibSp                               0
Parch                               0
Ticket               STON/O2. 3101282
Fare                            7.925
Cabin                             NaN
Embarked                            S
Name: 2, dtype: object

# 取出Heikkinen, Miss. Laina的性别信息
df.loc[2,'Sex']

'female'

# 截取部分人的全部信息
df.loc[0:2]

	Name	index	PassengerId	Survived	Pclass	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	Braund, Mr. Owen Harris	0	1	0	3	male	22.0	1	A/5 21171	7.2500	NaN	S
1	Cumings, Mrs. John Bradley (Florence Briggs Th...	1	2	1	1	female	38.0	1	PC 17599	71.2833	C85	C
2	Heikkinen, Miss. Laina	2	3	1	3	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S

# 截取部分人的部分信息
df.loc[0:2,'Name']

0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
Name: Name, dtype: object

# 设置索引，找出全部男性
df = df.set_index('Name')  #重复设置会报错
df['Sex'] == 'male'

Name
Braund, Mr. Owen Harris                                 True
Cumings, Mrs. John Bradley (Florence Briggs Thayer)    False
Heikkinen, Miss. Laina                                 False
Futrelle, Mrs. Jacques Heath (Lily May Peel)           False
Allen, Mr. William Henry                                True...
Montvila, Rev. Juozas                                   True
Graham, Miss. Margaret Edith                           False
Johnston, Miss. Catherine Helen "Carrie"               False
Behr, Mr. Karl Howell                                   True
Dooley, Mr. Patrick                                     True
Name: Sex, Length: 891, dtype: bool

# 找出前五个的男性
df[df['Sex'] == 'male'][0:5]

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
Name
Braund, Mr. Owen Harris	1	0	3	male	22.0	1	0	A/5 21171	7.2500	NaN	S
Allen, Mr. William Henry	5	0	3	male	35.0	0	0	373450	8.0500	NaN	S
Moran, Mr. James	6	0	3	male	NaN	0	0	330877	8.4583	NaN	Q
McCarthy, Mr. Timothy J	7	0	1	male	54.0	0	0	17463	51.8625	E46	S
Palsson, Master. Gosta Leonard	8	0	3	male	2.0	3	1	349909	21.0750	NaN	S

# 计算所有男性乘客的平均年龄
df.loc[df['Sex'] == 'male','Age'].mean()

0      22.0
4      35.0
5       NaN
6      54.0
7       2.0...
883    28.0
884    25.0
886    27.0
889    26.0
890    32.0
Name: Age, Length: 577, dtype: float64

df[‘Sex’] == ‘male’ :为晒选出男性

df.loc[df[‘Sex’] == ‘male’,‘Age’] ：列举出全部男性的年龄

df.loc[df[‘Sex’] == ‘male’,‘Age’].mean : 求出平均年龄

# 大于70岁的乘客的信息
df[df['Age'] >= 70]

	Name	PassengerId	Survived	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
96	Goldschmidt, Mr. George B	97	0	1	male	71.0	0	0	PC 17754	34.6542	A5	C
116	Connors, Mr. Patrick	117	0	3	male	70.5	0	0	370369	7.7500	NaN	Q
493	Artagaveytia, Mr. Ramon	494	0	1	male	71.0	0	0	PC 17609	49.5042	NaN	C
630	Barkworth, Mr. Algernon Henry Wilson	631	1	1	male	80.0	0	0	27042	30.0000	A23	S
672	Mitchell, Mr. Henry Michael	673	0	2	male	70.0	0	0	C.A. 24580	10.5000	NaN	S
745	Crosby, Capt. Edward Gifford	746	0	1	male	70.0	1	1	WE/P 5735	71.0000	B22	S
851	Svensson, Mr. Johan	852	0	3	male	74.0	0	0	347060	7.7750	NaN	S

# 计算大于70岁的乘客的人数
(df['Age'] >= 70).sum()

# 计算大于70岁的乘客的总岁数
df.loc[df['Age'] >= 70,'Age'].sum()

506.5

创建DataFrame

data = {'country':['China','America','India'],'population':[14,5,25]}
df = pd.DataFrame(data)
df

	country	population
0	China	14
1	America	5
2	India	25

df[df['population'] >= 14]

	country	population
0	China	14
2	India	25

参数设置 set_option

# 设置最大显示行数与列数
# pd.set_option('display.max_columns',20)
pd.set_option('display.max_rows',6)
df

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

891 rows × 12 columns

# 查看最大显示的列数与行数
print("pd.get_option('display.max_rows'):",pd.get_option('display.max_rows'))
print("pd.get_option('display.max_columns'):",pd.get_option('display.max_columns'))

pd.get_option('display.max_rows'): 6
pd.get_option('display.max_columns'): 20

series结构

读取数据都是二维的，也就是DataFrame结构，而如果在数据中单独取某一列数据，那就是Series格式了。

相当于DataFrame结构是有Series组合起来得到的，因而会更高级一点

# 创建Series结构类型
data = [1,2,3]
index = ['a','b','c']
Clichong = pd.Series(data = data,index = index)
Clichong

a    1
b    2
c    3
dtype: int64

Clichong.loc['b']

Clichong.iloc[1]

Clichong_cp = Clichong.copy()
Clichong_cp['b']

# inplace如果为False则表示不将结果赋值给变量，只相当于打印操作；如果为True，就直接表示在数据中执行实际变换
Clichong.replace(to_replace=1,value=100,inplace=True)
Clichong

a    100
b      2
c      3
dtype: int64

Clichong.index

Index(['a', 'b', 'c'], dtype='object')

# 改变索引
Clichong.rename(index={'a':'A'},inplace=True)
Clichong.index

Index(['A', 'b', 'c'], dtype='object')

以下是两种增加数据的方式

# 增加数据的方式
# 方法1：直接赋值
Clichong['d'] = 4
Clichong

A    100
b      2
c      3
d      4
dtype: int64

# 方法2：追加另外一个series,其中的ignore_index可以重新设置索引
data = [5,6]
index = ['e','f']
Clichong2 = pd.Series(data=data,index=index)
Clichong = Clichong.append(Clichong2,ignore_index=True)
Clichong

0    100
1      2
2      3...
5      6
6      5
7      6
Length: 8, dtype: int64

删除数据的方式del与drop

# 方法1：del删除
del Clichong[1]
Clichong

2    3
3    4
4    5
5    6
6    5
7    6
dtype: int64

# 方法2：drop删除
Clichong.drop([7],inplace=True)
Clichong

4    5
5    6
6    5
dtype: int64

数据分析与value_counts函数

# 基本的一些统计方法都是类似的，max，min，median，sum等待
df['Age'].mean()

29.69911764705882

# 观察索引样本的函数
pd.set_option('display.max_rows',8) # 设置最大显示行数为8
df.describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

# 协方差矩阵
df.cov()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
PassengerId	66231.000000	-0.626966	-7.561798	138.696504	-16.325843	-0.342697	161.883369
Survived	-0.626966	0.236772	-0.137703	-0.551296	-0.018954	0.032017	6.221787
Pclass	-7.561798	-0.137703	0.699015	-4.496004	0.076599	0.012429	-22.830196
Age	138.696504	-0.551296	-4.496004	211.019125	-4.163334	-2.344191	73.849030
SibSp	-16.325843	-0.018954	0.076599	-4.163334	1.216043	0.368739	8.748734
Parch	-0.342697	0.032017	0.012429	-2.344191	0.368739	0.649728	8.661052
Fare	161.883369	6.221787	-22.830196	73.849030	8.748734	8.661052	2469.436846

# 相关系数
df.corr()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
PassengerId	1.000000	-0.005007	-0.035144	0.036847	-0.057527	-0.001652	0.012658
Survived	-0.005007	1.000000	-0.338481	-0.077221	-0.035322	0.081629	0.257307
Pclass	-0.035144	-0.338481	1.000000	-0.369226	0.083081	0.018443	-0.549500
Age	0.036847	-0.077221	-0.369226	1.000000	-0.308247	-0.189119	0.096067
SibSp	-0.057527	-0.035322	0.083081	-0.308247	1.000000	0.414838	0.159651
Parch	-0.001652	0.081629	0.018443	-0.189119	0.414838	1.000000	0.216225
Fare	0.012658	0.257307	-0.549500	0.096067	0.159651	0.216225	1.000000

# 统计某一I列属性的比例情况
df['Sex'].value_counts()  # 默认多的排前面
# 还可以让少的排前面
df['Sex'].value_counts(ascending = True)

female    314
male      577
Name: Sex, dtype: int64

如果是只有少数的类型还好，但是如果是对于非离散指标就不太好弄了，这时候可以指定一些区间，对连续指进行离散化

# 对年龄段划分为几个组,bins是划分的组数
df['Age'].value_counts(ascending = True,bins = 5)

(64.084, 80.0]       11
(48.168, 64.084]     69
(0.339, 16.336]     100
(32.252, 48.168]    188
(16.336, 32.252]    346
Name: Age, dtype: int64

利用value_counts函数中的bins参数可以实现分组，而对于分箱操作，cut函数也可以实现,其中的[10,30,50,80]可以自定义分类参考，label可以自定义分组

data = [1,2,3,4,5,6,7,8,9,0]
bins = [0,3,7,10]
Clichong = pd.cut(data,bins)
Clichong

[(0.0, 3.0], (0.0, 3.0], (0.0, 3.0], (3.0, 7.0], (3.0, 7.0], (3.0, 7.0], (3.0, 7.0], (7.0, 10.0], (7.0, 10.0], NaN]
Categories (3, interval[int64]): [(0, 3] < (3, 7] < (7, 10]]

pd.value_counts(Clichong)

(3, 7]     4
(0, 3]     3
(7, 10]    2
dtype: int64

# 对某一种类型进行自定义的分类，结合起来用，要弄清楚每一小块的含义
pd.value_counts(pd.cut(df['Age'],[10,30,50,80]))

(10, 30]    345
(30, 50]    241
(50, 80]     64
Name: Age, dtype: int64

# 分类好的数据可以用value_counts来进行统计
pd.cut(df['Age'],[10,30,50,80],labels = group_name)

0      Young
1      Mille
2      Young
3      Mille...
887    Young
888      NaN
889    Young
890    Mille
Name: Age, Length: 891, dtype: category
Categories (3, object): [Young < Mille < Old]

# 还可以自定义标签,利用cut中的labels参数
group_name = ['Young','Mille','Old']
pd.value_counts(pd.cut(df['Age'],[10,30,50,80],labels = group_name))

Young    345
Mille    241
Old       64
Name: Age, dtype: int64

pivot操作

example = pd.DataFrame({'Month': ["January", "January", "January", "January", "February", "February", "February", "February", "March", "March", "March", "March"],'Category': ["Transportation", "Grocery", "Household", "Entertainment","Transportation", "Grocery", "Household", "Entertainment","Transportation", "Grocery", "Household", "Entertainment"],'Amount': [74., 235., 175., 100., 115., 240., 225., 125., 90., 260., 200., 120.]})
example

	Month	Category	Amount
0	January	Transportation	74.0
1	January	Grocery	235.0
2	January	Household	175.0
3	January	Entertainment	100.0
...	...	...	...
8	March	Transportation	90.0
9	March	Grocery	260.0
10	March	Household	200.0
11	March	Entertainment	120.0

12 rows × 3 columns

# pivot是数据透析表，可以按照自己的方式来分析数据
example_pivot = example.pivot(index = 'Category',columns= 'Month',values = 'Amount')
example_pivot

Month	February	January	March
Category
Entertainment	125.0	100.0	120.0
Grocery	240.0	235.0	260.0
Household	225.0	175.0	200.0
Transportation	115.0	74.0	90.0

其中的Month表示统计的月份，而Amount表示实际的花费，以上统计的就是每个月花费在各项用途上的金额分别是多少

# 统计行项综合
example_pivot.sum(axis=1)

Category
Entertainment     345.0
Grocery           735.0
Household         600.0
Transportation    279.0
dtype: float64

# 统计列项总和
example_pivot.sum(axis=0)

Month
February    705.0
January     584.0
March       670.0
dtype: float64

df.pivot_table(index='Sex',columns='Pclass',values='Fare')

Pclass	1	2	3
Sex
female	106.125798	21.970121	16.118810
male	67.226127	19.741782	12.661633

其中pclass表示船舱等级，Fare表示船票价格。

所以index指定了按照上面属性来统计，而columns指定了统计哪些指标，values指定了统计的实际指标值是什么，相当于平均值

如果想指定最大值或者是最小值，可以设置aggfunc参数,aggfunc的值代表了数据的取值为什么

# 指定最大值而不是平均值
df.pivot_table(index='Sex',columns='Pclass',values='Fare',aggfunc='max')  # or min

Pclass	1	2	3
Sex
female	512.3292	65.0	69.55
male	512.3292	73.5	69.55

# 想统计个船舱等级的人数
df.pivot_table(index='Sex',columns='Pclass',values='Fare',aggfunc='count')

Pclass	1	2	3
Sex
female	94	76	144
male	122	108	347

# 进一步统计人数，不分男女
(df.pivot_table(index='Sex',columns='Pclass',values='Fare',aggfunc='count')).sum(axis=0)

Pclass
1    216
2    184
3    491
dtype: int64

# 将乘客分成两组：成年人与为成年人，再对着两组乘客分别统计不同性别人的平均获救可能性，mean为平均值，虽然默认也是平均值
df['Underaged'] = df['Age'] <= 18
df.pivot_table(index = 'Underaged',columns='Sex',values='Survived',aggfunc='mean')

Sex	female	male
Underaged
False	0.760163	0.167984
True	0.676471	0.338028

groupby操作

df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],'data':[0,5,10,5,10,15,10,15,20]})
df

	key	data
0	A	0
1	B	5
2	C	10
3	A	5
...	...	...
5	C	15
6	A	10
7	B	15
8	C	20

9 rows × 2 columns

# 统计key的取值的常见代码
for key in ['A','B','C']:print(key,df[df['key'] == key].sum())

A key     AAA
data     15
dtype: object
B key     BBB
data     30
dtype: object
C key     CCC
data     45
dtype: object

# groupby可以代替以上代码,默认为累加值
df.groupby('key').sum()

	data
key
A	15
B	30
C	45

# 可以更改为均值等指标
df.groupby('key').aggregate(np.mean)

	data
key
A	5
B	10
C	15

# 按照不同行呗统计年龄的平均值
df.groupby('Sex')['Age'].mean()

Sex
female    27.915709
male      30.726645
Name: Age, dtype: float64

# 按照性别统计存活率
df.groupby('Sex')['Survived'].mean()

Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64

groupby的一些其他操作

# 数据集
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],'C' : np.random.randn(8),'D' : np.random.randn(8)})
df

	A	B	C	D
0	foo	one	0.611947	1.221803
1	bar	one	0.688298	0.077745
2	foo	two	-0.392495	-0.856792
3	bar	three	0.094306	1.281799
4	foo	two	-0.524519	-1.288789
5	bar	two	-1.507810	1.012624
6	foo	one	-0.061693	-1.741863
7	foo	three	-0.745409	0.332852

# 调用count属性
print("mean:\n",df.groupby('A').mean())
print("\ncount1:\n",df.groupby('A').count())
print("\ncount2:\n",df.groupby(['A','B']).count())

mean:C         D
A
bar -0.241736  0.790723
foo -0.222434 -0.466558count1:B  C  D
A
bar  3  3  3
foo  5  5  5count2:C  D
A   B
bar one    1  1three  1  1two    1  1
foo one    2  2three  1  1two    2  2

# 与numpy结合,设置一下统计的方法
# df.groupby(['A','B']).aggregate(np.sum)
grouped = df.groupby(['A','B'])
grouped.aggregate(np.sum)

		C	D
A	B
bar	one	0.688298	0.077745
	three	0.094306	1.281799
	two	-1.507810	1.012624
foo	one	0.550255	-0.520061
	three	-0.745409	0.332852
	two	-0.917015	-2.145581

# 通过as_index参数来增加索引
# df.groupby(['A','B'],as_index = False).aggregate(np.sum)
grouped = df.groupby(['A','B'],as_index = False)
grouped.aggregate(np.sum)

	A	B	C	D
0	bar	one	0.688298	0.077745
1	bar	three	0.094306	1.281799
2	bar	two	-1.507810	1.012624
3	foo	one	0.550255	-0.520061
4	foo	three	-0.745409	0.332852
5	foo	two	-0.917015	-2.145581

# 通过describe方法来展示所有的统计信息
grouped.describe().head()

	C								D
	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max
0	1.0	0.688298	NaN	0.688298	0.688298	0.688298	0.688298	0.688298	1.0	0.077745	NaN	0.077745	0.077745	0.077745	0.077745	0.077745
1	1.0	0.094306	NaN	0.094306	0.094306	0.094306	0.094306	0.094306	1.0	1.281799	NaN	1.281799	1.281799	1.281799	1.281799	1.281799
2	1.0	-1.507810	NaN	-1.507810	-1.507810	-1.507810	-1.507810	-1.507810	1.0	1.012624	NaN	1.012624	1.012624	1.012624	1.012624	1.012624
3	2.0	0.275127	0.476335	-0.061693	0.106717	0.275127	0.443537	0.611947	2.0	-0.260030	2.095628	-1.741863	-1.000947	-0.260030	0.480886	1.221803
4	1.0	-0.745409	NaN	-0.745409	-0.745409	-0.745409	-0.745409	-0.745409	1.0	0.332852	NaN	0.332852	0.332852	0.332852	0.332852	0.332852

# 设置自己需要的统计指标
grouped = df.groupby('A')
grouped['C'].agg([np.sum,np.mean,np.std])

	sum	mean	std
A
bar	-0.725207	-0.241736	1.135965
foo	-1.112169	-0.222434	0.528136

Merge操作

示例1

left = pd.DataFrame({'key':['K0','K1','K2','K3'],'A':['A0','A1','A2','A3'],'B':['B0','B1','B2','B3']})right = pd.DataFrame({'key':['K0','K1','K2','K3'],'C':['C0','C1','C2','C3'],'D':['D0','D1','D2','D3']})

left

	key	A	B
0	K0	A0	B0
1	K1	A1	B1
2	K2	A2	B2
3	K3	A3	B3

right

	key	C	D
0	K0	C0	D0
1	K1	C1	D1
2	K2	C2	D2
3	K3	C3	D3

pd.merge(left,right)

	key	A	B	C	D
0	K0	A0	B0	C0	D0
1	K1	A1	B1	C1	D1
2	K2	A2	B2	C2	D2
3	K3	A3	B3	C3	D3

示例2

left = pd.DataFrame({'key1':['K0','K1','K2','K3'],'key2':['K0','K1','K2','K3'],'A':['A0','A1','A2','A3'],'B':['B0','B1','B2','B3']})right = pd.DataFrame({'key1':['K0','K1','K2','K3'],'key2':['K0','K1','K2','K4'],'C':['C0','C1','C2','C3'],'D':['D0','D1','D2','D3']})

left

	key1	key2	A	B
0	K0	K0	A0	B0
1	K1	K1	A1	B1
2	K2	K2	A2	B2
3	K3	K3	A3	B3

right

	key1	key2	C	D
0	K0	K0	C0	D0
1	K1	K1	C1	D1
2	K2	K2	C2	D2
3	K3	K4	C3	D3

# 发现有部分数据被丢弃
pd.merge(left,right,on=['key1','key2'])

	key1	key2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K1	K1	A1	B1	C1	D1
2	K2	K2	A2	B2	C2	D2

# 考虑所有的结果how
pd.merge(left,right,on=['key1','key2'],how='outer')

	key1	key2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K1	K1	A1	B1	C1	D1
2	K2	K2	A2	B2	C2	D2
3	K3	K3	A3	B3	NaN	NaN
4	K3	K4	NaN	NaN	C3	D3

# 再加入详细的说明indicator
pd.merge(left,right,on=['key1','key2'],how='outer',indicator = True)

	key1	key2	A	B	C	D	_merge
0	K0	K0	A0	B0	C0	D0	both
1	K1	K1	A1	B1	C1	D1	both
2	K2	K2	A2	B2	C2	D2	both
3	K3	K3	A3	B3	NaN	NaN	left_only
4	K3	K4	NaN	NaN	C3	D3	right_only

# 或者只考虑一边，比如只考虑左边
pd.merge(left,right,on=['key1','key2'],how='left')

	key1	key2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K1	K1	A1	B1	C1	D1
2	K2	K2	A2	B2	C2	D2
3	K3	K3	A3	B3	NaN	NaN

# 或者只考虑一边，比如只考虑右边
pd.merge(left,right,on=['key1','key2'],how='right')

	key1	key2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K1	K1	A1	B1	C1	D1
2	K2	K2	A2	B2	C2	D2
3	K3	K4	NaN	NaN	C3	D3

排序操作

data = pd.DataFrame({'group':['a','a','a','b','b','b','c','c','c'],'data':[4,3,2,1,12,3,4,5,7]})
data

	group	data
0	a	4
1	a	3
2	a	2
3	b	1
...	...	...
5	b	3
6	c	4
7	c	5
8	c	7

9 rows × 2 columns

首先对group列按照降序进行排列，在此基础上对data列进行升序排列，by参数用于设置要排序的列，而ascending参数用于设置生降序。False表示降序，True表示升序

data.sort_values(by=['group','data'],ascending = [False,True],inplace=True)
data

	group	data
6	c	4
7	c	5
8	c	7
3	b	1
...	...	...
4	b	12
2	a	2
1	a	3
0	a	4

9 rows × 2 columns

缺失值处理

data = pd.DataFrame({'k1':['one']*3+['two']*4,'k2':[3,2,1,3,3,4,4]})
data

	k1	k2
0	one	3
1	one	2
2	one	1
3	two	3
4	two	3
5	two	4
6	two	4

# 排序操作:降序
data.sort_values(by='k2',ascending = False)

	k1	k2
5	two	4
6	two	4
0	one	3
3	two	3
4	two	3
1	one	2
2	one	1

# 去除相同的数据
data.drop_duplicates()

	k1	k2
0	one	3
1	one	2
2	one	1
3	two	3
5	two	4

# 只考虑某一列的重复情况,其他全部舍弃
data.drop_duplicates(subset='k1')

	k1	k2
0	one	3
3	two	3

# 往数据里添加新列，使用assign函数
df = pd.DataFrame({'data1':np.random.randn(5),'data2':np.random.randn(5)})
df2 = df.assign(ration = df['data1']/df['data2'])
df2

	data1	data2	ration
0	1.046754	-0.305594	-3.425314
1	1.060983	-0.751471	-1.411874
2	-0.633877	-1.355632	0.467588
3	-1.036392	1.615449	-0.641551
4	0.481765	-0.981919	-0.490636

# 通过isnull判断缺失值
df = pd.DataFrame([range(3),[0, np.nan,0],[0,0,np.nan],range(3)])
df.isnull()

	0	1	2
0	False	False	False
1	False	True	False
2	False	False	True
3	False	False	False

# 如果数据太多，可以按列或者是行来判断是否有缺失值
# 按列来判断
df.isnull().any()

0    False
1     True
2     True
dtype: bool

# 按行来判断
df.isnull().any(axis = 1)

0    False
1     True
2     True
3    False
dtype: bool

# 填充确实值
df.fillna(100)

	1	2
0	1.0	2.0
1	100.0	0.0
2	0.0	100.0
3	1.0	2.0

apply自定义函数

首先先写好执行操作的函数，接下来直接调用即可，相当于对于数据中所有的样本都执行这样的操作

data = pd.DataFrame({'food':['A1','A2','B1','B2','B3','C1','C2'],'data':[1,2,3,4,5,6,7]})
data

	food	data
0	A1	1
1	A2	2
2	B1	3
3	B2	4
4	B3	5
5	C1	6
6	C2	7

# 使用apply函数进行映射
def food_map(series):if series['food'] == 'A1':return 'A'elif series['food'] == 'A2':return 'A'elif series['food'] == 'B1':return 'B'elif series['food'] == 'B2':return 'B'elif series['food'] == 'B3':return 'B'elif series['food'] == 'C1':return 'C'elif series['food'] == 'C2':return 'C'
data['food_map'] = data.apply(food_map,axis = 'columns')
data

	food	data	food_map
0	A1	1	A
1	A2	2	A
2	B1	3	B
3	B2	4	B
4	B3	5	B
5	C1	6	C
6	C2	7	C

# 使用map函数进行映射
food2Upper = {'A1':'a1','A2':'a2','B1':'b1','B2':'b2','B3':'b3','C1':'c1','C2':'c2'
}
data['upper'] = data['food'].map(food2Upper)
data

	food	data	food_map	upper
0	A1	1	A	a1
1	A2	2	A	a2
2	B1	3	B	b1
3	B2	4	B	b2
4	B3	5	B	b3
5	C1	6	C	c1
6	C2	7	C	c2

时间操作

# 创建一个时间戳
ts = pd.Timestamp('2021-1-21')
ts

Timestamp('2021-01-21 00:00:00')

# 打印月份
print("ts.month:",ts.month)
print("ts.month_name",ts.month_name())

ts.month: 1
ts.month_name January

# 打印天数
print("ts.day:",ts.day)
print("ts.day_name",ts.day_name())

ts.day: 21
ts.day_name Thursday

s = pd.Series(['2017-11-24 00:00:00','2017-11-25 00:00:00','2017-11-26 00:00:00'])
s

0    2017-11-24 00:00:00
1    2017-11-25 00:00:00
2    2017-11-26 00:00:00
dtype: object

# 转换成标准格式
ts = pd.to_datetime(s)
ts

0   2017-11-24
1   2017-11-25
2   2017-11-26
dtype: datetime64[ns]

ts.dt.hour

0    0
1    0
2    0
dtype: int64

ts.dt.weekday

0    4
1    5
2    6
dtype: int64

# 创建自己的时间特征，每条数据按固定的时间保存下来
pd.Series(pd.date_range(start='2017-11-24',periods = 10,freq = '12H'))

0   2017-11-24 00:00:00
1   2017-11-24 12:00:00
2   2017-11-25 00:00:00
3   2017-11-25 12:00:00...
6   2017-11-27 00:00:00
7   2017-11-27 12:00:00
8   2017-11-28 00:00:00
9   2017-11-28 12:00:00
Length: 10, dtype: datetime64[ns]

# 如果以时间特征为索引，可以将parse_dates参数设置为True
df = pd.read_csv('./data/flowdata.csv',index_col=0,parse_dates=True)
df

	L06_347	LS06_347	LS06_348
Time
2009-01-01 00:00:00	0.137417	0.097500	0.016833
2009-01-01 03:00:00	0.131250	0.088833	0.016417
2009-01-01 06:00:00	0.113500	0.091250	0.016750
2009-01-01 09:00:00	0.135750	0.091500	0.016250
...	...	...	...
2013-01-01 15:00:00	1.420000	1.420000	0.096333
2013-01-01 18:00:00	1.178583	1.178583	0.083083
2013-01-01 21:00:00	0.898250	0.898250	0.077167
2013-01-02 00:00:00	0.860000	0.860000	0.075000

11697 rows × 3 columns

# 有了索引就可以去数据
df[pd.Timestamp('2012-01-01 09:00'):pd.Timestamp('2012-01-11 19:00')]

	L06_347	LS06_347	LS06_348
Time
2012-01-01 09:00:00	0.330750	0.293583	0.029750
2012-01-01 12:00:00	0.295000	0.285167	0.031750
2012-01-01 15:00:00	0.301417	0.287750	0.031417
2012-01-01 18:00:00	0.322083	0.304167	0.038083
...	...	...	...
2012-01-11 09:00:00	0.190833	0.208833	0.022000
2012-01-11 12:00:00	0.195417	0.206167	0.022750
2012-01-11 15:00:00	0.182083	0.204083	0.022417
2012-01-11 18:00:00	0.170583	0.202750	0.022083

84 rows × 3 columns

# 取2012年的数据
df['2012']

	L06_347	LS06_347	LS06_348
Time
2012-01-01 00:00:00	0.307167	0.273917	0.028000
2012-01-01 03:00:00	0.302917	0.270833	0.030583
2012-01-01 06:00:00	0.331500	0.284750	0.030917
2012-01-01 09:00:00	0.330750	0.293583	0.029750
...	...	...	...
2012-12-31 12:00:00	0.651250	0.651250	0.063833
2012-12-31 15:00:00	0.629000	0.629000	0.061833
2012-12-31 18:00:00	0.617333	0.617333	0.060583
2012-12-31 21:00:00	0.846500	0.846500	0.170167

2928 rows × 3 columns

# 指定具体月份
df['2013-01':'2013-01']

	L06_347	LS06_347	LS06_348
Time
2013-01-01 00:00:00	1.688333	1.688333	0.207333
2013-01-01 03:00:00	2.693333	2.693333	0.201500
2013-01-01 06:00:00	2.220833	2.220833	0.166917
2013-01-01 09:00:00	2.055000	2.055000	0.175667
...	...	...	...
2013-01-01 15:00:00	1.420000	1.420000	0.096333
2013-01-01 18:00:00	1.178583	1.178583	0.083083
2013-01-01 21:00:00	0.898250	0.898250	0.077167
2013-01-02 00:00:00	0.860000	0.860000	0.075000

9 rows × 3 columns

# 进行细致的判断
df[(df.index.hour > 8) & (df.index.hour < 12) & (df.index.day == 1) & (df.index.month == 1)]

	L06_347	LS06_347	LS06_348
Time
2009-01-01 09:00:00	0.135750	0.091500	0.016250
2010-01-01 09:00:00	0.448167	0.524583	0.052000
2011-01-01 09:00:00	0.493167	0.502667	NaN
2012-01-01 09:00:00	0.330750	0.293583	0.029750
2013-01-01 09:00:00	2.055000	2.055000	0.175667

resample重采样函数

# mean表示去平均值，head表示取前五条数据,resample重采样统计了每天的平均指标（默认是1条为1个周期）
df.resample('D').mean().head()

	L06_347	LS06_347	LS06_348
Time
2009-01-01	0.125010	0.092281	0.016635
2009-01-02	0.124146	0.095781	0.016406
2009-01-03	0.113562	0.085542	0.016094
2009-01-04	0.140198	0.102708	0.017323
2009-01-05	0.128812	0.104490	0.018167

# 设置为10天一个周期
df.resample('10D').mean().head()

	L06_347	LS06_347	LS06_348
Time
2009-01-01	0.113854	0.085479	0.015616
2009-01-11	0.411610	0.453808	0.042764
2009-01-21	1.058358	1.107352	0.080539
2009-01-31	0.320679	0.295744	0.038614
2009-02-10	0.906675	0.988781	0.071100

# 按月来进行统计
df.resample('M').mean().head()

	L06_347	LS06_347	LS06_348
Time
2009-01-31	0.517864	0.536660	0.045597
2009-02-28	0.516847	0.529987	0.047238
2009-03-31	0.373157	0.383172	0.037508
2009-04-30	0.163182	0.129354	0.021356
2009-05-31	0.178588	0.160616	0.020744

绘图操作

以下都是一些简单的绘图，一般专业的绘图都是用matplolib工具包的

%matplotlib inline
df = pd.DataFrame(np.random.randn(10, 4).cumsum(0), index = np.arange(0, 100, 10), columns = ['A', 'B', 'C', 'D'])
df.plot()
df

	A	B	C	D
0	-2.089443	1.089140	1.248430	0.082829
10	-3.649157	-0.520943	1.173340	-1.307401
20	-3.387452	-1.713291	0.212164	-1.434823
30	-1.511931	-3.020731	-0.334770	-2.991320
...	...	...	...	...
60	-0.027079	-4.617245	-0.540804	-4.774767
70	-2.743985	-4.534934	-1.393246	-5.132394
80	-3.336041	-5.283589	-2.202856	-6.250555
90	-1.680583	-4.269342	-1.783487	-5.914622

10 rows × 4 columns

# 可以指定绘图的种类，比如散点图，条形图等等
df = pd.DataFrame(np.random.rand(6, 4), index = ['one', 'two', 'three', 'four', 'five', 'six'], columns = pd.Index(['A', 'B', 'C', 'D'], name = 'Genus'))
df.plot(kind='bar')
df

Genus	A	B	C	D
one	0.706566	0.326933	0.169319	0.950354
two	0.507864	0.763742	0.271822	0.750882
three	0.274819	0.916360	0.040567	0.688893
four	0.498586	0.390705	0.545891	0.636310
five	0.145746	0.681933	0.411809	0.314657
six	0.721254	0.289420	0.362825	0.218710

tips = pd.read_csv('./tips.csv')
tips.total_bill.plot(kind='hist',bins=50)
tips

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
...	...	...	...	...	...	...	...
240	27.18	2.00	Female	Yes	Sat	Dinner	2
241	22.67	2.00	Male	Yes	Sat	Dinner	2
242	17.82	1.75	Male	No	Sat	Dinner	2
243	18.78	3.00	Female	No	Thur	Dinner	2

244 rows × 7 columns

df = pd.read_csv('./macrodata.csv')
data = df[['quarter','realgdp','realcons']]
data.plot.scatter('quarter','realgdp')
data

	quarter	realgdp	realcons
0	1.0	2710.349	1707.4
1	2.0	2778.801	1733.7
2	3.0	2775.488	1751.8
3	4.0	2785.204	1753.7
...	...	...	...
199	4.0	13141.920	9195.3
200	1.0	12925.410	9209.2
201	2.0	12901.504	9189.0
202	3.0	12990.341	9256.0

203 rows × 3 columns

大数据处理

# 由于样本的较大，所以加载进来都卡顿了一会
gl = pd.read_csv('./game_logs.csv')
gl.head()

E:\anacanda\lib\site-packages\IPython\core\interactiveshell.py:3063: DtypeWarning: Columns (12,13,14,15,19,20,81,83,85,87,93,94,95,96,97,98,99,100,105,106,108,109,111,112,114,115,117,118,120,121,123,124,126,127,129,130,132,133,135,136,138,139,141,142,144,145,147,148,150,151,153,154,156,157,160) have mixed types.Specify dtype option on import or set low_memory=False.interactivity=interactivity, compiler=compiler, result=result)

	date	day_of_week	v_name	v_league	v_game_number	h_name	h_league	h_game_number	v_score	...	h_player_7_name	h_player_7_def_pos	h_player_8_id	h_player_8_name	h_player_8_def_pos	h_player_9_id	h_player_9_name	h_player_9_def_pos	additional_info	acquisition_info
0	18710504	Thu	CL1	na	1	FW1	na	1	0	...	Ed Mincher	7.0	mcdej101	James McDermott	8.0	kellb105	Bill Kelly	9.0	NaN	Y
1	18710505	Fri	BS1	na	1	WS3	na	1	20	...	Asa Brainard	1.0	burrh101	Henry Burroughs	9.0	berth101	Henry Berthrong	8.0	HTBF	Y
2	18710506	Sat	CL1	na	2	RC1	na	1	12	...	Pony Sager	6.0	birdg101	George Bird	7.0	stirg101	Gat Stires	9.0	NaN	Y
3	18710508	Mon	CL1	na	3	CH1	na	1	12	...	Ed Duffy	6.0	pinke101	Ed Pinkham	5.0	zettg101	George Zettlein	1.0	NaN	Y
4	18710509	Tue	BS1	na	2	TRO	na	1	9	...	Steve Bellan	5.0	pikel101	Lip Pike	3.0	cravb101	Bill Craver	6.0	HTBF	Y

5 rows × 161 columns

# 样本数据
gl.shape

(171907, 161)

# 指定成deep表示要详细展示当前数据占用的内存(执行的时候卡顿了好久)
gl.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171907 entries, 0 to 171906
Columns: 161 entries, date to acquisition_info
dtypes: float64(77), int64(6), object(78)
memory usage: 860.5 MB

可以看见这份数据读取进来之后占用860.5MB内容

for dtype in ['float64','object','int64']:selected_dtype = gl.select_dtypes(include=[dtype])mean_usage_b = selected_dtype.memory_usage(deep=True).mean()mean_usage_mb = mean_usage_b / 1024 ** 2print("Average memory usage for {} columns: {:03.2f} MB".format(dtype,mean_usage_mb))

Average memory usage for float64 columns: 1.29 MB
Average memory usage for object columns: 9.51 MB
Average memory usage for int64 columns: 1.12 MB

循环中遍历3种类型，通过select_dtypes选中当前类型的特征，可以发现object类型占用的内存最多，其余int64和float64的两种差不多，我们需要对着三种数据类型进行处理，所得数据加载得考研快一点

首先，我们先了解一些数据的类型，int64，int32等类型

int_types = ["uint8", "int8", "int16","int32","int64"]
for it in int_types:print(np.iinfo(it))

Machine parameters for uint8
---------------------------------------------------------------
min = 0
max = 255
---------------------------------------------------------------Machine parameters for int8
---------------------------------------------------------------
min = -128
max = 127
---------------------------------------------------------------Machine parameters for int16
---------------------------------------------------------------
min = -32768
max = 32767
---------------------------------------------------------------Machine parameters for int32
---------------------------------------------------------------
min = -2147483648
max = 2147483647
---------------------------------------------------------------Machine parameters for int64
---------------------------------------------------------------
min = -9223372036854775808
max = 9223372036854775807
---------------------------------------------------------------

def mem_usage(pandas_obj):if isinstance(pandas_obj,pd.DataFrame):usage_b = pandas_obj.memory_usage(deep=True).sum()else: # we assume if not a df it's a seriesusage_b = pandas_obj.memory_usage(deep=True)usage_mb = usage_b / 1024 ** 2 # convert bytes to megabytesreturn "{:03.2f} MB".format(usage_mb)gl_int = gl.select_dtypes(include=['int64'])
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_numeric.html
converted_int = gl_int.apply(pd.to_numeric,downcast='unsigned')print(mem_usage(gl_int))
print(mem_usage(converted_int))

7.87 MB
1.48 MB

7.87 MB是全部为int64类型时的内存占用量

1.48 MB是向下转换后，int类型数据的内存占用量

mem_usage()函数的主要功能就是计算传入数据的内存占用量

gl_float = gl.select_dtypes(include=['float64'])
converted_float = gl_float.apply(pd.to_numeric,downcast='float')print(mem_usage(gl_float))
print(mem_usage(converted_float))

100.99 MB
50.49 MB

可以看见float类型向下转换基本上可以节省一半的内存

gl_obj = gl.select_dtypes(include=['object']).copy()
converted_obj = pd.DataFrame()for col in gl_obj.columns:num_unique_values = len(gl_obj[col].unique())num_total_values = len(gl_obj[col])if num_unique_values / num_total_values < 0.5:converted_obj.loc[:,col] = gl_obj[col].astype('category')else:converted_obj.loc[:,col] = gl_obj[col]print(mem_usage(gl_obj))
print(mem_usage(converted_obj))

751.64 MB
51.67 MB

对object类型数据中唯一值个数进行判断，如果数量不足整体一半则进行转换，如果过多就忽略