pandas 25 式

目录**

查看 pandas 及其支持项的版本
创建 DataFrame
重命名列
反转行序
反转列序
按数据类型选择列
把字符串转换为数值
优化 DataFrame 大小
用多个文件建立 DataFrame ~ 按行
用多个文件建立 DataFrame ~ 按列
从剪贴板创建 DataFrame
把 DataFrame 分割为两个随机子集
根据多个类别筛选 DataFrame
根据最大的类别筛选 DataFrame
操控缺失值
把字符串分割为多列
把 Series 里的列表转换为 DataFrame
用多个函数聚合
用一个 DataFrame 合并聚合的输出结果
选择行与列
重塑多重索引 Series
创建透视表
把连续型数据转换为类别型数据
改变显示选项
设置 DataFrame 样式

https://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/top_25_pandas_tricks.ipynb

import pandas as pd
import numpy as np

原文的数据集是 bit.ly 短网址的，我这里在读取时出问题，不稳定，所以就帮大家下载下来了，统一放到了 data 目录里。

drinks = pd.read_csv('data/drinks.csv',encoding="utf-8")
movies = pd.read_csv('data/imdb_1000.csv')
orders = pd.read_csv('data/chipotle.tsv', sep='\t')
orders['item_price'] = orders.item_price.str.replace('$', '').astype('float')
stocks = pd.read_csv('data/stocks.csv', parse_dates=['Date'])
titanic = pd.read_csv('data/titanic_train.csv')
ufo = pd.read_csv('data/ufo.csv', parse_dates=['Time'])

1. 查看 pandas 及其支持项的版本

使用 pd.__version__ 可以查看 pandas 的版本。

pd.__version__

'0.23.0'

查看所有 pandas 的支持项版本，使用 show_versions 函数。比如，查看 Python、pandas、Numpy、matplotlib 等支持项的版本。

pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.Nonepandas: 0.23.0
pytest: 3.5.1
pip: 19.2.3
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

2. 创建 DataFrame

创建 DataFrame 的方式有很多，比如，可以把字典传递给 DataFrame 构建器，字典的 Key 是列名，字典的 Value 为列表，是 DataFrame 的列的值。

df = pd.DataFrame({'列 1':[100,200],'列 2':[300,400]})
df

	列 1	列 2
0	100	300
1	200	400

如果 DataFrame 的数据较多，用字典的方式就不合适了，需要输入的东西太多。这时，可以用 Numpy 的 random.rand() 函数，设定行数与列数，然后把值传递给 DataFrame 构建器。

# np.random.rand() 的参数有两个，先行后列
pd.DataFrame(np.random.rand(5, 8))
a = pd.DataFrame(np.random.rand(5,8),columns=list('一二三四五六七八'))
a

	一	二	三	四	五	六	七	八
0	0.574532	0.848912	0.553054	0.287474	0.736865	0.735210	0.228698	0.178829
1	0.409149	0.444497	0.993602	0.996009	0.808754	0.879729	0.739526	0.957843
2	0.056604	0.254460	0.212828	0.539109	0.086704	0.300498	0.065166	0.306231
3	0.316446	0.721824	0.476582	0.680902	0.297369	0.823183	0.801095	0.580427
4	0.795505	0.329867	0.169955	0.436137	0.064809	0.959068	0.022464	0.704460

这样就可以生成 DataFrame 了，但如果要用非数字形式的列名，需要强制把字符串转换为列表，再把这个列表传给 columns 参数。

pd.DataFrame(np.random.rand(5, 8), columns=list('一二三四五六七八'))

	一	二	三	四	五	六	七	八
0	0.093988	0.559703	0.391193	0.024522	0.781299	0.691624	0.858546	0.916399
1	0.459453	0.460284	0.204827	0.411720	0.737781	0.290906	0.448626	0.431645
2	0.067880	0.730378	0.265344	0.185587	0.390899	0.172026	0.112606	0.484074
3	0.158781	0.469606	0.913319	0.682071	0.819861	0.135211	0.434149	0.737850
4	0.419152	0.595088	0.758856	0.731311	0.725668	0.680243	0.650883	0.424595

这里要注意的是，字符串里的字符数量必须与 DataFrame 的列数一致。

3. 重命名列

df

	列 1	列 2
0	100	300
1	200	400

用点（.）选择 pandas 里的列写起来比较容易，但列名里有空格，就没法这样操作了。

rename()方法改列名是最灵活的方式，它的参数是字典，字典的 Key 是原列名，值是新列名，还可以指定轴向（axis）。

df = df.rename({'列 1':'列1','列 2':'列2'}, axis='columns')
df

	列1	列2
0	100	300
1	200	400

这种方式的优点是可以重命名任意数量的列，一列、多列、所有列都可以。

还有一种简单的方式可以一次性重命名所有列，即，直接为列的属性赋值。

df.columns = ['列 1','列 2']
df

	列 1	列 2
0	100	300
1	200	400

只想替换列名里的空格，还有更简单的操作，直接用 str.replace 方法，不必把所有的列名都敲一遍。

df.columns = df.columns.str.replace(' ','_')
df.columns = df.columns.str.replace(" ","_")
df

	列_1	列_2
0	100	300
1	200	400

以上这三种方式都可以更改列名。

用 add_prefix 与 add_suffix 函数可以为所有列名添加前缀或后缀。

df.add_prefix('X_')
df.add_prefix("X_")

	X_列_1	X_列_2
0	100	300
1	200	400

df.add_suffix('_Y')

	列_1_Y	列_2_Y
0	100	300
1	200	400

4. 反转列序

反转 drinks 表的顺序。

drinks.head()

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
0	Afghanistan	0	0	0	0.0	Asia
1	Albania	89	132	54	4.9	Europe
2	Algeria	25	0	14	0.7	Africa
3	Andorra	245	138	312	12.4	Europe
4	Angola	217	57	45	5.9	Africa

这个数据集按国家列出了酒水平均消耗量，如果想反转列序该怎么办？

最直接的方式是把 ::-1 传递给 loc 访问器，与 Python 里反转列表的切片法一样。

drinks.loc[::-1].head()

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
192	Zimbabwe	64	18	4	4.7	Africa
191	Zambia	32	19	4	2.5	Africa
190	Yemen	6	0	0	0.1	Asia
189	Vietnam	111	2	1	2.0	Asia
188	Venezuela	333	100	3	7.7	South America

如果想让索引从 0 到 1，用 reset_index()方法，并用 drop 关键字去掉原有索引。

drinks.loc[::-1].reset_index(drop=True).head()

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
0	Zimbabwe	64	18	4	4.7	Africa
1	Zambia	32	19	4	2.5	Africa
2	Yemen	6	0	0	0.1	Asia
3	Vietnam	111	2	1	2.0	Asia
4	Venezuela	333	100	3	7.7	South America

这样，行序就已经反转过来了，索引也重置为默认索引。

5. 反转列序

与反转行序类似，还可以用 loc 从左到右反转列序。

drinks.loc[:,::-1].head()

	continent	total_litres_of_pure_alcohol	wine_servings	spirit_servings	beer_servings	country
0	Asia	0.0	0	0	0	Afghanistan
1	Europe	4.9	54	132	89	Albania
2	Africa	0.7	14	0	25	Algeria
3	Europe	12.4	312	138	245	Andorra
4	Africa	5.9	45	57	217	Angola

逗号前面的分号表示选择所有行，逗号后面的::-1表示反转列，这样一来，country 列就跑到最右边去了。

6. 按数据类型选择列

首先，查看一下 drinks 的数据类型：

drinks.dtypes

country                          object
beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object

选择所有数值型的列，用 selec_dtypes() 方法。

drinks.select_dtypes(include='number').head()

	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol
0	0	0	0	0.0
1	89	132	54	4.9
2	25	0	14	0.7
3	245	138	312	12.4
4	217	57	45	5.9

同样的方法，还可以选择所有字符型的列。

drinks.select_dtypes(include='object').head()

	country	continent
0	Afghanistan	Asia
1	Albania	Europe
2	Algeria	Africa
3	Andorra	Europe
4	Angola	Africa

同理，还可以用 datetime 选择日期型的列。

传递列表即可选择多种类型的列。

drinks.select_dtypes(include=['number','object','category','datetime']).head()

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
0	Afghanistan	0	0	0	0.0	Asia
1	Albania	89	132	54	4.9	Europe
2	Algeria	25	0	14	0.7	Africa
3	Andorra	245	138	312	12.4	Europe
4	Angola	217	57	45	5.9	Africa

还可以使用 exclude 关键字排除指定的数据类型。

drinks.select_dtypes(exclude='number').head()

	country	continent
0	Afghanistan	Asia
1	Albania	Europe
2	Algeria	Africa
3	Andorra	Europe
4	Angola	Africa

7. 把字符串转换为数值

再创建一个新的 DataFrame 示例。

df = pd.DataFrame({'列1':['1.1','2.2','3.3'],'列2':['4.4','5.5','6.6'],'列3':['7.7','8.8','-']})
df

	列1	列2	列3
0	1.1	4.4	7.7
1	2.2	5.5	8.8
2	3.3	6.6	-

这个 DataFrame 里的数字其实是以字符串形式保存的，因此，列类型是 object。

df.dtypes

列1    object
列2    object
列3    object
dtype: object

要想执行数学计算，要先把这些列的数据类型转换为数值型，下面的代码用 astype() 方法把前两列的数据类型转化为 float。

df.astype({'列1':'float','列2':'float'}).dtypes

列1    float64
列2    float64
列3     object
dtype: object

用这种方式转换第三列会出错，因为这列里包含一个代表 0 的下划线，pandas 无法自动判断这个下划线。

为了解决这个问题，可以使用 to_numeric() 函数来处理第三列，让 pandas 把任意无效输入转为 NaN。

pd.to_numeric(df.列3, errors='coerce')

0    7.7
1    8.8
2    NaN
Name: 列3, dtype: float64

NaN 代表的是 0，可以用 fillna() 方法填充。

df = df.apply(pd.to_numeric, errors='coerce').fillna(0)
df

	列1	列2	列3
0	1.1	4.4	7.7
1	2.2	5.5	8.8
2	3.3	6.6	0.0

一行代码就可以解决这个问题，现在所有列的值都转成 float 了。

df.dtypes

列1    float64
列2    float64
列3    float64
dtype: object

8. 优化 DataFrame 对内存的占用

pandas 的 DataFrame 设计的目标是把数据存到内存里，有时要缩减 DataFrame 的大小，减少对内存的占用。

下面显示了 drinks 占用的内存。

drinks.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
country                         193 non-null object
beer_servings                   193 non-null int64
spirit_servings                 193 non-null int64
wine_servings                   193 non-null int64
total_litres_of_pure_alcohol    193 non-null float64
continent                       193 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 30.5 KB

这里显示 drinks 使用了 30.5 KB 内存。

大型 DataFrame 会影响计算性能，甚至导致 DataFrame 读入内存失败，下面介绍简单几步，即可在读取 DataFrame 时减少内存占用。

第一步是只读取切实所需的列，这里需要指定 usecols 参数。

cols = ['beer_servings','continent']
small_drinks = pd.read_csv('data/drinks.csv', usecols=cols)
small_drinks.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 2 columns):
beer_servings    193 non-null int64
continent        193 non-null object
dtypes: int64(1), object(1)
memory usage: 13.7 KB

只选择两列以后，DataFrame 对内存的占用减少到 13.7 KB。

第二步是把包含类别型数据的 object 列转换为 Category 数据类型，通过指定 dtype 参数实现。

dtypes ={'continent':'category'}
smaller_drinks = pd.read_csv('data/drinks.csv',usecols=cols, dtype=dtypes)
smaller_drinks.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 2 columns):
beer_servings    193 non-null int64
continent        193 non-null category
dtypes: category(1), int64(1)
memory usage: 2.4 KB

把 continent 列改为 category 数据类型后，DataFrame 对内存的占用进一步缩减到 2.4 KB。

注意：类别数量相对于行数较少时，category 数据类型对对内存占用的减少会比较有限。

9. 用多个文件建立 DataFrame ~ 按行

本段介绍怎样把分散于多个文件的数据集读取为一个 DataFrame。

比如，有多个 stock 文件，每个 CSV 文件里只存储一天的数据。

下面是三天的股票数据：

pd.read_csv('data/stocks1.csv')

	Date	Close	Volume	Symbol
0	2016-10-03	31.50	14070500	CSCO
1	2016-10-03	112.52	21701800	AAPL
2	2016-10-03	57.42	19189500	MSFT

pd.read_csv('data/stocks2.csv')

	Date	Close	Volume	Symbol
0	2016-10-04	113.00	29736800	AAPL
1	2016-10-04	57.24	20085900	MSFT
2	2016-10-04	31.35	18460400	CSCO

pd.read_csv('data/stocks3.csv')

	Date	Close	Volume	Symbol
0	2016-10-05	57.64	16726400	MSFT
1	2016-10-05	31.59	11808600	CSCO
2	2016-10-05	113.05	21453100	AAPL

把每个 CSV 文件读取成 DataFrame，合并后，再删除导入的原始 DataFrame，但这种方式占用内存太多，而且要写很多代码。

使用 Python 内置的 glob 更方便。

from glob import glob

把文件名规则传递给 glob()，这里包括通配符，即可返回包含所有合规文件名的列表。

本例里，glob 会查找 data 子目录里所有以 stocks 开头的 CSV 文件。

stock_files = sorted(glob('data/stocks?.csv'))
stock_files

['data\\stocks1.csv', 'data\\stocks2.csv', 'data\\stocks3.csv']

glob 返回的是无序文件名，要用 Python 内置的 sorted() 函数排序列表。

调用 read_csv() 函数读取生成器表达式里的每个文件，把读取结果传递给 concat() 函数，然后合并为一个 DataFrame。

注：原文里用的是 stock_files = sorted(glob('data/stocks*.csv'))，译文里没用 stocks*，用的是 stocks?，这是因为 data 目录里还有一个叫 stocks.csv 的文件，如果用 *，会读取出 4 个文件，而不是原文中的 3 个文件。

pd.concat((pd.read_csv(file) for file in stock_files))

	Date	Close	Volume	Symbol
0	2016-10-03	31.50	14070500	CSCO
1	2016-10-03	112.52	21701800	AAPL
2	2016-10-03	57.42	19189500	MSFT
0	2016-10-04	113.00	29736800	AAPL
1	2016-10-04	57.24	20085900	MSFT
2	2016-10-04	31.35	18460400	CSCO
0	2016-10-05	57.64	16726400	MSFT
1	2016-10-05	31.59	11808600	CSCO
2	2016-10-05	113.05	21453100	AAPL

生成的 DataFrame 索引有重复值，见 “0、1、2”。为避免这种情况，要在 concat() 函数里用忽略旧索引、重置新索引的参数，ignore_index = True。

pd.concat((pd.read_csv(file) for file in stock_files), ignore_index=True)

	Date	Close	Volume	Symbol
0	2016-10-03	31.50	14070500	CSCO
1	2016-10-03	112.52	21701800	AAPL
2	2016-10-03	57.42	19189500	MSFT
3	2016-10-04	113.00	29736800	AAPL
4	2016-10-04	57.24	20085900	MSFT
5	2016-10-04	31.35	18460400	CSCO
6	2016-10-05	57.64	16726400	MSFT
7	2016-10-05	31.59	11808600	CSCO
8	2016-10-05	113.05	21453100	AAPL

10. 用多个文件建立 DataFrame ~ 按列

上个技巧按行合并数据集，但是如果多个文件包含不同的列，该怎么办？

本例将 drinks 数据集分为了两个 CSV 文件，每个文件都包含 3 列。

pd.read_csv('data/drinks1.csv').head()

	country	beer_servings	spirit_servings
0	Afghanistan	0	0
1	Albania	89	132
2	Algeria	25	0
3	Andorra	245	138
4	Angola	217	57

pd.read_csv('data/drinks2.csv').head()

	wine_servings	total_litres_of_pure_alcohol	continent
0	0	0.0	Asia
1	54	4.9	Europe
2	14	0.7	Africa
3	312	12.4	Europe
4	45	5.9	Africa

与上例一样，还是使用 glob()。

drink_files = sorted(glob('data/drinks?.csv'))

这里要让 concat() 函数按列合并，axis='columns。

pd.concat((pd.read_csv(file) for file in drink_files), axis='columns').head()

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
0	Afghanistan	0	0	0	0.0	Asia
1	Albania	89	132	54	4.9	Europe
2	Algeria	25	0	14	0.7	Africa
3	Andorra	245	138	312	12.4	Europe
4	Angola	217	57	45	5.9	Africa

现在 drinks 有 6 列啦！

11. 从剪贴板创建 DataFrame

想快速把 Excel 或别的表格软件里存储的数据读取为 DataFrame，用 read_clipboard() 函数。

打开要复制的 Excel 文件，选取内容，复制。

df = pd.read_clipboard()
df

	city_id	city_name
0	1	北京
1	2	上海
2	3	广州
3	4	深圳

与 read_csv() 函数类似，read_clipboard() 会自动检测列名与每列的数据类型。

df = pd.read_clipboard()
df

	weight	price
苹果	160	10
苹果	166	16
葡萄	311	31
苹果	165	15
梨	215	25
梨	218	28

真不错！ pandas 自动把第一列当设置成索引了。

df.index

Index(['苹果', '苹果', '葡萄', '苹果', '梨', '梨'], dtype='object')

注意：因为不能复用、重现，不推荐在正式代码里使用 read_clipboard() 函数。

12. 把 DataFrame 分割为两个随机子集

把 DataFrame 分为两个随机子集，一个占 75% 的数据量，另一个是剩下的 25%。

以 Movies 为例，该数据有 979 条记录。

len(movies)

使用 sample()方法随机选择 75% 的记录，并将之赋值给 moives_1。

movies_1 = movies.sample(frac=0.75, random_state=1234)

使用 drop() 方法删掉 movies 里所有 movies_1，并将之赋值给 movies_2。

movies_2 = movies.drop(movies_1.index)

两个 DataFrame 的行数之和与 movies 一致。

len(movies_1) + len(movies_2)

movies_1 与 movies_2 里的每个索引值都来自于 movies，而且互不重复。

movies_1.index.sort_values()

Int64Index([  0,   2,   5,   6,   7,   8,   9,  11,  13,  16,...966, 967, 969, 971, 972, 974, 975, 976, 977, 978],dtype='int64', length=734)

movies_2.index.sort_values()

Int64Index([  1,   3,   4,  10,  12,  14,  15,  18,  26,  30,...931, 934, 937, 941, 950, 954, 960, 968, 970, 973],dtype='int64', length=245)

注意：如果索引值有重复、不唯一，这种方式会失效。

13. 根据多个类别筛选 DataFrame

预览 movies。

movies.head()

	star_rating	title	content_rating	genre	duration	actors_list
0	9.3	The Shawshank Redemption	R	Crime	142	[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1	9.2	The Godfather	R	Crime	175	[u'Marlon Brando', u'Al Pacino', u'James Caan']
2	9.1	The Godfather: Part II	R	Crime	200	[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3	9.0	The Dark Knight	PG-13	Action	152	[u'Christian Bale', u'Heath Ledger', u'Aaron E...
4	8.9	Pulp Fiction	R	Crime	154	[u'John Travolta', u'Uma Thurman', u'Samuel L....

查看 genre（电影类型）列。

movies.genre.unique()

array(['Crime', 'Action', 'Drama', 'Western', 'Adventure', 'Biography','Comedy', 'Animation', 'Mystery', 'Horror', 'Film-Noir', 'Sci-Fi','History', 'Thriller', 'Family', 'Fantasy'], dtype=object)

要是想筛选 Action（动作片）、Drama（剧情片）、Western（西部片），可以用 or 的操作符实现多条件筛选。

movies[(movies.genre == 'Action') |(movies.genre == 'Drama') |(movies.genre == 'Western')].head()

	star_rating	title	content_rating	genre	duration	actors_list
3	9.0	The Dark Knight	PG-13	Action	152	[u'Christian Bale', u'Heath Ledger', u'Aaron E...
5	8.9	12 Angry Men	NOT RATED	Drama	96	[u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...
6	8.9	The Good, the Bad and the Ugly	NOT RATED	Western	161	[u'Clint Eastwood', u'Eli Wallach', u'Lee Van ...
9	8.9	Fight Club	R	Drama	139	[u'Brad Pitt', u'Edward Norton', u'Helena Bonh...
11	8.8	Inception	PG-13	Action	148	[u'Leonardo DiCaprio', u'Joseph Gordon-Levitt'...

不过，用 isin() 方法筛选会更清晰，只要传递电影类型的列表就可以了。

movies[movies.genre.isin(['Action','Drama','Western'])].head()

	star_rating	title	content_rating	genre	duration	actors_list
3	9.0	The Dark Knight	PG-13	Action	152	[u'Christian Bale', u'Heath Ledger', u'Aaron E...
5	8.9	12 Angry Men	NOT RATED	Drama	96	[u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...
6	8.9	The Good, the Bad and the Ugly	NOT RATED	Western	161	[u'Clint Eastwood', u'Eli Wallach', u'Lee Van ...
9	8.9	Fight Club	R	Drama	139	[u'Brad Pitt', u'Edward Norton', u'Helena Bonh...
11	8.8	Inception	PG-13	Action	148	[u'Leonardo DiCaprio', u'Joseph Gordon-Levitt'...

如果想反选，可在条件前添加一个波浪符（tilde ~）。

movies[~movies.genre.isin(['Action','Drama','Western'])].head()

	star_rating	title	content_rating	genre	duration	actors_list
0	9.3	The Shawshank Redemption	R	Crime	142	[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1	9.2	The Godfather	R	Crime	175	[u'Marlon Brando', u'Al Pacino', u'James Caan']
2	9.1	The Godfather: Part II	R	Crime	200	[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
4	8.9	Pulp Fiction	R	Crime	154	[u'John Travolta', u'Uma Thurman', u'Samuel L....
7	8.9	The Lord of the Rings: The Return of the King	PG-13	Adventure	201	[u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...

14. 根据最大的类别筛选 DataFrame

筛选电影类别里（genre）数量最多的三类电影。

先用 value_counts() 统计各类电影的数量，把统计结果赋值给 counts，这个结果是 Series。

counts = movies.genre.value_counts()
counts

Drama        278
Comedy       156
Action       136
Crime        124
Biography     77
Adventure     75
Animation     62
Horror        29
Mystery       16
Western        9
Thriller       5
Sci-Fi         5
Film-Noir      3
Family         2
History        1
Fantasy        1
Name: genre, dtype: int64

使用 Series 的 nlargest 方法，可以轻松选出 Series 里最大的三个值。

counts.nlargest(3)

Drama     278
Comedy    156
Action    136
Name: genre, dtype: int64

这里所需的只是这个 Series 的 index。

counts.nlargest(3).index

Index(['Drama', 'Comedy', 'Action'], dtype='object')

把这个 index 传递给 isin()。

movies[movies.genre.isin(counts.nlargest(3).index)].head()

	star_rating	title	content_rating	genre	duration	actors_list
3	9.0	The Dark Knight	PG-13	Action	152	[u'Christian Bale', u'Heath Ledger', u'Aaron E...
5	8.9	12 Angry Men	NOT RATED	Drama	96	[u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...
9	8.9	Fight Club	R	Drama	139	[u'Brad Pitt', u'Edward Norton', u'Helena Bonh...
11	8.8	Inception	PG-13	Action	148	[u'Leonardo DiCaprio', u'Joseph Gordon-Levitt'...
12	8.8	Star Wars: Episode V - The Empire Strikes Back	PG	Action	124	[u'Mark Hamill', u'Harrison Ford', u'Carrie Fi...

最终，这个 DataFrame 里就只剩下了剧情片、喜剧片与动作片。

15. 处理缺失值

本例使用目击 UFO 数据集。

ufo.head()

	City	Colors Reported	Shape Reported	State	Time
0	Ithaca	NaN	TRIANGLE	NY	1930-06-01 22:00:00
1	Willingboro	NaN	OTHER	NJ	1930-06-30 20:00:00
2	Holyoke	NaN	OVAL	CO	1931-02-15 14:00:00
3	Abilene	NaN	DISK	KS	1931-06-01 13:00:00
4	New York Worlds Fair	NaN	LIGHT	NY	1933-04-18 19:00:00

可以看到，这个数据集里有缺失值。

要查看每列有多少缺失值，可以使用 isna() 方法，然后使用 sum()函数。

ufo.isna().sum()

City                  25
Colors Reported    15359
Shape Reported      2644
State                  0
Time                   0
dtype: int64

isna() 生成一个由 True 与 False 构成的 DataFrame，sum() 把 True 转换为 1，把 False 转换为 0。

还可以用 mean() 函数，计算缺失值占比。

ufo.isna().mean()

City               0.001371
Colors Reported    0.842004
Shape Reported     0.144948
State              0.000000
Time               0.000000
dtype: float64

用 dropna() 删除列里的所有缺失值。

ufo.dropna(axis='columns').head()

	State	Time
0	NY	1930-06-01 22:00:00
1	NJ	1930-06-30 20:00:00
2	CO	1931-02-15 14:00:00
3	KS	1931-06-01 13:00:00
4	NY	1933-04-18 19:00:00

只想删除列中缺失值高于 10% 的缺失值，可以设置 dropna() 里的阈值，即 threshold.

ufo.dropna(thresh=len(ufo)*0.9, axis='columns').head()

	City	State	Time
0	Ithaca	NY	1930-06-01 22:00:00
1	Willingboro	NJ	1930-06-30 20:00:00
2	Holyoke	CO	1931-02-15 14:00:00
3	Abilene	KS	1931-06-01 13:00:00
4	New York Worlds Fair	NY	1933-04-18 19:00:00

16. 把字符串分割为多列

创建一个 DataFrame 示例。

df = pd.DataFrame({'姓名':['张 三','李 四','王 五'],'所在地':['北京-东城区','上海-黄浦区','广州-白云区']})
df

	姓名	所在地
0	张三	北京-东城区
1	李四	上海-黄浦区
2	王五	广州-白云区

把姓名列分为姓与名两列，用 str.split() 方法，按空格分割，并用 expand 关键字，生成一个新的 DataFrame。

df.姓名.str.split(' ', expand=True)

	0	1
0	张	三
1	李	四
2	王	五

通过赋值语句，把这两列添加到原 DataFrame。

df[['姓','名']] = df.姓名.str.split(' ', expand=True)
df

	姓名	所在地	姓	名
0	张三	北京-东城区	张	三
1	李四	上海-黄浦区	李	四
2	王五	广州-白云区	王	五

如果想分割字符串，但只想保留分割结果的一列，该怎么操作？

df.所在地.str.split('-', expand=True)

	0	1
0	北京	东城区
1	上海	黄浦区
2	广州	白云区

要是只想保留城市列，可以选择只把城市加到 DataFrame 里。

df['城市'] = df.所在地.str.split('-', expand=True)[0]
df

	姓名	所在地	姓	名	城市
0	张三	北京-东城区	张	三	北京
1	李四	上海-黄浦区	李	四	上海
2	王五	广州-白云区	王	五	广州

17. 把 Series 里的列表转换为 DataFrame

创建一个 DataFrame 示例。

df = pd.DataFrame({'列1':['a','b','c'],'列2':[[10,20], [20,30], [30,40]]})
df

	列1	列2
0	a	[10, 20]
1	b	[20, 30]
2	c	[30, 40]

这里包含了两列，第二列包含的是 Python 整数列表。

要把第二列转为 DataFrame，在第二列上使用 apply() 方法，并把结果传递给 Series 构建器。

df_new = df.列2.apply(pd.Series)
df_new

	0	1
0	10	20
1	20	30
2	30	40

用 concat() 函数，把原 DataFrame 与新 DataFrame 组合在一起。

pd.concat([df,df_new], axis='columns')

	列1	列2	0	1
0	a	[10, 20]	10	20
1	b	[20, 30]	20	30
2	c	[30, 40]	30	40

18. 用多个函数聚合

先看一下 Chipotle 连锁餐馆的 DataFrame。

orders.head(10)

	order_id	quantity	item_name	choice_description	item_price
0	1	1	Chips and Fresh Tomato Salsa	NaN	2.39
1	1	1	Izze	[Clementine]	3.39
2	1	1	Nantucket Nectar	[Apple]	3.39
3	1	1	Chips and Tomatillo-Green Chili Salsa	NaN	2.39
4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	16.98
5	3	1	Chicken Bowl	[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...	10.98
6	3	1	Side of Chips	NaN	1.69
7	4	1	Steak Burrito	[Tomatillo Red Chili Salsa, [Fajita Vegetables...	11.75
8	4	1	Steak Soft Tacos	[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...	9.25
9	5	1	Steak Burrito	[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...	9.25

每个订单都有订单号（order_id），每个订单有多行。要统计每个订单的金额，需要先根据每个 order_id 汇总每个订单里各个产品（item_price）的金额。下面的例子列出了订单号为 1 的总价。

orders[orders.order_id == 1].item_price.sum()

11.56

计算每单的总价，要按 order_id 进行 groupby() 分组，再按 item_price 计算每组的总价。

orders.groupby('order_id').item_price.sum().head()

order_id
1    11.56
2    16.98
3    12.67
4    21.00
5    13.70
Name: item_price, dtype: float64

有时，要用多个聚合函数，不一定只是 sum() 一个函数。这时，要用 agg() 方法，把多个聚合函数的列表作为该方法的参数。

orders.groupby('order_id').item_price.agg(['sum','count']).head()

	sum	count
order_id
1	11.56	4
2	16.98	1
3	12.67	2
4	21.00	2
5	13.70	2

上列就算出了每个订单的总价与订单里的产品数量。

19. 用一个 DataFrame 合并聚合的输出结果

本例用的还是 orders。

orders.head(10)

	order_id	quantity	item_name	choice_description	item_price
0	1	1	Chips and Fresh Tomato Salsa	NaN	2.39
1	1	1	Izze	[Clementine]	3.39
2	1	1	Nantucket Nectar	[Apple]	3.39
3	1	1	Chips and Tomatillo-Green Chili Salsa	NaN	2.39
4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	16.98
5	3	1	Chicken Bowl	[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...	10.98
6	3	1	Side of Chips	NaN	1.69
7	4	1	Steak Burrito	[Tomatillo Red Chili Salsa, [Fajita Vegetables...	11.75
8	4	1	Steak Soft Tacos	[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...	9.25
9	5	1	Steak Burrito	[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...	9.25

如果想新增一列，为每行列出订单的总价，要怎么操作？上面介绍过用 sum() 计算总价。

orders.groupby('order_id').item_price.sum().head()

order_id
1    11.56
2    16.98
3    12.67
4    21.00
5    13.70
Name: item_price, dtype: float64

sum() 是聚合函数，该函数返回结果的行数（1834行）比原始数据的行数（4622行）少。

len(orders.groupby('order_id').item_price.sum())

len(orders.item_price)

要解决这个问题得用 transform() 方法，这个方法执行同样的计算，但返回与原始数据行数一样的输出结果，本例中为 4622 行。

total_price = orders.groupby('order_id').item_price.transform('sum')
len(total_price)

接下来，为 DataFrame 新增一列，total_price。

orders['total_price'] = total_price
orders.head()

	order_id	quantity	item_name	choice_description	item_price	total_price
0	1	1	Chips and Fresh Tomato Salsa	NaN	2.39	11.56
1	1	1	Izze	[Clementine]	3.39	11.56
2	1	1	Nantucket Nectar	[Apple]	3.39	11.56
3	1	1	Chips and Tomatillo-Green Chili Salsa	NaN	2.39	11.56
4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	16.98	16.98

如上所示，每一行都列出了对应的订单总价。

这样一来，计算每行产品占订单总价的百分比就易如反掌了。

orders['percent_of_total'] = orders.item_price / orders.total_price
orders.head(10)

	order_id	quantity	item_name	choice_description	item_price	total_price	percent_of_total
0	1	1	Chips and Fresh Tomato Salsa	NaN	2.39	11.56	0.206747
1	1	1	Izze	[Clementine]	3.39	11.56	0.293253
2	1	1	Nantucket Nectar	[Apple]	3.39	11.56	0.293253
3	1	1	Chips and Tomatillo-Green Chili Salsa	NaN	2.39	11.56	0.206747
4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	16.98	16.98	1.000000
5	3	1	Chicken Bowl	[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...	10.98	12.67	0.866614
6	3	1	Side of Chips	NaN	1.69	12.67	0.133386
7	4	1	Steak Burrito	[Tomatillo Red Chili Salsa, [Fajita Vegetables...	11.75	21.00	0.559524
8	4	1	Steak Soft Tacos	[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...	9.25	21.00	0.440476
9	5	1	Steak Burrito	[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...	9.25	13.70	0.675182

20. 选择行与列

本例使用大家都看腻了的泰坦尼克数据集。

titanic.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

这个数据集包括了泰坦尼克乘客的基本信息以及是否逃生的数据。

用 describe() 方法，可以得到该数据集的基本统计数据。

titanic.describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

这个结果集显示的数据很多，但不一定都是你需要的，可能只需要其中几行。

titanic.describe().loc['min':'max']

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
min	1.0	0.0	1.0	0.420	0.0	0.0	0.0000
25%	223.5	0.0	2.0	20.125	0.0	0.0	7.9104
50%	446.0	0.0	3.0	28.000	0.0	0.0	14.4542
75%	668.5	1.0	3.0	38.000	1.0	0.0	31.0000
max	891.0	1.0	3.0	80.000	8.0	6.0	512.3292

还可以只选择部分列。

titanic.describe().loc['min':'max','Pclass':'Parch']

	Pclass	Age	SibSp	Parch
min	1.0	0.420	0.0	0.0
25%	2.0	20.125	0.0	0.0
50%	3.0	28.000	0.0	0.0
75%	3.0	38.000	1.0	0.0
max	3.0	80.000	8.0	6.0

21. 重塑多重索引 Series

泰坦尼克数据集里有一列标注了**幸存（Survived）**状态，值用 0、1 代表。计算该列的平均值可以计算整体幸存率。

titanic.Survived.mean()

0.3838383838383838

按**性别（Sex）**统计男女的幸存率，需要使用 groupby()。

titanic.groupby('Sex').Survived.mean()

Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64

要按性别与**舱型（Pclass）**统计幸存率，就要按性别与舱型进行 groupby()。

titanic.groupby(['Sex','Pclass']).Survived.mean()

Sex     Pclass
female  1         0.9680852         0.9210533         0.500000
male    1         0.3688522         0.1574073         0.135447
Name: Survived, dtype: float64

上面显示了不同性别，不同舱型的幸存率，输出结果是一个多重索引的序列（Series），这种形式与实际数据相比多了多重索引。

这种表现形式不利于阅读，也不方便实现数据交互，用 unstack() 把多重索引转换为 DataFrame 更方便。

titanic.groupby(['Sex','Pclass']).Survived.mean().unstack()

Pclass	1	2	3
Sex
female	0.968085	0.921053	0.500000
male	0.368852	0.157407	0.135447

这个 DataFrame 包含的数据与多重索引序列一模一样，只是可以用大家更熟悉的 DataFrame 方法进行操控。

22. 创建透视表

经常输出类似上例的 DataFrame，pivot_table() 方法更方便。

titanic.pivot_table(index='Sex', columns='Pclass',values='Survived', aggfunc='mean')

Pclass	1	2	3
Sex
female	0.968085	0.921053	0.500000
male	0.368852	0.157407	0.135447

使用透视表，可以直接指定索引、数据列、值与聚合函数。

设置 margins=True，即可为透视表添加行与列的汇总。

titanic.pivot_table(index='Sex', columns='Pclass',values='Survived', aggfunc='mean', margins=True)

Pclass	1	2	3	All
Sex
female	0.968085	0.921053	0.500000	0.742038
male	0.368852	0.157407	0.135447	0.188908
All	0.629630	0.472826	0.242363	0.383838

此表显示了整体幸存率，及按性别与舱型划分的幸存率。

把聚合函数 mean 改为 count，就可以生成交叉表。

titanic.pivot_table(index='Sex', columns='Pclass',values='Survived', aggfunc='count', margins=True)

Pclass	1	2	3	All
Sex
female	94	76	144	314
male	122	108	347	577
All	216	184	491	891

这里显示了每个类别的记录数。

23. 把连续型数据转换为类型数据

下面看一下泰坦尼克数据集的年龄（Age）列。

titanic.Age.head()

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64

这一列是连续型数据，如果想把它转换为类别型数据怎么办？

这里可以用 cut 函数把年龄划分为儿童、青年、成人三个年龄段。

pd.cut(titanic.Age, bins=[0, 18, 25, 99], labels=['儿童', '青年', '成人']).head(10)

0     青年
1     成人
2     成人
3     成人
4     成人
5    NaN
6     成人
7     儿童
8     成人
9     儿童
Name: Age, dtype: category
Categories (3, object): [儿童 < 青年 < 成人]

这段代码为不同分箱提供了标签，年龄在 0-18 岁的为儿童，18-25 岁的为青年，25-99 岁的为成人。

注意：现在数据已经是类别型了，类别型数据会自动排序。

24. 改变显示选项

接下来还是看泰坦尼克数据集。

titanic.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

年龄列有 1 位小数，票价列有 4 位小数，如何将这两列显示的小数位数标准化？

可以用以下代码让这两列只显示 2 位小数。

pd.set_option('display.float_format', '{:.2f}'.format)

第一个参数是要设置的选项名称，第二个参数是 Python 的字符串格式。

titanic.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.00	1	A/5 21171	7.25	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.00	1	PC 17599	71.28	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.00	0	STON/O2. 3101282	7.92	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.00	1	113803	53.10	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.00	0	373450	8.05	NaN	S

现在年龄与票价列为 2 位小数了。

注意：这种操作不改变底层数据，只改变数据的显示形式。

还可以用以下代码重置数据显示选项。

pd.reset_option('display.float_format')

注意：使用同样的方式，还可以设置更多选项。

25. 设置 DataFrame 样式

上面的技巧适用于调整整个 Jupyter Notebook 的显示内容。

不过，要想为某个 DataFrame 设定指定的样式，pandas 还提供了更灵活的方式。

下面看一下 stocks。

stocks

	Date	Close	Volume	Symbol
0	2016-10-03	31.50	14070500	CSCO
1	2016-10-03	112.52	21701800	AAPL
2	2016-10-03	57.42	19189500	MSFT
3	2016-10-04	113.00	29736800	AAPL
4	2016-10-04	57.24	20085900	MSFT
5	2016-10-04	31.35	18460400	CSCO
6	2016-10-05	57.64	16726400	MSFT
7	2016-10-05	31.59	11808600	CSCO
8	2016-10-05	113.05	21453100	AAPL

创建样式字符字典，指定每列使用的格式。

format_dict = {'Date':'{:%m/%d/%y}', 'Close':'${:.2f}', 'Volume':'{:,}'}

把这个字典传递给 DataFrame 的 style.format() 方法。

stocks.style.format(format_dict)

            <tr><th id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7level0_row0" class="row_heading level0 row0" >0</th><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row0_col0" class="data row0 col0" >10/03/16</td><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row0_col1" class="data row0 col1" >$31.50</td><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row0_col2" class="data row0 col2" >14,070,500</td><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row0_col3" class="data row0 col3" >CSCO</td></tr><tr><th id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7level0_row1" class="row_heading level0 row1" >1</th><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row1_col0" class="data row1 col0" >10/03/16</td><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row1_col1" class="data row1 col1" >$112.52</td><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row1_col2" class="data row1 col2" >21,701,800</td><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row1_col3" class="data row1 col3" >AAPL</td></tr><tr><th id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7level0_row2" class="row_heading level0 row2" >2</th><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row2_col0" class="data row2 col0" >10/03/16</td><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row2_col1" class="data row2 col1" >$57.42</td><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row2_col2" class="data row2 col2" >19,189,500</td><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row2_col3" class="data row2 col3" >MSFT</td></tr><tr><th id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7level0_row3" class="row_heading level0 row3" >3</th><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row3_col0" class="data row3 col0" >10/04/16</td><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row3_col1" class="data row3 col1" >$113.00</td><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row3_col2" class="data row3 col2" >29,736,800</td><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row3_col3" class="data row3 col3" >AAPL</td></tr><tr><th id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7level0_row4" class="row_heading level0 row4" >4</th><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row4_col0" class="data row4 col0" >10/04/16</td><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row4_col1" class="data row4 col1" >$57.24</td><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row4_col2" class="data row4 col2" >20,085,900</td><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row4_col3" class="data row4 col3" >MSFT</td></tr><tr><th id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7level0_row5" class="row_heading level0 row5" >5</th><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row5_col0" class="data row5 col0" >10/04/16</td><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row5_col1" class="data row5 col1" >$31.35</td><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row5_col2" class="data row5 col2" >18,460,400</td><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row5_col3" class="data row5 col3" >CSCO</td></tr><tr><th id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7level0_row6" class="row_heading level0 row6" >6</th><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row6_col0" class="data row6 col0" >10/05/16</td><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row6_col1" class="data row6 col1" >$57.64</td><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row6_col2" class="data row6 col2" >16,726,400</td><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row6_col3" class="data row6 col3" >MSFT</td></tr><tr><th id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7level0_row7" class="row_heading level0 row7" >7</th><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row7_col0" class="data row7 col0" >10/05/16</td><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row7_col1" class="data row7 col1" >$31.59</td><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row7_col2" class="data row7 col2" >11,808,600</td><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row7_col3" class="data row7 col3" >CSCO</td></tr><tr><th id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7level0_row8" class="row_heading level0 row8" >8</th><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row8_col0" class="data row8 col0" >10/05/16</td><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row8_col1" class="data row8 col1" >$113.05</td><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row8_col2" class="data row8 col2" >21,453,100</td><td id="T_9f3a49b6_be3e_11e9_92e9_083e8ee5add7row8_col3" class="data row8 col3" >AAPL</td></tr>
</tbody></table>

	Date	Close	Volume	Symbol

注意：日期是月-日-年的格式，闭市价有美元符，交易量有千分号。

接下来用链式方法实现更多样式。

(stocks.style.format(format_dict).hide_index().highlight_min('Close', color='red').highlight_max('Close', color='lightgreen')
)

            <tr><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row0_col0" class="data row0 col0" >10/03/16</td><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row0_col1" class="data row0 col1" >$31.50</td><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row0_col2" class="data row0 col2" >14,070,500</td><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row0_col3" class="data row0 col3" >CSCO</td></tr><tr><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row1_col0" class="data row1 col0" >10/03/16</td><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row1_col1" class="data row1 col1" >$112.52</td><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row1_col2" class="data row1 col2" >21,701,800</td><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row1_col3" class="data row1 col3" >AAPL</td></tr><tr><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row2_col0" class="data row2 col0" >10/03/16</td><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row2_col1" class="data row2 col1" >$57.42</td><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row2_col2" class="data row2 col2" >19,189,500</td><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row2_col3" class="data row2 col3" >MSFT</td></tr><tr><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row3_col0" class="data row3 col0" >10/04/16</td><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row3_col1" class="data row3 col1" >$113.00</td><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row3_col2" class="data row3 col2" >29,736,800</td><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row3_col3" class="data row3 col3" >AAPL</td></tr><tr><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row4_col0" class="data row4 col0" >10/04/16</td><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row4_col1" class="data row4 col1" >$57.24</td><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row4_col2" class="data row4 col2" >20,085,900</td><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row4_col3" class="data row4 col3" >MSFT</td></tr><tr><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row5_col0" class="data row5 col0" >10/04/16</td><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row5_col1" class="data row5 col1" >$31.35</td><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row5_col2" class="data row5 col2" >18,460,400</td><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row5_col3" class="data row5 col3" >CSCO</td></tr><tr><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row6_col0" class="data row6 col0" >10/05/16</td><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row6_col1" class="data row6 col1" >$57.64</td><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row6_col2" class="data row6 col2" >16,726,400</td><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row6_col3" class="data row6 col3" >MSFT</td></tr><tr><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row7_col0" class="data row7 col0" >10/05/16</td><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row7_col1" class="data row7 col1" >$31.59</td><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row7_col2" class="data row7 col2" >11,808,600</td><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row7_col3" class="data row7 col3" >CSCO</td></tr><tr><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row8_col0" class="data row8 col0" >10/05/16</td><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row8_col1" class="data row8 col1" >$113.05</td><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row8_col2" class="data row8 col2" >21,453,100</td><td id="T_9f3f78c8_be3e_11e9_bd20_083e8ee5add7row8_col3" class="data row8 col3" >AAPL</td></tr>
</tbody></table>

Date	Close	Volume	Symbol

可以看到，这个表隐藏了索引，闭市价最小值用红色显示，最大值用浅绿色显示。

再看一下背景色渐变的样式。

(stocks.style.format(format_dict)
.hide_index()
.background_gradient(subset='Volume', cmap='Blues')
)

            <tr><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row0_col0" class="data row0 col0" >10/03/16</td><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row0_col1" class="data row0 col1" >$31.50</td><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row0_col2" class="data row0 col2" >14,070,500</td><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row0_col3" class="data row0 col3" >CSCO</td></tr><tr><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row1_col0" class="data row1 col0" >10/03/16</td><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row1_col1" class="data row1 col1" >$112.52</td><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row1_col2" class="data row1 col2" >21,701,800</td><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row1_col3" class="data row1 col3" >AAPL</td></tr><tr><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row2_col0" class="data row2 col0" >10/03/16</td><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row2_col1" class="data row2 col1" >$57.42</td><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row2_col2" class="data row2 col2" >19,189,500</td><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row2_col3" class="data row2 col3" >MSFT</td></tr><tr><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row3_col0" class="data row3 col0" >10/04/16</td><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row3_col1" class="data row3 col1" >$113.00</td><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row3_col2" class="data row3 col2" >29,736,800</td><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row3_col3" class="data row3 col3" >AAPL</td></tr><tr><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row4_col0" class="data row4 col0" >10/04/16</td><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row4_col1" class="data row4 col1" >$57.24</td><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row4_col2" class="data row4 col2" >20,085,900</td><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row4_col3" class="data row4 col3" >MSFT</td></tr><tr><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row5_col0" class="data row5 col0" >10/04/16</td><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row5_col1" class="data row5 col1" >$31.35</td><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row5_col2" class="data row5 col2" >18,460,400</td><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row5_col3" class="data row5 col3" >CSCO</td></tr><tr><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row6_col0" class="data row6 col0" >10/05/16</td><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row6_col1" class="data row6 col1" >$57.64</td><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row6_col2" class="data row6 col2" >16,726,400</td><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row6_col3" class="data row6 col3" >MSFT</td></tr><tr><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row7_col0" class="data row7 col0" >10/05/16</td><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row7_col1" class="data row7 col1" >$31.59</td><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row7_col2" class="data row7 col2" >11,808,600</td><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row7_col3" class="data row7 col3" >CSCO</td></tr><tr><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row8_col0" class="data row8 col0" >10/05/16</td><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row8_col1" class="data row8 col1" >$113.05</td><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row8_col2" class="data row8 col2" >21,453,100</td><td id="T_9f4961d0_be3e_11e9_839a_083e8ee5add7row8_col3" class="data row8 col3" >AAPL</td></tr>
</tbody></table>

Date	Close	Volume	Symbol

交易量（Volume）列现在按不同深浅的蓝色显示，一眼就能看出来数据的大小。

下面看最后一个例子。

(stocks.style.format(format_dict)
.hide_index()
.bar('Volume', color='lightblue', align='zero')
.set_caption('2016年10月股票价格')
)

            <tr><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row0_col0" class="data row0 col0" >10/03/16</td><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row0_col1" class="data row0 col1" >$31.50</td><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row0_col2" class="data row0 col2" >14,070,500</td><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row0_col3" class="data row0 col3" >CSCO</td></tr><tr><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row1_col0" class="data row1 col0" >10/03/16</td><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row1_col1" class="data row1 col1" >$112.52</td><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row1_col2" class="data row1 col2" >21,701,800</td><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row1_col3" class="data row1 col3" >AAPL</td></tr><tr><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row2_col0" class="data row2 col0" >10/03/16</td><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row2_col1" class="data row2 col1" >$57.42</td><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row2_col2" class="data row2 col2" >19,189,500</td><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row2_col3" class="data row2 col3" >MSFT</td></tr><tr><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row3_col0" class="data row3 col0" >10/04/16</td><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row3_col1" class="data row3 col1" >$113.00</td><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row3_col2" class="data row3 col2" >29,736,800</td><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row3_col3" class="data row3 col3" >AAPL</td></tr><tr><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row4_col0" class="data row4 col0" >10/04/16</td><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row4_col1" class="data row4 col1" >$57.24</td><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row4_col2" class="data row4 col2" >20,085,900</td><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row4_col3" class="data row4 col3" >MSFT</td></tr><tr><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row5_col0" class="data row5 col0" >10/04/16</td><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row5_col1" class="data row5 col1" >$31.35</td><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row5_col2" class="data row5 col2" >18,460,400</td><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row5_col3" class="data row5 col3" >CSCO</td></tr><tr><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row6_col0" class="data row6 col0" >10/05/16</td><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row6_col1" class="data row6 col1" >$57.64</td><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row6_col2" class="data row6 col2" >16,726,400</td><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row6_col3" class="data row6 col3" >MSFT</td></tr><tr><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row7_col0" class="data row7 col0" >10/05/16</td><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row7_col1" class="data row7 col1" >$31.59</td><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row7_col2" class="data row7 col2" >11,808,600</td><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row7_col3" class="data row7 col3" >CSCO</td></tr><tr><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row8_col0" class="data row8 col0" >10/05/16</td><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row8_col1" class="data row8 col1" >$113.05</td><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row8_col2" class="data row8 col2" >21,453,100</td><td id="T_9f50172e_be3e_11e9_a22a_083e8ee5add7row8_col3" class="data row8 col3" >AAPL</td></tr>
</tbody></table>

2016年10月股票价格
Date	Close	Volume	Symbol

本例的 DataFrame 加上了标题，交易量列使用了迷你条形图。

注意：Pandas 还支持更多 DataFrame 样式选项，详见 pandas 官方文档。

pandas常用操作相关推荐

pandas 按字符串肚脐眼读取数据_十分钟学习pandas！ pandas常用操作总结！
学习Python, 当然少不了pandas,pandas是python数据科学中的必备工具,熟练使用pandas是从sql boy/girl 跨越到一名优秀的数据分析师傅的必备技能. 这篇pandas ...
Pandas常用操作总结
文章目录前言 1.DF常用的两种创建方式方式一:通过np来生成方式二:通过字典来生成 2.pandas常用的属性及方法 3.Pandas Select(数据选择) 4.Pandas Set_va ...
pandas 常用操作
删除操作删除列: df=df.drop('column_label',axis=1) 删除行: df=df.drop('row_label') 删除重复的行: df=df.drop_duplicat ...
pandas常用操作以及eda分析笔记（自用）
pandas学习笔记以sklearn自带的boston数据集转为dataframe为例(这样就不用总是换示例数据了
pandas常用操作集合
my_dic = {'name': ['战神-吕布', '武圣-关羽', '美人-貂蝉'],'hight': [6.3, 6.1, 5.8],'gender': ['male', 'male', 'f ...
10000字的Pandas核心操作知识大全！
来源丨数据不吹牛工作中最近常用到pandas做数据处理和分析,特意总结了以下常用内容.想下载到本地可访问以下地址 https://github.com/SeafyLiang/Python_study ...
深度盘点：整理100个 Pandas 常用函数
大家好,Pandas 是 Python 中最频繁.最受欢迎使用的模块之一,本文我将对 pandas 常用操作进行总结. 内容主要涉及:读取数据和保存数据.数据详情信息.数据处理.数据切片.筛选.排序. ...
python对excel操作简书_Python实现EXCEL常用操作——pandas简介
知乎的代码块太丑了,这里的内容就更新到简书了Python实现EXCEL常用操作--pandas简介www.jianshu.com EXCEL是日常办公最常用的软件,然而遇到数据量特别大(超过10W条 ...
CDA学习之Pandas - 常用函数和75个高频操作
目录一.函数 1.1 常用函数 1.1.1 导⼊数据 1.1.2 导出数据 1.1.3 查看数据 1.1.4 数据选取 1.1.5 数据处理 1.1.6 数据分组和排序 1.1.7 数据合并 1.1 ...

pandas常用操作