这是油管上的一个帅哥的网课地址如下 https://www.youtube.com/watch?v=yzIMircGU5I&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y

Python pandas Q&A video series by Data School

YouTube playlist and GitHub repository

What is pandas?
How do I read a tabular data file into pandas?
How do I select a pandas Series from a DataFrame?
Why do some pandas commands end with parentheses (and others don't)?
How do I rename columns in a pandas DataFrame?
How do I remove columns from a pandas DataFrame?
How do I sort a pandas DataFrame or a Series?
How do I filter rows of a pandas DataFrame by column value?
How do I apply multiple filter criteria to a pandas DataFrame?
Your pandas questions answered!
How do I use the "axis" parameter in pandas?
How do I use string methods in pandas?
How do I change the data type of a pandas Series?
When should I use a "groupby" in pandas?
How do I explore a pandas Series?
How do I handle missing values in pandas?
What do I need to know about the pandas index? (Part 1)
What do I need to know about the pandas index? (Part 2)
How do I select multiple rows and columns from a pandas DataFrame?
When should I use the "inplace" parameter in pandas?
How do I make my pandas DataFrame smaller and faster?
How do I use pandas with scikit-learn to create Kaggle submissions?
More of your pandas questions answered!
How do I create dummy variables in pandas?
How do I work with dates and times in pandas?
How do I find and remove duplicate rows in pandas?
How do I avoid a SettingWithCopyWarning in pandas?
How do I change display options in pandas?
How do I create a pandas DataFrame from another object?
How do I apply a function to a pandas Series or DataFrame?

In [1]:

# 传统方式

import pandas as pd

1. What is pandas? (video)

pandas main page
pandas installation instructions
Anaconda distribution of Python (includes pandas)
How to use the IPython/Jupyter notebook (video)

2. How do I read a tabular data file into pandas? (video)

In [2]:

# 直接从URL中读取Chipotle订单的数据集，并将结果存储在数据库中

url1 = "https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/chipotle.tsv"

#定义地址

orders =pd.read_table(url1)#使用read_table()打开

In [3]:

# 检查前5行

orders.head()

Out[3]:

	order_id	quantity	item_name	choice_description	item_price
0	1	1	Chips and Fresh Tomato Salsa	NaN	$2.39
1	1	1	Izze	[Clementine]	$3.39
2	1	1	Nantucket Nectar	[Apple]	$3.39
3	1	1	Chips and Tomatillo-Green Chili Salsa	NaN	$2.39
4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	$16.98

Documentation for read_table

In [4]:

# 读取电影评论员的数据集（修改read_table的默认参数值）

user_cols = ['user_id','age','gender','occupation','zipcode']#定义列名

url2 = "https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/u.user"

#定义地址

#users=pd.read_table(url2,sep='|',header=None,names= user_clos,skiprows=2,skipfooter=3)

users = pd.read_table('http://bit.ly/movieusers', sep='|', header=None, names=user_cols)

#加入参数sep 分隔符，header 头部 标题，names 列名

In [5]:

# 检查前5行

users.head()

Out[5]:

	user_id	age	gender	occupation	zipcode
0	1	24	M	technician	85711
1	2	53	F	other	94043
2	3	23	M	writer	32067
3	4	24	M	technician	43537
4	5	33	F	other	15213

[Back to top]

3. How do I select a pandas Series from a DataFrame? (video)

In [6]:

# 将UFO报告的数据集读入DataFrame

url3 = "https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv"#定义列名

ufo = pd.read_table(url3, sep=',')

In [7]:

# #用read_table打开csv文件，区别是 read_csv直接是用逗号隔开

ufo = pd.read_csv(url3)

In [8]:

#  检查前5行

ufo.head()

Out[8]:

	City	Colors Reported	Shape Reported	State	Time
0	Ithaca	NaN	TRIANGLE	NY	6/1/1930 22:00
1	Willingboro	NaN	OTHER	NJ	6/30/1930 20:00
2	Holyoke	NaN	OVAL	CO	2/15/1931 14:00
3	Abilene	NaN	DISK	KS	6/1/1931 13:00
4	New York Worlds Fair	NaN	LIGHT	NY	4/18/1933 19:00

In [9]:

# #用括号法查看Series

ufo['City']

# #用点法查看Series，要注意 名字里面有空格或者是python专有字符的时候不能用，但是方便输入

ufo.City

Out[9]:

0                      Ithaca
1                 Willingboro
2                     Holyoke
3                     Abilene
4        New York Worlds Fair
5                 Valley City
6                 Crater Lake
7                        Alma
8                     Eklutna
9                     Hubbard
10                    Fontana
11                   Waterloo
12                     Belton
13                     Keokuk
14                  Ludington
15                Forest Home
16                Los Angeles
17                  Hapeville
18                     Oneida
19                 Bering Sea
20                   Nebraska
21                        NaN
22                        NaN
23                  Owensboro
24                 Wilderness
25                  San Diego
26                 Wilderness
27                     Clovis
28                 Los Alamos
29               Ft. Duschene...
18211                 Holyoke
18212                  Carson
18213                Pasadena
18214                  Austin
18215                El Campo
18216            Garden Grove
18217           Berthoud Pass
18218              Sisterdale
18219            Garden Grove
18220             Shasta Lake
18221                Franklin
18222          Albrightsville
18223              Greenville
18224                 Eufaula
18225             Simi Valley
18226           San Francisco
18227           San Francisco
18228              Kingsville
18229                 Chicago
18230             Pismo Beach
18231             Pismo Beach
18232                    Lodi
18233               Anchorage
18234                Capitola
18235          Fountain Hills
18236              Grant Park
18237             Spirit Lake
18238             Eagle River
18239             Eagle River
18240                    Ybor
Name: City, Length: 18241, dtype: object

括号表示法总是有效，而点表示法有局限性：

如果系列名称中有空格，则点符号不起作用
如果系列与DataFrame方法或属性（如'head'或'shape'）具有相同的名称，则点符号不起作用
点符号不能用于定义新series的名

In [10]:

# #这里的拼接也不能用点的方法

ufo['Location'] = ufo.City + ', ' + ufo.State

ufo.head()

Out[10]:

	City	Colors Reported	Shape Reported	State	Time	Location
0	Ithaca	NaN	TRIANGLE	NY	6/1/1930 22:00	Ithaca, NY
1	Willingboro	NaN	OTHER	NJ	6/30/1930 20:00	Willingboro, NJ
2	Holyoke	NaN	OVAL	CO	2/15/1931 14:00	Holyoke, CO
3	Abilene	NaN	DISK	KS	6/1/1931 13:00	Abilene, KS
4	New York Worlds Fair	NaN	LIGHT	NY	4/18/1933 19:00	New York Worlds Fair, NY

[Back to top]

4. Why do some pandas commands end with parentheses (and others don't)? (video)

In [11]:

# 将顶级IMDb电影的数据集读入DataFrame

url4="https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/imdb_1000.csv"

movies = pd.read_csv(url4)

#方法以括号结尾，而属性则没有：

In [12]:

# 示例方法：显示前5行

movies.head()

Out[12]:

	star_rating	title	content_rating	genre	duration	actors_list
0	9.3	The Shawshank Redemption	R	Crime	142	[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1	9.2	The Godfather	R	Crime	175	[u'Marlon Brando', u'Al Pacino', u'James Caan']
2	9.1	The Godfather: Part II	R	Crime	200	[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3	9.0	The Dark Knight	PG-13	Action	152	[u'Christian Bale', u'Heath Ledger', u'Aaron E...
4	8.9	Pulp Fiction	R	Crime	154	[u'John Travolta', u'Uma Thurman', u'Samuel L....

In [13]:

#示例方法：计算摘要统计信息

movies.describe()

Out[13]:

	star_rating	duration
count	979.000000	979.000000
mean	7.889785	120.979571
std	0.336069	26.218010
min	7.400000	64.000000
25%	7.600000	102.000000
50%	7.800000	117.000000
75%	8.100000	134.000000
max	9.300000	242.000000

In [14]:

movies.describe(include=['object'])

Out[14]:

	title	content_rating	genre	actors_list
count	979	976	979	979
unique	975	12	16	969
top	Dracula	R	Drama	[u'Daniel Radcliffe', u'Emma Watson', u'Rupert...
freq	2	460	278	6

In [15]:

# 示例属性：行数和列数

movies.shape

Out[15]:

(979, 6)

In [16]:

# 示例属性：每列的数据类型

movies.dtypes

Out[16]:

star_rating       float64
title              object
content_rating     object
genre              object
duration            int64
actors_list        object
dtype: object

In [17]:

# 使用describe方法的可选参数来仅汇总'object'列

movies.describe(include=['object'])

Out[17]:

	title	content_rating	genre	actors_list
count	979	976	979	979
unique	975	12	16	969
top	Dracula	R	Drama	[u'Daniel Radcliffe', u'Emma Watson', u'Rupert...
freq	2	460	278	6

Documentation for describe

[Back to top]

5. How do I rename columns in a pandas DataFrame? (video)

In [18]:

# 检查列名称

ufo.columns

Out[18]:

Index(['City', 'Colors Reported', 'Shape Reported', 'State', 'Time','Location'],dtype='object')

In [19]:

# 使用'rename'方法重命名其中两列

ufo.rename(columns={'Colors Reported':'Colors_Reported', 'Shape Reported':'Shape_Reported'}, inplace=True)

ufo.columns

Out[19]:

Index(['City', 'Colors_Reported', 'Shape_Reported', 'State', 'Time','Location'],dtype='object')

Documentation for rename

In [20]:

# 通过覆盖'columns'属性替换所有列名

ufo = pd.read_table(url3, sep=',')

ufo_cols = ['city', 'colors reported', 'shape reported', 'state', 'time']

ufo.columns = ufo_cols

ufo.columns

Out[20]:

Index(['city', 'colors reported', 'shape reported', 'state', 'time'], dtype='object')

In [21]:

# 使用'names'参数替换文件读取过程中的列名

ufo = pd.read_csv(url3, header=0, names=ufo_cols)

ufo.columns

Out[21]:

Index(['city', 'colors reported', 'shape reported', 'state', 'time'], dtype='object')

Documentation for read_csv

In [22]:

ufo.columns = ufo.columns.str.replace(' ', '_') #如何批量修改替换使得列名无空格

ufo.columns

Out[22]:

Index(['city', 'colors_reported', 'shape_reported', 'state', 'time'], dtype='object')

Documentation for str.replace

[Back to top]

6. How do I remove columns from a pandas DataFrame? (video)

In [35]:

ufo = pd.read_table(url3, sep=',')

ufo.head()

Out[35]:

	City	Colors Reported	Shape Reported	State	Time
0	Ithaca	NaN	TRIANGLE	NY	6/1/1930 22:00
1	Willingboro	NaN	OTHER	NJ	6/30/1930 20:00
2	Holyoke	NaN	OVAL	CO	2/15/1931 14:00
3	Abilene	NaN	DISK	KS	6/1/1931 13:00
4	New York Worlds Fair	NaN	LIGHT	NY	4/18/1933 19:00

In [37]:

# #axis=1 是纵向，inplace = True：不创建新的对象，直接对原始对象进行修改；

ufo.drop('Colors Reported', axis=1, inplace=True)

ufo.head()

Out[37]:

	City	Shape Reported	State	Time
0	Ithaca	TRIANGLE	NY	6/1/1930 22:00
1	Willingboro	OTHER	NJ	6/30/1930 20:00
2	Holyoke	OVAL	CO	2/15/1931 14:00
3	Abilene	DISK	KS	6/1/1931 13:00
4	New York Worlds Fair	LIGHT	NY	4/18/1933 19:00

Documentation for drop

In [38]:

# 一次删除多个列

ufo.drop(['City', 'State'], axis=1, inplace=True)

ufo.head()

Out[38]:

	Shape Reported	Time
0	TRIANGLE	6/1/1930 22:00
1	OTHER	6/30/1930 20:00
2	OVAL	2/15/1931 14:00
3	DISK	6/1/1931 13:00
4	LIGHT	4/18/1933 19:00

In [39]:

# 一次删除多行（axis = 0表示行）

ufo.drop([0, 1], axis=0, inplace=True)

ufo.head()

#删除4行 按标签，axis=0 是横向，默认为横向，但建议写出来

Out[39]:

	Shape Reported	Time
2	OVAL	2/15/1931 14:00
3	DISK	6/1/1931 13:00
4	LIGHT	4/18/1933 19:00
5	DISK	9/15/1934 15:30
6	CIRCLE	6/15/1935 0:00

[Back to top]

7. How do I sort a pandas DataFrame or a Series? (video)

In [40]:

movies.head()

Out[40]:

	star_rating	title	content_rating	genre	duration	actors_list
0	9.3	The Shawshank Redemption	R	Crime	142	[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1	9.2	The Godfather	R	Crime	175	[u'Marlon Brando', u'Al Pacino', u'James Caan']
2	9.1	The Godfather: Part II	R	Crime	200	[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3	9.0	The Dark Knight	PG-13	Action	152	[u'Christian Bale', u'Heath Ledger', u'Aaron E...
4	8.9	Pulp Fiction	R	Crime	154	[u'John Travolta', u'Uma Thurman', u'Samuel L....

#注意：以下任何排序方法都不会影响基础数据。（换句话说，排序是暂时的）。

In [41]:

#排序单个Series

movies.title.sort_values().head()

Out[41]:

542     (500) Days of Summer
5               12 Angry Men
201         12 Years a Slave
698                127 Hours
110    2001: A Space Odyssey
Name: title, dtype: object

In [42]:

# #排序单个Series 倒序

movies.title.sort_values(ascending=False).head()

Out[42]:

864               [Rec]
526                Zulu
615          Zombieland
677              Zodiac
955    Zero Dark Thirty
Name: title, dtype: object

Documentation for sort_values for a Series. (Prior to version 0.17, use order instead.)

In [43]:

# #以单个Series排序DataFrame

movies.sort_values('title').head()

Out[43]:

	star_rating	title	content_rating	genre	duration	actors_list
542	7.8	(500) Days of Summer	PG-13	Comedy	95	[u'Zooey Deschanel', u'Joseph Gordon-Levitt', ...
5	8.9	12 Angry Men	NOT RATED	Drama	96	[u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...
201	8.1	12 Years a Slave	R	Biography	134	[u'Chiwetel Ejiofor', u'Michael Kenneth Willia...
698	7.6	127 Hours	R	Adventure	94	[u'James Franco', u'Amber Tamblyn', u'Kate Mara']
110	8.3	2001: A Space Odyssey	G	Mystery	160	[u'Keir Dullea', u'Gary Lockwood', u'William S...

In [44]:

# 改为按降序排序

movies.sort_values('title', ascending=False).head()

Out[44]:

	star_rating	title	content_rating	genre	duration	actors_list
864	7.5	[Rec]	R	Horror	78	[u'Manuela Velasco', u'Ferran Terraza', u'Jorg...
526	7.8	Zulu	UNRATED	Drama	138	[u'Stanley Baker', u'Jack Hawkins', u'Ulla Jac...
615	7.7	Zombieland	R	Comedy	88	[u'Jesse Eisenberg', u'Emma Stone', u'Woody Ha...
677	7.7	Zodiac	R	Crime	157	[u'Jake Gyllenhaal', u'Robert Downey Jr.', u'M...
955	7.4	Zero Dark Thirty	R	Drama	157	[u'Jessica Chastain', u'Joel Edgerton', u'Chri...

Documentation for sort_values for a DataFrame. (Prior to version 0.17, use sort instead.)

In [45]:

# 首先按'content_rating'，然后按duration'排序DataFrame

movies.sort_values(['content_rating', 'duration']).head()

Out[45]:

	star_rating	title	content_rating	genre	duration	actors_list
713	7.6	The Jungle Book	APPROVED	Animation	78	[u'Phil Harris', u'Sebastian Cabot', u'Louis P...
513	7.8	Invasion of the Body Snatchers	APPROVED	Horror	80	[u'Kevin McCarthy', u'Dana Wynter', u'Larry Ga...
272	8.1	The Killing	APPROVED	Crime	85	[u'Sterling Hayden', u'Coleen Gray', u'Vince E...
703	7.6	Dracula	APPROVED	Horror	85	[u'Bela Lugosi', u'Helen Chandler', u'David Ma...
612	7.7	A Hard Day's Night	APPROVED	Comedy	87	[u'John Lennon', u'Paul McCartney', u'George H...

Summary of changes to the sorting API in pandas 0.17

[Back to top]

8. How do I filter rows of a pandas DataFrame by column value? (video)

In [46]:

movies.head()

Out[46]:

	star_rating	title	content_rating	genre	duration	actors_list
0	9.3	The Shawshank Redemption	R	Crime	142	[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1	9.2	The Godfather	R	Crime	175	[u'Marlon Brando', u'Al Pacino', u'James Caan']
2	9.1	The Godfather: Part II	R	Crime	200	[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3	9.0	The Dark Knight	PG-13	Action	152	[u'Christian Bale', u'Heath Ledger', u'Aaron E...
4	8.9	Pulp Fiction	R	Crime	154	[u'John Travolta', u'Uma Thurman', u'Samuel L....

In [47]:

# 检查行数和列数

movies.shape

Out[47]:

(979, 6)

##目标：过滤DataFrame行，仅显示“持续时间”至少为200分钟的电影

In [48]:

#先展示一个比较复杂的方法，用一个for循环制造一个和原数据一样行数，判断每一行是否符合条件，列表元素均为boolean

#创建一个列表，其中每个元素引用一个DataFrame行：如果行满足条件，则返回true，否则返回False

booleans = []

for length in movies.duration:

    if length >= 200:

        booleans.append(True)

    else:

        booleans.append(False)

In [49]:

# 确认列表与DataFrame的长度相同

len(booleans)

Out[49]:

In [50]:

# 检查前五个列表元素

booleans[0:5]

Out[50]:

[False, False, True, False, False]

In [51]:

# 将列表转换为Series

is_long = pd.Series(booleans)

is_long.head()

Out[51]:

0    False
1    False
2     True
3    False
4    False
dtype: bool

In [52]:

# 使用带有布尔Series的括号表示法告诉DataFrame movies[is_long]要显示哪些行

movies[is_long]

Out[52]:

	star_rating	title	content_rating	genre	duration	actors_list
2	9.1	The Godfather: Part II	R	Crime	200	[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
7	8.9	The Lord of the Rings: The Return of the King	PG-13	Adventure	201	[u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...
17	8.7	Seven Samurai	UNRATED	Drama	207	[u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K...
78	8.4	Once Upon a Time in America	R	Crime	229	[u'Robert De Niro', u'James Woods', u'Elizabet...
85	8.4	Lawrence of Arabia	PG	Adventure	216	[u"Peter O'Toole", u'Alec Guinness', u'Anthony...
142	8.3	Lagaan: Once Upon a Time in India	PG	Adventure	224	[u'Aamir Khan', u'Gracy Singh', u'Rachel Shell...
157	8.2	Gone with the Wind	G	Drama	238	[u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...
204	8.1	Ben-Hur	G	Adventure	212	[u'Charlton Heston', u'Jack Hawkins', u'Stephe...
445	7.9	The Ten Commandments	APPROVED	Adventure	220	[u'Charlton Heston', u'Yul Brynner', u'Anne Ba...
476	7.8	Hamlet	PG-13	Drama	242	[u'Kenneth Branagh', u'Julie Christie', u'Dere...
630	7.7	Malcolm X	PG-13	Biography	202	[u'Denzel Washington', u'Angela Bassett', u'De...
767	7.6	It's a Mad, Mad, Mad, Mad World	APPROVED	Action	205	[u'Spencer Tracy', u'Milton Berle', u'Ethel Me...

In [53]:

# 简化上面的步骤：不需要编写for循环来创建is_long'

is_long = movies.duration >= 200

movies[is_long]#运用这种写法，pandas就知道，按照这个series去筛选

# 或等效地，将其写在一行（无需创建'is_long'对象）

movies[movies.duration >= 200]

Out[53]:

	star_rating	title	content_rating	genre	duration	actors_list
2	9.1	The Godfather: Part II	R	Crime	200	[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
7	8.9	The Lord of the Rings: The Return of the King	PG-13	Adventure	201	[u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...
17	8.7	Seven Samurai	UNRATED	Drama	207	[u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K...
78	8.4	Once Upon a Time in America	R	Crime	229	[u'Robert De Niro', u'James Woods', u'Elizabet...
85	8.4	Lawrence of Arabia	PG	Adventure	216	[u"Peter O'Toole", u'Alec Guinness', u'Anthony...
142	8.3	Lagaan: Once Upon a Time in India	PG	Adventure	224	[u'Aamir Khan', u'Gracy Singh', u'Rachel Shell...
157	8.2	Gone with the Wind	G	Drama	238	[u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...
204	8.1	Ben-Hur	G	Adventure	212	[u'Charlton Heston', u'Jack Hawkins', u'Stephe...
445	7.9	The Ten Commandments	APPROVED	Adventure	220	[u'Charlton Heston', u'Yul Brynner', u'Anne Ba...
476	7.8	Hamlet	PG-13	Drama	242	[u'Kenneth Branagh', u'Julie Christie', u'Dere...
630	7.7	Malcolm X	PG-13	Biography	202	[u'Denzel Washington', u'Angela Bassett', u'De...
767	7.6	It's a Mad, Mad, Mad, Mad World	APPROVED	Action	205	[u'Spencer Tracy', u'Milton Berle', u'Ethel Me...

In [54]:

# 从过滤后的DataFrame中选择“流派”系列

movies[movies.duration >= 200].genre

# 或者等效地，使用'loc'方法

movies.loc[movies.duration >= 200, 'genre']

Out[54]:

2          Crime
7      Adventure
17         Drama
78         Crime
85     Adventure
142    Adventure
157        Drama
204    Adventure
445    Adventure
476        Drama
630    Biography
767       Action
Name: genre, dtype: object

Documentation for loc

[Back to top]

9. How do I apply multiple filter criteria to a pandas DataFrame? (video)

In [55]:

# read a dataset of top-rated IMDb movies into a DataFrame

movies = pd.read_csv('http://bit.ly/imdbratings')

movies.head()

Out[55]:

	star_rating	title	content_rating	genre	duration	actors_list
0	9.3	The Shawshank Redemption	R	Crime	142	[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1	9.2	The Godfather	R	Crime	175	[u'Marlon Brando', u'Al Pacino', u'James Caan']
2	9.1	The Godfather: Part II	R	Crime	200	[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3	9.0	The Dark Knight	PG-13	Action	152	[u'Christian Bale', u'Heath Ledger', u'Aaron E...
4	8.9	Pulp Fiction	R	Crime	154	[u'John Travolta', u'Uma Thurman', u'Samuel L....

In [56]:

# 过滤DataFrame仅显示“持续时间”至少为200分钟的电影

movies[movies.duration >= 200]

Out[56]:

	star_rating	title	content_rating	genre	duration	actors_list
2	9.1	The Godfather: Part II	R	Crime	200	[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
7	8.9	The Lord of the Rings: The Return of the King	PG-13	Adventure	201	[u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...
17	8.7	Seven Samurai	UNRATED	Drama	207	[u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K...
78	8.4	Once Upon a Time in America	R	Crime	229	[u'Robert De Niro', u'James Woods', u'Elizabet...
85	8.4	Lawrence of Arabia	PG	Adventure	216	[u"Peter O'Toole", u'Alec Guinness', u'Anthony...
142	8.3	Lagaan: Once Upon a Time in India	PG	Adventure	224	[u'Aamir Khan', u'Gracy Singh', u'Rachel Shell...
157	8.2	Gone with the Wind	G	Drama	238	[u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...
204	8.1	Ben-Hur	G	Adventure	212	[u'Charlton Heston', u'Jack Hawkins', u'Stephe...
445	7.9	The Ten Commandments	APPROVED	Adventure	220	[u'Charlton Heston', u'Yul Brynner', u'Anne Ba...
476	7.8	Hamlet	PG-13	Drama	242	[u'Kenneth Branagh', u'Julie Christie', u'Dere...
630	7.7	Malcolm X	PG-13	Biography	202	[u'Denzel Washington', u'Angela Bassett', u'De...
767	7.6	It's a Mad, Mad, Mad, Mad World	APPROVED	Action	205	[u'Spencer Tracy', u'Milton Berle', u'Ethel Me...

理解逻辑运算符：

and：仅当运算符的两边都为True时才为真
or：如果运算符的任何一侧为True，则为真

In [57]:

print(True and True)

print(True and False)

print(False and False)

True
False
False

In [58]:

print(True or True)

print(True or False)

print(False or False)

True
True
False

在pandas中指定多个过滤条件的规则：

使用＆而不是和使用|而不是或在每个条件周围添加括号以指定评估顺序

Goal: Further filter the DataFrame of long movies (duration >= 200) to only show movies which also have a 'genre' of 'Drama'

In [59]:

# 使用'＆'运算符指定两个条件都是必需的

movies[(movies.duration >=200) & (movies.genre == 'Drama')]

Out[59]:

	star_rating	title	content_rating	genre	duration	actors_list
17	8.7	Seven Samurai	UNRATED	Drama	207	[u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K...
157	8.2	Gone with the Wind	G	Drama	238	[u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...
476	7.8	Hamlet	PG-13	Drama	242	[u'Kenneth Branagh', u'Julie Christie', u'Dere...

In [60]:

# I不正确：使用'|'运算符会展示长或戏剧的电影

movies[(movies.duration >=200) | (movies.genre == 'Drama')].head()

Out[60]:

	star_rating	title	content_rating	genre	duration	actors_list
2	9.1	The Godfather: Part II	R	Crime	200	[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
5	8.9	12 Angry Men	NOT RATED	Drama	96	[u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...
7	8.9	The Lord of the Rings: The Return of the King	PG-13	Adventure	201	[u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...
9	8.9	Fight Club	R	Drama	139	[u'Brad Pitt', u'Edward Norton', u'Helena Bonh...
13	8.8	Forrest Gump	PG-13	Drama	142	[u'Tom Hanks', u'Robin Wright', u'Gary Sinise']

##过滤原始数据框以显示“类型”为“犯罪”或“戏剧”或“动作”的电影

In [61]:

# 使用'|'运算符指定行可以匹配三个条件中的任何一个

movies[(movies.genre == 'Crime') | (movies.genre == 'Drama') | (movies.genre == 'Action')].head(10)

# 用isin等效

movies[movies.genre.isin(['Crime', 'Drama', 'Action'])].head(10)

Out[61]:

	star_rating	title	content_rating	genre	duration	actors_list
0	9.3	The Shawshank Redemption	R	Crime	142	[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1	9.2	The Godfather	R	Crime	175	[u'Marlon Brando', u'Al Pacino', u'James Caan']
2	9.1	The Godfather: Part II	R	Crime	200	[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3	9.0	The Dark Knight	PG-13	Action	152	[u'Christian Bale', u'Heath Ledger', u'Aaron E...
4	8.9	Pulp Fiction	R	Crime	154	[u'John Travolta', u'Uma Thurman', u'Samuel L....
5	8.9	12 Angry Men	NOT RATED	Drama	96	[u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...
9	8.9	Fight Club	R	Drama	139	[u'Brad Pitt', u'Edward Norton', u'Helena Bonh...
11	8.8	Inception	PG-13	Action	148	[u'Leonardo DiCaprio', u'Joseph Gordon-Levitt'...
12	8.8	Star Wars: Episode V - The Empire Strikes Back	PG	Action	124	[u'Mark Hamill', u'Harrison Ford', u'Carrie Fi...
13	8.8	Forrest Gump	PG-13	Drama	142	[u'Tom Hanks', u'Robin Wright', u'Gary Sinise']

Documentation for isin

[Back to top]

10. Your pandas questions answered! (video)

Question: When reading from a file, how do I read in only a subset of the columns?

In [62]:

url3 = "https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv"#定义列名

ufo = pd.read_csv(url3)#用read_csv打开csv文件

ufo.columns

Out[62]:

Index(['City', 'Colors Reported', 'Shape Reported', 'State', 'Time'], dtype='object')

In [63]:

# 列名筛选

ufo = pd.read_csv(url3, usecols=['City', 'State'])

# 用位置切片等效

ufo = pd.read_csv(url3, usecols=[0, 4])

ufo.columns

Out[63]:

Index(['City', 'Time'], dtype='object')

Question: When reading from a file, how do I read in only a subset of the rows?

In [64]:

# 读3行数据

ufo = pd.read_csv(url3, nrows=3)

ufo

Out[64]:

	City	Colors Reported	Shape Reported	State	Time
0	Ithaca	NaN	TRIANGLE	NY	6/1/1930 22:00
1	Willingboro	NaN	OTHER	NJ	6/30/1930 20:00
2	Holyoke	NaN	OVAL	CO	2/15/1931 14:00

Documentation for read_csv

Question: How do I iterate through a Series?

In [65]:

# Series可直接迭代（如列表）

for c in ufo.City:

    print(c)

Ithaca
Willingboro
Holyoke

Question: How do I iterate through a DataFrame?

In [66]:

# 可以使用各种方法迭代DataFrame

for index, row in ufo.iterrows():

    print(index, row.City, row.State)

0 Ithaca NY
1 Willingboro NJ
2 Holyoke CO

Documentation for iterrows

Question: How do I drop all non-numeric columns from a DataFrame?

In [67]:

# 将酒精消耗数据集读入DataFrame，并检查数据类型

url7= 'https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/drinks.csv'

drinks = pd.read_csv(url7)

drinks.dtypes

Out[67]:

country                          object
beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object

In [68]:

# 仅包含DataFrame中的数字列

import numpy as np

drinks.select_dtypes(include=[np.number]).dtypes

Out[68]:

beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
dtype: object

Documentation for select_dtypes

Question: How do I know whether I should pass an argument as a string or a list?

In [69]:

# 描述所有数字列

drinks.describe()

Out[69]:

	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol
count	193.000000	193.000000	193.000000	193.000000
mean	106.160622	80.994819	49.450777	4.717098
std	101.143103	88.284312	79.697598	3.773298
min	0.000000	0.000000	0.000000	0.000000
25%	20.000000	4.000000	1.000000	1.300000
50%	76.000000	56.000000	8.000000	4.200000
75%	188.000000	128.000000	59.000000	7.200000
max	376.000000	438.000000	370.000000	14.400000

In [70]:

# 传递字符串'all'来描述所有列

drinks.describe(include='all')

Out[70]:

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
count	193	193.000000	193.000000	193.000000	193.000000	193
unique	193	NaN	NaN	NaN	NaN	6
top	Marshall Islands	NaN	NaN	NaN	NaN	Africa
freq	1	NaN	NaN	NaN	NaN	53
mean	NaN	106.160622	80.994819	49.450777	4.717098	NaN
std	NaN	101.143103	88.284312	79.697598	3.773298	NaN
min	NaN	0.000000	0.000000	0.000000	0.000000	NaN
25%	NaN	20.000000	4.000000	1.000000	1.300000	NaN
50%	NaN	76.000000	56.000000	8.000000	4.200000	NaN
75%	NaN	188.000000	128.000000	59.000000	7.200000	NaN
max	NaN	376.000000	438.000000	370.000000	14.400000	NaN

In [71]:

# 传递数据类型列表以仅描述多个类型

drinks.describe(include=['object', 'float64'])

Out[71]:

	country	total_litres_of_pure_alcohol	continent
count	193	193.000000	193
unique	193	NaN	6
top	Marshall Islands	NaN	Africa
freq	1	NaN	53
mean	NaN	4.717098	NaN
std	NaN	3.773298	NaN
min	NaN	0.000000	NaN
25%	NaN	1.300000	NaN
50%	NaN	4.200000	NaN
75%	NaN	7.200000	NaN
max	NaN	14.400000	NaN

In [72]:

# 即使您只想描述单个数据类型，也要传递一个列表

drinks.describe(include=['object'])

Out[72]:

	country	continent
count	193	193
unique	193	6
top	Marshall Islands	Africa
freq	1	53

Documentation for describe

[Back to top]

11. How do I use the "axis" parameter in pandas? (video)

In [73]:

url7= 'https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/drinks.csv'

drinks = pd.read_csv(url7)

drinks.head()

Out[73]:

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
0	Afghanistan	0	0	0	0.0	Asia
1	Albania	89	132	54	4.9	Europe
2	Algeria	25	0	14	0.7	Africa
3	Andorra	245	138	312	12.4	Europe
4	Angola	217	57	45	5.9	Africa

In [74]:

# drop a column (temporarily)

drinks.drop('continent', axis=1).head()

Out[74]:

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol
0	Afghanistan	0	0	0	0.0
1	Albania	89	132	54	4.9
2	Algeria	25	0	14	0.7
3	Andorra	245	138	312	12.4
4	Angola	217	57	45	5.9

Documentation for drop

In [75]:

# 删除一列（暂时）

drinks.drop(2, axis=0).head()

Out[75]:

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
0	Afghanistan	0	0	0	0.0	Asia
1	Albania	89	132	54	4.9	Europe
3	Andorra	245	138	312	12.4	Europe
4	Angola	217	57	45	5.9	Africa
5	Antigua & Barbuda	102	128	45	4.9	North America

使用axis参数引用行或列时：

axis 0表示行 axis 1指的是列

In [76]:

# 计算每个数字列的平均值

drinks.mean()

# 或等效地，明确指定轴

drinks.mean(axis=0)

Out[76]:

beer_servings                   106.160622
spirit_servings                  80.994819
wine_servings                    49.450777
total_litres_of_pure_alcohol      4.717098
dtype: float64

Documentation for mean

In [77]:

# 计算每一行的平均值

drinks.mean(axis=1).head()

Out[77]:

0      0.000
1     69.975
2      9.925
3    176.850
4     81.225
dtype: float64

使用axis参数执行数学运算时：

*axis0 *表示操作应“向下移动”行轴
*axis1 *表示操作应“移过”列轴

In [78]:

# 'index' 等效 axis 0

drinks.mean(axis='index')

Out[78]:

beer_servings                   106.160622
spirit_servings                  80.994819
wine_servings                    49.450777
total_litres_of_pure_alcohol      4.717098
dtype: float64

In [79]:

# 'columns' 等效 axis 1

drinks.mean(axis='columns').head()

Out[79]:

0      0.000
1     69.975
2      9.925
3    176.850
4     81.225
dtype: float64

[Back to top]

12. How do I use string methods in pandas? (video)

In [80]:

url1 = "https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/chipotle.tsv"

#定义地址

orders =pd.read_table(url1)#使用read_table()打开

orders.head()

Out[80]:

	order_id	quantity	item_name	choice_description	item_price
0	1	1	Chips and Fresh Tomato Salsa	NaN	$2.39
1	1	1	Izze	[Clementine]	$3.39
2	1	1	Nantucket Nectar	[Apple]	$3.39
3	1	1	Chips and Tomatillo-Green Chili Salsa	NaN	$2.39
4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	$16.98

In [81]:

# 在Python中访问字符串方法的常用方法

'hello'.upper()

Out[81]:

'HELLO'

In [82]:

# spandas Series 的字符串方法通过'str'访问

orders.item_name.str.upper().head()

Out[82]:

0             CHIPS AND FRESH TOMATO SALSA
1                                     IZZE
2                         NANTUCKET NECTAR
3    CHIPS AND TOMATILLO-GREEN CHILI SALSA
4                             CHICKEN BOWL
Name: item_name, dtype: object

In [83]:

# string方法'contains'检查子字符串并返回一个布尔Series

orders.item_name.str.contains('Chicken').head()

Out[83]:

0    False
1    False
2    False
3    False
4     True
Name: item_name, dtype: bool

In [84]:

# 布尔Series筛选DataFrame

orders[orders.item_name.str.contains('Chicken')].head()

Out[84]:

	order_id	quantity	item_name	choice_description	item_price
4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	$16.98
5	3	1	Chicken Bowl	[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...	$10.98
11	6	1	Chicken Crispy Tacos	[Roasted Chili Corn Salsa, [Fajita Vegetables,...	$8.75
12	6	1	Chicken Soft Tacos	[Roasted Chili Corn Salsa, [Rice, Black Beans,...	$8.75
13	7	1	Chicken Bowl	[Fresh Tomato Salsa, [Fajita Vegetables, Rice,...	$11.25

In [85]:

# 字符串方法可以链接在一起

orders.choice_description.str.replace('[', '').str.replace(']', '').head()

Out[85]:

0                                                  NaN
1                                           Clementine
2                                                Apple
3                                                  NaN
4    Tomatillo-Red Chili Salsa (Hot), Black Beans, ...
Name: choice_description, dtype: object

In [86]:

# 许多pandas字符串方法支持正则表达式

orders.choice_description.str.replace('[\[\]]', '').head()

Out[86]:

0                                                  NaN
1                                           Clementine
2                                                Apple
3                                                  NaN
4    Tomatillo-Red Chili Salsa (Hot), Black Beans, ...
Name: choice_description, dtype: object

String handling section of the pandas API reference

[Back to top]

13. How do I change the data type of a pandas Series? (video)

In [87]:

url7= 'https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/drinks.csv'

drinks = pd.read_csv(url7)

drinks.head()

Out[87]:

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
0	Afghanistan	0	0	0	0.0	Asia
1	Albania	89	132	54	4.9	Europe
2	Algeria	25	0	14	0.7	Africa
3	Andorra	245	138	312	12.4	Europe
4	Angola	217	57	45	5.9	Africa

In [88]:

# 检查每个系列的数据类型

drinks.dtypes

Out[88]:

country                          object
beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object

In [89]:

# 更改现有系列的数据类型

drinks['beer_servings'] = drinks.beer_servings.astype(float)

drinks.dtypes

Out[89]:

country                          object
beer_servings                   float64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object

Documentation for astype

In [90]:

# 或者，在读取文件时更改系列的数据类型

drinks = pd.read_csv(url7, dtype={'beer_servings':float})

drinks.dtypes

Out[90]:

country                          object
beer_servings                   float64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object

In [91]:

orders = pd.read_table(url1)

orders.head()

Out[91]:

	order_id	quantity	item_name	choice_description	item_price
0	1	1	Chips and Fresh Tomato Salsa	NaN	$2.39
1	1	1	Izze	[Clementine]	$3.39
2	1	1	Nantucket Nectar	[Apple]	$3.39
3	1	1	Chips and Tomatillo-Green Chili Salsa	NaN	$2.39
4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	$16.98

In [92]:

# 检查每个系列的数据类型

orders.dtypes

Out[92]:

order_id               int64
quantity               int64
item_name             object
choice_description    object
item_price            object
dtype: object

In [93]:

# 将字符串转换为数字以进行数学运算

orders.item_price.str.replace('$', '').astype(float).mean()

Out[93]:

7.464335785374397

In [94]:

# 字符串方法'contains'检查子字符串并返回一个布尔系列

orders.item_name.str.contains('Chicken').head()

Out[94]:

0    False
1    False
2    False
3    False
4     True
Name: item_name, dtype: bool

In [95]:

# 将布尔系列转换为整数（False = 0，True = 1）

orders.item_name.str.contains('Chicken').astype(int).head()

Out[95]:

0    0
1    0
2    0
3    0
4    1
Name: item_name, dtype: int32

[Back to top]

14. When should I use a "groupby" in pandas? (video)

In [96]:

drinks = pd.read_csv(url7)

drinks.head()

Out[96]:

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
0	Afghanistan	0	0	0	0.0	Asia
1	Albania	89	132	54	4.9	Europe
2	Algeria	25	0	14	0.7	Africa
3	Andorra	245	138	312	12.4	Europe
4	Angola	217	57	45	5.9	Africa

In [97]:

# 计算整个数据集中的平均beer_servings

drinks.beer_servings.mean()

Out[97]:

106.16062176165804

In [98]:

# 计算非洲国家的平均beer_servings

drinks[drinks.continent=='Africa'].beer_servings.mean()

Out[98]:

61.471698113207545

In [99]:

#计算每个大陆的平均beer_servings

drinks.groupby('continent').beer_servings.mean()

Out[99]:

continent
Africa            61.471698
Asia              37.045455
Europe           193.777778
North America    145.434783
Oceania           89.687500
South America    175.083333
Name: beer_servings, dtype: float64

Documentation for groupby

In [100]:

# 其他聚合函数（例如'max'）也可以与groupby一起使用

drinks.groupby('continent').beer_servings.max()

Out[100]:

continent
Africa           376
Asia             247
Europe           361
North America    285
Oceania          306
South America    333
Name: beer_servings, dtype: int64

In [101]:

# 多个聚合函数可以同时应用

drinks.groupby('continent').beer_servings.agg(['count', 'mean', 'min', 'max'])

Out[101]:

	count	mean	min	max
continent
Africa	53	61.471698	0	376
Asia	44	37.045455	0	247
Europe	45	193.777778	0	361
North America	23	145.434783	1	285
Oceania	16	89.687500	0	306
South America	12	175.083333	93	333

Documentation for agg

In [102]:

# 不指定列，就会算出所有数值列

drinks.groupby('continent').mean()

Out[102]:

	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol
continent
Africa	61.471698	16.339623	16.264151	3.007547
Asia	37.045455	60.840909	9.068182	2.170455
Europe	193.777778	132.555556	142.222222	8.617778
North America	145.434783	165.739130	24.521739	5.995652
Oceania	89.687500	58.437500	35.625000	3.381250
South America	175.083333	114.750000	62.416667	6.308333

In [103]:

# 允许绘图出现在jupyter notebook中

%matplotlib inline

In [104]:

# 直接在上面的DataFrame的并排条形图

drinks.groupby('continent').mean().plot(kind='bar')

Out[104]:

<matplotlib.axes._subplots.AxesSubplot at 0x296edf23b70>

Documentation for plot

[Back to top]

15. How do I explore a pandas Series? (video)

In [105]:

# read a dataset of top-rated IMDb movies into a DataFrame

url4="https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/imdb_1000.csv"

movies = pd.read_csv(url4)

movies.head()

Out[105]:

	star_rating	title	content_rating	genre	duration	actors_list
0	9.3	The Shawshank Redemption	R	Crime	142	[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1	9.2	The Godfather	R	Crime	175	[u'Marlon Brando', u'Al Pacino', u'James Caan']
2	9.1	The Godfather: Part II	R	Crime	200	[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3	9.0	The Dark Knight	PG-13	Action	152	[u'Christian Bale', u'Heath Ledger', u'Aaron E...
4	8.9	Pulp Fiction	R	Crime	154	[u'John Travolta', u'Uma Thurman', u'Samuel L....

In [106]:

# 检查数据类型

movies.dtypes

Out[106]:

star_rating       float64
title              object
content_rating     object
genre              object
duration            int64
actors_list        object
dtype: object

探索非数字系列

In [107]:

# 计算最常见值的非空值，唯一值和频率

movies.genre.describe()

Out[107]:

count       979
unique       16
top       Drama
freq        278
Name: genre, dtype: object

Documentation for describe

In [108]:

# 数Series中每个值发生的次数

movies.genre.value_counts()

Out[108]:

Drama        278
Comedy       156
Action       136
Crime        124
Biography     77
Adventure     75
Animation     62
Horror        29
Mystery       16
Western        9
Sci-Fi         5
Thriller       5
Film-Noir      3
Family         2
Fantasy        1
History        1
Name: genre, dtype: int64

Documentation for value_counts

In [109]:

# 显示百分比而不是原始计数

movies.genre.value_counts(normalize=True)

Out[109]:

Drama        0.283963
Comedy       0.159346
Action       0.138917
Crime        0.126660
Biography    0.078652
Adventure    0.076609
Animation    0.063330
Horror       0.029622
Mystery      0.016343
Western      0.009193
Sci-Fi       0.005107
Thriller     0.005107
Film-Noir    0.003064
Family       0.002043
Fantasy      0.001021
History      0.001021
Name: genre, dtype: float64

In [110]:

# '输出的是一个Series

type(movies.genre.value_counts())

Out[110]:

pandas.core.series.Series

In [111]:

# 可以使用Series方法

movies.genre.value_counts().head()

Out[111]:

Drama        278
Comedy       156
Action       136
Crime        124
Biography     77
Name: genre, dtype: int64

In [112]:

# 显示Series中唯一值

movies.genre.unique()

Out[112]:

array(['Crime', 'Action', 'Drama', 'Western', 'Adventure', 'Biography','Comedy', 'Animation', 'Mystery', 'Horror', 'Film-Noir', 'Sci-Fi','History', 'Thriller', 'Family', 'Fantasy'], dtype=object)

In [113]:

#数Series中唯一值的数量

movies.genre.nunique()

Out[113]:

Documentation for unique and nunique

In [114]:

# 两个Series的交叉列表

pd.crosstab(movies.genre, movies.content_rating)

Out[114]:

content_rating	APPROVED	G	GP	NC-17	NOT RATED	PASSED	PG	PG-13	R	TV-MA	UNRATED	X
genre
Action	3	1	1	0	4	1	11	44	67	0	3	0
Adventure	3	2	0	0	5	1	21	23	17	0	2	0
Animation	3	20	0	0	3	0	25	5	5	0	1	0
Biography	1	2	1	0	1	0	6	29	36	0	0	0
Comedy	9	2	1	1	16	3	23	23	73	0	4	1
Crime	6	0	0	1	7	1	6	4	87	0	11	1
Drama	12	3	0	4	24	1	25	55	143	1	9	1
Family	0	1	0	0	0	0	1	0	0	0	0	0
Fantasy	0	0	0	0	0	0	0	0	1	0	0	0
Film-Noir	1	0	0	0	1	0	0	0	0	0	1	0
History	0	0	0	0	0	0	0	0	0	0	1	0
Horror	2	0	0	1	1	0	1	2	16	0	5	1
Mystery	4	1	0	0	1	0	1	2	6	0	1	0
Sci-Fi	1	0	0	0	0	0	0	1	3	0	0	0
Thriller	1	0	0	0	0	0	1	0	3	0	0	0
Western	1	0	0	0	2	0	2	1	3	0	0	0

Documentation for crosstab

探索数字系列：

In [115]:

# 计算各种汇总统计

movies.duration.describe()

Out[115]:

count    979.000000
mean     120.979571
std       26.218010
min       64.000000
25%      102.000000
50%      117.000000
75%      134.000000
max      242.000000
Name: duration, dtype: float64

In [116]:

# 许多统计数据都是作为Series方法实现的

movies.duration.mean()

Out[116]:

120.97957099080695

Documentation for mean

In [117]:

# 'value_counts' 主要用于分类数据，而不是数字数据

movies.duration.value_counts().head()

Out[117]:

112    23
113    22
102    20
101    20
129    19
Name: duration, dtype: int64

In [118]:

# 允许绘图出现在jupyter notebook中

%matplotlib inline

In [119]:

# 'duration'Series的直方图（显示数值变量的分布）

movies.duration.plot(kind='hist')

Out[119]:

<matplotlib.axes._subplots.AxesSubplot at 0x296ee26ba58>

In [120]:

# 'genre'Series'value_counts'的条形图

movies.genre.value_counts().plot(kind='bar')

Out[120]:

<matplotlib.axes._subplots.AxesSubplot at 0x296ee2ccba8>

Documentation for plot

[Back to top]

转载于:https://www.cnblogs.com/romannista/p/10659805.html

萌新向Python数据分析及数据挖掘第二章 pandas 第一节 pandas使用基础QA 1-15相关推荐

萌新向Python数据分析及数据挖掘第二章 pandas 第二节 Python Language Basics, IPython, and Jupyter Notebooks...
Python Language Basics, IPython, and Jupyter Notebooks In [5]: import numpy as np #导入numpy np.random ...
python数据分析常用的算法_萌新向Python数据分析及数据挖掘第三章机器学习常用算法第二节线性回归算法（上）理解篇...
理解以a b为变量,预测值与真值的差的平方和为结果的函数参数学习的基本方法:找到最优参数使得预测与真实值差距最小假设可以找到一条直线 y = ax+b 使得预测值与真值的差的平方和最小故事假 ...
萌新向Python数据分析及数据挖掘第一章 Python基础第三节列表简介第四节操作列表...
第一章 Python基础第三节列表简介列表是是处理一组有序项目的数据结构,即可以在一个列表中存储一个序列的项目.列表中的元素包括在方括号([])中,每个元素之间用逗号分割.列表是可变的数据类型, ...
萌新向Python数据分析及数据挖掘第三章机器学习常用算法第四节 PCA与梯度上升（上）理解篇...
转载于:https://www.cnblogs.com/romannista/p/10811992.html
第二章：第一节数据清洗及特征处理-自测
回顾&引言]前面一章的内容大家可以感觉到我们主要是对基础知识做一个梳理,让大家了解数据分析的一些操作,主要做了数据的各个角度的观察.那么在这里,我们主要是做数据分析的流程性学习,主要是包括了数 ...
【重识云原生】第二章计算第一节——计算虚拟化技术总述
云平台计算领域知识地图: 楔子:计算虚拟化技术算是云计算技术的擎天之柱,其前两代技术的演进一直引领着云计算的发展,即便到了云原生时代,其作用依然举足轻重. 一.计算虚拟化技术总述 1.1 虚拟化技 ...
第二章：第一节数据清洗及特征处理-课程
[回顾&引言]前面一章的内容大家可以感觉到我们主要是对基础知识做一个梳理,让大家了解数据分析的一些操作,主要做了数据的各个角度的观察.那么在这里,我们主要是做数据分析的流程性学习,主要是包括了 ...
WEB前台架构教程（原创）第二章（第一节PS切图规划）
转自:http://hi.baidu.com/phpidea/blog/item/9666a754d8d05259574e0013.html 我在第一章介绍了页面设计要关注的几个点,重点位置突出.分栏 ...
（数据库系统概论|王珊）第二章关系数据库-第一节：关系数据结构及其形式化定义
文章目录一:关系 (1)域 (2)笛卡尔积 (3)关系 A:基本概述 B:码相关概念 C:关系的三种类型二:关系模式三:关系数据库 (1)基本概念 (2)关系数据库的型与值前面说过,数据模型由 ...

萌新向Python数据分析及数据挖掘第二章 pandas 第一节 pandas使用基础QA 1-15

Python pandas Q&A video series by Data School

YouTube playlist and GitHub repository

Table of contents

1. What is pandas? (video)

2. How do I read a tabular data file into pandas? (video)

3. How do I select a pandas Series from a DataFrame? (video)

4. Why do some pandas commands end with parentheses (and others don't)? (video)

5. How do I rename columns in a pandas DataFrame? (video)

6. How do I remove columns from a pandas DataFrame? (video)

7. How do I sort a pandas DataFrame or a Series? (video)

8. How do I filter rows of a pandas DataFrame by column value? (video)

9. How do I apply multiple filter criteria to a pandas DataFrame? (video)

10. Your pandas questions answered! (video)

11. How do I use the "axis" parameter in pandas? (video)

12. How do I use string methods in pandas? (video)

13. How do I change the data type of a pandas Series? (video)

14. When should I use a "groupby" in pandas? (video)

15. How do I explore a pandas Series? (video)

萌新向Python数据分析及数据挖掘第二章 pandas 第一节 pandas使用基础QA 1-15相关推荐

最新文章

热门文章

萌新向Python数据分析及数据挖掘 第二章 pandas 第一节 pandas使用基础QA 1-15

Python pandas Q&A video series by Data School

YouTube playlist and GitHub repository

Table of contents

1. What is pandas? (video)

2. How do I read a tabular data file into pandas? (video)

3. How do I select a pandas Series from a DataFrame? (video)

4. Why do some pandas commands end with parentheses (and others don't)? (video)

5. How do I rename columns in a pandas DataFrame? (video)

6. How do I remove columns from a pandas DataFrame? (video)

7. How do I sort a pandas DataFrame or a Series? (video)

8. How do I filter rows of a pandas DataFrame by column value? (video)

9. How do I apply multiple filter criteria to a pandas DataFrame? (video)

10. Your pandas questions answered! (video)

11. How do I use the "axis" parameter in pandas? (video)

12. How do I use string methods in pandas? (video)

13. How do I change the data type of a pandas Series? (video)

14. When should I use a "groupby" in pandas? (video)

15. How do I explore a pandas Series? (video)

萌新向Python数据分析及数据挖掘 第二章 pandas 第一节 pandas使用基础QA 1-15相关推荐

最新文章

热门文章

萌新向Python数据分析及数据挖掘第二章 pandas 第一节 pandas使用基础QA 1-15

萌新向Python数据分析及数据挖掘第二章 pandas 第一节 pandas使用基础QA 1-15相关推荐