萌新向Python数据分析及数据挖掘 第二章 pandas 第一节 pandas使用基础QA 1-15
这是油管上的一个帅哥的网课地址如下 https://www.youtube.com/watch?v=yzIMircGU5I&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y
Python pandas Q&A video series by Data School
YouTube playlist and GitHub repository
Table of contents
- What is pandas?
- How do I read a tabular data file into pandas?
- How do I select a pandas Series from a DataFrame?
- Why do some pandas commands end with parentheses (and others don't)?
- How do I rename columns in a pandas DataFrame?
- How do I remove columns from a pandas DataFrame?
- How do I sort a pandas DataFrame or a Series?
- How do I filter rows of a pandas DataFrame by column value?
- How do I apply multiple filter criteria to a pandas DataFrame?
- Your pandas questions answered!
- How do I use the "axis" parameter in pandas?
- How do I use string methods in pandas?
- How do I change the data type of a pandas Series?
- When should I use a "groupby" in pandas?
- How do I explore a pandas Series?
- How do I handle missing values in pandas?
- What do I need to know about the pandas index? (Part 1)
- What do I need to know about the pandas index? (Part 2)
- How do I select multiple rows and columns from a pandas DataFrame?
- When should I use the "inplace" parameter in pandas?
- How do I make my pandas DataFrame smaller and faster?
- How do I use pandas with scikit-learn to create Kaggle submissions?
- More of your pandas questions answered!
- How do I create dummy variables in pandas?
- How do I work with dates and times in pandas?
- How do I find and remove duplicate rows in pandas?
- How do I avoid a SettingWithCopyWarning in pandas?
- How do I change display options in pandas?
- How do I create a pandas DataFrame from another object?
- How do I apply a function to a pandas Series or DataFrame?
# 传统方式
import pandas as pd
1. What is pandas? (video)
- pandas main page
- pandas installation instructions
- Anaconda distribution of Python (includes pandas)
- How to use the IPython/Jupyter notebook (video)
2. How do I read a tabular data file into pandas? (video)
# 直接从URL中读取Chipotle订单的数据集,并将结果存储在数据库中
url1 = "https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/chipotle.tsv"
#定义地址
orders =pd.read_table(url1)#使用read_table()打开
# 检查前5行
orders.head()
order_id | quantity | item_name | choice_description | item_price | |
---|---|---|---|---|---|
0 | 1 | 1 | Chips and Fresh Tomato Salsa | NaN | $2.39 |
1 | 1 | 1 | Izze | [Clementine] | $3.39 |
2 | 1 | 1 | Nantucket Nectar | [Apple] | $3.39 |
3 | 1 | 1 | Chips and Tomatillo-Green Chili Salsa | NaN | $2.39 |
4 | 2 | 2 | Chicken Bowl | [Tomatillo-Red Chili Salsa (Hot), [Black Beans... | $16.98 |
Documentation for read_table
# 读取电影评论员的数据集(修改read_table的默认参数值)
user_cols = ['user_id','age','gender','occupation','zipcode']#定义列名
url2 = "https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/u.user"
#定义地址
#users=pd.read_table(url2,sep='|',header=None,names= user_clos,skiprows=2,skipfooter=3)
users = pd.read_table('http://bit.ly/movieusers', sep='|', header=None, names=user_cols)
#加入参数sep 分隔符,header 头部 标题,names 列名
# 检查前5行
users.head()
user_id | age | gender | occupation | zipcode | |
---|---|---|---|---|---|
0 | 1 | 24 | M | technician | 85711 |
1 | 2 | 53 | F | other | 94043 |
2 | 3 | 23 | M | writer | 32067 |
3 | 4 | 24 | M | technician | 43537 |
4 | 5 | 33 | F | other | 15213 |
[Back to top]
3. How do I select a pandas Series from a DataFrame? (video)
# 将UFO报告的数据集读入DataFrame
url3 = "https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv"#定义列名
ufo = pd.read_table(url3, sep=',')
# #用read_table打开csv文件,区别是 read_csv直接是用逗号隔开
ufo = pd.read_csv(url3)
# 检查前5行
ufo.head()
City | Colors Reported | Shape Reported | State | Time | |
---|---|---|---|---|---|
0 | Ithaca | NaN | TRIANGLE | NY | 6/1/1930 22:00 |
1 | Willingboro | NaN | OTHER | NJ | 6/30/1930 20:00 |
2 | Holyoke | NaN | OVAL | CO | 2/15/1931 14:00 |
3 | Abilene | NaN | DISK | KS | 6/1/1931 13:00 |
4 | New York Worlds Fair | NaN | LIGHT | NY | 4/18/1933 19:00 |
# #用括号法查看Series
ufo['City']
# #用点法查看Series,要注意 名字里面有空格或者是python专有字符的时候不能用,但是方便输入
ufo.City
0 Ithaca 1 Willingboro 2 Holyoke 3 Abilene 4 New York Worlds Fair 5 Valley City 6 Crater Lake 7 Alma 8 Eklutna 9 Hubbard 10 Fontana 11 Waterloo 12 Belton 13 Keokuk 14 Ludington 15 Forest Home 16 Los Angeles 17 Hapeville 18 Oneida 19 Bering Sea 20 Nebraska 21 NaN 22 NaN 23 Owensboro 24 Wilderness 25 San Diego 26 Wilderness 27 Clovis 28 Los Alamos 29 Ft. Duschene... 18211 Holyoke 18212 Carson 18213 Pasadena 18214 Austin 18215 El Campo 18216 Garden Grove 18217 Berthoud Pass 18218 Sisterdale 18219 Garden Grove 18220 Shasta Lake 18221 Franklin 18222 Albrightsville 18223 Greenville 18224 Eufaula 18225 Simi Valley 18226 San Francisco 18227 San Francisco 18228 Kingsville 18229 Chicago 18230 Pismo Beach 18231 Pismo Beach 18232 Lodi 18233 Anchorage 18234 Capitola 18235 Fountain Hills 18236 Grant Park 18237 Spirit Lake 18238 Eagle River 18239 Eagle River 18240 Ybor Name: City, Length: 18241, dtype: object
括号表示法总是有效,而点表示法有局限性:
- 如果系列名称中有空格,则点符号不起作用
- 如果系列与DataFrame方法或属性(如'head'或'shape')具有相同的名称,则点符号不起作用
- 点符号不能用于定义新series的名
# #这里的拼接也不能用点的方法
ufo['Location'] = ufo.City + ', ' + ufo.State
ufo.head()
City | Colors Reported | Shape Reported | State | Time | Location | |
---|---|---|---|---|---|---|
0 | Ithaca | NaN | TRIANGLE | NY | 6/1/1930 22:00 | Ithaca, NY |
1 | Willingboro | NaN | OTHER | NJ | 6/30/1930 20:00 | Willingboro, NJ |
2 | Holyoke | NaN | OVAL | CO | 2/15/1931 14:00 | Holyoke, CO |
3 | Abilene | NaN | DISK | KS | 6/1/1931 13:00 | Abilene, KS |
4 | New York Worlds Fair | NaN | LIGHT | NY | 4/18/1933 19:00 | New York Worlds Fair, NY |
[Back to top]
4. Why do some pandas commands end with parentheses (and others don't)? (video)
# 将顶级IMDb电影的数据集读入DataFrame
url4="https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/imdb_1000.csv"
movies = pd.read_csv(url4)
#方法以括号结尾,而属性则没有:
# 示例方法:显示前5行
movies.head()
star_rating | title | content_rating | genre | duration | actors_list | |
---|---|---|---|---|---|---|
0 | 9.3 | The Shawshank Redemption | R | Crime | 142 | [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt... |
1 | 9.2 | The Godfather | R | Crime | 175 | [u'Marlon Brando', u'Al Pacino', u'James Caan'] |
2 | 9.1 | The Godfather: Part II | R | Crime | 200 | [u'Al Pacino', u'Robert De Niro', u'Robert Duv... |
3 | 9.0 | The Dark Knight | PG-13 | Action | 152 | [u'Christian Bale', u'Heath Ledger', u'Aaron E... |
4 | 8.9 | Pulp Fiction | R | Crime | 154 | [u'John Travolta', u'Uma Thurman', u'Samuel L.... |
#示例方法:计算摘要统计信息
movies.describe()
star_rating | duration | |
---|---|---|
count | 979.000000 | 979.000000 |
mean | 7.889785 | 120.979571 |
std | 0.336069 | 26.218010 |
min | 7.400000 | 64.000000 |
25% | 7.600000 | 102.000000 |
50% | 7.800000 | 117.000000 |
75% | 8.100000 | 134.000000 |
max | 9.300000 | 242.000000 |
movies.describe(include=['object'])
title | content_rating | genre | actors_list | |
---|---|---|---|---|
count | 979 | 976 | 979 | 979 |
unique | 975 | 12 | 16 | 969 |
top | Dracula | R | Drama | [u'Daniel Radcliffe', u'Emma Watson', u'Rupert... |
freq | 2 | 460 | 278 | 6 |
# 示例属性:行数和列数
movies.shape
(979, 6)
# 示例属性:每列的数据类型
movies.dtypes
star_rating float64 title object content_rating object genre object duration int64 actors_list object dtype: object
# 使用describe方法的可选参数来仅汇总'object'列
movies.describe(include=['object'])
title | content_rating | genre | actors_list | |
---|---|---|---|---|
count | 979 | 976 | 979 | 979 |
unique | 975 | 12 | 16 | 969 |
top | Dracula | R | Drama | [u'Daniel Radcliffe', u'Emma Watson', u'Rupert... |
freq | 2 | 460 | 278 | 6 |
Documentation for describe
[Back to top]
5. How do I rename columns in a pandas DataFrame? (video)
# 检查列名称
ufo.columns
Index(['City', 'Colors Reported', 'Shape Reported', 'State', 'Time','Location'],dtype='object')
# 使用'rename'方法重命名其中两列
ufo.rename(columns={'Colors Reported':'Colors_Reported', 'Shape Reported':'Shape_Reported'}, inplace=True)
ufo.columns
Index(['City', 'Colors_Reported', 'Shape_Reported', 'State', 'Time','Location'],dtype='object')
Documentation for rename
# 通过覆盖'columns'属性替换所有列名
ufo = pd.read_table(url3, sep=',')
ufo_cols = ['city', 'colors reported', 'shape reported', 'state', 'time']
ufo.columns = ufo_cols
ufo.columns
Index(['city', 'colors reported', 'shape reported', 'state', 'time'], dtype='object')
# 使用'names'参数替换文件读取过程中的列名
ufo = pd.read_csv(url3, header=0, names=ufo_cols)
ufo.columns
Index(['city', 'colors reported', 'shape reported', 'state', 'time'], dtype='object')
Documentation for read_csv
ufo.columns = ufo.columns.str.replace(' ', '_') #如何批量修改替换使得列名无空格
ufo.columns
Index(['city', 'colors_reported', 'shape_reported', 'state', 'time'], dtype='object')
Documentation for str.replace
[Back to top]
6. How do I remove columns from a pandas DataFrame? (video)
ufo = pd.read_table(url3, sep=',')
ufo.head()
City | Colors Reported | Shape Reported | State | Time | |
---|---|---|---|---|---|
0 | Ithaca | NaN | TRIANGLE | NY | 6/1/1930 22:00 |
1 | Willingboro | NaN | OTHER | NJ | 6/30/1930 20:00 |
2 | Holyoke | NaN | OVAL | CO | 2/15/1931 14:00 |
3 | Abilene | NaN | DISK | KS | 6/1/1931 13:00 |
4 | New York Worlds Fair | NaN | LIGHT | NY | 4/18/1933 19:00 |
# #axis=1 是纵向,inplace = True:不创建新的对象,直接对原始对象进行修改;
ufo.drop('Colors Reported', axis=1, inplace=True)
ufo.head()
City | Shape Reported | State | Time | |
---|---|---|---|---|
0 | Ithaca | TRIANGLE | NY | 6/1/1930 22:00 |
1 | Willingboro | OTHER | NJ | 6/30/1930 20:00 |
2 | Holyoke | OVAL | CO | 2/15/1931 14:00 |
3 | Abilene | DISK | KS | 6/1/1931 13:00 |
4 | New York Worlds Fair | LIGHT | NY | 4/18/1933 19:00 |
Documentation for drop
# 一次删除多个列
ufo.drop(['City', 'State'], axis=1, inplace=True)
ufo.head()
Shape Reported | Time | |
---|---|---|
0 | TRIANGLE | 6/1/1930 22:00 |
1 | OTHER | 6/30/1930 20:00 |
2 | OVAL | 2/15/1931 14:00 |
3 | DISK | 6/1/1931 13:00 |
4 | LIGHT | 4/18/1933 19:00 |
# 一次删除多行(axis = 0表示行)
ufo.drop([0, 1], axis=0, inplace=True)
ufo.head()
#删除4行 按标签,axis=0 是横向,默认为横向,但建议写出来
Shape Reported | Time | |
---|---|---|
2 | OVAL | 2/15/1931 14:00 |
3 | DISK | 6/1/1931 13:00 |
4 | LIGHT | 4/18/1933 19:00 |
5 | DISK | 9/15/1934 15:30 |
6 | CIRCLE | 6/15/1935 0:00 |
[Back to top]
7. How do I sort a pandas DataFrame or a Series? (video)
movies.head()
star_rating | title | content_rating | genre | duration | actors_list | |
---|---|---|---|---|---|---|
0 | 9.3 | The Shawshank Redemption | R | Crime | 142 | [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt... |
1 | 9.2 | The Godfather | R | Crime | 175 | [u'Marlon Brando', u'Al Pacino', u'James Caan'] |
2 | 9.1 | The Godfather: Part II | R | Crime | 200 | [u'Al Pacino', u'Robert De Niro', u'Robert Duv... |
3 | 9.0 | The Dark Knight | PG-13 | Action | 152 | [u'Christian Bale', u'Heath Ledger', u'Aaron E... |
4 | 8.9 | Pulp Fiction | R | Crime | 154 | [u'John Travolta', u'Uma Thurman', u'Samuel L.... |
#注意:以下任何排序方法都不会影响基础数据。 (换句话说,排序是暂时的)。
#排序单个Series
movies.title.sort_values().head()
542 (500) Days of Summer 5 12 Angry Men 201 12 Years a Slave 698 127 Hours 110 2001: A Space Odyssey Name: title, dtype: object
# #排序单个Series 倒序
movies.title.sort_values(ascending=False).head()
864 [Rec] 526 Zulu 615 Zombieland 677 Zodiac 955 Zero Dark Thirty Name: title, dtype: object
Documentation for sort_values
for a Series. (Prior to version 0.17, use order
instead.)
# #以单个Series排序DataFrame
movies.sort_values('title').head()
star_rating | title | content_rating | genre | duration | actors_list | |
---|---|---|---|---|---|---|
542 | 7.8 | (500) Days of Summer | PG-13 | Comedy | 95 | [u'Zooey Deschanel', u'Joseph Gordon-Levitt', ... |
5 | 8.9 | 12 Angry Men | NOT RATED | Drama | 96 | [u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals... |
201 | 8.1 | 12 Years a Slave | R | Biography | 134 | [u'Chiwetel Ejiofor', u'Michael Kenneth Willia... |
698 | 7.6 | 127 Hours | R | Adventure | 94 | [u'James Franco', u'Amber Tamblyn', u'Kate Mara'] |
110 | 8.3 | 2001: A Space Odyssey | G | Mystery | 160 | [u'Keir Dullea', u'Gary Lockwood', u'William S... |
# 改为按降序排序
movies.sort_values('title', ascending=False).head()
star_rating | title | content_rating | genre | duration | actors_list | |
---|---|---|---|---|---|---|
864 | 7.5 | [Rec] | R | Horror | 78 | [u'Manuela Velasco', u'Ferran Terraza', u'Jorg... |
526 | 7.8 | Zulu | UNRATED | Drama | 138 | [u'Stanley Baker', u'Jack Hawkins', u'Ulla Jac... |
615 | 7.7 | Zombieland | R | Comedy | 88 | [u'Jesse Eisenberg', u'Emma Stone', u'Woody Ha... |
677 | 7.7 | Zodiac | R | Crime | 157 | [u'Jake Gyllenhaal', u'Robert Downey Jr.', u'M... |
955 | 7.4 | Zero Dark Thirty | R | Drama | 157 | [u'Jessica Chastain', u'Joel Edgerton', u'Chri... |
Documentation for sort_values
for a DataFrame. (Prior to version 0.17, use sort
instead.)
# 首先按'content_rating',然后按duration'排序DataFrame
movies.sort_values(['content_rating', 'duration']).head()
star_rating | title | content_rating | genre | duration | actors_list | |
---|---|---|---|---|---|---|
713 | 7.6 | The Jungle Book | APPROVED | Animation | 78 | [u'Phil Harris', u'Sebastian Cabot', u'Louis P... |
513 | 7.8 | Invasion of the Body Snatchers | APPROVED | Horror | 80 | [u'Kevin McCarthy', u'Dana Wynter', u'Larry Ga... |
272 | 8.1 | The Killing | APPROVED | Crime | 85 | [u'Sterling Hayden', u'Coleen Gray', u'Vince E... |
703 | 7.6 | Dracula | APPROVED | Horror | 85 | [u'Bela Lugosi', u'Helen Chandler', u'David Ma... |
612 | 7.7 | A Hard Day's Night | APPROVED | Comedy | 87 | [u'John Lennon', u'Paul McCartney', u'George H... |
Summary of changes to the sorting API in pandas 0.17
[Back to top]
8. How do I filter rows of a pandas DataFrame by column value? (video)
movies.head()
star_rating | title | content_rating | genre | duration | actors_list | |
---|---|---|---|---|---|---|
0 | 9.3 | The Shawshank Redemption | R | Crime | 142 | [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt... |
1 | 9.2 | The Godfather | R | Crime | 175 | [u'Marlon Brando', u'Al Pacino', u'James Caan'] |
2 | 9.1 | The Godfather: Part II | R | Crime | 200 | [u'Al Pacino', u'Robert De Niro', u'Robert Duv... |
3 | 9.0 | The Dark Knight | PG-13 | Action | 152 | [u'Christian Bale', u'Heath Ledger', u'Aaron E... |
4 | 8.9 | Pulp Fiction | R | Crime | 154 | [u'John Travolta', u'Uma Thurman', u'Samuel L.... |
# 检查行数和列数
movies.shape
(979, 6)
##目标:过滤DataFrame行,仅显示“持续时间”至少为200分钟的电影
#
#先展示一个比较复杂的方法,用一个for循环制造一个和原数据一样行数,判断每一行是否符合条件,列表元素均为boolean
#创建一个列表,其中每个元素引用一个DataFrame行:如果行满足条件,则返回true,否则返回False
booleans = []
for length in movies.duration:
if length >= 200:
booleans.append(True)
else:
booleans.append(False)
# 确认列表与DataFrame的长度相同
len(booleans)
979
# 检查前五个列表元素
booleans[0:5]
[False, False, True, False, False]
# 将列表转换为Series
is_long = pd.Series(booleans)
is_long.head()
0 False 1 False 2 True 3 False 4 False dtype: bool
# 使用带有布尔Series的括号表示法告诉DataFrame movies[is_long]要显示哪些行
movies[is_long]
star_rating | title | content_rating | genre | duration | actors_list | |
---|---|---|---|---|---|---|
2 | 9.1 | The Godfather: Part II | R | Crime | 200 | [u'Al Pacino', u'Robert De Niro', u'Robert Duv... |
7 | 8.9 | The Lord of the Rings: The Return of the King | PG-13 | Adventure | 201 | [u'Elijah Wood', u'Viggo Mortensen', u'Ian McK... |
17 | 8.7 | Seven Samurai | UNRATED | Drama | 207 | [u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K... |
78 | 8.4 | Once Upon a Time in America | R | Crime | 229 | [u'Robert De Niro', u'James Woods', u'Elizabet... |
85 | 8.4 | Lawrence of Arabia | PG | Adventure | 216 | [u"Peter O'Toole", u'Alec Guinness', u'Anthony... |
142 | 8.3 | Lagaan: Once Upon a Time in India | PG | Adventure | 224 | [u'Aamir Khan', u'Gracy Singh', u'Rachel Shell... |
157 | 8.2 | Gone with the Wind | G | Drama | 238 | [u'Clark Gable', u'Vivien Leigh', u'Thomas Mit... |
204 | 8.1 | Ben-Hur | G | Adventure | 212 | [u'Charlton Heston', u'Jack Hawkins', u'Stephe... |
445 | 7.9 | The Ten Commandments | APPROVED | Adventure | 220 | [u'Charlton Heston', u'Yul Brynner', u'Anne Ba... |
476 | 7.8 | Hamlet | PG-13 | Drama | 242 | [u'Kenneth Branagh', u'Julie Christie', u'Dere... |
630 | 7.7 | Malcolm X | PG-13 | Biography | 202 | [u'Denzel Washington', u'Angela Bassett', u'De... |
767 | 7.6 | It's a Mad, Mad, Mad, Mad World | APPROVED | Action | 205 | [u'Spencer Tracy', u'Milton Berle', u'Ethel Me... |
# 简化上面的步骤:不需要编写for循环来创建is_long'
is_long = movies.duration >= 200
movies[is_long]#运用这种写法,pandas就知道,按照这个series去筛选
# 或等效地,将其写在一行(无需创建'is_long'对象)
movies[movies.duration >= 200]
star_rating | title | content_rating | genre | duration | actors_list | |
---|---|---|---|---|---|---|
2 | 9.1 | The Godfather: Part II | R | Crime | 200 | [u'Al Pacino', u'Robert De Niro', u'Robert Duv... |
7 | 8.9 | The Lord of the Rings: The Return of the King | PG-13 | Adventure | 201 | [u'Elijah Wood', u'Viggo Mortensen', u'Ian McK... |
17 | 8.7 | Seven Samurai | UNRATED | Drama | 207 | [u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K... |
78 | 8.4 | Once Upon a Time in America | R | Crime | 229 | [u'Robert De Niro', u'James Woods', u'Elizabet... |
85 | 8.4 | Lawrence of Arabia | PG | Adventure | 216 | [u"Peter O'Toole", u'Alec Guinness', u'Anthony... |
142 | 8.3 | Lagaan: Once Upon a Time in India | PG | Adventure | 224 | [u'Aamir Khan', u'Gracy Singh', u'Rachel Shell... |
157 | 8.2 | Gone with the Wind | G | Drama | 238 | [u'Clark Gable', u'Vivien Leigh', u'Thomas Mit... |
204 | 8.1 | Ben-Hur | G | Adventure | 212 | [u'Charlton Heston', u'Jack Hawkins', u'Stephe... |
445 | 7.9 | The Ten Commandments | APPROVED | Adventure | 220 | [u'Charlton Heston', u'Yul Brynner', u'Anne Ba... |
476 | 7.8 | Hamlet | PG-13 | Drama | 242 | [u'Kenneth Branagh', u'Julie Christie', u'Dere... |
630 | 7.7 | Malcolm X | PG-13 | Biography | 202 | [u'Denzel Washington', u'Angela Bassett', u'De... |
767 | 7.6 | It's a Mad, Mad, Mad, Mad World | APPROVED | Action | 205 | [u'Spencer Tracy', u'Milton Berle', u'Ethel Me... |
# 从过滤后的DataFrame中选择“流派”系列
movies[movies.duration >= 200].genre
# 或者等效地,使用'loc'方法
movies.loc[movies.duration >= 200, 'genre']
2 Crime 7 Adventure 17 Drama 78 Crime 85 Adventure 142 Adventure 157 Drama 204 Adventure 445 Adventure 476 Drama 630 Biography 767 Action Name: genre, dtype: object
Documentation for loc
[Back to top]
9. How do I apply multiple filter criteria to a pandas DataFrame? (video)
# read a dataset of top-rated IMDb movies into a DataFrame
movies = pd.read_csv('http://bit.ly/imdbratings')
movies.head()
star_rating | title | content_rating | genre | duration | actors_list | |
---|---|---|---|---|---|---|
0 | 9.3 | The Shawshank Redemption | R | Crime | 142 | [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt... |
1 | 9.2 | The Godfather | R | Crime | 175 | [u'Marlon Brando', u'Al Pacino', u'James Caan'] |
2 | 9.1 | The Godfather: Part II | R | Crime | 200 | [u'Al Pacino', u'Robert De Niro', u'Robert Duv... |
3 | 9.0 | The Dark Knight | PG-13 | Action | 152 | [u'Christian Bale', u'Heath Ledger', u'Aaron E... |
4 | 8.9 | Pulp Fiction | R | Crime | 154 | [u'John Travolta', u'Uma Thurman', u'Samuel L.... |
# 过滤DataFrame仅显示“持续时间”至少为200分钟的电影
movies[movies.duration >= 200]
star_rating | title | content_rating | genre | duration | actors_list | |
---|---|---|---|---|---|---|
2 | 9.1 | The Godfather: Part II | R | Crime | 200 | [u'Al Pacino', u'Robert De Niro', u'Robert Duv... |
7 | 8.9 | The Lord of the Rings: The Return of the King | PG-13 | Adventure | 201 | [u'Elijah Wood', u'Viggo Mortensen', u'Ian McK... |
17 | 8.7 | Seven Samurai | UNRATED | Drama | 207 | [u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K... |
78 | 8.4 | Once Upon a Time in America | R | Crime | 229 | [u'Robert De Niro', u'James Woods', u'Elizabet... |
85 | 8.4 | Lawrence of Arabia | PG | Adventure | 216 | [u"Peter O'Toole", u'Alec Guinness', u'Anthony... |
142 | 8.3 | Lagaan: Once Upon a Time in India | PG | Adventure | 224 | [u'Aamir Khan', u'Gracy Singh', u'Rachel Shell... |
157 | 8.2 | Gone with the Wind | G | Drama | 238 | [u'Clark Gable', u'Vivien Leigh', u'Thomas Mit... |
204 | 8.1 | Ben-Hur | G | Adventure | 212 | [u'Charlton Heston', u'Jack Hawkins', u'Stephe... |
445 | 7.9 | The Ten Commandments | APPROVED | Adventure | 220 | [u'Charlton Heston', u'Yul Brynner', u'Anne Ba... |
476 | 7.8 | Hamlet | PG-13 | Drama | 242 | [u'Kenneth Branagh', u'Julie Christie', u'Dere... |
630 | 7.7 | Malcolm X | PG-13 | Biography | 202 | [u'Denzel Washington', u'Angela Bassett', u'De... |
767 | 7.6 | It's a Mad, Mad, Mad, Mad World | APPROVED | Action | 205 | [u'Spencer Tracy', u'Milton Berle', u'Ethel Me... |
理解逻辑运算符:
and
:仅当运算符的两边都为True时才为真or
:如果运算符的任何一侧为True,则为真
print(True and True)
print(True and False)
print(False and False)
True False False
print(True or True)
print(True or False)
print(False or False)
True True False
在pandas中指定多个过滤条件的规则:
使用&而不是和 使用|而不是或 在每个条件周围添加括号以指定评估顺序
Goal: Further filter the DataFrame of long movies (duration >= 200) to only show movies which also have a 'genre' of 'Drama'
# 使用'&'运算符指定两个条件都是必需的
movies[(movies.duration >=200) & (movies.genre == 'Drama')]
star_rating | title | content_rating | genre | duration | actors_list | |
---|---|---|---|---|---|---|
17 | 8.7 | Seven Samurai | UNRATED | Drama | 207 | [u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K... |
157 | 8.2 | Gone with the Wind | G | Drama | 238 | [u'Clark Gable', u'Vivien Leigh', u'Thomas Mit... |
476 | 7.8 | Hamlet | PG-13 | Drama | 242 | [u'Kenneth Branagh', u'Julie Christie', u'Dere... |
# I不正确:使用'|'运算符会展示长或戏剧的电影
movies[(movies.duration >=200) | (movies.genre == 'Drama')].head()
star_rating | title | content_rating | genre | duration | actors_list | |
---|---|---|---|---|---|---|
2 | 9.1 | The Godfather: Part II | R | Crime | 200 | [u'Al Pacino', u'Robert De Niro', u'Robert Duv... |
5 | 8.9 | 12 Angry Men | NOT RATED | Drama | 96 | [u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals... |
7 | 8.9 | The Lord of the Rings: The Return of the King | PG-13 | Adventure | 201 | [u'Elijah Wood', u'Viggo Mortensen', u'Ian McK... |
9 | 8.9 | Fight Club | R | Drama | 139 | [u'Brad Pitt', u'Edward Norton', u'Helena Bonh... |
13 | 8.8 | Forrest Gump | PG-13 | Drama | 142 | [u'Tom Hanks', u'Robin Wright', u'Gary Sinise'] |
##过滤原始数据框以显示“类型”为“犯罪”或“戏剧”或“动作”的电影
# 使用'|'运算符指定行可以匹配三个条件中的任何一个
movies[(movies.genre == 'Crime') | (movies.genre == 'Drama') | (movies.genre == 'Action')].head(10)
# 用isin等效
movies[movies.genre.isin(['Crime', 'Drama', 'Action'])].head(10)
star_rating | title | content_rating | genre | duration | actors_list | |
---|---|---|---|---|---|---|
0 | 9.3 | The Shawshank Redemption | R | Crime | 142 | [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt... |
1 | 9.2 | The Godfather | R | Crime | 175 | [u'Marlon Brando', u'Al Pacino', u'James Caan'] |
2 | 9.1 | The Godfather: Part II | R | Crime | 200 | [u'Al Pacino', u'Robert De Niro', u'Robert Duv... |
3 | 9.0 | The Dark Knight | PG-13 | Action | 152 | [u'Christian Bale', u'Heath Ledger', u'Aaron E... |
4 | 8.9 | Pulp Fiction | R | Crime | 154 | [u'John Travolta', u'Uma Thurman', u'Samuel L.... |
5 | 8.9 | 12 Angry Men | NOT RATED | Drama | 96 | [u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals... |
9 | 8.9 | Fight Club | R | Drama | 139 | [u'Brad Pitt', u'Edward Norton', u'Helena Bonh... |
11 | 8.8 | Inception | PG-13 | Action | 148 | [u'Leonardo DiCaprio', u'Joseph Gordon-Levitt'... |
12 | 8.8 | Star Wars: Episode V - The Empire Strikes Back | PG | Action | 124 | [u'Mark Hamill', u'Harrison Ford', u'Carrie Fi... |
13 | 8.8 | Forrest Gump | PG-13 | Drama | 142 | [u'Tom Hanks', u'Robin Wright', u'Gary Sinise'] |
Documentation for isin
[Back to top]
10. Your pandas questions answered! (video)
Question: When reading from a file, how do I read in only a subset of the columns?
url3 = "https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv"#定义列名
ufo = pd.read_csv(url3)#用read_csv打开csv文件
ufo.columns
Index(['City', 'Colors Reported', 'Shape Reported', 'State', 'Time'], dtype='object')
# 列名筛选
ufo = pd.read_csv(url3, usecols=['City', 'State'])
# 用位置切片等效
ufo = pd.read_csv(url3, usecols=[0, 4])
ufo.columns
Index(['City', 'Time'], dtype='object')
Question: When reading from a file, how do I read in only a subset of the rows?
# 读3行数据
ufo = pd.read_csv(url3, nrows=3)
ufo
City | Colors Reported | Shape Reported | State | Time | |
---|---|---|---|---|---|
0 | Ithaca | NaN | TRIANGLE | NY | 6/1/1930 22:00 |
1 | Willingboro | NaN | OTHER | NJ | 6/30/1930 20:00 |
2 | Holyoke | NaN | OVAL | CO | 2/15/1931 14:00 |
Documentation for read_csv
Question: How do I iterate through a Series?
# Series可直接迭代(如列表)
for c in ufo.City:
print(c)
Ithaca Willingboro Holyoke
Question: How do I iterate through a DataFrame?
# 可以使用各种方法迭代DataFrame
for index, row in ufo.iterrows():
print(index, row.City, row.State)
0 Ithaca NY 1 Willingboro NJ 2 Holyoke CO
Documentation for iterrows
Question: How do I drop all non-numeric columns from a DataFrame?
# 将酒精消耗数据集读入DataFrame,并检查数据类型
url7= 'https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/drinks.csv'
drinks = pd.read_csv(url7)
drinks.dtypes
country object beer_servings int64 spirit_servings int64 wine_servings int64 total_litres_of_pure_alcohol float64 continent object dtype: object
# 仅包含DataFrame中的数字列
import numpy as np
drinks.select_dtypes(include=[np.number]).dtypes
beer_servings int64 spirit_servings int64 wine_servings int64 total_litres_of_pure_alcohol float64 dtype: object
Documentation for select_dtypes
Question: How do I know whether I should pass an argument as a string or a list?
# 描述所有数字列
drinks.describe()
beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | |
---|---|---|---|---|
count | 193.000000 | 193.000000 | 193.000000 | 193.000000 |
mean | 106.160622 | 80.994819 | 49.450777 | 4.717098 |
std | 101.143103 | 88.284312 | 79.697598 | 3.773298 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 20.000000 | 4.000000 | 1.000000 | 1.300000 |
50% | 76.000000 | 56.000000 | 8.000000 | 4.200000 |
75% | 188.000000 | 128.000000 | 59.000000 | 7.200000 |
max | 376.000000 | 438.000000 | 370.000000 | 14.400000 |
# 传递字符串'all'来描述所有列
drinks.describe(include='all')
country | beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | continent | |
---|---|---|---|---|---|---|
count | 193 | 193.000000 | 193.000000 | 193.000000 | 193.000000 | 193 |
unique | 193 | NaN | NaN | NaN | NaN | 6 |
top | Marshall Islands | NaN | NaN | NaN | NaN | Africa |
freq | 1 | NaN | NaN | NaN | NaN | 53 |
mean | NaN | 106.160622 | 80.994819 | 49.450777 | 4.717098 | NaN |
std | NaN | 101.143103 | 88.284312 | 79.697598 | 3.773298 | NaN |
min | NaN | 0.000000 | 0.000000 | 0.000000 | 0.000000 | NaN |
25% | NaN | 20.000000 | 4.000000 | 1.000000 | 1.300000 | NaN |
50% | NaN | 76.000000 | 56.000000 | 8.000000 | 4.200000 | NaN |
75% | NaN | 188.000000 | 128.000000 | 59.000000 | 7.200000 | NaN |
max | NaN | 376.000000 | 438.000000 | 370.000000 | 14.400000 | NaN |
# 传递数据类型列表以仅描述多个类型
drinks.describe(include=['object', 'float64'])
country | total_litres_of_pure_alcohol | continent | |
---|---|---|---|
count | 193 | 193.000000 | 193 |
unique | 193 | NaN | 6 |
top | Marshall Islands | NaN | Africa |
freq | 1 | NaN | 53 |
mean | NaN | 4.717098 | NaN |
std | NaN | 3.773298 | NaN |
min | NaN | 0.000000 | NaN |
25% | NaN | 1.300000 | NaN |
50% | NaN | 4.200000 | NaN |
75% | NaN | 7.200000 | NaN |
max | NaN | 14.400000 | NaN |
# 即使您只想描述单个数据类型,也要传递一个列表
drinks.describe(include=['object'])
country | continent | |
---|---|---|
count | 193 | 193 |
unique | 193 | 6 |
top | Marshall Islands | Africa |
freq | 1 | 53 |
Documentation for describe
[Back to top]
11. How do I use the "axis" parameter in pandas? (video)
url7= 'https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/drinks.csv'
drinks = pd.read_csv(url7)
drinks.head()
country | beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | continent | |
---|---|---|---|---|---|---|
0 | Afghanistan | 0 | 0 | 0 | 0.0 | Asia |
1 | Albania | 89 | 132 | 54 | 4.9 | Europe |
2 | Algeria | 25 | 0 | 14 | 0.7 | Africa |
3 | Andorra | 245 | 138 | 312 | 12.4 | Europe |
4 | Angola | 217 | 57 | 45 | 5.9 | Africa |
# drop a column (temporarily)
drinks.drop('continent', axis=1).head()
country | beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | |
---|---|---|---|---|---|
0 | Afghanistan | 0 | 0 | 0 | 0.0 |
1 | Albania | 89 | 132 | 54 | 4.9 |
2 | Algeria | 25 | 0 | 14 | 0.7 |
3 | Andorra | 245 | 138 | 312 | 12.4 |
4 | Angola | 217 | 57 | 45 | 5.9 |
Documentation for drop
# 删除一列(暂时)
drinks.drop(2, axis=0).head()
country | beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | continent | |
---|---|---|---|---|---|---|
0 | Afghanistan | 0 | 0 | 0 | 0.0 | Asia |
1 | Albania | 89 | 132 | 54 | 4.9 | Europe |
3 | Andorra | 245 | 138 | 312 | 12.4 | Europe |
4 | Angola | 217 | 57 | 45 | 5.9 | Africa |
5 | Antigua & Barbuda | 102 | 128 | 45 | 4.9 | North America |
使用axis参数引用行或列时:
axis 0表示行 axis 1指的是列
# 计算每个数字列的平均值
drinks.mean()
# 或等效地,明确指定轴
drinks.mean(axis=0)
beer_servings 106.160622 spirit_servings 80.994819 wine_servings 49.450777 total_litres_of_pure_alcohol 4.717098 dtype: float64
Documentation for mean
# 计算每一行的平均值
drinks.mean(axis=1).head()
0 0.000 1 69.975 2 9.925 3 176.850 4 81.225 dtype: float64
使用axis参数执行数学运算时:
- *axis0 *表示操作应“向下移动”行轴
- *axis1 *表示操作应“移过”列轴
# 'index' 等效 axis 0
drinks.mean(axis='index')
beer_servings 106.160622 spirit_servings 80.994819 wine_servings 49.450777 total_litres_of_pure_alcohol 4.717098 dtype: float64
# 'columns' 等效 axis 1
drinks.mean(axis='columns').head()
0 0.000 1 69.975 2 9.925 3 176.850 4 81.225 dtype: float64
[Back to top]
12. How do I use string methods in pandas? (video)
url1 = "https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/chipotle.tsv"
#定义地址
orders =pd.read_table(url1)#使用read_table()打开
orders.head()
order_id | quantity | item_name | choice_description | item_price | |
---|---|---|---|---|---|
0 | 1 | 1 | Chips and Fresh Tomato Salsa | NaN | $2.39 |
1 | 1 | 1 | Izze | [Clementine] | $3.39 |
2 | 1 | 1 | Nantucket Nectar | [Apple] | $3.39 |
3 | 1 | 1 | Chips and Tomatillo-Green Chili Salsa | NaN | $2.39 |
4 | 2 | 2 | Chicken Bowl | [Tomatillo-Red Chili Salsa (Hot), [Black Beans... | $16.98 |
# 在Python中访问字符串方法的常用方法
'hello'.upper()
'HELLO'
# spandas Series 的字符串方法通过'str'访问
orders.item_name.str.upper().head()
0 CHIPS AND FRESH TOMATO SALSA 1 IZZE 2 NANTUCKET NECTAR 3 CHIPS AND TOMATILLO-GREEN CHILI SALSA 4 CHICKEN BOWL Name: item_name, dtype: object
# string方法'contains'检查子字符串并返回一个布尔Series
orders.item_name.str.contains('Chicken').head()
0 False 1 False 2 False 3 False 4 True Name: item_name, dtype: bool
# 布尔Series筛选DataFrame
orders[orders.item_name.str.contains('Chicken')].head()
order_id | quantity | item_name | choice_description | item_price | |
---|---|---|---|---|---|
4 | 2 | 2 | Chicken Bowl | [Tomatillo-Red Chili Salsa (Hot), [Black Beans... | $16.98 |
5 | 3 | 1 | Chicken Bowl | [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou... | $10.98 |
11 | 6 | 1 | Chicken Crispy Tacos | [Roasted Chili Corn Salsa, [Fajita Vegetables,... | $8.75 |
12 | 6 | 1 | Chicken Soft Tacos | [Roasted Chili Corn Salsa, [Rice, Black Beans,... | $8.75 |
13 | 7 | 1 | Chicken Bowl | [Fresh Tomato Salsa, [Fajita Vegetables, Rice,... | $11.25 |
# 字符串方法可以链接在一起
orders.choice_description.str.replace('[', '').str.replace(']', '').head()
0 NaN 1 Clementine 2 Apple 3 NaN 4 Tomatillo-Red Chili Salsa (Hot), Black Beans, ... Name: choice_description, dtype: object
# 许多pandas字符串方法支持正则表达式
orders.choice_description.str.replace('[\[\]]', '').head()
0 NaN 1 Clementine 2 Apple 3 NaN 4 Tomatillo-Red Chili Salsa (Hot), Black Beans, ... Name: choice_description, dtype: object
String handling section of the pandas API reference
[Back to top]
13. How do I change the data type of a pandas Series? (video)
url7= 'https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/drinks.csv'
drinks = pd.read_csv(url7)
drinks.head()
country | beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | continent | |
---|---|---|---|---|---|---|
0 | Afghanistan | 0 | 0 | 0 | 0.0 | Asia |
1 | Albania | 89 | 132 | 54 | 4.9 | Europe |
2 | Algeria | 25 | 0 | 14 | 0.7 | Africa |
3 | Andorra | 245 | 138 | 312 | 12.4 | Europe |
4 | Angola | 217 | 57 | 45 | 5.9 | Africa |
# 检查每个系列的数据类型
drinks.dtypes
country object beer_servings int64 spirit_servings int64 wine_servings int64 total_litres_of_pure_alcohol float64 continent object dtype: object
# 更改现有系列的数据类型
drinks['beer_servings'] = drinks.beer_servings.astype(float)
drinks.dtypes
country object beer_servings float64 spirit_servings int64 wine_servings int64 total_litres_of_pure_alcohol float64 continent object dtype: object
Documentation for astype
# 或者,在读取文件时更改系列的数据类型
drinks = pd.read_csv(url7, dtype={'beer_servings':float})
drinks.dtypes
country object beer_servings float64 spirit_servings int64 wine_servings int64 total_litres_of_pure_alcohol float64 continent object dtype: object
orders = pd.read_table(url1)
orders.head()
order_id | quantity | item_name | choice_description | item_price | |
---|---|---|---|---|---|
0 | 1 | 1 | Chips and Fresh Tomato Salsa | NaN | $2.39 |
1 | 1 | 1 | Izze | [Clementine] | $3.39 |
2 | 1 | 1 | Nantucket Nectar | [Apple] | $3.39 |
3 | 1 | 1 | Chips and Tomatillo-Green Chili Salsa | NaN | $2.39 |
4 | 2 | 2 | Chicken Bowl | [Tomatillo-Red Chili Salsa (Hot), [Black Beans... | $16.98 |
# 检查每个系列的数据类型
orders.dtypes
order_id int64 quantity int64 item_name object choice_description object item_price object dtype: object
# 将字符串转换为数字以进行数学运算
orders.item_price.str.replace('$', '').astype(float).mean()
7.464335785374397
# 字符串方法'contains'检查子字符串并返回一个布尔系列
orders.item_name.str.contains('Chicken').head()
0 False 1 False 2 False 3 False 4 True Name: item_name, dtype: bool
# 将布尔系列转换为整数(False = 0,True = 1)
orders.item_name.str.contains('Chicken').astype(int).head()
0 0 1 0 2 0 3 0 4 1 Name: item_name, dtype: int32
[Back to top]
14. When should I use a "groupby" in pandas? (video)
drinks = pd.read_csv(url7)
drinks.head()
country | beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | continent | |
---|---|---|---|---|---|---|
0 | Afghanistan | 0 | 0 | 0 | 0.0 | Asia |
1 | Albania | 89 | 132 | 54 | 4.9 | Europe |
2 | Algeria | 25 | 0 | 14 | 0.7 | Africa |
3 | Andorra | 245 | 138 | 312 | 12.4 | Europe |
4 | Angola | 217 | 57 | 45 | 5.9 | Africa |
# 计算整个数据集中的平均beer_servings
drinks.beer_servings.mean()
106.16062176165804
# 计算非洲国家的平均beer_servings
drinks[drinks.continent=='Africa'].beer_servings.mean()
61.471698113207545
#计算每个大陆的平均beer_servings
drinks.groupby('continent').beer_servings.mean()
continent Africa 61.471698 Asia 37.045455 Europe 193.777778 North America 145.434783 Oceania 89.687500 South America 175.083333 Name: beer_servings, dtype: float64
Documentation for groupby
# 其他聚合函数(例如'max')也可以与groupby一起使用
drinks.groupby('continent').beer_servings.max()
continent Africa 376 Asia 247 Europe 361 North America 285 Oceania 306 South America 333 Name: beer_servings, dtype: int64
# 多个聚合函数可以同时应用
drinks.groupby('continent').beer_servings.agg(['count', 'mean', 'min', 'max'])
count | mean | min | max | |
---|---|---|---|---|
continent | ||||
Africa | 53 | 61.471698 | 0 | 376 |
Asia | 44 | 37.045455 | 0 | 247 |
Europe | 45 | 193.777778 | 0 | 361 |
North America | 23 | 145.434783 | 1 | 285 |
Oceania | 16 | 89.687500 | 0 | 306 |
South America | 12 | 175.083333 | 93 | 333 |
Documentation for agg
# 不指定列,就会算出所有数值列
drinks.groupby('continent').mean()
beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol | |
---|---|---|---|---|
continent | ||||
Africa | 61.471698 | 16.339623 | 16.264151 | 3.007547 |
Asia | 37.045455 | 60.840909 | 9.068182 | 2.170455 |
Europe | 193.777778 | 132.555556 | 142.222222 | 8.617778 |
North America | 145.434783 | 165.739130 | 24.521739 | 5.995652 |
Oceania | 89.687500 | 58.437500 | 35.625000 | 3.381250 |
South America | 175.083333 | 114.750000 | 62.416667 | 6.308333 |
# 允许绘图出现在jupyter notebook中
%matplotlib inline
# 直接在上面的DataFrame的并排条形图
drinks.groupby('continent').mean().plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x296edf23b70>
Documentation for plot
[Back to top]
15. How do I explore a pandas Series? (video)
# read a dataset of top-rated IMDb movies into a DataFrame
url4="https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/imdb_1000.csv"
movies = pd.read_csv(url4)
movies.head()
star_rating | title | content_rating | genre | duration | actors_list | |
---|---|---|---|---|---|---|
0 | 9.3 | The Shawshank Redemption | R | Crime | 142 | [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt... |
1 | 9.2 | The Godfather | R | Crime | 175 | [u'Marlon Brando', u'Al Pacino', u'James Caan'] |
2 | 9.1 | The Godfather: Part II | R | Crime | 200 | [u'Al Pacino', u'Robert De Niro', u'Robert Duv... |
3 | 9.0 | The Dark Knight | PG-13 | Action | 152 | [u'Christian Bale', u'Heath Ledger', u'Aaron E... |
4 | 8.9 | Pulp Fiction | R | Crime | 154 | [u'John Travolta', u'Uma Thurman', u'Samuel L.... |
# 检查数据类型
movies.dtypes
star_rating float64 title object content_rating object genre object duration int64 actors_list object dtype: object
探索非数字系列
# 计算最常见值的非空值,唯一值和频率
movies.genre.describe()
count 979 unique 16 top Drama freq 278 Name: genre, dtype: object
Documentation for describe
# 数Series中每个值发生的次数
movies.genre.value_counts()
Drama 278 Comedy 156 Action 136 Crime 124 Biography 77 Adventure 75 Animation 62 Horror 29 Mystery 16 Western 9 Sci-Fi 5 Thriller 5 Film-Noir 3 Family 2 Fantasy 1 History 1 Name: genre, dtype: int64
Documentation for value_counts
# 显示百分比而不是原始计数
movies.genre.value_counts(normalize=True)
Drama 0.283963 Comedy 0.159346 Action 0.138917 Crime 0.126660 Biography 0.078652 Adventure 0.076609 Animation 0.063330 Horror 0.029622 Mystery 0.016343 Western 0.009193 Sci-Fi 0.005107 Thriller 0.005107 Film-Noir 0.003064 Family 0.002043 Fantasy 0.001021 History 0.001021 Name: genre, dtype: float64
# '输出的是一个Series
type(movies.genre.value_counts())
pandas.core.series.Series
# 可以使用Series方法
movies.genre.value_counts().head()
Drama 278 Comedy 156 Action 136 Crime 124 Biography 77 Name: genre, dtype: int64
# 显示Series中唯一值
movies.genre.unique()
array(['Crime', 'Action', 'Drama', 'Western', 'Adventure', 'Biography','Comedy', 'Animation', 'Mystery', 'Horror', 'Film-Noir', 'Sci-Fi','History', 'Thriller', 'Family', 'Fantasy'], dtype=object)
#数Series中唯一值的数量
movies.genre.nunique()
16
Documentation for unique
and nunique
# 两个Series的交叉列表
pd.crosstab(movies.genre, movies.content_rating)
content_rating | APPROVED | G | GP | NC-17 | NOT RATED | PASSED | PG | PG-13 | R | TV-MA | UNRATED | X |
---|---|---|---|---|---|---|---|---|---|---|---|---|
genre | ||||||||||||
Action | 3 | 1 | 1 | 0 | 4 | 1 | 11 | 44 | 67 | 0 | 3 | 0 |
Adventure | 3 | 2 | 0 | 0 | 5 | 1 | 21 | 23 | 17 | 0 | 2 | 0 |
Animation | 3 | 20 | 0 | 0 | 3 | 0 | 25 | 5 | 5 | 0 | 1 | 0 |
Biography | 1 | 2 | 1 | 0 | 1 | 0 | 6 | 29 | 36 | 0 | 0 | 0 |
Comedy | 9 | 2 | 1 | 1 | 16 | 3 | 23 | 23 | 73 | 0 | 4 | 1 |
Crime | 6 | 0 | 0 | 1 | 7 | 1 | 6 | 4 | 87 | 0 | 11 | 1 |
Drama | 12 | 3 | 0 | 4 | 24 | 1 | 25 | 55 | 143 | 1 | 9 | 1 |
Family | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
Fantasy | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
Film-Noir | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
History | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
Horror | 2 | 0 | 0 | 1 | 1 | 0 | 1 | 2 | 16 | 0 | 5 | 1 |
Mystery | 4 | 1 | 0 | 0 | 1 | 0 | 1 | 2 | 6 | 0 | 1 | 0 |
Sci-Fi | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 3 | 0 | 0 | 0 |
Thriller | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3 | 0 | 0 | 0 |
Western | 1 | 0 | 0 | 0 | 2 | 0 | 2 | 1 | 3 | 0 | 0 | 0 |
Documentation for crosstab
探索数字系列:
# 计算各种汇总统计
movies.duration.describe()
count 979.000000 mean 120.979571 std 26.218010 min 64.000000 25% 102.000000 50% 117.000000 75% 134.000000 max 242.000000 Name: duration, dtype: float64
# 许多统计数据都是作为Series方法实现的
movies.duration.mean()
120.97957099080695
Documentation for mean
# 'value_counts' 主要用于分类数据,而不是数字数据
movies.duration.value_counts().head()
112 23 113 22 102 20 101 20 129 19 Name: duration, dtype: int64
# 允许绘图出现在jupyter notebook中
%matplotlib inline
# 'duration'Series的直方图(显示数值变量的分布)
movies.duration.plot(kind='hist')
<matplotlib.axes._subplots.AxesSubplot at 0x296ee26ba58>
# 'genre'Series'value_counts'的条形图
movies.genre.value_counts().plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x296ee2ccba8>
Documentation for plot
[Back to top]
转载于:https://www.cnblogs.com/romannista/p/10659805.html
萌新向Python数据分析及数据挖掘 第二章 pandas 第一节 pandas使用基础QA 1-15相关推荐
- 萌新向Python数据分析及数据挖掘 第二章 pandas 第二节 Python Language Basics, IPython, and Jupyter Notebooks...
Python Language Basics, IPython, and Jupyter Notebooks In [5]: import numpy as np #导入numpy np.random ...
- python数据分析常用的算法_萌新向Python数据分析及数据挖掘 第三章 机器学习常用算法 第二节 线性回归算法 (上)理解篇...
理解 以a b为变量,预测值与真值的差的平方和为结果的函数 参数学习的基本方法:找到最优参数使得预测与真实值差距最小 假设可以找到一条直线 y = ax+b 使得预测值与真值的差的平方和最小 故事 假 ...
- 萌新向Python数据分析及数据挖掘 第一章 Python基础 第三节 列表简介 第四节 操作列表...
第一章 Python基础 第三节 列表简介 列表是是处理一组有序项目的数据结构,即可以在一个列表中存储一个序列的项目.列表中的元素包括在方括号([])中,每个元素之间用逗号分割.列表是可变的数据类型, ...
- 萌新向Python数据分析及数据挖掘 第三章 机器学习常用算法 第四节 PCA与梯度上升 (上)理解篇...
转载于:https://www.cnblogs.com/romannista/p/10811992.html
- 第二章:第一节数据清洗及特征处理-自测
回顾&引言]前面一章的内容大家可以感觉到我们主要是对基础知识做一个梳理,让大家了解数据分析的一些操作,主要做了数据的各个角度的观察.那么在这里,我们主要是做数据分析的流程性学习,主要是包括了数 ...
- 【重识云原生】第二章计算第一节——计算虚拟化技术总述
云平台计算领域知识地图: 楔子:计算虚拟化技术算是云计算技术的擎天之柱,其前两代技术的演进一直引领着云计算的发展,即便到了云原生时代,其作用依然举足轻重. 一.计算虚拟化技术总述 1.1 虚拟化技 ...
- 第二章:第一节数据清洗及特征处理-课程
[回顾&引言]前面一章的内容大家可以感觉到我们主要是对基础知识做一个梳理,让大家了解数据分析的一些操作,主要做了数据的各个角度的观察.那么在这里,我们主要是做数据分析的流程性学习,主要是包括了 ...
- WEB前台架构教程(原创)第二章(第一节PS切图规划)
转自:http://hi.baidu.com/phpidea/blog/item/9666a754d8d05259574e0013.html 我在第一章介绍了页面设计要关注的几个点,重点位置突出.分栏 ...
- (数据库系统概论|王珊)第二章关系数据库-第一节:关系数据结构及其形式化定义
文章目录 一:关系 (1)域 (2)笛卡尔积 (3)关系 A:基本概述 B:码相关概念 C:关系的三种类型 二:关系模式 三:关系数据库 (1)基本概念 (2)关系数据库的型与值 前面说过,数据模型由 ...
最新文章
- php使用webservivce_基于SSM框架实现简单的登录注册的示例代码
- Metasploit search命令使用技巧
- 年度最期待游戏废土2登陆Linux
- [六省联考2017]组合数问题
- 简化工作——我的bat文件
- CentOS7 安装Mysql5.6 后启动失败处理 The server quit without updating PID file
- 硬件基础知识--(10)三极管的工作原理
- nodejs参数的接收(post和get)
- thinkjs 学习笔记
- oracle中累计求和_Excel中常见的7种求和公式
- 終級方案之封USB設備必殺技
- 【C语言取反运算符】~2是多少?~-5是多少?
- 利用oc门或od门实现线与_OC和OD门、线与和线或
- Django验证码*异步方案Celery之Celery介绍和使用(Celery介绍、创建Celery实例并加载配置、加载Celery配置、定义发送短信任务、启动Celery服务、调用发送短信任务)
- 阿里企业邮箱收费版与免费版有哪些规格和功能上的区别?
- 未来富豪,将出自这12大颠覆性领域
- JavaScript内的类型转换
- 盘点人气云计算大数据开源技术变迁
- Flink project java篇
- maven项目设置多个源文件夹