数据分析工程师_第02讲Pandas教程(上篇)

数据分析工程师_第02讲Pandas教程上篇

数据分析工程师_第02讲Pandas教程(上篇)
- Pandas简介
- 目录
- 数据结构Series
- - 构造和初始化series
  - - booling indexing/条件判断索引
  - Series赋值
  - 数据缺失
- 数据结构DataFrame

数据分析工程师_第02讲Pandas教程(上篇)

pandas是一个专门用于数据分析的python工具库

Pandas简介

python数据分析处理的一个package
基于numpy(对“矩阵”做科学计算)
有一种用python去操作Excel/SQL的感觉

数据结构Series

import numpy as np
import pandas as pd

# json.loads()解码python json格式
import jsonjsonStr = '{"name":"aspiring", "age": 17, "hobby": ["money","power", "read"],"parames":{"a":1,"b":2}}'jsonData = json.loads(jsonStr)
print(jsonData)print(type(jsonData))
print(jsonData['hobby'])

{'name': 'aspiring', 'age': 17, 'hobby': ['money', 'power', 'read'], 'parames': {'a': 1, 'b': 2}}
<class 'dict'>
['money', 'power', 'read']

# 读json文件
# json.load()加载python json格式文件path1 = 'data/example.json'
open(path1,'r',encoding='utf-8').readline()

'{ "a": "Mozilla\\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\\/535.11 (KHTML, like Gecko) Chrome\\/17.0.963.78 Safari\\/535.11", "c": "US", "nk": 1, "tz": "America\\/New_York", "gr": "MA", "g": "A6qOVH", "h": "wfLQtf", "l": "orofrog", "al": "en-US,en;q=0.8", "hh": "1.usa.gov", "r": "http:\\/\\/www.facebook.com\\/l\\/7AQEFzjSi\\/1.usa.gov\\/wfLQtf", "u": "http:\\/\\/www.ncbi.nlm.nih.gov\\/pubmed\\/22415991", "t": 1331923247, "hc": 1331822918, "cy": "Danvers", "ll": [ 42.576698, -70.954903 ] }\n'

# python数据分析书籍例子
import jsonpath2 = 'data/example.json'records = [json.loads(line) for line in open(path2, 'r', encoding='utf-8')]records[0]records[0]['tz']

'America/New_York'

构造和初始化series

s = pd.Series([7, 'Beijing', 3.14, -12345, 'HanXiaoyang'])
s

0              7
1        Beijing
2           3.14
3         -12345
4    HanXiaoyang
dtype: object

s.values

array([7, 'Beijing', 3.14, -12345, 'HanXiaoyang'], dtype=object)

s.index

RangeIndex(start=0, stop=5, step=1)

s[1]

'Beijing'

0              7
1        Beijing
2           3.14
3         -12345
4    HanXiaoyang
dtype: object

pandas会默认用0到n作为Series的index，但是我们也可以自己指定index。index可以类比理解为dic当中的key。

s = pd.Series([7, 'Beijing', 3.14, -12345, 'HanXiaoyang'], index=['A', 'B', 'C', 'D', 'E'])

A              7
B        Beijing
C           3.14
D         -12345
E    HanXiaoyang
dtype: object

s['A']

s[ ['A','D','B'] ]

A          7
D     -12345
B    Beijing
dtype: object

我们可以用list来构建Series，同时可以指定index。实际上我们还可以用dic来初始化Series，因为Series本来就是key-value的结构。

cities = {'Beijing':55000, 'ShangHai':60000, 'Shenzhen':50000, 'Hangzhou':30000, 'Guangzhou':40000, 'Suzhou':None}

cities

{'Beijing': 55000,'Guangzhou': 40000,'Hangzhou': 30000,'ShangHai': 60000,'Shenzhen': 50000,'Suzhou': None}

apt = pd.Series(cities, name='income')

apt

Beijing      55000.0
Guangzhou    40000.0
Hangzhou     30000.0
ShangHai     60000.0
Shenzhen     50000.0
Suzhou           NaN
Name: income, dtype: float64

# 索引
apt['Guangzhou']

40000.0

apt[1]

40000.0

apt[1:]

Guangzhou    40000.0
Hangzhou     30000.0
ShangHai     60000.0
Shenzhen     50000.0
Suzhou           NaN
Name: income, dtype: float64

apt[:-1]

Beijing      55000.0
Guangzhou    40000.0
Hangzhou     30000.0
ShangHai     60000.0
Shenzhen     50000.0
Name: income, dtype: float64

apt[[3,4,1]]

ShangHai     60000.0
Shenzhen     50000.0
Guangzhou    40000.0
Name: income, dtype: float64

apt[ ['ShangHai', 'Shenzhen', 'Guangzhou'] ]

ShangHai     60000.0
Shenzhen     50000.0
Guangzhou    40000.0
Name: income, dtype: float64

# 简单的计算
# 广播特性
3*apt

Beijing      165000.0
Guangzhou    120000.0
Hangzhou      90000.0
ShangHai     180000.0
Shenzhen     150000.0
Suzhou            NaN
Name: income, dtype: float64

apt/2.5

Beijing      22000.0
Guangzhou    16000.0
Hangzhou     12000.0
ShangHai     24000.0
Shenzhen     20000.0
Suzhou           NaN
Name: income, dtype: float64

# list不可以直接做数学运算
my_list = [2,4,6,8,10]

my_list/2

---------------------------------------------------------------------------TypeError                                 Traceback (most recent call last)<ipython-input-29-39aba40a404f> in <module>()
----> 1 my_list/2TypeError: unsupported operand type(s) for /: 'list' and 'int'

apt[1:]

Guangzhou    40000.0
Hangzhou     30000.0
ShangHai     60000.0
Shenzhen     50000.0
Suzhou           NaN
Name: income, dtype: float64

apt[:-1]

Beijing      55000.0
Guangzhou    40000.0
Hangzhou     30000.0
ShangHai     60000.0
Shenzhen     50000.0
Name: income, dtype: float64

# 基于索引去做计算的
apt[1:] + apt[:-1]

Beijing           NaN
Guangzhou     80000.0
Hangzhou      60000.0
ShangHai     120000.0
Shenzhen     100000.0
Suzhou            NaN
Name: income, dtype: float64

# in判断index是否存在
'Hangzhou' in apt

True

'Chongqing' in apt

False

# apt['Chongqing'] 不OK的
print(apt.get('Chongqing'))

None

print(apt.get('Guangzhou'))

40000.0

booling indexing/条件判断索引

apt>=40000

Beijing       True
Guangzhou     True
Hangzhou     False
ShangHai      True
Shenzhen      True
Suzhou       False
Name: income, dtype: bool

#条件索引
apt[apt>=40000]

Beijing      55000.0
Guangzhou    40000.0
ShangHai     60000.0
Shenzhen     50000.0
Name: income, dtype: float64

# 统计计算
apt.mean()

47000.0

apt.median()

50000.0

apt.max()

60000.0

apt.min()

30000.0

Series赋值

apt

Beijing      55000.0
Guangzhou    40000.0
Hangzhou     30000.0
ShangHai     60000.0
Shenzhen     50000.0
Suzhou           NaN
Name: income, dtype: float64

apt['Shenzhen'] = 70000

apt

Beijing      55000.0
Guangzhou    40000.0
Hangzhou     30000.0
ShangHai     60000.0
Shenzhen     70000.0
Suzhou           NaN
Name: income, dtype: float64

#条件赋值
apt[apt<=40000] = 45000

apt

Beijing      55000.0
Guangzhou    45000.0
Hangzhou     45000.0
ShangHai     60000.0
Shenzhen     70000.0
Suzhou           NaN
Name: income, dtype: float64

type(apt)

pandas.core.series.Series

#更多高级的数学运算
np.log(apt)

Beijing      10.915088
Guangzhou    10.714418
Hangzhou     10.714418
ShangHai     11.002100
Shenzhen     11.156251
Suzhou             NaN
Name: income, dtype: float64

cars = pd.Series({'Beijing':350000, 'ShangHai':400000, 'Shenzhen':300000, \'Tianjin':200000, 'Guangzhou':250000, 'Chongqing':150000})

cars

Beijing      350000
Chongqing    150000
Guangzhou    250000
ShangHai     400000
Shenzhen     300000
Tianjin      200000
dtype: int64

expense = cars + 10*apt

expense

Beijing       900000.0
Chongqing          NaN
Guangzhou     700000.0
Hangzhou           NaN
ShangHai     1000000.0
Shenzhen     1000000.0
Suzhou             NaN
Tianjin            NaN
dtype: float64

数据缺失

'Hangzhou' in apt

True

'Hangzhou' in cars

False

apt

Beijing      55000.0
Guangzhou    45000.0
Hangzhou     45000.0
ShangHai     60000.0
Shenzhen     70000.0
Suzhou           NaN
Name: income, dtype: float64

# bool结果返回
apt.notnull()

Beijing       True
Guangzhou     True
Hangzhou      True
ShangHai      True
Shenzhen      True
Suzhou       False
Name: income, dtype: bool

apt.isnull()

Beijing      False
Guangzhou    False
Hangzhou     False
ShangHai     False
Shenzhen     False
Suzhou        True
Name: income, dtype: bool

expense

Beijing       900000.0
Chongqing          NaN
Guangzhou     700000.0
Hangzhou           NaN
ShangHai     1000000.0
Shenzhen     1000000.0
Suzhou             NaN
Tianjin            NaN
dtype: float64

expense[expense.isnull()] = expense.mean()

expense

Beijing       900000.0
Chongqing     900000.0
Guangzhou     700000.0
Hangzhou      900000.0
ShangHai     1000000.0
Shenzhen     1000000.0
Suzhou        900000.0
Tianjin       900000.0
dtype: float64

数据结构DataFrame

一个DataFrame就是一张表格，Series是一维数组，DataFrame是二维数组。可以类比成office当中的excel，也可以理解成多个Series的集合。

data = {'City':['Beijing','ShangHai','Guangzhou','Shenzhen','Hangzhou','Chongqing'],'year':[2017,2018,2017,2018,2017,2017],'population':[2100,2300,1000,700,500,500]}

pd.DataFrame(data)

	City	population	year
0	Beijing	2100	2017
1	ShangHai	2300	2018
2	Guangzhou	1000	2017
3	Shenzhen	700	2018
4	Hangzhou	500	2017
5	Chongqing	500	2017

pd.DataFrame(data, columns=['year','City','population'])

	year	City	population
0	2017	Beijing	2100
1	2018	ShangHai	2300
2	2017	Guangzhou	1000
3	2018	Shenzhen	700
4	2017	Hangzhou	500
5	2017	Chongqing	500

# index
pd.DataFrame(data, columns=['year','City','population'], index=['one','two','three','four','five','six'])

	year	City	population
one	2017	Beijing	2100
two	2018	ShangHai	2300
three	2017	Guangzhou	1000
four	2018	Shenzhen	700
five	2017	Hangzhou	500
six	2017	Chongqing	500

# DataFrame可以视作Series的集合
apt

Beijing      55000.0
Guangzhou    45000.0
Hangzhou     45000.0
ShangHai     60000.0
Shenzhen     70000.0
Suzhou           NaN
Name: income, dtype: float64

cars

Beijing      350000
Chongqing    150000
Guangzhou    250000
ShangHai     400000
Shenzhen     300000
Tianjin      200000
dtype: int64

df = pd.DataFrame({'apt':apt, 'cars':cars})

df

	apt	cars
Beijing	55000.0	350000.0
Chongqing	NaN	150000.0
Guangzhou	45000.0	250000.0
Hangzhou	45000.0	NaN
ShangHai	60000.0	400000.0
Shenzhen	70000.0	300000.0
Suzhou	NaN	NaN
Tianjin	NaN	200000.0

# 取出一列(Series)
df['apt']

Beijing      55000.0
Chongqing        NaN
Guangzhou    45000.0
Hangzhou     45000.0
ShangHai     60000.0
Shenzhen     70000.0
Suzhou           NaN
Tianjin          NaN
Name: apt, dtype: float64

type(df['apt'])

pandas.core.series.Series

df[['apt']]

	apt
Beijing	55000.0
Chongqing	NaN
Guangzhou	45000.0
Hangzhou	45000.0
ShangHai	60000.0
Shenzhen	70000.0
Suzhou	NaN
Tianjin	NaN

type(df[['apt']])

pandas.core.frame.DataFrame

# 赋值
df

	apt	cars
Beijing	55000.0	350000.0
Chongqing	NaN	150000.0
Guangzhou	45000.0	250000.0
Hangzhou	45000.0	NaN
ShangHai	60000.0	400000.0
Shenzhen	70000.0	300000.0
Suzhou	NaN	NaN
Tianjin	NaN	200000.0

df['bonus'] = 40000

df

	apt	cars	bonus
Beijing	55000.0	350000.0	40000
Chongqing	NaN	150000.0	40000
Guangzhou	45000.0	250000.0	40000
Hangzhou	45000.0	NaN	40000
ShangHai	60000.0	400000.0	40000
Shenzhen	70000.0	300000.0	40000
Suzhou	NaN	NaN	40000
Tianjin	NaN	200000.0	40000

# 对两列做计算
df['expense'] = df['apt'] + df['bonus']

df

	apt	cars	bonus	expense
Beijing	55000.0	350000.0	40000	95000.0
Chongqing	NaN	150000.0	40000	NaN
Guangzhou	45000.0	250000.0	40000	85000.0
Hangzhou	45000.0	NaN	40000	85000.0
ShangHai	60000.0	400000.0	40000	100000.0
Shenzhen	70000.0	300000.0	40000	110000.0
Suzhou	NaN	NaN	40000	NaN
Tianjin	NaN	200000.0	40000	NaN

df.index

Index(['Beijing', 'Chongqing', 'Guangzhou', 'Hangzhou', 'ShangHai', 'Shenzhen','Suzhou', 'Tianjin'],dtype='object')

#利用index的名称，来获取想要的行（或列）
df.loc['Beijing']

apt         55000.0
cars       350000.0
bonus       40000.0
expense     95000.0
Name: Beijing, dtype: float64

type(df.loc['Beijing'])

pandas.core.series.Series

df.loc[['Beijing', 'ShangHai', 'Guangzhou']]

	apt	cars	bonus	expense
Beijing	55000.0	350000.0	40000	95000.0
ShangHai	60000.0	400000.0	40000	100000.0
Guangzhou	45000.0	250000.0	40000	85000.0

df

	apt	cars	bonus	expense
Beijing	55000.0	350000.0	40000	95000.0
Chongqing	NaN	150000.0	40000	NaN
Guangzhou	45000.0	250000.0	40000	85000.0
Hangzhou	45000.0	NaN	40000	85000.0
ShangHai	60000.0	400000.0	40000	100000.0
Shenzhen	70000.0	300000.0	40000	110000.0
Suzhou	NaN	NaN	40000	NaN
Tianjin	NaN	200000.0	40000	NaN

# 高级函数loc
# 利用index的名称，来获取想要的行（或列）
df.loc['Beijing':'Suzhou', ['apt','bonus']]

	apt	bonus
Beijing	55000.0	40000
Chongqing	NaN	40000
Guangzhou	45000.0	40000
Hangzhou	45000.0	40000
ShangHai	60000.0	40000
Shenzhen	70000.0	40000
Suzhou	NaN	40000

# 类似切片的用法
df.loc['Beijing':'Suzhou', 'apt':'bonus']

	apt	cars	bonus
Beijing	55000.0	350000.0	40000
Chongqing	NaN	150000.0	40000
Guangzhou	45000.0	250000.0	40000
Hangzhou	45000.0	NaN	40000
ShangHai	60000.0	400000.0	40000
Shenzhen	70000.0	300000.0	40000
Suzhou	NaN	NaN	40000

# 传入list的用法
df.loc[['Beijing','Suzhou'], ['apt','bonus']]

	apt	bonus
Beijing	55000.0	40000
Suzhou	NaN	40000

df

	apt	cars	bonus	expense
Beijing	55000.0	350000.0	40000	95000.0
Chongqing	NaN	150000.0	40000	NaN
Guangzhou	45000.0	250000.0	40000	85000.0
Hangzhou	45000.0	NaN	40000	85000.0
ShangHai	60000.0	400000.0	40000	100000.0
Shenzhen	70000.0	300000.0	40000	110000.0
Suzhou	NaN	NaN	40000	NaN
Tianjin	NaN	200000.0	40000	NaN

df.loc['Beijing','bonus'] = 50000

df

	apt	cars	bonus	expense
Beijing	55000.0	350000.0	50000	95000.0
Chongqing	NaN	150000.0	40000	NaN
Guangzhou	45000.0	250000.0	40000	85000.0
Hangzhou	45000.0	NaN	40000	85000.0
ShangHai	60000.0	400000.0	40000	100000.0
Shenzhen	70000.0	300000.0	40000	110000.0
Suzhou	NaN	NaN	40000	NaN
Tianjin	NaN	200000.0	40000	NaN

#对列赋值
df.loc[:,'expense'] = 100000

df

	apt	cars	bonus	expense
Beijing	55000.0	350000.0	50000	100000
Chongqing	NaN	150000.0	40000	100000
Guangzhou	45000.0	250000.0	40000	100000
Hangzhou	45000.0	NaN	40000	100000
ShangHai	60000.0	400000.0	40000	100000
Shenzhen	70000.0	300000.0	40000	100000
Suzhou	NaN	NaN	40000	100000
Tianjin	NaN	200000.0	40000	100000

# 返回表示DataFrame维度的元组
df.shape

(8, 4)

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8 entries, Beijing to Tianjin
Data columns (total 5 columns):
apt        5 non-null float64
cars       6 non-null float64
bonus      8 non-null int64
expense    8 non-null int64
color      8 non-null object
dtypes: float64(2), int64(2), object(1)
memory usage: 704.0+ bytes

df.T

	Beijing	Chongqing	Guangzhou	Hangzhou	ShangHai	Shenzhen	Suzhou	Tianjin
apt	55000.0	NaN	45000.0	45000.0	60000.0	70000.0	NaN	NaN
cars	350000.0	150000.0	250000.0	NaN	400000.0	300000.0	NaN	200000.0
bonus	50000.0	40000.0	40000.0	40000.0	40000.0	40000.0	40000.0	40000.0
expense	100000.0	100000.0	100000.0	100000.0	100000.0	100000.0	100000.0	100000.0

df

	apt	cars	bonus	expense
Beijing	55000.0	350000.0	50000	100000
Chongqing	NaN	150000.0	40000	100000
Guangzhou	45000.0	250000.0	40000	100000
Hangzhou	45000.0	NaN	40000	100000
ShangHai	60000.0	400000.0	40000	100000
Shenzhen	70000.0	300000.0	40000	100000
Suzhou	NaN	NaN	40000	100000
Tianjin	NaN	200000.0	40000	100000

df.describe()

	apt	cars	bonus	expense
count	5.000000	6.000000	8.000000	8.0
mean	55000.000000	275000.000000	41250.000000	100000.0
std	10606.601718	93541.434669	3535.533906	0.0
min	45000.000000	150000.000000	40000.000000	100000.0
25%	45000.000000	212500.000000	40000.000000	100000.0
50%	55000.000000	275000.000000	40000.000000	100000.0
75%	60000.000000	337500.000000	40000.000000	100000.0
max	70000.000000	400000.000000	50000.000000	100000.0

df['cars']

Beijing      350000.0
Chongqing    150000.0
Guangzhou    250000.0
Hangzhou          NaN
ShangHai     400000.0
Shenzhen     300000.0
Suzhou            NaN
Tianjin      200000.0
Name: cars, dtype: float64

df['cars'] < 310000

Beijing      False
Chongqing     True
Guangzhou     True
Hangzhou     False
ShangHai     False
Shenzhen      True
Suzhou       False
Tianjin       True
Name: cars, dtype: bool

df.loc[:,'color'] = ['红','黄','紫','蓝','红','绿','棕','橙']

df

	apt	cars	bonus	expense	color
Beijing	55000.0	350000.0	50000	100000	红
Chongqing	NaN	150000.0	40000	100000	黄
Guangzhou	45000.0	250000.0	40000	100000	紫
Hangzhou	45000.0	NaN	40000	100000	蓝
ShangHai	60000.0	400000.0	40000	100000	红
Shenzhen	70000.0	300000.0	40000	100000	绿
Suzhou	NaN	NaN	40000	100000	棕
Tianjin	NaN	200000.0	40000	100000	橙

df['color'].isin(['红','绿'])

Beijing       True
Chongqing    False
Guangzhou    False
Hangzhou     False
ShangHai      True
Shenzhen      True
Suzhou       False
Tianjin      False
Name: color, dtype: bool

df

	apt	cars	bonus	expense	color
Beijing	55000.0	350000.0	50000	100000	红
Chongqing	NaN	150000.0	40000	100000	黄
Guangzhou	45000.0	250000.0	40000	100000	紫
Hangzhou	45000.0	NaN	40000	100000	蓝
ShangHai	60000.0	400000.0	40000	100000	红
Shenzhen	70000.0	300000.0	40000	100000	绿
Suzhou	NaN	NaN	40000	100000	棕
Tianjin	NaN	200000.0	40000	100000	橙

# 填充缺失值
df.fillna(value=50000)
#df.fillna(value=50000, inplace=True)

	apt	cars	bonus	expense	color
Beijing	55000.0	350000.0	50000	100000	红
Chongqing	50000.0	150000.0	40000	100000	黄
Guangzhou	45000.0	250000.0	40000	100000	紫
Hangzhou	45000.0	50000.0	40000	100000	蓝
ShangHai	60000.0	400000.0	40000	100000	红
Shenzhen	70000.0	300000.0	40000	100000	绿
Suzhou	50000.0	50000.0	40000	100000	棕
Tianjin	50000.0	200000.0	40000	100000	橙

df

	apt	cars	bonus	expense	color
Beijing	55000.0	350000.0	50000	100000	红
Chongqing	NaN	150000.0	40000	100000	黄
Guangzhou	45000.0	250000.0	40000	100000	紫
Hangzhou	45000.0	NaN	40000	100000	蓝
ShangHai	60000.0	400000.0	40000	100000	红
Shenzhen	70000.0	300000.0	40000	100000	绿
Suzhou	NaN	NaN	40000	100000	棕
Tianjin	NaN	200000.0	40000	100000	橙

#向前填充
df.fillna(method='ffill')

	apt	cars	bonus	expense	color
Beijing	55000.0	350000.0	50000	100000	红
Chongqing	55000.0	150000.0	40000	100000	黄
Guangzhou	45000.0	250000.0	40000	100000	紫
Hangzhou	45000.0	250000.0	40000	100000	蓝
ShangHai	60000.0	400000.0	40000	100000	红
Shenzhen	70000.0	300000.0	40000	100000	绿
Suzhou	70000.0	300000.0	40000	100000	棕
Tianjin	70000.0	200000.0	40000	100000	橙

#向后填充
df.fillna(method='bfill')

	apt	cars	bonus	expense	color
Beijing	55000.0	350000.0	50000	100000	红
Chongqing	45000.0	150000.0	40000	100000	黄
Guangzhou	45000.0	250000.0	40000	100000	紫
Hangzhou	45000.0	400000.0	40000	100000	蓝
ShangHai	60000.0	400000.0	40000	100000	红
Shenzhen	70000.0	300000.0	40000	100000	绿
Suzhou	NaN	200000.0	40000	100000	棕
Tianjin	NaN	200000.0	40000	100000	橙

!head -10 data/GOOG.csv

'head' 不是内部或外部命令，也不是可运行的程序
或批处理文件。

goog = pd.read_csv('data/GOOG.csv', index_col=0, parse_dates=['Date'])

#前5行
goog.head(5)

	Open	High	Low	Close	Adj Close	Volume
Date
2004-08-19	49.813286	51.835709	47.800831	49.982655	49.982655	44871300
2004-08-20	50.316402	54.336334	50.062355	53.952770	53.952770	22942800
2004-08-23	55.168217	56.528118	54.321388	54.495735	54.495735	18342800
2004-08-24	55.412300	55.591629	51.591621	52.239193	52.239193	15319700
2004-08-25	52.284027	53.798351	51.746044	52.802086	52.802086	9232100

goog

	Open	High	Low	Close	Adj Close	Volume
Date
2004-08-19	49.813286	51.835709	47.800831	49.982655	49.982655	44871300
2004-08-20	50.316402	54.336334	50.062355	53.952770	53.952770	22942800
2004-08-23	55.168217	56.528118	54.321388	54.495735	54.495735	18342800
2004-08-24	55.412300	55.591629	51.591621	52.239193	52.239193	15319700
2004-08-25	52.284027	53.798351	51.746044	52.802086	52.802086	9232100
2004-08-26	52.279045	53.773445	52.134586	53.753517	53.753517	7128600
2004-08-27	53.848164	54.107193	52.647663	52.876804	52.876804	6241200
2004-08-30	52.443428	52.548038	50.814533	50.814533	50.814533	5221400
2004-08-31	50.958992	51.661362	50.889256	50.993862	50.993862	4941200
2004-09-01	51.158245	51.292744	49.648903	49.937820	49.937820	9181600
2004-09-02	49.409801	50.993862	49.285267	50.565468	50.565468	15190400
2004-09-03	50.286514	50.680038	49.474556	49.818268	49.818268	5176800
2004-09-07	50.316402	50.809555	49.619015	50.600338	50.600338	5875200
2004-09-08	50.181908	51.322632	50.062355	50.958992	50.958992	5009200
2004-09-09	51.073563	51.163227	50.311420	50.963974	50.963974	4080900
2004-09-10	50.610302	53.081039	50.460861	52.468334	52.468334	8740200
2004-09-13	53.115910	54.002586	53.031227	53.549286	53.549286	7881300
2004-09-14	53.524376	55.790882	53.195610	55.536835	55.536835	10880300
2004-09-15	55.073570	56.901718	54.894241	55.790882	55.790882	10763900
2004-09-16	55.960247	57.683788	55.616535	56.772205	56.772205	9310200
2004-09-17	56.996365	58.525631	56.562988	58.525631	58.525631	9517400
2004-09-20	58.256641	60.572956	58.166977	59.457142	59.457142	10679200
2004-09-21	59.681301	59.985161	58.535595	58.699978	58.699978	7263000
2004-09-22	58.480801	59.611561	58.186901	58.968971	58.968971	7617100
2004-09-23	59.198112	61.086033	58.291508	60.184414	60.184414	8576100
2004-09-24	60.244190	61.818291	59.656395	59.691261	59.691261	9166700
2004-09-27	59.556767	60.214302	58.680054	58.909195	58.909195	7099600
2004-09-28	60.423519	63.462128	59.880554	63.193138	63.193138	17009400
2004-09-29	63.113434	67.257904	62.879314	65.295258	65.295258	30661400
2004-09-30	64.707458	65.902977	64.259140	64.558022	64.558022	13823300
...	...	...	...	...	...	...
2017-06-08	982.349976	984.570007	977.200012	983.409973	983.409973	1481900
2017-06-09	984.500000	984.500000	935.630005	949.830017	949.830017	3309400
2017-06-12	939.559998	949.354980	915.232971	942.900024	942.900024	3763500
2017-06-13	951.909973	959.979980	944.090027	953.400024	953.400024	2013300
2017-06-14	959.919983	961.150024	942.250000	950.760010	950.760010	1489700
2017-06-15	933.969971	943.338989	924.440002	942.309998	942.309998	2133100
2017-06-16	940.000000	942.039978	931.594971	939.780029	939.780029	3094700
2017-06-19	949.960022	959.989990	949.049988	957.369995	957.369995	1533300
2017-06-20	957.520020	961.619995	950.010010	950.630005	950.630005	1126000
2017-06-21	953.640015	960.099976	950.760010	959.450012	959.450012	1202200
2017-06-22	958.700012	960.719971	954.549988	957.090027	957.090027	941400
2017-06-23	956.830017	966.000000	954.200012	965.590027	965.590027	1527900
2017-06-26	969.900024	973.309998	950.789978	952.270020	952.270020	1598400
2017-06-27	942.460022	948.289978	926.849976	927.330017	927.330017	2579900
2017-06-28	929.000000	942.750000	916.000000	940.489990	940.489990	2721400
2017-06-29	929.919983	931.260010	910.619995	917.789978	917.789978	3299200
2017-06-30	926.049988	926.049988	908.309998	908.729980	908.729980	2065500
2017-07-03	912.179993	913.940002	894.789978	898.700012	898.700012	1709800
2017-07-05	901.760010	914.510010	898.500000	911.710022	911.710022	1813900
2017-07-06	904.119995	914.943970	899.700012	906.690002	906.690002	1424500
2017-07-07	908.849976	921.539978	908.849976	918.590027	918.590027	1637800
2017-07-10	921.770020	930.380005	919.590027	928.799988	928.799988	1192800
2017-07-11	929.539978	931.429993	922.000000	930.090027	930.090027	1113200
2017-07-12	938.679993	946.299988	934.469971	943.830017	943.830017	1532100
2017-07-13	946.289978	954.450012	943.010010	947.159973	947.159973	1294700
2017-07-14	952.000000	956.909973	948.005005	955.989990	955.989990	1053800
2017-07-17	957.000000	960.739990	949.241028	953.419983	953.419983	1165500
2017-07-18	953.000000	968.039978	950.599976	965.400024	965.400024	1154000
2017-07-19	967.840027	973.039978	964.030029	970.890015	970.890015	1224500
2017-07-20	975.000000	975.900024	961.510010	968.150024	968.150024	1616500

3253 rows × 6 columns

goog.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3253 entries, 2004-08-19 to 2017-07-20
Data columns (total 6 columns):
Open         3253 non-null float64
High         3253 non-null float64
Low          3253 non-null float64
Close        3253 non-null float64
Adj Close    3253 non-null float64
Volume       3253 non-null int64
dtypes: float64(5), int64(1)
memory usage: 177.9 KB

goog.describe()

	Open	High	Low	Close	Adj Close	Volume
count	3253.000000	3253.000000	3253.000000	3253.000000	3253.000000	3.253000e+03
mean	370.588678	373.854568	366.959060	370.463274	370.463274	8.139070e+06
std	212.537536	213.645163	211.213609	212.542226	212.542226	8.403870e+06
min	49.409801	50.680038	47.800831	49.818268	49.818268	7.900000e+03
25%	225.928162	228.050217	222.984207	224.986694	224.986694	2.743600e+06
50%	292.030396	293.898407	288.538483	291.318054	291.318054	5.374600e+06
75%	531.599976	535.729126	527.810913	532.299988	532.299988	1.081150e+07
max	984.500000	988.250000	977.200012	983.679993	983.679993	8.254150e+07

goog.tail()

	Open	High	Low	Close	Adj Close	Volume
Date
2017-07-14	952.000000	956.909973	948.005005	955.989990	955.989990	1053800
2017-07-17	957.000000	960.739990	949.241028	953.419983	953.419983	1165500
2017-07-18	953.000000	968.039978	950.599976	965.400024	965.400024	1154000
2017-07-19	967.840027	973.039978	964.030029	970.890015	970.890015	1224500
2017-07-20	975.000000	975.900024	961.510010	968.150024	968.150024	1616500

goog.index

DatetimeIndex(['2004-08-19', '2004-08-20', '2004-08-23', '2004-08-24','2004-08-25', '2004-08-26', '2004-08-27', '2004-08-30','2004-08-31', '2004-09-01',...'2017-07-07', '2017-07-10', '2017-07-11', '2017-07-12','2017-07-13', '2017-07-14', '2017-07-17', '2017-07-18','2017-07-19', '2017-07-20'],dtype='datetime64[ns]', name='Date', length=3253, freq=None)

# 日期对应一周的星期几
goog.loc[:,'dow'] = goog.index.dayofweek
# 日期对应的一年的第几天
# goog.loc[:,'doy'] = goog.index.dayofyear

goog

	Open	High	Low	Close	Adj Close	Volume	dow
Date
2004-08-19	49.813286	51.835709	47.800831	49.982655	49.982655	44871300	3
2004-08-20	50.316402	54.336334	50.062355	53.952770	53.952770	22942800	4
2004-08-23	55.168217	56.528118	54.321388	54.495735	54.495735	18342800	0
2004-08-24	55.412300	55.591629	51.591621	52.239193	52.239193	15319700	1
2004-08-25	52.284027	53.798351	51.746044	52.802086	52.802086	9232100	2
2004-08-26	52.279045	53.773445	52.134586	53.753517	53.753517	7128600	3
2004-08-27	53.848164	54.107193	52.647663	52.876804	52.876804	6241200	4
2004-08-30	52.443428	52.548038	50.814533	50.814533	50.814533	5221400	0
2004-08-31	50.958992	51.661362	50.889256	50.993862	50.993862	4941200	1
2004-09-01	51.158245	51.292744	49.648903	49.937820	49.937820	9181600	2
2004-09-02	49.409801	50.993862	49.285267	50.565468	50.565468	15190400	3
2004-09-03	50.286514	50.680038	49.474556	49.818268	49.818268	5176800	4
2004-09-07	50.316402	50.809555	49.619015	50.600338	50.600338	5875200	1
2004-09-08	50.181908	51.322632	50.062355	50.958992	50.958992	5009200	2
2004-09-09	51.073563	51.163227	50.311420	50.963974	50.963974	4080900	3
2004-09-10	50.610302	53.081039	50.460861	52.468334	52.468334	8740200	4
2004-09-13	53.115910	54.002586	53.031227	53.549286	53.549286	7881300	0
2004-09-14	53.524376	55.790882	53.195610	55.536835	55.536835	10880300	1
2004-09-15	55.073570	56.901718	54.894241	55.790882	55.790882	10763900	2
2004-09-16	55.960247	57.683788	55.616535	56.772205	56.772205	9310200	3
2004-09-17	56.996365	58.525631	56.562988	58.525631	58.525631	9517400	4
2004-09-20	58.256641	60.572956	58.166977	59.457142	59.457142	10679200	0
2004-09-21	59.681301	59.985161	58.535595	58.699978	58.699978	7263000	1
2004-09-22	58.480801	59.611561	58.186901	58.968971	58.968971	7617100	2
2004-09-23	59.198112	61.086033	58.291508	60.184414	60.184414	8576100	3
2004-09-24	60.244190	61.818291	59.656395	59.691261	59.691261	9166700	4
2004-09-27	59.556767	60.214302	58.680054	58.909195	58.909195	7099600	0
2004-09-28	60.423519	63.462128	59.880554	63.193138	63.193138	17009400	1
2004-09-29	63.113434	67.257904	62.879314	65.295258	65.295258	30661400	2
2004-09-30	64.707458	65.902977	64.259140	64.558022	64.558022	13823300	3
...	...	...	...	...	...	...	...
2017-06-08	982.349976	984.570007	977.200012	983.409973	983.409973	1481900	3
2017-06-09	984.500000	984.500000	935.630005	949.830017	949.830017	3309400	4
2017-06-12	939.559998	949.354980	915.232971	942.900024	942.900024	3763500	0
2017-06-13	951.909973	959.979980	944.090027	953.400024	953.400024	2013300	1
2017-06-14	959.919983	961.150024	942.250000	950.760010	950.760010	1489700	2
2017-06-15	933.969971	943.338989	924.440002	942.309998	942.309998	2133100	3
2017-06-16	940.000000	942.039978	931.594971	939.780029	939.780029	3094700	4
2017-06-19	949.960022	959.989990	949.049988	957.369995	957.369995	1533300	0
2017-06-20	957.520020	961.619995	950.010010	950.630005	950.630005	1126000	1
2017-06-21	953.640015	960.099976	950.760010	959.450012	959.450012	1202200	2
2017-06-22	958.700012	960.719971	954.549988	957.090027	957.090027	941400	3
2017-06-23	956.830017	966.000000	954.200012	965.590027	965.590027	1527900	4
2017-06-26	969.900024	973.309998	950.789978	952.270020	952.270020	1598400	0
2017-06-27	942.460022	948.289978	926.849976	927.330017	927.330017	2579900	1
2017-06-28	929.000000	942.750000	916.000000	940.489990	940.489990	2721400	2
2017-06-29	929.919983	931.260010	910.619995	917.789978	917.789978	3299200	3
2017-06-30	926.049988	926.049988	908.309998	908.729980	908.729980	2065500	4
2017-07-03	912.179993	913.940002	894.789978	898.700012	898.700012	1709800	0
2017-07-05	901.760010	914.510010	898.500000	911.710022	911.710022	1813900	2
2017-07-06	904.119995	914.943970	899.700012	906.690002	906.690002	1424500	3
2017-07-07	908.849976	921.539978	908.849976	918.590027	918.590027	1637800	4
2017-07-10	921.770020	930.380005	919.590027	928.799988	928.799988	1192800	0
2017-07-11	929.539978	931.429993	922.000000	930.090027	930.090027	1113200	1
2017-07-12	938.679993	946.299988	934.469971	943.830017	943.830017	1532100	2
2017-07-13	946.289978	954.450012	943.010010	947.159973	947.159973	1294700	3
2017-07-14	952.000000	956.909973	948.005005	955.989990	955.989990	1053800	4
2017-07-17	957.000000	960.739990	949.241028	953.419983	953.419983	1165500	0
2017-07-18	953.000000	968.039978	950.599976	965.400024	965.400024	1154000	1
2017-07-19	967.840027	973.039978	964.030029	970.890015	970.890015	1224500	2
2017-07-20	975.000000	975.900024	961.510010	968.150024	968.150024	1616500	3

3253 rows × 7 columns

goog.loc[:,'doy'] = goog.index.dayofyear

goog.head()

	Open	High	Low	Close	Adj Close	Volume	dow	doy
Date
2004-08-19	49.813286	51.835709	47.800831	49.982655	49.982655	44871300	3	232
2004-08-20	50.316402	54.336334	50.062355	53.952770	53.952770	22942800	4	233
2004-08-23	55.168217	56.528118	54.321388	54.495735	54.495735	18342800	0	236
2004-08-24	55.412300	55.591629	51.591621	52.239193	52.239193	15319700	1	237
2004-08-25	52.284027	53.798351	51.746044	52.802086	52.802086	9232100	2	238

%matplotlib inline
goog['Open'].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x244022eb908>

nvda = pd.read_csv('data/NVDA.csv', index_col=0, parse_dates=['Date'])

nvda.head(10)

	Open	High	Low	Close	Adj Close	Volume
Date
1999-01-22	1.750000	1.953125	1.552083	1.640625	1.523430	67867200
1999-01-25	1.770833	1.833333	1.640625	1.812500	1.683028	12762000
1999-01-26	1.833333	1.869792	1.645833	1.671875	1.552448	8580000
1999-01-27	1.677083	1.718750	1.583333	1.666667	1.547611	6109200
1999-01-28	1.666667	1.677083	1.651042	1.661458	1.542776	5688000
1999-01-29	1.661458	1.666667	1.583333	1.583333	1.470231	6100800
1999-02-01	1.583333	1.625000	1.583333	1.614583	1.499249	3867600
1999-02-02	1.583333	1.625000	1.442708	1.489583	1.383178	6602400
1999-02-03	1.468750	1.541667	1.458333	1.520833	1.412196	1878000
1999-02-04	1.541667	1.645833	1.520833	1.604167	1.489577	4548000

nvda.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4654 entries, 1999-01-22 to 2017-07-20
Data columns (total 6 columns):
Open         4654 non-null float64
High         4654 non-null float64
Low          4654 non-null float64
Close        4654 non-null float64
Adj Close    4654 non-null float64
Volume       4654 non-null int64
dtypes: float64(5), int64(1)
memory usage: 254.5 KB

nvda.describe()

	Open	High	Low	Close	Adj Close	Volume
count	4654.000000	4654.000000	4654.000000	4654.000000	4654.000000	4.654000e+03
mean	18.872888	19.222090	18.513574	18.879564	18.091126	1.632563e+07
std	22.025278	22.346668	21.662627	22.048935	22.093697	1.204002e+07
min	1.395833	1.421875	1.333333	1.364583	1.267107	4.920000e+05
25%	8.510000	8.755000	8.245261	8.505000	7.897462	8.721475e+06
50%	13.810000	14.090000	13.500000	13.814167	12.832797	1.373830e+07
75%	19.770000	20.129999	19.505000	19.789167	18.774976	2.041408e+07
max	166.330002	168.500000	164.610001	167.500000	167.500000	2.307714e+08

%matplotlib inline
nvda['Open'].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x244025689e8>

nvda['Open'].plot(grid=True)# 画柱状图
# nvda['Open'].plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x24402568908>

nvda.index > '2016-01-01'

array([False, False, False, ...,  True,  True,  True])

nvda.index < '2016-04-01'

array([ True,  True,  True, ..., False, False, False])

# 条件与或非
# | 表示或
# & 表示且
# ! 表示非
(nvda.index > '2016-01-01') & (nvda.index < '2016-02-01')

array([False, False, False, ..., False, False, False])

nvda[(nvda.index > '2016-01-01') & (nvda.index < '2016-02-01')]['Open'].plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x24401f3cd68>

nvda[(nvda.index > '2016-01-01') & (nvda.index < '2016-02-01')]['Open'].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x244026c24a8>

nvda.head()

	Open	High	Low	Close	Adj Close	Volume
Date
1999-01-22	1.750000	1.953125	1.552083	1.640625	1.523430	67867200
1999-01-25	1.770833	1.833333	1.640625	1.812500	1.683028	12762000
1999-01-26	1.833333	1.869792	1.645833	1.671875	1.552448	8580000
1999-01-27	1.677083	1.718750	1.583333	1.666667	1.547611	6109200
1999-01-28	1.666667	1.677083	1.651042	1.661458	1.542776	5688000

nvda['Open'].mean()

18.87288806854321

nvda[['Open','High']].mean()

Open    18.872888
High    19.222090
dtype: float64

nvda[(nvda.index >= '2016-01-01') & (nvda.index <= '2016-06-30')].describe()

	Open	High	Low	Close	Adj Close	Volume
count	125.000000	125.000000	125.000000	125.000000	125.000000	1.250000e+02
mean	35.931360	36.416000	35.498880	36.006720	35.705421	9.802855e+06
std	6.722128	6.771410	6.768863	6.798316	6.816078	5.527803e+06
min	24.780001	25.559999	24.750000	25.219999	24.922132	4.382600e+06
25%	31.270000	31.870001	30.820000	31.520000	31.147726	6.919400e+06
50%	35.299999	35.570000	34.840000	35.389999	35.099426	8.707300e+06
75%	42.000000	42.799999	41.459999	42.279999	41.932854	1.122720e+07
max	47.759998	48.540001	47.650002	48.490002	48.216755	5.275640e+07

df

	apt	cars	bonus	expense	color
Beijing	55000.0	350000.0	50000	100000	红
Chongqing	NaN	150000.0	40000	100000	黄
Guangzhou	45000.0	250000.0	40000	100000	紫
Hangzhou	45000.0	NaN	40000	100000	蓝
ShangHai	60000.0	400000.0	40000	100000	红
Shenzhen	70000.0	300000.0	40000	100000	绿
Suzhou	NaN	NaN	40000	100000	棕
Tianjin	NaN	200000.0	40000	100000	橙

df.to_csv('data/my_df.csv')

!head -10 my_df.csv

'head' 不是内部或外部命令，也不是可运行的程序
或批处理文件。

df.to_csv('data/my_df.csv', index=False)

!head -10 my_df.csv

'head' 不是内部或外部命令，也不是可运行的程序
或批处理文件。

help(pd.read_csv)

Help on function read_csv in module pandas.io.parsers:read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=None, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, skip_footer=0, doublequote=True, delim_whitespace=False, as_recarray=None, compact_ints=None, use_unsigned=None, low_memory=True, buffer_lines=None, memory_map=False, float_precision=None)Read CSV (comma-separated) file into DataFrameAlso supports optionally iterating or breaking of the fileinto chunks.Additional help can be found in the `online docs for IO Tools<http://pandas.pydata.org/pandas-docs/stable/io.html>`_.Parameters----------filepath_or_buffer : str, pathlib.Path, py._path.local.LocalPath or any object with a read() method (such as a file handle or StringIO)The string could be a URL. Valid URL schemes include http, ftp, s3, andfile. For file URLs, a host is expected. For instance, a local file couldbe file ://localhost/path/to/table.csvsep : str, default ','Delimiter to use. If sep is None, the C engine cannot automatically detectthe separator, but the Python parsing engine can, meaning the latter willbe used and automatically detect the separator by Python's builtin sniffertool, ``csv.Sniffer``. In addition, separators longer than 1 character anddifferent from ``'\s+'`` will be interpreted as regular expressions andwill also force the use of the Python parsing engine. Note that regexdelimiters are prone to ignoring quoted data. Regex example: ``'\r\t'``delimiter : str, default ``None``Alternative argument name for sep.delim_whitespace : boolean, default FalseSpecifies whether or not whitespace (e.g. ``' '`` or ``'    '``) will beused as the sep. Equivalent to setting ``sep='\s+'``. If this optionis set to True, nothing should be passed in for the ``delimiter``parameter... versionadded:: 0.18.1 support for the Python parser.header : int or list of ints, default 'infer'Row number(s) to use as the column names, and the start of thedata.  Default behavior is to infer the column names: if no namesare passed the behavior is identical to ``header=0`` and columnnames are inferred from the first line of the file, if columnnames are passed explicitly then the behavior is identical to``header=None``. Explicitly pass ``header=0`` to be able toreplace existing names. The header can be a list of integers thatspecify row locations for a multi-index on the columnse.g. [0,1,3]. Intervening rows that are not specified will beskipped (e.g. 2 in this example is skipped). Note that thisparameter ignores commented lines and empty lines if``skip_blank_lines=True``, so header=0 denotes the first line ofdata rather than the first line of the file.names : array-like, default NoneList of column names to use. If file contains no header row, then youshould explicitly pass header=None. Duplicates in this list will causea ``UserWarning`` to be issued.index_col : int or sequence or False, default NoneColumn to use as the row labels of the DataFrame. If a sequence is given, aMultiIndex is used. If you have a malformed file with delimiters at the endof each line, you might consider index_col=False to force pandas to _not_use the first column as the index (row names)usecols : array-like or callable, default NoneReturn a subset of the columns. If array-like, all elements must eitherbe positional (i.e. integer indices into the document columns) or stringsthat correspond to column names provided either by the user in `names` orinferred from the document header row(s). For example, a valid array-like`usecols` parameter would be [0, 1, 2] or ['foo', 'bar', 'baz'].If callable, the callable function will be evaluated against the columnnames, returning names where the callable function evaluates to True. Anexample of a valid callable argument would be ``lambda x: x.upper() in['AAA', 'BBB', 'DDD']``. Using this parameter results in much fasterparsing time and lower memory usage.as_recarray : boolean, default False.. deprecated:: 0.19.0Please call `pd.read_csv(...).to_records()` instead.Return a NumPy recarray instead of a DataFrame after parsing the data.If set to True, this option takes precedence over the `squeeze` parameter.In addition, as row indices are not available in such a format, the`index_col` parameter will be ignored.squeeze : boolean, default FalseIf the parsed data only contains one column then return a Seriesprefix : str, default NonePrefix to add to column numbers when no header, e.g. 'X' for X0, X1, ...mangle_dupe_cols : boolean, default TrueDuplicate columns will be specified as 'X.0'...'X.N', rather than'X'...'X'. Passing in False will cause data to be overwritten if thereare duplicate names in the columns.dtype : Type name or dict of column -> type, default NoneData type for data or columns. E.g. {'a': np.float64, 'b': np.int32}Use `str` or `object` to preserve and not interpret dtype.If converters are specified, they will be applied INSTEADof dtype conversion.engine : {'c', 'python'}, optionalParser engine to use. The C engine is faster while the python engine iscurrently more feature-complete.converters : dict, default NoneDict of functions for converting values in certain columns. Keys can eitherbe integers or column labelstrue_values : list, default NoneValues to consider as Truefalse_values : list, default NoneValues to consider as Falseskipinitialspace : boolean, default FalseSkip spaces after delimiter.skiprows : list-like or integer or callable, default NoneLine numbers to skip (0-indexed) or number of lines to skip (int)at the start of the file.If callable, the callable function will be evaluated against the rowindices, returning True if the row should be skipped and False otherwise.An example of a valid callable argument would be ``lambda x: x in [0, 2]``.skipfooter : int, default 0Number of lines at bottom of file to skip (Unsupported with engine='c')skip_footer : int, default 0.. deprecated:: 0.19.0Use the `skipfooter` parameter instead, as they are identicalnrows : int, default NoneNumber of rows of file to read. Useful for reading pieces of large filesna_values : scalar, str, list-like, or dict, default NoneAdditional strings to recognize as NA/NaN. If dict passed, specificper-column NA values.  By default the following values are interpreted asNaN: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan','1.#IND', '1.#QNAN', 'N/A', 'NA', 'NULL', 'NaN', 'n/a', 'nan','null'.keep_default_na : bool, default TrueIf na_values are specified and keep_default_na is False the default NaNvalues are overridden, otherwise they're appended to.na_filter : boolean, default TrueDetect missing value markers (empty strings and the value of na_values). Indata without any NAs, passing na_filter=False can improve the performanceof reading a large fileverbose : boolean, default FalseIndicate number of NA values placed in non-numeric columnsskip_blank_lines : boolean, default TrueIf True, skip over blank lines rather than interpreting as NaN valuesparse_dates : boolean or list of ints or names or list of lists or dict, default False* boolean. If True -> try parsing the index.* list of ints or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3each as a separate date column.* list of lists. e.g.  If [[1, 3]] -> combine columns 1 and 3 and parse asa single date column.* dict, e.g. {'foo' : [1, 3]} -> parse columns 1, 3 as date and call result'foo'If a column or index contains an unparseable date, the entire column orindex will be returned unaltered as an object data type. For non-standarddatetime parsing, use ``pd.to_datetime`` after ``pd.read_csv``Note: A fast-path exists for iso8601-formatted dates.infer_datetime_format : boolean, default FalseIf True and `parse_dates` is enabled, pandas will attempt to infer theformat of the datetime strings in the columns, and if it can be inferred,switch to a faster method of parsing them. In some cases this can increasethe parsing speed by 5-10x.keep_date_col : boolean, default FalseIf True and `parse_dates` specifies combining multiple columns thenkeep the original columns.date_parser : function, default NoneFunction to use for converting a sequence of string columns to an array ofdatetime instances. The default uses ``dateutil.parser.parser`` to do theconversion. Pandas will try to call `date_parser` in three different ways,advancing to the next if an exception occurs: 1) Pass one or more arrays(as defined by `parse_dates`) as arguments; 2) concatenate (row-wise) thestring values from the columns defined by `parse_dates` into a single arrayand pass that; and 3) call `date_parser` once for each row using one ormore strings (corresponding to the columns defined by `parse_dates`) asarguments.dayfirst : boolean, default FalseDD/MM format dates, international and European formatiterator : boolean, default FalseReturn TextFileReader object for iteration or getting chunks with``get_chunk()``.chunksize : int, default NoneReturn TextFileReader object for iteration.See the `IO Tools docs<http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking>`_for more information on ``iterator`` and ``chunksize``.compression : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'For on-the-fly decompression of on-disk data. If 'infer' and`filepath_or_buffer` is path-like, then detect compression from thefollowing extensions: '.gz', '.bz2', '.zip', or '.xz' (otherwise nodecompression). If using 'zip', the ZIP file must contain only one datafile to be read in. Set to None for no decompression... versionadded:: 0.18.1 support for 'zip' and 'xz' compression.thousands : str, default NoneThousands separatordecimal : str, default '.'Character to recognize as decimal point (e.g. use ',' for European data).float_precision : string, default NoneSpecifies which converter the C engine should use for floating-pointvalues. The options are `None` for the ordinary converter,`high` for the high-precision converter, and `round_trip` for theround-trip converter.lineterminator : str (length 1), default NoneCharacter to break file into lines. Only valid with C parser.quotechar : str (length 1), optionalThe character used to denote the start and end of a quoted item. Quoteditems can include the delimiter and it will be ignored.quoting : int or csv.QUOTE_* instance, default 0Control field quoting behavior per ``csv.QUOTE_*`` constants. Use one ofQUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).doublequote : boolean, default ``True``When quotechar is specified and quoting is not ``QUOTE_NONE``, indicatewhether or not to interpret two consecutive quotechar elements INSIDE afield as a single ``quotechar`` element.escapechar : str (length 1), default NoneOne-character string used to escape delimiter when quoting is QUOTE_NONE.comment : str, default NoneIndicates remainder of line should not be parsed. If found at the beginningof a line, the line will be ignored altogether. This parameter must be asingle character. Like empty lines (as long as ``skip_blank_lines=True``),fully commented lines are ignored by the parameter `header` but not by`skiprows`. For example, if comment='#', parsing '#empty\na,b,c\n1,2,3'with `header=0` will result in 'a,b,c' beingtreated as the header.encoding : str, default NoneEncoding to use for UTF when reading/writing (ex. 'utf-8'). `List of Pythonstandard encodings<https://docs.python.org/3/library/codecs.html#standard-encodings>`_dialect : str or csv.Dialect instance, default NoneIf provided, this parameter will override values (default or not) for thefollowing parameters: `delimiter`, `doublequote`, `escapechar`,`skipinitialspace`, `quotechar`, and `quoting`. If it is necessary tooverride values, a ParserWarning will be issued. See csv.Dialectdocumentation for more details.tupleize_cols : boolean, default False.. deprecated:: 0.21.0This argument will be removed and will always convert to MultiIndexLeave a list of tuples on columns as is (default is to convert toa MultiIndex on the columns)error_bad_lines : boolean, default TrueLines with too many fields (e.g. a csv line with too many commas) will bydefault cause an exception to be raised, and no DataFrame will be returned.If False, then these "bad lines" will dropped from the DataFrame that isreturned.warn_bad_lines : boolean, default TrueIf error_bad_lines is False, and warn_bad_lines is True, a warning for each"bad line" will be output.low_memory : boolean, default TrueInternally process the file in chunks, resulting in lower memory usewhile parsing, but possibly mixed type inference.  To ensure no mixedtypes either set False, or specify the type with the `dtype` parameter.Note that the entire file is read into a single DataFrame regardless,use the `chunksize` or `iterator` parameter to return the data in chunks.(Only valid with C parser)buffer_lines : int, default None.. deprecated:: 0.19.0This argument is not respected by the parsercompact_ints : boolean, default False.. deprecated:: 0.19.0Argument moved to ``pd.to_numeric``If compact_ints is True, then for any column that is of integer dtype,the parser will attempt to cast it as the smallest integer dtype possible,either signed or unsigned depending on the specification from the`use_unsigned` parameter.use_unsigned : boolean, default False.. deprecated:: 0.19.0Argument moved to ``pd.to_numeric``If integer columns are being compacted (i.e. `compact_ints=True`), specifywhether the column should be compacted to the smallest signed or unsignedinteger dtype.memory_map : boolean, default FalseIf a filepath is provided for `filepath_or_buffer`, map the file objectdirectly onto memory and access the data directly from there. Using thisoption can improve performance because there is no longer any I/O overhead.Returns-------result : DataFrame or TextParser

数据分析工程师_第02讲Pandas教程(上篇)相关推荐

pandas dataframe column_Python数据分析——Pandas 教程（下）
Python数据分析--Pandas 教程(上) 上节,我们讲了 Pandas 基本的数据加载与检索,这节我们讲讲如何进行数据比较. Pandas系列对象在 Pandas 中我们获取指定列的数据有多 ...
数据分析第七讲 pandas练习数据的合并、分组聚合、时间序列、pandas绘图
文章目录数据分析第七讲 pandas练习数据的合并和分组聚合一.pandas-DataFrame 练习1 对于这一组电影数据,如果我们想runtime(电影时长)的分布情况,应该如何呈现数据? ...
2020年软考信息安全工程师_基础知识精讲免费视频-任铄-专题视频课程
2020年软考信息安全工程师_基础知识精讲免费视频-1480人已学习课程介绍根据新的软考信息安全工程师考试大纲和作者长期辅导考试的经验,对考试中的所有知识点进行了详细的讲解,为考试 ...
hive 语句总结_大数据分析工程师面试集锦4-Hive
导语本篇文章为大家带来Hive面试指南,文内会有两种题型,问答题和代码题,题目一部分来自于网上,一部分来自平时工作的总结精选题型 Hive可考察的内容有:基本概念.架构.数据类型.数据组织.DDL ...
数据分析第六讲 pandas
文章目录数据分析第六讲 pandas 一.pandas介绍 1.学习pandas的作用 2.pandas是什么? 二.pandas常用数据类型 1.Series一维,带标签数据 2.DataFram ...
视频教程-2020年软考信息安全工程师_基础知识精讲软考视频培训课程-软考
2020年软考信息安全工程师_基础知识精讲软考视频培训课程河北师范大学软件学院优秀讲师,项目经理资质,担任操作系统原理.软件工程.项目管理等课程教学工作.参与十个以上百万级软件项目管理及系统设计工作 ...
学堂在线_操作系统_notes_第0-2讲_OS概述、OS实验环境准备
学堂在线_操作系统_notes_第0-2讲_OS概述.OS实验环境准备 - 20220626.No.1821 - 操作系统OS 综合了 C语言 + 数据结构与算法DSA + 计算机组成. OS 是控 ...
大数据分析工程师证书_大数据分析工程师面试集锦4Hive
海牛学院的 | 第 610 期本文预计阅读 | 27 分钟作者丨斌迪编辑丨Zandy 导语本篇文章为大家带来Hive面试指南,文内会有两种题型,问答题和代码题,题目一部分来自于网上,一部分来自平时 ...
astype强制转换不管用_用numpy和pandas进行数据分析
一. NumPy和Pandas简介 numpy包用于高性能科学计算和数据分析,是常用的高级数据分析库的基础包,可以很高效的去处理一维.二维的数据.pandas包可以说是基于numpy构建的更高级数据结 ...
【牛客】摩拜2018校招数据分析工程师笔试解析
[牛客]摩拜2018校招数据分析工程师笔试解析 * 选择题都有正确答案,后面五道大题均是我的答案,欢迎大家讨论纠正! (https://www.nowcoder.com/test/11453292/s ...

数据分析工程师_第02讲Pandas教程(上篇)

数据分析工程师_第02讲Pandas教程上篇

数据分析工程师_第02讲Pandas教程(上篇)

Pandas简介

目录

数据结构Series

构造和初始化series

booling indexing/条件判断索引

Series赋值

数据缺失

数据结构DataFrame

数据分析工程师_第02讲Pandas教程(上篇)相关推荐

最新文章

热门文章