数据分析工程师_第02讲Pandas教程上篇
- 数据分析工程师_第02讲Pandas教程(上篇)
- Pandas简介
- 目录
- 数据结构Series
- 构造和初始化series
- Series赋值
- 数据缺失
- 数据结构DataFrame
数据分析工程师_第02讲Pandas教程(上篇)
pandas是一个专门用于数据分析的python工具库
Pandas简介
- python数据分析处理的一个package
- 基于numpy(对“矩阵”做科学计算)
- 有一种用python去操作Excel/SQL的感觉
目录
- series
- DataFrame
- Index
- csv文件读写
数据结构Series
import numpy as np
import pandas as pd
# json.loads()解码python json格式
import jsonjsonStr = '{"name":"aspiring", "age": 17, "hobby": ["money","power", "read"],"parames":{"a":1,"b":2}}'jsonData = json.loads(jsonStr)
print(jsonData)print(type(jsonData))
print(jsonData['hobby'])
{'name': 'aspiring', 'age': 17, 'hobby': ['money', 'power', 'read'], 'parames': {'a': 1, 'b': 2}}
<class 'dict'>
['money', 'power', 'read']
# 读json文件
# json.load()加载python json格式文件path1 = 'data/example.json'
open(path1,'r',encoding='utf-8').readline()
'{ "a": "Mozilla\\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\\/535.11 (KHTML, like Gecko) Chrome\\/17.0.963.78 Safari\\/535.11", "c": "US", "nk": 1, "tz": "America\\/New_York", "gr": "MA", "g": "A6qOVH", "h": "wfLQtf", "l": "orofrog", "al": "en-US,en;q=0.8", "hh": "1.usa.gov", "r": "http:\\/\\/www.facebook.com\\/l\\/7AQEFzjSi\\/1.usa.gov\\/wfLQtf", "u": "http:\\/\\/www.ncbi.nlm.nih.gov\\/pubmed\\/22415991", "t": 1331923247, "hc": 1331822918, "cy": "Danvers", "ll": [ 42.576698, -70.954903 ] }\n'
# python数据分析书籍例子
import jsonpath2 = 'data/example.json'records = [json.loads(line) for line in open(path2, 'r', encoding='utf-8')]records[0]records[0]['tz']
'America/New_York'
构造和初始化series
s = pd.Series([7, 'Beijing', 3.14, -12345, 'HanXiaoyang'])
s
0 7
1 Beijing
2 3.14
3 -12345
4 HanXiaoyang
dtype: object
s.values
array([7, 'Beijing', 3.14, -12345, 'HanXiaoyang'], dtype=object)
s.index
RangeIndex(start=0, stop=5, step=1)
s[1]
'Beijing'
s
0 7
1 Beijing
2 3.14
3 -12345
4 HanXiaoyang
dtype: object
pandas会默认用0到n作为Series的index,但是我们也可以自己指定index。index可以类比理解为dic当中的key。
s = pd.Series([7, 'Beijing', 3.14, -12345, 'HanXiaoyang'], index=['A', 'B', 'C', 'D', 'E'])
s
A 7
B Beijing
C 3.14
D -12345
E HanXiaoyang
dtype: object
s['A']
7
s[ ['A','D','B'] ]
A 7
D -12345
B Beijing
dtype: object
我们可以用list来构建Series,同时可以指定index。实际上我们还可以用dic来初始化Series,因为Series本来就是key-value的结构。
cities = {'Beijing':55000, 'ShangHai':60000, 'Shenzhen':50000, 'Hangzhou':30000, 'Guangzhou':40000, 'Suzhou':None}
cities
{'Beijing': 55000,'Guangzhou': 40000,'Hangzhou': 30000,'ShangHai': 60000,'Shenzhen': 50000,'Suzhou': None}
apt = pd.Series(cities, name='income')
apt
Beijing 55000.0
Guangzhou 40000.0
Hangzhou 30000.0
ShangHai 60000.0
Shenzhen 50000.0
Suzhou NaN
Name: income, dtype: float64
# 索引
apt['Guangzhou']
40000.0
apt[1]
40000.0
apt[1:]
Guangzhou 40000.0
Hangzhou 30000.0
ShangHai 60000.0
Shenzhen 50000.0
Suzhou NaN
Name: income, dtype: float64
apt[:-1]
Beijing 55000.0
Guangzhou 40000.0
Hangzhou 30000.0
ShangHai 60000.0
Shenzhen 50000.0
Name: income, dtype: float64
apt[[3,4,1]]
ShangHai 60000.0
Shenzhen 50000.0
Guangzhou 40000.0
Name: income, dtype: float64
apt[ ['ShangHai', 'Shenzhen', 'Guangzhou'] ]
ShangHai 60000.0
Shenzhen 50000.0
Guangzhou 40000.0
Name: income, dtype: float64
# 简单的计算
# 广播特性
3*apt
Beijing 165000.0
Guangzhou 120000.0
Hangzhou 90000.0
ShangHai 180000.0
Shenzhen 150000.0
Suzhou NaN
Name: income, dtype: float64
apt/2.5
Beijing 22000.0
Guangzhou 16000.0
Hangzhou 12000.0
ShangHai 24000.0
Shenzhen 20000.0
Suzhou NaN
Name: income, dtype: float64
# list不可以直接做数学运算
my_list = [2,4,6,8,10]
my_list/2
---------------------------------------------------------------------------TypeError Traceback (most recent call last)<ipython-input-29-39aba40a404f> in <module>()
----> 1 my_list/2TypeError: unsupported operand type(s) for /: 'list' and 'int'
apt[1:]
Guangzhou 40000.0
Hangzhou 30000.0
ShangHai 60000.0
Shenzhen 50000.0
Suzhou NaN
Name: income, dtype: float64
apt[:-1]
Beijing 55000.0
Guangzhou 40000.0
Hangzhou 30000.0
ShangHai 60000.0
Shenzhen 50000.0
Name: income, dtype: float64
# 基于索引去做计算的
apt[1:] + apt[:-1]
Beijing NaN
Guangzhou 80000.0
Hangzhou 60000.0
ShangHai 120000.0
Shenzhen 100000.0
Suzhou NaN
Name: income, dtype: float64
# in判断index是否存在
'Hangzhou' in apt
True
'Chongqing' in apt
False
# apt['Chongqing'] 不OK的
print(apt.get('Chongqing'))
None
print(apt.get('Guangzhou'))
40000.0
booling indexing/条件判断索引
apt>=40000
Beijing True
Guangzhou True
Hangzhou False
ShangHai True
Shenzhen True
Suzhou False
Name: income, dtype: bool
#条件索引
apt[apt>=40000]
Beijing 55000.0
Guangzhou 40000.0
ShangHai 60000.0
Shenzhen 50000.0
Name: income, dtype: float64
# 统计计算
apt.mean()
47000.0
apt.median()
50000.0
apt.max()
60000.0
apt.min()
30000.0
Series赋值
apt
Beijing 55000.0
Guangzhou 40000.0
Hangzhou 30000.0
ShangHai 60000.0
Shenzhen 50000.0
Suzhou NaN
Name: income, dtype: float64
apt['Shenzhen'] = 70000
apt
Beijing 55000.0
Guangzhou 40000.0
Hangzhou 30000.0
ShangHai 60000.0
Shenzhen 70000.0
Suzhou NaN
Name: income, dtype: float64
#条件赋值
apt[apt<=40000] = 45000
apt
Beijing 55000.0
Guangzhou 45000.0
Hangzhou 45000.0
ShangHai 60000.0
Shenzhen 70000.0
Suzhou NaN
Name: income, dtype: float64
type(apt)
pandas.core.series.Series
#更多高级的数学运算
np.log(apt)
Beijing 10.915088
Guangzhou 10.714418
Hangzhou 10.714418
ShangHai 11.002100
Shenzhen 11.156251
Suzhou NaN
Name: income, dtype: float64
cars = pd.Series({'Beijing':350000, 'ShangHai':400000, 'Shenzhen':300000, \'Tianjin':200000, 'Guangzhou':250000, 'Chongqing':150000})
cars
Beijing 350000
Chongqing 150000
Guangzhou 250000
ShangHai 400000
Shenzhen 300000
Tianjin 200000
dtype: int64
expense = cars + 10*apt
expense
Beijing 900000.0
Chongqing NaN
Guangzhou 700000.0
Hangzhou NaN
ShangHai 1000000.0
Shenzhen 1000000.0
Suzhou NaN
Tianjin NaN
dtype: float64
数据缺失
'Hangzhou' in apt
True
'Hangzhou' in cars
False
apt
Beijing 55000.0
Guangzhou 45000.0
Hangzhou 45000.0
ShangHai 60000.0
Shenzhen 70000.0
Suzhou NaN
Name: income, dtype: float64
# bool结果返回
apt.notnull()
Beijing True
Guangzhou True
Hangzhou True
ShangHai True
Shenzhen True
Suzhou False
Name: income, dtype: bool
apt.isnull()
Beijing False
Guangzhou False
Hangzhou False
ShangHai False
Shenzhen False
Suzhou True
Name: income, dtype: bool
expense
Beijing 900000.0
Chongqing NaN
Guangzhou 700000.0
Hangzhou NaN
ShangHai 1000000.0
Shenzhen 1000000.0
Suzhou NaN
Tianjin NaN
dtype: float64
expense[expense.isnull()] = expense.mean()
expense
Beijing 900000.0
Chongqing 900000.0
Guangzhou 700000.0
Hangzhou 900000.0
ShangHai 1000000.0
Shenzhen 1000000.0
Suzhou 900000.0
Tianjin 900000.0
dtype: float64
数据结构DataFrame
一个DataFrame就是一张表格,Series是一维数组,DataFrame是二维数组。可以类比成office当中的excel,也可以理解成多个Series的集合。
data = {'City':['Beijing','ShangHai','Guangzhou','Shenzhen','Hangzhou','Chongqing'],'year':[2017,2018,2017,2018,2017,2017],'population':[2100,2300,1000,700,500,500]}
pd.DataFrame(data)
|
City
|
population
|
year
|
0
|
Beijing
|
2100
|
2017
|
1
|
ShangHai
|
2300
|
2018
|
2
|
Guangzhou
|
1000
|
2017
|
3
|
Shenzhen
|
700
|
2018
|
4
|
Hangzhou
|
500
|
2017
|
5
|
Chongqing
|
500
|
2017
|
pd.DataFrame(data, columns=['year','City','population'])
|
year
|
City
|
population
|
0
|
2017
|
Beijing
|
2100
|
1
|
2018
|
ShangHai
|
2300
|
2
|
2017
|
Guangzhou
|
1000
|
3
|
2018
|
Shenzhen
|
700
|
4
|
2017
|
Hangzhou
|
500
|
5
|
2017
|
Chongqing
|
500
|
# index
pd.DataFrame(data, columns=['year','City','population'], index=['one','two','three','four','five','six'])
|
year
|
City
|
population
|
one
|
2017
|
Beijing
|
2100
|
two
|
2018
|
ShangHai
|
2300
|
three
|
2017
|
Guangzhou
|
1000
|
four
|
2018
|
Shenzhen
|
700
|
five
|
2017
|
Hangzhou
|
500
|
six
|
2017
|
Chongqing
|
500
|
# DataFrame可以视作Series的集合
apt
Beijing 55000.0
Guangzhou 45000.0
Hangzhou 45000.0
ShangHai 60000.0
Shenzhen 70000.0
Suzhou NaN
Name: income, dtype: float64
cars
Beijing 350000
Chongqing 150000
Guangzhou 250000
ShangHai 400000
Shenzhen 300000
Tianjin 200000
dtype: int64
df = pd.DataFrame({'apt':apt, 'cars':cars})
df
|
apt
|
cars
|
Beijing
|
55000.0
|
350000.0
|
Chongqing
|
NaN
|
150000.0
|
Guangzhou
|
45000.0
|
250000.0
|
Hangzhou
|
45000.0
|
NaN
|
ShangHai
|
60000.0
|
400000.0
|
Shenzhen
|
70000.0
|
300000.0
|
Suzhou
|
NaN
|
NaN
|
Tianjin
|
NaN
|
200000.0
|
# 取出一列(Series)
df['apt']
Beijing 55000.0
Chongqing NaN
Guangzhou 45000.0
Hangzhou 45000.0
ShangHai 60000.0
Shenzhen 70000.0
Suzhou NaN
Tianjin NaN
Name: apt, dtype: float64
type(df['apt'])
pandas.core.series.Series
df[['apt']]
|
apt
|
Beijing
|
55000.0
|
Chongqing
|
NaN
|
Guangzhou
|
45000.0
|
Hangzhou
|
45000.0
|
ShangHai
|
60000.0
|
Shenzhen
|
70000.0
|
Suzhou
|
NaN
|
Tianjin
|
NaN
|
type(df[['apt']])
pandas.core.frame.DataFrame
# 赋值
df
|
apt
|
cars
|
Beijing
|
55000.0
|
350000.0
|
Chongqing
|
NaN
|
150000.0
|
Guangzhou
|
45000.0
|
250000.0
|
Hangzhou
|
45000.0
|
NaN
|
ShangHai
|
60000.0
|
400000.0
|
Shenzhen
|
70000.0
|
300000.0
|
Suzhou
|
NaN
|
NaN
|
Tianjin
|
NaN
|
200000.0
|
df['bonus'] = 40000
df
|
apt
|
cars
|
bonus
|
Beijing
|
55000.0
|
350000.0
|
40000
|
Chongqing
|
NaN
|
150000.0
|
40000
|
Guangzhou
|
45000.0
|
250000.0
|
40000
|
Hangzhou
|
45000.0
|
NaN
|
40000
|
ShangHai
|
60000.0
|
400000.0
|
40000
|
Shenzhen
|
70000.0
|
300000.0
|
40000
|
Suzhou
|
NaN
|
NaN
|
40000
|
Tianjin
|
NaN
|
200000.0
|
40000
|
# 对两列做计算
df['expense'] = df['apt'] + df['bonus']
df
|
apt
|
cars
|
bonus
|
expense
|
Beijing
|
55000.0
|
350000.0
|
40000
|
95000.0
|
Chongqing
|
NaN
|
150000.0
|
40000
|
NaN
|
Guangzhou
|
45000.0
|
250000.0
|
40000
|
85000.0
|
Hangzhou
|
45000.0
|
NaN
|
40000
|
85000.0
|
ShangHai
|
60000.0
|
400000.0
|
40000
|
100000.0
|
Shenzhen
|
70000.0
|
300000.0
|
40000
|
110000.0
|
Suzhou
|
NaN
|
NaN
|
40000
|
NaN
|
Tianjin
|
NaN
|
200000.0
|
40000
|
NaN
|
df.index
Index(['Beijing', 'Chongqing', 'Guangzhou', 'Hangzhou', 'ShangHai', 'Shenzhen','Suzhou', 'Tianjin'],dtype='object')
#利用index的名称,来获取想要的行(或列)
df.loc['Beijing']
apt 55000.0
cars 350000.0
bonus 40000.0
expense 95000.0
Name: Beijing, dtype: float64
type(df.loc['Beijing'])
pandas.core.series.Series
df.loc[['Beijing', 'ShangHai', 'Guangzhou']]
|
apt
|
cars
|
bonus
|
expense
|
Beijing
|
55000.0
|
350000.0
|
40000
|
95000.0
|
ShangHai
|
60000.0
|
400000.0
|
40000
|
100000.0
|
Guangzhou
|
45000.0
|
250000.0
|
40000
|
85000.0
|
df
|
apt
|
cars
|
bonus
|
expense
|
Beijing
|
55000.0
|
350000.0
|
40000
|
95000.0
|
Chongqing
|
NaN
|
150000.0
|
40000
|
NaN
|
Guangzhou
|
45000.0
|
250000.0
|
40000
|
85000.0
|
Hangzhou
|
45000.0
|
NaN
|
40000
|
85000.0
|
ShangHai
|
60000.0
|
400000.0
|
40000
|
100000.0
|
Shenzhen
|
70000.0
|
300000.0
|
40000
|
110000.0
|
Suzhou
|
NaN
|
NaN
|
40000
|
NaN
|
Tianjin
|
NaN
|
200000.0
|
40000
|
NaN
|
# 高级函数loc
# 利用index的名称,来获取想要的行(或列)
df.loc['Beijing':'Suzhou', ['apt','bonus']]
|
apt
|
bonus
|
Beijing
|
55000.0
|
40000
|
Chongqing
|
NaN
|
40000
|
Guangzhou
|
45000.0
|
40000
|
Hangzhou
|
45000.0
|
40000
|
ShangHai
|
60000.0
|
40000
|
Shenzhen
|
70000.0
|
40000
|
Suzhou
|
NaN
|
40000
|
# 类似切片的用法
df.loc['Beijing':'Suzhou', 'apt':'bonus']
|
apt
|
cars
|
bonus
|
Beijing
|
55000.0
|
350000.0
|
40000
|
Chongqing
|
NaN
|
150000.0
|
40000
|
Guangzhou
|
45000.0
|
250000.0
|
40000
|
Hangzhou
|
45000.0
|
NaN
|
40000
|
ShangHai
|
60000.0
|
400000.0
|
40000
|
Shenzhen
|
70000.0
|
300000.0
|
40000
|
Suzhou
|
NaN
|
NaN
|
40000
|
# 传入list的用法
df.loc[['Beijing','Suzhou'], ['apt','bonus']]
|
apt
|
bonus
|
Beijing
|
55000.0
|
40000
|
Suzhou
|
NaN
|
40000
|
df
|
apt
|
cars
|
bonus
|
expense
|
Beijing
|
55000.0
|
350000.0
|
40000
|
95000.0
|
Chongqing
|
NaN
|
150000.0
|
40000
|
NaN
|
Guangzhou
|
45000.0
|
250000.0
|
40000
|
85000.0
|
Hangzhou
|
45000.0
|
NaN
|
40000
|
85000.0
|
ShangHai
|
60000.0
|
400000.0
|
40000
|
100000.0
|
Shenzhen
|
70000.0
|
300000.0
|
40000
|
110000.0
|
Suzhou
|
NaN
|
NaN
|
40000
|
NaN
|
Tianjin
|
NaN
|
200000.0
|
40000
|
NaN
|
df.loc['Beijing','bonus'] = 50000
df
|
apt
|
cars
|
bonus
|
expense
|
Beijing
|
55000.0
|
350000.0
|
50000
|
95000.0
|
Chongqing
|
NaN
|
150000.0
|
40000
|
NaN
|
Guangzhou
|
45000.0
|
250000.0
|
40000
|
85000.0
|
Hangzhou
|
45000.0
|
NaN
|
40000
|
85000.0
|
ShangHai
|
60000.0
|
400000.0
|
40000
|
100000.0
|
Shenzhen
|
70000.0
|
300000.0
|
40000
|
110000.0
|
Suzhou
|
NaN
|
NaN
|
40000
|
NaN
|
Tianjin
|
NaN
|
200000.0
|
40000
|
NaN
|
#对列赋值
df.loc[:,'expense'] = 100000
df
|
apt
|
cars
|
bonus
|
expense
|
Beijing
|
55000.0
|
350000.0
|
50000
|
100000
|
Chongqing
|
NaN
|
150000.0
|
40000
|
100000
|
Guangzhou
|
45000.0
|
250000.0
|
40000
|
100000
|
Hangzhou
|
45000.0
|
NaN
|
40000
|
100000
|
ShangHai
|
60000.0
|
400000.0
|
40000
|
100000
|
Shenzhen
|
70000.0
|
300000.0
|
40000
|
100000
|
Suzhou
|
NaN
|
NaN
|
40000
|
100000
|
Tianjin
|
NaN
|
200000.0
|
40000
|
100000
|
# 返回表示DataFrame维度的元组
df.shape
(8, 4)
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 8 entries, Beijing to Tianjin
Data columns (total 5 columns):
apt 5 non-null float64
cars 6 non-null float64
bonus 8 non-null int64
expense 8 non-null int64
color 8 non-null object
dtypes: float64(2), int64(2), object(1)
memory usage: 704.0+ bytes
df.T
|
Beijing
|
Chongqing
|
Guangzhou
|
Hangzhou
|
ShangHai
|
Shenzhen
|
Suzhou
|
Tianjin
|
apt
|
55000.0
|
NaN
|
45000.0
|
45000.0
|
60000.0
|
70000.0
|
NaN
|
NaN
|
cars
|
350000.0
|
150000.0
|
250000.0
|
NaN
|
400000.0
|
300000.0
|
NaN
|
200000.0
|
bonus
|
50000.0
|
40000.0
|
40000.0
|
40000.0
|
40000.0
|
40000.0
|
40000.0
|
40000.0
|
expense
|
100000.0
|
100000.0
|
100000.0
|
100000.0
|
100000.0
|
100000.0
|
100000.0
|
100000.0
|
df
|
apt
|
cars
|
bonus
|
expense
|
Beijing
|
55000.0
|
350000.0
|
50000
|
100000
|
Chongqing
|
NaN
|
150000.0
|
40000
|
100000
|
Guangzhou
|
45000.0
|
250000.0
|
40000
|
100000
|
Hangzhou
|
45000.0
|
NaN
|
40000
|
100000
|
ShangHai
|
60000.0
|
400000.0
|
40000
|
100000
|
Shenzhen
|
70000.0
|
300000.0
|
40000
|
100000
|
Suzhou
|
NaN
|
NaN
|
40000
|
100000
|
Tianjin
|
NaN
|
200000.0
|
40000
|
100000
|
df.describe()
|
apt
|
cars
|
bonus
|
expense
|
count
|
5.000000
|
6.000000
|
8.000000
|
8.0
|
mean
|
55000.000000
|
275000.000000
|
41250.000000
|
100000.0
|
std
|
10606.601718
|
93541.434669
|
3535.533906
|
0.0
|
min
|
45000.000000
|
150000.000000
|
40000.000000
|
100000.0
|
25%
|
45000.000000
|
212500.000000
|
40000.000000
|
100000.0
|
50%
|
55000.000000
|
275000.000000
|
40000.000000
|
100000.0
|
75%
|
60000.000000
|
337500.000000
|
40000.000000
|
100000.0
|
max
|
70000.000000
|
400000.000000
|
50000.000000
|
100000.0
|
df['cars']
Beijing 350000.0
Chongqing 150000.0
Guangzhou 250000.0
Hangzhou NaN
ShangHai 400000.0
Shenzhen 300000.0
Suzhou NaN
Tianjin 200000.0
Name: cars, dtype: float64
df['cars'] < 310000
Beijing False
Chongqing True
Guangzhou True
Hangzhou False
ShangHai False
Shenzhen True
Suzhou False
Tianjin True
Name: cars, dtype: bool
df.loc[:,'color'] = ['红','黄','紫','蓝','红','绿','棕','橙']
df
|
apt
|
cars
|
bonus
|
expense
|
color
|
Beijing
|
55000.0
|
350000.0
|
50000
|
100000
|
红
|
Chongqing
|
NaN
|
150000.0
|
40000
|
100000
|
黄
|
Guangzhou
|
45000.0
|
250000.0
|
40000
|
100000
|
紫
|
Hangzhou
|
45000.0
|
NaN
|
40000
|
100000
|
蓝
|
ShangHai
|
60000.0
|
400000.0
|
40000
|
100000
|
红
|
Shenzhen
|
70000.0
|
300000.0
|
40000
|
100000
|
绿
|
Suzhou
|
NaN
|
NaN
|
40000
|
100000
|
棕
|
Tianjin
|
NaN
|
200000.0
|
40000
|
100000
|
橙
|
df['color'].isin(['红','绿'])
Beijing True
Chongqing False
Guangzhou False
Hangzhou False
ShangHai True
Shenzhen True
Suzhou False
Tianjin False
Name: color, dtype: bool
df
|
apt
|
cars
|
bonus
|
expense
|
color
|
Beijing
|
55000.0
|
350000.0
|
50000
|
100000
|
红
|
Chongqing
|
NaN
|
150000.0
|
40000
|
100000
|
黄
|
Guangzhou
|
45000.0
|
250000.0
|
40000
|
100000
|
紫
|
Hangzhou
|
45000.0
|
NaN
|
40000
|
100000
|
蓝
|
ShangHai
|
60000.0
|
400000.0
|
40000
|
100000
|
红
|
Shenzhen
|
70000.0
|
300000.0
|
40000
|
100000
|
绿
|
Suzhou
|
NaN
|
NaN
|
40000
|
100000
|
棕
|
Tianjin
|
NaN
|
200000.0
|
40000
|
100000
|
橙
|
# 填充缺失值
df.fillna(value=50000)
#df.fillna(value=50000, inplace=True)
|
apt
|
cars
|
bonus
|
expense
|
color
|
Beijing
|
55000.0
|
350000.0
|
50000
|
100000
|
红
|
Chongqing
|
50000.0
|
150000.0
|
40000
|
100000
|
黄
|
Guangzhou
|
45000.0
|
250000.0
|
40000
|
100000
|
紫
|
Hangzhou
|
45000.0
|
50000.0
|
40000
|
100000
|
蓝
|
ShangHai
|
60000.0
|
400000.0
|
40000
|
100000
|
红
|
Shenzhen
|
70000.0
|
300000.0
|
40000
|
100000
|
绿
|
Suzhou
|
50000.0
|
50000.0
|
40000
|
100000
|
棕
|
Tianjin
|
50000.0
|
200000.0
|
40000
|
100000
|
橙
|
df
|
apt
|
cars
|
bonus
|
expense
|
color
|
Beijing
|
55000.0
|
350000.0
|
50000
|
100000
|
红
|
Chongqing
|
NaN
|
150000.0
|
40000
|
100000
|
黄
|
Guangzhou
|
45000.0
|
250000.0
|
40000
|
100000
|
紫
|
Hangzhou
|
45000.0
|
NaN
|
40000
|
100000
|
蓝
|
ShangHai
|
60000.0
|
400000.0
|
40000
|
100000
|
红
|
Shenzhen
|
70000.0
|
300000.0
|
40000
|
100000
|
绿
|
Suzhou
|
NaN
|
NaN
|
40000
|
100000
|
棕
|
Tianjin
|
NaN
|
200000.0
|
40000
|
100000
|
橙
|
#向前填充
df.fillna(method='ffill')
|
apt
|
cars
|
bonus
|
expense
|
color
|
Beijing
|
55000.0
|
350000.0
|
50000
|
100000
|
红
|
Chongqing
|
55000.0
|
150000.0
|
40000
|
100000
|
黄
|
Guangzhou
|
45000.0
|
250000.0
|
40000
|
100000
|
紫
|
Hangzhou
|
45000.0
|
250000.0
|
40000
|
100000
|
蓝
|
ShangHai
|
60000.0
|
400000.0
|
40000
|
100000
|
红
|
Shenzhen
|
70000.0
|
300000.0
|
40000
|
100000
|
绿
|
Suzhou
|
70000.0
|
300000.0
|
40000
|
100000
|
棕
|
Tianjin
|
70000.0
|
200000.0
|
40000
|
100000
|
橙
|
#向后填充
df.fillna(method='bfill')
|
apt
|
cars
|
bonus
|
expense
|
color
|
Beijing
|
55000.0
|
350000.0
|
50000
|
100000
|
红
|
Chongqing
|
45000.0
|
150000.0
|
40000
|
100000
|
黄
|
Guangzhou
|
45000.0
|
250000.0
|
40000
|
100000
|
紫
|
Hangzhou
|
45000.0
|
400000.0
|
40000
|
100000
|
蓝
|
ShangHai
|
60000.0
|
400000.0
|
40000
|
100000
|
红
|
Shenzhen
|
70000.0
|
300000.0
|
40000
|
100000
|
绿
|
Suzhou
|
NaN
|
200000.0
|
40000
|
100000
|
棕
|
Tianjin
|
NaN
|
200000.0
|
40000
|
100000
|
橙
|
!head -10 data/GOOG.csv
'head' 不是内部或外部命令,也不是可运行的程序
或批处理文件。
goog = pd.read_csv('data/GOOG.csv', index_col=0, parse_dates=['Date'])
#前5行
goog.head(5)
|
Open
|
High
|
Low
|
Close
|
Adj Close
|
Volume
|
Date
|
|
|
|
|
|
|
2004-08-19
|
49.813286
|
51.835709
|
47.800831
|
49.982655
|
49.982655
|
44871300
|
2004-08-20
|
50.316402
|
54.336334
|
50.062355
|
53.952770
|
53.952770
|
22942800
|
2004-08-23
|
55.168217
|
56.528118
|
54.321388
|
54.495735
|
54.495735
|
18342800
|
2004-08-24
|
55.412300
|
55.591629
|
51.591621
|
52.239193
|
52.239193
|
15319700
|
2004-08-25
|
52.284027
|
53.798351
|
51.746044
|
52.802086
|
52.802086
|
9232100
|
goog
|
Open
|
High
|
Low
|
Close
|
Adj Close
|
Volume
|
Date
|
|
|
|
|
|
|
2004-08-19
|
49.813286
|
51.835709
|
47.800831
|
49.982655
|
49.982655
|
44871300
|
2004-08-20
|
50.316402
|
54.336334
|
50.062355
|
53.952770
|
53.952770
|
22942800
|
2004-08-23
|
55.168217
|
56.528118
|
54.321388
|
54.495735
|
54.495735
|
18342800
|
2004-08-24
|
55.412300
|
55.591629
|
51.591621
|
52.239193
|
52.239193
|
15319700
|
2004-08-25
|
52.284027
|
53.798351
|
51.746044
|
52.802086
|
52.802086
|
9232100
|
2004-08-26
|
52.279045
|
53.773445
|
52.134586
|
53.753517
|
53.753517
|
7128600
|
2004-08-27
|
53.848164
|
54.107193
|
52.647663
|
52.876804
|
52.876804
|
6241200
|
2004-08-30
|
52.443428
|
52.548038
|
50.814533
|
50.814533
|
50.814533
|
5221400
|
2004-08-31
|
50.958992
|
51.661362
|
50.889256
|
50.993862
|
50.993862
|
4941200
|
2004-09-01
|
51.158245
|
51.292744
|
49.648903
|
49.937820
|
49.937820
|
9181600
|
2004-09-02
|
49.409801
|
50.993862
|
49.285267
|
50.565468
|
50.565468
|
15190400
|
2004-09-03
|
50.286514
|
50.680038
|
49.474556
|
49.818268
|
49.818268
|
5176800
|
2004-09-07
|
50.316402
|
50.809555
|
49.619015
|
50.600338
|
50.600338
|
5875200
|
2004-09-08
|
50.181908
|
51.322632
|
50.062355
|
50.958992
|
50.958992
|
5009200
|
2004-09-09
|
51.073563
|
51.163227
|
50.311420
|
50.963974
|
50.963974
|
4080900
|
2004-09-10
|
50.610302
|
53.081039
|
50.460861
|
52.468334
|
52.468334
|
8740200
|
2004-09-13
|
53.115910
|
54.002586
|
53.031227
|
53.549286
|
53.549286
|
7881300
|
2004-09-14
|
53.524376
|
55.790882
|
53.195610
|
55.536835
|
55.536835
|
10880300
|
2004-09-15
|
55.073570
|
56.901718
|
54.894241
|
55.790882
|
55.790882
|
10763900
|
2004-09-16
|
55.960247
|
57.683788
|
55.616535
|
56.772205
|
56.772205
|
9310200
|
2004-09-17
|
56.996365
|
58.525631
|
56.562988
|
58.525631
|
58.525631
|
9517400
|
2004-09-20
|
58.256641
|
60.572956
|
58.166977
|
59.457142
|
59.457142
|
10679200
|
2004-09-21
|
59.681301
|
59.985161
|
58.535595
|
58.699978
|
58.699978
|
7263000
|
2004-09-22
|
58.480801
|
59.611561
|
58.186901
|
58.968971
|
58.968971
|
7617100
|
2004-09-23
|
59.198112
|
61.086033
|
58.291508
|
60.184414
|
60.184414
|
8576100
|
2004-09-24
|
60.244190
|
61.818291
|
59.656395
|
59.691261
|
59.691261
|
9166700
|
2004-09-27
|
59.556767
|
60.214302
|
58.680054
|
58.909195
|
58.909195
|
7099600
|
2004-09-28
|
60.423519
|
63.462128
|
59.880554
|
63.193138
|
63.193138
|
17009400
|
2004-09-29
|
63.113434
|
67.257904
|
62.879314
|
65.295258
|
65.295258
|
30661400
|
2004-09-30
|
64.707458
|
65.902977
|
64.259140
|
64.558022
|
64.558022
|
13823300
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
2017-06-08
|
982.349976
|
984.570007
|
977.200012
|
983.409973
|
983.409973
|
1481900
|
2017-06-09
|
984.500000
|
984.500000
|
935.630005
|
949.830017
|
949.830017
|
3309400
|
2017-06-12
|
939.559998
|
949.354980
|
915.232971
|
942.900024
|
942.900024
|
3763500
|
2017-06-13
|
951.909973
|
959.979980
|
944.090027
|
953.400024
|
953.400024
|
2013300
|
2017-06-14
|
959.919983
|
961.150024
|
942.250000
|
950.760010
|
950.760010
|
1489700
|
2017-06-15
|
933.969971
|
943.338989
|
924.440002
|
942.309998
|
942.309998
|
2133100
|
2017-06-16
|
940.000000
|
942.039978
|
931.594971
|
939.780029
|
939.780029
|
3094700
|
2017-06-19
|
949.960022
|
959.989990
|
949.049988
|
957.369995
|
957.369995
|
1533300
|
2017-06-20
|
957.520020
|
961.619995
|
950.010010
|
950.630005
|
950.630005
|
1126000
|
2017-06-21
|
953.640015
|
960.099976
|
950.760010
|
959.450012
|
959.450012
|
1202200
|
2017-06-22
|
958.700012
|
960.719971
|
954.549988
|
957.090027
|
957.090027
|
941400
|
2017-06-23
|
956.830017
|
966.000000
|
954.200012
|
965.590027
|
965.590027
|
1527900
|
2017-06-26
|
969.900024
|
973.309998
|
950.789978
|
952.270020
|
952.270020
|
1598400
|
2017-06-27
|
942.460022
|
948.289978
|
926.849976
|
927.330017
|
927.330017
|
2579900
|
2017-06-28
|
929.000000
|
942.750000
|
916.000000
|
940.489990
|
940.489990
|
2721400
|
2017-06-29
|
929.919983
|
931.260010
|
910.619995
|
917.789978
|
917.789978
|
3299200
|
2017-06-30
|
926.049988
|
926.049988
|
908.309998
|
908.729980
|
908.729980
|
2065500
|
2017-07-03
|
912.179993
|
913.940002
|
894.789978
|
898.700012
|
898.700012
|
1709800
|
2017-07-05
|
901.760010
|
914.510010
|
898.500000
|
911.710022
|
911.710022
|
1813900
|
2017-07-06
|
904.119995
|
914.943970
|
899.700012
|
906.690002
|
906.690002
|
1424500
|
2017-07-07
|
908.849976
|
921.539978
|
908.849976
|
918.590027
|
918.590027
|
1637800
|
2017-07-10
|
921.770020
|
930.380005
|
919.590027
|
928.799988
|
928.799988
|
1192800
|
2017-07-11
|
929.539978
|
931.429993
|
922.000000
|
930.090027
|
930.090027
|
1113200
|
2017-07-12
|
938.679993
|
946.299988
|
934.469971
|
943.830017
|
943.830017
|
1532100
|
2017-07-13
|
946.289978
|
954.450012
|
943.010010
|
947.159973
|
947.159973
|
1294700
|
2017-07-14
|
952.000000
|
956.909973
|
948.005005
|
955.989990
|
955.989990
|
1053800
|
2017-07-17
|
957.000000
|
960.739990
|
949.241028
|
953.419983
|
953.419983
|
1165500
|
2017-07-18
|
953.000000
|
968.039978
|
950.599976
|
965.400024
|
965.400024
|
1154000
|
2017-07-19
|
967.840027
|
973.039978
|
964.030029
|
970.890015
|
970.890015
|
1224500
|
2017-07-20
|
975.000000
|
975.900024
|
961.510010
|
968.150024
|
968.150024
|
1616500
|
3253 rows × 6 columns
goog.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3253 entries, 2004-08-19 to 2017-07-20
Data columns (total 6 columns):
Open 3253 non-null float64
High 3253 non-null float64
Low 3253 non-null float64
Close 3253 non-null float64
Adj Close 3253 non-null float64
Volume 3253 non-null int64
dtypes: float64(5), int64(1)
memory usage: 177.9 KB
goog.describe()
|
Open
|
High
|
Low
|
Close
|
Adj Close
|
Volume
|
count
|
3253.000000
|
3253.000000
|
3253.000000
|
3253.000000
|
3253.000000
|
3.253000e+03
|
mean
|
370.588678
|
373.854568
|
366.959060
|
370.463274
|
370.463274
|
8.139070e+06
|
std
|
212.537536
|
213.645163
|
211.213609
|
212.542226
|
212.542226
|
8.403870e+06
|
min
|
49.409801
|
50.680038
|
47.800831
|
49.818268
|
49.818268
|
7.900000e+03
|
25%
|
225.928162
|
228.050217
|
222.984207
|
224.986694
|
224.986694
|
2.743600e+06
|
50%
|
292.030396
|
293.898407
|
288.538483
|
291.318054
|
291.318054
|
5.374600e+06
|
75%
|
531.599976
|
535.729126
|
527.810913
|
532.299988
|
532.299988
|
1.081150e+07
|
max
|
984.500000
|
988.250000
|
977.200012
|
983.679993
|
983.679993
|
8.254150e+07
|
goog.tail()
|
Open
|
High
|
Low
|
Close
|
Adj Close
|
Volume
|
Date
|
|
|
|
|
|
|
2017-07-14
|
952.000000
|
956.909973
|
948.005005
|
955.989990
|
955.989990
|
1053800
|
2017-07-17
|
957.000000
|
960.739990
|
949.241028
|
953.419983
|
953.419983
|
1165500
|
2017-07-18
|
953.000000
|
968.039978
|
950.599976
|
965.400024
|
965.400024
|
1154000
|
2017-07-19
|
967.840027
|
973.039978
|
964.030029
|
970.890015
|
970.890015
|
1224500
|
2017-07-20
|
975.000000
|
975.900024
|
961.510010
|
968.150024
|
968.150024
|
1616500
|
goog.index
DatetimeIndex(['2004-08-19', '2004-08-20', '2004-08-23', '2004-08-24','2004-08-25', '2004-08-26', '2004-08-27', '2004-08-30','2004-08-31', '2004-09-01',...'2017-07-07', '2017-07-10', '2017-07-11', '2017-07-12','2017-07-13', '2017-07-14', '2017-07-17', '2017-07-18','2017-07-19', '2017-07-20'],dtype='datetime64[ns]', name='Date', length=3253, freq=None)
# 日期对应一周的星期几
goog.loc[:,'dow'] = goog.index.dayofweek
# 日期对应的一年的第几天
# goog.loc[:,'doy'] = goog.index.dayofyear
goog
|
Open
|
High
|
Low
|
Close
|
Adj Close
|
Volume
|
dow
|
Date
|
|
|
|
|
|
|
|
2004-08-19
|
49.813286
|
51.835709
|
47.800831
|
49.982655
|
49.982655
|
44871300
|
3
|
2004-08-20
|
50.316402
|
54.336334
|
50.062355
|
53.952770
|
53.952770
|
22942800
|
4
|
2004-08-23
|
55.168217
|
56.528118
|
54.321388
|
54.495735
|
54.495735
|
18342800
|
0
|
2004-08-24
|
55.412300
|
55.591629
|
51.591621
|
52.239193
|
52.239193
|
15319700
|
1
|
2004-08-25
|
52.284027
|
53.798351
|
51.746044
|
52.802086
|
52.802086
|
9232100
|
2
|
2004-08-26
|
52.279045
|
53.773445
|
52.134586
|
53.753517
|
53.753517
|
7128600
|
3
|
2004-08-27
|
53.848164
|
54.107193
|
52.647663
|
52.876804
|
52.876804
|
6241200
|
4
|
2004-08-30
|
52.443428
|
52.548038
|
50.814533
|
50.814533
|
50.814533
|
5221400
|
0
|
2004-08-31
|
50.958992
|
51.661362
|
50.889256
|
50.993862
|
50.993862
|
4941200
|
1
|
2004-09-01
|
51.158245
|
51.292744
|
49.648903
|
49.937820
|
49.937820
|
9181600
|
2
|
2004-09-02
|
49.409801
|
50.993862
|
49.285267
|
50.565468
|
50.565468
|
15190400
|
3
|
2004-09-03
|
50.286514
|
50.680038
|
49.474556
|
49.818268
|
49.818268
|
5176800
|
4
|
2004-09-07
|
50.316402
|
50.809555
|
49.619015
|
50.600338
|
50.600338
|
5875200
|
1
|
2004-09-08
|
50.181908
|
51.322632
|
50.062355
|
50.958992
|
50.958992
|
5009200
|
2
|
2004-09-09
|
51.073563
|
51.163227
|
50.311420
|
50.963974
|
50.963974
|
4080900
|
3
|
2004-09-10
|
50.610302
|
53.081039
|
50.460861
|
52.468334
|
52.468334
|
8740200
|
4
|
2004-09-13
|
53.115910
|
54.002586
|
53.031227
|
53.549286
|
53.549286
|
7881300
|
0
|
2004-09-14
|
53.524376
|
55.790882
|
53.195610
|
55.536835
|
55.536835
|
10880300
|
1
|
2004-09-15
|
55.073570
|
56.901718
|
54.894241
|
55.790882
|
55.790882
|
10763900
|
2
|
2004-09-16
|
55.960247
|
57.683788
|
55.616535
|
56.772205
|
56.772205
|
9310200
|
3
|
2004-09-17
|
56.996365
|
58.525631
|
56.562988
|
58.525631
|
58.525631
|
9517400
|
4
|
2004-09-20
|
58.256641
|
60.572956
|
58.166977
|
59.457142
|
59.457142
|
10679200
|
0
|
2004-09-21
|
59.681301
|
59.985161
|
58.535595
|
58.699978
|
58.699978
|
7263000
|
1
|
2004-09-22
|
58.480801
|
59.611561
|
58.186901
|
58.968971
|
58.968971
|
7617100
|
2
|
2004-09-23
|
59.198112
|
61.086033
|
58.291508
|
60.184414
|
60.184414
|
8576100
|
3
|
2004-09-24
|
60.244190
|
61.818291
|
59.656395
|
59.691261
|
59.691261
|
9166700
|
4
|
2004-09-27
|
59.556767
|
60.214302
|
58.680054
|
58.909195
|
58.909195
|
7099600
|
0
|
2004-09-28
|
60.423519
|
63.462128
|
59.880554
|
63.193138
|
63.193138
|
17009400
|
1
|
2004-09-29
|
63.113434
|
67.257904
|
62.879314
|
65.295258
|
65.295258
|
30661400
|
2
|
2004-09-30
|
64.707458
|
65.902977
|
64.259140
|
64.558022
|
64.558022
|
13823300
|
3
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
...
|
2017-06-08
|
982.349976
|
984.570007
|
977.200012
|
983.409973
|
983.409973
|
1481900
|
3
|
2017-06-09
|
984.500000
|
984.500000
|
935.630005
|
949.830017
|
949.830017
|
3309400
|
4
|
2017-06-12
|
939.559998
|
949.354980
|
915.232971
|
942.900024
|
942.900024
|
3763500
|
0
|
2017-06-13
|
951.909973
|
959.979980
|
944.090027
|
953.400024
|
953.400024
|
2013300
|
1
|
2017-06-14
|
959.919983
|
961.150024
|
942.250000
|
950.760010
|
950.760010
|
1489700
|
2
|
2017-06-15
|
933.969971
|
943.338989
|
924.440002
|
942.309998
|
942.309998
|
2133100
|
3
|
2017-06-16
|
940.000000
|
942.039978
|
931.594971
|
939.780029
|
939.780029
|
3094700
|
4
|
2017-06-19
|
949.960022
|
959.989990
|
949.049988
|
957.369995
|
957.369995
|
1533300
|
0
|
2017-06-20
|
957.520020
|
961.619995
|
950.010010
|
950.630005
|
950.630005
|
1126000
|
1
|
2017-06-21
|
953.640015
|
960.099976
|
950.760010
|
959.450012
|
959.450012
|
1202200
|
2
|
2017-06-22
|
958.700012
|
960.719971
|
954.549988
|
957.090027
|
957.090027
|
941400
|
3
|
2017-06-23
|
956.830017
|
966.000000
|
954.200012
|
965.590027
|
965.590027
|
1527900
|
4
|
2017-06-26
|
969.900024
|
973.309998
|
950.789978
|
952.270020
|
952.270020
|
1598400
|
0
|
2017-06-27
|
942.460022
|
948.289978
|
926.849976
|
927.330017
|
927.330017
|
2579900
|
1
|
2017-06-28
|
929.000000
|
942.750000
|
916.000000
|
940.489990
|
940.489990
|
2721400
|
2
|
2017-06-29
|
929.919983
|
931.260010
|
910.619995
|
917.789978
|
917.789978
|
3299200
|
3
|
2017-06-30
|
926.049988
|
926.049988
|
908.309998
|
908.729980
|
908.729980
|
2065500
|
4
|
2017-07-03
|
912.179993
|
913.940002
|
894.789978
|
898.700012
|
898.700012
|
1709800
|
0
|
2017-07-05
|
901.760010
|
914.510010
|
898.500000
|
911.710022
|
911.710022
|
1813900
|
2
|
2017-07-06
|
904.119995
|
914.943970
|
899.700012
|
906.690002
|
906.690002
|
1424500
|
3
|
2017-07-07
|
908.849976
|
921.539978
|
908.849976
|
918.590027
|
918.590027
|
1637800
|
4
|
2017-07-10
|
921.770020
|
930.380005
|
919.590027
|
928.799988
|
928.799988
|
1192800
|
0
|
2017-07-11
|
929.539978
|
931.429993
|
922.000000
|
930.090027
|
930.090027
|
1113200
|
1
|
2017-07-12
|
938.679993
|
946.299988
|
934.469971
|
943.830017
|
943.830017
|
1532100
|
2
|
2017-07-13
|
946.289978
|
954.450012
|
943.010010
|
947.159973
|
947.159973
|
1294700
|
3
|
2017-07-14
|
952.000000
|
956.909973
|
948.005005
|
955.989990
|
955.989990
|
1053800
|
4
|
2017-07-17
|
957.000000
|
960.739990
|
949.241028
|
953.419983
|
953.419983
|
1165500
|
0
|
2017-07-18
|
953.000000
|
968.039978
|
950.599976
|
965.400024
|
965.400024
|
1154000
|
1
|
2017-07-19
|
967.840027
|
973.039978
|
964.030029
|
970.890015
|
970.890015
|
1224500
|
2
|
2017-07-20
|
975.000000
|
975.900024
|
961.510010
|
968.150024
|
968.150024
|
1616500
|
3
|
3253 rows × 7 columns
goog.loc[:,'doy'] = goog.index.dayofyear
goog.head()
|
Open
|
High
|
Low
|
Close
|
Adj Close
|
Volume
|
dow
|
doy
|
Date
|
|
|
|
|
|
|
|
|
2004-08-19
|
49.813286
|
51.835709
|
47.800831
|
49.982655
|
49.982655
|
44871300
|
3
|
232
|
2004-08-20
|
50.316402
|
54.336334
|
50.062355
|
53.952770
|
53.952770
|
22942800
|
4
|
233
|
2004-08-23
|
55.168217
|
56.528118
|
54.321388
|
54.495735
|
54.495735
|
18342800
|
0
|
236
|
2004-08-24
|
55.412300
|
55.591629
|
51.591621
|
52.239193
|
52.239193
|
15319700
|
1
|
237
|
2004-08-25
|
52.284027
|
53.798351
|
51.746044
|
52.802086
|
52.802086
|
9232100
|
2
|
238
|
%matplotlib inline
goog['Open'].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x244022eb908>
nvda = pd.read_csv('data/NVDA.csv', index_col=0, parse_dates=['Date'])
nvda.head(10)
|
Open
|
High
|
Low
|
Close
|
Adj Close
|
Volume
|
Date
|
|
|
|
|
|
|
1999-01-22
|
1.750000
|
1.953125
|
1.552083
|
1.640625
|
1.523430
|
67867200
|
1999-01-25
|
1.770833
|
1.833333
|
1.640625
|
1.812500
|
1.683028
|
12762000
|
1999-01-26
|
1.833333
|
1.869792
|
1.645833
|
1.671875
|
1.552448
|
8580000
|
1999-01-27
|
1.677083
|
1.718750
|
1.583333
|
1.666667
|
1.547611
|
6109200
|
1999-01-28
|
1.666667
|
1.677083
|
1.651042
|
1.661458
|
1.542776
|
5688000
|
1999-01-29
|
1.661458
|
1.666667
|
1.583333
|
1.583333
|
1.470231
|
6100800
|
1999-02-01
|
1.583333
|
1.625000
|
1.583333
|
1.614583
|
1.499249
|
3867600
|
1999-02-02
|
1.583333
|
1.625000
|
1.442708
|
1.489583
|
1.383178
|
6602400
|
1999-02-03
|
1.468750
|
1.541667
|
1.458333
|
1.520833
|
1.412196
|
1878000
|
1999-02-04
|
1.541667
|
1.645833
|
1.520833
|
1.604167
|
1.489577
|
4548000
|
nvda.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4654 entries, 1999-01-22 to 2017-07-20
Data columns (total 6 columns):
Open 4654 non-null float64
High 4654 non-null float64
Low 4654 non-null float64
Close 4654 non-null float64
Adj Close 4654 non-null float64
Volume 4654 non-null int64
dtypes: float64(5), int64(1)
memory usage: 254.5 KB
nvda.describe()
|
Open
|
High
|
Low
|
Close
|
Adj Close
|
Volume
|
count
|
4654.000000
|
4654.000000
|
4654.000000
|
4654.000000
|
4654.000000
|
4.654000e+03
|
mean
|
18.872888
|
19.222090
|
18.513574
|
18.879564
|
18.091126
|
1.632563e+07
|
std
|
22.025278
|
22.346668
|
21.662627
|
22.048935
|
22.093697
|
1.204002e+07
|
min
|
1.395833
|
1.421875
|
1.333333
|
1.364583
|
1.267107
|
4.920000e+05
|
25%
|
8.510000
|
8.755000
|
8.245261
|
8.505000
|
7.897462
|
8.721475e+06
|
50%
|
13.810000
|
14.090000
|
13.500000
|
13.814167
|
12.832797
|
1.373830e+07
|
75%
|
19.770000
|
20.129999
|
19.505000
|
19.789167
|
18.774976
|
2.041408e+07
|
max
|
166.330002
|
168.500000
|
164.610001
|
167.500000
|
167.500000
|
2.307714e+08
|
%matplotlib inline
nvda['Open'].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x244025689e8>
nvda['Open'].plot(grid=True)# 画柱状图
# nvda['Open'].plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x24402568908>
nvda.index > '2016-01-01'
array([False, False, False, ..., True, True, True])
nvda.index < '2016-04-01'
array([ True, True, True, ..., False, False, False])
# 条件与或非
# | 表示或
# & 表示且
# ! 表示非
(nvda.index > '2016-01-01') & (nvda.index < '2016-02-01')
array([False, False, False, ..., False, False, False])
nvda[(nvda.index > '2016-01-01') & (nvda.index < '2016-02-01')]['Open'].plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x24401f3cd68>
nvda[(nvda.index > '2016-01-01') & (nvda.index < '2016-02-01')]['Open'].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x244026c24a8>
nvda.head()
|
Open
|
High
|
Low
|
Close
|
Adj Close
|
Volume
|
Date
|
|
|
|
|
|
|
1999-01-22
|
1.750000
|
1.953125
|
1.552083
|
1.640625
|
1.523430
|
67867200
|
1999-01-25
|
1.770833
|
1.833333
|
1.640625
|
1.812500
|
1.683028
|
12762000
|
1999-01-26
|
1.833333
|
1.869792
|
1.645833
|
1.671875
|
1.552448
|
8580000
|
1999-01-27
|
1.677083
|
1.718750
|
1.583333
|
1.666667
|
1.547611
|
6109200
|
1999-01-28
|
1.666667
|
1.677083
|
1.651042
|
1.661458
|
1.542776
|
5688000
|
nvda['Open'].mean()
18.87288806854321
nvda[['Open','High']].mean()
Open 18.872888
High 19.222090
dtype: float64
nvda[(nvda.index >= '2016-01-01') & (nvda.index <= '2016-06-30')].describe()
|
Open
|
High
|
Low
|
Close
|
Adj Close
|
Volume
|
count
|
125.000000
|
125.000000
|
125.000000
|
125.000000
|
125.000000
|
1.250000e+02
|
mean
|
35.931360
|
36.416000
|
35.498880
|
36.006720
|
35.705421
|
9.802855e+06
|
std
|
6.722128
|
6.771410
|
6.768863
|
6.798316
|
6.816078
|
5.527803e+06
|
min
|
24.780001
|
25.559999
|
24.750000
|
25.219999
|
24.922132
|
4.382600e+06
|
25%
|
31.270000
|
31.870001
|
30.820000
|
31.520000
|
31.147726
|
6.919400e+06
|
50%
|
35.299999
|
35.570000
|
34.840000
|
35.389999
|
35.099426
|
8.707300e+06
|
75%
|
42.000000
|
42.799999
|
41.459999
|
42.279999
|
41.932854
|
1.122720e+07
|
max
|
47.759998
|
48.540001
|
47.650002
|
48.490002
|
48.216755
|
5.275640e+07
|
df
|
apt
|
cars
|
bonus
|
expense
|
color
|
Beijing
|
55000.0
|
350000.0
|
50000
|
100000
|
红
|
Chongqing
|
NaN
|
150000.0
|
40000
|
100000
|
黄
|
Guangzhou
|
45000.0
|
250000.0
|
40000
|
100000
|
紫
|
Hangzhou
|
45000.0
|
NaN
|
40000
|
100000
|
蓝
|
ShangHai
|
60000.0
|
400000.0
|
40000
|
100000
|
红
|
Shenzhen
|
70000.0
|
300000.0
|
40000
|
100000
|
绿
|
Suzhou
|
NaN
|
NaN
|
40000
|
100000
|
棕
|
Tianjin
|
NaN
|
200000.0
|
40000
|
100000
|
橙
|
df.to_csv('data/my_df.csv')
!head -10 my_df.csv
'head' 不是内部或外部命令,也不是可运行的程序
或批处理文件。
df.to_csv('data/my_df.csv', index=False)
!head -10 my_df.csv
'head' 不是内部或外部命令,也不是可运行的程序
或批处理文件。
help(pd.read_csv)
Help on function read_csv in module pandas.io.parsers:read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=None, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, skip_footer=0, doublequote=True, delim_whitespace=False, as_recarray=None, compact_ints=None, use_unsigned=None, low_memory=True, buffer_lines=None, memory_map=False, float_precision=None)Read CSV (comma-separated) file into DataFrameAlso supports optionally iterating or breaking of the fileinto chunks.Additional help can be found in the `online docs for IO Tools<http://pandas.pydata.org/pandas-docs/stable/io.html>`_.Parameters----------filepath_or_buffer : str, pathlib.Path, py._path.local.LocalPath or any object with a read() method (such as a file handle or StringIO)The string could be a URL. Valid URL schemes include http, ftp, s3, andfile. For file URLs, a host is expected. For instance, a local file couldbe file ://localhost/path/to/table.csvsep : str, default ','Delimiter to use. If sep is None, the C engine cannot automatically detectthe separator, but the Python parsing engine can, meaning the latter willbe used and automatically detect the separator by Python's builtin sniffertool, ``csv.Sniffer``. In addition, separators longer than 1 character anddifferent from ``'\s+'`` will be interpreted as regular expressions andwill also force the use of the Python parsing engine. Note that regexdelimiters are prone to ignoring quoted data. Regex example: ``'\r\t'``delimiter : str, default ``None``Alternative argument name for sep.delim_whitespace : boolean, default FalseSpecifies whether or not whitespace (e.g. ``' '`` or ``' '``) will beused as the sep. Equivalent to setting ``sep='\s+'``. If this optionis set to True, nothing should be passed in for the ``delimiter``parameter... versionadded:: 0.18.1 support for the Python parser.header : int or list of ints, default 'infer'Row number(s) to use as the column names, and the start of thedata. Default behavior is to infer the column names: if no namesare passed the behavior is identical to ``header=0`` and columnnames are inferred from the first line of the file, if columnnames are passed explicitly then the behavior is identical to``header=None``. Explicitly pass ``header=0`` to be able toreplace existing names. The header can be a list of integers thatspecify row locations for a multi-index on the columnse.g. [0,1,3]. Intervening rows that are not specified will beskipped (e.g. 2 in this example is skipped). Note that thisparameter ignores commented lines and empty lines if``skip_blank_lines=True``, so header=0 denotes the first line ofdata rather than the first line of the file.names : array-like, default NoneList of column names to use. If file contains no header row, then youshould explicitly pass header=None. Duplicates in this list will causea ``UserWarning`` to be issued.index_col : int or sequence or False, default NoneColumn to use as the row labels of the DataFrame. If a sequence is given, aMultiIndex is used. If you have a malformed file with delimiters at the endof each line, you might consider index_col=False to force pandas to _not_use the first column as the index (row names)usecols : array-like or callable, default NoneReturn a subset of the columns. If array-like, all elements must eitherbe positional (i.e. integer indices into the document columns) or stringsthat correspond to column names provided either by the user in `names` orinferred from the document header row(s). For example, a valid array-like`usecols` parameter would be [0, 1, 2] or ['foo', 'bar', 'baz'].If callable, the callable function will be evaluated against the columnnames, returning names where the callable function evaluates to True. Anexample of a valid callable argument would be ``lambda x: x.upper() in['AAA', 'BBB', 'DDD']``. Using this parameter results in much fasterparsing time and lower memory usage.as_recarray : boolean, default False.. deprecated:: 0.19.0Please call `pd.read_csv(...).to_records()` instead.Return a NumPy recarray instead of a DataFrame after parsing the data.If set to True, this option takes precedence over the `squeeze` parameter.In addition, as row indices are not available in such a format, the`index_col` parameter will be ignored.squeeze : boolean, default FalseIf the parsed data only contains one column then return a Seriesprefix : str, default NonePrefix to add to column numbers when no header, e.g. 'X' for X0, X1, ...mangle_dupe_cols : boolean, default TrueDuplicate columns will be specified as 'X.0'...'X.N', rather than'X'...'X'. Passing in False will cause data to be overwritten if thereare duplicate names in the columns.dtype : Type name or dict of column -> type, default NoneData type for data or columns. E.g. {'a': np.float64, 'b': np.int32}Use `str` or `object` to preserve and not interpret dtype.If converters are specified, they will be applied INSTEADof dtype conversion.engine : {'c', 'python'}, optionalParser engine to use. The C engine is faster while the python engine iscurrently more feature-complete.converters : dict, default NoneDict of functions for converting values in certain columns. Keys can eitherbe integers or column labelstrue_values : list, default NoneValues to consider as Truefalse_values : list, default NoneValues to consider as Falseskipinitialspace : boolean, default FalseSkip spaces after delimiter.skiprows : list-like or integer or callable, default NoneLine numbers to skip (0-indexed) or number of lines to skip (int)at the start of the file.If callable, the callable function will be evaluated against the rowindices, returning True if the row should be skipped and False otherwise.An example of a valid callable argument would be ``lambda x: x in [0, 2]``.skipfooter : int, default 0Number of lines at bottom of file to skip (Unsupported with engine='c')skip_footer : int, default 0.. deprecated:: 0.19.0Use the `skipfooter` parameter instead, as they are identicalnrows : int, default NoneNumber of rows of file to read. Useful for reading pieces of large filesna_values : scalar, str, list-like, or dict, default NoneAdditional strings to recognize as NA/NaN. If dict passed, specificper-column NA values. By default the following values are interpreted asNaN: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan','1.#IND', '1.#QNAN', 'N/A', 'NA', 'NULL', 'NaN', 'n/a', 'nan','null'.keep_default_na : bool, default TrueIf na_values are specified and keep_default_na is False the default NaNvalues are overridden, otherwise they're appended to.na_filter : boolean, default TrueDetect missing value markers (empty strings and the value of na_values). Indata without any NAs, passing na_filter=False can improve the performanceof reading a large fileverbose : boolean, default FalseIndicate number of NA values placed in non-numeric columnsskip_blank_lines : boolean, default TrueIf True, skip over blank lines rather than interpreting as NaN valuesparse_dates : boolean or list of ints or names or list of lists or dict, default False* boolean. If True -> try parsing the index.* list of ints or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3each as a separate date column.* list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse asa single date column.* dict, e.g. {'foo' : [1, 3]} -> parse columns 1, 3 as date and call result'foo'If a column or index contains an unparseable date, the entire column orindex will be returned unaltered as an object data type. For non-standarddatetime parsing, use ``pd.to_datetime`` after ``pd.read_csv``Note: A fast-path exists for iso8601-formatted dates.infer_datetime_format : boolean, default FalseIf True and `parse_dates` is enabled, pandas will attempt to infer theformat of the datetime strings in the columns, and if it can be inferred,switch to a faster method of parsing them. In some cases this can increasethe parsing speed by 5-10x.keep_date_col : boolean, default FalseIf True and `parse_dates` specifies combining multiple columns thenkeep the original columns.date_parser : function, default NoneFunction to use for converting a sequence of string columns to an array ofdatetime instances. The default uses ``dateutil.parser.parser`` to do theconversion. Pandas will try to call `date_parser` in three different ways,advancing to the next if an exception occurs: 1) Pass one or more arrays(as defined by `parse_dates`) as arguments; 2) concatenate (row-wise) thestring values from the columns defined by `parse_dates` into a single arrayand pass that; and 3) call `date_parser` once for each row using one ormore strings (corresponding to the columns defined by `parse_dates`) asarguments.dayfirst : boolean, default FalseDD/MM format dates, international and European formatiterator : boolean, default FalseReturn TextFileReader object for iteration or getting chunks with``get_chunk()``.chunksize : int, default NoneReturn TextFileReader object for iteration.See the `IO Tools docs<http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking>`_for more information on ``iterator`` and ``chunksize``.compression : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'For on-the-fly decompression of on-disk data. If 'infer' and`filepath_or_buffer` is path-like, then detect compression from thefollowing extensions: '.gz', '.bz2', '.zip', or '.xz' (otherwise nodecompression). If using 'zip', the ZIP file must contain only one datafile to be read in. Set to None for no decompression... versionadded:: 0.18.1 support for 'zip' and 'xz' compression.thousands : str, default NoneThousands separatordecimal : str, default '.'Character to recognize as decimal point (e.g. use ',' for European data).float_precision : string, default NoneSpecifies which converter the C engine should use for floating-pointvalues. The options are `None` for the ordinary converter,`high` for the high-precision converter, and `round_trip` for theround-trip converter.lineterminator : str (length 1), default NoneCharacter to break file into lines. Only valid with C parser.quotechar : str (length 1), optionalThe character used to denote the start and end of a quoted item. Quoteditems can include the delimiter and it will be ignored.quoting : int or csv.QUOTE_* instance, default 0Control field quoting behavior per ``csv.QUOTE_*`` constants. Use one ofQUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).doublequote : boolean, default ``True``When quotechar is specified and quoting is not ``QUOTE_NONE``, indicatewhether or not to interpret two consecutive quotechar elements INSIDE afield as a single ``quotechar`` element.escapechar : str (length 1), default NoneOne-character string used to escape delimiter when quoting is QUOTE_NONE.comment : str, default NoneIndicates remainder of line should not be parsed. If found at the beginningof a line, the line will be ignored altogether. This parameter must be asingle character. Like empty lines (as long as ``skip_blank_lines=True``),fully commented lines are ignored by the parameter `header` but not by`skiprows`. For example, if comment='#', parsing '#empty\na,b,c\n1,2,3'with `header=0` will result in 'a,b,c' beingtreated as the header.encoding : str, default NoneEncoding to use for UTF when reading/writing (ex. 'utf-8'). `List of Pythonstandard encodings<https://docs.python.org/3/library/codecs.html#standard-encodings>`_dialect : str or csv.Dialect instance, default NoneIf provided, this parameter will override values (default or not) for thefollowing parameters: `delimiter`, `doublequote`, `escapechar`,`skipinitialspace`, `quotechar`, and `quoting`. If it is necessary tooverride values, a ParserWarning will be issued. See csv.Dialectdocumentation for more details.tupleize_cols : boolean, default False.. deprecated:: 0.21.0This argument will be removed and will always convert to MultiIndexLeave a list of tuples on columns as is (default is to convert toa MultiIndex on the columns)error_bad_lines : boolean, default TrueLines with too many fields (e.g. a csv line with too many commas) will bydefault cause an exception to be raised, and no DataFrame will be returned.If False, then these "bad lines" will dropped from the DataFrame that isreturned.warn_bad_lines : boolean, default TrueIf error_bad_lines is False, and warn_bad_lines is True, a warning for each"bad line" will be output.low_memory : boolean, default TrueInternally process the file in chunks, resulting in lower memory usewhile parsing, but possibly mixed type inference. To ensure no mixedtypes either set False, or specify the type with the `dtype` parameter.Note that the entire file is read into a single DataFrame regardless,use the `chunksize` or `iterator` parameter to return the data in chunks.(Only valid with C parser)buffer_lines : int, default None.. deprecated:: 0.19.0This argument is not respected by the parsercompact_ints : boolean, default False.. deprecated:: 0.19.0Argument moved to ``pd.to_numeric``If compact_ints is True, then for any column that is of integer dtype,the parser will attempt to cast it as the smallest integer dtype possible,either signed or unsigned depending on the specification from the`use_unsigned` parameter.use_unsigned : boolean, default False.. deprecated:: 0.19.0Argument moved to ``pd.to_numeric``If integer columns are being compacted (i.e. `compact_ints=True`), specifywhether the column should be compacted to the smallest signed or unsignedinteger dtype.memory_map : boolean, default FalseIf a filepath is provided for `filepath_or_buffer`, map the file objectdirectly onto memory and access the data directly from there. Using thisoption can improve performance because there is no longer any I/O overhead.Returns-------result : DataFrame or TextParser
数据分析工程师_第02讲Pandas教程(上篇)相关推荐
- pandas dataframe column_Python数据分析——Pandas 教程(下)
Python数据分析--Pandas 教程(上) 上节,我们讲了 Pandas 基本的数据加载与检索,这节我们讲讲如何进行数据比较. Pandas系列对象 在 Pandas 中我们获取指定列的数据有多 ...
- 数据分析 第七讲 pandas练习 数据的合并、分组聚合、时间序列、pandas绘图
文章目录 数据分析 第七讲 pandas练习 数据的合并和分组聚合 一.pandas-DataFrame 练习1 对于这一组电影数据,如果我们想runtime(电影时长)的分布情况,应该如何呈现数据? ...
- 2020年软考信息安全工程师_基础知识精讲免费视频-任铄-专题视频课程
2020年软考信息安全工程师_基础知识精讲免费视频-1480人已学习 课程介绍 根据新的软考信息安全工程师考试大纲和作者长期辅导考试的经验,对考试中的所有知识点进行了详细的讲解,为考试 ...
- hive 语句总结_大数据分析工程师面试集锦4-Hive
导语 本篇文章为大家带来Hive面试指南,文内会有两种题型,问答题和代码题,题目一部分来自于网上,一部分来自平时工作的总结 精选题型 Hive可考察的内容有:基本概念.架构.数据类型.数据组织.DDL ...
- 数据分析 第六讲 pandas
文章目录 数据分析第六讲 pandas 一.pandas介绍 1.学习pandas的作用 2.pandas是什么? 二.pandas常用数据类型 1.Series一维,带标签数据 2.DataFram ...
- 视频教程-2020年软考信息安全工程师_基础知识精讲软考视频培训课程-软考
2020年软考信息安全工程师_基础知识精讲软考视频培训课程 河北师范大学软件学院优秀讲师,项目经理资质,担任操作系统原理.软件工程.项目管理等课程教学工作.参与十个以上百万级软件项目管理及系统设计工作 ...
- 学堂在线_操作系统_notes_第0-2讲_OS概述、OS实验环境准备
学堂在线_操作系统_notes_第0-2讲_OS概述.OS实验环境准备 - 20220626.No.1821 - 操作系统OS 综合了 C语言 + 数据结构与算法DSA + 计算机组成. OS 是 控 ...
- 大数据分析工程师证书_大数据分析工程师面试集锦4Hive
海牛学院的 | 第 610 期本文预计阅读 | 27 分钟 作者丨斌迪 编辑丨Zandy 导语本篇文章为大家带来Hive面试指南,文内会有两种题型,问答题和代码题,题目一部分来自于网上,一部分来自平时 ...
- astype强制转换不管用_用numpy和pandas进行数据分析
一. NumPy和Pandas简介 numpy包用于高性能科学计算和数据分析,是常用的高级数据分析库的基础包,可以很高效的去处理一维.二维的数据.pandas包可以说是基于numpy构建的更高级数据结 ...
- 【牛客】摩拜2018校招数据分析工程师笔试解析
[牛客]摩拜2018校招数据分析工程师笔试解析 * 选择题都有正确答案,后面五道大题均是我的答案,欢迎大家讨论纠正! (https://www.nowcoder.com/test/11453292/s ...
最新文章
- Python面向对象编程:深度认识类class
- 迪士尼公布最新研究:AR对象可智能地与环境中的物体交互
- asp.net signalR 专题—— 第二篇 对PersistentConnection持久连接的快速讲解
- Facebook高管解读财报 加大对视频业务的投资
- php论坛怎么架设,论坛架设有诀窍 phpWind配置技巧三则
- 常用的分隔符有哪三种_加固博士:常用防水材料大比拼,究竟花落谁家?
- 韩山师范计算机科学与技术,韩山师范学院计算机科学与技术专业
- Wi-Fi闪开,网速快 100 倍的Li-Fi要来了
- 枚举算法:求解不等式
- 计算机专业Top20,全国计算机软件专业大学排名TOP20,清华居然不是第一!
- 基于Python-turtle库绘制哆啦A梦
- linux中screen命令的用法
- Ubuntu中安装网易云音乐(可以直接打开的最简单的方法)
- BZOJ2191:Splite
- kettle-如何在kettle中编写java代码
- Python3之标准库
- ArcMap 计算面积对应平面坐标系,投影坐标系 2000坐标系 84坐标系
- 亲自动手制作来自MyBatis-Spring官网的chm格式帮助文件
- 金蝶检测服务器响应异常,连接金蝶云服务器异常 请检查
- 百度网盘Linux版本能用吗,百度网盘Linux版使用体验效果
热门文章
- 史诗级动态规划 教程 by hch
- 高中数学知识点归纳总结三角函数与解三角形
- 献给面试学生 关键字const是什么意思 ESP(译者:Embedded Systems Programming) --Dan Saks概括了const的所有用法
- Mac mini 2018 win10 外接显卡终极教程
- 量子精密测量技术大突破,应用正当时,国仪量子成果斐然
- java cap 反编译_应用 JD-Eclipse 插件实现 RFT 中 .class 文件的反向编译
- JAVA.犹抱琵琶半遮面
- 怎么使qq推广效果最大化
- Python模拟银行管理系统(面向对象)# 谭子
- 2019 ICPC 沈阳站 游记