Python数据分析入门与实践-笔记
第1章 实验环境的搭建
本章将主要介绍Anaconda和Jupyter Notebook。包括如何在windows,Mac,linux等平台上安装Anaconda,以及Jupyter Notebook的基本启动使用方法。
1-1 导学视频
数学科学和机器学习
数学科学工作流
课程具体安排:
- 第一章:实验环境的搭建
- 第二章:Numpy入门
- 第三章:Pandas入门
- 第四章:Pandas玩转数据
- 第五章:绘图与可视化-Matplotlib
- 第六章:绘图与可视化-Seaborn
- 第七章:数据分析项目实战
- 第八章:总结
适合人群:
- 有一定的自学和动手能力
- 有最基本的Python基础
- 将来想从事数据分析和机器学习相关领域工作
1-2 Anaconda和Jupyter notebook介绍
Anaconda/Jupyter notebook:open Data Science Platform
Anaconda是什么?
- 最著名的Python数据科学平台
- 750+流行的Python&R包
- 跨平台:Windows,Mac,Linux
- conda:可扩展的包管理工具
- 免费分发
- 非常活跃的社区
Anaconda的安装
下载地址
- 现在:https://www.anaconda.com/products/individual
- 之前:https://www.anaconda.com/download/
检查安装是否正确:
cd ~/anaconda
bin/conda --version
conda 4.3.21
Conda: Package和Environment管理
- 安装Packages
- 更新Packages
- 创建沙盒:Conda environment
Conda的Environment管理
创建一个新的environment
conda create --name python34 python3.4
激活一个environment
activate python34 # for Windows
source activate python34 # for Linux & Mac
退出一个environment
deactivate python34 # for Windows
source deactivate python34 # for Linux & Mac
删除一个environment
conda remove --name python34 --all
Conda的package管理
Conda的包管理有点类似pip
安装一个Python包
conda install numpy
查看已安装的Python包
conda list
conda list -n python34 #查看指定环境安装的Python包
删除一个Python包
conda remove --name python34 numpy
Data Science IDE vs Developer IDE
Data Science IDEs in Anaconda
从IPython到Jupyter
什么是Ipython?
- 一个强大的交互式shell
- 是Jupyter的kernel
- 支持交互式数据分析和可视化
Ipython Kernel
- 主要负责运行用户代码
- 通过stdin/stdout和Ipython shell交互
- 用json message通过ZeroMQ和notebook交互
什么是Jupyter Notebook?
- 前身是Ipython notebook
- 一个开源的Web application
- 可以创建和分享包含代码、视图、注释的文档
- 可以用于数据统计、分析、建模、机器学习等领域
Notebook和kernel之间的交互
- 核心是Notebook server
- Notebook server 加载和保存 notebook
Notebook的文件格式(.ipynb)
- 由IPython Notebook 定义的一种格式(json)
- 可以读取在线数据,CSV/XLS文件
- 可以转换成其他格式(py,html,pdf,md等)
NBViewer
- 一个online的ipynb格式notebook展示工具
- 可以通过url分享
- Github集成了NBViewer
- 通过转换器轻松集成到Blogs Emails、Wikis、Books
本课程实验室环境
- 在Windows/Mac/Linux上安装Anaconda
- 使用Python3.6作为基础环境
- 使用Jupyter Notebook作为编程IDE
1-3 Anaconda在Mac上的安装演示
下载macOS版本安装包,Python3.6+64位版本(截止2022/2/15,Python3.9)
Anaconda3-2021.11-MacOSX-x86_64.pkg
选择Install for me only,其他基本默认选项
不建议改变安装目录(安装需1.44GB)
~] ls
~] pwd
~] cd anaconda/
anaconda] ls
anaconda] cd bin
bin] ./conda --version
conda 4.3.21
bin] ./conda list
bin] ./jupyter notebook # 打开浏览器
1-4 Anaconda在windows上安装演示
下载Windows版本安装包,Python3.6+64位版本(截止2022/2/15,Python3.9)
Anaconda3-2021.11-Windows-x86_64.exe
选择Just Me(recommended),其他基本默认选项
在【开始菜单】里可看到安装好的Anaconda3
打开Jupyter Notebook
1-5 Anaconda在Linux上的安装演示
下载Linux版本安装包,Python3.6+64位版本(截止2022/2/15,Python3.9)
复制安装包链接
~] wget https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh
~] ls
Anaconda3-2021.11-Linux-x86_64.sh
~] ls -lh
~] sh Anaconda3-2021.11-Linux-x86_64.sh # 选择默认选项
~] pwd
/home/centos
~] cd anaconda3
anaconda3] ls
anaconda3] cd bin
anaconda3] ./conda --version
conda 4.3.21
anaconda3] ./jupyter notebook --no-browser # 复制链接
本地终端
~ ssh -N -f -L localhost:888:localhost:8888 gitlab-demo-ci
~ ssh -N -f -L localhost:888:localhost:8888 root@gitlab-demo-ci
浏览器打开,链接复制进去
!ifconfig # 对应linux系统中 ifconfig
1-6 Jupyter-notebook的使用演示
cd anaconda3
cd jupyter-notebook/python-data-science
python-data-science git:(master) ls
README.md demo.ipynb
python-data-science git:(master) xx/bin/jupyter notebook # 可打开
第2章 Numpy入门
本章将介绍Python数据科学领域里最基础的一个库——Numpy,回顾矩阵运算基础,介绍最重要的数据结构Array以及如何通过Numpy进行数组和矩阵运算。
2-1 数据科学领域5个常用Python库
- Numpy
- Scipy
- Pandas
- Matplotlib
- Scikit-learn
Numpy
- N维数组(矩阵),快速高效,矢量属性运算
- 高效的Index,不需要循环
- 开源免费跨平台,运行效率足以和C/Matlab媲美
- NumPy 中文
Scipy
- 依赖于Numpy
- 专为科学和工程设计
- 实现了多种常用科学计算:线性代数,傅里叶变换,信号和图像处理
Pandas
- 结构化数据分析利器(依赖Numpy)
- 提供了多种高级数据结构:Time-Series,DataFrame,Panel
- 强大的数据索引和处理能力
Matplotlib
- Python 2D绘图领域使用最广泛的套件
- 基本能取代Matlab的绘图功能(散点,曲线,柱形等)
- 通过mplot3d可以绘制精美的3D图
Scikit-learn
- 机器学习的Python模块
- 建立在Scipy之上,提供了常用的机器学习算法:聚类,回归
- 简单易学的API接口
2-2 数学基础回顾之矩阵运算
基本概念
- 矩阵:矩形的数据,即二维数组。其中向量和标量都是矩阵的特例
- 向量:是指1xn或者nx1的矩阵
- 标量:1x1的矩阵
- 数组:N维的数组,时矩阵的延伸
特殊矩阵
- 全0全1矩阵
- 单位矩阵
矩阵加减运算
- 相加、减的两个矩阵必须要有相同的行和列
- 行和列对应元素相加减
数组乘法(点乘)
- 数组乘法(点乘)是对应元素之间的乘法
矩阵乘法
设A为mxp的矩阵,B为pxn的矩阵,mxn的矩阵C为A与B的乘积,记为C=AB,其中矩阵C中的第i行第j列元素可以表示为:
其他线性代数知识
- 清华大学出版的线性代数
- http://bs.szu.edu.cn/sljr/Up/day_110824/201108240409437708.pdf
2-3 Array的创建及访问
Jupyter notebook 新建文件 Array.ipynb
# 数组的创建和访问
import numpy as np
# create from python list
list_1 = [1, 2, 3, 4]
list_1
# [1, 2, 3, 4]array_1 = np.array(list_1)
array_1
# array([1, 2, 3, 4])list_2 = [5, 6, 7, 8]
array_2 = np.array([list_1,list_2])
array_2
# array([[1, 2, 3, 4],[5, 6, 7, 8]])array_2.shape # 数组的维度 (n,m) n行m列
# (2, 4)# 扩展
# array_2.ndim - 数组的轴(维度)的个数。在Python世界中,维度的数量被称为rank。
# array_2.itemsize - 数组中每个元素的字节大小
# array_2.data - 该缓冲区包含数组的实际元素。array_2.size # 数组元素的总数,等于 shape 的元素的乘积
# 8array_2.dtype # 一个描述数组中元素类型的对象
# dtype('int32') 看电脑,也可能是dtype('int64')array_3 = np.array([[1.0,2,3],[4.0,5,6]])
array_3.dtype
# dtype('float64')array_4 = np.arange(1,10)
array_4
# array([1, 2, 3, 4, 5, 6, 7, 8, 9])array_4 = np.arange(1, 10, 2)
array_4
# array([1, 3, 5, 7, 9])np.zeros(5)
# array([0., 0., 0., 0., 0.]) # 零矩阵np.zeros([2,3]) # 两行三列的二维零矩阵
# array([[0., 0., 0.],[0., 0., 0.]])np.eye(5) # n=5的单位矩阵
# array([[1., 0., 0., 0., 0.],[0., 1., 0., 0., 0.],[0., 0., 1., 0., 0.],[0., 0., 0., 1., 0.],[0., 0., 0., 0., 1.]])np.eye(5).dtype
# dtype('float64')a = np.arange(1,10)
a
# array([1, 2, 3, 4, 5, 6, 7, 8, 9])a[1]
# 2(取数组第2个元素)a[1:5]
# array([2, 3, 4, 5]) 取数组第2-5个元素b = np.array([[1,2,3],[4,5,6]])
b
# array([[1, 2, 3],[4, 5, 6]])b[1][0]
# 4b[1,0]
# 4c = np.array([[1,2,3],[4,5,6],[7,8,9]])
c
# array([[1, 2, 3],[4, 5, 6],[7, 8, 9]])
c[:2,1:]
# array([[2, 3],[5, 6]])
2-4 数组与矩阵运算
Jupyter notebook 新建文件 数组与矩阵运算.ipynb
# 快速创建数组
import numpy as np
np.random.randn(10) # 返回10个小数元素的一维数组
# array([ 0.26674666, -0.91111093, 0.30684449, -0.80206634, -0.89176532,0.7950014 , -0.53259808, -0.09981816, 1.2960139 , -0.9668373 ])
np.random.randint(10) # 0
np.random.randint(10,size=(2,3)) # 生成一个2x3的二维数组,数组元素[0,9]
# array([[7, 5, 8],[1, 5, 8]])
np.random.randint(10,size=20) # 生成20个元素的一维数组,数组元素[0,9]
# array([5, 6, 4, 8, 0, 9, 6, 2, 2, 9, 2, 1, 4, 6, 1, 5, 8, 2, 3, 4])
np.random.randint(10,size=20).reshape(4,5) # 对生成20个元素的一维数组进行重塑成4x5的二维数组,数组元素[0,9]
# array([[7, 1, 0, 5, 7],[8, 0, 3, 7, 9],[9, 0, 7, 3, 2],[9, 1, 5, 8, 7]])# 数组运算
a = np.random.randint(10,size=20).reshape(4,5)
b = np.random.randint(10,size=20).reshape(4,5)
a
# array([[2, 3, 8, 4, 8],[0, 7, 9, 9, 9],[1, 8, 1, 8, 6],[3, 4, 7, 5, 1]])
b
# array([[8, 4, 3, 1, 6],[4, 4, 6, 2, 9],[9, 4, 8, 5, 8],[6, 2, 5, 5, 8]])
a + b
# array([[10, 7, 11, 5, 14],[ 4, 11, 15, 11, 18],[10, 12, 9, 13, 14],[ 9, 6, 12, 10, 9]])
a - b
# array([[-6, -1, 5, 3, 2],[-4, 3, 3, 7, 0],[-8, 4, -7, 3, -2],[-3, 2, 2, 0, -7]])
a * b
# array([[16, 12, 24, 4, 48],[ 0, 28, 54, 18, 81],[ 9, 32, 8, 40, 48],[18, 8, 35, 25, 8]])
a / b
# 可能会报错,看b里是否有元素0
array([[0.25 , 0.75 , 2.66666667, 4. , 1.33333333],[0. , 1.75 , 1.5 , 4.5 , 1. ],[0.11111111, 2. , 0.125 , 1.6 , 0.75 ],[0.5 , 2. , 1.4 , 1. , 0.125 ]])
np.mat([[1,2,3],[4,5,6]])
# matrix([[1, 2, 3],[4, 5, 6]])
a
# array([[2, 3, 8, 4, 8],[0, 7, 9, 9, 9],[1, 8, 1, 8, 6],[3, 4, 7, 5, 1]])
np.mat(a)
#
matrix([[2, 3, 8, 4, 8],[0, 7, 9, 9, 9],[1, 8, 1, 8, 6],[3, 4, 7, 5, 1]])# 矩阵的运算
A = np.mat(a)
B = np.mat(b)
A
# matrix([[2, 3, 8, 4, 8],[0, 7, 9, 9, 9],[1, 8, 1, 8, 6],[3, 4, 7, 5, 1]])
B
# matrix([[8, 4, 3, 1, 6],[4, 4, 6, 2, 9],[9, 4, 8, 5, 8],[6, 2, 5, 5, 8]])
A + B
# matrix([[10, 7, 11, 5, 14],[ 4, 11, 15, 11, 18],[10, 12, 9, 13, 14],[ 9, 6, 12, 10, 9]])
A - B
# matrix([[-6, -1, 5, 3, 2],[-4, 3, 3, 7, 0],[-8, 4, -7, 3, -2],[-3, 2, 2, 0, -7]])
A * B # 报错,A的列数和B的行数不一致a = np.mat(np.random.randint(10,size=20).reshape(4,5))
b = np.mat(np.random.randint(10,size=20).reshape(5,4))
a
# matrix([[9, 9, 3, 0, 5],[9, 4, 6, 4, 5],[9, 0, 7, 0, 9],[7, 2, 6, 0, 6]])
b
# matrix([[2, 2, 6, 4],[8, 9, 8, 0],[2, 1, 3, 9],[3, 1, 0, 2],[9, 3, 1, 4]])
a * b
# matrix([[141, 117, 140, 83],[119, 79, 109, 118],[113, 52, 84, 135],[ 96, 56, 82, 106]])# Array常用函数
a = np.random.randint(10,size=20).reshape(4,5)
np.unique(a) # 对a中所有元素去重
# array([0, 1, 2, 3, 4, 5, 6, 8, 9])
a
# array([[4, 2, 8, 4, 2],[6, 9, 6, 4, 0],[9, 2, 6, 9, 0],[1, 3, 8, 5, 9]])
sum(a) # a中所有行列求和
# array([20, 16, 28, 22, 11])
sum(a[0]) # a中第一行求和
# 20
sum(a[:,0]) # a中第一列求和
# 20
a.max() # a中最大值
# 9
max(a[0]) # a中第一行最大值
# 8
max(a[:,0]) # a中第一列最大值
# 9
2-5 Array的input和output
Jupyter notebook 新建文件 Array的input和output.ipynb
# 使用pickle序列化numpy array
import pickle
import numpy as np
x = np.arange(10)
x
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
f = open('x.pk1','wb')
pickle.dump(x, f)
!ls # windows系统可用!dir
# Array.ipynb Array的input和output.ipynbx.pk1 数组与矩阵运算.ipynb
f = open('x.pk1','rb')
pickle.load(f)
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.save('one_array', x)
!ls
# Array.ipynb Array的input和output.ipynbx.pk1 one_array.npy数组与矩阵运算.ipynb
np.load('one_array.npy')
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.arange(20)
y
# array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,17, 18, 19])
np.savez('two_array.npz', a=x, b=y)
!ls
# Array.ipynb two_array.npzArray的input和output.ipynb x.pk1one_array.npy 数组与矩阵运算.ipynb
np.load('two_array.npz')
# <numpy.lib.npyio.NpzFile at 0x17033c77df0>
c = np.load('two_array.npz')
c['a']
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
c['b']
# array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,17, 18, 19])
scipy文档
- 现在:https://docs.scipy.org/doc/scipy/getting_started.html
- 之前:https://docs.scipy.org/doc/numpy-dev/user/quickstart.html
第3章 Pandas入门
本章将介绍Python数据科学领域用于数据分析最重要的一个库——Pandas。将从pandas里最重要的两种数据结构Series和DataFrame开始,介绍其创建和基本操作,通过实际操作理解Series和DataFrame的关系。
3-1 Pandas Series
Jupyter notebook 新建文件 Series.ipynb
import numpy as np
import pandas as pd
s1 = pd.Series([1,2,3,4])
s1
# 0 11 22 33 4dtype: int64
s1.values
# array([1, 2, 3, 4], dtype=int64)
s1.index
# RangeIndex(start=0, stop=4, step=1)
s2 = pd.Series(np.arange(10))
s2 # 有些电脑 dtype: int64
# 0 01 12 23 34 45 56 67 78 89 9dtype: int32
s3 = pd.Series({'1':1, '2':2, '3':3})
s3
# 1 12 23 3dtype: int64
s3.values
# array([1, 2, 3], dtype=int64)
s3.index
# Index(['1', '2', '3'], dtype='object')
s4 = pd.Series([1,2,3,4],index=['A','B','C','D'])
s4
# A 1B 2C 3D 4dtype: int64
s4.values
# array([1, 2, 3, 4], dtype=int64)
s4.index
# Index(['A', 'B', 'C', 'D'], dtype='object')
s4['A']
# 1
s4[s4>2]
# C 3D 4dtype: int64
s4
# A 1B 2C 3D 4dtype: int64
s4.to_dict()
# {'A': 1, 'B': 2, 'C': 3, 'D': 4}
s5 = pd.Series(s4.to_dict())
s5
# A 1B 2C 3D 4dtype: int64
index_1 = ['A', 'B', 'C', 'D','E']
s6 = pd.Series(s5,index=index_1)
s6
# A 1.0B 2.0C 3.0D 4.0E NaNdtype: float64
pd.isnull(s6)
# A FalseB FalseC FalseD FalseE True
dtype: bool
pd.notnull(s6)
# A TrueB TrueC TrueD TrueE Falsedtype: bool
s6
# A 1.0B 2.0C 3.0D 4.0E NaNdtype: float64
s6.name = 'demo'
s6
# A 1.0B 2.0C 3.0D 4.0E NaNName: demo, dtype: float64
s6.index.name = 'demo index'
s6
# demo indexA 1.0B 2.0C 3.0D 4.0E NaNName: demo, dtype: float64
s6.index
# Index(['A', 'B', 'C', 'D', 'E'], dtype='object', name='demo index')
s6.values
# array([ 1., 2., 3., 4., nan])
3-2 Pandas DataFrame
Jupyter notebook 新建文件 DataFrame.ipynb
import numpy as np
import pandas as pd
from pandas import Series, DataFrameimport webbrowser
link = 'https://www.tiobe.com/tiobe-index/'
webbrowser.open(link) # 浏览器里打开链接
True
df = pd.read_clipboard() # 复制页面 table里前10条数据,包含表头
df
# 输出
Position Programming Language Ratings
0 21 SAS 0.66% None
1 22 Scratch 0.64% None
2 23 Fortran 0.58% None
3 24 Rust 0.54% None
4 25 (Visual) FoxPro 0.52%
5 26 COBOL 0.42% None
6 27 Dart 0.42% None
7 28 Kotlin 0.41% None
8 29 Lua 0.40% None
9 30 Julia 0.40% Nonetype(df)
# pandas.core.frame.DataFrame
df.columns
# Index(['Position', 'Programming', 'Language', 'Ratings'], dtype='object')
df.Ratings
#
0 None
1 None
2 None
3 None
4 0.52%
5 None
6 None
7 None
8 None
9 None
Name: Ratings, dtype: objectdf_new = DataFrame(df,columns=['Programming','Language'])
df_new
# 输出
Programming Language
0 SAS 0.66%
1 Scratch 0.64%
2 Fortran 0.58%
3 Rust 0.54%
4 (Visual) FoxPro
5 COBOL 0.42%
6 Dart 0.42%
7 Kotlin 0.41%
8 Lua 0.40%
9 Julia 0.40%df['Position']
#
0 21
1 22
2 23
3 24
4 25
5 26
6 27
7 28
8 29
9 30
Name: Position, dtype: int64type(df['Position'])
pandas.core.series.Series
df_new = DataFrame(df,columns=['Programming','Language','Language1'])
df_new
# 输出
Programming Language Language1
0 SAS 0.66% NaN
1 Scratch 0.64% NaN
2 Fortran 0.58% NaN
3 Rust 0.54% NaN
4 (Visual) FoxPro NaN
5 COBOL 0.42% NaN
6 Dart 0.42% NaN
7 Kotlin 0.41% NaN
8 Lua 0.40% NaN
9 Julia 0.40% NaN# 填充的三种方式
df_new['Language1'] = range(0,10)
# df_new['Language1'] = np.arange(0,10)
# df_new['Language1'] = pd.Series(np.arange(0,10))
df_new
# 输出
Programming Language Language1
0 SAS 0.66% 0
1 Scratch 0.64% 1
2 Fortran 0.58% 2
3 Rust 0.54% 3
4 (Visual) FoxPro 4
5 COBOL 0.42% 5
6 Dart 0.42% 6
7 Kotlin 0.41% 7
8 Lua 0.40% 8
9 Julia 0.40% 9df_new['Language1'] = pd.Series([100,200], index=[1,2])
df_new
# 输出
Programming Language Language1
0 SAS 0.66% NaN
1 Scratch 0.64% 100.0
2 Fortran 0.58% 200.0
3 Rust 0.54% NaN
4 (Visual) FoxPro NaN
5 COBOL 0.42% NaN
6 Dart 0.42% NaN
7 Kotlin 0.41% NaN
8 Lua 0.40% NaN
9 Julia 0.40% NaN
3-3 深入理解Series和Dataframe
Jupyter notebook 新建文件 深入理解Series和Dataframe.ipynb
import numpy as np
import pandas as pd
from pandas import Series, DataFramedata = {'Country':['Belgium', 'India', 'Brazil'],'Capital':['Brussels','New Delhi', 'Brasilia'],'Population':[11190846, 1303171035, 207847528]}#Series
s1 = pd.Series(data['Country'])
s1
# 输出
0 Belgium
1 India
2 Brazil
dtype: objects1.values
# array(['Belgium', 'India', 'Brazil'], dtype=object)
s1.index
# RangeIndex(start=0, stop=3, step=1)
s1 = pd.Series(data['Country'],index=['A','B','C'])
# 输出
A Belgium
B India
C Brazil
dtype: objects1.values
# array(['Belgium', 'India', 'Brazil'], dtype=object)
s1.index
# Index(['A', 'B', 'C'], dtype='object')#Dataframe
df1 = pd.DataFrame(data)
df1
# 输出Country Capital Population
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasilia 207847528df1['Country']
# 输出
0 Belgium
1 India
2 Brazil
Name: Country, dtype: objectcou = df1['Country']
type(cou)
# pandas.core.series.Series
df1.iterrows()
# <generator object DataFrame.iterrows at 0x0000018DD44C59E0>for row in df1.iterrows():print(row),print(type(row)),print(len(row))
# 输出
(0, Country Belgium
Capital Brussels
Population 11190846
Name: 0, dtype: object)
<class 'tuple'>
2
(1, Country India
Capital New Delhi
Population 1303171035
Name: 1, dtype: object)
<class 'tuple'>
2
(2, Country Brazil
Capital Brasilia
Population 207847528
Name: 2, dtype: object)
<class 'tuple'>
2for row in df1.iterrows():print(type(row[0]),row[0],row[1])break
# 输出
<class 'int'> 0 Country Belgium
Capital Brussels
Population 11190846
Name: 0, dtype: object# <class 'int'> ??
<class 'numpy.int64'> for row in df1.iterrows():print(type(row[0]),type(row[1]))break
# 输出
<class 'int'> <class 'pandas.core.series.Series'># <class 'int'> ??
<class 'numpy.int64'> df1
# 输出
Country Capital Population
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasilia 207847528data
# 输出
{'Country': ['Belgium', 'India', 'Brazil'],'Capital': ['Brussels', 'New Delhi', 'Brasilia'],'Population': [11190846, 1303171035, 207847528]}s1 = pd.Series(data['Country'])
s2 = pd.Series(data['Capital'])
s3 = pd.Series(data['Population'])
df_new = pd.DataFrame([s1,s2,s3])
df_new
# 输出0 1 2
0 Belgium India Brazil
1 Brussels New Delhi Brasilia
2 11190846 1303171035 207847528df1
# 输出
Country Capital Population
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasilia 207847528df_new = df_new.T
df_new
# 输出0 1 2
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasilia 207847528df_new = pd.DataFrame([s1,s2,s3], index=['Country','Capital','Population'])
df_new
# 输出0 1 2
Country Belgium India Brazil
Capital Brussels New Delhi Brasilia
Population 11190846 1303171035 207847528df_new = df_new.T
df_new
# 输出Country Capital Population
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasilia 207847528
3-4 Pandas-Dataframe-IO操作
Jupyter notebook 新建文件 DataFrame IO.ipynb
import numpy as np
import pandas as pd
from pandas import Series,DataFrameimport webbrowserlink = 'http://pandas.pydata.org/pandas-docs/version/0.20/io.html'
webbrowser.open(link) # 打开浏览器,返回True; 复制 网页表格内容
# Truedf1 = pd.read_clipboard()
df1
# 输出Format Type Data Description Reader Writer
0 text CSV read_csv to_csv
1 text JSON read_json to_json
2 text HTML read_html to_html
3 text Local clipboard read_clipboard to_clipboard
4 binary MS Excel read_excel to_excel
5 binary HDF5 Format read_hdf to_hdf
6 binary Feather Format read_feather to_feather
7 binary Msgpack read_msgpack to_msgpack
8 binary Stata read_stata to_stata
9 binary SAS read_sas
10 binary Python Pickle Format read_pickle to_pickle
11 SQL SQL read_sql to_sql
12 SQL Google Big Query read_gbq to_gbqdf1.to_clipboard()
df1
# 输出Format Type Data Description Reader Writer
0 text CSV read_csv to_csv
1 text JSON read_json to_json
2 text HTML read_html to_html
3 text Local clipboard read_clipboard to_clipboard
4 binary MS Excel read_excel to_excel
5 binary HDF5 Format read_hdf to_hdf
6 binary Feather Format read_feather to_feather
7 binary Msgpack read_msgpack to_msgpack
8 binary Stata read_stata to_stata
9 binary SAS read_sas
10 binary Python Pickle Format read_pickle to_pickle
11 SQL SQL read_sql to_sql
12 SQL Google Big Query read_gbq to_gbqdf1.to_csv('df1.csv')
!ls # windows系统可用 !dir
# DataFrame IO.ipynb df1.csv!more df1.csv
# 输出
,Format Type,Data Description,Reader,Writer
0,text,CSV,read_csv,to_csv
1,text,JSON,read_json,to_json
2,text,HTML,read_html,to_html
3,text,Local clipboard,read_clipboard,to_clipboard
4,binary,MS Excel,read_excel,to_excel
5,binary,HDF5 Format,read_hdf,to_hdf
6,binary,Feather Format,read_feather,to_feather
7,binary,Msgpack,read_msgpack,to_msgpack
8,binary,Stata,read_stata,to_stata
9,binary,SAS,read_sas,
10,binary,Python Pickle Format,read_pickle,to_pickle
11,SQL,SQL,read_sql,to_sql
12,SQL,Google Big Query,read_gbq,to_gbqdf1.to_csv('df1.csv',index=False) # 去掉索引
!ls
# DataFrame IO.ipynb df1.csv!more df1.csv
# 输出
Format Type,Data Description,Reader,Writer
text,CSV,read_csv,to_csv
text,JSON,read_json,to_json
text,HTML,read_html,to_html
text,Local clipboard,read_clipboard,to_clipboard
binary,MS Excel,read_excel,to_excel
binary,HDF5 Format,read_hdf,to_hdf
binary,Feather Format,read_feather,to_feather
binary,Msgpack,read_msgpack,to_msgpack
binary,Stata,read_stata,to_stata
binary,SAS,read_sas,
binary,Python Pickle Format,read_pickle,to_pickle
SQL,SQL,read_sql,to_sql
SQL,Google Big Query,read_gbq,to_gbqdf2 = pd.read_csv('df1.csv')
df2
# 输出Format Type Data Description Reader Writer
0 text CSV read_csv to_csv
1 text JSON read_json to_json
2 text HTML read_html to_html
3 text Local clipboard read_clipboard to_clipboard
4 binary MS Excel read_excel to_excel
5 binary HDF5 Format read_hdf to_hdf
6 binary Feather Format read_feather to_feather
7 binary Msgpack read_msgpack to_msgpack
8 binary Stata read_stata to_stata
9 binary SAS read_sas
10 binary Python Pickle Format read_pickle to_pickle
11 SQL SQL read_sql to_sql
12 SQL Google Big Query read_gbq to_gbqdf1.to_json()
# 输出
'{"Format Type":{"0":"text","1":"text","2":"text","3":"text","4":"binary","5":"binary","6":"binary","7":"binary","8":"binary","9":"binary","10":"binary","11":"SQL","12":"SQL"},"Data Description":{"0":"CSV","1":"JSON","2":"HTML","3":"Local clipboard","4":"MS Excel","5":"HDF5 Format","6":"Feather Format","7":"Msgpack","8":"Stata","9":"SAS","10":"Python Pickle Format","11":"SQL","12":"Google Big Query"},"Reader":{"0":"read_csv","1":"read_json","2":"read_html","3":"read_clipboard","4":"read_excel","5":"read_hdf","6":"read_feather","7":"read_msgpack","8":"read_stata","9":"read_sas","10":"read_pickle","11":"read_sql","12":"read_gbq"},"Writer":{"0":"to_csv","1":"to_json","2":"to_html","3":"to_clipboard","4":"to_excel","5":"to_hdf","6":"to_feather","7":"to_msgpack","8":"to_stata","9":" ","10":"to_pickle","11":"to_sql","12":"to_gbq"}}'pd.read_json(df1.to_json())
# 输出Format Type Data Description Reader Writer
0 text CSV read_csv to_csv
1 text JSON read_json to_json
2 text HTML read_html to_html
3 text Local clipboard read_clipboard to_clipboard
4 binary MS Excel read_excel to_excel
5 binary HDF5 Format read_hdf to_hdf
6 binary Feather Format read_feather to_feather
7 binary Msgpack read_msgpack to_msgpack
8 binary Stata read_stata to_stata
9 binary SAS read_sas
10 binary Python Pickle Format read_pickle to_pickle
11 SQL SQL read_sql to_sql
12 SQL Google Big Query read_gbq to_gbqdf1.to_html()
# 输出
'<table border="1" class="dataframe">\n <thead>\n <tr style="text-align: right;">\n <th></th>\n <th>Format Type</th>\n <th>Data Description</th>\n <th>Reader</th>\n <th>Writer</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>text</td>\n <td>CSV</td>\n <td>read_csv</td>\n <td>to_csv</td>\n </tr>\n <tr>\n <th>1</th>\n <td>text</td>\n <td>JSON</td>\n <td>read_json</td>\n <td>to_json</td>\n </tr>\n <tr>\n <th>2</th>\n <td>text</td>\n <td>HTML</td>\n <td>read_html</td>\n <td>to_html</td>\n </tr>\n <tr>\n <th>3</th>\n <td>text</td>\n <td>Local clipboard</td>\n <td>read_clipboard</td>\n <td>to_clipboard</td>\n </tr>\n <tr>\n <th>4</th>\n <td>binary</td>\n <td>MS Excel</td>\n <td>read_excel</td>\n <td>to_excel</td>\n </tr>\n <tr>\n <th>5</th>\n <td>binary</td>\n <td>HDF5 Format</td>\n <td>read_hdf</td>\n <td>to_hdf</td>\n </tr>\n <tr>\n <th>6</th>\n <td>binary</td>\n <td>Feather Format</td>\n <td>read_feather</td>\n <td>to_feather</td>\n </tr>\n <tr>\n <th>7</th>\n <td>binary</td>\n <td>Msgpack</td>\n <td>read_msgpack</td>\n <td>to_msgpack</td>\n </tr>\n <tr>\n <th>8</th>\n <td>binary</td>\n <td>Stata</td>\n <td>read_stata</td>\n <td>to_stata</td>\n </tr>\n <tr>\n <th>9</th>\n <td>binary</td>\n <td>SAS</td>\n <td>read_sas</td>\n <td></td>\n </tr>\n <tr>\n <th>10</th>\n <td>binary</td>\n <td>Python Pickle Format</td>\n <td>read_pickle</td>\n <td>to_pickle</td>\n </tr>\n <tr>\n <th>11</th>\n <td>SQL</td>\n <td>SQL</td>\n <td>read_sql</td>\n <td>to_sql</td>\n </tr>\n <tr>\n <th>12</th>\n <td>SQL</td>\n <td>Google Big Query</td>\n <td>read_gbq</td>\n <td>to_gbq</td>\n </tr>\n </tbody>\n</table>'df1.to_html('df1.html')
!ls
# DataFrame IO.ipynb df1.csv df1.html
df1.to_excel('df1.xlsx')
3-5 DataFrame的Selecting和indexing
Jupyter notebook 新建文件 Selecting and indexing.ipynb
import numpy as np
import pandas as pd
from pandas import Series, DataFrame!pwd # pwd 对应windows系统 chdir
# /Users/xxx/xx!ls /Users/xxx/xx/homework # ls 对应windows系统 dir pwd
# movie_metadata.csvimdb = pd.read_csv('/Users/xxx/xx/homework/movie_metadata.csv')
imdb
# 输出
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres ... num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... 3054.0 English USA PG-13 237000000.0 2009.0 936.0 7.9 1.78 33000
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Action|Adventure|Fantasy ... 1238.0 English USA PG-13 300000000.0 2007.0 5000.0 7.1 2.35 0
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Action|Adventure|Thriller ... 994.0 English UK PG-13 245000000.0 2015.0 393.0 6.8 2.35 85000
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Action|Thriller ... 2701.0 English USA PG-13 250000000.0 2012.0 23000.0 8.5 2.35 164000
4 NaN Doug Walker NaN NaN 131.0 NaN Rob Walker 131.0 NaN Documentary ... NaN NaN NaN NaN NaN NaN 12.0 7.1 NaN 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5038 Color Scott Smith 1.0 87.0 2.0 318.0 Daphne Zuniga 637.0 NaN Comedy|Drama ... 6.0 English Canada NaN NaN 2013.0 470.0 7.7 NaN 84
5039 Color NaN 43.0 43.0 NaN 319.0 Valorie Curry 841.0 NaN Crime|Drama|Mystery|Thriller ... 359.0 English USA TV-14 NaN NaN 593.0 7.5 16.00 32000
5040 Color Benjamin Roberds 13.0 76.0 0.0 0.0 Maxwell Moody 0.0 NaN Drama|Horror|Thriller ... 3.0 English USA NaN 1400.0 2013.0 0.0 6.3 NaN 16
5041 Color Daniel Hsia 14.0 100.0 0.0 489.0 Daniel Henney 946.0 10443.0 Comedy|Drama|Romance ... 9.0 English USA PG-13 NaN 2012.0 719.0 6.3 2.35 660
5042 Color Jon Gunn 43.0 90.0 16.0 16.0 Brian Herzlinger 86.0 85222.0 Documentary ... 84.0 English USA PG 1100.0 2004.0 23.0 6.6 1.85 456
5043 rows × 28 columnsimdb.shape
# (5043, 28)imdb.head()
# 输出color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres ... num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... 3054.0 English USA PG-13 237000000.0 2009.0 936.0 7.9 1.78 33000
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Action|Adventure|Fantasy ... 1238.0 English USA PG-13 300000000.0 2007.0 5000.0 7.1 2.35 0
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Action|Adventure|Thriller ... 994.0 English UK PG-13 245000000.0 2015.0 393.0 6.8 2.35 85000
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Action|Thriller ... 2701.0 English USA PG-13 250000000.0 2012.0 23000.0 8.5 2.35 164000
4 NaN Doug Walker NaN NaN 131.0 NaN Rob Walker 131.0 NaN Documentary ... NaN NaN NaN NaN NaN NaN 12.0 7.1 NaN 0
5 rows × 28 columnsimdb.tail(10)
# 输出
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres ... num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
5033 Color Shane Carruth 143.0 77.0 291.0 8.0 David Sullivan 291.0 424760.0 Drama|Sci-Fi|Thriller ... 371.0 English USA PG-13 7000.0 2004.0 45.0 7.0 1.85 19000
5034 Color Neill Dela Llana 35.0 80.0 0.0 0.0 Edgar Tancangco 0.0 70071.0 Thriller ... 35.0 English Philippines Not Rated 7000.0 2005.0 0.0 6.3 NaN 74
5035 Color Robert Rodriguez 56.0 81.0 0.0 6.0 Peter Marquardt 121.0 2040920.0 Action|Crime|Drama|Romance|Thriller ... 130.0 Spanish USA R 7000.0 1992.0 20.0 6.9 1.37 0
5036 Color Anthony Vallone NaN 84.0 2.0 2.0 John Considine 45.0 NaN Crime|Drama ... 1.0 English USA PG-13 3250.0 2005.0 44.0 7.8 NaN 4
5037 Color Edward Burns 14.0 95.0 0.0 133.0 Caitlin FitzGerald 296.0 4584.0 Comedy|Drama ... 14.0 English USA Not Rated 9000.0 2011.0 205.0 6.4 NaN 413
5038 Color Scott Smith 1.0 87.0 2.0 318.0 Daphne Zuniga 637.0 NaN Comedy|Drama ... 6.0 English Canada NaN NaN 2013.0 470.0 7.7 NaN 84
5039 Color NaN 43.0 43.0 NaN 319.0 Valorie Curry 841.0 NaN Crime|Drama|Mystery|Thriller ... 359.0 English USA TV-14 NaN NaN 593.0 7.5 16.00 32000
5040 Color Benjamin Roberds 13.0 76.0 0.0 0.0 Maxwell Moody 0.0 NaN Drama|Horror|Thriller ... 3.0 English USA NaN 1400.0 2013.0 0.0 6.3 NaN 16
5041 Color Daniel Hsia 14.0 100.0 0.0 489.0 Daniel Henney 946.0 10443.0 Comedy|Drama|Romance ... 9.0 English USA PG-13 NaN 2012.0 719.0 6.3 2.35 660
5042 Color Jon Gunn 43.0 90.0 16.0 16.0 Brian Herzlinger 86.0 85222.0 Documentary ... 84.0 English USA PG 1100.0 2004.0 23.0 6.6 1.85 456
10 rows × 28 columnsimdb['color']
# 输出
0 Color
1 Color
2 Color
3 Color
4 NaN...
5038 Color
5039 Color
5040 Color
5041 Color
5042 Color
Name: color, Length: 5043, dtype: objectimdb['color'][0]
# 'Color'
imdb['color'][1]
# 'Color'imdb[['color','director_name']]
# 输出color director_name
0 Color James Cameron
1 Color Gore Verbinski
2 Color Sam Mendes
3 Color Christopher Nolan
4 NaN Doug Walker
... ... ...
5038 Color Scott Smith
5039 Color NaN
5040 Color Benjamin Roberds
5041 Color Daniel Hsia
5042 Color Jon Gunn
5043 rows × 2 columnssub_df = imdb[['director_name','movie_title','imdb_score']]
sub_df
# 输出
director_name movie_title imdb_score
0 James Cameron Avatar 7.9
1 Gore Verbinski Pirates of the Caribbean: At World's End 7.1
2 Sam Mendes Spectre 6.8
3 Christopher Nolan The Dark Knight Rises 8.5
4 Doug Walker Star Wars: Episode VII - The Force Awakens ... 7.1
... ... ... ...
5038 Scott Smith Signed Sealed Delivered 7.7
5039 NaN The Following 7.5
5040 Benjamin Roberds A Plague So Pleasant 6.3
5041 Daniel Hsia Shanghai Calling 6.3
5042 Jon Gunn My Date with Drew 6.6
5043 rows × 3 columnssub_df.head()
# 输出director_name movie_title imdb_score
0 James Cameron Avatar 7.9
1 Gore Verbinski Pirates of the Caribbean: At World's End 7.1
2 Sam Mendes Spectre 6.8
3 Christopher Nolan The Dark Knight Rises 8.5
4 Doug Walker Star Wars: Episode VII - The Force Awakens ... 7.1sub_df.head(5)
# 输出director_name movie_title imdb_score
0 James Cameron Avatar 7.9
1 Gore Verbinski Pirates of the Caribbean: At World's End 7.1
2 Sam Mendes Spectre 6.8
3 Christopher Nolan The Dark Knight Rises 8.5
4 Doug Walker Star Wars: Episode VII - The Force Awakens ... 7.1sub_df.iloc[10:20,:]
# 输出director_name movie_title imdb_score
10 Zack Snyder Batman v Superman: Dawn of Justice 6.9
11 Bryan Singer Superman Returns 6.1
12 Marc Forster Quantum of Solace 6.7
13 Gore Verbinski Pirates of the Caribbean: Dead Man's Chest 7.3
14 Gore Verbinski The Lone Ranger 6.5
15 Zack Snyder Man of Steel 7.2
16 Andrew Adamson The Chronicles of Narnia: Prince Caspian 6.6
17 Joss Whedon The Avengers 8.1
18 Rob Marshall Pirates of the Caribbean: On Stranger Tides 6.7
19 Barry Sonnenfeld Men in Black 3 6.8sub_df.iloc[10:20,0:2]
# 输出
director_name movie_title
10 Zack Snyder Batman v Superman: Dawn of Justice
11 Bryan Singer Superman Returns
12 Marc Forster Quantum of Solace
13 Gore Verbinski Pirates of the Caribbean: Dead Man's Chest
14 Gore Verbinski The Lone Ranger
15 Zack Snyder Man of Steel
16 Andrew Adamson The Chronicles of Narnia: Prince Caspian
17 Joss Whedon The Avengers
18 Rob Marshall Pirates of the Caribbean: On Stranger Tides
19 Barry Sonnenfeld Men in Black 3tmp_df = sub_df.iloc[10:20,0:2]
tmp_df
# 输出
director_name movie_title
10 Zack Snyder Batman v Superman: Dawn of Justice
11 Bryan Singer Superman Returns
12 Marc Forster Quantum of Solace
13 Gore Verbinski Pirates of the Caribbean: Dead Man's Chest
14 Gore Verbinski The Lone Ranger
15 Zack Snyder Man of Steel
16 Andrew Adamson The Chronicles of Narnia: Prince Caspian
17 Joss Whedon The Avengers
18 Rob Marshall Pirates of the Caribbean: On Stranger Tides
19 Barry Sonnenfeld Men in Black 3tmp_df.iloc[2:4,:]
# 输出director_name movie_title
12 Marc Forster Quantum of Solace
13 Gore Verbinski Pirates of the Caribbean: Dead Man's Chesttmp_df.loc[15:17,:]
# 输出director_name movie_title
15 Zack Snyder Man of Steel
16 Andrew Adamson The Chronicles of Narnia: Prince Caspian
17 Joss Whedon The Avengerstmp_df.loc[15:17,:'movie_title']
# 输出director_name movie_title
15 Zack Snyder Man of Steel
16 Andrew Adamson The Chronicles of Narnia: Prince Caspian
17 Joss Whedon The Avengerstmp_df.loc[15:17,:'director_name']
# 输出director_name
15 Zack Snyder
16 Andrew Adamson
17 Joss Whedon
3-6 Series和Dataframe的Reindexing
Jupyter notebook 新建文件 Reindexing Series and DataFrame.ipynb
import numpy as np
import pandas as pd
from pandas import Series, DataFrame# series reindex
s1 = Series([1,2,3,4], index=['A','B','C','D'])
s1
# 输出
A 1
B 2
C 3
D 4
dtype: int64# s1.reindex() # 光标移动到方法上面,按shift+tab,弹出文档,连续按选择文档详细程度
s1.reindex(index=['A','B','C','D','E'])
# 输出
A 1.0
B 2.0
C 3.0
D 4.0
E NaN
dtype: float64s1.reindex(index=['A','B','C','D','E'],fill_value=0)
# 输出
A 1
B 2
C 3
D 4
E 0
dtype: int64s1.reindex(index=['A','B','C','D','E'],fill_value=10)
# 输出
A 1
B 2
C 3
D 4
E 10
dtype: int64s2 = Series(['A','B','C'], index=[1,5,10])
s2
# 输出
1 A
5 B
10 C
dtype: objects2.reindex(index=range(15))
# 输出
0 NaN
1 A
2 NaN
3 NaN
4 NaN
5 B
6 NaN
7 NaN
8 NaN
9 NaN
10 C
11 NaN
12 NaN
13 NaN
14 NaN
dtype: objects2.reindex(index=range(15),method='ffill')
# 输出
0 NaN
1 A
2 A
3 A
4 A
5 B
6 B
7 B
8 B
9 B
10 C
11 C
12 C
13 C
14 C
dtype: object# reindex dataframe
df1 = DataFrame(np.random.rand(25).reshape([5,5]))
df1
# 输出0 1 2 3 4
0 0.255424 0.315708 0.951327 0.423676 0.975377
1 0.087594 0.192460 0.502268 0.534926 0.423024
2 0.817002 0.113410 0.468270 0.410297 0.278942
3 0.315239 0.018933 0.133764 0.240001 0.910754
4 0.267342 0.451077 0.282865 0.170235 0.898429df1 = DataFrame(np.random.rand(25).reshape([5,5]), index=['A','B','D','E','F'], columns=['c1','c2','c3','c4','c5'])
df1
# 输出c1 c2 c3 c4 c5
A 0.278063 0.894546 0.932129 0.178442 0.303684
B 0.186239 0.260677 0.708358 0.275914 0.369878
D 0.786987 0.125907 0.191987 0.338194 0.009877
E 0.192269 0.909661 0.227301 0.343989 0.610203
F 0.503267 0.306472 0.197467 0.063800 0.813786df1.reindex(index=['A','B','C','D','E','F'])
# 输出c1 c2 c3 c4 c5
A 0.278063 0.894546 0.932129 0.178442 0.303684
B 0.186239 0.260677 0.708358 0.275914 0.369878
C NaN NaN NaN NaN NaN
D 0.786987 0.125907 0.191987 0.338194 0.009877
E 0.192269 0.909661 0.227301 0.343989 0.610203
F 0.503267 0.306472 0.197467 0.063800 0.813786df1.reindex(columns=['c1','c2','c3','c4','c5','c6'])
# 输出c1 c2 c3 c4 c5 c6
A 0.278063 0.894546 0.932129 0.178442 0.303684 NaN
B 0.186239 0.260677 0.708358 0.275914 0.369878 NaN
D 0.786987 0.125907 0.191987 0.338194 0.009877 NaN
E 0.192269 0.909661 0.227301 0.343989 0.610203 NaN
F 0.503267 0.306472 0.197467 0.063800 0.813786 NaNdf1.reindex(index=['A','B','C','D','E','F'],columns=['c1','c2','c3','c4','c5','c6'])
# 输出c1 c2 c3 c4 c5 c6
A 0.278063 0.894546 0.932129 0.178442 0.303684 NaN
B 0.186239 0.260677 0.708358 0.275914 0.369878 NaN
C NaN NaN NaN NaN NaN NaN
D 0.786987 0.125907 0.191987 0.338194 0.009877 NaN
E 0.192269 0.909661 0.227301 0.343989 0.610203 NaN
F 0.503267 0.306472 0.197467 0.063800 0.813786 NaNs1
# 输出
A 1
B 2
C 3
D 4
dtype: int64s1.reindex(index=['A','B'])
# 输出
A 1
B 2
dtype: int64df1
# 输出c1 c2 c3 c4 c5
A 0.278063 0.894546 0.932129 0.178442 0.303684
B 0.186239 0.260677 0.708358 0.275914 0.369878
D 0.786987 0.125907 0.191987 0.338194 0.009877
E 0.192269 0.909661 0.227301 0.343989 0.610203
F 0.503267 0.306472 0.197467 0.063800 0.813786df1.reindex(index=['A','B'])
# 输出c1 c2 c3 c4 c5
A 0.278063 0.894546 0.932129 0.178442 0.303684
B 0.186239 0.260677 0.708358 0.275914 0.369878s1
# 输出
A 1
B 2
C 3
D 4
dtype: int64s1.drop('A')
# 输出
B 2
C 3
D 4
dtype: int64df1
# 输出c1 c2 c3 c4 c5
A 0.278063 0.894546 0.932129 0.178442 0.303684
B 0.186239 0.260677 0.708358 0.275914 0.369878
D 0.786987 0.125907 0.191987 0.338194 0.009877
E 0.192269 0.909661 0.227301 0.343989 0.610203
F 0.503267 0.306472 0.197467 0.063800 0.813786df1.drop('A',axis=0)
# 输出c1 c2 c3 c4 c5
B 0.186239 0.260677 0.708358 0.275914 0.369878
D 0.786987 0.125907 0.191987 0.338194 0.009877
E 0.192269 0.909661 0.227301 0.343989 0.610203
F 0.503267 0.306472 0.197467 0.063800 0.813786df1.drop('c1',axis=0)
# 报错,行中没有该字段df1.drop('c1',axis=1)
# 输出c2 c3 c4 c5
A 0.894546 0.932129 0.178442 0.303684
B 0.260677 0.708358 0.275914 0.369878
D 0.125907 0.191987 0.338194 0.009877
E 0.909661 0.227301 0.343989 0.610203
F 0.306472 0.197467 0.063800 0.813786
3-7 谈一谈NaN
Jupyter notebook 新建文件 谈一谈NaN.ipynb
# NaN - means Not a Number
import numpy as np
import pandas as pd
from pandas import Series, DataFramen = np.nan
type(n)
# floatm = 1
m + n
# nan# Nan in Series
s1 = Series([1, 2, np.nan, 3, 4], index=['A','B','C','D','E'])
s1
# 输出
A 1.0
B 2.0
C NaN
D 3.0
E 4.0
dtype: float64s1.isnull()
# 输出
A False
B False
C True
D False
E False
dtype: bools1.notnull()
# 输出
A True
B True
C False
D True
E True
dtype: bools1
# 输出
A 1.0
B 2.0
C NaN
D 3.0
E 4.0
dtype: float64s1.dropna()
# 输出
A 1.0
B 2.0
D 3.0
E 4.0
dtype: float64# Nan in DataFrame
dframe = DataFrame([[1,2,3],[np.nan,5,6],[7,np.nan,9],[np.nan,np.nan,np.nan]])
dframe
# 输出0 1 2
0 1.0 2.0 3.0
1 NaN 5.0 6.0
2 7.0 NaN 9.0
3 NaN NaN NaNdframe.isnull()
# 输出0 1 2
0 False False False
1 True False False
2 False True False
3 True True Truedframe.notnull()
# 输出0 1 2
0 True True True
1 False True True
2 True False True
3 False False Falsedf1 = dframe.dropna(axis=0)
df1
# 输出0 1 2
0 1.0 2.0 3.0df1 = dframe.dropna(axis=1)
df1
# 输出
0
1
2
3df1 = dframe.dropna(axis=1,how='any')
df1
# 输出
0
1
2
3# 输出
df1 = dframe.dropna(axis=0,how='any')
df1
# 输出0 1 2
0 1.0 2.0 3.0df1 = dframe.dropna(axis=0,how='all')
df1
# 输出0 1 2
0 1.0 2.0 3.0
1 NaN 5.0 6.0
2 7.0 NaN 9.0dframe2 = DataFrame([[1,2,3,np.nan],[2,np.nan,5,6],[np.nan,7,np.nan,9],[1,np.nan,np.nan,np.nan]])
dframe2
# 输出0 1 2 3
0 1.0 2.0 3.0 NaN
1 2.0 NaN 5.0 6.0
2 NaN 7.0 NaN 9.0
3 1.0 NaN NaN NaNdf2 = dframe2.dropna(thresh=None)
df2
# 输出
0 1 2 3df2 = dframe2.dropna(thresh=2)
df2
# 输出0 1 2 3
0 1.0 2.0 3.0 NaN
1 2.0 NaN 5.0 6.0
2 NaN 7.0 NaN 9.0dframe2
# 输出0 1 2 3
0 1.0 2.0 3.0 NaN
1 2.0 NaN 5.0 6.0
2 NaN 7.0 NaN 9.0
3 1.0 NaN NaN NaNdframe2.fillna(value=1)
# 输出0 1 2 3
0 1.0 2.0 3.0 1.0
1 2.0 1.0 5.0 6.0
2 1.0 7.0 1.0 9.0
3 1.0 1.0 1.0 1.0dframe2.fillna(value={0:0,1:1,2:2,3:3}) # 列填充
# 输出0 1 2 3
0 1.0 2.0 3.0 3.0
1 2.0 1.0 5.0 6.0
2 0.0 7.0 2.0 9.0
3 1.0 1.0 2.0 3.0df1
# 输出0 1 2
0 1.0 2.0 3.0
1 NaN 5.0 6.0
2 7.0 NaN 9.0df2
# 输出0 1 2 3
0 1.0 2.0 3.0 NaN
1 2.0 NaN 5.0 6.0
2 NaN 7.0 NaN 9.0df1.dropna()
# 输出0 1 2
0 1.0 2.0 3.0df1.fillna(1)
# 输出0 1 2
0 1.0 2.0 3.0
1 1.0 5.0 6.0
2 7.0 1.0 9.0
第5章 绘图和可视化之Matplotlib
数据的可视化是数据分析领域里非常重要的内容。本章会学习Matplotlib的基本使用,包括如何对Pandas里的Series和DataFrame绘图, 以及图形样式和显示模式的设置等内容。
5-1 Matplotlib介绍
为什么要Python画图
- GUI太复杂
- Excel太头疼
- Python简单,免费( sorry Matlab)
什么是matplotlib?
- 一个Python包
- 用于2D绘图
- 非常强大和流行
- 有很多扩展
Hello world in matplotlib
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
x = np.linspace(0,2*np.pi,100)
y = np.sin(x)
plt.plot(x,y)
Matplotlib
- Backend:主要处理把图显示到哪里和画在哪里?
- Artist:图形显示成什么样?
- Scripting:pyplot,Python语法和API
参考资料
5-2 matplotlib简单绘图之plot
Jupyter notebook 新建文件 matplotlib的简单绘图-plot.ipynb
import numpy as np
import matplotlib.pyplot as plta = [1, 2, 3]
plt.plot(a)
# [<matplotlib.lines.Line2D at 0x20619607f40>]
# plt.show() # 新版图形会直接显示
a = [1, 2, 3]
b = [4, 5, 6]
plt.plot(a, b)
a = [1, 2, 3]
b = [4, 5, 6, 7]
plt.plot(a, b)
# ValueError: x and y must have same first dimension, but have shapes (3,) and (4,)
a = [1, 2, 3]
b = [4, 5, 6]
# %matplotlib inline # 旧版加上这句可以不用show,新版不需要
plt.plot(a, b)
%timeit np.arange(10)
# 464 ns ± 25.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
plt.plot(a,b,'*')
plt.plot(a,b,'--')
plt.plot(a,b,'r--')
plt.plot(a,b,'w--') # 线是白色,看不到
c = [10, 8, 6]
d = [1, 8, 3]
plt.plot(a, b, c, d)
c = [10, 8, 6]
d = [1, 8, 3]
plt.plot(a, b, 'r--', c, d, 'b*')
t = np.arange(0.0, 2.0, 0.1)
t
# array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2,1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9])
t.size
# 20
s = np.sin(t*np.pi)
s
# 输出
array([ 0.00000000e+00, 3.09016994e-01, 5.87785252e-01, 8.09016994e-01,9.51056516e-01, 1.00000000e+00, 9.51056516e-01, 8.09016994e-01,5.87785252e-01, 3.09016994e-01, 1.22464680e-16, -3.09016994e-01,-5.87785252e-01, -8.09016994e-01, -9.51056516e-01, -1.00000000e+00,-9.51056516e-01, -8.09016994e-01, -5.87785252e-01, -3.09016994e-01])plt.plot(t, s)
plt.plot(t, s, 'r--')
plt.plot(t, s, 'r--', t*2, s)
plt.plot(t, s, 'r--', t*2, s, '--')
plt.plot(t, s, 'r--', t*2, s, '--')
plt.xlabel('this is x')
plt.ylabel('this is y')
plt.title('this is a demo')
plt.plot(t, s, 'r--', label='aaaa')
plt.plot(t*2, s, 'b--', label='bbbb')
plt.xlabel('this is x')
plt.ylabel('this is y')
plt.title('this is a demo')
plt.legend()
5-3 matplotlib简单绘图之subplot
Jupyter notebook 新建文件 matplotlib简单绘图之subplot.ipynb
import numpy as np
import matplotlib.pyplot as pltx = np.linspace(0.0, 5.0)
y1 = np.sin(np.pi*x)
y2 = np.sin(np.pi*x*2)plt.plot(x, y1, 'b--', label='sin(pi*x)')
plt.ylabel('y1 value')
plt.plot(x, y2, 'r--', label='sin(pi*2x)')
plt.ylabel('y2 value')
plt.xlabel('x value')
plt.title('this is x-y value')
plt.legend()
plt.subplot(2, 1, 1)
plt.plot(x, y1, 'b--')
plt.ylabel('y1')
plt.subplot(2, 1, 2)
plt.plot(x, y2, 'r--')
plt.ylabel('y2')
plt.xlabel('x')
plt.subplot(2, 2, 1)
plt.plot(x, y1, 'b--')
plt.ylabel('y1')
plt.subplot(2, 2, 2)
plt.plot(x, y2, 'r--')
plt.ylabel('y2')
plt.xlabel('x')
plt.subplot(2, 2, 3)
plt.plot(x, y1, 'r*')
plt.subplot(2, 2, 4)
plt.plot(x, y1, 'b*')
plt.subplot(221)
plt.plot(x, y1, 'b--')
plt.ylabel('y1')
plt.subplot(222)
plt.plot(x, y2, 'r--')
plt.ylabel('y2')
plt.xlabel('x')
plt.subplot(223)
plt.plot(x, y1, 'r*')
plt.subplot(224)
plt.plot(x, y1, 'b*')
a = plt.subplots()
a
type(a)
# tuple
a[0]
a[1]
# <AxesSubplot:>
figure, ax = plt.subplots()
ax.plot([1, 2, 3, 4, 5])
figure, ax = plt.subplots(2, 2)
ax.plot([1, 2, 3, 4, 5])
# AttributeError: 'numpy.ndarray' object has no attribute 'plot'
figure, ax = plt.subplots(2, 2)
ax
ax[0][0].plot(x, y1)
ax[0][1].plot(x, y2)
# [<matplotlib.lines.Line2D at 0x23c99d1df70>]
figure # 老版:plt.show()
5-4 Pandas绘图之Series
Jupyter notebook 新建文件 Pandas绘图之Series.ipynb
import numpy as np
import pandas as pd
from pandas import Series
import matplotlib.pyplot as plts = Series([1, 2, 3, 4, 5])
s
# 输出
0 1
1 2
2 3
3 4
4 5
dtype: int64s.cumsum()
# 输出
0 1
1 3
2 6
3 10
4 15
dtype: int64s1 = Series(np.random.randn(100)).cumsum()
s1.plot()
s1.plot(kind='bar')
s1 = Series(np.random.randn(1000)).cumsum()
s1.plot(kind='bar')
s1.plot(kind='line')
s1.plot(kind='line', grid=True)
s1.plot(kind='line', grid=True, label='S1', title='This is Series')
plt.legend()
s1.plot(kind='line', grid=True, label='S1', title='This is Series', style='--')
plt.legend()
s1 = Series(np.random.randn(1000)).cumsum()
s2 = Series(np.random.randn(1000)).cumsum()
s1.plot(kind='line', grid=True, label='S1', title='This is Series')
s2.plot(label='S2')
fig, ax = plt.subplots(2,1)
ax
ax[0].plot(s1)
ax[1].plot(s2)
s1.plot(ax=ax[0], label='S1')
s2.plot(ax=ax[1], label='S2')
fig
s1[0:10].plot(ax=ax[0], label='S1', kind='bar')
fig
s1 = Series(np.random.randn(10)).cumsum()
s2 = Series(np.random.randn(10)).cumsum()
fig, ax = plt.subplots(2,1)
s1[0:10].plot(ax=ax[0], label='S1', kind='bar')
s2.plot(ax=ax[1], label='S2')
plt.show()
5-5 Pandas绘图之DataFrame
Jupyter notebook 新建文件 Pandas绘图之DataFrame.ipynb
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import Series, DataFrame
df = DataFrame(np.random.randint(1, 10, 40).reshape(10,4),columns=['A','B','C','D']
)
df
# 输出A B C D
0 6 9 5 4
1 6 3 3 3
2 9 3 8 1
3 4 3 9 9
4 4 5 3 7
5 2 2 4 4
6 5 8 9 9
7 6 8 3 1
8 1 5 6 6
9 5 1 5 1df.plot()
df.plot(kind='bar')
df.plot(kind='barh')
df.plot(kind='bar', stacked=True)
df.plot(kind='area')
df.iloc[5]
# 输出
A 2
B 2
C 4
D 4
Name: 5, dtype: int32a = df.iloc[5]
type(a)
# pandas.core.series.Seriesdf.iloc[5].plot()
for i in df.index:df.iloc[i].plot(label=str(i))
plt.legend()
plt.show()
df['A']
# 输出
0 6
1 6
2 9
3 4
4 4
5 2
6 5
7 6
8 1
9 5
Name: A, dtype: int32df['A'].plot()
df.plot()
df
# 输出A B C D
0 6 9 5 4
1 6 3 3 3
2 9 3 8 1
3 4 3 9 9
4 4 5 3 7
5 2 2 4 4
6 5 8 9 9
7 6 8 3 1
8 1 5 6 6
9 5 1 5 1df.T
# 输出0 1 2 3 4 5 6 7 8 9
A 6 6 9 4 4 2 5 6 1 5
B 9 3 3 3 5 2 8 8 5 1
C 5 3 8 9 3 4 9 3 6 5
D 4 3 1 9 7 4 9 1 6 1df.T.plot()
5-6 直方图和密度图
Jupyter notebook 新建文件 matplotlib里的直方图和密度图.ipynb
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import Series, DataFrame# 直方图
s = Series(np.random.randn(1000))
plt.hist(s)
plt.hist(s, rwidth=0.9)
a = np.arange(10)
a
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])plt.hist(a, rwidth=0.9)
s
# 输出
0 1.234429
1 -0.157434
2 -0.587175
3 0.583882
4 -0.612201...
995 -0.155292
996 -1.126121
997 -0.173816
998 0.608625
999 -0.862248
Length: 1000, dtype: float64re = plt.hist(s, rwidth=0.9)
re
# 输出
(array([ 6., 22., 55., 127., 210., 234., 171., 115., 43., 17.]),array([-3.24995597, -2.65054559, -2.0511352 , -1.45172482, -0.85231444,-0.25290405, 0.34650633, 0.94591672, 1.5453271 , 2.14473748,2.74414787]),<BarContainer object of 10 artists>)type(re)
# tuplelen(re)
# 3re[0]
# array([ 6., 22., 55., 127., 210., 234., 171., 115., 43., 17.])re[1]
# array([-3.24995597, -2.65054559, -2.0511352 , -1.45172482, -0.85231444,-0.25290405, 0.34650633, 0.94591672, 1.5453271 , 2.14473748,2.74414787])re[2]
# <BarContainer object of 10 artists>plt.hist(s)
plt.hist(s, rwidth=0.9)
plt.hist(s, rwidth=0.9, bins=20)
plt.hist(s, rwidth=0.9, bins=200)
plt.hist(s, rwidth=0.9, bins=20, color='r')
s.plot()
s.plot(kind='kde')
https://matplotlib.org/stable/tutorials/index
第6章 绘图和可视化之Seaborn
Seaborn是对Matplotlib的进一步封装,其强大的调色功能和内置的多种多样的绘图模式,使之成为当下最流行的数据科学绘图工具。本章将介绍Seaborn的基本使用,以及和matplotlib的功能对比。
6-1 seaborn介绍
Seaborn - Powerful Matplotlib Extension
Statistical data visualization
Seaborn的优势在哪里?
Jupyter notebook 新建文件 Seaborn和matplotlib对比.ipynb
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline # 新版可以不写iris = pd.read_csv('../homework/iris.csv')
iris.head()
# 输出SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa# 需求:画一个花瓣(petal)和花萼(sepal)长度的散点图,并且点的颜色要区分鸢尾花的种类
iris.Name.unique()
# array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)color_map = dict(zip(iris.Name.unique(), ['blue', 'green', 'red']))
for species, group in iris.groupby('Name'):plt.scatter(group['PetalLength'], group['SepalLength'],color=color_map[species],alpha=0.3, edgecolor=None,label=species)plt.legend(frameon=True, title='Name')
plt.xlabel('petalLength')
plt.ylabel('sepalLength')
sns.lmplot('PetalLength','SepalLength', iris, hue='Name', fit_reg=False)
https://www.youtube.com/watch?v=FytuB8nFHPQ
6-2 seaborn实现直方图和密度图
Jupyter notebook 新建文件 seaborn实现直方图和密度图.ipynb
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import Series, DataFrame
%matplotlib inline
import seaborn as snss1 = Series(np.random.randn(100))
plt.hist(s1)
s1.plot(kind='kde')
sns.distplot(s1, hist=True, kde=True)
sns.distplot(s1, hist=True, kde=False)
sns.distplot(s1, hist=False, kde=True)
sns.distplot(s1, hist=False, kde=True, rug=True)
sns.distplot(s1, bins=20, hist=False, kde=True, rug=True)
sns.distplot(s1, bins=20, hist=True, kde=False, rug=True)
sns.kdeplot(s1)
sns.kdeplot(s1, shade=True)
sns.kdeplot(s1, shade=True, color='r')
sns.plt.plot(s1) # 相当于 plt.plot(s1) ,旧版可以,新版好像不行
sns.plt.hist(s1) # 相当于 plt.hist(s1) ,旧版可以,新版好像不行# 作业
sns.rugplot(s1)
6-3 seaborn实现柱状图和热力图
Jupyter notebook 新建文件 seaborn实现柱状图和热力图.ipynb
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as snsfrom pandas import Series, DataFrame
%matplotlib inlinedf = sns.load_dataset('flights') # 下方是 flights 的来源
df.head()
# 输出year month passengers
0 1949 Jan 112
1 1949 Feb 118
2 1949 Mar 132
3 1949 Apr 129
4 1949 May 121df.shape
# (144, 3)
https://github.com/https://github.com/mwaskom/seaborn-data
df = df.pivot(index='month', columns='year', values='passengers')
df
# 输出
year 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960
month
Jan 112 115 145 171 196 204 242 284 315 340 360 417
Feb 118 126 150 180 196 188 233 277 301 318 342 391
Mar 132 141 178 193 236 235 267 317 356 362 406 419
Apr 129 135 163 181 235 227 269 313 348 348 396 461
May 121 125 172 183 229 234 270 318 355 363 420 472
Jun 135 149 178 218 243 264 315 374 422 435 472 535
Jul 148 170 199 230 264 302 364 413 465 491 548 622
Aug 148 170 199 242 272 293 347 405 467 505 559 606
Sep 136 158 184 209 237 259 312 355 404 404 463 508
Oct 119 133 162 191 211 229 274 306 347 359 407 461
Nov 104 114 146 172 180 203 237 271 305 310 362 390
Dec 118 140 166 194 201 229 278 306 336 337 405 432sns.heatmap(df)
df.plot()
sns.heatmap(df, annot=True)
sns.heatmap(df, annot=True, fmt='d')
df.sum()
# 输出
year
1949 1520
1950 1676
1951 2042
1952 2364
1953 2700
1954 2867
1955 3408
1956 3939
1957 4421
1958 4572
1959 5140
1960 5714
dtype: int64s = df.sum()
s.index
# Int64Index([1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959,1960],dtype='int64', name='year')s.values
# array([1520, 1676, 2042, 2364, 2700, 2867, 3408, 3939, 4421, 4572, 5140,5714], dtype=int64)sns.barplot(x=s.index, y=s.values)
type(s)
# pandas.core.series.Seriess.plot(kind='bar')
6-4 seaborn图形显示效果的设置
Jupyter notebook 新建文件 seaborn设置图形显示效果.ipynb
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt# 1. axes_style and set_style
%matplotlib inlinex = np.linspace(0, 14, 100)
y1 = np.sin(x)
y2 = np.sin(x+2)*1.25
plt.plot(x, y1)
plt.plot(x, y2)
def sinplot():plt.plot(x, y1)plt.plot(x, y2)sinplot()
import seaborn as sns
sinplot()
style = ['darkgrid', 'dark', 'white', 'whitegrid', 'ticks']
sns.set_style(style[0]) # 可试试这几种
sinplot()
sns.axes_style()
sns.set_style(style[0], {'grid.color':'red'})
sinplot()
sns.axes_style()
sns.set()
sinplot()
sns.set_style('white')
sinplot()
sns.set()
sinplot()
# 2. plotting_context() and set_context()
context = ['paper', 'notebook', 'talk', 'poster']
sns.set_context(context[3])
sinplot()
sns.plotting_context()
sns.set_context(context[1], rc={'grid.linewidth':1.0})
sinplot()
sns.set_context(context[1], rc={'grid.linewidth':3.0})
sinplot()
sns.plotting_context()
sns.set()
sns.plotting_context()
6-5 seaborn强大的调色功能
Jupyter notebook 新建文件 seaborn强大的调色功能.ipynb
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt%matplotlib inlinedef sinplot():x = np.linspace(0, 14, 100)plt.figure(figsize=(8, 6))for i in range(4):plt.plot(x, np.sin(x+i)*(i+0.75), label="sin(x+%s)*(%s+0.75)" % (i,i))plt.legend()sinplot()
import seaborn as sns
sinplot()
import seaborn as sns
sns.set()
sinplot()
sns.color_palette() # RGB
sns.palplot(sns.color_palette())
pal_style = ['deep', 'muted', 'pastel', 'bright', 'dark', 'colorblind']
sns.palplot(sns.color_palette('dark'))
sns.set_palette(sns.color_palette('dark'))
sns.color_palette()
sinplot()
sns.set()
sinplot()
with sns.color_palette('dark'):sinplot()
sinplot()
sns.color_palette()
pal1 = sns.color_palette([(0.5, 0.1, 0.7),(0.3, 0.1, 0.9)])
sns.palplot(pal1)
sns.color_palette('hls', 8)
http://seaborn.pydata.org/tutorial
Choosing color palettes
第7章 数据分析项目实战
通过前六章的学习,我们基本上掌握了数据分析领域里主要工具的使用,本章将通过一个股票市场的分析实战项目,和大家一起用学过的知识去分析数据,进而得到有用的信息。
7-1 实战准备
数学科学工作流
B 站全站视频信息爬虫:https://github.com/chenjiandongx/bili-spider
数据分析和挖掘有哪些公开的数据来源?:https://www.zhihu.com/question/19969760https://www.quora.com/Data/Where-can-I-find-large-datasets-open-to-the-publichttps://www.kaggle.com/datasetsgoogle public data explorer:https://www.google.com/publicdata/directory
AWS Public Datasets:https://aws-amazon.com/public-datasets/
7-2 股票市场分析实战之数据获取
Jupyter notebook 新建文件 股票市场分析实战--数据获取.ipynb
http://finance.yahoo.com/
搜索阿里巴巴:BABAhttps://pandas-datareader.readthedocs.io/en/latest/
conda install pandas-datareader # 安装pandas-datareader
import pandas_datareader as pdr
alibaba = pdr.get_data_yahoo('BABA')
alibaba.head()
# 输出High Low Open Close Volume Adj Close
Date
2017-02-22 105.199997 102.419998 102.480003 104.199997 15779400 104.199997
2017-02-23 104.860001 101.820000 104.720001 102.459999 10095600 102.459999
2017-02-24 103.000000 101.300003 101.389999 102.949997 7356600 102.949997
2017-02-27 103.824997 102.220001 102.500000 103.599998 6865900 103.599998
2017-02-28 103.989998 102.029999 103.889999 102.900002 8017500 102.900002alibaba.shape
# (1259, 6)alibaba.tail()
# 输出High Low Open Close Volume Adj Close
Date
2022-02-14 122.384003 119.370003 120.559998 121.919998 13154900 121.919998
2022-02-15 126.805000 123.377998 123.769997 126.239998 14634700 126.239998
2022-02-16 127.580002 124.510002 125.610001 125.559998 17993300 125.559998
2022-02-17 129.399994 124.056000 125.000000 124.430000 15906000 124.430000
2022-02-18 120.879997 117.199997 120.879997 118.989998 21165800 118.989998alibaba.describe()
# 输出High Low Open Close Volume Adj Close
count 1259.000000 1259.000000 1259.000000 1259.000000 1.259000e+03 1259.000000
mean 189.265396 184.532867 187.110629 186.942844 1.848556e+07 186.942844
std 43.814601 42.927590 43.471982 43.413098 1.035043e+07 43.413098
min 103.000000 101.300003 101.389999 102.309998 4.120700e+06 102.309998
25% 163.034500 158.815002 161.314995 160.599998 1.227665e+07 160.599998
50% 183.100006 178.520004 180.800003 180.839996 1.617560e+07 180.839996
75% 213.695000 209.366501 211.195000 211.480003 2.129500e+07 211.480003
max 319.320007 308.910004 313.500000 317.140015 1.418300e+08 317.140015alibaba.info
# 输出
<bound method DataFrame.info of High Low Open Close Volume \
Date
2017-02-22 105.199997 102.419998 102.480003 104.199997 15779400
2017-02-23 104.860001 101.820000 104.720001 102.459999 10095600
2017-02-24 103.000000 101.300003 101.389999 102.949997 7356600
2017-02-27 103.824997 102.220001 102.500000 103.599998 6865900
2017-02-28 103.989998 102.029999 103.889999 102.900002 8017500
... ... ... ... ... ...
2022-02-14 122.384003 119.370003 120.559998 121.919998 13154900
2022-02-15 126.805000 123.377998 123.769997 126.239998 14634700
2022-02-16 127.580002 124.510002 125.610001 125.559998 17993300
2022-02-17 129.399994 124.056000 125.000000 124.430000 15906000
2022-02-18 120.879997 117.199997 120.879997 118.989998 21165800 Adj Close
Date
2017-02-22 104.199997
2017-02-23 102.459999
2017-02-24 102.949997
2017-02-27 103.599998
2017-02-28 102.900002
... ...
2022-02-14 121.919998
2022-02-15 126.239998
2022-02-16 125.559998
2022-02-17 124.430000
2022-02-18 118.989998 [1259 rows x 6 columns]>alibaba.info()
# 输出
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1259 entries, 2017-02-22 to 2022-02-18
Data columns (total 6 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 High 1259 non-null float641 Low 1259 non-null float642 Open 1259 non-null float643 Close 1259 non-null float644 Volume 1259 non-null int64 5 Adj Close 1259 non-null float64
dtypes: float64(5), int64(1)
memory usage: 68.9 KB
7-3 股票市场分析实战之历史趋势分析
Jupyter notebook 新建文件 股票市场分析实战--历史趋势分析.ipynb
# 基本信息
import numpy as np
import pandas as pd
from pandas import Series, DataFrame# 股票数据的读取
import pandas_datareader as pdr# 可视化
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline# time
from datetime import datetimestart = datetime(2015,9,20)
alibaba = pdr.get_data_yahoo('BABA',start=start)
amazon = pdr.get_data_yahoo('AMZN',start=start)# alibaba.to_csv('../homework/BABA.csv')
# amazon.to_csv('../homework/AMZN.csv')alibaba.head()
# 输出High Low Open Close Volume Adj Close
Date
2015-09-21 66.400002 62.959999 65.379997 63.900002 22355100 63.900002
2015-09-22 63.270000 61.580002 62.939999 61.900002 14897900 61.900002
2015-09-23 62.299999 59.680000 61.959999 60.000000 22684600 60.000000
2015-09-24 60.340000 58.209999 59.419998 59.919998 20645700 59.919998
2015-09-25 60.840000 58.919998 60.630001 59.240002 17009100 59.240002alibaba['Adj Close'].plot()
alibaba['Adj Close'].plot(legend=True)
alibaba['Volume'].plot(legend=True)
alibaba['Adj Close'].plot()
amazon['Adj Close'].plot()
alibaba.head()
# 输出High Low Open Close Volume Adj Close
Date
2015-09-21 66.400002 62.959999 65.379997 63.900002 22355100 63.900002
2015-09-22 63.270000 61.580002 62.939999 61.900002 14897900 61.900002
2015-09-23 62.299999 59.680000 61.959999 60.000000 22684600 60.000000
2015-09-24 60.340000 58.209999 59.419998 59.919998 20645700 59.919998
2015-09-25 60.840000 58.919998 60.630001 59.240002 17009100 59.240002alibaba['high-low'] = alibaba['High'] - alibaba['Low']
alibaba['high-low'].plot()
alibaba.head()
# 输出High Low Open Close Volume Adj Close high-low
Date
2015-09-21 66.400002 62.959999 65.379997 63.900002 22355100 63.900002 3.440002
2015-09-22 63.270000 61.580002 62.939999 61.900002 14897900 61.900002 1.689999
2015-09-23 62.299999 59.680000 61.959999 60.000000 22684600 60.000000 2.619999
2015-09-24 60.340000 58.209999 59.419998 59.919998 20645700 59.919998 2.130001
2015-09-25 60.840000 58.919998 60.630001 59.240002 17009100 59.240002 1.920002# daily return
alibaba['daily-return'] = alibaba['Adj Close'].pct_change()
alibaba['daily-return']
# 输出
Date
2015-09-21 NaN
2015-09-22 -0.031299
2015-09-23 -0.030695
2015-09-24 -0.001333
2015-09-25 -0.011348...
2022-02-14 -0.002699
2022-02-15 0.035433
2022-02-16 -0.005387
2022-02-17 -0.009000
2022-02-18 -0.043719
Name: daily-return, Length: 1617, dtype: float64alibaba['daily-return'].plot()
alibaba['daily-return'].plot(figsize=(10,4), linestyle='--', marker='o')
alibaba['daily-return'].plot(kind='hist')
sns.distplot(alibaba['daily-return'].dropna(), bins=100, color='purple')
7-4 股票市场分析实战之风险分析
Jupyter notebook 新建文件 股票市场分析实战--风险分析.ipynb
# 基本信息
import numpy as np
import pandas as pd
from pandas import Series, DataFrame# 股票数据的读取
import pandas_datareader as pdr# 可视化
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline# time
from datetime import datetimestart = datetime(2015,1,1)
company = ['AAPL', 'GOOG', 'MSFT', 'AMZN', 'FB']
top_tech_df = pdr.get_data_yahoo(company, start=start)['Adj Close']
# top_tech_df.to_csv('../homework/top5.csv')
top_tech_df.head()
# 输出Symbols AAPL GOOG MSFT AMZN FB
Date
2014-12-31 24.951860 524.958740 40.836304 310.350006 78.019997
2015-01-02 24.714508 523.373108 41.108841 308.519989 78.449997
2015-01-05 24.018263 512.463013 40.730816 302.190002 77.190002
2015-01-06 24.020521 500.585632 40.132992 295.290009 76.150002
2015-01-07 24.357340 499.727997 40.642887 298.420013 76.150002top_tech_dr = top_tech_df.pct_change()
top_tech_dr.head()
# 输出
Symbols AAPL GOOG MSFT AMZN FB
Date
2014-12-31 NaN NaN NaN NaN NaN
2015-01-02 -0.009512 -0.003020 0.006674 -0.005897 0.005511
2015-01-05 -0.028172 -0.020846 -0.009196 -0.020517 -0.016061
2015-01-06 0.000094 -0.023177 -0.014677 -0.022833 -0.013473
2015-01-07 0.014022 -0.001713 0.012705 0.010600 0.000000top_tech_df.plot()
top_tech_df[['AAPL', 'FB', 'MSFT']].plot()
sns.jointplot('GOOG','GOOG', top_tech_dr, kind='scatter')
sns.jointplot('AMZN','GOOG', top_tech_dr, kind='scatter')
sns.jointplot('MSFT','FB', top_tech_dr, kind='scatter')
sns.pairplot(top_tech_dr.dropna())
https://en.wikipedia.org/wiki/Quantile
top_tech_dr['AAPL'].quantile(0.52)
# 0.001484101307547557top_tech_dr['MSFT'].quantile(0.05)
# -0.025949460040733018vips = pdr.get_data_yahoo('VIPS', start=start)['Adj Close'] # 唯品会
vips.plot()
vips.pct_change().quantile(0.2)
# -0.02404166247703312
第8章 课程总结
本章的总结不是对前面8章内容的汇总,而是给大家指明了一条继续学习和锻炼的道路。希望大家坚持练习,早日修成正果。
8-1 总结
4个pdf(建议打印出来,经常看看):
Numpy_Python_cheat_Sheet.pdf
Pandas_cheat_sheet.pdf
Python_Matplotlib_Cheat_Sheet.pdf
Seaborn_cheat_sheet.pdf
kaggle 网站:
1、多看tutorial例子
https://www.kaggle.com/datasets:提供了一些公开数据
Bitcoin Historical Data:比特币的历史数据
kernels:做Data Science
之前:https://www.kaggle.com/kernels
现在:https://www.kaggle.com/code
Data ScienceTutorial for Beginners
Jupiter Notebook编写,fork 的功能
可以直接 fork Notebook , kaggle 会打开一个 online 的 Jupiter Notebook
Python数据分析入门与实践-笔记相关推荐
- python编程入门到实践笔记习题_Python编程从入门到实践笔记——列表简介
python编程从入门到实践笔记--列表简介 #coding=utf-8 #列表--我的理解等于c语言和java中的数组 bicycles = ["trek","cann ...
- python编程入门到实践笔记-python基础(《Python编程:从入门到实践》读书笔记)...
注: 本文的大部分代码示例来自书籍<Python编程:从入门到实践>. 一.变量: 命名: (1)变量名只能包含字母.数字和下划线.变量名可以字母或下划线打头,但不能以数字打头 (2)变量 ...
- Python数据分析入门与实践
概述 Python数据分析主要基于Pandas的Serise和DataFrame去实现,Serise和DataFrame有点像php里的数组,在数据科学里叫矩阵. 把数据使用Pandas进行采样和机器 ...
- Python编程入门到实践 - 笔记( 4 章)
昨天下午又看了一遍第4章,今天早上自己来个总结吧. 复习内容如下: 通过 for 循环遍历表中内容以及在循环中打印和循环外打印 rang() 创建列表和设置步长 数字列表的简单统计 1)mix 最小 ...
- python从入门到实践笔记_Python编程从入门到实践二至七章笔记
Python 第二章变量和简单数据类型 一. 改变字符大小写 1. 首字母大写 .title() 或者 .capitaliza() 输入: print("xiao ming!".t ...
- python从入门到实践笔记_Python编程 从入门到实践 #笔记#
变量 命名规则 只能包含字母.数字.下划线 不能包含空格,不能以数字开头 不能为关键字或函数名 字符串 用单引号.双引号.三引号包裹 name = "ECLIPSE" name.t ...
- python数据分析入门学习笔记
python数据分析入门学习笔记儿 学习利用python进行数据分析的笔记儿&下星期二内部交流会要讲的内容,一并分享给大家.博主粗心大意,有什么不对的地方欢迎指正~还有许多尚待完善的地方,待我 ...
- Python数据分析入门笔记10——简单案例练习(学生信息分析)
系列文章目录 Python数据分析入门笔记1--学习前的准备 Python数据分析入门笔记2--pandas数据读取 Python数据分析入门笔记3--数据预处理之缺失值 Python数据分析入门笔记 ...
- Python数据分析入门笔记4——数据预处理之重复值
系列文章目录 Python数据分析入门笔记1--学习前的准备 Python数据分析入门笔记2--pandas数据读取 Python数据分析入门笔记3--数据预处理之缺失值 Python数据分析入门笔记 ...
最新文章
- SQL Server 数据库清除日志的方法
- python类中self是什么
- python基础语法及知识总结-Python 学习完基础语法知识后,如何进一步提高?
- leetcode算法题--第一个只出现一次的字符
- Learning to rank在淘宝的应用
- 只用html5与CSS做一个简单的页面,HTML+CSS基础训练之做一个简单页面的布局
- 使用Flyway在Java EE中进行数据库迁移
- leetcode 188. 买卖股票的最佳时机 IV(dp)
- 单片机ch2o程序_基于单片机的室内甲醛浓度检测系统
- 问题 L: 求一元二次方程的根
- python机器学习案例系列教程——决策树(ID3、C4.5、CART)
- 锁-概念:可重入锁、可中断锁、公平锁、读写锁
- GoF设计模式——单例模式(C++实现)
- 谦虚的向大家问个技术问题,树型结构的排序问题
- 固态硬盘开盘数据恢复的方法
- Vue - 生成二维码(把链接地址或字符文字转成二维码,扫描后可打开显示)
- 笨功夫与巧心思,Milvus开源社区的成长
- ios xmpp 发送语音图片解决方案
- IP协议/地址(IPv4IPv6)概要
- c++禁用启用设备 usb
热门文章
- swoole的初步学习
- 抖音 根据 phone_number 找到 dy号和uid
- 【DG】DG日常维护
- DVB数字电视系统简介(DVB-C,DVB-S,DVB-T)
- 10kV高压开关柜无线测温系统设计及产品选型
- NPOI 读取空Excel 空单元格报错的问题
- 如何使用ffmpeg分离视频和音频t
- 美通社企业新闻汇总 | 2019.1.29 | 华为四款明星产品亮相世界移动通信大会;一季度全球智能手机同比将减产10%...
- 华三交换机如何进入配置_H3C交换机如何配置登录用户名和密码
- “消费者至上:媒体新时代 ”主题响彻IBC2019