第1章 实验环境的搭建

本章将主要介绍Anaconda和Jupyter Notebook。包括如何在windows,Mac,linux等平台上安装Anaconda,以及Jupyter Notebook的基本启动使用方法。

1-1 导学视频

数学科学和机器学习

数学科学工作流

课程具体安排:

  • 第一章:实验环境的搭建
  • 第二章:Numpy入门
  • 第三章:Pandas入门
  • 第四章:Pandas玩转数据
  • 第五章:绘图与可视化-Matplotlib
  • 第六章:绘图与可视化-Seaborn
  • 第七章:数据分析项目实战
  • 第八章:总结

适合人群:

  • 有一定的自学和动手能力
  • 有最基本的Python基础
  • 将来想从事数据分析和机器学习相关领域工作

1-2 Anaconda和Jupyter notebook介绍

Anaconda/Jupyter notebook:open Data Science Platform

Anaconda是什么?

  • 最著名的Python数据科学平台
  • 750+流行的Python&R包
  • 跨平台:Windows,Mac,Linux
  • conda:可扩展的包管理工具
  • 免费分发
  • 非常活跃的社区

Anaconda的安装

下载地址

  • 现在:https://www.anaconda.com/products/individual
  • 之前:https://www.anaconda.com/download/

检查安装是否正确:

cd ~/anaconda
bin/conda --version
conda 4.3.21

Conda: Package和Environment管理

  • 安装Packages
  • 更新Packages
  • 创建沙盒:Conda environment

Conda的Environment管理
创建一个新的environment

conda create --name python34 python3.4

激活一个environment

activate python34 # for Windows
source activate python34 # for Linux & Mac

退出一个environment

deactivate python34 # for Windows
source deactivate python34 # for Linux & Mac

删除一个environment

conda remove --name python34 --all

Conda的package管理
Conda的包管理有点类似pip
安装一个Python包

conda install numpy

查看已安装的Python包

conda list
conda list -n python34 #查看指定环境安装的Python包

删除一个Python包

conda remove --name python34 numpy

Data Science IDE vs Developer IDE

Data Science IDEs in Anaconda

从IPython到Jupyter

什么是Ipython?

  • 一个强大的交互式shell
  • 是Jupyter的kernel
  • 支持交互式数据分析和可视化

Ipython Kernel

  • 主要负责运行用户代码
  • 通过stdin/stdout和Ipython shell交互
  • 用json message通过ZeroMQ和notebook交互

什么是Jupyter Notebook?

  • 前身是Ipython notebook
  • 一个开源的Web application
  • 可以创建和分享包含代码、视图、注释的文档
  • 可以用于数据统计、分析、建模、机器学习等领域

Notebook和kernel之间的交互

  • 核心是Notebook server
  • Notebook server 加载和保存 notebook

Notebook的文件格式(.ipynb)

  • 由IPython Notebook 定义的一种格式(json)
  • 可以读取在线数据,CSV/XLS文件
  • 可以转换成其他格式(py,html,pdf,md等)

NBViewer

  • 一个online的ipynb格式notebook展示工具
  • 可以通过url分享
  • Github集成了NBViewer
  • 通过转换器轻松集成到Blogs Emails、Wikis、Books

本课程实验室环境

  • 在Windows/Mac/Linux上安装Anaconda
  • 使用Python3.6作为基础环境
  • 使用Jupyter Notebook作为编程IDE

1-3 Anaconda在Mac上的安装演示

下载macOS版本安装包,Python3.6+64位版本(截止2022/2/15,Python3.9)
Anaconda3-2021.11-MacOSX-x86_64.pkg
选择Install for me only,其他基本默认选项
不建议改变安装目录(安装需1.44GB)

~] ls
~] pwd
~] cd anaconda/
anaconda] ls
anaconda] cd bin
bin] ./conda --version
conda 4.3.21
bin] ./conda list
bin] ./jupyter notebook # 打开浏览器

1-4 Anaconda在windows上安装演示

下载Windows版本安装包,Python3.6+64位版本(截止2022/2/15,Python3.9)
Anaconda3-2021.11-Windows-x86_64.exe
选择Just Me(recommended),其他基本默认选项
在【开始菜单】里可看到安装好的Anaconda3
打开Jupyter Notebook

1-5 Anaconda在Linux上的安装演示

下载Linux版本安装包,Python3.6+64位版本(截止2022/2/15,Python3.9)
复制安装包链接

~] wget https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh
~] ls
Anaconda3-2021.11-Linux-x86_64.sh
~] ls -lh
~] sh Anaconda3-2021.11-Linux-x86_64.sh # 选择默认选项
~] pwd
/home/centos
~] cd anaconda3
anaconda3] ls
anaconda3] cd bin
anaconda3] ./conda --version
conda 4.3.21
anaconda3] ./jupyter notebook --no-browser # 复制链接

本地终端

~ ssh -N -f -L localhost:888:localhost:8888 gitlab-demo-ci
~ ssh -N -f -L localhost:888:localhost:8888 root@gitlab-demo-ci

浏览器打开,链接复制进去

!ifconfig  # 对应linux系统中 ifconfig

1-6 Jupyter-notebook的使用演示

cd anaconda3
cd jupyter-notebook/python-data-science
python-data-science git:(master) ls
README.md    demo.ipynb
python-data-science git:(master) xx/bin/jupyter notebook # 可打开

第2章 Numpy入门

本章将介绍Python数据科学领域里最基础的一个库——Numpy,回顾矩阵运算基础,介绍最重要的数据结构Array以及如何通过Numpy进行数组和矩阵运算。

2-1 数据科学领域5个常用Python库

  • Numpy
  • Scipy
  • Pandas
  • Matplotlib
  • Scikit-learn

Numpy

  • N维数组(矩阵),快速高效,矢量属性运算
  • 高效的Index,不需要循环
  • 开源免费跨平台,运行效率足以和C/Matlab媲美
  • NumPy 中文

Scipy

  • 依赖于Numpy
  • 专为科学和工程设计
  • 实现了多种常用科学计算:线性代数,傅里叶变换,信号和图像处理

Pandas

  • 结构化数据分析利器(依赖Numpy)
  • 提供了多种高级数据结构:Time-Series,DataFrame,Panel
  • 强大的数据索引和处理能力

Matplotlib

  • Python 2D绘图领域使用最广泛的套件
  • 基本能取代Matlab的绘图功能(散点,曲线,柱形等)
  • 通过mplot3d可以绘制精美的3D图

Scikit-learn

  • 机器学习的Python模块
  • 建立在Scipy之上,提供了常用的机器学习算法:聚类,回归
  • 简单易学的API接口

2-2 数学基础回顾之矩阵运算

基本概念

  • 矩阵:矩形的数据,即二维数组。其中向量和标量都是矩阵的特例
  • 向量:是指1xn或者nx1的矩阵
  • 标量:1x1的矩阵
  • 数组:N维的数组,时矩阵的延伸

特殊矩阵

  • 全0全1矩阵

  • 单位矩阵

矩阵加减运算

  • 相加、减的两个矩阵必须要有相同的行和列
  • 行和列对应元素相加减

数组乘法(点乘)

  • 数组乘法(点乘)是对应元素之间的乘法

矩阵乘法

设A为mxp的矩阵,B为pxn的矩阵,mxn的矩阵C为A与B的乘积,记为C=AB,其中矩阵C中的第i行第j列元素可以表示为:

其他线性代数知识

  • 清华大学出版的线性代数
  • http://bs.szu.edu.cn/sljr/Up/day_110824/201108240409437708.pdf

2-3 Array的创建及访问

Jupyter notebook 新建文件 Array.ipynb

# 数组的创建和访问
import numpy as np
# create from python list
list_1 = [1, 2, 3, 4]
list_1
#  [1, 2, 3, 4]array_1 = np.array(list_1)
array_1
# array([1, 2, 3, 4])list_2 = [5, 6, 7, 8]
array_2 = np.array([list_1,list_2])
array_2
# array([[1, 2, 3, 4],[5, 6, 7, 8]])array_2.shape  # 数组的维度 (n,m) n行m列
# (2, 4)# 扩展
# array_2.ndim - 数组的轴(维度)的个数。在Python世界中,维度的数量被称为rank。
# array_2.itemsize - 数组中每个元素的字节大小
# array_2.data - 该缓冲区包含数组的实际元素。array_2.size  # 数组元素的总数,等于 shape 的元素的乘积
# 8array_2.dtype  # 一个描述数组中元素类型的对象
# dtype('int32') 看电脑,也可能是dtype('int64')array_3 = np.array([[1.0,2,3],[4.0,5,6]])
array_3.dtype
# dtype('float64')array_4 = np.arange(1,10)
array_4
# array([1, 2, 3, 4, 5, 6, 7, 8, 9])array_4 = np.arange(1, 10, 2)
array_4
# array([1, 3, 5, 7, 9])np.zeros(5)
# array([0., 0., 0., 0., 0.])    # 零矩阵np.zeros([2,3])  # 两行三列的二维零矩阵
# array([[0., 0., 0.],[0., 0., 0.]])np.eye(5)    # n=5的单位矩阵
# array([[1., 0., 0., 0., 0.],[0., 1., 0., 0., 0.],[0., 0., 1., 0., 0.],[0., 0., 0., 1., 0.],[0., 0., 0., 0., 1.]])np.eye(5).dtype
# dtype('float64')a = np.arange(1,10)
a
# array([1, 2, 3, 4, 5, 6, 7, 8, 9])a[1]
# 2(取数组第2个元素)a[1:5]
# array([2, 3, 4, 5]) 取数组第2-5个元素b = np.array([[1,2,3],[4,5,6]])
b
# array([[1, 2, 3],[4, 5, 6]])b[1][0]
# 4b[1,0]
# 4c = np.array([[1,2,3],[4,5,6],[7,8,9]])
c
# array([[1, 2, 3],[4, 5, 6],[7, 8, 9]])
c[:2,1:]
# array([[2, 3],[5, 6]])

2-4 数组与矩阵运算

Jupyter notebook 新建文件 数组与矩阵运算.ipynb

# 快速创建数组
import numpy as np
np.random.randn(10)        # 返回10个小数元素的一维数组
# array([ 0.26674666, -0.91111093,  0.30684449, -0.80206634, -0.89176532,0.7950014 , -0.53259808, -0.09981816,  1.2960139 , -0.9668373 ])
np.random.randint(10)    # 0
np.random.randint(10,size=(2,3))    # 生成一个2x3的二维数组,数组元素[0,9]
# array([[7, 5, 8],[1, 5, 8]])
np.random.randint(10,size=20)        # 生成20个元素的一维数组,数组元素[0,9]
# array([5, 6, 4, 8, 0, 9, 6, 2, 2, 9, 2, 1, 4, 6, 1, 5, 8, 2, 3, 4])
np.random.randint(10,size=20).reshape(4,5)    # 对生成20个元素的一维数组进行重塑成4x5的二维数组,数组元素[0,9]
# array([[7, 1, 0, 5, 7],[8, 0, 3, 7, 9],[9, 0, 7, 3, 2],[9, 1, 5, 8, 7]])# 数组运算
a = np.random.randint(10,size=20).reshape(4,5)
b = np.random.randint(10,size=20).reshape(4,5)
a
# array([[2, 3, 8, 4, 8],[0, 7, 9, 9, 9],[1, 8, 1, 8, 6],[3, 4, 7, 5, 1]])
b
# array([[8, 4, 3, 1, 6],[4, 4, 6, 2, 9],[9, 4, 8, 5, 8],[6, 2, 5, 5, 8]])
a + b
# array([[10,  7, 11,  5, 14],[ 4, 11, 15, 11, 18],[10, 12,  9, 13, 14],[ 9,  6, 12, 10,  9]])
a - b
# array([[-6, -1,  5,  3,  2],[-4,  3,  3,  7,  0],[-8,  4, -7,  3, -2],[-3,  2,  2,  0, -7]])
a * b
# array([[16, 12, 24,  4, 48],[ 0, 28, 54, 18, 81],[ 9, 32,  8, 40, 48],[18,  8, 35, 25,  8]])
a / b
# 可能会报错,看b里是否有元素0
array([[0.25      , 0.75      , 2.66666667, 4.        , 1.33333333],[0.        , 1.75      , 1.5       , 4.5       , 1.        ],[0.11111111, 2.        , 0.125     , 1.6       , 0.75      ],[0.5       , 2.        , 1.4       , 1.        , 0.125     ]])
np.mat([[1,2,3],[4,5,6]])
# matrix([[1, 2, 3],[4, 5, 6]])
a
# array([[2, 3, 8, 4, 8],[0, 7, 9, 9, 9],[1, 8, 1, 8, 6],[3, 4, 7, 5, 1]])
np.mat(a)
#
matrix([[2, 3, 8, 4, 8],[0, 7, 9, 9, 9],[1, 8, 1, 8, 6],[3, 4, 7, 5, 1]])# 矩阵的运算
A = np.mat(a)
B = np.mat(b)
A
# matrix([[2, 3, 8, 4, 8],[0, 7, 9, 9, 9],[1, 8, 1, 8, 6],[3, 4, 7, 5, 1]])
B
# matrix([[8, 4, 3, 1, 6],[4, 4, 6, 2, 9],[9, 4, 8, 5, 8],[6, 2, 5, 5, 8]])
A + B
# matrix([[10,  7, 11,  5, 14],[ 4, 11, 15, 11, 18],[10, 12,  9, 13, 14],[ 9,  6, 12, 10,  9]])
A - B
# matrix([[-6, -1,  5,  3,  2],[-4,  3,  3,  7,  0],[-8,  4, -7,  3, -2],[-3,  2,  2,  0, -7]])
A * B    # 报错,A的列数和B的行数不一致a = np.mat(np.random.randint(10,size=20).reshape(4,5))
b = np.mat(np.random.randint(10,size=20).reshape(5,4))
a
# matrix([[9, 9, 3, 0, 5],[9, 4, 6, 4, 5],[9, 0, 7, 0, 9],[7, 2, 6, 0, 6]])
b
# matrix([[2, 2, 6, 4],[8, 9, 8, 0],[2, 1, 3, 9],[3, 1, 0, 2],[9, 3, 1, 4]])
a * b
# matrix([[141, 117, 140,  83],[119,  79, 109, 118],[113,  52,  84, 135],[ 96,  56,  82, 106]])# Array常用函数
a = np.random.randint(10,size=20).reshape(4,5)
np.unique(a)    # 对a中所有元素去重
# array([0, 1, 2, 3, 4, 5, 6, 8, 9])
a
# array([[4, 2, 8, 4, 2],[6, 9, 6, 4, 0],[9, 2, 6, 9, 0],[1, 3, 8, 5, 9]])
sum(a)        # a中所有行列求和
# array([20, 16, 28, 22, 11])
sum(a[0])    # a中第一行求和
# 20
sum(a[:,0])    # a中第一列求和
# 20
a.max()        # a中最大值
# 9
max(a[0])    # a中第一行最大值
# 8
max(a[:,0])    # a中第一列最大值
# 9

2-5 Array的input和output

Jupyter notebook 新建文件 Array的input和output.ipynb

# 使用pickle序列化numpy array
import pickle
import numpy as np
x = np.arange(10)
x
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
f = open('x.pk1','wb')
pickle.dump(x, f)
!ls        # windows系统可用!dir
# Array.ipynb            Array的input和output.ipynbx.pk1                    数组与矩阵运算.ipynb
f = open('x.pk1','rb')
pickle.load(f)
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.save('one_array', x)
!ls
# Array.ipynb            Array的input和output.ipynbx.pk1                    one_array.npy数组与矩阵运算.ipynb
np.load('one_array.npy')
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.arange(20)
y
# array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,17, 18, 19])
np.savez('two_array.npz', a=x, b=y)
!ls
# Array.ipynb                        two_array.npzArray的input和output.ipynb        x.pk1one_array.npy                        数组与矩阵运算.ipynb
np.load('two_array.npz')
# <numpy.lib.npyio.NpzFile at 0x17033c77df0>
c = np.load('two_array.npz')
c['a']
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
c['b']
# array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,17, 18, 19])

scipy文档

  • 现在:https://docs.scipy.org/doc/scipy/getting_started.html
  • 之前:https://docs.scipy.org/doc/numpy-dev/user/quickstart.html

第3章 Pandas入门

本章将介绍Python数据科学领域用于数据分析最重要的一个库——Pandas。将从pandas里最重要的两种数据结构Series和DataFrame开始,介绍其创建和基本操作,通过实际操作理解Series和DataFrame的关系。

3-1 Pandas Series

Jupyter notebook 新建文件 Series.ipynb

import numpy as np
import pandas as pd
s1 = pd.Series([1,2,3,4])
s1
# 0    11    22    33    4dtype: int64
s1.values
# array([1, 2, 3, 4], dtype=int64)
s1.index
# RangeIndex(start=0, stop=4, step=1)
s2 = pd.Series(np.arange(10))
s2            # 有些电脑 dtype: int64
# 0    01    12    23    34    45    56    67    78    89    9dtype: int32
s3 = pd.Series({'1':1, '2':2, '3':3})
s3
# 1    12    23    3dtype: int64
s3.values
# array([1, 2, 3], dtype=int64)
s3.index
# Index(['1', '2', '3'], dtype='object')
s4 = pd.Series([1,2,3,4],index=['A','B','C','D'])
s4
# A    1B    2C    3D    4dtype: int64
s4.values
# array([1, 2, 3, 4], dtype=int64)
s4.index
# Index(['A', 'B', 'C', 'D'], dtype='object')
s4['A']
# 1
s4[s4>2]
# C    3D    4dtype: int64
s4
# A    1B    2C    3D    4dtype: int64
s4.to_dict()
# {'A': 1, 'B': 2, 'C': 3, 'D': 4}
s5 = pd.Series(s4.to_dict())
s5
# A    1B    2C    3D    4dtype: int64
index_1 = ['A', 'B', 'C', 'D','E']
s6 = pd.Series(s5,index=index_1)
s6
# A    1.0B    2.0C    3.0D    4.0E    NaNdtype: float64
pd.isnull(s6)
# A    FalseB    FalseC    FalseD    FalseE     True
dtype: bool
pd.notnull(s6)
# A     TrueB     TrueC     TrueD     TrueE    Falsedtype: bool
s6
# A    1.0B    2.0C    3.0D    4.0E    NaNdtype: float64
s6.name = 'demo'
s6
# A    1.0B    2.0C    3.0D    4.0E    NaNName: demo, dtype: float64
s6.index.name = 'demo index'
s6
# demo indexA    1.0B    2.0C    3.0D    4.0E    NaNName: demo, dtype: float64
s6.index
# Index(['A', 'B', 'C', 'D', 'E'], dtype='object', name='demo index')
s6.values
# array([ 1.,  2.,  3.,  4., nan])

3-2 Pandas DataFrame

Jupyter notebook 新建文件 DataFrame.ipynb

import numpy as np
import pandas as pd
from pandas import Series, DataFrameimport webbrowser
link = 'https://www.tiobe.com/tiobe-index/'
webbrowser.open(link)        # 浏览器里打开链接
True
df = pd.read_clipboard()    # 复制页面 table里前10条数据,包含表头
df
# 输出
Position    Programming    Language    Ratings
0    21    SAS    0.66%    None
1    22    Scratch    0.64%    None
2    23    Fortran    0.58%    None
3    24    Rust    0.54%    None
4    25    (Visual)    FoxPro    0.52%
5    26    COBOL    0.42%    None
6    27    Dart    0.42%    None
7    28    Kotlin    0.41%    None
8    29    Lua    0.40%    None
9    30    Julia    0.40%    Nonetype(df)
# pandas.core.frame.DataFrame
df.columns
# Index(['Position', 'Programming', 'Language', 'Ratings'], dtype='object')
df.Ratings
#
0     None
1     None
2     None
3     None
4    0.52%
5     None
6     None
7     None
8     None
9     None
Name: Ratings, dtype: objectdf_new = DataFrame(df,columns=['Programming','Language'])
df_new
# 输出
Programming    Language
0    SAS    0.66%
1    Scratch    0.64%
2    Fortran    0.58%
3    Rust    0.54%
4    (Visual)    FoxPro
5    COBOL    0.42%
6    Dart    0.42%
7    Kotlin    0.41%
8    Lua    0.40%
9    Julia    0.40%df['Position']
#
0    21
1    22
2    23
3    24
4    25
5    26
6    27
7    28
8    29
9    30
Name: Position, dtype: int64type(df['Position'])
pandas.core.series.Series
df_new = DataFrame(df,columns=['Programming','Language','Language1'])
df_new
# 输出
Programming    Language    Language1
0    SAS    0.66%    NaN
1    Scratch    0.64%    NaN
2    Fortran    0.58%    NaN
3    Rust    0.54%    NaN
4    (Visual)    FoxPro    NaN
5    COBOL    0.42%    NaN
6    Dart    0.42%    NaN
7    Kotlin    0.41%    NaN
8    Lua    0.40%    NaN
9    Julia    0.40%    NaN# 填充的三种方式
df_new['Language1'] = range(0,10)
# df_new['Language1'] = np.arange(0,10)
# df_new['Language1'] = pd.Series(np.arange(0,10))
df_new
# 输出
Programming    Language    Language1
0    SAS    0.66%    0
1    Scratch    0.64%    1
2    Fortran    0.58%    2
3    Rust    0.54%    3
4    (Visual)    FoxPro    4
5    COBOL    0.42%    5
6    Dart    0.42%    6
7    Kotlin    0.41%    7
8    Lua    0.40%    8
9    Julia    0.40%    9df_new['Language1'] = pd.Series([100,200], index=[1,2])
df_new
# 输出
Programming    Language    Language1
0    SAS    0.66%    NaN
1    Scratch    0.64%    100.0
2    Fortran    0.58%    200.0
3    Rust    0.54%    NaN
4    (Visual)    FoxPro    NaN
5    COBOL    0.42%    NaN
6    Dart    0.42%    NaN
7    Kotlin    0.41%    NaN
8    Lua    0.40%    NaN
9    Julia    0.40%    NaN

3-3 深入理解Series和Dataframe

Jupyter notebook 新建文件 深入理解Series和Dataframe.ipynb

import numpy as np
import pandas as pd
from pandas import Series, DataFramedata = {'Country':['Belgium', 'India', 'Brazil'],'Capital':['Brussels','New Delhi', 'Brasilia'],'Population':[11190846, 1303171035, 207847528]}#Series
s1 = pd.Series(data['Country'])
s1
# 输出
0    Belgium
1      India
2     Brazil
dtype: objects1.values
# array(['Belgium', 'India', 'Brazil'], dtype=object)
s1.index
# RangeIndex(start=0, stop=3, step=1)
s1 = pd.Series(data['Country'],index=['A','B','C'])
# 输出
A    Belgium
B      India
C     Brazil
dtype: objects1.values
# array(['Belgium', 'India', 'Brazil'], dtype=object)
s1.index
# Index(['A', 'B', 'C'], dtype='object')#Dataframe
df1 = pd.DataFrame(data)
df1
# 输出Country    Capital    Population
0    Belgium    Brussels    11190846
1    India    New Delhi    1303171035
2    Brazil    Brasilia    207847528df1['Country']
# 输出
0    Belgium
1      India
2     Brazil
Name: Country, dtype: objectcou = df1['Country']
type(cou)
# pandas.core.series.Series
df1.iterrows()
# <generator object DataFrame.iterrows at 0x0000018DD44C59E0>for row in df1.iterrows():print(row),print(type(row)),print(len(row))
# 输出
(0, Country        Belgium
Capital       Brussels
Population    11190846
Name: 0, dtype: object)
<class 'tuple'>
2
(1, Country            India
Capital        New Delhi
Population    1303171035
Name: 1, dtype: object)
<class 'tuple'>
2
(2, Country          Brazil
Capital        Brasilia
Population    207847528
Name: 2, dtype: object)
<class 'tuple'>
2for row in df1.iterrows():print(type(row[0]),row[0],row[1])break
# 输出
<class 'int'> 0 Country        Belgium
Capital       Brussels
Population    11190846
Name: 0, dtype: object# <class 'int'>  ??
<class 'numpy.int64'> for row in df1.iterrows():print(type(row[0]),type(row[1]))break
# 输出
<class 'int'> <class 'pandas.core.series.Series'># <class 'int'>  ??
<class 'numpy.int64'> df1
# 输出
Country    Capital    Population
0    Belgium    Brussels    11190846
1    India    New Delhi    1303171035
2    Brazil    Brasilia    207847528data
# 输出
{'Country': ['Belgium', 'India', 'Brazil'],'Capital': ['Brussels', 'New Delhi', 'Brasilia'],'Population': [11190846, 1303171035, 207847528]}s1 = pd.Series(data['Country'])
s2 = pd.Series(data['Capital'])
s3 = pd.Series(data['Population'])
df_new = pd.DataFrame([s1,s2,s3])
df_new
# 输出0    1    2
0    Belgium    India    Brazil
1    Brussels    New Delhi    Brasilia
2    11190846    1303171035    207847528df1
# 输出
Country    Capital    Population
0    Belgium    Brussels    11190846
1    India    New Delhi    1303171035
2    Brazil    Brasilia    207847528df_new = df_new.T
df_new
# 输出0    1    2
0    Belgium    Brussels    11190846
1    India    New Delhi    1303171035
2    Brazil    Brasilia    207847528df_new = pd.DataFrame([s1,s2,s3], index=['Country','Capital','Population'])
df_new
# 输出0    1    2
Country    Belgium    India    Brazil
Capital    Brussels    New Delhi    Brasilia
Population    11190846    1303171035    207847528df_new = df_new.T
df_new
# 输出Country    Capital    Population
0    Belgium    Brussels    11190846
1    India    New Delhi    1303171035
2    Brazil    Brasilia    207847528

3-4 Pandas-Dataframe-IO操作

Jupyter notebook 新建文件 DataFrame IO.ipynb

import numpy as np
import pandas as pd
from pandas import Series,DataFrameimport webbrowserlink = 'http://pandas.pydata.org/pandas-docs/version/0.20/io.html'
webbrowser.open(link)    # 打开浏览器,返回True; 复制 网页表格内容
# Truedf1 = pd.read_clipboard()
df1
# 输出Format Type    Data Description    Reader    Writer
0    text    CSV    read_csv    to_csv
1    text    JSON    read_json    to_json
2    text    HTML    read_html    to_html
3    text    Local clipboard    read_clipboard    to_clipboard
4    binary    MS Excel    read_excel    to_excel
5    binary    HDF5 Format    read_hdf    to_hdf
6    binary    Feather Format    read_feather    to_feather
7    binary    Msgpack    read_msgpack    to_msgpack
8    binary    Stata    read_stata    to_stata
9    binary    SAS    read_sas
10    binary    Python Pickle Format    read_pickle    to_pickle
11    SQL    SQL    read_sql    to_sql
12    SQL    Google Big Query    read_gbq    to_gbqdf1.to_clipboard()
df1
# 输出Format Type    Data Description    Reader    Writer
0    text    CSV    read_csv    to_csv
1    text    JSON    read_json    to_json
2    text    HTML    read_html    to_html
3    text    Local clipboard    read_clipboard    to_clipboard
4    binary    MS Excel    read_excel    to_excel
5    binary    HDF5 Format    read_hdf    to_hdf
6    binary    Feather Format    read_feather    to_feather
7    binary    Msgpack    read_msgpack    to_msgpack
8    binary    Stata    read_stata    to_stata
9    binary    SAS    read_sas
10    binary    Python Pickle Format    read_pickle    to_pickle
11    SQL    SQL    read_sql    to_sql
12    SQL    Google Big Query    read_gbq    to_gbqdf1.to_csv('df1.csv')
!ls   # windows系统可用 !dir
# DataFrame IO.ipynb    df1.csv!more df1.csv
# 输出
,Format Type,Data Description,Reader,Writer
0,text,CSV,read_csv,to_csv
1,text,JSON,read_json,to_json
2,text,HTML,read_html,to_html
3,text,Local clipboard,read_clipboard,to_clipboard
4,binary,MS Excel,read_excel,to_excel
5,binary,HDF5 Format,read_hdf,to_hdf
6,binary,Feather Format,read_feather,to_feather
7,binary,Msgpack,read_msgpack,to_msgpack
8,binary,Stata,read_stata,to_stata
9,binary,SAS,read_sas,
10,binary,Python Pickle Format,read_pickle,to_pickle
11,SQL,SQL,read_sql,to_sql
12,SQL,Google Big Query,read_gbq,to_gbqdf1.to_csv('df1.csv',index=False)    # 去掉索引
!ls
# DataFrame IO.ipynb    df1.csv!more df1.csv
# 输出
Format Type,Data Description,Reader,Writer
text,CSV,read_csv,to_csv
text,JSON,read_json,to_json
text,HTML,read_html,to_html
text,Local clipboard,read_clipboard,to_clipboard
binary,MS Excel,read_excel,to_excel
binary,HDF5 Format,read_hdf,to_hdf
binary,Feather Format,read_feather,to_feather
binary,Msgpack,read_msgpack,to_msgpack
binary,Stata,read_stata,to_stata
binary,SAS,read_sas,
binary,Python Pickle Format,read_pickle,to_pickle
SQL,SQL,read_sql,to_sql
SQL,Google Big Query,read_gbq,to_gbqdf2 = pd.read_csv('df1.csv')
df2
# 输出Format Type    Data Description    Reader    Writer
0    text    CSV    read_csv    to_csv
1    text    JSON    read_json    to_json
2    text    HTML    read_html    to_html
3    text    Local clipboard    read_clipboard    to_clipboard
4    binary    MS Excel    read_excel    to_excel
5    binary    HDF5 Format    read_hdf    to_hdf
6    binary    Feather Format    read_feather    to_feather
7    binary    Msgpack    read_msgpack    to_msgpack
8    binary    Stata    read_stata    to_stata
9    binary    SAS    read_sas
10    binary    Python Pickle Format    read_pickle    to_pickle
11    SQL    SQL    read_sql    to_sql
12    SQL    Google Big Query    read_gbq    to_gbqdf1.to_json()
# 输出
'{"Format Type":{"0":"text","1":"text","2":"text","3":"text","4":"binary","5":"binary","6":"binary","7":"binary","8":"binary","9":"binary","10":"binary","11":"SQL","12":"SQL"},"Data Description":{"0":"CSV","1":"JSON","2":"HTML","3":"Local clipboard","4":"MS Excel","5":"HDF5 Format","6":"Feather Format","7":"Msgpack","8":"Stata","9":"SAS","10":"Python Pickle Format","11":"SQL","12":"Google Big Query"},"Reader":{"0":"read_csv","1":"read_json","2":"read_html","3":"read_clipboard","4":"read_excel","5":"read_hdf","6":"read_feather","7":"read_msgpack","8":"read_stata","9":"read_sas","10":"read_pickle","11":"read_sql","12":"read_gbq"},"Writer":{"0":"to_csv","1":"to_json","2":"to_html","3":"to_clipboard","4":"to_excel","5":"to_hdf","6":"to_feather","7":"to_msgpack","8":"to_stata","9":" ","10":"to_pickle","11":"to_sql","12":"to_gbq"}}'pd.read_json(df1.to_json())
# 输出Format Type    Data Description    Reader    Writer
0    text    CSV    read_csv    to_csv
1    text    JSON    read_json    to_json
2    text    HTML    read_html    to_html
3    text    Local clipboard    read_clipboard    to_clipboard
4    binary    MS Excel    read_excel    to_excel
5    binary    HDF5 Format    read_hdf    to_hdf
6    binary    Feather Format    read_feather    to_feather
7    binary    Msgpack    read_msgpack    to_msgpack
8    binary    Stata    read_stata    to_stata
9    binary    SAS    read_sas
10    binary    Python Pickle Format    read_pickle    to_pickle
11    SQL    SQL    read_sql    to_sql
12    SQL    Google Big Query    read_gbq    to_gbqdf1.to_html()
# 输出
'<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th></th>\n      <th>Format Type</th>\n      <th>Data Description</th>\n      <th>Reader</th>\n      <th>Writer</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>text</td>\n      <td>CSV</td>\n      <td>read_csv</td>\n      <td>to_csv</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>text</td>\n      <td>JSON</td>\n      <td>read_json</td>\n      <td>to_json</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>text</td>\n      <td>HTML</td>\n      <td>read_html</td>\n      <td>to_html</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>text</td>\n      <td>Local clipboard</td>\n      <td>read_clipboard</td>\n      <td>to_clipboard</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>binary</td>\n      <td>MS Excel</td>\n      <td>read_excel</td>\n      <td>to_excel</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>binary</td>\n      <td>HDF5 Format</td>\n      <td>read_hdf</td>\n      <td>to_hdf</td>\n    </tr>\n    <tr>\n      <th>6</th>\n      <td>binary</td>\n      <td>Feather Format</td>\n      <td>read_feather</td>\n      <td>to_feather</td>\n    </tr>\n    <tr>\n      <th>7</th>\n      <td>binary</td>\n      <td>Msgpack</td>\n      <td>read_msgpack</td>\n      <td>to_msgpack</td>\n    </tr>\n    <tr>\n      <th>8</th>\n      <td>binary</td>\n      <td>Stata</td>\n      <td>read_stata</td>\n      <td>to_stata</td>\n    </tr>\n    <tr>\n      <th>9</th>\n      <td>binary</td>\n      <td>SAS</td>\n      <td>read_sas</td>\n      <td></td>\n    </tr>\n    <tr>\n      <th>10</th>\n      <td>binary</td>\n      <td>Python Pickle Format</td>\n      <td>read_pickle</td>\n      <td>to_pickle</td>\n    </tr>\n    <tr>\n      <th>11</th>\n      <td>SQL</td>\n      <td>SQL</td>\n      <td>read_sql</td>\n      <td>to_sql</td>\n    </tr>\n    <tr>\n      <th>12</th>\n      <td>SQL</td>\n      <td>Google Big Query</td>\n      <td>read_gbq</td>\n      <td>to_gbq</td>\n    </tr>\n  </tbody>\n</table>'df1.to_html('df1.html')
!ls
# DataFrame IO.ipynb    df1.csv        df1.html
df1.to_excel('df1.xlsx')

3-5 DataFrame的Selecting和indexing

Jupyter notebook 新建文件 Selecting and indexing.ipynb

import numpy as np
import pandas as pd
from pandas import Series, DataFrame!pwd    # pwd 对应windows系统 chdir
# /Users/xxx/xx!ls /Users/xxx/xx/homework    # ls 对应windows系统 dir pwd
# movie_metadata.csvimdb = pd.read_csv('/Users/xxx/xx/homework/movie_metadata.csv')
imdb
# 输出
color    director_name    num_critic_for_reviews    duration    director_facebook_likes    actor_3_facebook_likes    actor_2_name    actor_1_facebook_likes    gross    genres    ...    num_user_for_reviews    language    country    content_rating    budget    title_year    actor_2_facebook_likes    imdb_score    aspect_ratio    movie_facebook_likes
0    Color    James Cameron    723.0    178.0    0.0    855.0    Joel David Moore    1000.0    760505847.0    Action|Adventure|Fantasy|Sci-Fi    ...    3054.0    English    USA    PG-13    237000000.0    2009.0    936.0    7.9    1.78    33000
1    Color    Gore Verbinski    302.0    169.0    563.0    1000.0    Orlando Bloom    40000.0    309404152.0    Action|Adventure|Fantasy    ...    1238.0    English    USA    PG-13    300000000.0    2007.0    5000.0    7.1    2.35    0
2    Color    Sam Mendes    602.0    148.0    0.0    161.0    Rory Kinnear    11000.0    200074175.0    Action|Adventure|Thriller    ...    994.0    English    UK    PG-13    245000000.0    2015.0    393.0    6.8    2.35    85000
3    Color    Christopher Nolan    813.0    164.0    22000.0    23000.0    Christian Bale    27000.0    448130642.0    Action|Thriller    ...    2701.0    English    USA    PG-13    250000000.0    2012.0    23000.0    8.5    2.35    164000
4    NaN    Doug Walker    NaN    NaN    131.0    NaN    Rob Walker    131.0    NaN    Documentary    ...    NaN    NaN    NaN    NaN    NaN    NaN    12.0    7.1    NaN    0
...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...
5038    Color    Scott Smith    1.0    87.0    2.0    318.0    Daphne Zuniga    637.0    NaN    Comedy|Drama    ...    6.0    English    Canada    NaN    NaN    2013.0    470.0    7.7    NaN    84
5039    Color    NaN    43.0    43.0    NaN    319.0    Valorie Curry    841.0    NaN    Crime|Drama|Mystery|Thriller    ...    359.0    English    USA    TV-14    NaN    NaN    593.0    7.5    16.00    32000
5040    Color    Benjamin Roberds    13.0    76.0    0.0    0.0    Maxwell Moody    0.0    NaN    Drama|Horror|Thriller    ...    3.0    English    USA    NaN    1400.0    2013.0    0.0    6.3    NaN    16
5041    Color    Daniel Hsia    14.0    100.0    0.0    489.0    Daniel Henney    946.0    10443.0    Comedy|Drama|Romance    ...    9.0    English    USA    PG-13    NaN    2012.0    719.0    6.3    2.35    660
5042    Color    Jon Gunn    43.0    90.0    16.0    16.0    Brian Herzlinger    86.0    85222.0    Documentary    ...    84.0    English    USA    PG    1100.0    2004.0    23.0    6.6    1.85    456
5043 rows × 28 columnsimdb.shape
# (5043, 28)imdb.head()
# 输出color    director_name    num_critic_for_reviews    duration    director_facebook_likes    actor_3_facebook_likes    actor_2_name    actor_1_facebook_likes    gross    genres    ...    num_user_for_reviews    language    country    content_rating    budget    title_year    actor_2_facebook_likes    imdb_score    aspect_ratio    movie_facebook_likes
0    Color    James Cameron    723.0    178.0    0.0    855.0    Joel David Moore    1000.0    760505847.0    Action|Adventure|Fantasy|Sci-Fi    ...    3054.0    English    USA    PG-13    237000000.0    2009.0    936.0    7.9    1.78    33000
1    Color    Gore Verbinski    302.0    169.0    563.0    1000.0    Orlando Bloom    40000.0    309404152.0    Action|Adventure|Fantasy    ...    1238.0    English    USA    PG-13    300000000.0    2007.0    5000.0    7.1    2.35    0
2    Color    Sam Mendes    602.0    148.0    0.0    161.0    Rory Kinnear    11000.0    200074175.0    Action|Adventure|Thriller    ...    994.0    English    UK    PG-13    245000000.0    2015.0    393.0    6.8    2.35    85000
3    Color    Christopher Nolan    813.0    164.0    22000.0    23000.0    Christian Bale    27000.0    448130642.0    Action|Thriller    ...    2701.0    English    USA    PG-13    250000000.0    2012.0    23000.0    8.5    2.35    164000
4    NaN    Doug Walker    NaN    NaN    131.0    NaN    Rob Walker    131.0    NaN    Documentary    ...    NaN    NaN    NaN    NaN    NaN    NaN    12.0    7.1    NaN    0
5 rows × 28 columnsimdb.tail(10)
# 输出
color    director_name    num_critic_for_reviews    duration    director_facebook_likes    actor_3_facebook_likes    actor_2_name    actor_1_facebook_likes    gross    genres    ...    num_user_for_reviews    language    country    content_rating    budget    title_year    actor_2_facebook_likes    imdb_score    aspect_ratio    movie_facebook_likes
5033    Color    Shane Carruth    143.0    77.0    291.0    8.0    David Sullivan    291.0    424760.0    Drama|Sci-Fi|Thriller    ...    371.0    English    USA    PG-13    7000.0    2004.0    45.0    7.0    1.85    19000
5034    Color    Neill Dela Llana    35.0    80.0    0.0    0.0    Edgar Tancangco    0.0    70071.0    Thriller    ...    35.0    English    Philippines    Not Rated    7000.0    2005.0    0.0    6.3    NaN    74
5035    Color    Robert Rodriguez    56.0    81.0    0.0    6.0    Peter Marquardt    121.0    2040920.0    Action|Crime|Drama|Romance|Thriller    ...    130.0    Spanish    USA    R    7000.0    1992.0    20.0    6.9    1.37    0
5036    Color    Anthony Vallone    NaN    84.0    2.0    2.0    John Considine    45.0    NaN    Crime|Drama    ...    1.0    English    USA    PG-13    3250.0    2005.0    44.0    7.8    NaN    4
5037    Color    Edward Burns    14.0    95.0    0.0    133.0    Caitlin FitzGerald    296.0    4584.0    Comedy|Drama    ...    14.0    English    USA    Not Rated    9000.0    2011.0    205.0    6.4    NaN    413
5038    Color    Scott Smith    1.0    87.0    2.0    318.0    Daphne Zuniga    637.0    NaN    Comedy|Drama    ...    6.0    English    Canada    NaN    NaN    2013.0    470.0    7.7    NaN    84
5039    Color    NaN    43.0    43.0    NaN    319.0    Valorie Curry    841.0    NaN    Crime|Drama|Mystery|Thriller    ...    359.0    English    USA    TV-14    NaN    NaN    593.0    7.5    16.00    32000
5040    Color    Benjamin Roberds    13.0    76.0    0.0    0.0    Maxwell Moody    0.0    NaN    Drama|Horror|Thriller    ...    3.0    English    USA    NaN    1400.0    2013.0    0.0    6.3    NaN    16
5041    Color    Daniel Hsia    14.0    100.0    0.0    489.0    Daniel Henney    946.0    10443.0    Comedy|Drama|Romance    ...    9.0    English    USA    PG-13    NaN    2012.0    719.0    6.3    2.35    660
5042    Color    Jon Gunn    43.0    90.0    16.0    16.0    Brian Herzlinger    86.0    85222.0    Documentary    ...    84.0    English    USA    PG    1100.0    2004.0    23.0    6.6    1.85    456
10 rows × 28 columnsimdb['color']
# 输出
0       Color
1       Color
2       Color
3       Color
4         NaN...
5038    Color
5039    Color
5040    Color
5041    Color
5042    Color
Name: color, Length: 5043, dtype: objectimdb['color'][0]
# 'Color'
imdb['color'][1]
# 'Color'imdb[['color','director_name']]
# 输出color    director_name
0    Color    James Cameron
1    Color    Gore Verbinski
2    Color    Sam Mendes
3    Color    Christopher Nolan
4    NaN    Doug Walker
...    ...    ...
5038    Color    Scott Smith
5039    Color    NaN
5040    Color    Benjamin Roberds
5041    Color    Daniel Hsia
5042    Color    Jon Gunn
5043 rows × 2 columnssub_df = imdb[['director_name','movie_title','imdb_score']]
sub_df
# 输出
director_name    movie_title    imdb_score
0    James Cameron    Avatar    7.9
1    Gore Verbinski    Pirates of the Caribbean: At World's End    7.1
2    Sam Mendes    Spectre    6.8
3    Christopher Nolan    The Dark Knight Rises    8.5
4    Doug Walker    Star Wars: Episode VII - The Force Awakens  ...    7.1
...    ...    ...    ...
5038    Scott Smith    Signed Sealed Delivered    7.7
5039    NaN    The Following    7.5
5040    Benjamin Roberds    A Plague So Pleasant    6.3
5041    Daniel Hsia    Shanghai Calling    6.3
5042    Jon Gunn    My Date with Drew    6.6
5043 rows × 3 columnssub_df.head()
# 输出director_name    movie_title    imdb_score
0    James Cameron    Avatar    7.9
1    Gore Verbinski    Pirates of the Caribbean: At World's End    7.1
2    Sam Mendes    Spectre    6.8
3    Christopher Nolan    The Dark Knight Rises    8.5
4    Doug Walker    Star Wars: Episode VII - The Force Awakens  ...    7.1sub_df.head(5)
# 输出director_name    movie_title    imdb_score
0    James Cameron    Avatar    7.9
1    Gore Verbinski    Pirates of the Caribbean: At World's End    7.1
2    Sam Mendes    Spectre    6.8
3    Christopher Nolan    The Dark Knight Rises    8.5
4    Doug Walker    Star Wars: Episode VII - The Force Awakens  ...    7.1sub_df.iloc[10:20,:]
# 输出director_name    movie_title    imdb_score
10    Zack Snyder    Batman v Superman: Dawn of Justice    6.9
11    Bryan Singer    Superman Returns    6.1
12    Marc Forster    Quantum of Solace    6.7
13    Gore Verbinski    Pirates of the Caribbean: Dead Man's Chest    7.3
14    Gore Verbinski    The Lone Ranger    6.5
15    Zack Snyder    Man of Steel    7.2
16    Andrew Adamson    The Chronicles of Narnia: Prince Caspian    6.6
17    Joss Whedon    The Avengers    8.1
18    Rob Marshall    Pirates of the Caribbean: On Stranger Tides    6.7
19    Barry Sonnenfeld    Men in Black 3    6.8sub_df.iloc[10:20,0:2]
# 输出
director_name    movie_title
10    Zack Snyder    Batman v Superman: Dawn of Justice
11    Bryan Singer    Superman Returns
12    Marc Forster    Quantum of Solace
13    Gore Verbinski    Pirates of the Caribbean: Dead Man's Chest
14    Gore Verbinski    The Lone Ranger
15    Zack Snyder    Man of Steel
16    Andrew Adamson    The Chronicles of Narnia: Prince Caspian
17    Joss Whedon    The Avengers
18    Rob Marshall    Pirates of the Caribbean: On Stranger Tides
19    Barry Sonnenfeld    Men in Black 3tmp_df = sub_df.iloc[10:20,0:2]
tmp_df
# 输出
director_name    movie_title
10    Zack Snyder    Batman v Superman: Dawn of Justice
11    Bryan Singer    Superman Returns
12    Marc Forster    Quantum of Solace
13    Gore Verbinski    Pirates of the Caribbean: Dead Man's Chest
14    Gore Verbinski    The Lone Ranger
15    Zack Snyder    Man of Steel
16    Andrew Adamson    The Chronicles of Narnia: Prince Caspian
17    Joss Whedon    The Avengers
18    Rob Marshall    Pirates of the Caribbean: On Stranger Tides
19    Barry Sonnenfeld    Men in Black 3tmp_df.iloc[2:4,:]
# 输出director_name    movie_title
12    Marc Forster    Quantum of Solace
13    Gore Verbinski    Pirates of the Caribbean: Dead Man's Chesttmp_df.loc[15:17,:]
# 输出director_name    movie_title
15    Zack Snyder    Man of Steel
16    Andrew Adamson    The Chronicles of Narnia: Prince Caspian
17    Joss Whedon    The Avengerstmp_df.loc[15:17,:'movie_title']
# 输出director_name    movie_title
15    Zack Snyder    Man of Steel
16    Andrew Adamson    The Chronicles of Narnia: Prince Caspian
17    Joss Whedon    The Avengerstmp_df.loc[15:17,:'director_name']
# 输出director_name
15    Zack Snyder
16    Andrew Adamson
17    Joss Whedon

3-6 Series和Dataframe的Reindexing

Jupyter notebook 新建文件 Reindexing Series and DataFrame.ipynb

import numpy as np
import pandas as pd
from pandas import Series, DataFrame# series reindex
s1 = Series([1,2,3,4], index=['A','B','C','D'])
s1
# 输出
A    1
B    2
C    3
D    4
dtype: int64# s1.reindex()    # 光标移动到方法上面,按shift+tab,弹出文档,连续按选择文档详细程度
s1.reindex(index=['A','B','C','D','E'])
# 输出
A    1.0
B    2.0
C    3.0
D    4.0
E    NaN
dtype: float64s1.reindex(index=['A','B','C','D','E'],fill_value=0)
# 输出
A    1
B    2
C    3
D    4
E    0
dtype: int64s1.reindex(index=['A','B','C','D','E'],fill_value=10)
# 输出
A     1
B     2
C     3
D     4
E    10
dtype: int64s2 = Series(['A','B','C'], index=[1,5,10])
s2
# 输出
1     A
5     B
10    C
dtype: objects2.reindex(index=range(15))
# 输出
0     NaN
1       A
2     NaN
3     NaN
4     NaN
5       B
6     NaN
7     NaN
8     NaN
9     NaN
10      C
11    NaN
12    NaN
13    NaN
14    NaN
dtype: objects2.reindex(index=range(15),method='ffill')
# 输出
0     NaN
1       A
2       A
3       A
4       A
5       B
6       B
7       B
8       B
9       B
10      C
11      C
12      C
13      C
14      C
dtype: object# reindex dataframe
df1 = DataFrame(np.random.rand(25).reshape([5,5]))
df1
# 输出0    1    2    3    4
0    0.255424    0.315708    0.951327    0.423676    0.975377
1    0.087594    0.192460    0.502268    0.534926    0.423024
2    0.817002    0.113410    0.468270    0.410297    0.278942
3    0.315239    0.018933    0.133764    0.240001    0.910754
4    0.267342    0.451077    0.282865    0.170235    0.898429df1 = DataFrame(np.random.rand(25).reshape([5,5]), index=['A','B','D','E','F'], columns=['c1','c2','c3','c4','c5'])
df1
# 输出c1    c2    c3    c4    c5
A    0.278063    0.894546    0.932129    0.178442    0.303684
B    0.186239    0.260677    0.708358    0.275914    0.369878
D    0.786987    0.125907    0.191987    0.338194    0.009877
E    0.192269    0.909661    0.227301    0.343989    0.610203
F    0.503267    0.306472    0.197467    0.063800    0.813786df1.reindex(index=['A','B','C','D','E','F'])
# 输出c1    c2    c3    c4    c5
A    0.278063    0.894546    0.932129    0.178442    0.303684
B    0.186239    0.260677    0.708358    0.275914    0.369878
C    NaN    NaN    NaN    NaN    NaN
D    0.786987    0.125907    0.191987    0.338194    0.009877
E    0.192269    0.909661    0.227301    0.343989    0.610203
F    0.503267    0.306472    0.197467    0.063800    0.813786df1.reindex(columns=['c1','c2','c3','c4','c5','c6'])
# 输出c1    c2    c3    c4    c5    c6
A    0.278063    0.894546    0.932129    0.178442    0.303684    NaN
B    0.186239    0.260677    0.708358    0.275914    0.369878    NaN
D    0.786987    0.125907    0.191987    0.338194    0.009877    NaN
E    0.192269    0.909661    0.227301    0.343989    0.610203    NaN
F    0.503267    0.306472    0.197467    0.063800    0.813786    NaNdf1.reindex(index=['A','B','C','D','E','F'],columns=['c1','c2','c3','c4','c5','c6'])
# 输出c1    c2    c3    c4    c5    c6
A    0.278063    0.894546    0.932129    0.178442    0.303684    NaN
B    0.186239    0.260677    0.708358    0.275914    0.369878    NaN
C    NaN    NaN    NaN    NaN    NaN    NaN
D    0.786987    0.125907    0.191987    0.338194    0.009877    NaN
E    0.192269    0.909661    0.227301    0.343989    0.610203    NaN
F    0.503267    0.306472    0.197467    0.063800    0.813786    NaNs1
# 输出
A    1
B    2
C    3
D    4
dtype: int64s1.reindex(index=['A','B'])
# 输出
A    1
B    2
dtype: int64df1
# 输出c1    c2    c3    c4    c5
A    0.278063    0.894546    0.932129    0.178442    0.303684
B    0.186239    0.260677    0.708358    0.275914    0.369878
D    0.786987    0.125907    0.191987    0.338194    0.009877
E    0.192269    0.909661    0.227301    0.343989    0.610203
F    0.503267    0.306472    0.197467    0.063800    0.813786df1.reindex(index=['A','B'])
# 输出c1    c2    c3    c4    c5
A    0.278063    0.894546    0.932129    0.178442    0.303684
B    0.186239    0.260677    0.708358    0.275914    0.369878s1
# 输出
A    1
B    2
C    3
D    4
dtype: int64s1.drop('A')
# 输出
B    2
C    3
D    4
dtype: int64df1
# 输出c1    c2    c3    c4    c5
A    0.278063    0.894546    0.932129    0.178442    0.303684
B    0.186239    0.260677    0.708358    0.275914    0.369878
D    0.786987    0.125907    0.191987    0.338194    0.009877
E    0.192269    0.909661    0.227301    0.343989    0.610203
F    0.503267    0.306472    0.197467    0.063800    0.813786df1.drop('A',axis=0)
# 输出c1    c2    c3    c4    c5
B    0.186239    0.260677    0.708358    0.275914    0.369878
D    0.786987    0.125907    0.191987    0.338194    0.009877
E    0.192269    0.909661    0.227301    0.343989    0.610203
F    0.503267    0.306472    0.197467    0.063800    0.813786df1.drop('c1',axis=0)
# 报错,行中没有该字段df1.drop('c1',axis=1)
# 输出c2    c3    c4    c5
A    0.894546    0.932129    0.178442    0.303684
B    0.260677    0.708358    0.275914    0.369878
D    0.125907    0.191987    0.338194    0.009877
E    0.909661    0.227301    0.343989    0.610203
F    0.306472    0.197467    0.063800    0.813786

3-7 谈一谈NaN

Jupyter notebook 新建文件 谈一谈NaN.ipynb

# NaN - means Not a Number
import numpy as np
import pandas as pd
from pandas import Series, DataFramen = np.nan
type(n)
# floatm = 1
m + n
# nan# Nan in Series
s1 = Series([1, 2, np.nan, 3, 4], index=['A','B','C','D','E'])
s1
# 输出
A    1.0
B    2.0
C    NaN
D    3.0
E    4.0
dtype: float64s1.isnull()
# 输出
A    False
B    False
C     True
D    False
E    False
dtype: bools1.notnull()
# 输出
A     True
B     True
C    False
D     True
E     True
dtype: bools1
# 输出
A    1.0
B    2.0
C    NaN
D    3.0
E    4.0
dtype: float64s1.dropna()
# 输出
A    1.0
B    2.0
D    3.0
E    4.0
dtype: float64# Nan in DataFrame
dframe = DataFrame([[1,2,3],[np.nan,5,6],[7,np.nan,9],[np.nan,np.nan,np.nan]])
dframe
# 输出0    1    2
0    1.0    2.0    3.0
1    NaN    5.0    6.0
2    7.0    NaN    9.0
3    NaN    NaN    NaNdframe.isnull()
# 输出0    1    2
0    False    False    False
1    True    False    False
2    False    True    False
3    True    True    Truedframe.notnull()
# 输出0    1    2
0    True    True    True
1    False    True    True
2    True    False    True
3    False    False    Falsedf1 = dframe.dropna(axis=0)
df1
# 输出0    1    2
0    1.0    2.0    3.0df1 = dframe.dropna(axis=1)
df1
# 输出
0
1
2
3df1 = dframe.dropna(axis=1,how='any')
df1
# 输出
0
1
2
3# 输出
df1 = dframe.dropna(axis=0,how='any')
df1
# 输出0    1    2
0    1.0    2.0    3.0df1 = dframe.dropna(axis=0,how='all')
df1
# 输出0    1    2
0    1.0    2.0    3.0
1    NaN    5.0    6.0
2    7.0    NaN    9.0dframe2 = DataFrame([[1,2,3,np.nan],[2,np.nan,5,6],[np.nan,7,np.nan,9],[1,np.nan,np.nan,np.nan]])
dframe2
# 输出0    1    2    3
0    1.0    2.0    3.0    NaN
1    2.0    NaN    5.0    6.0
2    NaN    7.0    NaN    9.0
3    1.0    NaN    NaN    NaNdf2 = dframe2.dropna(thresh=None)
df2
# 输出
0    1    2    3df2 = dframe2.dropna(thresh=2)
df2
# 输出0    1    2    3
0    1.0    2.0    3.0    NaN
1    2.0    NaN    5.0    6.0
2    NaN    7.0    NaN    9.0dframe2
# 输出0    1    2    3
0    1.0    2.0    3.0    NaN
1    2.0    NaN    5.0    6.0
2    NaN    7.0    NaN    9.0
3    1.0    NaN    NaN    NaNdframe2.fillna(value=1)
# 输出0    1    2    3
0    1.0    2.0    3.0    1.0
1    2.0    1.0    5.0    6.0
2    1.0    7.0    1.0    9.0
3    1.0    1.0    1.0    1.0dframe2.fillna(value={0:0,1:1,2:2,3:3})    # 列填充
# 输出0    1    2    3
0    1.0    2.0    3.0    3.0
1    2.0    1.0    5.0    6.0
2    0.0    7.0    2.0    9.0
3    1.0    1.0    2.0    3.0df1
# 输出0    1    2
0    1.0    2.0    3.0
1    NaN    5.0    6.0
2    7.0    NaN    9.0df2
# 输出0    1    2    3
0    1.0    2.0    3.0    NaN
1    2.0    NaN    5.0    6.0
2    NaN    7.0    NaN    9.0df1.dropna()
# 输出0    1    2
0    1.0    2.0    3.0df1.fillna(1)
# 输出0    1    2
0    1.0    2.0    3.0
1    1.0    5.0    6.0
2    7.0    1.0    9.0

第5章 绘图和可视化之Matplotlib

数据的可视化是数据分析领域里非常重要的内容。本章会学习Matplotlib的基本使用,包括如何对Pandas里的Series和DataFrame绘图, 以及图形样式和显示模式的设置等内容。

5-1 Matplotlib介绍

为什么要Python画图

  • GUI太复杂
  • Excel太头疼
  • Python简单,免费( sorry Matlab)

什么是matplotlib?

  • 一个Python包
  • 用于2D绘图
  • 非常强大和流行
  • 有很多扩展

Hello world in matplotlib

import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
x = np.linspace(0,2*np.pi,100)
y = np.sin(x)
plt.plot(x,y)

Matplotlib

  • Backend:主要处理把图显示到哪里和画在哪里?
  • Artist:图形显示成什么样?
  • Scripting:pyplot,Python语法和API

参考资料

5-2 matplotlib简单绘图之plot

Jupyter notebook 新建文件 matplotlib的简单绘图-plot.ipynb

import numpy as np
import matplotlib.pyplot as plta = [1, 2, 3]
plt.plot(a)
# [<matplotlib.lines.Line2D at 0x20619607f40>]
# plt.show() # 新版图形会直接显示

a = [1, 2, 3]
b = [4, 5, 6]
plt.plot(a, b)

a = [1, 2, 3]
b = [4, 5, 6, 7]
plt.plot(a, b)
# ValueError: x and y must have same first dimension, but have shapes (3,) and (4,)
a = [1, 2, 3]
b = [4, 5, 6]
# %matplotlib inline # 旧版加上这句可以不用show,新版不需要
plt.plot(a, b)
%timeit np.arange(10)
# 464 ns ± 25.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
plt.plot(a,b,'*')

plt.plot(a,b,'--')

plt.plot(a,b,'r--')

plt.plot(a,b,'w--')  # 线是白色,看不到

c = [10, 8, 6]
d = [1, 8, 3]
plt.plot(a, b, c, d)

c = [10, 8, 6]
d = [1, 8, 3]
plt.plot(a, b, 'r--', c, d, 'b*')

t = np.arange(0.0, 2.0, 0.1)
t
# array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2,1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9])
t.size
# 20
s = np.sin(t*np.pi)
s
# 输出
array([ 0.00000000e+00,  3.09016994e-01,  5.87785252e-01,  8.09016994e-01,9.51056516e-01,  1.00000000e+00,  9.51056516e-01,  8.09016994e-01,5.87785252e-01,  3.09016994e-01,  1.22464680e-16, -3.09016994e-01,-5.87785252e-01, -8.09016994e-01, -9.51056516e-01, -1.00000000e+00,-9.51056516e-01, -8.09016994e-01, -5.87785252e-01, -3.09016994e-01])plt.plot(t, s)

plt.plot(t, s, 'r--')

plt.plot(t, s, 'r--', t*2, s)

plt.plot(t, s, 'r--', t*2, s, '--')

plt.plot(t, s, 'r--', t*2, s, '--')
plt.xlabel('this is x')
plt.ylabel('this is y')
plt.title('this is a demo')

plt.plot(t, s, 'r--', label='aaaa')
plt.plot(t*2, s, 'b--', label='bbbb')
plt.xlabel('this is x')
plt.ylabel('this is y')
plt.title('this is a demo')
plt.legend()

5-3 matplotlib简单绘图之subplot

Jupyter notebook 新建文件 matplotlib简单绘图之subplot.ipynb

import numpy as np
import matplotlib.pyplot as pltx = np.linspace(0.0, 5.0)
y1 = np.sin(np.pi*x)
y2 = np.sin(np.pi*x*2)plt.plot(x, y1, 'b--', label='sin(pi*x)')
plt.ylabel('y1 value')
plt.plot(x, y2, 'r--', label='sin(pi*2x)')
plt.ylabel('y2 value')
plt.xlabel('x value')
plt.title('this is x-y value')
plt.legend()

plt.subplot(2, 1, 1)
plt.plot(x, y1, 'b--')
plt.ylabel('y1')
plt.subplot(2, 1, 2)
plt.plot(x, y2, 'r--')
plt.ylabel('y2')
plt.xlabel('x')

plt.subplot(2, 2, 1)
plt.plot(x, y1, 'b--')
plt.ylabel('y1')
plt.subplot(2, 2, 2)
plt.plot(x, y2, 'r--')
plt.ylabel('y2')
plt.xlabel('x')
plt.subplot(2, 2, 3)
plt.plot(x, y1, 'r*')
plt.subplot(2, 2, 4)
plt.plot(x, y1, 'b*')

plt.subplot(221)
plt.plot(x, y1, 'b--')
plt.ylabel('y1')
plt.subplot(222)
plt.plot(x, y2, 'r--')
plt.ylabel('y2')
plt.xlabel('x')
plt.subplot(223)
plt.plot(x, y1, 'r*')
plt.subplot(224)
plt.plot(x, y1, 'b*')

a = plt.subplots()
a

type(a)
# tuple
a[0]

a[1]
# <AxesSubplot:>
figure, ax = plt.subplots()
ax.plot([1, 2, 3, 4, 5])

figure, ax = plt.subplots(2, 2)
ax.plot([1, 2, 3, 4, 5])
# AttributeError: 'numpy.ndarray' object has no attribute 'plot'
figure, ax = plt.subplots(2, 2)
ax

ax[0][0].plot(x, y1)
ax[0][1].plot(x, y2)
# [<matplotlib.lines.Line2D at 0x23c99d1df70>]
figure    # 老版:plt.show()

5-4 Pandas绘图之Series

Jupyter notebook 新建文件 Pandas绘图之Series.ipynb

import numpy as np
import pandas as pd
from pandas import Series
import matplotlib.pyplot as plts = Series([1, 2, 3, 4, 5])
s
# 输出
0    1
1    2
2    3
3    4
4    5
dtype: int64s.cumsum()
# 输出
0     1
1     3
2     6
3    10
4    15
dtype: int64s1 = Series(np.random.randn(100)).cumsum()
s1.plot()

s1.plot(kind='bar')

s1 = Series(np.random.randn(1000)).cumsum()
s1.plot(kind='bar')

s1.plot(kind='line')

s1.plot(kind='line', grid=True)

s1.plot(kind='line', grid=True, label='S1', title='This is Series')
plt.legend()

s1.plot(kind='line', grid=True, label='S1', title='This is Series', style='--')
plt.legend()

s1 = Series(np.random.randn(1000)).cumsum()
s2 = Series(np.random.randn(1000)).cumsum()
s1.plot(kind='line', grid=True, label='S1', title='This is Series')
s2.plot(label='S2')

fig, ax = plt.subplots(2,1)
ax
ax[0].plot(s1)
ax[1].plot(s2)

s1.plot(ax=ax[0], label='S1')
s2.plot(ax=ax[1], label='S2')
fig

s1[0:10].plot(ax=ax[0], label='S1', kind='bar')
fig

s1 = Series(np.random.randn(10)).cumsum()
s2 = Series(np.random.randn(10)).cumsum()
fig, ax = plt.subplots(2,1)
s1[0:10].plot(ax=ax[0], label='S1', kind='bar')
s2.plot(ax=ax[1], label='S2')
plt.show()

5-5 Pandas绘图之DataFrame

Jupyter notebook 新建文件 Pandas绘图之DataFrame.ipynb

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import Series, DataFrame
df = DataFrame(np.random.randint(1, 10, 40).reshape(10,4),columns=['A','B','C','D']
)
df
# 输出A   B   C   D
0   6   9   5   4
1   6   3   3   3
2   9   3   8   1
3   4   3   9   9
4   4   5   3   7
5   2   2   4   4
6   5   8   9   9
7   6   8   3   1
8   1   5   6   6
9   5   1   5   1df.plot()

df.plot(kind='bar')

df.plot(kind='barh')

df.plot(kind='bar', stacked=True)

df.plot(kind='area')

df.iloc[5]
# 输出
A    2
B    2
C    4
D    4
Name: 5, dtype: int32a = df.iloc[5]
type(a)
# pandas.core.series.Seriesdf.iloc[5].plot()

for i in df.index:df.iloc[i].plot(label=str(i))
plt.legend()
plt.show()   

df['A']
# 输出
0    6
1    6
2    9
3    4
4    4
5    2
6    5
7    6
8    1
9    5
Name: A, dtype: int32df['A'].plot()

df.plot()

df
# 输出A   B   C   D
0   6   9   5   4
1   6   3   3   3
2   9   3   8   1
3   4   3   9   9
4   4   5   3   7
5   2   2   4   4
6   5   8   9   9
7   6   8   3   1
8   1   5   6   6
9   5   1   5   1df.T
# 输出0   1   2   3   4   5   6   7   8   9
A   6   6   9   4   4   2   5   6   1   5
B   9   3   3   3   5   2   8   8   5   1
C   5   3   8   9   3   4   9   3   6   5
D   4   3   1   9   7   4   9   1   6   1df.T.plot()

5-6 直方图和密度图

Jupyter notebook 新建文件 matplotlib里的直方图和密度图.ipynb

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import Series, DataFrame# 直方图
s = Series(np.random.randn(1000))
plt.hist(s)

plt.hist(s, rwidth=0.9)

a = np.arange(10)
a
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])plt.hist(a, rwidth=0.9)

s
# 输出
0      1.234429
1     -0.157434
2     -0.587175
3      0.583882
4     -0.612201...
995   -0.155292
996   -1.126121
997   -0.173816
998    0.608625
999   -0.862248
Length: 1000, dtype: float64re = plt.hist(s, rwidth=0.9)

re
# 输出
(array([  6.,  22.,  55., 127., 210., 234., 171., 115.,  43.,  17.]),array([-3.24995597, -2.65054559, -2.0511352 , -1.45172482, -0.85231444,-0.25290405,  0.34650633,  0.94591672,  1.5453271 ,  2.14473748,2.74414787]),<BarContainer object of 10 artists>)type(re)
# tuplelen(re)
# 3re[0]
# array([  6.,  22.,  55., 127., 210., 234., 171., 115.,  43.,  17.])re[1]
# array([-3.24995597, -2.65054559, -2.0511352 , -1.45172482, -0.85231444,-0.25290405,  0.34650633,  0.94591672,  1.5453271 ,  2.14473748,2.74414787])re[2]
# <BarContainer object of 10 artists>plt.hist(s)

plt.hist(s, rwidth=0.9)

plt.hist(s, rwidth=0.9, bins=20)

plt.hist(s, rwidth=0.9, bins=200)

plt.hist(s, rwidth=0.9, bins=20, color='r')

s.plot()

s.plot(kind='kde')

https://matplotlib.org/stable/tutorials/index

第6章 绘图和可视化之Seaborn

Seaborn是对Matplotlib的进一步封装,其强大的调色功能和内置的多种多样的绘图模式,使之成为当下最流行的数据科学绘图工具。本章将介绍Seaborn的基本使用,以及和matplotlib的功能对比。

6-1 seaborn介绍

Seaborn - Powerful Matplotlib Extension
Statistical data visualization

Seaborn的优势在哪里?

Jupyter notebook 新建文件 Seaborn和matplotlib对比.ipynb

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline  # 新版可以不写iris = pd.read_csv('../homework/iris.csv')
iris.head()
# 输出SepalLength SepalWidth  PetalLength PetalWidth  Name
0   5.1 3.5 1.4 0.2 Iris-setosa
1   4.9 3.0 1.4 0.2 Iris-setosa
2   4.7 3.2 1.3 0.2 Iris-setosa
3   4.6 3.1 1.5 0.2 Iris-setosa
4   5.0 3.6 1.4 0.2 Iris-setosa# 需求:画一个花瓣(petal)和花萼(sepal)长度的散点图,并且点的颜色要区分鸢尾花的种类
iris.Name.unique()
# array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)color_map = dict(zip(iris.Name.unique(), ['blue', 'green', 'red']))
for species, group in iris.groupby('Name'):plt.scatter(group['PetalLength'], group['SepalLength'],color=color_map[species],alpha=0.3, edgecolor=None,label=species)plt.legend(frameon=True, title='Name')
plt.xlabel('petalLength')
plt.ylabel('sepalLength')

sns.lmplot('PetalLength','SepalLength', iris, hue='Name', fit_reg=False)

https://www.youtube.com/watch?v=FytuB8nFHPQ

6-2 seaborn实现直方图和密度图

Jupyter notebook 新建文件 seaborn实现直方图和密度图.ipynb

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import Series, DataFrame
%matplotlib inline
import seaborn as snss1 = Series(np.random.randn(100))
plt.hist(s1)

s1.plot(kind='kde')

sns.distplot(s1, hist=True, kde=True)

sns.distplot(s1, hist=True, kde=False)

sns.distplot(s1, hist=False, kde=True)

sns.distplot(s1, hist=False, kde=True, rug=True)

sns.distplot(s1, bins=20, hist=False, kde=True, rug=True)

sns.distplot(s1, bins=20, hist=True, kde=False, rug=True)

sns.kdeplot(s1)

sns.kdeplot(s1, shade=True)

sns.kdeplot(s1, shade=True, color='r')

sns.plt.plot(s1)  # 相当于 plt.plot(s1) ,旧版可以,新版好像不行
sns.plt.hist(s1)  # 相当于 plt.hist(s1) ,旧版可以,新版好像不行# 作业
sns.rugplot(s1)

6-3 seaborn实现柱状图和热力图

Jupyter notebook 新建文件 seaborn实现柱状图和热力图.ipynb

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as snsfrom pandas import Series, DataFrame
%matplotlib inlinedf = sns.load_dataset('flights')  # 下方是 flights 的来源
df.head()
# 输出year    month   passengers
0   1949    Jan 112
1   1949    Feb 118
2   1949    Mar 132
3   1949    Apr 129
4   1949    May 121df.shape
# (144, 3)
https://github.com/https://github.com/mwaskom/seaborn-data

df = df.pivot(index='month', columns='year', values='passengers')
df
# 输出
year    1949    1950    1951    1952    1953    1954    1955    1956    1957    1958    1959    1960
month
Jan 112 115 145 171 196 204 242 284 315 340 360 417
Feb 118 126 150 180 196 188 233 277 301 318 342 391
Mar 132 141 178 193 236 235 267 317 356 362 406 419
Apr 129 135 163 181 235 227 269 313 348 348 396 461
May 121 125 172 183 229 234 270 318 355 363 420 472
Jun 135 149 178 218 243 264 315 374 422 435 472 535
Jul 148 170 199 230 264 302 364 413 465 491 548 622
Aug 148 170 199 242 272 293 347 405 467 505 559 606
Sep 136 158 184 209 237 259 312 355 404 404 463 508
Oct 119 133 162 191 211 229 274 306 347 359 407 461
Nov 104 114 146 172 180 203 237 271 305 310 362 390
Dec 118 140 166 194 201 229 278 306 336 337 405 432sns.heatmap(df)

df.plot()

sns.heatmap(df, annot=True)

sns.heatmap(df, annot=True, fmt='d')

df.sum()
# 输出
year
1949    1520
1950    1676
1951    2042
1952    2364
1953    2700
1954    2867
1955    3408
1956    3939
1957    4421
1958    4572
1959    5140
1960    5714
dtype: int64s = df.sum()
s.index
# Int64Index([1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959,1960],dtype='int64', name='year')s.values
# array([1520, 1676, 2042, 2364, 2700, 2867, 3408, 3939, 4421, 4572, 5140,5714], dtype=int64)sns.barplot(x=s.index, y=s.values)

type(s)
# pandas.core.series.Seriess.plot(kind='bar')

6-4 seaborn图形显示效果的设置

Jupyter notebook 新建文件 seaborn设置图形显示效果.ipynb

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt# 1. axes_style and set_style
%matplotlib inlinex = np.linspace(0, 14, 100)
y1 = np.sin(x)
y2 = np.sin(x+2)*1.25
plt.plot(x, y1)
plt.plot(x, y2)

def sinplot():plt.plot(x, y1)plt.plot(x, y2)sinplot()

import seaborn as sns
sinplot()

style = ['darkgrid', 'dark', 'white', 'whitegrid', 'ticks']
sns.set_style(style[0])    # 可试试这几种
sinplot()
sns.axes_style()

sns.set_style(style[0], {'grid.color':'red'})
sinplot()

sns.axes_style()

sns.set()
sinplot()

sns.set_style('white')
sinplot()

sns.set()
sinplot()

# 2. plotting_context() and set_context()
context = ['paper', 'notebook', 'talk', 'poster']
sns.set_context(context[3])
sinplot()

sns.plotting_context()

sns.set_context(context[1], rc={'grid.linewidth':1.0})
sinplot()

sns.set_context(context[1], rc={'grid.linewidth':3.0})
sinplot()

sns.plotting_context()
sns.set()
sns.plotting_context()

6-5 seaborn强大的调色功能

Jupyter notebook 新建文件 seaborn强大的调色功能.ipynb

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt%matplotlib inlinedef sinplot():x = np.linspace(0, 14, 100)plt.figure(figsize=(8, 6))for i in range(4):plt.plot(x, np.sin(x+i)*(i+0.75), label="sin(x+%s)*(%s+0.75)" % (i,i))plt.legend()sinplot()

import seaborn as sns
sinplot()

import seaborn as sns
sns.set()
sinplot()

sns.color_palette()  # RGB

sns.palplot(sns.color_palette())

pal_style = ['deep', 'muted', 'pastel', 'bright', 'dark', 'colorblind']
sns.palplot(sns.color_palette('dark'))

sns.set_palette(sns.color_palette('dark'))
sns.color_palette()

sinplot()

sns.set()
sinplot()

with sns.color_palette('dark'):sinplot()

sinplot()

sns.color_palette()

pal1 = sns.color_palette([(0.5, 0.1, 0.7),(0.3, 0.1, 0.9)])
sns.palplot(pal1)

sns.color_palette('hls', 8)

http://seaborn.pydata.org/tutorial
Choosing color palettes

第7章 数据分析项目实战

通过前六章的学习,我们基本上掌握了数据分析领域里主要工具的使用,本章将通过一个股票市场的分析实战项目,和大家一起用学过的知识去分析数据,进而得到有用的信息。

7-1 实战准备

数学科学工作流

B 站全站视频信息爬虫:https://github.com/chenjiandongx/bili-spider
数据分析和挖掘有哪些公开的数据来源?:https://www.zhihu.com/question/19969760https://www.quora.com/Data/Where-can-I-find-large-datasets-open-to-the-publichttps://www.kaggle.com/datasetsgoogle public data explorer:https://www.google.com/publicdata/directory
AWS Public Datasets:https://aws-amazon.com/public-datasets/

7-2 股票市场分析实战之数据获取

Jupyter notebook 新建文件 股票市场分析实战--数据获取.ipynb

http://finance.yahoo.com/
搜索阿里巴巴:BABAhttps://pandas-datareader.readthedocs.io/en/latest/
conda install pandas-datareader  # 安装pandas-datareader
import pandas_datareader as pdr
alibaba = pdr.get_data_yahoo('BABA')
alibaba.head()
# 输出High    Low Open    Close   Volume  Adj Close
Date
2017-02-22  105.199997  102.419998  102.480003  104.199997  15779400    104.199997
2017-02-23  104.860001  101.820000  104.720001  102.459999  10095600    102.459999
2017-02-24  103.000000  101.300003  101.389999  102.949997  7356600 102.949997
2017-02-27  103.824997  102.220001  102.500000  103.599998  6865900 103.599998
2017-02-28  103.989998  102.029999  103.889999  102.900002  8017500 102.900002alibaba.shape
# (1259, 6)alibaba.tail()
# 输出High    Low Open    Close   Volume  Adj Close
Date
2022-02-14  122.384003  119.370003  120.559998  121.919998  13154900    121.919998
2022-02-15  126.805000  123.377998  123.769997  126.239998  14634700    126.239998
2022-02-16  127.580002  124.510002  125.610001  125.559998  17993300    125.559998
2022-02-17  129.399994  124.056000  125.000000  124.430000  15906000    124.430000
2022-02-18  120.879997  117.199997  120.879997  118.989998  21165800    118.989998alibaba.describe()
# 输出High    Low Open    Close   Volume  Adj Close
count   1259.000000 1259.000000 1259.000000 1259.000000 1.259000e+03   1259.000000
mean    189.265396  184.532867  187.110629  186.942844  1.848556e+07   186.942844
std 43.814601   42.927590   43.471982   43.413098   1.035043e+07   43.413098
min 103.000000  101.300003  101.389999  102.309998  4.120700e+06   102.309998
25% 163.034500  158.815002  161.314995  160.599998  1.227665e+07   160.599998
50% 183.100006  178.520004  180.800003  180.839996  1.617560e+07   180.839996
75% 213.695000  209.366501  211.195000  211.480003  2.129500e+07   211.480003
max 319.320007  308.910004  313.500000  317.140015  1.418300e+08   317.140015alibaba.info
# 输出
<bound method DataFrame.info of                   High         Low        Open       Close    Volume  \
Date
2017-02-22  105.199997  102.419998  102.480003  104.199997  15779400
2017-02-23  104.860001  101.820000  104.720001  102.459999  10095600
2017-02-24  103.000000  101.300003  101.389999  102.949997   7356600
2017-02-27  103.824997  102.220001  102.500000  103.599998   6865900
2017-02-28  103.989998  102.029999  103.889999  102.900002   8017500
...                ...         ...         ...         ...       ...
2022-02-14  122.384003  119.370003  120.559998  121.919998  13154900
2022-02-15  126.805000  123.377998  123.769997  126.239998  14634700
2022-02-16  127.580002  124.510002  125.610001  125.559998  17993300
2022-02-17  129.399994  124.056000  125.000000  124.430000  15906000
2022-02-18  120.879997  117.199997  120.879997  118.989998  21165800   Adj Close
Date
2017-02-22  104.199997
2017-02-23  102.459999
2017-02-24  102.949997
2017-02-27  103.599998
2017-02-28  102.900002
...                ...
2022-02-14  121.919998
2022-02-15  126.239998
2022-02-16  125.559998
2022-02-17  124.430000
2022-02-18  118.989998  [1259 rows x 6 columns]>alibaba.info()
# 输出
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1259 entries, 2017-02-22 to 2022-02-18
Data columns (total 6 columns):#   Column     Non-Null Count  Dtype
---  ------     --------------  -----  0   High       1259 non-null   float641   Low        1259 non-null   float642   Open       1259 non-null   float643   Close      1259 non-null   float644   Volume     1259 non-null   int64  5   Adj Close  1259 non-null   float64
dtypes: float64(5), int64(1)
memory usage: 68.9 KB

7-3 股票市场分析实战之历史趋势分析

Jupyter notebook 新建文件 股票市场分析实战--历史趋势分析.ipynb

# 基本信息
import numpy as np
import pandas as pd
from pandas import Series, DataFrame# 股票数据的读取
import pandas_datareader as pdr# 可视化
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline# time
from datetime import datetimestart = datetime(2015,9,20)
alibaba = pdr.get_data_yahoo('BABA',start=start)
amazon = pdr.get_data_yahoo('AMZN',start=start)# alibaba.to_csv('../homework/BABA.csv')
# amazon.to_csv('../homework/AMZN.csv')alibaba.head()
# 输出High    Low Open    Close   Volume  Adj Close
Date
2015-09-21  66.400002   62.959999   65.379997   63.900002   22355100    63.900002
2015-09-22  63.270000   61.580002   62.939999   61.900002   14897900    61.900002
2015-09-23  62.299999   59.680000   61.959999   60.000000   22684600    60.000000
2015-09-24  60.340000   58.209999   59.419998   59.919998   20645700    59.919998
2015-09-25  60.840000   58.919998   60.630001   59.240002   17009100    59.240002alibaba['Adj Close'].plot()

alibaba['Adj Close'].plot(legend=True)

alibaba['Volume'].plot(legend=True)

alibaba['Adj Close'].plot()
amazon['Adj Close'].plot()

alibaba.head()
# 输出High    Low Open    Close   Volume  Adj Close
Date
2015-09-21  66.400002   62.959999   65.379997   63.900002   22355100    63.900002
2015-09-22  63.270000   61.580002   62.939999   61.900002   14897900    61.900002
2015-09-23  62.299999   59.680000   61.959999   60.000000   22684600    60.000000
2015-09-24  60.340000   58.209999   59.419998   59.919998   20645700    59.919998
2015-09-25  60.840000   58.919998   60.630001   59.240002   17009100    59.240002alibaba['high-low'] = alibaba['High'] - alibaba['Low']
alibaba['high-low'].plot()

alibaba.head()
# 输出High    Low Open    Close   Volume  Adj Close   high-low
Date
2015-09-21  66.400002   62.959999   65.379997   63.900002   22355100    63.900002   3.440002
2015-09-22  63.270000   61.580002   62.939999   61.900002   14897900    61.900002   1.689999
2015-09-23  62.299999   59.680000   61.959999   60.000000   22684600    60.000000   2.619999
2015-09-24  60.340000   58.209999   59.419998   59.919998   20645700    59.919998   2.130001
2015-09-25  60.840000   58.919998   60.630001   59.240002   17009100    59.240002   1.920002# daily return
alibaba['daily-return'] = alibaba['Adj Close'].pct_change()
alibaba['daily-return']
# 输出
Date
2015-09-21         NaN
2015-09-22   -0.031299
2015-09-23   -0.030695
2015-09-24   -0.001333
2015-09-25   -0.011348...
2022-02-14   -0.002699
2022-02-15    0.035433
2022-02-16   -0.005387
2022-02-17   -0.009000
2022-02-18   -0.043719
Name: daily-return, Length: 1617, dtype: float64alibaba['daily-return'].plot()

alibaba['daily-return'].plot(figsize=(10,4), linestyle='--', marker='o')

alibaba['daily-return'].plot(kind='hist')

sns.distplot(alibaba['daily-return'].dropna(), bins=100, color='purple')

7-4 股票市场分析实战之风险分析

Jupyter notebook 新建文件 股票市场分析实战--风险分析.ipynb

# 基本信息
import numpy as np
import pandas as pd
from pandas import Series, DataFrame# 股票数据的读取
import pandas_datareader as pdr# 可视化
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline# time
from datetime import datetimestart = datetime(2015,1,1)
company = ['AAPL', 'GOOG', 'MSFT', 'AMZN', 'FB']
top_tech_df = pdr.get_data_yahoo(company, start=start)['Adj Close']
# top_tech_df.to_csv('../homework/top5.csv')
top_tech_df.head()
# 输出Symbols AAPL    GOOG    MSFT    AMZN    FB
Date
2014-12-31  24.951860   524.958740  40.836304   310.350006  78.019997
2015-01-02  24.714508   523.373108  41.108841   308.519989  78.449997
2015-01-05  24.018263   512.463013  40.730816   302.190002  77.190002
2015-01-06  24.020521   500.585632  40.132992   295.290009  76.150002
2015-01-07  24.357340   499.727997  40.642887   298.420013  76.150002top_tech_dr = top_tech_df.pct_change()
top_tech_dr.head()
# 输出
Symbols AAPL    GOOG    MSFT    AMZN    FB
Date
2014-12-31  NaN NaN NaN NaN NaN
2015-01-02  -0.009512   -0.003020   0.006674    -0.005897   0.005511
2015-01-05  -0.028172   -0.020846   -0.009196   -0.020517   -0.016061
2015-01-06  0.000094    -0.023177   -0.014677   -0.022833   -0.013473
2015-01-07  0.014022    -0.001713   0.012705    0.010600    0.000000top_tech_df.plot()

top_tech_df[['AAPL', 'FB', 'MSFT']].plot()

sns.jointplot('GOOG','GOOG', top_tech_dr, kind='scatter')

sns.jointplot('AMZN','GOOG', top_tech_dr, kind='scatter')

sns.jointplot('MSFT','FB', top_tech_dr, kind='scatter')

sns.pairplot(top_tech_dr.dropna())

https://en.wikipedia.org/wiki/Quantile
top_tech_dr['AAPL'].quantile(0.52)
# 0.001484101307547557top_tech_dr['MSFT'].quantile(0.05)
# -0.025949460040733018vips =  pdr.get_data_yahoo('VIPS', start=start)['Adj Close']  # 唯品会
vips.plot()

vips.pct_change().quantile(0.2)
# -0.02404166247703312

第8章 课程总结

本章的总结不是对前面8章内容的汇总,而是给大家指明了一条继续学习和锻炼的道路。希望大家坚持练习,早日修成正果。

8-1 总结

4个pdf(建议打印出来,经常看看):
    Numpy_Python_cheat_Sheet.pdf
    Pandas_cheat_sheet.pdf
    Python_Matplotlib_Cheat_Sheet.pdf
    Seaborn_cheat_sheet.pdf

kaggle 网站:

1、多看tutorial例子
https://www.kaggle.com/datasets:提供了一些公开数据
Bitcoin Historical Data:比特币的历史数据
kernels:做Data Science
之前:https://www.kaggle.com/kernels
现在:https://www.kaggle.com/code
Data ScienceTutorial for Beginners
Jupiter Notebook编写,fork 的功能
可以直接 fork Notebook , kaggle 会打开一个 online 的 Jupiter Notebook

Python数据分析入门与实践-笔记相关推荐

  1. python编程入门到实践笔记习题_Python编程从入门到实践笔记——列表简介

    python编程从入门到实践笔记--列表简介 #coding=utf-8 #列表--我的理解等于c语言和java中的数组 bicycles = ["trek","cann ...

  2. python编程入门到实践笔记-python基础(《Python编程:从入门到实践》读书笔记)...

    注: 本文的大部分代码示例来自书籍<Python编程:从入门到实践>. 一.变量: 命名: (1)变量名只能包含字母.数字和下划线.变量名可以字母或下划线打头,但不能以数字打头 (2)变量 ...

  3. Python数据分析入门与实践

    概述 Python数据分析主要基于Pandas的Serise和DataFrame去实现,Serise和DataFrame有点像php里的数组,在数据科学里叫矩阵. 把数据使用Pandas进行采样和机器 ...

  4. Python编程入门到实践 - 笔记( 4 章)

    昨天下午又看了一遍第4章,今天早上自己来个总结吧. 复习内容如下: 通过 for 循环遍历表中内容以及在循环中打印和循环外打印 rang() 创建列表和设置步长 数字列表的简单统计 1)mix 最小 ...

  5. python从入门到实践笔记_Python编程从入门到实践二至七章笔记

    Python 第二章变量和简单数据类型 一. 改变字符大小写 1. 首字母大写 .title() 或者 .capitaliza() 输入: print("xiao ming!".t ...

  6. python从入门到实践笔记_Python编程 从入门到实践 #笔记#

    变量 命名规则 只能包含字母.数字.下划线 不能包含空格,不能以数字开头 不能为关键字或函数名 字符串 用单引号.双引号.三引号包裹 name = "ECLIPSE" name.t ...

  7. python数据分析入门学习笔记

    python数据分析入门学习笔记儿 学习利用python进行数据分析的笔记儿&下星期二内部交流会要讲的内容,一并分享给大家.博主粗心大意,有什么不对的地方欢迎指正~还有许多尚待完善的地方,待我 ...

  8. Python数据分析入门笔记10——简单案例练习(学生信息分析)

    系列文章目录 Python数据分析入门笔记1--学习前的准备 Python数据分析入门笔记2--pandas数据读取 Python数据分析入门笔记3--数据预处理之缺失值 Python数据分析入门笔记 ...

  9. Python数据分析入门笔记4——数据预处理之重复值

    系列文章目录 Python数据分析入门笔记1--学习前的准备 Python数据分析入门笔记2--pandas数据读取 Python数据分析入门笔记3--数据预处理之缺失值 Python数据分析入门笔记 ...

最新文章

  1. SQL Server 数据库清除日志的方法
  2. python类中self是什么
  3. python基础语法及知识总结-Python 学习完基础语法知识后,如何进一步提高?
  4. leetcode算法题--第一个只出现一次的字符
  5. Learning to rank在淘宝的应用
  6. 只用html5与CSS做一个简单的页面,HTML+CSS基础训练之做一个简单页面的布局
  7. 使用Flyway在Java EE中进行数据库迁移
  8. leetcode 188. 买卖股票的最佳时机 IV(dp)
  9. 单片机ch2o程序_基于单片机的室内甲醛浓度检测系统
  10. 问题 L: 求一元二次方程的根
  11. python机器学习案例系列教程——决策树(ID3、C4.5、CART)
  12. 锁-概念:可重入锁、可中断锁、公平锁、读写锁
  13. GoF设计模式——单例模式(C++实现)
  14. 谦虚的向大家问个技术问题,树型结构的排序问题
  15. 固态硬盘开盘数据恢复的方法
  16. Vue - 生成二维码(把链接地址或字符文字转成二维码,扫描后可打开显示)
  17. 笨功夫与巧心思,Milvus开源社区的成长
  18. ios xmpp 发送语音图片解决方案
  19. IP协议/地址(IPv4IPv6)概要
  20. c++禁用启用设备 usb

热门文章

  1. swoole的初步学习
  2. 抖音 根据 phone_number 找到 dy号和uid
  3. 【DG】DG日常维护
  4. DVB数字电视系统简介(DVB-C,DVB-S,DVB-T)
  5. 10kV高压开关柜无线测温系统设计及产品选型
  6. NPOI 读取空Excel 空单元格报错的问题
  7. 如何使用ffmpeg分离视频和音频t
  8. 美通社企业新闻汇总 | 2019.1.29 | 华为四款明星产品亮相世界移动通信大会;一季度全球智能手机同比将减产10%...
  9. 华三交换机如何进入配置_H3C交换机如何配置登录用户名和密码
  10. “消费者至上:媒体新时代 ”主题响彻IBC2019