在Python中调用C/C++：cython及pybind11

转自：https://zhuanlan.zhihu.com/p/442935082

Python写起来非常方便, 但面对大量for循环的时候, 执行速度有些捉急. 原因在于, python是一种动态类型语言, 在运行期间才去做数据类型检查, 这样效率就很低(尤其是大规模for循环的时候).

相比而言, C/C++每个变量的类型都是事先给定的, 通过编译生成二进制可执行文件. 相比与python, C/C++效率比较高, 大规模for循环执行速度很快.

既然python的短板在于速度, 所以, 为了给python加速, 能否在Python中调用C/C++的代码?

Python解释器

当我们编写Python代码时，我们得到的是一个包含Python代码的以.py为扩展名的文本文件。要运行代码，就需要Python解释器去执行.py文件。

(你给我翻译翻译, 什么叫python代码)

Cython

当我们从Python官方网站下载并安装好Python后，我们就直接获得了一个官方版本的解释器：CPython。这个解释器是用C语言开发的，所以叫CPython。在命令行下运行python就是启动CPython解释器。CPython是使用最广的Python解释器。

虽然CPython效率低, 但是如果用它去调用C/C++代码, 效果还是挺好的. 像numpy之类的数学运算库, 很多都是用C/C++写的. 这样既能利用python简洁的语法, 又能利用C/C++高效的执行速度. 有些情况下numpy效率比自己写C/C++还高, 因为numpy利用了CPU指令集优化和多核并行计算.

我们今天要讲的Python调用C/C++, 都是基于CPython解释器的.

IronPython

IronPython和Jython类似，只不过IronPython是运行在微软.Net平台上的Python解释器，可以直接把Python代码编译成.Net的字节码。缺点在于, 因为numpy等常用的库都是用C/C++编译的, 所以在IronPython中调用numpy等第三方库非常不方便. (现在微软已经放弃对IronPython的更新了)

Jython

Jython是运行在Java平台上的Python解释器，可以直接把Python代码编译成Java字节码执行。Jython的好处在于能够调用Java相关的库, 坏处跟IronPython一样.

PyPy

PyPy一个基于Python的解释器，也就是用python解释.py. 它的目标是执行速度。PyPy采用JIT技术，对Python代码进行动态编译（注意不是解释），所以可以显著提高Python代码的执行速度。

为什么动态解释慢

假设我们有一个简单的python函数

 def add(x, y):return x + y

然后CPython执行起来大概是这个样子(伪代码)

if instance_has_method(x, '__add__') {// x.__add__ 里面又有一大堆针对不同类型的 y 的判断return call(x, '__add__', y);
} else if isinstance_has_method(super_class(x), '__add__' {return call(super_class, '__add__', y);
} else if isinstance(x, str) and isinstance(y, str) {return concat_str(x, y);
} else if isinstance(x, float) and isinstance(y, float) {return add_float(x, y);
} else if isinstance(x, int) and isinstance(y, int) {return add_int(x, y);
} else ...

因为Python的动态类型, 一个简单的函数, 要做很多次类型判断. 这还没完，你以为里面把两个整数相加的函数，就是 C 语言里面的 x + y 么? No.

Python里万物皆为对象, 实际上Python里的int大概是这样一个结构体(伪代码).

 struct {prev_gc_obj *objnext_gc_obj *objtype IntTypevalue IntValue... other fields}

每个 int 都是这样的结构体，还是动态分配出来放在 heap 上的，里面的 value 还不能变，也就是说你算 1000 这个结构体加 1000 这个结构体，需要在heap里malloc出来 2000 这个结构体. 计算结果用完以后, 还要进行内存回收. (执行这么多操作, 速度肯定不行)

所以, 如果能够静态编译执行+指定变量的类型, 将大幅提升执行速度.

Cython

什么是Cython

cython是一种新的编程语言, 它的语法基于python, 但是融入了一些C/C++的语法. 比如说, cython里可以指定变量类型, 或是使用一些C++里的stl库(比如使用std::vector), 或是调用你自己写的C/C++函数.

注意: Cython不是CPython!

原生Python

我们有一个RawPython.py

from math import sqrt
import timedef func(n):res = 0for i in range(1, n):res = res + 1.0 / sqrt(i)return resdef main():start = time.time()res = func(30000000)print(f"res = {res}, use time {time.time() - start:.5}")if __name__ == '__main__':main()

我们先使用Python原生方式来执行看一下需要多少时间, 在我电脑上要花4秒。

编译运行Cython程序

首先, 把一个cython程序转化成.c/.cpp文件, 然后用C/C++编译器, 编译生成二进制文件. 在Windows下, 我们需要安装Visual Studio/mingw等编译工具. 在Linux或是Mac下, 我们需要安装gcc, clang 等编译工具.

通过pip安装cython

pip install cython

把 RawPython.py 重命名为 RawPython1.pyx

编译的话, 有两种办法:

(1)用setup.py编译

增加一个setup.py, 添加以下内容. 这里language_level的意思是, 使用Python 3.

from distutils.core import setup
from Cython.Build import cythonizesetup(ext_modules = cythonize('RawPython1.pyx', language_level=3)
)

把Python编译为二进制代码

python setup.py build_ext --inplace

然后, 我们发现当前目录下多了RawPython1.c(由.pyx转化生成), 和RawPython1.pyd(由.c编译生成的二进制文件).

(2)直接在命令行编译(以gcc为例)

cython RawPython1.pyx
gcc -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing -I/usr/include/python3.x -o RawPython1.so RawPython1.c

第一句是把.pyx转化成.c, 第二句是用gcc编译+链接.

python -c "import RawPython1; RawPython1.main()"

我们可以导入编译好的RawPython1模块, 然后在Python中调用执行.

由以上的步骤的执行结果来看，并没有提高太多，只大概提高了一倍的速度，这是因为Python的运行速度慢除了因为是解释执行以外还有一个最重要的原因是Python是动态类型语言，每个变量在运行前是不知道类型是什么的，所以即便编译为二进制代码同样速度不会太快，这时候我们需要深度使用Cython来给Python提速了，就是使用Cython来指定Python的数据类型。

加速! 加速!

指定变量类型

cython的好处是, 可以像C语言一样, 显式地给变量指定类型. 所以, 我们在cython的函数中, 加入循环变量的类型.

然后, 用C语言中的sqrt实现开方操作.

 def func(int n):cdef double res = 0cdef int i, num = nfor i in range(1, num):res = res + 1.0 / sqrt(i)return res

但是, python中math.sqrt方法, 返回值是一个Python的float对象, 这样效率还是比较低.

为了, 我们能否使用C语言的sqrt函数? 当然可以~

Cython对一些常用的C函数/C++类做了包装, 可以直接在Cython里进行调用.

我们把开头的

from math import sqrt

换成

from libc.math cimport sqrt

再按照上面的方式编译运行, 发现速度提高了不少.

改造后的完整代码如下:

import time
from libc.math cimport sqrt
def func(int n):cdef double res = 0cdef int i, num = nfor i in range(1, num):res = res + 1.0 / sqrt(i)return res
def main():start = time.time()res = func(30000000)print(f"res = {res}, use time {time.time() - start:.5}")if __name__ == '__main__':main()

Cython调用C/C++

既然C/C++比较高效, 我们能否直接用cython调用C/C++呢? 就是用C语言重写一遍这个函数, 然后在cython里进行调用.

首先写一段对应的C语言版本

usefunc.h

#pragma once
#include <math.h>
double c_func(int n)
{int i;double result = 0.0;for(i=1; i<n; i++)result = result + sqrt(i);return result;
}

然后, 我们在Cython中, 引入这个头文件, 然后调用这个函数

cdef extern from "usecfunc.h":cdef double c_func(int n)
import timedef func(int n):return c_func(n)def main():start = time.time()res = func(30000000)print(f"res = {res}, use time {time.time() - start:.5}")

在Cython中使用numpy

在Cython中, 我们可以调用numpy. 但是, 如果直接按照数组下标访问, 我们还需要动态判断numpy数据的类型, 这样效率就比较低.

 import numpy as npcimport numpy as npfrom libc.math cimport sqrtimport timedef func(int n):cdef np.ndarray arr = np.empty(n, dtype=np.float64)cdef int i, num = n for i in range(1, num):arr[i] = 1.0 / sqrt(i)return arrdef main():start = time.time()res = func(30000000)print(f"len(res) = {len(res)}, use time {time.time() - start:.5}")

解释:

 cimport numpy as np

这一句的意思是, 我们可以使用numpy的C/C++接口(指定数据类型, 数组维度等).

这一句的意思是, 我们也可以使用numpy的Python接口(np.array, np.linspace等). Cython在内部处理这种模糊性，这样用户就不需要使用不同的名称.

在编译的时候, 我们还需要修改setup.py, 引入numpy的头文件.

from distutils.core import setup, Extension
from Cython.Build import cythonize
import numpy as npsetup(ext_modules = cythonize(Extension("RawPython4", ["RawPython4.pyx"],include_dirs=[np.get_include()],), language_level=3)
)

加速!加速!

上面的代码, 还是能够进一步加速的

可以指定numpy数组的数据类型和维度, 这样就不用动态判断数据类型了. 实际生成的代码, 就是按C语言里按照数组下标来访问.
在使用numpy数组时, 还要同时做数组越界检查. 如果我们确定自己的程序不会越界, 可以关闭数组越界检测.
Python还支持负数下标访问, 也就是从后往前的第i个. 为了做负数下标访问, 也需要一个额外的if…else…来判断. 如果我们用不到这个功能, 也可以关掉.
Python还会做除以0的检查, 我们并不会做除以0的事情, 关掉.
相关的检查也关掉.

最终加速的程序如下:

import numpy as np
cimport numpy as np
from libc.math cimport sqrt
import time
cimport cython@cython.boundscheck(False)         # 关闭数组下标越界
@cython.wraparound(False)          # 关闭负索引
@cython.cdivision(True)            # 关闭除0检查
@cython.initializedcheck(False)    # 关闭检查内存视图是否初始化
def func(int n):cdef np.ndarray[np.float64_t, ndim=1] arr = np.empty(n, dtype=np.float64)cdef int i, num = n for i in range(1, num):arr[i] = 1.0 / sqrt(i)return arrdef main():start = time.time()res = func(30000000)print(f"len(res) = {len(res)}, use time {time.time() - start:.5}")

cdef np.ndarray[np.float64_t, ndim=1] arr = np.empty(n, dtype=np.float64)

这一句的意思是, 我们创建numpy数组时, 手动指定变量类型和数组维度.

上面是对这一个函数关闭数组下标越界, 负索引, 除0检查, 内存视图是否初始化等. 我们也可以在全局范围内设置, 即在.pyx文件的头部, 加上注释

# cython: boundscheck=False
# cython: wraparound=False
# cython: cdivision=True
# cython: initializedcheck=False

也可以用这种写法:

with cython.cdivision(True):# do something here

其他

cython吸收了很多C/C++的语法, 也包括指针和引用. 也可以把一个struct/class从C++传给Cython.

Cython总结

Cython的语法与Python类似, 同时引入了一些C/C++的特性, 比如指定变量类型等. 同时, Cython还可以调用C/C++的函数.

Cython的特点在于, 如果没有指定变量类型, 执行效率跟Python差不多. 指定好类型后, 执行效率才会比较高.

更多文档可以参考Cython官方文档

Welcome to Cython’s Documentationdocs.cython.org/en/latest/index.html

pybind11

Cython是一种类Python的语言, 但是pybind11是基于C++的. 我们在.cpp文件中引入pybind11, 定义python程序入口, 然后编译执行就好了.

从官网的说明中看到pybind11的几个特点

轻量级头文件库
目标和语法类似于优秀的Boost.python库
用于为python绑定c++代码

安装

可以执行pip install pybind11安装 pybind11 (万能的pip)

也可以用Visual Studio + vcpkg+CMake来安装.

简单的例子

#include <pybind11/pybind11.h>namespace py = pybind11;
int add_func(int i, int j) {return i + j;
}PYBIND11_MODULE(example, m) {m.doc() = "pybind11 example plugin";  //可选，说明这个模块是做什么的m.def("add_func", &add_func, "A function which adds two numbers");
}

首先引入pybind11的头文件, 然后用PYBIND11_MODULE声明.

example：模型名，切记不需要引号. 之后可以在python中执行import example
m：可以理解成模块对象, 用于给Python提供接口
m.doc()：help说明
m.def：用来注册函数和Python打通界限

m.def( "给python调用方法名"， &实际操作的函数， "函数功能说明" ). //其中函数功能说明为可选

编译&运行

pybind11只有头文件，所以只要在代码中增加相应的头文件, 就可以使用pybind11了.

#include <pybind11/pybind11.h>

在Linux下, 可以执行这样的命令来编译:

 c++ -O3 -Wall -shared -std=c++11 -fPIC $(python3 -m pybind11 --includes) example.cpp -o example$(python3-config --extension-suffix)

我们也可以用setup.py来编译(在Windows下, 需要Visual Studio或mingw等编译工具; 在Linux或是Mac下, 需要gcc或clang等编译工具)

from setuptools import setup, Extension
import pybind11functions_module = Extension(name='example',sources=['example.cpp'],include_dirs=[pybind11.get_include()],
)setup(ext_modules=[functions_module])

然后运行下面的命令, 就可以编译了

python setup.py build_ext --inplace

在python中进行调用

python -c "import example; print(example.add_func(200, 33))"

在pybind11中指定函数参数

通过简单的代码修改，就可以通知Python参数名称

m.def("add", &add, "A function which adds two numbers", py::arg("i"), py::arg("j"));

也可以指定默认参数

int add(int i = 1, int j = 2) {return i + j;
}

在PYBIND11_MODULE中指定默认参数

m.def("add", &add, "A function which adds two numbers",py::arg("i") = 1, py::arg("j") = 2);

为Python方法添加变量

PYBIND11_MODULE(example, m) {m.attr("the_answer") = 23333;py::object world = py::cast("World");m.attr("what") = world;
}

对于字符串, 需要用py::cast将其转化为Python对象.

然后在Python中, 可以访问the_answer和what对象

import example
>>>example.the_answer
42
>>>example.what
'World'

在cpp文件中调用python方法

因为python万物皆为对象, 因此我们可以用py::object 来保存Python中的变量/方法/模块等.

 py::object os = py::module_::import("os");py::object makedirs = os.attr("makedirs");makedirs("/tmp/path/to/somewhere");

这就相当于在Python里执行了

 import osmakedirs = os.makedirsmakedirs("/tmp/path/to/somewhere")

用pybind11使用python list

我们可以直接传入python的list

 void print_list(py::list my_list) {for (auto item : my_list)py::print(item);}PYBIND11_MODULE(example, m) {m.def("print_list", &print_list, "function to print list", py::arg("my_list"));}

在Python里跑一下这个程序,

 >>>import example>>>result = example.print_list([2, 23, 233])2 23 233>>>print(result)

这个函数也可以用std::vector<int>作为参数. 为什么可以这样做呢? pybind11可以自动将python list对象, 复制构造为std::vector<int>. 在返回的时候, 又自动地把std::vector转化为Python中的list. 代码如下:

 #include <pybind11/pybind11.h>#include <pybind11/stl.h>std::vector<int> print_list2(std::vector<int> & my_list) {auto x = std::vector<int>();for (auto item : my_list){x.push_back(item + 233);}return x;}PYBIND11_MODULE(example, m) {m.def("print_list2", &print_list2, "help message", py::arg("my_list"));}

用pybind11使用numpy

因为numpy比较好用, 所以如果能够把numpy数组作为参数传给pybind11, 那就非常香了. 代码如下(一大段)

 #include <pybind11/pybind11.h>#include <pybind11/numpy.h>py::array_t<double> add_arrays(py::array_t<double> input1, py::array_t<double> input2) {py::buffer_info buf1 = input1.request(), buf2 = input2.request();if (buf1.ndim != 1 || buf2.ndim != 1)throw std::runtime_error("Number of dimensions must be one");if (buf1.size != buf2.size)throw std::runtime_error("Input shapes must match");/* No pointer is passed, so NumPy will allocate the buffer */auto result = py::array_t<double>(buf1.size);py::buffer_info buf3 = result.request();double *ptr1 = (double *) buf1.ptr,*ptr2 = (double *) buf2.ptr,*ptr3 = (double *) buf3.ptr;for (size_t idx = 0; idx < buf1.shape[0]; idx++)ptr3[idx] = ptr1[idx] + ptr2[idx];return result;}m.def("add_arrays", &add_arrays, "Add two NumPy arrays");

先把numpy的指针拿出来, 然后在指针上进行操作.

我们在Python里测试如下:

 >>>import example>>>import numpy as np>>>x = np.ones(3)>>>y = np.ones(3)>>>z = example.add_arrays(x, y)>>>print(type(z))<class 'numpy.ndarray'>>>>print(z)array([2., 2., 2.])

来一段完整的代码

#include <pybind11/pybind11.h>
#include <pybind11/stl.h>
#include <pybind11/numpy.h>namespace py = pybind11;
int add_func(int i, int j) {return i + j;
}void print_list(py::list my_list) {for (auto item : my_list)py::print(item);
}std::vector<int> print_list2(std::vector<int> & my_list) {auto x = std::vector<int>();for (auto item : my_list){x.push_back(item + 233);}return x;
}py::array_t<double> add_arrays(py::array_t<double> input1, py::array_t<double> input2) {py::buffer_info buf1 = input1.request(), buf2 = input2.request();if (buf1.ndim != 1 || buf2.ndim != 1)throw std::runtime_error("Number of dimensions must be one");if (buf1.size != buf2.size)throw std::runtime_error("Input shapes must match");/* No pointer is passed, so NumPy will allocate the buffer */auto result = py::array_t<double>(buf1.size);py::buffer_info buf3 = result.request();double *ptr1 = (double *) buf1.ptr,*ptr2 = (double *) buf2.ptr,*ptr3 = (double *) buf3.ptr;for (size_t idx = 0; idx < buf1.shape[0]; idx++)ptr3[idx] = ptr1[idx] + ptr2[idx];return result;
}PYBIND11_MODULE(example, m) {m.doc() = "pybind11 example plugin";  //可选，说明这个模块是做什么的m.def("add_func", &add_func, "A function which adds two numbers");m.attr("the_answer") = 23333;py::object world = py::cast("World");m.attr("what") = world;m.def("print_list", &print_list, "function to print list", py::arg("my_list"));m.def("print_list2", &print_list2, "help message", py::arg("my_list2"));m.def("add_arrays", &add_arrays, "Add two NumPy arrays");
}

pybind11总结

pybind11在C++下使用, 可以为Python程序提供C++接口. 同时, pybind11也支持传入python list, numpy等对象.

更多文档可以参考pybind11官方文档

https://pybind11.readthedocs.io/en/stable/pybind11.readthedocs.io/en/stable/

其他使用python调用C++的方式

CPython会自带一个Python.h, 我们可以在C/C++中引入这个头文件, 然后编译生成动态链接库. 但是, 直接调用Python.h写起来有一点点麻烦.
boost是一个C++库, 对Python.h做了封装, 但整个boost库比较庞大, 而且相关的文档不太友好.
swig(Simplified Wrapper and Interface Generator), 用特定的语法声明C/C++函数/变量. (之前tensorlfow用的就是这个, 但现在改成pybind11了)

总结: 什么时候应该加速呢

用Python开发比较简洁, 用C++开发写起来有些麻烦.

在写python时, 我们可以通过Profile等耗时分析工具, 找出比较用时的代码块, 对这一块用C++进行优化. 没必要优化所有的部分.

Cython或是pybind11只做三件事: 加速, 加速, 还是加速. 在需要大量计算, 比较耗时的地方, 我们可以用C/C++来实现, 这样有助于提升整个Python程序的执行速度.

加速python还有一些其他的方法, 比如用numpy的向量化操作代替for循环, 使用jit即时编译等.