《python源码剖析》第一部分作者：陈儒

首先需要明确的是，在Python的世界里，一切都是对象。

一、PyObject

PyObject是Python对象机制的基石，一切对象都有相同的PyObject部分。

PyObject的定义如下：

// source file: [object.h]
/* Nothing is actually declared to be a PyObject, but every pointer to* a Python object can be cast to a PyObject*.  This is inheritance built* by hand.  Similarly every pointer to a variable-size Python object can,* in addition, be cast to PyVarObject*.*/
typedef struct _object {PyObject_HEAD;
} PyObject;

Python对象最基本的内容，都包含在PyObject_HEAD这个宏中：

// source file: [object.h]
#ifdef Py_TRACE_REFS
/* Define pointers to support a doubly-linked list of all live heap objects. */
#define _PyObject_HEAD_EXTRA            \struct _object *_ob_next;           \struct _object *_ob_prev;#define _PyObject_EXTRA_INIT 0, 0,#else
#define _PyObject_HEAD_EXTRA
#define _PyObject_EXTRA_INIT
#endif/* PyObject_HEAD defines the initial segment of every PyObject. */
#define PyObject_HEAD                   \_PyObject_HEAD_EXTRA                \Py_ssize_t ob_refcnt;               \struct _typeobject *ob_type;

当我们在Visual Studio的release模式下编译Python时，是不会定义符号Py_TRACE_REFS的，所以实际发布的Python中，PyObject的定义十分简单：
typedef struct _object {Py_ssize_t ob_refcnt;struct _typeobject *ob_type;
} PyObject;
Py_ssize_t就是64位整型。

ob_refcnt指的是对象的引用计数，它跟Python的内存管理机制有关，即当ob_refcnt降为0时，该对象就会从堆上被删除。ob_type是一个指向_typeobject结构体的指针，_typeobject结构体对应着Python中的类型对象，所以ob_type用来指向对象的类型，而相对应的类型的一些特性则由_typeobject结构体定义。

所以，PyObject定义的所有对象的核心就两个，一个引用计数，一个类型信息。

这里给出_typeobject的部分定义：

// source file: [object.h]
typedef struct _typeobject {PyObject_VAR_HEADconst char *tp_name; /* For printing, in format "<module>.<name>" */Py_ssize_t tp_basicsize, tp_itemsize; /* For allocation *//* Methods to implement standard operations */destructor tp_dealloc;printfunc tp_print;getattrfunc tp_getattr;setattrfunc tp_setattr;cmpfunc tp_compare;reprfunc tp_repr;/* Method suites for standard classes */PyNumberMethods *tp_as_number;PySequenceMethods *tp_as_sequence;PyMappingMethods *tp_as_mapping;/* .......... */
} PyTypeObject;

二、PyIntObject

PyIntObject：整数对象，可以说是最简单的对象了。

话不多说，先上定义：

// source file: [intobject.h]
typedef struct {PyObject_HEADlong ob_ival;
} PyIntObject;

可以说是jtm简单了，就是在PyObject的基础上增加了一个long类型的变量。从这里可以知道，Python中的int类型实际上就是c中原生类型long的简单封装。对于PyIntObject，它的类型对象是PyInt_Type。
PyIntObject的创建可以通过PyInt_FromLong()来实现。当然也有PyInt_FromString()和PyInt_FromUnicode()，但最终实现也都是依靠PyInt_FromLong()。在说PyInt_FromLong()之前，先介绍一下Python的小整数对象和大整数对象的内存管理。

小整数对象

对于小整数对象，可能会在程序中非常频繁地使用，如for循环。因此，Python存在一个小整数池，避免小整数被一次又一次地使用malloc在堆上申请空间和free，从而提高Python运行效率。

// source file: [intobject.c]
#ifndef NSMALLPOSINTS
#define NSMALLPOSINTS           257
#endif
#ifndef NSMALLNEGINTS
#define NSMALLNEGINTS           5
#endif
#if NSMALLNEGINTS + NSMALLPOSINTS > 0
/* References to small integers are saved in this array so that theycan be shared.The integers that are saved are those in the range-NSMALLNEGINTS (inclusive) to NSMALLPOSINTS (not inclusive).
*/
static PyIntObject *small_ints[NSMALLNEGINTS + NSMALLPOSINTS];
#endif

small_ints保存了指向[-5, 257]地所有指针。当需要定义小整数对象时，只需要指向相应的内存地址，并将该对象的引用计数+1。因为small_ints保存着一份所有小整数的指针，所以它们的引用计数不会降为0。你也可以通过改变NSMALLNEGINTS和NSMALLPOSINTS的值来改变小整数池的范围，但是需要你重新编译Python。

大整数对象

对于大整数对象，Python为其提供了一块内存空间供这些大整数轮流使用，内存块空间满了则再开辟一块内存空间。这样不必每个PyIntObject都向内存申请一次空间，一定程度上考虑了效率问题。内存块由PyIntBlock这个结构维护。

// source file: [intobject.c]
/* Integers are quite normal objects, to make object handling uniform.(Using odd pointers to represent integers would save much spacebut require extra checks for this special case throughout the code.)Since a typical Python program spends much of its time allocatingand deallocating integers, these operations should be very fast.Therefore we use a dedicated allocation scheme with a much loweroverhead (in space and time) than straight malloc(): a simplededicated free list, filled when necessary with memory from malloc().block_list is a singly-linked list of all PyIntBlocks ever allocated,linked via their next members.  PyIntBlocks are never returned to thesystem before shutdown (PyInt_Fini).free_list is a singly-linked list of available PyIntObjects, linkedvia abuse of their ob_type members.
*/#define BLOCK_SIZE      1000    /* 1K less typical malloc overhead */
#define BHEAD_SIZE      8       /* Enough for a 64-bit pointer */
#define N_INTOBJECTS    ((BLOCK_SIZE - BHEAD_SIZE) / sizeof(PyIntObject))struct _intblock {struct _intblock *next;PyIntObject objects[N_INTOBJECTS];
};typedef struct _intblock PyIntBlock;static PyIntBlock *block_list = NULL;
static PyIntObject *free_list = NULL;

实际上，Python是通过block_list和free_list这两个单向链表来实现大整数对象的内存管理的。block_list维护者一个个内存块，free_list维护者未使用的各个PyIntObject。

block_list的节点类型是PyIntBlock，通过next指针来指向下一个节点。而free_list的节点类型是PyIntObject，它只有ob_refcnt、ob_type和ob_ival三个成员变量，是由哪个变量承担next的责任呢？别急，我们先看看Python怎么创建整数对象。

整型对象的创建和删除

前面说到，Python是通过PyInt_FromLong()这个C API来创建整型对象的。

// source file: [intobject.c]
PyObject* PyInt_FromLong(long ival)
{register PyIntObject *v;
#if NSMALLNEGINTS + NSMALLPOSINTS > 0// [1]尝试使用小整数对象池if (-NSMALLNEGINTS <= ival && ival < NSMALLPOSINTS) {v = small_ints[ival + NSMALLNEGINTS];Py_INCREF(v);
#ifdef COUNT_ALLOCSif (ival >= 0)quick_int_allocs++;elsequick_neg_int_allocs++;
#endifreturn (PyObject *) v;}
#endif// [2]使用通用整数池if (free_list == NULL) {if ((free_list = fill_free_list()) == NULL)return NULL;}/* Inline PyObject_New */v = free_list;free_list = (PyIntObject *)Py_TYPE(v);(void)PyObject_INIT(v, &PyInt_Type);v->ob_ival = ival;return (PyObject *) v;
}

PyInt_FromLong()的执行流程很简单：

如果小整数对象池机制被激活，则尝试使用小整数对象池
否则，使用通用整数对象池

注意看第26行free_list = (PyIntObject *)Py_TYPE(v);，其中Py_TYPE()的定义为#define Py_TYPE(ob) (((PyObject*)(ob))->ob_type)，可见，free_list是利用ob_type这个字段来保存下一节点的地址。

fill_free_list()的调用时机有两个：一是首次调用PyInt_FromLong()时，这时候block_list和free_list都等于NULL；二是当block_list上所有的内存都填满之后，free_list也指向NULL。

下面是fill_free_list()的定义：

static PyFloatObject* fill_free_list(void)
{PyFloatObject *p, *q;/* XXX Float blocks escape the object heap. Use PyObject_MALLOC ??? */p = (PyFloatObject *) PyMem_MALLOC(sizeof(PyFloatBlock));if (p == NULL)return (PyFloatObject *) PyErr_NoMemory();((PyFloatBlock *)p)->next = block_list;block_list = (PyFloatBlock *)p;p = &((PyFloatBlock *)p)->objects[0];q = p + N_FLOATOBJECTS;while (--q > p)Py_TYPE(q) = (struct _typeobject *)(q-1);Py_TYPE(q) = NULL;return p + N_FLOATOBJECTS - 1;
}

Python中不同对象在销毁时会进行不同的动作，销毁动作在与对象对应的类型对象中被定义，这个关键的操作就是类型对象中的tp_dealloc。（记得吗，对象的类型对象ob_type）。
PyIntObject对象的tp_dealloc操作如下：

// source file: [intobject.c]
static void int_dealloc(PyIntObject *v)
{if (PyInt_CheckExact(v)) {Py_TYPE(v) = (struct _typeobject *)free_list;free_list = v;}elsePy_TYPE(v)->tp_free((PyObject *)v);
}

正常的销毁过程（待销毁PyIntObject对象v）：

v->ob_type = free_list
free_list = v

现在了解了整数的创建与销毁的过程：创建过程首先尝试小整数池，再尝试通用整数池。通用整数池通过fill_free_list()来初始化内存块，那么，小整数池是什么时候初始化的呢？完成小整数池的创建和初始化的函数是_PyInt_Init():

// source file: [intobject.c]
int _PyInt_Init(void)
{PyIntObject *v;int ival;
#if NSMALLNEGINTS + NSMALLPOSINTS > 0for (ival = -NSMALLNEGINTS; ival < NSMALLPOSINTS; ival++) {if (!free_list && (free_list = fill_free_list()) == NULL)return 0;/* PyObject_New is inlined */v = free_list;free_list = (PyIntObject *)Py_TYPE(v);(void)PyObject_INIT(v, &PyInt_Type);v->ob_ival = ival;small_ints[ival + NSMALLNEGINTS] = v;}
#endifreturn 1;
}

从_PyInt_Init()可以看出，小整数对象也是存在与由block_list和free_list两个单向链表维护的内存块中，创建和初始化之后就永生不灭，不会也不可能被加入到free_list的范围。

三、PyStringObject

定长对象和变长对象

对于PyIntObject对象，无论整数有多大，都可以保存在ob_ival中。但是对于字符串对象，就没那么简单了。显然，字符串对象需要维护n个char型变量。看上去这种“n个…”似乎也是一类Python对象的共同特征，因此，Python在PyObject对象之外还有一类表示可变长度对象的结构体–PyVarObject。

/* PyObject_VAR_HEAD defines the initial segment of all variable-size* container objects.  These end with a declaration of an array with 1* element, but enough space is malloc'ed so that the array actually* has room for ob_size elements.  Note that ob_size is an element count,* not necessarily a byte count.*/
#define PyObject_VAR_HEAD               \PyObject_HEAD                       \Py_ssize_t ob_size; /* Number of items in variable part */typedef struct {PyObject_VAR_HEAD
} PyVarObject;

不包含可变长度数据的对象称为“定长对象”，而像字符串对象这样的包含可变长度数据的对象称为“变长对象”，他们的区别在于定长对象的不同对象占用的内存大小是一样的，而变长对象的不同对象占用的内存可能不一样。ob_size指的是变长对象中元素的个数，而不是字节长度。

还可以将对象分为可变对象mutable和不可变对象immutable。像int、str、tuple都属于不可变对象类型，list属于可变对象类型。

定长对象和变长对象是针对同一类型不同对象的说法，可变对象和不可变对象是针对某个对象的说法。

字符串的定义和创建

现在，我们可以看看PyStringObject在Python中的定义了：

// source file: [stringobject.h]
typedef struct {PyObject_VAR_HEADlong ob_shash;int ob_sstate;char ob_sval[1];
} PyStringObject;

虽然在PyStringObject的定义中，ob_sval是一个字符的字符数组(为什么不用char *ob_sval呢?)，但是它实际上是作为一个字符指针指向一段内存的，而这段内存的大小由ob_size和变长对象元素的单位长度决定的。对于PyStringObject，元素为char，单位长度为1。同C中字符串一样，PyStringObject内部维护的字符串末尾必须以'\0'结尾，所以ob_sval实际上指向的是一段长度为(ob_size + 1) * sizeof(char)个字节的内存。

由于PyStringObject的长度由ob_size维护，所以PyStringObject对象的中间可能会出现'\0'。这里与C中遇到'\0'就认为字符串结束不同。

ob_shash用来缓存PyStringObject对象的hash值，它在dict对象中会发挥重要作用。如果一个PyStringObject对象还没有被计算hash值，那么ob_shash的初始值为-1
ob_sstate用来标记该PyStringObject对象是否已经经过intern机制的处理。

PyStringObject的创建有好几条路径，我们先看看最基本的PyString_FromString()：

// source file: [stringobject.c]
PyObject *
PyString_FromString(const char *str)
{register size_t size;register PyStringObject *op;assert(str != NULL);size = strlen(str);/* ...... *//* Inline PyObject_NewVar */op = (PyStringObject *)PyObject_MALLOC(PyStringObject_SIZE + size);if (op == NULL)return PyErr_NoMemory();(void)PyObject_INIT_VAR(op, &PyString_Type, size);op->ob_shash = -1;op->ob_sstate = SSTATE_NOT_INTERNED;Py_MEMCPY(op->ob_sval, str, size+1);/* ...... */return (PyObject *) op;
}

PyStringObject对象创建完之后，在内存中的状态如下：

字符串对象的intern机制

PyString_FromString()后面有这么一段代码：

// source file: [stringobject.c]
PyObject* PyString_FromString(const char *str)
{/* ...... *//* share short strings */if (size == 0) {PyObject *t = (PyObject *)op;PyString_InternInPlace(&t);op = (PyStringObject *)t;nullstring = op;Py_INCREF(op);} else if (size == 1) {PyObject *t = (PyObject *)op;PyString_InternInPlace(&t);op = (PyStringObject *)t;characters[*str & UCHAR_MAX] = op;Py_INCREF(op);}return (PyObject *) op;
}

这里为长度为0和1的字符串都做了intern处理，也就是PyString_InternInPlace()。

字符串的对象intern机制之目的是：对于被intern之后的字符串，比如"Ruby"，在整个Python的运行期间，系统中都只有唯一的一个与字符串"Ruby"对应的PyStringObject对象。这样当判断两个PyStringObject对象是否相同是，如果它们都被intern了，那么只需要简单地检查他们对应的PyObject*是否相同即可。这个机制既节省了空间，又简化了对PyStringObject对象的比较。

PyString_InternInPlace()的定义如下：

// source file: [stringobject.c]
void PyString_InternInPlace(PyObject **p)
{register PyStringObject *s = (PyStringObject *)(*p);PyObject *t;if (s == NULL || !PyString_Check(s))Py_FatalError("PyString_InternInPlace: strings only please!");/* If it's a string subclass, we don't really know what puttingit in the interned dict might do. */if (!PyString_CheckExact(s))return;if (PyString_CHECK_INTERNED(s))return;if (interned == NULL) {interned = PyDict_New();if (interned == NULL) {PyErr_Clear(); /* Don't leave an exception */return;}}t = PyDict_GetItem(interned, (PyObject *)s);if (t) {Py_INCREF(t);Py_SETREF(*p, t);return;}if (PyDict_SetItem(interned, (PyObject *)s, (PyObject *)s) < 0) {PyErr_Clear();return;}/* The two references in interned are not counted by refcnt.The string deallocator will take care of this */Py_REFCNT(s) -= 2;PyString_CHECK_INTERNED(s) = SSTATE_INTERNED_MORTAL;
}

可以看到，intern机制的实现主要依靠一个名为interned的dict实现的，interned是一个<PyObject*， PyObject*>的字典。在为某个字符串对象a施加intern机制时首先判断interned中是否存在b，它的原生字符串内容与a相同，如果有，则将指向a的指针指向b，然后a的引用计数-1；否则，将a添加到interned。

在为对象a设置interned时，会创建<PyObject*， PyObject*>键值对，而这两个PyObject*都会使对象a的引用计数+1，但是由于这两个指针的特殊性，这两个指针不应该影响对象a的引用计数，否则对象a的引用计数将永远无法降为0。这正是会有Py_REFCNT(s) -= 2的原因。

据我了解（不确定），Python会为由字符、数字、下划线组成的字符串施加intern，但是在PyString_FronString()中只看到了长度为0和1时的intern处理。其他的intern操作发生在哪里？？

字符缓冲池

小整数的缓冲池是在Python初始化时被创建的，而字符的缓冲池在Python初始化时都指向NULL，当字符被创建时才加入到字符缓冲池。

// source file: [stringobject.c]
static PyStringObject *characters[UCHAR_MAX + 1];