


The syntax of PDF comprises the four main elements:

• These are the basic building blocks in PDF.Objects.

• It specifies how objects are laid out and modified in a PDF file.File structure.

• It determines how objects are logically organized to represent the contents of a PDF file (text, graphics, etc.).Document structure.

• They provide a means for efficient storage of various parts of the document content.Content streams.




There are 9 basic object types in PDF. Simple object types are Boolean, Numeric, String and Null. PDF strings have bounded length and are enclosed in parentheses '(' and ')'. The type Name is used as an identifier in the description of the PDF document structure. Names are introduced using the character '' and can contain arbitrary characters except null (0x00). The aforementioned 5 object types will be referred to as primitive types in this paper. An Array is a one-dimensional ordered collection of PDF objects enclosed in square brackets, '[' and ']'. Arrays may contain PDF objects of different type, including nested arrays. A Dictionary is an unordered set of key-value pairs enclosed between the symbols '' and ''. The keys must be name objects and must be unique within a dictionary. The values may be of any PDF object type, including nested dictionaries. A Stream object is a PDF dictionary followed by a sequence of bytes. The bytes represent information that may be compressed or encrypted, and the associated dictionary contains information on whether and how to decode the bytes. These bytes usually contain content to be rendered, but may also contain a set of other objects[3]. Finally, an Indirect object is any of the previously defined objects supplied with a unique object identifier and enclosed in the keywords obj and endobj. Due to their unique identifiers, indirect objects can be referenced from other objects via indirect references./<<>>

字典对象(Dictionary Objects)的例子:


Indirect Objects的例子:


File structure

This  escribes how objects are organized in a PDF file for efficient random access and incremental update. A basic conforming PDF file shall be constructed of following four elements (see Figure 2):
• A one-line header identifying the version of the PDF specification to which the file conforms
• A body containing the objects that make up the document contained in the file
• A cross-reference table containing information about the indirect objects in the file
• A trailer giving the location of the cross-reference table and of certain special objects within the body of the file .

The body of a PDF file shall consist of a sequence of indirect objects representing the contents of a document. The objects, which are of the basic types described in 7.3, "Objects," represent components of the document such as fonts, pages, and sampled images. Beginning with PDF 1.5, the body can also contain object streams, each of which contains a sequence of indirect objects; see 7.5.7, "Object Streams."

PDF文件的Body应由表示文件内容的一系列间接对象组成。对象属于7.3"对象"中描述的基本类型,表示文档的组件,如字体、页面和采样图像。从PDF 1.5开始,主体还可以包含对象流,每个对象流包含一系列间接对象;见7.5.7"对象流"


PDF对象的语法如图1左侧所示的简化示例性PDF文件所示。它包含四个由两部分对象标识符表示的间接对象,例如,第一个对象为1 0,以及obj和endobj关键字。这些对象是字典,因为它们被符号“”和“”包围。第一个是目录字典,由其类型条目表示,该条目包含一个带有值Catalog的PDF名称。该目录有两个额外的字典条目:Pages和OpenAction。OpenAction是嵌套字典的一个示例。它有两个条目:S,一个表示这是JavaScript动作字典的PDF名称,和JS,一个包含要执行的实际JavaScript脚本的PDF字符串:alert('Hello!');。Pages是对对象标识符为3 0的对象的间接引用:目录后面紧跟的Pages字典。它有一个整数Count,表示文档中有2个页面,还有一个由方括号标识的数组,其中有两个对页面对象的引用。相同的对象类型用于构建其余的页面对象。请注意,每个页面对象在其父条目中都包含对页面对象的反向引用。总共有三个引用指向同一个间接对象3 0,即Pages对象。<<>>



7.5.4 Cross-Reference Table

The cross-reference table contains information that permits random access to indirect objects within the file so that the entire file need not be read to locate any particular object. The table shall contain a one-line entry for each indirect object, specifying the byte offset of that object within the body of the file.

交叉引用表包含允许对文件中的间接对象进行随机访问的信息,因此无需读取整个文件来定位任何特定对象。该表应包含每个间接对象的一行条目,指定该对象在文件正文中的字节偏移量。(从PDF 1.5开始,部分或全部交叉参考信息也可以包含在交叉参考流中;请参见7.5.8"交叉参考流"。


xref代表Cross-Reference Table的开始

0 代表起始object number

26 代表此表格indirect object entry的个数

n 代表 in-use, f 代表 free


注意: ggggg 是生成号,可以看作是版本号。

7.5.5 File Trailer

The trailer of a PDF file enables a conforming reader to quickly find the cross-reference table and certain special objects. Conforming readers should read a PDF file from its end. The last line of the file shall contain only the end-of-file marker, %%EOF. The two preceding lines shall contain, one per line and in order, the keyword startxref and the byte offset in the decoded stream from the beginning of the file to the beginning of the xref keyword in the last cross-reference section. The startxref line shall be preceded by the trailer dictionary, consisting of the keyword trailer followed by a series of key-value pairs enclosed in double angle brackets (<< … >>) (using LESS-THAN SIGNs (3Ch) and GREATER-THAN SIGNs (3Eh)). Thus, the trailer has the following overall structure:

Document structure

PDF 文档可以被视为包含在 PDF 文件Body部分中的对象的层次结构。 在
层次结构的根是文档的目录字典(catalog dictionary)。


每一页都由一个页面对象表示,该对象是一个字典,其中包括对页面内容和其他属性的引用,例如其缩略图(12.3.4,“缩略图”)和与之相关的任何注释(12.5,“注释”)。各个页面对象在一个称为页面树(如7.7.3“页面树”所述)的结构中绑定在一起,而页面树又由文档目录中的间接引用指定(an indirect reference in the document catalog)。层次结构中的父、子和同级关系由字典项定义,这些字典项的值是对其他字典的间接引用。

文档对象层次结构的根是目录字典( catalog dictionary),通过PDF文件Root entry in the trailer(见7.5.5,“文件尾部”)。目录包含对定义文档内容、大纲、文章线程、命名目的地和其他属性的其他对象的引用。此外,它还包含有关如何在屏幕上显示文件的信息,例如是否应自动显示文件的大纲和缩略页面图像,以及打开文件时是否应显示除第一页以外的其他位置。表28显示了目录字典中的条目。
Page Objects页面属性的继承



