• 原文出自: http://www.codeproject.com/csharp/giospdfsplittermerger.asp
  • GIOS PDF Splitter and Merger
  • Download C# .NET 1.1 project with source code - 36.8 Kb

This is a screenshot of the GIOS PDF splitter and merger v1.0, the first open source PDF splitter and merger tool written in C# .NET.

Introduction

After the success of the GIOS PDF .NET library released on April, 2005, I decided to invest more of my time for the community. Extending and improving the PDF library was one of the things I could do, but what about the new features to be added?

Well, I have to thank my friend Charles. Last month we were discussing about the new features to be added to the PDF library. He said: "if you need another challenge, how about developing a PDF merger program?" His words rocked me: There is no free Windows application that does this. Moreover, there is no open source project written in C#. So, I took the giant PDF reference by Acrobat for evaluating the possibility of doing that.

Background

Reading Adobe's Portable Document Format (PDF) Specification, Third Edition, Version 1.4, Section 3.4, you will find that a PDF is made of:

  • Header, which gives information about the kind of file it is (typically %PDF-1.4).
  • Body, which contains the data of the objects. It gets 99 percent of the PDF size.
  • Cross reference table, which gives the reader the capability of indexing objects without parsing the entire file. This is the secret behind fast navigation in a heavy document. A corrupted cross reference table doesn't compromise on the reading of the document but Acrobat takes too much time rebuilding it on the fly.
  • Trailer, which contains the necessary information for opening the document, like the ID of the root object named Catalog.

The Body is made of a nodal structure of generic objects. The Root or Catalog is a container of container of pages (Pages objects).

How to do it (basic concepts)

We have to point out what we need to change in order to split (merge) a PDF:

  • The header remains the same and it's the same for almost all of the PDF.
  • We have to reorganize the body discarding the objects that are not needed by the new document.
  • Rebuild the cross reference table, but during the testing phase Acrobat will do it for us on the fly. So this is not a big problem.
  • Override the settings in the trailer, but this object is so simple that it takes very little time to rewrite it entirely.

This is the schema of splitting a document of three pages into a new PDF made (in order) from the third and the first page of the original document:

  1. object 1, 2, 4 and 5 will be discarded because they are the descriptors of the old document structure.
  2. object 7, 12, 13 and 14 will be discarded because they are the father and the children of the pages we want to discard.
  3. object 17 and 18 will be created in order to describe the new structure.

How to do it (through coding)

The application works with these engines:

  1. The objects parser for the original documents (PdfFile.cs and PdfFileObject.cs)
  2. The splitter (PdfSplitter.cs)
  3. The merger (PsdSplitterMerger.cs)

The objects parser

The objects parser parses the lines of the PDF and stores the objects in memory recognizing their types.

I'm really not proud of my object parser. It's not the best but it works. Here an extract of my code in which the object itself searches for some matches inside its content in order to know its own type. I've seen some better parsers here, for example in the article A pdf Forms parser, if you are a purist coder don't look inside! ;-).

The use of Regex here is not necessary, but it's surely a more elegant way of searching string matches:

if (Regex.IsMatch(s, @"/Page")&!Regex.IsMatch(s, @"/Pages"))
{
this.type = PdfObjectType.Page;
return this.type;
}
if (Regex.IsMatch(s,@"stream"))
{
this.type = PdfObjectType.Stream;
return this.type;
}
if (Regex.IsMatch(s, @"(/Creator)|(/Author)|(/Producer)"))
{
this.type = PdfObjectType.Info;
return this.type;
}
this.type = PdfObjectType.Other;

The splitter

The splitter takes a collection of objects (input) and returns a collection of objects (output).

The input is provided by the objects parser, and the output is basically a filtered list of the original objects. This is how it works:

  1. Takes the original objects of the document (provided by the object parser).
  2. Takes the indexes of the selected pages.
  3. Uses a sort of spider for populating a list of objects needed by the selected pages.
  4. Erases from the original collection the objects not visited by the spider.
  5. Rebuilds the numeration of the objects (features needed by the merger).

This is a recursive method in PdfFileObject.cs used for exploring its children:

Collapse
internal void PopulateRelatedObjects(PdfFile PdfFile,
Hashtable container)
{
Match m = Regex.Match(this.OriginalText, @"/d+ 0 R[^G]");
while (m.Success)
{
int num=int.Parse(
m.Value.Substring(0,m.Value.IndexOf(" ")));
bool notparent = !Regex.IsMatch(this.OriginalText,
@"/Parent/s+"+num+" 0 R");
if (notparent &! container.Contains(num))
{
PdfFileObject pfo = PdfFile.LoadObject(num);
if (pfo != null & !container.Contains(pfo.number))
{
container.Add(num,null);
pfo.PopulateRelatedObjects(PdfFile, container);
}
}
m = m.NextMatch();
}
}

The merger

The merger is a simple class that is used to append the output of each splitter and write the necessary objects (in our example, objects 17 and 18). It also writes the header, the cross reference table and the trailer. Take a look into PdfSplitterMerger.cs, it's very simple.

Conclusion

I hope this project is useful for non-coders. Splitting and merging documents should be free. Let's hope that these projects demystifying the PDF will get some good result in the near future.

History

  • 21st December, 2005 - v1.0 release.
  • 4th January, 2006 - v1.1
    • Some minor bug fixed.
    • Good gain of performance due to some Regex optimization.
  • 24th November, 2006 - v1.12
    • Regex fix for supporting SQL Reporting Services.

<script src="/script/togglePre.js" type="text/javascript"></script>

About Paolo Gios

After being a freelance software developer and c# trainer,
I've proudly joined BeyondTrust Corporation
as Senior Developer.

I live in Turin, Italy

my homepage is: http://www.paologios.com

Click here to view Paolo Gios's online profile.

完全支持双层PDF!开源的PDF分离与合并软件,C#版本(转自codeproject)相关推荐

  1. 多张图片合并成PDF文件,还在下载合并软件,PS就能帮你搞定

    我们知道pdf格式是我们在工作和学习中常用的一种文件格式,pdf是一种通用的文档格式.很多阅读器都支持pdf格式,比如小编就非常喜欢用kindle来进行阅读,kindle可以打开pdf格式的电子书,可 ...

  2. 为何选择iText?java PDF开源库选择与iText发展历史

    作者:CuteXiaoKe 微信公众号:CuteXiaoKe 转眼间,我写iText7系列已经有一年多了,还记得最开始的时候是因为兴趣才翻译iText,不过随着慢慢翻译文章才发现iText的强大之处, ...

  3. 面对世界竞争对手,如何拿到Google PDF开源项目PDFium?

    面对世界竞争对手,如何拿到Google PDF开源项目PDFium? 发表于2015-10-20 15:06| 844次阅读| 来源CSDN| 8 条评论| 作者蒲婧 CTO俱乐部CTO讲堂CTOPD ...

  4. 在线文件/文档预览/分页分片预览 之开源kkfileview(word转pdf,pdf截取,pdf转图片,Aspose jobConverter , OpenOffice ,libreoffice )

    前提说明 浏览器不能直接浏览word文件,但可以浏览pdf文件!!! 可以后台把word,excel 转成成pdf.然后给前端预览: 业界常用的开源工具有:Aspose jobConverter ,  ...

  5. ie如何导入html文件类型,Magicodes.IE: 导入导出通用库,支持Dto导入导出以及动态导出,支持Excel、Word、Pdf和Html。...

    Magicodes.IE 导入导出通用库,支持Dto导入导出以及动态导出,支持Excel.Word.Pdf和Html. 疯狂的徽章 GitHub Azure DevOps Build Status: ...

  6. 用轻量服务器搭建自己的pdf在线工具箱(支持pdf压缩以及pdf OCR)

    上篇文章中我们讲了怎么利用腾讯轻量云服务器搭建一个PDF在线压缩工具,今天我们来搭建一个更强大的工具,不仅支持PDF在线压缩,还支持PDF OCR文字识别 前言 前两天需要压缩一个pdf文件,由于pd ...

  7. pdfmake支持html,pdfMake前端导出pdf

    pdfMake前端导出pdf 目前导出PDF还是后端(或nodejs)比较好. (如果没有必要) 导出方案 后端: IText,wkhtmltopdf...等等. 前端: jsPdf,pdfKit,r ...

  8. c语言 解析pdf 开源库,使用第三方开源库mupdf,实现pdf转png

    step1: 下载SumatraPDF工程: https://github.com/Bitterbell/Pdf-Reader muPdf 库是一个开源的 pdf 读取器,但是在 github 上下载 ...

  9. uni-app 系统打印、AirPrint、支持ipad、打印图片 pdf webView文档

    地址:系统打印.AirPrint.支持ipad.打印图片 pdf webView文档 - DCloud 插件市场 系统打印.AirPrint.支持ipad.打印图片 pdf webView文档(ios ...

最新文章

  1. HDU 4540 威威猫系列故事――打地鼠(DP)
  2. iOS开发使用半透明模糊效果方法整理
  3. 高速电路EDA设计第一次实验
  4. 网易企业业务进入大航海时代,邀您共创星辰大海
  5. 用备份进行Active Directory的灾难重建:Active Directory系列之三
  6. 什么是DNS,A记录,子域名,CNAME别名,MX记录,TXT记录,SRV 记录,TTL值
  7. spring的controller是单例模式,但是是多线程,各个线程之间不影响
  8. cad卸载_怎么把CAD卸载干净,老司机来教你
  9. tf 如何进行svd_Tensorflow快餐教程(6) - 矩阵分解
  10. go get 的不再src目录中_如何正确的开始用Go编程
  11. 数学计算机学具制作,神奇的数学
  12. C++面试题-面向对象-面向对象概念
  13. python怎么读-Python怎么读?为什么叫Python?
  14. 服务器搭建网站完整教程(宝塔面板+wordpress) 快速搭建网站 一键部署
  15. CSS:标签右对齐,文本框左对齐的实现
  16. 网站服务器开启cookies,浏览器如何开启cookie(图解浏览器cookie功能使用)
  17. 【DL】第3章 使用词嵌入计算文本相似度
  18. C语言:甲乙丙丁分糖
  19. Web API数据传输加密
  20. 复数的指数C语言,复数运算 - RapidBird的个人空间 - OSCHINA - 中文开源技术交流社区...

热门文章

  1. THREE加载模型FBX、OBJ、GLTF
  2. 触屏笔和电容笔哪个好?非常值得入手的电容笔推荐
  3. Maya Python 第七章 使用Maya命令创建基本工具 7.1-7.3
  4. 十七、电话号码的字母组合
  5. 一道面试题:你了解哪些编译器优化行为?知道Copy elision 、RVO吗?
  6. Vue2.0基本用法之组件的注册和传值(父子props,插槽,$emit)和学写购物车
  7. 解决MySQL5和8的成绩排序问题
  8. AMD、ARM、Intel、Qualcomm
  9. 计算机中常用t来表示,2012年计算机等级考试一级B考点详解(4)
  10. 山东田野稻花香 国稻种芯·中国水稻节:威海荣成引种的旱稻