python搜索pdf内容所在页码_用python合并多个pdf文件并标页码

合并多个pdf文件

来源某篇博客，忘了地址=_=!

# -*- coding:utf-8*-

# 利用PyPDF2模块合并同一文件夹下的所有PDF文件

# 只需修改存放PDF文件的文件夹变量：file_dir 和输出文件名变量: outfile

import os

from PyPDF2 import PdfFileReader, PdfFileWriter

import time

# 使用os模块的walk函数，搜索出指定目录下的全部PDF文件

# 获取同一目录下的所有PDF文件的绝对路径

def getFileName(filedir):

file_list = [os.path.join(root, filespath) \

for root, dirs, files in os.walk(filedir) \

for filespath in files \

if str(filespath).endswith('pdf')

]

return file_list if file_list else []

# 合并同一目录下的所有PDF文件

def MergePDF(filepath, outfile):

output = PdfFileWriter()

outputPages = 0

pdf_fileName = getFileName(filepath)

if pdf_fileName:

for pdf_file in pdf_fileName:

print("路径：%s"%pdf_file)

# 读取源PDF文件

input = PdfFileReader(open(pdf_file, "rb"))

# 获得源PDF文件中页面总数

pageCount = input.getNumPages()

outputPages += pageCount

print("页数：%d"%pageCount)

# 分别将page添加到输出output中

for iPage in range(pageCount):

output.addPage(input.getPage(iPage))

print("合并后的总页数:%d."%outputPages)

# 写入到目标PDF文件

outputStream = open(os.path.join(filepath, outfile), "wb")

output.write(outputStream)

outputStream.close()

print("PDF文件合并完成！")

else:

print("没有可以合并的PDF文件！")

# 主函数

def main():

time1 = time.time()

file_dir = r'E:\test\ac3' # 存放PDF的原文件夹

outfile = "Cheat_Sheets.pdf" # 输出的PDF文件的名称

MergePDF(file_dir, outfile)

time2 = time.time()

print('总共耗时：%s s.' %(time2 - time1))

main()

可能会报错，注释site-packages/PyPDF2/generic.py下的这段代码：

标页码

#!/usr/bin/env python3

# -*- coding: utf-8 -*-

helpDoc = '''

Add Page Number to PDF file with Python

Python 给 PDF 添加页码

usage:

python addPageNumberToPDF.py [PDF path]

require:

pip install reportlab pypdf2

Support both Python2/3, But more recommend Python3

tips:

* output file will save at pdfWithNumbers/[PDF path]_page.pdf

* only support A4 size PDF

* tested on Python2/Python3@ubuntu

* more large size of PDF require more RAM

* if segmentation fault, plaese try use Python 3

* if generate PDF document is damaged, plaese try use Python 3

Author:

Lei Yang (ylxx@live.com)

GitHub:

https://gist.github.com/DIYer22/b9ede6b5b96109788a47973649645c1f

'''

print(helpDoc)

import reportlab

from reportlab.lib.units import mm

from reportlab.pdfgen import canvas

from PyPDF2 import PdfFileWriter, PdfFileReader

def createPagePdf(num, tmp):

c = canvas.Canvas(tmp)

for i in range(1,num+1):

c.drawString((210//2)*mm, (4)*mm, str(i))

c.showPage()

c.save()

return

with open(tmp, 'rb') as f:

pdf = PdfFileReader(f)

layer = pdf.getPage(0)

return layer

if __name__ == "__main__":

pass

import sys,os

# 需要标页码的pdf文件

path = 'E:\\test\\ac2\\3.pdf'

if len(sys.argv) == 1:

if not os.path.isfile(path):

sys.exit(1)

else:

path = sys.argv[1]

base = os.path.basename(path)

tmp = "__tmp.pdf"

batch = 10

batch = 0

output = PdfFileWriter()

with open(path, 'rb') as f:

pdf = PdfFileReader(f,strict=False)

n = pdf.getNumPages()

if batch == 0:

batch = -n

createPagePdf(n,tmp)

if not os.path.isdir('pdfWithNumbers/'):

os.mkdir('pdfWithNumbers/')

with open(tmp, 'rb') as ftmp:

numberPdf = PdfFileReader(ftmp)

for p in range(n):

if not p%batch and p:

newpath = path.replace(base, 'pdfWithNumbers/'+ base[:-4] + '_page_%d'%(p//batch) + path[-4:])

with open(newpath, 'wb') as f:

output.write(f)

output = PdfFileWriter()

print('page: %d of %d'%(p, n))

page = pdf.getPage(p)

numberLayer = numberPdf.getPage(p)

page.mergePage(numberLayer)

output.addPage(page)

if output.getNumPages():

newpath = path.replace(base, base[:-4] + '_page_%d'%(p//batch + 1) + path[-4:])

with open(newpath, 'wb') as f:

output.write(f)

os.remove(tmp)

python搜索pdf内容所在页码_用python合并多个pdf文件并标页码相关推荐

python搜索pdf内容所在页码_利用Python在pdf文档中寻找某些词出现的页码
要研究pdf文件的页码,首先要考虑这个文件的种类.pdf可能是一本书的电子版,可能是一份简历.可能是由Word.PPT或其他文档导出的--如果不是一本书,通常页面内容里是没有页码的:如果是一本书,虽然 ...
python编写win 本地reader程序_使用Python、win32api和Acrobat Reader 9打印PDF
我有报告,我要发送到一个系统,要求报告是可读的PDF格式.我尝试了所有的免费库和应用程序,我发现唯一有效的是Adobe的acrobat家族.在我用python编写了一个快速脚本,它使用win32ap ...
python将学号与成绩匹配_用Python运维网络（1）：基础知识
近日发现一个专栏叫<网络行者>,作者是阿卜杜拉国王科技大学的Senior Network Consultant,读了一下他的<网络工程师的Python之路>系列文章,大受启发, ...
python发行版是什么意思_致 Python 初学者
当下是一个喧嚣.浮躁的时代.我们总是被生活中大量涌现的热点所吸引,几乎没有深度阅读和思考的时间和机会.我始终认为,学习是需要沉下心来慢慢钻研的,是长期的:同时,学习不应该被赋予太多的功利色彩.一个Py ...
python中二进制和文本不同_关于Python字符编码与二进制不得不说的一些事
二进制核心思想: 冯诺依曼 + 图灵机电如何表示状态,才能稳定? 计算机开始设计的时候并不是考虑简单,而是考虑能自动完成任务与结果的可靠性, 简单始终是建立再稳定.可靠基础上经过尝试10进制,但 ...
python自动化可以做什么菜_用 Python 自动化办公能做到哪些有趣或有用的事情？...
本篇回答内容来自CSDN博主肉尼 1.CSV (1)写csv文件 import csv def writecsv(path,data): with open(path, "w") ...
python的pandas包使用教程_「Python」pandas入门教程
pandas适合于许多不同类型的数据,包括: 具有异构类型列的表格数据,例如SQL表格或Excel数据有序和无序(不一定是固定频率)时间序列数据. 具有行列标签的任意矩阵数据(均匀类型或不同类型) ...
python末位1的位置_用Python黑了整个学院学姐的电话和QQ，爬虫牛皮！兄弟们耗子尾之！...
文章末尾有python全套学习资料领取 1. python爬虫可以爬取大规模数据.Python具有丰富和强大的库.它常被昵称为胶水语言,能够把用其他语言制作的各种模块(尤其是C/C++)很轻松地联结在 ...
python哪个关键字可以导入模块_关于python导入模块import与常见的模块详解
0.什么是python模块?干什么的用的? Java中如果使用abs()函数,则需要需要导入Math包,同样python也是封装的,因为python提供的函数太多,所以根据函数的功能将其封装在不同的m ...
python pygame模块怎么写游戏_使用 Python 和 Pygame 模块构建一个游戏框架
这系列的第一篇通过创建一个简单的骰子游戏来探究 Python.现在是来从零制作你自己的游戏的时间. 在我的这系列的第一篇文章中, 我已经讲解如何使用 Python 创建一个简单的.基于文本的骰子游戏 ...

python搜索pdf内容所在页码_用python合并多个pdf文件并标页码

python搜索pdf内容所在页码_用python合并多个pdf文件并标页码相关推荐

最新文章

热门文章