python搜索pdf内容所在页码_使用pyPDF从文档中检索页码

答案很好。但是，由于稍后(dreamer)请求了一个工作代码示例，而且我今天也遇到了同样的问题，所以我想添加一些注释。pdf结构并不统一；您可以依赖的东西很少，因此任何工作代码示例都不太可能适合每个人。一个很好的解释可以找到in this answer。

正如kindall所解释的，您很可能需要探索您正在处理的pdf文件。

就像这样：import sys

import PyPDF2 as pyPdf

"""Open your pdf"""

pdf = pyPdf.PdfFileReader(open(sys.argv[1], "rb"))

"""Explore the /PageLabels (if it exists)"""

try:

page_label_type = pdf.trailer["/Root"]["/PageLabels"]

print(page_label_type)

except:

print("No /PageLabel object")

"""Select the item that is most likely to contain the information you desire; e.g.

{'/Nums': [0, IndirectObject(42, 0)]}

here, we only have "/Num". """

try:

page_label_type = pdf.trailer["/Root"]["/PageLabels"]["/Nums"]

print(page_label_type)

except:

print("No /PageLabel object")

"""If you see a list, like

[0, IndirectObject(42, 0)]

get the correct item from it"""

try:

page_label_type = pdf.trailer["/Root"]["/PageLabels"]["/Nums"][1]

print(page_label_type)

except:

print("No /PageLabel object")

"""If you then have an indirect object, like

IndirectObject(42, 0)

use getObject()"""

try:

page_label_type = pdf.trailer["/Root"]["/PageLabels"]["/Nums"][1].getObject()

print(page_label_type)

except:

print("No /PageLabel object")

"""Now we have e.g.

{'/S': '/r', '/St': 21}

meaning roman numerals, starting with page 21, i.e. xxi. We can now also obtain the two variables directly."""

try:

page_label_type = pdf.trailer["/Root"]["/PageLabels"]["/Nums"][1].getObject()["/S"]

print(page_label_type)

start_page = pdf.trailer["/Root"]["/PageLabels"]["/Nums"][1].getObject()["/St"]

print(start_page)

except:

print("No /PageLabel object")从ISO pdf 1.7规范(相关章节here)中可以看到，如何标记页面有很多可能性。作为一个简单的工作示例，请考虑这个脚本，它将至少处理十进制(阿拉伯语)和罗马数字：

脚本：import sys

import PyPDF2 as pyPdf

def arabic_to_roman(arabic):

roman = ''

while arabic >= 1000:

roman += 'm'

arabic -= 1000

diffs = [900, 500, 400, 300, 200, 100, 90, 50, 40, 30, 20, 10, 9, 5, 4, 3, 2, 1]

digits = ['cm', 'd', 'cd', 'ccc', 'cc', 'c', 'xc', 'l', 'xl', 'xxx', 'xx', 'x', 'ix', 'v', 'iv', 'iii', 'ii', 'i']

for i in range(len(diffs)):

if arabic >= diffs[i]:

roman += digits[i]

arabic -= diffs[i]

return(roman)

def get_page_labels(pdf):

try:

page_label_type = pdf.trailer["/Root"]["/PageLabels"]["/Nums"][1].getObject()["/S"]

except:

page_label_type = "/D"

try:

page_start = pdf.trailer["/Root"]["/PageLabels"]["/Nums"][1].getObject()["/St"]

except:

page_start = 1

page_count = pdf.getNumPages()

##or, if you feel fancy, do:

#page_count = pdf.trailer["/Root"]["/Pages"]["/Count"]

page_stop = page_start + page_count

if page_label_type == "/D":

page_numbers = list(range(page_start, page_stop))

for i in range(len(page_numbers)):

page_numbers[i] = str(page_numbers[i])

elif page_label_type == '/r':

page_numbers_arabic = range(page_start, page_stop)

page_numbers = []

for i in range(len(page_numbers_arabic)):

page_numbers.append(arabic_to_roman(page_numbers_arabic[i]))

print(page_label_type)

print(page_start)

print(page_count)

print(page_numbers)

pdf = pyPdf.PdfFileReader(open(sys.argv[1], "rb"))

get_page_labels(pdf)

python搜索pdf内容所在页码_使用pyPDF从文档中检索页码相关推荐

python搜索pdf内容所在页码_利用Python在pdf文档中寻找某些词出现的页码
要研究pdf文件的页码,首先要考虑这个文件的种类.pdf可能是一本书的电子版,可能是一份简历.可能是由Word.PPT或其他文档导出的--如果不是一本书,通常页面内容里是没有页码的:如果是一本书,虽然 ...
python读取word内容复制粘贴_如何复制word文档的内容？
为了复制带有样式的文本,您需要编写自己的函数,因为没有python-docx函数来完成这样的工作. 这是我写的函数:def get_para_data(output_doc_name, paragra ...
Python批量识别图片中的文字并保存到txt文档中
Python OCR工具pytesseract,之前是惠普的产品,被Google收了之后就给开源了. 1.需要下载并安装Google Tesseract,下载地址看图片上有,要下载4.0.0版本的 2 ...
linux在文档中查找内容,【Linux】用grep在文档中查找内容
有时候,我们需要在文档中查找一些内容,常用grep.它在文档查找相关内容并输出匹配行. > 查找某关键字在system.log中,查找包含keyword的行 grep 'keyword' sy ...
如何添加引文标_如何在Google文档中查找和添加引文
如何添加引文标 When writing papers, you need to generate a detailed and accurate list of all the sources yo ...
去掉图题注空格_在Word 2010文档中为图表插入形如“图一,图二”的题注时,删除标签与编号之间自动出现的空格的最优操作方法是( )_学小易找答案...
[判断题]矛盾有两个基本属性,一个是同一性另一个是特殊性. [单选题]小王利用Word撰写专业学术论文时,需要在论文结尾处罗列出所有参考文献或书目,最优的操作方法是( ). [单选题]"九层 ...
word文档怎么给数字加千分符_如何给word文档中的数字添加千分位分隔符
展开全部准备工具/材料:windows10Build10158版本,Excel 2013版本. 1.此演示操作中使用的办公软件是Excel 2013版本. 2.windows10Build10158 ...
word文档分节符如何删除_如何在Word文档中查找分节符
word文档分节符如何删除 Section breaks in Word allow you to break up your document into sections and format ea ...
mongo单个文档限制_如何在单个文档中使用多个页眉和页脚
mongo单个文档限制 Word features a few built-in ways to change up your headers and footers in a document. F ...

python搜索pdf内容所在页码_使用pyPDF从文档中检索页码

python搜索pdf内容所在页码_使用pyPDF从文档中检索页码相关推荐

最新文章

热门文章