python解析pdf文件

加载pdf文件，获取pdf的每一页对象：

import pdfplumber
path = ""
with pdfplumber.open(path) as pdf_obj:pages = pdf_obj.pages# 获取页面的宽高p_width = page.widthp_height = page.height

获取当页的所有文本：

text = page.extract_text()

如果未抽取到文本对象，text是None对象。使用该方法，表格中的文本也会被抽取出来，但是文本是按照“行”来读取的，表格内的文字会出现“错行”的情况。

如果页面中既有文字也有表格，可以考虑抽取出本页中所有的文本对象及对应的坐标，然后根据坐标将文本按原位置还原。

words = page.extract_words()
new_words = []
for word in words:for k, v in word.items():if type(v) == decimal.Decimal:word[k] = float(v)new_words.append(word)

也可以单独解析表格对象：

def get_new_cells(cells):new_cells = []for item in cells:if item is None:item = (None, None, None, None)new_cells.append(item)return new_cellsts = page.find_tables()
ts1 = ts[0]
rows = ts1.rows
for row in rows:cells = row.cellsnew_cells = get_new_cells(cells)from pdfplumber.table import Table as ptbnt = ptb(page, new_cells)print(nt.extract())

获取本页中的所有图片并保存：

def get_images(page_obj, pdf_path):"""获取本页中的图片"""image_list = []imgs = page_obj.imagespdf_name = pdf_path.split('/')[-1].replace('.pdf', '')main_path = 'E:/temp/imgs/%s' % pdf_namefor img in imgs:try:name = img.get('name', 'abc')new_img_path = '%s_%s' % (main_path, name)ism = img.get('stream')color_space = ism.__dict__.get('attrs').get('ColorSpace')if color_space.name == 'DeviceRGB':mode = "RGB"else:mode = "P"img_row_data = ism.get_data()img_filter = ism.__dict__.get('attrs').get('Filter')img_filter_name = img_filter.nameif img_filter_name == 'FlateDecode':width, height = ism.__dict__.get('attrs').get('Width'), ism.__dict__.get('attrs').get('Height')if not width or not height:continuenew_img_path = new_img_path+'.png'size = (width, height)new_img = Image.frombytes(mode, size, img_row_data)new_img.save(new_img_path)elif img_filter_name == 'DCTDecode':new_img_path = new_img_path+'.jpg'new_img = open(new_img_path, 'wb')new_img.write(img_row_data)new_img.close()elif img_filter_name == 'JPXDecode':new_img_path = new_img_path+'.jp2'new_img = open(new_img_path, 'wb')new_img.write(img_row_data)new_img.close()elif img_filter_name == 'CCITTFaxDecode':new_img_path = new_img_path+'.tiff'new_img = open(new_img_path, 'wb')new_img.write(img_row_data)new_img.close()else:logging.error('wrong img_filter_name: %s' % img_filter_name)continueimage_list.append({'name': name, 'path': new_img_path})except Exception as e:logging.error('get_images failed, pdf_path: %s, error: %s' % (pdf_path, e))return image_list

也尝试过使用pdfminer解析pdf文档：

def pdfminer_test1():from pdfminer.pdfdocument import PDFDocumentfrom pdfminer.pdfparser import PDFParserfrom pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreterfrom pdfminer.converter import PDFPageAggregatorfrom pdfminer.layout import LTTextBoxHorizontal, LAParams, LTImage, LTFigure, LTCurve, LTTextBoxfrom pdfminer.pdfpage import PDFPagepath = ''fp = open(path, 'rb')parser = PDFParser(fp)doc = PDFDocument(parser)parser.set_document(doc)if not doc.is_extractable:return Nonersrcmgr = PDFResourceManager()laparams = LAParams()device = PDFPageAggregator(rsrcmgr, laparams=laparams)interpreter = PDFPageInterpreter(rsrcmgr, device)count = 1for page in PDFPage.create_pages(doc):if count == 5:texts = []images = []interpreter.process_page(page)layout = device.get_result()for x in layout:if isinstance(x, LTTextBox):# 可以通过x.get_text()获取文本texts.append(x)if isinstance(x, LTImage):print(x)images.append(x)if isinstance(x, LTFigure):figurestack = [x]while figurestack:figure = figurestack.pop()for f in figure:if isinstance(f, LTTextBox):texts.append(f)if isinstance(f, LTImage):print(x)images.append(f)if isinstance(f, LTFigure):figurestack.append(f)count += 1fp.close()

最开始的目的是，拿到一个pdf文件，可以按照原文件，完全将pdf中的所有文本（包括表格中的文本）和图片抽取出来（包括位置还原），但是一直没找到好的办法。

上面的代码整理到了这里：pdfParser

后来还尝试过其他方式，先将原文件转换为doc文档，然后读取doc文档。

import docx
from docx.document import Document
from docx.oxml.table import CT_Tbl
from docx.oxml.text.paragraph import CT_P
from docx.table import _Cell, Table
from docx.text.paragraph import Paragraphdef iter_block_items(parent):"""Yield each paragraph and table child within *parent*, in document order.Each returned value is an instance of either Table or Paragraph. *parent*would most commonly be a reference to a main Document object, butalso works for a _Cell object, which itself can contain paragraphs and tables."""if isinstance(parent, Document):parent_elm = parent.element.bodyelif isinstance(parent, _Cell):parent_elm = parent._tcelse:raise ValueError("something's not right")for child in parent_elm.iterchildren():if isinstance(child, CT_P):yield Paragraph(child, parent)elif isinstance(child, CT_Tbl):yield Table(child, parent)# table = Table(child, parent)# for row in table.rows:#     for cell in row.cells:#         for paragraph in cell.paragraphs:else:print('other type: %s' % type(parent_elm))passdef docx_test():path = ''document = docx.Document(path)content = ''for block in iter_block_items(document):print(block.style.name)if block.style.name == 'Table Grid':passif block.style.name == 'Heading 1':passif isinstance(block, docx.table.Table):table_content = ''for i, row in enumerate(block.rows):row_content = []for cell in row.cells:c = cell.textrow_content.append(c)row_content_str = ' '.join(row_content) + '\n'table_content += row_content_strcontent += table_contentif isinstance(block, docx.text.paragraph.Paragraph):content += block.text + '\n'print(content)

上面的代码只提取了doc文件中的文字和表格，未包含图片对象。

参考：

python PDFMiner 处理pdf，保存文本及图片

pdftoword

进阶PDF

Python提取 PDF 文本、图片内容

PDFPLUMBER使用入门

pdfplumber是怎么做表格抽取的（一）

python解析pdf文件相关推荐

python解析pdf文件_抽img_text（pymupdf）
pymupdf官方文档:https://pymupdf.readthedocs.io/en/latest/textpage.html#TextPage.extractDICT 代码: # 证书留学背 ...
python读取pdf文件_深入学习python解析并读取PDF文件内容的方法
这篇文章主要学习了python解析并读取PDF文件内容的方法,包括对学习库的应用,python2.7和python3.6中python解析PDF文件内容库的更新,包括对pdfminer库的详细解释和应 ...
如何用python修改pdf内容_如何利用python将pdf文件转化为txt文件？
https://www.wukong.com/answer/6579491774144708872/?iid=15906422033&app=news_article&share_an ...
python处理pdf实例_python使用pdfminer解析pdf文件的方法示例
最近要做个从 pdf 文件中抽取文本内容的工具,大概查了一下 python 里可以使用 pdfminer 来实现.下面就看看怎样使用吧. PDFMiner是一个可以从PDF文档中提取信息的工具.与其他 ...
python 读取pdf cid_python使用pdfminer解析pdf文件的方法示例
最近要做个从 pdf 文件中抽取文本内容的工具,大概查了一下 python 里可以使用 pdfminer 来实现.下面就看看怎样使用吧. PDFMiner是一个可以从PDF文档中提取信息的工具.与其他 ...
python如何解析PDF文件
python如何解析PDF文件 python中读取pdf的方法:使用python第三方库pdfminerk3k 1.使用pdfminer库 pdfminer是一个主流的分析pdf的库.如果是pytho ...
给知网没有书签的pdf文件添加书签（利用python解析txt文件内容并为pdf添加目录）
今天利用tampermonkey的知网下载助手脚本下载pdf格式论文时,发现论文缺少书签,而脚本可以下载一个txt格式的书签(目录),因此打算利用python将txt格式的目录添加到pdf中. txt ...
Python编程--使用PyPDF解析PDF文件中的元数据
Python编程–使用PyPDF解析PDF文件中的元数据元数据作为一种文件里非常明显可见的对象,元数据可以存在于文档.电子表格.图片.音频和视频文件中.创建这些文件的应用程序可能会把文档的作者.创 ...
python解析pdf中文乱码_解析PDF文件以及解决编码问题
1.解析pdf文件最近需要将pdf中文本提取出来,于是就了解了一下pdfminer 首先安装:pip3 install pdfminer3k 之后就是用pdfminer解析,不多说,直接上代码,这些 ...

python解析pdf文件

python解析pdf文件相关推荐

最新文章

热门文章