pyMuPDF How To

1.PDF转图片

只要是支持的文档，就可以转换为图像，比如XPS,PDF等

import sys, fitz  # import the binding
fname = sys.argv[1]  # get filename from command line
doc = fitz.open(fname)  # open document
for page in doc:  # iterate through the pagespix = page.get_pixmap()  # render page to an imagepix.save("page-%i.png" % page.number)  # store image as a PNG

2.提高图像的分辨率

用Page.get_pixmap()创建Pixmap对象，该函数有一个重要的Matrix参数，可以达到缩放、旋转、镜像，Matrix参数为默认值时,不会改变图像。以下实例为x,y方向分别放大2倍。

zoom_x = 2.0  # horizontal zoom
zomm_y = 2.0  # vertical zoom
mat = fitz.Matrix(zoom_x, zomm_y)  # zoom factor 2 in each dimension
pix = page.get_pixmap(matrix=mat)  # use 'mat' instead of the identity matrix

3.创建在页面中缩放的图像，通过定义一个矩形区域来实现

mat = fitz.Matrix(2, 2)  # zoom factor 2 in each direction
rect = page.rect  # the page rectangle
mp = (rect.tl + rect.br) / 2  # its middle point, becomes top-left of clip
clip = fitz.Rect(mp, rect.br)  # the area we want
pix = page.get_pixmap(matrix=mat, clip=clip)

4.指定页面图像大小，以适合相应图形化界面的显示

# WIDTH: width of the GUI window
# HEIGHT: height of the GUI window
# clip: a subrectangle of the document page
# compare width/height ratios of image and windowif clip.width / clip.height < WIDTH / HEIGHT:# clip is narrower: zoom to window heightzoom = HEIGHT / clip.height
else:  # else zoom to window widthzoom = WIDTH / clip.width
mat = fitz.Matrix(zoom, zoom)
pix = page.get_pixmap(matrix=mat, clip=clip)

或直接根据缩放因子，缩放页面

width = WIDTH / zoom
height = HEIGHT / zoom
clip = fitz.Rect(tl, tl.x + width, tl.y + height)
# ensure we still are inside the page
clip &= page.rect
mat = fitz.Matrix(zoom, zoom)
pix = fitz.Pixmap(matrix=mat, clip=clip)

5.从pdf文档中抽取图像

pdf文档的图像具有一个标识，他是一个整数。通个这个值有两种方法获取图像：

pix = fitz.Pixmap(doc, xref) #xref即a cross reference number，速度较快

pix.tobytes()

img = doc.extract_image(xref)，该方法返回一个字典数据，img[“image”]中包含图像数据，img[“ext”]包含图像文件的扩展名。

获取xref:可以通过遍历Page.get_images()，返回列表，其元素为[xref, smask, …]形式的列表，但可能会重复；遍历所有的xref,并调用Document.extract_image(xref)，返回的字典为空则继续，但可能文档中存在一些用来定义透明的“伪图像”。

6.把所有图片，放入一个PDF文件中，将每张图片作为PDF文件中的一页

import os, fitz
import PySimpleGUI as psg  # for showing a progress bar
doc = fitz.open()  # PDF with the pictures
imgdir = "D:/2012_10_05"  # where the pics are
imglist = os.listdir(imgdir)  # list of them
imgcount = len(imglist)  # pic countfor i, f in enumerate(imglist):img = fitz.open(os.path.join(imgdir, f))  # open pic as documentrect = img[0].rect  # pic dimensionpdfbytes = img.convert_to_pdf()  # make a PDF streamimg.close()  # no longer neededimgPDF = fitz.open("pdf", pdfbytes)  # open stream as PDFpage = doc.new_page(width = rect.width,  # new page with ...height = rect.height)  # pic dimensionpage.show_pdf_page(rect, imgPDF, 0)  # image fills the pagepsg.EasyProgressMeter("Import Images",  # show our progressi+1, imgcount)doc.save("all-my-pics.pdf")

7.使用矢量图像

svg = page.get_svg_image(matrix=fitz.Identity)

8.转换图像格式

pix = fitz.Pixmap("input.xxx")  # any supported input format
pix.save("output.yyy")  # any supported output format

9.用小图像合成图像

import fitz
src = fitz.Pixmap("img-7edges.png")      # create pixmap from a picture
col = 3                                  # tiles per row
lin = 4                                  # tiles per column
tar_w = src.width * col                  # width of target
tar_h = src.height * lin                 # height of target# create target pixmap
tar_pix = fitz.Pixmap(src.colorspace, (0, 0, tar_w, tar_h), src.alpha)# now fill target with the tiles
for i in range(col):for j in range(lin):src.set_origin(src.width * i, src.height * j)tar_pix.copy(src, src.irect) # copy input to new loctar_pix.save("tar.png")

10.添加图片到PDF某页面中

可以使用Page.insert_image()或Page.show_pdf_page()向页面中插入图像。

（1）Page.insert_image()，基本参数与可实现的功能介绍：图像源(图像文件，pixmap)，定制显示分辨率，rotation(只能为0，90，180，270），缩放图片。

（2）Page.show_pdf_page()，可以旋转任意角度，并可支持任何非PDF文档。

page.insert_image(rect,                  # where to place the image (rect-like)filename=None,         # image in a filestream=None,           # image in memory (bytes)pixmap=None,           # image from pixmapmask=None,             # specify alpha channel separatelyrotate=0,              # rotate (int, multiple of 90)xref=0,                # re-use existing imageoc=0,                  # control visibility via OCG / OCMDkeep_proportion=True,  # keep aspect ratiooverlay=True,          # put in foreground
)page.show_pdf_page(rect,                  # where to place the image (rect-like)src,                   # source PDFpno=0,                 # page number in source PDFclip=None,             # only display this area (rect-like)rotate=0,              # rotate (float, any value)oc=0,                  # control visibility via OCG / OCMDkeep_proportion=True,  # keep aspect ratiooverlay=True,          # put in foreground
)

11.控制插入图像的尺寸和透明度选项

# example: 'stream' contains a transparent PNG image:
pix = fitz.Pixmap(stream)  # intermediate pixmap
base = fitz.Pixmap(pix, 0)  # extract base image without alpha
mask = fitz.Pixmap(None, pix)  # extract alpha channel for the mask image
basestream = base.pil_tobytes("JPEG")
maskstream = mask.pil_tobytes("JPEG")
page.insert_image(rect, stream=basestream, mask=maskstream)# 添加透明度
stream = open("example.jpg", "rb").read()
basepix = fitz.Pixmap(stream)
opacity = 0.3  # 30% opacity, choose a value 0 < opacity < 1
value = int(255 * opacity)  # we need an integer between 0 and 255
alphas = [value] * (basepix.width * basepix.height)
alphas = bytearray(alphas)  # convert to a bytearray
pixmask = fitz.Pixmap(fitz.csGRAY, basepix.width, basepix.height, alphas, 0)
page.insert_image(rect, stream=stream, mask=pixmask.tobytes())

12.抽取文档中所有文本

import sys, fitz
fname = sys.argv[1]  # get document filename
doc = fitz.open(fname)  # open document
out = open(fname + ".txt", "wb")  # open text output
for page in doc:  # iterate the document pagestext = page.get_text().encode("utf8")  # get plain text (is in UTF-8)out.write(text)  # write text of pageout.write(bytes((12,)))  # write page delimiter (form feed 0x0C)
out.close()

13.标注搜索到的文本，Page.search_for()搜索时不区分大小写，返回的是搜索结果内容所在Rect的列表。下例以下划线的方式标注

import sys
import fitzdef mark_word(page, text):"""Underline each word that contains 'text'."""found = 0wlist = page.getTex("words")  # make the word listfor w in wlist:  # scan through all words on pageif text in w[4]:  # w[4] is the word's stringfound += 1  # countr = fitz.Rect(w[:4])  # make rect from word bboxpage.add_underline_annot(r)  # underlinereturn foundfname = sys.argv[1]  # filename
text = sys.argv[2]  # search string
doc = fitz.open(fname)print("underlining words containing '%s' in document '%s'" % (word, doc.name))new_doc = False  # indicator if anything found at allfor page in doc:  # scan through the pagesfound = mark_word(page, text)  # mark the page's wordsif found:  # if anything found ...new_doc = Trueprint("found '%s' %i times on page %i" % (text, found, page.number + 1))if new_doc:doc.save("marked-" + doc.name)

# -*- coding: utf-8 -*-
import fitz# the document to annotate
doc = fitz.open("tilted-text.pdf")# the text to be marked
t = "¡La práctica hace el campeón!"# work with first page only
page = doc[0]# get list of text locations
# we use "quads", not rectangles because text may be tilted!
rl = page.search_for(t, quads = True)# mark all found quads with one annotation
page.add_squiggly_annot(rl)# save to a new PDF
doc.save("a-squiggly.pdf")

pyMuPDF How To相关推荐

【Python】PyMuPDF模块将PDF转换为图片
上一篇文章介绍了pdf2image模块+poppler将PDF转换为图片,这篇文章主要介绍另外一个模块PyMuPDF. PyMuPDF(又名"fitz"):MuPDF的Pytho ...
Python 利用pymupdf将pdf转换为图片并拆分，后通过PIL合并生成一张图片
文章主要内容主要参考几篇文章并合并在一起的,文章链接依次如下,第二和第三的文章链接是从第一篇文章找到的: (1).https://blog.csdn.net/qq_25115281/article/d ...
终于找到了PyMuPDF不能提取文字的原因……它只是个包装
介绍里说pymupdf的体积十分小,但它却能提取扫描件的PDF,它是用什么神奇的算法!于是我拿来一张扫描件PDF run了一下!结果让人惊讶: 满满是字的PDF竟然一个字节都没有识别出来!好歹给些乱码 ...
Python将PDF转成图片—PyMuPDF和pdf2image
前言:在最近的测试中遇到一个与PDF相关的测试需求,其中有一个过程是将PDF转换成图片,然后对图片进行测试. 粗略的试了好几种方式,其中语言尝试了Python和Java,总体而言所找到的Python方 ...
pdf转换成jpg python_【PyMuPDF和pdf2image】Python将PDF转成图片PNG和JPG
公众号:前言:在最近的测试中遇到一个与PDF相关的测试需求,其中有一个过程是将PDF转换成图片,然后对图片进行测试. 粗略的试了好几种方式,其中语言尝试了Python和Java,总体而言所找到的Pyt ...
Python处理PDF神器PyMuPDF的安装与使用
1. PyMuPDF简介 1.1. 介绍在介绍PyMuPDF之前,先来了解一下MuPDF,从命名形式中就可以看出,PyMuPDF是MuPDF的Python接口形式. 1.1.1. MuPDF MuP ...
Python处理PDF神器：PyMuPDF的安装与使用
1.PyMuPDF简介 1. 介绍在介绍PyMuPDF之前,先来了解一下MuPDF,从命名形式中就可以看出,PyMuPDF是MuPDF的Python接口形式. MuPDF MuPDF 是一个轻量级的 ...
Python处理PDF——PyMuPDF的安装与使用
推荐阅读: 1.程序员用Python爬虫做副业半个月就赚了3W 2.一个30岁的程序员无比挣扎的故事,连躺平都是奢望 1.PyMuPDF简介介绍在介绍PyMuPDF之前,先来了解一下MuPDF,从 ...
Python提取PDF文档页面——PyMuPDF使用
1.打算用python做一个电脑端的电子书架,用到了PyMuPDF,这是一个轻量级PDF阅读器和工具包.搜索了网上的资料,大多基于低版本的PyMuPDF,好多API已经不再适用.本次使用的PyMuPD ...
利用pymupdf编辑修改pdf
利用pymupdf编辑修改pdf 本文背景为了修改pdf的文本, 在pymupdf官方手册查了一通,没看到明显的说明,然后到github的讨论区看了发现了修改pdf的方案,在此记录一下参考链接: ...

pyMuPDF How To

pyMuPDF How To相关推荐

最新文章

热门文章