*java识别pdf中的关键字，获取关键字坐标 *

今天遇到个需求，需要在pdf中识别关键字的坐标位置并插入电子签名（png，jpg），在网上找了很多的案例，发现不同的pdf识别出来效果不一样，有的pdf识别是一个文字一个文字，有的是一段文字一段文字。最终发现，由于pdf来源不同（word转换、excel转换、图片转换）导致识别的效果有所差异。

于是自己整理了一份案例，效果还算完美

效果展示

废话不多说
使用依赖：itextpdf-5.5.13.jar
常见itextpdf各个版本【0积分免费下载】下载地址

新建DTO实体，用于接收返回的坐标信息

public class MatchItem {/*** 页数*/private Integer pageNum;/*** X轴坐标*/private Float x;/*** Y轴坐标*/private Float y;/*** 宽度*/private Float pageWidth;/*** 高度*/private Float pageHeight;/*** 匹配到的文案*/private String content;public Integer getPageNum() {return pageNum;}public void setPageNum(Integer pageNum) {this.pageNum = pageNum;}public Float getX() {return x;}public void setX(Float x) {this.x = x;}public Float getY() {return y;}public void setY(Float y) {this.y = y;}public Float getPageWidth() {return pageWidth;}public void setPageWidth(Float pageWidth) {this.pageWidth = pageWidth;}public Float getPageHeight() {return pageHeight;}public void setPageHeight(Float pageHeight) {this.pageHeight = pageHeight;}public String getContent() {return content;}public void setContent(String content) {this.content = content;}@Overridepublic String toString() {return "MatchItem [pageNum=" + pageNum + ", x=" + x + ", y=" + y+ ", pageWidth=" + pageWidth + ", pageHeight=" + pageHeight+ ", content=" + content + "]";}}

2.新建工具类，用处识别文字坐标

public class AutoMatch {public static List<MatchItem> matchPage(String fileName, String keyword) throws Exception {List<MatchItem> items = new ArrayList();PdfReader reader = new PdfReader(fileName);int pageSize = reader.getNumberOfPages();for (int page = 1; page <= pageSize; page++) {items.addAll(matchPage(reader, page, keyword));}return items;}public static List matchPage(PdfReader reader, Integer pageNumber, String keyword) throws Exception {KeyWordPositionListener renderListener = new KeyWordPositionListener();renderListener.setKeyword(keyword);PdfReaderContentParser parse = new PdfReaderContentParser(reader);Rectangle rectangle = reader.getPageSize(pageNumber);renderListener.setPageNumber(pageNumber);renderListener.setCurPageSize(rectangle);parse.processContent(pageNumber, renderListener);return findKeywordItems(renderListener, keyword);}public static List findKeywordItems(KeyWordPositionListener renderListener, String keyword) {//先判断本页中是否存在关键词//所有块LISTList<MatchItem> allItems = renderListener.getAllItems();StringBuffer sbtemp = new StringBuffer("");//将一页中所有的块内容连接起来组成一个字符串。for (MatchItem item : allItems) {sbtemp.append(item.getContent());}//一页组成的字符串没有关键词，直接returnif (sbtemp.toString().indexOf(keyword) == -1) {return renderListener.getMatches();}//第一种情况：关键词与块内容完全匹配的项List matches = renderListener.getMatches();//第二种情况：多个块内容拼成一个关键词，则一个一个来匹配，组装成一个关键词sbtemp = new StringBuffer("");List tempItems = new ArrayList();for (MatchItem item : allItems) {//1，关键词中存在某块 2，拼装的连续的块=关键词 3，避开某个块完全匹配关键词//关键词 中国移动 而块为 中 ，国，移动//关键词 中华人民 而块为中，华人民共和国 这种情况解决不了，也不允许存在if (keyword.indexOf(item.getContent()) != -1 && !keyword.equals(item.getContent())) {System.out.println(item.getContent());tempItems.add(item);sbtemp.append(item.getContent());//如果暂存的字符串和关键词 不再匹配时if (keyword.indexOf(sbtemp.toString()) == -1) {sbtemp = new StringBuffer(item.getContent());tempItems.clear();tempItems.add(item);}//暂存的字符串正好匹配到关键词时if (sbtemp.toString().equalsIgnoreCase(keyword)) {MatchItem tmpitem = getRightItem(tempItems, keyword);if (tmpitem != null) {//得到匹配的项matches.add(tmpitem);}//清空暂存的字符串sbtemp = new StringBuffer("");//清空暂存的LISTtempItems.clear();//继续查找continue;}} else {//如果找不到则清空sbtemp = new StringBuffer("");tempItems.clear();}}//第三种情况：关键词存在块中for (MatchItem item : allItems) {if (item.getContent().indexOf(keyword) != -1 && !keyword.equals(item.getContent())) {matches.add(item);}}return matches;}public static MatchItem getRightItem(List<MatchItem> tempItems, String keyword) {for (MatchItem item : tempItems) {if (keyword.indexOf(item.getContent()) != -1 && !keyword.equals(item.getContent())) {return item;}}return null;}

3.新建pdf监听类，重写源码方法

public class KeyWordPositionListener implements RenderListener {private List<MatchItem> matches = new ArrayList<>();private List<MatchItem> allItems = new ArrayList<>();private Rectangle curPageSize;/*** 匹配的关键字*/private String keyword;/*** 匹配的当前页*/private Integer pageNumber;public void beginTextBlock() {//do nothing}public void renderText(TextRenderInfo renderInfo) {String content = renderInfo.getText();content = content.replace("<", "").replace("《", "").replace("(", "").replace("（", "").replace("\"", "").replace("'", "").replace(">", "").replace("》", "").replace(")", "").replace("）", "").replace("、", "").replace(".", "").replace("：", "").replace(":", "").replace(" ", "");Rectangle2D.Float textRectangle = renderInfo.getDescentLine().getBoundingRectange();MatchItem item = new MatchItem();item.setContent(content);item.setPageNum(pageNumber);item.setPageWidth(curPageSize.getWidth());item.setPageHeight(curPageSize.getHeight());item.setX((float)textRectangle.getX());item.setY((float)textRectangle.getY());if(!StringUtils.isEmpty(content)){if(content.equalsIgnoreCase(keyword)) {matches.add(item);}}else{item.setContent("空字符串");}allItems.add(item);//先保存所有的项}public void endTextBlock() {//do nothing}public void renderImage(ImageRenderInfo renderInfo) {//do nothing}/*** 设置需要匹配的当前页* @param pageNumber*/public void setPageNumber(Integer pageNumber) {this.pageNumber = pageNumber;}/*** 设置需要匹配的关键字，忽略大小写* @param keyword*/public void setKeyword(String keyword) {this.keyword = keyword;}/*** 返回匹配的结果列表* @return*/public List<MatchItem> getMatches() {return matches;}void setCurPageSize(Rectangle rect) {this.curPageSize = rect;}public List<MatchItem> getAllItems() {return allItems;}public void setAllItems(List<MatchItem> allItems) {this.allItems = allItems;}}

4.编写测试案例

   public static void main(String[] args) throws Exception {List<MatchItem> matchItemList = matchPage("pdf路径", "需要匹配的文案");}

返回坐标后，就可通过文字对应的坐标信息插入电子签名了。

提示：插入签名的话一般需要向不同方向设置偏移量，调整到自己理想位置，直接对X坐标或Y坐标进行加减即可。

pdf插入图片可见另一个帖子
点击跳转

PDF识别文字、关键字，获取对应坐标，用于插入电子签名相关推荐

【文字识别】OCR截图文字识别提取(无需安装)拖拽图片，打开图片，图片PDF转文字的好帮手
软件无需安装,双击打开就能用,适用于Windows 7以上平台: 具有截图文字识别,拖拽文字识别,打开文字识别,翻译文字等功能可用于图片和PDF中文字的识别提取中,电脑一切看得见的文件都可以识别,支 ...
想知道PDF文件怎么识别文字吗？
PDF是一种常见的文件格式,用于文档.表格.演示文稿等等.与其它的文件格式相比,它的优点是可编辑和可读性高.但是,有时候我们需要对PDF文件进行识别,这时候就需要一些好的识别方法了.那么,今天我就将给 ...
识别图片中曲线并获取其坐标
识别图片中曲线并获取其坐标 github主页:https://github.com/Taot-chen 有时候需要用到一些数据库里面曲线图的数据,进行进一步的变换处理,但是很多时候都只有图片,没有数据 ...
python识别文字坐标_【Python 教程】使用 Python 和大漠插件进行文字识别
家里有一台win7系统的电脑,平时可以用来玩玩游戏消磨时间.但是有时候有一些重复的操作实在是无趣,所以打算写个脚本,让其自动化执行. 最终的目标就是把游戏里一些常用的操作都集合到脚本中去,且无序随机执 ...
python处理pdf实例_Python程序图片和pdf上文字识别实例
实例一:先减少背景杂音,再做图片文字识别为了提高识别率,先用opencv-python对扫描的图片做预处理(减少背景杂音),然后调用pytesseract识别图片上的文字.处理方式就是: 学习Pyt ...
PDF图片文字如何编辑？ORC图文识别一招搞定
PDF大家都不陌生,但是可能很多人都没有深入了解过,就像我们说PDF编辑器可以编辑,很多人就那种扫描过的PDF文件去用PDF编辑器编辑,然后说根本编辑不了,是没有的PDF编辑器. 简单来说,我们今天就 ...
PDF定位关键字/词所在坐标及页码
因为功能需要在pdf文件上添加一个日期文本,找了好多资源,总算功夫不负有心人,总算让我找到一篇博客http://www.cnblogs.com/tankqiu/p/4339079.html?utm_s ...
如何获取PDF文件中对应内容的坐标及范围？
如何获取PDF文件中对应内容的坐标及范围? 介绍安装地址使用方式打开软件开启坐标显示坐标显示单位切换开启网格辅助线测量工具使用介绍这款来至Adobe公司旗下的PDF阅读器: 它免费提 ...
uiautomator2+ tesseract 智能识别文字实现手游辅助外挂,打怪刷装备快人一步
目录一.背景二.需求分解三.脚本开发实践 1.tesseract 安装及测试 2.python使用Tesseract库识别文字 3.构建定时任务,定时刷怪 4.最终效果一.背景先交代下背景, ...
图像转文本、PDF 转文字（包括html、xml）、关键句提取软件开发手记
文章目录准备工作声明所需模块: 模块简介安装方法: 软件开发图像转文本 API 使用代码其他 PDF 转文字代码 Tooltip 关键句提取截图古文排版软件使用关于界面国际化效 ...

PDF识别文字、关键字，获取对应坐标，用于插入电子签名

效果展示

提示：插入签名的话一般需要向不同方向设置偏移量，调整到自己理想位置，直接对X坐标或Y坐标进行加减即可。

PDF识别文字、关键字，获取对应坐标，用于插入电子签名相关推荐

最新文章

热门文章