OP的示例PDF中的不可见文本通常通过定义剪辑路径(文本的边界之外)和填充路径(隐藏下面的文本)而变得不可见.因此,我们必须在文本提取期间考虑与路径相关的指令以忽略该不可见文本.

不幸的是,为这些指令设计的回调未在PDFTextStripper或其父类LegacyPDFStreamEngine和PDFStreamEngine中声明.

但是它们在另一个主要的PDFStreamEngine子类PDFGraphicsStreamEngine中声明,并且它们在PageDrawer中明智地实现.

因此,为了利用这一点,我们可以复制&粘贴&将PageDrawer实现调整为PDFTextStripper的子类,例如喜欢这个:

public class PDFVisibleTextStripper extends PDFTextStripper {

public PDFVisibleTextStripper() throws IOException {

addOperator(new AppendRectangleToPath());

addOperator(new ClipEvenOddRule());

addOperator(new ClipNonZeroRule());

addOperator(new ClosePath());

addOperator(new CurveTo());

addOperator(new CurveToReplicateFinalPoint());

addOperator(new CurveToReplicateInitialPoint());

addOperator(new EndPath());

addOperator(new FillEvenOddAndStrokePath());

addOperator(new FillEvenOddRule());

addOperator(new FillNonZeroAndStrokePath());

addOperator(new FillNonZeroRule());

addOperator(new LineTo());

addOperator(new MoveTo());

addOperator(new StrokePath());

}

@Override

protected void processTextPosition(TextPosition text) {

Matrix textMatrix = text.getTextMatrix();

Vector start = textMatrix.transform(new Vector(0, 0));

Vector end = new Vector(start.getX() + text.getWidth(), start.getY());

PDGraphicsState gs = getGraphicsState();

Area area = gs.getCurrentClippingPath();

if (area == null || (area.contains(start.getX(), start.getY()) && area.contains(end.getX(), end.getY())))

super.processTextPosition(text);

}

private GeneralPath linePath = new GeneralPath();

void deleteCharsInPath() {

for (List list : charactersByArticle) {

List toRemove = new ArrayList<>();

for (TextPosition text : list) {

Matrix textMatrix = text.getTextMatrix();

Vector start = textMatrix.transform(new Vector(0, 0));

Vector end = new Vector(start.getX() + text.getWidth(), start.getY());

if (linePath.contains(start.getX(), start.getY()) || linePath.contains(end.getX(), end.getY())) {

toRemove.add(text);

}

}

if (toRemove.size() != 0) {

System.out.println(toRemove.size());

list.removeAll(toRemove);

}

}

}

public final class AppendRectangleToPath extends OperatorProcessor {

@Override

public void process(Operator operator, List operands) throws IOException {

if (operands.size() < 4) {

throw new MissingOperandException(operator, operands);

}

if (!checkArrayTypesClass(operands, COSNumber.class)) {

return;

}

COSNumber x = (COSNumber) operands.get(0);

COSNumber y = (COSNumber) operands.get(1);

COSNumber w = (COSNumber) operands.get(2);

COSNumber h = (COSNumber) operands.get(3);

float x1 = x.floatValue();

float y1 = y.floatValue();

// create a pair of coordinates for the transformation

float x2 = w.floatValue() + x1;

float y2 = h.floatValue() + y1;

Point2D p0 = context.transformedPoint(x1, y1);

Point2D p1 = context.transformedPoint(x2, y1);

Point2D p2 = context.transformedPoint(x2, y2);

Point2D p3 = context.transformedPoint(x1, y2);

// to ensure that the path is created in the right direction, we have to create

// it by combining single lines instead of creating a simple rectangle

linePath.moveTo((float) p0.getX(), (float) p0.getY());

linePath.lineTo((float) p1.getX(), (float) p1.getY());

linePath.lineTo((float) p2.getX(), (float) p2.getY());

linePath.lineTo((float) p3.getX(), (float) p3.getY());

// close the subpath instead of adding the last line so that a possible set line

// cap style isn't taken into account at the "beginning" of the rectangle

linePath.closePath();

}

@Override

public String getName() {

return "re";

}

}

public final class StrokePath extends OperatorProcessor {

@Override

public void process(Operator operator, List operands) throws IOException {

linePath.reset();

}

@Override

public String getName() {

return "S";

}

}

public final class FillEvenOddRule extends OperatorProcessor {

@Override

public void process(Operator operator, List operands) throws IOException {

linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD);

deleteCharsInPath();

linePath.reset();

}

@Override

public String getName() {

return "f*";

}

}

public class FillNonZeroRule extends OperatorProcessor {

@Override

public final void process(Operator operator, List operands) throws IOException {

linePath.setWindingRule(GeneralPath.WIND_NON_ZERO);

deleteCharsInPath();

linePath.reset();

}

@Override

public String getName() {

return "f";

}

}

public final class FillEvenOddAndStrokePath extends OperatorProcessor {

@Override

public void process(Operator operator, List operands) throws IOException {

linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD);

deleteCharsInPath();

linePath.reset();

}

@Override

public String getName() {

return "B*";

}

}

public class FillNonZeroAndStrokePath extends OperatorProcessor {

@Override

public void process(Operator operator, List operands) throws IOException {

linePath.setWindingRule(GeneralPath.WIND_NON_ZERO);

deleteCharsInPath();

linePath.reset();

}

@Override

public String getName() {

return "B";

}

}

public final class ClipEvenOddRule extends OperatorProcessor {

@Override

public void process(Operator operator, List operands) throws IOException {

linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD);

getGraphicsState().intersectClippingPath(linePath);

}

@Override

public String getName() {

return "W*";

}

}

public class ClipNonZeroRule extends OperatorProcessor {

@Override

public void process(Operator operator, List operands) throws IOException {

linePath.setWindingRule(GeneralPath.WIND_NON_ZERO);

getGraphicsState().intersectClippingPath(linePath);

}

@Override

public String getName() {

return "W";

}

}

public final class MoveTo extends OperatorProcessor {

@Override

public void process(Operator operator, List operands) throws IOException {

if (operands.size() < 2) {

throw new MissingOperandException(operator, operands);

}

COSBase base0 = operands.get(0);

if (!(base0 instanceof COSNumber)) {

return;

}

COSBase base1 = operands.get(1);

if (!(base1 instanceof COSNumber)) {

return;

}

COSNumber x = (COSNumber) base0;

COSNumber y = (COSNumber) base1;

Point2D.Float pos = context.transformedPoint(x.floatValue(), y.floatValue());

linePath.moveTo(pos.x, pos.y);

}

@Override

public String getName() {

return "m";

}

}

public class LineTo extends OperatorProcessor {

@Override

public void process(Operator operator, List operands) throws IOException {

if (operands.size() < 2) {

throw new MissingOperandException(operator, operands);

}

COSBase base0 = operands.get(0);

if (!(base0 instanceof COSNumber)) {

return;

}

COSBase base1 = operands.get(1);

if (!(base1 instanceof COSNumber)) {

return;

}

// append straight line segment from the current point to the point

COSNumber x = (COSNumber) base0;

COSNumber y = (COSNumber) base1;

Point2D.Float pos = context.transformedPoint(x.floatValue(), y.floatValue());

linePath.lineTo(pos.x, pos.y);

}

@Override

public String getName() {

return "l";

}

}

public class CurveTo extends OperatorProcessor {

@Override

public void process(Operator operator, List operands) throws IOException {

if (operands.size() < 6) {

throw new MissingOperandException(operator, operands);

}

if (!checkArrayTypesClass(operands, COSNumber.class)) {

return;

}

COSNumber x1 = (COSNumber) operands.get(0);

COSNumber y1 = (COSNumber) operands.get(1);

COSNumber x2 = (COSNumber) operands.get(2);

COSNumber y2 = (COSNumber) operands.get(3);

COSNumber x3 = (COSNumber) operands.get(4);

COSNumber y3 = (COSNumber) operands.get(5);

Point2D.Float point1 = context.transformedPoint(x1.floatValue(), y1.floatValue());

Point2D.Float point2 = context.transformedPoint(x2.floatValue(), y2.floatValue());

Point2D.Float point3 = context.transformedPoint(x3.floatValue(), y3.floatValue());

linePath.curveTo(point1.x, point1.y, point2.x, point2.y, point3.x, point3.y);

}

@Override

public String getName() {

return "c";

}

}

public final class CurveToReplicateFinalPoint extends OperatorProcessor {

@Override

public void process(Operator operator, List operands) throws IOException {

if (operands.size() < 4) {

throw new MissingOperandException(operator, operands);

}

if (!checkArrayTypesClass(operands, COSNumber.class)) {

return;

}

COSNumber x1 = (COSNumber) operands.get(0);

COSNumber y1 = (COSNumber) operands.get(1);

COSNumber x3 = (COSNumber) operands.get(2);

COSNumber y3 = (COSNumber) operands.get(3);

Point2D.Float point1 = context.transformedPoint(x1.floatValue(), y1.floatValue());

Point2D.Float point3 = context.transformedPoint(x3.floatValue(), y3.floatValue());

linePath.curveTo(point1.x, point1.y, point3.x, point3.y, point3.x, point3.y);

}

@Override

public String getName() {

return "y";

}

}

public class CurveToReplicateInitialPoint extends OperatorProcessor {

@Override

public void process(Operator operator, List operands) throws IOException {

if (operands.size() < 4) {

throw new MissingOperandException(operator, operands);

}

if (!checkArrayTypesClass(operands, COSNumber.class)) {

return;

}

COSNumber x2 = (COSNumber) operands.get(0);

COSNumber y2 = (COSNumber) operands.get(1);

COSNumber x3 = (COSNumber) operands.get(2);

COSNumber y3 = (COSNumber) operands.get(3);

Point2D currentPoint = linePath.getCurrentPoint();

Point2D.Float point2 = context.transformedPoint(x2.floatValue(), y2.floatValue());

Point2D.Float point3 = context.transformedPoint(x3.floatValue(), y3.floatValue());

linePath.curveTo((float) currentPoint.getX(), (float) currentPoint.getY(), point2.x, point2.y, point3.x, point3.y);

}

@Override

public String getName() {

return "v";

}

}

public final class ClosePath extends OperatorProcessor {

@Override

public void process(Operator operator, List operands) throws IOException {

linePath.closePath();

}

@Override

public String getName() {

return "h";

}

}

public final class EndPath extends OperatorProcessor {

@Override

public void process(Operator operator, List operands) throws IOException {

linePath.reset();

}

@Override

public String getName() {

return "n";

}

}

}

请确保使用PDFVisibleTextStripper构造函数中的内部运算符类,而不是PageDrawer使用的具有相同名称的类.要确保只需按照代码下的链接.

这会减少输出

REVERSE tEaSER caRd

500

elections

er of Teams

t Bet

1,000

MARK BOX AS SHOWN 

DENOTES HOME TEAM

PRO FOOTBALL - THURSDAY, SEPTEMBER 8, 2016

1 PANTHERS nbc - 10½ 8:30p 2 BRONCOS  - 3½

PRO FOOTBALL - SUNDAY, SEPTEMBER 11, 2016

3 FALCONS  - 9½ 1:00p 4 BUCCANEERS - 4½

5 VIKINGS - 9½ 1:00p 6 TITANS  - 4½

7 EAGLES  - 10½ 1:00p 8 BROWNS - 3½

9 BENGALS - 9½ 1:00p 10 JETS  - 4½

11 SAINTS  - 7½ 1:00p 12 RAIDERS - 6½

13 CHIEFS  - 14½ 1:00p 14 CHARGERS + ½

15 RAVENS  - 10½ 1:00p 16 BILLS - 3½

17 TEXANS  - 14½ 1:00p 18 BEARS + ½

19 PACKERS - 12½ 1:00p 20 JAGUARS  - 1½

21 SEAHAWKS  - 17½ 4:05p 22 DOLPHINS + 3½

23 COWBOYS  - 7½ 4:25p 24 GIANTS - 6½

25 COLTS  - 10½ 4:25p 26 LIONS - 3½

27 CARDINALS  nbc - 14½ 8:30p 28 PATRIOTS + ½

PRO FOOTBALL - MONDAY, SEPTEMBER 12, 2016

29 STEELERS espn - 10½ 7:10p 30 REDSKINS  - 3½

31 RAMS espn - 9½ 10:20p 32 49ERS  - 4½

这会丢弃大部分不需要的数据.

在this question的上下文中,很明显,processTextPosition和deleteCharsInPath计算字符基线结束的方式隐含地假定水平文本没有页面旋转.但是,如果放松一个人的“可见性”标准,如果其基线的开始可见,则可以假定一个角色是可见的.在这种情况下,不再需要计算出的Vector结束,并且代码也适用于旋转页面.

java 去掉pdf文字_java – 使用pdfbox从pdf中删除不可见的文本相关推荐

  1. java pdfbox 提取pdf 标题_java – 使用pdfbox从PDF文件中提取文本

    我试图使用pdfbox从PDF文件中提取文本,但不是作为命令行工具,而是在我的 Java应用程序中.我正在使用jsoup下载pdf. res = Jsoup .connect(host+action) ...

  2. java 解析pdf表格_java – 使用PDFBox解析PDF文件(特别是使用表格)

    我需要解析一个包含表格数据的PDF文件.我使用 PDFBox提取文件文本来解析结果(字符串)稍后.问题是文本提取不能像我预期的表格数据那样工作.例如,我有一个文件,其中包含这样的表(7列:前两个总是有 ...

  3. java 去掉pdf文字_Java 删除PDF中的附件

    在PDF中添加附件,可分两种情况,一种是直接把文档作为附件插入到PDF,一种是注释附件,即将文档通过注释的形式插入到PDF页面中的指定位置,通过点击注释,即可打开注释中的附件文档.同样的,在我们删除P ...

  4. java 去掉pdf文字_Java 添加和删除PDF图层

    在PDF文档中,图层可以使部分内容选择性地被隐藏或显示.通过添加图层,我们可以将文本.图片.表格等元素精确定位于页面指定位置,并可将这些元素进行叠放.组合形成页面的最终效果.本文将介绍如何使用Spir ...

  5. java pdf 文字_Java给pdf文件添加文字等信息

    2019独角兽企业重金招聘Python工程师标准>>> maven依赖 4.0.0 com.ttxit artifact 1.0.0 war javax.servlet javax. ...

  6. java生成pdf加密_java使用iText 生成PDF全攻略(表格,加密)

    java使用iText 生成PDF全攻略,包括创建文档,设置字体,添加表格(PdfPTable),创建新页(newPage),设置布局,加密 主要使用的jar包: itextpdf-5.4.2.jar ...

  7. pdfbox 第一页加内容_Java使用PDFBox操作PDF文件获取页码、文章内容、缩略图

    一.依赖 com.sleepycat je 5.0.73 org.apache.pdfbox pdfbox 2.0.8 二.实现代码 import lombok.extern.slf4j.Slf4j; ...

  8. java pdf合并_Java 合并、拆分PDF文档

    本文将介绍如何在Java程序中合并及拆分PDF文档,合并文档时,包括合并多个不同PDF文档为一个文档,以及合并PDF文档的不同页面为一页:拆分文档是,包括将PDF文档按每一页拆分,以及按指定页数范围来 ...

  9. java html5转pdf文件_Java 将Html转为PDF

    本文介绍如何在Java程序中将html文件转换成PDF文件.转换时,需要注意以下两点: 一.需要使用转换插件 二.需要使用到PDF 库,Spire.PDF for Java 版本: 3.6.6 及以后 ...

最新文章

  1. 解决hal.dll丢失问题 调试方法启动XP
  2. html瀑布式原理,纯css3+html瀑布流效果
  3. Oracle 项目就是那回事 ----表空间的管理
  4. XMLHttpRequest发送POST请求
  5. Servlet的request.getRemoteAddr()方法回去地址是0:0:0:0:0:0:0:1
  6. OpenCV学习(7) 分水岭算法(1)
  7. Silverlight 设置DataGrid中行的提示信息
  8. 论文篇-----基于机器学习的交通流预测技术的研究与应用
  9. 2020朝花夕拾-不务正业的大学生做了什么比赛?
  10. java中的if语句_java中的条件判断语句
  11. 昨晚《体育世界》LBJ在CCTV5
  12. SCI收录的文献类型与认证的文献类型
  13. 【定位原理揭秘第三期】室内定位技术原理揭秘
  14. C语言中将字符数字转换为数值的小技巧和方法
  15. 2022年海外有哪些直播带货平台?直播带货要怎么做?
  16. NPC整流器,三电平,中点钳位。PWM整流器三电平模型。simulink
  17. 小程序做电商的硬伤 “正规军”入驻 草根望尘莫及
  18. Mac 上用 Safari 一键轻松翻译网页
  19. EZBOOT not found 网上下载的iso 无法在虚拟机里面安装如何解决?
  20. C语言数据结构+冒泡排序的实现

热门文章

  1. Charles证书过期 iOS iPhone手机Charles证书过期 证书显示为红色https抓包显示unknown
  2. 人脸识别(2)----调用百度人脸识别API
  3. python 读取合并单元格_python使用xlrd读取合并单元格
  4. 手机屏下摄像头为何难量产?
  5. js获取当前时间戳以及前一天时间戳
  6. [课业] | 软件安全 | 使用渗透性工具Metasploit进行漏洞测试
  7. PCA降维原理(主成分分析)小结
  8. 【宿舍指纹锁---Arduino UNO (保姆级教程)】
  9. Dubbo基础知识_1
  10. 用c语言实现顺序查找,顺序查找算法及C语言实现