Spring Boot 提取pdf中的文字

提取pdf中的文字，由于字体不同，可能会提取出来乱码。（友情提示：建议先pdf文件转成图片，然后调用百度api提取文字，准确率高。跳转链接：https://blog.csdn.net/weixin_45652692/article/details/118190220）

pom.xml

        <dependency><groupId>org.apache.pdfbox</groupId><artifactId>pdfbox</artifactId><version>2.0.20</version></dependency><!--  提取pdf中的文字         --><dependency><groupId>com.itextpdf</groupId><artifactId>itextpdf</artifactId><version>5.5.6</version></dependency><!--获取pdf文件的总页数--><dependency><groupId>org.apache.pdfbox</groupId><artifactId>pdfbox</artifactId><version>1.8.11</version></dependency>

PDFToWordUtils.java

import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
import jdk.nashorn.internal.ir.IfNode;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
import org.springframework.stereotype.Component;import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;/*** @description: 提取PDF中的文字* @author: LCHYUE* @time: 2021/6/15 9:29*/
@Component
public class PDFToWordUtils {/*** @Description: 提取pdf中的文字 第一种方法* @Param: fileUrlList：地址* @Param: pages：页码* @return: content：提取的文字* @Author: LCHYUE* @Date: 2021/6/15*/public String PDFToWord(String fileUrlList) throws IOException {//linux---start---fileUrlList=fileUrlList.replaceAll("\\\\", File.separator);//linux--end----Integer pages = PDFToPage(fileUrlList);String fileName = fileUrlList;//源文件的位置PdfReader reader = null;//PDF读取器reader = new PdfReader(fileName);String content = "";for (int i = 1; i <= pages; i++) {content += PdfTextExtractor.getTextFromPage(reader, i); // 读取PDF中第i页（用哪一页就写几）的文档内容，并转成String}content = content.replace("\n", "");content = content.replace("\\\\r", "");content = content.replace("\\\\t", "");System.out.println(content);//控制台打印PDF第一页的内容return content;}/*** @Description: 提取pdf中的页码* @Param: fileUrlList：地址* @return: pages：页码数* @Author: LCHYUE* @Date: 2021/6/15*/public int PDFToPage(String fileUrlList) {//linux---start---fileUrlList=fileUrlList.replaceAll("\\\\", File.separator);//linux--end----File file = new File(fileUrlList);PdfReader pdfReader = null;try {pdfReader = new PdfReader(new FileInputStream(file));} catch (IOException e) {e.printStackTrace();}int pages = pdfReader.getNumberOfPages();System.out.println("pdf文件的总页数为:" + pages);return pages;}/*** @Description: 提取pdf中的文字 第二种方法* @Param: pdfPath：地址* @return: content：提取的文字* @Author: LCHYUE* @Date: 2021/6/15*/public static String getTextFromPdf(String pdfPath) throws Exception {// 是否排序boolean sort = false;// 开始提取页数int startPage = 1;// 结束提取页数int endPage = Integer.MAX_VALUE;String content = null;InputStream input = null;//linux---start---pdfPath=pdfPath.replaceAll("\\\\", File.separator);//linux--end----File pdfFile = new File(pdfPath);PDDocument document = null;try {input = new FileInputStream(pdfFile);// 加载 pdf 文档PDFParser parser = new PDFParser(input);parser.parse();document = parser.getPDDocument();// 获取内容信息PDFTextStripper pts = new PDFTextStripper();pts.setSortByPosition(sort);endPage = document.getNumberOfPages();System.out.println("Total Page: " + endPage);pts.setStartPage(startPage);pts.setEndPage(endPage);try {content = pts.getText(document);} catch (Exception e) {throw e;}System.out.println("Get PDF Content ...");} catch (Exception e) {throw e;} finally {if (null != input)input.close();if (null != document)document.close();}content = content.replace("\n", "");content = content.replace("\\\\r", "");content = content.replace("\\\\t", "");System.out.println(content);return content;}public static void main(String[] args) throws Exception {//        Integer page = PDFToPage("D:\\Desktop\\图书文件夹\\4.pdf");
//        PDFToWord("D:\\Desktop\\图书文件夹\\4.pdf", page);//        Integer page = PDFToPage("D:\\图书.pdf");
//        PDFToWord("D:\\图书.pdf", page);getTextFromPdf("D:\\Desktop\\图书文件夹\\4.pdf");}
}

Spring Boot 提取pdf中的文字相关推荐

Python提取PDF中的文字和图片
一,使用Python提取PDF中的文字 # 只能处理包含文本的PDF文件 #coding=utf-8 import sys import importlib importlib.reload(sys) ...
python提取pdf中的文字和图片_Python操作PDF-文本和图片提取（使用PyPDF2和PyMuPDF）...
PDF文件格式如今,可移植文档格式(PDF)属于最常用的数据格式.在1990年,PDF文档的结构由Adobe定义.PDF格式的思想是,对于通信过程中涉及的双方(创建者,作者或发送者以及接收者)而言, ...
一招教你免费提取PDF中的文字
转换PDF文档的时候,我们会发现一个问题:有的PDF文档转成Word可编辑,有的PDF文档转出来却还是图片,无法编辑. 针对这类可编辑的PDF文档,小编有个既简单又省钱的方法教给大家. 第一步首先需 ...
python 处理pdf文件转成txt 批量提取pdf中的文字
用到的包 pdfminer3k 代码 import os import refrom pdfminer.pdfinterp import PDFResourceManager,process_pdf ...
使用PDFBOX提取PDF中的文字
PDDocument pdf = PDDocument.load(new File(srcFilePath));PDPageTree pageTree = pdf.getPages();int cou ...
在线提取PDF中图片和文字
无需下载软件,你就可以在线提取PDF中图片和文字,http://www.extractpdf.com/不仅可以获取本地PDF文档的图片和文字,还能获取远程PDF文档的图片和文字. 结果本人测试,该工具 ...
Spring Boot微服务中Chaos Monkey的应用
点击蓝色"程序猿DD"关注我哟有多少人从未在生产环境中遇到系统崩溃或故障?当然,你们每个人迟早都会经历它.如果我们无法避免失败,那么解决方案似乎是将我们的系统维持在永久性故障状态 ...
如何结决PDF中的文字无法复制或选中(使用Adobe Acrobat X Pro转换双重PDF)
如何结决PDF中的文字无法复制或选中(使用Adobe Acrobat X Pro转换双重PDF) 很多时候我们从网上下载到的PDF无法选中其中的字体,我们又恰好需要这些文字,如果一个一个的自己打出来又 ...
怎么提取pdf中的表格数据_如何从pdf第1部分中提取表格数据
怎么提取pdf中的表格数据 In this article, we talk about the challenges and principles of extracting tabular dat ...

Spring Boot 提取pdf中的文字

Spring Boot 提取pdf中的文字相关推荐

最新文章

热门文章