apache poi_将HTML转换为Apache POI的RichTextString

apache poi

1.概述

在本教程中，我们将构建一个将HTML作为输入的应用程序，并使用提供HTML的RichText表示形式创建Microsoft Excel工作簿。为了生成Microsoft Excel工作簿，我们将使用Apache POI 。为了分析HTML，我们将使用Jericho。

Github上提供了本教程的完整源代码。

2.什么是耶利哥？

Jericho是一个Java库，它允许对HTML文档的各个部分（包括服务器端标签）进行分析和操作，同时逐字再现任何无法识别或无效HTML。它还提供了高级HTML表单操作功能。它是一个开放源代码库，使用以下许可证发行： Eclipse公共许可证（EPL）， GNU通用公共许可证（LGPL）和Apache许可证。

我发现Jericho非常易于使用，可以实现将HTML转换为RichText的目标。

3. pom.xml

这是我们正在构建的应用程序所需的依赖项。请注意，对于此应用程序，我们必须使用Java 9 。这是因为我们使用的java.util.regex appendReplacement方法自Java 9起才可用。

<parent><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-parent</artifactId><version>1.5.9.RELEASE</version><relativePath /> <!-- lookup parent from repository -->
</parent><properties><project.build.sourceEncoding>UTF-8</project.build.sourceEncoding><project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding><java.version>9</java.version>
</properties><dependencies><dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-batch</artifactId></dependency><dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-thymeleaf</artifactId></dependency><dependency><groupId>com.h2database</groupId><artifactId>h2</artifactId><scope>runtime</scope></dependency><dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-test</artifactId><scope>test</scope></dependency><!-- https://mvnrepository.com/artifact/org.apache.commons/commons-lang3 --><dependency><groupId>org.apache.commons</groupId><artifactId>commons-lang3</artifactId><version>3.7</version></dependency><dependency><groupId>org.springframework.batch</groupId><artifactId>spring-batch-test</artifactId><scope>test</scope></dependency><dependency><groupId>org.apache.poi</groupId><artifactId>poi</artifactId><version>3.15</version></dependency><dependency><groupId>org.apache.poi</groupId><artifactId>poi-ooxml</artifactId><version>3.15</version></dependency><!-- https://mvnrepository.com/artifact/net.htmlparser.jericho/jericho-html --><dependency><groupId>net.htmlparser.jericho</groupId><artifactId>jericho-html</artifactId><version>3.4</version></dependency><dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-configuration-processor</artifactId><optional>true</optional></dependency><!-- legacy html allow --><dependency><groupId>net.sourceforge.nekohtml</groupId><artifactId>nekohtml</artifactId></dependency>
</dependencies>

4.网页– Thymeleaf

我们使用Thymeleaf来创建一个基本页面，该页面具有带有文本区域的表单。 Github上提供 Thymeleaf页面的源代码。如果愿意，可以使用RichText编辑器替换此textarea，例如CKEditor。我们只需要注意使用适当的setData方法使AJAX的数据正确即可。在Spring Boot中，以前有一个关于CKeditor的教程，标题为CKEditor，名为AJAX 。

5.控制器

在我们的控制器中，我们将自动装配JobLauncher和一个Spring Batch作业，我们将创建一个名为GenerateExcel的作业 。通过自动装配这两个类，当POST请求发送到“ / export”时，我们可以按需运行Spring Batch Job GenerateExcel 。

要注意的另一件事是，为了确保Spring Batch作业将运行一次以上，我们在此代码中包含唯一参数： addLong（“ uniqueness”，System.nanoTime（））。toJobParameters（） 。如果我们不包括唯一参数，则可能会发生错误，因为只能创建和执行唯一的JobInstances，否则Spring Batch无法区分第一个JobInstance和第二个JobInstance 。

@Controller
public class WebController {private String currentContent;@AutowiredJobLauncher jobLauncher;@AutowiredGenerateExcel exceljob; @GetMapping("/")public ModelAndView getHome() {ModelAndView modelAndView = new ModelAndView("index");return modelAndView;}@PostMapping("/export")public String postTheFile(@RequestBody String body, RedirectAttributes redirectAttributes, Model model)throws IOException, JobExecutionAlreadyRunningException, JobRestartException, JobInstanceAlreadyCompleteException, JobParametersInvalidException {setCurrentContent(body);Job job = exceljob.ExcelGenerator();jobLauncher.run(job, new JobParametersBuilder().addLong("uniqueness", System.nanoTime()).toJobParameters());return "redirect:/";}//standard getters and setters}

6.批处理作业

在批处理作业的步骤1中，我们调用getCurrentContent（）方法来获取传递到Thymeleaf表单中的内容，创建一个新的XSSFWorkbook，指定一个任意的Microsoft Excel Sheet选项卡名称，然后将所有三个变量都传递到createWorksheet方法中我们将在本教程的下一步中进行以下操作：

@Configuration
@EnableBatchProcessing
@Lazy
public class GenerateExcel {List<String> docIds = new ArrayList<String>();@Autowiredprivate JobBuilderFactory jobBuilderFactory;@Autowiredprivate StepBuilderFactory stepBuilderFactory;@AutowiredWebController webcontroller;@AutowiredCreateWorksheet createexcel;@Beanpublic Step step1() {return stepBuilderFactory.get("step1").tasklet(new Tasklet() {@Overridepublic RepeatStatus execute(StepContribution stepContribution, ChunkContext chunkContext) throws Exception, JSONException {String content = webcontroller.getCurrentContent();System.out.println("content is ::" + content);Workbook wb = new XSSFWorkbook();String tabName = "some";createexcel.createWorkSheet(wb, content, tabName);return RepeatStatus.FINISHED;}}).build();}@Beanpublic Job ExcelGenerator() {return jobBuilderFactory.get("ExcelGenerator").start(step1()).build();}}

我们还在其他教程中介绍了Spring Batch，例如将XML转换为JSON + Spring Batch和Spring Batch CSV Processing 。

7. Excel创建服务

我们使用各种类来创建我们的Microsoft Excel文件。在将HTML转换为RichText时，顺序很重要，因此这将是重点。

7.1 RichTextDetails

一个带有两个参数的类：一个字符串，其内容将成为RichText，一个字体映射。

public class RichTextDetails {private String richText;private Map<Integer, Font> fontMap;//standard getters and setters@Overridepublic int hashCode() {// The goal is to have a more efficient hashcode than standard one.return richText.hashCode();}

7.2 RichTextInfo

一个POJO，它将跟踪RichText的位置以及其他内容：

public class RichTextInfo {private int startIndex;private int endIndex;private STYLES fontStyle;private String fontValue;// standard getters and setters, and the like

7.3样式

一个包含要处理HTML标记的枚举。我们可以根据需要添加以下内容：

public enum STYLES {BOLD("b"), EM("em"), STRONG("strong"), COLOR("color"), UNDERLINE("u"), SPAN("span"), ITALLICS("i"), UNKNOWN("unknown"),PRE("pre");// standard getters and setters

7.4 TagInfo

POJO跟踪标签信息：

public class TagInfo {private String tagName;private String style;private int tagType;// standard getters and setters

7.5 HTML为RichText

这不是一个小类，所以让我们按方法将其分解。

本质上，我们用div标签将任意HTML包围起来，因此我们知道我们在寻找什么。然后，我们在div标签中查找所有元素，将每个元素添加到RichTextDetails的ArrayList中，然后将整个ArrayList传递给mergeTextDetails方法。 mergeTextDetails返回RichtextString，这是我们需要设置单元格值的内容：

public RichTextString fromHtmlToCellValue(String html, Workbook workBook){Config.IsHTMLEmptyElementTagRecognised = true;Matcher m = HEAVY_REGEX.matcher(html);String replacedhtml =  m.replaceAll("");StringBuilder sb = new StringBuilder();sb.insert(0, "<div>");sb.append(replacedhtml);sb.append("</div>");String newhtml = sb.toString();Source source = new Source(newhtml);List<RichTextDetails> cellValues = new ArrayList<RichTextDetails>();for(Element el : source.getAllElements("div")){cellValues.add(createCellValue(el.toString(), workBook));}RichTextString cellValue = mergeTextDetails(cellValues);return cellValue;}

如上所述，我们在此方法中传递了RichTextDetails的ArrayList。 Jericho的设置采用布尔值来识别空标签元素，例如
：已识别Config.IsHTMLEmptyElementTag。在与在线富文本编辑器打交道时，这可能很重要，因此我们将其设置为true。因为我们需要跟踪元素的顺序，所以我们使用LinkedHashMap而不是HashMap。

private static RichTextString mergeTextDetails(List<RichTextDetails> cellValues) {Config.IsHTMLEmptyElementTagRecognised = true;StringBuilder textBuffer = new StringBuilder();Map<Integer, Font> mergedMap = new LinkedHashMap<Integer, Font>(550, .95f);int currentIndex = 0;for (RichTextDetails richTextDetail : cellValues) {//textBuffer.append(BULLET_CHARACTER + " ");currentIndex = textBuffer.length();for (Entry<Integer, Font> entry : richTextDetail.getFontMap().entrySet()) {mergedMap.put(entry.getKey() + currentIndex, entry.getValue());}textBuffer.append(richTextDetail.getRichText()).append(NEW_LINE);}RichTextString richText = new XSSFRichTextString(textBuffer.toString());for (int i = 0; i < textBuffer.length(); i++) {Font currentFont = mergedMap.get(i);if (currentFont != null) {richText.applyFont(i, i + 1, currentFont);}}return richText;}

如上所述，我们使用Java 9来将StringBuilder与java.util.regex.Matcher.appendReplacement结合使用 。为什么？那是因为StringBuffer的运行速度比StringBuilder慢。 StringBuffer函数被同步以确保线程安全，因此速度较慢。

我们使用Deque而不是Stack，因为Deque接口提供了更完整和一致的LIFO堆栈操作集：

static RichTextDetails createCellValue(String html, Workbook workBook) {Config.IsHTMLEmptyElementTagRecognised  = true;Source source = new Source(html);Map<String, TagInfo> tagMap = new LinkedHashMap<String, TagInfo>(550, .95f);for (Element e : source.getChildElements()) {getInfo(e, tagMap);}StringBuilder sbPatt = new StringBuilder();sbPatt.append("(").append(StringUtils.join(tagMap.keySet(), "|")).append(")");String patternString = sbPatt.toString();Pattern pattern = Pattern.compile(patternString);Matcher matcher = pattern.matcher(html);StringBuilder textBuffer = new StringBuilder();List<RichTextInfo> textInfos = new ArrayList<RichTextInfo>();ArrayDeque<RichTextInfo> richTextBuffer = new ArrayDeque<RichTextInfo>();while (matcher.find()) {matcher.appendReplacement(textBuffer, "");TagInfo currentTag = tagMap.get(matcher.group(1));if (START_TAG == currentTag.getTagType()) {richTextBuffer.push(getRichTextInfo(currentTag, textBuffer.length(), workBook));} else {if (!richTextBuffer.isEmpty()) {RichTextInfo info = richTextBuffer.pop();if (info != null) {info.setEndIndex(textBuffer.length());textInfos.add(info);}}}}matcher.appendTail(textBuffer);Map<Integer, Font> fontMap = buildFontMap(textInfos, workBook);return new RichTextDetails(textBuffer.toString(), fontMap);}

我们可以在这里看到RichTextInfo的使用位置：

private static Map<Integer, Font> buildFontMap(List<RichTextInfo> textInfos, Workbook workBook) {Map<Integer, Font> fontMap = new LinkedHashMap<Integer, Font>(550, .95f);for (RichTextInfo richTextInfo : textInfos) {if (richTextInfo.isValid()) {for (int i = richTextInfo.getStartIndex(); i < richTextInfo.getEndIndex(); i++) {fontMap.put(i, mergeFont(fontMap.get(i), richTextInfo.getFontStyle(), richTextInfo.getFontValue(), workBook));}}}return fontMap;}

我们在哪里使用STYLES枚举：

private static Font mergeFont(Font font, STYLES fontStyle, String fontValue, Workbook workBook) {if (font == null) {font = workBook.createFont();}switch (fontStyle) {case BOLD:case EM:case STRONG:font.setBoldweight(Font.BOLDWEIGHT_BOLD);break;case UNDERLINE:font.setUnderline(Font.U_SINGLE);break;case ITALLICS:font.setItalic(true);break;case PRE:font.setFontName("Courier New");case COLOR:if (!isEmpty(fontValue)) {font.setColor(IndexedColors.BLACK.getIndex());}break;default:break;}return font;}

我们正在使用TagInfo类来跟踪当前标签：

private static RichTextInfo getRichTextInfo(TagInfo currentTag, int startIndex, Workbook workBook) {RichTextInfo info = null;switch (STYLES.fromValue(currentTag.getTagName())) {case SPAN:if (!isEmpty(currentTag.getStyle())) {for (String style : currentTag.getStyle().split(";")) {String[] styleDetails = style.split(":");if (styleDetails != null && styleDetails.length > 1) {if ("COLOR".equalsIgnoreCase(styleDetails[0].trim())) {info = new RichTextInfo(startIndex, -1, STYLES.COLOR, styleDetails[1]);}}}}break;default:info = new RichTextInfo(startIndex, -1, STYLES.fromValue(currentTag.getTagName()));break;}return info;}

我们处理HTML标签：

private static void getInfo(Element e, Map<String, TagInfo> tagMap) {tagMap.put(e.getStartTag().toString(),new TagInfo(e.getStartTag().getName(), e.getAttributeValue("style"), START_TAG));if (e.getChildElements().size() > 0) {List<Element> children = e.getChildElements();for (Element child : children) {getInfo(child, tagMap);}}if (e.getEndTag() != null) {tagMap.put(e.getEndTag().toString(),new TagInfo(e.getEndTag().getName(), END_TAG));} else {// Handling self closing tagstagMap.put(e.getStartTag().toString(),new TagInfo(e.getStartTag().getName(), END_TAG));}}

7.6创建工作表

使用StringBuilder，我创建了一个要写入FileOutPutStream的字符串。在实际应用中，应由用户定义。我在两个不同的行上附加了文件夹路径和文件名。请将文件路径更改为您自己的文件路径。

sheet.createRow（0）在第一行创建一行，而dataRow.createCell（0）在该行的列A中创建一个单元格。

public void createWorkSheet(Workbook wb, String content, String tabName) {StringBuilder sbFileName = new StringBuilder();sbFileName.append("/Users/mike/javaSTS/michaelcgood-apache-poi-richtext/");sbFileName.append("myfile.xlsx");String fileMacTest = sbFileName.toString();try {this.fileOut = new FileOutputStream(fileMacTest);} catch (FileNotFoundException ex) {Logger.getLogger(CreateWorksheet.class.getName()).log(Level.SEVERE, null, ex);}Sheet sheet = wb.createSheet(tabName); // Create new sheet w/ Tab namesheet.setZoom(85); // Set sheet zoom: 85%// content rich textRichTextString contentRich = null;if (content != null) {contentRich = htmlToExcel.fromHtmlToCellValue(content, wb);}// begin insertion of values into cellsRow dataRow = sheet.createRow(0);Cell A = dataRow.createCell(0); // Row NumberA.setCellValue(contentRich);sheet.autoSizeColumn(0);try {/// Write the output to a filewb.write(fileOut);fileOut.close();} catch (IOException ex) {Logger.getLogger(CreateWorksheet.class.getName()).log(Level.SEVERE, null, ex);}}

8.演示

我们访问localhost：8080 。

我们用一些HTML输入一些文本：

我们打开excel文件，然后看到我们创建的RichText：

9.结论

我们可以看到将HTML转换为Apache POI的RichTextString类并不是一件容易的事。但是，对于商业应用程序而言，将HTML转换为RichTextString至关重要，因为在Microsoft Excel文件中，可读性很重要。我们构建的应用程序的性能可能还有改进的余地，但我们涵盖了构建此类应用程序的基础。

完整的源代码可在Github上找到。

翻译自: https://www.javacodegeeks.com/2018/01/converting-html-richtextstring-apache-poi.html

apache poi