java使用爬虫工具jsoup实现抓取网页的内容及图片并写入到word文档中

背景：

有个朋友最近让帮忙写个小功能，需求大概是，1万个链接，让爬每个链接中的某一段文章并附一张图片，每五个链接写入到一个word文档中。

基本思路，就是先去找个爬虫框架把链接网页中内容和图片写到word中，后面在将1万个链接通过位除余分组，开几个线程去写。

1.导入maven依赖jar包

<dependency><groupId>org.jsoup</groupId><artifactId>jsoup</artifactId><version>1.12.1</version>
</dependency>

2.编写测试单元

找一个简单的百度网页，做个简单的demo程序，测试一下框架的功能-------测试链接

测试代码如下

 @org.junit.Test
public void testJsoup() {try {String allUrl ="https://mbd.baidu.com/newspage/data/landingshare?context=%7B%22nid%22%3A%22news_9881067036128581241%22%2C%22sourceFrom%22%3A%22bjh%22%7D";Document docAll = Jsoup.connect(allUrl).data("query", "Java").userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36").get();//标题Elements titile = docAll.getElementsByTag("titile");System.out.println(titile.text());//段落内容Elements contexts = docAll.getElementsByTag("p");String context = "";for (Element element:contexts) {context+=element.text();}System.out.println(context);//图片Elements img = docAll.getElementsByTag("img");for (Element image : img){//图片路径String src = image.attr("src");System.out.println(src);}} catch (Exception e) {e.printStackTrace();}}

运行可知得到的document就是整个页面内容，关于内容的解析，详细的在jsoup网站上有详细文档。这里我只简单解析。

3.关于图片的读取下载

          //图片Elements img = docAll.getElementsByTag("img");int i=1;for (Element image : img){File file = new File("D://"+i+".JPEG");FileOutputStream fo = new FileOutputStream(file);//图片路径String src = image.attr("src");BufferedInputStream in = Jsoup.connect(src).ignoreContentType(true).execute().bodyStream();byte[] buf = new byte[1024];int length = 0;while ((length = in.read(buf, 0, buf.length)) != -1) {fo.write(buf, 0, length);}in.close();fo.close();System.out.println(src + "下载完成");i++;}

测试下载图片也没有问题。

4.最后就是将抓取的内容写到word文档中了。

往文档中写文字比较简单，关键在于图片的写入，查询了一些资料，做了各种测试，发现有个很简单的工具。

（附参考地址）

5 添加工具依赖

<!-- https://mvnrepository.com/artifact/com.lowagie/itext --><dependency><groupId>com.lowagie</groupId><artifactId>itext</artifactId><version>2.1.7</version></dependency><!-- https://mvnrepository.com/artifact/com.lowagie/itext-rtf --><dependency><groupId>com.lowagie</groupId><artifactId>itext-rtf</artifactId><version>2.1.7</version></dependency><!-- https://mvnrepository.com/artifact/com.itextpdf/itext-asian --><dependency><groupId>com.itextpdf</groupId><artifactId>itext-asian</artifactId><version>5.2.0</version></dependency>

测试代码如下：

@org.junit.Testpublic void test() throws  Exception{String filePath = "D:\\a.doc";try {String allUrl ="https://mbd.baidu.com/newspage/data/landingshare?context=%7B%22nid%22%3A%22news_9881067036128581241%22%2C%22sourceFrom%22%3A%22bjh%22%7D";Document docAll = Jsoup.connect(allUrl).data("query", "Java").userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36").cookie("auth", "token").timeout(3000).get();// 设置纸张大小com.lowagie.text.Document document = new com.lowagie.text.Document(PageSize.A4);// 建立一个书写器(Writer)与document对象关联，通过书写器(Writer)可以将文档写入到磁盘中// ByteArrayOutputStream baos = new ByteArrayOutputStream();File file = new File(filePath);RtfWriter2.getInstance(document, new FileOutputStream(file));document.open();// 设置中文字体BaseFont bfChinese = BaseFont.createFont(BaseFont.HELVETICA,BaseFont.WINANSI, BaseFont.NOT_EMBEDDED);// 标题字体风格Font titleFont = new Font(bfChinese, 12, Font.BOLD);// // 正文字体风格//Font contextFont = new Font(bfChinese, 10, Font.NORMAL);/**标题*/String webTitle = docAll.getElementsByTag("title").text();/*** 内容*/Elements contexts = docAll.getElementsByTag("p");String contextString ="";for (Element context : contexts){contextString+=context.text();}/**end*/Paragraph title = new Paragraph(webTitle);//// 设置标题格式对齐方式title.setAlignment(com.lowagie.text.Element.ALIGN_CENTER);// title.setFont(titleFont);document.add(title);//  文本正文Paragraph context = new Paragraph(contextString);// 正文格式左对齐context.setAlignment(com.lowagie.text.Element.ALIGN_LEFT);// context.setFont(contextFont);// 离上一段落（标题）空的行数context.setSpacingBefore(5);// 设置第一行空的列数context.setFirstLineIndent(20);document.add(context);//// // 利用类FontFactory结合Font和Color可以设置各种各样字体样式//// Paragraph underline = new Paragraph("下划线的实现", FontFactory.getFont(// FontFactory.HELVETICA_BOLDOBLIQUE, 18, Font.UNDERLINE,// new Color(0, 0, 255)));//// document.add(underline);//// // 添加图片 Image.getInstance即可以放路径又可以放二进制字节流///**图片*/Elements imgs = docAll.getElementsByTag("img");for (Element image : imgs){//图片路径String src = image.attr("src");BufferedInputStream in = Jsoup.connect(src).ignoreContentType(true).execute().bodyStream();ByteArrayOutputStream out = new ByteArrayOutputStream();byte[] buf = new byte[1024];int length = 0;while ((length = in.read(buf, 0, buf.length)) != -1) {out.write(buf, 0, length);}Image img = Image.getInstance(out.toByteArray());img.setAbsolutePosition(0, 0);img.setAlignment(Image.LEFT);// 设置图片显示位置// img.scaleAbsolute(60, 60);// 直接设定显示尺寸//// // img.scalePercent(50);//表示显示的大小为原尺寸的50%//// // img.scalePercent(25, 12);//图像高宽的显示比例//// // img.setRotation(30);//图像旋转一定角度//document.add(img);in.close();out.close();}document.close();// 得到输入流// wordFile = new ByteArrayInputStream(baos.toByteArray());// baos.close();//            Connection referrer = Jsoup.connect(src).referrer(src);
//            referrer.ignoreContentType(true);
//            Connection.Response execute = referrer.execute();
//            BufferedInputStream in = execute.bodyStream();} catch (Exception e) {e.printStackTrace();}}

基本测试完成，后面功能实现就简单了。

java使用爬虫工具jsoup实现抓取网页的内容及图片并写入到word文档中相关推荐

爬虫-演练-GET请求抓取网页的内容
目标站点待定操作流程待定
列表根据下标取值_散列表（上）：Word文档中的单词拼写检查功能是如何实现的？...
Word这种文本编辑器你平时应该经常用吧,那你有没有留意过它的拼写检查功能呢?一旦我们在Word里输入一个错误的英文单词,它就会用标红的方式提示"拼写错误".Word的这个单词拼写 ...
python爬虫搜特定内容的论文_python基于BeautifulSoup实现抓取网页指定内容的方法...
python基于BeautifulSoup实现抓取网页指定内容的方法更新时间:2015年07月09日 10:12:50 作者:光索与诺这篇文章主要介绍了python基于BeautifulSoup实 ...
Python爬虫:利用JS逆向抓取携程网景点评论区图片的下载链接
Python爬虫:利用JS逆向抓取携程网景点评论区图片的下载链接 1. 前言 2. 实现过程 3. 运行结果 1. 前言文章内容可能存在版权问题,为此,小编不提供相关实现代码,只是从js逆向说一说到 ...
Java版Word开发工具Aspose.Words功能解析：查找和替换Word文档中的文本
MS Word提供了一种简单的方法来查找和替换文档中的文本.查找和替换文本的一种流行用例之一可能是在文档之间的敏感信息在各个实体之间共享之前,对其进行删除或替换.但是,手动过程可能需要您安装MS Wo ...
从多个Word文档中批量取值，整理到Excel表中。
针对多个内部是表格,并且格式相同的文档,例如:一些Word表格简历.一些调查表.技术交底等.可以一键提取所有文档中固定位置的数据. 按位置提取word文档内容到excel 通常我们要重复提取每个文档中 ...
java导出数据到word文档中
1.功能概述: web项目中,在html文件点击下载word文件,后台获取要输出的数据再导出到word文档中 2. 操作步骤: (1).新建word模板,凡是需要填充的数据用${xxxx},编辑好wo ...
Word处理控件Aspose.Words功能演示：使用Java在MS Word文档中进行邮件合并
邮件合并是一种动态生成信件,信封,发票,报告和其他类型文档的便捷方法.使用邮件合并,您可以创建一个包含合并字段的模板文件,然后使用数据源中的数据填充这些字段. 假设您必须向20个不同的人发送一封信,并 ...
java 替换 word_Java 在 Word 文档中使用新文本替换指定文本的方法
创作一份文案,经常会高频率地使用某些词汇,如地名.人名.人物职位等,若表述有误,就需要整体撤换.文本将介绍如何使用Spire.Doc for Java,在Java程序中对Word文档中的指定文本进行替 ...

java使用爬虫工具jsoup实现抓取网页的内容及图片并写入到word文档中

java使用爬虫工具jsoup实现抓取网页的内容及图片并写入到word文档中相关推荐

最新文章

热门文章