jsoup is an open source Java HTML parser that we can use to parse HTML and extract useful information. You can also think of jsoup as web page scraping tool in java programming language.

jsoup是一个开放源代码Java HTML解析器,我们可以用来解析HTML并提取有用的信息。 您也可以将jsoup视为Java编程语言中的网页抓取工具。

so (jsoup)

jsoup API can be used to fetch HTML from URL or parse it from HTML string or from HTML file.

jsoup API可用于从URL提取HTML或从HTML字符串或HTML文件解析它。

Some of the cool features of jsoup API are;

jsoup API的一些很酷的功能是;

  1. scrape HTML from URL or read it from String or from a file.从URL抓取HTML或从String或文件中读取HTML。
  2. Extract data from html through DOM based traversal or using CSS like selectors.通过基于DOM的遍历或使用类似选择器CSS从html提取数据。
  3. jsoup API can be used to edit HTML too.jsoup API也可以用于编辑HTML。
  4. jsoup API is self contained, we don’t need any other jars to use it.jsoup API是自包含的,我们不需要任何其他jar即可使用它。

You can download jsoup jar from it’s website or if you are using maven, then add below dependency for it.

您可以从其网站下载jsoup jar,或者如果您使用的是maven,则为其添加以下依赖项。

<dependency><groupId>org.jsoup</groupId><artifactId>jsoup</artifactId><version>1.8.1</version>
</dependency>

Let’s look into different jsoup features one by one.

让我们逐一研究不同的jsoup功能。

jsoup示例从URL加载HTML文档 (jsoup example to load HTML document from URL)

We can do it with a one liner code as shown below.

我们可以使用一个班轮代码来完成它,如下所示。

org.jsoup.nodes.Document doc = org.jsoup.Jsoup.connect("https://www.journaldev.com").get();
System.out.println(doc.html()); // prints HTML data

jsoup示例从String解析HTML文档 (jsoup example to parse HTML document from String)

If we have HTML data as String, we can use below code to parse it.

如果我们将HTML数据作为字符串,则可以使用以下代码对其进行解析。

String source = "<html><head><title>Jsoup Example</title></head>"+ "<body><h1>Welcome to JournalDev!!</h1><br />"+ "</body></html>";Document doc = Jsoup.parse(source);

jsoup示例从文件加载文档 (jsoup example to load a document from file)

If HTML data is saved in a file, we can load it using below code.

如果HTML数据保存在文件中,我们可以使用以下代码加载它。

Document doc = Jsoup.parse(new File("data.html"), "UTF-8");

解析HTML正文片段 (Parsing HTML Body Fragment)

One of the best feature of jsoup is that if we supply html body fragmented data, it tries hard to generate a valid HTML for us, as shown in below example.

jsoup的最佳功能之一是,如果我们提供html正文片段数据,它将努力为我们生成有效HTML,如下例所示。

String html = "<div><p>Test Data</p>";
Document doc1 = Jsoup.parseBodyFragment(html);
System.out.println(doc1.html());

Above code prints following HTML.

上面的代码显示以下HTML。

<html><head></head><body><div><p>Test Data</p></div></body>
</html>

Let’s now look at different methods to extract data from HTML.

现在让我们看一下从HTML提取数据的不同方法。

Jsoup DOM方法 (Jsoup DOM Methods)

Just like HTML, Jsoup parse the HTML into Document. A document consists of different elements and there are many useful methods that we can use to find elements. Some of these methods in Document are;

就像HTML一样,Jsoup将HTML解析为Document。 文档包含不同的元素,可以使用许多有用的方法来查找元素。 Document中的一些方法是:

  1. getElementById(String id)getElementById(字符串ID)
  2. getElementsByTag(String tag)getElementsByTag(字符串标签)
  3. getElementsByClass(String className)getElementsByClass(String className)
  4. getElementsByAttribute(String key)getElementsByAttribute(字符串键)
  5. siblingElements(), firstElementSibling(), lastElementSibling() etc.siblingElements(),firstElementSibling(),lastElementSibling()等。

Element has different attributes, so we have some methods for element data too.

元素具有不同的属性,因此我们也有一些元素数据的方法。

  1. attr(String key) to get and attr(String key, String value) to set attributesattr(String key)获取和attr(String key,字符串值)设置属性
  2. id(), className() and classNames()id(),className()和classNames()
  3. text() to get and text(String value) to set the text contenttext()获取和text(String值)设置文本内容
  4. html() to get and html(String value) to set the inner HTML contenthtml()获取和html(String value)设置内部HTML内容
  5. tag() and tagName()tag()和tagName()

There are some methods for manipulating HTML data as well.

也有一些用于处理HTML数据的方法。

  1. append(String html), prepend(String html)append(String html),prepend(String html)
  2. appendText(String text), prependText(String text)appendText(字符串文本),prependText(字符串文本)
  3. appendElement(String tagName), prependElement(String tagName)appendElement(String tagName),prependElement(String tagName)
  4. html(String value)html(字符串值)

Below is a simple example where I am using jsoup DOM methods to parse my website home page and list all the links.

下面是一个简单的示例,其中我使用jsoup DOM方法来解析我的网站主页并列出所有链接。

package com.journaldev.jsoup;import java.io.IOException;import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;public class JsoupExtractLinks {public static void main(String[] args) throws IOException {Document doc = Jsoup.connect("https://www.journaldev.com").get();Element content = doc.getElementById("content");Elements links = content.getElementsByTag("a");for (Element link : links) {String linkHref = link.attr("href");String linkText = link.text();System.out.println("Text::"+linkText+", URL::"+linkHref);}}}

Above program produces following output.

上面的程序产生以下输出。

Text::jQuery Popup and Tooltip Window Animation Effects, URL::https://www.journaldev.com/6998/jquery-popup-and-tooltip-window-animation-effects
Text::Jobin Bennett, URL::https://www.journaldev.com/author/jobin
Text::March 7, 2015, URL::https://www.journaldev.com/6998/jquery-popup-and-tooltip-window-animation-effects
Text::jQuery, URL::https://www.journaldev.com/dev/jquery
Text::jQuery Plugins, URL::https://www.journaldev.com/dev/jquery/jquery-plugins
Text::Permalink, URL::https://www.journaldev.com/6998/jquery-popup-and-tooltip-window-animation-effects
Text::Apache HttpClient Example to send GET/POST HTTP Requests, URL::https://www.journaldev.com/7146/apache-httpclient-example-to-send-get-post-http-requests
Text::Pankaj, URL::https://www.journaldev.com/author/pankaj
Text::March 6, 2015, URL::https://www.journaldev.com/7146/apache-httpclient-example-to-send-get-post-http-requests
Text::Java, URL::https://www.journaldev.com/dev/java
Text::Permalink, URL::https://www.journaldev.com/7146/apache-httpclient-example-to-send-get-post-http-requests
Text::Java HttpURLConnection Example to send HTTP GET/POST Requests, URL::https://www.journaldev.com/7148/java-httpurlconnection-example-to-send-http-getpost-requests
Text::Pankaj, URL::https://www.journaldev.com/author/pankaj
Text::March 6, 2015, URL::https://www.journaldev.com/7148/java-httpurlconnection-example-to-send-http-getpost-requests
Text::Java, URL::https://www.journaldev.com/dev/java
Text::Permalink, URL::https://www.journaldev.com/7148/java-httpurlconnection-example-to-send-http-getpost-requests
Text::How to integrate Google reCAPTCHA in Java Web Application, URL::https://www.journaldev.com/7133/how-to-integrate-google-recaptcha-in-java-web-application
Text::Pankaj, URL::https://www.journaldev.com/author/pankaj
Text::March 4, 2015, URL::https://www.journaldev.com/7133/how-to-integrate-google-recaptcha-in-java-web-application
Text::Java EE, URL::https://www.journaldev.com/dev/java/j2ee
Text::Permalink, URL::https://www.journaldev.com/7133/how-to-integrate-google-recaptcha-in-java-web-application
Text::JSF Spring Hibernate Integration Example Tutorial, URL::https://www.journaldev.com/7122/jsf-spring-hibernate-integration-example-tutorial
Text::Pankaj, URL::https://www.journaldev.com/author/pankaj
Text::March 3, 2015, URL::https://www.journaldev.com/7122/jsf-spring-hibernate-integration-example-tutorial
Text::Hibernate, URL::https://www.journaldev.com/dev/hibernate
Text::JSF, URL::https://www.journaldev.com/dev/jsf
Text::Spring, URL::https://www.journaldev.com/dev/spring
Text::Permalink, URL::https://www.journaldev.com/7122/jsf-spring-hibernate-integration-example-tutorial
Text::JSF Spring Integration Example Tutorial, URL::https://www.journaldev.com/7112/spring-jsf-integration
Text::Oracle Webcenter Portal Framework Application – Modifying Home Page And Login/Logout Target Pages & Deploying Your Application Into Custom Portal Managed Server Instance, URL::https://www.journaldev.com/6938/oracle-webcenter-portal-framework-application-modifying-home-page-and-loginlogout-target-pages-deploying-your-application-into-custom-portal-managed-server-instance
Text::JSF and JDBC Integration Example Tutorial, URL::https://www.journaldev.com/7068/jsf-database-example-mysql-jdbc
Text::Count the Number of Triangles in Given Picture – Programmatic Solution, URL::https://www.journaldev.com/7064/count-the-number-of-triangles-in-given-picture-programmatic-solution
Text::JSF Expression Language (EL) Example Tutorial, URL::https://www.journaldev.com/7058/jsf-expression-language-jsf-el
Text::Read all Articles →, URL::https://www.journaldev.com/page/2

Jsoup选择器语法 (Jsoup selector syntax)

We can also use CSS or jQuery like syntax to find and manipulate HTMl elements. Document and Element contains select(String cssQuery) that we can use for this.

我们还可以使用CSS或jQuery之类的语法来查找和操作HTMl元素。 DocumentElement包含select(String cssQuery)可以用于此目的。

Some quick examples are;

一些简单的例子是:

  1. doc.select(“a”): returns all “a” tag elements from HTML.doc.select(“ a”):从HTML返回所有“ a”标记元素。
  2. doc.select(c|if): finds <c:if> elementsdoc.select(c | if):查找<c:if>元素
  3. doc.select(“#id1″): returns all tags with id=”id1”doc.select(“#id1”):返回所有ID =“ id1”的标签
  4. doc.select(“.cl1″): returns all tags with class=”cl1”doc.select(“。cl1”):返回所有带有class =“ cl1”的标签
  5. doc.select(“[href]”): returns all tags with attribute hrefdoc.select(“ [href]”):返回所有带有href属性的标签

We can combine selectors too, you can find more details at Selectors API.

我们也可以组合选择器,您可以在Selectors API上找到更多详细信息。

Let’s now look at an example where I will fetch my Google+ author URL from my website using both DOM and Selector API.

现在让我们看一个示例,在该示例中,我将同时使用DOM和Selector API从我的网站中获取Google+作者URL。

package com.journaldev.jsoup;import java.io.IOException;import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;public class JsoupFindAuthor {public static void main(String[] args) throws IOException {//journaldev.com posts have author set as below//<div class="g-person" data-width="350" data-href="//plus.google.com/u/0/118104420597648001532"//data-layout="landscape" data-rel="author"></div>findAuthorUsingDOM();findAuthorUsingSelector();}private static void findAuthorUsingSelector() throws IOException {Document doc = Jsoup.connect("https://www.journaldev.com").get();Elements authors = doc.select("div.g-person"); //selector combinationfor(Element author : authors){System.out.println("Selector:: Author Google+ URL::"+author.attr("data-href"));}}private static void findAuthorUsingDOM() throws IOException {Document doc = Jsoup.connect("https://www.journaldev.com").get();Elements authors = doc.getElementsByClass("g-person");for(Element author : authors){System.out.println("DOM:: Author Google+ URL::"+author.attr("data-href"));}}}

Above program prints following output.

上面的程序打印以下输出。

DOM:: Author Google+ URL:://plus.google.com/u/0/118104420597648001532
Selector:: Author Google+ URL:://plus.google.com/u/0/118104420597648001532

jsoup修改HTML的示例 (jsoup example to modify HTML)

Let’s now look at jsoup example where I will parse input HTML and manipulate it.

现在让我们看一下jsoup示例,在这里我将解析输入HTML并对其进行操作。

package com.journaldev.jsoup;import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;public class JsoupModifyHTML {public static final String SOURCE_HTML = "<html><head><title>Jsoup Example</title></head>"+ "<body><h1>Welcome to JournalDev!!</h1><br />"+ "<div id=\"id1\">Hello</div>"+ "<div class=\"class1\">Pankaj</div>"+ "<a href=\"https://journaldev.com\">Home</a>"+ "<a href=\"https://wikipedia.org\">Wikipedia</a>"+ "</body></html>";public static void main(String[] args) {Document doc = Jsoup.parse(SOURCE_HTML);System.out.println("Title="+doc.title());//let's add attribute target="_blank" to all the linksdoc.select("a[href]").attr("rel", "nofollow");//System.out.println(doc.html());//change div class="class1" to class2doc.select("div.class1").attr("class", "class2");//System.out.println(doc.html());//change the HTML value of first h1 elementdoc.select("h1").first().html("Welcome to JournalDev.com");doc.select("h1").first().append("!!");//System.out.println(doc.html());//let's make Home link bolddoc.select("a[href]").first().html("<strong>Home</strong>");System.out.println(doc.html());}}

Please have a look at above program carefully to understand what’s modifications are done to the input html string. Also compare it with the final document as shown below in output.

请仔细查看上述程序,以了解对输入html字符串所做的修改。 还要将其与最终文档进行比较,如下图所示。

Title=Jsoup Example
<html><head><title>Jsoup Example</title></head><body><h1>Welcome to JournalDev.com!!</h1><br><div id="id1">Hello</div><div class="class2">Pankaj</div><a href="https://journaldev.com" target="_blank"><strong>Home</strong></a><a href="https://wikipedia.org" target="_blank">Wikipedia</a></body>
</html>

jsoup示例来解析Google搜索页面并查找结果 (jsoup example to parse Google Search Page and Find out Results)

Before I conclude this post, here is an example where I am parsing google search results first page and fetching all the links.

在结束本文之前,这里有一个示例,其中我解析google搜索结果的第一页并获取所有链接。

package com.journaldev.jsoup;import java.io.IOException;import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;public class ParsingGoogleSearch {public static void main(String[] args) throws IOException {Document doc = Jsoup.connect("https://www.google.com/search?q=java").userAgent("Mozilla/5.0").get();//System.out.println(doc.html());Elements resultsH3 = doc.select("h3.r > a");for (Element result : resultsH3) {String linkHref = result.attr("href");String linkText = result.text();System.out.println("Text::" + linkText + ", URL::" + linkHref.substring(6, linkHref.indexOf("&")));}}}

It prints following output.

打印以下输出。

Text::Download Free Java Software, URL::=https://java.com/download
Text::java.com: Java + You, URL::=https://www.java.com/
Text::Oracle Technology Network for Java Developers | Oracle ..., URL::=https://www.oracle.com/technetwork/java/
Text::Java (software platform) - Wikipedia, the free encyclopedia, URL::=https://en.wikipedia.org/wiki/Java_(software_platform)
Text::Java (programming language) - Wikipedia, the free encyclopedia, URL::=https://en.wikipedia.org/wiki/Java_(programming_language)
Text::Java Tutorial - TutorialsPoint, URL::=https://www.tutorialspoint.com/java/
Text::Welcome to JavaWorld.com, URL::=https://www.javaworld.com/
Text::Java.net: Welcome, URL::=https://www.java.net/
Text::News for java, URL::h?q=java
Text::Javalobby | The heart of the Java developer community, URL::=https://java.dzone.com/

Note that currently google search results are part of h3 tag with class “r” and obviously “a” is used for the link. So if in future there is any change such as h3 tag class name is changed, then it won’t work properly and we will have to do slight modification by looking at the source html structure.

请注意,当前的google搜索结果是h3标签中“ r”类的一部分,显然“ a”用于链接。 因此,如果将来有任何更改,例如h3标签类名称已更改,则它将无法正常工作,我们将不得不通过查看源html结构来进行一些修改。

That’s all for jsoup example tutorial, I hope it will help you in parsing HTML data easily when required.

这就是jsoup示例教程的全部内容,我希望它可以帮助您在需要时轻松地解析HTML数据。

Reference: Official Website

参考: 官方网站

翻译自: https://www.journaldev.com/7144/jsoup-java-html-parser

jsoup Java HTML解析器相关推荐

  1. Java XML解析器

    使用Apache Xerces解析XML文档 一.技术概述 在用Java解析XML时候,一般都使用现成XML解析器来完成,自己编码解析是一件很棘手的问题,对程序员要求很高,一般也没有专业厂商或者开源组 ...

  2. java xml解析器_Java XML解析器

    java xml解析器 Java XML parser is used to work with xml data. XML is widely used technology to transpor ...

  3. java sax解析器_Java SAX解析器示例

    java sax解析器 SAX Parser in java provides API to parse XML documents. SAX parser is different from DOM ...

  4. java检查html是否闭合,Java Html解析器和闭合标记

    如何使用Java HTML解析器库处理闭合标记(例如:)?Java Html解析器和闭合标记 举例来说,如果我有以下几点: public class MyFilter implements NodeF ...

  5. Java高性能解析器实现思路及方法

    在某些情况下,你可能需要在Java中实现你自己的数据或语言解析器,也许是这种数据格式或语言缺乏标准的Java或开源解析器可以使用.或者虽然有现成的解析器实现,但它们要么太慢,要么太占内存,要么就是没有 ...

  6. jsoup 1.6.2发布 最棒的Java HTML解析器

    jsoup 是一款 Java 的HTML 解析器,可直接解析某个URL地址.HTML文本内容.它提供了一套非常省力的API,可通过DOM,CSS以及类似于JQuery的操作方法来取出和操作数据. js ...

  7. 如何实现一个Java Class解析器

    最近在写一个私人项目,名字叫做ClassAnalyzer,ClassAnalyzer的目的是能让我们对Java Class文件的设计与结构能够有一个深入的理解.主体框架与基本功能已经完成,还有一些细节 ...

  8. java 视图解析器_SpringMVC——视图和视图解析器

    请求处理方法执行完成后,最终返回一个 ModelAndView对象.对于那些返回 String,View 或 ModeMap 等类型的处理方法,Spring MVC 也会在内部将它们装配成一个Mode ...

  9. java json解码器_Jackson:我是最牛掰的 Java JSON 解析器(有点虚)

    在当今的编程世界里,JSON 已经成为将信息从客户端传输到服务器端的首选协议,可以好不夸张的说,XML 就是那个被拍死在沙滩上的前浪. 很不幸的是,JDK 没有 JSON 库,不知道为什么不搞一下.L ...

最新文章

  1. VMware虚拟机Ubuntu系统与物理机Windows 7系统共享文件夹
  2. .NET Compact Framework s60v3(在S60上运行Windows Mobile程序)
  3. 一个还不错的源码解析网站
  4. c语言实现双链表的基本操作—增删改查
  5. SQL按照年月员工状态统计出勤情况
  6. atlas单机模式代码_游戏日报:3DS源码遭泄露,COD吃鸡更新经典模式,重装机兵再跳票...
  7. 【JAVA零基础入门系列】Day14 Java对象的克隆
  8. Web前端现在薪资多少?企业喜欢什么样的Web前端工程师?
  9. aws rds监控慢sql_AWS RDS SQL Server中的初始Windows身份验证配置
  10. 关于Windows 2019 antimalware 进程占用CPU 过多的处理方法 关闭windows 病毒防护的方法...
  11. android系统密码文件夹,安卓手机如何隐藏(加密)文件夹(软件) 安卓手机隐藏(加密)文件夹(软件)的方法...
  12. 【前端】js代码模拟用户键盘鼠标输入
  13. 如何在Keil中的添加和使用STC芯片型号
  14. WordPress后台定制-为WooCommerce产品增加自定义字段
  15. 骁龙cpu linux内核,高通骁龙888 SoC在Linux 5.12内核才被支持,以往怎么兼容的?
  16. hive 转拼音udf_Hive UDF编程:自己动手实现implode函数
  17. DrGraph - SVG模块之一:显示与节点选择
  18. 2021校招offer薪资如何?(包含当今互联网各巨厂、大厂、中厂)
  19. 题解 [CQOI2017] 老 C 的方块
  20. 键盘按键名称及HID扫描码及VK虚拟键码对照表

热门文章

  1. Matlab C-Mex Round 1
  2. Pro Git 读书笔记
  3. Android网络请求通信之Volley
  4. 循环渐进NsDoor(三)
  5. Spring浅入浅出——不吹牛逼不装逼
  6. [WPF自定义控件库]自定义Expander
  7. Element-ui框架Tree树形控件切换高亮显示选中效果
  8. rgbdslam_v2编译过程中引起的needed by错误
  9. Copy_on_write的简单实现
  10. SD-WAN技术分析