主题：jsoup使用

jsoup 是一款 Java 的HTML 解析器，可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API，可通过DOM，CSS以及类似于JQuery的操作方法来取出和操作数据。请参考： http://jsoup.org/

jsoup的主要功能如下：

从一个URL，文件或字符串中解析HTML；

使用DOM或CSS选择器来查找、取出数据；

可操作HTML元素、属性、文本；

jsoup是基于MIT协议发布的，可放心使用于商业项目。

下载和安装：

maven安装方法：

把下面放入pom.xml下

<groupId>org.jsoup</groupId>

<artifactId>jsoup</artifactId>

</dependency>

用jsoup解析html的方法如下：

解析url html方法

Document doc = Jsoup.connect("http://example.com")  .data("query", "Java")
.userAgent("Mozilla")
.cookie("auth", "token")
.timeout(3000)
.post();

从文件中解析的方法：

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

类试js jsoup提供下面方法：

getElementById(String id) 用id获得元素
getElementsByTag(String tag) 用标签获得元素
getElementsByClass(String className) 用class获得元素
getElementsByAttribute(String key) 用属性获得元素

同时还提供下面的方法提供获取兄弟节点：

siblingElements(), firstElementSibling(), lastElementSibling();nextElementSibling(), previousElementSibling()

用下面方法获得元素的数据：

attr(String key) 获得元素的数据
attr(String key, String value) t设置元素数据
attributes() 获得所以属性
id(), className() classNames() 获得id class得值
text()获得文本值
text(String value) 设置文本值
html() 获取html
html(String value)设置html
outerHtml() 获得内部html
data()获得数据内容
tag() 获得tag 和 tagName() 获得tagname

操作html提供了下面方法：

append(String html), prepend(String html)
appendText(String text), prependText(String text)
appendElement(String tagName), prependElement(String tagName)
html(String value)

通过类似jquery的方法操作html

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements links = doc.select("a[href]"); // a with href
Elements pngs = doc.select("img[src$=.png]");
// img with src ending .png
Element masthead = doc.select("div.masthead").first();
// div with class=masthead
Elements resultLinks = doc.select("h3.r > a"); // direct a after h3

支持的操作有下面这些：

```
tagname 操作tag
```
```
ns|tag ns或tag
```
```
#id  用id获得元素 
```
```
.class 用class获得元素
```
```
[attribute] 属性获得元素
```
```
[^attr]: 以attr开头的属性 
```
```
[attr=value] 属性值为value 
```

[attr^=value], [attr$=value], [attr*=value]

```
[attr~=regex]正则
```
```
*:所以的标签
```

选择组合

```
el#id el和id定位
```
```
el.class e1和class定位
```
```
el[attr] e1和属性定位 
```
```
ancestor child ancestor下面的child 
```

等等

抓取网站标题和内容及里面图片的事例：

Java代码

```
public  void parse(String urlStr) {   
```
```
    // 返回结果初始化。   
```
```
  
```
```
    Document doc = null;   
```
```
    try {   
```
```
        doc = Jsoup   
```
```
                .connect(urlStr)   
```
```
                .userAgent(   
```

                        "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.2.15)") // 设置User-Agent

                .timeout(5000) // 设置连接超时时间

```
                .get();   
```

    } catch (MalformedURLException e) {

```
        log.error( e);   
```
```
        return ;   
```
```
    } catch (IOException e) {   
```

        if (e instanceof SocketTimeoutException) {

```
            log.error( e);   
```

                               return ;

```
        }   
```

        if(e instanceof UnknownHostException){

```
            log.error(e);   
```
```
            return ;   
```
```
        }   
```
```
        log.error( e);   
```
```
        return ;   
```
```
    }   
```
```
    system.out.println(doc.title());   
```
```
    Element head = doc.head();   
```

    Elements metas = head.select("meta");

```
    for (Element meta : metas) {   
```

        String content = meta.attr("content");

        if ("content-type".equalsIgnoreCase(meta.attr("http-equiv"))

                && !StringUtils.startsWith(content, "text/html")) {

```
            log.debug( urlStr);   
```
```
            return ;   
```
```
        }   
```

        if ("description".equalsIgnoreCase(meta.attr("name"))) {

            system.out.println(meta.attr("content"));

```
        }   
```
```
    }   
```
```
    Element body = doc.body();   
```

    for (Element img : body.getElementsByTag("img")) {

        String imageUrl = img.attr("abs:src");//获得绝对路径

        for (String suffix : IMAGE_TYPE_ARRAY) {

            if(imageUrl.indexOf("?")>0){

                imageUrl=imageUrl.substring(0,imageUrl.indexOf("?"));

```
            }   
```

            if (StringUtils.endsWithIgnoreCase(imageUrl, suffix)) {

                imgSrcs.add(imageUrl);

```
                break;   
```
```
            }   
```
```
        }   
```
```
    }   
```
```
}  
```

    public  void parse(String urlStr) {
// 返回结果初始化。
Document doc = null;
try {
doc = Jsoup
.connect(urlStr)
.userAgent(
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.2.15)") // 设置User-Agent
.timeout(5000) // 设置连接超时时间
.get();
} catch (MalformedURLException e) {
log.error( e);
return ;
} catch (IOException e) {
if (e instanceof SocketTimeoutException) {
log.error( e);
return ;
}
if(e instanceof UnknownHostException){
log.error(e);
return ;
}
log.error( e);
return ;
}
system.out.println(doc.title());
Element head = doc.head();
Elements metas = head.select("meta");
for (Element meta : metas) {
String content = meta.attr("content");
if ("content-type".equalsIgnoreCase(meta.attr("http-equiv"))
&& !StringUtils.startsWith(content, "text/html")) {
log.debug( urlStr);
return ;
}
if ("description".equalsIgnoreCase(meta.attr("name"))) {
system.out.println(meta.attr("content"));
}
}
Element body = doc.body();
for (Element img : body.getElementsByTag("img")) {
String imageUrl = img.attr("abs:src");//获得绝对路径
for (String suffix : IMAGE_TYPE_ARRAY) {
if(imageUrl.indexOf("?")>0){
imageUrl=imageUrl.substring(0,imageUrl.indexOf("?"));
}
if (StringUtils.endsWithIgnoreCase(imageUrl, suffix)) {
imgSrcs.add(imageUrl);
break;
}
}
}
}

这里重点要提的是怎么获得图片或链接的决定地址：

  如上获得绝对地址的方法String imageUrl = img.attr("abs:src");//获得绝对路径 ，前面添加abs：jsoup就会获得决定地址；

想知道原因，咱们查看下源码，如下：

Java代码

//该方面是先从map中找看是否有该属性key，如果有直接返回，如果没有检查是否

```
//以abs：开头   
```

  public String attr(String attributeKey) {

        Validate.notNull(attributeKey);

```
  
```
```
        if (hasAttr(attributeKey))   
```

            return attributes.get(attributeKey);

        else if (attributeKey.toLowerCase().startsWith("abs:"))

            return absUrl(attributeKey.substring("abs:".length()));

```
        else return "";   
```
```
    }  
```

//该方面是先从map中找看是否有该属性key，如果有直接返回，如果没有检查是否
//以abs：开头
public String attr(String attributeKey) {
Validate.notNull(attributeKey);
if (hasAttr(attributeKey))
return attributes.get(attributeKey);
else if (attributeKey.toLowerCase().startsWith("abs:"))
return absUrl(attributeKey.substring("abs:".length()));
else return "";
}

 接着查看absUrl方法：

Java代码

```
       
```
```
  
```
```
  /**  
```

     * Get an absolute URL from a URL attribute that may be relative (i.e. an <code>&lt;a href></code> or

```
     * <code>&lt;img src></code>).  
```
```
     * <p/>  
```

     * E.g.: <code>String absUrl = linkEl.absUrl("href");</code>

```
     * <p/>  
```

     * If the attribute value is already absolute (i.e. it starts with a protocol, like

     * <code>http://</code> or <code>https://</code> etc), and it successfully parses as a URL, the attribute is

     * returned directly. Otherwise, it is treated as a URL relative to the element's {@link #baseUri}, and made

```
     * absolute using that.  
```
```
     * <p/>  
```

     * As an alternate, you can use the {@link #attr} method with the <code>abs:</code> prefix, e.g.:

     * <code>String absUrl = linkEl.attr("abs:href");</code>

```
     *  
```

     * @param attributeKey The attribute key

     * @return An absolute URL if one could be made, or an empty string (not null) if the attribute was missing or

     * could not be made successfully into a URL.

```
     * @see #attr  
```

     * @see java.net.URL#URL(java.net.URL, String)

```
     */  
```

//看到这里大家应该明白绝对地址是怎么取的了

public String absUrl(String attributeKey) {

        Validate.notEmpty(attributeKey);

```
  
```

        String relUrl = attr(attributeKey);

        if (!hasAttr(attributeKey)) {

            return ""; // nothing to make absolute with

```
        } else {   
```
```
            URL base;   
```
```
            try {   
```
```
                try {   
```

                    base = new URL(baseUri);

                } catch (MalformedURLException e) {

                    // the base is unsuitable, but the attribute may be abs on its own, so try that

                    URL abs = new URL(relUrl);

                    return abs.toExternalForm();

```
                }   
```

                // workaround: java resolves '//path/file + ?foo' to '//path/?foo', not '//path/file?foo' as desired

                if (relUrl.startsWith("?"))

                    relUrl = base.getPath() + relUrl;

                URL abs = new URL(base, relUrl);

                return abs.toExternalForm();

            } catch (MalformedURLException e) {

```
                return "";   
```
```
            }   
```
```
        }   
```
```
    } 
```

主题：jsoup使用相关推荐

Android中利用Jsoup让WebView清除Html标签并让图片适应大小并居中
在App应用中,我们常常会加载一些新闻.通知等内容,这些内容很多来自于数据库中存储的HTML文本,里面的标签杂乱不堪,且内容中又包括图片等. 处理这些内容,我们可以利用正则表达式来处理,但标签太多,太 ...
Java使用Jsoup写爬虫
Java使用Jsoup写爬虫安装Jsoup.jar 简单了解Jsoup Jsoup框架中的常用方法动手实践进阶写法安装Jsoup.jar 1.首先我们打开Jsoup官网 2.按照图片这里下载 ...
java毕业设计——基于java+Jsoup+HttpClient的网络爬虫技术的网络新闻分析系统设计与实现（毕业论文+程序源码）——网络新闻分析系统
基于java+Jsoup+HttpClient的网络爬虫技术的网络新闻分析系统设计与实现(毕业论文+程序源码) 大家好,今天给大家介绍基于java+Jsoup+HttpClient的网络爬虫技术的网络 ...
【Android+OkHttp3+Jsoup】模拟登录教务系统抓取课表和成绩
原文链接:https://blog.csdn.net/u013347241/article/details/52711018 今天这篇文章为大家带来的是模拟登录教务系统并抓取课表和成绩的详细实现过程. ...
jsoup api 用法
jsoup api 在解析html过程中,需要处理一个标签下的所有子节点,从而达到识别每个图片节点,并将其插入到原来的文本位置中,然而node 和 element 的方法功能不同,关于节点(node) ...
Android 使用Jsoup解析网页批量获取图片
Android 网络图片查看器HappyLook开发一.前言二.框架介绍 1.Jsoup简介 2.EventBus简介 3.RecyclerView及Glide 三.具体实现 1.需求确认 2.引 ...
基于JSoup库的java爬虫开发学习——小步快跑
因某需求,需要使用java从网页上爬取一些数据来使用,花了点时间看了一下JSoup,简单介绍一下 jsoup is a Java library for working with real-world ...
java爬虫之Jsoup入门
网络爬虫网络爬虫概念网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本.另外一些不常使用的名字还有蚂蚁. ...
基于Jsoup实现的简单网络爬虫
之前是完全不会爬虫的,但是新项目中需要从网页上爬一大堆的数据,所以就花了一天时间学习了下.主题部分还是很简单的. * 既然想要写博文,那我就要写的细致点,对自己对读者都是一种负责! 什么是爬虫? 我所 ...

主题：jsoup使用

主题：jsoup使用相关推荐

最新文章

热门文章