【java爬虫】jsoup爬取网页数据-搜索算法评测/竞品评测

百度抽奖概率改4个小时频繁黑屏频繁弹出源码的前端FE T8李森：请云端高level的同学参加会议。。。对，我级别到了。。。

666666

业务背景：如何保证搜索算法的好坏？所以有了竞品评测，自己的APP采用接口的方式抓取前6个卡片的关键字段。对于竞品的无法抓到人家的接口，采用jsoup爬取pc端前端字段，存成我们需要的字段。如视频的时长，播放量，点赞数，类型等。基于PM提供的一批query，抓取多个APP的搜索数据。最后统一存到OSS上，给到PM外包做标注（相关性、满意度、打分）

jsoup参考资料：
https://www.jianshu.com/p/fd5caaaa950d

深坑：

爬虫爬到的网页源码和按F12查看的网页源码不一致。为什么？

网页最终显示的页面源码是经过浏览器解析后的，get或者post请求到的源码是服务器直接返回的，不一样是正常的。

审查元素(或者用开发者工具，Firebug)看到的是现在实时性的内容(经过js的修改)，而网页源代码看到的是就是最开始浏览器收到HTTP响应内容

这个原因，就是页面加载的时候浏览器会渲染，把对应的class填充内容，但是爬虫的时候没有渲染的功能

开始不知道，爬取数据的时候发现有的字段返回为null

如，爬取爱奇艺的网页，我尝试了JS/HTML格式化（http://tool.chinaz.com/Tools/jsformat.aspx）

尝试了json格式，但本身是HTML（https://www.json.cn/#）

尝试了VScode...

但是最后发现在谷歌浏览器直接开发者模式下查看Elements比较好，格式清晰一目了然，由于开发者模式下查询比较卡，可以打开查看网页源码，进行搜索查找元素

分层为<div><h3><span>

写代码之前，要学习jsoup，很简单，看懂了再去写效率高。。。

第一次写爬虫，对照竞品爬取代码debug，仿照写

选择器 select 取class直接select(.classname)

如遇：

解决报错：javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException

参考：https://blog.csdn.net/u010248330/article/details/70161899

javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
   at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)
   at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1949)
   at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:302)
   at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296)
   at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1509)
   at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216)
   at sun.security.ssl.Handshaker.processLoop(Handshaker.java:979)
   at sun.security.ssl.Handshaker.process_record(Handshaker.java:914)
   at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1062)
   at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
   at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
   at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
   at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559)
   at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:185)
   at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:153)
   at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:746)
   at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:722)
   at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:306)
   at org.jsoup.helper.HttpConnection.get(HttpConnection.java:295)
   at com.alibaba.pingce.jingpin.BliHandler.getBliPcResult(BliHandler.java:44)
   at com.alibaba.pingce.jingpin.BliHandler.main(BliHandler.java:199)
Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
   at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:387)
   at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:292)
   at sun.security.validator.Validator.validate(Validator.java:260)
   at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:324)
   at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:229)
   at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:124)
   at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1491)
   ... 16 more
Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
   at sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:141)
   at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:126)
   at java.security.cert.CertPathBuilder.build(CertPathBuilder.java:280)
   at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:382)
   ... 22 more
Exception in thread "main" java.lang.NullPointerException
   at com.alibaba.pingce.jingpin.BliHandler.getBliPcResult(BliHandler.java:189)
   at com.alibaba.pingce.jingpin.BliHandler.main(BliHandler.java:199)
Disconnected from the target VM, address: '127.0.0.1:56813', transport: 'socket'

在网上查阅了信息说是证书问题，可以在代码中写一段逻辑忽略证书：

下面是网上下载的代码：http://www.sojson.com/blog/195.html

import java.security.cert.CertificateException;
import java.security.cert.X509Certificate;

import javax.net.ssl.HostnameVerifier;
import javax.net.ssl.HttpsURLConnection;
import javax.net.ssl.SSLContext;
import javax.net.ssl.SSLSession;
import javax.net.ssl.TrustManager;
import javax.net.ssl.X509TrustManager;

public class SslUtils {

public static void trustAllHttpsCertificates() throws Exception {
TrustManager[] trustAllCerts = new TrustManager[1];
TrustManager tm = new miTM();
trustAllCerts[0] = tm;
SSLContext sc = SSLContext.getInstance("SSL");
sc.init(null, trustAllCerts, null);
HttpsURLConnection.setDefaultSSLSocketFactory(sc.getSocketFactory());
}

static class miTM implements TrustManager,X509TrustManager {
public X509Certificate[] getAcceptedIssuers() {
return null;
}

public boolean isServerTrusted(X509Certificate[] certs) {
return true;
}

public boolean isClientTrusted(X509Certificate[] certs) {
return true;
}

public void checkServerTrusted(X509Certificate[] certs, String authType)
throws CertificateException {
return;
}

public void checkClientTrusted(X509Certificate[] certs, String authType)
throws CertificateException {
return;
}
}

/**
* 忽略HTTPS请求的SSL证书，必须在openConnection之前调用
* @throws Exception
*/
public static void ignoreSsl() throws Exception{
HostnameVerifier hv = new HostnameVerifier() {
public boolean verify(String urlHostName, SSLSession session) {
return true;
}
};
trustAllHttpsCertificates();
HttpsURLConnection.setDefaultHostnameVerifier(hv);
}
}

//在URLConnection con = url.openConnection()之前使用就行

   public static void main(String[] args) {
       //String url="http://wx1.sinaimg.cn/mw690/006sl6kBgy1fel3aq0nyej30i20hxq7i.jpg";
       String url="https://05.imgmini.eastday.com/mobile/20170413/20170413053046_4a5e70ed0b39c824517630e6954861f2_1.jpeg";
       String downToFilePath="d:/download/image/";
       String fileName="test";
       try {
           SslUtils.ignoreSsl();
       } catch (Exception e) {
           e.printStackTrace();
       }
       imageDownLoad(url, downToFilePath,fileName);

   }

在代码中，增加如上工具类方法的异常信息捕获即可

BliHandler

package com.alibaba.pingce.jingpin;import com.alibaba.algo.dao.SokuTopQueryCompareSnapshotInfoDao;
import com.alibaba.fastjson.JSONObject;
import com.alibaba.pingce.component.Constants;
import com.alibaba.pingce.model.JingPinModle;
import com.alibaba.util.http.handler.SslUtil;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;@Service
public class BliHandler {@AutowiredSokuTopQueryCompareSnapshotInfoDao sokuTopQueryCompareSnapshotInfoDao;public List<JingPinModle> getBliPcResult(String query, int num) {List<JingPinModle> jingPinModles = new ArrayList<>();try {try {SslUtil.ignoreSsl();} catch (Exception e) {e.printStackTrace();}//            String url="http://so.iqiyi.com/so/q_"+ URLEncoder.encode ( query,"UTF-8" )+"?source=input&sr=1476998987782";
//            String url = "https://search.bilibili.com/all?keyword=" + URLEncoder.encode(query, "UTF-8") + "&from_source=nav_suggest_new";String url = "https://search.bilibili.com/all?keyword=" + query + "&from_source=nav_suggest_new";
//            logger.info ( url );
//            System.out.println("utl==" + url);Document doc = Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31").get();//            System.out.println("doc=="+doc);HashMap<String, Integer> docSourceMap = new HashMap<>();docSourceMap.put("bangumi-item-wrap", 1); //节目
//            docSourceMap.put("", 10); //节目大词docSourceMap.put("video-item matrix", 2); //ugc
//            docSourceMap.put("", 12); //人物
//            docSourceMap.put("live-room-item", 98);//直播//            docSourceMap.put("mixin-list",1111);List<String> classes = new ArrayList<>();classes.add("bangumi-item-wrap");classes.add("video-item matrix");classes.add("live-room-item");
//            classes.add("mixin-list");
//            Elements docList = doc.select ( "div[class=layout-main] > div" );// 获取当前query搜索结果的所有类型卡片列表（节目、ugc等）Elements docList = doc.select(".mixin-list");
//            System.out.println("docList==" + docList);// 获取所有类型卡片列表里的节目列表
//            Elements bangumi_list = docList.select("." + classes.get(i));
//            Elements bangumi_list = docList.select(".bangumi-list");Elements bangumi_list = docList.select(".bangumi-item-wrap");// 获取所有类型卡片列表里的ugc列表
//            Elements videoListClearfix = docList.select(".video-item");// 标签[class=]Elements videoListClearfix = docList.select("li[class=video-item matrix]");// 获取所有类型卡片列表里的直播列表Elements liveList = docList.select("ul[class=live-room-wrap clearfix]").select("li[class=live-room-item]");for (int i = 0; i < classes.size(); i++) {String title = "null";String pic = "null";String site = "null";String time = "null";String anchor = "null";String timelength = "null";String videoUrl = "null";String type = "null";String playCount = "null";String headIcon = "null";int rank = 1;//            for (Element element : docList) {// 节目卡bangumi_listif (!bangumi_list.isEmpty()) {for (Element element : bangumi_list) {
//                        System.out.println("element==" + element);if (jingPinModles.size() >= 5) {break;}JSONObject curDoc = new JSONObject();String figure = element.attr("class").trim();
//                        System.out.println("figure为==" + figure);if (!classes.contains(figure)) {continue;}Integer docSource = docSourceMap.get(figure);
//                        System.out.println("docSource为==" + docSource);JingPinModle jingPinModle = new JingPinModle();// 节目-番剧// 两种写法都可以，获取class div[class=right-info] 或者.right-info//                        String category = element.select("div[class=right-info]").select("span[class=bangumi-label]").text().trim();String category = element.select(".right-info").select("span[class=bangumi-label]").text().trim();
//                        System.out.println("category==" + category);if (!category.isEmpty()) {type = "节目（番剧）";} else {type = "专题";}title = element.select(".right-info").select("a[href]").attr("title").trim();site = "B站";String pic1 = "http" + element.select(".lazy-img");
//                        System.out.println("pic1===" + pic1);pic = "http" + element.select(".lazy-img").attr("img[src]");//                         Elements elements = element.select("a[class=left-img]");////                         System.out.println("------------------------");//                         for(Element element1:elements){//                             System.out.println(JSONObject.toJSONString(element1.select("a").attr("href")));//                             System.out.println("element1===="+element1);//                         }videoUrl = "http:" + element.select("a").attr("href").trim();jingPinModle.setRank(rank++);jingPinModle.setQuery(query);jingPinModle.setVdo_title(title);jingPinModle.setPic(pic);jingPinModle.setSite(site);jingPinModle.setCreate_time(time);jingPinModle.setRel_people(anchor);jingPinModle.setSeconds(timelength);jingPinModle.setUrl(videoUrl);jingPinModle.setType(type);jingPinModles.add(jingPinModle);
//                        break;}}// ugc卡videoListClearfixif (!videoListClearfix.isEmpty()) {Element element = videoListClearfix.get(i);
//                    System.out.println("element==" + element);if (jingPinModles.size() >= 5) {break;}JSONObject curDoc = new JSONObject();String figure = element.attr("class").trim();
//                    System.out.println("figure为==" + figure);if (!classes.contains(figure)) {continue;}Integer docSource = docSourceMap.get(figure);
//                    System.out.println("docSource为==" + docSource);JingPinModle jingPinModle = new JingPinModle();// 标题title = element.select(".info").select(".headline").select("a[class=title]").attr("title").trim();// 上传时间time = element.select(".info").select(".tags").select("span[class=so-icon time]").text();System.out.println("time==" + time);//                        select("div[desc=发布时间]").select("span[class=so-icon time]").text().trim();// 播放数playCount = element.select(".info").select(".tags").select("span[class=so-icon watch-num]").text();// 作者anchor = element.select(".info").select(".tags").select("span[class=so-icon]").select("a[class=up-name]").text();if (anchor.isEmpty()) {anchor = element.select("div[class=result-right]").select("div[desc=上传者]").select("a[class=uploader-name]").attr("title").trim();}//                        anchor = element.select ( "div[class=result-right]" ).select ( "div[class=qy-search-result-info uploader-ico]" ).//                                select ( "span[class=info-uploader]" ).text().replace("+关注","").trim();// 视频时长timelength = element.select(".img").select("span[class=so-imgTag_rb]").text();// 视频封面pic = "http:" + element.select("div[class=result-figure]").select("img[class=qy-mod-cover]").attr("src").trim();videoUrl = "http:" + element.select("div[class=result-right]").select("a[class=main-tit]").attr("href").trim();type = "ugc";site = "B站";jingPinModle.setRank(rank++);jingPinModle.setQuery(query);jingPinModle.setVdo_title(title);jingPinModle.setPic(pic);jingPinModle.setSite(site);jingPinModle.setCreate_time(time);jingPinModle.setRel_people(anchor);jingPinModle.setSeconds(timelength);jingPinModle.setUrl(videoUrl);jingPinModle.setType(type);// 视频时长jingPinModle.setPlay_count(playCount);jingPinModles.add(jingPinModle);
//                    for (Element element : videoListClearfix) {
//                    }}}} catch (Exception e) {e.printStackTrace();}JingPinModle capture_model = new JingPinModle();capture_model.setPic(sokuTopQueryCompareSnapshotInfoDao.selectUrlBySiteAndQuery(query, Constants.BliBli));capture_model.setQuery(query);capture_model.setRank(jingPinModles.size() + 1);jingPinModles.add(capture_model);return jingPinModles;}public static void main(String[] args) {BliHandler handler = new BliHandler();List<JingPinModle> modles = handler.getBliPcResult("辉夜大小姐", 5);System.out.println(modles.size());}
}

遇到的问题

调试的时候，发现图片取不到，为null

以下是开发者模式下抓取到的字段img

换一种方式，不用jsoup改用json解析：截取“显示网络源码”里的json，从window.__INITIAL_STATE__=到;(function(){var s;之前的json。pic取值如下（拼接https）

videoid取值如下 https://www.bilibili.com/video/av 拼接json里的id

爬取结果如下：

jsoup源码：

源码：

//
// Source code recreated from a .class file by IntelliJ IDEA
// (powered by Fernflower decompiler)
//package org.jsoup.nodes;import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;
import org.jsoup.SerializationException;
import org.jsoup.helper.StringUtil;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document.OutputSettings;
import org.jsoup.parser.Parser;
import org.jsoup.select.NodeFilter;
import org.jsoup.select.NodeTraversor;
import org.jsoup.select.NodeVisitor;public abstract class Node implements Cloneable {static final String EmptyString = "";Node parentNode;int siblingIndex;protected Node() {}public abstract String nodeName();protected abstract boolean hasAttributes();public boolean hasParent() {return this.parentNode != null;}public String attr(String attributeKey) {Validate.notNull(attributeKey);if (!this.hasAttributes()) {return "";} else {String val = this.attributes().getIgnoreCase(attributeKey);if (val.length() > 0) {return val;} else {return attributeKey.startsWith("abs:") ? this.absUrl(attributeKey.substring("abs:".length())) : "";}}}public abstract Attributes attributes();public Node attr(String attributeKey, String attributeValue) {this.attributes().putIgnoreCase(attributeKey, attributeValue);return this;}public boolean hasAttr(String attributeKey) {Validate.notNull(attributeKey);if (attributeKey.startsWith("abs:")) {String key = attributeKey.substring("abs:".length());if (this.attributes().hasKeyIgnoreCase(key) && !this.absUrl(key).equals("")) {return true;}}return this.attributes().hasKeyIgnoreCase(attributeKey);}public Node removeAttr(String attributeKey) {Validate.notNull(attributeKey);this.attributes().removeIgnoreCase(attributeKey);return this;}public Node clearAttributes() {Iterator it = this.attributes().iterator();while(it.hasNext()) {it.next();it.remove();}return this;}public abstract String baseUri();protected abstract void doSetBaseUri(String var1);public void setBaseUri(final String baseUri) {Validate.notNull(baseUri);this.traverse(new NodeVisitor() {public void head(Node node, int depth) {node.doSetBaseUri(baseUri);}public void tail(Node node, int depth) {}});}public String absUrl(String attributeKey) {Validate.notEmpty(attributeKey);return !this.hasAttr(attributeKey) ? "" : StringUtil.resolve(this.baseUri(), this.attr(attributeKey));}protected abstract List<Node> ensureChildNodes();public Node childNode(int index) {return (Node)this.ensureChildNodes().get(index);}public List<Node> childNodes() {return Collections.unmodifiableList(this.ensureChildNodes());}public List<Node> childNodesCopy() {List<Node> nodes = this.ensureChildNodes();ArrayList<Node> children = new ArrayList(nodes.size());Iterator var3 = nodes.iterator();while(var3.hasNext()) {Node node = (Node)var3.next();children.add(node.clone());}return children;}public abstract int childNodeSize();protected Node[] childNodesAsArray() {return (Node[])this.ensureChildNodes().toArray(new Node[this.childNodeSize()]);}public Node parent() {return this.parentNode;}public final Node parentNode() {return this.parentNode;}public Node root() {Node node;for(node = this; node.parentNode != null; node = node.parentNode) {;}return node;}public Document ownerDocument() {Node root = this.root();return root instanceof Document ? (Document)root : null;}public void remove() {Validate.notNull(this.parentNode);this.parentNode.removeChild(this);}public Node before(String html) {this.addSiblingHtml(this.siblingIndex, html);return this;}public Node before(Node node) {Validate.notNull(node);Validate.notNull(this.parentNode);this.parentNode.addChildren(this.siblingIndex, node);return this;}public Node after(String html) {this.addSiblingHtml(this.siblingIndex + 1, html);return this;}public Node after(Node node) {Validate.notNull(node);Validate.notNull(this.parentNode);this.parentNode.addChildren(this.siblingIndex + 1, node);return this;}private void addSiblingHtml(int index, String html) {Validate.notNull(html);Validate.notNull(this.parentNode);Element context = this.parent() instanceof Element ? (Element)this.parent() : null;List<Node> nodes = Parser.parseFragment(html, context, this.baseUri());this.parentNode.addChildren(index, (Node[])nodes.toArray(new Node[nodes.size()]));}public Node wrap(String html) {Validate.notEmpty(html);Element context = this.parent() instanceof Element ? (Element)this.parent() : null;List<Node> wrapChildren = Parser.parseFragment(html, context, this.baseUri());Node wrapNode = (Node)wrapChildren.get(0);if (wrapNode != null && wrapNode instanceof Element) {Element wrap = (Element)wrapNode;Element deepest = this.getDeepChild(wrap);this.parentNode.replaceChild(this, wrap);deepest.addChildren(new Node[]{this});if (wrapChildren.size() > 0) {for(int i = 0; i < wrapChildren.size(); ++i) {Node remainder = (Node)wrapChildren.get(i);remainder.parentNode.removeChild(remainder);wrap.appendChild(remainder);}}return this;} else {return null;}}public Node unwrap() {Validate.notNull(this.parentNode);List<Node> childNodes = this.ensureChildNodes();Node firstChild = childNodes.size() > 0 ? (Node)childNodes.get(0) : null;this.parentNode.addChildren(this.siblingIndex, this.childNodesAsArray());this.remove();return firstChild;}private Element getDeepChild(Element el) {List<Element> children = el.children();return children.size() > 0 ? this.getDeepChild((Element)children.get(0)) : el;}void nodelistChanged() {}public void replaceWith(Node in) {Validate.notNull(in);Validate.notNull(this.parentNode);this.parentNode.replaceChild(this, in);}protected void setParentNode(Node parentNode) {Validate.notNull(parentNode);if (this.parentNode != null) {this.parentNode.removeChild(this);}this.parentNode = parentNode;}protected void replaceChild(Node out, Node in) {Validate.isTrue(out.parentNode == this);Validate.notNull(in);if (in.parentNode != null) {in.parentNode.removeChild(in);}int index = out.siblingIndex;this.ensureChildNodes().set(index, in);in.parentNode = this;in.setSiblingIndex(index);out.parentNode = null;}protected void removeChild(Node out) {Validate.isTrue(out.parentNode == this);int index = out.siblingIndex;this.ensureChildNodes().remove(index);this.reindexChildren(index);out.parentNode = null;}protected void addChildren(Node... children) {List<Node> nodes = this.ensureChildNodes();Node[] var3 = children;int var4 = children.length;for(int var5 = 0; var5 < var4; ++var5) {Node child = var3[var5];this.reparentChild(child);nodes.add(child);child.setSiblingIndex(nodes.size() - 1);}}protected void addChildren(int index, Node... children) {Validate.noNullElements(children);List<Node> nodes = this.ensureChildNodes();Node[] var4 = children;int var5 = children.length;for(int var6 = 0; var6 < var5; ++var6) {Node child = var4[var6];this.reparentChild(child);}nodes.addAll(index, Arrays.asList(children));this.reindexChildren(index);}protected void reparentChild(Node child) {child.setParentNode(this);}private void reindexChildren(int start) {List<Node> childNodes = this.ensureChildNodes();for(int i = start; i < childNodes.size(); ++i) {((Node)childNodes.get(i)).setSiblingIndex(i);}}public List<Node> siblingNodes() {if (this.parentNode == null) {return Collections.emptyList();} else {List<Node> nodes = this.parentNode.ensureChildNodes();List<Node> siblings = new ArrayList(nodes.size() - 1);Iterator var3 = nodes.iterator();while(var3.hasNext()) {Node node = (Node)var3.next();if (node != this) {siblings.add(node);}}return siblings;}}public Node nextSibling() {if (this.parentNode == null) {return null;} else {List<Node> siblings = this.parentNode.ensureChildNodes();int index = this.siblingIndex + 1;return siblings.size() > index ? (Node)siblings.get(index) : null;}}public Node previousSibling() {if (this.parentNode == null) {return null;} else {return this.siblingIndex > 0 ? (Node)this.parentNode.ensureChildNodes().get(this.siblingIndex - 1) : null;}}public int siblingIndex() {return this.siblingIndex;}protected void setSiblingIndex(int siblingIndex) {this.siblingIndex = siblingIndex;}public Node traverse(NodeVisitor nodeVisitor) {Validate.notNull(nodeVisitor);NodeTraversor.traverse(nodeVisitor, this);return this;}public Node filter(NodeFilter nodeFilter) {Validate.notNull(nodeFilter);NodeTraversor.filter(nodeFilter, this);return this;}public String outerHtml() {StringBuilder accum = new StringBuilder(128);this.outerHtml(accum);return accum.toString();}protected void outerHtml(Appendable accum) {NodeTraversor.traverse(new Node.OuterHtmlVisitor(accum, this.getOutputSettings()), this);}OutputSettings getOutputSettings() {Document owner = this.ownerDocument();return owner != null ? owner.outputSettings() : (new Document("")).outputSettings();}abstract void outerHtmlHead(Appendable var1, int var2, OutputSettings var3) throws IOException;abstract void outerHtmlTail(Appendable var1, int var2, OutputSettings var3) throws IOException;public <T extends Appendable> T html(T appendable) {this.outerHtml(appendable);return appendable;}public String toString() {return this.outerHtml();}protected void indent(Appendable accum, int depth, OutputSettings out) throws IOException {accum.append('\n').append(StringUtil.padding(depth * out.indentAmount()));}public boolean equals(Object o) {return this == o;}public boolean hasSameValue(Object o) {if (this == o) {return true;} else {return o != null && this.getClass() == o.getClass() ? this.outerHtml().equals(((Node)o).outerHtml()) : false;}}public Node clone() {Node thisClone = this.doClone((Node)null);LinkedList<Node> nodesToProcess = new LinkedList();nodesToProcess.add(thisClone);while(!nodesToProcess.isEmpty()) {Node currParent = (Node)nodesToProcess.remove();int size = currParent.childNodeSize();for(int i = 0; i < size; ++i) {List<Node> childNodes = currParent.ensureChildNodes();Node childClone = ((Node)childNodes.get(i)).doClone(currParent);childNodes.set(i, childClone);nodesToProcess.add(childClone);}}return thisClone;}public Node shallowClone() {return this.doClone((Node)null);}protected Node doClone(Node parent) {Node clone;try {clone = (Node)super.clone();} catch (CloneNotSupportedException var4) {throw new RuntimeException(var4);}clone.parentNode = parent;clone.siblingIndex = parent == null ? 0 : this.siblingIndex;return clone;}private static class OuterHtmlVisitor implements NodeVisitor {private Appendable accum;private OutputSettings out;OuterHtmlVisitor(Appendable accum, OutputSettings out) {this.accum = accum;this.out = out;out.prepareEncoder();}public void head(Node node, int depth) {try {node.outerHtmlHead(this.accum, depth, this.out);} catch (IOException var4) {throw new SerializationException(var4);}}public void tail(Node node, int depth) {if (!node.nodeName().equals("#text")) {try {node.outerHtmlTail(this.accum, depth, this.out);} catch (IOException var4) {throw new SerializationException(var4);}}}}
}

【java爬虫】jsoup爬取网页数据-搜索算法评测/竞品评测相关推荐

HTTP编程(Java爬虫-简单爬取网页数据）
HTTP协议简介 HTTP 是 HyperText Transfer Protocol 的缩写,翻译为超文本传输协议,它是基于 TCP 协议之上的一种请求-响应协议. HTTP请求格式是固定的,由HT ...
java爬虫-简单爬取网页图片
刚刚接触到"爬虫"这个词的时候是在大一,那时候什么都不明白,但知道了百度.谷歌他们的搜索引擎就是个爬虫. 现在大二.再次燃起对爬虫的热爱,查阅资料,知道常用java.python语 ...
【Java爬虫】爬取网页中的内容，提取其中文字
挺乱的,临时存一下 package cn.hanquan.craw;import java.io.FileWriter; import java.io.IOException; import java ...
如何使用爬虫语言爬取网页数据？
爬虫开发基础概念我们已经讲完,怎么来开发个爬虫呢?举个栗子: 如图,是星斗苍凉.月色照亮的动漫斗罗大陆的播放页面.我们以此为例,开发爬虫来获取页面数据. Java爬虫 Java爬虫的开发主要使用Js ...
python爬取网页内容_你以为Python爬虫只能爬取网页数据吗？APP也是可以的呢！
摘要大多数APP里面返回的是json格式数据,或者一堆加密过的数据 .这里以超级课程表APP为例,抓取超级课程表里用户发的话题. 1 抓取APP数据包方法详细可以参考这篇博文:http://my. ...
如何通过jsoup网络爬虫工具爬取网页数据,并通过jxl工具导出到excel
1:闲话少说,直接看需求: 抓取的url:http://www.shparking.cn/index.php/welcome/municipal_parking?key=&per_page=. ...
java用jsoup爬网页数据_java使用jsoup爬取网页数据
1.导入依赖 org.jsoup jsoup 1.11.3 1.解析一个html字符串示例如下:String html = " 这是P元素的内容 ";Document docum ...
python爬虫之爬取网页基础知识及环境配置概括
记:python爬虫是爬取网页数据.统计数据必备的知识体系,当我们想统计某个网页的部分数据时,就需要python爬虫进行网络数据的爬取,英文翻译为 spider 爬虫的核心 1.爬取网页:爬取整个网页 ...
Jsoup：用Java也可以爬虫，怎么使用Java进行爬虫，用Java爬取网页数据，使用Jsoup爬取数据，爬虫举例：京东搜索
Jsoup:用Java也可以爬虫,怎么使用Java进行爬虫,用Java爬取网页数据,使用Jsoup爬取数据,爬虫举例:京东搜索一.资源为什么接下来的代码中要使用el.getElementsByTa ...

【java爬虫】jsoup爬取网页数据-搜索算法评测/竞品评测

遇到的问题

【java爬虫】jsoup爬取网页数据-搜索算法评测/竞品评测相关推荐

最新文章

热门文章