Jsoup爬虫的基本使用

什么是Jsoup？

jsoup 是一款Java 的HTML解析器，可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API，可通过DOM，CSS以及类似于jQuery的操作方法来取出和操作数据（简称爬虫）。

基本使用

新建一个maven项目

<dependencies><dependency><groupId>org.apache.httpcomponents</groupId><artifactId>httpclient</artifactId><version>4.0.1</version></dependency><dependency><groupId>org.apache.httpcomponents</groupId><artifactId>httpcore</artifactId><version>4.0.1</version></dependency><dependency><groupId>org.apache.httpcomponents</groupId><artifactId>httpmime</artifactId><version>4.0.1</version></dependency><dependency><groupId>commons-codec</groupId><artifactId>commons-codec</artifactId><version>1.4</version></dependency><dependency><groupId>commons-logging</groupId><artifactId>commons-logging</artifactId><version>1.1.1</version></dependency><dependency><groupId>commons-io</groupId><artifactId>commons-io</artifactId><version>1.4</version></dependency><dependency><groupId>org.jsoup</groupId><artifactId>jsoup</artifactId><version>1.11.3</version></dependency><dependency><groupId>org.apache.commons</groupId><artifactId>commons-lang3</artifactId><version>3.1</version></dependency><dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>4.11</version><scope>compile</scope></dependency>
</dependencies>

测试类

 @Testpublic void test111() throws Exception{//        1、爬取的urlString targetUrl = "https://zhipeng0908.gitee.io";
//        2、获取connection，CrawlerUtil工具类在下方Connection connect = CrawlerUtil.getConnection(targetUrl);
//        4、执行Connection.Response response = connect.method(Connection.Method.GET).execute();
//        5、处理爬虫结果
//        得到domDocument document = response.parse();
//        <body></body>Element bodyElement = document.body();// .post-header为这个html中一个div的类名
//        Elements 类继承了ArrayList类Elements cardElement = bodyElement.select(".post-header");
//        处理结果，获得文本内容for (Element blog : cardElement) {Elements titleElement = blog.select(".post-title");String title = titleElement.text();Elements timeElement = blog.select(".post-meta > span.post-time > time");String time = timeElement.text();Elements linkElement = blog.select(".post-title-link");String link = linkElement.attr("href");System.out.println("博客标题："+title + "\t" + "url：" + (targetUrl+link) + "\t"+"发布时间："+time);}}

工具类

public static Connection getConnection(String targetUrl){Connection connect = Jsoup.connect(targetUrl);
//        3、伪造请求头connect.header("Accept","text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9");connect.header("Accept-Encoding","gzip, deflate, br");connect.header("Accept-Language","zh-CN,zh;q=0.9");connect.header("Cache-Control","no-cache");connect.header("Connection","keep-alive");connect.header("Cookie","_ga=GA1.2.2130438396.1588431092; Hm_lvt_ec661610f14acf2457496da3a87d804d=1588840665,1589378478; Hm_lpvt_ec661610f14acf2457496da3a87d804d=1589378528");connect.header("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36");return connect;}

结果

Jsoup爬虫的基本使用相关推荐

jsoup爬虫教程技巧_Jsoup V的幕后秘密：优化的技巧和窍门
jsoup爬虫教程技巧我们已经把事情做好了,现在是时候加快工作速度了. 我们会牢记Donald Knuth的警告:"大约97%的时间我们应该忘记效率低下:过早的优化是万恶之源". ...
jsoup爬虫技术精通_精通业务的同时保持技术的3种方法
jsoup爬虫技术精通上周,我很幸运地参加了2017年红帽峰会 . 我们与客户,分析师和记者举行了无数次会议和情况介绍会. 在会议之间走动时,我开始思考一个挑战,这对许多技术人员来说是一个挑战,因为 ...
jsoup爬虫,爬取全站代码
最近使用jsoup扒了几个网站,感觉bug改的差不多了,于是写出来与大家分享. 首先我会把爬虫基础的爬取思路与部分重要方法展示出来,最后我会把全部代码贴出来.并且我会写一个Main类,里面就是爬虫的模 ...
jsoup爬虫简书首页数据做个小Demo
代码地址如下: http://www.demodashi.com/demo/11643.html 昨天LZ去面试,遇到一个大牛,被血虐一番,发现自己基础还是很薄弱,对java一些原理掌握的还是不够稳固 ...
java使用jsoup爬虫入门
一.maven项目里pom添加jsoup依赖 <dependency><groupId>org.jsoup</groupId><artifactId>j ...
Android 通过okhttp + jsoup 爬虫爬取网页小说
Android 通过okhttp + jsoup 爬虫爬取网页小说效果图 1.准备工作测试地址:http://www.tlxs.net 第三方依赖: implementation 'com.squ ...
Java+Jsoup爬虫小红书
源码链接:https://pan.baidu.com/s/1oOAxJqSMCyVJPNv-iAYW7A 提取码:1co9 Java+Jsoup爬虫小红书,微博,B站爬取地址:https://www ...
【Java】Jsoup爬虫,一个简单获取京东商品信息的小Demo
简单记录 - Jsoup爬虫入门实战数据问题?数据库获取,消息队列中获取中,都可以成为数据源,爬虫! 爬取数据:(获取请求返回的页面信息,筛选出我们想要的数据就可以了!) 我们经常需要分析HTML网 ...
利用java的JSoup爬虫技术爬取网页信息
简单讲解java的Jsoup爬虫技术来爬取网页的数据,简单来讲就是解释网页,一般学过xml的人都很容易理解. 第一步:我们要爬取网页的静态内容必须要了解网页的源码,也就是div之类的标签,因为我们是靠 ...
Java+Jsoup爬虫微博
源码链接:https://pan.baidu.com/s/1oOAxJqSMCyVJPNv-iAYW7A 提取码:1co9 Java+Jsoup爬虫小红书,微博,B站爬取地址:https://wei ...

Jsoup爬虫的基本使用

什么是Jsoup？

基本使用

新建一个maven项目

测试类

Jsoup爬虫的基本使用相关推荐

最新文章

热门文章