HttpClient 实现爬取百度搜索结果（自动翻页）

如果你对HttpClient还不是很了解，建议先移步我的另一篇博客HttpClient4.x之请求示例后再来看这篇博客。我们这里的项目采用maven搭建。在阅读前要对jdk和maven有一定的了解。另外开发工具这里我这里使用的是：Spring Tool Suite（STS）当然你也可以使用其他的开发工具进行。环境和版本说明大致如下：

开发工具：Spring Tool Suite（STS） 3.9.6.RELEASE

maven版本：3.2.5

jdk版本： 1.8.0_144

第一步先引入项目示例程序的依赖

     <dependency><groupId>org.apache.httpcomponents</groupId><artifactId>httpclient</artifactId><version>4.5.3</version></dependency><dependency><groupId>commons-logging</groupId><artifactId>commons-logging</artifactId><version>1.1.1</version></dependency><dependency><groupId>log4j</groupId><artifactId>log4j</artifactId><version>1.2.17</version></dependency><dependency><groupId>commons-lang</groupId><artifactId>commons-lang</artifactId><version>2.6</version></dependency><dependency><groupId>org.jsoup</groupId><artifactId>jsoup</artifactId><version>1.8.3</version></dependency><dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>4.12</version><scope>test</scope></dependency>

搜索文章内容的实体类。主要是用户存储爬取的文章标题和文章的网址。

package cn.zhuoqianmingyue.getoperation.baidusearch;
/***  搜索数据* @author zhuoqianmingyue*/
public class SearchData {private String title;//文章标题private String url;//文章的urlpublic String getTitle() {return title;}public void setTitle(String title) {this.title = title;}public String getUrl() {return url;}public void setUrl(String url) {this.url = url;}}

实现思路：

在介绍实现逻辑前我们先我们了解一下百度搜索搜索参数介绍与结果页面的规律。

然后将搜索地址复制下来：

https://www.baidu.com/s？wd = httpclinet＆pn = 20＆oq = httpclinet＆ie = utf-8＆rsv_idx = 1＆rsv_pq = f04f5a140000a07c＆rsv_t = 0bfcfngHhMH3Vk5SnTN81kLVbKKYYKMY9rqyBKn64MnYAQRQ％2FzWD48knXc

wd：查询关键字; pn：显示结果页数; oq：上次索引关键字1页是0 2页是10 3页是20；ie：关键字编码格式;rsv_idx：不知道只干啥的这里好像没有变化一直是1; rsv_pq：未知每次搜索都有变化; rsv_t：未知每次搜索都有变化

查看结果数据的HTML标签规律

我们发现结果数据都是携带data-click 的a标签，文章url都在href中,文章标题就是一个a标签的文本内容。

翻页区域都是在id属性是page的div中我们可以根据是否包含下一页判断是否到最后一页。

还有一点需要注意的是百度快照也是包含数据单击的一个标签这个我们要进行筛除。

实现逻辑大致如下：我们首先通过循环对页数进行累加通过httpclinet爬取每页的结果数据当我们发现分页区域没有下一页的内容后停止数据的爬取。爬取没页数据后通过jsoup解析爬取html信息获取所有的含有data-click的a标签，判断a标签文本是否是百度快照，如果不是获取a标签的href的文本并设置到SearchData中。

百度爬虫实现工具类代码实现：

package cn.zhuoqianmingyue.getoperation.baidusearch;import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import java.util.ArrayList;
import java.util.List;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.utils.URIBuilder;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import org.apache.log4j.Logger;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;import cn.zhuoqianmingyue.getoperation.SimpleGetHttpClientDemo;public class BaiDuSearchUtil {private static Logger log = Logger.getLogger(SimpleGetHttpClientDemo.class);/*** 根据searchKey 爬取相关的文章内容* @param searchKey* @return* @throws URISyntaxException*/public static List<SearchData> search(String searchKey) throws URISyntaxException {List<SearchData> allSearchData = new ArrayList<SearchData>();int pageNumber = 1;boolean flag = true;while(flag) {String searchInfoHtml = getSearchInfoHtml(searchKey,pageNumber);boolean isEndpage = isEndPage(searchInfoHtml);if(!isEndpage) {List<SearchData> currentPageSearchDataList = parseDataHtml(searchInfoHtml);allSearchData.addAll(currentPageSearchDataList);}else {flag = false;}pageNumber++;log.info("当前爬取的页数为："+pageNumber);}return allSearchData;}/***  判断当前搜索结果是否是最后一页* @param searchInfoHtml* @return*/private static boolean isEndPage(String searchInfoHtml) {Document doc = Jsoup.parse(searchInfoHtml);Element pageElement = doc.select("div#page").get(0);String html = pageElement.html();if(html.indexOf("下一页")!=-1) {return false;}return true;}/***  解析搜索结果中文章标题和文章的url* @param searchInfoHtml* @return*/private static List<SearchData> parseDataHtml(String searchInfoHtml) {List<SearchData> searchDataList = new ArrayList<SearchData>(); Document doc = Jsoup.parse(searchInfoHtml);Elements select = doc.select("a[data-click]");for (Element element : select) {String url = element.attr("href");if(!"javascript:;".equals(url)) {String title = element.html().replace("<em>", "").replace("</em>", "");if(!"百度快照".equals(title)) {SearchData searchData = new SearchData();searchData.setTitle(title);searchData.setUrl(url);searchDataList.add(searchData);}}}return searchDataList;}/***  爬取百度搜索具体页数结果页面* @param searchKey :搜索的关键词* @param number:爬取的页数  * @return* @throws URISyntaxException*/private static String getSearchInfoHtml(String searchKey,Integer pageNumber) throws URISyntaxException {String searchInfoHtml = "";URI uriParma = dualWithParameter(searchKey,pageNumber);HttpGet httpGet = new HttpGet(uriParma);addHeaders(httpGet);CloseableHttpClient httpClient = HttpClients.createDefault();CloseableHttpResponse response = null;try {response = httpClient.execute(httpGet);int satausCode = response.getStatusLine().getStatusCode();if(satausCode == 200 ){searchInfoHtml = EntityUtils.toString(response.getEntity(),"UTF-8");}} catch (ClientProtocolException e) {e.printStackTrace();} catch (IOException e) {e.printStackTrace();}return searchInfoHtml;}/*** 设置httpGet的头部信息* @param httpGet*/private static void addHeaders(HttpGet httpGet) {httpGet.addHeader("Host","www.baidu.com");httpGet.addHeader("User-Agent","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36");}/*** 处理百度搜索必须的查询参数* @param searchKey* @param pageNumber* @return* @throws URISyntaxException*/private static URI dualWithParameter(String searchKey, Integer pageNumber) throws URISyntaxException {URI uri = new URI("https://www.baidu.com/s");URIBuilder uriBuilder = new URIBuilder(uri);uriBuilder.setParameter("wd", searchKey);//查询关键字uriBuilder.setParameter("pn", ((pageNumber-1)*10)+"");//显示结果页数uriBuilder.setParameter("oq", searchKey);//上次索引关键字uriBuilder.setParameter("ie", "utf-8");//关键字编码格式uriBuilder.setParameter("rsv_idx", "1");//uriBuilder.setParameter("f", "8");//用户自立搜索，透露表现用户直接点击“百度一下”按键uriBuilder.setParameter("rsv_bq", "1");uriBuilder.setParameter("tn", "baidu");URI uriParma = uriBuilder.build();return uriParma;}public static void main(String[] args) throws URISyntaxException {List<SearchData> allSearchData = BaiDuSearchUtil.search("httpclinet");System.out.println(allSearchData.size());}
}

参考文献：https：//blog.csdn.net/qq_26816591/article/details/53335987

源码地址：https：//github.com/zhuoqianmingyue/httpclientexamples

HttpClient 实现爬取百度搜索结果（自动翻页）相关推荐

NLP 获取相似词 - 1.爬取百度搜索结果
视频链接:https://www.bilibili.com/video/av78674056 一,前言 NLP实际项目要用到,给定一个词,找出它的同义词.相似词.拓展词等. 我思考了下,有: 1,同义 ...
python爬虫代码实例-Python爬虫爬取百度搜索内容代码实例
这篇文章主要介绍了Python爬虫爬取百度搜索内容代码实例,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友可以参考下搜索引擎用的很频繁,现在利用Python爬 ...
【爬虫】爬取百度搜索结果页面
今日看了一下爬虫,写了一个爬取百度搜索页面的小程序.可以在代码中改动搜索词,代码如下: #coding=utf-8 #python version:2.7 #author:sharpdeepimpor ...
python爬取百度搜索_使用Python + requests爬取百度搜索页面
想学一下怎样用python爬取百度搜索页面,因为是第一次接触爬虫,遇到一些问题,把解决过程与大家分享一下 1.使用requests爬取网页首先爬取百度主页www.baidu.com import r ...
python爬取百度搜索_Python-Scrapy抓取百度数据并分析
抓取智联招聘和百度搜索的数据并进行分析,使用visual studio编写代码mongodb和SQLServer存储数据.使用scrapy框架结合 selenium爬取百度搜索数据,并进行简要的数据的 ...
python 爬取百度搜索结果url
简单的爬取百度搜索结果url 先用了requests库来访问百度,再通过xpath来提取搜索后的结果 import requests from lxml import etreefor i in ra ...
Python爬取百度搜索的标题和真实URL的代码和详细解析
网页爬取主要的是对网页内容进行分析,这是进行数据爬取的先决条件,因此博客主要对爬取思路进行下解析,自学的小伙伴们可以一起来学习,有什么不足也可以指出,都是在自学Ing,回归正题今天我们要来爬取百度搜索 ...
Python 爬取百度搜索风云榜新闻并自动推送到邮箱
本文将使用Python爬取百度新闻搜索指数排名前50的新闻,并通过服务器运行,每天定时发送到指定邮箱. 先上代码: # -*- coding:utf-8 -*- import requests,os, ...
python爬取百度域名注册_python爬取百度域名_python爬取百度搜索結果url匯總
寫了兩篇之后,我覺得關於爬蟲,重點還是分析過程分析些什么呢: 1)首先明確自己要爬取的目標比如這次我們需要爬取的是使用百度搜索之后所有出來的url結果 2)分析手動進行的獲取目標的過程,以便以程序 ...

HttpClient 实现爬取百度搜索结果（自动翻页）

HttpClient 实现爬取百度搜索结果（自动翻页）相关推荐

最新文章

热门文章