Java网络爬虫

这是本文目录

这里写目录标题

Java网络爬虫
- 1. HttpClient
- - 1.1 Get请求
  - 1.2 POST请求
  - 1.3 连接池
  - 1.4 参数设置
- 2. Jsoup

本文将循序渐进介绍3大爬虫必备技术

HttpClietn（负责请求页面并获得页面）
Jsout（负责解析页面，提取元素）
WebMagic（Java的一个爬虫框架，利用WebMagic可以整合1、2中的繁琐操作）
WebMagic框架我们留到下一期讲解。

1. HttpClient

使用网络爬虫其实就是要用Java程序去访问Html页面，并对Html页面进行解析，而Java中HttpClient技术可以很好的访问Html页面，实现抓取网页数据的功能。话不多说，我们立即进入HttpClient的学习吧

1.1 Get请求

tips：以下只介绍使用中涉及的对象、方法，至于异常处理请大家实际操作中自己选择处理方式。

// 使用HttpClient之前必须先创建一个对象
CloseableHttpClient httpClient = HttpClients.createDefault();// 创建Get请求
HttpGet httpGet = new HttpGet("https://www.baidu.com/");// 使用HttpClient发起请求
CloseableHttpResponse response = httpClient.execute(httpGet);//获取响应 -> 判断响应状态码是否为200
if (response.getStatusLine().getStatusCode() == 200) {//如果为200表示请求成功，获取返回数据String content = EntityUtils.toString(response.getEntity(), "UTF-8");System.out.println(content);}

以下代码为可用程序：

  public static void main(String[] args) throws IOException {//创建HttpClient对象CloseableHttpClient httpClient = HttpClients.createDefault();//创建HttpGet请求HttpGet httpGet = new HttpGet("https://www.baidu.com/");CloseableHttpResponse response = null;try {//使用HttpClient发起请求response = httpClient.execute(httpGet);//判断响应状态码是否为200if (response.getStatusLine().getStatusCode() == 200) {//如果为200表示请求成功，获取返回数据String content = EntityUtils.toString(response.getEntity(), "UTF-8");//打印数据长度System.out.println(content);}} catch (Exception e) {e.printStackTrace();} finally {//释放连接if (response == null) {try {response.close();} catch (IOException e) {e.printStackTrace();}httpClient.close();}}}

请求百度页面时，有时候会出现请求失败的问题，目前博主还没有有效解决方案，不过只要多请求几次即可，大家知道原因的欢迎补充

带参数的Get请求

以百度为例，我们要检索三星S20手机，首先分析百度的URL

https://www.baidu.com/s?wd=三星S20

接下来我们来编写我们的代码，主要方法与上述一致，这里直接给出可用代码

public static void main(String[] args) throws IOException {//创建HttpClient对象CloseableHttpClient httpClient = HttpClients.createDefault();//创建HttpGet请求，带参数的地址https://www.baidu.com/s?wd=HttpClientString uri = "https://www.baidu.com/s?wd=三星S20";HttpGet httpGet = new HttpGet(uri);CloseableHttpResponse response = null;try {//使用HttpClient发起请求response = httpClient.execute(httpGet);//判断响应状态码是否为200if (response.getStatusLine().getStatusCode() == 200) {//如果为200表示请求成功，获取返回数据String content = EntityUtils.toString(response.getEntity(), "UTF-8");//打印数据长度System.out.println(content);}} catch (Exception e) {e.printStackTrace();} finally {//释放连接if (response == null) {try {response.close();} catch (IOException e) {e.printStackTrace();}httpClient.close();}}
}

1.2 POST请求

POST的无参请求与GET请求的使用方式一样，只不过这次创建的时HTTPPOST对象

public static void main(String[] args) throws IOException {//创建HttpClient对象CloseableHttpClient httpClient = HttpClients.createDefault();//创建HttpGet请求HttpPost httpPost = new HttpPost("http://www.itcast.cn/");CloseableHttpResponse response = null;try {//使用HttpClient发起请求response = httpClient.execute(httpPost);//判断响应状态码是否为200if (response.getStatusLine().getStatusCode() == 200) {//如果为200表示请求成功，获取返回数据String content = EntityUtils.toString(response.getEntity(), "UTF-8");//打印数据长度System.out.println(content);}} catch (Exception e) {e.printStackTrace();} finally {//释放连接if (response == null) {try {response.close();} catch (IOException e) {e.printStackTrace();}httpClient.close();}}
}

带参数的POST请求

在POST请求中，如果需要带参数的话，必须使用一些对象来模拟表单请求

以下为使用的对象

     //声明存放参数的List集合List<NameValuePair> params = new ArrayList<NameValuePair>();params.add(new BasicNameValuePair("keys", "java"));//创建表单数据EntityUrlEncodedFormEntity formEntity = new UrlEncodedFormEntity(params, "UTF-8");//设置表单Entity到httpPost请求对象中httpPost.setEntity(formEntity);

可用代码如下：

public static void main(String[] args) throws IOException {//创建HttpClient对象CloseableHttpClient httpClient = HttpClients.createDefault();//创建HttpGet请求HttpPost httpPost = new HttpPost("http://www.itcast.cn/");//声明存放参数的List集合List<NameValuePair> params = new ArrayList<NameValuePair>();params.add(new BasicNameValuePair("keys", "java"));//创建表单数据EntityUrlEncodedFormEntity formEntity = new UrlEncodedFormEntity(params, "UTF-8");//设置表单Entity到httpPost请求对象中httpPost.setEntity(formEntity);CloseableHttpResponse response = null;try {//使用HttpClient发起请求response = httpClient.execute(httpPost);//判断响应状态码是否为200if (response.getStatusLine().getStatusCode() == 200) {//如果为200表示请求成功，获取返回数据String content = EntityUtils.toString(response.getEntity(), "UTF-8");//打印数据长度System.out.println(content);}} catch (Exception e) {e.printStackTrace();} finally {//释放连接if (response == null) {try {response.close();} catch (IOException e) {e.printStackTrace();}httpClient.close();}}
}

1.3 连接池

通过上述的学习我们发现，每次爬取信息都要创建以此连接，使用完后又得关闭连接。因此我们使用连接池技术，避免频繁的创建销毁，提高爬取效率。话不多说，进入代码的学习。

创建连接池

 PoolingHttpClientConnectionManager cm = new PoolingHttpClientConnectionManager();

从连接池中获取连接

CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(cm).build();

1.4 参数设置

//创建HttpClient对象
RequestConfig requestConfig = RequestConfig.custom().setConnectTimeout(1000)//设置创建连接的最长时间.setConnectionRequestTimeout(500)//设置获取连接的最长时间.setSocketTimeout(10 * 1000)//设置数据传输的最长时间.build();

2. Jsoup

通过HttpClient，我们可以轻松的抓取网页了，那么得到网页后，我们该如何解析呢，这个时候Jsoup就登场了。

 jsoup 是一款Java 的HTML解析器，可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API，可通过DOM，CSS以及类似于jQuery的操作方法来取出和操作数据。jsoup的主要功能如下：
1.从一个URL，文件或字符串中解析HTML；
2.使用DOM或CSS选择器来查找、取出数据；
3.可操作HTML元素、属性、文本；

使用Maven工程导入Jsoup依赖：

<!--Jsoup-->
<dependency><groupId>org.jsoup</groupId><artifactId>jsoup</artifactId><version>1.10.3</version>
</dependency>
<!--测试-->
<dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>4.12</version>
</dependency>
<!--工具-->
<dependency><groupId>org.apache.commons</groupId><artifactId>commons-lang3</artifactId><version>3.7</version>
</dependency>
<dependency><groupId>commons-io</groupId><artifactId>commons-io</artifactId><version>2.6</version>
</dependency>

具体的学习我就不放在这里介绍了，由于博主已经掌握了Jsoup方面的知识，且Jsoup官方API已经有很不错的教程，大家可以到官方中文网站学习，这是学习地址：https://www.open-open.com/jsoup/
如果有需要的话，后面有空我会将这里的教程补充一下。

接下来直接进入本文的关键部分，WebMagic爬虫框架