网络爬虫小白教程 (HttpClient)

1.学习计划

入门程序
网络爬虫介绍
HttpClient抓取数据
Jsoup解析数据
爬虫案例

2.网络爬虫

网络爬虫（Web crawler），是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本

2.1 爬虫入门程序

2.1.1 环境准备

JDK1.8
IntelliJ IDEA
IDEA自带的Maven

2.1.2 环境准备

创建Maven工程itcast-crawler-first并给pom.xml加入依赖

 <dependencies><!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient --><dependency><groupId>org.apache.httpcomponents</groupId><artifactId>httpclient</artifactId><version>4.5.13</version></dependency><!-- https://mvnrepository.com/artifact/org.slf4j/slf4j-log4j12 --><dependency><groupId>org.slf4j</groupId><artifactId>slf4j-log4j12</artifactId><version>1.7.25</version><scope>test</scope></dependency></dependencies>

2.1.3 加入log4j.properties

log4j.rootLogger=DEBUG,A1
log4j.logger.cn.itcast = DEBUG
log4j.appender.A1=org.apache.log4j.ConsoleAppender
log4j.appender.A1.layout=org.apache.log4j.PatternLayout
log4j.appender.A1.layout.ConversionPattern=%-d{yyyy-MM-dd HH:mm:ss,SSS} [%t] [%c]-[%p] %m%n### 输出信息到控制抬 ###log4j.appender.stdout = org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target = System.out
log4j.appender.stdout.layout = org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern = [%-5p] %d{yyyy-MM-dd HH:mm:ss,SSS} method:%l%n%m%n

2.1.4 编写代码

编写最简单的爬虫，抓取传智播客首页：http://www.itcast.cn/

public class CrawlerFirst {public static void main(String[] args) throws IOException {//1.打开浏览器,创建httpClient对象CloseableHttpClient httpClient = HttpClients.createDefault();//2.输入网址,发起get请求创建HttpGet对象HttpGet httpGet=new HttpGet("http://www.itcast.cn");//3.按回车发起请求,返回响应，使用HttpClient对象发起请求CloseableHttpResponse response = httpClient.execute(httpGet);//4.解析响应获取数据//判断状态码是否是200if(response.getStatusLine().getStatusCode()==200){HttpEntity httpEntity = response.getEntity();String content = EntityUtils.toString(httpEntity, "utf-8");System.out.println(content);}}
}

测试结果：可以获取到页面数据

3.网络爬虫

3.1 网络爬虫介绍

3.2 为什么学网络爬虫

我们初步认识了网络爬虫，但是为什么要学习网络爬虫呢？只有清晰地知道我们的学习目的，才能够更好地学习这一项知识。在此，总结了4种常见的学习爬虫的原因：

1.可以实现搜索引擎

我们学会了爬虫编写之后，就可以利用爬虫自动地采集互联网中的信息，采集回来后进行相应的存储或处理，在需要检索某些信息的时候，只需在采集回来的信息中进行检索，即实现了私人的搜索引擎。

2.大数据时代，可以让我们获取更多的数据源。

在进行大数据分析或者进行数据挖掘的时候，需要有数据源进行分析。我们可以从某些提供数据统计的网站获得，也可以从某些文献或内部资料中获得，但是这些获得数据的方式，有时很难满足我们对数据的需求，而手动从互联网中去寻找这些数据，则耗费的精力过大。此时就可以利用爬虫技术，自动地从互联网中获取我们感兴趣的数据内容，并将这些数据内容爬取回来，作为我们的数据源，再进行更深层次的数据分析，并获得更多有价值的信息。

3.可以更好地进行搜索引擎优化（SEO）。

对于很多SEO从业者来说，为了更好的完成工作，那么就必须要对搜索引擎的工作原理非常清楚，同时也需要掌握搜索引擎爬虫的工作原理。
而学习爬虫，可以更深层次地理解搜索引擎爬虫的工作原理，这样在进行搜索引擎优化时，才能知己知彼，百战不殆。

4.有利于就业。

从就业来说，爬虫工程师方向是不错的选择之一，因为目前爬虫工程师的需求越来越大，而能够胜任这方面岗位的人员较少，所以属于一个比较紧缺的职业方向，并且随着大数据时代和人工智能的来临，爬虫技术的应用将越来越广泛，在未来会拥有很好的发展空间。

4.HttpClient

网络爬虫就是用程序帮助我们访问网络上的资源，我们一直以来都是使用HTTP协议访问互联网的网页，网络爬虫需要编写程序，在这里使用同样的HTTP协议访问网页。
这里我们使用Java的HTTP协议客户端 HttpClient这个技术，来实现抓取网页数据。

4.1.GET请求

访问传智官网，请求url地址：

http://www.itcast.cn/

public class HttpGetTest {public static void main(String[] args) {//创建HttpClient对象CloseableHttpClient httpClient = HttpClients.createDefault();//创建httpGet对象，设置url访问地址HttpGet httpGet=new HttpGet("http://www.itcast.cn");CloseableHttpResponse response=null;try{//使用HttpClient发起请求,获取responseresponse = httpClient.execute(httpGet);//解析响应if(response.getStatusLine().getStatusCode()==200){//获取响应实体HttpEntity entity = response.getEntity();//将实体转成string对象String content = EntityUtils.toString(entity,"utf-8");System.out.println(content.length());}}catch (IOException e){e.printStackTrace();}finally {try{//关闭responseresponse.close();httpClient.close();}catch (IOException e){e.printStackTrace();}}}
}

注意：这里需要注释掉scope才可以打印日志

 <dependency><groupId>org.slf4j</groupId><artifactId>slf4j-log4j12</artifactId><version>1.7.25</version><!--<scope>test</scope>--></dependency>

请求结果

4.2.带参数的GET请求

在传智中搜索学习视频，地址为：

http://yun.itheima.com/search?keys=Java

public class HttpGetParamTest {public static void main(String[] args) throws Exception {//创建HttpClient对象CloseableHttpClient httpClient = HttpClients.createDefault();//设置请求地址是http://yun.itheima.com/search?keys=Java//创建URI builderURIBuilder uriBuilder=new URIBuilder("http://yun.itheima.com/search");uriBuilder.setParameter("keys","Java");;//设置参数//创建httpGet对象，设置url访问地址HttpGet httpGet=new HttpGet(uriBuilder.build());System.out.println("发起请求的信息"+httpGet);CloseableHttpResponse response=null;try{//使用HttpClient发起请求,获取responseresponse = httpClient.execute(httpGet);//解析响应if(response.getStatusLine().getStatusCode()==200){//获取响应实体HttpEntity entity = response.getEntity();//将实体转成string对象String content = EntityUtils.toString(entity,"utf-8");System.out.println(content.length());}}catch (IOException e){e.printStackTrace();}finally {try{//关闭responseresponse.close();httpClient.close();}catch (IOException e){e.printStackTrace();}}}
}

这里需要调大 Idea 控制台打印缓存, 打印所有日志

请求结果

4.3.POST请求

使用POST访问传智官网，请求url地址：

http://www.itcast.cn/

 public static void main(String[] args) {//创建HttpClient对象CloseableHttpClient httpClient = HttpClients.createDefault();//创建httpGet对象，设置url访问地址HttpPost httpPost=new HttpPost("http://www.itcast.cn");CloseableHttpResponse response=null;try{//使用HttpClient发起请求,获取responseresponse = httpClient.execute(httpPost);//解析响应if(response.getStatusLine().getStatusCode()==200){//获取响应实体HttpEntity entity = response.getEntity();//将实体转成string对象String content = EntityUtils.toString(entity,"utf-8");System.out.println(content.length());}}catch (IOException e){e.printStackTrace();}finally {try{//关闭responseresponse.close();httpClient.close();}catch (IOException e){e.printStackTrace();}}}

请求结果：

4.4.带参数的POST请求

在传智中搜索学习视频，使用POST请求，url地址为：

http://yun.itheima.com/search

url地址没有参数，参数keys=java放到表单中进行提交

  public static void main(String[] args) throws Exception {//创建HttpClient对象CloseableHttpClient httpClient = HttpClients.createDefault();//创建httpGet对象，设置url访问地址HttpPost httpPost=new HttpPost("http://yun.itheima.com/search");//声明list集合，封装表单中的参数List<NameValuePair> params=new ArrayList<NameValuePair>();//设置请求地址是http://yun.itheima.com/search?keys=Javaparams.add(new BasicNameValuePair("keys","Java"));//创建表单的Entity对象//第一个参数是封装好的表单数据，第二个参数是编码UrlEncodedFormEntity formEntity=new UrlEncodedFormEntity(params,"utf8");//设置表单的Entity对象到post请求中httpPost.setEntity(formEntity);CloseableHttpResponse response=null;try{//使用HttpClient发起请求,获取responseresponse = httpClient.execute(httpPost);//解析响应if(response.getStatusLine().getStatusCode()==200){//获取响应实体HttpEntity entity = response.getEntity();//将实体转成string对象String content = EntityUtils.toString(entity,"utf-8");System.out.println(content.length());}}catch (IOException e){e.printStackTrace();}finally {try{//关闭responseresponse.close();httpClient.close();}catch (IOException e){e.printStackTrace();}}}

请求结果

4.5.连接池

如果每次请求都要创建HttpClient，会有频繁创建和销毁的问题，可以使用连接池来解决这个问题。

测试以下代码，并断点查看每次获取的HttpClient都是不一样的。

public class HttpClientPoolTest {public static void main(String[] args) {//创建连接池管理器PoolingHttpClientConnectionManager cm =new PoolingHttpClientConnectionManager();//设置最大连接数cm.setMaxTotal(100);//设置每个主机的最大连接数cm.setDefaultMaxPerRoute(10);//主机--百度，新浪，搜狐等//使用连接池管理器发起请求doGet(cm);//1565doGet(cm);}private static void doGet(PoolingHttpClientConnectionManager cm) {//不是每次创建新的httpClient,而是从连接池中获取httpClientCloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(cm).build();HttpGet httpGet=new HttpGet("http://www.itcast.cn");CloseableHttpResponse response=null;try{response = httpClient.execute(httpGet);if(response.getStatusLine().getStatusCode()==200){HttpEntity entity = response.getEntity();String content = EntityUtils.toString(entity,"utf8");System.out.println(content);}}catch (IOException e){e.printStackTrace();}finally {if(response!=null){try {response.close();}catch (IOException e){e.printStackTrace();}//不能关闭httpClient,由连接池管理httpClient//httpClient.close();}}}
}

4.6 请求参数

有时候因为网络，或者目标服务器的原因，请求需要更长的时间才能完成，我们需要自定义相关时间

public class HttpConfigTest {public static void main(String[] args) {//创建HttpClient对象CloseableHttpClient httpClient = HttpClients.createDefault();//创建httpGet对象，设置url访问地址HttpGet httpGet=new HttpGet("http://www.itcast.cn");//配置请求信息RequestConfig config=RequestConfig.custom().setConnectTimeout(1000)//创建连接的最长时间单位是毫秒.setConnectionRequestTimeout(500) //设置获取连接的最长时间，单位毫秒.setSocketTimeout(10*1000)//设置数据传输的最长时间，单位是毫秒.build();//给请求设置请求信息httpGet.setConfig(config);CloseableHttpResponse response=null;try{//使用HttpClient发起请求,获取responseresponse = httpClient.execute(httpGet);//解析响应if(response.getStatusLine().getStatusCode()==200){//获取响应实体HttpEntity entity = response.getEntity();//将实体转成string对象String content = EntityUtils.toString(entity,"utf-8");System.out.println(content.length());}}catch (IOException e){e.printStackTrace();}finally {try{//关闭responseresponse.close();httpClient.close();}catch (IOException e){e.printStackTrace();}}}
}

核心代码

网络爬虫小白教程 (HttpClient)相关推荐

python商业爬虫教程_廖雪峰老师的Python商业爬虫课程 Python网络爬虫实战教程体会不一样的Python爬虫课程...
廖雪峰老师的Python商业爬虫课程 Python网络爬虫实战教程体会不一样的Python爬虫课程 1.JPG (53.51 KB, 下载次数: 1) 2019-8-9 08:15 上传 2.JPG ...
python网络爬虫系列教程_Python网络爬虫系列教程连载 ----长期更新中，敬请关注!...
感谢大家长期对Python爱好者社区的支持,后期Python爱好者社区推出Python网络爬虫系列教程.欢迎大家关注.以下系列教程大纲,欢迎大家补充.视频长期连载更新中 --------------- ...
Python网络爬虫简单教程——第一部
Python网络爬虫简单教程--第一部感谢,如需转载请注明文章出处:https://blog.csdn.net/weixin_44609873/article/details/103384984 P ...
Python爬虫小白教程（二）—— 爬取豆瓣评分TOP250电影
文章目录前言安装bs4库网站分析获取页面爬取页面页面分析其他页面爬虫系列前言经过上篇博客Python爬虫小白教程(一)-- 静态网页抓取后我们已经知道如何抓取一个静态的页面了,现在 ...
Python之网络爬虫完全教程
[Python]网络爬虫(一):抓取网页的含义和URL基本构成一.网络爬虫的定义网络爬虫,即Web Spider,是一个很形象的名字. 把互联网比喻成一个蜘蛛网,那么Spider就是在网上爬来爬去 ...
python网络爬虫系列教程——python中pyquery库应用全解
全栈工程师开发手册 (作者:栾鹏) python教程全解 python网络爬虫lxml库的应用全解. 在线安装方法:cmd中输入"pip install pyquery" 离线安装 ...
python网络爬虫系列教程——python中lxml库应用全解（xpath表达式）
全栈工程师开发手册 (作者:栾鹏) python教程全解 python网络爬虫lxml库的应用全解. 在线安装方法:cmd中输入"pip install lxml" 离线安装,下载 ...
python网络爬虫系列教程——python中requests库应用全解
全栈工程师开发手册 (作者:栾鹏) python教程全解 python中requests库的基础应用,网页数据挖掘的常用库之一.也就是说最主要的功能是从网页抓取数据. 使用前需要先联网安装reques ...
python网络爬虫系列教程——python网络数据爬虫误区，让你的爬虫更像人类
1 前言近期,有些朋友问我一些关于如何应对反爬虫的问题.由于好多朋友都在问,因此决定写一篇此类的博客.把我知道的一些方法,分享给大家.博主属于小菜级别,玩爬虫也完全是处于兴趣爱好,如有不足之处,还望 ...

网络爬虫小白教程 (HttpClient)

1.学习计划

2.网络爬虫

2.1 爬虫入门程序

2.1.1 环境准备

2.1.2 环境准备

2.1.3 加入log4j.properties

2.1.4 编写代码

3.网络爬虫

3.1 网络爬虫介绍

3.2 为什么学网络爬虫

4.HttpClient

4.1.GET请求

4.2.带参数的GET请求

4.3.POST请求

4.4.带参数的POST请求

4.5.连接池

4.6 请求参数

网络爬虫小白教程 (HttpClient)相关推荐

最新文章

热门文章