基于HTTP的简单网络爬虫

HTTP概述

HTTP是目前使用最广泛的Web应用程序使用的基础协议，例如，浏览器访问网站，手机App访问后台服务器，都是通过HTTP协议实现的。

HTTP是HyperText Transfer Protocol的缩写，翻译为超文本传输协议，它是基于TCP协议之上的一种请求-响应协议。

HTTP请求的格式是固定的，它由HTTP Header和HTTP Body两部分构成。第一行总是请求方法路径 HTTP版本：例如，GET / HTTP/1.1表示使用GET请求，路径是/，版本是HTTP/1.1。
后续的每一行都是固定的Header: Value格式，我们称为HTTP Header，服务器依靠某些特定的Header来识别客户端请求，例如：
Host：表示请求的域名，因为一台服务器上可能有多个网站，因此有必要依靠Host来识别请求是发给哪个网站的；
User-Agent：表示客户端自身标识信息，不同的浏览器有不同的标识，服务器依靠User-Agent判断客户端类型是IE还是Chrome，是Firefox还是一个Python爬虫；
Accept：表示客户端能处理的HTTP响应格式，*/*表示任意格式，text/*表示任意文本，image/png表示PNG格式的图片；
Accept-Language：表示客户端接收的语言，多种语言按优先级排序，服务器依靠该字段给用户返回特定语言的网页版本。

如果是GET请求，那么该HTTP请求只有HTTP Header，没有HTTP Body。

如果是POST请求，那么该HTTP请求带有Body，以一个空行分隔。

POST请求通常要设置Content-Type表示Body的类型，Content-Length表示Body的长度，这样服务器就可以根据请求的Header和Body做出正确的响应。

HTTP响应也是由Header和Body两部分组成，一个典型的HTTP响应如下：

HTTP/1.1 200 OK
Content-Type: text/html
Content-Length: 133251<!DOCTYPE html>
<html><body>
<h1>Hello</h1>
...

响应的第一行总是 HTTP版本响应代码响应说明

例如，HTTP/1.1 200 OK表示版本是HTTP/1.1，响应代码是200，响应说明是OK。客户端只依赖响应代码判断HTTP响应是否成功。HTTP有固定的响应代码：
1xx：表示一个提示性响应，例如101表示将切换协议，常见于WebSocket连接；
2xx：表示一个成功的响应，例如200表示成功，206表示只发送了部分内容；
3xx：表示一个重定向的响应，例如301表示永久重定向，303表示客户端应该按指定路径重新发送请求；
4xx：表示一个因为客户端问题导致的错误响应，例如400表示因为Content-Type等各种原因导致的无效请求，404表示指定的路径不存在；
5xx：表示一个因为服务器问题导致的错误响应，例如500表示服务器内部故障，503表示服务器暂时无法响应。

HTTP编程

URL：统一资源定位符

爬取一张图片的编码方法如下：
在网站中得到图片的路径，通过路径创建URL实例

URL imageur1=new URL("https://img2.doubanio.com/view/photo/m/public/p2875247682.webp");

通过URL实例打开连接

HttpURLConnection connect = (HttpURLConnection)imageur1.openConnection();

设置请求方式GET

connect.setRequestMethod("GET");

设置请求Header属性

connect.setRequestProperty("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36 Edg/103.0.1264.49");

最后通过输入输出流读取并写入图片

具体代码如下：

package com.gjh.demo01;import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.ProtocolException;
import java.net.URL;public class Text02 {public static void main(String[] args) {//HttpURLConnection connect;try {//某张电影海报的图片（该图片的统一资源定位符）URL imageur1=new URL("https://img2.doubanio.com/view/photo/m/public/p2875247682.webp");//打开连接HttpURLConnection connect = (HttpURLConnection)imageur1.openConnection();//设置请求方式GETconnect.setRequestMethod("GET");//设置请求Header属性connect.setRequestProperty("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36 Edg/103.0.1264.49");try(//读取图片BufferedInputStream bis=new BufferedInputStream(connect.getInputStream());//存储图片（写入输出图片的字节内容）BufferedOutputStream bos=new BufferedOutputStream(new FileOutputStream("D:\\Text\\douban file\\"+System.currentTimeMillis()+".jpg"));) {//边读边写byte[] buff=new byte[1024];int len=-1;while((len=bis.read(buff))!=-1) {bos.write(buff,0,len);}} catch (IOException e) {e.printStackTrace();}} catch (MalformedURLException e1) {e1.printStackTrace();} catch (ProtocolException e1) {e1.printStackTrace();} catch (IOException e1) {e1.printStackTrace();}}}

运行结果如下：

爬取网站首页全部海报图片的编码方式

创建URL实例时传入是网站首页的路径，之后需要对从网站获取的html格式信息进行截取（信息格式如下图），这里我们可以对截取下来的信息在循环中以字符串的形式进行截取，也可以用jsoup方法解析。

1.以字符串的形式进行截取

package com.gjh.demo01;import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.BufferedReader;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.ProtocolException;
import java.net.URL;
import java.nio.charset.StandardCharsets;public class Text03 {public static void main(String[] args) {//获取豆瓣首页的海报图片，存入指定目录try {URL imageur1=new URL("https://movie.douban.com/");HttpURLConnection connect = (HttpURLConnection)imageur1.openConnection();connect.setRequestMethod("GET");connect.setRequestProperty("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36 Edg/103.0.1264.49");BufferedReader reader=new BufferedReader(new InputStreamReader(connect.getInputStream(),StandardCharsets.UTF_8));String line=null;while((line=reader.readLine())!=null) {line=line.trim();if(line.startsWith("<img")  &&  line.contains("https://img")  &&  line.contains(".jpg")) {//System.out.println(line);//使用字符串截取的方式获得指定的字符串int startPath=line.indexOf("https:");int endPath=line.indexOf(".jpg")+4;String Path=line.substring(startPath, endPath);int startName=line.indexOf("alt=")+5;int endName=line.indexOf("\"",startName);String Name=line.substring(startName,endName);
//                  System.out.println(Path);
//                  System.out.println(Name);URL imageUr1=new URL(Path);HttpURLConnection imageUr1connect=(HttpURLConnection)imageUr1.openConnection();try (BufferedInputStream in=new BufferedInputStream(imageUr1connect.getInputStream());BufferedOutputStream out=new BufferedOutputStream(new FileOutputStream("D:\\Text\\douban file\\"+Name+".jpg"));){byte[] buff=new byte[1024];int len=-1;while((len=in.read(buff))!=-1) {out.write(buff,0,len);}} catch (Exception e) {e.printStackTrace();}}}} catch (MalformedURLException e) {e.printStackTrace();} catch (ProtocolException e) {e.printStackTrace();} catch (IOException e) {e.printStackTrace();}}
}

2.用jsoup方法解析

jsoup类的作用：进行原始解析
Document类：网页文档（包含解析到的所有标签）
Elements类：若干元素Element形成的集合（继承自ArrayList）
Element类：某一个html元素

package com.gjh.demo01;import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.BufferedReader;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.ProtocolException;
import java.net.URL;
import java.nio.charset.StandardCharsets;import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;public class Text04 {public static void main(String[] args) {/*** 在循环中每次在line获取属性*/try {URL imageur1=new URL("https://movie.douban.com/");HttpURLConnection connect = (HttpURLConnection)imageur1.openConnection();connect.setRequestMethod("GET");connect.setRequestProperty("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36 Edg/103.0.1264.49");BufferedReader reader=new BufferedReader(new InputStreamReader(connect.getInputStream(),StandardCharsets.UTF_8));String line=null;while((line=reader.readLine())!=null) {line=line.trim();if(line.startsWith("<img")  &&  line.contains("https://img")  &&  line.contains(".jpg")) {//System.out.println(line);//解析成Document对象Document doc=Jsoup.parse(line);//从Document中获取名称为img的所有标签元素（Elements)//从所有代表img的Elements元素集合中获取第一个Element imagelement=doc.getElementsByTag("img").first();//获取img标签元素src属性和alt属性String src = imagelement.attr("src");  //提取图片的路径srcString alt = imagelement.attr("alt");  //电影名称altURL imageUr1=new URL(src);HttpURLConnection imageUr1connect=(HttpURLConnection)imageUr1.openConnection();try (BufferedInputStream in=new BufferedInputStream(imageUr1connect.getInputStream());BufferedOutputStream out=new BufferedOutputStream(new FileOutputStream("D:\\Text\\douban file\\"+alt+".jpg"));){byte[] buff=new byte[1024];int len=-1;while((len=in.read(buff))!=-1) {out.write(buff,0,len);}} catch (Exception e) {e.printStackTrace();}}}} catch (MalformedURLException e) {e.printStackTrace();} catch (ProtocolException e) {e.printStackTrace();} catch (IOException e) {e.printStackTrace();}}
}

两种获取图片路径的方式的结果如下：