今日推荐
Web登录很简单?开玩笑!知乎热问:国家何时整治程序员的高薪现象?太可怕了!注解+反射优雅的实现Excel导入导出(通用版)Fluent Mybatis 牛逼!Nginx 常用配置清单这玩意比ThreadLocal叼多了,吓得我赶紧分享出来。

来源:blog.csdn.net/qq_35402412/article/details/113627625

目的

爬取搜狗图片上千张美女图片并下载到本地

准备工作

爬取地址:https://pic.sogou.com/pics?query=%E7%BE%8E%E5%A5%B3

分析

打开上面的地址,按F12开发者工具 - NetWork - XHR - 页面往下滑动XHR栏出现请求信息如下:

Request URL :https://pic.sogou.com/napi/pc/searchList?mode=1&start=48&xml_len=48&query=%E7%BE%8E%E5%A5%B3

分析这段请求URL的主要几个参数:

start=48 表示从第48张图片开始检索

xml_len=48 从地48张往后获取48张图片

query=?搜索关键词(例:美女,这里浏览器自动做了转码,不影响我们使用)

点击Respose,找个JSON格式器辅助过去看看。

JSON格式:https://www.bejson.com/

分析Respose返回的信息,可以发现我们想要的图片地址放在 picUrl里,

思路

通过以上分析,不难实现下载方法,思路如下:

  1. 设置URL请求参数

  2. 访问URL请求,获取图片地址

  3. 图片地址存入List

  4. 遍历List,使用线程池下载到本地

代码

SougouImgProcessor.java 爬取图片类

import com.alibaba.fastjson.JSONObject;
import us.codecraft.webmagic.utils.HttpClientUtils;
import victor.chang.crawler.pipeline.SougouImgPipeline;import java.util.ArrayList;
import java.util.List;/*** A simple PageProcessor.* @author code4crafter@gmail.com <br>* @since 0.1.0*/
public class SougouImgProcessor {private String url;private SougouImgPipeline pipeline;private List<JSONObject> dataList;private List<String> urlList;private String word;public SougouImgProcessor(String url,String word) {this.url = url;this.word = word;this.pipeline = new SougouImgPipeline();this.dataList = new ArrayList<>();this.urlList = new ArrayList<>();}public void process(int idx, int size) {String res = HttpClientUtils.get(String.format(this.url, idx, size, this.word));JSONObject object = JSONObject.parseObject(res);List<JSONObject> items = (List<JSONObject>)((JSONObject)object.get("data")).get("items");for(JSONObject item : items){this.urlList.add(item.getString("picUrl"));}this.dataList.addAll(items);}// 下载public void pipelineData(){// 多线程pipeline.processSync(this.urlList, this.word);}public static void main(String[] args) {String url = "https://pic.sogou.com/napi/pc/searchList?mode=1&start=%s&xml_len=%s&query=%s";SougouImgProcessor processor = new SougouImgProcessor(url,"美女");int start = 0, size = 50, limit = 1000; // 定义爬取开始索引、每次爬取数量、总共爬取数量for(int i=start;i<start+limit;i+=size)processor.process(i, size);processor.pipelineData();}}

SougouImgPipeline.java  图片下载类

import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.net.URL;
import java.net.URLConnection;
import java.util.List;
import java.util.Objects;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;/*** Store results in files.<br>* @author code4crafter@gmail.com <br>* @since 0.1.0*/
public class SougouImgPipeline {private String extension = ".jpg";private String path;private volatile AtomicInteger suc;private volatile AtomicInteger fails;public SougouImgPipeline() {setPath("E:/pipeline/sougou");suc = new AtomicInteger();fails = new AtomicInteger();}public SougouImgPipeline(String path) {setPath(path);suc = new AtomicInteger();fails = new AtomicInteger();}public SougouImgPipeline(String path, String extension) {setPath(path);this.extension = extension;suc = new AtomicInteger();fails = new AtomicInteger();}public void setPath(String path) {this.path = path;}/*** 下载* @param url* @param cate* @throws Exception*/private void downloadImg(String url, String cate, String name) throws Exception {String path = this.path + "/" + cate + "/";File dir = new File(path);if (!dir.exists()) {    // 目录不存在则创建目录dir.mkdirs();}String realExt = url.substring(url.lastIndexOf("."));   // 获取扩展名String fileName = name + realExt;fileName = fileName.replace("-", "");String filePath = path + fileName;File img = new File(filePath);if(img.exists()){   // 若文件之前已经下载过,则跳过System.out.println(String.format("文件%s已存在本地目录",fileName));return;}URLConnection con = new URL(url).openConnection();con.setConnectTimeout(5000);con.setReadTimeout(5000);InputStream inputStream = con.getInputStream();byte[] bs = new byte[1024];File file = new File(filePath);FileOutputStream os = new FileOutputStream(file, true);// 开始读取 写入int len;while ((len = inputStream.read(bs)) != -1) {os.write(bs, 0, len);}System.out.println("picUrl: " + url);System.out.println(String.format("正在下载第%s张图片", suc.getAndIncrement()));}/*** 单线程处理** @param data* @param word*/public void process(List<String> data, String word) {long start = System.currentTimeMillis();for (String picUrl : data) {if (picUrl == null)continue;try {downloadImg(picUrl, word, picUrl);} catch (Exception e) {fails.incrementAndGet();}}System.out.println("下载成功: " + suc.get());System.out.println("下载失败: " + fails.get());long end = System.currentTimeMillis();System.out.println("耗时:" + (end - start) / 1000 + "秒");}/*** 多线程处理** @param data* @param word*/public void processSync(List<String> data, String word) {long start = System.currentTimeMillis();int count = 0;ExecutorService executorService = Executors.newCachedThreadPool(); // 创建缓存线程池for (int i=0;i<data.size();i++) {String picUrl = data.get(i);if (picUrl == null)continue;String name = "";if(i<10){name="000"+i;}else if(i<100){name="00"+i;}else if(i<1000){name="0"+i;}String finalName = name;executorService.execute(() -> {try {downloadImg(picUrl, word, finalName);} catch (Exception e) {fails.incrementAndGet();}});count++;}executorService.shutdown();try {if (!executorService.awaitTermination(60, TimeUnit.SECONDS)) {// 超时的时候向线程池中所有的线程发出中断(interrupted)。// executorService.shutdownNow();}System.out.println("AwaitTermination Finished");System.out.println("共有URL: "+data.size());System.out.println("下载成功: " + suc);System.out.println("下载失败: " + fails);File dir = new File(this.path + "/" + word + "/");int len = Objects.requireNonNull(dir.list()).length;System.out.println("当前共有文件: "+len);long end = System.currentTimeMillis();System.out.println("耗时:" + (end - start) / 1000.0 + "秒");} catch (InterruptedException e) {e.printStackTrace();}}/*** 多线程分段处理** @param data* @param word* @param threadNum*/public void processSync2(List<String> data, final String word, int threadNum) {if (data.size() < threadNum) {process(data, word);} else {ExecutorService executorService = Executors.newCachedThreadPool();int num = data.size() / threadNum;    //每段要处理的数量for (int i = 0; i < threadNum; i++) {int start = i * num;int end = (i + 1) * num;if (i == threadNum - 1) {end = data.size();}final List<String> cutList = data.subList(start, end);executorService.execute(() -> process(cutList, word));}executorService.shutdown();}}}

HttpClientUtils.java   http请求工具类

import org.apache.http.Header;
import org.apache.http.HttpEntity;
import org.apache.http.NameValuePair;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.client.methods.HttpUriRequest;
import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
import org.apache.http.conn.ssl.TrustStrategy;
import org.apache.http.entity.StringEntity;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.ssl.SSLContextBuilder;
import org.apache.http.util.EntityUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;import javax.net.ssl.HostnameVerifier;
import javax.net.ssl.SSLContext;
import javax.net.ssl.SSLSession;
import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.security.GeneralSecurityException;
import java.security.cert.CertificateException;
import java.security.cert.X509Certificate;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;/*** @author code4crafter@gmail.com* Date: 17/3/27*/
public abstract class HttpClientUtils {public static Map<String, List<String>> convertHeaders(Header[] headers) {Map<String, List<String>> results = new HashMap<String, List<String>>();for (Header header : headers) {List<String> list = results.get(header.getName());if (list == null) {list = new ArrayList<String>();results.put(header.getName(), list);}list.add(header.getValue());}return results;}/*** http的get请求* @param url*/public static String get(String url) {return get(url, "UTF-8");}public static Logger logger = LoggerFactory.getLogger(HttpClientUtils.class);/*** http的get请求* @param url*/public static String get(String url, String charset) {HttpGet httpGet = new HttpGet(url);return executeRequest(httpGet, charset);}/*** http的get请求,增加异步请求头参数* @param url*/public static String ajaxGet(String url) {return ajaxGet(url, "UTF-8");}/*** http的get请求,增加异步请求头参数** @param url*/public static String ajaxGet(String url, String charset) {HttpGet httpGet = new HttpGet(url);httpGet.setHeader("X-Requested-With", "XMLHttpRequest");return executeRequest(httpGet, charset);}/*** @param url* @return*/public static String ajaxGet(CloseableHttpClient httpclient, String url) {HttpGet httpGet = new HttpGet(url);httpGet.setHeader("X-Requested-With", "XMLHttpRequest");return executeRequest(httpclient, httpGet, "UTF-8");}/*** http的post请求,传递map格式参数*/public static String post(String url, Map<String, String> dataMap) {return post(url, dataMap, "UTF-8");}/*** http的post请求,传递map格式参数*/public static String post(String url, Map<String, String> dataMap, String charset) {HttpPost httpPost = new HttpPost(url);try {if (dataMap != null) {List<NameValuePair> nvps = new ArrayList<NameValuePair>();for (Map.Entry<String, String> entry : dataMap.entrySet()) {nvps.add(new BasicNameValuePair(entry.getKey(), entry.getValue()));}UrlEncodedFormEntity formEntity = new UrlEncodedFormEntity(nvps, charset);formEntity.setContentEncoding(charset);httpPost.setEntity(formEntity);}} catch (UnsupportedEncodingException e) {e.printStackTrace();}return executeRequest(httpPost, charset);}/*** http的post请求,增加异步请求头参数,传递map格式参数*/public static String ajaxPost(String url, Map<String, String> dataMap) {return ajaxPost(url, dataMap, "UTF-8");}/*** http的post请求,增加异步请求头参数,传递map格式参数*/public static String ajaxPost(String url, Map<String, String> dataMap, String charset) {HttpPost httpPost = new HttpPost(url);httpPost.setHeader("X-Requested-With", "XMLHttpRequest");try {if (dataMap != null) {List<NameValuePair> nvps = new ArrayList<NameValuePair>();for (Map.Entry<String, String> entry : dataMap.entrySet()) {nvps.add(new BasicNameValuePair(entry.getKey(), entry.getValue()));}UrlEncodedFormEntity formEntity = new UrlEncodedFormEntity(nvps, charset);formEntity.setContentEncoding(charset);httpPost.setEntity(formEntity);}} catch (UnsupportedEncodingException e) {e.printStackTrace();}return executeRequest(httpPost, charset);}/*** http的post请求,增加异步请求头参数,传递json格式参数*/public static String ajaxPostJson(String url, String jsonString) {return ajaxPostJson(url, jsonString, "UTF-8");}/*** http的post请求,增加异步请求头参数,传递json格式参数*/public static String ajaxPostJson(String url, String jsonString, String charset) {HttpPost httpPost = new HttpPost(url);httpPost.setHeader("X-Requested-With", "XMLHttpRequest");StringEntity stringEntity = new StringEntity(jsonString, charset);// 解决中文乱码问题stringEntity.setContentEncoding(charset);stringEntity.setContentType("application/json");httpPost.setEntity(stringEntity);return executeRequest(httpPost, charset);}/*** 执行一个http请求,传递HttpGet或HttpPost参数*/public static String executeRequest(HttpUriRequest httpRequest) {return executeRequest(httpRequest, "UTF-8");}/*** 执行一个http请求,传递HttpGet或HttpPost参数*/public static String executeRequest(HttpUriRequest httpRequest, String charset) {CloseableHttpClient httpclient;if ("https".equals(httpRequest.getURI().getScheme())) {httpclient = createSSLInsecureClient();} else {httpclient = HttpClients.createDefault();}String result = "";try {try {CloseableHttpResponse response = httpclient.execute(httpRequest);HttpEntity entity = null;try {entity = response.getEntity();result = EntityUtils.toString(entity, charset);} finally {EntityUtils.consume(entity);response.close();}} finally {httpclient.close();}} catch (IOException ex) {ex.printStackTrace();}return result;}public static String executeRequest(CloseableHttpClient httpclient, HttpUriRequest httpRequest, String charset) {String result = "";try {try {CloseableHttpResponse response = httpclient.execute(httpRequest);HttpEntity entity = null;try {entity = response.getEntity();result = EntityUtils.toString(entity, charset);} finally {EntityUtils.consume(entity);response.close();}} finally {httpclient.close();}} catch (IOException ex) {ex.printStackTrace();}return result;}/*** 创建 SSL连接*/public static CloseableHttpClient createSSLInsecureClient() {try {SSLContext sslContext = new SSLContextBuilder().loadTrustMaterial(new TrustStrategy() {@Overridepublic boolean isTrusted(X509Certificate[] chain, String authType) throws CertificateException {return true;}}).build();SSLConnectionSocketFactory sslsf = new SSLConnectionSocketFactory(sslContext, new HostnameVerifier() {@Overridepublic boolean verify(String hostname, SSLSession session) {return true;}});return HttpClients.custom().setSSLSocketFactory(sslsf).build();} catch (GeneralSecurityException ex) {throw new RuntimeException(ex);}}
}

运行

由于网络等原因,我们发现并不能全部下载成功,不过可以多次运行尝试,可以实现较高的下载成功率。

666,厉害了。。

如果看到这里,说明你喜欢这篇文章,请 转发、点赞。同时 标星(置顶)本公众号可以第一时间接受到博文推送。
推荐一些很不错的计算机学习教程,包括:数据结构、算法、计算机网络、操作系统、Java(spring、springmvc、springboot、springcloud等)等等 ,全部收集于网络,如果有侵权,请联系删除!
下面是部分截图:获取方式点击下方公众号,回复:好好学Java,即可获取。

用 Java 爬小姐姐图片,这个厉害了。。。相关推荐

  1. 手把手教学爬小姐姐图片(jsoup)

    jsoup简单小应用,利用java爬取小姐姐图片,虽然不如python好用,但是依旧不错,记录一下 <dependency><groupId>org.jsoup</gro ...

  2. 想用Python爬小姐姐图片?那你得先搞定分布式进程

    导读:分布式进程指的是将Process进程分布到多台机器上,充分利用多台机器的性能完成复杂的任务.我们可以将这一点应用到分布式爬虫的开发中. 作者:范传辉 如需转载请联系大数据(ID:hzdashuj ...

  3. python爬虫 爬取清纯小姐姐图片

    文章目录 1 思路介绍 2 完整代码 2 代码介绍 2.1 获取网站 2.3 创建目录 2.4 找到首图的名称和地址 2.5 实现同一个人的翻页 2.6 实现每一位小姐姐图片的连接 2.7 爬取图片 ...

  4. Python爬取不羞涩网小姐姐图片——BeautifulSoup应用

    引言 今年提倡原地过年,相信很多朋友都没有回家过年,像我就被迫留在深圳过年了,无聊之余只能去看看电影爬爬山.今天给大家带来一个打发无聊时光的案例,用Python爬取不羞涩网小姐姐图片,并保存到本地,老 ...

  5. python爬虫 爬取小姐姐图片

    前言 大致熟悉了python的基础语法以后,开始学习爬虫基础. 一.爬取前的准备工作 python3.7环境(只要是python3版本都可以): 依赖包 : time requests re (缺少包 ...

  6. 三分钟教会你用Python爬取心仪小姐姐图片

    使用Python爬取小姐姐图片 首先上网站链接 唯美女生 爬取图片主要分为一下几步: 1.打开一个你喜欢的小姐姐的网站 E.g xiaojiejie web 2.下载并安装python环境 pytho ...

  7. [ Python ] 爬虫类库学习之 xpath,爬取彼岸图网的 小姐姐 图片

    安装:pip install lxml 实例化一个etree对象 from lxml import etree 1.将本地的html文档中的源码数据加载到etree对象中 etree.parse(fi ...

  8. Python三步爬取VMgirls小姐姐图片

    Python三步爬取VMgirls小姐姐图片 具体思路 第一步:确定目标 第二步:分析目标网站 第三步:代码编写 具体思路 第一步:确定目标:寻找目标网站,我选择的网站是http://www.VMgi ...

  9. Python爬虫利用18行代码爬取虎牙上百张小姐姐图片

    Python爬虫利用18行代码爬取虎牙上百张小姐姐图片 下面开始上代码 需要用到的库 import request #页面请求 import time #用于时间延迟 import re #正则表达式 ...

最新文章

  1. 爷青回!GAN生成的超级马里奥关卡,可以永不通关的那种
  2. iOS UI基础-6.0 UIActionSheet的使用
  3. torch.cat同时连接多个tensor
  4. Metal之加载TGA与PNG/JPEG纹理图片
  5. 《机器学习导论》和《统计机器学习》学习资料:张志华教授
  6. 关于windows xp sp2/sp3 中tcpip.sys对于Raw socket的限制
  7. jsp导入jstl标签库_EE JSP:使用JSTL标记库生成动态内容
  8. 里氏替换原则_代码需要有单一职责,还要开闭,里氏替换又是什么鬼?
  9. Node js开发中的那些旮旮角角 第一部
  10. 500道Java 必备面试题答案(过后即删)
  11. es查询大文本效率_进一步提高Elasticsearch的检索效率
  12. pytorch 训练人脸精度不达标
  13. 力扣题目系列:746. 使用最小花费爬楼梯 -- 一道动态规划入门题
  14. 2022手机商城源码h5运营版本
  15. android tcp 工具,TcpIp工具包app
  16. 如果不想当程序员,学编程有什么用?答案显而易见!
  17. 苹果2021新品发布会,iMac全新设计你GET到了吗
  18. vue 仿豆瓣 爬坑之旅
  19. Linux磁盘与分区命名:sda, sdb, sdc, sda1, sda2
  20. T/CAGIS 1—2019《空间三维模型数据格式》

热门文章

  1. NHibernate部分错误
  2. ms sql 聚合事例
  3. STM32 串口 #pragma import(__use_no_semihosting)解析
  4. STM32F103C8T6
  5. C++ Primer 5th笔记(chap 16 模板和泛型编程)模板实参推断和引用
  6. 区块链BaaS云服务(31) 吉利 Concordium区块链
  7. (chap4 IP协议) 路由控制( Routing)
  8. Hyperledger Fabric 核心模块(6)configtxlator工具
  9. 以太坊知识教程------智能合约(2)调用
  10. java元婴期(23)----java进阶(mybatis(2)---mapper代理mybatis核心配置文件输入输出映射)