爬虫（第一篇） IP代理池

搞虫子的都知道，IP代理是必要的方法，为什么？这个大家知道就好了，最近写了个IP代理池，给大家围观一下：开始。

首先咱们找到一个免费的IP代理网站，读取人家的数据，但是注意了，注意频率别把人家给搞崩了

本服务采用的依赖：Springboot、apache util、jsoup、fastjson、Redis 等

第一：线程池，多个线程检测

package com.*.util.thread;import org.apache.log4j.Logger;import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;/**** ** 类名称：      CustomExecutorService   * 类描述：       线程池配置   * 创建人：       * 创建时间：  2014-11-21 下午4:45:26   * 修改人：        * 修改时间：  2016-10-21 下午4:45:26   * 修改备注：   * @version 1.1*/
public class CustomExecutorService {private static Logger log = Logger.getLogger( CustomExecutorService.class ) ;private static ExecutorService executorService;       // 线程池private static final int POOL_MULTIPLE = 0x0000000a ;    // 线程池中单个CPU分配工作线程的数目（十六进制）private static CustomExecutorService instance;private CustomExecutorService() {// 创建一个线程池                            可用处理器数*线程池工作数executorService = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors() * POOL_MULTIPLE * 5 );}/***** 返回线程池* @return*/public static ExecutorService getExecutorService(){return executorService ;}/**** 一次调用就可以了，很多地方其实不再需要再次调用，在系统启动的时候调用一次就可以了* @return*/public synchronized static CustomExecutorService getInstance() {if (instance == null) {instance = new CustomExecutorService();log.info( "Thread pool instance success" ) ;}return instance;}/***** 一次调用就可以了，在系统关闭的时候调用一次就可以了*/public static void destory() {log.info( "Thread pool shutdown ..." ) ;executorService.shutdown() ;}/***** 具体执行线程的调用* @param thread*/public static void execute( Runnable thread ) {if( executorService != null ){executorService.execute(thread);}else{CustomExecutorService.getInstance() ;try{ Thread.sleep( 1 ); }catch (Exception e){}execute( thread ) ;log.error( "Thread pool haven't instance,please instance Thread pool." ) ;}}public static void main(String[] args) {CustomExecutorService.getInstance() ;        //线程池初始化 --- 系统启动时调用一次就okCustomExecutorService.execute( new Thread() ) ;        CustomExecutorService.destory() ;}}

第二：定时器，定时处理Redis中无效的IP

package *.*.*.ipproxy;import com.alibaba.fastjson.JSONObject;
import com.*.util.thread.CustomExecutorService;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.context.annotation.Lazy;
import org.springframework.data.redis.core.RedisTemplate;
import org.springframework.scheduling.annotation.EnableScheduling;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;import java.util.List;/*** @Date 创建时间： 2021-01-18 15:58* @Author 作者姓名: WuNengDeShiXiong* @Version 1.0* @Copyright Copyright by* @Direction 类说明*/
@Component
@Lazy(value = false)
@EnableScheduling
public class IpSourceValid {private final static Logger logger = LoggerFactory.getLogger( IpSourceValid.class ) ;@Autowiredprivate RedisTemplate redisTemplate;/***** 已有的代理池，则需要30秒刷新一次，不可用的IP代理池、全部丢弃*/@Scheduled(cron = "0 * * * * ?")void update() {logger.info( " redis IP代理池检测开始 ....................."  );List<String> range = redisTemplate.opsForList().range("ip", 0, -1);for (String ipInfoJson : range) {//线程池检测-代理IP池-有效的IP有哪些CustomExecutorService.execute(new Thread() {@Overridepublic void run() {IPInfo ipInfo = JSONObject.parseObject( ipInfoJson , IPInfo.class ) ;//使用时-要求偏低,不可用时从Redis删除Boolean useLess = IpInvalidUtils.useless( ipInfo.getIp() , Integer.parseInt(ipInfo.getPort() ) , false ) ;if ( useLess ) {logger.info( " {} 从redis移除" , ipInfoJson );redisTemplate.opsForList().remove("ip", 0, ipInfoJson );}}}) ;}logger.info( " redis IP代理池检测结束 ....................."  );}}

第三：从开源网站读取的IP转换成对象

package *.*.*.ipproxy;import org.jsoup.select.Elements;import java.io.Serializable;/*** @Date 创建时间： 2021-01-14 11:56* @Author 作者姓名: WuNengDeShiXiong* @Version 1.0* @Copyright Copyright by* @Direction 类说明*/
public class IPInfo implements Serializable {private String ip ;private String port ;private String city ;private String type ;private String validTime ;private String source ;public IPInfo(){}public IPInfo(Elements tdChilds){if ( tdChilds != null  && tdChilds.size() > 1 ) {this.ip = tdChilds.get(0).text() ;this.port = tdChilds.get(1).text() ;this.city = "中国" + tdChilds.get(2).text() ;this.type = tdChilds.get(3).text() ;this.validTime = tdChilds.get(4).text() ;}}get set ......
}

第四：使用简单的java连接使用代理去访问牛皮的网站，此处使用的是QQ的地址，响应速度快

package *.*.*.ipproxy;import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;/*** @Date 创建时间： 2021-01-14 10:22* @Author 作者姓名: WuNengDeShiXiong* @Version 1.0* @Copyright Copyright by* @Direction 类说明       验证代理的IP与端口*/
public class IpInvalidUtils {private final static Logger logger = LoggerFactory.getLogger( IpSourceValid.class ) ;private final static String valid_url = "https://graph.qq.com/oauth2.0" ;      //自定义一个网站地址就可以了，百度、腾讯、阿里能够访问即可,返回信息越少越好private final static String User_Agent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4240.183 Safari/537.36" ;private final static int connent_timeout_normal = 2000 ;private final static int connent_timeout_slow = 5000 ;/***** 内部代理方法* @param ip* @param port* @return*/public static boolean useless( String ip , Integer port){return useless(  ip ,  port , true ) ;}/***** 验证IP、端口 是否可用* @param ip* @param port* @return 无效的ip 返回true 有效的ip返回false*/public static boolean useless( String ip , Integer port, Boolean high) {Connection jsoupConn = null;boolean requestIsFail = true ;try {//创建请求jsoupConn = Jsoup.connect( valid_url ).userAgent( User_Agent ) ;jsoupConn.proxy( ip , port ) ;if( high ) {//要求高-则检测时间必须在2秒内响应jsoupConn.timeout( connent_timeout_normal );     // timeout：设置连接主机超时（单位：毫秒）}else{//要求低-则检测时间必须在响应jsoupConn.timeout( connent_timeout_slow );       // timeout：设置连接主机超时（单位：毫秒）}//解析请求信息 能够解析到请求结果的一律认为访问成功Connection.Response resp = jsoupConn.execute();if ( resp.statusCode() > 0 ) {//String body = jsoupConn.get().getElementsByTag("body").html();requestIsFail = false ;}} catch (Exception e) {logger.error( "数据识别 此IP： {} 端口：{} 失败的原因：{}" , ip , port , e.getMessage() );}return requestIsFail ;}}

第五步：咱们就要定时去网站取数据了

package *.*.*.ipproxy;import com.alibaba.fastjson.JSON;
import com.*.util.thread.CustomExecutorService;
import org.apache.commons.lang3.StringUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.context.annotation.Lazy;
import org.springframework.data.redis.core.RedisTemplate;
import org.springframework.scheduling.annotation.EnableScheduling;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;import java.io.IOException;
import java.util.List;/*** @Date 创建时间： 2021-01-14 10:21* @Author 作者姓名: WuNengDeShiXiong* @Version 1.0* @Copyright Copyright by* @Direction 类说明       定时取一个免费的源，然后对免费的源获取的IP做数据检查*                        多线程获取-然后做数据检测*/
@Component
@Lazy(value = false)
@EnableScheduling
public class UpdateIpSource1 {private final static Logger logger = LoggerFactory.getLogger( UpdateIpSource1.class ) ;@Autowiredprivate RedisTemplate redisTemplate;/**** 获取可用的IP代理池，10分钟运行一次*/@Scheduled(cron = "0 */10 * * * ?")public void ips() {String prefix = "http://www.66ip.cn/areaindex_" ;String suxfix = "/1.html" ;StringBuilder url = null ;//只取前5页的有效数据for(int i=1 ;i<5 ;i++){url = new StringBuilder( prefix ).append( i ).append( suxfix ) ;parseUrl( url.toString() ) ;}}/***** 解析指定地址，然后从地址内获取IP、端口；* @param url*/public void parseUrl( String url ){try {//Document document = Jsoup.connect( "http://www.66ip.cn/areaindex_1/1.html" ).timeout(3000).get();Document document = Jsoup.connect( url ).timeout(3000).get();Elements tags = document.select("table > tbody > tr");for (Element element : tags) {//取得ip地址节点、端口号节点Elements tdChilds = element.select("td");logger.info( " 准备验证数据 {}    ------source1------" , tdChilds.text() );IPInfo ipInfo = new IPInfo( tdChilds ) ;ipInfo.setSource("1");redisHandler( ipInfo ) ;}} catch (IOException e) {e.printStackTrace();}}/****** 验证读取到的IP地址是否有效，并且写入到Redis服务* @param ipInfo*/public void redisHandler( IPInfo ipInfo ){//第一步检测数据有效性，且只有有效数据才做数据检测if ( ipInfo != null && StringUtils.isNotBlank( ipInfo.getIp() ) && StringUtils.isNotBlank( ipInfo.getPort() ) ) {try {Integer.parseInt(ipInfo.getPort() ) ;}catch (Exception e){logger.info("端口号为非数字，丢弃请求.................------source1------");return ;}//第二步 采用多线程对IP端口做代理检测CustomExecutorService.execute(new Thread(){@Overridepublic void run() {Boolean useLess = IpInvalidUtils.useless( ipInfo.getIp() , Integer.parseInt(ipInfo.getPort() )) ;if ( ! useLess ) {String ipInfoJson = JSON.toJSONString( ipInfo ) ;//将有效的数据移动到最左边，第一位List<String> range = redisTemplate.opsForList().range("ip", 0, -1 );if ( ! range.contains( ipInfoJson ) ) {logger.info( " {} 有效，开始-存进redis {}    ------source1------" ,  ipInfoJson );if (redisTemplate.opsForList().size("ip") > 1000) {redisTemplate.opsForList().rightPopAndLeftPush("ip", ipInfoJson);}else{redisTemplate.opsForList().leftPush("ip", ipInfoJson );}}}}}) ;}}}

好了，今天的代理池获取就完成了。引用的Maven请看下文 Springboot基础包就不发了

<!-- https://mvnrepository.com/artifact/com.alibaba/fastjson -->
<dependency><groupId>com.alibaba</groupId><artifactId>fastjson</artifactId><version>1.2.47</version>
</dependency><!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency><groupId>org.jsoup</groupId><artifactId>jsoup</artifactId><version>1.11.3</version>
</dependency>apache的Utils包就不贴了，大家都用

爬虫（第一篇） IP代理池相关推荐

python爬虫之：IP代理池开源项目讲解
Table of Contents 一.项目基本介绍二.项目讲解三.一些项目问题答疑四.代理池设计一.项目基本介绍本项目来源于github,截止于2019/08/20,star数量:7133 ...
python爬虫-自建IP代理池
写在前面最近跟静觅大神学习了维护代理池就借此机会整理一下整体思路代理池主要分为4个模块:存储模块.获取模块.检测模块.接口模块存储模块:使用Redis有序集合,用来做代理的去重和状态标识获 ...
Python网络爬虫--Scrapy使用IP代理池
自动更新IP池写个自动获取IP的类proxies.py,执行一下把获取的IP保存到txt文件中去: 代码 # *-* coding:utf-8 *-* import requests from bs ...
爬虫基础篇之IP代理池
代理池介绍由众多ip组成提供多个稳定可用代理IP的ip池. 当我们做爬虫时,最常见的反爬手段就是IP反爬,当同一个IP访问网站超出频控限制,将会被限制访问,那么代理IP池应运而生.资金充足的情况下个 ...
[Python 爬虫] 使用 Scrapy 爬取新浪微博用户信息（四） —— 应对反爬技术（选取 User-Agent、添加 IP代理池以及Cookies池）
上一篇:[Python 爬虫] 使用 Scrapy 爬取新浪微博用户信息(三) -- 数据的持久化--使用MongoDB存储爬取的数据最近项目有些忙,很多需求紧急上线,所以一直没能完善< 使用 ...
Python之反爬虫手段（User-Agent，Cookie，Referer，time.sleep()，IP代理池）
现在的爬虫越来越难,各大网站为了预防不间断的网络爬虫,都相应地做出了不同的反爬机制,那么如何能够在不被封IP的情况,尽可能多得爬取数据呢?这里主要介绍到一些通用的反爬措施,虽然不一定适合所有网站,但是 ...
python爬虫ip代理池_爬虫教程-Python3网络爬虫开发——IP代理池的维护
该楼层疑似违规已被系统折叠隐藏此楼查看此楼准备工作要实现IP代理池我们首先需要成功安装好了 Redis 数据库并启动服务,另外还需要安装 Aiohttp.Requests.RedisPy.PyQ ...
免费IP代理池定时维护，封装通用爬虫工具类每次随机更新IP代理池跟UserAgent池，并制作简易流量爬虫...
前言我们之前的爬虫都是模拟成浏览器后直接爬取,并没有动态设置IP代理以及UserAgent标识,这样很容易被服务器封IP,因此需要设置IP代理,但又不想花钱买,网上有免费IP代理,但大多都数都是不可 ...
Python爬虫——建立IP代理池
在使用Python爬虫时,经常遇见具有反爬机制的网站.我们可以通过伪装headers来爬取,但是网站还是可以获取你的ip,从而禁掉你的ip来阻止爬取信息. 在request方法中,我们可以通过prox ...

爬虫（第一篇） IP代理池

爬虫（第一篇） IP代理池相关推荐

最新文章

热门文章