一、为什么要搭建爬虫代理池

在众多的网站防爬措施中,有一种是根据ip的访问频率进行限制,即在某一时间段内,当某个ip的访问次数达到一定的阀值时,该ip就会被拉黑、在一段时间内禁止访问。

应对的方法有两种:

1. 降低爬虫的爬取频率,避免IP被限制访问,缺点显而易见:会大大降低爬取的效率。

2. 搭建一个IP代理池,使用不同的IP轮流进行爬取。

二、搭建思路

1、从代理网站(如:西刺代理、快代理、云代理、无忧代理)爬取代理IP;

2、验证代理IP的可用性(使用代理IP去请求指定URL,根据响应验证代理IP是否生效);

3、将可用的代理IP保存到数据库;

常用代理网站:西刺代理 、云代理 、IP海 、无忧代理 、飞蚁代理 、快代理

三、代码实现

工程结构如下:

ipproxy.py

IPProxy代理类定义了要爬取的IP代理的字段信息和一些基础方法。

# -*- coding: utf-8 -*-

import re

import time

from settings import PROXY_URL_FORMATTER

schema_pattern = re.compile(r'http|https$', re.I)

ip_pattern = re.compile(r'^([0-9]{1,3}.){3}[0-9]{1,3}$', re.I)

port_pattern = re.compile(r'^[0-9]{2,5}$', re.I)

class IPProxy:

'''

{

"schema": "http", # 代理的类型

"ip": "127.0.0.1", # 代理的IP地址

"port": "8050", # 代理的端口号

"used_total": 11, # 代理的使用次数

"success_times": 5, # 代理请求成功的次数

"continuous_failed": 3, # 使用代理发送请求,连续失败的次数

"created_time": "2018-05-02" # 代理的爬取时间

}

'''

def __init__(self, schema, ip, port, used_total=0, success_times=0, continuous_failed=0,

created_time=None):

"""Initialize the proxy instance"""

if schema == "" or schema is None:

schema = "http"

self.schema = schema.lower()

self.ip = ip

self.port = port

self.used_total = used_total

self.success_times = success_times

self.continuous_failed = continuous_failed

if created_time is None:

created_time = time.strftime('%Y-%m-%d', time.localtime(time.time()))

self.created_time = created_time

def _get_url(self):

''' Return the proxy url'''

return PROXY_URL_FORMATTER % {'schema': self.schema, 'ip': self.ip, 'port': self.port}

def _check_format(self):

''' Return True if the proxy fields are well-formed,otherwise return False'''

if self.schema is not None and self.ip is not None and self.port is not None:

if schema_pattern.match(self.schema) and ip_pattern.match(self.ip) and port_pattern.match(self.port):

return True

return False

def _is_https(self):

''' Return True if the proxy is https,otherwise return False'''

return self.schema == 'https'

def _update(self, successed=False):

''' Update proxy based on the result of the request's response'''

self.used_total = self.used_total + 1

if successed:

self.continuous_failed = 0

self.success_times = self.success_times + 1

else:

print(self.continuous_failed)

self.continuous_failed = self.continuous_failed + 1

if __name__ == '__main__':

proxy = IPProxy('HTTPS', '192.168.2.25', "8080")

print(proxy._get_url())

print(proxy._check_format())

print(proxy._is_https())

settings.py

settings.py中汇聚了工程所需要的配置信息。

# 指定Redis的主机名和端口

REDIS_HOST = 'localhost'

REDIS_PORT = 6379

# 代理保存到Redis key 格式化字符串

PROXIES_REDIS_FORMATTER = 'proxies::{}'

# 已经存在的HTTP代理和HTTPS代理集合

PROXIES_REDIS_EXISTED = 'proxies::existed'

# 最多连续失败几次

MAX_CONTINUOUS_TIMES = 3

# 代理地址的格式化字符串

PROXY_URL_FORMATTER = '%(schema)s://%(ip)s:%(port)s'

USER_AGENT_LIST = [

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",

"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",

"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",

"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",

"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",

"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",

"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",

"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",

"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",

"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",

"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",

"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",

"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",

"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"

]

# 爬取到的代理保存前先检验是否可用,默认True

PROXY_CHECK_BEFOREADD = True

# 检验代理可用性的请求地址,支持多个

PROXY_CHECK_URLS = {'https':['https://icanhazip.com'],'http':['http://icanhazip.com']}

proxy_util.py

proxy_util.py 中主要定义了一些实用方法,例如:proxy_to_dict(proxy)用来将IPProxy代理实例转换成字典;proxy_from_dict(d)用来将字典转换为IPProxy实例;request_page()用来发送请求;_is_proxy_available()用来校验代理IP是否可用。

# -*- coding: utf-8 -*-

import random

import logging

import requests

from ipproxy import IPProxy

from settings import USER_AGENT_LIST, PROXY_CHECK_URLS

# Setting logger output format

logging.basicConfig(level=logging.INFO,

format='[%(asctime)-15s] [%(levelname)8s] [%(name)10s ] - %(message)s (%(filename)s:%(lineno)s)',

datefmt='%Y-%m-%d %T'

)

logger = logging.getLogger(__name__)

def proxy_to_dict(proxy):

d = {

"schema": proxy.schema,

"ip": proxy.ip,

"port": proxy.port,

"used_total": proxy.used_total,

"success_times": proxy.success_times,

"continuous_failed": proxy.continuous_failed,

"created_time": proxy.created_time

}

return d

def proxy_from_dict(d):

return IPProxy(schema=d['schema'], ip=d['ip'], port=d['port'], used_total=d['used_total'],

success_times=d['success_times'], continuous_failed=d['continuous_failed'],

created_time=d['created_time'])

# Truncate header and tailer blanks

def strip(data):

if data is not None:

return data.strip()

return data

base_headers = {

'Accept-Encoding': 'gzip, deflate, br',

'Accept-Language': 'en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7'

}

def request_page(url, options={}, encoding='utf-8'):

"""send a request,get response"""

headers = dict(base_headers, **options)

if 'User-Agent' not in headers.keys():

headers['User-Agent'] = random.choice(USER_AGENT_LIST)

logger.info('正在抓取: ' + url)

try:

response = requests.get(url, headers=headers)

if response.status_code == 200:

logger.info('抓取成功: ' + url)

return response.content.decode(encoding=encoding)

except ConnectionError:

logger.error('抓取失败' + url)

return None

def _is_proxy_available(proxy, options={}):

"""Check whether the Proxy is available or not"""

headers = dict(base_headers, **options)

if 'User-Agent' not in headers.keys():

headers['User-Agent'] = random.choice(USER_AGENT_LIST)

proxies = {proxy.schema: proxy._get_url()}

check_urls = PROXY_CHECK_URLS[proxy.schema]

for url in check_urls:

try:

response = requests.get(url=url, proxies=proxies, headers=headers, timeout=5)

except BaseException:

logger.info("< " + url + " > 验证代理 < " + proxy._get_url() + " > 结果: 不可用 ")

else:

if response.status_code == 200:

logger.info("< " + url + " > 验证代理 < " + proxy._get_url() + " > 结果: 可用 ")

return True

else:

logger.info("< " + url + " > 验证代理 < " + proxy._get_url() + " > 结果: 不可用 ")

return False

if __name__ == '__main__':

headers = dict(base_headers)

if 'User-Agent' not in headers.keys():

headers['User-Agent'] = random.choice(USER_AGENT_LIST)

proxies = {"https": "https://163.125.255.154:9797"}

response = requests.get("https://www.baidu.com", headers=headers, proxies=proxies, timeout=3)

print(response.content)

proxy_queue.py

代理队列用来保存并对外提供 IP代理,不同的代理队列内代理IP的保存和提取策略可以不同。在这里, BaseQueue 是所有代理队列的基类,其中声明了所有代理队列都需要实现的保存代理IP、提取代理IP、查看代理IP数量等接口。示例的 FifoQueue 是一个先进先出队列,底层使用 Redis 列表实现,为了确保同一个代理IP只能被放入队列一次,这里使用了一个Redis proxies::existed 集合进行入队前重复校验。

# -*- coding: utf-8 -*-

from proxy_util import logger

import json

import redis

from ipproxy import IPProxy

from proxy_util import proxy_to_dict, proxy_from_dict, _is_proxy_available

from settings import PROXIES_REDIS_EXISTED, PROXIES_REDIS_FORMATTER, MAX_CONTINUOUS_TIMES, PROXY_CHECK_BEFOREADD

"""

Proxy Queue Base Class

"""

class BaseQueue(object):

def __init__(self, server):

"""Initialize the proxy queue instance

Parameters

----------

server : StrictRedis

Redis client instance

"""

self.server = server

def _serialize_proxy(self, proxy):

"""Serialize proxy instance"""

return proxy_to_dict(proxy)

def _deserialize_proxy(self, serialized_proxy):

"""deserialize proxy instance"""

return proxy_from_dict(eval(serialized_proxy))

def __len__(self, schema='http'):

"""Return the length of the queue"""

raise NotImplementedError

def push(self, proxy, need_check):

"""Push a proxy"""

raise NotImplementedError

def pop(self, schema='http', timeout=0):

"""Pop a proxy"""

raise NotImplementedError

class FifoQueue(BaseQueue):

"""First in first out queue"""

def __len__(self, schema='http'):

"""Return the length of the queue"""

return self.server.llen(PROXIES_REDIS_FORMATTER.format(schema))

def push(self, proxy, need_check=PROXY_CHECK_BEFOREADD):

"""Push a proxy"""

if need_check and not _is_proxy_available(proxy):

return

elif proxy.continuous_failed < MAX_CONTINUOUS_TIMES and not self._is_existed(proxy):

key = PROXIES_REDIS_FORMATTER.format(proxy.schema)

self.server.rpush(key, json.dumps(self._serialize_proxy(proxy),ensure_ascii=False))

def pop(self, schema='http', timeout=0):

"""Pop a proxy"""

if timeout > 0:

p = self.server.blpop(PROXIES_REDIS_FORMATTER.format(schema.lower()), timeout)

if isinstance(p, tuple):

p = p[1]

else:

p = self.server.lpop(PROXIES_REDIS_FORMATTER.format(schema.lower()))

if p:

p = self._deserialize_proxy(p)

self.server.srem(PROXIES_REDIS_EXISTED, p._get_url())

return p

def _is_existed(self, proxy):

added = self.server.sadd(PROXIES_REDIS_EXISTED, proxy._get_url())

return added == 0

if __name__ == '__main__':

r = redis.StrictRedis(host='localhost', port=6379)

queue = FifoQueue(r)

proxy = IPProxy('http', '218.66.253.144', '80')

queue.push(proxy)

proxy = queue.pop(schema='http')

print(proxy._get_url())

proxy_crawlers.py

ProxyBaseCrawler 是所有代理爬虫的基类,其中只定义了一个 _start_crawl() 方法用来从搜集到的代理网站爬取代理IP。

# -*- coding: utf-8 -*-

from lxml import etree

from ipproxy import IPProxy

from proxy_util import strip, request_page, logger

class ProxyBaseCrawler(object):

def __init__(self, queue=None, website=None, urls=[]):

self.queue = queue

self.website = website

self.urls = urls

def _start_crawl(self):

raise NotImplementedError

class KuaiDailiCrawler(ProxyBaseCrawler): # 快代理

def _start_crawl(self):

for url_dict in self.urls:

logger.info("开始爬取 [ " + self.website + " ] :::> [ " + url_dict['type'] + " ]")

has_more = True

url = None

while has_more:

if 'page' in url_dict.keys() and str.find(url_dict['url'], '{}') != -1:

url = url_dict['url'].format(str(url_dict['page']))

url_dict['page'] = url_dict['page'] + 1

else:

url = url_dict['url']

has_more = False

html = etree.HTML(request_page(url))

tr_list = html.xpath("//table[@class='table table-bordered table-striped']/tbody/tr")

for tr in tr_list:

ip = tr.xpath("./td[@data-title='IP']/text()")[0] if len(

tr.xpath("./td[@data-title='IP']/text()")) else None

port = tr.xpath("./td[@data-title='PORT']/text()")[0] if len(

tr.xpath("./td[@data-title='PORT']/text()")) else None

schema = tr.xpath("./td[@data-title='类型']/text()")[0] if len(

tr.xpath("./td[@data-title='类型']/text()")) else None

proxy = IPProxy(schema=strip(schema), ip=strip(ip), port=strip(port))

if proxy._check_format():

self.queue.push(proxy)

if tr_list is None:

has_more = False

class FeiyiDailiCrawler(ProxyBaseCrawler): # 飞蚁代理

def _start_crawl(self):

for url_dict in self.urls:

logger.info("开始爬取 [ " + self.website + " ] :::> [ " + url_dict['type'] + " ]")

has_more = True

url = None

while has_more:

if 'page' in url_dict.keys() and str.find(url_dict['url'], '{}') != -1:

url = url_dict['url'].format(str(url_dict['page']))

url_dict['page'] = url_dict['page'] + 1

else:

url = url_dict['url']

has_more = False

html = etree.HTML(request_page(url))

tr_list = html.xpath("//div[@id='main-content']//table/tr[position()>1]")

for tr in tr_list:

ip = tr.xpath("./td[1]/text()")[0] if len(tr.xpath("./td[1]/text()")) else None

port = tr.xpath("./td[2]/text()")[0] if len(tr.xpath("./td[2]/text()")) else None

schema = tr.xpath("./td[4]/text()")[0] if len(tr.xpath("./td[4]/text()")) else None

proxy = IPProxy(schema=strip(schema), ip=strip(ip), port=strip(port))

if proxy._check_format():

self.queue.push(proxy)

if tr_list is None:

has_more = False

class WuyouDailiCrawler(ProxyBaseCrawler): # 无忧代理

def _start_crawl(self):

for url_dict in self.urls:

logger.info("开始爬取 [ " + self.website + " ] :::> [ " + url_dict['type'] + " ]")

has_more = True

url = None

while has_more:

if 'page' in url_dict.keys() and str.find(url_dict['url'], '{}') != -1:

url = url_dict['url'].format(str(url_dict['page']))

url_dict['page'] = url_dict['page'] + 1

else:

url = url_dict['url']

has_more = False

html = etree.HTML(request_page(url))

ul_list = html.xpath("//div[@class='wlist'][2]//ul[@class='l2']")

for ul in ul_list:

ip = ul.xpath("./span[1]/li/text()")[0] if len(ul.xpath("./span[1]/li/text()")) else None

port = ul.xpath("./span[2]/li/text()")[0] if len(ul.xpath("./span[2]/li/text()")) else None

schema = ul.xpath("./span[4]/li/text()")[0] if len(ul.xpath("./span[4]/li/text()")) else None

proxy = IPProxy(schema=strip(schema), ip=strip(ip), port=strip(port))

if proxy._check_format():

self.queue.push(proxy)

if ul_list is None:

has_more = False

class IPhaiDailiCrawler(ProxyBaseCrawler): # IP海代理

def _start_crawl(self):

for url_dict in self.urls:

logger.info("开始爬取 [ " + self.website + " ] :::> [ " + url_dict['type'] + " ]")

has_more = True

url = None

while has_more:

if 'page' in url_dict.keys() and str.find(url_dict['url'], '{}') != -1:

url = url_dict['url'].format(str(url_dict['page']))

url_dict['page'] = url_dict['page'] + 1

else:

url = url_dict['url']

has_more = False

html = etree.HTML(request_page(url))

tr_list = html.xpath("//table//tr[position()>1]")

for tr in tr_list:

ip = tr.xpath("./td[1]/text()")[0] if len(tr.xpath("./td[1]/text()")) else None

port = tr.xpath("./td[2]/text()")[0] if len(tr.xpath("./td[2]/text()")) else None

schema = tr.xpath("./td[4]/text()")[0] if len(tr.xpath("./td[4]/text()")) else None

proxy = IPProxy(schema=strip(schema), ip=strip(ip), port=strip(port))

if proxy._check_format():

self.queue.push(proxy)

if tr_list is None:

has_more = False

class YunDailiCrawler(ProxyBaseCrawler): # 云代理

def _start_crawl(self):

for url_dict in self.urls:

logger.info("开始爬取 [ " + self.website + " ] :::> [ " + url_dict['type'] + " ]")

has_more = True

url = None

while has_more:

if 'page' in url_dict.keys() and str.find(url_dict['url'], '{}') != -1:

url = url_dict['url'].format(str(url_dict['page']))

url_dict['page'] = url_dict['page'] + 1

else:

url = url_dict['url']

has_more = False

html = etree.HTML(request_page(url, encoding='gbk'))

tr_list = html.xpath("//table/tbody/tr")

for tr in tr_list:

ip = tr.xpath("./td[1]/text()")[0] if len(tr.xpath("./td[1]/text()")) else None

port = tr.xpath("./td[2]/text()")[0] if len(tr.xpath("./td[2]/text()")) else None

schema = tr.xpath("./td[4]/text()")[0] if len(tr.xpath("./td[4]/text()")) else None

proxy = IPProxy(schema=strip(schema), ip=strip(ip), port=strip(port))

if proxy._check_format():

self.queue.push(proxy)

if tr_list is None:

has_more = False

class XiCiDailiCrawler(ProxyBaseCrawler): # 西刺代理

def _start_crawl(self):

for url_dict in self.urls:

logger.info("开始爬取 [ " + self.website + " ] :::> [ " + url_dict['type'] + " ]")

has_more = True

url = None

while has_more:

if 'page' in url_dict.keys() and str.find(url_dict['url'], '{}') != -1:

url = url_dict['url'].format(str(url_dict['page']))

url_dict['page'] = url_dict['page'] + 1

else:

url = url_dict['url']

has_more = False

html = etree.HTML(request_page(url))

tr_list = html.xpath("//table[@id='ip_list']//tr[@class!='subtitle']")

for tr in tr_list:

ip = tr.xpath("./td[2]/text()")[0] if len(tr.xpath("./td[2]/text()")) else None

port = tr.xpath("./td[3]/text()")[0] if len(tr.xpath("./td[3]/text()")) else None

schema = tr.xpath("./td[6]/text()")[0] if len(tr.xpath("./td[6]/text()")) else None

if schema.lower() == "http" or schema.lower() == "https":

proxy = IPProxy(schema=strip(schema), ip=strip(ip), port=strip(port))

if proxy._check_format():

self.queue.push(proxy)

if tr_list is None:

has_more = False

run.py

通过run.py启动各个代理网站爬虫。

# -*- coding: utf-8 -*-

import redis

from proxy_queue import FifoQueue

from settings import REDIS_HOST, REDIS_PORT

from proxy_crawlers import WuyouDailiCrawler, FeiyiDailiCrawler, KuaiDailiCrawler, IPhaiDailiCrawler, YunDailiCrawler, \

XiCiDailiCrawler

r = redis.StrictRedis(host=REDIS_HOST, port=REDIS_PORT)

fifo_queue = FifoQueue(r)

def run_kuai():

kuaidailiCrawler = KuaiDailiCrawler(queue=fifo_queue, website='快代理[国内高匿]',

urls=[{'url': 'https://www.kuaidaili.com/free/inha/{}/', 'type': '国内高匿',

'page': 1},

{'url': 'https://www.kuaidaili.com/free/intr/{}/', 'type': '国内普通',

'page': 1}])

kuaidailiCrawler._start_crawl()

def run_feiyi():

feiyidailiCrawler = FeiyiDailiCrawler(queue=fifo_queue, website='飞蚁代理',

urls=[{'url': 'http://www.feiyiproxy.com/?page_id=1457', 'type': '首页推荐'}])

feiyidailiCrawler._start_crawl()

def run_wuyou():

wuyoudailiCrawler = WuyouDailiCrawler(queue=fifo_queue, website='无忧代理',

urls=[{'url': 'http://www.data5u.com/free/index.html', 'type': '首页推荐'},

{'url': 'http://www.data5u.com/free/gngn/index.shtml', 'type': '国内高匿'},

{'url': 'http://www.data5u.com/free/gnpt/index.shtml', 'type': '国内普通'}])

wuyoudailiCrawler._start_crawl()

def run_iphai():

crawler = IPhaiDailiCrawler(queue=fifo_queue, website='IP海代理',

urls=[{'url': 'http://www.iphai.com/free/ng', 'type': '国内高匿'},

{'url': 'http://www.iphai.com/free/np', 'type': '国内普通'},

{'url': 'http://www.iphai.com/free/wg', 'type': '国外高匿'},

{'url': 'http://www.iphai.com/free/wp', 'type': '国外普通'}])

crawler._start_crawl()

def run_yun():

crawler = YunDailiCrawler(queue=fifo_queue, website='云代理',

urls=[{'url': 'http://www.ip3366.net/free/?stype=1&page={}', 'type': '国内高匿', 'page': 1},

{'url': 'http://www.ip3366.net/free/?stype=2&page={}', 'type': '国内普通', 'page': 1},

{'url': 'http://www.ip3366.net/free/?stype=3&page={}', 'type': '国外高匿', 'page': 1},

{'url': 'http://www.ip3366.net/free/?stype=4&page={}', 'type': '国外普通', 'page': 1}])

crawler._start_crawl()

def run_xici():

crawler = XiCiDailiCrawler(queue=fifo_queue, website='西刺代理',

urls=[{'url': 'https://www.xicidaili.com/', 'type': '首页推荐'},

{'url': 'https://www.xicidaili.com/nn/{}', 'type': '国内高匿', 'page': 1},

{'url': 'https://www.xicidaili.com/nt/{}', 'type': '国内普通', 'page': 1},

{'url': 'https://www.xicidaili.com/wn/{}', 'type': '国外高匿', 'page': 1},

{'url': 'https://www.xicidaili.com/wt/{}', 'type': '国外普通', 'page': 1}])

crawler._start_crawl()

if __name__ == '__main__':

run_xici()

run_iphai()

run_kuai()

run_feiyi()

run_yun()

run_wuyou()

爬取西刺代理时,后台日志示例如下:

Redis数据库中爬取到的代理IP的数据结构如下:

四、代理测试

接下来,使用爬取好的代理来请求 http://icanhazip.com 进行测试,代码如下:

# -*- coding: utf-8 -*-

import random

import requests

from proxy_util import logger

from run import fifo_queue

from settings import USER_AGENT_LIST

from proxy_util import base_headers

# 测试地址

url = 'http://icanhazip.com'

# 获取代理

proxy = fifo_queue.pop(schema='http')

proxies = {proxy.schema:proxy._get_url()}

# 构造请求头

headers = dict(base_headers)

if 'User-Agent' not in headers.keys():

headers['User-Agent'] = random.choice(USER_AGENT_LIST)

response = None

successed = False

try:

response = requests.get(url,headers=headers,proxies = proxies,timeout=5)

except BaseException:

logger.error("使用代理< "+proxy._get_url()+" > 请求 < "+url+" > 结果: 失败 ")

else:

if (response.status_code == 200):

logger.info(response.content.decode())

successed = True

logger.info("使用代理< " + proxy._get_url() + " > 请求 < " + url + " > 结果: 成功 ")

else:

logger.info(response.content.decode())

logger.info("使用代理< " + proxy._get_url() + " > 请求 < " + url + " > 结果: 失败 ")

# 根据请求的响应结果更新代理

proxy._update(successed)

# 将代理返还给队列,返还时不校验可用性

fifo_queue.push(proxy,need_check=False)

使用 http://218.66.253.144:80 代理请求成功后将代理重新放回队列,并将 Redis 中该代理的 used_total 、success_times 、continuous_failed三个字段信息进行了相应的更新。

到此这篇关于Python爬虫代理池搭建的方法步骤的文章就介绍到这了,更多相关Python爬虫代理池搭建内容请搜索脚本之家以前的文章或继续浏览下面的相关文章希望大家以后多多支持脚本之家!

python爬虫面试代理池_Python爬虫代理池搭建的方法步骤相关推荐

  1. python爬虫企业级技术点_Python爬虫必备技术点(二)

    Python爬虫必备技术点[续] 一.非爬虫框架 1.1 爬虫的认知 数据请求(网络请求库) 数据解析(re/xpath/bs4) 数据存储(csv/pymysql/json??) 反反爬的策略 ip ...

  2. python爬虫基础项目教程_Python爬虫开发与项目实战_Python教程

    资源名称:Python爬虫开发与项目实战 内容简介: 随着大数据时代到来,网络信息量也变得更多更大,基于传统搜索引擎的局限性,网络爬虫应运而生,本书从基本的爬虫原理开始讲解,通过介绍Pthyon编程语 ...

  3. python打开网页被禁止_Python爬虫被禁?看看是不是这几个问题

    Python爬虫在网上完成网站的信息采集时,常常出现无缘无故的ip被禁的情况,正爬取呢就没法继续了,造成日常业务也没办法正常进行了,整个人都不好了呢.一部分人完全不清楚被禁的原因,这么简单的就给禁掉了 ...

  4. python自动抓包手机_Python爬虫入门:教你通过Fiddler进行手机抓包!

    哟哟哟~ hi起来 everybody Python爬虫入门:教你通过Fiddler进行手机抓包! 今天要说说怎么在我们的手机抓包 进群:700341555获取Python爬虫入门学习资料! Pyth ...

  5. python爬虫模拟与思考_Python爬虫之模拟知乎登录

    昨天受邀在 CSDN 微信群做了一次 Python 技术分享,主题是<用Python模拟知乎登录>,效果非常不错,发现越来越多的人加入到了 Python 阵容中. 经常写爬虫的都知道,有些 ...

  6. python爬虫微博评论图片_python爬虫爬取微博评论

    原标题:python爬虫爬取微博评论 python爬虫是程序员们一定会掌握的知识,练习python爬虫时,很多人会选择爬取微博练手.python爬虫微博根据微博存在于不同媒介上,所爬取的难度有差异,无 ...

  7. python爬取收费素材_Python爬虫练习:爬取素材网站数据

    前言 本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理. 在工作中的电子文案.ppt,生活中的新闻.广告,都离不开大量的素材,而素材 ...

  8. python爬虫抓取房产_Python爬虫实战(3):安居客房产经纪人信息采集

    1, 引言 Python开源网络爬虫项目启动之初,我们就把网络爬虫分成两类:即时爬虫和收割式网络爬虫.为了使用各种应用场景,该项目的整个网络爬虫产品线包含了四类产品,如下图所示: 本实战是上图中的&q ...

  9. python基础知识500题_python爬虫基础知识点整理

    更多编程教程请到:菜鸟教程 https://www.piaodoo.com/ 友情链接: 高州阳光论坛https://www.hnthzk.com/ 人人影视http://www.sfkyty.com ...

  10. python爬虫爬图片教程_Python爬虫爬图片需要什么

    Python爬虫爬图片需要什么?下面用两种方法制作批量爬取网络图片的方法: 第一种方法:基于urllib实现 要点如下: 1.url_request = request.Request(url) 2. ...

最新文章

  1. 【翻译】SQL Server索引进阶:第三级,聚集索引
  2. Entity Framework中IQueryable, IEnumerable, IList的区别(转自网络)
  3. 故障排查:是什么 导致了客户端批量心跳超时掉线
  4. 背景图层和普通图层的区别_新手如何在PS中创建图层?不容错过的7种方法,你值得学习...
  5. 应用程序的日志通过rsyslog推送到syslog服务器
  6. ups计算软件_ups不间断电源系统分类及作用
  7. 寒窗苦读十多年,我的毕业论文只研究了一个「屁」
  8. wuzhicms刷新按钮的功能开发
  9. 适用于openvino 2020.2的yolov5的docker制作
  10. L1-041__048
  11. 在vue2.0下安装axios
  12. matlab面板数据怎么求增长率的公式,环比增长率怎么算公式表格(教你如何计算同比增长率)...
  13. elementUi tabs刷新后,选中的tab下划线不显示
  14. 数据分析,如何支持管理层决策
  15. 为什么说运维的未来必然是 AIOps?
  16. oracle stdevp函数,ORACLE和SQL语法区别归纳整理.doc
  17. 使用Windows Server 2003实现高可用故障转移群集(1)
  18. PyCharm下利用pyqt对话框打开图片,显示
  19. base64常用的前缀(excel,doc,pdf,png,jpg)
  20. html制作吃货网,在美食中做一个快乐的吃货

热门文章

  1. Bootstrap栅格系统(屏幕大小)
  2. 12款响应式 Lightbox(灯箱)效果插件
  3. Python学习笔记(11)-Python进阶11-函数
  4. 机器学习算法工程师、计算机视觉工程师 技术路线
  5. 2022,程序员的出路在哪里?
  6. chrome浏览器清理缓存也没有用,每次必须重启怎么办?
  7. solr删除数据的4种方便快捷的方式
  8. IDEA如何设置鼠标滚轮调节字体大小
  9. 求Kinetics400,AVA,prcv2018,Moments in time challenge2018,youtube8M,ActivityNet数据集,原始视频
  10. rufus中gpt和mrb磁盘_SSD固态硬盘用GPT还是MBR分区?