Python爬虫——建立IP代理池

在使用Python爬虫时，经常遇见具有反爬机制的网站。我们可以通过伪装headers来爬取，但是网站还是可以获取你的ip，从而禁掉你的ip来阻止爬取信息。
在request方法中，我们可以通过proxies参数来伪装我们的ip，一些网站上有免费的ip代理网站，可以通过爬取这些ip，经检测后建立ip代理池。

ip代理网站：
（https://www.xicidaili.com/nt/）
（https://www.kuaidaili.com/free/intr/）

爬取ip（IPPool.py）

import requests
from lxml import etree
from fake_useragent import UserAgent
#伪装
ua = UserAgent()
headers = {'User-Agent':ua.random}
def get_ip():ip_list = []#路径url = 'https://www.xicidaili.com/nt/' #ip是有时效的，只爬取第一页#请求response = requests.get(url=url,headers=headers)#设置编码response.encoding = response.apparent_encodingresponse = response.textresponse = etree.HTML(response)tr_list = response.xpath('//tr[@class="odd"]')for i in tr_list:#ipip = i.xpath('./td[2]/text()')[0]#端口号port = i.xpath('./td[3]/text()')[0]#协议agreement = i.xpath('./td[6]/text()')[0]agreement = agreement.lower()#拼装完整路径ip = agreement + '://' + ip + ':' + portip_list.append(ip)return ip_list
if __name__ == '__main__':ip_list = get_ip()print(ip_list)

测试ip

测试方法一（from multiprocessing.dummy import Pool）

import requests
from multiprocessing.dummy import Pool
#获取爬取到的ip列表
from IPPool import get_ip
test_list = get_ip()
#定义一个全局列表，用来存放有效ip
ip_list = []
#ip测试网站
url = 'http://icanhazip.com'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0'
}
def ip_test(ip):try:if ip.split(":")[0] == 'http':proxies = {'http': ip}else:proxies = {'https': ip}response = requests.get(url=url, headers=headers, proxies=proxies, timeout=3)ip_list.append(ip)print(ip + "可用")except:print(ip + "不可用")
if __name__ == '__main__':pool = Pool(4)pool.map(ip_test, test_list)print(ip_list)print("总共爬取%s个ip，可用ip为：%s，不可用ip为：%s"%(len(test_list),len(ip_list),len(test_list)-len(ip_list)))

测试结果：

测试方法二（Threading多线程队列）

import threading
import requests
import queue
from fake_useragent import UserAgent#获取爬取到的ip列表
from IPPool import get_ip
test_list = get_ip()
#定义一个全局列表，用来存放有效ip
ip_pool = []
#随机头伪装
ua = UserAgent()
headers = {'User-Agent':ua.random}url = 'https://www.csdn.net/'
# url = 'http://icanhazip.com/'def test_ip(queue_list):while True:if queue_list.empty():breakelse:ip = queue_list.get()if ip.split(":")[0] == 'http':proxies = {'http' : ip}else:proxies = {'https': ip}try:response = requests.get(url=url, headers=headers, proxies=proxies,timeout=3)if response.status_code == 200:print("【%s】测试%s,测试结果【可用】" % (threading.current_thread().name, proxies))ip_pool.append(ip)except:print("【%s】测试%s,测试结果【不可用】" % (threading.current_thread().name, proxies))if __name__ == '__main__':queue_list = queue.Queue()#创建队列#将爬取的ip放入队列中for i in test_list:queue_list.put(i)#创建线程out_thread = [threading.Thread(target=test_ip, args=(queue_list,), name="进程%s" % item) for item in range(5)]for thread in out_thread:thread.start()for thread in out_thread:thread.join()print('测试完成')print(ip_pool)print("总共爬取%s个ip，可用ip为：%s，不可用ip为：%s"%(len(test_list),len(ip_pool),len(test_list)-len(ip_pool)))

结果：

测试网址不需要那么复杂，www.baidu.com一类的都可以，有一位博主推荐了一个测试网站：http://icanhazip.com/

在测试时遇到了一个坑，没有太注意协议是http还是https，统一用了http，然后发现每一个ip都可以用，当然这是不可能的，经过修改后，测试成功的ip大概在二十五个左右。

https://www.kuaidaili.com/free/intr/这个网址的ip爬取也写了（ip还没有处理），但是这个网址的一页ip有点少，所以就没有测试

IPPool2.py

import requests
from lxml import etree
from fake_useragent import UserAgent
#伪装
ua = UserAgent()
headers = {'User-Agent':ua.random}def get_ip():ip_list = []#路径url = 'https://www.kuaidaili.com/free/intr/'#请求response = requests.get(url=url,headers=headers)#设置编码response.encoding = response.apparent_encodingresponse = response.textresponse = etree.HTML(response)tr_list = response.xpath('//*[@id="list"]/table/tbody/tr')for i in tr_list:ip = i.xpath('./td[1]/text()')[0]ip_list.append(ip)return ip_list
if __name__ == '__main__':ip_list = get_ip()# print(ip_list)

Python爬虫——建立IP代理池相关推荐

（廿九）Python爬虫：IP代理池的开发
作为一个爬虫开发者,使用IP代理是必要的一步,我们可以在网上找到免费的高匿IP,比如西刺代理.但是,这些免费的代理大部分都是不好用的,经常会被封禁.所以我们转而考虑购买付费代理.可是,作为一个程序员首 ...
Python创建免费Ip代理池，伪装Ip。
Python创建免费Ip代理池主要使用requests第三方库.欸嘿,有了这个,就不用花钱买Ip了,生活小妙招.妙哇. 一.具体思路 1.利用requests爬取免费代理Ip的网页 2.存储列表后, ...
python爬虫ip代理池_爬虫教程-Python3网络爬虫开发——IP代理池的维护
该楼层疑似违规已被系统折叠隐藏此楼查看此楼准备工作要实现IP代理池我们首先需要成功安装好了 Redis 数据库并启动服务,另外还需要安装 Aiohttp.Requests.RedisPy.PyQ ...
Python搭建自己[IP代理池]
IP代理是什么: ip就是访问网页数据服务器位置信息,每一个主机或者网络都有一个自己IP信息为什么要使用代理ip: 因为在向互联网发送请求中,网页端会识别客户端是真实用户还是爬虫程序,在今天以互联网 ...
python利用proxybroker构建爬虫免费IP代理池！不用担心被封了！
大纲前言 ProxyBroker简介 ProxyBroker安装在终端使用ProxyBroker 在代码中使用ProxyBroker 总结前言写爬虫的小伙伴可能遇到过这种情况: 正当悠闲地喝着 ...
python建立ip代理池_Python搭建代理IP池实现存储IP的方法
上一文写了如何从代理服务网站提取 IP,本文就讲解如何存储 IP,毕竟代理池还是要有一定量的 IP 数量才行.存储的方式有很多,直接一点的可以放在一个文本文件中,但操作起来不太灵活,而我选择的是 My ...
【Python爬虫建立IP池错误】爬取西刺网出现的各种问题
本想爬取一个网站,但由于访问次数多了,遭服务器拒绝.后面就行通过建立一个IP池,当然就想爬取西刺网上的IP.所以就在网上copy了一份代码,但很不幸的是不管怎么弄都无法运行.所以我开始简化代码,从爬取 ...
Python建立ip代理池(多线程)
转载自公众号:JAVAandPythonJun 说在前面的话 Hello,我是JAP君,相信经常使用爬虫的朋友对代理ip应该比较熟悉,代理ip就是可以模拟一个ip地址去访问某个网站.我们有时候需要爬取 ...
【爬虫】IP代理池调研
调研产品芝麻IP,太阳IP,多贝代理.阿布云.亿牛云.西刺代理ip 阿布云￥1/时￥16/天￥108/周￥429/月每个请求一个随机IP 海量IP资源池需求近300个区域全覆盖 IP切换迅速使 ...