获取免费代理IP，并检验IP有效性

爬虫程序访问网站，速度很快，很容易突破网站设置的访问次数，此情况下就会被停止访问，或者IP被封。如果此时能有一些代理IP，切换不同的代理IP去访问网站，使网站以为是从不同的机器上访问的，那么代理IP背后的自己的IP就不受影响了。就算用了代理IP也不要频繁访问网站，因为要为网站考虑一下它的压力。

1.从http://www.xicidaili.com/nn/1里获取免费代理IP。打开网页，查看源代码，分析代码结构，找到你需要的数据，用正则把用它找出来。正则表达式是 r'<td>(([1-9]\.|[1-9][0-9]\.|1[0-9]{2}\.|2[0-4][0-9]\.|25[0-5]\.){3}([1-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))</td>\s+<td>(\d{2,5})</td>'。

2.把代理IP保存文件，留着用。但代理IP变化很快，有可能一会功夫就不能用了。所以在需要的时候抓取一下就行了。可以保存在文件里，也可以保存在数据库里。

3.检查代理IP有效性。这个操作可以放在每次抓取页面前，如果不能用就切换其他代理IP，同时把这个不能用的代理IP移除。

代码如下：分两个文件，一个获取代理IP，一个检查有效性(另外有多进程检查)。

# -*- coding: utf-8 -*-
'''
从www.xicidaili.com获取代理IP，并保存文件
'''
import urllib.request as req
import time
import re
import randomtext_html = r'd:/tmp/xici_html.txt'
text_ips = r'd:/tmp/xici_ips.txt'class Getxi():def __init__(self,page):self.page = pageself.url = r'http://www.xicidaili.com/nn/{}'def request_method(self,p):curr_time = time.time()sec = int(curr_time)micsec = int(round(curr_time*1000))print(sec,' == ',micsec)headers = {'Cache-Control':'max-age=0','Connection': 'Keep-Alive','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',            'Accept-Language': 'zh-CN,zh;q=0.8','Accept-Enconding':'gzip, deflate, sdch','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36','Host':'www.xicidili.com','Referer':'http://www.xicidili.com/','Pragma':'no-cache','Upgrade-Insecure-Requests':1,}url_com = self.url.format(p)reqs = req.Request(url_com,headers=headers)return reqsdef get_html(self,p):reqss = self.request_method(p)conn = req.urlopen(reqss)html = conn.read().decode('utf-8')return htmldef save_html(self,ip_html):with open(text_html,'a') as f:f.write(ip_html)f.close()def save_ips(self,ips):with open(text_ips,'a') as f:f.write(ips)f.close()def parse_html(self,ip_html):pattern = re.compile(r'<td>(([1-9]\.|[1-9][0-9]\.|1[0-9]{2}\.|2[0-4][0-9]\.|25[0-5]\.){3}([1-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))</td>\s+<td>(\d{2,5})</td>',re.S)tds = pattern.findall(ip_html)str1 = ''for td in tds:str1 += '{}:{}\n'.format(td[0].strip(),td[3].strip())#print(str1)self.save_ips(str1)def crawler(self):for i in range(self.page):html = self.get_html(i+1)self.save_html(html)self.parse_html(html)time.sleep(random.randint(5,15))def xixi():page = 2xi = Getxi(page)xi.crawler()if __name__ == '__main__':xixi()

检查有效性：访问的网页是http://2018.ip138.com/ic.asp

# -*- coding: utf-8 -*-
'''
验证代理IP的有效性
'''
from urllib import request
import urllib
import time
import random
import socket
import httpips_ok_file = r'd:/tmp/xici_1_ok.txt' # 验证后，存入有效的IP
ips_file = r'd:/tmp/xici_ips.txt' # IP列表
url = 'http://2018.ip138.com/ic.asp' # 检测访问ip
User_Agent = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'ok_ips = ''class CheckProxyIp():def __init__(self):passdef read_ips_file(self):with open(ips_file,'r',encoding='utf-8') as f:ips = f.readlines()f.close()for ip in ips:i = ip.strip()self.check_ips(i)time.sleep(random.randint(1,5))def check_ips(self,ip):global ok_ipsproxy = {'http':ip,'https':ip}print(proxy)proxy_handler = request.ProxyHandler(proxy)opener = request.build_opener(proxy_handler)opener.addheaders = [('User-Agent',User_Agent)]request.install_opener(opener)try:response = request.urlopen(url,timeout=3) # 使用安装好的openerif(response.getcode() == 200):html = response.read().decode('gbk')print(len(html))ok_ips += ip+'\n'else:print('no')except UnicodeDecodeError as e:print(e)except urllib.error.HTTPError as e:print(e)except urllib.error.URLError as e:print(e)except socket.timeout as e:print(e)except http.client.RemoteDisconnected as e:print(e)except ConnectionResetError as e:print(e)def save_ok_ip(self):global ok_ipsprint('save ....')print(ok_ips)with open(ips_ok_file,'w') as f:f.write(ok_ips)f.close()def check():chcip = CheckProxyIp()chcip.read_ips_file()chcip.save_ok_ip()if __name__ == '__main__':check()

获取免费代理IP，并检验IP有效性相关推荐

使用python为爬虫获取免费代理ip
免费代理ip的爬取爬虫一直是python使用的一个重要部分,而许多网站也为此做了许多反爬措施,其中爬虫访问过于频繁直接封ip地址也作为一种"伤敌一千,自损八百"的方法被许多网站采 ...
爬虫获取免费代理IP
创建文件UserAgent.py 写入: import randomuser_agent = ["Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6 ...
Python获取免费代理IP，并全部测试一遍，结果大失所望
前言为什么要IP代理:当采集数据, 批量采集数据, 请求速度过快, 网站可能会把你IP封掉 <你的网络进不去这个网站> IP代理换一个IP, 再去采集请求数据一. 抓包分析数据来源 1 ...
用Python获取免费代理IP
前言为什么要IP代理:当采集数据, 批量采集数据, 请求速度过快, 网站可能会把你IP封掉 <你的网络进不去这个网站> IP代理换一个IP, 再去采集请求数据一. 抓包分析数据来源 1 ...
使用Scylla获取免费代理IP
简介自动化的代理 IP 爬取与验证易用的 JSON API 安装 pip直接安装 pip install scylla scylla --help scylla # 运行爬虫和 Web 服务器查 ...
python之利用requests库爬取西刺代理，并检验IP的活性
用爬虫爬取某个网站的数据时,如果用一个IP频繁的向该网站请求大量数据,那么你的ip就可能会被该网站拉入黑名单,导致你不能访问该网站,这个时候就需要用到IP动态代理,即让爬虫爬取一定数据后更换IP来继续 ...
Python 免费代理ip的批量获取
Python 免费代理ip的批量获取简介网络爬虫的世界,向来都是一场精彩的攻防战.现在许多网站的反爬虫机制在不断的完善,其中最令人头疼的,莫过于直接封锁你的ip.但是道高一尺魔高一丈,在爬取网页的 ...
PHP、Tomcat获取Nginx代理后的客户端真实IP
PHP.Tomcat获取Nginx代理后的客户端真实IP 文章目录 PHP.Tomcat获取Nginx代理后的客户端真实IP 一.PHP获取Nginx代理后客户端真实IP Nginx配置 PHP配置 ...
爬虫获取代理IP并检验可用性与识别指纹
前段时间在做有关代理IP与路由器的学习,基于FreeBuf上feiniao的文章http://www.freebuf.com/articles/web/159172.html,自己总结并修改了部分代码 ...

获取免费代理IP，并检验IP有效性

获取免费代理IP，并检验IP有效性相关推荐

最新文章

热门文章