2019独角兽企业重金招聘Python工程师标准>>>

通常情况下爬虫超过一定频率或次数，对应的公网 IP 会被封掉，为了能稳定爬取大量数据，我们一般从淘宝购买大量代理ip，一般 10元 10w IP/天，然而这些 IP 大量都是无效 IP，需要自己不断重试或验证，其实这些 IP 也是卖家从一些代理网站爬下来的。

既然如此，为什么我们不自己动手爬呢？基本思路其实挺简单：

（1）找一个专门的 proxy ip 网站，解析出其中 IP

（2）验证 IP 有效性

（3）存储有效 IP 或者做成服务

一个 demo 如下：

import requests
from bs4 import BeautifulSoup
import re
import socket
import logginglogging.basicConfig(level=logging.DEBUG)def proxy_spider(page_num):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}for i in range(page_num):url = 'http://www.xicidaili.com/wt/' + str(i + 1)r = requests.get(url=url, headers=headers)html = r.text# print r.status_codesoup = BeautifulSoup(html, "html.parser")datas = soup.find_all(name='tr', attrs={'class': re.compile('|[^odd]')})for data in datas:soup_proxy = BeautifulSoup(str(data), "html.parser")proxy_contents = soup_proxy.find_all(name='td')ip_org = str(proxy_contents[1].string)ip = ip_orgport = str(proxy_contents[2].string)protocol = str(proxy_contents[5].string)wan_proxy_check(ip, port, protocol)# print(ip, port, protocol)def local_proxy_check(ip, port, protocol):proxy = {}proxy[protocol.lower()] = '%s:%s' % (ip, port)# print proxyheaders = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)try:s.settimeout(1)s.connect((ip, int(port)))s.shutdown(2)logging.debug("{} {}".format(ip, port))return Trueexcept:logging.debug("-------- {} {}".format(ip, port))return False"""
几种在Linux下查询外网IP的办法
https://my.oschina.net/epstar/blog/513186
各大巨头电商提供的IP库API接口-新浪、搜狐、阿里
http://zhaoshijie.iteye.com/blog/2205033
"""def wan_proxy_check(ip, port, protocol):proxy = {}proxy[protocol.lower()] = '%s:%s' % (ip, port)# proxy =  {protocol:protocol+ "://" +ip + ":" + port}# print(proxy)headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}try:result = requests.get("http://pv.sohu.com/cityjson", headers=headers, proxies=proxy, timeout=1).text.strip("\n")wan_ip = re.findall(r"\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b", result)[0]if wan_ip == ip:logging.info("{} {} {}".format(protocol, wan_ip, port))logging.debug("========================")else:logging.debug(" Porxy bad: {} {}".format(wan_ip, port))except Exception as e:logging.debug("#### Exception: {}".format(str(e)))if __name__ == '__main__':proxy_spider(1)

Refer：

[1] Python爬虫代理IP池(proxy pool)

https://github.com/jhao104/proxy_pool

[2] Python爬虫代理IP池

http://www.spiderpy.cn/blog/detail/13

[3] python ip proxy tool scrapy crawl. 抓取大量免费代理 ip，提取有效 ip 使用

https://github.com/awolfly9/IPProxyTool

转载于:https://my.oschina.net/leejun2005/blog/67349

Python 爬取可用代理 IP相关推荐

python ip动态代理_Python实现爬取可用代理IP
Python实现爬取可用代理IP,在实现爬虫时,动态设置代理IP可以有效防止反爬虫,但对于普通爬虫初学者需要在代理网站上测试可用代理IP.由于手动测试过程相对比较繁琐,且重复无用过程故编写代码以实现动 ...
Python 抓取可用代理IP
问题描述在做数据抓取的时候,经常会碰到有些网站对同一IP的访问频率做限制.遇到这种情况一般只有两种解决方案: 降低抓取频率.这种方法在数据变化不频繁,数据量不大的情况下还好,但是,如果数据变化频繁或 ...
python爬取快代理IP并测试IP的可用性
用到的网站https://www.kuaidaili.com/,免费的IP很不稳定,随时会挂,有需求的还是购买付费IP比较稳 import requests from urllib import pa ...
Python爬虫实战013：Python爬取免费代理ip
import requests import time import random from lxml import etree from fake_useragent import UserAgen ...
Python爬虫：爬取免费代理ip
之前写的几个爬虫都只能爬取到少量的信息,这是由于一个ip频繁地访问网站,会被认定为非正常的爬虫从而被屏蔽,这时候就需要使用代理ip来访问网站了,具体方法就是在发送request时添加一个proxy参数 ...
多线程爬取免费代理ip池（给我爬）
多线程爬取免费代理ip池 (给我爬) 文章目录多线程爬取免费代理ip池 (给我爬) 安装的库 IP 隐藏代理ip 多线程爬取读入代理ip 写入代理ip 验证代理ip 解析网页得到代理ip 获取网 ...
爬取免费代理IP并测试
爬取免费代理IP并测试写在开头:这次总共爬了三个代理ip的网站,前两个网站经过测试,ip并不能访问我真正想爬的网站 Git仓库:https://gitee.com/jiangtongxueya/my ...
python爬取免费优质IP归属地查询接口
python爬取免费优质IP归属地查询接口 python爬取免费优质IP归属地查询接口具体不表,我今天要做的工作就是: 需要将数据库中大量ip查询出起归属地刚开始感觉好简单啊,毕竟只需要从百度找个 ...
python爬取国内代理ip_Python语言爬取代理IP
本文主要向大家介绍了Python语言爬取代理IP,通过具体的内容向大家展示,希望对大家学习Python语言有所帮助. #!/usr/bin/env python #-*-coding=utf-8 -* ...

Python 爬取可用代理 IP

Refer：

Python 爬取可用代理 IP相关推荐

最新文章

热门文章