怎样使用python爬虫获得免费代理IP

进行爬取和测试有效性
总结

爬虫一直是python使用的一个重要部分，而许多网站也为此做了许多反爬措施，其中爬虫访问过于频繁直接封ip地址也作为一种“伤敌一千，自损八百”的方法被许多网站采用，代理ip便可以防止这种情况出现。

进行爬取和测试有效性

分析完毕开始爬取ip，直接使用第三方的requests和BeautifulSoup4，可以让抓取变得很方便，代码如下：

from iptools import header, dict2proxy
from bs4 import BeautifulSoup as Soupdef parse_items(items):# 存放ip信息字典的列表ips = []for item in items:tds = item.find_all('td')# 从对应位置获取ip，端口，类型ip, port, _type = tds[1].text, int(tds[2].text), tds[5].textips.append({'ip': ip, 'port': port, 'type': _type})return ipsdef check_ip(ip):try:proxy = dict2proxy(ip)url = 'https://www.ipip.net/'r = requests.get(url, headers=head, proxies=pro,timeout=5)r.raise_for_status()except:return Falseelse:return Truedef get_proxies(index):url = 'http://zhimaruanjian.com// % indexr = requests.get(url, headers=header)r.encoding = r.apparent_encodingr.raise_for_status()soup = Soup(r.text, 'lxml')# 第一个是显示最上方的信息的，需要丢掉items = soup.find_all('tr')[1:]ips = parse_items(items)good_proxies = []for ip in ips:if check(ip):good_proxies.append(ip)return good_proxies

就像在上面写的，有效性我直接使用了ip查询网站，获得的ip基本确保可以直接使用。

写入json文件

可以将获取的ip存放在json文件中，json模块的使用也很简单，直接打开一个文件，使用dump方法写入文件即可

import jsondef write_to_json(ips):with open('proxies.json', 'w', encoding='utf-8') as f:json.dump(ips, f, indent=4)

写入MongoDB

写入数据库后获取和操作会很方便

from pymongo import MongoClient as Clientdef write_to_mongo(ips):client = Client(host='localhost', port=27017)db = client['proxies_db']coll = db['proxies']for ip in ips:if coll.find({'ip': ip['ip']}).count() == 0:coll.insert_one(ip)client.close()

写入后使用RoboMongo查看

使用多线程

导入threading包，将Thread封装一下，得到最终的代码

get_proxies.py
import jsonimport requests
import timefrom proxies_get.iptools import header, dict2proxy
from bs4 import BeautifulSoup as Soup
from pymongo import MongoClient as Client
import threadingdef parse_items(items):# 存放ip信息字典的列表ips = []for item in items:tds = item.find_all('td')# 从对应位置获取ip，端口，类型ip, port, _type = tds[1].text, int(tds[2].text), tds[5].text.lower()ips.append({'ip': ip, 'port': port, 'type': _type})return ipsdef check_ip(ip, good_proxies):try:pro = dict2proxy(ip)# print(pro)url = 'https://www.ipip.net/'r = requests.get(url, headers=header, proxies=pro, timeout=5)r.raise_for_status()print(r.status_code, ip['ip'])except Exception as e:# print(e)passelse:good_proxies.append(ip)def write_to_json(ips):with open('proxies.json', 'w', encoding='utf-8') as f:json.dump(ips, f, indent=4)def write_to_mongo(ips):'''将数据写入mongoDB'''client = Client(host='localhost', port=27017)db = client['proxies_db']coll = db['proxies']# 先检测，再写入，防止重复for ip in ips:if coll.find({'ip': ip['ip']}).count() == 0:coll.insert_one(ip)client.close()class GetThread(threading.Thread):'''对Thread进行封装'''def __init__(self, args):threading.Thread.__init__(self, args=args)self.good_proxies = []def run(self):url = 'http://zhimaruanjian.com/ % self._args[0]# 发起网络访问r = requests.get(url, headers=header)r.encoding = r.apparent_encodingr.raise_for_status()soup = Soup(r.text, 'lxml')# 第一个是显示最上方的信息的，需要丢掉items = soup.find_all('tr')[1:]ips = parse_items(items)threads = []for ip in ips:# 开启多线程t = threading.Thread(target=check_ip, args=[ip, self.good_proxies])t.start()time.sleep(0.1)threads.append(t)[t.join() for t in threads]def get_result(self):return self.good_proxiesif __name__ == '__main__':# 主函数使用多线程threads = []for i in range(1, 30):t = GetThread(args=[i])t.start()time.sleep(10)threads.append(t)[t.join() for t in threads]for t in threads:proxies = t.get_result()write_to_mongo(proxies)iptools.py
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ''AppleWebKit/537.36 (KHTML, like Gecko) ''Chrome/64.0.3282.186 Safari/537.36'}def dict2proxy(dic):s = dic['type'] + '://' + dic['ip'] + ':' + str(dic['port'])return {'http': s, 'https': s}

总结

这个免费代理ip的爬虫没什么太难的地方，就是服务器有点弱，一不小心就503了，需要限制一下访问速度。使用免费的代理会影响使用到的效果，因此可以使用代理商代理ip服务，会更加的稳定安全。

怎样使用python爬虫获得免费代理IP相关推荐

干货分享，使用python爬虫构建免费代理IP池
在使用python爬虫的时候,经常会遇见所要爬取的网站采取了反爬取技术,高强度.高效率地爬取网页信息常常会给网站服务器带来巨大压力,所以同一个IP反复爬取同一个网页,就很可能被封,那如何解决呢?使用代 ...
使用python为爬虫获取免费代理ip
免费代理ip的爬取爬虫一直是python使用的一个重要部分,而许多网站也为此做了许多反爬措施,其中爬虫访问过于频繁直接封ip地址也作为一种"伤敌一千,自损八百"的方法被许多网站采 ...
用Python爬虫抓取代理IP
不知道大家在访问网站的时候有没有遇到过这样的状况就是被访问的网站会给出一个提示,提示的显示是"访问频率太高",如果在想进行访问那么必须要等一会或者是对方会给出一个验证码使用验证码对 ...
【Python 爬虫教程】代理ip网站有哪些？
代理ip网站有哪些? 背景在使用 Python 采集数据的过程中,为了提供效率,经常需要使用到代理ip.这里搜集了常见的代理ip网站,方便各位朋友. 代理ip网站列表蜻蜓代理:蜻蜓代理提供了免费代 ...
python 爬虫如何使用代理IP
python3 爬虫如何使用代理IP 前言众所周知,爬虫速度过快,频繁访问都会被封IP,怎么解决这个问题呢?再去换一台设备?先不说数据是否同步,仅仅换个设备的成本就不低,这个时候就需要代理IP了.以 ...
python爬虫如何使用代理ip
目录 python requests和selenium使用代理ip requests使用代理ip selenium使用代理ip selenium工具被浏览器检测出来 python requests和s ...
干货|Python爬虫如何设置代理IP
在学习Python爬虫的时候,经常会遇见所要爬取的网站采取了反爬取技术导致爬取失败.高强度.高效率地爬取网页信息常常会给网站服务器带来巨大压力,所以同一个IP反复爬取同一个网页,就很可能被封,这里讲述 ...
Python 爬虫使用固定代理IP
购买的固定代理IP一般都需要账号密码, 在网上找了几个使用方法,但是都报错,所以,就想了这个笨办法,如有好办法希望大家指点. ''' 遇到问题没人解答?小编创建了一个Python学习交流QQ群:579 ...
【python爬虫】使用代理IP进行网站爬取
我使用代理IP是为了刷票,公司参加了比赛,投票规则是每IP只能投8票,并没有每天刷新还是永久限制,无奈之下使用代理IP, 代理ip网址 http://www.goubanjia.com/ http:/ ...

怎样使用python爬虫获得免费代理IP

怎样使用python爬虫获得免费代理IP

进行爬取和测试有效性

总结

怎样使用python爬虫获得免费代理IP相关推荐

最新文章

热门文章