系列

【实用工具系列之爬虫】python实现爬取代理IP（防 ‘反爬虫’）
【实用工具系列之爬虫】python实现快速爬取财经资讯（防 ‘反爬虫’）

本文使用python实现代理IP的爬取，并可以防‘反爬虫’。

环境

Ubuntu16.04
python3

爬取方法

代理IP网站：https://www.xicidaili.com

步骤
1、按照页面id顺序爬取页面内容
2、使用正则表达式解析ip、port
3、保存ip、port信息
防 ‘反爬虫’ 方法
针对https://www.xicidaili.com有反爬虫，对上面步骤进行改进
1、先爬取第1页，提取其中的ip和端口
2、使用1中的ip及端口作为代理
3、爬取剩余的页面的ip、端口
代码实战
crawl_proxy_ip.py

import sys, os
import urllib.request
import time, random
import re
from urllib import request, parse
import pickledef crawl_proxy_ip(url, proxy_ip_dict=None):#添加header模仿浏览器操作if proxy_ip_dict is None:headers = {'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}req = request.Request(url=url, data=None, headers=headers, method='GET') #这里要注意method是POST还是GET，都试一下response = request.urlopen(req)html = response.read().decode('utf-8')else:html = download_by_proxy(url, proxy_ip_dict)ip_prot_list = extract_ip(html)#print(ip_prot_list)print(url, len(ip_prot_list))return ip_prot_listdef download_by_proxy(url, proxy_ip_dict):headers = {'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)','Connection':'close'}proxy_handler = urllib.request.ProxyHandler( proxy_ip_dict )opener = urllib.request.build_opener(proxy_handler)req = urllib.request.Request(url, headers = headers)response = opener.open(req, timeout=60)html = response.read().decode('utf-8')#print(len(html))return htmldef extract_ip(html):html = html.replace(' ', '')html = html.replace('\r', '')html = html.replace('\n', '')#<td>222.89.32.150</td><td>9999</td>ip_prot_list = []res = re.search('<td>(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})</td><td>(\d{1,5})</td>', html)while res is not None:ip_prot_list.append(res.groups())html = html.replace('<td>%s</td><td>%s</td>' %(res.groups()[0], res.groups()[1]), '')res = re.search('<td>(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})</td><td>(\d{1,5})</td>', html)return ip_prot_listdef crawl_web(first_url, max_number, proxy_ip_list):data = []i = 1while i <= max_number:try:proxy_ip_dict = random.choice(proxy_ip_list)ip_prot_list = crawl_proxy_ip(first_url + str(i), proxy_ip_dict)data += ip_prot_listexcept:print('error:', 'https://www.xicidaili.com/nn/%d' %(i))continuei += 1with open('proxy_ip.pkl', 'wb') as f:pickle.dump(data, f)print('done!')def load_proxy_ip(path):with open(path, 'rb') as f:data = pickle.load(f)proxy_ip_list = []for item in data:proxy_ip_dict = {}proxy_ip_dict['http'] = 'http://%s:%s' %(item[0], item[1])proxy_ip_list.append(proxy_ip_dict)return proxy_ip_listif __name__=='__main__':#爬取第1页，作为代理IPif not os.path.exists('proxy_ip-1.pkl'):ip_prot_list = crawl_proxy_ip('https://www.xicidaili.com/nn/1')with open('proxy_ip-1.pkl', 'wb') as f:pickle.dump(ip_prot_list, f)#使用第1页的代理IP爬取剩余的proxy_ip_list = load_proxy_ip('proxy_ip-1.pkl')crawl_web('https://www.xicidaili.com/nn/', 50, proxy_ip_list)#爬取前50个网页

代码说明
1、上述代码爬取了前50个网页的信息
2、先爬取第1页中的ip及端口，作为代理ip
3、最终结果保存在 proxy_ip.pkl 中

若以上方法对你有帮助，请点赞，谢谢！！！

【实用工具系列之爬虫】python实现爬取代理IP（防 ‘反爬虫’）相关推荐

Python爬虫简单运用爬取代理IP
功能1: 爬取西拉ip代理官网上的代理ip 环境:python3.8+pycharm 库:requests,lxml 浏览器:谷歌 IP地址:http://www.xiladaili.com/gaon ...
python爬取国内代理ip_Python语言爬取代理IP
本文主要向大家介绍了Python语言爬取代理IP,通过具体的内容向大家展示,希望对大家学习Python语言有所帮助. #!/usr/bin/env python #-*-coding=utf-8 -* ...
python爬取代理IP并进行有效的IP测试
爬取代理IP及测试是否可用很多人在爬虫时为了防止被封IP,所以就会去各大网站上查找免费的代理IP,由于不是每个IP地址都是有效的,如果要进去一个一个比对的话效率太低了,我也遇到了这种情况,所以就直接 ...
python pptp proxy_Python爬虫使用代理IP突破反爬虫限制
说起Python爬虫的发展史,那简直是与反爬虫相爱相杀的血泪史.在互联网中,有网络爬虫的地方,绝对少不了反爬虫的身影.网站反爬虫的拦截前提是要正确区分人类访问用户和网络机器人,当发现可疑目标时,通过限 ...
记一次用Python爬取代理IP并使用（尝试用代理IP制造直播房间访问量）
前言首先说一下代理IP的用法途(代码中会有涉及):代理IP可以用来隐藏你的真实IP,你访问网站是通过代理服务器来做一个中转,所以目标服务器只能看到代理服务器的IP地址,这样就可以让你的IP地址实现隐 ...
python自动爬取更新电影网站_Python爬虫之—微信实时爬取电影咨询
本文将介绍如何使用爬虫在微信对话中实现实时的电影咨询爬取功能,希望和大家一起来分享" 1. 撩妹起源俗话说的好:少壮不撩妹,长大徒伤悲啊! 说的很对,但是在这个撩妹的时代,要想成功把到妹, ...
【Python】爬取理想论坛单帖爬虫
代码: # 单帖爬虫,用于爬取理想论坛帖子得到发帖人,发帖时间和回帖时间,url例子见main函数 from bs4 import BeautifulSoup import requests impo ...
看不懂别做爬虫-----python scrapy爬取淘宝
淘宝商品数据爬取 1.网页分析做爬虫第一步当然是打开网页进行分析首先打开网站以后发现在显示的位置没有我们想要的数据那我们就使用查找就可以 ctrl + f 复制一个商品的信息看看网页源代 ...
Python爬取代理IP
在一些网页的内容爬取过程中,有时候在单位时间内如果我们发送的请求次数过多,网站就可能会封掉我们的IP地址,这时候为了保证我们的爬虫的正常运行,我们就要使用代理IP. 下面来介绍如何构建自己的IP池. ...

【实用工具系列之爬虫】python实现爬取代理IP（防 ‘反爬虫’）

系列

环境

爬取方法

若以上方法对你有帮助，请点赞，谢谢！！！

【实用工具系列之爬虫】python实现爬取代理IP（防 ‘反爬虫’）相关推荐

最新文章

热门文章