Python笔记-使用代理切换ip爬取数据

爬取某站点运行截图如下：

这里使用了阿布云的产品切换代理。

这个产品的文档还是风全的！

推荐大伙使用：

关键代码如下：

记得这个代理可能异常，记得做异常处理。

源码如下：

import re
import requests, timeclass HandleLaGou(object):def __init__(self):self.laGou_session = requests.session()self.header = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}self.city_list = ""#获取全国城市列表def handle_city(self):city_search = re.compile(r'zhaopin/">(.*?)</a>')city_url = "https://www.lagou.com/jobs/allCity.html"city_result = self.handle_request(method = "GET", url = city_url)self.city_list = city_search.findall(city_result)self.laGou_session.cookies.clear()def handle_city_job(self, city):first_request_url = "https://www.lagou.com/jobs/list_python?city=%s&cl=false&fromSearch=true&labelWords=&suginput=" % cityfirst_response = self.handle_request(method = "GET", url = first_request_url)total_page_search = re.compile(r'class="span\stotalNum">(\d+)</span>')try:total_page = total_page_search.search(first_response).group(1)except:returnelse:for i in range(1, int(total_page) + 1):data = {"pn": i,"kd": "python"}page_url = "https://www.lagou.com/jobs/positionAjax.json?city=%s&needAddtionalResult=false" % cityreferer_url = "https://www.lagou.com/jobs/list_python?city=%s&cl=false&fromSearch=true&labelWords=&suginput=" % cityself.header['Referer'] = referer_url.encode()response = self.handle_request(method = "POST", url = page_url, data = data, info = city)print(response)def handle_request(self, method, url, data = None, info = None):while True:proxyinfo = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {"host" : "http-dyn.abuyun.com","port" : 9020,"user" : "V21C9SWA4CQ3FSHD","pass" : "1DF3191F6103Q34",}proxy = {"http": proxyinfo,"https": proxyinfo}try:if method == "GET":response = self.laGou_session.get(url=url, headers=self.header, proxies=proxy,timeout=6)return response.textelif method == "POST":response = self.laGou_session.post(url=url, headers=self.header, data=data, proxies=proxy,timeout=6)print(response.text)except:self.laGou_session.cookies.clear()first_request_url = "https://www.lagou.com/jobs/list_python?city=%s&cl=false&fromSearch=true&labelWords=&suginput=" % infoself.handle_request(method="GET", url=first_request_url)time.sleep(10)continueresponse.encoding = 'utf-8'if '频繁' in response.text:# 先清除cookies再重新获取cookiesself.laGou_session.cookies.clear()first_request_url = "https://www.lagou.com/jobs/list_python?city=%s&cl=false&fromSearch=true&labelWords=&suginput=" % infoself.handle_request(method="GET", url=first_request_url)time.sleep(10)continuereturn response.textif __name__ == '__main__':laGou = HandleLaGou()laGou.handle_city()for city in laGou.city_list:laGou.handle_city_job(city)breakpass

Python笔记-使用代理切换ip爬取数据相关推荐

数据抓取 -- 使用代理IP爬取数据：（2）：使用timeout 时要注意，防止数据加载不完整，导致爬取丢失（举例）
问题: 在使用代理IP爬取数据的时候,经常会出现爬取的网址信息不完整的现象.其中有个原因就是timeout设置问题. 代码如下: import requests from bs4 import Bea ...
python爬去新浪微博_!如何通过python调用新浪微博的API来爬取数据
python抓取新浪微博,求教爬手机端可以参考的代码, #-*-coding:utf8-*- import smtplib from email.mime.text import MIMEText ...
Python 爬虫实战，模拟登陆爬取数据
Python 爬虫实战,模拟登陆爬取数据从0记录爬取某网站上的资源连接: 模拟登陆爬取数据保存到本地结果演示: 源网站展示: 爬到的本地文件展示: 环境准备: python环境安装略安装r ...
Python数据分析：爬虫从网页爬取数据需要几步？
对于数据分析师来说,数据获取通常有两种方式,一种是直接从系统本地获取数据,另一种是爬取网页上的数据,爬虫从网页爬取数据需要几步?总结下来,Python爬取网页数据需要发起请求.获取响应内容.解析数据. ...
Python+selenium+firefox登录微博并爬取数据（2）
上次写到安装完成环境,并且成功访问到微博首页(未登录状态).后来发现新浪的登录机制太繁琐,所以放弃selenium登录,转向使用新浪官方Api 爬取数据.但是,写完之后才发现,调用接口也有限制.没办法 ...
python爬虫免费代理池_Python爬取免费代理搭建代理池
我们在做爬虫的过程中经常会遇到这样的情况:最初爬虫正常运行,正常抓取数据,一切看起来都是那么美好,然而一杯茶的功夫可能就会出现错误,比如403Forbidden:这时候网页上可能会出现 "您 ...
python笔记之利用scrapy框架爬取糗事百科首页段子
环境准备: scrapy框架(可以安装anaconda一个python的发行版本,有很多库) cmd命令窗口教程: 创建爬虫项目 scrapy startproject qq #创建了一个爬虫项目q ...
【python】数据分析绘制疫情图(爬取数据+保存Excel+echart绘制地图)
数据分析师工作中常规流程一般是:数据获取.数据处理.数据分析展示等. 本篇通过国内疫情数据实现数据从爬取到展示的过程. 介绍 py版本:python 3.8 目标绘制全国疫情图. 思路通过以下三个方 ...
PYTHON爬虫神站——curl.trillworks 爬取数据只需两步！
前不久作者因为需要爬取一些建筑类数据于是又又又准备苦哈哈的打开百度搜索"如何爬取XX",然后看html,找到关键点再挨个循环访问.....以下省略. 但这次这款朋友推荐网站拯救 ...

Python笔记-使用代理切换ip爬取数据

Python笔记-使用代理切换ip爬取数据相关推荐

最新文章

热门文章