拉勾网爬取失败？试试这一招

如果你爬过拉勾网就知道拉勾网有点难爬。

不愧是一家专为互联网从业者提供工作机会的招聘网站……

所以拉勾网使用的是什么反爬机制呢？一个是cookie限制，另一个是IP访问频率限制。

我在这次的爬取中遇到的反爬不是cookie限制，而是IP访问频率被限制了。

解决反爬虫

我选择了拉勾网自带岗位栏中的“数据运营”岗。

在第一次的尝试爬取中我遇到了这样的问题……

查看一下返回的响应页面发现……

即使加了请求头，每次只要爬到第6条数据时就会跳出验证页面，这是因为拉勾网检测到了同一IP的访问频率过快，于是触发了验证机制，需要输入验证信息才能获取我们想要的页面信息。

对于IP访问频率限制，使用IP代理是最理想的应对方法，不过也可以使用time模块来降低访问频率，缺点是速度很慢，如果需要爬取的数据不是很多的话可以采取这种方法。

以下我采用了延长访问频率的方式来尝试获取“数据运营”岗的全部招聘信息，结果没有报错！

具体的方法是：使用random模块中randint()函数随机获取秒数，再用time模块中sleep()函数将程序暂停一下，将其设置在请求网站后即可。

import random,time
time.sleep(random.randint(10,15))

随机暂停的秒数设置在10~15s最好，因为我尝试过设置在5~10s结果还是被检测出来。

爬取拉勾网

爬取每个岗位以下招聘数据：

完整代码如下：

import requests,random,time,re
from bs4 import BeautifulSoup
import pandas as pd# 定义空列表，用于存储信息
job_all={}
company_content=[]
industry_content=[]
job_content=[]
experience_content=[]
education_content=[]
salary_content=[]
detail_content=[]
url_content=[]# 爬取1到8页的招聘信息
for i in range(1,9):url='https://www.lagou.com/guangzhou-zhaopin/shujuyunying/{}/'url=url.format(i)headers={'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}res=requests.get(url,headers=headers,timeout=3)res.encoding=res.apparent_encodingsoup=BeautifulSoup(res.text,'html.parser')urls=soup.find_all('div',class_='p_top') # 提取当页招聘岗位的详情链接# 遍历每一个详情链接，提取招聘岗位的公司、行业、岗位、经验要求、学历要求、工资、职责和工作要求for url in urls:url=url.find('a')['href']rep=requests.get(url,headers=headers,timeout=3)time.sleep(random.randint(10,15)) # 延迟程序运行，应对反爬虫rep.encoding=rep.apparent_encodingsoup=BeautifulSoup(rep.text,'html.parser')company=soup.find('em',class_='fl-cn').text.strip() # 公司industry=soup.find('h4',class_='c_feature_name').text.strip() # 行业job=soup.find('h1',class_='name').text.strip() # 岗位detail=soup.find('div',class_='job-detail').text.strip() # 职责和工作要求request=soup.find('dd',class_='job_request') # 经验要求、学历要求和工资# 使用正则表达式进行提取request_match=re.match('^<dd .*?<span class=.*?>(.*?) </span>.*?span>/(.*?) /</span.*?span>(.*?) /</span.*?span>(.*?) /</span.*?span>(.*?)</span.*?h3>',str(request),re.S)experience=request_match.group(3) # 经验要求education=request_match.group(4) # 学历要求salary=request_match.group(1) # 工资# 添加岗位信息到列表company_content.append(company)industry_content.append(industry)job_content.append(job)experience_content.append(experience)education_content.append(education)salary_content.append(salary)detail_content.append(detail)url_content.append(url)job_all['公司']=company_content
job_all['行业']=industry_content
job_all['岗位']=job_content
job_all['经验']=experience_content
job_all['学历']=education_content
job_all['工资']=salary_content
job_all['职责和要求']=detail_content
job_all['详情链接']=url_contentdf=pd.DataFrame(job_all,columns=['公司','行业','岗位','经验','学历','工资','职责和要求','详情链接'])
df.to_excel('拉勾网数据运营岗.xlsx')

公众号：「Python编程小记」，持续推送学习分享，欢迎关注！