在这前通过接口分析拉钩网站，发现其反爬虫措施比较多，爬取比较麻烦，在这一章节，采用selenium方法进行爬虫设计。

1. 初始化

采用类模式的形式设计实现，先初始化自己的的信息，实现代码如下：

chrome_driver = r"F:\python\python_environment\chromedriver.exe"def __init__(self):self.driver = webdriver.Chrome(executable_path=self.chrome_driver)self.url = 'https://www.lagou.com/jobs/list_python/p-city_0?&cl=false&fromSearch=true&labelWords=&suginput='self.positions = []

采用selenium请求网页信息，获取整个网页的信息，除此之外，为了实现多页爬取，还需要获取下一页的标签信息，进而实现模拟点击翻页的操作，考虑到翻到最后一页以后，网页将不会再翻页，可以将其设置为循环等待的停止条件。
检测数据以后，最后一页的时候比原来多了一个pager_next pager_next_disabled，可以将其作为翻页终止的条件。实现代码如下：

    def run(self):self.driver.get(self.url)while True:#driver.page_source 可以拿到整个页面的信息，包括ajax加载的数据（网页上不显示）source = self.driver.page_source# print(source)#显示等待，条件：获取下一页的按钮信息WebDriverWait(driver=self.driver,timeout=10).until(EC.presence_of_element_located((By.XPATH,'//div[@class="pager_container"]/span[last()]')))self.prase_list_page(source)try:next_btn = self.driver.find_element_by_xpath('//div[@class="pager_container"]/span[last()]')if "pager_next pager_next_disabled" in  next_btn.get_attribute("class"):breakelse:next_btn.click()except:print(source)time.sleep(2)

2. 获取每一页的url

使用selenium获得网页信息，可以拿到整个页面的信息，包括ajax加载的数据（网页上不显示），可以直接解析网页信息，从而直接拿取各个职位的链接信息，实现代码如下：

    def prase_list_page(self,source):html = etree.HTML(source)urls = html.xpath('//div[@class="position"]/div/a[@class="position_link"]/@href')# print(urls)for url in urls:self.request_detail_page(url)time.sleep(1)

请求职位的详情信息，由于需要实现翻页操作，职位的起始页面不能更改，因此详情职位页面需要重新打开一个新的窗口，之后再解析网页的职业信息，解析完以后再将网页关闭，再转换到起始页面，实现代码如下：

    def request_detail_page(self,url):# self.driver.get(url)#打开一个新的窗口，并切换窗口self.driver.execute_script("window.open('%s')"%url)self.driver.switch_to.window(self.driver.window_handles[1])#WebDriverWait方法中的xpath方法不能匹配文本信息。WebDriverWait(driver=self.driver,timeout=10).until(EC.presence_of_element_located((By.XPATH,'//div[@class="job-name"]/h1')))source = self.driver.page_sourceself.prase_detail_page(source)#关闭当前窗口self.driver.close()#切换回原来的窗口self.driver.switch_to.window(self.driver.window_handles[0])

3. 解析详细网页

解析信息的时候需要注意一些细致的调整，比如说用strip()函数去前后括号等，sub()函数采用正则表达式去除反斜杠和空格等，实现的代码如下：

    def prase_detail_page(self,source):html = etree.HTML(source)position_name = html.xpath('//div[@class="job-name"]/h1/text()')[0]job_request = html.xpath('//dd[@class="job_request"]//span')# 工资salary = job_request[0].xpath('.//text()')[0].strip()# 城市city = job_request[1].xpath('.//text()')[0]city = re.sub(r'[\s/]', '', city)# 年限work_years = job_request[2].xpath('.//text()')[0]work_years = re.sub(r'[\s/]', '', work_years)# 学历eduction = job_request[3].xpath('.//text()')[0]eduction = re.sub(r'[\s/]', '', eduction)#公司名字company_name = html.xpath('//h3[@class="fl"]/em/text()')[0].strip()#详细信息desc = "".join(html.xpath('//dd[@class="job_bt"]//text()')).strip()position ={'name':position_name,'company_name':company_name,'salary':salary,'city':city,'work_years':work_years,'eduction':eduction,'desc':desc}print(position)self.positions.append(position)

4. 整个的实现代码如下:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from lxml import etree
import re
import timeclass LagouSpider(object):chrome_driver = r"F:\python\python_environment\chromedriver.exe"def __init__(self):self.driver = webdriver.Chrome(executable_path=self.chrome_driver)self.url = 'https://www.lagou.com/jobs/list_python/p-city_0?&cl=false&fromSearch=true&labelWords=&suginput='self.positions = []def run(self):self.driver.get(self.url)while True:#driver.page_source 可以拿到整个页面的信息，包括ajax加载的数据（网页上不显示）source = self.driver.page_source# print(source)#显示等待，条件：获取下一页的按钮信息WebDriverWait(driver=self.driver,timeout=10).until(EC.presence_of_element_located((By.XPATH,'//div[@class="pager_container"]/span[last()]')))self.prase_list_page(source)try:next_btn = self.driver.find_element_by_xpath('//div[@class="pager_container"]/span[last()]')if "pager_next pager_next_disabled" in  next_btn.get_attribute("class"):breakelse:next_btn.click()except:print(source)time.sleep(2)def prase_list_page(self,source):html = etree.HTML(source)urls = html.xpath('//div[@class="position"]/div/a[@class="position_link"]/@href')# print(urls)for url in urls:self.request_detail_page(url)time.sleep(1)def request_detail_page(self,url):# self.driver.get(url)#打开一个新的窗口，并切换窗口self.driver.execute_script("window.open('%s')"%url)self.driver.switch_to.window(self.driver.window_handles[1])#WebDriverWait方法中的xpath方法不能匹配文本信息。WebDriverWait(driver=self.driver,timeout=10).until(EC.presence_of_element_located((By.XPATH,'//div[@class="job-name"]/h1')))source = self.driver.page_sourceself.prase_detail_page(source)#关闭当前窗口self.driver.close()#切换回原来的窗口self.driver.switch_to.window(self.driver.window_handles[0])def prase_detail_page(self,source):html = etree.HTML(source)position_name = html.xpath('//div[@class="job-name"]/h1/text()')[0]job_request = html.xpath('//dd[@class="job_request"]//span')# 工资salary = job_request[0].xpath('.//text()')[0].strip()# 城市city = job_request[1].xpath('.//text()')[0]city = re.sub(r'[\s/]', '', city)# 年限work_years = job_request[2].xpath('.//text()')[0]work_years = re.sub(r'[\s/]', '', work_years)# 学历eduction = job_request[3].xpath('.//text()')[0]eduction = re.sub(r'[\s/]', '', eduction)#公司名字company_name = html.xpath('//h3[@class="fl"]/em/text()')[0].strip()#详细信息desc = "".join(html.xpath('//dd[@class="job_bt"]//text()')).strip()position ={'name':position_name,'company_name':company_name,'salary':salary,'city':city,'work_years':work_years,'eduction':eduction,'desc':desc}print(position)self.positions.append(position)if __name__ == '__main__':lagou = LagouSpider()lagou.run()# print(lagou.positions)

selenium实现拉钩爬虫相关推荐

Python selenium 拉钩爬虫
selenium 用作自动化测试工具,并非爬虫工具,用作爬虫性能没那么好.但既然可以读取网页信息,那还是可以用来爬取数据的.用该工具模拟访问,网站会认为是正常的访问行为. 项目创建几个文件,都在同一个 ...
仿拉钩app（一）---爬虫数据准备
工欲善其事必先利其器,准备做一个拉钩的app,但是没数据可怎么办,那就直接扒裤衩去爬吧一般爬虫的思路为: 分析页面结构是否有接口模仿请求(解决反爬的各种方式) 解析数据存储数据按照以上的思路 ...
Python爬虫实现全自动爬取拉钩教育视频
ps:改良之后的多线程版本在最后背景大饼加了不少技术交流群,之前在群里看到拉钩教育平台在做活动,花了1块钱买了套课程.比较尴尬的是大饼一般都会在上下班的路中学习下(路上时间比较久)而这个视频无法缓 ...
杭州python爬虫招聘_python爬取招聘网站（智联，拉钩，Boss直聘）
刚好最近有这需求,动手写了几个就贴上代码算了 1.智联将结果保存为python的一个数据框中 import requests from requests.exceptions import Req ...
Web自动化selenium技术快速实现爬虫
selenium是大家众所周知的web自动化测试框架,主要用来完成web网站项目的自动化测试,但其实如果要实现一个web爬虫,去某些网站爬取数据,其实用selenium来实现也很方便. 比如,我们现在 ...
python 下载拉钩教育AES加密视频
说在前面: 下面我们要爬取的是拉钩教育课程上面的视频,课程已经购买过了.但是由于没有提供缓冲和下载视频的功能,所以就打算把视频通过python给下载下来,以下的文章都是参考博友的,自己总结下并学习学习 ...
Python突破拉钩反爬机制，采集各类招聘数据
首先说一下这个有啥用?要说有用也没啥用,要说没用吧,既然能拿到这些数据,拿来做数据分析.能有效的得到职位信息,薪资信息等.也能为找工作更加简单吧,且能够比较有选择性的相匹配的职位及公司很多人学习py ...
拉钩招聘信息爬取以及可视化
本篇文章主要向读者介绍如何爬取像lagou这样具有反爬虫网站上面的招聘信息,以及对于以获取的数据进行可视化处理,如果,我们对于获取的数据不进行可视化处理,那我们获取到的数据就没有发挥它应有的作用.对于 ...
python爬取拉钩python数据分析职位招聘信息
python数据分析 python数据分析是目前python最火的方向之一,为了解目前市场对该职位的需求,我们爬取了拉钩上对pythons数据分析的招聘信息. 环境系统:windows7 pytho ...

selenium实现拉钩爬虫

1. 初始化

2. 获取每一页的url

3. 解析详细网页

4. 整个的实现代码如下:

selenium实现拉钩爬虫相关推荐

最新文章

热门文章