scrapy模拟浏览器爬取51job(动态渲染页面爬取)

scrapy模拟浏览器爬取51job

51job链接

网络爬虫时，网页不止有静态页面还有动态页面，动态页面主要由JavaScript动态渲染，网络爬虫经常遇见爬取JavaScript动态渲染的页面。

动态渲染页面爬取，就是模拟浏览器的运行方式，可以做到在浏览器中看到是什么内容爬取的源码就是相应的内容，实现了可见即可爬。

这个方法在爬虫过程中会打开一个浏览器加载该网页，自动操作浏览器浏览各个网页，同时也可爬取加载的页面 HTML。用一句简单而通俗的话说，就是使用浏览器渲染方法将爬取动态网页变成爬取静态网页。

我们可以用 Python 的 Selenium 库模拟浏览器完成抓取。Selenium 是一个用于 Web 应用程序测试的工具。Selenium 测试直接运行在浏览器中，浏览器自动按照脚本代码做出单击、输入、打开、验证等操作，就像真正的用户在操作一样

安装部署

Selenium是一个自动化测试工具，利用它可以驱动浏览器执行特定的动作，如点击、下拉等操作，同时还可以获取浏览器当前呈现的页面的源代码，做到可见即可爬。

Selenium库安装如下：

pip install selenium

Selenium库安装后，可在命令行下进行测试，具体测试指令如下：

import selenium

输入以上内容，没有出现错误，说明Selenium库安装成功，具体如下图。

浏览器驱动的下载安装

浏览器驱动也是一个独立的程序，是由浏览器厂商提供的，不同的浏览器需要不同的浏览器驱动。比如 Chrome 浏览器和火狐浏览器有各自不同的驱动程序。

浏览器驱动接收到我们的自动化程序发送的界面操作请求后，会转发请求给浏览器，让浏览器去执行对应的自动化操作。浏览器执行完操作后，会将自动化的结果返回给浏览器驱动，浏览器驱动再通过 HTTP 响应的消息返回给我们的自动化程序的客户端库。自动化程序的客户端库接收到响应后，将结果转化为数据对象返回给程序代码。

在下载 Chrome 浏览器驱动前，首先确定 Chrome 浏览器的版本。点击 Chrome 浏览器“自定义及控制 Goole Chrome”按钮，选择“帮助”、“关于 Google Chrome(G)”，查看浏览器的实际版本号。

https://chromedriver.storage.googleapis.com/index.html 是 Chorome 浏览器驱动的下载地址。按照 Chrome 的版本号以及操作系统，选择不同的版本下载

下载完成后解压缩，将 ChromeDriver.exe 拷贝到指定目录，后续编写代码要指定驱动所在目录。

声明浏览器对象

Selenium支持多个浏览器，比如：Chrome、Firefox、Edge等，还可以支持Android、BlackBerry等手机段浏览器。另外也支持无界面浏览器PhantomJS。

具体的初始化方式如下：

from selenium import webdriver
browser = webdriver.Chrome(executable_path=path)
browser=webdriver.Firefox(executable_path=path)
browser=webdriver.Edge(executable_path=path)
browser=webdriver.PhantomJS(executable_path=path)
browser=webdriver.Safari(executable_path=path)

其中，executable_path表示：浏览器驱动器存放位置。
以上步骤实现了浏览器对象的初始化，并将其赋值为browser对象。

访问页面

Selenium使用get()方法请求网页，具体的语法如下：

browser.get(url)

访问页面的实现方式如下：

from selenium import webdriver
path="E:/chromedriver.exe"
browser = webdriver.Chrome(executable_path=path) #获取 Chrome 驱动实例
browser.get('https://www.taobao.com/')#打开淘宝
print(browser.page_source)  #返回源码
browser.close() #关闭浏览器

运行程序后，弹出了Chrome浏览器并且自动访问了淘宝，然后输出淘宝网页的源代码，最后关闭浏览器。

Webdriver.Chrome()为获取 Chrome 浏览器驱动实例，Webdriver 后的方法名是浏览器的名称，如 webdriver.Firefox()为火狐浏览器的驱动实例。其中参数 d:\ChromeDriver.exe 为驱动所在的路径。参数可省略，但是需要将 ChromeDriver.exe 的路径放入到系统的环境变量中。wd.get(url)可以打开指定的网页。wd.close()关闭 selenium 打开的浏览器。

在 selenium 模块的使用过程中，常见错误如下

错误信息为："Exception AttributeError:Service object has no attribute process in…”，可能是 geckodriver 环境变量有问题，重新将 webdriver 所在目录配置到环境变量中。或者直接在代码中指定路径：webdriver.Chrome(‘ChromeDriver 全路径’)
错误信息为： selenium.common.exceptions.WebDriverException: Message: Unsupported Marionette protocol version 2，required
可能是 Chrome 版本太低。

元素选择器

要想对页面进行操作，首先要做的是选中页面元素。元素选取方法如下表

从命名上来讲，定位一个元素使用的单词为 element,定位多个元素使用的单词为 elements。从使用的角度来说，定位一个元素，返回的类型为元素对象，如果存在多个匹配第一个，如果查找不到匹配的元素，会出现异常，程序代码不会继续执行;定位多个元素返回的数据类型为列表，可循环遍历、可使用列表索引,查找不到匹配元素不会出现异常，适合于复杂情况下的判断。

以下以百度首页为例进行基本案例讲解。CSS 选择器的基本使用方法要求读者务必掌握，简要回顾下。Id 选择器使用#，如“#u1”，定位 id 为 u1 的元素;类选择器使用“.”，如“.mnav”，定位所有 class 为 mnav 的元素;元素选择器直接使用标签名，如“div”，定位所有的 div;组合选择器，以上多种元素选择方式组合在一起，是使用频率最高的一类选择器。如“#u1 .pf”,定位 id 为 u1 的元素下的所有 class 为 pf 的元素；“#u1>.pf”,定位 id 为 u1 的元素下的 class 为 pf 的元素,并且要求 class 为 pf 的元素是 u1 的直接子级。

为了更好体现定位到指定元素，使用了 get_attribute 方法来获取元素的属性，参数可以是合法的 html 标签属性,如 class 或 name，outerHTML 表示获取定位元素的 html 并且包括元素本身。element1.text 表示获取元素的文本节点，并包括下级文本。

下表罗列出常用的 CSS 选择器和其他选择器对比。

操纵元素的方法

操控元素通常包括点击元素、在输入框中输入字符串、获取元素包含的信息。

Selenium可驱动浏览器执行一些操作，即可以让浏览器模拟执行一些动作，常见的操作及方法如下：
输入文字：使用send_keys()方法实现
清空文字：使用clear()方法实现
点击按钮：使用click()方法实现

from selenium import webdriver
import time
path="E:/chromedriver.exe"
browser = webdriver.Chrome(executable_path=path)
browser.get('https://music.163.com/')
#获取输入框
input = browser.find_element_by_id('srch')
#搜索框输入Andy Lao，但是未点击搜索按钮所以不进行搜索
input.send_keys('Andy Lao')
time.sleep(1)
#清空输入框
input.clear()
input.send_keys('刘德华')
#获取搜索按钮
button = browser.find_element_by_name('srch')
#点击按钮完成搜索任务
button.click()
#关闭浏览器
browser.close()

程序实现流程如下：
1.驱动浏览器打开网易云音乐；
2.使用find_element_by_id()方式获取输入框；
3.使用send_keys()方法输入：Andy Lao；
4.等待一秒后使用clear()清空输入框；
5.再次调用send_keys()方法输入：刘德华；
6.再次使用使用find_element_by_id()方式获取输入框；
7.调用click()方法完成搜索动作。

动作链

Selenium可驱动浏览器执行其他操作，这些操作没有特定的执行对象，比如：鼠标拖拽、键盘按键等，此类操作称为动作链。

Selenium库提供了Actionchains模块，该模块专门处理动作链，比如：鼠标移动，鼠标按钮操作，按键、上下文菜单（鼠标右键）交互等。

click(on_element=None) ——单击鼠标左键
click_and_hold(on_element=None) ——点击鼠标左键，不松开
context_click(on_element=None) ——点击鼠标右键
double_click(on_element=None) ——双击鼠标左键
drag_and_drop(source, target) ——拖拽到某个元素然后松开
drag_and_drop_by_offset(source, xoffset, yoffset) ——拖拽到某个坐标然后松开
key_down(value, element=None) ——按下某个键盘上的键
key_up(value, element=None) ——松开某个键move_by_offset(xoffset, yoffset) ——鼠标从当前位置移动到某个坐标move_to_element(to_element) ——鼠标移动到某个元素move_to_element_with_offset(to_element, xoffset, yoffset) ——移动到距某个元素（左上角坐标）多少距离的位置
perform() ——执行链中的所有动作
release(on_element=None) ——在某个元素位置松开鼠标左键
send_keys(*keys_to_send) ——发送某个键到当前焦点的元素
send_keys_to_element(element, *keys_to_send) ——发送某个键到指定元素

进入正题！！！

我使用的是火狐浏览器，你们可以自行决定

当我们爬取的时候，会遇到有滑块，这个时候我们就需要知道滑块到底滑行了多少，在模拟人的操作时，前面一个阶段，我们会快速拉滑块，给他一个正的加速度，在滑块要到的时候，我们就要降低速度。

def get_track(distance,t):track = []current = 0  #当前初始位置#mid = distance * t / (t+1)mid = distance * 3 / 4  #print(mid)v = 6.8  # 初速度while current < distance:if current < mid:a = 2else:a = -3v0 = v  v = v0 + a * tmove = v0 * t + 1/2 * a * t * t  #计算滑行的距离，与高中的物理知识相关，不知道的了解一下哟current += move#print(current)track.append(round(move))return track

而滑行的速度该如何计算？举一个例子

使用开发者工具，或者ps,可以找出圆圈的长和宽，假设为：40x30,接着我们可以找出整个滑块的长和宽，假设为：340x30，则滑块需要滑行的距离为（340-40），也就是300。所以我们在使用小程序计算的时候，总的距离大概是300左右，不要超过太多和少太多。

对滑块的操作：
①点击鼠标左键，不松开

ActionChains(self.browser).click_and_hold(slider).perform()

②向右拖

ActionChains(self.browser).move_by_offset(x, 0).perform()  #向右滑动

③松开鼠标

ActionChains(self.browser).release().perform()  #释放操作

完整代码：

Middleware.py

# -*- coding: utf-8 -*-# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlfrom scrapy import signals
from selenium.common.exceptions import TimeoutException
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
import time,requests,random
from selenium.webdriver.common.action_chains import ActionChains
from scrapy.http import HtmlResponsedef get_track(distance,t):track = []current = 0#mid = distance * t / (t+1)mid = distance * 3 / 4#print(mid)v = 6.8while current < distance:if current < mid:a = 2else:a = -3v0 = vv = v0 + a * tmove = v0 * t + 1/2 * a * t * tcurrent += move#print(current)track.append(round(move))return trackclass SeleniumMiddleware:def __init__(self):# 1.创建chrome参数opt= Options()# 2.创建无界面对象self.browser = Firefox(executable_path='D:\geckodriver.exe', options=opt)   # 创建无界面对象self.browser.maximize_window() ##浏览器最大化@classmethoddef from_crawler(cls, crawler):  # 关闭浏览器s = cls()crawler.signals.connect(s.spider_closed, signal=signals.spider_closed)return s##按照轨迹拖动，完全验证def move_to_gap(self,slider,tracks):
#   #     拖动滑块到缺口处 param slider: 滑块,param track: 轨迹ActionChains(self.browser).click_and_hold(slider).perform() print(tracks)for x in tracks:print(x)ActionChains(self.browser).move_by_offset(x, 0).perform()  #向右滑动ActionChains(self.browser).release().perform()  #释放操作# # perform() ——执行链中的所有动作 ，release(on_element=None) ——在某个元素位置松开鼠标左键def process_request(self, request, spider):## # 判断是否需要模拟器下载, 如果不需要模拟直接跳过模拟去download下载try:## 3.打开指定的网页self.browser.get(request.url)  #滑块处理if request.url.find("https://jobs.51job.com/")!= -1:try:yzm = self.browser.find_element_by_xpath("//span[@id='nc_1_n1z']")print(yzm)if yzm:print("====有滑块=====")self.move_to_gap(yzm,get_track(258, 2))  # 拖住滑块time.sleep(10)print("====lllllll====")else:print("===没有滑块===")except Exception as e:print("==="+str(e))else:print("===feeder====")time.sleep(2)return HtmlResponse(url=request.url, body=self.browser.page_source, request=request, encoding='utf-8',status=200)except TimeoutException:return HtmlResponse(url=request.url, status=500, request=request)def spider_closed(self):self.browser.quit()

job.py

# -*- coding: utf-8 -*-
import scrapy
#from scrapy.utils.response import open_in_browser
import copyclass JobSpider(scrapy.Spider):name = 'job'allowed_domains = ['51job.com']start_urls=['https://search.51job.com/list/060000,000000,0000,00,9,99,%25E5%25A4%25A7%25E6%2595%25B0%25E6%258D%25AE,2,{i}.html' for i in range(1,2)]def start_requests(self):for url in self.start_urls:yield scrapy.Request(url)#yield scrapy.Request("https://httpbin.org/ip")def parse(self, response):item = {}print("======")print(len(response.xpath("//div[@class='j_joblist']/div[@class='e']")))for entry in response.xpath("//div[@class='j_joblist']/div[@class='e']"):url = entry.xpath(".//p[@class='t']/../@href").get()item['url'] = urlitem['job']=entry.xpath(".//p[@class='t']/span[1]/text()").get()item['price'] = entry.xpath(".//span[@class='sal']/text()").get()item['where'] = entry.xpath(".//p[@class='info']/span[2]/text()").get().split('  |  ')[0]item['jingyan'] = entry.xpath(".//p[@class='info']/span[2]/text()").get().split('  |  ')[1]item['xueli'] = entry.xpath(".//p[@class='info']/span[2]/text()").get().split('  |  ')[2]item['gongsi']=entry.xpath(".//div[@class='er']/a/text()").get()item['daiyu']=entry.xpath(".//p[@class='tags']/@title").get()yield scrapy.Request(url,callback=self.parse_detail,meta={'item':copy.deepcopy(item)},dont_filter=True)def parse_detail(self,response):item = response.meta['item']content = response.xpath("//div[contains(@class,'job_msg')]").xpath("substring-before(.,'职能类别：')").xpath('string(.)').extract()desc=""for i in content:desc=desc.join(i.split())item['desc']=descyield item

将数据存进mongodb

不知如何操作的话，可以看看之前我的文章

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#import pymysqlclass Job51Pipeline(object):def process_item(self, item, spider):return itemimport pymongo
from urllib import parseclass NewPipeline_mongo:def __init__(self, mongo_uri, mongo_db,account,passwd):self.mongo_uri = mongo_uriself.mongo_db = mongo_dbself.account = accountself.passwd = passwd@classmethoddef from_crawler(cls, crawler):#print(crawler.settings.get('USERNAME'))return cls(mongo_uri=crawler.settings.get('MONGO_URI','localhost'),mongo_db=crawler.settings.get('MONGO_DB','cq'),account = crawler.settings.get('USERNAME','root'),passwd = crawler.settings.get('PWD','123456'))def open_spider(self, spider):uri = 'mongodb://%s:%s@%s:27017/?authSource=admin' % (self.account, parse.quote_plus(self.passwd),self.mongo_uri)#print(uri)self.client = pymongo.MongoClient(uri)self.db = self.client[self.mongo_db]print(self.mongo_db)def process_item(self, item, spider):print(item)collection = 'job51'self.db[collection].insert_one(dict(item))return itemdef close_spider(self, spider):self.client.close()

settings.py

ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 1# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
DOWNLOADER_MIDDLEWARES = {'job51.middlewares.SeleniumMiddleware': 543,
}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'job51.pipelines.NewPipeline_mongo': 200,
}