简介

selenium本身是自动化测试框架，只是在爬虫领域更能够显示出其一把梭的威力，所有网站比如淘宝，微博等必须登录状态才能访问页面，对数据进行抓取时，逆向分析js将是一条不归路，而自动化测试框架selenium完全模拟人的行为模式，对网站按钮的点击，元素的获取，内容文本的输入有着得天独厚的优势。不过相对于逆向加密参数执行的爬虫程序来说，selenium还是太过效率低下了，常规套路一般是通过selenium拿到cookie或者token后，再通过爬虫程序去抓取页面，事半功倍。

Alimama实战

以阿里妈妈后台为例，通过分析我们拿到了请求json来自于https://pub.alimama.com/campaign/joinedSpecialCampaigns.json?toPage=1&status=2&perPageSize=40

不过单独访问该页面，会将我们地址重定向到登录界面，这种网站就必须我们登录再发起请求抓取数据了。

模拟登录

该登录页面是淘宝的统一登录框架，右键重新加载时抓包拿到框架地址，去除无用参数拿到原始地址https://login.taobao.com/member/login.jhtml?style=mini&newMini2=true&from=alimama，避免其他请求干扰我们的判断。

步骤如下：

获取账户，密码，滑块，按钮的元素位置
输入账户密码
判断滑块存在并滑动滑块
点击登录
保存cookie并调用cookie发起请求

chromedriver初始化

根据本机的chrome版本获取selenium的驱动程序chromedriver版本

特征隐藏

面对一些网站通过ajax请求，同时携带一些难以破解加密参数，虽然selenium模拟浏览器行为操作，绕过这些反爬虫的手段，不过依旧有一些站点通过JavaScript 探测到Selenium启动的浏览器的天生存在的几十个特征来屏蔽这些爬虫的运行。通过https://bot.sannysoft.com/ 可以查看当前浏览器的一些特征值，正常浏览器打开如下：

而通过selenium打开该网站时，部分特征被检测到，这就被安全人员拿来作为关键参数，禁止改浏览器的数据请求。

比如某平台中对selenium的属性$cdc_asdjflasutopfhvcZLmcfl_做了校验，应对解决方案使用HexEdit 4.2修改chromedriver.exe 的$cdc_asdjflasutopfhvcZLmcfl_修改为同长度的字符串,如$ccccccccccccccccccccccccccc。
针对chrome弹窗请停用以开发者模式运行插件，可以通过Chrome.dll-patch75and76.exe放入chrome文件夹下包含包含chrome.dll文件的目录下并管理员身份执行。
针对CHROME正受到组件控制的提示，可以通过chrome_options.add_experimental_option('excludeSwitches', ['enable-automation'])实现屏蔽’CHROME正受到组件控制’的提示。
针对chrome自带密码保存对爬虫的干扰影响，通过chrome_options.add_experimental_option("prefs", prefs)屏蔽。
针对封禁ip可以通过chrome_options.add_argument("--proxy-server=http://58.243.205.102:4543")开启ip代理。
设置请求头UA,browser.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'})
针对navigator属性中存在webdriver，新页面加载后browser.execute_script('Object.defineProperty(navigator,"webdriver",{get:() => false,});')去除特征无效，可以通过CDP协议browser.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {"source": """Object.defineProperty(navigator, 'webdriver', {get: () => undefined})""", })

不过仅仅靠隐藏几个特征是毫无意义的，针对众多的特征已经有大牛为我们做了完美隐藏，那就是stealth.min.js

道高一尺魔高一丈，完整隐藏特征代码如下：

# chrome 版本78.0.3904.70，chromedriver版本78.0.3904.70
# 设置代理
# chrome_options.add_argument("--proxy-server=http://58.243.205.102:4543")
# chrome.exe --remote-debugging-port=7222  本地启动selenium
# chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:7222")
chrome_options = Options()
# 设置无头
chrome_options.add_argument("--headless")
chrome_options.add_argument('user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36')
# 屏蔽'CHROME正受到组件控制'的提示
chrome_options.add_experimental_option('excludeSwitches', ['enable-automation'])
# 屏蔽保存密码
prefs = {"": ""}
prefs["credentials_enable_service"] = False
prefs["profile.password_manager_enabled"] = False
chrome_options.add_experimental_option("prefs", prefs)
driver = Chrome('./chromedriver', options=chrome_options)
#driver.execute_script('Object.defineProperty(navigator,"webdriver",{get:() => false,});')
#driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {"source": """Object.defineProperty(navigator, 'webdriver', {get: () => undefined})""", })
#driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'})driver.set_page_load_timeout(10)
with open('./stealth.min.js') as f:js = f.read()driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {"source": js
})

保存cookie

def save_cookies(self):# 隐式等待，设置了一个最长等待时间self.browser.implicitly_wait(10)# 最大化窗口self.browser.maximize_window()# 向文本框发送账户密码self.browser.find_element_by_xpath('//input[@name="fm-login-id"]').send_keys('***')self.browser.find_element_by_xpath('//input[@name="fm-login-password"]').send_keys('***')# 解决滑块slide_block = self.browser.find_element_by_xpath('//*[@id="nc_1_n1z"]')if (slide_block.is_displayed()):# 点击移动滑块action = ActionChains(self.browser)action.click_and_hold(on_element=slide_block)action.move_by_offset(xoffset=258, yoffset=0)action.pause(0.5).release().perform()  # perform指定动作链self.browser.find_element_by_xpath('//button[@class="fm-button fm-submit password-login"]').click()time.sleep(5)if "login_unusual" in self.browser.current_url:print("gg了，要手机验证码了，救命啊啊啊啊啊")input("输入手机验证码啦：")self.cookies = '; '.join(item for item in [item["name"] + "=" + item["value"] for item in self.browser.get_cookies()])with open(COOKIES_FILE_PATH, 'w', encoding='utf-8') as file:file.write(self.cookies)print("cookie写入成功：", self.cookies)

使用cookie登录

def taobao_login(self):print("登录中。。。。。")ok = Falsewhile not ok:with open(COOKIES_FILE_PATH, 'r+', encoding='utf-8') as file:self.headers["cookie"] = file.read()response = self.session.get(self.shop_plan_url, headers=self.headers, verify=False)try:ok = json.loads(response.text)except:self.browser.get(self.alimama_login_url)self.browser.delete_all_cookies()self.save_cookies()self.browser.close()self.browser.quit()

Tencent实战

由于腾讯优量汇中的报表不提供api，本次目标是抓取该报表中的广告收益数据。

通过抓包分析最关键的cookie为adnet_sso，只要拿到该cookie就可以成功请求数据，该cookie经过了cookie传递层层更新，太烦了，干脆selenium一把梭，登陆后拿到cookie存到文件中，访问api时添加cookie到header中即可。

模拟登录

https://sso.e.qq.com/login/hub?sso_redirect_uri=https%3A%2F%2Fe.qq.com%2Fdev%2Flogin&service_tag=14

我们肯定是避免扫码登录了，登录流程是当QQ账号登录界面出现时，点击账号密码登录，找到文本框输入qq号及密码后点击授权并登录按钮，获取selenium的cookie并保存到文件中，访问api数据时读取该cookie即可，如果异常则删除selenium的cookie重新登录保存cookie。

def adnet_login(self):print("登录中。。。。。")ok = Falsewhile not ok:with open(COOKIES_FILE_PATH, 'r+', encoding='utf-8') as file:self.headers["cookie"] = file.read()response = self.session.post(self.get_date_url, data=json.dumps(self.data), headers=self.headers, verify=False)try:res = json.loads(response.text)ok = Trueexcept:self.browser.get(self.adnet_login_url)self.browser.delete_all_cookies()self.save_cookies()self.browser.close()self.browser.quit()

初始化selenium的流程和Alimama的一致，腾讯广告的登录界面藏在id="qqLoginFrame"的frame中的id="ptlogin_iframe"的frame中，通过switch_to.frame直接切换到frame中获取元素，填写帐密实现登录保存cookie。

def save_cookies(self):self.browser.implicitly_wait(10)self.browser.maximize_window()self.browser.find_element_by_xpath('//a[@id="qqLogin"]').click()# el_frame = self.browser.find_element_by_xpath('//*[@id="qqLoginFrame"]')# print(self.browser.page_source)self.browser.switch_to.frame('qqLoginFrame')self.browser.switch_to.frame('ptlogin_iframe')time.sleep(5)self.browser.find_element_by_xpath('//a[contains(text(),"帐号密码登录")]').click()self.browser.find_element_by_xpath('//*[@id="u"]').send_keys('*')self.browser.find_element_by_xpath('//*[@id="p"]').send_keys('*')self.browser.find_element_by_xpath('//*[@id="loginform"]/div[@class="submit"]/a').click()time.sleep(5)self.cookies = '; '.join(item for item in [item["name"] + "=" + item["value"] for item in self.browser.get_cookies()])with open(COOKIES_FILE_PATH, 'w', encoding='utf-8') as file:file.write(self.cookies)print("cookie写入成功：", self.cookies)

常用操作

不同系统

chrome_options = webdriver.ChromeOptions()
if platform.system() == "Windows":driver = webdriver.Chrome('chromedriver.exe', chrome_options=chrome_options)
elif platform.system() == "Linux":chrome_options.add_argument("--headless")chrome_options.add_argument('--disable-gpu')chrome_options.add_argument('--no-sandbox')driver = webdriver.Chrome(executable_path="/usr/bin/chromedriver",chrome_options=chrome_options)

获取元素信息

def get_data():divs = driver.find_elements_by_xpath('//div[@class="items"]/div[@class="item J_MouserOnverReq  "]')for div in divs:info = div.find_element_by_xpath('.//div[@class="row row-2 title"]/a').textprice = div.find_element_by_xpath('.//strong').textdeal = div.find_element_by_xpath('.//div[@class="deal-cnt"]').textshop = div.find_element_by_xpath('.//div[@class="shop"]/a').textprint(info, price, deal, shop, sep="|")with open('taobao.csv', mode='a', newline="") as csvfile:csvwrite = csv.writer(csvfile, delimiter=',')csvwrite.writerow([info, price, deal, shop])
browser.find_elements_by_xpath("//div[@id='J_DivItemDesc']/descendant::*/img")  查找后代元素
browser.find_elements_by_xpath("//div[@id='J_DivItemDesc']/descendant::*/img").tag_name  获取标签
browser.find_elements_by_xpath("//div[@id='J_DivItemDesc']/descendant::*/img").get_attribute('value')  获取属性value信息或文本框信息
js = 'return document.getElementById("su").getAttribute("value")'
res = driver.excute_script(js)  利用js获取元素属性值

鼠标操作

def get_data():# 移动鼠标到距离元素的位置title = browser.find_element_by_xpath("//div[@class='title-bar']")ActionChains(browser).move_to_element_with_offset(title, 100, 600).perform()# 键盘指令# browser.find_element_by_tag_name('body').send_keys(Keys.CONTROL + Keys.SHIFT + 'J')# hover到指定元素# ActionChains(browser).move_to_element(browser.find_elements_by_xpath('//tbody[@mx-ie="mouseover"]/tr')[1]).perform()# 页面双击操作才能获取列表ActionChains(browser).double_click(browser.find_element_by_xpath("//body")).perform()tr_list = browser.find_elements_by_xpath('//tbody[contains(@mx-ie,"mouseover")]/tr')if len(tr_list) == 0:# 页面重载browser.execute_script("location.reload()")title = browser.find_element_by_xpath("//div[@class='title-bar']")# 鼠标移动位置ActionChains(browser).move_to_element_with_offset(title, 100, 600).perform()# 双击ActionChains(browser).double_click(browser.find_element_by_xpath("//body")).perform()# 判断元素属性是否包含tr_list = browser.find_elements_by_xpath('//tbody[contains(@mx-ie,"mouseover")]/tr')# 滚轮直接滑到底部browser.execute_script("window.scrollTo(0,document.body.scrollHeight);")

查找元素

# 通过不同的方式查找界面元素
def findElement(by, value):if (by == "id"):element = browser.find_element_by_id(value)return elementelif (by == "name"):element = browser.find_element_by_name(value)return elementelif (by == "xpath"):element = browser.find_element_by_xpath(value)return elementelif (by == "classname"):element = browser.find_element_by_class_name(value)return elementelif (by == "css"):element = browser.find_element_by_css_selector(value)return elementelif (by == "link_text"):element = browser.find_element_by_link_text(value)return elementelse:print("无对应方法，请检查")return None

元素存在

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def is_element_present(locator):wait = WebDriverWait(browser, 2)try:# 显式等待wait.until(EC.visibility_of_element_located(locator))except TimeoutException:return Falsereturn True
is_element_present((By.XPATH, '//*[@id=\"sufei-dialog-content\"]'))

点击元素

def move_element_click(xpath):if is_element_present((By.XPATH, xpath)):ele_loc = browser.find_element_by_xpath(xpath)browser.execute_script("arguments[0].scrollIntoView();", ele_loc)ActionChains(browser).move_to_element(ele_loc).click().perform()time.sleep(random.randint(1, 3))
move_element_click("//div[@class='dialog-contentbox']/vframe/div/div/button")

hover元素

def hover(by, value):element = findElement(by, value)ActionChains(browser).move_to_element(element).perform()
hover("xpath", '//tbody[contains(@mx-ie,"mouseover")]/tr[' + str(tr_list.index(tr) + 1) + ']')

完整源码请关注微信公众号：ReverseCode，回复：爬虫基础

爬虫基础篇之selenium登陆获取阿里腾讯cookie相关推荐

修改pom文件_自动化测试基础篇：Selenium 框架设计（POM）
(给Python开发者加星标,提升Python技能) 来源: 叁藏法师 https://www.cnblogs.com/sanzangTst/p/8376550.html [导语]Selenium是 ...
基础篇——用串口登陆树莓派pi3/pi4并配置wifi网络
背景故事在日常使用树莓派的过程中,常常会遇到没有屏幕或者不方便携带屏幕,但又需要使用树莓派的情况,当然很多人会先想到VNC或者SSH连接树莓派,但新的问题来了,到了陌生环境树莓派需要先配网,才能获取 ...
爬虫基础篇之Scrapy抓取京东
虚拟环境同一台服务器上不同的项目可能依赖的包不同版本,新版本默认覆盖旧版本,可能导致其他项目无法运行,通过虚拟环境,完全隔离各个项目各个版本的依赖包,实现运行环境互不影响. virtualenv p ...
爬虫基础篇之IP代理池
代理池介绍由众多ip组成提供多个稳定可用代理IP的ip池. 当我们做爬虫时,最常见的反爬手段就是IP反爬,当同一个IP访问网站超出频控限制,将会被限制访问,那么代理IP池应运而生.资金充足的情况下个 ...
【网站密码管理不用愁】基础篇 • 利用selenium构建网站密码管理和自动登录神器
文章目录一.项目背景二.前置必懂知识 [01]selenium基础知识 [02]了解HTML和CSS 三.用Python和selenium实现 [01]实现访问网站 [02]定位目标元素四.Pa ...
python urllib.request 爬虫数据处理-运维学python之爬虫基础篇（二）urllib模块使用...
1 何为爬虫网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本.另外一些不常使用的名字还有蚂蚁.自动索引. ...
python爬虫获取标签规则_Python爬虫之数据提取-selenium定位获取标签对象并提取数据...
## selenium提取数据知识点:了解 driver对象的常用属性和方法掌握 driver对象定位标签元素获取标签对象的方法掌握标签对象提取文本和属性值的方法 1. driver对象的常 ...
【云原生 | Docker 基础篇】03、Docker 阿里云镜像加速器
目录一.阿里云镜像加速 1.是什么 2.注册一个属于自己的阿里云账户(可复用淘宝账号) 3.进入控制台 4.选择容器镜像服务 5.获取加速器地址 6.配置镜像加速器二.永远的 Hello Word ...
Python爬虫基础：初探selenium——动态网页静态网页
前言 Selenium是一个用于Web应用程序测试的工具.Selenium测试直接运行在浏览器中,就像真正的用户在操作一样. 支持的浏览器包括IE(7,8,9,10,11),Mozilla Chrom ...

爬虫基础篇之selenium登陆获取阿里腾讯cookie

简介

Alimama实战

模拟登录

chromedriver初始化

特征隐藏

保存cookie

使用cookie登录

Tencent实战

模拟登录

常用操作

不同系统

获取元素信息

鼠标操作

查找元素

元素存在

点击元素

hover元素

爬虫基础篇之selenium登陆获取阿里腾讯cookie相关推荐

最新文章

热门文章