前言

最近在学习python爬虫，为了巩固爬虫的知识，偶尔会写一些简单的脚本来加强对语法的熟练度。

一、模拟知乎登录的准备

因为知乎的防爬机制会识别selenium，所有我们不能直接实例化浏览器对象，可以通过接管已经打开的浏览器进行账户的输入。

系统环境变量：在path中添加chrome浏览器的路径

在cmd中输入以下命令启动浏览器

chrome.exe --remote-debugging-port=9222 --user-data-dir=“E:\data_info\selenium_data

remote-debugging-port=:端口号
user-data-dir：数据存储路径

需要安装的包

pillow
pyHook
PyUserInput
selenium == 3.8.0
Keras==2.0.1
numpy==1.12.1
scikit-learn==0.18.1
h5py==2.6.0
tensorflow==1.13.1

python第三方库：下载链接
安装的时候找不到对应库可以去上方链接下载并安装

二、登录验证码的问题

1.英文验证码

一般情况下，我们输入验证码需要保持在同一session下，如果用selenium需要保持在同一cookie下；当然知乎与此不同：

首先在网页的源码中找到图片所在的属性
对图片进行保存
输入验证码，此处采用手动输入

代码如下（示例）：

# 获取它的src返回的是base64的字符串，解码并保存图片
english_captcha = driver.find_element_by_class_name('Captcha-englishImg')
base64_text = english_captcha.get_attribute('src')
code = base64_text.replace('data:image/jpg;base64', '').replace('%0A', '')
with open('img.jpeg', 'wb') as f:f.write(base64.b64decode(code))
# 用Image模块打开图片并手动输入验证码
try:im = Image.open('img.jpeg')im.show()im.close()
except:pass
yzm = input("请输入验证码：")

2.中文倒立文字验证码

知乎多次登录之后偶尔会提示点击倒立中文验证码

保存中文倒立验证码的图片
计算出倒立中文字体的坐标
控制鼠标自动取点击对应的坐标即可

中文倒立字体坐标计算可以用github上面的模块：链接

zheye使用方法:

from zheye import zheye
z = zheye()
positions = z.Recognize('path/to/captcha.gif')

代码如下（示例）：

# 保存图片
chinese_captcha = driver.find_element_by_class_name('Captcha-chineseImg')
base64_text = chinese_captcha.get_attribute("src")
code = base64_text.replace('data:image/jpg;base64', '').replace('%0A', '')
with open('img.jpeg', 'wb') as f:f.write(base64.b64decode(code))# 元素在浏览器中的位置
ele_position = chinese_captcha.location
x_relative = ele_position["x"]
y_relative = ele_position["y"]
# 执行js，计算整个浏览器高度-窗口页面的高度 = 窗体高度
browser_navigation_panel_height = driver.execute_script('return window.outerHeight-window.innerHeight;')# 者也解析倒立文字，默认计算的400*88的图片，知乎是400*22的
from ArticleSpider.zheye import zheye
z = zheye()
positions = z.Recognize('img.jpeg')
pos_arr = []
if len(positions) == 2:# 两个文字倒立if positions[0][1] > positions[1][1]:# 判断x坐标的先后顺序pos_arr.append([positions[1][1] / 2, positions[1][0] / 2])pos_arr.append([positions[0][1] / 2, positions[0][0] / 2])else:pos_arr.append([positions[0][1] / 2, positions[0][0] / 2]) pos_arr.append([positions[1][1] / 2, positions[1][0] / 2])
else:pos_arr.append([positions[0][1] / 2, positions[0][0] / 2])# 循环点击倒立文字
for pos in pos_arr:print('x:{0}'.format(x_relative + pos[0]))print('y_relative:{0}'.format(y_relative))print('browser_navigation_panel_height:{}'.format(browser_navigation_panel_height))print('pos[1]:{}'.format(pos[1]))m = PyMouse()x = int(x_relative + pos[0])y = int(y_relative + browser_navigation_panel_height + pos[1])m.click(x, y)

三.完整代码如下

学习代码，可能有点冗余，仅供参考

import os
import time
import base64from PIL import Image
from pymouse import PyMouse
from selenum import webdriver
from selenium.webdriver.chrome.options import Optionsdriver_options = Options()
driver_options.add_experimental_option("debuggerAddress", "127.0.0.1:9222")
driver = webdriver.Chrome(executable_path='chromedriver.exe', options=driver_options)
driver.maximize_window()
driver.get(url='https://www.zhihu.com/signin')
time.sleep(1)
# 点击密码登录并输入
driver.find_element_by_xpath('//div[@class="SignFlow-tabs"]//div[@class="SignFlow-tab"]').click()
driver.find_element_by_xpath('//input[@name="username"]').send_keys('xxxxxx')
driver.find_element_by_xpath('//input[@name="password"]').send_keys('xxxxxx')
time.sleep(1)
# 尝试点击登录
try:driver.find_element_by_xpath('//button[@type="submit"]').click()  #
except:pass
time.sleep(5)
# 获取验证码
try:english_captcha = driver.find_element_by_class_name('Captcha-englishImg')
except:english_captcha = None
try:chinese_captcha = driver.find_element_by_class_name('Captcha-chineseImg')
except:chinese_captcha = None
file_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'yzm_c.jpeg')
if chinese_captcha:# 倒立文字验证码ele_position = chinese_captcha.location  # 元素在浏览器中的位置x_relative = ele_position["x"]y_relative = ele_position["y"]# 执行js，计算整个高度-上面TAB高度=整个浏览器高度-窗口页面的高度browser_navigation_panel_height = driver.execute_script('return window.outerHeight-window.innerHeight;')base64_text = chinese_captcha.get_attribute("src")  # 获取base64加密图片的二进制内容code = base64_text.replace('data:image/jpg;base64', '').replace('%0A', '')if os.path.exists(file_path):os.remove(file_path)with open(file_path, 'wb') as f:f.write(base64.b64decode(code))# 者也解析倒立文字from ArticleSpider.zheye import zheyez = zheye()# 默认返回的是乘以2，坐标y，x的列表positions = z.Recognize(file_path)pos_arr = []  # 存放调整后的坐标if len(positions) == 2:# 两个文字倒立if positions[0][1] > positions[1][1]:# 判断x坐标的先后顺序pos_arr.append([positions[1][1] / 2, positions[1][0] / 2])pos_arr.append([positions[0][1] / 2, positions[0][0] / 2])else:pos_arr.append([positions[0][1] / 2, positions[0][0] / 2])pos_arr.append([positions[1][1] / 2, positions[1][0] / 2])else:pos_arr.append([positions[0][1] / 2, positions[0][0] / 2])# 循环点击倒立文字for pos in pos_arr:print('x:{0}'.format(x_relative + pos[0]))print('y_relative:{0}'.format(y_relative))print('browser_navigation_panel_height:{}'.format(browser_navigation_panel_height))print('pos[1]:{}'.format(pos[1]))m = PyMouse()x = int(x_relative + pos[0])y = int(y_relative + browser_navigation_panel_height + pos[1])m.click(x, y)time.sleep(3)
elif english_captcha:# 英文验证码处理base64_text = english_captcha.get_attribute('src')code = base64_text.replace('data:image/jpg;base64', '').replace('%0A', '')if os.path.exists(file_path):os.remove(file_path)with open(file_path, 'wb') as f:f.write(base64.b64decode(code))try:im = Image.open(file_path)im.show()im.close()except:passyzm = input("请输入验证码：")driver.find_element_by_xpath('//input[@name="captcha"]').send_keys(yzm)
# 点击登录
driver.find_element_by_xpath('//button[@type="submit"]').click()

四.总结

1.熟悉selenium的登录过程
2. 提供验证码登录的思路
3. 如果有些图片需要请求下载的，可以带上cookie去获取

学习笔记 -- 用python中的selenium模拟知乎登录相关推荐

c++用一级运算比较大小_Python 学习笔记：Python 中的数字和数字型运算
在 Python 数据类型知识中我们已经初步认识了几种 Python 中的数据类型,现在我们更详细的学习一下数字型以及数字型运算. 我们已经知道了 Python 中的数字分为两种,分别是整数 i ...
『Python学习笔记』Python中的异步Web框架之fastAPI介绍RestAPI
Python中的异步Web框架之fastAPI介绍&RestAPI 文章目录一. fastAPI简要介绍 1.1. 安装 1.2. 创建 1.3. get方法 1.4. post方法 1.5 ...
python学习笔记，python中的队列及代码实现
队列是一种特殊的线性表,特殊之处在于它只允许在表的前端(front)进行删除操作,而在表的后端(rear)进行插入操作,和栈一样,队列是一种操作受限制的线性表.进行插入操作的端称为队尾,进行删除操作的 ...
学习笔记27—python中numpy.ravel() 和 flatten()函数
简介首先声明两者所要实现的功能是一致的(将多维数组降位一维).这点从两个单词的意也可以看出来,ravel(散开,解开),flatten(变平).两者的区别在于返回拷贝(copy)还是返回视图(vie ...
python的messagebox的用法_Python GUI编程学习笔记之tkinter中messagebox、filedialog控件用法详解...
本文实例讲述了Python GUI编程学习笔记之tkinter中messagebox.filedialog控件用法.分享给大家供大家参考,具体如下: 相关内容: messagebox 介绍使用 fi ...
python 模拟浏览器selenium_浅谈python爬虫使用Selenium模拟浏览器行为
前几天有位微信读者问我一个爬虫的问题,就是在爬去百度贴吧首页的热门动态下面的图片的时候,爬取的图片总是爬取不完整,比首页看到的少.原因他也大概分析了下,就是后面的图片是动态加载的.他的问题就是这部分动 ...
python编程语言继承_python应用：学习笔记（Python继承）
学习笔记(Python继承)Python是一种解释型脚本语言,可以应用于以下领域: web 和 Internet开发科学计算和统计人工智能教育桌面界面开发后端开发网络爬虫有几种叫法(父类 ...
【theano-windows】学习笔记六——theano中的循环函数scan
前言 Scan是Theano中最基础的循环函数, 官方教程主要是通过大量的例子来说明用法. 不过在学习的时候我比较习惯先看看用途, 然后是参数说明, 最后再是研究实例. 国际惯例, 参考网址官网关于 ...
python3.4学习笔记(八) Python第三方库安装与使用，包管理工具解惑
python3.4学习笔记(八) Python第三方库安装与使用,包管理工具解惑许多人在安装Python第三方库的时候, 经常会为一个问题困扰:到底应该下载什么格式的文件? 当我们点开下载页时, 一 ...

学习笔记 -- 用python中的selenium模拟知乎登录

文章目录

前言