selenium 模拟登陆豆瓣，爬取武林外传的短评

selenium 模拟登陆豆瓣，爬去武林外传的短评：

　　在最开始写爬虫的时候，抓取豆瓣评论，我们从F12里面是可以直接发现接口的，但是最近豆瓣更新，数据是JS异步加载的，所以没有找到合适的方法爬去，于是采用了selenium来模拟浏览器爬取。

　　豆瓣登陆也是改了样式，我们可以发现登陆页面是在另一个frame里面

所以代码如下：

# -*- coding:utf-8 -*-
# 导包
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
# 创建chrome参数对象
opt = webdriver.ChromeOptions()
# 把chrome设置成无界面模式，不论windows还是linux都可以，自动适配对应参数
opt.set_headless()
# 用的是谷歌浏览器
driver = webdriver.Chrome(options=opt)
driver=webdriver.Chrome()
# 登录豆瓣网
driver.get("http://www.douban.com/")# 切换到登录框架中来
driver.switch_to.frame(driver.find_elements_by_tag_name("iframe")[0])
# 点击"密码登录"
bottom1 = driver.find_element_by_xpath('/html/body/div[1]/div[1]/ul[1]/li[2]')
bottom1.click()# # 输入密码账号
input1 = driver.find_element_by_xpath('//*[@id="username"]')
input1.clear()
input1.send_keys("xxxxx")input2 = driver.find_element_by_xpath('//*[@id="password"]')
input2.clear()
input2.send_keys("xxxxx")# 登录
bottom = driver.find_element_by_class_name('account-form-field-submit ')
bottom.click()

　然后跳转到评论界面 https://movie.douban.com/subject/3882715/comments?sort=new_score

点击下一页发现url变化 https://movie.douban.com/subject/3882715/comments?start=20&limit=20&sort=new_score 所以我们观察到变化后可以直接写循环

获取用户的姓名

driver.find_element_by_xpath('//*[@id="comments"]/div[{}]/div[2]/h3/span[2]/a'.format(str(i))).text用户的评论

driver.find_element_by_xpath('//*[@id="comments"]/div[{}]/div[2]/p/span'.format(str(i))).text然后我们想要知道用户的居住地：

1    #获取用户的url然后点击url获取居住地
2             userInfo=driver.find_element_by_xpath('//*[@id="comments"]/div[{}]/div[2]/h3/span[2]/a'.format(str(i))).get_attribute('href')3 driver.get(userInfo)4             try:5                 userLocation = driver.find_element_by_xpath('//*[@id="profile"]/div/div[2]/div[1]/div/a').text6                 print("用户的居之地是:")7                 print(userLocation)8             exceptException as e:9                 print(e)

这里要注意有些用户没有写居住地，所以必须要捕获异常

完整代码

#-*- coding:utf-8 -*-#导包
importtimefrom selenium importwebdriverfrom selenium.webdriver.common.keys importKeysclassdoubanwlwz_spider():def __init__(self):#创建chrome参数对象opt =webdriver.ChromeOptions()#把chrome设置成无界面模式，不论windows还是linux都可以，自动适配对应参数
opt.set_headless()#用的是谷歌浏览器driver = webdriver.Chrome(options=opt)driver=webdriver.Chrome()self.getInfo(driver)defgetInfo(self,driver):#切换到登录框架中来#登录豆瓣网driver =driverdriver.get("http://www.douban.com/")driver.switch_to.frame(driver.find_elements_by_tag_name("iframe")[0])#点击"密码登录"bottom1 = driver.find_element_by_xpath('/html/body/div[1]/div[1]/ul[1]/li[2]')bottom1.click()## 输入密码账号input1 = driver.find_element_by_xpath('//*[@id="username"]')input1.clear()input1.send_keys("ZZZ2")input2= driver.find_element_by_xpath('//*[@id="password"]')input2.clear()input2.send_keys("ZZZ")#登录bottom = driver.find_element_by_class_name('account-form-field-submit')bottom.click()time.sleep(1)driver.get('https://movie.douban.com/subject/3882715/comments?start=300&limit=20&sort=new_score')search_window=driver.current_window_handle#pageSource=driver.page_source#print(pageSource)#获取用户的名字 每页20个for i in range(1,21):print("用户的评论是:")print(driver.find_element_by_xpath('//*[@id="comments"]/div[{}]/div[2]/h3/span[2]/a'.format(str(i))).text)#获取用户的评论#print(driver.find_element_by_xpath('//*[@id="comments"]/div[1]/div[2]/p/span').text)print("用户的名字是:")print(driver.find_element_by_xpath('//*[@id="comments"]/div[{}]/div[2]/p/span'.format(str(i))).text)#获取用户的url然后点击url获取居住地userInfo=driver.find_element_by_xpath('//*[@id="comments"]/div[{}]/div[2]/h3/span[2]/a'.format(str(i))).get_attribute('href')driver.get(userInfo)try:userLocation= driver.find_element_by_xpath('//*[@id="profile"]/div/div[2]/div[1]/div/a').textprint("用户的居之地是:")print(userLocation)exceptException as e:print(e)driver.back()pageNum=int(input("请输入您想要爬去的步行街的页数："))
AAA=doubanwlwz_spider()

转载于:https://www.cnblogs.com/ZFBG/p/10992970.html

selenium 模拟登陆豆瓣，爬取武林外传的短评相关推荐

通过selenium模拟登陆新浪微博爬取首页和评论
1,获得登陆后的cookies,在通过cookiejar维持cookies(requests不能保存手动构建的cookies),并传入session中 2,在使用获得的session去请求页面,获得首 ...
【python】实验2项目2：使用爬虫Selenium模拟浏览器获取爬取QQ音乐中你喜欢的某位歌手（陈奕迅）
请使用爬虫Selenium模拟浏览器获取爬取QQ音乐中你喜欢的某位歌手(可以是任意歌手)最受欢迎的前5首歌曲的歌词.流派.歌曲发行时间.评论条数.评论时间.评论点赞次数.评论内容具体(每一首歌的评论& ...
Python 爬虫之 Selenium 模拟打开操作浏览器爬取斗鱼全部视播相关信息，并json保存信息
Python 爬虫之 Selenium 模拟打开操作浏览器爬取斗鱼全部视播相关信息,并json保存信息基础: Python 爬虫之 Selenium.webdriver 的安装,以及模拟打开操作 ...
Python爬虫：Selenium模拟Chrome浏览器爬取淘宝商品信息
对于采用异步加载技术的网页,有时候想通过逆向工程的方式来设计爬虫进行爬取会比较困难,因此,要想通过python获取异步加载数据往往可以使用Selenium模拟浏览器的方式来获取. Selenium是一 ...
selenium模拟登陆豆瓣并获取cookies
验证码处理与模拟登陆豆瓣,首先我们看到豆瓣没有cookies,我们需要用程序来模拟登陆获取cookies(当前有些情况下自己手动登陆后复制粘贴cookies也能登陆),该文主要讲方法,如何用selen ...
python实例豆瓣代码_Python实例：通过selenium模拟登陆豆瓣
前几天写的<Python实例:分析豆瓣影片评论Ver 1.0版本>文章中,关于爬取数据过频繁导致IP被封禁的事情让我对豆瓣数据的爬取中断了.忽然想到之前有写过关于关于使用selenium库 ...
selenium模拟登陆豆瓣网
前言随着网站安全做的越来越好,不少网站,直接去爬取数据是无法爬出来的,必须要验证登陆,即登陆之后才能做后面的操作,因此需要解决的第一步就是登陆登陆的常用方式: 1.使用request库,模拟pos ...
Scrapy模拟登陆豆瓣抓取数据
由于豆瓣网站是反爬虫的,因此要破解反爬虫和模拟登录,还有需要破解验证码. 1创建项目 (在pycharm中创建scrapy爬虫工程即可) scrapy startproject douban 其中d ...
【Python3.6爬虫学习记录】（十二）PhantomJS模拟登陆并爬取教务处学生照片（哈工大）
前言:这也不算心血来潮的事情,前面几天文章都涉及过,之前一直觉得很麻烦.必须SSLVPN登陆,到教务处页面,然后进行页面跳转到照片页面.所以每次打开图片页面都要保证登陆教务处,当然可以使用cookie ...
python爬虫+网页点击事件+selenium模拟浏览器，爬取选股宝内容
(一)PYTHON的安装(已安装,可跳过此步骤) 1.PYTHON下载 PYTHON官网:https://www.python.org/ 按照对应的系统下载,我这里是window系统,点击window ...

selenium 模拟登陆豆瓣，爬取武林外传的短评

selenium 模拟登陆豆瓣，爬取武林外传的短评相关推荐

最新文章

热门文章