python selenium爬虫工具

今天seo的同事需要一个简单的爬虫工具，根据一个url地址，抓取改页面的a连接，然后进入a连接里面的页面再次抓取a连接

1.需要一个全局的set([])集合来保存抓取的url地址

2.由于现在单页面也来越多，所以我们借用selenium来抓取页面内容，由于页面内容比较多，我们程序需要将滚动条滚到最下面，如：driver.execute_script("return document.body.scrollHeight;")

3.需要查找页面的超链接 driver.find_elements_by_xpath("//a[@href]")

4.为了便于查看数据记录，每抓取一个地址就记录到日志中去（曾经尝试过爬网完毕后再记录，但是爬网时间太长，一旦出现异常就一条记录都没有了）

整个代码如下：

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.common.exceptions import TimeoutException
import time
import datetime
from urllib import parse
import osurls = set([])
def getUrl(url,host):driver = webdriver.Ie()try:#driver = webdriver.Firefox()driver.set_page_load_timeout(10)driver.get(url)#time.sleep(2)all_window_height = []all_window_height.append(driver.execute_script("return document.body.scrollHeight;"))while True:driver.execute_script("scroll(0,100000)")time.sleep(1)check_height = driver.execute_script("return document.body.scrollHeight;")if check_height == all_window_height[-1]:print("我已下拉完毕")breakelse:all_window_height.append(check_height) print("我正在下拉")#for link in driver.find_elements_by_xpath("//*[@href]"):#for link in driver.find_elements_by_tag_name("a"):for link in driver.find_elements_by_xpath("//a[@href]"):try:tempurl1=link.get_attribute('href')if tempurl1.startswith("http"):if tempurl1 not in urls:urls.add(tempurl1)log(host,url+','+tempurl1)print(tempurl1)except:print(link)except Exception as e:print(e)finally:driver.quit()def log(name,msg):filename='D://'+name+'.csv'if not os.path.exists(filename):with open(filename,'w') as f:print('create file:'+filename)f.write('parentUrl,currenturl'+'\n')f.close()with open(filename,'a') as f:f.write(msg+'\n')f.close()url= input("Enter a url")
try:urls.clear()url= url.strip()if len(url)>0:host =parse.urlparse(url).netlocprint(url+"下面的连接:")t1=datetime.datetime.now()getUrl(url,host)l=list(urls)for item in l:print(item+"下面的连接:")getUrl(item,host)t2=datetime.datetime.now()tt =(t2-t1).secondsminutes=tt//60seconds=tt%60print("total cost %d minutes %d seconds" % (minutes,seconds))except Exception as e:print(e)

然后运行pyinstaller -F a.py 打包

关于selenium 的IE 可以参考https://blog.csdn.net/ma_jiang/article/details/96022775

转载于:https://www.cnblogs.com/majiang/p/11196132.html

python selenium爬虫工具相关推荐

Python + selenium自动化工具 + 滑块验证码+点选验证码，实现模拟登录“中国铁路网12306”
文章目录一.模拟登录"中国铁路网12306 1.引入库 2.初始化 3.将点选验证码图片,通过人工打码,返回目标像素位置(json格式). 4.点选验证码位置得到后,需要鼠标左击进行模拟人 ...
python selenium爬虫
python selenium爬虫 1 前言博主是一名经管研究生,以自身经历为例.如今大学生写论文大部分都需要数据支撑来论证研究结果,数据除了从数据库直接下载外,有些是需要通过网络爬虫来获得.网络爬 ...
Python Selenium爬虫实战应用
本节讲解 Python Selenium 爬虫实战案例,通过对实战案例的讲解让您进一步认识 Selenium 框架. 实战案例目标:抓取京东商城(https://www.jd.com/)商品名称.商品 ...
python selenium爬虫实例_python使用selenium爬虫知乎的方法示例
说起爬虫一般想到的情况是,使用 python 中都通过 requests 库获取网页内容,然后通过 beautifulSoup 进行筛选文档中的标签和内容.但是这样有个问题就是,容易被反扒机制所拦住. ...
python selenium爬虫_详解基于python +Selenium的爬虫
详解基于python +Selenium的爬虫一.背景 1. Selenium Selenium 是一个用于web应用程序自动化测试的工具,直接运行在浏览器当中,支持chrome.firefox等主 ...
python selenium爬虫代码示例_python3通过selenium爬虫获取到dj商品的实例代码
先给大家介绍下python3 selenium使用其实这个就相当于模拟人的点击事件来连续的访问浏览器.如果你玩过王者荣耀的话在2016年一月份的版本里面就有一个bug. 安卓手机下载一个按键精灵就可 ...
Python 网络爬虫工具：httpx 和 parsel（对比测评）
Python 网络爬虫领域两个最新的比较火的工具莫过于 httpx 和 parsel 了. httpx 号称下一代的新一代的网络请求库,不仅支持 requests 库的所有操作,还能发送异步请求,为编 ...
python+selenium爬虫自动化批量下载文件
一.项目需求在一个业务网站有可以一个个打开有相关内容的文本,需要逐个保存为TXT,数据量是以千为单位,人工操作会麻木到崩溃. 二.解决方案目前的基础办法就是使用python+selenium自动化 ...
python selenium 爬虫模拟浏览网站内容
使用python selenium编写的爬虫代码,模拟用户浏览某个网站内容,废话少说进入正文. 1.爬虫界面如下: 界面使用说明: 第一步:填写要访问的网站地址第二步:填写每天访问该网址的次数第三 ...

python selenium爬虫工具

python selenium爬虫工具相关推荐

最新文章

热门文章