python+selenium爬虫按照名单循环爬取作者知网下载量等信息

主要爬取下面的表格内的信息如文献篇数，被引用数等等
用的是selenium爬虫

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import csv
import time
browser = webdriver.Chrome()
browser.minimize_window()  # 最小化窗口
url = 'http://epub.cnki.net/grid2008/brief/result_src.aspx?comptype=scho&stype=1&&auscho_1_sel=%E5%AD%A6%E8%80%85&auscho_1_value1=%E8%B5%B5%E5%AA%9B&auscho_1_special1=%3D&auscho2_1_sel=%E7%A0%94%E7%A9%B6%E9%A2%86%E5%9F%9F%2C%E7%A0%94%E7%A9%B6%E6%96%B9%E5%90%91%2C%E5%AD%A6%E8%80%85%E7%9F%A5%E8%AF%86%E5%90%91%E9%87%8F&auscho2_1_value1=%E5%9B%BE%E4%B9%A6%E6%83%85%E6%8A%A5&auscho2_1_special1=%25&navicode=&showtitle=%u5B66%u8005%u68C0%u7D22&dbCatalog=%u4E2D%u56FD%u5B66%u672F%u6587%u732E%u7F51%u7EDC%u51FA%u7248%u603B%u5E93'def start_spider():# 请求urlbrowser.get(url)time.sleep(5)try:browser.find_element_by_id('iframeResult')# 定位到iframeexcept NoSuchElementException:print('no')browser.switch_to.parent_frame()#切换到父iframex=browser.switch_to.frame('iframeResult')# browser.switch_to.frame('iframeResult')a = browser.find_element_by_class_name('s_table')tr_content = a.find_elements_by_tag_name("tr")  # 进一步定位到表格内容所在的tr节点lst = []  # 存储为listfor tr in tr_content:tds =tr.find_elements_by_tag_name("td")  # 进一步定位到表格内容所在的td节点for td in tds:lst.append(td.text)with open('D:\Python_DATA\data.csv','a', encoding='utf-8',newline='') as csvfile:writer = csv.writer(csvfile)writer.writerow(lst)lst.clear()# a=browser.find_element_by_class_name('s_tabletd_rb')print(lst)  # 输出表格内容browser.switch_to.default_content()f = open('D:\Python_DATA\sd.csv', 'r')content = f.read()final_list = list()rows = content.split('\n')#名单转换成listfor row in rows:final_list.append(row.split(','))for word in final_list:try:#异常处理，有的人查询不到print(word)browser.find_element_by_id('auscho_1_value1').clear()browser.find_element_by_id('auscho_1_value1').send_keys(word)browser.find_element_by_class_name('butt04').click()try:time.sleep(2)browser.find_element_by_id('iframeResult')except NoSuchElementException:print('no')browser.switch_to.parent_frame()#切换到父iframex=browser.switch_to.frame('iframeResult')# browser.switch_to.frame('iframeResult')a = browser.find_element_by_class_name('s_table')tr_content = a.find_elements_by_tag_name("tr")  # 进一步定位到表格内容所在的tr节点for tr in tr_content:tds =tr.find_elements_by_tag_name("td")  # 进一步定位到表格内容所在的td节点for td in tds:lst.append(td.text)with open('D:\Python_DATA\data.csv','a', encoding='utf-8',newline='') as csvfile:writer = csv.writer(csvfile)writer.writerow(lst)lst.clear()# a=browser.find_element_by_class_name('s_tabletd_rb')browser.switch_to.default_content()except:print('查无此人')continueif __name__ == '__main__':start_spider()#browser.close()print("爬取完成，请到相应文件夹查看！")

效果图：

python+selenium爬虫按照名单循环爬取作者知网下载量等信息相关推荐

Python 小小爬虫练手，爬取自己的IP
Python 小小爬虫练手,爬取自己的IP import re import urllib.request url="http://2020.ip138.com/i ...
Python爬虫之selenium爬虫，模拟浏览器爬取天猫信息
由于工作需要,需要提取到天猫400个指定商品页面中指定的信息,于是有了这个爬虫.这是一个使用 selenium 爬取天猫商品信息的爬虫,虽然功能单一,但是也算是 selenium 爬虫的基本用法了. ...
Python网络爬虫：利用正则表达式爬取豆瓣电影top250排行前10页电影信息
在学习了几个常用的爬取包方法后,转入爬取实战. 爬取豆瓣电影早已是练习爬取的常用方式了,网上各种代码也已经很多了,我可能现在还在做这个都太土了,不过没事,毕竟我也才刚入门-- 这次我还是利用正则表达式 ...
[Python Scrapy爬虫] 二.翻页爬取农产品信息并保存本地
前面 "Python爬虫之Selenium+Phantomjs+CasperJS" 介绍了很多Selenium基于自动测试的Python爬虫程序,主要利用它的xpath语句,通过分 ...
python 循环定时器 timer显示数据_【Python】多线程、定时循环爬取优信二手车信息...
爬虫爬取优信二手车:循环遍历每页,获取相应的有价值字段信息,这里不详细阐释了. 多线程 Python中,使用concurrent.futures模块下的ThreadPoolExecutor类来实现线 ...
Python爬虫入门——2. 2爬取酷狗音乐top1-500歌曲信息
有了第一个程序的基础,我们现在来爬取酷狗音乐top500的歌曲信息.连接http://www.kugou.com/yy/rank/home/1-8888.html 我们第一个程序只爬取了一个页面的数据 ...
【Python】爬虫入门6：爬取百度图片搜索结果（基于关键字爬图）
源代码 #!/usr/bin/env python # -*- coding: UTF-8 -*-# 需求:爬取百度图片# noinspection PyUnresolvedReferences im ...
python 小说爬虫_Python实现的爬取小说爬虫功能示例
本文实例讲述了Python实现的爬取小说爬虫功能.分享给大家供大家参考,具体如下: 想把顶点小说网上的一篇持续更新的小说下下来,就写了一个简单的爬虫,可以爬取爬取各个章节的内容,保存到txt文档中,支 ...
Python网络爬虫（6）--爬取淘宝模特图片
经过前面的一些基础学习,我们大致知道了如何爬取并解析一个网页中的信息,这里我们来做一个更有意思的事情,爬取MM图片并保存.网址为https://mm.taobao.com/json/request_t ...

python+selenium爬虫按照名单循环爬取作者知网下载量等信息

python+selenium爬虫按照名单循环爬取作者知网下载量等信息相关推荐

最新文章

热门文章