Web of Science爬虫实战(Post方法)
Web of Science爬虫实战(Post方法)
一.概述
本次爬虫主要通过论文的标题来检索出该论文,从而爬取该论文的被引量,近180天下载量以及全部下载量。这里使用的是Web of Scienece 核心合集,并且使用python的requests 库中post方法进行爬取,此外为了加快爬取速度2.0版本采取了多线程的技术
二.网站及爬取策略分析
首先点击http://apps.webofknowledge.com/UA_GeneralSearch_input.do?product=UA&search_mode=GeneralSearch&SID=5BKzcH5qIlQiOPN9Wjv&preferencesSaved=
这个链接进入相应页面然后按下F12点击右侧的network页面,如下图所示:
图1:检索页面
然后在右侧红框内选择web of science 核心合集 以及标题按钮,然后输入标题我这里以Blackcarbon in soils from different land use areas of Shanghai, China: Level,sources and relationship with polycyclic aromatic hydrocarbons为例,下图中右侧红框内内容即为需要用post方法提交的数据。
图2.post提交数据来源
点击检索即可进入我们需要爬取的页面如下
图3.爬取页面
看到右侧用红框标出的页面鼠标移到上面右键检查即课打开该开发者页面,具体分析里面的html
三.爬虫代码
下面即是完整的python代码,也可以访问我的github https://github.com/jgzquanquan/Spyder_wos
title_wos_1.0版本
import re
# from threading import Thread
from multiprocessing import Process
from multiprocessing import Manager
import requests
import time
import xlrd
from bs4 import BeautifulSoup
from lxml import etreeclass SpiderMain(object):def __init__(self, sid, kanming):self.hearders = {'Origin': 'https://apps.webofknowledge.com','Referer': 'https://apps.webofknowledge.com/UA_GeneralSearch_input.do?product=UA&search_mode=GeneralSearch&SID=R1ZsJrXOFAcTqsL6uqh&preferencesSaved=','User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36",'Content-Type': 'application/x-www-form-urlencoded'}self.form_data = {'fieldCount': 1,'action': 'search','product': 'WOS','search_mode': 'GeneralSearch','SID': sid,'max_field_count': 25,'formUpdated': 'true','value(input1)': kanming,'value(select1)': 'TI','value(hidInput1)': '','limitStatus': 'collapsed','ss_lemmatization': 'On','ss_spellchecking': 'Suggest','SinceLastVisit_UTC': '','SinceLastVisit_DATE': '','period': 'Range Selection','range': 'ALL','startYear': '1982','endYear': '2017','update_back2search_link_param': 'yes','ssStatus': 'display:none','ss_showsuggestions': 'ON','ss_query_language': 'auto','ss_numDefaultGeneralSearchFields': 1,'rs_sort_by': 'PY.D;LD.D;SO.A;VL.D;PG.A;AU.A'}self.form_data2 = {'product': 'WOS','prev_search_mode': 'CombineSearches','search_mode': 'CombineSearches','SID': sid,'action': 'remove','goToPageLoc': 'SearchHistoryTableBanner','currUrl': 'https://apps.webofknowledge.com/WOS_CombineSearches_input.do?SID=' + sid + '&product=WOS&search_mode=CombineSearches','x': 48,'y': 9,'dSet': 1}def craw(self, root_url,i):try:s = requests.Session()r = s.post(root_url, data=self.form_data, headers=self.hearders)r.encoding = r.apparent_encodingtree = etree.HTML(r.text)cited = tree.xpath("//div[@class='search-results-data-cite']/a/text()")download = tree.xpath(".//div[@class='alum_text']/span/text()")flag = 0print(i,cited, download,r.url)flag=0return cited, download, flagexcept Exception as e:if i == 0:print(e)print(i)flag = 1return cited, download, flagdef delete_history(self):murl = 'https://apps.webofknowledge.com/WOS_CombineSearches.do's = requests.Session()s.post(murl, data=self.form_data2, headers=self.hearders)root_url = 'https://apps.webofknowledge.com/WOS_GeneralSearch.do'if __name__ == "__main__":# sid='6AYLQ8ZFGGVXDTaCTV9'root = 'http://www.webofknowledge.com/'s = requests.get(root)sid = re.findall(r'SID=\w+&', s.url)[0].replace('SID=', '').replace('&', '')data = xlrd.open_workbook('2015年研究生发表论文.xlsx')table = data.sheets()[2]#具体是取哪个表格nrows = table.nrowsncols = table.ncolsctype = 1xf = 0for i in range(2, nrows):csv = open('2015_3.csv', 'a')fail = open('fail.txt', 'a')if i % 100 == 0:# 每一百次更换sids = requests.get(root)sid = re.findall(r'SID=\w+&', s.url)[0].replace('SID=', '').replace('&', '')kanming = table.cell(i, 5).value#取第i行第6列的数据obj_spider = SpiderMain(sid, kanming)cited,download,flag = obj_spider.craw(root_url,i)if flag==1:fail.write(str(i)+'\n')else:if len(cited)==0:cited.append(0)print(cited)if len(download)==0:download.append(0)download.append(0)print(download)csv.write(str(i) + ',' + str(cited[0]) + ',' + str(download[0]) + ',' + str(download[1]) +'\n')csv.close()
title_wos_2.0
import re
# from threading import Thread
from multiprocessing import Process
from multiprocessing import Manager
import requests
import time
import xlrd
from bs4 import BeautifulSoup
from lxml import etreeclass SpiderMain(object):def __init__(self, sid, kanming):self.hearders = {'Origin': 'https://apps.webofknowledge.com','Referer': 'https://apps.webofknowledge.com/UA_GeneralSearch_input.do?product=UA&search_mode=GeneralSearch&SID=R1ZsJrXOFAcTqsL6uqh&preferencesSaved=','User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36",'Content-Type': 'application/x-www-form-urlencoded'}self.form_data = {'fieldCount': 1,'action': 'search','product': 'WOS','search_mode': 'GeneralSearch','SID': sid,'max_field_count': 25,'formUpdated': 'true','value(input1)': kanming,'value(select1)': 'TI','value(hidInput1)': '','limitStatus': 'collapsed','ss_lemmatization': 'On','ss_spellchecking': 'Suggest','SinceLastVisit_UTC': '','SinceLastVisit_DATE': '','period': 'Range Selection','range': 'ALL','startYear': '1982','endYear': '2017','update_back2search_link_param': 'yes','ssStatus': 'display:none','ss_showsuggestions': 'ON','ss_query_language': 'auto','ss_numDefaultGeneralSearchFields': 1,'rs_sort_by': 'PY.D;LD.D;SO.A;VL.D;PG.A;AU.A'}self.form_data2 = {'product': 'WOS','prev_search_mode': 'CombineSearches','search_mode': 'CombineSearches','SID': sid,'action': 'remove','goToPageLoc': 'SearchHistoryTableBanner','currUrl': 'https://apps.webofknowledge.com/WOS_CombineSearches_input.do?SID=' + sid + '&product=WOS&search_mode=CombineSearches','x': 48,'y': 9,'dSet': 1}def craw(self, root_url,i):try:s = requests.Session()r = s.post(root_url, data=self.form_data, headers=self.hearders)r.encoding = r.apparent_encodingtree = etree.HTML(r.text)cited = tree.xpath("//div[@class='search-results-data-cite']/a/text()")download = tree.xpath(".//div[@class='alum_text']/span/text()")flag = 0print(cited, download,r.url)flag=0return cited, download, flagexcept Exception as e:# 出现错误,再次try,以提高结果成功率if i == 0:print(e)print(i)flag = 1return cited, download, flagdef delete_history(self):murl = 'https://apps.webofknowledge.com/WOS_CombineSearches.do's = requests.Session()s.post(murl, data=self.form_data2, headers=self.hearders)class MyThread(Process):def __init__(self, sid, kanming, i, dic):Process.__init__(self)self.row = iself.sid = sidself.kanming = kanmingself.dic=dicdef run(self):self.cited, self.download, self.fl = SpiderMain(self.sid, self.kanming).craw(root_url, self.row)self.dic[str(self.row)]=Result(self.download, self.cited, self.fl, self.kanming, self.row)class Result():def __init__(self, download, cited, fl, kanming, row):super().__init__()self.row = rowself.kanming = kanmingself.fl = flself.cited = citedself.download = downloaddef runn(sid, kanming, i, d):ar, ref, fl = SpiderMain(sid, kanming).craw(root_url, row)d[str(i)]=Result(ar, ref, fl, kanming, i)print(d)
root_url = 'https://apps.webofknowledge.com/WOS_GeneralSearch.do'if __name__ == "__main__":# sid='6AYLQ8ZFGGVXDTaCTV9'root = 'http://www.webofknowledge.com/'s = requests.get(root)sid = re.findall(r'SID=\w+&', s.url)[0].replace('SID=', '').replace('&', '')data = xlrd.open_workbook('2015年研究生发表论文.xlsx')table = data.sheets()[0]nrows = table.nrowsncols = table.ncolsctype = 1xf = 0threads = []threadnum = 5d = Manager().dict()csv = open('2015_3.csv', 'a')fail = open('fail2015.txt', 'a')for i in range(2, nrows):if i % 100 == 0:# 每一百次更换sids = requests.get(root)sid = re.findall(r'SID=\w+&', s.url)[0].replace('SID=', '').replace('&', '')kanming = table.cell(i, 5).valuet = MyThread(sid, kanming, i, d)threads.append(t)if i % threadnum == 0 or i == nrows - 1:for t in threads:try:t.daemon = Truet.start()except requests.exceptions.ReadTimeout:continuefor t in threads:t.join()for t in threads:rst = d[str(t.row)]cited,download,flag = rst.cited,rst.download,rst.flif flag==1:fail.write(str(i)+'\n')else:if len(cited)==0:cited.append(0)print(cited)if len(download)==0:download.append(0)download.append(0)print(download)csv.write(str(i) + "," + str(cited[0]) + ',' + str(download[0]) + ',' + str(download[1]) +'\n')threads = []csv.close()fail.close()
Web of Science爬虫实战(Post方法)相关推荐
- 从Web of Science 爬取文章作者邮箱小记
最近因为课题组任务需要承接某国际会议,需要查阅给定文献作者的邮箱,因为数量较多,所以决定采用爬虫的方式来完成. 本文章主要在<Web of Science爬虫实战(Post方法)>(htt ...
- php参考文献外文文献,web of science怎么导出参考文献
web of science导出参考文献的方法:首先登录web of Science网站,选择文献:然后选中所需要的文献,点击页面上方中间"保存至Endnote online"旁边 ...
- WOS_Crawler: Web of Science图形界面爬虫、解析工具
文章目录 太长不看 0. 写在前面 1. WOS_Cralwer的使用方法 1.1 图形界面使用方法 1.2 Python API使用方法 2. 注意事项 3. Web of Science爬取逻辑 ...
- Web Of Science检索页面错误信息修改申请方法
发现自己被SCI检索的文章在检索页面出现信息错误时, 可以通过Web Of Science 页面提交修改申请 方法一: 1.打开wos中文章的具体页面,在右下角有一个 suggest a correc ...
- web of science 2022新版检索证明pdf下载方法
2022年一月中旬后web of science 更新,很多常用功能找不到了,最近帮老板申奖,顺便记录一下. Web of science搜索论文->打开->导出为Html->选择完 ...
- Python爬虫实战之爬取web网易云音乐——解析
找到存储歌曲地址的url界面 首先我们要进入网易云的web页面在页面中我们随意选择一首歌曲,打开开发者工具查看响应的界面. 在这些页面中我们需要查找存储有音乐文件的url,这是我们可以打开全局搜索直接 ...
- 全球地区资料json 含中英文 经纬度_爬虫实战(三)使用百度API获取经纬度/地址...
点击上方"蓝字"关注我们百度API获取经纬度/地址Mar 28, 2020 本期介绍给定地址/经纬度,使用百度API来获取经纬度/地址. 本文约3k字,预计阅读18分钟. 本次是第 ...
- Web of science以及中国知网学术论文爬取教程(附代码)
我是目录 Web of Science 中国知网 最近又做了爬取知网以及web of science的工作,因此记录在这里.(话说这几天简直是要类吐血,之前看的论文累得全忘光光了,还得捡一下) 本期教 ...
- 今日头条文章爬虫实战
原 java爬虫系列 今日头条文章爬虫实战 置顶 2018年03月26日 16:55:31 Mr_OOO 阅读数:3868更多 <div class="tags-box space&q ...
最新文章
- USNews 2020美国大学排名公布:UCLA超越伯克利;计算机专业MIT第一,斯坦福跌出前四...
- 战略资产配置matlab,资产组合有效前沿的解和最优解(MATLAB语言)
- 2.2.6 学习率衰减
- 我会说我喜欢创业嘛?(每个月总有几天会更新…………标题一定要长)
- 疫情冬天过去,二手经济春天到来
- Java学习福利,入门到精通学习路线分享
- python 不同模块之间的引用错误问题
- jsp java el表达式_jsp相关笔记,el表达式、jsp标签库(jstl)
- 二.激光SLAM框架学习之A-LOAM框架---介绍及其演示
- 2018-10-10 在浏览器插件中读取JSON资源文件
- 仿王者荣耀JS示例代码
- blastn 输出结果每列啥意思_NCBI在线BLAST用法详解
- MQTT连接阿里云IOT
- 【千峰】网络安全学习笔记
- js获取树形JSON数据根节点到任一子节点路径
- 靶机渗透练习84-The Planets:Earth
- 4G LTE网络空口时延
- 用python画美国国旗
- 【图像加密】正交拉丁方+二维Arnold置乱图像加密【含GUI Matlab源码 813期】
- DeepDream:使用深度学习再造毕加索抽象风格艺术画
热门文章
- 安踏携手华为运动健康共同验证冠军跑鞋 创新引领中国体育
- 西行漫记(5):关于故事的故事
- keil 调用 nop警告 174-Dexpression has no effect
- `泷泽萝拉.png .exe`
- java文件名 目录名或卷标语法不正确_大神求解,IO报错文件名、目录名或卷标语法不正确...
- 继续摘抄:postfix最新源码病毒过滤和反垃圾实战篇
- 新技能丨FETA40i-C核心板实现高清模拟摄像头720P方案
- 常用的激活函数合集(详细版)
- 手游竞争白热化 虎牙直播率先抢占手游新“蓝海”
- html使用javascript实现图片滚动无缝拼接