德温特专利数据的爬取（selenium\xpath\contains解决了输入框ID老是动态改变的问题）

（一）目标

针对一系列机构名，获取2016-2021年间的每年申请专利数目以及总数、专利家族数、专利授权量、专利被引频次。

（二）方法

1、使用Webdriver模拟人工访问浏览器

步骤为：
（1）定义检索情况：

（2）定位、计算总被引量
从检索结果中定位每个专利的被引量，降序排列后加和所有专利的被引量得到总的被引量

（3）从筛选器中获取每年的申请量：

（3）从筛选器中查看授权/申请状况

2、BeautifulSoup解析页面

这里，有个坑就是“一定要在点击完页面所有操作的按钮之后再进行解析”，这样会避免后面点击的内容解析不出来的情况。

（三）代码实现

from selenium import webdriver
import time
import json
from pprint import pprint
import requests
import redis
import json
import re
import random
from bs4 import BeautifulSoup
import xlwt
work_book = xlwt.Workbook()driver = webdriver.Chrome()
driver.get(url='https://derwentinnovation.clarivate.com.cn/login/')
time.sleep(2)
driver.find_element_by_xpath('//*[@id="tr-login-username"]').click()
driver.find_element_by_xpath('//*[@id="tr-login-username"]').clear()
driver.find_element_by_xpath('//*[@id="tr-login-username"]').send_keys('pengh@mail.las.ac.cn')
driver.find_element_by_xpath('//*[@id="tr-login-password"]').click()
driver.find_element_by_xpath('//*[@id="tr-login-password"]').clear()
driver.find_element_by_xpath('//*[@id="tr-login-password"]').send_keys('pengh2018#')
time.sleep(2)
driver.find_element_by_xpath('//*[@id="tr-email-form"]/div/div/div/input').click()
time.sleep(10)#机构名
lists = [...
]f = open('result.csv','w',encoding='utf-8')
import  csv
wr_f = csv.writer(f)
for li in lists:driver.get(url='https://derwentinnovation.clarivate.com.cn/ui/zh/#/home')time.sleep(6)'//*[@id="mat-input-61"]'driver.find_element_by_xpath('/html/body/di-app/div[1]/div[2]/di-app-home/main/section[1]/div/di-app-smart-search/div/div/di-app-search-type/div[1]/mat-card/a').click()time.sleep(6)print(li)app = 0grant = 0# start querying#first quesydriver.find_element_by_xpath('//textarea[contains(@placeholder,"Fanuc")]').click()driver.find_element_by_xpath('//textarea[contains(@placeholder,"Fanuc")]').clear()driver.find_element_by_xpath('//textarea[contains(@placeholder,"Fanuc")]').send_keys(li)time.sleep(5)#  query buttondriver.find_element_by_xpath('//button[contains(@class,"but-xl ps-button-3 mat-flat-button mat-button-base mat-primary")]').click()time.sleep(10)# sorttry:driver.find_element_by_xpath('//*[@id="dataTable_scrollHead_Id"]/div/table/thead/tr/th[12]/div').click()except:passtime.sleep(16)# year datedriver.find_element_by_xpath('//*[@id="filter-sticky-bar"]/div/button').click()time.sleep(6)try:driver.find_elements_by_xpath("//mat-icon[.='arrow_right']")[6].click()except:wr_f.writerow([li,0,0,0,0,0,0,0,0,0,0,0])continuetime.sleep(10)# analyze datahtml = driver.page_source.encode('utf-8')soup =BeautifulSoup(html,'lxml')#get the zhuanli申请content_div = soup.find_all("div",{"class":"mat-radio-label-content"})# for div in content_div:#     for span in div.find_all('span'):#         print(span.getText())aa = str(content_div[0].find_all('span')[1].getText()).strip().replace(',','')aa = int(aa)bb = str(content_div[2].find_all('span')[1].getText()).strip().replace(',','')if bb !='':bb = int(bb)print(aa)print(bb)if aa == '0':continueelse:# get the patent_apply for each yearcontent1 = soup.find_all("span",{"class":"mat-checkbox-label"})temp_2015 = 0temp_2016 = 0temp_2017 = 0temp_2018 = 0temp_2019 = 0temp_2020 = 0for con1 in content1[-14:]:time.sleep(2)temp = str(con1.getText()).strip()if (temp[0] == '2' and 6<len(temp)<=10):if temp[0:4] == '2015':temp_2015 += int(temp[5:-1])elif temp[0:4] == '2016':temp_2016 += int(temp[5:-1])elif temp[0:4] == '2017':temp_2017 += int(temp[5:-1])elif temp[0:4] == '2018':temp_2018 += int(temp[5:-1])elif temp[0:4] == '2019':temp_2019 += int(temp[5:-1])elif temp[0:4] == '2020':temp_2020 += int(temp[5:-1])elif temp[0] == 'A':app = temp[5:-1]print(app)elif temp[0] == 'G':grant = temp[7:-1]print(grant)print(temp_2015)print(temp_2016)print(temp_2017)print(temp_2018)print(temp_2019)print(temp_2020)#get the sum citation valuesum = 0content2 = soup.find_all("div",{"class":"dataTables_scrollBody"})[0]time.sleep(6)for tr in content2.table.tbody.find_all('tr'):num = tr.find("td",{"class":"sorting_1"}).getText()num = str(num).strip()num = int(num)sum += num# print(num)if num == 0:breakprint(sum)wr_f.writerow([li,aa,bb,temp_2015,temp_2016,temp_2017,temp_2018,temp_2019,temp_2020,app,grant,sum])
f.close()
driver.close()

（四）学习到的点

1、当遇到input or button or不论什么元素的ID老是变动的时候，可以用xpath中的contain方式进行定位！亲测有效、好用！
driver.find_element_by_xpath('//textarea[contains(@placeholder,"Fanuc")]').click()
2、用BeautifulSoup解析页面时，要完成所有的点击操作，使得“style:display=none”的元素变得可见，才能解析出完整的内容！

德温特专利数据的爬取（selenium\xpath\contains解决了输入框ID老是动态改变的问题）相关推荐

Java网络爬虫--一步步使用Java网络爬虫技术实现豆瓣读书Top250数据的爬取，并插入数据库
一步步使用Java网络爬虫技术实现豆瓣读书Top250数据的爬取,并插入数据库目录一步步使用Java网络爬虫技术实现豆瓣读书Top250数据的爬取,并插入数据库第一步:创建项目,搭建项目结构 p ...
生成osm文件_超酷城市肌理！地理数据信息爬取方法大全（B篇）DEM+POI+OSM
WENWEN:这一弹是对第一弹的补充和深化讲解,上一弹请点击常用的地理数据信息爬取方法大全(前期场地信息获取第一弹),关于DEM获取地形地理空间数据云提交任务一直在排队的问题,这个应该是官网的问题,不 ...
python爬取公交车站数据_Python爬虫实例_城市公交网络站点数据的爬取方法
爬取的站点:http://beijing.8684.cn/ (1)环境配置,直接上代码: # -*- coding: utf-8 -*- import requests ##导入requests fr ...
Python实现对主要城市及其周边地区天气数据的爬取
python爬虫学习爬虫(爬取指定网站数据) Python实现对主要城市及其周边地区天气数据的爬取,关键步骤已经做了注释此版本仅是初学者的学习版,不喜勿喷 #coding: utf-8 import ...
python 北上资金_python爬虫技术：北向资金数据自动爬取！
好久不见!今天我们继续python的话题啦.python现在势头凶得很,没事刷抖音.刷朋友圈.看公众号,弹出的广告总少不了python."python带你发家致富,财富自由!"广告 ...
scrapy框架之全站数据的爬取
全站数据的爬取有俩种方式: 1.基于spider的全站数据爬取:需要自己进行分页操作,并进行手动发送请求 2.基于CrawlSpider ,今天主要讲解基于CrawlSpider 的爬取方式 Craw ...
php 爬取股票数据库,【实例】--股票数据定向爬取
从股票列表网页获取股票代码根据股票代码去股票详情页面获取股票详细信息 1. 股票列表页面凤凰网财经-股票信息 http://app.finance.ifeng.com/list/stock.php ...
如何爬一个网站的数据-免费爬取网站的任意数据软件
如何爬一个网站的数据?爬取网络数据大家称之为网络爬行收集页面以创建索引或集合.另一方面,网络抓取下载页面以提取一组特定的数据用于分析目的,例如,产品详细信息.定价信息.SEO 数据或任何其他数据集. ...
集思录REITs基金数据python爬取写入EXCEL表
本文主要讲述REITs基金相关内容, 封闭基金数据获取参见: 集思录封闭基金数据python爬取写入excel表国债数据获取参见: 和讯网债券数据Python爬取保存成CSV文件之一 1.什么是RE ...