python 爬虫实现

本文使用python3 实现从谷歌学术获得搜索结果

模拟浏览器发送请求

网络访问的模型使用请求应答的模型。客户端发送请求，浏览器相应请求。

使用chrome浏览器获得请求方式

在f12开发者模式下，查看请求头，发现是使用get方法。复制为url得到请求内容
为了模拟浏览器，所以使用headers。
在headers中可以将cookies删除，测试不影响

在python中实现

使用rullib中的模块

数据分析

使用正则表达式
分析html文件。通过正则表达式匹配

代码块

import urllib.parse
import urllib.requestimport rekeyword=input("keywords is?\n")
print(keyword)url='https://scholar.google.com/scholar?&hl=en&q='+keyword+'&btnG=&lr='
header_dict={'Host': 'scholar.google.com','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3','Referer': 'https://scholar.google.com/schhp?hl=zh-CN','Connection': 'keep-alive'}
req = urllib.request.Request(url=url,headers=header_dict)
response = urllib.request.urlopen(req,timeout=120)
#print(f.read())
#with open('aaa.html', 'wb') as f:
#    f.write(response.read())print("conneect succeed!")'''data=response.read().decode('utf-8')
pattern = re.compile(r'<div class="gs_r"><div class="gs_ri"><h3.*?<a onclick',re.S)for m in re.finditer(pattern,data):print (m.group())
'''
#print(response.read())
data=response.read()data=data.decode()pattern = re.compile(r'<div class="gs_ri">.*?</div></div></div>')#print(data)
# 使用re.match匹配文本，获得匹配结果，无法匹配时将返回None
result1 = re.search(pattern,data)'''
if result1:# 使用Match获得分组信息print (result1.group().encode('utf_8'))
else:print ('1匹配失败！')'''
m=re.findall(pattern,data)
print("data get")
print(len(m))address = re.compile(r'<a href=".*?"')
author= re.compile(r'<div class="gs_a">.*?</div>')
abstruct=re.compile(r'<div class="gs_rs">.*?</div>')for s in m:net=re.search(address,s)temp=net.group()print("url:")print(temp[9:-1])net=re.search(author,s)temp=net.group()a1 = re.compile(r'<a.*?>')print("author:")#replacedStr = re.sub("\d+", "222", inputStr)temp= re.sub(a1,'',temp)print(temp[18:-6])net=re.search(abstruct,s)if(net):print("abstruct:")temp=net.group()temp=temp.replace("<b>"," ").replace("<br>"," ").replace("</b>"," ")print(temp[19:-6])else:print("no abstrutct")print('')url='https://scholar.google.com/scholar?start=20&hl=en&q='+keyword+'234&btnG=&lr='
header_dict={'Host': 'scholar.google.com','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3','Referer': 'https://scholar.google.com/schhp?hl=zh-CN','Connection': 'keep-alive'}
req = urllib.request.Request(url=url,headers=header_dict)
response = urllib.request.urlopen(req,timeout=120)
#print(f.read())
#with open('aaa.html', 'wb') as f:
#    f.write(response.read())print("conneect succeed!")'''data=response.read().decode('utf-8')
pattern = re.compile(r'<div class="gs_r"><div class="gs_ri"><h3.*?<a onclick',re.S)for m in re.finditer(pattern,data):print (m.group())
'''
#print(response.read())
data=response.read()data=data.decode()pattern = re.compile(r'<div class="gs_ri">.*?</div></div></div>')#print(data)
# 使用re.match匹配文本，获得匹配结果，无法匹配时将返回None
result1 = re.search(pattern,data)'''
if result1:# 使用Match获得分组信息print (result1.group().encode('utf_8'))
else:print ('1匹配失败！')'''
m=re.findall(pattern,data)
print("data get")
print(len(m))address = re.compile(r'<a href=".*?"')
author= re.compile(r'<div class="gs_a">.*?</div>')
abstruct=re.compile(r'<div class="gs_rs">.*?</div>')for s in m:net=re.search(address,s)temp=net.group()print("url:")print(temp[9:-1])net=re.search(author,s)temp=net.group()a1 = re.compile(r'<a.*?>')print("author:")#replacedStr = re.sub("\d+", "222", inputStr)temp= re.sub(a1,'',temp)print(temp[18:-6])net=re.search(abstruct,s)if(net):print("abstruct:")temp=net.group()temp=temp.replace("<b>"," ").replace("<br>"," ").replace("</b>"," ")print(temp[19:-6])else:print("no abstrutct")print('')start=20
start+=10url='https://scholar.google.com/scholar?start='+str(start)+'&hl=en&q='+keyword+'234&btnG=&lr='
header_dict={'Host': 'scholar.google.com','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3','Referer': 'https://scholar.google.com/schhp?hl=zh-CN','Connection': 'keep-alive'}
req = urllib.request.Request(url=url,headers=header_dict)
response = urllib.request.urlopen(req,timeout=120)
#print(f.read())
#with open('aaa.html', 'wb') as f:
#    f.write(response.read())print("conneect succeed!")'''data=response.read().decode('utf-8')
pattern = re.compile(r'<div class="gs_r"><div class="gs_ri"><h3.*?<a onclick',re.S)for m in re.finditer(pattern,data):print (m.group())
'''
#print(response.read())
data=response.read()data=data.decode()pattern = re.compile(r'<div class="gs_ri">.*?</div></div></div>')#print(data)
# 使用re.match匹配文本，获得匹配结果，无法匹配时将返回None
result1 = re.search(pattern,data)'''
if result1:# 使用Match获得分组信息print (result1.group().encode('utf_8'))
else:print ('1匹配失败！')'''
m=re.findall(pattern,data)
print("data get")
print(len(m))address = re.compile(r'<a href=".*?"')
author= re.compile(r'<div class="gs_a">.*?</div>')
abstruct=re.compile(r'<div class="gs_rs">.*?</div>')for s in m:net=re.search(address,s)temp=net.group()print("url:")print(temp[9:-1])net=re.search(author,s)temp=net.group()a1 = re.compile(r'<a.*?>')print("author:")#replacedStr = re.sub("\d+", "222", inputStr)temp= re.sub(a1,'',temp)print(temp[18:-6])net=re.search(abstruct,s)if(net):print("abstruct:")temp=net.group()temp=temp.replace("<b>"," ").replace("<br>"," ").replace("</b>"," ")print(temp[19:-6])else:print("no abstrutct")print('')

python爬虫得到谷歌学术搜索结果相关推荐

python爬取谷歌学术参考文献的BibTex格式——基于selenium
1.背景进行Latex写作时,当引用文献,需要根据文章名,一个一个去谷歌学术搜索,找到BibTex,再复制进bib文件里,耗费大量时间和精力. 图1.传统方法引用参考文献这样枯燥重复的工作完全可以 ...
python爬虫代码实例-Python爬虫爬取百度搜索内容代码实例
这篇文章主要介绍了Python爬虫爬取百度搜索内容代码实例,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友可以参考下搜索引擎用的很频繁,现在利用Python爬 ...
Python爬虫 | 利用python爬虫获取想要搜索的数据
这篇文章主要介绍了利用Python爬虫采集想要搜索的信息(利用某du的接口实现)并且处理掉它的反爬手段,文中示例代码很详细,具有一定的学习价值,感兴趣的小伙伴快来一起学习吧. ☀️新人小白博主
利用谷歌学术搜索生成规范的文献引用
最近在写毕业论文,自然免不了引用文献,但是引用方法比较复杂,比如会议时是[C].期刊是[J]等等,我真心分不清,最好有一个工具可以帮助自动生成引用方法. 今天早上师兄教了我一个好办法,不敢私藏,赶快和 ...
Python 爬虫 - 获取百度关键字搜索内容
Python 爬虫获取百度关键字搜索内容 https://www.cnblogs.com/w0000/p/bd_search_page.html Github headers内的参数,仅有UA时,返 ...
人工智能的前沿信息获取之使用谷歌学术搜索
谷歌学术是谷歌公司开发的一款专门针对学术搜索的在线搜索引擎[4],谷歌学术的网址为https://scholar.google.com,界面如图 6‑1所示.使用谷歌学术搜索可以检索会议或者期刊论文. ...
使用python爬虫抓取学术论文
介绍这是一个很小的爬虫,可以用来爬取学术引擎的pdf论文,由于是网页内容是js生成的,所以必须动态抓取.通过selenium和chromedriver实现.可以修改起始点的URL从谷粉搜搜改到谷歌学 ...
写文章没高质量配图？python爬虫绕过限制一键搜索下载图虫创意图片！
文章目录前言分析理想状态实际分析爬虫实现其他注意效果与总结前言在我们写文章(博客.公众号.自媒体)的时候,常常觉得自己的文章有些老土,这很大程度是因为配图没有选好. 笔者也是遇到相同 ...
python爬虫之51job工作搜索
简介大多数情况下我们通过urllib2等模块可以对单纯的html进行爬取分析,但是当我们遇到的页面是js渲染的,我们需要去分析一个一个后台的请求,这就太蛋疼了.若我们使用像浏览器一样的工具来处理 ...
python爬虫之采集360搜索的联想词
思路和部分代码引用迪艾姆python培训黄哥python爬虫联想词视频,但是太罗嗦了,顺便整理下,而且到现在,360也不傻,已经进化了,采用原来的方式,多少有点bug,这个后面会说.正题如下: 语言: ...

python爬虫得到谷歌学术搜索结果

python 爬虫实现

模拟浏览器发送请求

使用chrome浏览器获得请求方式

在python中实现

数据分析

代码块

python爬虫得到谷歌学术搜索结果相关推荐

最新文章

热门文章