【Python】网页爬取CVPR论文

动机

利用python自动下载 cvpr论文

流程

获取网页内容
找到所有论文链接
下载

1. 获取网页内容

所用模块：requests

重要函数：requests.get

输出：web_context

参考链接：
http://blog.csdn.net/fly_yr/article/details/51525435

#get web context
def get_context(url):"""params: url: linkreturn:web_context"""web_context = requests.get(url)return web_context.text

2. 找到论文链接

使用模块：import re

重要函数：re.findall()

输出：cvpr论文的下载链接列表

论文Pdf链接形式：
href=“content_cvpr_2016/papers/Hendricks_Deep_Compositional_Captioning_CVPR_2016_paper.pdf”>pdf

使用正则化寻找所有符合此文本形式的链接

参考链接：https://www.cnblogs.com/MrFiona/p/5954084.html
http://blog.csdn.net/u014467169/article/details/51345657

#find paper files'''
(?<=href=\"): 寻找开头，匹配此句之后的内容
.+: 匹配多个字符（除了换行符）
?pdf: 匹配零次或一次pdf
(?=\">pdf): 以">pdf" 结尾
|: 或
'''
#link pattern: href="***_CVPR_2016_paper.pdf">pdf
link_list = re.findall(r"(?<=href=\").+?pdf(?=\">pdf)|(?<=href=\').+?pdf(?=\">pdf)",web_context)
#name pattern: <a href="***_CVPR_2016_paper.html">***</a>
name_list = re.findall(r"(?<=2016_paper.html\">).+(?=</a>)",web_context)

下载论文

内容：

整理论文链接和名字
使用urllib下载
所用模块：os,urllib
重要函数：
os.path.exists(),re.sub(),urllib.urlretrieve()
参考链接：
https://zhidao.baidu.com/question/369467791671548644.html
https://zhidao.baidu.com/question/1830964875242219220.html
https://www.cnblogs.com/jiu0821/p/6275685.html

#download
# create local filefolder
local_dir = 'E:\\CVPR16\\'
if not os.path.exists(local_dir):os.makedirs(local_dir)cnt = 0
while cnt < len(link_list):file_name = name_list[cnt]download_url = link_list[cnt]#为了可以保存为文件名，将标点符号和空格替换为'_'file_name = re.sub('[:\?/]+',"_",file_name).replace(' ','_')file_path = local_dir + file_name + '.pdf'#downloadprint '['+str(cnt)+'/'+str(len(link_list))+'] Downloading' + file_pathtry:urllib.urlretrieve("http://openaccess.thecvf.com/" + download_url, file_path)except Exception,e:print 'download Fail: ' + file_pathcnt += 1
print 'Finished'

完整代码：

# -*- coding: utf-8 -*-
"""
手撸代码第一步：2018/3/7功能：网页爬取CVPR论文@author: vincent
"""
#package used
import os
import re
import urllibimport requests#get web context
def get_context(url):"""params: url: linkreturn:web_context"""web_context = requests.get(url)return web_context.texturl = 'http://openaccess.thecvf.com//CVPR2016.py'
web_context = get_context(url)#find paper files'''
(?<=href=\"): 寻找开头，匹配此句之后的内容
.+: 匹配多个字符（除了换行符）
?pdf: 匹配零次或一次pdf
(?=\">pdf): 以">pdf" 结尾
|: 或
'''
#link pattern: href="***_CVPR_2016_paper.pdf">pdf
link_list = re.findall(r"(?<=href=\").+?pdf(?=\">pdf)|(?<=href=\').+?pdf(?=\">pdf)",web_context)
#name pattern: <a href="***_CVPR_2016_paper.html">***</a>
name_list = re.findall(r"(?<=2016_paper.html\">).+(?=</a>)",web_context)#download
# create local filefolder
local_dir = 'E:\\CVPR16\\'
if not os.path.exists(local_dir):os.makedirs(local_dir)cnt = 0
while cnt < len(link_list):file_name = name_list[cnt]download_url = link_list[cnt]#为了可以保存为文件名，将标点符号和空格替换为'_'file_name = re.sub('[:\?/]+',"_",file_name).replace(' ','_')file_path = local_dir + file_name + '.pdf'#downloadprint '['+str(cnt)+'/'+str(len(link_list))+'] Downloading' + file_pathtry:urllib.urlretrieve("http://openaccess.thecvf.com/" + download_url, file_path)except Exception,e:print 'download Fail: ' + file_pathcnt += 1
print 'Finished'

【Python】网页爬取CVPR论文相关推荐

python如何爬取sci论文_利用python爬取并翻译GEO数据库
GEO数据库是NCBI创建并维护的基因表达数据库,始于2000年,收录了世界各国研究机构提交的高通量基因表达数据,现芯片集数据量高达12万以上.想要从这里面挖掘(bai piao)数据,发个sci提前 ...
Python爬虫——爬取IEEE论文
目录 1 获取文章列表 1.1 问题 1.2 解决方法 1.2.1 创建浏览器对象进行模拟访问[1−4]^{[1-4]}[1−4] 1.2.2 POST请求[5]^{[5]}[5] 2 获取完整摘要 ...
python网页爬取方法_Python爬取网页的三种方法
# Python爬取网页的三种方法之一: 使用urllib或者urllib2模块的getparam方法 import urllib fopen1 = urllib.urlopen('http://w ...
python如何爬取sci论文_通过爬虫确定SCI期刊的发表周期
众所周知,SCI发表周期较长,从投稿到见刊时间跨度超过2年也不罕见,如果运气不好,文章投出去石沉大海,用几个月的时间等来一封拒稿信,很可能会影响到博士毕业或职称评选.因此,为了尽量避免漫长的等待过程, ...
python如何爬取sci论文中所需的数据_sci论文中的科研数据处理方法
不直接参与科研的人都觉得科研是一种充满了新idea和思想碰撞的活动.可是那只是整个科研过程中的一小部分.相信所有已经做过科研的人,不管你在哪一个科研领域,在你的成果足够写成论文之前都避免不了进行大量的 ...
python 网页爬取数据生成文字云图
1. 需要的三个包: from wordcloud import WordCloud #词云库 import matplotlib.pyplot as plt #数学绘图库 import jieba; ...
python每隔一段时间保存网页内容_利用Python轻松爬取网页题库答案！教孩子不怕尴尬了！...
大家有没有遇到这种令人尴尬的情况:"好不容易在网上找到需要的资源数据,可是不容易下载下来!"如果是通过一页一页的粘贴复制来下载,真的让人难以忍受,特别是像我这种急性子,真得会让人窒 ...
chrome动态ip python_用Python爬虫爬取动态网页，附带完整代码，有错误欢迎指出！...
系统环境: 操作系统:Windows8.1专业版 64bit Python:anaconda.Python2.7 Python modules:requests.random.json Backgro ...
Python爬虫: 单网页所有静态网页动态网页爬取
Python爬虫: 单网页所有静态网页动态网页爬取前言:所有页代码主干均来自网上!!!感谢大佬们. 其实我对爬虫还挺感兴趣的,因为我玩instagram(需要科学上网),上过IG的人都知道IG虽 ...

【Python】网页爬取CVPR论文

动机

流程

1. 获取网页内容

2. 找到论文链接

下载论文

【Python】网页爬取CVPR论文相关推荐

最新文章

热门文章