Python 爬虫自动下载OpenAI Key Papers

Spinning Up是OpenAI开源的面向初学者的深度强化学习资料，其中列出了105篇深度强化学习领域非常经典的文章，见 Spinning Up：

博主使用Python爬虫自动爬取了所有文章，而且爬下来的文章也按照网页的分类自动分类好。

见下载资源：Spinning Up Key Papers

源码如下：

import os
import time
import urllib.request as url_re
import requests as rq
from bs4 import BeautifulSoup as bf'''Automatically download all the key papers recommended by OpenAI Spinning Up.
See more info on: https://spinningup.openai.com/en/latest/spinningup/keypapers.htmlDependency:bs4, lxml
'''headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
}spinningup_url = 'https://spinningup.openai.com/en/latest/spinningup/keypapers.html'paper_id = 1def download_pdf(pdf_url, pdf_path):"""Automatically download PDF file from InternetArgs:pdf_url (str): url of the PDF file to be downloadedpdf_path (str): save routine of the downloaded PDF file"""if os.path.exists(pdf_path): returntry:with url_re.urlopen(pdf_url) as url:pdf_data = url.read()with open(pdf_path, "wb") as f:f.write(pdf_data)except:  # fix link at [102]pdf_url = r"https://is.tuebingen.mpg.de/fileadmin/user_upload/files/publications/Neural-Netw-2008-21-682_4867%5b0%5d.pdf"with url_re.urlopen(pdf_url) as url:pdf_data = url.read()with open(pdf_path, "wb") as f:f.write(pdf_data)time.sleep(10)  # sleep 10 seconds to download nextdef download_from_bs4(papers, category_path):"""Download papers from Spinning UpArgs:papers (bs4.element.ResultSet): 'a' tags with paper linkcategory_path (str): root dir of the paper to be downloaded"""global paper_idprint("Start to ownload papers from catagory {}...".format(category_path))for paper in papers:paper_link = paper['href']if not paper_link.endswith('.pdf'):if paper_link[8:13] == 'arxiv':# paper_link = "https://arxiv.org/abs/1811.02553" paper_link = paper_link[:18] + 'pdf' + paper_link[21:] + '.pdf' # arxiv linkelif paper_link[8:18] == 'openreview':  # openreview link# paper_link = "https://openreview.net/forum?id=ByG_3s09KX"paper_link = paper_link[:23] + 'pdf' + paper_link[28:]elif paper_link[14:18] == 'nips':  # neurips linkpaper_link = "https://proceedings.neurips.cc/paper/2017/file/a1d7311f2a312426d710e1c617fcbc8c-Paper.pdf"else: continuepaper_name = '[{}] '.format(paper_id) + paper.string + '.pdf'if ':' in paper_name:paper_name = paper_name.replace(':', '_')if '?' in paper_name:paper_name = paper_name.replace('?', '')paper_path = os.path.join(category_path, paper_name)download_pdf(paper_link, paper_path)print("Successfully downloaded {}!".format(paper_name))paper_id += 1print("Successfully downloaded all the papers from catagory {}!".format(category_path))def _save_html(html_url, html_path):"""Save requested HTML filesArgs:html_url (str): url of the HTML page to be savedhtml_path (str): save path of HTML file"""html_file = rq.get(html_url, headers=headers)with open(html_path, "w", encoding='utf-8') as h:h.write(html_file.text)def download_key_papers(root_dir):"""Download all the key papers, consistent with the categories listed on the websiteArgs:root_dir (str): save path of all the downloaded papers"""# 1. Get the html of Spinning Upspinningup_html = rq.get(spinningup_url, headers=headers)# 2. Parse the html and get the main category idssoup = bf(spinningup_html.content, 'lxml')# _save_html(spinningup_url, 'spinningup.html')# spinningup_file = open('spinningup.html', 'r', encoding="UTF-8")# spinningup_handle = spinningup_file.read()# soup = bf(spinningup_handle, features='lxml')category_ids = []categories = soup.find(name='div', attrs={'class': 'section', 'id': 'key-papers-in-deep-rl'}).\find_all(name='div', attrs={'class': 'section'}, recursive=False)for category in categories:category_ids.append(category['id'])# 3. Get all the categories and make corresponding dirscategory_dirs = []if not os.path.exitis(root_dir):os.makedirs(root_dir)for category in soup.find_all(name='h2'):category_name = list(category.children)[0].stringif ':' in category_name:  # replace ':' with '_' to get valid dir namecategory_name = category_name.replace(':', '_')category_path = os.path.join(root_dir, category_name)category_dirs.append(category_path)if not os.path.exists(category_path):os.makedirs(category_path)# 4. Start to download all the papersprint("Start to download key papers...")for i in range(len(category_ids)):category_path = category_dirs[i]category_id = category_ids[i]content = soup.find(name='div', attrs={'class': 'section', 'id': category_id})inner_categories = content.find_all('div')if inner_categories != []:for category in inner_categories:category_id = category['id']inner_category = category.h3.text[:-1]inner_category_path = os.path.join(category_path, inner_category)if not os.path.exists(inner_category_path):os.makedirs(inner_category_path)content = soup.find(name='div', attrs={'class': 'section', 'id': category_id})papers = content.find_all(name='a',attrs={'class': 'reference external'})download_from_bs4(papers, inner_category_path)      else:papers = content.find_all(name='a',attrs={'class': 'reference external'})download_from_bs4(papers, category_path)print("Download Complete!")if __name__ == "__main__":root_dir = "key-papers"download_key_papers(root_dir)

Python 爬虫自动下载OpenAI Key Papers相关推荐

python爬虫(自动下载图片)
爬虫第一步下载第三方工具(requests包): win+R 输入cmd点击确定或回车输入以下命令下载requests包: requests包是python爬虫常用的包他的下载方式是 pip in ...
Python爬虫自动下载音乐(网易)
songs.txt 带着地球去流浪我在夜里偷看过一颗星星蜉蝣寄旅不让我的眼泪陪我过夜谁明浪子心说谎的爱人残酷月光 #coding:utf-8 import requests, sys, ...
mac os平台使用python爬虫自动下载巨潮网络文件
环境配置选择python+selenium+wget+Safari的环境来下载文件,本来期望使用phantomjs,但使用时点击出的链接网页为空白网页,无法下载文件. 使用Safari时遇到的错误: ...
抓取安居客二手房经纪人数据，python爬虫自动翻页
为什么80%的码农都做不了架构师?>>> 和链接不一样,安居客网站里面没有找到总页数,可能在json里面有,只是我没有找到. 基于此能不能做网页的循环爬取呢. 能否判断页面读取 ...
python爬虫下载-python爬虫之下载文件的方式总结以及程序实例
python爬虫之下载文件的方式以及下载实例目录第一种方法:urlretrieve方法下载第二种方法:request download 第三种方法:视频文件.大型文件下载实战演示第一种方法: ...
python 下载文件-python爬虫之下载文件的方式总结以及程序实例
python爬虫之下载文件的方式以及下载实例目录第一种方法:urlretrieve方法下载第二种方法:request download 第三种方法:视频文件.大型文件下载实战演示第一种方法: ...
python爬虫批量下载“简谱”
python讨论qq群:996113038 导语: 上次发过一篇关于"python打造电子琴"的文章,从阅读量来看,我们公众号的粉丝里面还是有很多对音乐感兴趣的朋友的.于是,今天我 ...
新一配：perl循环调用python爬虫批量下载喜马拉雅音频
新一配:perl循环调用python爬虫批量下载喜马拉雅音频手机下载喜马拉雅音频后,获得的音频文件虽然可以转成mp3格式,但其文件名却是一长串字符串,无法辨别是哪一集,网上找了各种工具,都有局限性, ...
Python爬虫实战——下载小说
Python爬虫实战--下载小说前言第三方库的安装示例代码效果演示结尾前言使用requests库下载开源网站的小说注意:本文仅用于学习交流,禁止用于盈利或侵权行为. 操作系统:wind ...

Python 爬虫自动下载OpenAI Key Papers

Python 爬虫自动下载OpenAI Key Papers相关推荐

最新文章

热门文章