爬取三个acm网站题库(neuqoj pku hdu)
环境:macos+Python3.9(Windows版本仅需更改目录)
效果图:
代码:
没有写多线程,按需更改range()或者多个文件一起运行。
1.neuqoj
import requests
from bs4 import BeautifulSoup
import time,os,re
import json
def write_in_file(f,string):#output functionwith open ('/Users/cyh/Desktop/acm/neuqacm/'+f+'/'+f+".txt","a+",encoding='utf-8') as fi:fi.write(string)fi.write("\n")fi.close()link = "http://140.143.222.61:8088/problem/"
link2="http://newoj.acmclub.cn/problems/"
headers = {'user-agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1' ,'accept-language': 'zh-CN,zh'
}
for i in range (1002,1003):try:print("开始",i)r = requests.get(link+str(i),headers = headers,timeout = 100)j=r.json()# print(j)problem_title=j['data']['title']if("/" in problem_title):problem_title=problem_title.replace("/", "比")if not(os.path.exists('/Users/cyh/Desktop/acm/neuqacm/'+str(i)+problem_title+'/')):os.mkdir('/Users/cyh/Desktop/acm/neuqacm/'+str(i)+problem_title)write_in_file(str(i)+problem_title,"question: "+problem_title+"\n")problem_des = [j['data']['difficulty'],j['data']['input'],j['data']['output'],j['data']['sample_input'],j['data']['sample_output']]the_title =['难度','输入描述','输出描述','样例输入','样例输出']print("写入"+str(i) +" file")j['data']['description']=j['data']['description'].replace('<div align="left"><span style="font-size: medium">', ' ')j['data']['description']=j['data']['description'].replace('<font color="#000000">','')j['data']['description']=j['data']['description'].replace('<span style="font-size: medium">','')j['data']['description']=j['data']['description'].replace('<span style="font-size: small">','')j['data']['description']=j['data']['description'].replace('<span>', ' ')j['data']['description']=j['data']['description'].replace('</span>', ' ')j['data']['description']=j['data']['description'].replace('''<p><style type="text/css">p { margin-bottom: 0.21cm; }</style>''', ' ')j['data']['description']=j['data']['description'].replace('<p style="margin-bottom: 0cm;"><font color="#000000">', ' ')j['data']['description']=j['data']['description'].replace("<p>",' ')j['data']['description']=j['data']['description'].replace("</p>",' ')j['data']['description']=j['data']['description'].replace("<font>",' ')j['data']['description']=j['data']['description'].replace("</font>",' ')j['data']['description']=j['data']['description'].replace("<br />",' ')j['data']['description']=j['data']['description'].replace("""<style type="text/css">p { margin-bottom: 0.21cm; }</style>""",' ')j['data']['description']=j['data']['description'].replace(' ', '')print(j['data']['description'])len_of_the_title = len(the_title)write_in_file(str(i)+problem_title,'题目描述'+":\n"+j['data']['description']+"\n")for m in range(0,len_of_the_title):write_in_file(str(i)+problem_title,the_title[m]+":\n"+str(problem_des[m])+"\n")print("done")except:print("跳过")
2.hduacm
import requests
from bs4 import BeautifulSoup
import time,osdef write_in_file(f,string):#output functionwith open ('/Users/cyh/Desktop/acm/hduacm/'+f+'/'+f+".txt","a+",encoding='utf-8') as fi:fi.write(string)fi.close()link = "http://acm.hdu.edu.cn/showproblem.php?pid="
headers = {'user-agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1'
}
for i in range (6937,6939):print("开始",i)r = requests.get(link+str(i),headers = headers,timeout = 100)print("OK")soup = BeautifulSoup(r.text,"lxml")problem_title = soup.find("h1").text#get the titleif("/" in problem_title):problem_title=problem_title.replace("/", "比")if not(os.path.exists('/Users/cyh/Desktop/acm/hduacm/'+str(i)+problem_title+'/')):os.mkdir('/Users/cyh/Desktop/acm/hduacm/'+str(i)+problem_title)write_in_file(str(i)+problem_title,"question: "+problem_title+"\n")problem_des = soup.find_all("div",class_="panel_content") the_title = soup.find_all("div",class_ ="panel_title")#print(the_title)print("写入"+str(i) +" file")len_of_the_title = len(the_title)for m in range(0,len_of_the_title):write_in_file(str(i)+problem_title,the_title[m].text+": "+problem_des[m].text+"\n")print("done")
3.pkuacm
import requests
from bs4 import BeautifulSoup
import time,os,re
from lxml import etreedef write_in_file(f,string):#output functionwith open ('/Users/cyh/Desktop/acm/pkuacm/'+f+'/'+f+".html","a+",encoding='utf-8') as fi:fi.write(string)fi.close()link = "http://poj.org/problem?id="
headers = {'user-agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1'
}
count=[0,0]
for i in range (2577,3000):try:print("开始",i)t='&lang=zh-CN&change=true'r = requests.get(link+str(i)+t,headers = headers,timeout = 100)r=r.contentprint("OK")c=etree.HTML(r,parser=etree.HTMLParser()) # //html/body/table[]/tbody/tr/td/div[2] /html/body/table/tbody/tr/td/div[2]<div class="ptt" lang="zh-CN"></div>
d=c.xpath("/html/body/table[2]")e=c.xpath('/html/body/table[2]/tr/td/div[2]')problem_title=etree.tostring(e[0],encoding='utf-8').decode('utf-8').replace("</div>
",'').replace('<div class="ptt" lang="zh-CN">', '')print(problem_title)content=etree.tostring(d[0],encoding='utf-8').decode('utf-8')# print(etree.tostring(c, pretty_print=True).decode("utf-8"))if("/" in problem_title):problem_title=problem_title.replace("/", "比")if not(os.path.exists('/Users/cyh/Desktop/acm/pkuacm/'+str(i)+problem_title+'/')):os.mkdir('/Users/cyh/Desktop/acm/pkuacm/'+str(i)+problem_title.strip('\n'))write_in_file(str(i)+problem_title.strip('\n'), content)count[0]+=1except:count[1]+=1print("pass:",count[1])
print("完成",count[0])
爬取三个acm网站题库(neuqoj pku hdu)相关推荐
- 013:实战爬取三个翻译网站掌握Ajax表单提交
本篇内容由易到难,涉及到ajax-form表单数据提交及md5解密 一共有三个翻译网络.我们要实现的是找到翻译的接口,打造我们自己的翻译软件.首先是 爬取百度翻译: 打开百度翻译,来获取我们的url ...
- 爬取三联生活周刊网站新闻
爬虫三联生活周刊网站新闻 网站详情 代码详情 完整代码 输出结果 网站详情 三联生活周刊网址:http://www.lifeweek.com.cn 三联生活周刊是一本杂志和他倡导的生活--作为中国最受 ...
- python爬取学校题库_如何使用 Python 爬虫爬取牛客网 Java 题库?
[原文链接]http://www.changxuan.top/?p=146 由于"打怪"失败,最近一直在牛客网上刷题复习备战春招.其中有个 Java专题复习题库,我刷着刷着就想把它 ...
- 如何使用 Python 爬虫爬取牛客网 Java 题库?
[原文链接]http://www.changxuan.top/?p=146 由于"打怪"失败,最近一直在牛客网上刷题复习备战春招.其中有个 Java专题复习题库,我刷着刷着就想把它 ...
- 爬取三联生活周刊新闻(进阶版)
Python结构化爬虫 结构化爬虫,按搜索爬取网页 背景 网站详情 源代码 输出结果 结构化爬虫,按搜索爬取网页 背景 本次的内容是在上一篇文章内容的延伸,在上一篇文章中,我们讲到了爬取某一篇新闻的内 ...
- python爬取学校题库_Python爬虫实战-获取某网站题库
爬取*网站题库 import requests import re import time import html headers = { 'User-Agent':'Mozilla/5.0 (Win ...
- Crawler:基于urllib+requests库+伪装浏览器实现爬取国内知名招聘网站,上海地区与机器学习有关的招聘信息(2018.4.30之前)并保存在csv文件内
Crawler:基于urllib+requests库+伪装浏览器实现爬取国内知名招聘网站,上海地区与机器学习有关的招聘信息(2018.4.30之前)并保存在csv文件内 目录 输出结果 设计思路 核心 ...
- scrapy-redis案例(三)爬取中国红娘相亲网站
前言:本案例将分为三篇. 第一篇,使用scrapy框架来实现爬取中国红娘相亲网站. 第二篇,使用scrapy-redis 简单的方式爬取中国红娘相亲网站.(使用redis存储数据,请求具有持续性,但不 ...
- Python案例篇:爬取分析大型招聘网站Python岗
大家好,我是辣条,过了清明节假期的我出关了. 目录 爬取前程无忧python岗位 1.步骤需求(简单介绍) 1.1选择动态数据XHR 1.2找到url 1.3动态接口信息 1.4找到数据 2.涉及知识 ...
- Python 爬虫实战入门——爬取汽车之家网站促销优惠与经销商信息
在4S店实习,市场部经理让我写一个小程序自动爬取汽车之家网站上自家品牌的促销文章,因为区域经理需要各店上报在网站上每一家经销商文章的露出频率,于是就自己尝试写一个爬虫,正好当入门了. 一.自动爬取并输 ...
最新文章
- linux内核用什么调试,什么是开发/调试Linux内核最有效和最优雅的方式
- 【SQL Server】系统学习之三:逻辑查询处理阶段-六段式
- PIE SDK影像快速拼接
- api 定位 微信小程序 精度_微信小程序开发知识点集锦
- 作用域和作用域链 —javascript面向对象高级
- COM组件的运行机制
- 木棒,POJ(1011)
- C++工作笔记-作用域( :: )的另一种玩法
- Codeforces 744C. Hongcow Buys a Deck of Cards(状压DP)
- Android SDK Manager 更新代理配置
- 搜索引擎难做,为什么微软必应活了下来?
- 高斯消元解同余方程组
- SEO知识分享一,选择关键词
- eos linux开发语言,EOSIO与Linux之间的区别
- 国都企信通短信平台发送手机短信的python脚本一例
- java sftp 读取文件_Java代码获取SFTP服务器文件
- Hp-socket高性能网络库三--tcp组件pack接收模型
- GitCode 在线 Web IDE
- 贝叶斯算法(bayesian)在反垃圾邮件中的应用
- HDU 6078 Wavel Sequence(dp)