爬取三个acm网站题库（neuqoj pku hdu）

环境：macos+Python3.9（Windows版本仅需更改目录）

效果图：

代码：

没有写多线程，按需更改range（）或者多个文件一起运行。

1.neuqoj

import requests
from bs4 import BeautifulSoup
import time,os,re
import json
def write_in_file(f,string):#output functionwith open ('/Users/cyh/Desktop/acm/neuqacm/'+f+'/'+f+".txt","a+",encoding='utf-8') as fi:fi.write(string)fi.write("\n")fi.close()link = "http://140.143.222.61:8088/problem/"
link2="http://newoj.acmclub.cn/problems/"
headers = {'user-agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1' ,'accept-language': 'zh-CN,zh'
}
for i in range (1002,1003):try:print("开始",i)r = requests.get(link+str(i),headers = headers,timeout = 100)j=r.json()# print(j)problem_title=j['data']['title']if("/" in problem_title):problem_title=problem_title.replace("/", "比")if not(os.path.exists('/Users/cyh/Desktop/acm/neuqacm/'+str(i)+problem_title+'/')):os.mkdir('/Users/cyh/Desktop/acm/neuqacm/'+str(i)+problem_title)write_in_file(str(i)+problem_title,"question: "+problem_title+"\n")problem_des = [j['data']['difficulty'],j['data']['input'],j['data']['output'],j['data']['sample_input'],j['data']['sample_output']]the_title =['难度','输入描述','输出描述','样例输入','样例输出']print("写入"+str(i) +" file")j['data']['description']=j['data']['description'].replace('<div align="left"><span style="font-size: medium">', ' ')j['data']['description']=j['data']['description'].replace('<font color="#000000">','')j['data']['description']=j['data']['description'].replace('<span style="font-size: medium">','')j['data']['description']=j['data']['description'].replace('<span style="font-size: small">','')j['data']['description']=j['data']['description'].replace('<span>', ' ')j['data']['description']=j['data']['description'].replace('</span>', ' ')j['data']['description']=j['data']['description'].replace('''<p><style type="text/css">p { margin-bottom: 0.21cm; }</style>''', ' ')j['data']['description']=j['data']['description'].replace('<p style="margin-bottom: 0cm;"><font color="#000000">', ' ')j['data']['description']=j['data']['description'].replace("<p>",' ')j['data']['description']=j['data']['description'].replace("</p>",' ')j['data']['description']=j['data']['description'].replace("<font>",' ')j['data']['description']=j['data']['description'].replace("</font>",' ')j['data']['description']=j['data']['description'].replace("<br />",' ')j['data']['description']=j['data']['description'].replace("""<style type="text/css">p { margin-bottom: 0.21cm; }</style>""",' ')j['data']['description']=j['data']['description'].replace('&nbsp;', '')print(j['data']['description'])len_of_the_title = len(the_title)write_in_file(str(i)+problem_title,'题目描述'+":\n"+j['data']['description']+"\n")for m in range(0,len_of_the_title):write_in_file(str(i)+problem_title,the_title[m]+":\n"+str(problem_des[m])+"\n")print("done")except:print("跳过")

2.hduacm

import requests
from bs4 import BeautifulSoup
import time,osdef write_in_file(f,string):#output functionwith open ('/Users/cyh/Desktop/acm/hduacm/'+f+'/'+f+".txt","a+",encoding='utf-8') as fi:fi.write(string)fi.close()link = "http://acm.hdu.edu.cn/showproblem.php?pid="
headers = {'user-agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1'
}
for i in range (6937,6939):print("开始",i)r = requests.get(link+str(i),headers = headers,timeout = 100)print("OK")soup = BeautifulSoup(r.text,"lxml")problem_title = soup.find("h1").text#get the titleif("/" in problem_title):problem_title=problem_title.replace("/", "比")if not(os.path.exists('/Users/cyh/Desktop/acm/hduacm/'+str(i)+problem_title+'/')):os.mkdir('/Users/cyh/Desktop/acm/hduacm/'+str(i)+problem_title)write_in_file(str(i)+problem_title,"question: "+problem_title+"\n")problem_des = soup.find_all("div",class_="panel_content") the_title = soup.find_all("div",class_ ="panel_title")#print(the_title)print("写入"+str(i) +" file")len_of_the_title = len(the_title)for m in range(0,len_of_the_title):write_in_file(str(i)+problem_title,the_title[m].text+": "+problem_des[m].text+"\n")print("done")

3.pkuacm

import requests
from bs4 import BeautifulSoup
import time,os,re
from lxml import etreedef write_in_file(f,string):#output functionwith open ('/Users/cyh/Desktop/acm/pkuacm/'+f+'/'+f+".html","a+",encoding='utf-8') as fi:fi.write(string)fi.close()link = "http://poj.org/problem?id="
headers = {'user-agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1'
}
count=[0,0]
for i in range (2577,3000):try:print("开始",i)t='&lang=zh-CN&change=true'r = requests.get(link+str(i)+t,headers = headers,timeout = 100)r=r.contentprint("OK")c=etree.HTML(r,parser=etree.HTMLParser())  #   //html/body/table[]/tbody/tr/td/div[2]    /html/body/table/tbody/tr/td/div[2]<div class="ptt" lang="zh-CN"></div>
d=c.xpath("/html/body/table[2]")e=c.xpath('/html/body/table[2]/tr/td/div[2]')problem_title=etree.tostring(e[0],encoding='utf-8').decode('utf-8').replace("</div>
",'').replace('<div class="ptt" lang="zh-CN">', '')print(problem_title)content=etree.tostring(d[0],encoding='utf-8').decode('utf-8')#    print(etree.tostring(c, pretty_print=True).decode("utf-8"))if("/" in problem_title):problem_title=problem_title.replace("/", "比")if not(os.path.exists('/Users/cyh/Desktop/acm/pkuacm/'+str(i)+problem_title+'/')):os.mkdir('/Users/cyh/Desktop/acm/pkuacm/'+str(i)+problem_title.strip('\n'))write_in_file(str(i)+problem_title.strip('\n'), content)count[0]+=1except:count[1]+=1print("pass:",count[1])
print("完成",count[0])

爬取三个acm网站题库（neuqoj pku hdu）相关推荐

013：实战爬取三个翻译网站掌握Ajax表单提交
本篇内容由易到难,涉及到ajax-form表单数据提交及md5解密一共有三个翻译网络.我们要实现的是找到翻译的接口,打造我们自己的翻译软件.首先是爬取百度翻译: 打开百度翻译,来获取我们的url ...
爬取三联生活周刊网站新闻
爬虫三联生活周刊网站新闻网站详情代码详情完整代码输出结果网站详情三联生活周刊网址:http://www.lifeweek.com.cn 三联生活周刊是一本杂志和他倡导的生活--作为中国最受 ...
python爬取学校题库_如何使用 Python 爬虫爬取牛客网 Java 题库？
[原文链接]http://www.changxuan.top/?p=146 由于"打怪"失败,最近一直在牛客网上刷题复习备战春招.其中有个 Java专题复习题库,我刷着刷着就想把它 ...
如何使用 Python 爬虫爬取牛客网 Java 题库？
[原文链接]http://www.changxuan.top/?p=146 由于"打怪"失败,最近一直在牛客网上刷题复习备战春招.其中有个 Java专题复习题库,我刷着刷着就想把它 ...
爬取三联生活周刊新闻（进阶版）
Python结构化爬虫结构化爬虫,按搜索爬取网页背景网站详情源代码输出结果结构化爬虫,按搜索爬取网页背景本次的内容是在上一篇文章内容的延伸,在上一篇文章中,我们讲到了爬取某一篇新闻的内 ...
python爬取学校题库_Python爬虫实战-获取某网站题库
爬取*网站题库 import requests import re import time import html headers = { 'User-Agent':'Mozilla/5.0 (Win ...
Crawler：基于urllib+requests库+伪装浏览器实现爬取国内知名招聘网站，上海地区与机器学习有关的招聘信息(2018.4.30之前)并保存在csv文件内
Crawler:基于urllib+requests库+伪装浏览器实现爬取国内知名招聘网站,上海地区与机器学习有关的招聘信息(2018.4.30之前)并保存在csv文件内目录输出结果设计思路核心 ...
scrapy-redis案例（三）爬取中国红娘相亲网站
前言:本案例将分为三篇. 第一篇,使用scrapy框架来实现爬取中国红娘相亲网站. 第二篇,使用scrapy-redis 简单的方式爬取中国红娘相亲网站.(使用redis存储数据,请求具有持续性,但不 ...
Python案例篇：爬取分析大型招聘网站Python岗
大家好,我是辣条,过了清明节假期的我出关了. 目录爬取前程无忧python岗位 1.步骤需求(简单介绍) 1.1选择动态数据XHR 1.2找到url 1.3动态接口信息 1.4找到数据 2.涉及知识 ...
Python 爬虫实战入门——爬取汽车之家网站促销优惠与经销商信息
在4S店实习,市场部经理让我写一个小程序自动爬取汽车之家网站上自家品牌的促销文章,因为区域经理需要各店上报在网站上每一家经销商文章的露出频率,于是就自己尝试写一个爬虫,正好当入门了. 一.自动爬取并输 ...

爬取三个acm网站题库（neuqoj pku hdu）

爬取三个acm网站题库（neuqoj pku hdu）相关推荐

最新文章

热门文章