第一步：观察网页

先观察一波noi的官网的网页的题目分类。

大概就是这样子了，在主页上只展示了标题如1.1，1.2，1.3...的标题，标题下面显示了部分题目。很显然这些题目的爬取还不够。太少。我们的目的是获取每一个title的链接，为了跳到下一个网页上。

第二步：分析第一个网页

打开goole浏览器的开发者模式，分析一波题目链接

我们的任务就是爬取第一个官方主页的所有title链接，用于我们获取下一页的所有题目页。

输入链接，很明显，我们的猜想是正确的。

第三步：分析第二个页面的题目链接

第二个页面获取所有的地址链接用于我们跳到第三个页面。

第四步：分析题目页的网页

题目分析完了，下一步就是粘一波代码了。

第五步：爬取noi所有题目

代码部分：

import requests
from bs4 import BeautifulSoupdef get_page_one():headers = {'Cookie': 'PHPSESSID=9k52q5kv00l4m29nvbf55m08j7','User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) ''Chrome/70.0.3538.77 Safari/537.36','Host': 'noi.openjudge.cn','Connection': 'keep-alive'}url = 'http://noi.openjudge.cn'response = requests.get(url, headers=headers)# print(response.text)try:if response.status_code == 200:response.encoding = response.apparent_encodingreturn response.textreturn Noneexcept Exception as e:print(e)return Nonedef get_page_two(href):headers = {'Cookie': 'PHPSESSID=9k52q5kv00l4m29nvbf55m08j7','User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) ''Chrome/70.0.3538.77 Safari/537.36','Host': 'noi.openjudge.cn','Connection': 'keep-alive'}url = 'http://noi.openjudge.cn' + hrefresponse = requests.get(url, headers=headers)# print(response.text)try:if response.status_code == 200:response.encoding = response.apparent_encodingreturn response.textreturn Noneexcept Exception as e:print(e)return Nonedef get_page_three(href):headers = {'Cookie': 'PHPSESSID=9k52q5kv00l4m29nvbf55m08j7','User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) ''Chrome/70.0.3538.77 Safari/537.36','Host': 'noi.openjudge.cn','Connection': 'keep-alive','Upgrade-Insecure-Requests': '1'}url = 'http://noi.openjudge.cn' + hrefresponse = requests.get(url, headers=headers)try:if response.status_code == 200:response.encoding = response.apparent_encodingreturn response.textreturn Noneexcept Exception as e:print(e)return Nonedef parse_html_href_one(html):soup = BeautifulSoup(html, 'lxml')storge = list()for ul in soup.select('.practice-info h3 a'):storge.append(ul['href'])return storgedef parse_html_href_two(html_two):count = 0storge_two = list()storge_three = list()for i in html_two:htmls = get_page_two(i)soup = BeautifulSoup(htmls, 'lxml')for ul in soup.select('.title a'):storge_two.append(ul['href'])storge_three.append(ul.get_text())count += 1print('一共有题目{}'.format(count))return storge_two, storge_threedef parse_html_href_three(html_three, html_four):count = 0for i in html_three:count += 1htmls = get_page_three(i)soup = BeautifulSoup(htmls, 'lxml')for ul in soup.select('.problem-content'):write_to_file(ul.get_text(), str(html_four[count - 1]))print(ul.get_text())print('-----------------------')def write_to_file(content, number):try:with open('{}.txt'.format(number), 'w') as file:file.write(content)except Exception as e:print(e)def start():href = parse_html_href_one(get_page_one())href_two, href_three = parse_html_href_two(href)parse_html_href_three(href_two, href_three)if __name__ == '__main__':start()

以上就是我爬取noi的代码的脚本。

各位如有需求自行下载。这是我学习爬取时候的一个小练习，各位如想转发，请在转发时提及在下的名称就好。如有另外思路，互相交流，在下还有爬取某东全网所有品牌所有数据的脚本。后续可能会写博客分享一波。

爬取noi官网所有题目分析相关推荐

用python输出所有的玫瑰花数_用Python爬取WordPress官网所有插件
转自丘壑博客,转载注明出处前言只要是用WordPress的人或多或少都会装几个插件,可以用来丰富扩展WordPress的各种功能.围绕WordPress平台的插件和主题已经建立了一个独特的经济生态 ...
Python3爬取国家统计局官网2019年全国所有城市（2020年更新）
Python3爬取国家统计局官网2019年全国所有城市(2020年更新) 一级城市爬取一级城市爬取由于最近需要用到所有城市的数据,故从统计局爬取19年的一级城市数据 import random i ...
爬虫实战（二）—利用requests、selenium爬取王者官网、王者营地APP数据及pymongo详解
概述可关注微信订阅号 loak 查看实际效果. 代码已托管github,地址为:https://github.com/luozhengszj/LOLGokSpider ,包括了项目的所有代码. 本文 ...
python爬取千图网_python爬取lol官网英雄图片代码
python爬取lol官网英雄图片代码可以帮助用户对英雄联盟官网平台的皮肤图片进行抓取,有很多喜欢lol的玩家们想要官方的英雄图片当作自己的背景或者头像,可以使用这款软件为你爬取图片资源,操作很简单, ...
如何用Python爬取LOL官网全英雄皮肤
今天小编带你爬取LOL官网全英雄皮肤的图片不要失望,也不要难过接下咱们来讲讲怎么爬取LOL官网本次案例使用到的模块 import requests import re import json 安 ...
python爬虫入门练习，使用正则表达式和requests爬取LOL官网皮肤
刚刚python入门,学会了requests模块爬取简单网页,然后写了个爬取LOL官网皮肤的爬虫,代码奉上 #获取json文件 #获取英雄ID列表 #拼接URL #下载皮肤 #导入re request ...
websect爬取小米官网数据
webesct 首先要下载webesct nom i websect 老规矩,还是爬取小米官网的数据QAQ,现在要爬取的是这个: const $ = require("websect&quo ...
python使用selenium爬取联想官网驱动（一）：获取遍历各驱动的下载网址
python使用selenium爬取联想官网驱动(一):获取遍历各驱动的下载网址然后wget命令试验下载由于初期学习,所以先拿一个型号的产品驱动试验. (1)以下为在联想某型号产品获取相关驱动下载的 ...
用Python爬取WordPress官网所有插件
转自丘壑博客转载注明出处前言只要是用WordPress的人或多或少都会装几个插件,可以用来丰富扩展WordPress的各种功能.围绕WordPress平台的插件和主题已经建立了一个独特的经济生态 ...
Python3爬取国家统计局官网2017年全国所有城市县镇数据
最近由于项目需要用到全国城镇乡的数据,网上找了下大部分都是很久之前的,或者不理想的数据,某文库更是无论文章好不好都要下载券,所以索性自己用Python写爬虫爬数据,以下是代码(Python3.6版本) ...

爬取noi官网所有题目分析