爬取东方求闻史记和东方求闻口授图片

代码的优化有时间再做吧

# -*- coding: UTF-8 -*-import requests
from bs4 import BeautifulSoup
import codecs
import chardet
import re
import time
import random
import osimport sysreload(sys)
sys.setdefaultencoding("utf-8")header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 ''Safari/537.36',  # 构建用户代理'Connection': 'close',  # 拿到网页内容后关闭连接
}
root = 'https://thwiki.cc'
rooturl1 = 'https://thwiki.cc/东方求闻史纪'
rooturl2 = 'https://thwiki.cc/东方求闻口授'
saveturl = './thout/ThouhouImage'# 求闻史纪部分
html = requests.get(rooturl1, headers=header)
html.encoding = chardet.detect(html.content)['encoding']
soup = BeautifulSoup(html.text, "html.parser")data = soup.findAll(name='a', attrs={"href": re.compile(r'\/(\%\w{2})+\/(\%\w{2})+')})
# 获取子网页链接for k in data:pichtml = requests.get(root + k.get('href'), headers=header)pichtml.encoding = chardet.detect(pichtml.content)['encoding']souphtml = BeautifulSoup(pichtml.text, "html.parser")datahtml = souphtml.findAll(name='img', attrs={"src": re.compile(r'c(/\w{1,2}){2}/(%\w{2})+.jpg')})# 简化正则 东方求闻史纪用的是jpg格式的文件if str(datahtml) == '[]':datahtmlset = souphtml.findAll(name='img', attrs={"srcset": re.compile(r'c(/\w{1,2}){2}/(%\w{2})+.jpg')})if str(datahtmlset) == '[]':continueelse:time.sleep(random.randint(0, 2))# 随机休眠 模拟真实用户for i in datahtmlset:picurl = i.get('srcset')picurl = picurl.split(' ')[-2]# 构建url 取倒数第二个img_name = picurl.split('/')[-1]img_path = saveturl + r'\{0}'.format(img_name)try:# 如果根目录不存在就创建该根目录if not os.path.exists(saveturl):os.makedirs(saveturl)if not os.path.exists(img_path):r = requests.get(picurl)  # 获取并保存图片with open(img_path, 'wb') as f:f.write(r.content)f.close()print("文件保存成功")else:print("文件已存在")continueexcept:print("执行出错")else:time.sleep(random.randint(0, 2))# print datahtmlfor i in datahtml:picurl = i.get('src')img_name = picurl.split('/')[-1]img_path = saveturl + r'\{0}'.format(img_name)try:# 如果根目录不存在就创建该根目录if not os.path.exists(saveturl):os.makedirs(saveturl)if not os.path.exists(img_path):r = requests.get(picurl)  # 获取并保存图片with open(img_path, 'wb') as f:f.write(r.content)f.close()print("文件保存成功")else:print("文件已存在")continueexcept:print("执行出错")# 求闻口授部分html = requests.get(rooturl2, headers=header)
html.encoding = chardet.detect(html.content)['encoding']
soup = BeautifulSoup(html.text, "html.parser")data = soup.findAll(name='a', attrs={"href": re.compile(r'\/(\%\w{2})+\/(\%\w{2})+')})for k in data:pichtml = requests.get(root + k.get('href'), headers=header)pichtml.encoding = chardet.detect(pichtml.content)['encoding']souphtml = BeautifulSoup(pichtml.text, "html.parser")datahtml = souphtml.findAll(name='img', attrs={"src": re.compile(r'c(/\w{1,2}){2}/(%\w{2})+.png')})  # 简化正则# 求闻口授用的是png的图片文件if str(datahtml) == '[]':datahtmlset = souphtml.findAll(name='img', attrs={"srcset": re.compile(r'c(/\w{1,2}){2}/(%\w{2})+.png')})if str(datahtmlset) == '[]':continueelse:time.sleep(random.randint(0, 2))for i in datahtmlset:picurl = i.get('srcset')picurl = picurl.split(' ')[-2]img_name = picurl.split('/')[-1]img_path = saveturl + r'\{0}'.format(img_name)try:# 如果根目录不存在就创建该根目录if not os.path.exists(saveturl):os.makedirs(saveturl)if not os.path.exists(img_path):r = requests.get(picurl)  # 获取并保存图片with open(img_path, 'wb') as f:f.write(r.content)f.close()print("文件保存成功")else:print("文件已存在")continueexcept:print("执行出错")else:time.sleep(random.randint(0, 2))for i in datahtml:picurl = i.get('src')img_name = picurl.split('/')[-1]img_path = saveturl + r'\{0}'.format(img_name)try:# 如果根目录不存在就创建该根目录if not os.path.exists(saveturl):os.makedirs(saveturl)if not os.path.exists(img_path):r = requests.get(picurl)  # 获取并保存图片with open(img_path, 'wb') as f:f.write(r.content)f.close()print("文件保存成功")else:print("文件已存在")continueexcept:print("执行出错")quit(0)

爬取东方求闻史记和东方求闻口授图片相关推荐

python爬取有道词典json报错，求帮助！
python爬取有道词典json报错,求帮助! import urllib.request import urllib.parse import json import time import ran ...
python爬取正确但不出文件_使用Python爬取微信公众号文章并保存为PDF文件(解决图片不显示的问题)...
前言第一次写博客,主要内容是爬取微信公众号的文章,将文章以PDF格式保存在本地. 爬取微信公众号文章(使用wechatsogou) 1.安装 pip install wechatsogou --up ...
爬虫案例若干-爬取CSDN博文,糗事百科段子以及淘宝的图片
前面学习了基本的浏览器伪装的方式,现在来看三个实例: 例1 爬取CSDN首页的博文思路很简单,伪装浏览器之后,通过正则获取对应的url链接,然后把对应的url的文章都下载下来 #!/usr/bin/ ...
python用bs4爬取豆瓣电影排行榜 Top 250的电影信息和电影图片，分别保存到csv文件和文件夹中
python用bs4爬取豆瓣电影排行榜 Top 250的电影信息和图片,分别保存到csv文件和文件夹中. 爬取的数据包括每个电影的电影名 , 导演 ,演员 ,评分,推荐语,年份,国家,类型. py如果 ...
python爬取B站番剧索引页面并保存文本和图片
该篇文章为"行路难=_="原创期末的Python考试要写一个爬取网站信息的程序,我就选取了b站番剧索引页面作为目标网页(因为感觉番剧主页的信息太杂了.) 目标网页:https:/ ...
datetime 索引_Python爬取B站番剧索引页面并保存文本和图片
期末的Python考试要写一个爬取网站信息的程序,我就选取了b站番剧索引页面作为目标网页(因为感觉番剧主页的信息太杂了.) 目标网页:https://www.bilibili.com/anime/in ...
python爬取头条图集_Python爬虫基础练习(六) 今日头条街头篮球图片爬取
今天我们要爬取的仍然是图片,不过与上一篇有所不一样的是,今天爬取的是今日头条上的图集,接着往下看吧~ 运行平台:Windows Python版本:Python3.6 IDE: Sublime Text ...
Java用Jsoup解析爬取某房网的翻页的前五页图片--解决src取不到图片导致进入onerror标签的问题--使用data-original标签-图文加代码注释
昨天写了一篇基本爬虫,简单说一下翻页爬取,其实有些网站的翻页很简单,看地址栏变化可以了,有些页码都是在URL地址中体现出来的.文末附上源码,源码有详细注释. 简单说下今晚的实践以及遇到的问题: 今天爬 ...
新手爬取51job，智联，boss网站职位信息总结和代码（小杜总结）
爬取要求: (1) 使用合适的数据保存手段保存爬取数据 (2) 记每条数据的爬取时间 (3) 实现数据的增量爬取 (4) 实现同时基于关键字和页面 URL 的去重元数据说明: 一统一注意事项或建议 ...

爬取东方求闻史记和东方求闻口授图片

爬取东方求闻史记和东方求闻口授图片

爬取东方求闻史记和东方求闻口授图片相关推荐

最新文章

热门文章