Python爬取古风漫画网

#!/user/bin/python
# -*- coding: utf-8 -*-import requests
from bs4 import BeautifulSoup
from urllib import request
import time
import os
from concurrent.futures import ThreadPoolExecutor# 初始化环境
rootPath = "D:\Comic"
header = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
setname = "http://m.gufengmh8.com"
# startChapter = 0# 获得要爬取的url
url = "http://m.gufengmh8.com/manhua/shenyongjianglin/"
print("请输入要下载漫画目录页面的地址：")
#url = input()
print("请输入开始章节（若下载整本请填1）：")
#startChapter = int(input()) - 1
startChapter = 366#将url拼接成移动端url
if url[8] == 'w':url = "http://m" + url[10:]print(url)# 访问目录url获取漫画标题
req = requests.get(url, header)
result = req.content
result = result.decode("utf-8")
soup = BeautifulSoup(result, 'html5lib')
titleName = soup.title.string# 查找章节列表标签
chapter = soup.find_all(id="chapter-list-1")[0]
chapterUrlList = chapter.find_all("a")
chapterNameList = chapter.find_all("span")# 获取章节url
chapterUrl = []
for line in chapterUrlList:list.append(chapterUrl, line.get("href"))# 获取章节名
chapterName = []
for line in chapterNameList:s = line.string.replace(":", "：")list.append(chapterName, s)print(chapterUrl)
print(chapterName)# 设置章节标志
step = 0def download_chapter(url, CN):print("正在下载章节：" + CN)# 检测并创建当前章节物理目录path = rootPath + "\\" + titleName + "\\" + CN + "\\"if not os.path.exists(path):os.makedirs(path)# 访问当前章节 获取章节页数 并延时0.3秒url = setname + urlreq = requests.get(url, header)req = req.contentreq = req.decode('utf-8', 'ingore')soup = BeautifulSoup(req, "html5lib")chapterPageNumber = soup.find_all(id="k_total")if chapterPageNumber.__len__() == 0:chapterPageNumber = soup.find_all(id="total-page")[0].stringelse:chapterPageNumber = chapterPageNumber[0].stringtime.sleep(0.3)# 分割urlurlPage = url[:-5]# 设置当前章节内页码nowChapterPage = 1while True:# 拼接当前页urlurl = urlPage + "-%d" % nowChapterPage + ".html"# 访问当前页urlreq = requests.get(url, header)result = req.contentresult = result.decode("utf-8")soup = BeautifulSoup(result, "html5lib")# 获取当前页的图片urlimgUrl = soup.find_all("mip-img")if imgUrl.__len__() == 0:imgUrl = imgUrl = soup.find_all("img")[0]else:imgUrl = imgUrl[0]imgUrl = imgUrl.get("src")# print(path)# 保存图片try:request.urlretrieve(imgUrl, path + "%d.jpg" % nowChapterPage)except Exception:print()print("漫画：" + CN + "第%d页下载失败" % nowChapterPage)# 检测当前页是否为最后一页if nowChapterPage == int(chapterPageNumber):print(CN + "  下载完成")break# 页码+1 并延时0.2秒nowChapterPage = nowChapterPage + 1time.sleep(0.2)continue# 打印下载进度#print("\r该章节已下载： %.1f%%" % (nowChapterPage * 100 / int(chapterPageNumber)), end="", flush=True)# 检测当前页是否为最后一页if nowChapterPage == int(chapterPageNumber):print()print(CN + "  下载完成")break# 页码+1 并延时0.2秒nowChapterPage = nowChapterPage + 1time.sleep(0.2)print(CN+'    下载完成')# 创建容量为10的线程池
pool = ThreadPoolExecutor(10)
# 循环访问各章节url
for url in chapterUrl:# 当前为最够一章时 结束if step + startChapter == chapterUrl.__len__():break# 获取当前章节名CN = chapterName[step + startChapter]url = chapterUrl[step + startChapter]# 去线程池中获取一个线程,线程去执行print_num方法pool.submit(download_chapter, url, CN)# 章节标志位+1time.sleep(0.2)step = step + 1# print(titleName + "  已下载完成")
# print("漫画保存在: " + rootPath + "  目录下")
# print("按回车键退出....")# exit = input()''' code update log
2018/11/30  1.将soup的构建完全改为使用移动端网址 2.添加了对不同页面结构的章节总页面数和图片url标签的判断3.添加了页面图片获取失败时的错误处理2019/1/14   1.引入动态下载进度显示2.修复章节名称含有英文冒号时 会产生错误的物理存储地址的bug2019/2/22   1.采用10容量线程池改写代码 '''

Python爬取古风漫画网相关推荐

【爬虫】Scrapy爬取古风漫画网
目录须知分析 A.\mathcal{A}.A.目标 B.\mathcal{B}.B.子目标 C.\mathcal{C}.C.子目标分析 manhua.py items.py pipelines.p ...
用python输出所有的玫瑰花数_用Python爬取WordPress官网所有插件
转自丘壑博客,转载注明出处前言只要是用WordPress的人或多或少都会装几个插件,可以用来丰富扩展WordPress的各种功能.围绕WordPress平台的插件和主题已经建立了一个独特的经济生态 ...
python爬房源信息_用python爬取链家网的二手房信息
题外话:这几天用python做题,算是有头有尾地完成了.这两天会抽空把我的思路和方法,还有代码贴出来,供python的初学者参考.我python的实战经历不多,所以代码也是简单易懂的那种.当然过程中还 ...
python爬取千图网_python爬取lol官网英雄图片代码
python爬取lol官网英雄图片代码可以帮助用户对英雄联盟官网平台的皮肤图片进行抓取,有很多喜欢lol的玩家们想要官方的英雄图片当作自己的背景或者头像,可以使用这款软件为你爬取图片资源,操作很简单, ...
Python爬取不羞涩网小姐姐图片——BeautifulSoup应用
引言今年提倡原地过年,相信很多朋友都没有回家过年,像我就被迫留在深圳过年了,无聊之余只能去看看电影爬爬山.今天给大家带来一个打发无聊时光的案例,用Python爬取不羞涩网小姐姐图片,并保存到本地,老 ...
python 爬取淘宝网课
python爬取淘宝网课,打开web控制台,发现有个链接可以下载到对应的内容,下载的格式是m3u8,用文本打开里面是许多.ts链接,当然百度后得知可以直接下个vlc然后下载,但是还是想用python试 ...
使用python爬取斗图网的图片
使用python爬取斗图网的图片以下是代码部分: # -*- coding: utf-8 -*- """ Created on Wed Apr 14 14:41:42 ...
python爬取链家网的房屋数据
python爬取链家网的房屋数据爬取内容爬取源网站爬取内容爬取思路爬取的数据代码获取房屋url 获取房屋具体信息爬取内容爬取源网站北京二手房 https://bj.lianjia. ...
用Python爬取彼岸图网图片
用Python爬取彼岸图网图片 *使用了四个模块 import time import requests from lxml import etree import os 没有的话自行百度安装. ...

Python爬取古风漫画网

Python爬取古风漫画网相关推荐

最新文章

热门文章