python爬取站_简单python爬虫练习 E站本爬取

必备条件：

一台能上404的机子..

过程：

由于也只是初学爬虫，个中技巧也不熟练，写的过程中的语法用法参考了很多文档和博客，我是对于当前搜索页用F12看过去..找到每个本子的地址再一层层下去最后下载图片...然后去根据标签一层层遍历将文件保存在本地,能够直接爬取搜索页下一整页的所有本，并保存在该文件同级目录下，用着玩玩还行中途还被E站封了一次IP，现在再看觉得很多地方还能改进(差就是还有进步空间嘛，不排除失效的可能

这就是个试验页别想太多

代码：

from bs4 import BeautifulSoup

import re

import requests

import os

import urllib.request

headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36',

'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',

'Upgrade-Insecure-Requests': '1'}

r = requests.get("https://e-hentai.org/", headers=headers)

soup = BeautifulSoup(r.text, 'lxml')

divs = soup.find_all(class_='gl3c glname')

# 爬取当前页面的本子网址

for div in divs:

url = div.a.get('href')

r2 = requests.get(url, headers=headers)

soup2 = BeautifulSoup(r2.text, 'lxml')

manga = soup2.find_all(class_='gdtm')

title = soup2.title.get_text() # 获取该本子标题

# 遍历本子的各页

for div2 in manga:

picurl = div2.a.get('href')

picr = requests.get(picurl, headers=headers)

soup3 = BeautifulSoup(picr.text, 'lxml')

downurl = soup3.find_all(id='img')

page = 0

for dur in downurl:

# print(dur.get('src'))

# 判断是否存在该文件夹

purl=dur.get('src')

fold_path = './'+title

if not os.path.exists(fold_path):

print("正在创建文件夹...")

os.makedirs(fold_path)

print("正在尝试下载图片....:{}".format(purl))

#保留后缀

filename = title+str(page)+purl.split('/')[-1]

filepath = fold_path + '/' + filename

page = page + 1

if os.path.exists(filepath):

print("已存在该文件，不下了不下了")

else:

try:

urllib.request.urlretrieve(purl, filename=filepath)

except Exception as e:

print("error发生:")

print(e)

然后还利用pyinstaller做了一个exe文件

Updata1:

发现忘了考虑各个分页，导致一本本子最多只能爬取四十张图片，而且由于爬取一页的本子数量太多且良莠不齐测试爬虫的时候正好是半夜..还是一个人住，批量爬取时混入了恐怖本子吓得没睡好，现在只从本子打开后的链接进行爬取一本本子，我的写法照理说从分页的链接也能爬取一整本(虽然我没试过试过了确实可以，而且修正了由于标题问题导致爬取失败的bug

# coding:utf-8

# author:graykido

# data=2020.5.3

from bs4 import BeautifulSoup

import re

import requests

import os

import urllib.request

import threading

import time

headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36',

'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',

'Upgrade-Insecure-Requests': '1'}

#批量处理

urls = []

temp = input('请输入要爬取的指定页(输入空白结束):')

while temp!="":

urls.append(temp)

temp = input('输入成功请继续输入链接或者输入空白结束：')

for url in urls:

start = time.perf_counter()

r = requests.get(url, headers=headers)

soup = BeautifulSoup(r.text, 'lxml')

manga = soup.find_all(class_='gdtm')

title = soup.title.get_text() # 获取该本子标题

# 去除非法字符

for ch in '!"#$%&()*+,-./:;<=>?@\\^_‘{|}~ ':

title = title.replace(ch, "")

# 避免因标题过长导致无法储存

if len(title) > 50:

title = title[:50]

pagetag = soup.find(class_='ptt').find_all('a')

mxpage = 0

baseurl = ""

for page in pagetag:

temstr = str(page.get('href'))

temspl = temstr.split('?p=')

if len(temspl) > 1:

mxpage = max(mxpage, int(temspl[1]))

else:

baseurl = page.get('href')

pages = [baseurl]

for i in range(1, mxpage + 1):

pages.append(baseurl + '?p=' + str(i))

if mxpage==0:

mxpage = 1

print("正在获取的漫画名:{0:},共计{1:}分页".format(soup.title.get_text(), mxpage))

count = 0

# 遍历各分页

for page in pages:

r = requests.get(page, headers=headers)

soup = BeautifulSoup(r.text, 'lxml')

manga = soup.find_all(class_='gdtm')

# 遍历分页的各张图片

for div in manga:

picurl = div.a.get('href')

picr = requests.get(picurl, headers=headers)

soup2 = BeautifulSoup(picr.text, 'lxml')

downurl = soup2.find_all(id='img')

for dur in downurl:

purl = dur.get('src')

fold_path = './' + title

# 判断是否存在该文件夹

if not os.path.exists(fold_path):

print("正在创建文件夹...")

os.makedirs(fold_path)

print("正在尝试下载图片....:{}".format(purl))

# 保留后缀

filename = title + str(count) + '.' + purl.split('.')[-1]

filepath = fold_path + '/' + filename

count = count + 1

if os.path.exists(filepath):

print("已存在该文件，不下了不下了")

else:

try:

urllib.request.urlretrieve(purl, filename=filepath)

print("已成功")

except Exception as e:

print("error发生:")

print(e)

print('————下完收工————')

end = time.perf_counter()

print("下载总时长:{}秒".format(end - start))

之后的事情：

发现了E站原来有自己的API，虽然他家的API也不太好用，但至少比纯手写爬虫方便一点了

文档

python爬取站_简单python爬虫练习 E站本爬取相关推荐

python写出表白_简单python 表白代码
# 导入模块 import turtle # 人 turtle.penup() turtle.goto(-100, 100) turtle.pendown() turtle.color('yellow ...
python的代码有哪些_简单python代码类型有哪些？
简单python代码类型有哪些? 简单python代码类型有: 1.[背景] 最近,派大星想要减肥,他决定控制自己的饮食,少吃一点蟹黄堡. 海绵宝宝为了帮助好朋友派大星,和派大星一起制定了一个饮食游戏 ...
python爬取网页文字和图片_简单的爬虫：爬取网站内容正文与图片
我们来写个简单的爬虫#### 需要用到的模块需要用到python的urllib和lxml模块,urllib为python的自带模块,lxml需要自行安装:pip install lxml 简单介绍u ...
python唯美壁纸_用python爬虫爬取网页壁纸图片（彼岸桌面网唯美图片）
参考文章:https://www..com/franklv/p/6829387.html 今天想给我的电脑里面多加点壁纸,但是嫌弃一个个保存太慢,于是想着写个爬虫直接批量爬取,因为爬虫只是很久之前学过 ...
python视频网站分类_用Python爬取b站视频
本文概要爬取B站视频的办法在csdn和B站有很多但是本文算作是对爬取步骤的一个拆解同时也算是我的笔记.本代码的参考对象是https://blog.csdn.net/Mr_Ohahah/artic ...
python爬取歌词_利用Python网络爬虫抓取网易云音乐歌词
今天小编给大家分享网易云音乐歌词爬取方法. 本文的总体思路如下: 找到正确的URL,获取源码: 利用bs4解析源码,获取歌曲名和歌曲ID: 调用网易云歌曲API,获取歌词: 将歌词写入文件,并存入本地 ...
python登录网站后爬取数据_用 Python 登录主流网站，我们的数据爬取少不了它
不论是自然语言处理还是计算机视觉,做机器学习算法总会存在数据不足的情况,而这个时候就需要我们用爬虫获取一些额外数据.这个项目介绍了如何用 Python 登录各大网站,并用简单的爬虫获取一些有用数据,目 ...
python爬房源信息_用python爬取链家网的二手房信息
题外话:这几天用python做题,算是有头有尾地完成了.这两天会抽空把我的思路和方法,还有代码贴出来,供python的初学者参考.我python的实战经历不多,所以代码也是简单易懂的那种.当然过程中还 ...
python英语词汇读音_利用PYTHON 爬虫爬出自己的英语单词库
为什么要建立自己的单词库用过各种的背单词软件,总是在使用其他人的词库或者软件自己提供的词库,基本是人家提供什么自己就用什么,要想有更多的自主基本没有,最近看一个 COCA的按单词使用频率来提取的2万 ...

python爬取站_简单python爬虫练习 E站本爬取

python爬取站_简单python爬虫练习 E站本爬取相关推荐

最新文章

热门文章