Python爬取半次元图片[一]

用到模块有requests,BeautifulSoup4,lxml(BeautifulSoup基于这个解析，据说速度会快很多)，re(正则ps.只用到了一个compile函数)

介绍下思路:

创建Img文件夹，解析html标题为文件夹名称（创建在Img文件夹下）,利用Firefox模块Firehug分析网页(这是需要自己动手分析，不是写代码)

接下来介绍一下使用的函数

re:

re.compile("%s"%(往里面填匹配字符就行))

BeautifulSoup:

BeautifulSoup()

find_all("a",attrs = {" ":re.compile("")}) 往里面填匹配属性例如 soup.find_all("a",attrs = {"a":re.compile("hz16")})

os:

os.path.exsists("") 填目录或者文件

os.makedirs("") 填目录

requests:

requests.get(url)　　　　可为https也可为http，自带的urllib我没办法get 到 https ，如果哪位dalao看到的话请指教，百度来的一堆没作用

urllib中的request:　　　　　　　　注意是request不是requests，别搞混了

requests.urlretrieve(url,filename,..) 　　　　有三个参数可选，第三个是进度，自行百度urlretrieve模块,第一个为目标链接，第二个为文件储存位置及文件名要处理目录名

本来想直接从主页抓取所有coser然后通过子链接下载，但是目标网站为动态网页，然后看了他们说要用webkit就没去了解了，虽然说程序员就应该对自己代码和用户负责，但是明天上学，实在肝不动。

以下为我代码，有很多不足之处，初始化并没有写太好，肝了一天多了，肝不动了，写个博客晚会游戏吧

获取实例链接中href属性

hrefs = soup.find_all("a",attrs = {"class":re.compile("fz16 l-left mr5 blue1")})

href = hrefs[0]["href"]

find只抓一条.

之后处理用字符串拼接成完整链接丢入urlretrieve

我还是要提醒一遍一定注意文件名要处理，不然就像我一样，明明昨天晚上就应该ok的，结果今天才完成

Title = Title.replace(" ","") 把空格替换掉

Title = Title.strip(),Title = Title.rstrip() 左右两边的换行空格去掉ps.我使用时不知道是windows的锅还是pycharm的锅，始终去不掉制表符，后来我使用了分片ps.根据实际情况而定

获取标题中的text像这样

Title = soup.find_all("h1",attrs = {"js-post-title"}).text

这里title就是标题了，需要处理，一下为我的demo，初始化没做好，下周末改进，给出百度盘文件链接

#coser网站图片获取  限制与\u\..from bs4 import BeautifulSoup
import requests
from urllib import request
import re
import os
from random import randintdef Make_file():if os.path.exists("Daily_information.txt") == False:f = open("Daily_information.txt", "w")f.write("GET\n")f.close()def Check_File():if os.path.isdir("Img") == False:os.makedirs("Img")def Url_Write(url):                                                         #url日志系统if os.path.exists("Url_text.txt") == False:f = open("Url_text.txt","w")f.write("\n%s\n"%url)f.close()else:f = open("Url_text.txt","a")f.write("%s\n"%url)f.close()def Url_geting(url='http://www.baidu.com', pat={"Mother": "fucker"}):               #网页缓存返回beautifulSoup对象buf = requests.get(url=url,params=pat)               # 读取网站try:html = BeautifulSoup(buf.text, "lxml")                                          # 使用BeautifulSoup解析except Exception as e:                                                               # 防止出错f = open("Daily_information.txt","a")f.write("%s:%s\n" % (url,e))return htmldef Title_Get(html):Big_Title = html.find_all("h1",attrs = {"class":re.compile("js-post-title")})Title = Big_Title[0].textreturn Titledef Title_file_create(Title):                                                               #Title文件夹创建函数True_way = "%s"%Titleos.makedirs(True_way)def Title_Dispose(Title):                       #Title获取函数Title = Title[1:]Title = Title.split(":")Title = Title[-1]Title = Title.replace(" ","")return Titledef Img_Link_get(html):                                 #图片链接查找函数(估计只能用在半次元)Img_link = []Img_Face = html.find_all("img", attrs={"class": re.compile("detail_std")})for i in Img_Face:Img_link.append(i["src"])return Img_link# 给出登陆用户,链接，获得html，解析html得到Img中href属性,获取Title处理后给做文件名def Get_information(url = "https://bcy.net/coser/detail/13612/338282",pat = {"Test": "@1"}):html = Url_geting(url, pat=pat)The_link = Img_Link_get(html=html)Title = Title_Get(html)Title = Title_Dispose(Title)attrs = [url,html,The_link,Title]return attrsdef Get_Download(Img_links,path):                           #以后记得传参检查参数，此次bug为未处理传出参数中Title的空格if os.path.exists(path) == False:os.makedirs(path)step = 0for i in Img_links:step += 1request.urlretrieve(i,"%s\\%d.jpg"%(path,step))pat = [{"门前大桥下":"游过一只鸭"},{"我爱北京天安门":"天安门上太阳升"},{"爱像":"一阵风"},{"吹完他就走":"~~"},{"辣妹儿":"法克儿"}]if __name__ == "__main__":Check_File()Make_file()print("只可使用半次元coser页图片链接,按q再按enter退出")print("请输入链接：")while True:url = input()print("正在下载中....")weigth = len(pat)pat = pat[randint(0, weigth-1)]attrs = Get_information(url, pat)path = "Img\\%s" % attrs[3]#print(path + '1'Get_Download(attrs[2], path=path)Url_Write(url=attrs[0])print("下载完毕...继续输入链接下载...按q + enter 退出")

妹子真美好，可惜我怎么还是单身了这么多年。

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　Life goes on,Until we die.

转载于:https://www.cnblogs.com/the-moon-so-beautiful/p/7536096.html

Python爬取半次元图片[一]相关推荐

python爬取图片源码_半次元图片爬取-python爬取半次元图片源码下载-西西软件下载...
python爬取半次元图片源码,由大神自制的python爬取工具,本源码针对半次元图片平台,可以爬取最新的网站图片资源,支持自定义保存目录,非常方便,需要requests库的支持,想要相关源码资源的朋 ...
python爬取贴吧图片_Python爬取贴吧多页图片
Python爬取贴吧图片都只能爬取第一页的,加了循环也不行,现在可以了. #coding:utf-8 import urllib import urllib2 import re import os ...
Python爬取自然风景图片代码
Python爬取自然风景图片代码 \qquad 需要用到python的etree库和requests库,需要提前下载安装. from lxml import etree import requests ...
python爬取网站的图片
python爬取网站的图片本次爬取图片所需要用到的库:Requests库,BeautifulSoup库,正则表达式,os库. 思路:先爬一张图片,再爬一个网站的图片先爬一张图片: 首先要得到这张图 ...
python爬取网站源代码+图片
python爬取网站源代码+图片需求分析基础知识正则表达式 python网络请求文件读写实现基本思路具体实现结果总结需求分析大部分有志青年都想建立属于自己的个人网站,从零开始设计 ...
python 爬取5566图库图片
python 爬取5566图库图片 1 import requests 2 import random 3 import re 4 import time 5 import os 6 from bs4 ...
python爬取动态网页图片
爬取某知名网站图片(爬取动态网页) python爬取动态网页图片 python爬取动态网页图片环境: python3.pycharm 库: requests.urllib.json 思路: 1.分析 ...
python爬取明星百度图片并存入本地文件夹
python爬取明星百度图片并存入本地文件夹想要一个明星图片的时候,发现图片量过大,一张张保存太累,不太现实这时候就可以用到爬虫,批量爬取图片现在又出现一个问题,当发现一个明星爬完后,再爬取下一 ...
Python爬取百度壁纸图片
Python爬取百度壁纸图片 #! /usr/bin/python -- coding: utf-8 -- @Author : declan @Time : 2020/05/31 16:29 @Fil ...

Python爬取半次元图片[一]

Python爬取半次元图片[一]相关推荐

最新文章

热门文章