python抓取网页图片

网页的图片大致是用Image导入的，使用的是相对路径，例如

<image src="data:image/bg.jpg"/>

通过匹配可以获取image/bg.jpg,与页面地址组合可以得到图片的地址

除了直接引入的图片，还有通过CSS，HTML引入的图片，也需要处理

# -*- coding: utf-8 -*-
import urllib, httplib, urlparse
import sys
import redef httpExists(url):host, path = urlparse.urlsplit(url)[1:3]if ':' in host:# port specified, try to use ithost, port = host.split(':', 1)try:port = int(port)except ValueError:print 'invalid port number %r' % (port,)return Falseelse:# no port specified, use default portport = Nonetry:connection = httplib.HTTPConnection(host, port=port)connection.request("HEAD", path)resp = connection.getresponse( )if resp.status == 200:       # normal 'found' statusfound = Trueelif resp.status == 302:     # recurse on temporary redirectfound = httpExists(urlparse.urljoin(url,resp.getheader('location', '')))else:                        # everything else -> not foundprint "Status %d %s : %s" % (resp.status, resp.reason, url)found = Falseexcept Exception, e:print e.__class__, e, urlfound = Falsereturn found"""根据url获取文件名"""
def gGetFileName(url):if url==None: return Noneif url=="" : return ""arr=url.split("/")return arr[len(arr)-1]"""根据url下载文件，文件名参数指定"""
def gDownloadWithFilename(url,savePath,file):#参数检查，现忽略try:urlopen=urllib.URLopener()fp = urlopen.open(url)data = fp.read()fp.close()print 'download file url :',urlfile=open(savePath + file,'w+b')file.write(data)file.close()except IOError:print "download error!"+ urldef gDownload(url,savePath):fileName = gGetFileName(url)gDownloadWithFilename(url,savePath,fileName)def getRexgList(lines,regx,searchRegx):if lines==None : return lists =[]for line in lines:ismatch = re.search(regx,line,re.IGNORECASE)if ismatch :matchs = re.search(searchRegx,line,re.IGNORECASE)if matchs != None:groups = matchs.groups()for str in groups:if str not in lists:lists.append(str)return lists
def checkLine(lines):for line in lines :matchs = re.search(r'url\((\S+)\)',re.IGNORECASE)if matchs != None :print matchs.groups()
def  getPageLines(url):if url==None : returnif not httpExists(url): return try:page = urllib.urlopen(url)   html = page.readlines()page.close()return htmlexcept:print "getPageLines() error!"return
def getCurrentPageImage(url,savePath):lines = getPageLines(url)print 'lines.length',len(lines)regxlists =  getRexgList(lines,r'src\s*="images(\S+)"',r'src\s*="(\S+)"')if regxlists==None: return print 'getCurrentPageImage() images.length',len(regxlists)for jpg in regxlists:jpg =url + jpggDownload(jpg,savePath)def getCSSImages(link,savePath,url):lines = getPageLines(link)print 'lines.length',len(lines)regxlists =  getRexgList(lines,r'url\((\S+)\)',r'url\((\S+)\)')if regxlists==None: return print 'getCurrentPageImage() images.length',len(regxlists)for jpg in regxlists:jpg =url + jpggDownload(jpg,savePath)"""根据url获取其上的相关htm、html链接，返回list"""
def gGetHtmlLink(url):#参数检查，现忽略rtnList=[]lines=getPageLines(url)regx = r"""href="?(\S+)\.htm"""for link in getRexgList(lines,regx,r'href="(\S+)"'):link =url + linkif link not in rtnList:rtnList.append(link)print linkreturn rtnList
"""根据url获取其上的相关css链接，返回list"""
def gGetCSSLink(url):#参数检查，现忽略rtnList=[]lines=getPageLines(url)regx = r"""href="?(\S+)\.css"""for link in getRexgList(lines,regx,r'href="(\S+)"'):link = url + linkif link not in rtnList:rtnList.append(link)return rtnList
def getPageImage(url,savePath):"""getCurrentPageImage(url,savePath)""""""读取其他的CSS，html文件中的图片links=gGetHtmlLink(url)for link in links:print u'get images on link-html读取'getCurrentPageImage(link,savePath)"""links=gGetCSSLink(url)for link in links:print 'get images on link:',linkgetCSSImages(link,savePath,url)
if __name__ == '__main__':url = 'http://www.templatemo.com/templates/templatemo_281_chrome/'savePath = 'd:/tmp/'print 'download pic from [' + url +']'print 'save to [' +savePath+'] ...'getPageImage(url,savePath)print "download finished"

具体使用的时候根据URL的情况，具体分析得到图片地址的方式。

转载于:https://www.cnblogs.com/yangchengInfo/p/3279374.html

python抓取网页图片相关推荐

python抓取网页图片的小案例
1.分析 ,要抓取的页面的信息以及对应的源码信息 blog.sina.com.cn/s/blog 93dc666c0101b1bj.html 2.代码模块: 导入正则表达的模块导入url相关的模块 ...
Python利用bs4批量抓取网页图片并下载保存至本地
Python利用bs4批量抓取网页图片并下载保存至本地使用bs4抓取网页图片,bs4解析比较简单,需要预先了解一些html知识,bs4的逻辑简单,编写难度较低.本例以抓取某壁纸网站中的壁纸为例.(b ...
python抓取网站图片_python抓取图片示例 python抓取网页上图片
python抓取网页上图片这个错误时是什么意思下面是代码 import re import urllib.request imp正则表达式匹配的url有错误 for x in add: print ...
python抓取图片_Python3简单爬虫抓取网页图片
现在网上有很多python2写的爬虫抓取网页图片的实例,但不适用新手(新手都使用python3环境,不兼容python2), 所以我用Python3的语法写了一个简单抓取网页图片的实例,希望能够帮助到 ...
python 抓取网页链接_从Python中的网页抓取链接
python 抓取网页链接 Prerequisite: 先决条件: Urllib3: It is a powerful, sanity-friendly HTTP client for Python ...
python抓取网站图片_利用python抓取网站图片
看了网上关于python抓取网站图片的例子,所以自己也尝试着写一个,但是发现这个网站的src不是标准的路径,需要自己添加前面的目录地址,尝试了几次也不成功,所以希望有经验的朋友指导下. 本人是初学者, ...
python 抓取网页数据
python 抓取网页数据此文解决如何从不同网页爬取数据的问题及注意事项,重点说明requests库的应用. 在开始之前,要郑重说明一下,不是每一个网页都可以爬取数据哦.有的网页涉及个人隐私或其他敏 ...
使用Python爬取网页图片
使用Python爬取网页图片李晓文 21 天前近一段时间在学习如何使用Python进行网络爬虫,越来越觉得Python在处理爬虫问题是非常便捷的,那么接下来我就陆陆续续的将自己学习的爬虫知识分享给 ...
利用python爬取网页图片
学习python爬取网页图片的时候,可以通过这个工具去批量下载你想要的图片开始正题: 我从尤物网去爬取我喜欢的女神的写真照,我们这里主要用到的就两个模块 re和urllib模块,有的时候可能会用到t ...

python抓取网页图片

python抓取网页图片相关推荐

最新文章

热门文章