本篇博客参考:python爬虫入门教程 http://blog.csdn.net/wxg694175346/article/category/1418998

Python爬虫爬取网页图片 http://www.cnblogs.com/abelsu/p/4540711.html

一、项目分析
为了给我的出于实验目的网上商城批量增加商品信息,我需要自动从网上获取大量的商品名称、价格、图片信息保存到本地,再传到我自己的web应用中,为后续实验使用。
看完上面的参考博客就基本可以上手了,需要注意的一点是网上很多案例是python 2.X版本的,而现在一般是python 3.X版本的环境,有些地方代码需要调整,引用的包也有不同。
        整个项目没使用 scrapy、bs4,比较原生简单,最大的难点应该在于对网页源代码分析,通过正则表达式获取url,这里可能会出现两种和预期不同的错误场景:一是匹配不到,二是匹配过多,需要对正则表达式好好检查。
        我选择爬取的是苏宁易购里面的大聚惠类似于聚划算,起点是 https://ju.suning.com/,分析源代码很简易找到各分类的URL
<!--商品列表一级导航栏 [[ -->
<div class="ju-nav-wrapper"><div class="ju-nav"><table><tr><td class="active"><a name="columnId" id="0" value="1"href="/pc/new/home.html" name1="mps_index_qbsp_qb">全部商品</a></td><td><a name="categCode"href="/pc/column/products-1-0.html#refresh"value="1"name1="mps_index_qbsp_spml1">大家电</a></td><td><a name="categCode"href="/pc/column/products-2-0.html#refresh"value="2"name1="mps_index_qbsp_spml2">电脑数码</a></td><td><a name="categCode"href="/pc/column/products-17-0.html#refresh"value="17"name1="mps_index_qbsp_spml3">生活家电</a></td><td><a name="categCode"href="/pc/column/products-733-0.html#refresh"value="733"name1="mps_index_qbsp_spml4">手机</a></td><td><a name="categCode"href="/pc/column/products-81-0.html#refresh"value="81"name1="mps_index_qbsp_spml5">车品</a></td><td><a name="categCode"href="/pc/column/products-11-0.html#refresh"value="11"name1="mps_index_qbsp_spml6">居家日用</a></td><td><a name="categCode"href="/pc/column/products-10-0.html#refresh"value="10"name1="mps_index_qbsp_spml7">食品</a></td><td><a name="categCode"href="/pc/column/products-8-0.html#refresh"value="8"name1="mps_index_qbsp_spml8">美妆</a></td><td><a name="categCode"href="/pc/column/products-9-0.html#refresh"value="9"name1="mps_index_qbsp_spml9">母婴</a></td><td><a name="categCode"href="/pc/column/products-464-0.html#refresh"value="464"name1="mps_index_qbsp_spml10">服饰鞋包</a></td><td><a name="categCode"href="/pc/column/products-468-0.html#refresh"value="468"name1="mps_index_qbsp_spml11">纸品洗护</a></td><td><a name="categCode"href="/pc/column/products-125-0.html#refresh"value="125"name1="mps_index_qbsp_spml12">家装</a></td></tr></table></div>
</div>

将/pc/column/products-1-0.html改成
https://ju.suning.com/pc/column/products-1-0.html 就是大家电分类的显示页面,然后再对其进行源码分析

<a href="/pc/column/products-1-0.html#refresh" value="0" class="active" name1="mps_1_qbsp_ejqb">全 部</a>
<input type="hidden" value="P" id="secCategCodeBrand"/>
<a href="/pc/column/products-1-.html#P" value="P" class="floor" name1="mps_1_qbsp_ejml1">精选品牌</a>
<a href="/pc/column/products-1-139.html#139" value="139"  class="floor" name1="mps_1_qbsp_ejml1">厨卫</a>
<a href="/pc/column/products-1-137.html#137" value="137"  class="floor" name1="mps_1_qbsp_ejml2">冰箱</a>
<a href="/pc/column/products-1-191.html#191" value="191"  class="floor" name1="mps_1_qbsp_ejml3">彩电影音</a>
<a href="/pc/column/products-1-138.html#138" value="138"  class="floor" name1="mps_1_qbsp_ejml4">空调</a>
<a href="/pc/column/products-1-410.html#410" value="410"  class="floor" name1="mps_1_qbsp_ejml5">热水器</a>
<a href="/pc/column/products-1-409.html#409" value="409"  class="floor" name1="mps_1_qbsp_ejml6">洗衣机</a>
<a href="/pc/column/products-1-552.html#552" value="552"  class="floor" name1="mps_1_qbsp_ejml7">净水设备</a>
<a href="/pc/column/products-1-617.html#617" value="617"  class="floor" name1="mps_1_qbsp_ejml8">爆款预订</a>

可以得到二级分类的URL,将/pc/column/products-1-.html 改为

https://ju.suning.com/pc/column/products-1-.html就是精选品牌显示页面,然后再对其进行源码分析
<!-- 精选品牌列表 -->
<h5 id ="P" class="ju-prodlist-head"><span>精选品牌</span></h5>
<ul class="ju-prodlist-floor1 ju-prodlist-lazyBrand clearfix">
<li class="ju-brandlist-item" name="brandCollect" value="100036641"><a href="/pc/brandComm-100036641-1.html" title="帅康(sacon)" expotype="2" expo="mps_1_qbsp_jxpp1:帅康(sacon)" name1="mps_1_qbsp_jxpp1" target="_blank" shape="" class="brand-link"></a><img orig-src-type="1-4" orig-src="//image3.suning.cn/uimg/nmps/PPZT/1000592621751_2_390x195.jpg" width="390" height="195" class="brand-pic lazy-loading" alt="帅康(sacon)"><div class="sale clearfix"><span class="brand-countdown ju-timer" data-time-now="" name="dateNow" data-time-end="2017-09-06 23:59:57.0"></span><span class="brand-buynum" id="100036641"></span></div><div class="border"></div>
</li>
<li class="ju-brandlist-item" name="brandCollect" value="100036910"><a href="/pc/brandComm-100036910-1.html" title="富士通(FUJITSU)" expotype="2" expo="mps_1_qbsp_jxpp2:富士通(FUJITSU)" name1="mps_1_qbsp_jxpp2" target="_blank" shape="" class="brand-link"></a><img orig-src-type="1-4" orig-src="//image1.suning.cn/uimg/nmps/PPZT/1000601630663_2_390x195.jpg" width="390" height="195" class="brand-pic lazy-loading" alt="富士通(FUJITSU)"><div class="sale clearfix"><span class="brand-countdown ju-timer" data-time-now="" name="dateNow" data-time-end="2017-09-06 23:59:55.0"></span><span class="brand-buynum" id="100036910"></span></div><div class="border"></div>
</li>

将/pc/brandComm-100036641-1.html改成
https://ju.suning.com/pc/brandComm-100036641-1.html
就是帅康品牌的所有商品的显示页面,然后再对其进行源码分析

<li class="ju-prodlist-item" id="6494577"><div class="item-wrap"><a title="帅康(sacon)烟灶套餐TE6789W+35C欧式不锈钢油烟机灶具套餐" expotype="1" expo="mpsblist_100036641_ppsp_mrsp1:0070068619|126962539" name1="mpsblist_100036641_ppsp_mrsp1" href="/pc/jusp/product-00010641eb93529d.html" target="_blank" shape="" class="prd-link"></a><img class="prd-pic lazy-loading" orig-src-type="0-1" orig-src="//image4.suning.cn/uimg/nmps/ZJYDP/100059262126962539picA_1_392x294.jpg" width="390" height="292"><div class="detail"><p class="prd-name fixed-height-name">帅康(sacon)烟灶套餐TE6789W+35C欧式不锈钢油烟机灶具套餐</p><p class="prd-desp-items fixed-height-desp"><span>17大吸力</span><span>销量TOP</span><span>一级能效</span><span>限时抢烤箱!</span></p></div><div class="sale clearfix"><div class="prd-price clearfix"><div class="sn-price"></div><div class="discount"><p class="full-price"></p></div></div><div class="prd-sale"><p class="prd-quan" id="000000000126962539-0070068619"></p><p class="sale-amount"></p></div></div></div><div class="border"></div>
</li>

将/pc/jusp/product-00010641eb93529d.html改成
https://ju.suning.com/pc/jusp/product-00010641eb93529d.html
就是帅康(sacon)烟灶套餐TE6789W+35C欧式不锈钢油烟机灶具套餐商品的显示页面,然后再对其进行源码分析就可以提取出商品的信息了,下面讲代码实现。
二、项目实现

    我最开始完全按照我分析网页源代码的思路一层一级调用实现     
import urllib
import urllib.parse
import urllib.request
import re
import threading
import queue
import timeq = queue.Queue()
r = re.compile(r'href="(http://ju\.suning\.com/pc/jusp/product.+?)"')
urls = []
#商品-四级分类
def save_products_from_url(contents):category_products = re.findall('href="/pc/jusp/product.+?.html"',contents,re.S)print('所有四级分类')print(category_products)for url_product in category_products:url_product = url_product.replace("\"","")          url_product = url_product.replace("href=","")url_product = url_product.replace("/pc","http://ju.suning.com/pc") if url_product in urls:continueelse:             html = download_page(url_product)get_image(html)#设置sleep否则网站会认为是恶意访问而终止访问time.sleep(1)return
#品牌-三级分类
def save_brand_from_url(contents):category_brand = re.findall('href="/pc/brandComm.+?.html"',contents,re.S)print('所有三级分类')print(category_brand)for url_brand in category_brand:url_brand = url_brand.replace("\"","")           url_brand = url_brand.replace("href=","")url_brand = url_brand.replace("/pc","http://ju.suning.com/pc") if url_brand in urls:continueelse:urls.append(url_brand)q.put(url_brand)print('三级分类--:海信')print(url_brand)opener = urllib.request.urlopen(url_brand)contents = opener.read()contents = contents.decode("utf-8")     opener.close()time.sleep(1)save_products_from_url(contents)def save_contents_from_url(contents):#二级分类:空调regx = r'href="/pc/column/products-[\d]{1,3}-[\d][\d][\d].html'pattern = re.compile(regx)category_two = re.findall(pattern,repr(contents))print('所有二级分类')print(category_two)for url_two in category_two:url_two = url_two.replace("\"","")           url_two = url_two.replace("href=","")url_two = url_two.replace("/pc","http://ju.suning.com/pc")               if url_two in urls:continueelse:urls.append(url_two)q.put(url_two)print('二级分类--:空调')print(url_two)opener = urllib.request.urlopen(url_two)contents = opener.read()contents = contents.decode("utf-8")        opener.close()time.sleep(1)save_brand_from_url(contents)def set_urls_from_contents(contents):   #一级分类:大家电g = re.findall('href="/pc/column/products.+?.html#refresh"',contents,re.S)print('所有一级分类')print(g)  for url in g :      print('一级分类--:大家电')print(url)url = url.replace("\"","")url = url.replace("#refresh","")url = url.replace("href=","")url = url.replace("/pc","http://ju.suning.com/pc")print(url)if url in urls:continueelse:            urls.append(url)q.put(url)          opener = urllib.request.urlopen(url)           contents = opener.read()           contents = contents.decode("utf-8")          opener.close()time.sleep(1)save_contents_from_url(contents)def save_contents():url = "https://ju.suning.com/"opener = urllib.request.urlopen(url)contents = opener.read()contents = contents.decode("utf-8")opener.close()print('首页')print(url)set_urls_from_contents(contents)def download_page(url):request = urllib.request.Request(url)response = urllib.request.urlopen(request)data = response.read()return data#下载图片
def get_image(html):print('price')regx = r'sn.gbPrice ="\d*?.\d*?";'pattern = re.compile(regx)    get_price = re.findall(pattern,repr(html))print(get_price)  for title in get_price:myindex = title.index('"')newprice = title[myindex+1:len(title)-2]print(newprice)      print('title')regx = r'<title>.*?苏宁大聚惠</title>'pattern = re.compile(regx)html = html.decode('utf-8')get_title = re.findall(pattern,repr(html))       for title in get_title:myindex = title.index('【')newtitle = title[7:myindex]print(newtitle)  regx = r'orig-src="//image[\d].suning.cn/uimg/nmps/ZJYDP/[\S]*\.jpg'pattern = re.compile(regx)get_img = re.findall(pattern,repr(html))num = 1for img in get_img:img = img.replace("\"","")img = img.replace("orig-src=","http:")print(img)index = img.index('picA')item_id = img[index-18:index]name = img[index-18:index]+'.jpg'print(name)image = download_page(img)with open(name,'wb') as fp:fp.write(image)print('正在下载第%s张图片'%num)num += 1#将商品价格、名称、编号id写入文件 with open('items.txt','ab') as files:items = '|'+ newprice +'|'+ newtitle +'|'+ item_id + '\r\n'items = items.encode('utf-8')files.write(items)      time.sleep(1) returnq.put("https://ju.suning.com/")ts = []
t = threading.Thread(target=save_contents)
t.start()

这样写写比较清晰明了,便于理解,但太笨了都没有用到爬虫经常使用的递归,所以我后面修改了一版,难点是各层次的正则表达式不同,修改后终于使用递归了!

import urllib
import urllib.parse
import urllib.request
import re
import threading
import queue
import timeq = queue.Queue()
mylock = threading.RLock()  urls = []
level = 0
category = 0
categorysed = 0
#层级与正则表达式映射
def numbers_to_strings(argument):switcher = {1: regx_1,2: regx_2,3: regx_3,4: regx_4,}return switcher.get(argument, "nothing")def set_urls_from_contents(contents):global levelglobal categoryglobal categorysed#一级分类regx_1 = r'href="/pc/column/products.+?.html#refresh"'#二级分类regx_2 = r'href="/pc/column/products-[\d]{1,3}-[\d][\d][\d].html'#三级分类:品牌regx_3 = r'href="/pc/brandComm.+?.html"'#四级分类:商品regx_4 = r'href="/pc/jusp/product-.+?.html"'pattern = re.compile(regx_4)g = re.findall(pattern,repr(contents))if len(g) >0:level = 4else:level = 0print('商品分类不匹配')print(str(level)+':1')if level == 0: pattern = re.compile(regx_3)g = re.findall(pattern,repr(contents))if len(g) >0:                level = 3else:level = 0print('品牌分类不匹配')else:print('品牌分类跳过')print(str(level)+':2') if level == 0:pattern = re.compile(regx_2)g = re.findall(pattern,repr(contents))if len(g) >0:level = 2else:level = 0print('二级分类不匹配')else:print('二级分跳过')print(str(level)+':3')if level == 0:pattern = re.compile(regx_1)g = re.findall(pattern,repr(contents))if len(g) >0:level = 1else:level = 0print('一级分类不匹配')else:print('一级分类跳过')      print(str(level)+':4')   print('所有分类明细')print(g)for url in g :#url = url.groups()[0]print(str(level)+'级分类:')print(url)if url.find('#refresh')>0:eindex = url.index('.html')print(eindex)sindex = url.index('s-')category = url[sindex+2:eindex-2]print('一级分类id')print(category)         elif url.find('products-')>0:eindex = url.index('.html')#sindex = url.index('s-')categorysed = url[eindex-3:eindex]print('二级分类id')            print(categorysed)url = url.replace("\"","")url = url.replace("#refresh","")url = url.replace("href=","")url = url.replace("/pc","http://ju.suning.com/pc")          print(url)if url.find('product-') >0:level =4else:level = level -1if url in urls:continueelse:            urls.append(url)q.put(url)if level == 4:                            html = download_page(url)get_image(html,category,categorysed)else:                            opener = urllib.request.urlopen(url)                  contents = opener.read()           contents = contents.decode("utf-8")          opener.close()time.sleep(0.1)set_urls_from_contents(contents)def save_contents():url = "https://ju.suning.com/"opener = urllib.request.urlopen(url)contents = opener.read()contents = contents.decode("utf-8")opener.close()print('首页')print(url)set_urls_from_contents(contents)#下载具体一个商品页面中的信息def download_page(url):request = urllib.request.Request(url)response = urllib.request.urlopen(request)data = response.read()return data#下载图片
def get_image(html,category,categorysed):print('price')regx = r'sn.gbPrice ="\d*?.\d*?";'pattern = re.compile(regx)    get_price = re.findall(pattern,repr(html))print(get_price)#print(html)for title in get_price:myindex = title.index('"')newprice = title[myindex+1:len(title)-2]print(newprice)print('title')regx = r'<title>.*?苏宁大聚惠</title>'pattern = re.compile(regx)html = html.decode('utf-8')get_title = re.findall(pattern,repr(html))#print(get_title)    for title in get_title:myindex = title.index('【')newtitle = title[7:myindex]print(newtitle)regx = r'orig-src="//image[\d].suning.cn/uimg/nmps/ZJYDP/[\S]*\.jpg'pattern = re.compile(regx)get_img = re.findall(pattern,repr(html))num = 1for img in get_img:img = img.replace("\"","")img = img.replace("orig-src=","http:")print(img)index = img.index('pic')item_id = img[index-18:index]name = img[index-18:index]+'.jpg'print(name)image = download_page(img)with open(name,'wb') as fp:fp.write(image)print('正在下载第%s张图片'%num)num += 1with open('items.txt','ab') as files:items = str(category)+'|'+str(categorysed)+'|'+ newprice +'|'+ newtitle +'|'+ item_id + '\r\n'items = items.encode('utf-8')files.write(items)                time.sleep(1)return#首页入口
q.put("https://ju.suning.com/")ts = []
t = threading.Thread(target=save_contents)
t.start()

我在晚上睡觉前运行程序,第二天查看爬了  近三千条记录,程序没有报错,应该是电脑休眠网络中断了,不过数据已经足够了

  

四、python爬虫抓取购物网站商品信息--图片价格名称相关推荐

  1. [python 爬虫]Python爬虫抓取虎扑论坛帖子图片

    自从可以实现抓取文字了,自然要尝试更多的类型,比如图片.我是一个有逛虎扑论坛习惯的人,经常会发现有些帖子的图片挺好看的想保存下来,但是如果人为保存的话,一个帖子至少都有二三十张,这将是一个庞大的工作量 ...

  2. 利用Python爬虫抓取小说网站全部文章

    我们先来选定爬取目标,我爬取的网站是https://www.17k.com/ ,一些大型的网站(如起点.豆瓣等)做了反爬虫的部署,这会大大增加我们抓取的难度,所以尽量还是选一些不那么热门的网站. 爬虫 ...

  3. python京东商品采集_利用Python正则表达式抓取京东网商品信息

    京东(JD.com)是中国最大的自营式电商企业,2015年第一季度在中国自营式B2C电商市场的占有率为56.3%.如此庞大的一个电商网站,上面的商品信息是海量的,小编今天就带小伙伴利用正则表达式,并且 ...

  4. python 爬虫抓取19楼租房信息

    查看19lou.com的Cookie chrome中打开19lou.com,按F12可以打开开发者工具查看 不获取Cookie会导致爬取网站时重定向而抓不到内容 定义headers headers = ...

  5. python爬虫 — 爬取淘宝商品信息

    (一)确定需要爬取的信息 在爬取前首先确定需要获取的信息,打开taobao,在搜索框中输入,需要获取的商品的信息,比如ipad,点击搜索 就可以看到许多的ipad,选择其中的一款商品,比如第一个 可以 ...

  6. python爬虫抓取头条街拍美女图片

    开发环境:windows 7 开发工具:pycharm python版本:python 3.7 用到的库:os,urllib,requests,hashlib 关键步骤: 通过浏览器分析找到请求接口 ...

  7. Python爬虫抓取某音乐网站MP3(下载歌曲、存入Sqlite)

    Python爬虫抓取某音乐网站MP3(下载歌曲.存入Sqlite) 最近右胳膊受伤,打了石膏在家休息.为了实现之前的想法,就用左手打字.写代码,查资料完成了这个资源小爬虫.网页爬虫, 最主要的是协议分 ...

  8. python爬虫抓取网站技巧总结

    不知道为啥要说是黑幕了??哈哈哈-..以后再理解吧 python爬虫抓取网站的一些总结技巧 学用python也有3个多月了,用得最多的还是各类爬虫脚本:写过抓代理本机验证的脚本,写过在discuz论坛 ...

  9. python爬虫:批量抓取代理ip,进行验证,抓取豆瓣网站影视信息

    本文作为学习笔记参考用: [1]批量抓取代理ip: 找到第三方ip代理的网站,进行分析,并批量抓取,抓取程序放到Proxies_spider.py中,如下所示: import re import re ...

最新文章

  1. Java 中各种锁的介绍
  2. 【树链剖分】洛谷树(P3401)
  3. Linux系统编程---17(条件变量及其函数,生产者消费者条件变量模型,生产者与消费者模型(线程安全队列),条件变量优点,信号量及其主要函数,信号量与条件变量的区别,)
  4. wpf 依赖强制回调
  5. Quart 2D (DrawText)
  6. [置顶]       jQuery旋转插件—rotate
  7. 【CoreBluetooth】iOS 系统蓝牙框架
  8. 极路由大部分机型官方固件
  9. PBRT笔记(11)——光源
  10. 保定学院计算机编程,惠普HP打印机驱动程序安装失败怎么办hewlett-packard上的文件...
  11. 数学建模英文论文写作
  12. 2020中兴捧月算法大赛——傅里叶赛道 第1名方案
  13. 【码农说码】手撕锟斤拷,彻底搞懂GB2312,GBK,Big5,ASCII,UTF-8,UTF-32的前世今生
  14. 手把手教你用GoEasy实现Websocket IM聊天
  15. 联阿里接腾讯,B站如何实现“意义非凡”的一年
  16. 手机app数据爬取难度等级评估
  17. 第四课 k8s网络基础学习-DNS和DHCP学习
  18. 除了霸王洗发水,还能怎么拯救程序员的发际线?
  19. Whistle——抓包工具
  20. 引擎入门 | Unity UI简介–第1部分(1)

热门文章

  1. 食品行业经销商管控平台迅速扩展代理数量,轻松实现裂变分销
  2. python自动登录qq客户端_Python自动登录QQ的实现示例
  3. 【JDBC-2】JDBCUtils,Blob字段的操作
  4. Unity记录——ITween、Dotween实现开关门、判断在门前门后、批量处理单开、双开门以及推拉门
  5. 扎心,奉劝各位开发人员的几句真话
  6. python服务发现与注册_(转)微服务架构中服务注册与发现
  7. Linux常见命令拾遗
  8. 工作需要写了个鼠标悬浮滚动窗
  9. 计算机教学中融入德育教育的思考与实践,信息技术课程中渗透德育教学的实践与反思...
  10. MFC使用winpcap 抓包 pcap_compile使用