爬取汽车之家图片

需求:爬取汽车之家某一个汽车的图片

一、普通scrapy

第一步页面分析

目标url:
https://car.autohome.com.cn/photolist/series/265/p1/
https://car.autohome.com.cn/photolist/series/265/p2/ 第二页
https://car.autohome.com.cn/photolist/series/265/p3/ 第三页
- 观察网页很明显265是该车型的编码
- 页数p1 p2编码
观察图片url：
- 大图：https://car2.autoimg.cn/cardfs/product/g25/M0B/29/A8/800x0_1_q95_autohomecar__wKgHIlrwJHaAK02EAAsUwWrTmXY510.jpg
- 小图：
  https://car2.autoimg.cn/cardfs/product/g25/M0B/29/A8/240x180_0_q95_c42_autohomecar__wKgHIlrwJHaAK02EAAsUwWrTmXY510.jpg

第二步实现步骤

1 创建scrapy项目
scrapy startproject lsls
2 创建爬虫程序
scrapy genspider hy car.autohome.com.cn
3 实现逻辑

（一）准备程序

在terminal终端输入

scrapy startproject lsls
# 爬虫程序名最好不要和爬虫程序重名
scrapy genspider hy car.autohome.com.cn

创建start.py文件，放在与scrapy.cfg同层目录下

# 要运行整个程序的话，只需要运行这个文件
from scrapy import cmdline
# cmdline.execute('scrapy crawl hy'.split())
cmdline.execute(['scrapy','crawl','hy'])

（二）setting.py文件

固定格式

LOG_LEVEL = 'WARNING'ROBOTSTXT_OBEY = FalseDEFAULT_REQUEST_HEADERS = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en',
}
# 开启管道
ITEM_PIPELINES = {'lsls.pipelines.LslsPipeline': 300,
}# 开启自定义下载中间键，设置随机请求头
DOWNLOADER_MIDDLEWARES = {#    'lsls.middlewares.LslsDownloaderMiddleware': 543,'lsls.middlewares.UserAgentDownloaderMiddleware': 543
}

（三）hy.py文件

import scrapy
from lsls.items import LslsItemclass HySpider(scrapy.Spider):name = 'hy'allowed_domains = ['car.autohome.com.cn']start_urls = ['https://car.autohome.com.cn/photolist/series/265/p1/']print('爬取第1页')n = 1def parse(self, response):imgList = response.xpath('//ul[@id="imgList"]/li')for img in imgList:src = img.xpath('./a/img/@src').get()if src[-1] != 'g':src = img.xpath('./a/img/@src2').get()# 拼接url 并换成大图url = 'https:' + src.replace('240x180_0_q95_c42','800x0_1_q95')title = img.xpath('./div/a/text()').get()item = LslsItem(title = title,url = url)yield item# 翻页next_btn = response.xpath('//div[@class="page"]/a[@class="page-item-next"]')if next_btn:self.n+=1print(f'爬取第{self.n}页')url = f'https://car.autohome.com.cn/photolist/series/265/p{self.n}/'yield scrapy.Request(url=url)else:print('页面爬取完毕')

（四）item.py文件

import scrapyclass LslsItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()title = scrapy.Field()url = scrapy.Field()pass

（五）middlewares.py文件

不变

from scrapy import signals
from fake_useragent import UserAgent
import randomclass UserAgentDownloaderMiddleware:USER_AGENTS = ["Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)","Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)","Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)","Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)","Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6","Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1","Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0","Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5"]# 第一种方式 里面改变策略# def process_request(self, request, spider):#     user_agent = random.choice(self.USER_AGENTS)#     request.headers['User-Agent'] = user_agent# 第二种方式def process_request(self, request, spider):ua = UserAgent()user_agent = ua.randomrequest.headers['User-Agent'] = user_agent

（六）pipelines.py文件

import urllib.requestclass LslsPipeline:def open_spider(self, spider):self.title_list = {}def process_item(self, item, spider):url = 'https:'+ dict(item)['url']title = dict(item)['title']if name in self.title_list.keys():self.title_list[title]+=1else:self.title_list.setdefault(title,1)path = r'D:\python_lec\全栈开发\爬虫项目\爬虫小练习\qczj\图片下载'urllib.request.urlretrieve(url=url,filename=path+f'\{title} {self.title_list[title]}.jpg')

保存的是800大小的图

二、 crawlspider

翻页过程更加简单

（一）准备程序

scrapy startproject qczj
# 爬虫程序名最好不要和爬虫程序重名
cd qczj
scrapy genspider lsls car.autohome.com.cn

（二）lsls.py

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from qczj.items import QczjItemclass LslsSpider(CrawlSpider):name = 'lsls'allowed_domains = ['car.autohome.com.cn']start_urls = ['https://car.autohome.com.cn/photolist/series/265/p1/']rules = (# 主页Rule(LinkExtractor(allow=r'https://car.autohome.com.cn/photolist/series/265/p[1-17]/'),follow=True),# 详情页Rule(LinkExtractor(allow=r'https://car.autohome.com.cn/photo/series/31145/\d+/\d+.html'), callback='parse_item'),)def parse_item(self, response):item = QczjItem()img = response.xpath('//*[@id="img"]/@src').get()name = response.xpath('//*[@id="czts"]/div/div/p[1]/a/text()').get()item['img'] = imgitem['name'] = namereturn item

（三）pipelines.py

import urllib.requestclass QczjPipeline:def open_spider(self, spider):self.title_list = {}def process_item(self, item, spider):url = 'https:'+ dict(item)['img']name = dict(item)['name']if name in self.title_list.keys():self.title_list[name]+=1else:self.title_list.setdefault(name,1)path = r'D:\python_lec\全栈开发\爬虫项目\爬虫小练习\qczj\图片下载'urllib.request.urlretrieve(url=url,filename=path+f'\{name} {self.title_list[name]}.jpg')

爬取汽车之家图片 - scrapy - crawlspider - python爬虫案例相关推荐

利用Scrapy框架爬取汽车之家图片（详细）
爬取结果爬取步骤创建爬虫文件进入cmd命令模式下,进入想要存取爬虫代码的文件,我这里是进入e盘下的E:\pystudy\scraping文件夹内 C:\Users\wei>E:E:\> ...
python爬取汽车之家图片,Python requests 爬取汽车之家全部品牌logo，urllib下载到本地...
首先是汽车之家品牌页面的HTML 我们定位到图片那个位置,这个img标签的src加上https就是图片完整的地址那么品牌名称就是下面那个p标签的text 接下来我们的目的就是从中取出src和text ...
python爬取汽车之家图片_Python 汽车之家车型全数据爬取
所有车型数据分析发现所有车型数据在一个js文件中: ps:当然也可通过解析网页 xpath提取,或通过接口,获取方式有很多种,此文主要需要seriesId 车型ID 这一项数据为获取车型价格做准备 ...
使用scrapy框架爬取汽车之家的图片(高清)
使用scrapy框架爬取汽车之家的图片(高清) 不同于上一篇的地方是,这篇要爬取的是高清图片,而不仅仅是缩略图. 先来看一下要爬取的页面:https://car.autohome.com.cn/pic ...
python爬虫利用Scrapy框架爬取汽车之家奔驰图片--实战
先看一下利用scrapy框架爬取汽车之家奔驰A级的效果图 1)进入cmd命令模式下,进入想要存取爬虫代码的文件,我这里是进入e盘下的python_spider文件夹内 C:\Users\15538&g ...
python爬取汽车之家_python爬虫实战之爬取汽车之家网站上的图片
随着生活水平的提高和快节奏生活的发展.汽车开始慢慢成为人们的必需品,浏览各种汽车网站便成为购买合适.喜欢车辆的前提.例如汽车之家网站中就有最新的报价和图片以及汽车的相关内容,是提供信息最快最全的中国汽 ...
python爬虫（二十四）爬取汽车之家某品牌图片
爬取汽车之家某品牌图片需求爬取汽车之家某品牌的汽车图片目标url https://car.autohome.com.cn/photolist/series/52880/6957393.html# ...
python3爬虫系列16之多线程爬取汽车之家批量下载图片
python3爬虫系列16之多线程爬取汽车之家批量下载图片 1.前言上一篇呢,python3爬虫系列14之爬虫增速多线程,线程池,队列的用法(通俗易懂),主要介绍了线程,多线程,和两个线程池的使用. ...
爬取汽车之家北京二手车信息
爬取汽车之家北京二手车信息经测试,该网站:https://www.che168.com/beijing/list/ 反爬机制较低,仅需要伪造请求头设置爬取速率,但是100页之后需要登录,登录之后再爬 ...

爬取汽车之家图片 - scrapy - crawlspider - python爬虫案例

爬取汽车之家图片

一、普通scrapy

（一）准备程序

（二）setting.py文件

（三）hy.py文件

（四）item.py文件

（五）middlewares.py文件

（六）pipelines.py文件

二、 crawlspider

（一）准备程序

（二）lsls.py

（三）pipelines.py

爬取汽车之家图片 - scrapy - crawlspider - python爬虫案例相关推荐

最新文章

热门文章

爬取汽车之家图片 - scrapy - crawlspider - python爬虫案例

爬取汽车之家图片

一、 普通scrapy

（一）准备程序

（二）setting.py文件

（三）hy.py文件

（四）item.py文件

（五）middlewares.py文件

（六）pipelines.py文件

二、 crawlspider

（一）准备程序

（二）lsls.py

（三）pipelines.py

爬取汽车之家图片 - scrapy - crawlspider - python爬虫案例相关推荐

最新文章

热门文章

一、普通scrapy