python3 爬虫全站_Python3[爬虫实战] scrapy爬取汽车之家全站链接存json文件

昨晚晚上一不小心学习了崔庆才，崔大神的博客，试着尝试一下爬取一个网站的全部内容，福利吧网站现在已经找不到了，然后一不小心逛到了汽车之家 (http://www.autohome.com.cn/beijing/)

很喜欢这个网站，女人都喜欢车，更何况男人呢。(捂脸)

说一下思路：

1 . 使用CrawlSpider 这个spider，

2. 使用Rule

上面这两个配合使用可以起到爬取全站的作用

3. 使用LinkExtractor 配合Rule可以进行url规则的匹配

4. FormRequest 这是scrapy 登陆使用的一个包

注意：这里进行全站的爬取只是单纯的把以 .html 的url进行打印，保存到json文件，

这里我们还可以继续往下深入的，进行url下的内容提取。

说一下提取的思路：这里我们可以随便找一个url下的内容，然后找到想要提取到的内容，进行xpath提取，

xpath 的一般提取规则：选中想要提取内容的那一行，然后右键copy --> copy xpath 就可以啦，这里老司机说是最好用chrom浏览器的xpath，火狐可能有时候提取不到想要的元素，

xpath提取的简单并且常用的规则：

//*[@id=”post_content”]/p[1]

意思是：在根节点下面的有一个id为post_content的标签里面的第一个p标签(p[1])

如果你需要提取的是这个标签的文本你需要在后面加点东西变成下面这样：

//*[@id=”post_content”]/p[1]/text()

后面加上text()标签就是提取文本

如果要提取标签里面的属性就把text()换成@属性比如：

//*[@id=”post_content”]/p[1]/@src

So Easy！XPath提取完毕！来看看怎么用的！那就更简单了！！！！

response.xpath(‘你Copy的XPath’).extract()[‘要取第几个值’]

注意XPath提取出来的默认是List。

上面就是简单的提取规则，是不是很容易懂，我觉着也是，比之前学的容易懂多了，可能我现在还是个小白吧。哈哈哈。

附录一下：

关于imgurl那个XPath：

你先随便找一找图片的地址Copy XPath类似得到这样的：

//*[@id=”post_content”]/p[2]/img

你瞅瞅网页会发现每一个有几张图片每张地址都在一个p标签下的img标签的src属性中

把这个2去掉变成：

//*[@id=”post_content”]/p/img

就变成了所有p标签下的img标签了！加上 /@src 后所有图片就获取到啦！(不加[0]是因为我们要所有的地址、加了就只能获取一个了！)

关于XPath更多的用法与功能详解，建议大家去看看w3cschool

看来我确实没有怎么看w3c啊。还是抓个时间去看一下比较好，毕竟是基础嘛。

大概：废话就这么多，我真是个话痨，感觉。

贴上代码片吧，里面的内容注释都很详细。

步骤1：

spider里面的文件

# -*- coding: utf-8 -*-

# @Time : 2017/8/27 0:43

# @Author : 蛇崽

# @Email : 17193337679@163.com (主要进行全站爬取的练习)

# @File : LongXunDaoHangSpider.py

# crawlspider,rule配合使用可以起到遍历全站的作用，request为请求的接口

from scrapy.spider import CrawlSpider,Rule,Request

# 配合使用Rule进行url规则匹配

from scrapy.linkextractors import LinkExtractor

# scrapy 中用作登陆使用的一个包

from scrapy import FormRequest

from allNet.items import LongXunDaoHang

class longxunDaoHang(CrawlSpider):

name = 'longxun'

allowed_domains = ['autohome.com.cn']

start_urls = ['http://www.autohome.com.cn/shanghai/']

rules = (

Rule(LinkExtractor(allow=('\.html',)),callback='parse_item',follow=True),

)

def parse_item(self,response):

print(response.url)

daohang = LongXunDaoHang()

daohang['categoryLink'] = response.url

yield daohang

步骤2：

settings.py的内容：

# -*- coding: utf-8 -*-

# Scrapy settings for allNet project

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

# http://doc.scrapy.org/en/latest/topics/settings.html

# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'allNet'

SPIDER_MODULES = ['allNet.spiders']

NEWSPIDER_MODULE = 'allNet.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

#USER_AGENT = 'allNet (+http://www.yourdomain.com)'

# Obey robots.txt rules

ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)

#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

#DOWNLOAD_DELAY = 3

# The download delay setting will honor only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

COOKIES_ENABLED = True

# Disable Telnet Console (enabled by default)

#TELNETCONSOLE_ENABLED = False

# Override the default request headers:

#DEFAULT_REQUEST_HEADERS = {

# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

# 'Accept-Language': 'en',

# Enable or disable spider middlewares

# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

# SPIDER_MIDDLEWARES = {

# 'allNet.middlewares.AllnetSpiderMiddleware': 543,

# }

# Enable or disable downloader middlewares

# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

# DOWNLOADER_MIDDLEWARES = {

# 'allNet.middlewares.MyCustomDownloaderMiddleware': 543,

# 'allNet.middleware.JsonWritePipline':300,

# }

# Enable or disable extensions

# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html

#EXTENSIONS = {

# 'scrapy.extensions.telnet.TelnetConsole': None,

# Configure item pipelines

# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

'allNet.pipelines.AllnetPipeline': 300,

'allNet.pipelines.JsonWritePipline': 300,

}

# Enable and configure the AutoThrottle extension (disabled by default)

# See http://doc.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

步骤3：

piplines.py的内容

# -*- coding: utf-8 -*-

# Define your item pipelines here

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json

class AllnetPipeline(object):

def process_item(self, item, spider):

return item

# 写入json文件

class JsonWritePipline(object):

def __init__(self):

self.file = open('汽车之家全站url.json','w',encoding='utf-8')

def process_item(self,item,spider):

line = json.dumps(dict(item),ensure_ascii=False)+"\n"

self.file.write(line)

return item

def spider_closed(self,spider):

self.file.close()

很奇怪的是，汽车之家这里用的cookie什么的都没有进行设置，但是爬取全站这玩意，它就一直没有报错，昨天晚上十二点左右写的代码，想着用scrapy应该不一会就爬取完了吧，但是现在早上还一直在爬，我也是醉了，晚上好几次电脑进行休眠了，然后我又把他重新弄亮了，现在有点奇葩的是，现在spider还在运行着，但是json文件写不进去了，蛮怪怪的。最后上张爬取成果图吧：

这里写图片描述

这里留给自己一个作业：在爬取的url中进行数据的提取，存储，简单点：就是url下面内容的进行保存。(捂脸.jpg)

python3 爬虫全站_Python3[爬虫实战] scrapy爬取汽车之家全站链接存json文件相关推荐

Python3[爬虫实战] scrapy爬取汽车之家全站链接存json文件
昨晚晚上一不小心学习了崔庆才,崔大神的博客,试着尝试一下爬取一个网站的全部内容,福利吧网站现在已经找不到了,然后一不小心逛到了汽车之家 (http://www.autohome.com.cn/beij ...
python爬取汽车之家_python爬虫实战之爬取汽车之家网站上的图片
随着生活水平的提高和快节奏生活的发展.汽车开始慢慢成为人们的必需品,浏览各种汽车网站便成为购买合适.喜欢车辆的前提.例如汽车之家网站中就有最新的报价和图片以及汽车的相关内容,是提供信息最快最全的中国汽 ...
Python 爬虫实战入门——爬取汽车之家网站促销优惠与经销商信息
在4S店实习,市场部经理让我写一个小程序自动爬取汽车之家网站上自家品牌的促销文章,因为区域经理需要各店上报在网站上每一家经销商文章的露出频率,于是就自己尝试写一个爬虫,正好当入门了. 一.自动爬取并输 ...
python爬虫（二十四）爬取汽车之家某品牌图片
爬取汽车之家某品牌图片需求爬取汽车之家某品牌的汽车图片目标url https://car.autohome.com.cn/photolist/series/52880/6957393.html# ...
WebMagic爬虫入门教程（三）爬取汽车之家的实例-品牌车系车型结构等
本文使用WebMagic爬取汽车之家的品牌车系车型结构价格能源产地国别等:java代码备注,只是根据url变化爬取的,没有使用爬取script页面具体的数据,也有反爬机制,知识简单爬取html标签 ...
scrapy爬取汽车之家宝马5系图片
需求分析我们想在汽车之家官网上爬取宝马5系的部分图片,并根据分类保存到本地磁盘欣赏, 进入这个页面,分析发现,是按照多个维度进行分类的,因此我们要提取图片的时候,需要依次遍历没每个分类,然后在进入到 ...
scrapy 爬取汽车之家的汽车logo并保存图片
car.py items.py pipelines.py settings.py IMAGES_STORE = '//-//Car//images' # 设置存储路径
python爬虫利用Scrapy框架爬取汽车之家奔驰图片--实战
先看一下利用scrapy框架爬取汽车之家奔驰A级的效果图 1)进入cmd命令模式下,进入想要存取爬虫代码的文件,我这里是进入e盘下的python_spider文件夹内 C:\Users\15538&g ...
python3爬虫系列16之多线程爬取汽车之家批量下载图片
python3爬虫系列16之多线程爬取汽车之家批量下载图片 1.前言上一篇呢,python3爬虫系列14之爬虫增速多线程,线程池,队列的用法(通俗易懂),主要介绍了线程,多线程,和两个线程池的使用. ...

python3 爬虫全站_Python3[爬虫实战] scrapy爬取汽车之家全站链接存json文件

python3 爬虫全站_Python3[爬虫实战] scrapy爬取汽车之家全站链接存json文件相关推荐

最新文章

热门文章

python3 爬虫 全站_Python3[爬虫实战] scrapy爬取汽车之家全站链接存json文件

python3 爬虫 全站_Python3[爬虫实战] scrapy爬取汽车之家全站链接存json文件相关推荐

最新文章

热门文章

python3 爬虫全站_Python3[爬虫实战] scrapy爬取汽车之家全站链接存json文件

python3 爬虫全站_Python3[爬虫实战] scrapy爬取汽车之家全站链接存json文件相关推荐