Scrapy爬虫实战—虎扑步行街发帖爬取

0. 前言

在B站上看了黑马的scrapy，老师讲的超细致，赞！
本文主要用scrapy的基本操作完成爬取，适合入门级学习。

1. scrapy

scrapy有很多命令，在terminal输入scrapy可以看到

这里我们主要用startproject来创建整个项目，genspider生成爬虫，完成编程以后再crawl一下即可。
首先：scrapy startproject XXXX
就可以在本地找到已经创建好的文件夹

第二步：scrapy genspider hupu “https://bbs.hupu.com”
这一步是创建一个爬虫，注：需要cd进入这个demo文件夹再运行上述命令，就可以得到hupu.py

第三步：设置items
这里是实例化我们需要的“参数名”，类似于字典里的key

import scrapy
class ItcastItem(scrapy.Item):# define the fields for your item here like:#下述为每个帖子的信息author = scrapy.Field()reply = scrapy.Field()article_href = scrapy.Field()reply_number = scrapy.Field()scan_number = scrapy.Field()light_number = scrapy.Field()

接着我们就可以在hupu.py实现我们的想法了：

# -*- coding: utf-8 -*-
import scrapy
from ITcast.items import ItcastItem
import reclass HupuSpider(scrapy.Spider):#必须有name，代表爬虫的名称name = 'hupu'#可省略允许域allowed_domains = ['bbs.hupu.com']base_url = "https://bbs.hupu.com/bxj-"offset = 1#必须有起始urlstart_urls = [base_url + str(offset)]def parse(self, response):node_list = response.xpath('//*[@id="ajaxtable"]/div[1]/ul/li')for node in node_list:item = ItcastItem()author_name = node.xpath("./div[2]/a[1]/text()").extract()reply_name = node.xpath("./div[3]/span/text()").extract()article_href = node.xpath('./div[1]/a/@href').extract()reply_number = node.xpath('./span/text()').extract()item['author'] = author_name[0]item['reply'] = reply_name[0]item['article_href'] = "https://bbs.hupu.com" + article_href[0]item['reply_number'] = re.findall(r'\w+', reply_number[0])[0]item['scan_number'] = re.findall(r'\w+', reply_number[0])[1]yield item#拼接法：if self.offset < 50:self.offset += 1url = self.base_url + str(self.offset)yield scrapy.Request(url, callback=self.parse)
'''   #点击下一页：if len(response.xpath("//a[@class='nextPage']")) != 0:url = "https://bbs.hupu.com/" + str(response.xpath('//*[@id="container"]/div/div[2]/div[4]/div[1]/div/a[6]/@href').extract()[0])yield scrapy.Request(url, callback=self.parse)
'''

注：这里的yield很重要，第一个yield可以在实际使用中节省很大的内存，避免将所有item保存在items中再统一输出，而是利用生成器的特性，每运行一次传给engine一个item，且不可用return代替（用return代替则直接退出函数，不再循环了）；第二个yield可以用return代替，反正也是回调self.parse后结束。

接下来是修改管道文件，完成数据处理的操作：

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import jsonclass ItcastPipeline(object):def __init__(self):self.f = open("hupu.json", "wb")def process_item(self, item, spider):content = json.dumps(dict(item), ensure_ascii=False) + ",\n"self.f.write(content.encode("utf-8"))return itemdef close_spider(self, spider):self.f.close()

2. 准备cookie

由于虎扑的反爬机制，非登陆账号在第十页以后就需要登陆查看了，因此可以先登录，取出cookie，再在settings.py文件中加上，顺带加上headers

# -*- coding: utf-8 -*-# Scrapy settings for ITcast project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
import randomBOT_NAME = 'ITcast'SPIDER_MODULES = ['ITcast.spiders']
NEWSPIDER_MODULE = 'ITcast.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
UserAgentlist = ["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60","Opera/8.0 (Windows NT 5.1; U; en)","Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50","Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0","Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36","Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11","Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36","Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11"]USER_AGENT = random.choice(UserAgentlist)# Obey robots.txt rules
ROBOTSTXT_OBEY = True# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False#Override the default request headers:
DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en','Cookie':'_dacevid3=ecbdfb25.f434.fcd0.b788.210a5a48205e; acw_tc=76b20f4615842422838757569eae17e24efbdada53bc13808df6832673fc52; _cnzz_CV30020080=buzi_cookie%7Cecbdfb25.f434.fcd0.b788.210a5a48205e%7C-1; __gads=ID=effc8694c9ec3cf3:T=1584242284:S=ALNI_Ma2_M1WPUBJBiImj4flObqASJy-6Q; _HUPUSSOID=bb358b26-1d78-47a4-a64b-9ff4352047c3; _CLT=00376064be821b71351c003dda774e37; u=26819498|56We56eY55qE5aOV5a6i|f43c|7dc0fa8daecf65f4a401db993c6b9bc5|aecf65f4a401db99|56We56eY55qE5aOV5a6i; us=f44ad91429ba5c9f73cc0569e7329724eb5af2af1b29f656a26a585ac9131fd9a5005b586a05b308edde3d715a144bffd33809c47a6db6c524c344f427d2963d; Hm_lvt_39fc58a7ab8a311f2f6ca4dc1222a96e=1582974309,1582974335,1584242285,1584242368; sajssdk_2015_cross_new_user=1; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%22170dc343b82964-07f8c1efc1f5d8-38657501-1296000-170dc343b83966%22%2C%22%24device_id%22%3A%22170dc343b82964-07f8c1efc1f5d8-38657501-1296000-170dc343b83966%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%7D%7D; PHPSESSID=ed77c421aec93f286770c371f7feea73; lastvisit=0%091584249115%09%2Ferror%2F%40_%40.php%3F; _fmdata=Pju1H%2Fc7V2gjACkMPPdbImwLnQgbrdHhcmAZ4k1PqdajpeKlYZf8Z4OHLu5h1KPRN%2FteHhyK%2FbPb4wPfPcssRiRUm%2FtCdtRzs%2Bx5ioTmRJg%3D; ua=16002521; Hm_lpvt_39fc58a7ab8a311f2f6ca4dc1222a96e=1584249554; __dacevst=8685e086.9bbdbe1e|1584258197004'
}# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {#    'ITcast.middlewares.ItcastSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {#    'ITcast.middlewares.ItcastDownloaderMiddleware': 543,
#}# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'ITcast.pipelines.ItcastPipeline': 300,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

最后在terminal运行 scrapy crawl hupu即可运行得到需要csv如下，共爬取50页，共5888条数据。