0. 前言

在B站上看了黑马的scrapy,老师讲的超细致,赞!
本文主要用scrapy的基本操作完成爬取,适合入门级学习。

1. scrapy

scrapy有很多命令,在terminal输入scrapy可以看到

这里我们主要用startproject来创建整个项目,genspider生成爬虫,完成编程以后再crawl一下即可。
首先:scrapy startproject XXXX
就可以在本地找到已经创建好的文件夹

第二步:scrapy genspider hupu “https://bbs.hupu.com”
这一步是创建一个爬虫,注:需要cd进入这个demo文件夹再运行上述命令,就可以得到hupu.py

第三步:设置items
这里是实例化我们需要的“参数名”,类似于字典里的key

import scrapy
class ItcastItem(scrapy.Item):# define the fields for your item here like:#下述为每个帖子的信息author = scrapy.Field()reply = scrapy.Field()article_href = scrapy.Field()reply_number = scrapy.Field()scan_number = scrapy.Field()light_number = scrapy.Field()

接着我们就可以在hupu.py实现我们的想法了:

# -*- coding: utf-8 -*-
import scrapy
from ITcast.items import ItcastItem
import reclass HupuSpider(scrapy.Spider):#必须有name,代表爬虫的名称name = 'hupu'#可省略允许域allowed_domains = ['bbs.hupu.com']base_url = "https://bbs.hupu.com/bxj-"offset = 1#必须有起始urlstart_urls = [base_url + str(offset)]def parse(self, response):node_list = response.xpath('//*[@id="ajaxtable"]/div[1]/ul/li')for node in node_list:item = ItcastItem()author_name = node.xpath("./div[2]/a[1]/text()").extract()reply_name = node.xpath("./div[3]/span/text()").extract()article_href = node.xpath('./div[1]/a/@href').extract()reply_number = node.xpath('./span/text()').extract()item['author'] = author_name[0]item['reply'] = reply_name[0]item['article_href'] = "https://bbs.hupu.com" + article_href[0]item['reply_number'] = re.findall(r'\w+', reply_number[0])[0]item['scan_number'] = re.findall(r'\w+', reply_number[0])[1]yield item#拼接法:if self.offset < 50:self.offset += 1url = self.base_url + str(self.offset)yield scrapy.Request(url, callback=self.parse)
'''   #点击下一页:if len(response.xpath("//a[@class='nextPage']")) != 0:url = "https://bbs.hupu.com/" + str(response.xpath('//*[@id="container"]/div/div[2]/div[4]/div[1]/div/a[6]/@href').extract()[0])yield scrapy.Request(url, callback=self.parse)
'''

注:这里的yield很重要,第一个yield可以在实际使用中节省很大的内存,避免将所有item保存在items中再统一输出,而是利用生成器的特性,每运行一次传给engine一个item,且不可用return代替(用return代替则直接退出函数,不再循环了);第二个yield可以用return代替,反正也是回调self.parse后结束。

接下来是修改管道文件,完成数据处理的操作:

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import jsonclass ItcastPipeline(object):def __init__(self):self.f = open("hupu.json", "wb")def process_item(self, item, spider):content = json.dumps(dict(item), ensure_ascii=False) + ",\n"self.f.write(content.encode("utf-8"))return itemdef close_spider(self, spider):self.f.close()

2. 准备cookie

由于虎扑的反爬机制,非登陆账号在第十页以后就需要登陆查看了,因此可以先登录,取出cookie,再在settings.py文件中加上,顺带加上headers

# -*- coding: utf-8 -*-# Scrapy settings for ITcast project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
import randomBOT_NAME = 'ITcast'SPIDER_MODULES = ['ITcast.spiders']
NEWSPIDER_MODULE = 'ITcast.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
UserAgentlist = ["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60","Opera/8.0 (Windows NT 5.1; U; en)","Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50","Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0","Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36","Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11","Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36","Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11"]USER_AGENT = random.choice(UserAgentlist)# Obey robots.txt rules
ROBOTSTXT_OBEY = True# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False#Override the default request headers:
DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en','Cookie':'_dacevid3=ecbdfb25.f434.fcd0.b788.210a5a48205e; acw_tc=76b20f4615842422838757569eae17e24efbdada53bc13808df6832673fc52; _cnzz_CV30020080=buzi_cookie%7Cecbdfb25.f434.fcd0.b788.210a5a48205e%7C-1; __gads=ID=effc8694c9ec3cf3:T=1584242284:S=ALNI_Ma2_M1WPUBJBiImj4flObqASJy-6Q; _HUPUSSOID=bb358b26-1d78-47a4-a64b-9ff4352047c3; _CLT=00376064be821b71351c003dda774e37; u=26819498|56We56eY55qE5aOV5a6i|f43c|7dc0fa8daecf65f4a401db993c6b9bc5|aecf65f4a401db99|56We56eY55qE5aOV5a6i; us=f44ad91429ba5c9f73cc0569e7329724eb5af2af1b29f656a26a585ac9131fd9a5005b586a05b308edde3d715a144bffd33809c47a6db6c524c344f427d2963d; Hm_lvt_39fc58a7ab8a311f2f6ca4dc1222a96e=1582974309,1582974335,1584242285,1584242368; sajssdk_2015_cross_new_user=1; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%22170dc343b82964-07f8c1efc1f5d8-38657501-1296000-170dc343b83966%22%2C%22%24device_id%22%3A%22170dc343b82964-07f8c1efc1f5d8-38657501-1296000-170dc343b83966%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%7D%7D; PHPSESSID=ed77c421aec93f286770c371f7feea73; lastvisit=0%091584249115%09%2Ferror%2F%40_%40.php%3F; _fmdata=Pju1H%2Fc7V2gjACkMPPdbImwLnQgbrdHhcmAZ4k1PqdajpeKlYZf8Z4OHLu5h1KPRN%2FteHhyK%2FbPb4wPfPcssRiRUm%2FtCdtRzs%2Bx5ioTmRJg%3D; ua=16002521; Hm_lpvt_39fc58a7ab8a311f2f6ca4dc1222a96e=1584249554; __dacevst=8685e086.9bbdbe1e|1584258197004'
}# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {#    'ITcast.middlewares.ItcastSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {#    'ITcast.middlewares.ItcastDownloaderMiddleware': 543,
#}# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'ITcast.pipelines.ItcastPipeline': 300,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

最后在terminal运行 scrapy crawl hupu即可运行得到需要csv如下,共爬取50页,共5888条数据。

Scrapy爬虫实战—虎扑步行街发帖爬取相关推荐

  1. 爬虫实战2(上):爬取豆瓣影评

       这次我们将主要尝试利用python+requsets模拟登录豆瓣爬取复仇者联盟4影评,首先让我们了解一些模拟登录相关知识补充.本文结构如下: request模块介绍与安装 get与post方式介 ...

  2. 【2020-10-27】 scrapy爬虫之猎聘招聘信息爬取

    声明:本文只作学习研究,禁止用于非法用途,否则后果自负,如有侵权,请告知删除,谢谢! scrapy爬虫之猎聘招聘信息爬取 1.项目场景 目标网址:https://www.liepin.com/zhao ...

  3. 爬虫实战2(下):爬取豆瓣影评

       上篇笔记我详细讲诉了如何模拟登陆豆瓣,这次我们将记录模拟登陆+爬取影评(复仇者联盟4)实战.本文行文结构如下: 模拟登陆豆瓣展示 分析网址和源码爬取数据 进行面对对象重构 总结   一.模拟登陆 ...

  4. 爬虫实战入门级教学(数据爬取->数据分析->数据存储)

    爬虫实战入门级教学(数据爬取->数据分析->数据存储) 天天刷题好累哦,来一期简单舒适的爬虫学习,小试牛刀(仅供学习交流,不足之处还请指正) 文章讲的比较细比较啰嗦,适合未接触过爬虫的新手 ...

  5. 多线程爬虫实战--彼岸图网壁纸爬取

    多线程爬虫实战–彼岸图网壁纸爬取 普通方法爬取 import requests from lxml import etree import os from urllib import requesth ...

  6. python3爬虫实战:requests库+正则表达式爬取头像

    python3爬虫实战:requests库+正则表达式爬取头像 网站url:https://www.woyaogexing.com/touxiang/qinglv/new/ 浏览网页:可以发现每个图片 ...

  7. python爬虫之虎扑步行街主题帖

    前言 python爬虫的盛行让数据变得不在是哪么的难以获取.现在呢,我们可以根据我们的需求去寻找我们需要的数据,我们下来就利用python来写一个虎扑步行街主题帖的基本信息,主要包括:帖子主题(tit ...

  8. 链家网页爬虫_爬虫实战1-----链家二手房信息爬取

    经过一段机器学习之后,发现实在是太枯燥了,为了增添一些趣味性以及熟练爬虫,在之后会不定时的爬取一些网站 旨在熟悉网页结构--尤其是HTML的元素,ajax存储,json:熟练使用pyspider,sc ...

  9. python游戏辅助lol_Python爬虫实战,60行代码爬取英雄联盟全英雄全皮肤,找寻曾今那些被删除的绝版皮肤...

    学了一周多的爬虫课后终于按捺不住了,小编决定自己手动编写爬虫程序,刚好LJ在鼓励学员分享成果,优秀作品有奖励,就把自己用Python编程爬取各大游戏高清壁纸的过程整理了出来进行投稿,与大家一起分享. ...

最新文章

  1. 程序员在地铁写代码遭疯狂吐槽!网友:装什么装
  2. windows2008域下exchange2007sp1部署系列一
  3. linux安装显卡驱动的run文件,Linux系统下安装NVIDIA显卡驱动(run格式文件)
  4. 一时技痒 不用模拟第一印象的构造 通过三个观察得来的规律解决N^2个往返接力问题...
  5. 【NOIP2014】子矩阵
  6. 从72小时到1分钟,数据如何快速响应业务需求?
  7. (8)Python_分割numpy数组
  8. pmp考试中应该注意的点是什么?
  9. 大数据数据仓库-简介
  10. 某鱼最近卖的很火蓝色版微信去水印小程序源码+接口
  11. Kaggle TMDB 票房预测挑战赛
  12. MySQL使用SELECT 语句不加ORDER BY默认是如何排序的?
  13. 第四章 机器人控制方法
  14. 从2020全球前十的数字货币交易所甄别风险
  15. Microsoft Build 发布丨开发者关注的7大方向技术更新
  16. 13位Python大牛历时一个月打造的Python系统学习流程图,超详细!
  17. 特别关注:OA选型透析原则、本质和易入误区
  18. 利用腾讯云推流做7*24小时云直播
  19. python ffmpeg pipe_Python子进程中的ffmpeg无法为“pipe:”找到合适的输出格式
  20. ubuntu 18.04安装社交软件(微信/钉钉)

热门文章

  1. python反射机制_Python反射机制
  2. 入职农业银行软件开发两个月,聊聊现状
  3. 毕业设计 基于javaWeb的在线学习系统的设计与实现
  4. LOG(拉普拉斯高斯函数)缘分天空(数学的魅力)
  5. SRC之逻辑漏洞挖掘
  6. 如何论证自由落体不同质量的物体会同样速度掉落
  7. 贪吃蛇c++程序(A*算法自动追踪功能)
  8. 【数学竞赛】极限—洛必达法则
  9. H2数据库报错代码汇总
  10. Ps 初学者教程「42」如何利用渐变工具实现平滑过渡?