• scrapy-redis分布式爬取tencent社招信息

    • 什么是scrapy-redis
    • 目标任务
    • 安装爬虫
    • 创建爬虫
    • 编写 items.py
    • 编写 spiders/tencent.py
    • 编写 pipelines.py
    • 编写 middlewares.py
    • 编写 settings.py
    • 搭建 redis
    • 运行爬虫

什么是scrapy-redis

虽然 scrapy 框架是异步加多线程的,但是我们只能在一台主机上运行,爬取效率还是有限的,scrapy-redis 库是基于 scrapy 修改,为我们提供了 scrapy分布式的队列,调度器,去重等等功能,并且原有的 scrapy 单机版爬虫代码只需做很小的改动。有了它,就可以将多台主机组合起来,共同完成一个爬取任务,抓取的效率又提高了。再配合 Scrapyd 与 Gerapy 可以很方便的实现爬虫的分布式部署与运行。

目标任务

使用scrapy-redis爬取 https://hr.tencent.com/position.php?&start= 招聘信息,爬取的内容包括:职位名、详情连接 、职位类别、招聘人数、工作地点、发布时间、具体要求信息。

安装爬虫

pip install scrapy
pip install scrapy-redis
  • python 版本 3.7, scrapy 版本 1.6.0, scrapy-redis 版本 0.6.8

创建爬虫

#  创建工程
scrapy startproject TencentSpider
 # 创建爬虫 cd TencentSpider scrapy genspider -t crawl tencent tencent.com 
  • 爬虫名称 tencent , 作用域 tencent.com,爬虫类型 crawl

编写 items.py

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass TencentspiderItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()# 职位名positionname = scrapy.Field()# 详情连接positionlink = scrapy.Field()# 职位类别positionType = scrapy.Field()# 招聘人数peopleNum = scrapy.Field()# 工作地点 workLocation = scrapy.Field() # 发布时间 publishTime = scrapy.Field() # 职位详情 positiondetail = scrapy.Field() 
  • 定义需求爬取的 item 项

编写 spiders/tencent.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy_redis.spiders import RedisCrawlSpider
# 导入CrawlSpider类和Rule
from scrapy.spiders import CrawlSpider, Rule
# 导入链接规则匹配类,用来提取符合规则的连接
from scrapy.linkextractors import LinkExtractor
from TencentSpider.items import TencentspiderItemclass TencentSpider(RedisCrawlSpider): # 普通的scrapy爬虫继承自CrawlSpider name = 'tencent' # allowed_domains = ['tencent.com'] allowed_domains = ['hr.tencent.com'] # 普通的scrapy爬虫需要在这里定义start_urls,并且无redis_key变量 # start_urls = ['https://hr.tencent.com/position.php?&start=0#a'] redis_key = 'tencent:start_urls' # Response里链接的提取规则,返回的符合匹配规则的链接匹配对象的列表 pagelink = LinkExtractor(allow=("start=\d+")) rules = ( # 获取这个列表里的链接,依次发送请求,并且继续跟进,调用指定回调函数处理 Rule(pagelink, callback='parse_item', follow=True), ) # CrawlSpider的rules属性是直接从response对象的文本中提取url,然后自动创建新的请求。 # 与Spider不同的是,CrawlSpider已经重写了parse函数 # scrapy crawl spidername开始运行,程序自动使用start_urls构造Request并发送请求, # 然后调用parse函数对其进行解析,在这个解析过程中使用rules中的规则从html(或xml)文本中提取匹配的链接, # 通过这个链接再次生成Request,如此不断循环,直到返回的文本中再也没有匹配的链接,或调度器中的Request对象用尽,程序才停止。 # 如果起始的url解析方式有所不同,那么可以重写CrawlSpider中的另一个函数parse_start_url(self, response)用来解析第一个url返回的Response,但这不是必须的。 def parse_item(self, response): # print(response.request.headers) items = [] url1 = "https://hr.tencent.com/" for each in response.xpath("//tr[@class='even'] | //tr[@class='odd']"): # 初始化模型对象 item = TencentspiderItem() # 职位名称 try: item['positionname'] = each.xpath("./td[1]/a/text()").extract()[0].strip() except BaseException: item['positionname'] = "" # 详情连接 try: item['positionlink'] = "{0}{1}".format(url1, each.xpath("./td[1]/a/@href").extract()[0].strip()) except BaseException: item['positionlink'] = "" # 职位类别 try: item['positionType'] = each.xpath("./td[2]/text()").extract()[0].strip() except BaseException: item['positionType'] = "" # 招聘人数 try: item['peopleNum'] = each.xpath("./td[3]/text()").extract()[0].strip() except BaseException: item['peopleNum'] = "" # 工作地点 try: item['workLocation'] = each.xpath("./td[4]/text()").extract()[0].strip() except BaseException: item['workLocation'] = "" # 发布时间 try: item['publishTime'] = each.xpath("./td[5]/text()").extract()[0].strip() except BaseException: item['publishTime'] = "" items.append(item) # yield item for item in items: yield scrapy.Request(url=item['positionlink'], meta={'meta_1': item}, callback=self.second_parseTencent) def second_parseTencent(self, response): item = TencentspiderItem() meta_1 = response.meta['meta_1'] item['positionname'] = meta_1['positionname'] item['positionlink'] = meta_1['positionlink'] item['positionType'] = meta_1['positionType'] item['peopleNum'] = meta_1['peopleNum'] item['workLocation'] = meta_1['workLocation'] item['publishTime'] = meta_1['publishTime'] tmp = [] tmp.append(response.xpath("//tr[@class='c']")[0]) tmp.append(response.xpath("//tr[@class='c']")[1]) positiondetail = '' for i in tmp: positiondetail_title = i.xpath("./td[1]/div[@class='lightblue']/text()").extract()[0].strip() positiondetail = positiondetail + positiondetail_title positiondetail_detail = i.xpath("./td[1]/ul[@class='squareli']/li/text()").extract() positiondetail = positiondetail + ' '.join(positiondetail_detail) + ' ' # positiondetail_title = response.xpath("//div[@class='lightblue']").extract() # positiondetail_detail = response.xpath("//ul[@class='squareli']").extract() # positiondetail = positiondetail_title[0] + '\n' + positiondetail_detail[0] + '\n' + positiondetail_title[1] + '\n' + positiondetail_detail[1] item['positiondetail'] = positiondetail.strip() yield item 
  • 爬虫的主逻辑

编写 pipelines.py

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import jsonclass TencentspiderPipeline(object):"""功能:保存item数据"""def __init__(self):self.filename = open("tencent.json", "w", encoding='utf-8')def process_item(self, item, spider):try:text = json.dumps(dict(item), ensure_ascii=False) + "\n"self.filename.write(text)except BaseException as e:print(e)return itemdef close_spider(self, spider):self.filename.close()
  • 处理每个页面爬取得到的 item 项

编写 middlewares.py

# -*- coding: utf-8 -*-# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.htmlimport scrapy
from scrapy import signals
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
import randomclass TencentspiderSpiderMiddleware(object):# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the spider middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_spider_input(self, response, spider):# Called for each response that goes through the spider# middleware and into the spider.# Should return None or raise an exception.return Nonedef process_spider_output(self, response, result, spider):# Called with the results returned from the Spider, after# it has processed the response.# Must return an iterable of Request, dict or Item objects.for i in result:yield idef process_spider_exception(self, response, exception, spider):# Called when a spider or process_spider_input() method# (from other spider middleware) raises an exception.# Should return either None or an iterable of Response, dict# or Item objects.passdef process_start_requests(self, start_requests, spider):# Called with the start requests of the spider, and works# similarly to the process_spider_output() method, except# that it doesn’t have a response associated.# Must return only requests (not items).for r in start_requests:yield rdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)class TencentspiderDownloaderMiddleware(object):# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the downloader middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_request(self, request, spider):# Called for each request that goes through the downloader# middleware.# Must either:# - return None: continue processing this request# - or return a Response object# - or return a Request object# - or raise IgnoreRequest: process_exception() methods of#   installed downloader middleware will be calledreturn Nonedef process_response(self, request, response, spider):# Called with the response returned from the downloader.# Must either;# - return a Response object# - return a Request object# - or raise IgnoreRequestreturn responsedef process_exception(self, request, exception, spider):# Called when a download handler or a process_request()# (from other downloader middleware) raises an exception.# Must either:# - return None: continue processing this exception# - return a Response object: stops process_exception() chain# - return a Request object: stops process_exception() chainpassdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)class MyUserAgentMiddleware(UserAgentMiddleware):"""随机设置User-Agent"""def __init__(self, user_agent):self.user_agent = user_agent@classmethoddef from_crawler(cls, crawler):return cls(user_agent=crawler.settings.get('MY_USER_AGENT'))def process_request(self, request, spider):agent = random.choice(self.user_agent)request.headers['User-Agent'] = agent

编写 settings.py

# -*- coding: utf-8 -*-# Scrapy settings for TencentSpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'TencentSpider'SPIDER_MODULES = ['TencentSpider.spiders']
NEWSPIDER_MODULE = 'TencentSpider.spiders'# 普通scrapy无下面5项关于redis的配置
# 使用了scrapy_redis的去重组件,在redis数据库里做去重(必须) DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 使用了scrapy_redis的调度器,在redis里分配请求(必须) SCHEDULER = "scrapy_redis.scheduler.Scheduler" # 在redis中保持scrapy-redis用到的各个队列,从而允许暂停和暂停后恢复,也就是不清理redis queues(可选参数) SCHEDULER_PERSIST = True # 指定redis数据库的连接参数(必须) REDIS_HOST = '127.0.0.1' REDIS_PORT = 6379 DUPEFILTER_DEBUG = True # scrapy-redis在redis中都是用key-value形式存储数据,其中有几个常见的key-value形式: # 1、 “项目名:items” -->list 类型,保存爬虫获取到的数据item 内容是 json 字符串 # 2、 “项目名:dupefilter” -->set类型,用于爬虫访问的URL去重 内容是 40个字符的 url 的hash字符串 # 3、 “项目名:start_urls” -->List 类型,用于获取spider启动时爬取的第一个url # 4、 “项目名:requests” -->zset类型,用于scheduler调度处理 requests 内容是 request 对象的序列化 字符串 # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'TencentSpider (+http://www.yourdomain.com)' # 设置useragent随机列表 MY_USER_AGENT = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36", "Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.21 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.21", "Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)", "Mozilla/5.0 (Windows NT 6.2; rv:30.0) Gecko/20150101 Firefox/32.0", "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36", "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.2)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko", "Mozilla/4.0 (compatib1e; MSIE 6.1; Windows NT)", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; SLCC1; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.5.21022; .NET CLR 3.5.30729; .NET CLR 3.0.30618)", "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET4.0C; .NET4.0E; Media Center PC 6.0)", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.93 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:23.0) Gecko/20100101 Firefox/23.0", "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2)", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31", "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1; rv:17.0) Gecko/20100101 Firefox/17.0", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36", "Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 5.1)", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)", "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0", "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; InfoPath.2)", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0", "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0)", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0", "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.10 Safari/537.36", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12) AppleWebKit/602.1.21 (KHTML, like Gecko) Version/9.2 Safari/602.1.21", "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729)", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.152 Safari/537.36" ] # Obey robots.txt rules ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16) CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 'Accept-Encoding': 'gzip,deflate,br', 'accept-language': 'zh-CN,zh;q=0.9', 'cache-control': 'no-cache', 'pragma': 'no-cache', 'upgrade-insecure-requests': '1', 'host': 'hr.tencent.com' } # Enable or disable spider middlewares # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'TencentSpider.middlewares.TencentspiderSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html DOWNLOADER_MIDDLEWARES = { 'TencentSpider.middlewares.TencentspiderDownloaderMiddleware': None, 'TencentSpider.middlewares.MyUserAgentMiddleware': 543, 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None } # Enable or disable extensions # See https://doc.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'TencentSpider.pipelines.TencentspiderPipeline': 300, # 通过配置RedisPipeline将item写入key为 spider.name : items 的redis的list中,供后面的分布式处理item 这个已经由 scrapy-redis 实现,不需要我们写代码,直接使用即可 'scrapy_redis.pipelines.RedisPipeline': 100 } # Enable and configure the AutoThrottle extension (disabled by default) # See https://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' LOG_LEVEL = 'DEBUG' 

搭建 redis

这里搭建单机版 windows 版本,需要 linux 版本的自行百度。 下载地址:https://github.com/rgl/redis/downloads 选择最新版和你电脑的对应版本下载安装,这里我选择 redis-2.4.6-setup-64-bit.exe,双击安装,然后将 C:\Program Files\Redis 加入系统环境变量。配置文件为 C:\Program Files\Redis\conf\redis.conf 运行 redis 服务器的命令: redis-server 运行 redis 客户端的命令: redis-cli

运行爬虫

启动爬虫

cd TencentSpider
scrapy crawl tencent
  • TencentSpider 为项目文件夹, tencent 为爬虫名
  • 这时候爬虫会处于等待状态。
  • 可以在本机或者其他主机启动多个爬虫实例,只有所处的主机能够连接 redis 即可

设置 start_urls

# redis-cli
redis 127.0.0.1:6379> lpush tencent:start_urls https://hr.tencent.com/position.php?&start=0#a (integer) 1 redis 127.0.0.1:6379> 

或者运行以下脚本:

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.htmlimport redisif __name__ == '__main__':conn = redis.Redis(host='127.0.0.1',port=6379)# settings 中 REDIS_START_URLS_AS_SET = False  #默认是false,true的话,就是集合,false的话,就为列表 # 列表 conn.lpush('tencent:start_urls','https://hr.tencent.com/position.php?&start=0#a') # 集合 # conn.sadd('tencent:start_urls','https://hr.tencent.com/position.php?&start=0#a') # conn.close() 无需关闭连接 
  • tencent:start_urls 为 spiders/tencent.py 中变量 redis_key 的值
  • 稍等片刻后,所有爬虫会运行,爬取完成后 ctrl + c 停止

结果会保存在 redis 数据库的key tencent:items 中与项目文件夹根目录下的 tencent.json 文件中,内容如下:

{"positionname": "29302-服务采购商务岗", "positionlink": "https://hr.tencent.com/position_detail.php?id=49345&keywords=&tid=0&lid=0", "positionType": "职能类", "peopleNum": "1", "workLocation": "深圳", "publishTime": "2019-04-12", "positiondetail": "工作职责:• 负责相关产品和品类采购策略的制订及实施; • 负责相关产品及品类的采购运作管理,包括但不限于需求理解、供应商开发及选择、供应资源有效管理、商务谈判、成本控制、交付管理、组织验收等 • 支持业务部门的采购需求; • 收集、分析市场及行业相关信息,为采购决策提供依据。 工作要求:• 认同腾讯企业文化理念,正直、进取、尽责; • 本科或以上学历,管理、传媒、经济或其他相关专业,市场营销及内容类产品运营工作背景者优先; • 五年以上工作经验,对采购理念和采购过程管理具有清晰的认知和深刻的理解;拥有二年以上营销/设计采购、招标相关类管理经验; • 熟悉采购运作及管理,具有独立管理重大采购项目的经验,具有较深厚的采购专业知识; • 具备良好的组织协调和沟通能力、学习能力和团队合作精神强,具有敬业精神,具备较强的分析问题和解决问题的能力; • 了解IP及新文创行业现状及发展,熟悉市场营销相关行业知识和行业运作特点; • 具有良好的英语听说读写能力,英语可作为工作语言;同时有日语听说读写能力的优先; • 具备良好的文档撰写能力。计算机操作能力强,熟练使用MS OFFICE办公软件和 ERP 等软件的熟练使用。"} {"positionname": "CSIG16-自动驾驶高精地图(地图编译)", "positionlink": "https://hr.tencent.com/position_detail.php?id=49346&keywords=&tid=0&lid=0", "positionType": "技术类", "peopleNum": "1", "workLocation": "北京", "publishTime": "2019-04-12", "positiondetail": "工作职责:地图数据编译工具软件开发 工作要求: 硕士以上学历,2年以上工作经验,计算机、测绘、GIS、数学等相关专业;  精通C++编程,编程基础扎实;  熟悉常见数据结构,有较复杂算法设计经验;  精通数据库编程,如MySQL、sqlite等;  有实际的地图项目经验,如地图tile、大地坐标系、OSM等;  至少熟悉一种地图数据规格,如MIF、NDS、OpenDrive等;  有较好的数学基础,熟悉几何和图形学基本算法,;  具备较好的沟通表达能力和团队合作意识。"} {"positionname": "32032-资深特效美术设计师(上海)", "positionlink": "https://hr.tencent.com/position_detail.php?id=49353&keywords=&tid=0&lid=0", "positionType": "设计类", "peopleNum": "1", "workLocation": "上海", "publishTime": "2019-04-12", "positiondetail": "工作职责:负责游戏3D和2D特效制作,制作规范和技术标准的制定; 与项目组开发人员深入沟通,准确实现项目开发需求。 工作要求:5年以上端游、手游特效制作经验,熟悉UE4引擎; 能熟练使用相关软件和引擎工具制作高品质的3D特效; 善于使用第三方软件制作高品质序列资源,用于引擎特效; 可以总结自己的方法论和经验用于新人和带领团队; 对游戏开发和技术有热情和追求,有责任心,善于团队合作,沟通能力良好,应聘简历须附带作品。"} ...... ...... ......

此爬虫不保证时效性,如果源站调整就会失效。

转载于:https://www.cnblogs.com/leffss/p/11003085.html

scrapy-redis分布式爬取tencent社招信息相关推荐

  1. Scrapy + Redis 分布式爬取58同城北京全站二手房数据

    Hello,我是 Alex 007,为啥是007呢?因为叫 Alex 的人太多了,再加上每天007的生活,Alex 007就诞生了. 经过一个星期的学习,爬虫这个章节算是告一段落了,记录一下作业. 文 ...

  2. scrapy+redis+mongodb爬取苏宁商城图书价格

    之前的爬取苏宁图书信息的时候因为懒得分析图书的价格,所有今天把图书的价格给弄了 图书的价格是动态生成的,不过稍稍花点时间就分析出来了,本来长长的·一大串,慢慢删减慢慢试就剩一个,看下图 然后我在网页源 ...

  3. Scrapy实战:爬取知乎用户信息

    思路:从一个用户(本例为"张佳玮")出发,来爬取其粉丝,进而爬取其粉丝的粉丝- 先来观察网页结构: 审查元素: 可以看到用户"关注的人"等信息在网页中用json ...

  4. Scrapy定制管道爬取pexels.com网站信息

    参考了官方文档,链接https://scrapy-chs.readthedocs.io/zh_CN/latest/topics/images.html 处理文本的pipeline参考了这篇博客http ...

  5. scrapy爬虫,爬取整形美容网医生信息

    因为公司的商务部门需要网站上的医生信息,所以让我爬取. 网址:https://www.010yt.com/doc/ 因为之前学习了scrapy爬虫,所以在爬取这个项目信息的时候就用了这个信息. 首先就 ...

  6. java redis多线程爬取国美商品信息

    前面那篇爬虫文章用的是单线程没有用到其它一些比较提高效率的工具比较遗憾,所以今天做了一个比较全面的爬虫.首先谢谢 @[天不生我万古长](https://www.jianshu.com/u/e34019 ...

  7. python Requests 爬取富士康社招网站的招聘信息

    1.1 从第一页中获取最大页码数 frist_tree = etree.HTML(response)max_page = int(re.findall('\d+', frist_tree.xpath( ...

  8. 爬虫实战之分布式爬取知乎问答数据

    分布式爬取知乎 一.环境 window7 scrapy pycharm redis other PC MySQL 二.简介 之所以选择爬取知乎,一是很多人都在爬,那么一定有它爬取价值的地方:二呢分析后 ...

  9. 【爬虫】Scrapy爬取腾讯社招信息

    目标任务:爬取腾讯社招信息,需要爬取的内容为:职位名称,职位的详情链接,职位类别,招聘人数,工作地点,发布时间. 一.预备基础 1.Scrapy简介 Scrapy是用纯Python实现一个为了爬取网站 ...

最新文章

  1. Nginx负载均衡+转发策略
  2. xml学习总结(三)
  3. 强调模型可复现性!英伟达与伦敦国王学院开源医学AI框架 MONAI
  4. 【ElasticSearch】Es 源码之 NodeClient 源码解读
  5. IDEA添加mybatis-mapper的模板
  6. php request对象,PHP 中TP5 Request 请求对象的实例详解
  7. Linux基础命令---lpq查看打印队列
  8. AMD授权GPU给Intel?苏姿丰:没有的事
  9. Linux实时监控日志文件的swatchdog
  10. java商城源码_盘点这些年被黑的最惨的语言,Java瑟瑟发抖
  11. Web开发技术架构图
  12. 用springboot对接支付宝支付接口的详细开发步骤总结
  13. pikachu File Inclusion 文件包含漏洞 (皮卡丘漏洞平台通关系列)
  14. mac book pro 安装网络准入后经常死机
  15. JQuery插件让图片旋转任意角度且代码极其简单
  16. 如何实现电脑通过手机上网?1分钟搞定!
  17. python日历模块_Python calendar日历模块的说明
  18. 计算机虚拟内存的用途,虚拟内存有什么作用?
  19. 3d 角色血条制作方案:解决近大远小的策略
  20. Nginx四层代理和7层反向代理

热门文章

  1. 如何判断整数运算是否溢出
  2. java虚拟机在哪配置参数,Java虚拟机(JVM)参数配置说明
  3. Laravel + Dcat admin 开发一个健壮的 erp 项目
  4. CRM——企业内外部管理的重要手段
  5. mavon-editor编辑器页面瞄点
  6. 怎么用电脑制作gif动态图片?教你快速做GIF的妙招
  7. VB.NET Mid函数的使用方法详细介绍
  8. 我佛了……O2O模式的刮刮乐看的我一愣一愣的
  9. 数学建模-Topsis综合评价(评价模型)
  10. 断舍离:人生从此提效30%