爬虫一

本次爬取为两个爬虫，第一个爬虫爬取需要访问的URL并且存储到文本中，第二个爬虫读取第一个爬虫爬取的URl然后依次爬取该URL下内容，先运行第一个爬虫然后运行第二个爬虫即可完成爬取。

本帖仅供学习交流使用，请不要胡乱尝试以免影响网站正常运转

spiders文件下的spander.py文件内容

# -*- coding:utf-8 -*-
import scrapy
from ..items import ZhilianFistItemclass zhilian_url(scrapy.Spider):name = 'zhilian_url'start_urls = ['http://jobs.zhaopin.com/']def parse(self,response):myurl = ZhilianFistItem()urls = response.xpath('/html/body/div/div/div/a[@target="_blank"]/@href').extract()# if len(urls) == 0:#     print('+++++++++++++++++++     空空空空空空空     +++++++++++++++++++++++++')for url in urls:myurl['url'] = url# print('---------begin-----------------------------------------')# print(url)# print('---------end-----------------------------------------')yield myurlpass

items.py文件

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.htmlfrom scrapy import Item,Fieldclass ZhilianFistItem(Item):# define the fields for your item here like:# name = scrapy.Field()url = Field()

middlewares.py文件

# -*- coding: utf-8 -*-# Define here the models for your spider middleware
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/spider-middleware.htmlfrom scrapy import signalsclass ZhilianFistSpiderMiddleware(object):# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the spider middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_spider_input(response, spider):# Called for each response that goes through the spider# middleware and into the spider.# Should return None or raise an exception.return Nonedef process_spider_output(response, result, spider):# Called with the results returned from the Spider, after# it has processed the response.# Must return an iterable of Request, dict or Item objects.for i in result:yield idef process_spider_exception(response, exception, spider):# Called when a spider or process_spider_input() method# (from other spider middleware) raises an exception.# Should return either None or an iterable of Response, dict# or Item objects.passdef process_start_requests(start_requests, spider):# Called with the start requests of the spider, and works# similarly to the process_spider_output() method, except# that it doesn’t have a response associated.# Must return only requests (not items).for r in start_requests:yield rdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)

pipelines.py文件

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
# import  xlsxwriterclass ZhilianFistPipeline(object):# def open_spider(self, spider):def open_spider(self,spider):print('++++++++++++             ++++++++++++')print('++++++++++++    start    ++++++++++++')# 打开excel文件命名为url.xls# self.xls =xlsxwriter.Workbook('url.xlsx')# self.worksheet = self.xls.add_worksheet('myurls')# self.id = 0self.fp = open('myurls','w')print('++++++++++++      ok     ++++++++++++')passdef process_item(self, item, spider):if '.htm' in item['url']:passelif 'http://jobs.zhaopin.com/' in item['url']:print('++++++++++++             ++++++++++++')print('++++++++++++    存储中    ++++++++++++')# id  =  'A' + str(self.id + 1)# # print('*****************', id, '***************************************')# self.worksheet.write(id, item['url'])# self.id = self.id +1self.fp.writelines(item['url']+"\n")print('++++++++++++     ok      ++++++++++++')return itemelse:pass# def spider_closed(self, spider):# def spider_closed(self, spider):def spider_closed(self, spider):print('++++++++++++           ++++++++++++')print('++++++++++++    结束    ++++++++++++')self.fp.close()print('++++++++++++    ok    ++++++++++++')

setting.py文件

# -*- coding: utf-8 -*-BOT_NAME = 'zhilian_fist'SPIDER_MODULES = ['zhilian_fist.spiders']
NEWSPIDER_MODULE = 'zhilian_fist.spiders'# Obey robots.txt rules
ROBOTSTXT_OBEY = FalseDEFAULT_REQUEST_HEADERS = {'Host':'jobs.zhaopin.com','User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:53.0) Gecko/20100101 Firefox/53.0'
}ITEM_PIPELINES = {'zhilian_fist.pipelines.ZhilianFistPipeline': 300,
}

第二个爬虫，zhilian_second

spander.py文件

# -*- coding:utf-8 -*-
import scrapy
from ..items import ZhilianSecondItem
from scrapy import Request
from bs4 import BeautifulSoup
class spider(scrapy.Spider):name = 'zhilian_second'start_urls =[]def __init__(self):links = open('E:/PythonWorkStation/zhilian_fist/myurls')for line in links:# 一定要去掉换行符，如果有换行符则无法访问网址，真他妈坑爹line=line[:-1]# print('-----------------------------')# print('-----------------------------')# print(line+'测试是否有换行符')# print('-----------------------------')# print('-----------------------------')self.start_urls.append(line)# breakdef parse(self, response):item = ZhilianSecondItem()# print('--------------        start                 -----------------------')title_list = response.xpath('//div/span[@class="post"]/a/text()').extract()company_list = response.xpath('//div/span[@class="company_name"]/a/text()').extract()salary_list = response.xpath('//div/span[@class="salary"]/text()').extract()address_list = response.xpath('//div/span[@class="address"]/text()').extract()release_list = response.xpath('//div/span[@class="release_time"]/text()').extract()if response.xpath('//span[@class="search_page_next"]').extract()!= None:next_url = response.xpath('//span[@class="search_page_next"]/a/@href').extract()next_url=next_url[0].split('/')[2]# print('----b--------')# print('----b--------')# print(response.url)# print(len(response.url.split('/')))# print(next_url)# print(len(next_url))# print('----e--------')# print('----e--------')# self.start_urls.append( Request(response.url[:-9]+next_url[0]))if len(response.url.split('/'))==5:yield Request(response.url+next_url)elif len(response.url.split('/'))>5:i = len(next_url)+1print('***********')# print(i)print(next_url.lstrip('p'))print('***********')if (next_url.lstrip('p') == str(10) or next_url.lstrip('p')==str(100) or next_url.lstrip('p')==str(1000) or next_url.lstrip('p')== str(10000)):print('++++++++++++++++')i = i-1yield Request(response.url[:-(i)] + next_url)for a,s,d,f,g in zip(title_list,company_list,salary_list, address_list,release_list):item['company']=sitem['salary']=ditem['address']=fitem['release']=gitem['title'] = ayield item

items.py文件

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ZhilianSecondItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()title =scrapy.Field()company =scrapy.Field()salary =scrapy.Field()address =scrapy.Field()release =scrapy.Field()

middlewares.py文件

# -*- coding: utf-8 -*-# Define here the models for your spider middleware
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/spider-middleware.htmlfrom scrapy import signalsclass ZhilianSecondSpiderMiddleware(object):# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the spider middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_spider_input(response, spider):# Called for each response that goes through the spider# middleware and into the spider.# Should return None or raise an exception.return Nonedef process_spider_output(response, result, spider):# Called with the results returned from the Spider, after# it has processed the response.# Must return an iterable of Request, dict or Item objects.for i in result:yield idef process_spider_exception(response, exception, spider):# Called when a spider or process_spider_input() method# (from other spider middleware) raises an exception.# Should return either None or an iterable of Response, dict# or Item objects.passdef process_start_requests(start_requests, spider):# Called with the start requests of the spider, and works# similarly to the process_spider_output() method, except# that it doesn’t have a response associated.# Must return only requests (not items).for r in start_requests:yield rdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)

pipeline.py文件

# -*- coding: utf-8 -*-class ZhilianSecondPipeline(object):def open_spider(self,spider):self.file = open('E:/招聘岗位.txt','w',encoding='utf-8')def process_item(self, item, spider):self.file.write(item['title']+","+item['company']+","+item['salary']+","+item['address']+","+item['release']+'\n')# print('----------------------------------------------------------')# print(item['title'],item['company'],item['salary'],item['address'],item['release'])# print('----------------------------------------------------------')return itemdef spoder_closed(self,spider):self.file.close()

setting.py文件

# -*- coding: utf-8 -*-
BOT_NAME = 'zhilian_second'
SPIDER_MODULES = ['zhilian_second.spiders']
NEWSPIDER_MODULE = 'zhilian_second.spiders'
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {'zhilian_second.pipelines.ZhilianSecondPipeline': 300,
}
LOG_LEVEL = 'INFO'

由于爬取的太多需要等的时间过长，所以本人在程序没有运行结束之前关终止了运行，但是依旧爬取了数十万岗位信息如下图所示

爬取的内容分割如下

（职位，公司名称，工资介绍，地址，发布日期。）

scrapy框架下的两个爬虫分工合作爬取智联招聘所有职位信息。相关推荐

python 爬虫学习：抓取智联招聘网站职位信息(二)
在第一篇文章(python 爬虫学习:抓取智联招聘网站职位信息(一))中,我们介绍了爬取智联招聘网站上基于岗位关键字,及地区进行搜索的岗位信息,并对爬取到的岗位工资数据进行统计并生成直方图展示:同时进 ...
克服反爬虫机制爬取智联招聘网站
一.实验内容 1.爬取网站: 智联招聘网站(https://www.zhaopin.com/) 2.网站的反爬虫机制: 在我频繁爬取智联招聘网站之后,它会出现以下文字(尽管我已经控制了爬虫的爬 ...
(转)python爬虫实例——爬取智联招聘信息
受友人所托,写了一个爬取智联招聘信息的爬虫,与大家分享. 本文将介绍如何实现该爬虫. 目录网页分析实现代码分析结果总结 github代码地址网页分析以https://xiaoyuan.zh ...
python爬虫实例——爬取智联招聘信息
受友人所托,写了一个爬取智联招聘信息的爬虫,与大家分享. 本文将介绍如何实现该爬虫. 目录网页分析实现代码分析结果总结 github代码地址网页分析以https://xiaoyuan.zh ...
python 爬虫学习：抓取智联招聘网站职位信息(一)
近期智联招聘的网站风格变化较快,这对于想爬取数据的人来说有些难受.因此,在前人基础上,我整理了针对智联招聘网站的最新结构进行数据抓取的代码,目前支持抓取职位搜索列表页面的列表项,并将职位列表以exlc ...
Python网络爬虫：爬取腾讯招聘网职位信息并做成简单可视化图表
hello,大家好,我是wangzirui32,今天我们来学习如何爬取腾讯招聘网职位信息,并做成简单可视化图表,开始学习吧! 文章目录 1. 网页分析 2. 获取json数据 3. 转换为Excel ...
爬虫实战——爬取腾讯招聘的职位信息（2020年2月2日）
爬取腾讯招聘的职位信息思路分析特别说明 1.获取PostId列表 2.爬取详情页面 3.保存数据完整代码结果展示总结分析思路分析特别说明本文以Java工作岗位信息为例进行说明,如果想爬 ...
Scrapy学习——爬取智联招聘网站案例
Scrapy学习--爬取智联招聘网站案例安装scrapy 下载安装准备分析代码结果安装scrapy 如果直接使用pip安装会在安装Twisted报错,所以我们需要手动安装. 下载安装s ...
Python爬虫：抓取智联招聘岗位信息和要求（进阶版）
本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理以下文章来源于腾讯云作者:王强 ( 想要学习Python?Python学习交流群 ...

scrapy框架下的两个爬虫分工合作爬取智联招聘所有职位信息。

爬虫一

本帖仅供学习交流使用，请不要胡乱尝试以免影响网站正常运转

spiders文件下的spander.py文件内容

items.py文件

middlewares.py文件

pipelines.py文件

setting.py文件

第二个爬虫，zhilian_second

spander.py文件

items.py文件

middlewares.py文件

setting.py文件

scrapy框架下的两个爬虫分工合作爬取智联招聘所有职位信息。相关推荐

最新文章

热门文章