scrapy框架下的两个爬虫分工合作爬取智联招聘所有职位信息。
爬虫一
本帖仅供学习交流使用,请不要胡乱尝试以免影响网站正常运转
spiders文件下的spander.py文件内容
# -*- coding:utf-8 -*-
import scrapy
from ..items import ZhilianFistItemclass zhilian_url(scrapy.Spider):name = 'zhilian_url'start_urls = ['http://jobs.zhaopin.com/']def parse(self,response):myurl = ZhilianFistItem()urls = response.xpath('/html/body/div/div/div/a[@target="_blank"]/@href').extract()# if len(urls) == 0:# print('+++++++++++++++++++ 空空空空空空空 +++++++++++++++++++++++++')for url in urls:myurl['url'] = url# print('---------begin-----------------------------------------')# print(url)# print('---------end-----------------------------------------')yield myurlpass
items.py文件
# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.htmlfrom scrapy import Item,Fieldclass ZhilianFistItem(Item):# define the fields for your item here like:# name = scrapy.Field()url = Field()
middlewares.py文件
# -*- coding: utf-8 -*-# Define here the models for your spider middleware
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/spider-middleware.htmlfrom scrapy import signalsclass ZhilianFistSpiderMiddleware(object):# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the spider middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_spider_input(response, spider):# Called for each response that goes through the spider# middleware and into the spider.# Should return None or raise an exception.return Nonedef process_spider_output(response, result, spider):# Called with the results returned from the Spider, after# it has processed the response.# Must return an iterable of Request, dict or Item objects.for i in result:yield idef process_spider_exception(response, exception, spider):# Called when a spider or process_spider_input() method# (from other spider middleware) raises an exception.# Should return either None or an iterable of Response, dict# or Item objects.passdef process_start_requests(start_requests, spider):# Called with the start requests of the spider, and works# similarly to the process_spider_output() method, except# that it doesn’t have a response associated.# Must return only requests (not items).for r in start_requests:yield rdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)
pipelines.py文件
# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
# import xlsxwriterclass ZhilianFistPipeline(object):# def open_spider(self, spider):def open_spider(self,spider):print('++++++++++++ ++++++++++++')print('++++++++++++ start ++++++++++++')# 打开excel文件命名为url.xls# self.xls =xlsxwriter.Workbook('url.xlsx')# self.worksheet = self.xls.add_worksheet('myurls')# self.id = 0self.fp = open('myurls','w')print('++++++++++++ ok ++++++++++++')passdef process_item(self, item, spider):if '.htm' in item['url']:passelif 'http://jobs.zhaopin.com/' in item['url']:print('++++++++++++ ++++++++++++')print('++++++++++++ 存储中 ++++++++++++')# id = 'A' + str(self.id + 1)# # print('*****************', id, '***************************************')# self.worksheet.write(id, item['url'])# self.id = self.id +1self.fp.writelines(item['url']+"\n")print('++++++++++++ ok ++++++++++++')return itemelse:pass# def spider_closed(self, spider):# def spider_closed(self, spider):def spider_closed(self, spider):print('++++++++++++ ++++++++++++')print('++++++++++++ 结束 ++++++++++++')self.fp.close()print('++++++++++++ ok ++++++++++++')
setting.py文件
# -*- coding: utf-8 -*-BOT_NAME = 'zhilian_fist'SPIDER_MODULES = ['zhilian_fist.spiders']
NEWSPIDER_MODULE = 'zhilian_fist.spiders'# Obey robots.txt rules
ROBOTSTXT_OBEY = FalseDEFAULT_REQUEST_HEADERS = {'Host':'jobs.zhaopin.com','User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:53.0) Gecko/20100101 Firefox/53.0'
}ITEM_PIPELINES = {'zhilian_fist.pipelines.ZhilianFistPipeline': 300,
}
第二个爬虫,zhilian_second
spander.py文件
# -*- coding:utf-8 -*-
import scrapy
from ..items import ZhilianSecondItem
from scrapy import Request
from bs4 import BeautifulSoup
class spider(scrapy.Spider):name = 'zhilian_second'start_urls =[]def __init__(self):links = open('E:/PythonWorkStation/zhilian_fist/myurls')for line in links:# 一定要去掉换行符,如果有换行符则无法访问网址,真他妈坑爹line=line[:-1]# print('-----------------------------')# print('-----------------------------')# print(line+'测试是否有换行符')# print('-----------------------------')# print('-----------------------------')self.start_urls.append(line)# breakdef parse(self, response):item = ZhilianSecondItem()# print('-------------- start -----------------------')title_list = response.xpath('//div/span[@class="post"]/a/text()').extract()company_list = response.xpath('//div/span[@class="company_name"]/a/text()').extract()salary_list = response.xpath('//div/span[@class="salary"]/text()').extract()address_list = response.xpath('//div/span[@class="address"]/text()').extract()release_list = response.xpath('//div/span[@class="release_time"]/text()').extract()if response.xpath('//span[@class="search_page_next"]').extract()!= None:next_url = response.xpath('//span[@class="search_page_next"]/a/@href').extract()next_url=next_url[0].split('/')[2]# print('----b--------')# print('----b--------')# print(response.url)# print(len(response.url.split('/')))# print(next_url)# print(len(next_url))# print('----e--------')# print('----e--------')# self.start_urls.append( Request(response.url[:-9]+next_url[0]))if len(response.url.split('/'))==5:yield Request(response.url+next_url)elif len(response.url.split('/'))>5:i = len(next_url)+1print('***********')# print(i)print(next_url.lstrip('p'))print('***********')if (next_url.lstrip('p') == str(10) or next_url.lstrip('p')==str(100) or next_url.lstrip('p')==str(1000) or next_url.lstrip('p')== str(10000)):print('++++++++++++++++')i = i-1yield Request(response.url[:-(i)] + next_url)for a,s,d,f,g in zip(title_list,company_list,salary_list, address_list,release_list):item['company']=sitem['salary']=ditem['address']=fitem['release']=gitem['title'] = ayield item
items.py文件
# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ZhilianSecondItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()title =scrapy.Field()company =scrapy.Field()salary =scrapy.Field()address =scrapy.Field()release =scrapy.Field()
middlewares.py文件
# -*- coding: utf-8 -*-# Define here the models for your spider middleware
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/spider-middleware.htmlfrom scrapy import signalsclass ZhilianSecondSpiderMiddleware(object):# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the spider middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_spider_input(response, spider):# Called for each response that goes through the spider# middleware and into the spider.# Should return None or raise an exception.return Nonedef process_spider_output(response, result, spider):# Called with the results returned from the Spider, after# it has processed the response.# Must return an iterable of Request, dict or Item objects.for i in result:yield idef process_spider_exception(response, exception, spider):# Called when a spider or process_spider_input() method# (from other spider middleware) raises an exception.# Should return either None or an iterable of Response, dict# or Item objects.passdef process_start_requests(start_requests, spider):# Called with the start requests of the spider, and works# similarly to the process_spider_output() method, except# that it doesn’t have a response associated.# Must return only requests (not items).for r in start_requests:yield rdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)
pipeline.py文件
# -*- coding: utf-8 -*-class ZhilianSecondPipeline(object):def open_spider(self,spider):self.file = open('E:/招聘岗位.txt','w',encoding='utf-8')def process_item(self, item, spider):self.file.write(item['title']+","+item['company']+","+item['salary']+","+item['address']+","+item['release']+'\n')# print('----------------------------------------------------------')# print(item['title'],item['company'],item['salary'],item['address'],item['release'])# print('----------------------------------------------------------')return itemdef spoder_closed(self,spider):self.file.close()
setting.py文件
# -*- coding: utf-8 -*-
BOT_NAME = 'zhilian_second'
SPIDER_MODULES = ['zhilian_second.spiders']
NEWSPIDER_MODULE = 'zhilian_second.spiders'
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {'zhilian_second.pipelines.ZhilianSecondPipeline': 300,
}
LOG_LEVEL = 'INFO'
由于爬取的太多需要等的时间过长,所以本人在程序没有运行结束之前关终止了运行,但是依旧爬取了数十万岗位信息如下图所示
scrapy框架下的两个爬虫分工合作爬取智联招聘所有职位信息。相关推荐
- python 爬虫学习:抓取智联招聘网站职位信息(二)
在第一篇文章(python 爬虫学习:抓取智联招聘网站职位信息(一))中,我们介绍了爬取智联招聘网站上基于岗位关键字,及地区进行搜索的岗位信息,并对爬取到的岗位工资数据进行统计并生成直方图展示:同时进 ...
- 克服反爬虫机制爬取智联招聘网站
一.实验内容 1.爬取网站: 智联招聘网站(https://www.zhaopin.com/) 2.网站的反爬虫机制: 在我频繁爬取智联招聘网站之后,它会出现以下文字(尽管我已经控制了爬虫的爬 ...
- (转)python爬虫实例——爬取智联招聘信息
受友人所托,写了一个爬取智联招聘信息的爬虫,与大家分享. 本文将介绍如何实现该爬虫. 目录 网页分析 实现代码分析 结果 总结 github代码地址 网页分析 以https://xiaoyuan.zh ...
- python爬虫实例——爬取智联招聘信息
受友人所托,写了一个爬取智联招聘信息的爬虫,与大家分享. 本文将介绍如何实现该爬虫. 目录 网页分析 实现代码分析 结果 总结 github代码地址 网页分析 以https://xiaoyuan.zh ...
- python 爬虫学习:抓取智联招聘网站职位信息(一)
近期智联招聘的网站风格变化较快,这对于想爬取数据的人来说有些难受.因此,在前人基础上,我整理了针对智联招聘网站的最新结构进行数据抓取的代码,目前支持抓取职位搜索列表页面的列表项,并将职位列表以exlc ...
- Python网络爬虫:爬取腾讯招聘网职位信息 并做成简单可视化图表
hello,大家好,我是wangzirui32,今天我们来学习如何爬取腾讯招聘网职位信息,并做成简单可视化图表,开始学习吧! 文章目录 1. 网页分析 2. 获取json数据 3. 转换为Excel ...
- 爬虫实战——爬取腾讯招聘的职位信息(2020年2月2日)
爬取腾讯招聘的职位信息 思路分析 特别说明 1.获取PostId列表 2.爬取详情页面 3.保存数据 完整代码 结果展示 总结分析 思路分析 特别说明 本文以Java工作岗位信息为例进行说明,如果想爬 ...
- Scrapy学习——爬取智联招聘网站案例
Scrapy学习--爬取智联招聘网站案例 安装scrapy 下载 安装 准备 分析 代码 结果 安装scrapy 如果直接使用pip安装会在安装Twisted报错,所以我们需要手动安装. 下载 安装s ...
- Python爬虫:抓取智联招聘岗位信息和要求(进阶版)
本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理 以下文章来源于腾讯云 作者:王强 ( 想要学习Python?Python学习交流群 ...
最新文章
- telegram 内联模式 介绍
- java同步关键字_Java中synchronized关键字修饰方法同步的用法详解
- U-Boot 之五 详解 U-Boot 及 SPL 的启动流程
- 【C语言】x++与++x
- 计算机软考可以直接高级吗,计算机软考没有中级能考高级吗
- html 透视效果,html – CSS – 对背景图像的“敲除”/透视效果
- python用一维数组存储学号和成绩、然后按成绩排序输出_九度oj 题目1196:成绩排序...
- python for 循环中使用星号(*),实现分组展开列表
- flash动画测试什么软件,flash测试(flash怎么测试动画)
- 计算机应用基础ppt百度文库,计算机应用基础课件(最新版).ppt
- 如何利用 COMSOL 自动执行建模操作
- 安卓学习之路-RecyclerView的简单用法
- ESXI 中的虚拟机导出到本地
- android *#06#_现在在Android#20中
- 【论文阅读】ReDoSHunter: A Combined Static and Dynamic Approach for Regular Expression DoS Detection
- 中文分句,处理双引号
- 安卓虚拟摄像头_iPhone 的第四颗摄像头位置,为什么给了激光雷达?
- 计算机网络 (头歌平台)实验二
- 计算机清内存,电脑内存清理命令是什么
- 不同架构cpu上的c语言编译器,关于c ++:检测CPU架构的编译时
热门文章
- ThinkPHP3.1在PHP7下页面空白的解决方案
- 【Unity3D日常开发】生成预制体,并且预制体自动销毁
- 【python自动化办公(14)】利用python向Word文档中写入内容(format格式化中槽的使用和自动生成请假条小应用)
- springboot集成redis (Lettuce)
- [week2]每周总结与工作计划
- 架构师须知97件事精华版
- 电脑怎么通过IP连接打印机??
- 学Java 这样入门 28天轻松掌握
- OpenWrt开发者沙龙:立方体CEO何铮演讲
- [ 云计算 华为云 ] 华为云开天 aPaaS:构建高效的企业数字化平台(上)