爬虫一

本次爬取为两个爬虫,第一个爬虫爬取需要访问的URL并且存储到文本中,第二个爬虫读取第一个爬虫爬取的URl然后依次爬取该URL下内容,先运行第一个爬虫然后运行第二个爬虫即可完成爬取。

本帖仅供学习交流使用,请不要胡乱尝试以免影响网站正常运转

spiders文件下的spander.py文件内容

# -*- coding:utf-8 -*-
import scrapy
from ..items import ZhilianFistItemclass zhilian_url(scrapy.Spider):name = 'zhilian_url'start_urls = ['http://jobs.zhaopin.com/']def parse(self,response):myurl = ZhilianFistItem()urls = response.xpath('/html/body/div/div/div/a[@target="_blank"]/@href').extract()# if len(urls) == 0:#     print('+++++++++++++++++++     空空空空空空空     +++++++++++++++++++++++++')for url in urls:myurl['url'] = url# print('---------begin-----------------------------------------')# print(url)# print('---------end-----------------------------------------')yield myurlpass

items.py文件

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.htmlfrom scrapy import Item,Fieldclass ZhilianFistItem(Item):# define the fields for your item here like:# name = scrapy.Field()url = Field()

middlewares.py文件

# -*- coding: utf-8 -*-# Define here the models for your spider middleware
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/spider-middleware.htmlfrom scrapy import signalsclass ZhilianFistSpiderMiddleware(object):# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the spider middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_spider_input(response, spider):# Called for each response that goes through the spider# middleware and into the spider.# Should return None or raise an exception.return Nonedef process_spider_output(response, result, spider):# Called with the results returned from the Spider, after# it has processed the response.# Must return an iterable of Request, dict or Item objects.for i in result:yield idef process_spider_exception(response, exception, spider):# Called when a spider or process_spider_input() method# (from other spider middleware) raises an exception.# Should return either None or an iterable of Response, dict# or Item objects.passdef process_start_requests(start_requests, spider):# Called with the start requests of the spider, and works# similarly to the process_spider_output() method, except# that it doesn’t have a response associated.# Must return only requests (not items).for r in start_requests:yield rdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)

pipelines.py文件

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
# import  xlsxwriterclass ZhilianFistPipeline(object):# def open_spider(self, spider):def open_spider(self,spider):print('++++++++++++             ++++++++++++')print('++++++++++++    start    ++++++++++++')# 打开excel文件命名为url.xls# self.xls =xlsxwriter.Workbook('url.xlsx')# self.worksheet = self.xls.add_worksheet('myurls')# self.id = 0self.fp = open('myurls','w')print('++++++++++++      ok     ++++++++++++')passdef process_item(self, item, spider):if '.htm' in item['url']:passelif 'http://jobs.zhaopin.com/' in item['url']:print('++++++++++++             ++++++++++++')print('++++++++++++    存储中    ++++++++++++')# id  =  'A' + str(self.id + 1)# # print('*****************', id, '***************************************')# self.worksheet.write(id, item['url'])# self.id = self.id +1self.fp.writelines(item['url']+"\n")print('++++++++++++     ok      ++++++++++++')return itemelse:pass# def spider_closed(self, spider):# def spider_closed(self, spider):def spider_closed(self, spider):print('++++++++++++           ++++++++++++')print('++++++++++++    结束    ++++++++++++')self.fp.close()print('++++++++++++    ok    ++++++++++++')

setting.py文件

# -*- coding: utf-8 -*-BOT_NAME = 'zhilian_fist'SPIDER_MODULES = ['zhilian_fist.spiders']
NEWSPIDER_MODULE = 'zhilian_fist.spiders'# Obey robots.txt rules
ROBOTSTXT_OBEY = FalseDEFAULT_REQUEST_HEADERS = {'Host':'jobs.zhaopin.com','User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:53.0) Gecko/20100101 Firefox/53.0'
}ITEM_PIPELINES = {'zhilian_fist.pipelines.ZhilianFistPipeline': 300,
}

第二个爬虫,zhilian_second

spander.py文件

# -*- coding:utf-8 -*-
import scrapy
from ..items import ZhilianSecondItem
from scrapy import Request
from bs4 import BeautifulSoup
class spider(scrapy.Spider):name = 'zhilian_second'start_urls =[]def __init__(self):links = open('E:/PythonWorkStation/zhilian_fist/myurls')for line in links:# 一定要去掉换行符,如果有换行符则无法访问网址,真他妈坑爹line=line[:-1]# print('-----------------------------')# print('-----------------------------')# print(line+'测试是否有换行符')# print('-----------------------------')# print('-----------------------------')self.start_urls.append(line)# breakdef parse(self, response):item = ZhilianSecondItem()# print('--------------        start                 -----------------------')title_list = response.xpath('//div/span[@class="post"]/a/text()').extract()company_list = response.xpath('//div/span[@class="company_name"]/a/text()').extract()salary_list = response.xpath('//div/span[@class="salary"]/text()').extract()address_list = response.xpath('//div/span[@class="address"]/text()').extract()release_list = response.xpath('//div/span[@class="release_time"]/text()').extract()if response.xpath('//span[@class="search_page_next"]').extract()!= None:next_url = response.xpath('//span[@class="search_page_next"]/a/@href').extract()next_url=next_url[0].split('/')[2]# print('----b--------')# print('----b--------')# print(response.url)# print(len(response.url.split('/')))# print(next_url)# print(len(next_url))# print('----e--------')# print('----e--------')# self.start_urls.append( Request(response.url[:-9]+next_url[0]))if len(response.url.split('/'))==5:yield Request(response.url+next_url)elif len(response.url.split('/'))>5:i = len(next_url)+1print('***********')# print(i)print(next_url.lstrip('p'))print('***********')if (next_url.lstrip('p') == str(10) or next_url.lstrip('p')==str(100) or next_url.lstrip('p')==str(1000) or next_url.lstrip('p')== str(10000)):print('++++++++++++++++')i = i-1yield Request(response.url[:-(i)] + next_url)for a,s,d,f,g in zip(title_list,company_list,salary_list, address_list,release_list):item['company']=sitem['salary']=ditem['address']=fitem['release']=gitem['title'] = ayield item

items.py文件

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ZhilianSecondItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()title =scrapy.Field()company =scrapy.Field()salary =scrapy.Field()address =scrapy.Field()release =scrapy.Field()

middlewares.py文件

# -*- coding: utf-8 -*-# Define here the models for your spider middleware
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/spider-middleware.htmlfrom scrapy import signalsclass ZhilianSecondSpiderMiddleware(object):# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the spider middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_spider_input(response, spider):# Called for each response that goes through the spider# middleware and into the spider.# Should return None or raise an exception.return Nonedef process_spider_output(response, result, spider):# Called with the results returned from the Spider, after# it has processed the response.# Must return an iterable of Request, dict or Item objects.for i in result:yield idef process_spider_exception(response, exception, spider):# Called when a spider or process_spider_input() method# (from other spider middleware) raises an exception.# Should return either None or an iterable of Response, dict# or Item objects.passdef process_start_requests(start_requests, spider):# Called with the start requests of the spider, and works# similarly to the process_spider_output() method, except# that it doesn’t have a response associated.# Must return only requests (not items).for r in start_requests:yield rdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)

pipeline.py文件

# -*- coding: utf-8 -*-class ZhilianSecondPipeline(object):def open_spider(self,spider):self.file = open('E:/招聘岗位.txt','w',encoding='utf-8')def process_item(self, item, spider):self.file.write(item['title']+","+item['company']+","+item['salary']+","+item['address']+","+item['release']+'\n')# print('----------------------------------------------------------')# print(item['title'],item['company'],item['salary'],item['address'],item['release'])# print('----------------------------------------------------------')return itemdef spoder_closed(self,spider):self.file.close()

setting.py文件

# -*- coding: utf-8 -*-
BOT_NAME = 'zhilian_second'
SPIDER_MODULES = ['zhilian_second.spiders']
NEWSPIDER_MODULE = 'zhilian_second.spiders'
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {'zhilian_second.pipelines.ZhilianSecondPipeline': 300,
}
LOG_LEVEL = 'INFO'

由于爬取的太多需要等的时间过长,所以本人在程序没有运行结束之前关终止了运行,但是依旧爬取了数十万岗位信息如下图所示

爬取的内容分割如下
(职位,公司名称,工资介绍,地址,发布日期。)

scrapy框架下的两个爬虫分工合作爬取智联招聘所有职位信息。相关推荐

  1. python 爬虫学习:抓取智联招聘网站职位信息(二)

    在第一篇文章(python 爬虫学习:抓取智联招聘网站职位信息(一))中,我们介绍了爬取智联招聘网站上基于岗位关键字,及地区进行搜索的岗位信息,并对爬取到的岗位工资数据进行统计并生成直方图展示:同时进 ...

  2. 克服反爬虫机制爬取智联招聘网站

    一.实验内容 1.爬取网站: 智联招聘网站(https://www.zhaopin.com/) 2.网站的反爬虫机制:     在我频繁爬取智联招聘网站之后,它会出现以下文字(尽管我已经控制了爬虫的爬 ...

  3. (转)python爬虫实例——爬取智联招聘信息

    受友人所托,写了一个爬取智联招聘信息的爬虫,与大家分享. 本文将介绍如何实现该爬虫. 目录 网页分析 实现代码分析 结果 总结 github代码地址 网页分析 以https://xiaoyuan.zh ...

  4. python爬虫实例——爬取智联招聘信息

    受友人所托,写了一个爬取智联招聘信息的爬虫,与大家分享. 本文将介绍如何实现该爬虫. 目录 网页分析 实现代码分析 结果 总结 github代码地址 网页分析 以https://xiaoyuan.zh ...

  5. python 爬虫学习:抓取智联招聘网站职位信息(一)

    近期智联招聘的网站风格变化较快,这对于想爬取数据的人来说有些难受.因此,在前人基础上,我整理了针对智联招聘网站的最新结构进行数据抓取的代码,目前支持抓取职位搜索列表页面的列表项,并将职位列表以exlc ...

  6. Python网络爬虫:爬取腾讯招聘网职位信息 并做成简单可视化图表

    hello,大家好,我是wangzirui32,今天我们来学习如何爬取腾讯招聘网职位信息,并做成简单可视化图表,开始学习吧! 文章目录 1. 网页分析 2. 获取json数据 3. 转换为Excel ...

  7. 爬虫实战——爬取腾讯招聘的职位信息(2020年2月2日)

    爬取腾讯招聘的职位信息 思路分析 特别说明 1.获取PostId列表 2.爬取详情页面 3.保存数据 完整代码 结果展示 总结分析 思路分析 特别说明 本文以Java工作岗位信息为例进行说明,如果想爬 ...

  8. Scrapy学习——爬取智联招聘网站案例

    Scrapy学习--爬取智联招聘网站案例 安装scrapy 下载 安装 准备 分析 代码 结果 安装scrapy 如果直接使用pip安装会在安装Twisted报错,所以我们需要手动安装. 下载 安装s ...

  9. Python爬虫:抓取智联招聘岗位信息和要求(进阶版)

    本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理 以下文章来源于腾讯云 作者:王强 ( 想要学习Python?Python学习交流群 ...

最新文章

  1. telegram 内联模式 介绍
  2. java同步关键字_Java中synchronized关键字修饰方法同步的用法详解
  3. U-Boot 之五 详解 U-Boot 及 SPL 的启动流程
  4. 【C语言】x++与++x
  5. 计算机软考可以直接高级吗,计算机软考没有中级能考高级吗
  6. html 透视效果,html – CSS – 对背景图像的“敲除”/透视效果
  7. python用一维数组存储学号和成绩、然后按成绩排序输出_九度oj 题目1196:成绩排序...
  8. python for 循环中使用星号(*),实现分组展开列表
  9. flash动画测试什么软件,flash测试(flash怎么测试动画)
  10. 计算机应用基础ppt百度文库,计算机应用基础课件(最新版).ppt
  11. 如何利用 COMSOL 自动执行建模操作
  12. 安卓学习之路-RecyclerView的简单用法
  13. ESXI 中的虚拟机导出到本地
  14. android *#06#_现在在Android#20中
  15. 【论文阅读】ReDoSHunter: A Combined Static and Dynamic Approach for Regular Expression DoS Detection
  16. 中文分句,处理双引号
  17. 安卓虚拟摄像头_iPhone 的第四颗摄像头位置,为什么给了激光雷达?
  18. 计算机网络 (头歌平台)实验二
  19. 计算机清内存,电脑内存清理命令是什么
  20. 不同架构cpu上的c语言编译器,关于c ++:检测CPU架构的编译时

热门文章

  1. ThinkPHP3.1在PHP7下页面空白的解决方案
  2. 【Unity3D日常开发】生成预制体,并且预制体自动销毁
  3. 【python自动化办公(14)】利用python向Word文档中写入内容(format格式化中槽的使用和自动生成请假条小应用)
  4. springboot集成redis (Lettuce)
  5. [week2]每周总结与工作计划
  6. 架构师须知97件事精华版
  7. 电脑怎么通过IP连接打印机??
  8. 学Java 这样入门 28天轻松掌握
  9. OpenWrt开发者沙龙:立方体CEO何铮演讲
  10. [ 云计算 华为云 ] 华为云开天 aPaaS:构建高效的企业数字化平台(上)