基于scrapy框架的关于58同城招聘网站信息的爬取

起因： 学校项目实训，要求我们爬取招聘网站信息并对其进行分析，在此我和大家分享一下关于我爬取58同城招聘网站信息的过程和结果~
前期准备步骤：
1.搭建环境：首先把scrapy需要的环境搭建好，再次我就不赘述了，这个去百度，有很多的教程，可能有些不够全面不够准确，反正多看看，先把环境搭建好，我是在windows7下进行的安装。

2.环境搭建好后，学习scrapy框架的结构以及运行流程，具体网上也有很多介绍，我也不赘述了，提一点百度百科的解释，scrapy:Scrapy，Python开发的一个快速,高层次的屏幕抓取和web抓取框架，用于抓取web站点并从页面中提取结构化的数据。Scrapy用途广泛，可以用于数据挖掘、监测和自动化测试。

这个关于scrapy的中文的网站点击打开链接，大家可以学习学习，这项目，我也就学习了前面的几点知识。

代码编写过程：

1.在cmd中新建一个新项目。

scrapy startproject tc(58同城的缩写，项目名称)

2.对于该项目的items类进行编写：

emptyempty

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass TcItem(scrapy.Item):# define the fields for your item here like:name = scrapy.Field()   #招聘职位名称Cpname = scrapy.Field() #公司名称pay  = scrapy.Field()   #薪资待遇edu  = scrapy.Field()   #学历要求num  = scrapy.Field()   #招聘人数year = scrapy.Field()   #工作年限FL   = scrapy.Field()   #福利待遇

以上是我给想爬取的数据定义的属性
3.在spiders中新建了一个tc_spider.py,一下是tc_spider.py的代码:

# -*- coding: utf-8 -*-
import scrapy
from tc.items import TcItem
from scrapy.selector import HtmlXPathSelector,Selector
from scrapy.http import Request
class TcSpider(scrapy.Spider):name='tc'allowed_domains=['jn.58.com']start_urls=["http://jn.58.com/tech/pn1/?utm_source=market&spm=b-31580022738699-me-f-824.bdpz_biaoti&PGTID=0d303655-0010-915b-ca53-cb17de8b2ef6&ClickID=3"]theurl="http://jn.58.com/tech/pn"theurl2="/?utm_source=market&spm=b-31580022738699-me-f-824.bdpz_biaoti&PGTID=0d303655-0010-915b-ca53-cb17de8b2ef6&ClickID=3"for i in range(75):n=i+2the_url=theurl+str(n)+theurl2start_urls.append(the_url)def start_request(self,response):sel = Selector(response)  sites = sel.xpath("//*[@id='infolist']/dl")#items = []  for site in sites:  #item = DmozItem()  #item['namee'] = site.xpath('dt/a/text()').extract()href = site.xpath('dt/a/@href').extract()self.start_urls.append(href)#item['company'] = site.xpath('dd/a/@title').extract()#if site!= " " :#  items.append(item) for url in self.start_urls:yield self.make_requests_from_url()def parse_item(self, response):  items2 = []  item=TcItem()item['name']=response.xpath("//*[@class='headConLeft']/h1/text()").extract()item['Cpname']=response.xpath("//*[@class='company']/a/text()").extract()item['pay']=response.xpath(("//*[@class='salaNum']/strong/text()")).extract()item['edu']=response.xpath("//*[@class='xq']/ul/li[1]/div[2]/text()").extract()item['num']=response.xpath("//*[@class='xq']/ul/li[2]/div[1]/text()").extract()item['year']=response.xpath("//*[@class='xq']/ul/li[2]/div[2]/text()").extract()item['FL']=response.xpath("//*[@class='cbSum']/span/text()").extract()dec=item['num']items2.append(item)return items2  def parse(self, response):sel = HtmlXPathSelector(response)  href = sel.select("//*[@id='infolist']/dl/dt/a/@href").extract()for he in href:yield Request (he,callback=self.parse_item)
# 翻页#    next_page=response.xpath("//*[@class='nextMsg']/a/@href")#   if next_page:#   url=response.urljoin(next_page[0].extract())#    yield scrapy.Request(url,self.parse)

这段代码大体四个部分：①定义爬取的网站以及范围②每个属性的xpath的编写③对于每个职位的链接爬取的循环（能实现进去爬取静态的信息）④连续爬取，网页的循环

基于scrapy框架的关于58同城招聘网站信息的爬取相关推荐

58同城南京品牌公馆数据爬取
做一个租房信息的网站,要爬取58同城上南京品牌公馆的房源信息,因为数字被重新编码了,折腾了一天,记录一下整个过程,留着后面使用. 1,网页分析和字体文件反爬简单看了下url(https://nj.5 ...
Python爬虫之Scrapy框架系列（12）——实战ZH小说的爬取来深入学习CrawlSpider
目录: 1. CrawlSpider的引入: (1)首先:观察之前创建spider爬虫文件时 (2)然后:通过命令scrapy genspider获取帮助: (3)最后:使用模板crawl创建一个爬虫 ...
【Web_接口爬虫_Python3_58同城_requestosetreeproxies】58同城，商铺出租，爬取标题、内容、链接地址，保存文本_20200401
目录 [爬虫和数据挖掘] [绕过代理] [创建文件夹] [request请求] [response解析] [记录文件] [运行脚本] #!/usr/bin/env/python3 # -*- codi ...
记一次对某招聘网站的数据爬取并可视化
先单纯记一下代码,有时间再补充细节 pyecharts有很多地方不完善,比如横坐标显示 import requests import pandas as pd import re import csv ...
项目:招聘网站信息(获取数据+数据分析+数据可视化)
在本次项目中,使用到的第三方库如下: import requestsimport timeimport randomimport jsonimport pandasimport matplotlib. ...
Python爬虫实战之二 - 基于Scrapy框架抓取Boss直聘的招聘信息
Python爬虫实战之三 - 基于Scrapy框架抓取Boss直聘的招聘信息 ---------------readme--------------- 简介:本人产品汪一枚,Python自学数月,对于 ...
scrapy获取a标签的连接_python爬虫——基于scrapy框架爬取网易新闻内容
python爬虫--基于scrapy框架爬取网易新闻内容 1.需求[前期准备] 2.分析及代码实现(1)获取五大板块详情页url(2)解析每个板块(3)解析每个模块里的标题中详情页信息点击此处,获取 ...
基于Scrapy框架的Python新闻爬虫
概述该项目是基于Scrapy框架的Python新闻爬虫,能够爬取网易,搜狐,凤凰和澎湃网站上的新闻,将标题,内容,评论,时间等内容整理并保存到本地详细代码下载:http://www.demoda ...
python用scrapy爬取58同城的租房信息
上篇我们用了beautifulsoup4做了简易爬虫,本次我们用scrapy写爬虫58同城的租房信息,可以爬取下一页的信息直至最后一页. 1.scrapy的安装这个安装网上教程比较多,也比较简单,就 ...

基于scrapy框架的关于58同城招聘网站信息的爬取

基于scrapy框架的关于58同城招聘网站信息的爬取相关推荐

最新文章

热门文章