一、确定URL及其传递的参数

获取北京中Java的招聘信息url:

获取上海中Java的招聘信息url:

通过对比得知,url中传递了三个参数,jl代表城市的编号,kw代表职业,p代表当前在招聘页面的第几页

二、判断数据是否动态显示

执行以下代码后,然后在开发工具中打开浏览器查看页面

import requests
from fake_useragent import UserAgentheaders = {'User-Agent': UserAgent().random
}url = 'https://sou.zhaopin.com/?jl=530&kw=Java%E5%BC%80%E5%8F%91&p=1'
res = requests.get(url=url, headers=headers)
res.encoding = res.apparent_encoding
text = res.text
with open('zl_src_code.html', 'w', encoding='utf-8') as f:f.write(text)

结果:

由此可知,页面为动态加载,并需要登录

三、使用cookie或selenium登录爬取

在抓包工具中获取cookie

观察得知部分参数可以剔除:

将所有的 cooklie参数粘贴进文件中,并删除以上说明的无用的cookie参数:

执行以下代码生成cookie字典(因为之后要在scrapy项目中设置cookie,因此cookie文件为绝对路径)

def get_cookie():with open(r'E:\Code\Python\2021\PythonDemo\zl\zl\zl_cookie.txt', 'r', encoding='utf-8') as f:cookie_str = f.read()cookies = dict()for item in cookie_str.split('; '):cookies[item.split('=')[0]] = item.split('=')[1]return cookies

验证cookie登录是否成功,执行以下代码

import requests
from fake_useragent import UserAgentheaders = {'User-Agent': UserAgent().random
}def get_cookie():with open(r'E:\Code\Python\2021\PythonDemo\zl\zl\zl_cookie.txt', 'r', encoding='utf-8') as f:cookie_str = f.read()cookies = dict()for item in cookie_str.split('; '):cookies[item.split('=')[0]] = item.split('=')[1]return cookiesurl = 'https://sou.zhaopin.com/?jl=530&kw=Java%E5%BC%80%E5%8F%91&p=1'
res = requests.get(url=url, headers=headers, cookies=get_cookie())
res.encoding = res.apparent_encoding
text = res.text
with open('zl_src_code.html', 'w', encoding='utf-8') as f:f.write(text)

然后在开发工具中打开浏览器查看

利用cookie爬取成功

四、利用cookie保持登录状态后,并获取24座热门城市及其编号

通过观察之前爬取的源码可知,数据由js变量动态接收然后显示在网页中,且数据格式为json形式,因此需要解析json

通过json在线工具查看json结构:

编写Python代码如下:

import requests
from fake_useragent import UserAgent
from bs4 import BeautifulSoup
import jsondef get_cookie():with open(r'E:\Code\Python\2021\PythonDemo\zl\zl\zl_cookie.txt', 'r', encoding='utf-8') as f:cookie_str = f.read()cookies = dict()for item in cookie_str.split('; '):cookies[item.split('=')[0]] = item.split('=')[1]return cookiesheaders = {'User-Agent': UserAgent().random
}  # 设置随机请求头url = 'https://sou.zhaopin.com/?jl=530&kw=Java%E5%BC%80%E5%8F%91&p=1'  # url中的中文参数已被url编码
res = requests.get(url=url, headers=headers, cookies=get_cookie())
res.encoding = res.apparent_encoding  # 自动解析网页编码
text = res.text
soup = BeautifulSoup(text, 'html.parser')
jsons_str = soup.select('body > script:nth-child(10)')[0].string[18:]  # 这里使用BeautifulSoup获取热门城市的json数据。xpath试了,获取不到
for city in json.loads(jsons_str)['baseData']['hotCity']:print('{}: {}'.format(city['name'], city['code']))

运行结果:

1、创建connection.py连接文件

from redis import Redisclass RedisConnection:host = '127.0.0.1'port = 6379decode_responses = True# password = '5180'@classmethoddef getConnection(cls):conn = Nonetry:conn = Redis(host=cls.host,port=cls.port,decode_responses=cls.decode_responses,# password=cls.password)except Exception as e:print(e)return conn@classmethoddef close(cls, conn):if conn:conn.close()

将代码改进并将城市名称和城市号保存进redis中:

import requests
from fake_useragent import UserAgent
from bs4 import BeautifulSoup
import json
from connection import RedisConnectiondef get_cookie():with open(r'E:\Code\Python\2021\PythonDemo\zl\zl\zl_cookie.txt', 'r', encoding='utf-8') as f:cookie_str = f.read()cookies = dict()for item in cookie_str.split('; '):cookies[item.split('=')[0]] = item.split('=')[1]return cookiesheaders = {'User-Agent': UserAgent().random
}  # 设置随机请求头url = 'https://sou.zhaopin.com/?jl=530&kw=Java%E5%BC%80%E5%8F%91&p=1'  # url中的中文参数已被url编码
res = requests.get(url=url, headers=headers, cookies=get_cookie())
res.encoding = res.apparent_encoding  # 自动解析网页编码
text = res.textsoup = BeautifulSoup(text, 'html.parser')
jsons_str = soup.select('body > script:nth-child(10)')[0].string[18:]  # 这里使用BeautifulSoup获取热门城市的json数据。xpath试了,获取不到conn = RedisConnection.getConnection()
for city in json.loads(jsons_str)['baseData']['hotCity']:conn.hset('zl_hotCity', city['name'], city['code'])print('{}: {}'.format(city['name'], city['code']))
RedisConnection.close(conn)

五、编写scrapy

(一):准备工作
执行scrapy startproject zl创建scrapy项目
执行scrapy genspider zlCrawler www.xxx.com创建爬虫文件
创建如下所示的study数据库和mysql表:

(二)、更改项目配置文件settings.py:

# Scrapy settings for zl project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
from fake_useragent import UserAgentBOT_NAME = 'zl'SPIDER_MODULES = ['zl.spiders']
NEWSPIDER_MODULE = 'zl.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = UserAgent().random# Obey robots.txt rules
ROBOTSTXT_OBEY = False
LOG_LEVEL = 'ERROR'# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'zl.pipelines.ZlPipeline': 300,
}

(三)、更改items.py文件,确定爬取字段

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ZlItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()job_name = scrapy.Field()  # 工作名称company = scrapy.Field()  # 公司salary = scrapy.Field()  # 薪水job_area = scrapy.Field()  # 工作区域time_worked = scrapy.Field()  # 工作时间经验qualf = scrapy.Field()  # 学历tec_required = scrapy.Field()  # 要求的技术

(四)、创建dbutil包和connection.py文件
connection.py文件内容如下:

from pymysql import connect
from redis import Redisclass MysqlConnection:host = '127.0.0.1'port = 3306user = 'root'password = 'qwe12333'db = 'study'charset = 'utf8'@classmethoddef getConnection(cls):conn = Nonetry:conn = connect(host=cls.host,port=cls.port,user=cls.user,password=cls.password,db=cls.db,charset=cls.charset)except Exception as e:print(e)return conn@classmethoddef close(cls, conn):if conn:conn.close()class RedisConnection:host = '127.0.0.1'port = 6379decode_responses = True# password = '5180'@classmethoddef getConnection(cls):conn = Nonetry:conn = Redis(host=cls.host,port=cls.port,decode_responses=cls.decode_responses,# password=cls.password)except Exception as e:print(e)return conn@classmethoddef close(cls, conn):if conn:conn.close()

(五)、更改zlCrawler.py文件,编写爬虫代码

import scrapy
from zl.items import ZlItem
from zl.dbutil.connection import RedisConnectiondef get_cookie():with open(r'E:\Code\Python\2021\PythonDemo\zl\zl\zl_cookie.txt', 'r', encoding='utf-8') as f:cookie_str = f.read()  # 读取cookie文件,注意为绝对路径cookies = dict()for item in cookie_str.split('; '):cookies[item.split('=')[0]] = item.split('=')[1]  # 生成cookie字典return cookiesclass ZlcrawlerSpider(scrapy.Spider):name = 'zlCrawler'# allowed_domains = ['www.xxx.com']start_urls = []def __init__(self, **kwargs):super().__init__(**kwargs)self.url = 'https://sou.zhaopin.com/?jl={}&kw=Java%E5%BC%80%E5%8F%91&p={}'self.cookies = get_cookie()self.item = ZlItem()self.redis_conn = RedisConnection.getConnection()  # 获取redis连接self.zl_hotCity = self.redis_conn.hgetall('zl_hotCity')  # 获取指定hash的全部值RedisConnection.close(self.redis_conn)  # 释放连接def start_requests(self):  # 该方法主要是传递url和封装cookiefor city_name in self.zl_hotCity:city_code = self.zl_hotCity[city_name]yield scrapy.Request(self.url.format(city_code, 1), cookies=self.cookies,meta={'city_code': city_code, 'page_num': 1})def parse(self, response):next_page = response.xpath('//div[@class="soupager"]/button[2]/@class').extract_first()if next_page is not None:  # 可能会因为访问过多导致ip被禁,状态码是200,但是没数据,因此提前判断for row in response.xpath('//div[@class="joblist-box__item clearfix"]'):self.item['job_name'] = row.xpath('.//span[@class="iteminfo__line1__jobname__name"]/text()').extract_first()self.item['company'] = row.xpath('.//span[@class="iteminfo__line1__compname__name"]/text()').extract_first()self.item['salary'] = row.xpath('.//p[@class="iteminfo__line2__jobdesc__salary"]/text()').extract_first().strip()self.item['job_area'] = row.xpath('.//ul[@class="iteminfo__line2__jobdesc__demand"]/li[1]/text()').extract_first()self.item['time_worked'] = row.xpath('.//ul[@class="iteminfo__line2__jobdesc__demand"]/li[2]/text()').extract_first()self.item['qualf'] = row.xpath('.//ul[@class="iteminfo__line2__jobdesc__demand"]/li[3]/text()').extract_first()self.item['tec_required'] = ' '.join(row.xpath('.//div[@class="iteminfo__line3__welfare__item"]/text()').extract())yield self.itemcity_code = response.meta['city_code']page_num = response.meta['page_num']if 'disable' not in next_page:yield scrapy.Request(url=self.url.format(city_code, page_num + 1),meta={'city_code': city_code, 'page_num': page_num + 1},callback=self.parse, cookies=self.cookies)

可能会因为访问过多导致ip被禁,状态码是200,但是没数据,等待一个小时候才可以正常访问

(六)、更改pipelines.py文件,用于保存数据进mysql

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
from zl.dbutil.connection import MysqlConnection
from time import timeclass ZlPipeline:def __init__(self):self.start_time = time()self.conn = MysqlConnection.getConnection()self.cursor = self.conn.cursor()self.sql = 'insert into zl values(null, %s, %s, %s, %s, %s, %s, %s);'self.count = 0def process_item(self, item, spider):try:self.count += self.cursor.execute(self.sql,(item['job_name'], item['company'], item['salary'], item['job_area'],item['time_worked'],item['qualf'], item['tec_required']))except Exception as e:self.conn.rollback()if self.count % 10 == 0:self.conn.commit()print('{}: {}--{}'.format(self.count, item['company'], item['job_name']))def close_spider(self, spider):if self.cursor:self.cursor.close()self.conn.commit()MysqlConnection.close(self.conn)print('耗时:{}秒'.format(time() - self.start_time))

(七)、创建main.py文件,用于启动scrapy

from scrapy import cmdlinecmdline.execute('scrapy crawl zlCrawler'.split())

scrapy项目结构图:

运行结果:

爬取智联招聘上24座热门城市中Java招聘信息相关推荐

  1. 爬取智联招聘上的求职信息

    爬虫爬取智联招聘上的求职信息,并将爬取的内容保存到文件中 链接:https://pan.baidu.com/s/1p4gn2enm_WnyqK_3kjnoaQ 提取码:prdb 复制这段内容后打开百度 ...

  2. python爬虫多url_Python爬虫实战入门六:提高爬虫效率—并发爬取智联招聘

    之前文章中所介绍的爬虫都是对单个URL进行解析和爬取,url数量少不费时,但是如果我们需要爬取的网页url有成千上万或者更多,那怎么办? 使用for循环对所有的url进行遍历访问? 嗯,想法很好,但是 ...

  3. python爬取智联招聘网_python爬取智联招聘工作岗位信息

    1 # coding:utf-8 2 # auth:xiaomozi 3 #date:2018.4.19 4 #爬取智联招聘职位信息 5 6 7 import urllib 8 from lxml i ...

  4. Scrapy学习——爬取智联招聘网站案例

    Scrapy学习--爬取智联招聘网站案例 安装scrapy 下载 安装 准备 分析 代码 结果 安装scrapy 如果直接使用pip安装会在安装Twisted报错,所以我们需要手动安装. 下载 安装s ...

  5. Python爬虫爬取智联招聘职位信息

    目的:输入要爬取的职位名称,五个意向城市,爬取智联招聘上的该信息,并打印进表格中 #coding:utf-8 import urllib2 import re import xlwtclass ZLZ ...

  6. xpath爬取智联招聘--大数据开发职位并保存为csv

    先上项目效果图: 本次爬取的URL为智联招聘的网址:https://www.zhaopin.com/ 首先先登录上去,为了保持我们代码的时效性,让每个人都能直接运行代码出结果,我们要获取到我们登录上去 ...

  7. 爬取智联招聘信息并且存入数据库

    任务爬取智联页面的招聘信息并且存入数据库. 由于是初次尝试 这里选择了固定的页面存入数据库. 首先确定需要爬取的页面 http://sou.zhaopin.com/jobs/searchresult. ...

  8. 【Python爬虫案例学习20】Python爬虫爬取智联招聘职位信息

    目的:输入要爬取的职位名称,五个意向城市,爬取智联招聘上的该信息,并打印进表格中 ####基本环境配置: Python版本:2.7 开发工具:pycharm 系统:win10 ####相关模块: im ...

  9. python+selenium爬取智联招聘信息

    python+selenium爬取智联招聘信息 需求 准备 代码 结果 需求 老板给了我一份公司名单(大概几百家如下图),让我到网上看看这些公司分别在招聘哪些岗位,通过分析他们的招聘需求大致能推断出我 ...

最新文章

  1. 微信如何实施微服务?
  2. 《数据科学:R语言实现》——3.12 估计缺失数据
  3. Spring源码研究之how is returned hello string converted to jsp page
  4. Flash 生命终止,HTML5能否完美替代?
  5. python input函数无法输入字符串_Python手把手教程之用户输入input函数
  6. MineCraft和堆外内存
  7. java.sql.SQLException: Access denied for user 'Administrator'@'localhost' (using password: YES)
  8. 360 组织全局 HOOK 的 dll 加载
  9. 阿里p7java什么水平_转头条:阿里p7架构师:三年经验应该具备什么样的技能?
  10. GNU开发工具——GNU Binutils快速入门
  11. 对标苹果开“旧机发布会”?罗永浩出任转转品牌推广大使
  12. 线性代数笔记7——再看行列式与矩阵
  13. PHPStorm 常用 设置配置 和快捷键大全 Win/Mac
  14. 利用Axure做原型设计
  15. RS485自收发实现方案,典型应用电路及问题经验总结
  16. 在linux中连接mysql数据库服务器_Linux下连接Mysql服务器的方式
  17. 英语SouthRedAgate南红玛瑙southredagate单词
  18. GB/T 25000.51-2016解读系列之易用性
  19. brpc的精华bthread源码剖析
  20. 操作系统实验二:物理内存管理系统

热门文章

  1. ElasticFusion离线数据集运行结果再现问题总结
  2. PCB检查-allegro PDN进行简单电源直流压降分析
  3. 百度群面无领导小组讨论2020最新-题目及答案
  4. 转【@入口@】伏草惟存,文章精选系列导航
  5. 双循环是什么意思c语言,什么是双循环
  6. 【实验3 循环结构】7-14 循环结构 —— 中国古代著名算题。趣味题目:物不知其数。
  7. 计算机与了解Dos指令
  8. MySQL数据类型 -- 日期时间型
  9. 2022-2027年中国NGB网络建设光通信器件行业市场深度分析及投资战略规划报告
  10. 《重学Java设计模式》作者开始录视频了