爬取智联招聘上24座热门城市中Java招聘信息
一、确定URL及其传递的参数
获取北京中Java的招聘信息url:
获取上海中Java的招聘信息url:
通过对比得知,url中传递了三个参数,jl代表城市的编号,kw代表职业,p代表当前在招聘页面的第几页
二、判断数据是否动态显示
执行以下代码后,然后在开发工具中打开浏览器查看页面
import requests
from fake_useragent import UserAgentheaders = {'User-Agent': UserAgent().random
}url = 'https://sou.zhaopin.com/?jl=530&kw=Java%E5%BC%80%E5%8F%91&p=1'
res = requests.get(url=url, headers=headers)
res.encoding = res.apparent_encoding
text = res.text
with open('zl_src_code.html', 'w', encoding='utf-8') as f:f.write(text)
结果:
由此可知,页面为动态加载,并需要登录
三、使用cookie或selenium登录爬取
在抓包工具中获取cookie
观察得知部分参数可以剔除:
将所有的 cooklie参数粘贴进文件中,并删除以上说明的无用的cookie参数:
执行以下代码生成cookie字典(因为之后要在scrapy项目中设置cookie,因此cookie文件为绝对路径)
def get_cookie():with open(r'E:\Code\Python\2021\PythonDemo\zl\zl\zl_cookie.txt', 'r', encoding='utf-8') as f:cookie_str = f.read()cookies = dict()for item in cookie_str.split('; '):cookies[item.split('=')[0]] = item.split('=')[1]return cookies
验证cookie登录是否成功,执行以下代码
import requests
from fake_useragent import UserAgentheaders = {'User-Agent': UserAgent().random
}def get_cookie():with open(r'E:\Code\Python\2021\PythonDemo\zl\zl\zl_cookie.txt', 'r', encoding='utf-8') as f:cookie_str = f.read()cookies = dict()for item in cookie_str.split('; '):cookies[item.split('=')[0]] = item.split('=')[1]return cookiesurl = 'https://sou.zhaopin.com/?jl=530&kw=Java%E5%BC%80%E5%8F%91&p=1'
res = requests.get(url=url, headers=headers, cookies=get_cookie())
res.encoding = res.apparent_encoding
text = res.text
with open('zl_src_code.html', 'w', encoding='utf-8') as f:f.write(text)
然后在开发工具中打开浏览器查看
利用cookie爬取成功
四、利用cookie保持登录状态后,并获取24座热门城市及其编号
通过观察之前爬取的源码可知,数据由js变量动态接收然后显示在网页中,且数据格式为json形式,因此需要解析json
通过json在线工具查看json结构:
编写Python代码如下:
import requests
from fake_useragent import UserAgent
from bs4 import BeautifulSoup
import jsondef get_cookie():with open(r'E:\Code\Python\2021\PythonDemo\zl\zl\zl_cookie.txt', 'r', encoding='utf-8') as f:cookie_str = f.read()cookies = dict()for item in cookie_str.split('; '):cookies[item.split('=')[0]] = item.split('=')[1]return cookiesheaders = {'User-Agent': UserAgent().random
} # 设置随机请求头url = 'https://sou.zhaopin.com/?jl=530&kw=Java%E5%BC%80%E5%8F%91&p=1' # url中的中文参数已被url编码
res = requests.get(url=url, headers=headers, cookies=get_cookie())
res.encoding = res.apparent_encoding # 自动解析网页编码
text = res.text
soup = BeautifulSoup(text, 'html.parser')
jsons_str = soup.select('body > script:nth-child(10)')[0].string[18:] # 这里使用BeautifulSoup获取热门城市的json数据。xpath试了,获取不到
for city in json.loads(jsons_str)['baseData']['hotCity']:print('{}: {}'.format(city['name'], city['code']))
运行结果:
1、创建connection.py连接文件
from redis import Redisclass RedisConnection:host = '127.0.0.1'port = 6379decode_responses = True# password = '5180'@classmethoddef getConnection(cls):conn = Nonetry:conn = Redis(host=cls.host,port=cls.port,decode_responses=cls.decode_responses,# password=cls.password)except Exception as e:print(e)return conn@classmethoddef close(cls, conn):if conn:conn.close()
将代码改进并将城市名称和城市号保存进redis中:
import requests
from fake_useragent import UserAgent
from bs4 import BeautifulSoup
import json
from connection import RedisConnectiondef get_cookie():with open(r'E:\Code\Python\2021\PythonDemo\zl\zl\zl_cookie.txt', 'r', encoding='utf-8') as f:cookie_str = f.read()cookies = dict()for item in cookie_str.split('; '):cookies[item.split('=')[0]] = item.split('=')[1]return cookiesheaders = {'User-Agent': UserAgent().random
} # 设置随机请求头url = 'https://sou.zhaopin.com/?jl=530&kw=Java%E5%BC%80%E5%8F%91&p=1' # url中的中文参数已被url编码
res = requests.get(url=url, headers=headers, cookies=get_cookie())
res.encoding = res.apparent_encoding # 自动解析网页编码
text = res.textsoup = BeautifulSoup(text, 'html.parser')
jsons_str = soup.select('body > script:nth-child(10)')[0].string[18:] # 这里使用BeautifulSoup获取热门城市的json数据。xpath试了,获取不到conn = RedisConnection.getConnection()
for city in json.loads(jsons_str)['baseData']['hotCity']:conn.hset('zl_hotCity', city['name'], city['code'])print('{}: {}'.format(city['name'], city['code']))
RedisConnection.close(conn)
五、编写scrapy
(一):准备工作
执行scrapy startproject zl创建scrapy项目
执行scrapy genspider zlCrawler www.xxx.com创建爬虫文件
创建如下所示的study数据库和mysql表:
(二)、更改项目配置文件settings.py:
# Scrapy settings for zl project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
from fake_useragent import UserAgentBOT_NAME = 'zl'SPIDER_MODULES = ['zl.spiders']
NEWSPIDER_MODULE = 'zl.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = UserAgent().random# Obey robots.txt rules
ROBOTSTXT_OBEY = False
LOG_LEVEL = 'ERROR'# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'zl.pipelines.ZlPipeline': 300,
}
(三)、更改items.py文件,确定爬取字段
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ZlItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()job_name = scrapy.Field() # 工作名称company = scrapy.Field() # 公司salary = scrapy.Field() # 薪水job_area = scrapy.Field() # 工作区域time_worked = scrapy.Field() # 工作时间经验qualf = scrapy.Field() # 学历tec_required = scrapy.Field() # 要求的技术
(四)、创建dbutil包和connection.py文件
connection.py文件内容如下:
from pymysql import connect
from redis import Redisclass MysqlConnection:host = '127.0.0.1'port = 3306user = 'root'password = 'qwe12333'db = 'study'charset = 'utf8'@classmethoddef getConnection(cls):conn = Nonetry:conn = connect(host=cls.host,port=cls.port,user=cls.user,password=cls.password,db=cls.db,charset=cls.charset)except Exception as e:print(e)return conn@classmethoddef close(cls, conn):if conn:conn.close()class RedisConnection:host = '127.0.0.1'port = 6379decode_responses = True# password = '5180'@classmethoddef getConnection(cls):conn = Nonetry:conn = Redis(host=cls.host,port=cls.port,decode_responses=cls.decode_responses,# password=cls.password)except Exception as e:print(e)return conn@classmethoddef close(cls, conn):if conn:conn.close()
(五)、更改zlCrawler.py文件,编写爬虫代码
import scrapy
from zl.items import ZlItem
from zl.dbutil.connection import RedisConnectiondef get_cookie():with open(r'E:\Code\Python\2021\PythonDemo\zl\zl\zl_cookie.txt', 'r', encoding='utf-8') as f:cookie_str = f.read() # 读取cookie文件,注意为绝对路径cookies = dict()for item in cookie_str.split('; '):cookies[item.split('=')[0]] = item.split('=')[1] # 生成cookie字典return cookiesclass ZlcrawlerSpider(scrapy.Spider):name = 'zlCrawler'# allowed_domains = ['www.xxx.com']start_urls = []def __init__(self, **kwargs):super().__init__(**kwargs)self.url = 'https://sou.zhaopin.com/?jl={}&kw=Java%E5%BC%80%E5%8F%91&p={}'self.cookies = get_cookie()self.item = ZlItem()self.redis_conn = RedisConnection.getConnection() # 获取redis连接self.zl_hotCity = self.redis_conn.hgetall('zl_hotCity') # 获取指定hash的全部值RedisConnection.close(self.redis_conn) # 释放连接def start_requests(self): # 该方法主要是传递url和封装cookiefor city_name in self.zl_hotCity:city_code = self.zl_hotCity[city_name]yield scrapy.Request(self.url.format(city_code, 1), cookies=self.cookies,meta={'city_code': city_code, 'page_num': 1})def parse(self, response):next_page = response.xpath('//div[@class="soupager"]/button[2]/@class').extract_first()if next_page is not None: # 可能会因为访问过多导致ip被禁,状态码是200,但是没数据,因此提前判断for row in response.xpath('//div[@class="joblist-box__item clearfix"]'):self.item['job_name'] = row.xpath('.//span[@class="iteminfo__line1__jobname__name"]/text()').extract_first()self.item['company'] = row.xpath('.//span[@class="iteminfo__line1__compname__name"]/text()').extract_first()self.item['salary'] = row.xpath('.//p[@class="iteminfo__line2__jobdesc__salary"]/text()').extract_first().strip()self.item['job_area'] = row.xpath('.//ul[@class="iteminfo__line2__jobdesc__demand"]/li[1]/text()').extract_first()self.item['time_worked'] = row.xpath('.//ul[@class="iteminfo__line2__jobdesc__demand"]/li[2]/text()').extract_first()self.item['qualf'] = row.xpath('.//ul[@class="iteminfo__line2__jobdesc__demand"]/li[3]/text()').extract_first()self.item['tec_required'] = ' '.join(row.xpath('.//div[@class="iteminfo__line3__welfare__item"]/text()').extract())yield self.itemcity_code = response.meta['city_code']page_num = response.meta['page_num']if 'disable' not in next_page:yield scrapy.Request(url=self.url.format(city_code, page_num + 1),meta={'city_code': city_code, 'page_num': page_num + 1},callback=self.parse, cookies=self.cookies)
可能会因为访问过多导致ip被禁,状态码是200,但是没数据,等待一个小时候才可以正常访问
(六)、更改pipelines.py文件,用于保存数据进mysql
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
from zl.dbutil.connection import MysqlConnection
from time import timeclass ZlPipeline:def __init__(self):self.start_time = time()self.conn = MysqlConnection.getConnection()self.cursor = self.conn.cursor()self.sql = 'insert into zl values(null, %s, %s, %s, %s, %s, %s, %s);'self.count = 0def process_item(self, item, spider):try:self.count += self.cursor.execute(self.sql,(item['job_name'], item['company'], item['salary'], item['job_area'],item['time_worked'],item['qualf'], item['tec_required']))except Exception as e:self.conn.rollback()if self.count % 10 == 0:self.conn.commit()print('{}: {}--{}'.format(self.count, item['company'], item['job_name']))def close_spider(self, spider):if self.cursor:self.cursor.close()self.conn.commit()MysqlConnection.close(self.conn)print('耗时:{}秒'.format(time() - self.start_time))
(七)、创建main.py文件,用于启动scrapy
from scrapy import cmdlinecmdline.execute('scrapy crawl zlCrawler'.split())
scrapy项目结构图:
运行结果:
爬取智联招聘上24座热门城市中Java招聘信息相关推荐
- 爬取智联招聘上的求职信息
爬虫爬取智联招聘上的求职信息,并将爬取的内容保存到文件中 链接:https://pan.baidu.com/s/1p4gn2enm_WnyqK_3kjnoaQ 提取码:prdb 复制这段内容后打开百度 ...
- python爬虫多url_Python爬虫实战入门六:提高爬虫效率—并发爬取智联招聘
之前文章中所介绍的爬虫都是对单个URL进行解析和爬取,url数量少不费时,但是如果我们需要爬取的网页url有成千上万或者更多,那怎么办? 使用for循环对所有的url进行遍历访问? 嗯,想法很好,但是 ...
- python爬取智联招聘网_python爬取智联招聘工作岗位信息
1 # coding:utf-8 2 # auth:xiaomozi 3 #date:2018.4.19 4 #爬取智联招聘职位信息 5 6 7 import urllib 8 from lxml i ...
- Scrapy学习——爬取智联招聘网站案例
Scrapy学习--爬取智联招聘网站案例 安装scrapy 下载 安装 准备 分析 代码 结果 安装scrapy 如果直接使用pip安装会在安装Twisted报错,所以我们需要手动安装. 下载 安装s ...
- Python爬虫爬取智联招聘职位信息
目的:输入要爬取的职位名称,五个意向城市,爬取智联招聘上的该信息,并打印进表格中 #coding:utf-8 import urllib2 import re import xlwtclass ZLZ ...
- xpath爬取智联招聘--大数据开发职位并保存为csv
先上项目效果图: 本次爬取的URL为智联招聘的网址:https://www.zhaopin.com/ 首先先登录上去,为了保持我们代码的时效性,让每个人都能直接运行代码出结果,我们要获取到我们登录上去 ...
- 爬取智联招聘信息并且存入数据库
任务爬取智联页面的招聘信息并且存入数据库. 由于是初次尝试 这里选择了固定的页面存入数据库. 首先确定需要爬取的页面 http://sou.zhaopin.com/jobs/searchresult. ...
- 【Python爬虫案例学习20】Python爬虫爬取智联招聘职位信息
目的:输入要爬取的职位名称,五个意向城市,爬取智联招聘上的该信息,并打印进表格中 ####基本环境配置: Python版本:2.7 开发工具:pycharm 系统:win10 ####相关模块: im ...
- python+selenium爬取智联招聘信息
python+selenium爬取智联招聘信息 需求 准备 代码 结果 需求 老板给了我一份公司名单(大概几百家如下图),让我到网上看看这些公司分别在招聘哪些岗位,通过分析他们的招聘需求大致能推断出我 ...
最新文章
- 微信如何实施微服务?
- 《数据科学:R语言实现》——3.12 估计缺失数据
- Spring源码研究之how is returned hello string converted to jsp page
- Flash 生命终止,HTML5能否完美替代?
- python input函数无法输入字符串_Python手把手教程之用户输入input函数
- MineCraft和堆外内存
- java.sql.SQLException: Access denied for user 'Administrator'@'localhost' (using password: YES)
- 360 组织全局 HOOK 的 dll 加载
- 阿里p7java什么水平_转头条:阿里p7架构师:三年经验应该具备什么样的技能?
- GNU开发工具——GNU Binutils快速入门
- 对标苹果开“旧机发布会”?罗永浩出任转转品牌推广大使
- 线性代数笔记7——再看行列式与矩阵
- PHPStorm 常用 设置配置 和快捷键大全 Win/Mac
- 利用Axure做原型设计
- RS485自收发实现方案,典型应用电路及问题经验总结
- 在linux中连接mysql数据库服务器_Linux下连接Mysql服务器的方式
- 英语SouthRedAgate南红玛瑙southredagate单词
- GB/T 25000.51-2016解读系列之易用性
- brpc的精华bthread源码剖析
- 操作系统实验二:物理内存管理系统