先分析网页

这里我们先打开腾讯招聘网址
https://careers.tencent.com/search.html
这里我们检查html定位到职位，发现没有href属性

XHR是jquery中的ajax的封装

我们在找XHR中发现了数据，找到请求方式及地址,get

请求地址：https://careers.tencent.com/tencentcareer/api/post/Query?

通过查看表单数据发现timestamp与pageindex参数是变化的
经过分析我们发现

**timestamp**是   时间戳*1000 取整
**pageindex**是   页号

发送请求找postid

构建表单数据，发送请求，返回json数据，通过jsonpath找到postid，保存到

获取的工作职责工作名称工作要求

endpage = 10 # 查找10页数据
for i in range(1, endpage+1):timetamp = round(time.time() * 1000)# 构建表单数据params = {'timestamp': timetamp,'countryId': '','cityId': '','bgIds': '','productId': '','categoryId': '','parentCategoryId': '','attrId': '','keyword': '','pageIndex': i,'pageSize': '10','language': 'zh-cn','area': 'cn',}# 发送请求，返回json数据response = requests.get(url, headers=headers, params=params).json()# 通过jsonpath方法找到postid，保存到Postid_data中Postid_data = jsonpath.jsonpath(response, '$..PostId')

访问详细地址并保存数据

data_list = []# 遍历此页中的postidfor id in Postid_data:temp = {}params2 = {'timestamp': round(time.time() * 1000),'postId': id,'language': 'zh-cn'}response_xiangxi = requests.get(url2, headers=headers1, params=params2).json()name = jsonpath.jsonpath(response_xiangxi, '$..RecruitPostName')duty = jsonpath.jsonpath(response_xiangxi, '$..Responsibility')require = jsonpath.jsonpath(response_xiangxi, '$..Requirement')temp['name'] = name[0]temp['duty'] = duty[0]temp['require'] = require[0]# 将有用数据逐条存到data_list中data_list.append(temp)collection.insert_many(data_list

完整代码，通过面向对象过程

import timeimport requests
from jsonpath import jsonpath
from pymongo import MongoClientclass Baiduzhaopin(object):def __init__(self, page):# 招聘首页self.url_home = 'https://careers.tencent.com/tencentcareer/api/post/Query?' # 招聘详细地址self.url_li = 'https://careers.tencent.com/tencentcareer/api/post/ByPostId?' self.headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36','cookie': '_ga=GA1.2.1729346649.1610940890; _gcl_au=1.1.582164431.1610940891; sensorsdata2015jssdkcross=%7B%22distinct_''id%22%3A%223cebc2924a43644f09780490a12dc21b%40devS%22%2C%22first_id%22%3A%22%22%2C%22props%22%3A%7B%22%24late''st_traffic_source_type%22%3A%22%E8%87%AA%E7%84%B6%E6%90%9C%E7%B4%A2%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_''keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC%22%2C%22%24latest_referrer%22%3A%22https%3A%2F%2Fwww.baidu.''com%2Flink%22%7D%2C%22%24device_id%22%3A%22177138f3ae240-0699c0d8afaf5-31346d-1327104-177138f3ae3c0e%22%7D; loading=agree',}# 配置mongoself.client = MongoClient('localhost', 27017)self.col = self.client['python']['Tencent_recruitment']self.page = pagedef postid(self, page):# 时间戳 四舍五入time_temp = round(time.time() * 1000)# 构建表单数据param = {'timestamp': time_temp,'countryId': '','cityId': '','bgIds': '','productId': '','categoryId': '','parentCategoryId': '','attrId': '','keyword': '','pageIndex': page,'pageSize': '10','language': 'zh-cn','area': 'cn',}# 发送请求response = requests.get(self.url_home, headers=self.headers, params=param).json()# 找到json数据中的postid jsonpath真好用postid_data = jsonpath(response, '$..PostId')return postid_datadef target_data(self, postid):# 保存数据for id in postid:temp = {}params = {'timestamp': round(time.time() * 1000),'postId': id,'language': 'zh-cn'}response_target = requests.get(self.url_li, headers=self.headers, params=params).json()name = jsonpath(response_target, '$..RecruitPostName')duty = jsonpath(response_target, '$..Responsibility')require = jsonpath(response_target, '$..Requirement')temp['name'] = name[0]temp['duty'] = duty[0]temp['require'] = require[0]print(temp)self.col.insert_one(temp)def run(self):for page in range(1, self.page + 1):# 请求地址，获取postidpost_id = self.postid(page)# 请求详细的地址，找到数据并且保存到mongo中输出self.target_data(post_id)if __name__ == '__main__':# 爬取10页数据baidu = Baiduzhaopin(10)baidu.run()

效果实现

mongo中

总结

本人爬虫刚入门，代码写的并不是太好，欢迎指正，谢谢！

新人爬虫学习_爬取腾讯招聘信息相关推荐

Scrapy框架学习笔记 - 爬取腾讯招聘网数据
文章目录一.Scrapy框架概述 (一)网络爬虫 (二)Scrapy框架 (三)安装Scrapy框架 (四)Scrapy核心组件 (五)Scrapy工作流程二. Scrapy案例演示 (一)爬取目 ...
自动化爬虫爬取腾讯招聘信息
输入页数开始爬取可设定是否无头浏览有一个坑就是在翻页时无法直接click该元素要写一个执行js 预览图: 上代码 import time from selenium import webdriv ...
爬虫之多线程爬取智联招聘信息
前言: 本文爬取对象为智联搜索大数据岗位内容信息,并将信息保存到本地. 案例中使用的HttpClientUtils工具类参考上一篇文章https://blog.csdn.net/qq_15076569 ...
Python网络爬虫：爬取腾讯招聘网职位信息并做成简单可视化图表
hello,大家好,我是wangzirui32,今天我们来学习如何爬取腾讯招聘网职位信息,并做成简单可视化图表,开始学习吧! 文章目录 1. 网页分析 2. 获取json数据 3. 转换为Excel ...
从入门到入土：Python爬虫学习|实例练手|详细讲解|爬取腾讯招聘网|一步一步分析|异步加载|初级难度反扒处理|寻找消失的API来找工作吧
此博客仅用于记录个人学习进度,学识浅薄,若有错误观点欢迎评论区指出.欢迎各位前来交流.(部分材料来源网络,若有侵权,立即删除) 本人博客所有文章纯属学习之用,不涉及商业利益.不合适引用,自当删除! 若 ...
Python爬虫学习笔记 -- 爬取糗事百科
Python爬虫学习笔记 -- 爬取糗事百科代码存放地址: https://github.com/xyls2011/python/tree/master/qiushibaike 爬取网址:https ...
Python爬虫学习之爬取淘宝搜索图片
Python爬虫学习之爬取淘宝搜索图片准备工作因为淘宝的反爬机制导致Scrapy不能使用,所以我这里是使用selenium来获取网页信息,并且通过lxml框架来提取信息. selenium.lxm ...
爬虫入门经典(十三) | 一文教你简单爬取腾讯招聘
大家好,我是不温卜火,是一名计算机学院大数据专业大三的学生,昵称来源于成语-不温不火,本意是希望自己性情温和.作为一名互联网行业的小白,博主写博客一方面是为了记录自己的学习过程,另一方面是总结自己 ...
爬虫实战——爬取腾讯招聘的职位信息（2020年2月2日）
爬取腾讯招聘的职位信息思路分析特别说明 1.获取PostId列表 2.爬取详情页面 3.保存数据完整代码结果展示总结分析思路分析特别说明本文以Java工作岗位信息为例进行说明,如果想爬 ...
python爬取2019年计算机就业_2019年最新Python爬取腾讯招聘网信息代码解析
原标题:2019年最新Python爬取腾讯招聘网信息代码解析前言初学Python的小伙们在入门的时候感觉这门语言有很大的难度,但是他的用处是非常广泛的,在这里将通过实例代码讲解如何通过Python ...

新人爬虫学习_爬取腾讯招聘信息

爬虫学习_爬取腾讯招聘信息