前程无忧爬虫–仅供学习使用

前程无忧职位链接：https://search.51job.com/list/090200,000000,0000,00,9,99,%25E5%25A4%25A7%25E6%2595%25B0%25E6%258D%25AE,2,1.html?lang=c&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=
先右键检查分析网页，这里我们已经找到了详情页的链接
可以看到详情页的链接就在a标签里面，我们可以使用xpath语法来进行提取。urls = html.xpath("//div[@class='dw_table']//div[@class='el']/p/span/a/@href") 请求这个网页，我们就可以进入详情页了。

这里就是详情页了，可以看到左边的信息都可以在右边的源代码中看到，接下来用xpath提取就可以了。
网页分析完毕，接下来的时间交给代码了

提取网页的详情链接函数：

def get_urls():for i in range(1,46):#限制页数。print("正在获取第{}页的数据".format(i))url = 'https://search.51job.com/list/090200,000000,0000,00,9,99,%25E5%25A4%25A7%25E6%2595%25B0%25E6%258D%25AE,2,{}.html?'.format(i)response = requests.get(url,headers=headers)html = etree.HTML(response.text)urls = html.xpath("//div[@class='dw_table']//div[@class='el']/p/span/a/@href")# print(urls)parse_urls(urls)

解析详情页面，提取数据：

def parse_urls(urls):for ul in urls:try:print(ul)response = requests.get(ul,headers=headers)response.encoding='gbk'html = etree.HTML(response.text)# print(response.text)position_name = html.xpath("//div[@class='cn']/h1/text()")[0]#职位名称company_name = html.xpath("/html/body/div[3]/div[2]/div[2]/div/div[1]/p[1]/a[1]/text()")[0]#公司名称address = html.xpath("//div[@class='cn']/p[2]/text()")[0]#地址salary = html.xpath("//div[@class='cn']/strong/text()")[0]#工资induction_requirements = html.xpath("//div[@class='cn']/p[2]/text()")[1]#入职要求education = html.xpath("//div[@class='cn']/p[2]/text()")[2]#学历number = html.xpath("//div[@class='cn']/p[2]/text()")[3]#招聘人数release_time = html.xpath("//div[@class='cn']/p[2]/text()")[4]#发布时间print(position_name,company_name,address,salary,induction_requirements,education,number,release_time)datas = [position_name,company_name,address,salary,induction_requirements,education,number,release_time]# writer.writerow(datas)except Exception as e:print('错误:{},数据不齐，丢弃'.format(e))

完整代码如下：

#Time:2020/03/29
#author:渔戈
import requests
from lxml import etree
import csv
#将数据写入csv文件
fp = open('前程无忧.csv','a',encoding='utf-8',newline='')
writer = csv.writer(fp)#初始化csv文件
header =['position_name','company_name','address','salary','induction_requirements','education','number','release_time']
writer.writerow(header)#写入表头
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36',
}def get_urls():for i in range(1,46):#限制页数，最多有45页。print("正在获取第{}页的数据".format(i))url = 'https://search.51job.com/list/090200,000000,0000,00,9,99,%25E5%25A4%25A7%25E6%2595%25B0%25E6%258D%25AE,2,{}.html?'.format(i)response = requests.get(url,headers=headers)html = etree.HTML(response.text)urls = html.xpath("//div[@class='dw_table']//div[@class='el']/p/span/a/@href")# print(urls)parse_urls(urls)def parse_urls(urls):for ul in urls:try:print(ul)response = requests.get(ul,headers=headers)response.encoding='gbk'html = etree.HTML(response.text)# print(response.text)position_name = html.xpath("//div[@class='cn']/h1/text()")[0]#职位名称company_name = html.xpath("/html/body/div[3]/div[2]/div[2]/div/div[1]/p[1]/a[1]/text()")[0]#公司名称address = html.xpath("//div[@class='cn']/p[2]/text()")[0]#地址salary = html.xpath("//div[@class='cn']/strong/text()")[0]#工资induction_requirements = html.xpath("//div[@class='cn']/p[2]/text()")[1]#入职要求education = html.xpath("//div[@class='cn']/p[2]/text()")[2]#学历number = html.xpath("//div[@class='cn']/p[2]/text()")[3]#招聘人数release_time = html.xpath("//div[@class='cn']/p[2]/text()")[4]#发布时间print(position_name,company_name,address,salary,induction_requirements,education,number,release_time)datas = [position_name,company_name,address,salary,induction_requirements,education,number,release_time]writer.writerow(datas)except Exception as e:print('错误:{},数据不齐，丢弃'.format(e))if __name__ == '__main__':get_urls()fp.close()

前程无忧爬虫，仅供学习使用相关推荐

人力资源学python有意义吗-python爬虫抖音个人资料仅供学习参考切勿用于商业...
本文仅供学习参考切勿用于商业本次爬取使用fiddler+模拟器(下载抖音APP)+pycharm 1. 下载最新版本的fiddler(自行百度下载),以及相关配置 1.1.依次点击,菜单栏-Too ...
python爬虫爬取漫画（仅供学习）
项目名: crawl_chuanwu 爬取链接:https://www.manhuadui.com/manhua/chuanwu/ 声明:本项目无任何盈利目的,仅供学习使用,也不会对网站运行造成负担. ...
爬取了京东商城上的部分手机评论数据，仅供学习使用
京东的手机评论数据爬虫,仅供学习使用说明爬取了京东商城上的部分手机评论数据.由于项目的数据量要求不大,仅仅采用了比较简单的方式来进行数据的爬取,过程分为两个部分: 根据不同的手机品牌选择了第一页的 ...
Python + Selenium + Chrome Driver 自动化点击+评论+刷弹幕（仅供学习）
Python + Selenium + Chrome Driver 自动化点击评论刷弹幕首先说明,这篇博文仅供学习!仅供学习!仅供学习! 不要拿去做其他事,封号概不负责!!! 突发奇想首先先说 ...
Python爬取重点产业专利信息网（仅供学习交流！！）
由于要做有关专利方面的研究,所以选择了重点产业专利信息网获取数据,该网站提供了数据下载功能,但由于网站响应比较慢,而且需要数量较多,所以选择爬虫进行爬取. 1.数据获取经过分析发现该网站需要模拟登录 ...
每日简单小妙招：使用python实现控制摄像头拍照并将其发送某某邮箱（仅供学习）
仅供学习,望注意隐私文章目录 1.功能展示 2.代码展示 3.详细步骤 Ⅰ.安装opencv Ⅱ.QQ邮箱设置 1.功能展示这里我使用自己的电脑进行控制拍照,将其发送到自己的邮箱:图片经过base ...
kalilinux生成安卓木马(仅供学习使用）
kalilinux生成安卓木马(仅供学习使用) 一.前期准备工作 1.1虚拟机安装好kalilinux 链接:https://pan.baidu.com/s/10rcLYOGYKQb0pETqJLbD ...
基于易语言的键盘监听器（仅供学习）
基于易语言的键盘监听器(仅供学习) 软件原理梳理输入内容检测部分发送部分结束部分准备工作邮箱准备支持库准备模块准备窗口准备代码部分程序集启动窗口创建完毕子程序1 编辑框1内容 ...
理解ConstraintLayout性能上的好处（转载，仅供学习）
本文转载自:https://www.jianshu.com/p/fae1d533597b,仅供学习 (译)理解ConstraintLayout性能上的好处本文介绍了ConstraintLayout对 ...

前程无忧爬虫，仅供学习使用

前程无忧爬虫–仅供学习使用

前程无忧爬虫，仅供学习使用相关推荐

最新文章

热门文章