拉勾网招聘岗位爬虫项目

本次爬取数据为python岗位信息，并非用于商业渠道的，只是新手上路练练手，代码如有问题，请多多指教，谢谢。

项目介绍：
爬取网站:拉勾网
URL:https://www.lagou.com/
爬取关键词:python
技术路线:selenium+bs4+time+re+xlwt
爬取时间：2020.08.11
作者：YRH

1.导入库

from selenium import webdriver
from bs4 import BeautifulSoup
import time
import re
import xlwt

2.创建浏览器对象并访问网站

driver = webdriver.Chrome()  # 创建一个chrome实例
driver.get("https://www.lagou.com/")  # 传入网站

3.完整代码

# -*- coding: utf-8 -*-
# Author : YRH
# Data : 2020.08.11
# Project : 拉勾网爬取python招聘岗位
# Tool : PyCharmfrom selenium import webdriver
from bs4 import BeautifulSoup
import time
import re
import xlwtdriver = webdriver.Chrome()  # 创建一个chrome实例
driver.get("https://www.lagou.com/")  # 传入网站# 网页爬取函数
# 主要负责解决页面弹窗处理功能
# 招聘信息的输入和也面的切换
# 将HTML提出并且调用解析函数进行信息提取
def getHtml():info = []  # 存放信息列表time.sleep(0.5)# 由于会弹出一个定位选择框，所以要进行选择一些，否则下面无法进行try:if driver.find_element_by_id("cboxLoadedContent"):driver.find_element_by_id("cboxClose").click()except:passtime.sleep(0.5)# 找出输入框位置并且输入招聘关键词inpText = driver.find_element_by_id("search_input")inpText.send_keys("python")driver.find_element_by_id("search_button").click()time.sleep(1)# 处理弹出广告try:if driver.find_element_by_class_name("body-box"):driver.find_element_by_class_name("body-btn").click()except:passcount = 1# 第一页try:html = driver.page_sourcehtml = str(html).replace(u"\u2002", "").replace(u"\xa9", "")parse(html, info)print("成功爬取第" + str(count) + "页")count += 1except:print("Failed to get page " + str(count))count += 1# 获取第二页到尾页panduan = 1while panduan == 1:if count <= 30:driver.find_element_by_class_name("pager_next ").click()time.sleep(2)html = driver.page_sourcehtml = str(html).replace(u"\u2002", "").replace(u"\xa9", "")parse(html, info)print("成功爬取第" + str(count) + "页")count += 1else:panduan = 0print("爬虫程序执行完成")saveData(info)  # 调用数据保存函数进行数据保存driver.close()  # 关闭浏览器页面# 信息提取函数
# 主要将信息进行解析提取
def parse(html, infoList):soup = BeautifulSoup(html, "lxml")try:# 由于每一页只有14个招聘信息，所以要使用循环来给每一个li的属性"data-index赋值# 因为每个招聘信息的li标签都没有相同的属性值，所以只能这样子了a = 1while a <= 14:ul = soup.find("ul", class_='item_con_list').find_all_next("li", {"data-index": str(a)})for li in ul:# 岗位名称try:name = li.find("a", class_="position_link").find_next("h3").textexcept:name = " "# 地址try:address = li.find("span", class_='add').textaddress = str(address).replace("[", "").replace("]", "")except:address = " "try:text = li.find("div", class_='li_b_l').text# 提取学历要求,因为在一起提取是提取不了，所以学历只能另外提取education = re.findall(r"(大专|本科|硕士|博士|不要求|不限)", text)if len(education) >= 1:education = education[0]else:education = " "# 提取薪资,经验要求text = re.findall(r"(.*?)\n(.*?)/", text, re.X)if len(text[0]) == 2:money = text[0][0]experience = text[0][1]else:money = " "experience = " "except:education = " "money = " "experience = " "# 公司名称try:company = li.find("div", class_='company_name').find_next("a").textexcept:company = " "# 将岗位名称、公司地址、学历要求、薪资、经验要求、公司名称存放到列表中infoList.append([name, address, education, money, experience, company])a += 1except:print("招聘信息个数有变化")# 数据保存
# 数据保存至外部文件excel
def saveData(infoList):print("save........")workbook = xlwt.Workbook(encoding="utf-8")  # 创建workbook对象movieBook = workbook.add_sheet("sheet1")  # 创建工作表head = ["岗位名称", "公司地址", "学历要求", "薪资", "经验要求", "公司名称"]for i in range(0, len(head)):movieBook.write(0, i, head[i])  # 参数1是行，参数2是列，参数3是值# 数据逐行输入y = 1for a in infoList:for x in range(0, len(a)):movieBook.write(y, x, a[x])y += 1print("总共保存了" + str(y) + "家招聘信息")print("数据保存成功")print("数据保存程序执行完毕")workbook.save("拉勾网招聘信息.xls")  # 保存数据表if __name__ == '__main__':getHtml()

5.爬取结果部分截图

新手路上，请多多指教

(python爬虫)拉勾网招聘信息爬取相关推荐

Python 爬虫中国行政区划信息爬取（初学者）
Python 爬虫中国行政区划信息爬取 (初学者) 背景环境准备代码片段 1.定义地址信息对象 2.地址解析对象 2.1 获取web信息 2.2 web信息解析 2.3 区划信息提取 2.4 省 ...
python拉勾网招聘信息爬取（单线程，最新）
一.任务描述爬取拉勾网发布的关于"会计"岗位的招聘信息,通过查询相关文章发现,普遍都是使用单线程对网站信息进行爬取,且拉勾网经过多次维护更新,对简单的爬取代码有反爬虫机制,例如不 ...
【2020-10-27】 scrapy爬虫之猎聘招聘信息爬取
声明:本文只作学习研究,禁止用于非法用途,否则后果自负,如有侵权,请告知删除,谢谢! scrapy爬虫之猎聘招聘信息爬取 1.项目场景目标网址:https://www.liepin.com/zhao ...
python爬虫（一）爬取豆瓣电影排名前50名电影的信息
python爬虫(一)爬取豆瓣电影排名前50名电影的信息在Python爬虫中,我们可以使用beautifulsoup对网页进行解析. 我们可以使用它来爬取豆瓣电影排名前50名的电影的详细信息,例如排 ...
转 Python爬虫实战一之爬取糗事百科段子
静觅 » Python爬虫实战一之爬取糗事百科段子首先,糗事百科大家都听说过吧?糗友们发的搞笑的段子一抓一大把,这次我们尝试一下用爬虫把他们抓取下来. 友情提示糗事百科在前一段时间进行了改版,导致 ...
python爬虫豆瓣影评的爬取cookies实现自动登录账号
python爬虫豆瓣影评的爬取cookies实现自动登录账号频繁的登录网页会让豆瓣锁定你的账号-- 网页请求使用cookies来实现的自动登录账号,这里的cookies因为涉及到账号我屏蔽了,具 ...
python爬虫实战之多线程爬取前程无忧简历
python爬虫实战之多线程爬取前程无忧简历 import requests import re import threading import time from queue import Queu ...
python爬虫第二弹-多线程爬取网站歌曲
python爬虫第二弹-多线程爬取网站歌曲一.简介二.使用的环境三.网页解析 1.获取网页的最大页数 2.获取每一页的url形式 3.获取每首歌曲的相关信息 4.获取下载的链接四.代码实现一 ...
Python爬虫实战一之爬取糗事百科段子
点我进入原文另外, 中间遇到两个问题: 1. ascii codec can't decode byte 0xe8 in position 0:ordinal not in range(128) 解 ...
Python爬虫入门 | 7 分类爬取豆瓣电影，解决动态加载问题
比如我们今天的案例,豆瓣电影分类页面.根本没有什么翻页,需要点击"加载更多"新的电影信息,前面的黑科技瞬间被秒-- 又比如知乎关注的人列表页面: 我复制了其中两个人昵称 ...

(python爬虫)拉勾网招聘信息爬取

拉勾网招聘岗位爬虫项目

(python爬虫)拉勾网招聘信息爬取相关推荐

最新文章

热门文章