杭州python爬虫招聘_python爬取招聘网站(智联,拉钩,Boss直聘)
刚好最近有这需求,动手写了几个
就贴上代码算了
1.智联
将结果保存为python的一个数据框中
import requests
from requests.exceptions import RequestException
from bs4 import BeautifulSoup
import pandas as pd
import time
headers={
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36',
'Cookie':'adfbid=0; adfbid2=0; dywea=95841923.1684916627213906700.1518933348.1518933348.1518933348.1; dywec=95841923; dywez=95841923.1518933348.1.1.dywecsr=baidu|dyweccn=(organic)|dywecmd=organic; __utma=269921210.1045361993.1518933348.1518933348.1518933348.1; __utmc=269921210; __utmz=269921210.1518933348.1.1.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; firstchannelurl=https%3A//passport.zhaopin.com/account/register%3Fy7bRbP%3DdpDMrcLjtPLjtPLjmUUTAfiY8DxpsUXEmnJxKKhNBcL; userphoto=; userwork=4; bindmob=0; monitorlogin=Y; NTKF_T2D_CLIENTID=guestE1AB2AD1-7303-40C4-6AA7-A77C5CB910B4; dywem=95841923.y; qrcodekey=2063c159242c45a7b0a0a77188addaf4; Hm_lvt_38ba284938d5eddca645bb5e02a02006=1518933348,1518933526; lastchannelurl=https%3A//passport.zhaopin.com/findPassword/email/step2%3Freceiver%3D15606013006@163.com; JsNewlogin=1804437629; JSloginnamecookie=15606013006%40163%2Ecom; at=35348dc0332242e488a65f80546ef827; Token=35348dc0332242e488a65f80546ef827; rt=9e7c96030884411895b5209bfa279ab6; JSsUserInfo=24342e6955715d79443202754d6a5c710d6a5b68416b407409333979246b4c345b695d715d7944320575496a5a71076a5968416b4f74723344795c6b423444690b710479193208752c6a2571096ae21b3ef5a7ec09333079276b4c345b695d715d7944320575496a5a71076a5968416b4f74723344795c6b423444690b710479193208752a6a3f71096a58684a6b3874663348795b6b5c345a6948715a7940320975486a5071756a25684c6b4974093320792b6b4c34206925715b79453207754e6a5271066a5968486b4874093320793e6b4c345b69537138793d320e75496a5071616a39683f6b4474033340795c6b41345c6958715a7947320375496a5d71746a5868476b4a741c331679056b1c3451698; uiioit=3b622a6459640e644764466a5c6e556e5d64563854775d7751682c622a64596408644c646; usermob=4065416A5D6956784C7155745B6B5A66487A4165426A7; JSShowname=%e7%8e%8b%e9%b9%8f%e9%a3%9e; rinfo=JM014792091R90250002000_1; nTalk_CACHE_DATA={uid:kf_9051_ISME9754_601479209,tid:1518933400760526}; JSweixinNum=2; loginreleased=1; JSSearchModel=0; LastCity%5Fid=653; LastCity=%e6%9d%ad%e5%b7%9e; urlfrom=121126445; urlfrom2=121126445; adfcid=none; adfcid2=none; __utmt=1; Hm_lpvt_38ba284938d5eddca645bb5e02a02006=1518934020; LastJobTag=%e4%ba%94%e9%99%a9%e4%b8%80%e9%87%91%7c%e8%8a%82%e6%97%a5%e7%a6%8f%e5%88%a9%7c%e7%bb%a9%e6%95%88%e5%a5%96%e9%87%91%7c%e5%b8%a6%e8%96%aa%e5%b9%b4%e5%81%87%7c%e5%91%98%e5%b7%a5%e6%97%85%e6%b8%b8%7c%e9%a4%90%e8%a1%a5%7c%e5%ae%9a%e6%9c%9f%e4%bd%93%e6%a3%80%7c%e5%85%a8%e5%8b%a4%e5%a5%96%7c%e5%b9%b4%e5%ba%95%e5%8f%8c%e8%96%aa%7c%e9%ab%98%e6%b8%a9%e8%a1%a5%e8%b4%b4%7c%e4%ba%a4%e9%80%9a%e8%a1%a5%e5%8a%a9%7c%e5%bc%b9%e6%80%a7%e5%b7%a5%e4%bd%9c%7c%e9%80%9a%e8%ae%af%e8%a1%a5%e8%b4%b4%7c%e5%8a%a0%e7%8f%ad%e8%a1%a5%e5%8a%a9%7c%e5%8c%85%e4%bd%8f%7c%e5%b9%b4%e7%bb%88%e5%88%86%e7%ba%a2%7c%e8%a1%a5%e5%85%85%e5%8c%bb%e7%96%97%e4%bf%9d%e9%99%a9%7c%e5%8c%85%e5%90%83%7c%e6%88%bf%e8%a1%a5%7c%e6%af%8f%e5%b9%b4%e5%a4%9a%e6%ac%a1%e8%b0%83%e8%96%aa%7c%e5%88%9b%e4%b8%9a%e5%85%ac%e5%8f%b8%7c%e5%85%8d%e8%b4%b9%e7%8f%ad%e8%bd%a6%7c%e8%82%a1%e7%a5%a8%e6%9c%9f%e6%9d%83%7c%e4%b8%8d%e5%8a%a0%e7%8f%ad%7c%e4%bd%8f%e6%88%bf%e8%a1%a5%e8%b4%b4%7c14%e8%96%aa%7c%e6%97%a0%e8%af%95%e7%94%a8%e6%9c%9f%7c%e5%81%a5%e8%ba%ab%e4%bf%b1%e4%b9%90%e9%83%a8%7c%e9%87%87%e6%9a%96%e8%a1%a5%e8%b4%b4%7c%e5%85%8d%e6%81%af%e6%88%bf%e8%b4%b7; LastSearchHistory=%7b%22Id%22%3a%221dbaf98a-839e-407e-9b88-a11c1cf68354%22%2c%22Name%22%3a%22%e6%9d%ad%e5%b7%9e%22%2c%22SearchUrl%22%3a%22http%3a%2f%2fsou.zhaopin.com%2fjobs%2fsearchresult.ashx%22%2c%22SaveTime%22%3a%22%5c%2fDate(1518934276954%2b0800)%5c%2f%22%7d; SubscibeCaptcha=2AB0C06D9BFF47D0C33C835A13818B06; dyweb=95841923.62.9.1518933765171; __utmb=269921210.62.9.1518933765182'
}
url='http://sou.zhaopin.com/jobs/searchresult.ashx'
a= []
b=[]
c=[]
d=[]
e=[]
f=[]
g=[]
h=[]
i=[]
j=[]
k=[]
def get_one_page(url,headers,params):
try:
response = requests.get(url,headers=headers,params=params)
time.sleep(2)
if response.status_code==200:
return response.text
return None
except RequestException:
return None
def get_detail_info(html):
soup = BeautifulSoup(html,"lxml")
positions = soup.select('.zwmc a')
companys = soup.select("td.gsmc > a:nth-of-type(1)")
salarys = soup.select("td.zwyx")
locations = soup.select("td.gzdd")
release_dates = soup.select(".gxsj span")
company_natures = soup.select('li.newlist_deatil_two > span:nth-of-type(2)')
company_sizes = soup.select("li.newlist_deatil_two > span:nth-of-type(3)")
experiences = soup.select("li.newlist_deatil_two > span:nth-of-type(4)")
educations = soup.select("li.newlist_deatil_two > span:nth-of-type(5)")
dutys = soup.select("li.newlist_deatil_last")
urls = soup.select('td.zwmc > div > a')
for position,company,salary,location,release_date,company_nature,company_size,experience,education,duty,url in zip(
positions,companys,salarys,locations,release_dates,company_natures,company_sizes,experiences,educations,dutys,urls):
a.append(position.get_text())
b.append(company.get_text())
c.append(salary.get_text())
d.append(location.get_text())
e.append(release_date.get_text())
f.append(company_nature.get_text())
g.append(company_size.get_text())
h.append(experience.get_text())
i.append(education.get_text())
j.append(duty.get_text())
k.append(url.get("href"))
return(a,b,c,d,e,f,g,h,i,j,k)
def transform_into_dataframe(a,b,c,d,e,f,g,h,i,j,k):
data={
"position":a,
"company":b,
"salary":c,
"location":d,
"release_date":e,
"company_nature":f,
"company_size":g,
"experience":h,
"education":i,
"duty":j,
"url":k
}
position_data = pd.DataFrame(data)
return(position_data)
def main(url,headers,params):
html = get_one_page(url,headers,params)
a,b,c,d,e,f,g,h,i,j,k=get_detail_info(html)
position_data=transform_into_dataframe(a,b,c,d,e,f,g,h,i,j,k)
return(position_data)
if __name__=="__main__":
for page in range(1,11):
params = {
"jl":"杭州",
"kw":"数据分析",
"isadv":0,
"we":"0103",
"isfilter":1,
"p":page,
"sf":8001,
"st":10000
}
print("------------------第{}页抓取成功--------------".format(page))
position_data=main(url,headers,params)
2.拉钩
结果保存再mysql中
# -*- coding: utf-8 -*-
"""
Created on Sat Feb 17 23:14:47 2018
@author: Administrator
"""
import time
import requests
import pymysql
config={
"host":"127.0.0.1",
"user":"root",
"password":"root",
"database":"pachong",
"charset":"utf8"
}
def lagou(page):
headers = {'Referer':'https://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90?city=%E6%9D%AD%E5%B7%9E&cl=false&fromSearch=true&labelWords=&suginput=', 'Origin':'https://www.lagou.com', 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36',
'Accept':'application/json, text/javascript, */*; q=0.01',
'Cookie':'JSESSIONID=ABAAABAAAGFABEFE8A2337F3BAF09DBCC0A8594ED74C6C0; user_trace_token=20180122215242-849e2a04-ff7b-11e7-a5c6-5254005c3644; LGUID=20180122215242-849e3549-ff7b-11e7-a5c6-5254005c3644; index_location_city=%E5%8C%97%E4%BA%AC; _gat=1; TG-TRACK-CODE=index_navigation; _gid=GA1.2.1188502030.1516629163; _ga=GA1.2.667506246.1516629163; LGSID=20180122215242-849e3278-ff7b-11e7-a5c6-5254005c3644; LGRID=20180122230310-5c6292b3-ff85-11e7-a5d5-5254005c3644; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1516629163,1516629182; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1516633389; SEARCH_ID=8d3793ec834f4b0e8e680572b83eb968'
}
dates={'first':'true',
'pn': page,
'kd':"数据分析"}
url='https://www.lagou.com/jobs/positionAjax.json?city=%E6%9D%AD%E5%B7%9E&needAddtionalResult=false&isSchoolJob=0'
resp = requests.post(url,data=dates,headers=headers)
print(resp.content.decode('utf-8'))
result=resp.json()['content']['positionResult']['result']
db = pymysql.connect(**config)
positionName = []
for i in result:
print(i)
count=0
positionName.append(i['positionName'])
timeNow = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
#连接数据库
cursor = db.cursor()
if i['businessZones']:
businessZones = "".join(i['businessZones'])
else:
businessZones=""
if i['companyLabelList']:
companyLabelList = "".join(i['companyLabelList'])
else:
companyLabelList=""
if i['industryLables']:
industryLables = "".join(i['industryLables'])
else:
industryLables=""
if i['positionLables']:
positionLables = "".join(i['positionLables'])
else:
positionLables=""
sql = "insert into lagou(positionName,workYear,salary,companyShortName\
,companyIdInLagou,education,jobNature,positionIdInLagou,createTimeInLagou\
,city,industryField,positionAdvantage,companySize,score,positionLables\
,industryLables,publisherId,financeStage,companyLabelList,district,businessZones\
,companyFullName,firstType,secondType,isSchoolJob,subwayline\
,stationname,linestaion,resumeProcessRate,createByMe,keyByMe\
)VALUES (%s,%s,%s,%s, \
%s,%s,%s,%s,%s\
,%s,%s,%s,%s,%s,%s,%s\
,%s,%s,%s,%s,%s\
,%s,%s,%s,%s,%s\
,%s,%s,%s,%s,%s\
)"
cursor.execute(sql,(i['positionName'],i['workYear'],i['salary'],i['companyShortName']
,i['companyId'],i['education'],i['jobNature'],i['positionId'],i['createTime']
,i['city'],i['industryField'],i['positionAdvantage'],i['companySize'],i['score'],positionLables
,industryLables,i['publisherId'],i['financeStage'],companyLabelList,i['district'],businessZones
,i['companyFullName'],i['firstType'],i['secondType'],i['isSchoolJob'],i['subwayline']
,i['stationname'],i['linestaion'],i['resumeProcessRate'],timeNow,"数据分析"
))
db.commit() #提交数据
cursor.close()
count=count+1
db.close()
def main(pages):
page = 1
while page<=pages:
print('---------------------第',page,'页--------------------')
lagou(page)
page=page+1
if __name__ == '__main__':
main(13) #输入要爬取的页数
3.Boss直聘
# -*- coding: utf-8 -*-
"""
Created on Wed Feb 21 13:08:53 2018
@author: Administrator
"""
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd
from requests.exceptions import RequestException
import re
url = 'https://www.zhipin.com/c101210100/e_104-d_203-y_3-h_101210100/'
a= []
b=[]
c=[]
d=[]
e=[]
f=[]
g=[]
h=[]
i=[]
j=[]
k=[]
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Referer':'https://login.zhipin.com/',
'Cookie':'lastCity=101210100; JSESSIONID=""; __g=-; Hm_lvt_194df3105ad7148dcf2b98a91b5e727a=1519187438; __c=1519187447; __l=r=https%3A%2F%2Fwww.zhipin.com%2Fc101210100%2F&l=%2Fwww.zhipin.com%2Fjob_detail%2F%3Fquery%3D%25E6%2595%25B0%25E6%258D%25AE%25E5%2588%2586%25E6%259E%2590%26scity%3D101210100%26industry%3D%26position%3D; t=WPoHbF09MPblJoh; wt=WPoHbF09MPblJoh; __a=95524263.1519187442.1519187442.1519187447.17.2.16.17; Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a=1519189678',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36'
}
def get_one_page(url,headers,params):
try:
response = requests.get(url,headers=headers,params=params)
time.sleep(2)
if response.status_code==200:
return response.text
return None
except RequestException:
return None
def get_detail_info(html):
soup=BeautifulSoup(html,"lxml")
positions = soup.select('div.job-title')
companys = soup.select("div.info-company > div > h3 > a")
salarys = soup.select("div.info-primary > h3 > a > span")
pattern = re.compile("
\s*
(.*?)(.*?)(.*?)
.*? \s*
.*?(.*?)(.*?)
.*?",re.S)
re_datas=re.findall(pattern,html)
release_dates = soup.select("div > div.info-publis > p")
dutys = soup.select("div > div.info-primary > h3 > a > div.info-detail > p")
urls = soup.select('div > div.info-primary > h3 > a')
for position,company,salary,re_data,release_date,duty,url in zip(positions,companys,salarys,re_datas,release_dates,dutys,urls):
a.append(position.get_text()) #position
b.append(company.get_text()) #company
c.append(salary.get_text()) #salary
d.append(re_data[0]) #location
e.append(release_date.get_text()) #release_date
f.append(re_data[3]) #company_nature
g.append(re_data[4]) #company_size
h.append(re_data[1]) #experience
i.append(re_data[2]) #education
j.append(duty.get_text()) #duty
k.append('https://www.zhipin.com'+str(url.get("href"))) #url
return(a,b,c,d,e,f,g,h,i,j,k)
def transform_into_dataframe(a,b,c,d,e,f,g,h,i,j,k):
data = {
"position":a,
"company":b,
"salary":c,
"location":d,
"release_date":e,
"company_nature":f,
"company_size":g,
"experience":h,
"education":i,
"duty":j,
"url":k
}
position_data_zhipin = pd.DataFrame(data)
return position_data_zhipin
def main(url,headers,params):
html=get_one_page(url,headers=headers,params=params)
a,b,c,d,e,f,g,h,i,j,k=get_detail_info(html)
position_data_zhipin = transform_into_dataframe(a,b,c,d,e,f,g,h,i,j,k)
return(position_data_zhipin)
if __name__=='__main__':
for page in range(1,11):
params = {
'query':'数据分析',
'page':page,
'ka':'page-{}'.format(page)
}
print("------------------第{}页抓取成功--------------".format(page))
position_data_zhipin=main(url,headers,params)
```
杭州python爬虫招聘_python爬取招聘网站(智联,拉钩,Boss直聘)相关推荐
- 智联,拉钩,boss直聘,三款互联网招聘应用竞品分析
近年来在移动互联网浪潮的冲击下,凭借更好的用户体验和新颖的盈利模式,拉勾网和Boss直聘脱颖而出,成为了互联网垂直招聘模式中的佼佼者.虽然前程无忧和智联招聘两大传统招聘巨头也受到影响,在寻求转型中,但 ...
- python爬虫数据_python爬取数据分析
一.python爬虫使用的模块 1.import requests 2.from bs4 import BeautifulSoup 3.pandas 数据分析高级接口模块 二. 爬取数据在第一个请求中 ...
- 智联招聘数据爬取准备(1)-智联招聘搜索列表源码解析
网页源码解析 - 智联招聘搜索列表 一开始必须要解析智联招聘搜索列表页,从这里更方便实现各种深层级数据抓取. 网页地址是: http://sou.zhaopin.com/jobs/searchresu ...
- python爬虫学习之爬取某网站上的视频
""" 实现步骤:发送请求 >>> 获取数据 >>> 解析数据 >>> 保存数据 1.发送请求,对于视频信息数据包发 ...
- Python爬虫 scrapy框架爬取某招聘网存入mongodb解析
这篇文章主要介绍了Python爬虫 scrapy框架爬取某招聘网存入mongodb解析,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友可以参考下 创建项目 sc ...
- Python网络数据爬取及分析-智联招聘
python网络数据爬取及分析-智联招聘 一. 数据爬取 智联招聘是一家面向大型公司和快速发展的中小企业提供一站式专业人力资源的公司,可在智联招聘网站上根据不同城市.不同职位需求搜索得到相关招聘信息. ...
- python 爬虫实例 电影-Python爬虫教程-17-ajax爬取实例(豆瓣电影)
Python爬虫教程-17-ajax爬取实例(豆瓣电影) ajax: 简单的说,就是一段js代码,通过这段代码,可以让页面发送异步的请求,或者向服务器发送一个东西,即和服务器进行交互 对于ajax: ...
- Python爬虫系列之爬取微信公众号新闻数据
Python爬虫系列之爬取微信公众号新闻数据 小程序爬虫接单.app爬虫接单.网页爬虫接单.接口定制.网站开发.小程序开发 > 点击这里联系我们 < 微信请扫描下方二维码 代码仅供学习交流 ...
- 携程ajax,Python爬虫实战之爬取携程评论
一.分析数据源 这里的数据源是指html网页?还是Aajx异步.对于爬虫初学者来说,可能不知道怎么判断,这里辰哥也手把手过一遍. 提示:以下操作均不需要登录(当然登录也可以) 咱们先在浏览器里面搜索携 ...
最新文章
- topcoder srm 706 div1
- python子进程 内存,python中的子进程内存使用情况
- Intel Realsense Depth Quality Tool 相关参数
- corspost请求失败_vue项目CORS跨域请求500错误,post请求变options请求
- Linux 学习 (一)
- Spark 宽依赖和窄依赖
- 贾扬清谈云原生-让数据湖加速迈入3.0时代
- 网络服务器分为文件服务器通信服务器和,近代中国落后、贫困的根本原因是()...
- spark多个kafka source采用同一个group id导致的消费堆积延迟
- python编程设计_程序设计入门—Python
- vs2008中调用matlab生成的dll
- Ubuntu登录异常: 输入正确的密码, 但是却无法进入系统, 总是返回到登录界面, 但是用ctrl+alt+F1-F文字界面登录都可以进入。
- apt-get install的默认安装路径是什么
- php导出数组到csv格式demo
- 诺禾--分子生物学常用小软件分享
- 【毕业设计】基于STM32的公交站牌系统 - 物联网 嵌入式 单片机
- Python接口自动化测试
- 国企程序员可以干多久
- java Virtual Machine Launcher
- ConstraintLayout各种居中设置
热门文章
- 接口和抽象类的区别?
- Python编程基础17:构造方法和析构方法
- Python编程基础13:文件读写操作
- Python学习笔记:利用sorted()函数对序列排序
- android权威指南十三章,《Android编程权威指南》第30~32以及第34章读书笔记
- Pentium 4处理器架构/微架构/流水线 (4) - NetBurst框图
- Pentium II Pentium III架构/微架构/流水线 (4) - P6详解 - 高速缓存/Store Buffers
- 确定了C/C++的学习路线之后,便只能是一条路走到黑了
- 数值分析 python_数值分析python代码
- 生产者消费者_Kafka之生产者/消费者