python网络数据爬取及分析-智联招聘

一. 数据爬取

智联招聘是一家面向大型公司和快速发展的中小企业提供一站式专业人力资源的公司，可在智联招聘网站上根据不同城市、不同职位需求搜索得到相关招聘信息。接下来，将爬取智联招聘网站发布的招聘信息，并存储至本地MySQL数据库中。

爬取网址及相关信息
所爬网址：https://sou.zhaopin.com/?jl=653&kw=数据分析师&kt=3&p=1
其中，jl=653代表所选城市为杭州，p=1代表页码为第一页，kw=数据分析师代表职位的关键词为“数据分析师”。
要爬取的网页界面如图1.1所示：

在该页中，将爬取所有岗位的url，通过该url链接，获取每个岗位的详情页，所要爬取的字段均在详情页中(见图1.2)。

本文预爬取的字段包括：（1）职位信息（2）工资（3）所在城市（4）工作经验（5）学历要求（6）招聘人数（7）职位亮点（8）职位描述（9）公司地址（10）公司名称（11）公司行业所属（12）公司规模（13）公司简要描述
网页信息爬取
2.1 岗位列表页url链接爬取
点击F12键，查看网页源代码，搜索“数据分析师”，可发现并未搜索到该字段，所以岗位列表信息不是HTML网页中的信息,而是动态加载的信息。所以对该网页的爬取采用Selenium自动定位技术，通过分析网页结构，动态的定位网页跳转的链接或按钮。
在应用Selenium技术前，需要安装Selenium和安装浏览器驱动，在此不做叙述。以下代码为采用Selenium技术对岗位列表页url的爬取代码：

from selenium import webdriver
import os
os.environ["webdriver.chrome.driver"]="d:/selenium/chromedriver.exe"
#指定浏览器，这里选择谷歌浏览器
brower=webdriver.Chrome("d:/selenium/chromedriver.exe")
urls='https://sou.zhaopin.com/?jl=653&kw=数据分析师&kt=3&p=1'
#请求url
brower.get(url)
#隐式等待10s
brower.implicitly_wait(10)
#采用xpath路径定位元素
elems=brower.find_elements_by_css_selector(".contentpile__content__wrapper.clearfix")
#通过class即类属性定位元素
for elem in elems: detail_url=elem.find_element_by_class_name('contentpile__content__wrapper__item__info').get_attribute('href')
#getDetailInfo为获取详情页信息的方法
getDetailInfo(detail_url)

2.2 岗位详情页信息爬取
lxml 是一个HTML/XML的解析器，主要的功能是如何解析和提取 HTML/XML 数据。可利用XPath语法，来快速的定位特定元素以及节点信息。Xpath的相关语法在此不做叙述。以下代码为岗位详情页信息的爬取代码：

def getDetailInfo(detail_url):#获取详情页HTML#detail_url='https://jobs.zhaopin.com/CC446556739J00069333602.htm'#infoDict={}response=requests.get(detail_url)selector=etree.HTML(response.text)#采用xpath语法定位节点并且获取所需相关信息positions=selector.xpath('//h3[@class="summary-plane__title"]/text()')[0]salary=selector.xpath('//*[@class="summary-plane__salary"]/text()')[0]infos=selector.xpath('//ul[@class="summary-plane__info"]/li')city=infos[0].xpath('a/text()')[0]experience=infos[1].xpath('text()')[0]education=infos[2].xpath('text()')[0]nums=infos[3].xpath("text()")[0]points=selector.xpath('//div[@class="highlights__content"]/span')em_points=[]for point in points:em_points.append(point.xpath('text()')[0])total_points='/'.join(em_points)#查找某个标签下的所有字符‘string(.)’descriptions=selector.xpath('//*[@class="describtion__detail-content"]')[0].xpath('string(.)')address=selector.xpath('//span[@class="job-address__content-text"]/text()')[0]company_name=selector.xpath('//div[@class="company"]/a/text()')[0]company_detail='/'.join(selector.xpath('//div[@class="company"]/div/button[1]/text()'))company_size=selector.xpath('//div[@class="company"]/div/button[2]/text()')[0]company_description=selector.xpath('//div[@class="company__description"]/text()')[0]

2.3 数据存入MySQL
前提：安装MySQL数据库
本文采用的操作MySQL数据库的是pymysql第三方库，在使用前需安装该第三方库：

pip install pymysql

以下为连接数据库并进行插入操作的相关代码：

def InsertIntoDatabase(positions,salary,city,experience,education,nums,total_points,descriptions,address,company_name,company_detail,company_size,company_description):try:#connect()函数用于数据库的连接，其生成一个connect对象，用于访问数据库conn=pymysql.connect(host='localhost',user='root',password='******',port=3306,db='employee',charset='utf8')#cursor()方法创建游标对象，connect()方法用于提供连接数据库的接口，如果要对数据库操作则需要使用游标对象。cursor=conn.cursor()#执行数据库操作cursor.execute('use employee')sql="insert into Employee_info (positions,salary,city,experience,education,nums,total_points,descriptions,address,company_name,company_detail,company_size,company_description) values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"cursor.execute(sql,(positions,salary,city,experience,education,nums,total_points,descriptions,address,company_name,company_detail,company_size,company_description))except Exception:print('产生错误')pass#关闭游标cursor.close()#提交当前事务conn.commit()#关闭数据库连接conn.close()

2.4 完整代码
以下为爬取招聘网站所需信息的完整代码

from selenium import webdriver
import os
from lxml import etree
import pymysql
import requests
import time
os.environ["webdriver.chrome.driver"]="d:/selenium/chromedriver.exe"
brower=webdriver.Chrome("d:/selenium/chromedriver.exe")urls=['https://sou.zhaopin.com/?jl=653&kw=数据分析师&kt=3&p={}'.format(str(i+1)) for i in range (1,3)]
for url in urls:brower.get(url)#print(brower.title)brower.implicitly_wait(10)elems=brower.find_elements_by_css_selector(".contentpile__content__wrapper.clearfix")print(elems)#ele.send_keys(u"数据分析")#ele.send_keys(Keys.RETURN) #send_key()方法模拟键盘，Key.RETURN代表点击回车键for elem in elems:detail_url=elem.find_element_by_class_name('contentpile__content__wrapper__item__info').get_attribute('href')#print(detail_url)infomations=getDetailInfo(detail_url)def getDetailInfo(detail_url):#detail_url='https://jobs.zhaopin.com/CC446556739J00069333602.htm'#infoDict={}response=requests.get(detail_url)selector=etree.HTML(response.text)try:positions=selector.xpath('//h3[@class="summary-plane__title"]/text()')[0]salary=selector.xpath('//*[@class="summary-plane__salary"]/text()')[0]infos=selector.xpath('//ul[@class="summary-plane__info"]/li')city=infos[0].xpath('a/text()')[0]experience=infos[1].xpath('text()')[0]education=infos[2].xpath('text()')[0]nums=infos[3].xpath("text()")[0]points=selector.xpath('//div[@class="highlights__content"]/span')em_points=[]for point in points:em_points.append(point.xpath('text()')[0])total_points='/'.join(em_points)#查找某个标签下的所有字符‘string(.)’descriptions=selector.xpath('//*[@class="describtion__detail-content"]')[0].xpath('string(.)')address=selector.xpath('//span[@class="job-address__content-text"]/text()')[0]company_name=selector.xpath('//div[@class="company"]/a/text()')[0]company_detail='/'.join(selector.xpath('//div[@class="company"]/div/button[1]/text()'))company_size=selector.xpath('//div[@class="company"]/div/button[2]/text()')[0]company_description=selector.xpath('//div[@class="company__description"]/text()')[0]InsertIntoDatabase(positions,salary,city,experience,education,nums,total_points,descriptions,address,company_name,company_detail,company_size,company_description)#print('插入一条成功')except Exception as e:print(e)pass;#InsertIntoDatabase(positions,salary,city,experience,education,nums,total_points,descriptions,address,company_name,company_detail,company_size,company_description)def InsertIntoDatabase(positions,salary,city,experience,education,nums,total_points,descriptions,address,company_name,company_detail,company_size,company_description):try:conn=pymysql.connect(host='localhost',user='root',password='******',port=3306,db='employee',charset='utf8')cursor=conn.cursor()cursor.execute('use employee')sql="insert into Employee_info (positions,salary,city,experience,education,nums,total_points,descriptions,address,company_name,company_detail,company_size,company_description) values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"cursor.execute(sql,(positions,salary,city,experience,education,nums,total_points,descriptions,address,company_name,company_detail,company_size,company_description))except Exception:#print('产生错误')passcursor.close()conn.commit()conn.close()

2. 数据分析

通过数据爬取，已经获得所需的招聘相关信息，可根据已有数据对数据进行一定的分析处理。在接下来，将从以下几个方面对数据进行分析处理：

从数据库中导入数据
数据清理，包括缺失值处理等数据预处理
数据分析与数据可视化
3.1 平均工资
3.2 工资与工作经验的关系
3.3 工资与学历的关系
3.4 职位描述文本分析

1. 数据导入
采用第三方库pandas中的数据导入功能，可实现数据库数据的导入。实现代码如下：

#连接数据库
conn=pymysql.connect(host='localhost',user='root',password='******',port=3306,db='employee',charset='utf8')
sql="select * from employee_info"
#读取数据库文件
data=pd.read_sql(sql,conn)
#删除名为“id”的column列
df=data.drop('id',axis=1)

2.数据清理及预处理
在进行数据分析之前，最重要的一步就是进行数据的清理，以保证数据的可用性。数据清理包括重复值处理、异常值处理等。在此不做叙述。
所得DataFrame类型的变量df中，“salary"一列为文本类型，为了能够更方便的进行分析，将该列的数据转化为数值类型，并做相对应的缺失值处理（采用平均值替代缺失值）。其中，由于salary一列的格式基本为"1千-5千”，可先实现最高工资与最低工资的分离,进而实现平均工资的计算。
实现代码如下：

def unitchange(salary):if salary.find("千")!=-1:salary=int(salary[0])*1000elif salary.find("万")!=-1:salary=int(salary[0])*10000else:salary=np.nanreturn salary     #对salary进行处理，转化为整型，方便计算
df_clear1['salary_range']=df_clear1['salary'].str.split("-")
df_clear1['min_salary']=df_clear1['salary_range'].str.get(0)
df_clear1['max_salary']=df_clear1['salary_range'].str.get(1)
df_clear1['min_salary'] = df_clear1['min_salary'].map(unitchange)
df_clear1['max_salary']=df_clear1['max_salary'].fillna('unknown').map(unitchange)

3. 数据分析与数据可视化
3.1 平均工资

#计算得到平均工资
average_salary=(df_clear1['max_salary'].mean()+df_clear1['min_salary'].mean())/2
df_clear1['average_salary']=(df_clear1['max_salary']+df_clear1['min_salary'])/2
df_clear1.drop(['min_salary','max_salary','salary'],axis=1,inplace=True)

3.2 工资与工作经验的关系
（1）采用pandas的groupby()函数实现按工作经验的分组，并计算分组后各组的平均工资。

ex_sa=df_clear1.groupby(df_clear1['experience']).mean().sort_values(by='average_salary')
print(ex_sa)

以下为所得结果：

工作经验	工资水平（元）
无经验	4944.444444
1年以下	6666.666667
经验不限	6774.509804
1-3年	9195.121951
3-5年	14034.090909
5-10年	27500.000000
10年以上	41666.666667

（2）实现工资与工作经验关系的可视化。
实现可视化的过程中，可能会出现x刻度值中文乱码的问题。此时，需要对matplotlib进行相关设置。

import matplotlib
#指定默认字体
matplotlib.rcParams['font.sans-serif'] = ['SimHei']
matplotlib.rcParams['font.family']='sans-serif'
#解决负号’-‘显示为方块的问题
matplotlib.rcParams['axes.unicode_minus'] = False
ex_sa=df_clear1.groupby(df_clear1['experience']).mean().sort_values(by='average_salary')
plt.figure()
ex_sa.plot(kind='bar',rot=0)
plt.title("工资水平与工作年限")
plt.show()

从图中来看，忽略工作经验为“经验不限”的组，其余组的分布基本与我们的认知一致，即工作经验越丰富，其工资水平越高。
3.3 工资与学历的关系
（1）采用pandas的groupby()函数实现按学历的分组，并计算分组后各组的平均工资

edu_sa=df_clear1.groupby(df_clear1['education']).mean().sort_values(by='average_salary')

以下为所得结果：

           average_salary
education
中技            5500.000000
大专            6544.117647
中专            7100.000000
学历不限          7416.666667
本科           12228.155340
硕士           17444.444444
博士           25000.000000

（2）实现工资与学历关系的可视化

plt.figure()
edu_sa.plot(kind='bar',rot=0)
plt.title("工资水平与学历")
plt.show()

从图中来看，忽略工作经验为“学历”的组，其余组的分布基本与我们的认知一致，即学历越高，其工资水平越高。

3.4 职位描述文本分析
职位描述为文本类型，对其需采用文本分析方法。文本分析过程中通常涉及以下几个流程：
（1）中文分词：在得到语料之后，首先需要做的就是对中文语料进行分词，由于中文词与词之间没有明显的分界标志，所以需要通过一定的分词技术将句子分割成空格连接的词序列。本文采用的中文分词工具为Jieba中文分词工具。
（2）词性标注：词性标注是指为分词结果中的每个单词或者词组标注一个正确的词性，由于本文中均为职位描述的相关词汇，所以不涉及词性标注步骤。
（3）数据清洗：在使用Jieba中文分词技术得到分词完的语料之后，可能会存在脏数据和停用词等现象，为了得到更好的数据分析结果，需要对数据集进行数据清洗和停用词过滤等操作。本文采用Jieba库进行数据清洗。
（4）特征提取：特征提取将原始特征转换为一组具有明显物理意义或者统计意义的核心特征，所提取的特征可以尽可能地表示这个原始语料，提取地特征通常会存储至向量空间模型中。
（5）权重计算：在建立向量空间模型过程中，权重地表示尤为重要，常用方法包括了布尔权重，词频权重，TF-IDF权重、熵权重等。本文将采用TF-IDF权重计算。

3.4.1 中文分词
前提：安装完成第三方库jieba

import jieba
#采用精确模式对中文进行分词
des=jieba.cut(df_clear1.loc[0,'descriptions'],cut_all=False)

3.4.2 数据清洗

stopwords={}.fromkeys(['工作','职责','任职','资格','岗位职责','职位','岗位','负责','\xa0','有','能','在','的','对','被', '吗','也','中','最','有','和','及','等','中','或','Need','1','2','4','3','5','6','“','”','。','，','？','、','；','：',' ',';','.','You','一',':','（','）','．','5.1','【','】','(',')'])
splitStr=''
for output in des:if output not in stopwords:splitStr+=output+' '
print(splitStr)

输出结果为：

'游戏数据分析包括基础指标建设主题分析预测模型活动评估搭建游戏数据分析体系包括数据埋点设计数据清洗结合业务建立合适分析模型进行深入游戏数据分析根据产品要求设计数据指标体系制定数据规划运营数据设计分析逻辑发现潜在产品机会风险缺陷提供决策数据支持通过运营数据驱动产品优化并推动业务闭环应用数据分析经验搭建数据分析模型预测模型精准定位用户市场推演发展趋势完成部门游戏运营重点专项报告输出针对目标用户细分群体完成独立分析研究报告协助部门负责人完成其他业务专项研究商业分析要求本科以上学历年以上数据统计分析数据挖掘相关经验互联网行业数据分析经验者优先熟悉数据建模知识数据挖掘理论掌握数据分析体系方法掌握 Python R SAS SPSS 任数据分析工具机器学习经验优先拥有优秀业务建模能力实际业务搭建数据模型并对模型进行优化结果呈现为业务提供决策支持统计学社会学心理学专业背景优先一定文档撰写能力能够独立完成各类分析汇报文档较强逻辑思考分析能力抗压能力责任心强协调沟通能力强进取心团队合作意识 ’

3.4.3 特征提取及TF-IDF权重计算
（1）特征提取及TF-IDF权重计算

#文本转化为逆文档词频的特征向量（矩阵）
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer()
weight=tfidf.fit_transform(all_list[:2]).toarray()
word=tfidf.get_feature_names()
print ('IFIDF词频矩阵:\n')
print (weight)

[[ 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0.04808649 0. 0.04808649 0. 0.04808649 0. 0.
0. 0. 0. 0.04808649 0. 0.
0.04808649 0. 0.09617299 0.28851896 0.04808649 0.04808649
0. 0.10264181 0. 0. 0. 0.03421394
0.03421394 0. 0.14425948 0.09617299 0.04808649 0. 0.
0.09617299 0. 0. 0. 0. 0. 0.
0.03421394 0. 0. 0.09617299 0. 0.
0.20528361 0.04808649 0. 0.04808649 0.09617299 0.04808649
0.04808649 0. 0. 0. 0. 0.04808649
0. 0.04808649 0. 0.04808649 0.03421394 0.04808649
0.04808649 0. 0.04808649 0. 0.04808649 0.04808649
0.04808649 0. 0. 0.04808649 0.19234597 0.
0.04808649 0. 0. 0.04808649 0. 0.
0.04808649 0.04808649 0.03421394 0. 0. 0.04808649
0. 0.09617299 0.04808649 0.04808649 0. 0.
0.04808649 0.04808649 0. 0.04808649 0. 0.
0.04808649 0.09617299 0.04808649 0.04808649 0.04808649 0.09617299
0. 0. 0.04808649 0.04808649 0. 0.06842787
0.14425948 0.04808649 0. 0.06842787 0. 0. 0.
0.30792542 0.38469194 0. 0. 0.09617299 0.04808649
0. 0. 0.09617299 0. 0. 0.04808649
0. 0. 0. 0. 0. 0.
0.04808649 0.04808649 0.04808649 0. 0. 0.03421394
0.14425948 0. 0.04808649 0.03421394 0.04808649 0.04808649
0.04808649 0.19234597 0.04808649 0.04808649 0.09617299 0. 0.
0.04808649 0.09617299 0. 0.04808649 0.03421394 0.
0.03421394 0.09617299 0. 0.04808649 0. 0. 0.
0. 0. 0.04808649 0.04808649 0. 0.19234597
0.04808649 0.04808649 0.04808649 0.04808649 0. 0.04808649
0.04808649 0. 0.04808649 0.24043246 0.04808649 0. 0.
0.04808649 0. 0.06842787 0. 0.04808649 0. 0.
0.14425948 0.04808649 0. 0. 0. 0.
0.04808649 0. 0.04808649 0. 0. 0.
0.04808649 0.14425948 0.04808649 0.06842787 0.04808649 0.09617299
0. 0. 0. 0.06842787 0.04808649 0.04808649
0.04808649 0. 0. 0. 0. 0.09617299
0.04808649 0.04808649 0. 0. ]
[ 0.06547278 0.06547278 0.06547278 0.06547278 0.06547278 0.06547278
0.06547278 0.06547278 0.06547278 0.06547278 0.06547278 0.06547278
0.06547278 0.13094556 0. 0.06547278 0. 0.06547278
0. 0.06547278 0.06547278 0.06547278 0.06547278 0.06547278
0. 0.06547278 0.06547278 0. 0.06547278 0. 0.
0. 0. 0.06547278 0.04658442 0.06547278 0.06547278
0.06547278 0.09316885 0.04658442 0.06547278 0. 0. 0.
0.06547278 0.06547278 0. 0.13094556 0.06547278 0.06547278
0.13094556 0.06547278 0.06547278 0.04658442 0.13094556 0.06547278
0. 0.13094556 0.06547278 0.04658442 0. 0.06547278
0. 0. 0. 0. 0.06547278 0.13094556
0.06547278 0.06547278 0. 0.06547278 0. 0.06547278
0. 0.04658442 0. 0. 0.06547278 0.
0.06547278 0. 0. 0. 0.06547278 0.06547278
0. 0. 0.06547278 0. 0.06547278 0.06547278
0. 0.13094556 0.06547278 0. 0. 0.04658442
0.06547278 0.06547278 0. 0.06547278 0. 0. 0.
0.06547278 0.06547278 0. 0. 0.06547278 0.
0.06547278 0.06547278 0. 0. 0. 0. 0.
0. 0.06547278 0.06547278 0. 0. 0.06547278
0.13975327 0. 0. 0.06547278 0.04658442 0.06547278
0.06547278 0.06547278 0.4192598 0. 0.06547278 0.06547278
0. 0. 0.06547278 0.06547278 0. 0.06547278
0.06547278 0. 0.06547278 0.06547278 0.06547278 0.06547278
0.06547278 0.06547278 0. 0. 0. 0.06547278
0.06547278 0.09316885 0. 0.06547278 0. 0.09316885
0. 0. 0. 0. 0. 0. 0.
0.06547278 0.13094556 0. 0. 0.06547278 0.
0.09316885 0.06547278 0.04658442 0. 0.06547278 0.
0.13094556 0.06547278 0.06547278 0.06547278 0.06547278 0. 0.
0.13094556 0. 0. 0. 0. 0.
0.19641834 0. 0. 0.06547278 0. 0. 0.
0.06547278 0.13094556 0. 0.06547278 0.13975327 0.06547278
0. 0.06547278 0.06547278 0. 0. 0.06547278
0.06547278 0.06547278 0.06547278 0. 0.13094556 0.
0.06547278 0.06547278 0.06547278 0. 0. 0.
0.04658442 0. 0. 0.06547278 0.06547278 0.06547278
0.04658442 0. 0. 0. 0.06547278 0.06547278
0.06547278 0.13094556 0. 0. 0. 0.06547278
0.06547278]]

该词频矩阵可作为特征向量，如果需要实现分类或聚类操作，可采用该特征向量。

（2）词云
词云又叫文字云，是对文本数据中出现频率较高地关键词在视觉上地突出呈现，出现频率越高地词显示地越大或越鲜艳，从而将关键词渲染成类似云一样地彩色图片，感知文本数据地主题及核心思想。
前提：安装第三方库WordCloud

from wordcloud import WordCloud
my_wordcloud=WordCloud().generate(all_list[0])
plt.imshow(my_wordcloud)
plt.axis('off')
plt.show()

可以从词云图中看出，该职位描述主要是出现频率比较高地词汇为：数据分析、数据、分析、模型、能力等。
***注意:***如果语料是中文，在词云分析中可能会出现中文乱码地情况，导致中文关键词均显示为中文乱码。
解决方法：
（1）步骤一：在WorldCloud地安装目录下找到wordcloud.py文件（见图1.6），对该文件地源码进行修改。

（2）步骤2：从wordcloud.py文件中找到FONT_PATH，将DroidSansMono.ttf 修改成msyh.ttf（见图1.7），其中,msyh.ttf 表示微软雅黑中文字体。同时，在上述1.6图中地同一目录下放置msyh.ttf 的字体文件。

至此，有关智联招聘网站数据爬取与数据分析的相关代码介绍结束。实际上，还有很多爬取到的相关信息还可以进行分析，比如职位两点，行业分布，地区分布等等。同时，所爬取的招聘信息仅为杭州市的“数据分析师”信息，如果感兴趣的话，可以扩大爬取范围，获取将得到不一样的结果或者更有意思的结果。