中华英才网爬虫程序解析（1）-基础的爬虫程序实现

欢迎来到爬虫高级兼实战教程，打开你的IDE，开始python之旅吧！

中华英才网爬虫

在讲完python爬虫基础知识后，我们开始进行实战，在实战中我们会借实例来讲解爬虫的高级知识，爬虫程序已经公布于 https://github.com/code-nick-python/yingcaiwang-spider

在这个实例中，涉及到多线程threading和queue，分布式redis，接下来废话不多说，直接开始讲解！

爬虫程序基础编写

这里我们看到网页长这样：

我们的目标是爬取工作名称，工资，城市，学历，人数，公司和类别，接下来开始总结HTML代码吧！

首先看看连接数据库：

from pymongo import MongoClient
import pymysql# a class for connect
class spider:# some things for mongodb and mysqldef __init__(self, data=''):self.host = 'localhost'self.port = 27017self.user = 'root'self.passwd = 'nick2005'self.db = 'scraping'self.charset = 'utf8'self.data = data# connect to mongodb and remove alldef connect_to_mongodb(self):# connect the mongodb and remove allclient = MongoClient(host=self.host, port=self.port)db = client.blog_databasecollection = db.blogcollection.remove({})return collection# connect the mysql and remove alldef connect_to_mysql(self):conn = pymysql.connect(host=self.host, user=self.user, passwd=self.passwd, db=self.db, charset=self.charset)cur = conn.cursor()cur.execute('truncate table yingcaiwang;')return cur, conn

首先引入pymongo和pymysql分别是MongoDB和MySQL数据库的驱动

接下来在__init__中定义一些基础的类似于端口，密码之类的值。

然后定义一个连接mongodb数据库的函数connect_to_mongodb()，这里collection.remove代表移除全部数据，返回数据库。

接下来定义一个连接mysql数据库的函数connect_to_mysql()，这里返回两个值，也就是连接的conn和输入mysql语句的cur。

接下来看看主程序：

#引入库
import requests
from bs4 import BeautifulSoup
import time
import re
import class_connect#实例化连接类
a = class_connect.spider()
collection = a.connect_to_mongodb()
cur, conn = a.connect_to_mysql()#网站的user-agent
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}#记下开始时间
scrapy_time = time.time()#利用for循环翻页
for i in range(1, 208):#requests请求网站link = "http://campus.chinahr.com/qz/P" + str(i) + "/?job_type=10&"r = requests.get(link, headers=headers, timeout=20)#用BeautifulSoup解析网页soup = BeautifulSoup(r.text, "lxml")#用BeautifulSoup找到信息salary_list = soup.find_all('strong', class_='job-salary')     #工资city_list = soup.find_all('span', class_="job-city Fellip")     #城市top_list = soup.find_all('div', class_="top-area")     #名称和公司job_info = soup.find_all('div', class_='job-info')     #城市，学历和人数type_list = soup.find_all('span', class_='industry-name')     #类别#for循环每一个工作for x in range(len(top_list)):#用strip()来提取信息salary = salary_list[x].text.strip()     #工资city = city_list[x].text.strip()     #城市top = top_list[x].text.strip()     #名称和公司job_and_company = top.split('\n', 1)     #分开名称和公司job_information = job_info[x].text.strip()     #城市，学历和人数city_to_people = job_information.split('\n')     #分开城市，学历和人数type = type_list[x].text.strip()     #类别#插入mongodb的字典all = {"job": job_and_company[0],"company": job_and_company[1],"salary": salary,"city": city,"type": type}#用for循环分开城市，学历和人数for each in range(0, 5):#用re正则表达式first = re.compile(r'  ')     #compile构造去掉空格的正则time_for_sub = first.sub('', city_to_people[each])     #把空格替换为没有，等于去掉空格another = re.compile(r'/')     #compile构造去掉/的正则the_final_info = another.sub('', time_for_sub)     #把/替换为空格，等于去掉/#得到学历并插入字典if each == 3:all['background'] = the_final_infoback=the_final_info#得到人数并插入字典if each == 4:all['people'] = the_final_infopeo=the_final_info#插入mongodb和mysql数据库collection.insert_one(all)cur.execute("INSERT INTO yingcaiwang(job,company,salary,city,type,background,people) VALUES(%s,%s,%s,%s,%s,%s,%s);",(job_and_company[0], job_and_company[1], salary, city, type, back, peo))     #SQL语句#每爬取5页休息三秒，模拟真人if i % 5 == 0:print("第%s页爬取完毕，休息三秒" % (i))print('the %s page is finished,rest for three seconds' % (i))time.sleep(3)#每爬取1页休息一秒，模拟真人else:print("第%s页爬取完毕，休息一秒" % (i))print('the %s page is finished,rest for one second' % (i))time.sleep(1)conn.commit()#记下结束时间并输出总时间
scrapy_end = time.time()
scrapy_time_whole = scrapy_end - scrapy_time
print('it takes %s', scrapy_time_whole)#把mysql数据库的变动提交并关闭数据库
cur.close()
conn.commit()
conn.close()

所有基本解析已在注释中，接下来介绍一下重点

split()

首先从split()开始讲起，比如说实例中的这个：

top_list = soup.find_all('div', class_="top-area")     #名称和公司
top = top_list[x].text.strip()     #名称和公司
job_and_company = top.split('\n', 1)     #分开名称和公司

这是三个获取名称和公司的代码，首先使用soup.find_all找到名称和公司，HTML代码是这样的：

因此这里的寻找class_='top-area’时会搜索到名称和公司，接下来使用strip()提取出其中的信息，你可以试着打印出top，你会得到：

金融投资类管培生
合生创展集团有限公司-合生创展集团有限公司

可以看到名称和公司使用换行符隔开的，因此我们可以使用split以换行符来分割这个字符串，以下是split的用法：

str.split('split-str',number)

str是我们要分割的字符串
split-str是以什么字符分割
number就是分割次数，也就是你要切几刀（默认为切割所有）

所以上面的job_and_company = top.split('\n', 1) #分开名称和公司代表把top字符串以换行符分割1次。

compile()

compile也就是创建一个正则表达式，像这样：

first = re.compile(r'  ')
time_for_sub = first.sub('', city_to_people[each])

sub()

re.sub(pattern, str, string, count=0, flags=0)

pattern : 正则中的模式字符串。
str : 替换的字符串
string : 要被替换的字符串
count : 最多替换几次（默认替换所有）

由于我们之前已经用compile创建过正则表达式了，所以我们可以把pattern放在sub前面，像这样把刚刚compile的东西放在前面：

time_for_sub = first.sub('', city_to_people[each])

休息秒数

有些人看到sleep()可能会感到困惑，为什么要加个sleep()来延长时间呢？现在我们就来解决这个问题

这就要说到网站的反爬虫，由于你不是真人，但是会浪费网站的流量，所以有些网站会采取措施来阻止你的爬虫程序，因此我们可以使用sleep()来模拟真人操作，这里我们采用每爬取一页休息一秒，每五页休息3秒的方法来模拟。

这样我们的基础爬虫程序就搭建完成了，下一次我们会进行速度的优化，下次见！