中华英才网爬虫程序解析（2）-多线程threading模块

欢迎来到爬虫高级兼实战教程，打开你的IDE，开始python之旅吧！

threading模块

threading是多线程的一个模块。所谓多线程，就是实现多个线程并发执行的技术。
使用多线程能帮助我们提升整体处理性能，也就是让我们的爬虫更快。

但是python有一个不同，python具有GIL锁，也就是全局解释器锁，也就是在同一时间只能有一个线程执行，GIL锁就像通行证一样，只有一张，所以python的多线程指的是线程间快速切换来增加速度。

虽说有GIL锁，但是依旧能提高不少效率，如果于我们之后要学习的redis进行结合，效率会更上一步，废话不多说，开始程序解说。

程序解析

首先给出我们的代码和解析（完整代码可查看GitHub）：

#导入库
import requests
from bs4 import BeautifulSoup
import time
import re
import class_connect
import threading#把所有要请求的网址放入link_list列表中
link_list=[]
for i in range(1,208):url='http://campus.chinahr.com/qz/P'+str(i)+'/?job_type=10&'link_list.append(url)#连接数据库的类的实例化
a = class_connect.spider()
collection = a.connect_to_mongodb()
cur, conn = a.connect_to_mysql()#重写Thread方法并继承threading.Thread父类
class myThread(threading.Thread):def __init__(self,name,link_range):#使用Thread的__init__(self)threading.Thread.__init__(self)#定义线程名称和每个线程爬取的网站数self.name=nameself.link_range=link_rangedef run(self):#crawler为主函数print('Starting '+self.name)crawler(self.name,self.link_range)print('Exiting '+self.name)# 网站请求头
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}#记录开始时间
scrapy_time = time.time()#主函数开始
def crawler(threadName,link_range):#因为在函数内，所以需要重新连接MySQL数据库，否则会引发错误cur, conn = a.connect_to_mysql()#循环网站列表for i in range(link_range[0],link_range[1]+1):#依次抽取网站并请求link = link_list[i-1]r = requests.get(link, headers=headers, timeout=20)#=使用BeautifulSoup解析网页soup = BeautifulSoup(r.text, "lxml")#使用soup.find_all找到我们需要的salary_list = soup.find_all('strong', class_='job-salary')     #工资city_list = soup.find_all('span', class_="job-city Fellip")     #城市top_list = soup.find_all('div', class_="top-area")     #名称和公司job_info = soup.find_all('div', class_='job-info')     #城市，学历和人数type_list = soup.find_all('span', class_='industry-name')     #类别#循环每一条招聘信息for x in range(len(top_list)):#使用strip()提取文字信息salary = salary_list[x].text.strip()     #工资city = city_list[x].text.strip()      #城市top = top_list[x].text.strip()     #名称和公司job_and_company = top.split('\n', 1)     #分开名称和公司job_information = job_info[x].text.strip()     #城市，学历和人数city_to_people = job_information.split('\n')     #分开城市，学历和人数type = type_list[x].text.strip()     #类别#为了mongodb数据库的字典all = {"job": job_and_company[0],"company": job_and_company[1],"salary": salary,"city": city,"type": type}#用for循环分开城市，学历和人数for each in range(0, 5):#使用re正则表达式first = re.compile(r'  ')     #compile构造去掉空格的正则time_for_sub = first.sub('', city_to_people[each])     #把空格替换为没有，等于去掉空格another = re.compile(r'/')     #compile构造去掉/的正则the_final_info = another.sub('', time_for_sub)     #把/替换为没有，等于去掉/#获取背景和人数并插入字典if each == 3:all['background'] = the_final_info     #背景back=the_final_infoif each == 4:all['people'] = the_final_info     #人数peo=the_final_info#插入MongoDB和MySQL数据库collection.insert_one(all)cur.execute("INSERT INTO yingcaiwang(job,company,salary,city,type,background,people) VALUES(%s,%s,%s,%s,%s,%s,%s);",(job_and_company[0], job_and_company[1], salary, city, type, back, peo))     #SQL语句conn.commit()     #提交变动#每爬取5页休息3秒if i % 5 == 0:print(threadName+" : 第%s页爬取完毕，休息三秒" % (i))print('the %s page is finished,rest for three seconds' % (i))time.sleep(3)#每爬取1页休息1秒else:print(threadName+" : 第%s页爬取完毕，休息一秒" % (i))print('the %s page is finished,rest for one second' % (i))time.sleep(1)#平均分配网址
thread_list=[]
link_range_list=[(1,40),(41,80),(81,120),(121,160),(161,207)]#利用for开启5个线程
for i in range(1,6):thread=myThread('Thread-'+str(i),link_range_list[i-1])thread.start()thread_list.append(thread)#等待线程执行完成
for thread in thread_list:thread.join()#输出总时间
scrapy_end = time.time()
scrapy_time_whole = scrapy_end - scrapy_time
print('it takes {}'.format(scrapy_time_whole))#提交MySQL的变动并关闭
cur.close()
conn.commit()
conn.close()

所有基本解析已在注释中，接下来介绍一下重点

这里我们把上次的主程序放到了一个函数中并且在后面加入threading的使用，接下来讲解一下threading。

重写Thread方法和平均分配网址

这里我们重写了Thread父类的__init__和run(self)来满足我们的需求。

初次以外，我们需要平均分配网址来使每个线程差不多时间结束。

thread.start()

thread.start()也就是启动线程，我们可以看到线程的名字一般长这样：

Thread-1
Thread-2
......

因此我们可以使用for循环开启线程并加入线程列表中。

thread.join()

thread.join()也就是等待线程完成，我们使用for循环来循环我们的线程列表并等待他们完成，只有线程完成才能继续执行下一步的代码。

多线程就是这样了，下次我们会来讲解threading+queue的结合使用，下次见！