Python分布式爬虫实战

本实例从零到一实现豆瓣读书的所有标签的分布式爬虫编写

本实例使用到的工具:

IDE:Pycharm
工具:Python,Scrapy,linux,mysql,redis
需要用到的模块:scrapy pymysql scrapy_redis selenium
抓取内容:书名,作者,出版日期,价格,评分,参与评分人数,评论数量,书籍类型

先来捋捋思路:

step1.爬取所有标签页面的链接,保存到数据库
step2.爬取每个标签所有内容页的链接
step3.分布式爬取每个内容页(重点)
step4:linux运行scrapy爬虫

废话不多说,直接开搞

1.爬取所有标签页面的链接,保存到数据库

这里为了方便,使用requests库进行爬取

import requests
from lxml import etree# UA,不必多说了吧
header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"
}
def crawl_tag_links(url):# 爬取总标签页面,也就是这个 "https://book.douban.com/tag/?view=cloud"response = requests.get(url, headers=header)e = etree.HTML(response.text)# 取下所有标签的链接(我这边一共120个标签URL)tag_links = e.xpath("//table[@class='tagCol']//a/@href")# 取下来的链接是网址的后部分,比如[/小说,/历史......],所以需要补全网址tag_links = [f"https://book.douban.com{i}" for i in tag_links]

保存到mysql数据库

import pymysqldef save_tag_links(links):# 建立数据库对象 注意修改数据库ip地址和账号密码conn = pymysql.connect("192.168.2.208", "root", "123456", "douban")# 游标对象cursor = conn.cursor()# 查询数据表是否存在# 返回1表示存在 0表示不存在if not cursor.execute("show tables like 'tag_links'"):# 创建数据表,这里命名为tag_linkscursor.execute("""create table tag_links(id int primary key auto_increment,url varchar(100),status int)""")# 准备sql语句sql = "insert into tag_links values (%s,%s,%s)"# 准备插入数据库的数据# 第一个0是数据库的id列,插入数据时候id这一字段是自增的,所以给个0它就可以了# 第二个link就是每个标签页的url# 第三个0 表示还没被爬取,之后爬取这个标签页面的时候爬取成功后修改这里的0为1#               表示已经爬取过..这样哪怕发生意外也不用从新爬取了insert_links = [(0, link, 0) for link in links]try:# 批量插入数据cursor.executemany(sql, insert_links)# 注意插入数据是事务操作,需要提交conn.commit()except Exception as err:# 出现错误,回滚操作conn.rollback()print(err)finally:cursor.close()conn.close()

完整代码(写成一个类方便以后调用)

import requests, pymysql
from lxml import etreeclass TagSpider():def __init__(self):# UAself.header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"}def crawl_tag_links(self, url):# 爬取所有热门标签页面response = requests.get(url, headers=self.header)e = etree.HTML(response.text)# 取下所有标签的链接tag_links = e.xpath("//table[@class='tagCol']//a/@href")# 取下来的链接是网址的后部分,比如 (/小说),所以需要补全网址tag_links = [f"https://book.douban.com{i}" for i in tag_links]# 保存链接到mysql数据库self.save_tag_links(tag_links)def save_tag_links(self, links):# 建立数据库对象conn = pymysql.connect("192.168.2.208", "root", "123456", "douban")# 游标对象cursor = conn.cursor()# 查询数据表是否存在# 返回1表示存在 0表示不存在if not cursor.execute("show tables like 'tag_links'"):# 创建数据表,这里命名为tag_linkscursor.execute("""create table tag_links(id int primary key auto_increment,url varchar(100),status int)""")# 准备sql语句sql = "insert into tag_links values (%s,%s,%s)"# 准备插入数据库的数据# 第一个0是数据库的id列,插入数据时候id这一字段是自增的,所以给个0它就可以了# 第二个link就是每个标签页的url# 第三个0 表示还没被爬取,之后爬取这个标签页面的时候爬取成功后修改这里的0为1#               表示已经爬取过..这样哪怕发生意外也不用从新爬取了insert_links = [(0, link, 0) for link in links]try:# 批量插入数据cursor.executemany(sql, insert_links)# 注意插入数据是事务操作,需要提交conn.commit()except Exception as err:# 出现错误,回滚操作conn.rollback()print(err)finally:cursor.close()conn.close()if __name__ == '__main__':# 所有热门标签页的URLurl = "https://book.douban.com/tag/?view=cloud"# 创建对象实例get_tag_links = TagSpider()# 开始爬取所有标签get_tag_links.crawl_tag_links(url)

step2.爬取每个标签所有内容页的链接

Ps:貌似豆瓣有限制,只能查看每个标签的前50页
大致估算了一下,120(标签) x 50(页) x20(个内容页) = 12w条数据
为了节省更多的爬取时间,下面开始使用scrapy爬取

settings.py 配置文件

BOT_NAME = 'doubandushulinks'
SPIDER_MODULES = ['doubandushulinks.spiders']
NEWSPIDER_MODULE = 'doubandushulinks.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {'doubandushulinks.pipelines.DoubandushulinksPipeline': 300,
}

爬虫文件

思路:分别爬取每个标签前50页,如遇上"没有找到符合条件的图书"就跳过…

# -*- coding: utf-8 -*-
import scrapy, pymysql, re
from urllib.parse import unquoteclass DoubanlinksSpider(scrapy.Spider):name = 'doubanlinks'allowed_domains = ['douban.com']# 数据库对象conn = pymysql.connect("192.168.2.208", "root", "123456", "douban")# 游标cursor = conn.cursor()# 数据库中提取status为0(表示没爬取过)的URLcursor.execute("select url from tag_links where status = 0")urls = cursor.fetchall()# 养成良好的习惯,用完记得关闭数据库对象cursor.close()conn.close()# 每个标签url后补上页面数,每页编号相隔20start_urls = [f"{url[0]}?start={j}" for url in urls for j in range(0, 1000, 20)]def parse(self, response):# 如果页面出现 "没有找到符合条件的图书" 表示已经到达50也以后了if response.xpath("//p[@class='pl2']/text()").extract_first != "没有找到符合条件的图书":# 每个链接对应有标签名,保存下来日后做数据分析用tag = unquote(re.findall(r"tag/(.+)\?.+", response.url)[0])# 提取每一页的所有内容页链接content_links = response.xpath("//h2/a/@href").extract()# 准备存入数据库# 前面的0对应id,最后的0表示没有被爬取过,作用在之前解析过了item = {"data": [(0, url, tag, 0) for url in content_links]}# 爬取下来的数据只要轻轻的yield一下就可以交给管道处理了yield item

pipelines.py 管道

就像我们吃东西进肚子里所经过的大肠小肠十二指肠…

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import pymysqlclass DoubandushulinksPipeline(object):# 开始scrapy时调用以下函数def open_spider(self,spider):# 创建数据库对象self.conn = pymysql.connect("192.168.2.208","root","123456","douban")self.cursor = self.conn.cursor()# 如果没有content_links这个表就创建一个if not self.cursor.execute("show tables like 'content_links'"):self.cursor.execute("""create table content_links(id int primary key auto_increment,url varchar(100),type varchar(10),status int)""")def process_item(self, item, spider):# 储存到mysql 数据库sql = "insert into content_links values (%s,%s,%s,%s)"try:# 批量插入数据self.cursor.executemany(sql, item["data"])self.conn.commit()selfexcept Exception as err:self.conn.rollback()print(err)# scrapy 关闭时调用以下函数def close_spider(self,spider):self.cursor.close()self.conn.close()

总爬取了117078个内容页链接,共花5分钟…

step3.分布式爬取每个内容页

咳咳~~重点来了,这里重新创建一个新爬虫项目,以免搞乱之前写的代码~!

刚写好程序准备测试,结果:

…

所以再另外写一个登录程序:
只要爬取之前登录一下,就可以大方的爬取数据了

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait as wait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import Bydef login(login_url, user_url):chrome = webdriver.Chrome()# 打开登陆页面chrome.get(login_url)try:# 设置等待时间w = wait(chrome, 60)# 等待登录(此处可手动登录 或 编写自动登录代码)# 判断是否已登录w.until(EC.presence_of_element_located((By.CLASS_NAME, "bn-more")), message="login is failed!")# 打开用户面板chrome.get(user_url)w.until(EC.presence_of_element_located((By.ID, "usr-profile-nav-doulists")), message="access user page is failed!")# 获取cookies(列表字典类型 -> [{....},{.....},......] )json_cookies = chrome.get_cookies()cookies = {}for cookie in json_cookies:# 提取cookies中name和value的键值对组成新cookies字典cookies[cookie["name"]] = cookie["value"]# 保存到文件with open("chrome_cookie.txt", "w") as f:# 必须转换成字符串类型f.write(str(cookies))except Exception as error:print(error)return Falsechrome.close()return Trueif __name__ == '__main__':login_url = "https://accounts.douban.com/passport/login"user_url = "https://www.douban.com/people/215290729/"result = login(login_url, user_url)

settings.py 中添加以下内容
编写爬虫文件之前先配置scrapy,好让scrapy可以实现分布式爬虫

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# 设置URL去重
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 设置调度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 设置暂停恢复后是否继续
SCHEDULER_PERSIST = True
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue"
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue"
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderStack"
# 开启redis管道
ITEM_PIPELINES = {'scrapy_redis.pipelines.RedisPipeline': 400,}
# 设置显示日志等级
LOG_LEVEL = 'DEBUG'
# 设置redis服务器IP
REDIS_HOST = "192.168.2.208"
# 设置redis端口
REDIS_PORT = 6379
# 设置redis数据库编号
REDIS_DB = 1
# 连接数据库配置
REDIS_PARAMS = {'socket_timeout': 30,'socket_connect_timeout': 30,'retry_on_timeout': True,'encoding': 'utf-8','db': REDIS_DB
}

爬虫文件

 # -*- coding: utf-8 -*-
import scrapy, ast, re, pymysql, redis, syssys.path.append("..")
import settingsclass DoubandushuSpider(scrapy.Spider):name = 'doubandushu'allowed_domains = ['douban.com']# 连接redisredis_cli = redis.Redis(host=settings.REDIS_HOST, port=settings.REDIS_PORT)# 连接数据库conn = pymysql.connect("192.168.2.208", "root", "123456", "douban")# 游标cursor = conn.cursor()def start_requests(self):# 读取cookiescookies = ast.literal_eval(open("chrome_cookie.txt").read())# 每次读取200个内容页链接sql = "select id,url,type from content_links where status = 0 limit 200"self.cursor.execute(sql)urls = self.cursor.fetchall()# 如果urls为空表示数据库再也没有可爬取的url,跳出循环while len(urls) != 0:for id, url, type in urls:yield scrapy.Request(url, callback=self.parse, cookies=cookies, meta={"id": id, "type": type})# 每次爬取完200个内容页url都会重新获取200个self.cursor.execute(sql)urls = self.cursor.fetchall()def parse(self, response):# 获取数据try:name = response.xpath("//h1/span/text()").extract_first()info = response.xpath("string(//div[@id='info'])").extract_first()info = re.sub(r"[\n\s]", "", info)author = re.findall(r"作者:\s*(.+)出版社:", info)author = author[0] if author else Nonedate = re.findall(r"出版年:\s*(\w{4})", info, re.A)date = date[0] if date else Noneprice = re.findall(r"定价:\s*(.*[0-9])[\u4e00-\u9fa5]+:I*", info, re.A)try:price = price[0] if price else re.findall(r"定价:\s*(.*[0-9])[\u4e00-\u9fa5]*ISB*", info, re.A)[0]except Exception as error:price = 0score = response.xpath("//strong/text()").extract_first()score = score.replace(" ", "")if score == "":score = Nonerating_count = response.xpath("//a[@class='rating_people']/span/text()").extract_first()comment_count = response.xpath("//header//span[@class='pl']/a/text()").extract_first()comment_count = re.sub(r"[全部条\s]", "", comment_count)datas = {"id": response.meta["id"],"name": name,"author": author,"date": date,"price": price,"score": score,"rating_count": rating_count,"comment_count": comment_count,"type": response.meta["type"],"url": response.url}# 获取完数据保存数据之前,把内容页url的status设置为1# 保证以后获取的URL不会重复,就算重新运行爬虫也可以继续爬取status = 0的urlself.cursor.execute("update content_links set status = 1 where id = %s", (datas['id']))self.conn.commit()yield datasexcept Exception as error:self.conn.rollback()print(error)

愉快的爬取中
飘红不是错误提示,而是日志提醒

step4:linux运行scrapy爬虫

linux中创建scrapy项目

替换settings.py

上传爬虫文件到spiders里

运行命令

然后就可以愉快的爬爬爬了

最后:怎么获取数据到本地?

so easy~~~

import redis,ast
redis_cli = redis.Redis(db=1)while True:data = redis_cli.blpop("doubandushu:items")print(ast.literal_eval(data[1].decode()))

最后附上豆瓣读书数据分析实战

Python分布式爬虫实战 - 豆瓣读书相关推荐

python教学视频a_2019何老师一个月带你玩转Python分布式爬虫实战教程视频（视频+源码）...
├─章节1-爬虫前奏(官网免费) │ 001.爬虫前奏_什么是网络爬虫.mp4 │ 002.爬虫前奏_HTTP协议介绍.mp4 │ 003.爬虫前奏_抓包工具的使用网络请求.mp4 │ ├─章节2-网 ...
《Python网络爬虫实战》读书笔记1
文章目录 Python与网络爬虫 robots与Sitemap 查看网站所用的技术数据采集文件与数据的存储 CSV的读写使用数据库使用MySQL 使用SQLite3 使用SQLAlchemy ...
Python爬虫(5):豆瓣读书练手爬虫
Python爬虫(5):豆瓣读书练手爬虫我们在之前的文章中基本上掌握了Python爬虫的原理和方法,不知道大家有没有练习呢.今天我就来找一个简单的网页进行爬取,就当是给之前的兵书做一个实践.不然不就 ...
python爬虫文件代码大全-Python网络爬虫实战项目代码大全（长期更新，欢迎补充）...
WechatSogou[1]- 微信公众号爬虫.基于搜狗微信搜索的微信公众号爬虫接口,可以扩展成基于搜狗搜索的爬虫,返回结果是列表,每一项均是公众号具体信息字典.[1]: https://github ...
python常用代码大全-Python 网络爬虫实战项目代码大全
原标题:Python 网络爬虫实战项目代码大全 DouBanSpider 豆瓣读书的爬虫.你可以爬豆瓣读书下面标签下的所有图书,按评分排名依次存储,存储到Excel中,可方便大家筛选搜罗,比如筛选评价 ...
python基础代码大全-Python网络爬虫实战项目代码大全（长期更新，欢迎补充）
WechatSogou[1]- 微信公众号爬虫.基于搜狗微信搜索的微信公众号爬虫接口,可以扩展成基于搜狗搜索的爬虫,返回结果是列表,每一项均是公众号具体信息字典.[1]: https://github ...
Python网络爬虫实战项目代码大全（长期更新，欢迎补充）
Python网络爬虫实战项目代码大全(长期更新,欢迎补充) 阿橙 · 1 个月内 WechatSogou [1]- 微信公众号爬虫.基于搜狗微信搜索的微信公众号爬虫接口,可以扩展成基于搜狗搜索的爬虫, ...
【视频教程免费领取】聚焦Python分布式爬虫必学框架Scrapy 打造搜索引擎
领取方式关注公众号,发送Python0407获取下载链接. 扫码关注公众号,公众号回复 Python0407 获取下载地址目录结构目录:/读书ReadBook [57.6G] ┣━━48G全套J ...
python基础实例韦玮 pdf_韦玮：Python网络爬虫实战解析
2016年12月27日晚8点半,CSDN特邀IT专家.<Python系列实战教程>系列图书作者韦玮带来了主题为"Python网络爬虫反爬破解策略实战"的Chat交流.以 ...

Python分布式爬虫实战 - 豆瓣读书