创建项目

scrapy startproject douban

红框中是指出创建一个新爬虫。

创建爬虫

cd douban
scrapy genspider girls https://www.douban.com/group/641424/

自此，我们的项目算是基本创建好了，其中“girls”是指爬虫的名称，“https://www.douban.com/group/641424/”爬虫的域名。不过为了方便我们项目启动，可以在项目中新建一个entrypoint.py文件，文件内容如下：

from scrapy.cmdline import executeexecute(['scrapy', 'crawl', 'girls'])

项目架构图

创建Item

创建一个新的Item方便我们保存所爬取的数据。
下面我们就来创建保存数据Item:

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass DoubanItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()passclass GirlItem(scrapy.Item):title = scrapy.Field() # 标题author = scrapy.Field() # 作者url = scrapy.Field() # urllastTime = scrapy.Field() # 最近回应时间detail_time = scrapy.Field() # 发帖时间detail_report = scrapy.Field() # 发帖内容def __str__(self):return '{"title": "%s", "author": "%s", "url": "%s", "lastTime": "%s", "detail_time": "%s", "detail_report": "%s"}\n' %(self['title'], self['author'], self['url'], self['lastTime'], self['detail_time'], self['detail_report'])

之所以要从写__str__方法，是因为要将它展示成我们想展示的样子。

上面DoubanItem是由scrapy自动生成出来的，我们暂时先不管它，如果你想直接用系统创建的那个Item也是可以的。我这里是自己新创建一个，看起来比较好管理。

爬取网页

首先修改setting.py，添加USER_AGENT以及修改ROBOTSTXT_OBEY

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'
ROBOTSTXT_OBEY = False

字段title author url lastTime在第一层URL中可以爬取到，而detail_time detail_report则是要根据url继续下钻爬取。所以在parse方法中继续下钻调用detail_parse方法，在detail_parse方法中将item保存至文件中。

完整代码：

# -*- coding: utf-8 -*-
import scrapy
from bs4 import BeautifulSoup
from items import GirlItemclass GirlsSpider(scrapy.Spider):name = 'girls'allowed_domains = ['www.douban.com']start_urls = ['https://www.douban.com/group/641424/discussion?start=25']# 重写start_requests方法# def start_requests(self):#     # 浏览器用户代理#     headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}#     return [scrapy.Request(url=self.start_urls[0], callback=self.parse, headers=headers)]def parse(self, response):html = response.textsoup = BeautifulSoup(html, "lxml")# print("开始打印soup")# print(soup)table = soup.tabletr_arr = table.find_all("tr")for tr in tr_arr:item = GirlItem()tds = tr.find_all('td')item['title'] = tds[0].get_text().replace('\n','').replace(' ', '')item['author'] = tds[1].get_text().replace('\n','').replace(' ', '')item['lastTime'] = tds[3].get_text().replace('\n','')try:item['url'] = tds[0].find('a',href=True)['href']# 根据内页地址爬取yield scrapy.Request(item['url'], meta={'item': item}, callback=self.detail_parse)except:item['url'] = ""#找到下一个链接，也就是翻页next_url = soup.find(name='div', attrs={"class":"paginator"}).find(name='span', attrs={"class":"next"}).find(name='link')['href']if next_url:print("开始下一页")yield scrapy.Request(next_url, callback=self.parse)def detail_parse(self, response):# 接收上级已爬取的数据item = response.meta['item']try:item['detail_time'] = response.xpath('//*[@id="topic-content"]/div[2]/h3/span[2]/text()').extract()[0]except BaseException as e:print(e)item['detail_time'] = ""try:item['detail_report'] = response.xpath('//*[@id="link-report"]').extract()[0].replace('\n','')except BaseException as e:print(e)item['detail_report'] = ""write_to_file('E:/douban-detail.txt', item)# return itemdef write_to_file (file_name, txt):# print("正在存储文件" + str(file_name))# w 如果没有这个文件将创建这个文件''''r'：读'w'：写'a'：追加'r+' == r+w（可读可写，文件若不存在就报错(IOError)）'w+' == w+r（可读可写，文件若不存在就创建）'a+' ==a+r（可追加可写，文件若不存在就创建）'''f = open(file_name, 'a', encoding='utf-8')f.write(str(txt))f.close()

运行项目

python entrypoint.py

python3+Scrapy爬虫入门相关推荐

python3爬虫入门教程-总算懂得python3.4爬虫入门教程
Python是一款功能强大的脚本语言,具有丰富和强大的库,重要的是,它还具有很强的可读性,易用易学,非常适合编程初学者入门.以下是小编为你整理的python3.4爬虫入门教程环境配置:下载Pytho ...
python3 scrapy爬虫_Python3 Scrapy爬虫框架(Scrapy/scrapy-redis)
Python3 Scrapy爬虫框架(Scrapy/scrapy-redis) 本文由 Luzhuo 编写,转发请保留该信息. 原文: https://blog..net/Rozol/article/ ...
Scrapy爬虫入门教程五 Selectors（选择器）
Scrapy爬虫入门教程一安装和基本使用 Scrapy爬虫入门教程二官方提供Demo Scrapy爬虫入门教程三命令行工具介绍和示例 Scrapy爬虫入门教程四 Spider(爬虫) Scrap ...
【学习教程系列】最通俗的 Python3 网络爬虫入门
很多朋友学习Python都是先从爬虫开始,其原因不外两方面: 其一Python对爬虫支持度较好,类库众多,其二语法简单,入门容易,所以两者形影相随,不离不弃. 要使用python语言做爬虫,首先需要学 ...
python3 scrapy 爬虫实战之爬取站长之家
爬取目标站长之家:http://top.chinaz.com/all/ 爬取工具 win10 python3 scrapy BeautifulSoup 爬取内容 1 网站缩略图 2 网站名称 3 网 ...
Python3小白爬虫入门（一）
(图片来源于网络) 首先,作为一个刚入门python的小白,可以跟大家说,使用爬虫其实并不是很难.但是深入学习就另说了. 要使用python爬虫,首先要知道爬虫是什么?能做什么?先来一波百度: 网络爬 ...
Scrapy爬虫入门系列2 示例教程
本来想爬下http://www.alexa.com/topsites/countries/CN 总排名的,但是收费了只爬了50条数据: response.xpath('//div[@class=&q ...
scrapy爬虫入门
我们使用dmoz.org这个网站来作为小抓抓一展身手的对象. 首先先要回答一个问题. 问:把网站装进爬虫里,总共分几步? 答案很简单,四步: 新建项目 (Project):新建一个新的爬虫项目明确目 ...
python3 + Scrapy爬虫学习之创建项目
最近准备做一个关于scrapy框架的实战,爬取腾讯社招信息并存储,这篇博客记录一下创建项目的步骤 pycharm是无法创建一个scrapy项目的因此,我们需要用命令行的方法新建一个scrapy项目 ...

python3+Scrapy爬虫入门