scrapy爬取当当网图书畅销榜

一、采集任务

爬取当当网图书畅销榜信息，获取热销图书前500相关数据。

二、网页解析

1、打开当当网，按照图书榜>图书畅销榜进入当当网图书畅销榜[http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-recent7-0-0-1-1]，按住shift+Ctrl+I调出chrome浏览器的开发者工具，逐层找到每一个商品的代码，分析商品代码

（图片丢了嘤嘤嘤，不过不影响观看）

2、分析后确定爬取“图书名称(bname)”、“购买链接(burl)”、“评论数目(bcomment)”、“推荐度(btuijian)”、“出版时间(btime)”、“折后价(bprice)”、“折扣(bdiscount)”这七项信息。

3、根据HTML源码构造XPATH语句。

3.1图书名称

代码结构：

<div class="name"><a href="http://product.dangdang.com/25259300.html" target="_blank" title="流浪的地球（刘慈欣著，无删节无改写，大人孩子均可阅读，此版本当当网销量遥遥领先！根据本书改编的同名电影2019春节上映。）">流浪的地球（刘慈欣著，无删节无改写，大人孩子均可阅读，此版本<span class='dot'>...</span></a></div>

分析代码结构后，发现可以从class=“name”的div下取文本，也可以从a标签中取title属性，本次爬取选择了后者，得出的

xpath路径为：

"//div[@class='name']/a/@title"

3.2购买链接

代码结构：在上面的图书名称代码中包含了购买链接，所以购买链接是提取相同路径下的不同属性，得出

xpath路径为：

"//div[@class='name']/a/@href"

3.3评论数目

代码结构：

<div class="star"><span class="level"><span style="width: 92%;"></span></span><a href="http://product.dangdang.com/25259300.html?point=comment_point" target="_blank">121374条评论</a><span class="tuijian">100%推荐</span></div>

可以看出评论人数在class="star"的div下的a标签的文本中，得出

xpath路径为：

"//div[@class='star']/a/text()"

同样地，可以得出其他需要获取的信息的xpath路径

"//span[@class='tuijian']/text()"#推荐度

"//div[@class='publisher_info']/span/text()"#出版时间

"//span[@class='price_n']/text()"#折后价

"//span[@class='price_s']/text()"#折扣

三、scrapy爬虫项目

1、新建项目及文件

在cmd中进入目标文件夹，输入命令语句建立新的scrapy爬虫项目

scrapy startproject dangdang

在pycharm中打开dangdang项目

新建一个爬虫文件

scrapy genspider -t basic dd dangdang.com

生成的文件结构：

2、编写代码

2.1首先在items文件中定义所要爬取的数据

items.py代码：

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass DangdangItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()bname=scrapy.Field()#图书名称burl=scrapy.Field()#购买链接bcomment=scrapy.Field()#评价数目btuijian=scrapy.Field()#推荐度btime=scrapy.Field()#出版时间bprice=scrapy.Field()#折后价bdiscount=scrapy.Field()#折扣#定义要爬取的内容

2.2编写爬虫文件dd.py

修改第二步自动生成的爬虫文件

# -*- coding: utf-8 -*-import scrapy
from dangdang.items import DangdangItemclass DdSpider(scrapy.Spider):name = 'dd'allowed_domains = ['dangdang.com']start_urls = ['http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-recent7-0-0-1-1']#爬取网址def parse(self, response):item=DangdangItem()#指定爬取内容item['bname'] = response.xpath("//div[@class='name']/a/@title").extract()item['burl']=response.xpath("//div[@class='name']/a/@href").extract()item['bcomment']=response.xpath("//div[@class='star']/a/text()").extract()item['btuijian']=response.xpath("//span[@class='tuijian']/text()").extract()item['btime']=response.xpath("//div[@class='publisher_info']/span/text()").extract()item['bprice'] = response.xpath("//span[@class='price_n']/text()").extract()item['bdiscount'] = response.xpath("//span[@class='price_s']/text()").extract()yield itemfor i in range(2,25):url='http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-recent7-0-0-1-%d'%iyield Request(url,callback=self.parse)

2.3编写pipeline.py文件

为了使用pipeline.py文件，需要取消setting.py中的关于pipeline.py文件的注释（可以通过ctrl+f寻找语句）

ITEM_PIPELINES = {'dangdang.pipelines.DangdangPipeline': 300,
}

编写pipeline.py

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlclass DangdangPipeline(object):def process_item(self, item, spider):for i in range(0, len(item["bname"])):print(item["bname"][i])print(item["burl"][i])print(item["bcomment"][i])print(item["btuijian"][i])print(item["btime"][i])print(item["bprice"][i])print(item["bdiscount"][i])print()return item

3、执行

scrapy crawl dd -o dangdang.json

四、采集结果分析

运行后发现在ddangdang项目文件夹中生成了一个dangdang.json文件，pycharm中得到运行结果

但打开dangdang.json发现内容为乱码。

百度后查询到错误原因为setting.py文件中缺少转码语句导致乱码，于是在setting文件中添加转码语句（任意位置即可，我是添加在了最后）

FEED_EXPORT_ENCODING = 'utf-8'

再次运行，得到了正确的dangdang.json文件

核对后发现数据不完整，且根据书名核对较为困难，故决定采集图书排名序号，在相应文件中添加对应语句

items.py文件：

bnum=scrapy.Field()#图书序号

pipelines.py文件：

print(item["bnum"][i])

dd.py文件：

item['bnum'] = response.xpath("//div[@class='list_num ']/text()").extract()#4-500的序号

重新爬取后根据序号比对发现只爬取到第80本图书

重启pycharm后重新爬取，发现跳页现象，即爬取结果未按照页数排序，随机存储爬取结果

多次测试，结论相同。百度查询解决方法后无解。

Scrapy爬取当当网图书销售前100相关推荐

scrapy框架的简单使用——爬取当当网图书信息
** Scrapy爬取当当网图书信息实例 --以警察局办案为类比 ** 使用Scrapy进行信息爬取的过程看起来十分的复杂,但是他的操作方式与警局办案十分的相似,那么接下来我们就以故事的形式开始Scr ...
Scrapy爬取当当网的商品信息存到MySQL数据库
Scrapy爬取当当网的商品信息存到MySQL数据库 Scrapy 是一款十分强大的爬虫框架,能够快速简单地爬取网页,存到你想要的位置.经过两天的摸索,终于搞定了一个小任务,将当当网的商品信息爬下来存 ...
python实战|用scrapy爬取当当网数据
本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理以下文章来源于腾讯云作者:Python进击者 ( 想要学习Python?Pyth ...
请用Python写一个爬虫程序，爬取当当网图书畅销榜数据
好的,下面是一个简单的 Python 爬虫代码,可以爬取当当网图书畅销榜数据: import requests from bs4 import BeautifulSoupurl = "htt ...
Python爬虫实战+Scrapy框架爬取当当网图书信息
1.环境准备 1.在python虚拟环境终端使用 pip install scrapy下载scrapy依赖库 2.使用scrapy startproject book创建scrapy心目工程 3.使用 ...
Scrapy爬虫之爬取当当网图书畅销榜
本次将会使用Scrapy来爬取当当网的图书畅销榜,其网页截图如下: 我们的爬虫将会把每本书的排名,书名,作者,出版社,价格以及评论数爬取出来,并保存为csv格式的文件.项目的具体创建就不再多讲 ...
使用selenium爬取当当网图书
1.目标利用Selenium抓取当当网图书并用正则表达式获取到得到商品的图片.名称.价格.评论人数信息. 2.开发环境 python2.7 +windows10 +pycharm 3.页面分析如图 ...
python 爬虫爬取当当网图书信息
初次系统的学习python,在学习完基本语法后,对爬虫进行学习,现在对当当网进行爬取,爬取了基本图书信息,包括图书名.作者等 import requests from time import slee ...
scrapy爬取当当网Python图书的部分数据
1.下载scrapy框架 pip install scrapy 2.在E盘下创建一个文件夹scrapy01,在命令行窗体中进入该文件夹 3.创建项目:scrapy startproject 项目名 s ...

Scrapy爬取当当网图书销售前100

scrapy爬取当当网图书畅销榜

一、采集任务

二、网页解析

三、scrapy爬虫项目

四、采集结果分析

Scrapy爬取当当网图书销售前100相关推荐

最新文章

热门文章