scrapy爬取当当网Python图书的部分数据

1.下载scrapy框架

pip install scrapy

2.在E盘下创建一个文件夹scrapy01，在命令行窗体中进入该文件夹

3.创建项目：scrapy startproject 项目名

scrapy startproject first_scrapy

4.使用pycharm打开scrapy01文件夹

5.在items.py文件中创建所需的字段，用于保存数据

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass FirstScrapyItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()title = scrapy.Field()  # 书名price = scrapy.Field()  # 价格author = scrapy.Field()  # 作者date = scrapy.Field()  # 出版日期publisher = scrapy.Field()  # 出版社

6.在spiders文件夹中创建爬虫程序test.py，代码如下：

# author:WN
# datetime:2019/11/3 15:29
from abc import ABC
import scrapy
from .. import itemsclass MySpider(scrapy.Spider, ABC):# 名字name = "mySpider"def start_requests(self):for num in range(1, 101):url = "http://search.dangdang.com/?key=Python&act=input&page_index=%d" % num# 使用yield：请求过后返回的数据等待被取走yield scrapy.Request(url=url, callback=self.parse)def parse(self, response):try:data = response.text# scrapy是使用Xpath进行查找数据的# 创建选择查找类Selector()对象select = scrapy.Selector(text=data)book_data = select.xpath("//ul[@class='bigimg']/li")item = items.FirstScrapyItem()# 查找具体数据for book in book_data:title = book.xpath("./a/img/@alt").extract_first().strip()price = book.xpath("./p[@class='price']/span[@class='search_now_price']/text()").extract_first().lstrip('¥')author = book.xpath("./p[@class='search_book_author']/span/a/@title").extract_first()date = book.xpath("./p[@class='search_book_author']/span[2]/text()").extract_first().strip()publisher = book.xpath("./p[@class='search_book_author']/span/a[@name='P_cbs']/text()").extract_first()item['title'] = title if title else ''item['price'] = price if price else ''item['author'] = author if author else ''item['date'] = date if date else ''item['publisher'] = publisher if publisher else ''yield itemexcept Exception as e:print(e)

7.在setings.py中添加配置，以便将test.py中的item推送到piplines.py的类中

# 设置将item配置到pipelines中的类中
# 项目名.pipelines.类名
# 300是一个默认整数，它可以是任意整数
ITEM_PIPELINES = {'first_scrapy.pipelines.FirstScrapyPipeline': 300,
}

8.编写pipelines.py的代码，前提先创建mysql数据库book和表books：

create database book;
use book;
set character_set_results=gbk;
create table books(
bTitle varchar(256) primary key,
bPrice varchar(50),
bAuthoe varchar(50),
bDate varchar(32),
bPublisher varchar(256)
);

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import pymysqlclass FirstScrapyPipeline(object):# spider爬虫一开始就会执行下面的函数def open_spider(self, spider):print('opened')try:# 连接数据库self.con = pymysql.connect(host='localhost', port=3306, user='root', password='root', db='book', charset='utf8')# 创建游标self.cursor = self.con.cursor()self.opened = Trueself.count = 0except Exception as e:print(e)self.opened = False# spider爬虫关闭执行函数def close_spider(self, spider):if self.opened:self.con.commit()self.con.close()self.opened = Falseprint("close")print("总共爬取:", self.count, "本书籍")def process_item(self, item, spider):try:print(item['title'])print(item['price'])print(item['author'])print(item['date'])print(item['publisher'])if self.opened:self.cursor.execute('insert into books(bTitle,bPrice,bAuthor,bDate,bPublisher) values (%s,%s,%s,%s,%s)', (item['title'], item['price'], item['author'], item['date'], item['publisher']))self.count += 1except Exception as err:print(err)return item

9.运行此项目

（1）在命令行窗体中运行：scrapy crawl 爬虫程序名 -s LOG_ENABLED=False，后边的参数是不显示调试信息

scrapy crawl mySpider -s LOG_ENABLED=False

（2）在spiders文件夹的上一级文件夹下创建run.py，运行此文件就可以运行该项目（不在dos窗口中运行项目）代码如下：

# author:WN
# datetime:2019/11/3 15:36
from scrapy import cmdline
# 运行语句，不需要再打开dos窗口
# scrapy crawl 爬虫名 不显示调试信息的参数
cmdline.execute("scrapy crawl mySpider -s LOG_ENABLED=False".split())

scrapy爬取当当网Python图书的部分数据相关推荐

Scrapy爬取当当网图书销售前100
scrapy爬取当当网图书畅销榜一.采集任务爬取当当网图书畅销榜信息,获取热销图书前500相关数据. 二.网页解析 1. 打开当当网,按照图书榜>图书畅销榜进入当当网图书畅销榜[http: ...
python实战|用scrapy爬取当当网数据
本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理以下文章来源于腾讯云作者:Python进击者 ( 想要学习Python?Pyth ...
Scrapy爬取当当网的商品信息存到MySQL数据库
Scrapy爬取当当网的商品信息存到MySQL数据库 Scrapy 是一款十分强大的爬虫框架,能够快速简单地爬取网页,存到你想要的位置.经过两天的摸索,终于搞定了一个小任务,将当当网的商品信息爬下来存 ...
[Python]scrapy爬取当当网书籍相关信息
最近想买两本程序设计的书籍,也就在当当网上面看了下,发现真是太多的书了.所以想着利用爬虫知识爬取下程序设计相关书籍的一些信息. 00_1. 首先是今天所用到的东西 python 3.5 + scrap ...
scrapy爬取当当网
春节已经临近了尾声,也该收收心了.博客好久都没更新了,自己在年前写的爬虫也该"拿"出来了. 本次爬取的目标是当当网,获取当当网所有的书籍信息.采用scrapy+mongodb来采集 ...
mitdump爬取当当网APP图书目录
因为mitmproxy没办法连接数据库所以,只能先把结果保存为txt文件,再读取到数据库中. 在滑动APP界面时,对代码进行分析 import requests import re import ur ...
爬取当当网的图书信息之导读
什么是爬虫爬虫是用来抓取互联网上信息的程序.程序员可以利用爬虫来干很多事,有些挺酷炫,这里贴出知乎相关问题的网址https://www.zhihu.com/question/29372574 爬虫的 ...
Python爬虫实战+Scrapy框架爬取当当网图书信息
1.环境准备 1.在python虚拟环境终端使用 pip install scrapy下载scrapy依赖库 2.使用scrapy startproject book创建scrapy心目工程 3.使用 ...
scrapy框架的简单使用——爬取当当网图书信息
** Scrapy爬取当当网图书信息实例 --以警察局办案为类比 ** 使用Scrapy进行信息爬取的过程看起来十分的复杂,但是他的操作方式与警局办案十分的相似,那么接下来我们就以故事的形式开始Scr ...

scrapy爬取当当网Python图书的部分数据

scrapy爬取当当网Python图书的部分数据相关推荐

最新文章

热门文章