【Python】Scrapy完成电影信息爬取并存入数据库

本文使用了scrapy框架对电影信息进行爬取并将这些数据存入MySQL数据库。

一、安装相关python模块

根据你所使用的python包管理器安装相应的模块。比如使用pip:

pip install scrapy
pip install pymysql

二、创建scrapy项目

和其他python框架一样，利用scrapy startproject projectname命令创建项目即可:

出现上图提示即说明scrapy项目创建成功，如果出现command not found等提示，说明你需要重新安装scrapy。项目创建成功后的项目目录如图所示:

这里介绍一下部分文件的主要作用。

items.py文件里主要存放你的模型，即实体。
pipelines.py爬虫抓取到网页数据后在该文件中执行相关数据处理操作。
settings.py存放框架配置。
spiders/该文件夹下放爬虫业务代码。

三、coding

items.py，我们需要分析我们爬取的信息。

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass DialogItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()passclass Movie(scrapy.Item):name = scrapy.Field()           #电影名称href = scrapy.Field()           #电影链接actor = scrapy.Field()            #演员status = scrapy.Field()         #状态district = scrapy.Field()       #地区director = scrapy.Field()       #导演genre = scrapy.Field()          #类型intro = scrapy.Field()          #介绍

在Spider文件夹下创建爬虫文件MovieSpider.py,创建MovieSpider类时并继承scrapy.Spider。这里使用了xpath定位资源，下面会简单介绍，更多用法请点击这里,进入菜鸟教程进行学习。

import scrapy
from movie.items import Movieclass MovieSpider(scrapy.Spider):# 爬虫名称，最终会利用该名称启动爬虫name = 'MovieSpider'# 这里只填写域名即可，不需要协议和资源地址allowed_domains = ['88ys.com']# 开始url,即我们爬虫最开始需要爬取的地址start_urls = ['https://www.88ys.com/vod-type-id-14-pg-1.html']def parse(self, response):urls = response.xpath('//li[@class="p1 m1"]')for item in urls:movie = Movie()movie['name'] = item.xpath('./a/span[@class="lzbz"]/p[@class="name"]/text()').extract_first()movie['href'] = 'https://www.88ys.com' + item.xpath('./a/@href').extract_first()request = scrapy.Request(movie['href'], callback=self.crawl_details)request.meta['movie'] = movieyield requestdef crawl_details(self, response):movie = response.meta['movie']movie['actor'] = response.xpath('//div[@class="ct-c"]/dl/dt[2]/text()').extract_first()movie['status'] = response.xpath('//div[@class="ct-c"]/dl/dt[1]/text()').extract_first()movie['district'] = response.xpath('//div[@class="ct-c"]/dl/dd[4]/text()').extract_first()movie['director'] = response.xpath('//div[@class="ct-c"]/dl/dd[3]/text()').extract_first()movie['genre'] = response.xpath('//div[@class="ct-c"]/dl/dd[1]/text()').extract_first()movie['intro'] = response.xpath('//div[@class="ee"]/text()').extract_first()yield movie

xpath使用

syntax	说明
//	全文递归搜索
.	选取当前结点
. .	选取父节点
text()	选取标签下的文本
@属性	选取该属性的值
label	这里指节点名称，即html的标签
`div[@class="ct-c"]`	指类属性为`ct-c`的div
`/dl/dt[1]`	指dl下的第一个dt

编写pipelines.py，将爬取到的数据存入数据库

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.htmlimport pymysqlclass DialogPipeline(object):def __init__(self):self.conn = pymysql.connect('localhost', 'huangwei', '123456789', 'db_88ys')self.cursor = self.conn.cursor()def process_item(self, item, spider):sql = "insert into tb_movie(name, href, actor, status, district, director, genre, intro) values(%s, %s, %s, %s, %s, %s, %s, %s)"self.cursor.execute(sql, (item['name'], item['href'], item['actor'], item['status'],item['district'], item['director'], item['genre'], item['intro']) )self.conn.commit()def close_spider(self, spider):self.cursor.close()self.conn.close()

更改settings.py相关配置

# 是否遵循robots协议
ROBOTSTXT_OBEY = False# 模拟浏览器进行数据请求
DEFAULT_REQUEST_HEADERS = {"User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;",'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
}# 启用pipelines，将爬取到的数据进行保存
ITEM_PIPELINES = {'dialog.pipelines.DialogPipeline': 300,
}

四、启动爬虫

进入项目目录，使用scrapy crawl MovieSpider即可，执行中会打印相关日志，在命令中加入--nolog即可不显示日志。当然，在启动前我们需要准备好数据表。启动过程如下:

最终，我们查看数据库，爬取成功！！！

【Python】Scrapy完成电影信息爬取并存入数据库相关推荐

Python 爬虫中国行政区划信息爬取（初学者）
Python 爬虫中国行政区划信息爬取 (初学者) 背景环境准备代码片段 1.定义地址信息对象 2.地址解析对象 2.1 获取web信息 2.2 web信息解析 2.3 区划信息提取 2.4 省 ...
python爬取微博用户正文_基于Python的新浪微博用户信息爬取与分析
基于 Python 的新浪微博用户信息爬取与分析邓文萍 [摘要] 摘要:本文设计并实现了一个微博用户信息爬取与分析系统 , 利用 Cookie 实现了用户的模拟登录 , 使用 Python 语言的 ...
把爬取信息导出到mysql,关于爬虫学习的一些小小记录（四）——爬取数据存入数据库...
关于爬虫学习的一些小小记录(四)--爬取数据存入数据库创建数据库 pymysql 模块具体操作预知后事如何前面我们已经讲了怎么访问网页,并且从网页源码中提取数据.既然数据有了,怎样管理就是下一 ...
python实现百度新闻爬取并存入数据库（二）
上节课学习了爬取搜狗新闻网站的内容,这节课讲解如何把爬取的数据存入数据库表中,使用mysql数据库. 先简单说下mysql数据库的安装 mysql的安装文件可在网盘下载,安装即可,安装过程注意设置账号 ...
爬虫学习日记1-豆瓣top250电影信息爬取
@ 爬虫学习日记1-豆瓣top250电影信息爬去学习任务:结合requests.re两者的内容爬取https://movie.douban.com/top250里的内容, 要求抓取名次.影片名称.年 ...
Python小工具-电影天堂爬取电影下载链接
import requests import bs4# 获取单独的url def movie_info(url):'''内容标签:<div id="Zoom">下载链接 ...
Python Scrapy 爬虫入门：爬取豆瓣电影top250
一.安装Scrapy cmd 命令执行 pip install scrapy 二.Scrapy介绍 Scrapy是一套基于Twisted的异步处理框架,是纯python实现的爬虫框架,用户只需要定制开 ...
Scrapy电影天堂最新电影信息爬取
环境:python 2.7 创建scrapy项目过程可见本人博客其他文章,这里不再赘述直接上代码主要代码 # -*- coding: utf-8 -*- import scrapyclass Dy ...
easyui datalist 不显示数据_爬虫练习——豆瓣电影信息爬取及数据可视化
最近自学了简单的爬虫项目,简单记录下自己的小白学习路径. 本次爬取的是豆瓣电影TOP250数据,主要用到beautifulsoup.re.urllib库.SQLite包,数据可视化方面主要用到flas ...