使用scrapy再次爬取猫眼前100榜单电影!
前提:
记得去年5月份左右的时候写过一篇使用Requests方法来爬取猫眼榜单电影的文章,今天偶然翻到了这篇文章,又恰巧最近在学scrapy框架进行爬虫,于是决定饶有兴趣的使用scrapy框架再次进行爬取。
说明:
如图所示,这次爬取的猫眼榜单网页链接内容大致如下(图1-1),这次需要爬取的信息分别是电影名称、主演、上映时间、电影评分和电影图片链接,然后将获取的电影图片下载保存到本地,如图1-2所示。
图1-1
图1-2
爬虫解析:
1、首先使用谷歌浏览器打开网页,然后按下键盘“F12”进入开发者工具调试界面,选择左上角的箭头图标,然后鼠标移至一个电影名处,就可以定位到该元素源代码的具体位置,定位到元素的源代码之后,可以从源代码中读出改元素的属性,如图2-1所示:
图2-1
2、从上图可以看出,我们需要的信息隐藏在这个节点和属性值中,接下来就是如何获取到这些节点信息和属性值的问题,这里最简答的方法就是选择一个节点后,右击鼠标选择“Copy-Copy Xpath”,通过xpath方法来定位元素来获取信息。具体的xpath定位元素的使用方法,可自行百度进行学习。
代码:
spider文件
# -*- coding: utf-8 -*- import scrapy from maoyan.items import MaoyanItem import urllibclass Top100Spider(scrapy.Spider):name = 'top_100'allowed_domains = ['trade.maoyan.com']start_urls = ['https://trade.maoyan.com/board/4']def parse(self, response):#passdd_list = response.xpath('//dl[@class="board-wrapper"]/dd')for dd in dd_list:item = MaoyanItem()item['name'] = dd.xpath('./a/@title').extract_first() #电影名称item['starring'] = dd.xpath('./div/div/div/p[2]/text()').extract_first() #电影主演if item['starring'] is not None:item['starring'] = item['starring'].strip()item['releasetime'] = dd.xpath('./div/div/div/p[3]/text()').extract_first() #电影上映时间#item['image'] = 'https://trade.maoyan.com/' + dd.xpath('./a/@href').extract_first() #电影图片score_one = dd.xpath('./div/div/div[2]/p/i[1]/text()').extract_first() #评分前半部分score_two = dd.xpath('./div/div/div[2]/p/i[2]/text()').extract_first() #评分后半部分item['score'] = score_one + score_two#print(item)url = 'https://trade.maoyan.com' + dd.xpath('./a/@href').extract_first() #电影详情页yield scrapy.Request(url,callback= self.parse_datail,meta= {'item':item})#获取下一页网页信息next_page = response.xpath('//div[@class="pager-main"]/ul/li/a[contains(text(), "下一页")]/@href').extract_first()if next_page is not None:print('当前爬取的网页链接是:%s'%next_page)new_ilnk = urllib.parse.urljoin(response.url, next_page)yield scrapy.Request(new_ilnk,callback=self.parse,)def parse_datail(self,response):item = response.meta['item']item['image'] = response.xpath('//div[@class ="celeInfo-left"]/div/img/@src').extract_first() #获取图片链接yield item# print('当前获取的信息')# print(item)
item.py代码
# -*- coding: utf-8 -*-# Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass MaoyanItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()#passname = scrapy.Field() #电影名starring = scrapy.Field() #主演releasetime = scrapy.Field() #上映时间image = scrapy.Field() #电影图片链接score = scrapy.Field() #电影评分
pipelines.py代码
# -*- coding: utf-8 -*-# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.htmlfrom scrapy.pipelines.images import ImagesPipeline import scrapy from scrapy.exceptions import DropItem# class MaoyanPipeline(object): # def process_item(self, item, spider): # return item#使用ImagesPipeline进行图片下载class MaoyanPipeline(ImagesPipeline):def get_media_requests(self, item, info):print('item-iamge是', item['image'])yield scrapy.Request(item['image'])def item_completed(self, results, item, info):image_paths = [x['path'] for ok, x in results if ok]if not image_paths:raise DropItem("Item contains no images")return item
settings.py代码
# -*- coding: utf-8 -*-# Scrapy settings for maoyan project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html import random BOT_NAME = 'maoyan'SPIDER_MODULES = ['maoyan.spiders'] NEWSPIDER_MODULE = 'maoyan.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'maoyan (+http://www.yourdomain.com)'# Obey robots.txt rules ROBOTSTXT_OBEY = FalseUSER_AGENTS_LIST = ["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1","Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6","Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1","Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5","Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3","Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24","Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"] USER_AGENT = random.choice(USER_AGENTS_LIST) DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en',# 'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",'User-Agent':USER_AGENT }IMAGES_STORE = 'D:\\MaoYan' #文件保存路径
总结:
以上就是使用scrapy进行爬取猫眼前100榜单电影的方法,方法不是很难,主要难点还是在使用xpath进行元素定位获取数据方面,最后电影爬取成功后,就是去慢慢欣赏的时候 了,哈哈,祝各位周末愉快!
PS:如有需要Python学习资料的小伙伴可以加点击下方链接自行获取
python免费学习资料以及群交流解答点击即可加入
使用scrapy再次爬取猫眼前100榜单电影!相关推荐
- 爬取网易云歌曲榜单中网易云歌曲及其对应id xpath csv re requests python
基操爬取网易云歌曲榜单中网易云歌曲及其对应id 回顾xpath csv re requests 先进行基础抓包! 1 2 3 然后就是代码啦,兄弟们,我的代码都很完整,看懂思路,直接收藏复制粘贴就可以 ...
- python正则爬取微信阅读总榜单写入csv
# 爬取微信阅读top总榜 import requests import re import csv url = "https://weread.qq.com/web/category/al ...
- python爬取千千音乐榜单_Python爬取QQ音乐榜单数据
原博文 2020-08-09 12:56 − 1.爬取QQ音乐榜单数据并存入数据库(MySQL) 2.代码 import requests import json from bs4 import Be ...
- 爬取艺恩数据榜单年度票房电影
import requests import jsonwhile True:print("选择翻译成的年份:")print("输入:q 退出")year = i ...
- Python3.x使用Scrapy将爬取数据存储成Json
Python3.x使用Scrapy将爬取数据存储成Json 豆瓣电影排名前250链接 https://movie.douban.com/top250 注:前提安装好python及所需的环境 1.scr ...
- 爬虫—爬取微博热搜榜
1. 引言 利用scrapy框架爬取微博热搜榜网站前50条热搜. 爬取信息:热搜排名.热搜新闻名.热搜新闻热搜量. 数据存储:存储为.csv文件. 2.爬取流程 新建scrapy爬虫项目: 在终端输入 ...
- 用Scrapy框架爬取豆瓣电影,构建豆瓣电影预测评分模型
文章目录 前言 一.Scrapy爬虫爬取豆瓣电影 1. Scrapy框架介绍 (1) Scrapy框架构造: (2) 数据流 (3) 项目结构 2. 创建爬虫爬取豆瓣 (1)创建项目 (2) 创建It ...
- 03_使用scrapy框架爬取豆瓣电影TOP250
前言: 本次项目是使用scrapy框架,爬取豆瓣电影TOP250的相关信息.其中涉及到代理IP,随机UA代理,最后将得到的数据保存到mongoDB中.本次爬取的内容实则不难.主要是熟悉scrapy相关 ...
- Scrapy 简易爬取Boss直聘 可设定city job 爬取工作到excel或mysql中
2018-5-17 一. 本篇讲述了如何编写利用Scrapy爬虫,把数据放入到MYSQL数据库中和写入到excel中,由于笔者之前爬取过拉勾网,但个人倾向与Boss直聘,所以再次爬取Boss直聘来作为 ...
最新文章
- Java 如何线程间通信,面试被问哭。。。
- ajax 下拉框 保留,Ajax生成select级联下拉框和清空多余选项
- C++ 类模板中友元函数问题
- 2020-09-27 What is Sector-Bounded Nonlinearities?
- autofac 作用域_控制作用域和生命周期
- 手机outlook刷新不出邮件_网页端Outlook推Spaces功能:轻松整合邮件、会议和文档...
- 《深入理解Linux内核》笔记5:内存管理
- java命令大全_Java命令行工具:javac、java、javap 的使用详解
- dz开启php5.5,Discuz5.5.0代码高亮显示+运行代码框合成插件 下载第2/4页
- 基于JAVA+Servlet+JSP+MYSQL的人力资源管理系统
- conda安装tensorflow-gpu=2.2.0
- 网关屏蔽mac地址,linux下修改mac地址方法
- iOS中 点击按钮无响应
- 安装SQL Server 2016及一些常用操作
- Spring Cloud 从入门到精通
- c# 语音卡控制--语音卡实现电话录音
- BeagleBone Black 移植U-Boot (2 MLO、U-Boot)
- java使用Aspose.pdf实现pdf转图片
- MYSQL、JDBC
- 2022年全球100个可持续发展城市榜公布,挪威首都奥斯陆排第一,中国有十个城市入选 | 美通社头条...