使用Scrapy框架，爬取b站番剧信息。

感觉好久没写爬虫的，今天看了在b站浏览了一会儿，发现b站有很多东西可以爬取的，比如首页的排行榜，番剧感觉很容易找到数据来源的，所以就拿主页的番剧来练练手的。

爬取的网址：
https://www.bilibili.com/anime/index/#season_version=-1&area=-1&is_finish=-1&copyright=-1&season_status=-1&season_month=-1&year=-1&style_id=-1&order=3&st=1&sort=0&page=1*

通过观察url的规律，去除一些不影响请求网站的url中的数据，得到url
https://api.bilibili.com/pgc/season/index//resultpage=1&season_type=1&pagesize=20&type=1,然后发现只需每次改变page=的值就可以得到想要的信息，page最大值为153，感觉这次爬取的信息作用不大，不过还是把代码写出来了

运行scrapy的main方法，无需每次scrapy crawl name

# -*- coding: utf-8 -*-
#@Project filename：PythonDemo  dramaMain.py
#@IDE   ：IntelliJ IDEA
#@Author ：ganxiang
#@Date   ：2020/03/02 0002 19:16from scrapy.cmdline import execute
import os
import syssys.path.append(os.path.dirname(os.path.abspath(__file__)))
execute(['scrapy','crawl','drama'])

编写的dramaSeries.py

# -*- coding: utf-8 -*-
import scrapy
import json
from ..items import DramaseriesItem
class DramaSpider(scrapy.Spider):name = 'drama'allowed_domains = ['https://api.bilibili.com/']i =1start_urls = ['https://api.bilibili.com/pgc/season/index//result?page=%s&season_type=1&pagesize=20&type=1'% s for s in range(1,101)]def parse(self, response):item =DramaseriesItem()drama =json.loads(response.text)data =drama['data']data_list =data['list']# print(data_list)for filed in data_list:item['number']=self.iitem['badge']=filed['badge']item['cover_img']=filed['cover']item['index_show']=filed['index_show']item['link']=filed['link']item['media_id']=filed['media_id']item['order_type']=filed['order_type']item['season_id']=filed['season_id']item['title']=filed['title']print(self.i,item)self.i+=1yield itemself.i+=20

items.py

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass DramaseriesItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()number =scrapy.Field()badge =scrapy.Field()cover_img =scrapy.Field()index_show =scrapy.Field()link =scrapy.Field()media_id =scrapy.Field()order_type =scrapy.Field()season_id =scrapy.Field()title =scrapy.Field()pass

pipelines.py

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
from openpyxl import Workbook
from scrapy.utils.project import get_project_settings
settings = get_project_settings()class DramaseriesPipeline(object):excelBook =Workbook()activeSheet =excelBook.activefile =['number','title','link','media_id','season_id','index_show','cover_img','badge']activeSheet.append(file)def process_item(self, item, spider):files =[item['number'],item['title'],item['link'],item['media_id'],item['season_id'],item['index_show'],item['cover_img'],item['badge']]self. activeSheet.append(files)self.excelBook.save('./drama.xlsx')return item

settings.py
打开

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en',
}
ITEM_PIPELINES = {'dramaSeries.pipelines.DramaseriesPipeline': 300,
}

运行结果爬取了两千多字段，其实还可以爬很多的。

使用Scrapy框架，爬取b站番剧信息。相关推荐

scrapy框架----爬取B站番剧弹幕
items.py文件定义要爬取的数据的字段 import scrapy class bilidanmu(scrapy.Item):danmu = scrapy.Field() #弹幕cid = sc ...
python爬取B站番剧索引页面并保存文本和图片
该篇文章为"行路难=_="原创期末的Python考试要写一个爬取网站信息的程序,我就选取了b站番剧索引页面作为目标网页(因为感觉番剧主页的信息太杂了.) 目标网页:https:/ ...
datetime 索引_Python爬取B站番剧索引页面并保存文本和图片
期末的Python考试要写一个爬取网站信息的程序,我就选取了b站番剧索引页面作为目标网页(因为感觉番剧主页的信息太杂了.) 目标网页:https://www.bilibili.com/anime/in ...
小福利，带你使用scrapy框架爬取苏宁图书海量信息
大家好,我是天空之城,今天给大家带来小福利,带你使用scrapy框架爬取苏宁图书海量信息下图为项目的目录结构看下最后的数据截图,可以存为excel文件,也可以存入mysql数据库,参见前面文章介绍 ...
爬虫练习四：爬取b站番剧字幕
由于个人经常在空闲时间在b站看些小视频欢乐一下,这次就想到了爬取b站视频的弹幕. 这里就以番剧<我的妹妹不可能那么可爱>第一季为例,抓取这一番剧每一话对应的弹幕. 1. 分析页面这部番剧 ...
python中scrapy可以爬取多少数据_python scrapy框架爬取某站博人传评论数据
1. B站博人传评论数据爬取简介今天想了半天不知道抓啥,去B站看跳舞的小姐姐,忽然看到了评论,那就抓取一下B站的评论数据,视频动画那么多,也不知道抓取哪个,选了一个博人传跟火影相关的,抓取看看.网址 ...
利用Scrapy框架爬取LOL皮肤站高清壁纸
利用Scrapy框架爬取LOL皮肤站高清壁纸 Lan 2020-03-06 21:22 81 人阅读 0 条评论成品打包:点击进入代码: 爬虫文件 # -*- coding: utf- ...
scrapy获取a标签的连接_python爬虫——基于scrapy框架爬取网易新闻内容
python爬虫--基于scrapy框架爬取网易新闻内容 1.需求[前期准备] 2.分析及代码实现(1)获取五大板块详情页url(2)解析每个板块(3)解析每个模块里的标题中详情页信息点击此处,获取 ...
利用python的scrapy框架爬取google搜索结果页面内容
scrapy google search 实验目的爬虫实习的项目1,利用python的scrapy框架爬取google搜索结果页面内容. https://github.com/1012598167/ ...

使用Scrapy框架，爬取b站番剧信息。

使用Scrapy框架，爬取b站番剧信息。

使用Scrapy框架，爬取b站番剧信息。相关推荐

最新文章

热门文章