scrapy爬动态网址哔哩哔哩

这次，我们来爬取哔哩哔哩的热门视频，爬去其中的标题与详细页的同时观看人数和弹幕的实时在线，并存入mongodb

打开其中一个详情页发现里面是动态的，用静态网页的方法是爬不出来的

这次我们用抓包的方式来完成此次任务，也就是用chrome自带的开发者工具，按F12使用
用xhr筛选，其中的response等来筛选自己所需要的数据，并从Headers里拿到它的url，因为我们要爬八个网址，所以我们要用一个通用的式子来表示它，用正则表达式来拼凑出来。
我们找到了三个网址

我们发现这个网页的url是进不去的，但是我们发现response里面有我们需要的东西，用response.body.decode()来进行提取，再用正则提取。

根据dm二字我们发现了弹幕的url，进去后发现弹幕都在里面，这个即是我们所需要的url。

这里response里面有我们需要的coin等要点，这样就找到了所需要的所有东西。
接下来的难点呢，就是怎样拼凑出我们所通用的表达式
这里的话具体参照下方代码，用正则表达式等方式，表达出来
spider代码如下

# -*- coding: utf-8 -*-
import scrapy
import json
from bilibili.items import BilibiliItem
import re
import time
import requestsclass BlibiliSpider(scrapy.Spider):name = 'ganbei'# allowed_domains = ['www.bilibili.com'] # 注释掉，不然打不开后面的网页start_urls = ['https://www.bilibili.com/']def parse(self, response):listmain = response.xpath('//*[@id="reportFirst1"]/div[2]/div')[0:8]for each in listmain:item = {}url = ''.join(each.xpath('./div/a/@href').extract())urls = 'https:' + urlvideo_name = ' '.join(each.xpath('./div/a/div/p[1]/text()').extract())item['title'] = video_name  # 拿下标题part_number = urls.split('/av')[1]cid = requests.get(url=urls).text  # 用来组成apineed_part = re.findall('","cid":(.*?),"', cid, re.S)need_part = ''.join(need_part)barrage_api = 'https://api.bilibili.com/x/v1/dm/list.so?oid=' + need_partcollection_api = 'https://api.bilibili.com/x/web-interface/archive/' + 'stat?aid=' + part_numberwatching_url = 'https://api.bilibili.com/x/player.so?id=cid%3A' + need_part + '&aid=' + part_number + '&buvid=D7512C54-9EB9-4D8A-ADF9-040A66C06A6C190950infoc'item['barrage_api'] = ''.join(barrage_api)item['watching_url'] = ''.join(watching_url)item['collection_api'] = ''.join(collection_api)yield scrapy.Request(url=item['collection_api'], callback=self.collection, meta={'item': item})def collection(self, response):item = response.meta['item']all_text = json.loads(response.text)  # 用来拿下jsondetail_text = all_text.get('data')coins = detail_text.get('coin')favorite = detail_text.get('favorite')prise_number = detail_text.get('like')item['prise_number'] = prise_numberitem['coin_number'] = coinsitem['collection'] = favoriteyield scrapy.Request(item['watching_url'], callback=self.watching, meta={'item': item})def watching(self, response):item = response.meta['item']response = response.body.decode()online = re.findall('<online_count>(.*?)</online_count>', response, re.S)online_people = ''.join(online)item['watching_people'] = online_peopleyield scrapy.Request(url=item['barrage_api'], callback=self.barrage, meta={'item': item})def barrage(self, response):item = response.meta['item']bang_list = response.xpath('/i/d')all_barrage = []  # 方便下面的导入itemfor bang in bang_list:content = bang.xpath('./text()').extract()content = ''.join(content)time_base = bang.xpath('./@p').extract()time_base = ''.join(time_base)time_one = int(time_base.split(',')[4])time_is = time.localtime(time_one)end_finish_time = time.strftime('%Y-%m-%d %H:%M:%S', time_is)  # 定义时间all_dm_content = str(end_finish_time) + contentall_barrage.append(all_dm_content)item['barrage'] = ''.join(all_barrage)yield BilibiliItem(title=item['title'],praise_number=item['prise_number'],coin_count=item['coin_number'],collection_number=item['collection'],barrage=item['barrage'],watching_people=item['watching_people'])

items.py

import scrapyclass BilibiliItem(scrapy.Item):# define the fields for your item here like:title = scrapy.Field()watching_people = scrapy.Field()barrage = scrapy.Field()praise_number = scrapy.Field()coin_count = scrapy.Field()collection_number = scrapy.Field()

pipelines

class BilibiliPipeline(object):def process_item(self, item, spider):return itemclass CrawldataToMongoPipline(object):def __init__(self):host = Mongoipport = MongoPortdbName = MongoDBnameclient = MongoClient(host=host, port=port)  # 创建连接对象clientdb = client[dbName]  # 使用文档dbName='datago306'self.post = db[MongoItem]  # 使用item MongoItem='jobItem'def process_item(self, item, spider):job_info = dict(item)  # item转换为字典格式self.post.insert(job_info)  # 将item写入mongoreturn item

scrapy爬动态网址哔哩哔哩相关推荐

python爬取哔哩哔哩视频_荐爬取哔哩哔哩中的cosplay小视频
爬取哔哩哔哩小视频前言:想必大家都对小视频感兴趣吧,今天的爬虫的内容为将哔哩哔哩中的视频下载到本地,今天爬取的网站为 URL : https://vc.bilibili.com/p/eden/all ...
Python爬取哔哩哔哩实时直播弹幕
用Python爬取哔哩哔哩直播弹幕,关键在于找到哔哩哔哩网站的一个POST网址,和应该POST的数据.代码不长,十分简单.关键在于浏览器开发者工具的使用.希望对于新入门的萌新有一定的借鉴意义. 1.找 ...
python爬虫--小白爬取哔哩哔哩每周更新栏目动画
爬取哔哩哔哩每周必看栏目动画前言本次内容为爬取哔哩哔哩每周必看栏目动画,灵感来自于一位博主的评论,问能否爬取B站历史排行榜信息,便决定一试,不过B站上的排行耪都是动态更新的,因此没有头绪,自我感觉 ...
爬取哔哩哔哩综合排行榜信息及视频弹幕内容
爬取哔哩哔哩综合排行榜信息及视频弹幕内容爬取所需工具:python3,谷歌浏览器,pycharm 模块:requests,re,lxml 爬取思路进入排行榜爬取所有的视频url,再依靠for循环 ...
菜鸟弟弟从零开始的爬取Bilibili弹幕的Python爬虫教程-哔哩哔哩 - ( ゜- ゜)つロ干杯~
从零开始的爬取Bilibili弹幕的Python爬虫教程或许可以作为一个爬虫小白的练手的demo? 还是先看看什么是爬虫吧!(还有Bilibili! ) 网络爬虫: 网络爬虫(又称为网页蜘蛛,网络机 ...
python百度云链接哔哩哔哩弹幕网_Python爬取哔哩哔哩实时直播弹幕
Python爬取哔哩哔哩实时直播弹幕 Python爬取哔哩哔哩实时直播弹幕用Python爬取哔哩哔哩直播弹幕,关键在于找到哔哩哔哩网站的一个POST网址,和应该POST的数据.代码不长,十分简单.关 ...
Python爬取哔哩哔哩弹幕并且造词云图简单版！！！
一,操作步骤 1.通过浏览器打开哔哩哔哩 2.选择一个播放量较合适的视频(不要太大也不要太小大概就50万的样子)比如我用的是:https://www.bilibili.com/video/BV1th ...
Python爬取哔哩哔哩视频的相关信息后续
上一篇文章通过selenium工具自动搜索爬取哔哩哔哩上面的视频相关信息,今天我们接着上一篇文章,保存视频的图片到本地. 首先找到要爬取的网页数据所在的位置,如下图并且,右键点击该网址,可以选择在新 ...
哔哩哔哩弹幕爬取以及BV与AV号之间的转换
作为b站老粉丝,我有义务向新人科普bilibili的发展与纪年史,本人持中立态度,仅做记录工作. B站的API端口都是开放的,用一个很简单的调用命令就可以计算出BV号对应的AV号. B站的源码已经在G ...

scrapy爬动态网址哔哩哔哩

scrapy爬动态网址哔哩哔哩相关推荐

最新文章

热门文章