如何用python获取和保存B站历史播放记录

为什么要保存 B 站视频的播放记录呢? 因为 B 站的历史记录，最多保存 3 个月，超过 3 个月自动清除。所以我专门写一个脚本，将历史记录导出，保存在数据库中，一来是是以备以后需要时能够找到，二来也方便对这些视频按自己的习惯进行分类和做备注。

B 站的历史记录，后端以 Web API 的方式将数据提交到前台，前后端是分离的。在浏览器中滚动条向下滑动的时候，动态提交 HTTP 请求，数据分批渲染。首先，我们需要学会如何查看 Web API。以 Chrome 浏览器为例。进入 B 站，点击右上角「历史记录」按钮，然后按下 F12，调出开发者调试工具，切换到 Network 页签。因为历史记录是动态加载的，所以再点击筛选区的「XHR」(XmlHttpRequest，其实就是 HTTP Request）。点击后，下面显示的都是与 xhr 相关的内容。

点击第二行（cursor***)，右边出现多个页签，在 Headers 页签中，重要的有 Request URL 和 Request Headers 区域的 Cookie。将 Cookie 的内容拷贝到文本文件，比如将文件名命名为 cookie.txt。在 Chrome 浏览器中，默认情况下，右键不出现拷贝菜单，需要三次点击，选中整个字段，此时右键菜单有拷贝菜单出现。

切换到 Preview 页签，可以看出，B 站每次从后台返回 20 笔记录。我们可以通过 Preview 和 Response 来了解返回信息的数据结构。

向下滚动历史记录，左边出现更多的动态的请求内容，这些请求的 url path 主要的差异在 cursor 后面的 max 和 view_at 参数不同，max 和 view_at 来自上一次 response。换一个角度来说，每次服务端的 response，除本次 20 笔历史记录外，还同时返回下一次请求的 max 参数（表示最大的目标 ID号）和 view_at (查看的时间戳)参数。下图展示了刚才所述内容：

当所有的历史记录完毕，最后一次请求的响应，cursor 的 max = 0, view_at = 0, ps = 0。基于了解的这些信息，接下来可以通过 Python 代码来获取历史记录了。本次实现两个功能：

获取历史记录
保存到数据库

获取 B 站历史记录

编写一个从 url 返回 json 数据的函数。在该函数中，request header 参数有两个作用：提供客户端 cookie，以及将请求伪装为浏览器。

def get_response_json(url, req_headers):"""根据url获取json格式的response文本"""resp = requests.get(url, headers=req_headers)return json.loads(resp.text)

从 B 站历史记录的 api 获取 json 格式数据，刚刚提到的客户端 cookie，作用是提供客户端的登录信息，cookie 信息在 request header 中。通过面向对象的方式封装代码。将 cookie 和 header 的加载放在 _init_ 方法中。BiliHistory 类对外提供两个方法：

get_all_history(): 获取所有的浏览历史，list 类型。每一个元素为 dict 类型
save_db(): 将历史记录保存至 sqlite 数据库 (history.db3，硬编码)

BiliHistory 类的完整代码如下：

class BiliHistory(object):def __init__(self, cookie_file):self.base_url = "https://api.bilibili.com/x/web-interface/history/cursor"self.cookie = self._get_cookie_content(cookie_file)self.request_headers = self._set_req_headers()def _get_cookie_content(self, cookie_file):"""从cookies.txt中读取cookie"""with open(cookie_file, 'r') as fp:cookies = fp.read()return cookiesdef _set_req_headers(self):"""设置请求头：1)模拟浏览器；2)提供cookie"""headers = {"Accept": "*/*","Accept-Encoding": "gzip, deflate, br","Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6","Connection": "keep-alive","Cookie": self.cookie,"Host": "api.bilibili.com","User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "  \"Chrome/88.0.4324.96 Safari/537.36 Edg/88.0.705.50"}return headersdef _get_history(self, max, view_at, business):"""根据url以及查询字符串中三个参数，获取20个历史记录，history_list: 为包含dict的列表，每一个list是一个历史记录cursor: 下一个请求的cursor信息"""url = self.base_url + f"?max={max}&view_at={view_at}&business={business}"resp = get_response_json(url, self.request_headers)history_list = resp.get("data").get("list")next_cursor = resp.get("data").get("cursor")return history_list, next_cursor def get_all_history(self):"""获取所有的的浏览历史记录"""histories = []max = 0view_at = 0business = ''ps = 20while(ps!=0): # ps为0表示后面没有记录history, cursor = self._get_history(max, view_at, business)max = cursor.get("max")view_at = cursor.get("view_at")business = cursor.get("business")ps = cursor.get("ps")for item in history:histories.append(item)time.sleep(0.1)return historiesdef save_db(self):"""保存到sqlite3数据库"""histories = self.get_all_history()for item in histories:history = item.get("history")view_time = item.get("view_at")# 如果记录不在数据库中，则新增记录if is_url_exists(view_time) == False:url_content = {"title": item.get("title"),"business": history.get("business"),"bvid": history.get("bvid"),"cid": history.get("cid"),"epid": history.get("epid"),"oid": history.get("oid"),"page": history.get("page"),"part": history.get("part"),"dt": history.get("dt"),"author_name": item.get("author_name"),"videos": item.get("videos"),"is_fav": item.get("is_fav"),"tag_name": item.get("tag_name"),"view_at": item.get("view_at"),"progress": item.get("progress"),"show_title": item.get("show_title"),"cover": item.get("cover"),"uri": item.get("uri")}create_url_info(url_content)

每次获取 20 条记录，在每次 Http 请求后，暂停 0.2 秒钟。

数据保存到数据库

使用 sqlalchemy ORM 的数据创建和查询功能。sqlalchemy 的用法本篇不讲述，只提供数据库操作的代码。有需要的小伙伴请参考我的博客：SQLAlchemy简明教程。

数据库操作的相关代码如下：

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from model import BiliHistoryengine = create_engine("sqlite:///history.db3", echo=False)
session = sessionmaker(bind=engine)()def is_url_exists(view_time):"""根据浏览时间判断记录是否存在"""item = session.query(BiliHistory).filter(BiliHistory.view_at==view_time).first()return item != Nonedef create_url_info(url):url = BiliHistory(title = url.get("title"),business = url.get("business"),bvid = url.get("bvid"),cid = url.get("cid"),epid = url.get("epid"),oid = url.get("oid"),page = url.get("page"),part = url.get("part"),dt = url.get("dt"),author_name = url.get("author_name"),videos = url.get("videos"),is_fav  = url.get("is_fav"),tag_name = url.get("tag_name"),view_at = url.get("view_at"),progress = url.get("progress"),show_title = url.get("show_title"),cover =url.get("cover"),uri = url.get("uri"))session.add(url)session.commit()session.close()

源码

github - bilibili history

参考

Bilibili 历史记录API