Python + Selenium 爬取网易云课堂课时标题及时长

转载请注明出处：https://blog.csdn.net/jpch89/article/details/84142555

文章目录

Python + Selenium 爬取网易云课堂课时标题及时长
- 软件安装
- 目标页面
- 代码
- - 说明
  - study163seleniumff.py
  - helper.py
- 最终结果

软件安装

selenium
pip install selenium
geckodriver
https://github.com/mozilla/geckodriver/releases/

目标页面

https://study.163.com/course/introduction.htm?courseId=1006078212#/courseDetail?tab=1

一开始用常规方法请求下来，发现源码中根本找不到任何课时信息，说明该网页用 JavaScript 来动态加载内容。
使用开发者工具分析一下，发现浏览器请求了如下的地址获取课时详情信息：
https://study.163.com/dwr/call/plaincall/PlanNewBean.getPlanCourseDetail.dwr?1542346982156
在预览界面可以看到各课时信息的 Unicode 编码。
尝试直接请求上述地址，显然会报错，不想去研究请求头具体应该传哪些参数了，直接上 Selenium，反正就爬一个页面，对性能没什么要求。

代码

说明

study163seleniumff.py 是主运行文件
helper.py 是辅助模块，与主运行文件同目录
geckodriver.exe 需要放在 ../drivers/ 这个相对路径下

study163seleniumff.py

from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
from lxml import etree
import csv
from helper import Chapter, Lesson# 请求数据
url = 'https://study.163.com/course/introduction.htm?courseId=1006078212#/courseDetail?tab=1'options = Options()
options.add_argument('-headless')  # 无头参数
driver = Firefox(executable_path='../drivers/geckodriver',firefox_options=options)
driver.get(url)
text = driver.page_source
driver.quit()# 解析数据
html = etree.HTML(text)
chapters = html.xpath('//div[@class="chapter"]')
TABLEHEAD = ['章节号', '章节名', '课时号', '课时名', '课时长']
rows = []for each in chapters:chapter = Chapter(each)lessons = chapter.get_lessons()for each in lessons:lesson = Lesson(each)chapter_info = chapter.chapter_infolesson_info = lesson.lesson_infovalues = (*chapter_info, *lesson_info)row = dict(zip(TABLEHEAD, values))rows.append(row)# 存储数据
with open('courseinfo.csv', 'w', encoding='utf-8-sig', newline='') as f:writer = csv.DictWriter(f, TABLEHEAD)writer.writeheader()writer.writerows(rows)

helper.py

class Chapter:def __init__(self, chapter):self.chapter = chapterself._chapter_info = Nonedef parse_all(self):# 章节号chapter_num = self.chapter.xpath('.//span[contains(@class, "chaptertitle")]/text()')[0]# 去掉章节号最后的冒号chapter_num = chapter_num[:-1]# 章节名chapter_name = self.chapter.xpath('.//span[contains(@class, "chaptername")]/text()')[0]return chapter_num, chapter_name@propertydef chapter_info(self):self._chapter_info = self.parse_all()return self._chapter_infodef get_lessons(self):return self.chapter.xpath('.//div[@data-lesson]')class Lesson:def __init__(self, lesson):self.lesson = lessonself._lesson_info = None@propertydef lesson_info(self):# 课时号lesson_num = self.lesson.xpath('.//span[contains(@class, "ks")]/text()')[0]# 课时名lesson_name = self.lesson.xpath('.//span[@title]/@title')[0]# 课时长lesson_len = self.lesson.xpath('.//span[contains(@class, "kstime")]/text()')[0]self._lesson_info = lesson_num, lesson_name, lesson_lenreturn self._lesson_info

最终结果

最终结果保存为 courseinfo.csv，与主运行文件同路径。

完成于 2018.11.16

Python + Selenium 爬取网易云课堂课时标题及时长相关推荐

python爬取网易云歌单_详解python selenium 爬取网易云音乐歌单名
目标网站: 首先获取第一页的数据,这里关键要切换到iframe里打印一下获取剩下的页数,这里在点击下一页之前需要设置一个延迟,不然会报错. 结果: 一共37页,爬取完毕后关闭浏览器完整代码: u ...
web UI自动化 python+selenium 爬取网易云排行榜歌曲列表
from selenium import webdriver import time,csv class music163: #定义类 def init(self,ranking): self.ran ...
Python爬虫——selenium爬取网易云评论并做词云
大家好!我是霖hero 到点了上号网易云,很多人喜欢到夜深人静的时候,在网易云听音乐发表评论,正所谓:自古评论出人才,千古绝句随口来,奈何本人没文化,一句卧槽行天下!评论区集结各路大神,今天我们来爬取 ...
python爬虫爬取网易云音乐歌曲_Python网易云音乐爬虫进阶篇
image.png 年前写过一篇爬网易云音乐评论的文章,爬不了多久又回被封,所以爬下来那么点根本做不了什么分析,后面就再改了下,加入了多线程,一次性爬一个歌手最热门50首歌曲的评论,算是进阶版了- 思 ...
Selenium爬取网易云音乐评论
Selenium爬取网易云音乐评论一.爬取工具 1.1 selenium selenium这是一个第三方库我们可以通过 pip install selenium来安装这个第三方库. Sele ...
python网易云_用python爬虫爬取网易云音乐
标签: 使用python爬虫爬取网易云音乐需要使用的模块只需要requests模块和os模块即可开始工作先去网易云音乐网页版找一下你想要听的歌曲点击进去.按键盘F12打开网页调试工具,点击Ne ...
python爬虫----爬取网易云音乐
使用python爬虫爬取网易云音乐目录使用python爬虫爬取网易云音乐需要使用的模块开始工作运行结果需要使用的模块只需要requests模块和os模块即可开始工作先去网易云音乐网页 ...
python 批量爬取网易云音乐，java解密
每天一点点,记录学习 python 批量爬取网易云音乐网易云音乐,排行榜,右键,显示网页源代码,并不能找到任何一首歌的id,是因为java加密了随便找一首id为1374061038的歌,在网页源代 ...
python+execjs爬取网易云评论
python+execjs爬取网易云评论分析网站 JS分析 execjs解密js 运行结果代码分析网站首先打开网易云首页,随便点一首歌曲进入到评论区. 接着按F12进入开发者工具,重新刷新页面 ...
使用Selenium爬取网易云音乐的所有排行榜歌曲
项目目标: 获取到一些简单的信息,排名,歌曲名称,时长,歌手,并且将歌曲下载下来抓取分析: 在开始爬取之前需要确保已经安装好selenium,requests,lxml库,通过页面分析得到排行榜榜单 ...

Python + Selenium 爬取网易云课堂课时标题及时长

Python + Selenium 爬取网易云课堂课时标题及时长

文章目录

软件安装

目标页面

代码

说明

study163seleniumff.py

helper.py

最终结果

Python + Selenium 爬取网易云课堂课时标题及时长相关推荐

最新文章

热门文章