糗事百科段子 +图片 + 视频爬虫

import requests
from lxml import etree
from bs4 import BeautifulSoup
import jsonclass QiuShi(object):#构造方法def __init__(self):self.headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36"}self.base_url = "https://www.qiushibaike.com/8hr/page/{}"#请求方法def get_html_text(self, url):response = requests.get(url, headers=self.headers)if response.status_code == 200:return response.textelse:return None#解析列表页方法，返回详情页urldef parse_list_page(self, text):html = etree.HTML(text)urls = html.xpath("//a[@class='recmd-content']/@href")urls = list(map(lambda u: "https://www.qiushibaike.com" + u, urls))return urls#解析详情页方法，返回抓取数据def parse_detail_page(self, text):soup =  BeautifulSoup(text, "lxml")item = {}author = soup.find("span", attrs={"class": "side-user-name"}).stringcontent = soup.find("div", attrs={"class":"content"}).stringvideo = soup.find("video")if video:video_url = "https:" + soup.find("video").find("source").get("src")else:video_url = Noneimages = soup.find("div", attrs={"class":"thumb"})if images:img_urls = [i.get("src") for i in images.find_all("img")]img_urls = list(map(lambda u: "http:" + u, img_urls))else:img_urls = Nonelike_num = soup.find("i", attrs={"class":"number"}).stringitem["author"] = authoritem["content"] = contentitem["video_url"] = video_urlitem["img_urls"] = img_urlsitem["like_num"] = like_numreturn item#保存数据方法def save_imgAndVideo(self, item):img_urls = item["img_urls"]video_url = item["video_url"]i = 0if img_urls:for u in img_urls:with open("./data/糗事百科爬虫图片与视频数据/" + item["author"] + str(i) + ".jpg", "wb") as fp:fp.write(requests.get(u, headers=self.headers).content)i += 1print("the image save to local successful...")if video_url:with open("./data/糗事百科爬虫图片与视频数据/" + item["author"] + str(i) + ".mp4", "wb") as fp:fp.write(requests.get(video_url, headers=self.headers).content)i += 1print("the video save to local successful...")#保存图片和视频方法def save_item_toJson(self, item):with open("./data/糗事百科爬虫数据.json", "a", encoding="utf-8") as fp:json.dump(item, fp, ensure_ascii=False)fp.write("\n")print(item["author"] + "succesful save to local document of json...")self.save_imgAndVideo(item)#主方法def run(self):for i in range(1,10):text = self.get_html_text(self.base_url.format(i))detail_urls = self.parse_list_page(text)for url in detail_urls:text = self.get_html_text(url)item = self.parse_detail_page(text)self.save_item_toJson(item)if __name__ == '__main__':qs = QiuShi()qs.run()

代码运行需要在当前目录下创建路径 data/糗事百科爬虫图片与视频数据/ 作为图片和视频存储位置
运行结果如下:
保存数据如下:

糗事百科段子 +图片 + 视频爬虫相关推荐

爬虫爬取糗事百科段子
目录糗事百科段子爬取爬虫功能介绍所需软件网页解析找到所需信息所在代码段获取信息编程思路分析获取当前所有段子对提取的内容进行优化数据合并查看点赞数最多的内容总结最后附代码糗 ...
转 Python爬虫实战一之爬取糗事百科段子
静觅 » Python爬虫实战一之爬取糗事百科段子首先,糗事百科大家都听说过吧?糗友们发的搞笑的段子一抓一大把,这次我们尝试一下用爬虫把他们抓取下来. 友情提示糗事百科在前一段时间进行了改版,导致 ...
Python爬虫实战之爬取糗事百科段子
Python爬虫实战之爬取糗事百科段子完整代码地址:Python爬虫实战之爬取糗事百科段子程序代码详解: Spider1-qiushibaike.py:爬取糗事百科的8小时最新页的段子.包含的信息 ...
Python爬虫实战一之爬取糗事百科段子
点我进入原文另外, 中间遇到两个问题: 1. ascii codec can't decode byte 0xe8 in position 0:ordinal not in range(128) 解 ...
Python3写爬虫（五）爬取糗事百科段子
2019独角兽企业重金招聘Python工程师标准>>> 最近几天开始用Python3改写网上用Python2写的案例,发现完全可以用Python3来重构Python2的源码.本篇文章 ...
【Python爬虫系列教程 28-100】小姐姐带你入门爬虫框架Scrapy、使用Scrapy框架爬取糗事百科段子
文章目录 Scrapy快速入门安装和文档: 快速入门: 创建项目: 目录结构介绍: Scrapy框架架构 Scrapy框架介绍: Scrapy框架模块功能: Scrapy Shell 打开Scrap ...
爬虫实战1：爬取糗事百科段子
本文主要展示利用python3.7+urllib实现一个简单无需登录爬取糗事百科段子实例. 如何获取网页源代码对网页源码进行正则分析,爬取段子对爬取数据进行再次替换&删除处理易于阅读 0. ...
网络爬虫---爬取糗事百科段子实战
Python网络爬虫 1.知识要求掌握python基础语法熟悉urllib模块知识熟悉get方法会使用浏览器伪装技术如果您对相关知识遗忘了,可以点上面的相关知识链接,熟悉一下. 2.爬取糗事 ...
python爬虫经典段子_Python爬虫-抓取糗事百科段子
爬虫其实很简单,只要用心,很快就就能掌握这门技术,下面通过实现抓取糗事百科段子,来分析一下为什么爬虫事实上是个非常简单的东西. 本文目标抓取糗事百科热门段子实现每按一次回车显示一个段子的发布时间, ...

糗事百科段子 +图片 + 视频爬虫

糗事百科段子 +图片 + 视频爬虫相关推荐

最新文章

热门文章