凡尔赛文学火了。这种特殊的网络文体,常出现在朋友圈或微博,以波澜不惊的口吻,假装不经意地炫富、秀恩爱。
普通的炫耀,无非在社交网络发发跑车照片,或不经意露出名牌包包 logo,但凡尔赛文学还不这么直接。微博博主还专门制作过凡尔赛文学教学视频,讲解其三大精髓要素:

在豆瓣上,也有一个名叫凡尔赛学研习小组,组员们将凡尔赛定义为一种表演高级人生的精神,好了,进入主题,今天来快速爬取知乎里有关凡尔赛语录有关的回答,开始。

1.爬取的网站

在知乎搜索凡尔赛语录,第二个比较适合,就用这个。

点进去后可以发现关于这个提问共有 393 个回答。

网址:https://www.zhihu.com/question/429548386/answer/1575062220

去掉 answer 以及后面的部分就是这个要爬取的问题网址。特别是后面的一串数字是问题 id:https://www.zhihu.com/question/429548386,作为知乎问题的唯一标识。

2.爬取问题有关的回答

研究一下上面的网址,我们发现需要爬取两部分数据:

  1. 爬取的详情,包括创建时间、关注人数、浏览量、问题描述等
  2. 爬取的回答,包括每个答主的用户名、粉丝数等信息,问题回答的具体内容、发布时间、评论数、点赞数等信息

其中,这个问题详情可以直接爬取上面的网址,通过 bs4 解析页面内容拿到数据,而问题的回答则需要通过下面的链接,通过设置每页的起始下标和页面内容偏移量确定,有点类似于分页内容的爬取。

def init_url(question_id, limit, offset):  base_url_start = "https://www.zhihu.com/api/v4/questions/"  base_url_end = "/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics&limit={0}&offset={1}".format(limit, offset)  return base_url_start + question_id + base_url_end

设置每页回答数 limit=20,offset 则可以是0、20、40…而 question_id 则是上面提到的网址后面的一串数字,这里是 429548386,逻辑想明白之后就是通过写爬虫获取数据了,下面是完整的爬虫代码,运行的时候你只需要修改问题的 id 即可。

3.完整代码

# 导入相应的库
import json
import re
import time
from datetime import datetime
from time import sleep
import pandas as pd
import numpy as np
import warnings
import requests
from bs4 import BeautifulSoup
import random
import warnings
warnings.filterwarnings('ignore')def get_ua():"""在UA库中随机选择一个UA:return: 返回一个库中的随机UA"""ua_list = ["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60","Opera/8.0 (Windows NT 5.1; U; en)","Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50","Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0","Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36","Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11","Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36","Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER","Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)","Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)","Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36","Mozilla/5.0 (Windows; U; Windows NT 5.2) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.27 Safari/525.13","Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50","Mozilla/5.0 (Macintosh; U; IntelMac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1Safari/534.50","Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1","Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0","Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6","Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1","Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50","Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"]return random.choice(ua_list)def filter_emoij(text):"""过滤emoij表情符@param text:@return:"""try:co = re.compile(u'[\U00010000-\U0010ffff]')except re.error:co = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')text = co.sub('', text)return textdef get_question_base_info(url):"""获取问题的详细描述@param url:@return:"""response = requests.get(url=url, headers={'User-Agent': get_ua()}, timeout=10)"""获取数据并解析"""soup = BeautifulSoup(response.text, 'lxml')# 问题标题title = soup.find("h1", {"class": "QuestionHeader-title"}).text# 具体问题question = ''try:question = soup.find("div", {"class": "QuestionRichText--collapsed"}).text.replace('\u200b', '')except Exception as e:print(e)# 关注者follower = int(soup.find_all("strong", {"class": "NumberBoard-itemValue"})[0].text.strip().replace(",", ""))# 被浏览watched = int(soup.find_all("strong", {"class": "NumberBoard-itemValue"})[1].text.strip().replace(",", ""))# 问题回答次数answer_str = soup.find_all("h4", {"class": "List-headerText"})[0].span.text.strip()# 抽取xxx 个回答中的数字:【正则】数字出现次数>=0answer_count = int(re.findall('\d*', answer_str)[0])# 问题标签tag_list = []tags = soup.find_all("div", {"class": "QuestionTopic"})for tag in tags:tag_list.append(tag.text)return title, question, follower, watched, answer_count, tag_listdef init_url(question_id, limit, offset):"""构造每一页访问的url@param question_id:@param limit:@param offset:@return:"""base_url_start = "https://www.zhihu.com/api/v4/questions/"base_url_end = "/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed" \"%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by" \"%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count" \"%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info" \"%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting" \"%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B" \"%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics" \"&limit={0}&offset={1}".format(limit, offset)return base_url_start + question_id + base_url_enddef get_time_str(timestamp):"""将时间戳转换为标准日期字符@param timestamp:@return:"""datetime_str = ''try:# 时间戳timestamp 转datetime时间格式datetime_time = datetime.fromtimestamp(timestamp)# datetime时间格式转为日期字符串datetime_str = datetime_time.strftime("%Y-%m-%d %H:%M:%S")except Exception as e:print(e)print("日期转换错误")return datetime_strdef get_answer_info(url, index):"""解析问题回答@param url:@param index:@return:"""response = requests.get(url=url, headers={'User-Agent': get_ua()}, timeout=10)text = response.text.replace('\u200b', '')per_answer_list = []try:question_json = json.loads(text)"""获取当前页的回答数据"""print("爬取第{0}页回答列表,当前页获取到{1}个回答".format(index + 1, len(question_json["data"])))for data in question_json["data"]:"""问题的相关信息"""# 问题的问题类型、id、提问类型、创建时间、修改时间question_type = data["question"]['type']question_id = data["question"]['id']question_question_type = data["question"]['question_type']question_created = get_time_str(data["question"]['created'])question_updated_time = get_time_str(data["question"]['updated_time'])"""答主的相关信息"""# 答主的用户名、签名、性别、粉丝数author_name = data["author"]['name']author_headline = data["author"]['headline']author_gender = data["author"]['gender']author_follower_count = data["author"]['follower_count']"""回答的相关信息"""# 问题回答id、创建时间、更新时间、赞同数、评论数、具体内容id = data['id']created_time = get_time_str(data["created_time"])updated_time = get_time_str(data["updated_time"])voteup_count = data["voteup_count"]comment_count = data["comment_count"]content = data["content"]per_answer_list.append([question_type, question_id, question_question_type, question_created,question_updated_time, author_name, author_headline, author_gender,author_follower_count, id, created_time, updated_time, voteup_count, comment_count,content])except:print("Json格式校验错误")finally:answer_column = ['问题类型', '问题id', '问题提问类型', '问题创建时间', '问题更新时间','答主用户名', '答主签名', '答主性别', '答主粉丝数','答案id', '答案创建时间', '答案更新时间', '答案赞同数', '答案评论数', '答案具体内容']per_answer_data = pd.DataFrame(per_answer_list, columns=answer_column)return per_answer_dataif __name__ == '__main__':# question_id = '424516487'question_id = '429548386'url = "https://www.zhihu.com/question/" + question_id"""获取问题的详细描述"""title, question, follower, watched, answer_count, tag_list = get_question_base_info(url)print("问题url:"+ url)print("问题标题:" + title)print("问题描述:" + question)print("该问题被定义的标签为:" + '、'.join(tag_list))print("该问题关注人数:{0},已经被 {1} 人浏览过".format(follower, watched))print("截止 {},该问题有 {} 个回答".format(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()), answer_count))"""获取问题的回答数据"""# 构造urllimit, offset = 20, 0page_cnt = int(answer_count/limit) + 1answer_data = pd.DataFrame()for page_index in range(page_cnt):answer_url = init_url(question_id, limit, offset+page_index*limit)# 获取数据data_per_page = get_answer_info(answer_url, page_index)answer_data = answer_data.append(data_per_page)sleep(3)print("\n爬取完成,数据已保存!!")answer_data.to_csv('凡尔赛沙雕语录_{0}.csv'.format(question_id), encoding='utf-8', index=False)

4.结果

一共爬取到 393 个答案,需要注意一下,最后保存的文件格式为 UTF-8,读取乱码的同学请先检查格式是否一致。

爬取的结果部分截图如下:


感谢看到这里,更多Python精彩内容可以关注我看我主页,你们的三连(点赞,收藏,评论)是我持续更新下去的动力,感谢。

点击领取

爬取知乎“凡尔赛语录”话题下的所有回答,我知道点开看你的很帅气,但还是没我帅相关推荐

  1. 【一学就会】爬取知乎热榜话题下的回答及评论点赞数

    最近印度新冠疫情爆发,连我国都有好几个城市出现了印度的变异病毒.为此,我特意去知乎上逛了逛关于印度疫情的话题 [如何看待全球新冠确诊超 1.5 亿,印度单日新增确诊连续 9 天超 30 万例,未来国际 ...

  2. 数据挖掘 文本分类 知乎问题单分类(二):爬取知乎某话题下的问题(数据爬取)

    数据挖掘 文本分类 知乎问题单分类(二):爬取知乎某话题下的问题(数据爬取) 爬虫目标 Scrapy框架介绍 Scrapy框架原理 [^1] Scrapy工作流程 [^2] 具体实现 安装Scrapy ...

  3. 根据关键词组合,爬取知乎某个问题下所有含有关键词的回答

    其中,必须含有的关键词以空格间隔,或含有的关键词以+间隔,例如,查找知乎ID为23437659的问题:"国内你最喜欢的城市(除家乡外生活过的城市)是哪里?为什么?",要求回答中含有 ...

  4. python requests cookie保存_Python爬虫教程:爬取知乎网

    知乎已经成为了爬虫的训练场,本文利用Python中的requests库,模拟登陆知乎,获取cookie,保存到本地,然后这个cookie作为登陆的凭证,登陆知乎的主页面,爬取知乎主页面上的问题和对应问 ...

  5. 有个漂亮女朋友是种怎样的体验?爬取知乎2.2亿的阅读量的话题

    本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理. 以下文章来源于菜鸟学Python 作者:菜鸟哥 前言 对于很多人来说,拥有一个漂亮的女朋友是一件非 ...

  6. python爬取知乎话题广场_知乎一共有多少个话题?

    上图可知,它是通过请求POST接口来取得知乎话题数据,接口信息: 其中topic_id指大分类下的id,offset是指偏移量,指每次执行next方法加载的子话题数量,hash_id可以为空我们暂时忽 ...

  7. python爬虫实战(一)--爬取知乎话题图片

    原文链接python爬虫实战(一)–爬取知乎话题图片 前言 在学习了python基础之后,该尝试用python做一些有趣的事情了–爬虫. 知识准备: 1.python基础知识 2.urllib库使用 ...

  8. python实战1.0——爬取知乎某问题下的回复

    python实战1.0--爬取知乎某问题下的回复 确定问题 爬取 进行简单筛选 保存数据 # 获取问题下的回复总数 def get_number():url = 'https://www.zhihu. ...

  9. 60行代码爬取知乎“神回复”,句句戳中泪点

    作者 | shenzhongqiang 转载自Python与数据分析(ID:PythonML) 之前的一篇文章<爬了下知乎神回复,笑死人了~>发布后,引发了大家热烈的反响.很多朋友觉得很神 ...

最新文章

  1. 那些复杂的技术设计的开始离我们并不遥远
  2. 《强化学习周刊》第10期:强化学习应用之计算机视觉
  3. Spark的transformation和action算子简介
  4. 51CTO完成B轮融资,围绕1400万社区用户的IT学习平台要怎么做?
  5. SpringMVC-开启静态资源访问权限
  6. 结对-贪吃蛇游戏-开发环境搭建过程
  7. 一名IT经理是如何把项目带崩的。。。
  8. 02 理解==与Equals()的区别及用法 1214
  9. github推荐好玩项目
  10. 机器人视觉场景理解挑战赛
  11. `find -name`模式匹配多个模式
  12. 如何:让Oracle表及字段显示为区分大小写
  13. 我的python之行
  14. 一种基于频域滤波法消除干扰项与角谱法重构技术的数字全息显微台阶形貌测量实例分析
  15. 最新+电脑象棋测试软件,中国象棋2017电脑版
  16. 163vip邮箱提醒您谨防邮箱诈骗,点击查看常见套路
  17. SwiftUI OCR功能大全之 基于 SwiftUI 构建文档扫描仪
  18. redis---incr命令
  19. wordpress使用
  20. 为什么 Paint 3D 无法支持大尺寸图像(例如 9825 x 9908 像素)并且在背景去除过程中会出现 downsiding 问题

热门文章

  1. ICCV 2021 | 简而优:用分类器变换器进行小样本语义分割
  2. 面试AI Lab能力测评
  3. 博士称因待遇不公要离职,被学校要求返还51万元补偿费
  4. 科技竞赛:阿广带大家免费使用GPU打比赛,普通人也有机会拿奖金!
  5. 【每日一算法】两数之和 IV - 输入 BST
  6. 最让程序员自豪的事情是什么?
  7. Python设计模式-状态模式
  8. Linux之文本搜索命令 grep
  9. 公务员_只愿与一人十指紧扣_新浪博客
  10. 85.4% mIOU!NVIDIA:使用多尺度注意力进行语义分割,代码已开源!