搭建环境
代码设计
使用说明及效果展示

一、搭建环境

1. 软件版本

Python3.7.4

Anaconda3

2. 环境搭建问题

配置Anaconda环境变量
问题：anaconda未设置在环境变量里，导致使用pip下载python自带的库时无法下载到对应的路径进行使用。
解决：在电脑的环境变量中添加anaconda的路径。
使用pip网络问题
问题：因为网速过慢的原因导致无法正常使用pip进行更新以及python库的下载。
WARNING: pip is configured with locations that require TLS/SSL,however the ssl module in Python is not available.
解决：输入命令改为pip install xxx -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com，使用豆瓣源进行下载。

二、代码设计

1. 获取cookie

在网上查阅了很多的资料之后，我实现了模拟登陆获取cookie的方法，但是目前仍不能通过模拟登陆获取的cookie进行微博平台的连接和数据获取，所以在此我们的cookie是通过手动登陆微博平台后所获得的。
（模拟登陆获取cookie的方法网上很多，在此不做具体说明了）

爬取平台：微博 www.weibo.cn
获取cookie：使用用户名+密码登录www.weibo.cn 后，点击键盘F12 进入控制台界面。在network中找到名为weibo.cn的记录，点击查看里面的cookies。

2. 爬取数据

- 爬取文字

该函数主要是用来爬取文字内容，其中省略了部分对于冗余字符的处理。

- 爬取图片

该函数主要是用来爬取图片内容，其中根据微博图片/评论图片的不同，对于标签的筛选不同。

- 爬取表情

该函数主要是用来爬取微博评论表情内容。

三、使用说明及效果展示

1. 使用说明

修改cookie

修改微博评论的网址为所要爬取的微博评论，以“page=”结尾

在程序所在位置的同级目录下创建两个文件夹分别为：评论图片、评论表情。（本代码使用的相对路径，无需修改代码中的路径）

2. 效果展示

运行程序输入爬取的起始页数、终止页数
已爬取相应内容
附源代码：

"""环境: Python 3.7.4
内容:爬取微博评论内容、图片、表情
修改时间：2020.10.27
@author: Wenwen"""import requests
import urllib.request
import re
import time
import csv
from bs4 import BeautifulSoupnum = 1
list_content = []
list_t = []#请求函数：获取某一网页上的所有内容
def get_one_page(url):headers = {'User-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36','Host' : 'weibo.cn','Accept' : 'application/json, text/plain, */*','Accept-Language' : 'zh-CN,zh;q=0.9','Accept-Encoding' : 'gzip, deflate, br','Cookie' : '_T_WM=7c42b73c4c9cfa6fc4ff7350804c4504; SUB=_2A25yR6oNDeRhGedO41AZ8CzJyz2IHXVRyzZFrDV6PUJbktAKLXfakW1Nm4uHzxmmnLSVywkxToubmHotH6MGAF_Y; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9W5cHZZGnYY4KoxQoDxu1c2T5NHD95QpehnE1h5ESK5pWs4DqcjHIsyQi--Ri-zciKnfi--RiK.7iKyh; SUHB=0jBFua3v6DW2fH; SSOLoginState=1598282334','DNT' : '1','Connection' : 'keep-alive'}#请求头的书写，包括User-agent,Cookie等response = requests.get(url,headers = headers,verify=False)#利用requests.get命令获取网页htmlif response.status_code == 200:#状态为200即为爬取成功return response.text#返回值为html文档，传入到解析函数当中return None#爬取微博评论内容
def comment_page(html):print("序号","内容")pattern = re.compile('<span class="ctt">.*?</span>', re.S)items = re.findall(pattern,html)result = str(items)reResults = re.findall(">.*?<",result,re.S)first_results  = []for result in reResults:#print(result)if ">\'][\'<"  in result:continueif ">:<"  in result:continueif "<a href" in result:continueif ">回复<"  in result:continueif ">评论配图<"  in result:continueif "><"  in result:continueif ">\', \'<"  in result:continueif "@"  in result:continueif "> <"  in result:continueelse:first_results.append(result)#print("first:",firstStepResults)TEXT1 = re.compile(">")TEXT2 = re.compile("<")datalist = []datalist_t = []for last_result in first_results:global numtemp1 = re.sub(TEXT1, '', last_result)excel = re.sub(TEXT2, '', temp1)datalist_t = [num,excel]datalist.append(datalist_t)print(num,excel)with open('.\\微博评论内容.txt','a',encoding='utf-8') as fp:fp.write(excel)num += 1'''    if datalist == datalist1:datalist.clear()else:datalist1 = datalist'''return datalist#爬取微博评论的图片
def comment_img(html):#BS对象bf_1 = BeautifulSoup(html,'lxml')#获取全部上级标签text_1 = bf_1.find_all('span',{'class':'ctt'})#获取所有指定标签img = []#   img1 = []for i in range(len(text_1)):for x in text_1[i].find_all('a',string = "评论配图"):link = x.get('href')if link:img.append(link)#   if img == img1:#       img.clear()#   else:#       img1 = imgreturn img#爬取微博评论的表情
def comment_emotion(html):#BS对象bf_1 = BeautifulSoup(html,'lxml')#获取全部上级标签text_1 = bf_1.find_all('span',{'class':'ctt'})#获取所有指定标签emotion = []#  emotion1 = []for i in range(len(text_1)):for x in text_1[i].find_all('img'):link = x.get('src')if link:emotion.append('http:'+link)#  if emotion == emotion1:#      emotion.clear()#  else:#      emotion1 = emotionreturn emotion#爬取评论内容
def comment(ori,end):alllist = []imglist = []emotionlist = []final = end + 1for i in range(ori,final):url = "https://weibo.cn/comment/hot/JohBbtrnO?rl=1&page=" + str(i)html = get_one_page(url)#print(html)print('正在爬取第 %d 页评论' % (i))alllist = alllist + comment_page(html)imglist = imglist + comment_img(html)emotionlist = emotionlist + comment_emotion(html)time.sleep(3)   if i == end :print('爬取结束，共爬取 %d 页评论' %(end-ori+1))print('保存评论到微博评论内容.csv文件中......')store_text(alllist)  print('保存评论图片到文件夹中......')#print('comment_img:',imglist)store_img(imglist)print('保存评论表情到文件夹中......')#print('comment_emotion:',emotionlist)store_emotion(emotionlist)print('保存完毕！')#存储微博评论内容
def store_text(list):with open('.\\微博评论内容.csv','w',encoding='utf-8-sig',newline='') as f:'''f.write('content\n')f.write('\n'.join(list))'''csv_writer = csv.writer(f)for item in list:csv_writer.writerow(item)f.close()#存储微博评论图片
def store_img(list):j=0for imgurl in list:urllib.request.urlretrieve(imgurl,'.\\评论图片\\%s.jpg' % j)     j+=1#存储微博评论表情
def store_emotion(list):k=0for imgurl in list:urllib.request.urlretrieve(imgurl,'.\\评论表情\\%s.jpg' % k)k+=1  if __name__=="__main__":ori = input("请输入你要爬取的起始页数:")end = input("情输入你要爬取的终止页数:")ori = int(ori)end = int(end)if ori and end:print("将会爬取第 %d 页到第 %d 页的评论内容" %(ori,end))comment(ori,end)else:print("请输入符合规范的数字！")

PS：如遇到解决不了问题的小伙伴可以加点击下方链接自行获取

python免费学习资料以及群交流解答点击即可加入

基于微博平台的python爬虫数据采集，非常简单的小案例！相关推荐

【课程设计】基于Taro+React+Springboot+TaroUI+Python爬虫的网络音乐播放小程序详细设计实现
[课程设计]基于Taro+React+Springboot+TaroUI+Python爬虫的网络音乐播放小程序详细设计实现解决触摸穿透自定义导航栏文章目录项目简介功能截图 1.用户登录注册 ...
Python爬虫 | 爬取高质量小姐姐照片
Python爬虫 | 爬取高质量小姐姐照片 1.数据来源分析 2.获取author_id_list和img_id 3.制作detial 4.制作detial_list 5.数据保存 6.批量获取 7. ...
python爬虫数据采集_python爬虫采集
python爬虫采集最近有个项目需要采集一些网站网页,以前都是用php来做,但现在十分流行用python做采集,研究了一些做一下记录. 采集数据的根本是要获取一个网页的内容,再根据内容筛选出需要的数 ...
python爬虫数据采集
近几年来,python的热度一直特别火!大学期间,也进行了一番深入学习,毕业后也曾试图把python作为自己的职业方向,虽然没有如愿成为一名python工程师,但掌握了python,也让我现如今的工作 ...
Python爬虫原理与简单示例代码
链接链接爬取知乎热榜话题: 链接 BeautifulSoup的使用1: url = 'http://www.cntour.cn/'strhtml = requests.get(url)soup = ...
python爬虫概述及简单实践
文章目录一.先了解用户获取网络数据的方式二.简单了解网页源代码的组成 1.web基本的编程语言 2.使用浏览器查看网页源代码三.爬虫概述 1.认识爬虫 2.python爬虫 3.爬虫分类 4.爬 ...
python 爬虫开发之抖音小工具
前言: 有没有感觉网络不好的时候一个小视频要等半天才能看,而且等了这么久还不定能下载完成.特别是在外出差的交通工具上的时候,那时候网络真叫一个差字啊!想看抖影音打发时间都没网络.最近突然想到了可以用P ...
python爬虫抓取图片-简单的python爬虫教程：批量爬取图片
python编程语言,可以说是新型语言,也是这两年来发展比较快的一种语言,而且不管是少儿还是成年人都可以学习这个新型编程语言,今天南京小码王python培训机构变为大家分享了一个python爬虫教程. ...
Python爬虫之Requests模块巩固深入案例
爬虫之Requests模块巩固深入案例 requests实战之网页采集器 requests实战之破解百度翻译 requests实战之豆瓣电影爬取 requests实战之肯德基餐厅位置爬取 reques ...
Python爬虫，自动下载cosplay小姐姐图片！
1.xpath使用使用之前,传统艺能就是先导入该模块parsel. 之前我们匹配我们想要的内容比如链接,文字这些内容我们是不是都是通过正则表达式来爬取的不知道大家看完之后是不是觉得正则表达式好难, ...

基于微博平台的python爬虫数据采集，非常简单的小案例！