基于微博平台的python爬虫数据采集

一、搭建环境
- 1. 软件版本
- 2. 环境搭建问题
二、代码设计
- 1. 获取cookie
- 2. 爬取数据
- - - 爬取文字
  - - 爬取图片
  - - 爬取表情
三、使用说明及效果展示
- 1. 使用说明
- 2. 效果展示

这是本博主第一次在CSDN上发表文章，该文章旨在提供一条在微博平台爬取数据的思路，如有问题和建议欢迎与我讨论！本程序实现了在微博平台爬取指定用户的微博评论内容（文字+图片+表情）。我分为以下三个方面进行介绍：

搭建环境
代码设计
使用说明及效果展示

一、搭建环境

1. 软件版本

Python3.7.4 - https://www.python.org/downloads/
Anaconda3 - https://www.anaconda.com/products/individual#macos

2. 环境搭建问题

配置Anaconda环境变量
问题：anaconda未设置在环境变量里，导致使用pip下载python自带的库时无法下载到对应的路径进行使用。
解决：在电脑的环境变量中添加anaconda的路径。
使用pip网络问题
问题：因为网速过慢的原因导致无法正常使用pip进行更新以及python库的下载。
WARNING: pip is configured with locations that require TLS/SSL,however the ssl module in Python is not available.
解决：输入命令改为pip install xxx -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com，使用豆瓣源进行下载。
附教程：https://www.cnblogs.com/singsong-ss/p/11857808.html

二、代码设计

1. 获取cookie

在网上查阅了很多的资料之后，我实现了模拟登陆获取cookie的方法，但是目前仍不能通过模拟登陆获取的cookie进行微博平台的连接和数据获取，所以在此我们的cookie是通过手动登陆微博平台后所获得的。
（模拟登陆获取cookie的方法网上很多，在此不做具体说明了）

爬取平台：微博 www.weibo.cn
获取cookie：使用用户名+密码登录www.weibo.cn 后，点击键盘F12 进入控制台界面。在network中找到名为weibo.cn的记录，点击查看里面的cookies。

2. 爬取数据

- 爬取文字

该函数主要是用来爬取文字内容，其中省略了部分对于冗余字符的处理。

- 爬取图片

该函数主要是用来爬取图片内容，其中根据微博图片/评论图片的不同，对于标签的筛选不同。

- 爬取表情

该函数主要是用来爬取微博评论表情内容。

三、使用说明及效果展示

1. 使用说明

修改cookie

修改微博评论的网址为所要爬取的微博评论，以“page=”结尾

在程序所在位置的同级目录下创建两个文件夹分别为：评论图片、评论表情。（本代码使用的相对路径，无需修改代码中的路径）

2. 效果展示

运行程序输入爬取的起始页数、终止页数
已爬取相应内容

附源代码：

"""环境: Python 3.7.4
内容:爬取微博评论内容、图片、表情
修改时间：2020.10.27
@author: Wenwen"""import requests
import urllib.request
import re
import time
import csv
from bs4 import BeautifulSoupnum = 1
list_content = []
list_t = []#请求函数：获取某一网页上的所有内容
def get_one_page(url):headers = {'User-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36','Host' : 'weibo.cn','Accept' : 'application/json, text/plain, */*','Accept-Language' : 'zh-CN,zh;q=0.9','Accept-Encoding' : 'gzip, deflate, br','Cookie' : '_T_WM=7c42b73c4c9cfa6fc4ff7350804c4504; SUB=_2A25yR6oNDeRhGedO41AZ8CzJyz2IHXVRyzZFrDV6PUJbktAKLXfakW1Nm4uHzxmmnLSVywkxToubmHotH6MGAF_Y; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9W5cHZZGnYY4KoxQoDxu1c2T5NHD95QpehnE1h5ESK5pWs4DqcjHIsyQi--Ri-zciKnfi--RiK.7iKyh; SUHB=0jBFua3v6DW2fH; SSOLoginState=1598282334','DNT' : '1','Connection' : 'keep-alive'}#请求头的书写，包括User-agent,Cookie等response = requests.get(url,headers = headers,verify=False)#利用requests.get命令获取网页htmlif response.status_code == 200:#状态为200即为爬取成功return response.text#返回值为html文档，传入到解析函数当中return None#爬取微博评论内容
def comment_page(html):print("序号","内容")pattern = re.compile('<span class="ctt">.*?</span>', re.S)items = re.findall(pattern,html)result = str(items)reResults = re.findall(">.*?<",result,re.S)first_results  = []for result in reResults:#print(result)if ">\'][\'<"  in result:continueif ">:<"  in result:continueif "<a href" in result:continueif ">回复<"  in result:continueif ">评论配图<"  in result:continueif "><"  in result:continueif ">\', \'<"  in result:continueif "@"  in result:continueif "> <"  in result:continueelse:first_results.append(result)#print("first:",firstStepResults)TEXT1 = re.compile(">")TEXT2 = re.compile("<")datalist = []datalist_t = []for last_result in first_results:global numtemp1 = re.sub(TEXT1, '', last_result)excel = re.sub(TEXT2, '', temp1)datalist_t = [num,excel]datalist.append(datalist_t)print(num,excel)with open('.\\微博评论内容.txt','a',encoding='utf-8') as fp:fp.write(excel)num += 1'''    if datalist == datalist1:datalist.clear()else:datalist1 = datalist'''return datalist#爬取微博评论的图片
def comment_img(html):#BS对象bf_1 = BeautifulSoup(html,'lxml')#获取全部上级标签text_1 = bf_1.find_all('span',{'class':'ctt'})#获取所有指定标签img = []#   img1 = []for i in range(len(text_1)):for x in text_1[i].find_all('a',string = "评论配图"):link = x.get('href')if link:img.append(link)#   if img == img1:#       img.clear()#   else:#       img1 = imgreturn img#爬取微博评论的表情
def comment_emotion(html):#BS对象bf_1 = BeautifulSoup(html,'lxml')#获取全部上级标签text_1 = bf_1.find_all('span',{'class':'ctt'})#获取所有指定标签emotion = []#  emotion1 = []for i in range(len(text_1)):for x in text_1[i].find_all('img'):link = x.get('src')if link:emotion.append('http:'+link)#  if emotion == emotion1:#      emotion.clear()#  else:#      emotion1 = emotionreturn emotion#爬取评论内容
def comment(ori,end):alllist = []imglist = []emotionlist = []final = end + 1for i in range(ori,final):url = "https://weibo.cn/comment/hot/JohBbtrnO?rl=1&page=" + str(i)html = get_one_page(url)#print(html)print('正在爬取第 %d 页评论' % (i))alllist = alllist + comment_page(html)imglist = imglist + comment_img(html)emotionlist = emotionlist + comment_emotion(html)time.sleep(3)   if i == end :print('爬取结束，共爬取 %d 页评论' %(end-ori+1))print('保存评论到微博评论内容.csv文件中......')store_text(alllist)  print('保存评论图片到文件夹中......')#print('comment_img:',imglist)store_img(imglist)print('保存评论表情到文件夹中......')#print('comment_emotion:',emotionlist)store_emotion(emotionlist)print('保存完毕！')#存储微博评论内容
def store_text(list):with open('.\\微博评论内容.csv','w',encoding='utf-8-sig',newline='') as f:'''f.write('content\n')f.write('\n'.join(list))'''csv_writer = csv.writer(f)for item in list:csv_writer.writerow(item)f.close()#存储微博评论图片
def store_img(list):j=0for imgurl in list:urllib.request.urlretrieve(imgurl,'.\\评论图片\\%s.jpg' % j)     j+=1#存储微博评论表情
def store_emotion(list):k=0for imgurl in list:urllib.request.urlretrieve(imgurl,'.\\评论表情\\%s.jpg' % k)k+=1  if __name__=="__main__":ori = input("请输入你要爬取的起始页数:")end = input("情输入你要爬取的终止页数:")ori = int(ori)end = int(end)if ori and end:print("将会爬取第 %d 页到第 %d 页的评论内容" %(ori,end))comment(ori,end)else:print("请输入符合规范的数字！")

基于微博平台的python爬虫数据采集相关推荐

python爬虫数据采集_python爬虫采集
python爬虫采集最近有个项目需要采集一些网站网页,以前都是用php来做,但现在十分流行用python做采集,研究了一些做一下记录. 采集数据的根本是要获取一个网页的内容,再根据内容筛选出需要的数 ...
python爬虫数据采集
近几年来,python的热度一直特别火!大学期间,也进行了一番深入学习,毕业后也曾试图把python作为自己的职业方向,虽然没有如愿成为一名python工程师,但掌握了python,也让我现如今的工作 ...
基于大数据的python爬虫的菜谱美食食物推荐系统
众所周知,现阶段我们正处于一个"大数据"时代,从互联网上大量的数据中找到自己想要的信息变得越来困难,搜索引擎的商业化给市场带来了百度和谷歌这样的商业公司.网络爬虫便是搜索引擎的重要 ...
mac os平台使用python爬虫自动下载巨潮网络文件
环境配置选择python+selenium+wget+Safari的环境来下载文件,本来期望使用phantomjs,但使用时点击出的链接网页为空白网页,无法下载文件. 使用Safari时遇到的错误: ...
数据采集与存储案例——基于Python爬虫框架Scrapy的爬取网络数据与MySQL数据持久化
此案例需要预先安装pymsql python3.7.4 scrapy2.7.1 一.安装scrapy框架 1.使用pip命令安装scrapy pip install scrapy 在这里下载太慢可以使 ...
[Python爬虫] scrapy爬虫系列一.安装及入门介绍
前面介绍了很多Selenium基于自动测试的Python爬虫程序,主要利用它的xpath语句,通过分析网页DOM树结构进行爬取内容,同时可以结合Phantomjs模拟浏览器进行鼠标或键盘操作.但是,更 ...
python免费教学视频400集-如何入门 Python 爬虫？400集免费教程视频带你从0-1全面掌握...
学习Python大致可以分为以下几个阶段: 1.刚上手的时候肯定是先过一遍Python最基本的知识,比如说:变量.数据结构.语法等,基础过的很快,基本上1~2周时间就能过完了,我当时是在这儿看的基础: ...
Python爬虫：常用的爬虫工具汇总
按照网络爬虫的的思路: #mermaid-svg-YOkYst4FalQf6wUn {font-family:"trebuchet ms",verdana,arial,sans-s ...
python爬虫现状_基于Python的微博爬虫系统研究
基于 Python 的微博爬虫系统研究陈政伊袁云静贺月锦武瑞轩 [摘要] [摘要]随着大数据时代到来,爬虫的需求呈爆炸式增长,以新浪微博为代表的一系列社交应用蕴含着巨大的数据资源.以新浪 ...
python数据采集系统_基于python的聚焦网络爬虫数据采集系统设计与实现
基于 python 的聚焦网络爬虫数据采集系统设计与实现杨国志 ; 江业峰 [期刊名称] < <科学技术创新> > [年 ( 卷 ), 期] 2018(000)027 [摘要 ...

基于微博平台的python爬虫数据采集

基于微博平台的python爬虫数据采集

一、搭建环境

1. 软件版本

2. 环境搭建问题

二、代码设计

1. 获取cookie

2. 爬取数据

- 爬取文字

- 爬取图片

- 爬取表情

三、使用说明及效果展示

1. 使用说明

2. 效果展示

基于微博平台的python爬虫数据采集相关推荐

最新文章

热门文章