python爬取微博评论点赞数_python 爬虫爬微博分析数据

python 爬虫爬微博分析数据

最近刚看完爱情公寓5，里面的大力也太好看了吧。。。

打开成果的微博，小作文一样的微博看着也太爽了吧。。。

来用python分析分析狗哥这几年微博的干了些啥。

需要的工具有：

scrapy + pyecharts + pymysql

这些库的使用我就不说自己百度学吧。

第一步：当然是进入狗哥的微博分析了

这里我推选微博手机版的网站，因为手机版的网站比较简单，没那么花里胡哨提取信息方便一点

点击上述图片的位置

然后刷新下网站，加载json数据

通过分析这个就是加载微博数据的json文件，打开看一下

里的data->cards->mblog,就是存放微博文章的各种信息，比如文章点赞数，评论数等等。

把这个json文件翻到最下面

看到最下面的是1月29号发的微博，也就是说一个json文件存了3月15号到1月29号的微博

那么怎么获取 1月29号前面的呢？

这里仔细分析还是有规律的

我们在狗哥的主页上向下翻，翻到最后面会自己滚动加载新的json文件

新加载的json文件

打开后把前一个连接与这个比较一些

仔细发现前面都一样，唯一不同的是后面

第二个加载的多了一盒since_id

然后我们打开第一和json文件

这里有个since_id

这时我们就可以大胆推测一下了

第一次加载的json文件里面有个 since_id

而这个 since_id 也就是下一个要加载的json文件

然后下一个的 json文件里的since_id 也就是下一个的下一个的json文件

………………………………

这样就可把所有的json文件找出来了

你也可以自己找几个验证一下

有了这些数据那就开始爬虫了

第二部：爬取数据

我们可以设置：start_urls 为第一个出现的json文件连接

since_id # 下下面的id created_at # 创建的日期 text # 发布的内容 source # 发布文章的设备 scheme # 原文连接 reposts_count # 转发数量 textLength # 文章字数 comments_count # 评论个数 attitudes_count # 点赞个数

这些是 json里面的数据，可以直接通过字典来获取

然后我也直接贴代码了

import json

import scrapy

from weibo.items import WeiboItem

from bs4 import BeautifulSoup

class weibo_spider(scrapy.Spider):

name = "weibo"

start_urls =["https://m.weibo.cn/api/container/getIndex?uid=1927305954&t=0&luicode=10000011&lfid=100103type%3D1%26q%3D%E6%88%90%E6%9E%9C&type=uid&value=1927305954&containerid=1076031927305954"]

url = "https://m.weibo.cn/api/container/getIndex?uid=1927305954&t=0&luicode=10000011&lfid=100103type%3D1%26q%3D%E6%88%90%E6%9E%9C&type=uid&value=1927305954&containerid=1076031927305954&since_id="

#start_urls = ["https://m.weibo.cn/"]

allowed_domains = ["weibo.com", "weibo.cn"]

since_id = "" # 下下面的id

created_at = "" # 创建的日期

text = "" # 发布的内容

source = "" # 发布文章的设备

scheme = "" # 原文连接

reposts_count = 0 # 转发数量

textLength = 0 # 文章字数

comments_count = 0 # 评论个数

attitudes_count = 0 # 点赞个数

def parse(self, response):

text_json = json.loads(response.body_as_unicode())

self.since_id = text_json.get('data').get('cardlistInfo').get('since_id')

cards = text_json.get('data').get('cards')

for it in cards:

it_son = it.get('mblog')

if it_son:

self.created_at = it_son['created_at']

self.text = it_son['text']

self.source = it_son['source']

self.scheme = it['scheme']

self.reposts_count = it_son['reposts_count']

self.comments_count = it_son['comments_count']

self.attitudes_count = it_son['attitudes_count']

soup = BeautifulSoup(str(self.text), "html.parser") # 抓取的数据是有html标签去除一下

self.text = soup.get_text()

if len(self.created_at) < 6 :

self.created_at = "%s%s"%("2020-", self.created_at) #由于今年的微博没有年份所有给数据处理一下

self.textLength = len(self.text)

items = WeiboItem(created_at=self.created_at, text=self.text, source=self.source, scheme=self.scheme,

reposts_count=self.reposts_count, comments_count=self.comments_count, attitudes_count=self.attitudes_count, textLength=self.textLength) # 将数据写入items 文件中

yield items

if not self.since_id:

return

urls = "%s%s"%(self.url, str(self.since_id)) # 获取的下一个json链接

yield scrapy.Request(urls, callback=self.parse)

scrapy 的 itmes.py 文件

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

# See documentation in:

# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class WeiboItem(scrapy.Item):

# define the fields for your item here like:

# name = scrapy.Field()

since_id = scrapy.Field() # 下下面的id

created_at = scrapy.Field() # 创建的日期

text = scrapy.Field() # 发布的内容

source = scrapy.Field() # 发布文章的设备

scheme = scrapy.Field() # 原文连接

reposts_count = scrapy.Field() # 转发数量

textLength = scrapy.Field() # 文章字数

comments_count = scrapy.Field() # 评论个数

attitudes_count = scrapy.Field() # 点赞个数

接下来就是导入数据库了

scrapy 的 pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql

import json

class WeiboPipeline(object):

account = {

'user': 'root',

'password': '*******',

'host': 'localhost',

'database': 'python'

}

def mysqlConnect(self):

connect = pymysql.connect(**self.account)

return connect

def __init__(self):

self.connect = self.mysqlConnect() # 连接数据库

self. cursor = self.connect.cursor(cursor = pymysql.cursors.DictCursor)

#### 以json写入

#self.fp = open("xiaofuren.json", 'w', encoding='utf-8')

def insertMsg(self, scheme, text, source, reposts_count, comments_count, attitudes_count, textLength, created_at):

try:

self.cursor.execute(

"INSERT INTO%sVALUES(\'%s\',\'%s\',\'%s\',\'%d\',\'%d\',\'%d\',\'%d\',\'%s\')" % (

"weibo", scheme, text, source, reposts_count, comments_count, attitudes_count, textLength, created_at)

)

self.connect.commit()

except Exception as e:

print("insert_sql error: " + e)

def open_spider(self, spider):

print("爬虫开始了******************")

def process_item(self, item, spider):

self.insertMsg( item['scheme'], item['text'], item['source'], item['reposts_count'], item['comments_count'], item['attitudes_count'], item['textLength'], item['created_at'])

return item

#### 以json写入

# itme_json = json.dumps(dict(item), ensure_ascii=False)

# self.fp.write(itme_json + '\n')

# return item

def close_spider(self, spider):

print("爬虫结束***************")

print("数据写入成功")

self.cursor.close()

# since_id = "" # 下下面的id

# created_at = "" # 创建的日期

# text = "" # 发布的内容

# source = "" # 发布文章的设备

# scheme = "" # 原文连接

# reposts_count = 0 # 转发数量

# textLength = 0 # 文章字数

# comments_count = 0 # 评论个数

# attitudes_count = 0 # 点赞个数

运行了快5分钟吧，比较慢因为有个去除 html标签可能解析的慢

然后看下数据库

总共221条微博，去主页验证一下

发现少了20多条，可能有的转发的没有爬到，不过验证最后一天是正确的。

有了数据就开始分析了

第三步：数据分析

我用的pyecharts

这个可视化库很厉害，有地图（虽然没用上）。

导出数据库的信息

import datetime

import pymysql

account = {

'user' : 'root',

'password' : 'zhaobo123..',

'host' : 'localhost',

'database' : 'python'

}

def mysqlConnect(account):

connect = pymysql.connect(**account)

return connect

def getMessage(cursor, month, day, year, phone, dianzan, zhuanfa, pinlun, textLength, dates):

sql = 'select * from weibo ORDER BY created_at'

cursor.execute(sql)

row = cursor.fetchall()

Day = {} #建立字典便于统计每天发送的微博

Year = {}

Month = {}

for i in range(1, 32):

Day[i] = 0

for i in range(1, 13):

Month[i] = 0

for i in range(2013, 2021):

Year[i] = 0

for it in row:

date = datetime.datetime.strptime(it['created_at'], " %Y-%m-%d")

Year[date.year] += 1

Day[date.day] += 1

Month[date.month] += 1

phone.append(it['source'])

dianzan.append(it['attitudes_count'])

zhuanfa.append(it['reposts_count'])

pinlun.append(it['comments_count'])

textLength.append(it['textLength'])

dates.append(it['created_at'])

for i in range(1, 32):

day.append(Day[i])

for i in range(1, 13):

month.append(Month[i])

for i in range(2013, 2021):

year.append(Year[i])

if __name__ == '__main__':

month = [] # 按照月发送的微博

year = [] # 按照年发送的微博

day = [] # 按照日发送的微博

phone = [] # 手机的种类

dianzan = [] # 点赞数

zhuanfa = [] # 转发数

pinlun = [] # 评论数

textLength = [] #发送微博长度

dates = [] # 时间

connect = mysqlConnect(account)

cursor = connect.cursor(cursor=pymysql.cursors.DictCursor)

getMessage(cursor, month, day, year, phone, dianzan, zhuanfa, pinlun, textLength, dates)

代码里有注释我就不解释了。

然后就是数据可视化了

先按照狗哥按天，年，月发的微博，可视化

#按照日发微博的个数

xday = []

for i in range(1, 32):

xday.append(i)

bar = (

Bar()

.add_xaxis(xday)

.add_yaxis("每天发送的微博", day)

.set_global_opts(title_opts=opts.TitleOpts(title="狗哥发微博统计"))

)

bar.render(path= 'day.html')

# 按月

xmonth = []

for i in range(1, 13):

xmonth.append(i)

bar = (

Bar()

.add_xaxis(xmonth)

.add_yaxis("每月发送的微博", month)

.set_global_opts(title_opts=opts.TitleOpts(title="狗哥发微博统计"))

)

bar.render(path = 'month.html')

# 按年

xyear = []

for i in range(2013, 2021):

xyear.append(i)

bar = (

Bar()

.add_xaxis(xyear)

.add_yaxis("每年发送的微博", year)

.set_global_opts(title_opts=opts.TitleOpts(title="狗哥发微博统计"))

)

bar.render(path = 'year.html')

天：

这些年每月 28号发的最多，应该狗哥的小作文式的微博，都喜欢在月尾的时候发，来记录一下这个月的经历吧。

月：

看这些数据，狗哥喜欢在1月发微博，可能过年的时候比较闲吧，没事发发微博。

年：

应该是2020年最多（毕竟才过了4个月）刚出道微博宣传吧。。。。

18年到19年小作文式的微博比较多，刚步入社会没事发发微博恼骚一下。。。

发微博的设备

代码我就放在后面了。。。

直接上图吧

苹果的忠实粉丝

看看这些年的人气变化

这些年发的微博点赞数

没啥好分析的狗哥因为爱情公寓火的今年的点赞肯定爆炸式增长。

但是第一篇有三万多赞，肯定那些忠实粉丝看完了所有微博在最后一篇点个赞。

转发：

转发多应该是狗哥的小作文式的微博，毕竟还是挺有意思的。

评论数

和点赞一样最后一个特别多，都是来挖祖坟的

发布的微博内容长度：

看来狗哥喜欢每隔一段时间发布一篇小作文。。。。

ok结束了。

微博反爬机制不严获取微博不用登录，登录也不用验证吗，和知乎不一样，不登录看不了文章，而且验证码还特别麻烦。

但是微博爬评论就要登录了

下一篇给大家介绍下如何登录微博爬取微博评论。

python爬取微博评论点赞数_python 爬虫爬微博分析数据相关推荐

python爬取微博评论点赞数_Python selenium爬取微博数据代码实例
爬取某人的微博数据,把某人所有时间段的微博数据都爬下来. 具体思路: 创建driver-–get网页--找到并提取信息-–保存csv--翻页--get网页(开始循环)-----没有"下一页& ...
python爬取微博评论点赞数_python爬取点赞评论数
马上注册,结交更多好友,享用更多功能^_^ 您需要登录才可以下载或查看,没有帐号?立即注册 x 本帖最后由清歌终南于 2018-3-24 22:35 编辑看了小甲鱼老师的爬取网易云音乐热门评 ...
python百度贴吧怎么爬取最早的帖子_Python爬虫爬取百度贴吧的帖子
同样是参考网上教程,编写爬取贴吧帖子的内容,同时把爬取的帖子保存到本地文档: #!/usr/bin/python #_*_coding:utf-8_*_ import urllib import ur ...
python爬取公众号历史文章_Python爬虫爬取微信公众号历史文章全部链接
因为朋友问我能不能找一下一个微信公众号的全部历史文章的链接,我就帮他弄了一下,通过百度和谷歌发现现在大家爬微信公众号的思路基本都是下面两种: 通过搜狗搜索微信公众号然后拿到链接通过fiddler检测 ...
beautifulsoup爬取网页中的表格_Python爬虫爬取BT之家找电影资源
一.写在前面最近看新闻说圣城家园(SCG)倒了,之前BT天堂倒了,暴风影音也不行了,可以说看个电影越来越费力,国内大厂如企鹅和爱奇艺最近也出现一些幺蛾子,虽然目前版权意识虽然越来越强,但是很多资源在 ...
python爬取京东评论怎么翻页_爬取京东网页评论（动态网页）
1.当网页打开的方式不同时,在开发者选项找到的包含评论的文件地址不同,比如第一种,当我们找到的评论界面是含有下一页选项的时候(如下图).我们在左边文件界面发现包含评论的网页地址名字为''product ...
python批量爬取小网格区域坐标系_Python爬虫实例_利用百度地图API批量获取城市所有的POI点...
上篇关于爬虫的文章,我们讲解了如何运用Python的requests及BeautifuiSoup模块来完成静态网页的爬取,总结过程,网页爬虫本质就两步: 1.设置请求参数(url,headers,co ...
python爬取微博非好友圈_Python爬虫之微博好友圈
数学建模已结束,刚开始的目标就是不熬夜,结果还是熬夜了(QAQ),缓了一天就来写简书了,感觉很久没爬虫了,今天就爬下移动端的微博好友圈信息. 代码 import requests import jso ...
python爬虫爬取网易云音乐下载_Python爬虫实践-网易云音乐！没有版权又如何！照样爬取！...
1.前言最近,网易的音乐很多听不到了,刚好也看到很多教程,跟进学习了一下,也集大全了吧,本来想优化一下的,但是发现问题还是有点复杂,最后另辟捷径,提供了简单的方法啊! 本文主要参考 python编写 ...

python爬取微博评论点赞数_python 爬虫爬微博分析数据

python爬取微博评论点赞数_python 爬虫爬微博分析数据相关推荐

最新文章

热门文章

python爬取微博评论点赞数_python 爬虫 爬微博 分析 数据

python爬取微博评论点赞数_python 爬虫 爬微博 分析 数据相关推荐

最新文章

热门文章

python爬取微博评论点赞数_python 爬虫爬微博分析数据

python爬取微博评论点赞数_python 爬虫爬微博分析数据相关推荐