1. 编写目的

爬虫本身是一个非常简单的事情，都是由于业务需要才变得越来越复杂的。为了方便广大开发者，也有很多简单好用的爬虫框架，但这里不使用那些已经实现了的专用框架，也不能起到任何商业化的目的，只是单纯地爬一下自己 csdn 博客数据。

当然，为了更加好玩可以自行添加一些功能，比如说增加粉丝或者有评论时给自己发个邮件等等。当初还自己写了一些统计功能，但是现在 CSDN 已经提供了 “数据观星” 的功能，没事的时候可以逛逛看看自己的博客访客点赞评论数目（多么少）。

2. 具体实现

2.1 依赖

python 3
BeautifulSoup v4.x 开发文档
urllib
tqdm

在运行下面的代码的时候发现提示缺包请自行安装。

2.2 爬取博客源码

因为接下来的操作都需要基于源码进行，所以这是最基本的一步，也是最简单的一步。

from urllib.request import urlopen
from bs4 import BeautifulSoup# 博客地址
url = 'https://blog.csdn.net/smileyan9'html = urlopen(url)
soup = BeautifulSoup(html.read())
print(soup.title)
print(soup.body)

输出内容：

<title>Smileyan's blog_smile-yan_CSDN博客-我的大后端,C/C++,我的Linux领域博主</title>
<body class="nodata" style="">
<script src="https://g.csdnimg.cn/common/csdn-toolbar/csdn-toolbar.js" type="text/javascript"></script>
......

2.2 获得总计数据

包括总访客数目、粉丝数目、积分、总排名。当达到指定数目就发送邮件提醒自己。

from urllib.request import urlopen
from bs4 import BeautifulSoup# 博客地址
url = 'https://blog.csdn.net/smileyan9'
html = urlopen(url)
soup = BeautifulSoup(html.read())# 进一步缩小范围
sources = soup.select('.data-info')
soup = BeautifulSoup(str(sources))dls = soup.find_all(['dl'])
notes = ['原创','周排名','总排名','访问','等级','积分','粉丝','获赞','评论','收藏']
for step,dl in enumerate(dls):print(notes[step],':',dl['title'])

输出内容为：

原创 : 数字
周排名 : 数字
总排名 : 数字
访问 : 数字
等级 : 7级,点击查看等级说明
积分 : 数字
粉丝 : 数字
获赞 : 数字
评论 : 数字
收藏 : 数字

2.3 爬取所有博客地址

这次爬取的地址为csdn的 CSDN 博客地址/article/list/1，最后的数字是指页数。

from urllib.request import urlopen
from bs4 import BeautifulSoup
from tqdm import tqdm
from time import sleep# 页数
page_num = 4# 博客具体地址
key = 'https://smileyan.blog.csdn.net/article/details/'# 所有博客地址
all_urls = []for i in range(page_num):url = 'https://smileyan.blog.csdn.net/article/list/{}'.format(i+1)html = urlopen(url)soup = BeautifulSoup(html.read())# 根据源码中 css class 搜索all_a = soup.select('.article-item-box')for one in all_a:# 格式处理target_url = one['data-articleid']all_urls.append(target_url)print(target_url,end=',')
# 共 146
print(len(all_urls))

2.4 爬取所有博客访问量

和上面差不多，只是寻找源码位置不同。

from urllib.request import urlopen
from bs4 import BeautifulSoup
from tqdm import tqdm
from time import sleep# 页数
page_num = 4all_visitors = []for i in range(page_num):url = 'https://blog.csdn.net/smileyan9/article/list/{}'.format(i+1)html = urlopen(url)soup = BeautifulSoup(html.read())all_a = soup.select('.article-item-box')for one in all_a:num = one.select('.read-num')[0].get_text()all_visitors.append(int(num))print(len(all_visitors))for visit in all_visitors:print(visit,end=',')

2.5 发送邮件

在 2.2 获得总计数据后，可以考虑设置一个目标，当达到目标后给自己发送一个邮件。

#!/usr/bin/python
# -*- coding: UTF-8 -*-
from urllib.request import urlopen
from bs4 import BeautifulSoup
import smtplib
from email.mime.text import MIMEText
from email.header import Header# 博客地址
url = 'https://blog.csdn.net/smileyan9'html = urlopen(url)
soup = BeautifulSoup(html.read())sources = soup.select('.data-info')soup = BeautifulSoup(str(sources))# soup
results = soup.find_all(['span'])# 总排名
place = results[2].get_text()# 总积分
# score = results[4].get_text()      # 如果积分不超过1万可以这么使用
score = soup.find_all(['dl'])[5]['title']# 总粉丝
fans = results[5].get_text()# 总访客
text = soup.find_all(style='min-width:58px')[0]
visitors = text['title']print('排名：',place)
print('积分：',score)
print('粉丝：',fans)
print('访客：',visitors)# 定义规则
target_place = 2000
target_visitors = 100*10000
target_fans = 1000            # 好难，哈哈哈哈哈
target_score = 10000# 排名数目越小越好，所以用负数
now = [-int(place), int(visitors), int(fans), int(score)]
targets = [-target_place, target_visitors, target_fans, target_score]messages = ['恭喜！你的排名已经达到了目标！','恭喜！你的访客已经达到了目标！','恭喜！你的粉丝已经达到了目标！','恭喜！你的积分已经达到了目标！'
]reach = -1
for i in range(len(targets)):if (now[i] >= targets[i]):reach = ibreakif(reach > -1):common = '排名：{}; 积分：{}; 粉丝：{}; 访客：{}'.format(place, score, fans, visitors)# 第三方 SMTP 服务mail_host = 'smtp.exmail.qq.com'  #设置服务器mail_user = "root@smileyan.cn"    #用户名mail_pass="You password"   #口令 sender = 'root@smileyan.cn'receivers = ['root@smileyan.cn']  # 接收邮件，可设置为你的QQ邮箱或者其他邮箱message = MIMEText(messages[reach], 'plain', 'utf-8')message['From'] = Header("Python 脚本(by smileyan)", 'utf-8')message['To'] =  Header("幸运儿", 'utf-8')subject = '恭喜恭喜'message['Subject'] = Header(subject, 'utf-8')try:smtpObj = smtplib.SMTP() smtpObj.connect(mail_host, 25)    # 25 为 SMTP 端口号smtpObj.login(mail_user,mail_pass)  smtpObj.sendmail(sender, receivers, message.as_string())print("邮件发送成功")except smtplib.SMTPException:print("Error: 无法发送邮件")
else:print('革命尚未成功，同志仍需努力！')

注意需要更改邮箱，密码等等，已经测试过了，能够正常发送邮件。

还有一个问题就是，如何让这份 python 代码在给定时间间隔内执行呢？首先最好有一台能一直运行的电脑（推荐购买便宜好用的云服务器），然后在服务器上一直跑一段代码是很容易的，比如说再外层添加一个for 循环，每次循环添加一个 sleep 即可。也可以考虑编写 linux 脚本，每隔多长执行一次脚本等等。

最近(2020.10.28) 有时间在自己的华为云服务器上完成了这个功能，感兴趣的话，请参考 linux 定时任务 (python 爬虫统计博客数据)

3. 新的接口

可以考虑玩一下 CSDN 官方提供了新的接口，再次强调：不是用来刷访问量的。

接口地址：https://blog.csdn.net/community/home-api/v1/get-business-list?page=4&size=20&businessType=blog&orderby=ViewCount&noMore=false&year=&month=&username=smileyan9

接口作用：宏观统计某个博主的所有文章的数据情况。

关键字段解释：首先 page 和 size 就是分页查询的两个字段，上面的例子是指第 4 页，每一页的大小为 20；然后 businessType为 blog 即文章，而不包括 csdn 提供的类似于发动态的 blink和上传资源等；然后orderby 就是排序依据，最后 username 就是 csdn 的用户名。

接口的访问方法这里简单介绍一下，根据需要进行采集数据即可，切记，不可进行非法、违规操作！

from urllib.request import urlopen
import jsonpage = 4
size = 20# 博客地址
url = 'https://blog.csdn.net/community/home-api/v1/get-business-list?page=' + \str(page) + '&size=' + \str(size) + '&businessType=blog&orderby=ViewCount&noMore=false&year=&month=&username=smileyan9'html = urlopen(url)
result = json.loads(html.read())
print(result)

打印的接口大致为：

{'code': 200, 'message': 'success', 'data':{'list': [{'articleId': 102787017, 'title': '解决hdfs 运行在9000端口外界不能访问', 'description': '问题描述默认情况下，hdfs运行在 127.0.0.1:9000，也就是说只运行在本地，而不是0.0.0.0，像Tomcat不管在云服务器还是虚拟机上，启动后我们直接.......}}

4. 总结

有一种无聊叫 “爬数据玩玩吧”，还有一种无聊叫 “顺便水一篇博客吧” ……

感谢阅读，如果觉得好玩的话，记得再下方左下角点赞，感谢！

无意间发现这个简单好用的 BeautifulSoup 所以用来写个demo，并且感谢 CSDN 提供的免费域名博客 https://smileyan.blog.csdn.net/ 。再次说明本次爬数据代码纯属娱乐，绝无 “刷访客”、“商业用途” 之意。

Smileyan
2020.10.20.17:04
2021.3.1 10:42

30行代码统计自己 CSDN 博客相关数据相关推荐

【爬虫+数据可视化】Python爬取CSDN博客访问量数据并绘制成柱状图
以下内容为本人原创,欢迎大家观看学习,禁止用于商业及非法用途,谢谢合作! ·作者:@Yhen ·原文网站:CSDN ·原文链接:https://blog.csdn.net/Yhen1/article/ ...
Python爬虫-CSDN博客排行榜数据爬取
文章目录前言网络爬虫搜索引擎爬虫应用谨防违法爬虫实战网页分析编写代码运行效果反爬技术前言开始接触 CTF 网络安全比赛发现不会写 Python 脚本的话简直寸步难行--故丢弃 ...
Python采集CSDN博客排行榜数据
文章目录前言网络爬虫搜索引擎爬虫应用谨防违法爬虫实战网页分析编写代码运行效果反爬技术很多人学习python,不知道从何学起. 很多人学习python,掌握了基本语法过后,不知道在 ...
自制脚本，统计个人csdn博客总字数
目录前言一.自动爬取所有文章的链接地址二.字数统计总结前言在csdn写博客也已经一年多了,经常忍不住想知道自己总共写了多少字.可是目前官方只能统计单篇文章的总字数,却没有提供所有文章的字数 ...
python刷阅读_简单的37行python爬虫刷CSDN博客阅读数
# -*- coding:utf-8 -*- # 利用爬虫刷CSDN博客阅读数 import requests from bs4 import BeautifulSoup # 解析源码 def Get ...
csdn博客内容首行缩进
csdn首行缩进在使用markdown语法编写博客时,就算敲了多个空格或者制表符,也会当作一个空格处理,无法达到缩进的效果.可以使用html中的特殊空格字符达到缩进的效果. 1. 空格符 N ...
『Python开发实战菜鸟教程』实战篇：爬虫快速入门——统计分析CSDN与博客园博客阅读数据
文章目录 0x01:引子首先介绍一下网络爬虫是什么,可以用来做什么? 这里简单探讨一下网络爬虫的合法性正式进入爬虫实战前,需要我们了解下网页结构 HTML CSS JScript 写一个简单的 H ...
CSDN博客添加量子恒道统计代码步骤
CSDN博客添加量子恒道统计代码步骤. 1. 去量子恒道网站统计注册账户: 2. 添加已有的CSDN博客地址: 3. 添加博客后恒道代码里面会给你一个JavaScript脚本,记下里面的一串数字: ...
CSDN博客和OJ，两种积累编程代码量的方法对比
带着菜鸟学生在CSDN上发博文积累代码量已经有近两年的时间(见 2011级学生和 2012级学生),模式逐渐成熟,学生们每周能够认真地完成该做的学习任务,积分见涨,排名提升,相互评论,互褒互贬,好不热 ...

30行代码统计自己 CSDN 博客相关数据