文章目录

  • 1. 什么是爬虫
  • 2. 学习爬虫的必备知识
  • 3. 环境准备
  • 4. 爬虫的第一步,获取网页的HTML内容
    • 4.1 GET
    • 4.2 POST
  • 5. 使用BeautifulSoup模块来从HTML文本中提取我们想要的数据
    • 5.1 Tag
    • 5.2 NavigableString
    • 5.3 Comment
  • 6. BeautifulSoup遍历方法
    • 6.1 节点和标签名
    • 6.2 搜索文档树
      • 6.2.1 find_all()
      • 6.2.2 find()
  • 7 实战

1. 什么是爬虫

网络爬虫(又被称为网页蜘蛛,网络机器人),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本。

2. 学习爬虫的必备知识

HTML,Python,TCP/IP协议,HTTP协议

3. 环境准备

3.1. 安装python3:
https://www.python.org/downloads/release/python-372/
3.2. 安装requests库: pip install requests
3.3. 安装BeautifulSoup库: pip install beautifulsoup4
3.4. 安装lxml库:

python3如何安装lxml库
首先安装wheel: pip install wheel
然后打开 http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml 去下载lxml的Wheel文件(进入这个网站后用 ctrl+f 输入lxml回车即可找到)。
然后打开cmd,进入你所下载的whl文件所在的文件夹,然后执行
pip3 install lxml-4.3.2-cp37-cp37m-win_amd64.whl

4. 爬虫的第一步,获取网页的HTML内容

4.1 GET

向网站发送了一个get请求,然后网站会返回一个response

import requests #导入requests库
r = requests.get('https://unsplash.com') #像目标url地址发送get请求,返回一个response对象
print(r.text) #r.text是http response的网页HTML

get请求还可以传递参数:

payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.get("http://httpbin.org/get", params=payload)

上面代码向服务器发送的请求中包含了两个参数key1和key2,以及两个参数的值。实际上它构造成了如下网址:http://httpbin.org/get?key1=value1&key2=value2

4.2 POST

4.2.1 无参数的post请求:

r = requests.post("http://httpbin.org/post")

4.2.2 有参数的post请求:

payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.post("http://httpbin.org/post", data=payload)

5. 使用BeautifulSoup模块来从HTML文本中提取我们想要的数据

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象。所有对象可以归纳为4种类型: Tag , NavigableString , BeautifulSoup , Comment 。

假设现在我们有下面这样的一个html页面

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p><p class="story">...</p>
5.1 Tag

我们使用find()方法可以获得Tag对象,使用find-all()返回的是多个该对象的集合,是可以用for循环遍历的。返回标签之后,还可以对提取标签中的信息。

from bs4 import BeautifulSoupsoup = BeautifulSoup(html_doc, 'lxml')  #声明BeautifulSoup对象
find = soup.find('p')  #使用find方法查到第一个p标签
print("find's return type is ", type(find))  #输出返回值类型
print("find's content is", find)  #输出find获取的值
print("find's Tag Name is ", find.name)  #输出标签的名字
print("find's Attribute(class) is ", find['class'])  #输出标签的class属性值

输出结果

find's return type is  <class 'bs4.element.Tag'>
find's content is <p class="title"><b>The Dormouse's story</b></p>
find's Tag Name is  p
find's Attribute(class) is  ['title']
5.2 NavigableString

NavigableString就是标签中的文本内容。获取方式: tag.string
还是以上面那个例子

print('NavigableString is:', find.string)

输出结果:

NavigableString is: The Dormouse's story

注意:
1 如果tag只有一个 NavigableString 类型子节点,那么这个tag可以使用 .string 得到子节点
2 如果一个tag仅有一个子节点,那么这个tag也可以使用 .string 方法,输出结果与当前唯一子节点的 .string 结果相同
3 如果tag包含了多个子节点,tag就无法确定 .string 方法应该调用哪个子节点的内容, .string 的输出结果是 None
所以当某个tag有多个子tag时,我们是无法通过text参数搜索到该 tag 的。比如下面这种情况:

<a>abc<div clss='no_print'></div>
</a>

我们先把 a 标签中字符串之外的子元素删除:

[s.extract() for s in soup.find_all(name='div', class_='no_print')]

使要搜索的tag变成如下形式:

<a>abc
</a>

另外,除了标签中带有别的标签,还会有换行符和注释等等
换行符建议在 bs解析html文本之前,用replace()方法去掉:

html = html.replace('<br>', '').replace('<br/>', '')

注释的删除比较特别:

from bs4 import BeautifulSoup, Comment
for comment in soup(text=lambda text: isinstance(text, Comment)):
5.3 Comment

这个对象其实就是HTML和XML中的注释。有些时候,我们并不想获取HTML中的注释内容,所以用这个类型来判断是否是注释。

if type(some_str) == bs4.element.Comment:print('该字符是注释')
else:print('该字符不是注释')

6. BeautifulSoup遍历方法

6.1 节点和标签名

可以使用子节点、父节点、 及标签名的方式遍历:

soup.head #查找head标签
soup.p #查找第一个p标签#对标签的直接子节点进行循环
for child in title_tag.children:print(child)soup.parent #父节点# 所有父节点
for parent in link.parents:if parent is None:print(parent)else:print(parent.name)# 兄弟节点
sibling_soup.b.next_sibling #后面的兄弟节点
sibling_soup.c.previous_sibling #前面的兄弟节点#所有兄弟节点
for sibling in soup.a.next_siblings:print(repr(sibling))for sibling in soup.find(id="link3").previous_siblings:print(repr(sibling))
6.2 搜索文档树

最常用的是find()和find_all(),当然还有其他的。比如find_parent() 和 find_parents()、 find_next_sibling() 和 find_next_siblings() 、find_all_next() 和 find_next()、find_all_previous() 和 find_previous() 等等。

6.2.1 find_all()

完整语法:

find_all(name , attrs , recursive , string , **kwargs)

name 参数:可以查找所有名字为 name 的tag。
attr 参数:就是tag里的属性。
string 参数:搜索文档中字符串的内容。
recursive 参数: 调用tag的 find_all() 方法时,Beautiful Soup会检索当前tag的所有子孙节点。如果只想搜索tag的直接子节点,可以使用参数 recursive=False 。
例:

soup.find_all("title")
# [<title>The Dormouse's story</title>]
#
soup.find_all("p", "title")
# [<p class="title"><b>The Dormouse's story</b></p>]
#
soup.find_all("a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
#
soup.find_all(id="link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
#
import re
soup.find(string=re.compile("sisters"))
# u'Once upon a time there were three little sisters; and their names were\n'
6.2.2 find()

完整语法:

find(name , attrs , recursive , string , **kwargs)

例:

soup.find('title')
# <title>The Dormouse's story</title>
#
soup.find("head").find("title")
# <title>The Dormouse's story</title>

7 实战

我想爬一下我自己的博客的所有文字的标题和链接

import requests
from bs4 import BeautifulSoup  PAGE_CNT = 16
BLOG_URL = 'https://blog.csdn.net/linxinfa/article/list/'def get_all_aritcles():# 模拟Mozilla浏览器headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1664.3 Safari/537.36"}cnt=1for i in range(1, PAGE_CNT):# 爬1到16页的文章标题web_url = BLOG_URL + str(i)r = requests.get(web_url, headers=headers)html_txt = r.text# 去除换行html_txt = html_txt.replace('<br>', '').replace('<br/>', '')soup = BeautifulSoup(html_txt, 'lxml')tag_main =soup.find('main')tag_div = tag_main.find('div', class_='article-list')if tag_div:tag_h4_all = tag_div.find_all('h4')for tag_h4 in tag_h4_all:tag_a = tag_h4.find('a')if 'linxinfa' in tag_a['href']:href = tag_a['href']# 这里不能直接用tag_a.string取出文章标题,因为a节点中含有子节点,需要做extract处理tag_a = [s.extract() for s in tag_a]# 取出标题title = tag_a[1]# 去除前后空格title = title.strip()print('%s. %s: %s'%(cnt, title,href))cnt = cnt + 1if '__main__' == __name__:get_all_aritcles()

输出结果:

1. windows开机自动运行脚本:https://blog.csdn.net/linxinfa/article/details/88633883
2. Python爬虫学习笔记:https://blog.csdn.net/linxinfa/article/details/88627954
3. Unity发布Android时需要的Android SDK的下载:https://blog.csdn.net/linxinfa/article/details/88605815
4. 如何下载安装JDK(Unity发布Android时需要安装JDK):https://blog.csdn.net/linxinfa/article/details/88597900
5. Unity的一些常用设置(Unity入门必看):https://blog.csdn.net/linxinfa/article/details/88577190
6. apk在手机上的安装路径在哪里,如何拿到程序的本地数据:https://blog.csdn.net/linxinfa/article/details/88559834
7. Unity中生成图片的灰白图:https://blog.csdn.net/linxinfa/article/details/88553932
8. iOS企业版app部署到自己服务器(不通过AppStore,在iOS设备上直接安装ipa):https://blog.csdn.net/linxinfa/article/details/88540213
9. 字体裁剪,精简字体,字体瘦身:FontSubsetGUI,FontCreator,FontPruner:https://blog.csdn.net/linxinfa/article/details/88427808
10. lua深拷贝一个table:https://blog.csdn.net/linxinfa/article/details/88390724
11. Unity的Animator怎么重新播放某个动画:https://blog.csdn.net/linxinfa/article/details/88357367
12. Unity使用tolua框架教程: LuaFramewrk:https://blog.csdn.net/linxinfa/article/details/88246345
13. 查看XCode的SDK版本:https://blog.csdn.net/linxinfa/article/details/88241479
14. 关于win操作系统引导模式:UEFI:https://blog.csdn.net/linxinfa/article/details/88202064
15. Unity写lua代码的vs插件:BabeLua:https://blog.csdn.net/linxinfa/article/details/88191485
16. 从零准备mac系统下的Unity开发和打包环境(针对Unity5.x):https://blog.csdn.net/linxinfa/article/details/88182785
17. Unity打ipa提审AppStore,邮件回复Invalid architecturese:https://blog.csdn.net/linxinfa/article/details/88105433
18. Unity使用CodeGuard进行c#代码混淆遇到的一个坑: 父子类混淆:https://blog.csdn.net/linxinfa/article/details/88061598
19. Unity制作飞金币到指定位置的粒子:https://blog.csdn.net/linxinfa/article/details/88020162
20. Unity的预设怎么改成文本形式存储(YAML):https://blog.csdn.net/linxinfa/article/details/87987265
21. Android之Volley库:https://blog.csdn.net/linxinfa/article/details/87946495
22. Android Support 包:Android Support v4、v7、v13等:https://blog.csdn.net/linxinfa/article/details/87945883
23. objective-c的%s和%@:https://blog.csdn.net/linxinfa/article/details/87936067
24. objective-c的nil和NULL:https://blog.csdn.net/linxinfa/article/details/87935584
25. objective-c的alloc和init:https://blog.csdn.net/linxinfa/article/details/87935086
26. iOS内购代码objective-c总结:https://blog.csdn.net/linxinfa/article/details/87934241
27. Unity打包iOS自动拷贝1024图标到xcode工程中(上架AppStore需要设置1024*1024图标):https://blog.csdn.net/linxinfa/article/details/87930755
28. uGUI判断鼠标或者手指是否点击在UI上:https://blog.csdn.net/linxinfa/article/details/87885130
29. Unity编辑器扩展: GUILayout、EditorGUILayout 控件整理:https://blog.csdn.net/linxinfa/article/details/87863123
30. Unity通过RGBA图生成alpha通道的图:https://blog.csdn.net/linxinfa/article/details/87861218
31. c#的delegate和event:https://blog.csdn.net/linxinfa/article/details/87857754
32. Unity适配iphone刘海屏:https://blog.csdn.net/linxinfa/article/details/87855958
33. Unity (C#) 使用 LitJson 处理 JSON 数据:https://blog.csdn.net/linxinfa/article/details/87855614
34. Unity生成二维码,ZXing库:https://blog.csdn.net/linxinfa/article/details/87854950
35. Unity Attribute的使用总结:https://blog.csdn.net/linxinfa/article/details/87809229
36. java反编译工具:https://blog.csdn.net/linxinfa/article/details/87806296
37. 如何通过反射调用内部静态函数:https://blog.csdn.net/linxinfa/article/details/87801196
38. Unity3D的四种坐标系:https://blog.csdn.net/linxinfa/article/details/87795248
39. Unity用Gizmos画线和图:https://blog.csdn.net/linxinfa/article/details/87793107
40. 样条曲线:https://blog.csdn.net/linxinfa/article/details/87711348
41. 为什么Inspector里显示的图片大小和文件夹中显示的大小不一样:https://blog.csdn.net/linxinfa/article/details/87708461
42. 获取某目录中所有文件,包括子目录中的文件:https://blog.csdn.net/linxinfa/article/details/87697244
43. Unity中所有特殊的文件夹:https://blog.csdn.net/linxinfa/article/details/87695652
44. unity用代码设置Splash Screen闪屏:https://blog.csdn.net/linxinfa/article/details/87694417
45. Unity中Bundle Identifier、Bundle Version、Bundle Version Code区别:https://blog.csdn.net/linxinfa/article/details/87693138
46. PostProcessBuildAttribute和PostProcessSceneAttribute:https://blog.csdn.net/linxinfa/article/details/87643297
47. unity unit类型转Color:https://blog.csdn.net/linxinfa/article/details/87641907
48. Unity Debug.Log输出带颜色文字的log:https://blog.csdn.net/linxinfa/article/details/87633970
49. iOS12独立沙盒账户登录:https://blog.csdn.net/linxinfa/article/details/87632942
50. Unity打iOS之xcodeapi的使用:https://blog.csdn.net/linxinfa/article/details/87618408
51. Unity打iOS,编译选项是不是一定要选择il2cpp:https://blog.csdn.net/linxinfa/article/details/87358809
52. libstdc++适配Xcode10与iOS12:https://blog.csdn.net/linxinfa/article/details/87283608
53. 局域网内,unity5.x mac版本,iOS打包插件怎么装:https://blog.csdn.net/linxinfa/article/details/87278972
54. 局域网内windows远程mac:https://blog.csdn.net/linxinfa/article/details/87256465
55. Unity与iOS交互(XUPorter的使用):https://blog.csdn.net/linxinfa/article/details/87103423
56. mac升级XCode到10.1 (iOS 12.1 SDK),Unity5.x无法访问原项目的问题:APFS硬盘格式无法识别:https://blog.csdn.net/linxinfa/article/details/87095810
57. iOS开发中静态库和动态库:https://blog.csdn.net/linxinfa/article/details/87074927
58. NGUI的UIPanel双层裁剪:https://blog.csdn.net/linxinfa/article/details/86669016
59. NGUI在不修改shader的情况下,把精灵图片置灰:https://blog.csdn.net/linxinfa/article/details/86667423
60. NGUI的UIPanel的Depth改良:二级排序:https://blog.csdn.net/linxinfa/article/details/86666688
61. Unity检测图集是否是正方形并且检测它的压缩格式是否是PVRTC:https://blog.csdn.net/linxinfa/article/details/86657128
62. HTML样式的使用笔记:https://blog.csdn.net/linxinfa/article/details/86625756
63. Unity编辑器工具编写记录:https://blog.csdn.net/linxinfa/article/details/86589042
64. Unity编辑器小窗口进度条:https://blog.csdn.net/linxinfa/article/details/86588972
65. lua设置和获取一个数字的二进制形式的某个位的值:https://blog.csdn.net/linxinfa/article/details/86581351
66. lua给整数数字前面补零:https://blog.csdn.net/linxinfa/article/details/86580547
67. Unity通过反射给gameObject添加组件:https://blog.csdn.net/linxinfa/article/details/86580046
68. Unity3D摄像机裁剪——NGUI篇:https://blog.csdn.net/linxinfa/article/details/86544157
69. Unity3D游戏开发之 模型、纹理、音频等资源导入事件监控:https://blog.csdn.net/linxinfa/article/details/86518733
70. ajax笔记:https://blog.csdn.net/linxinfa/article/details/86251700
71. 接YSDK上架应用宝遇到的问题及解决办法:https://blog.csdn.net/linxinfa/article/details/86084072
72. svn log命令的使用:https://blog.csdn.net/linxinfa/article/details/86078669
73. python写excel:https://blog.csdn.net/linxinfa/article/details/86004334
74. python2安装pip:https://blog.csdn.net/linxinfa/article/details/85999905
75. pc上火车票抢票神器:https://blog.csdn.net/linxinfa/article/details/85768431
76. webpy使用笔记:https://blog.csdn.net/linxinfa/article/details/85762531
77. 如何拿到别人的ipa包:https://blog.csdn.net/linxinfa/article/details/85597737
78. iOS armv7, armv7s, arm64区别与应用32位、64位配置:https://blog.csdn.net/linxinfa/article/details/85336462
79. 判断Android是否为模拟器:https://blog.csdn.net/linxinfa/article/details/85251631
80. XCode连苹果真机查看运行log:https://blog.csdn.net/linxinfa/article/details/85092532
81. 腾讯YSDK米大师接入:https://blog.csdn.net/linxinfa/article/details/84994146
82. Unity计算帧率:https://blog.csdn.net/linxinfa/article/details/84346182
83. Unity3D研究院之MAC&Windows跨平台解析Excel:https://blog.csdn.net/linxinfa/article/details/84344955
84. Unity用代码设置游戏icon:https://blog.csdn.net/linxinfa/article/details/84248474
85. iOS使用TestFlight进行内部和外部人员测试:https://blog.csdn.net/linxinfa/article/details/84142588
86. Unity游戏icon压缩格式设置(解决图标不清晰问题):https://blog.csdn.net/linxinfa/article/details/84140334
87. Xcode archive 打包ipa过程图解:https://blog.csdn.net/linxinfa/article/details/84070516
88. 关于Certificate、Provisioning Profile、App ID的介绍及其之间的关系:https://blog.csdn.net/linxinfa/article/details/83818496
89. 用Visual Studio查看图片的二进制流:https://blog.csdn.net/linxinfa/article/details/83793056
90. IOS马甲包混淆:https://blog.csdn.net/linxinfa/article/details/83792772
91. UGUI图文混排插件Text Mesh Pro:https://blog.csdn.net/linxinfa/article/details/83343304
92. 常用的工具软件:https://blog.csdn.net/linxinfa/article/details/82734526
93. Shell学习笔记:https://blog.csdn.net/linxinfa/article/details/82414147
94. 实用的FTP工具:https://blog.csdn.net/linxinfa/article/details/82386796
95. windows与mac文件夹共享:https://blog.csdn.net/linxinfa/article/details/82386413
96. 安装python3并安装paramiko:https://blog.csdn.net/linxinfa/article/details/82386364
97. Unity如何在Inspector中预览lua脚本的内容:https://blog.csdn.net/linxinfa/article/details/82215911
98. Android和iOS包批量重签名:https://blog.csdn.net/linxinfa/article/details/81066703
99. Unity修炼笔记:https://blog.csdn.net/linxinfa/article/details/80648374
100. iOS崩溃堆栈信息的符号化解析:https://blog.csdn.net/linxinfa/article/details/80485303
101. Unity Shader Graph学习笔记:https://blog.csdn.net/linxinfa/article/details/80388277
102. 使用最新版wampserver搭建 WAMP 平台超简单实用教程:https://blog.csdn.net/linxinfa/article/details/80404710
103. Unity发布webgl的一些问题:https://blog.csdn.net/linxinfa/article/details/80363671
104. Mac如何共享文件夹:https://blog.csdn.net/linxinfa/article/details/80328808
105. Unity 使用Fresnel Effect实现边缘光:https://blog.csdn.net/linxinfa/article/details/80309766
106. 用Laya开发微信小游戏:https://blog.csdn.net/linxinfa/article/details/80156348
107. 使用python将一定格式的文本转成csv文件供excel做数据分析:https://blog.csdn.net/linxinfa/article/details/79979487
108. Python常用函数的使用实例:https://blog.csdn.net/linxinfa/article/details/79902178
109. Python GUI编程(Tkinter):https://blog.csdn.net/linxinfa/article/details/79868816
110. Python远程SSH:https://blog.csdn.net/linxinfa/article/details/79868733
111. Unity通过AssetBundle加载资源实例化在iOS上崩问题的解决(Strip导致):https://blog.csdn.net/linxinfa/article/details/79757492
112. Unity 2017.1新功能 | Sprite Atlas与Sprite Mask详解:https://blog.csdn.net/linxinfa/article/details/79743559
113. Unity机器学习ML-Agent的使用:https://blog.csdn.net/linxinfa/article/details/79712513
114. Unity3D研究院之手游开发中所有特殊的文件夹:https://blog.csdn.net/linxinfa/article/details/79696016
115. MyEclipse破解教程:https://blog.csdn.net/linxinfa/article/details/79672461
116. Eclipse调试夜神模拟器:https://blog.csdn.net/linxinfa/article/details/79624703
117. 小米开放平台接入笔记:https://blog.csdn.net/linxinfa/article/details/79616666
118. 苹果开发者账号申请:https://blog.csdn.net/linxinfa/article/details/79564616
119. unity Timeline封装:https://blog.csdn.net/linxinfa/article/details/79557744
120. Unity2017 Timeline实例解析:游戏场景中的动画:https://blog.csdn.net/linxinfa/article/details/79531844
121. Unity使用Post/Get提交数据到http服务器:https://blog.csdn.net/linxinfa/article/details/79488967
122. es6中export、export default、import的理解:https://blog.csdn.net/linxinfa/article/details/79447519
123. Lua异常捕获_try catch封装: pcall和xpcall:https://blog.csdn.net/linxinfa/article/details/79281451
124. JavaScript的undefine和null:https://blog.csdn.net/linxinfa/article/details/79252230
125. 轻松使用Nginx搭建web服务器:https://blog.csdn.net/linxinfa/article/details/79216890
126. Apache 与 Nginx 比较:https://blog.csdn.net/linxinfa/article/details/79216675
127. H5无法调起android app 的坑之 scheme 大小写:https://blog.csdn.net/linxinfa/article/details/79154909
128. javascript Prototype constructor的理解:https://blog.csdn.net/linxinfa/article/details/79023106
129. JavaScript笔记:https://blog.csdn.net/linxinfa/article/details/79022712
130. Unity动态构建Mesh来绘制任意多边形:https://blog.csdn.net/linxinfa/article/details/78816362
131. 给Unity的GameObject拓展接口:https://blog.csdn.net/linxinfa/article/details/78816204
132. Unity C#执行bat脚本:https://blog.csdn.net/linxinfa/article/details/78690488
133. unity反向查找资源依赖:https://blog.csdn.net/linxinfa/article/details/78519469
134. unity关于android打包性能设置:https://blog.csdn.net/linxinfa/article/details/78458409
135. 垂直同步:https://blog.csdn.net/linxinfa/article/details/78458254
136. URI和URL的区别:https://blog.csdn.net/linxinfa/article/details/78126288
137. 网页调用 iOS/Android 客户端:https://blog.csdn.net/linxinfa/article/details/78126237
138. Unity3D研究院编辑器之创建旧版动画:https://blog.csdn.net/linxinfa/article/details/78123969
139. Unity真机非全屏播放视频:https://blog.csdn.net/linxinfa/article/details/78115614
140. lua中删除元素:https://blog.csdn.net/linxinfa/article/details/78113162
141. Unity3D编辑器:删掉MissingScirpt脚本:https://blog.csdn.net/linxinfa/article/details/78047996
142. c#和lua的反射:https://blog.csdn.net/linxinfa/article/details/78034404
143. Adreno GPU Profiler工具使用总结:https://blog.csdn.net/linxinfa/article/details/77712942
144. Go语言实现简单的web服务器:https://blog.csdn.net/linxinfa/article/details/77679692
145. unity移动平台路径问题:https://blog.csdn.net/linxinfa/article/details/77678849
146. Go语言学习笔记:https://blog.csdn.net/linxinfa/article/details/77678558
147. Unity发布Android,UIInput拉起输入法字都是白色的问题:https://blog.csdn.net/linxinfa/article/details/77675757
148. Unity的Scene场景选中物体Hierarchy窗口无法锁定选中的物体的问题:https://blog.csdn.net/linxinfa/article/details/77675671
149. GO语言的import:https://blog.csdn.net/linxinfa/article/details/77601432
150. 关于Unity android打包的keystore:https://blog.csdn.net/linxinfa/article/details/77572382
151. 光猫连接无线路由器:https://blog.csdn.net/linxinfa/article/details/76563173
152. lua解析json:https://blog.csdn.net/linxinfa/article/details/76557700
153. 微信开放平台android接入笔记(unity3d):https://blog.csdn.net/linxinfa/article/details/74994233
154. Eclipse好用的快捷键:https://blog.csdn.net/linxinfa/article/details/74911211
155. Android反编译:https://blog.csdn.net/linxinfa/article/details/74852421
156. Unity中动态阴影的制作:https://blog.csdn.net/linxinfa/article/details/73108328
157. 手把手教你使用Unity的Behavior Designer:https://blog.csdn.net/linxinfa/article/details/72937709
158. Unity与Android的交互:https://blog.csdn.net/linxinfa/article/details/72852155
159. sqlite学习:https://blog.csdn.net/linxinfa/article/details/71270427
160. bash学习笔记:https://blog.csdn.net/linxinfa/article/details/71158008
161. A* Pathfinding Project (Unity A*寻路插件) 使用教程:https://blog.csdn.net/linxinfa/article/details/71080462
162. python的代码缩进:https://blog.csdn.net/linxinfa/article/details/70991556
163. web.py学习笔记:https://blog.csdn.net/linxinfa/article/details/70952587
164. web.py框架:https://blog.csdn.net/linxinfa/article/details/70940112
165. 使用命令行运行unity并执行某个静态函数(运用于远程打包):https://blog.csdn.net/linxinfa/article/details/70914939
166. UGUI 列表循环使用:https://blog.csdn.net/linxinfa/article/details/70767777
167. Unity 代码混淆: CodeGuard的使用:https://blog.csdn.net/linxinfa/article/details/70767114
168. Unity 查看所有GUI默认样式:https://blog.csdn.net/linxinfa/article/details/70445451
169. html笔记:https://blog.csdn.net/linxinfa/article/details/70256392
170. python和lua的socket实例:https://blog.csdn.net/linxinfa/article/details/70228275
171. WinSCP的使用:https://blog.csdn.net/linxinfa/article/details/70157278
172. linux 笔记:https://blog.csdn.net/linxinfa/article/details/64907039
173. Mecanim Animator使用详解:https://blog.csdn.net/linxinfa/article/details/55667835
174. Unity 5 AudioMixer:https://blog.csdn.net/linxinfa/article/details/55667763
175. UIScrollView复用节点示例:https://blog.csdn.net/linxinfa/article/details/55210490
176. Unity发布的ios包在iphone上声音是从听筒里出来的问题:https://blog.csdn.net/linxinfa/article/details/55101602
177. lua table排序:https://blog.csdn.net/linxinfa/article/details/54981464
178. C#用正则表达式去匹配被双引号包起来的中文:https://blog.csdn.net/linxinfa/article/details/54881112
179. unity5.x assetbundle打包和加载:https://blog.csdn.net/linxinfa/article/details/54865998
180. 5.x依赖打包:https://blog.csdn.net/linxinfa/article/details/54861848
181. C#如何通过反射获取方法以及动态调用方法:https://blog.csdn.net/linxinfa/article/details/54647315
182. UnityEditor常用函数:https://blog.csdn.net/linxinfa/article/details/54647165
183. c#装箱操作和拆箱操作:https://blog.csdn.net/linxinfa/article/details/54093727
184. unity AssetBundleManifest:https://blog.csdn.net/linxinfa/article/details/53750317
185. C#判断机器是32位还是64位:https://blog.csdn.net/linxinfa/article/details/53750249
186. android 签名:https://blog.csdn.net/linxinfa/article/details/53692655
187. sourceCRT 传文件中途乱码的问题:https://blog.csdn.net/linxinfa/article/details/53582380
188. linux将本目录下的大小为0的文件移除:https://blog.csdn.net/linxinfa/article/details/53581172
189. python md5:https://blog.csdn.net/linxinfa/article/details/53574986
190. svn 常用指令:https://blog.csdn.net/linxinfa/article/details/53427898
191. 缓存服务器:https://blog.csdn.net/linxinfa/article/details/53364196
192. 关于腾讯MSDK的平台大区划分:https://blog.csdn.net/linxinfa/article/details/53207517
193. 关于moba游戏的移动同步技术:https://blog.csdn.net/linxinfa/article/details/53150385
194. svn合并的一些坑:https://blog.csdn.net/linxinfa/article/details/53019611
195. Windows 的路径中表示文件层级为什么会用反斜杠 ‘\’,而 UNIX 系统都用斜杠 ‘/’?:https://blog.csdn.net/linxinfa/article/details/52983362
196. Unity中C#如何执行cmd命令(System.Diagnostics.Process的使用):https://blog.csdn.net/linxinfa/article/details/52982384
197. 谈谈类之间的关联关系与依赖关系:https://blog.csdn.net/linxinfa/article/details/52934732
198. unity mesh合并:https://blog.csdn.net/linxinfa/article/details/52912631
199. eclipse识别不了模拟器解决办法:https://blog.csdn.net/linxinfa/article/details/52849050
200. svn提交的一个坑:https://blog.csdn.net/linxinfa/article/details/52252214
201. MSDK手Q邀请透传参数问题:url编解码与base64编解码:https://blog.csdn.net/linxinfa/article/details/52187553
202. eclipse识别不到真机设备问题的解决:https://blog.csdn.net/linxinfa/article/details/52152936
203. Unity3D Shader 入门:https://blog.csdn.net/linxinfa/article/details/52107989
204. Unity项目优化:https://blog.csdn.net/linxinfa/article/details/52016169
205. unity NGUI图文混排:https://blog.csdn.net/linxinfa/article/details/52013523
206. 继上一篇,制作序列化类的编辑器:https://blog.csdn.net/linxinfa/article/details/51973500
207. unity 类的序列化:https://blog.csdn.net/linxinfa/article/details/51971633
208. Unity UGGUI RawImage 渲染小地图:https://blog.csdn.net/linxinfa/article/details/51957414
209. UGUI如何在UI与UI直接穿插粒子特效和模型:https://blog.csdn.net/linxinfa/article/details/51955484
210. UGUI工厂:https://blog.csdn.net/linxinfa/article/details/51932982
211. C#文件读写常用接口:https://blog.csdn.net/linxinfa/article/details/51924008
212. ulua热更新小demo:https://blog.csdn.net/linxinfa/article/details/51920802
213. Unity ipv6的支持:https://blog.csdn.net/linxinfa/article/details/51909664
214. unity 5.x android发布注意事项:https://blog.csdn.net/linxinfa/article/details/51909222
215. 一个app启动另一个app:https://blog.csdn.net/linxinfa/article/details/51891480
216. vs2010字体和颜色的舒适设置:https://blog.csdn.net/linxinfa/article/details/51888734
217. lua中metatable学习:https://blog.csdn.net/linxinfa/article/details/51854924
218. Lua 中实现面向对象:https://blog.csdn.net/linxinfa/article/details/51833541
219. Network-Emulator Network-Emulator-Toolkit网络模拟器使用详细介绍:https://blog.csdn.net/linxinfa/article/details/51833249
220. lua遍历table:https://blog.csdn.net/linxinfa/article/details/51833071
221. unity自制延迟定时回调:https://blog.csdn.net/linxinfa/article/details/51832460
222. ulua使用笔记:https://blog.csdn.net/linxinfa/article/details/51824500
223. ulua与unity互传数组:https://blog.csdn.net/linxinfa/article/details/51811863
224. ulua学习笔记1:https://blog.csdn.net/linxinfa/article/details/51811765
225. Mono.Cecil简介与示例:https://blog.csdn.net/linxinfa/article/details/51803200
226. System.Reflection简介:https://blog.csdn.net/linxinfa/article/details/51803069
227. Unity3d资源处理器AssetPostprocessor简单用法:https://blog.csdn.net/linxinfa/article/details/51801319
228. 适配器模式:https://blog.csdn.net/linxinfa/article/details/51790175
229. JavaScript学习笔记:https://blog.csdn.net/linxinfa/article/details/51785624
230. HTML学习笔记:https://blog.csdn.net/linxinfa/article/details/51783178
231. MySql学习笔记:https://blog.csdn.net/linxinfa/article/details/51778638
232. 装饰器模式:https://blog.csdn.net/linxinfa/article/details/51775920
233. Objective-C学习笔记:https://blog.csdn.net/linxinfa/article/details/51772097
234. C++ 信号处理:https://blog.csdn.net/linxinfa/article/details/51768999
235. URL Scheme:https://blog.csdn.net/linxinfa/article/details/51741823
236. TCP/IP:https://blog.csdn.net/linxinfa/article/details/51720337
237. 版本控制的分支策略及初步实践:https://blog.csdn.net/linxinfa/article/details/51683185
238. Unity文件操作路径:https://blog.csdn.net/linxinfa/article/details/51679528
239. MongoDB学习:https://blog.csdn.net/linxinfa/article/details/51660965
240. 如何设置Atlas的Texture:https://blog.csdn.net/linxinfa/article/details/51638220
241. Unity3D性能优化:https://blog.csdn.net/linxinfa/article/details/51636894
242. 运算符重载:https://blog.csdn.net/linxinfa/article/details/51635551
243. 检测预设资源是否有UIFont为空的编辑器:https://blog.csdn.net/linxinfa/article/details/51614786
244. 递归调用示例:https://blog.csdn.net/linxinfa/article/details/51613000
245. BOM头的检测:https://blog.csdn.net/linxinfa/article/details/51585110
246. SSH Unexpected socket error:10106的解决办法:https://blog.csdn.net/linxinfa/article/details/51584640
247. SSH远程登录:https://blog.csdn.net/linxinfa/article/details/51584272
248. unity常用Attribute:https://blog.csdn.net/linxinfa/article/details/51582351
249. 腾讯新版MSDK for Unity:https://blog.csdn.net/linxinfa/article/details/51582088
250. 使用脚本将Unity的ogg音效全部改为2d音效:https://blog.csdn.net/linxinfa/article/details/51519974
251. Unity使用HttpWebRequest远程下载文件:https://blog.csdn.net/linxinfa/article/details/51097305
252. 脚本打包:https://blog.csdn.net/linxinfa/article/details/51096997
253. 项目中的问题与解决方案:https://blog.csdn.net/linxinfa/article/details/50909005
254. unity自动打包经验:https://blog.csdn.net/linxinfa/article/details/50528642
255. 接口和抽象类的区别:https://blog.csdn.net/linxinfa/article/details/49591035
256. Unity使用RenderTexture进行截屏:https://blog.csdn.net/linxinfa/article/details/49493775
257. unity中对于scrollview下拉加载的方法:https://blog.csdn.net/linxinfa/article/details/49486793
258. unity中的坐标转换:https://blog.csdn.net/linxinfa/article/details/49447261
259. jdk环境变量配置:https://blog.csdn.net/linxinfa/article/details/49333821
260. MVC设计模式:https://blog.csdn.net/linxinfa/article/details/49307549
261. android截屏代码:https://blog.csdn.net/linxinfa/article/details/49280021
262. 关于Android真机调试:https://blog.csdn.net/linxinfa/article/details/49148103
263. Unity与java相互调用:https://blog.csdn.net/linxinfa/article/details/49004131
264. 单例模板类:https://blog.csdn.net/linxinfa/article/details/48740615
265. 【转】Unity中HideInInspector和SerializeField一起使用:https://blog.csdn.net/linxinfa/article/details/48522469
266. Unity3D试题:https://blog.csdn.net/linxinfa/article/details/47358985
267. unity AssetBundle的资源管理:https://blog.csdn.net/linxinfa/article/details/47320343
268. C#解析XML:https://blog.csdn.net/linxinfa/article/details/47312833
269. Linux制作run安装包:https://blog.csdn.net/linxinfa/article/details/47300239
270. 写C# dll供Unity调用:https://blog.csdn.net/linxinfa/article/details/47295279
271. Visual Studio我常用的快捷键:https://blog.csdn.net/linxinfa/article/details/47058269
272. 使用UnityEditor做工具:https://blog.csdn.net/linxinfa/article/details/47058183
273. 关于MSDK的几个难点:https://blog.csdn.net/linxinfa/article/details/46777109
274. 腾讯MSDK for Unity:https://blog.csdn.net/linxinfa/article/details/46760129
275. 腾讯MSDK手Q微信授权登录:https://blog.csdn.net/linxinfa/article/details/46754991
276. xml中常用的转义符:https://blog.csdn.net/linxinfa/article/details/46572861
277. unity3D 音频播放:https://blog.csdn.net/linxinfa/article/details/46507483
278. unity3D 在屏幕边框创建碰撞框:https://blog.csdn.net/linxinfa/article/details/46506261
279. Unity3D 摇一摇功能:https://blog.csdn.net/linxinfa/article/details/46505071
280. unity3D 射线碰撞检测:https://blog.csdn.net/linxinfa/article/details/46504585
281. Matlab与C++混合编程:https://blog.csdn.net/linxinfa/article/details/46484079
282. C++的dll导出类:https://blog.csdn.net/linxinfa/article/details/46484015
283. C++链接库的编写与调用:https://blog.csdn.net/linxinfa/article/details/46483941
284. Unity3D 安卓发布:https://blog.csdn.net/linxinfa/article/details/46482933
285. C++解析一段以;分隔的字符串:https://blog.csdn.net/linxinfa/article/details/46482851
286. Linux 常用命令总结:https://blog.csdn.net/linxinfa/article/details/46482611
287. Android的WebView控件:https://blog.csdn.net/linxinfa/article/details/46482417
288. 使用NDK编译C/C++为.so文件:https://blog.csdn.net/linxinfa/article/details/46482303
289. 使用office2010将Excel转xml:https://blog.csdn.net/linxinfa/article/details/46482275
290. C#给C++传参的兼容问题:https://blog.csdn.net/linxinfa/article/details/46482043
291. C# 调用C++链接库与回调:https://blog.csdn.net/linxinfa/article/details/46481803
292. C# 数据封装和解析:https://blog.csdn.net/linxinfa/article/details/46471783
293. C# Socket模块:https://blog.csdn.net/linxinfa/article/details/46469107
294. unity3D 让粒子在UI上播放:https://blog.csdn.net/linxinfa/article/details/46408101
295. 在3D物体上创建UI:https://blog.csdn.net/linxinfa/article/details/46391287
296. 加锁单例:https://blog.csdn.net/linxinfa/article/details/46391045
297. unity3d 根据手指触摸的位置去放置UI:https://blog.csdn.net/linxinfa/article/details/46390513
298. unity3D LineRender的使用:插值移动终点:https://blog.csdn.net/linxinfa/article/details/46390427
299. unity3D 旋转3D物体:https://blog.csdn.net/linxinfa/article/details/46386567
300. C# 字符串md5加密:https://blog.csdn.net/linxinfa/article/details/46366353

再写个脚本定时检测下自己的博客文章访问量变化并输出

import requests
from bs4 import BeautifulSoup
import time
import datetimearticle_dic={}
while True:total=0article_id = 0for i in range(1,19):# 爬1到19页的文章标题web_url = 'https://blog.csdn.net/linxinfa/article/list/%s?'%(i)r = requests.get(web_url)html_txt = r.text# 去除换行html_txt = html_txt.replace('<br>', '').replace('<br/>', '')soup = BeautifulSoup(html_txt, 'lxml')tag_main =soup.find('main')tag_div = tag_main.find('div',class_='article-list')tag_article_all = tag_div.find_all('div', class_='article-item-box csdn-tracking-statistics')for article_item in tag_article_all:article_h4=article_item.find('h4')article_h4_a=article_h4.find('a')if 'linxinfa' in article_h4_a['href']:title=u'《'+ [s.extract() for s in article_h4_a][1].strip()+u'》'read_num_all=article_item.find_all('span',class_='read-num')read_num = int(read_num_all[0].find('span',class_='num').string)if article_id in article_dic:old_read_cnt=article_dic[article_id]['cnt']if old_read_cnt != read_num:print(title, '被阅读了', 'old_read_cnt: ', old_read_cnt, 'new_read_cnt: ', read_num)article_dic[article_id] = {'title':title,'cnt':read_num}else:article_dic[article_id] = {'title':title,'cnt':read_num}article_id+=1time.sleep(60)

运行结果(只要有文章被访问量了,就会被检测出来)

《uGUI学习篇: UI元素的渲染与性能》 被阅读了 old_read_cnt:  15 new_read_cnt:  16
《uGUI学习篇: Unity用脚本创建以Image显示的精灵动画》 被阅读了 old_read_cnt:  12 new_read_cnt:  13
《Python爬虫学习笔记》 被阅读了 old_read_cnt:  26 new_read_cnt:  27

Python爬虫自学与实战,爬一下自己的博客文章相关推荐

  1. Python爬虫小实践:爬取任意CSDN博客所有文章的文字内容(或可改写为保存其他的元素),间接增加博客访问量...

    Python并不是我的主业,当初学Python主要是为了学爬虫,以为自己觉得能够从网上爬东西是一件非常神奇又是一件非常有用的事情,因为我们可以获取一些方面的数据或者其他的东西,反正各有用处. 这两天闲 ...

  2. 【python爬虫自学笔记】-----爬取简书网站首页文章标题与链接

    from urllib import request from bs4 import BeautifulSoup #一个可以从html或者xml中提取结构化数据的python库 #构造头文件,模拟浏览 ...

  3. 使用Python爬取CSDN历史博客文章列表,并生成目录

    使用Python爬取CSDN历史博客文章列表,并生成目录 这篇博客将介绍如何使用Python爬取CSDN历史博客文章列表,并生成目录. 2020年 2020年04月 cv2.threshold() 阈 ...

  4. Python爬虫自学与实战4:异常处理

    异常处理概述 Python在运行中经常会遇到异常,如果不做异常处理,会导致程序奔溃.缺乏应急机制的爬虫是往往无法顺利爬完一个网站,当你爬到一半遇到红字报错,本地只存有少量数据,再次启动又需要重新开始而 ...

  5. 【python爬虫自学笔记】-----爬取网易云歌单中歌曲歌词

    工具:python3.6 ,pycharm 个人歌单的链接地址为https://music.163.com/#/playlist?id=2251736705 开始对网页的内容进行爬取的时候,使用req ...

  6. 技术图文:如何利用C#爬取CSDN的博客文章?

    背景 大家有没有这样的体验,在 CSDN 上发现某个博主有很多干货文章,我们就想拿到这个博主以往文章的列表,在需要的时候进行查询和浏览. 如果从 CSDN 网站上用复制粘贴的方式来建立这个列表,一个是 ...

  7. 我的python爬虫自学之路

    昨天开始装装插件,找找博客,看看知乎,开始我的python的自学之路.惭愧,我算是一个只有三分钟热度的人,挖个坑督促一下自己.希望能坚持把坑填上. 先来盘点一下昨天完成的事,以及接下来的计划. 看完两 ...

  8. 是否担心别人将你的博客文章全部爬下来?3行代码教你检测爬虫

    本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理. 以下文章来源于Python实用宝典 ,作者Python实用宝典 是否担心别人将你 ...

  9. 一文搞定scrapy爬取众多知名技术博客文章保存到本地数据库,包含:cnblog、csdn、51cto、itpub、jobbole、oschina等

    本文旨在通过爬取一系列博客网站技术文章的实践,介绍一下scrapy这个python语言中强大的整站爬虫框架的使用.各位童鞋可不要用来干坏事哦,这些技术博客平台也是为了让我们大家更方便的交流.学习.提高 ...

最新文章

  1. 学习Python不错的网站
  2. 计算机服务器和数据库的关系,服务器到底是什么?和电脑又有什么区别?
  3. mysql的innodb数据库引擎详解
  4. 递归调用方法时栈内存是如何变化的?(使用内存图演示递归调用过程)
  5. 背单词软件 单词风暴 分享id_周一考研高效背单词系列(一):利用单词软件如何背好单词...
  6. CCNA笔记之第十四节:RIP协议(1)
  7. then 微信小程序_微信小程序和es6 promise的关系
  8. linux安装 tomcat 6 笔记
  9. 卷积操作中的矩阵乘法(gemm)—— 为什么矩阵乘法是深度学习的核心所在
  10. ajax全局加密,Ajax请求接口加密研究(针对网页前端的接口安全加密机制研究)...
  11. 2层弹出页面刷新中间层
  12. Java获取程序或项目路径的常用方法
  13. 现有Android项目中集成Flutter/Flutter混合开发实战(一)
  14. ListControl
  15. vb.net oracle数据库,vb.net 如何与oracle数据库连接
  16. C语言绘图EasyX图形库基础(笔记)
  17. php英语大全,学习php编程语言 php常用英语单词
  18. JMeter(三):后置处理器[Regular Expression Extractor]
  19. 静态化freemarker,分布式文件系统minIO
  20. PAT 乙级练习 1068 万绿丛中一点红 - 超级详细的思路讲解

热门文章

  1. maven仓库中的.LastUpdated文件
  2. 一月到十二月的英文简写和英文单词
  3. 因果推断 | 因果关系推断-系列电子书资源
  4. 如何理解电商的B2B模式与B2C模式
  5. iOS开发者程序许可协议
  6. 学透CSS-当CSS遇到古诗和月亮,月亮动起来!!!
  7. 学习java随堂练习-20220609
  8. FTP上传下载失败或文件为空
  9. APP - IOS_Application 常用推荐
  10. php导出excel失败原因,PHPExcel中导出Excel出错的一种可能原因