用Python编写博客导出工具
用Python编写博客导出工具
罗朝辉 (http://kesalin.github.io/)
CC 许可,转载请注明出处
写在前面的话
我在 github 上用 octopress 搭建了个人博客,octopress 使用Markdown语法编写博文。之前我在CSDN博客上也写过不少的技术博文,都说自己的孩子再丑也是个宝,所以就起了把CSDN博客里面的文章导出到个人博客上的念头。刚开始想找个工具把CSDN博客导出为xml或文本,然后再把xml或文本转换为Markdown博文。可惜搜了一下现有博客导出工具,大部分要收费才能将全部博文导出为xml格式,所以就只好发明轮子了:写个工具将全部博文导出为Markdown博文(也是txt格式的)。
我将详细介绍这个工具的编写过程,希望没有学习过编程的人也能够学会一些简单的Python语法来修改这个脚本工具,以满足他们将其他类型的博客导出为文本格式。这也是我第一次学习和使用Python,所以相信我,你一定也可以将自己的博客导出为想要的文本格式。
本文源代码在这里:ExportCSDNBlog.py
考虑到大部分非程序员使用Windows系统,下面将介绍在Windows下如何编写这个工具。
下载工具
在 Windows 下安装Python开发环境(Linux/Mac下用pip安装相应包即可,程序员自己解决咯):
Python 2.7.3
请安装这个版本,更高版本的Python与一些库不兼容。
下载页面
下载完毕双击可执行文件进行安装,默认安装在C:\Python2.7。
six
下载页面 下载完毕,解压到Python安装目录下,如C:\Python2.7\six-1.8.0目录下。
BeautifulSoup 4.3.2
下载页面, 下载完毕,解压到Python安装目录下,如C:\Python2.7\BeautifulSoup目录下。
html5lib
下载页面 下载完毕,解压到Python安装目录下,如C:\Python2.7\html5lib-0.999目录下。
安装工具
Windows下启动命令行,依次进入如下目录,执行setup.py install进行安装:
C:\Python2.7\six-1.8.0>setup.py install
C:\Python2.7\html5lib-0.999>setup.py install
C:\Python2.7\BeautifulSoup>setup.py install
参考文档
Python 2.X文档
BeautifulSoup文档
正则表达式文档
正则表达式在线测试
用到的Python语法
这个工具只用到了一些基本的Python语法,如果你没有Python基础,稍微了解一下如下博文是很有好处的。
- string: 字符串操作,参考python: string的操作函数
- list: 列表操作,参考Python list 操作
- dictionary: 字典操作,参考Python中dict详解
- datetime: 日期时间,参考python datetime处理时间
编写博客导出工具
分析
首先来分析这样一个工具的需求:
导出所有CSDN博客文章为Markdown文本。
这个总需求其实可以分两步来做:
* 获得CSDN博客文章
* 将文章转换为Markdown文本
针对第一步:如何获取博客文章呢?
打开任何一个CSDN博客,我们都可以看到下方的页面导航显示“XXX条数据 共XXX页 1 2 3 … 尾页”,我们从这个地方入手考虑。每个页面上都会显示属于该页的文章标题及文章链接,如果我们依次访问这些页面链接,就能从每个页面链接中找出属于该页面的文章标题及文章链接。这样所有的文章标题以及文章链接就都获取到了,有了这些文章链接,我们就能获取对应文章的html内容,然后通过解析这些html页面来生成相应Markdown文本了。
实现
从上面的分析可以看出,首先我们需要根据首页获取所有的页面链接,然后遍历每一个页面链接来获取文章链接。
- 获取页面链接的代码:
<span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">1</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">2</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">3</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">4</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">5</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">6</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">7</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">8</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">9</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">10</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">11</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">12</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">13</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">14</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">15</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">16</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">17</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">18</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">19</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">20</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">21</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">22</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">23</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">24</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">25</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">26</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">27</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">28</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">29</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">30</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">31</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">32</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">33</span> |
|
参数 url = “http://blog.csdn.net/” + username,即你首页的网址。通过urllib2库打开这个url发起一个web请求,从response中获取返回的html页面内容保存到data中。你可以被注释的 print data 来查看到底返回了什么内容。
有了html页面内容,接下来就用BeautifulSoup来解析它。BeautifulSoup极大地减少了我们的工作量。我会详细在这里介绍它的使用,后面再次出现类似的解析就会从略了。soup.find_all(id=“papelist”) 将会查找html页面中所有id=“papelist”的tag,然后返回包含这些tag的list。对应 CSDN 博文页面来说,只有一处地方:
<span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">1</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">2</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">3</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">4</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">5</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">6</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">7</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">8</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">9</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">10</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">11</span> |
|
好,我们获得了papelist 的tag对象,通过这个tag对象我们能够找出尾页tag a对象,从这个tag a解析出对应的href属性,获得尾页的编号12,然后自己拼出所有page页面的访问url来,并保存在pageUrlList中返回。page页面的访问url形式示例如下:
> page 1: http://blog.csdn.net/kesalin/article/list/1
- 根据page来获取文章链接的代码:
<span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">1</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">2</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">3</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">4</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">5</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">6</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">7</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">8</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">9</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">10</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">11</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">12</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">13</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">14</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">15</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">16</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">17</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">18</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">19</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">20</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">21</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">22</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">23</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">24</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">25</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">26</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">27</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">28</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">29</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">30</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">31</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">32</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">33</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">34</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">35</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">36</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">37</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">38</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">39</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">40</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">41</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">42</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">43</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">44</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">45</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">46</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">47</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">48</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">49</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">50</span> |
|
从第一步获得所有的page链接保存在pageUrlList中,接下来就根据这些page 页面来获取对应page的article链接和标题。关键代码是下面这三行:
topArticleDocs = soup.find_all(id="article_toplist")
articleDocs = soup.find_all(id="article_list")
articleListDocs = articleListDocs + topArticleDocs + articleDocs
从page的html内容中查找置顶的文章(article_toplist)以及普通的文章(article_list)的tag对象,然后将这些tag保存到articleListDocs中。
article_toplist示例:(article_list的格式是类似的)
<span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">1</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">2</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">3</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">4</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">5</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">6</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">7</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">8</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">9</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">10</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">11</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">12</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">13</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">14</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">15</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">16</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">17</span> |
|
然后遍历所有的保存到articleListDocs里的tag对象,从中解析出link_title的span tag对象保存到linkDocs中;然后从中解析出链接的url和标题,这里去掉了置顶文章标题中的“置顶”两字;最后将url和标题保存到artices列表中返回。artices列表中的每一项内容示例:
title:招聘:有兴趣做一个与Android对等的操作系统么?
url:http://blog.csdn.net/kesalin/article/details/10474007
- 根据文章链接获取文章html内容并解析转换为Markdown文本
<span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">1</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">2</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">3</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">4</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">5</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">6</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">7</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">8</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">9</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">10</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">11</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">12</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">13</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">14</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">15</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">16</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">17</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">18</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">19</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">20</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">21</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">22</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">23</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">24</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">25</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">26</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">27</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">28</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">29</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">30</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">31</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">32</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">33</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">34</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">35</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">36</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">37</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">38</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">39</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">40</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">41</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">42</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">43</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">44</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">45</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">46</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">47</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">48</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">49</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">50</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">51</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">52</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">53</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">54</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">55</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">56</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">57</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">58</span> |
|
同前面的分析类似,在这里通过访问具体文章页面获得html内容,从中解析出文章标题,分类,发表时间,文章内容信息。然后把这些内容传递给函数exportToMarkdown,在其中生成相应的Markdown文本文件。值得一提的是,在解析文章内容信息时,由于html文档内容有一些特殊的标签或转义符号,需要作特殊处理,这些特殊处理在函数htmlContent2String中进行。目前只导出了所有的文本内容,图片,url链接以及表格都没有处理,后续我会尽量完善这些转换。
<span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">1</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">2</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">3</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">4</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">5</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">6</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">7</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">8</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">9</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">10</span> |
|
目前仅仅是删除所有的html标签,并在函数decodeHtmlSpecialCharacter中转换转义字符。
- 生成Markdown文本文件
<span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">1</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">2</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">3</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">4</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">5</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">6</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">7</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">8</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">9</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">10</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">11</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">12</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">13</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">14</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">15</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">16</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">17</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">18</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">19</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">20</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">21</span> <span class="line-number" style="margin: 0px; padding: 0px; border: 0px; font-family: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; font-stretch: inherit; line-height: inherit; vertical-align: baseline; color: rgb(88, 110, 117) !important;">22</span> |
|
生成Markdown文本文件就很简单了,在这里我需要生成github page用的Markdown博文形式,所以内容如此,你可以根据你的需要修改为其他形式的文本内容。
原文链接:http://blog.csdn.net/kesalin/article/details/40222093
用Python编写博客导出工具相关推荐
- 【小工具】CSDN博客导出工具-Java集成Maven开发
CSDN博客导出工具 之前一直想把CSDN的博客导入到自己的网站中,可是由于博客比较多,后面受朋友老郭启发,就找了个时间用Java开发了这款小工具. 转载请注明出处:http://chenhaoxia ...
- 【开源】博客导出工具
有很多朋友在遇到一些好博客文章的时候,都想把它们下载到电脑上,转换成某些格式的文档,以方便存储.阅读. 本人就这些需求,特开发了C#版[博客导出工具]. 该工具现支持的网站包括: CSDN.ITEYE ...
- CSDN博客导出工具
我把 CSDN 博客当作笔记本来用,记录我遇到的坎儿 突然有一天夜里,我写的博客提交之后都没了,也不知道是哪里出了问题,一般白天博客审查很快,夜里都下班了得等到第二天审查,可是第二天发现博客空了,草稿 ...
- CSDN博客导出备份工具
写的文章太多,虽然都不是什么有深度的文章,大多只是做做笔记,但还是担心,有一天博客被误删了怎么办,所以上网找,碰巧找到了这样一个csdn博客导出工具: [csdn博客文章]导出备份 貌似有的用户反馈没 ...
- 自己动手编写CSDN博客备份工具-blogspider之源码分析(3)
作者:gzshun. 原创作品,转载请标明出处! 来源:http://blog.csdn.net/gzshun 周星驰:剪头发不应该看别人怎么剪就发神经跟流行,要配合啊!你看你的发型,完全不配合你的脸 ...
- 快速利用工具编写博客
快速利用工具编写博客 博客不只是给别人看的,同时也是给自己看的.平时自己的一些想法.规划.总结等,可以用博客记录,那么怎么才能告别繁琐语法(喜欢纯代码除外)而快速编写博客呢? 一.下载工具 百 ...
- python+shell 备份 CSDN 博客文章,CSDN博客备份工具
python+shell 备份 CSDN 博客文章,CSDN博客备份工具 在 csdn 写了几年的博客了.多少也积累了两三百篇博文,近日,想把自己的这些文章全部备份下来,于是开始寻找解决方案. 我找到 ...
- 【Android 逆向】使用 Python 编写 APK 批处理分析工具
文章目录 一.涉及到的工具和脚本 二.使用 Python 编写 APK重打包工具 三.博客源码 一.涉及到的工具和脚本 apktool.jar : 反编译 APK 文件使用到的工具 ; 参考 [And ...
- Mac平台上的一个MarkDown编辑器和静态博客生成工具-mweb mac最新版下载
MWeb for Mac是一款Markdown + 文档管理 + 静态网页生成,集大成的 Markdown 应用.MWeb界面简洁高效.功能强大,全面支持 Github Flavored Markdo ...
最新文章
- 当年轻人开始谈论AI伦理
- acronym与abbr
- 01背包 || BZOJ 1606: [Usaco2008 Dec]Hay For Sale 购买干草 || Luogu P2925 [USACO08DEC]干草出售Hay For Sale...
- matlab调用opencv的函数
- java 反射用法_Java 反射的概念与使用
- c++ List(双向链表)
- C++ STL stirng的复制比较
- List.Sort用法
- svn+http+ad域
- 学生成绩abcde怎样划分_7月学考成绩出来啦!
- Mac控制中心使用方法
- 查看Sql Server2016是否激活
- 对话系统数据集--CrossWOZ
- 孙玄/陈东:聊一聊ZooKeeper的顺序一致性
- mysql嵌套查询效率低_mysql的嵌套查询效率很低
- 科学道德与学风-2021雨课堂答案-第4章
- ggplot2设置坐标轴范围_6.6 坐标轴:设置坐标轴上刻度的显示位置
- Layer Tree 绘制
- 用html写除法竖式代码,除法的竖式写法
- 【通信】经PPM调制的超宽带信号经斯白噪声信道的系统matlab仿真