《python初级爬虫》（一）

前言

python初级爬虫只需要掌握以下四个技术

find 字符串函数
列表切片list[-x:-y]
文件读写操作
循环体while

原理：
网页上的任何东西都对应着源代码，所以爬虫的原理就是对网页上的源代码的爬取和访问两部分。
第一步：1 先对待爬取东西的代码截取，对于单篇文章而言

 <a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102wrup.html">写给那个茶水妹的《乘风破浪》诞生…</a>

这是文章对应的代码部分，我们需要切取出所需的url为
http://blog.sina.com.cn/s/blog_4701280b0102wrup.html
第二步：访问该url并存盘
导入urllib内建库
content = urllib.urlopen(url).read()
filename=”xxx”
输出HTML
with open(filename,”w”) as f:
f.write(content)
或者输出TXT
with open(“…../”+filename+”.txt”,”w”) as f:
f.write(content)
但是对于首页的所有文章，则是读取首页的所有内容urllib.urlopen(url).read()，并在所读取的内容中截取文章的url并存盘。

数据源：韩寒博客
http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html

第一部分：下载单篇博文存储本地

第一步：分析html源文件
邮件审查元素可以看到网页的源文件（chrome快捷键F12）,然后在源文件中查找文章名字：写给那个茶水妹的《乘风破浪》诞生…，在body中使用快捷键ctrl+F 查找：

找到其文章名字段落规则为：

 <a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102wrup.html">写给那个茶水妹的《乘风破浪》诞生…</a>

则在此字符串查询所需要的部分。

第二步：代码处理以及url的提取
导入urllib内建库
content = urllib.urlopen(url).read()
filename=”xxx”
输出HTML
with open(filename,”w”) as f:
f.write(content)
或者输出TXT
with open(“…../”+filename+”.txt”,”w”) as f:
f.write(content)

代码实现

import urllib# 使用转义符
str0 ="<a title=\"\" target=\"_blank\" href=\"http://blog.sina.com.cn/s/blog_4701280b0102wrup.html\">写给那个茶水妹的《乘风破浪》诞生…</a>"
# 其规则是 href="连接"   ">题目</a>"# 截取题目索引
title_1 = str0.find(r">")
title_2 = str0.find(r"</a>")
title = str0[title_1+1:title_2]
print title
# 截取http连接索引
href = str0.find(r"href=")
html = str0.find(r".html")
# 截取 url
url = str0[href+6:html+5]# 读出的是html码,type是str
content = urllib.urlopen(url).read()m = url.find("blog_")
filename = url[m:]
filename_1 = "F://python/PyCharmWorkpalce/Crawler/pacong_data/" + filename# 写成html
with open(filename_1 ,"w+") as f:f.write(content)
# 写成txt文件
with open(filename_1 + ".txt","w+") as f:f.write(content)
# 存成题目的txt文件
# 因为编译器是utf-8编码，则需要unicode编译
with open(unicode(filename_0 + title + ".txt","utf-8"),"w+") as f:f.write(content)

输出结果

第二部分:爬取首页的全部文章存入本地

同爬取单篇文章相类似，爬取首页的所有文章则需要读取首页的全部源代码，并对文章url进行截取，分析文章url

<a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102wrup.html">写给那个茶水妹的《乘风破浪》诞生…</a></span> <a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102eo83.html">《论电影的七个元素》——关于我对电…</a></span>

则需要在整体所读内容中找到第一篇文章的规律。特殊字段为

# -*- coding: utf-8 -*-
# auther : santi
# function :
# time :import urllib
import time
# 直接读取首页的全部内容
str0 = "http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html"
con = urllib.urlopen(str0).read()# 将con打印出来观察其文章题目规则
# <a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102elmo.html">2013年09月27日</a></span>
# <a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_4701280b0102wruo.html">写给那个茶水妹的《乘风破浪》诞生…</a></span>with open(r"F:\python\PyCharmWorkpalce\Crawler\pacong_data\context.txt","w") as f:f.write(con)url_all = [""] * 60
url_name = [""] * 60
index = con.find("<a title=\"\" target=\"_blank")
href = con.find("href=\"http:",index)
html = con.find(".html\">",href)
title = con.find("</a></span>",html)i = 0
# find函数找不到会返回-1，则说明全部爬取，直接跳出while循环。
while index != -1 and href != -1 and html != -1 and title != -1 or  i < 50 :url_all[i] = con[href+6:html+5]url_name[i] = con[html+7:title]print "finding...   " + url_all[i]index = con.find("<a title=\"\" target=\"_blank",title)href = con.find("href=\"http:",index)html = con.find(".html\">",href)title = con.find("</a></span>",html)i += 1else:print "Find End!"# 本地存储
# http://blog.sina.com.cn/s/blog_4701280b0102wrup.htmlm_0 = url_all[0].find("blog_")
m_1 = url_all[0].find(".html")+5
filename_0 = "F://python/PyCharmWorkpalce/Crawler/pacong_data/"j = 0
while j < i:filename_1 = url_all[j][m_0:m_1]content = urllib.urlopen(url_all[j]).read()print "downloading.... " + filename_1with open(filename_0 + filename_1,"w+") as f:f.write(content)with open(filename_0 + filename_1 + ".txt","w+") as f:f.write(content)with open(unicode(filename_0 + url_name[j] + ".txt","utf-8"),"w+") as f:f.write(content)time.sleep(15)j += 1print "Download article finished! "

输出结果：

finding...   http://blog.sina.com.cn/s/blog_4701280b0102wrup.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102wruo.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102eohi.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102eo83.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102elmo.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102eksm.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102ek51.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102egl0.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102ef4t.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102edcd.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102ecxd.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102eck1.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102ec39.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102eb8d.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102eb6w.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102eau0.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e85j.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e7wj.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e7vx.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e7pk.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e7er.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e63p.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e5np.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e4qq.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e4gf.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e4c3.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e490.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e42a.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e3v6.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e3nr.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e150.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e11n.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e0th.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e0p3.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e0l4.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e0ib.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e0hj.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e0fm.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e0eu.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e0ak.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e07s.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e074.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e06b.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e061.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102e02q.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102dz9f.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102dz84.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102dz5s.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102dyao.html
finding...   http://blog.sina.com.cn/s/blog_4701280b0102dxmp.html
Find End!
downloading.... blog_4701280b0102wrup.html
downloading.... blog_4701280b0102wruo.html
downloading.... blog_4701280b0102eohi.html
downloading.... blog_4701280b0102eo83.html
downloading.... blog_4701280b0102elmo.html
downloading.... blog_4701280b0102eksm.html
downloading.... blog_4701280b0102ek51.html
downloading.... blog_4701280b0102egl0.html
downloading.... blog_4701280b0102ef4t.html
downloading.... blog_4701280b0102edcd.html
downloading.... blog_4701280b0102ecxd.html
downloading.... blog_4701280b0102eck1.html
downloading.... blog_4701280b0102ec39.html
downloading.... blog_4701280b0102eb8d.html
downloading.... blog_4701280b0102eb6w.html
downloading.... blog_4701280b0102eau0.html
downloading.... blog_4701280b0102e85j.html
downloading.... blog_4701280b0102e7wj.html
downloading.... blog_4701280b0102e7vx.html
downloading.... blog_4701280b0102e7pk.html
downloading.... blog_4701280b0102e7er.html
downloading.... blog_4701280b0102e63p.html
downloading.... blog_4701280b0102e5np.html
downloading.... blog_4701280b0102e4qq.html
downloading.... blog_4701280b0102e4gf.html
downloading.... blog_4701280b0102e4c3.html
downloading.... blog_4701280b0102e490.html
downloading.... blog_4701280b0102e42a.html
downloading.... blog_4701280b0102e3v6.html
downloading.... blog_4701280b0102e3nr.html
downloading.... blog_4701280b0102e150.html
downloading.... blog_4701280b0102e11n.html
downloading.... blog_4701280b0102e0th.html
downloading.... blog_4701280b0102e0p3.html
downloading.... blog_4701280b0102e0l4.html
downloading.... blog_4701280b0102e0ib.html
downloading.... blog_4701280b0102e0hj.html
downloading.... blog_4701280b0102e0fm.html
downloading.... blog_4701280b0102e0eu.html
downloading.... blog_4701280b0102e0ak.html
downloading.... blog_4701280b0102e07s.html
downloading.... blog_4701280b0102e074.html
downloading.... blog_4701280b0102e06b.html
downloading.... blog_4701280b0102e061.html
downloading.... blog_4701280b0102e02q.html
downloading.... blog_4701280b0102dz9f.html
downloading.... blog_4701280b0102dz84.html
downloading.... blog_4701280b0102dz5s.html
downloading.... blog_4701280b0102dyao.html
downloading.... blog_4701280b0102dxmp.html
Download article finished!

下载示意图：