python爬取新浪博客_python爬取韩寒博客的实例

# coding=utf-8

# __author__ = 'zhouyin'

import urllib

import time

'''

============================把网页下载到本地，然后通过url作为文件名=====================================================

str0 = '地震思考录'

title = str0.find(r'

print title

href = str0.find(r'href=')

print href

html = str0.find(r'.html')

print html

url = str0[href+6:html+5]

print url

filename = url[-26:]

content = urllib.urlopen(url).read()

# print content

open(filename, 'w').write(content) #

把网页下载到本地，然后通过url作为文件名

====================================================================================================================

'''

==============================================================================================

con =

urllib.urlopen('http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html').read()

# 访问韩寒博客首页博文地址

title = con.find(r'

print html

url = con[href+6:html+5] #

因为你是从href开始搜索的，所以你不需要href=的字符，而.html则需要，所以尾要+5

# print con

print url

========================================================================================

'''

=====================================把韩寒博客的某一页的50篇文章下载到本地===========================================================================

url = [''] * 50 # 存储40个url的列表

con =

urllib.urlopen('http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html').read()

# 访问韩寒博客首页博文地址

print con

title = con.find(r'

url[0] = con[href + 6:html + 5]

i = 0

while title != -1 and href != -1 and html != -1 and i <

50:

url[i] = con[href +

6:html + 5]

print url[i]

title =

con.find(r'

i = i + 1

else:

print 'findall!'

j = 0

while j < 50:

content =

urllib.urlopen(url[j]).read()

open(r'hanhan/'+url[j][-26:], 'w+').write(content)

print 'downing....',

url[j]

j = j + 1

time.sleep(15)

else:

print 'download all

pages!'

========================================================================================================================

'''

=======================================把韩寒的博客的7个页面的所有文章下载=================================================================================

url = [''] * 350 # 存储40个url的列表

page = 1

link = 1

while page <= 7:

con =

urllib.urlopen('http://blog.sina.com.cn/s/articlelist_1191258123_0_'+str(page)+'.html').read()

# 访问韩寒博客首页博文地址

# print con

title =

con.find(r'

url[0] = con[href +

6:html + 5]

i = 0

while title != -1 and

href != -1 and html != -1 and i < 350:

url[i] = con[href + 6:html + 5]

print url[i]

title = con.find(r'

i = i + 1

else:

print 'findall!'

page = page + 1

link = link + 1

j = 0

while j < 50:

content =

urllib.urlopen(url[j]).read()

open(r'hanhan/'+url[j][-26:], 'w+').write(content)

print 'downing....',

url[j]

j = j + 1

time.sleep(15)

else:

print 'download all

pages!'

========================================================================================================================

'''

python爬取新浪博客_python爬取韩寒博客的实例相关推荐

python爬取新浪新闻首页_Python爬虫学习：微信、知乎、新浪等主流网站的模拟登陆爬取方法...
微信.知乎.新浪等主流网站的模拟登陆爬取方法摘要:微信.知乎.新浪等主流网站的模拟登陆爬取方法. 网络上有形形色色的网站,不同类型的网站爬虫策略不同,难易程度也不一样.从是否需要登陆这方面来说,一些 ...
python爬取新浪新闻意义_爬取新浪新闻
[Python] 纯文本查看复制代码import requests import os from bs4 import BeautifulSoup import re # 爬取具体每个新闻内容 de ...
从入门到入土：Python爬虫学习|实例练手|爬取新浪新闻搜索指定内容|Xpath定位标签爬取|代码注释详解
此博客仅用于记录个人学习进度,学识浅薄,若有错误观点欢迎评论区指出.欢迎各位前来交流.(部分材料来源网络,若有侵权,立即删除) 本人博客所有文章纯属学习之用,不涉及商业利益.不合适引用,自当删除! 若 ...
使用python网络爬虫爬取新浪新闻（一）
使用python网络爬虫爬取新浪新闻第一次写博客,感觉有点不太习惯!不知道怎么突然就想学学爬虫了,然后就用了一天的时间,跟着教程写了这个爬虫,!不说废话了,我将我从教程上学习的东西整个写下来吧,从头 ...
python爬取新浪新闻
最近公司项目比较少,楼主闲了好长时间了,作为一个刚毕业几个月的新人,心里很烦躁,只能自己找点新东西去学了.看到周围好多人都接触了爬虫,再加上楼主最近沉迷吴宣仪不可自拔,每天投票投票,投票的同时需要监控 ...
网络爬虫-----python爬取新浪新闻
思路:先爬取首页,然后通过正则筛选出所有文章url,然后通过循环分别爬取这些url到本地 #python新闻爬虫实战 import urllib.request import re url = 'ht ...
python爬虫-使用BeautifulSoup爬取新浪新闻标题
** python爬虫-使用BeautifulSoup爬取新浪新闻标题 ** 最近在学习爬虫的技巧,首先学习的是较为简单的BeautifulSoup,应用于新浪新闻上. import requests ...
（python爬虫）新浪新闻数据爬取与清洗+新浪新闻数据管理系统+MySQL
新浪新闻数据爬取与清洗+新浪新闻数据管理系统设计要求新浪新闻数据爬取与清洗基本要求:完成新浪新闻排行中文章的数据爬取,包括标题.媒体.时间.内容. 进阶要求:对最近一周出现次数最多的关键字排名并 ...
python爬虫scrapy爬取新闻标题及链接_18Python爬虫---CrawlSpider自动爬取新浪新闻网页标题和链接...
一.爬取新浪新闻思路 1.创建scrapy项目 2.分析新浪新闻网站静态页面代码 3.编写对应的xpath公式 4.写代码二.项目代码步骤1.创建scrapy项目 scrapy startproj ...
[Python爬虫]爬取新浪理财师股票问答
本文将与大家分享如何爬取新浪理财师股票问答. 一.背景介绍 1)爬取顺序: 在这里,根据已有的股票id列表,按照顺序,依次爬取每只股票下面的股票问答. 股票id格式: lines = ['300592 ...

python爬取新浪博客_python爬取韩寒博客的实例

python爬取新浪博客_python爬取韩寒博客的实例相关推荐

最新文章

热门文章