Python爬虫实战（一）：爬糗事百科段子

代码：

# _*_ coding:utf-8 _*_
import urllib2
import re
from datetime import datetimeclass QSBK:def __init__(self):self.pageIndex = 1self.user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'self.headers = {'User-Agent':self.user_agent}self.stories = []self.enable = Falsedef getPage(self,pageIndex):try:url = 'http://www.qiushibaike.com/hot/page'+str(pageIndex)request = urllib2.Request(url,headers = self.headers)response = urllib2.urlopen(request)pageCode = response.read().decode('utf-8')return pageCodeexcept urllib2.URLError,e:if hasattr(e,'reason'):print u"QSBK connect Error,reason: ",e.reasonreturn Nonedef getPageItems(self,pageIndex):pageCode = self.getPage(pageIndex)if not pageCode:print "Page Loading Error..."return Nonepattern = re.compile('<div.*?author clearfix">.*?<a.*?<img.*?>(.*?)</a>.*?<a.*?<h2>(.*?)</h2>.*?</a>.*?<div.*?'+'content">(.*?)<!--(.*?)-->.*?</div>.*?<div class="stats.*?class="number">(.*?)</i>',re.S)items = re.findall(pattern,pageCode)pageStories = []for item in items:haveImg = re.search("img",item[0])if not haveImg:replaceBR = re.compile('<br/>')text = re.sub(replaceBR,"\n",item[2])pageStories.append([item[1].strip(),text.strip(),item[3].strip(),item[4].strip()])return pageStoriesdef loadPage(self):if self.enable == True:if len(self.stories) < 2:pageStories = self.getPageItems(self.pageIndex)if pageStories:self.stories.append(pageStories)self.pageIndex += 1def getOneStory(self,pageStories,page):for story in pageStories:input = raw_input()self.loadPage()if input == 'Q':self.enable = Falsereturnprint u"第%d页\t发布人：%s\t发布时间：%s\t赞：%s\n%s" %(page,story[0],datetime.fromtimestamp(int(story[2])),story[3],story[1])def start(self):print u"正在读取糗事百科，按回车查看新段子，Q退出"self.enable = Trueself.loadPage()nowPage = 0while self.enable:if len(self.stories)>0:pageStories = self.stories[0]nowPage += 1del self.stories[0]self.getOneStory(pageStories,nowPage)spider = QSBK()
spider.start()

转载于:https://www.cnblogs.com/AndyJee/p/4997101.html

Python爬虫实战（一）：爬糗事百科段子相关推荐

Python爬虫实战：爬取维基百科
我们知道,百度百科一般极少收录英文词条类似的,很容易想到爬取维基百科,思路也和爬取百度百科一样,只需处理一下请求地址和返回结果就好下面也是直接放上代码,有不明白的地方可以看看注释: from lx ...
转 Python爬虫实战一之爬取糗事百科段子
静觅 » Python爬虫实战一之爬取糗事百科段子首先,糗事百科大家都听说过吧?糗友们发的搞笑的段子一抓一大把,这次我们尝试一下用爬虫把他们抓取下来. 友情提示糗事百科在前一段时间进行了改版,导致 ...
Python爬虫实战之爬取糗事百科段子
Python爬虫实战之爬取糗事百科段子完整代码地址:Python爬虫实战之爬取糗事百科段子程序代码详解: Spider1-qiushibaike.py:爬取糗事百科的8小时最新页的段子.包含的信息 ...
Python爬虫实战一之爬取糗事百科段子
点我进入原文另外, 中间遇到两个问题: 1. ascii codec can't decode byte 0xe8 in position 0:ordinal not in range(128) 解 ...
爬虫实战1：爬取糗事百科段子
本文主要展示利用python3.7+urllib实现一个简单无需登录爬取糗事百科段子实例. 如何获取网页源代码对网页源码进行正则分析,爬取段子对爬取数据进行再次替换&删除处理易于阅读 0. ...
网络爬虫---爬取糗事百科段子实战
Python网络爬虫 1.知识要求掌握python基础语法熟悉urllib模块知识熟悉get方法会使用浏览器伪装技术如果您对相关知识遗忘了,可以点上面的相关知识链接,熟悉一下. 2.爬取糗事 ...
Python之爬取糗事百科段子实战
"简说Python",选择"置顶/星标公众号" 福利干货,第一时间送达! 阅读本文大约6分钟,实战学习,老表建议你反复看,反复记,反复练. // 本文作者王豪 ...
【Python爬虫系列教程 28-100】小姐姐带你入门爬虫框架Scrapy、使用Scrapy框架爬取糗事百科段子
文章目录 Scrapy快速入门安装和文档: 快速入门: 创建项目: 目录结构介绍: Scrapy框架架构 Scrapy框架介绍: Scrapy框架模块功能: Scrapy Shell 打开Scrap ...
Python3写爬虫（五）爬取糗事百科段子
2019独角兽企业重金招聘Python工程师标准>>> 最近几天开始用Python3改写网上用Python2写的案例,发现完全可以用Python3来重构Python2的源码.本篇文章 ...

Python爬虫实战（一）：爬糗事百科段子

Python爬虫实战（一）：爬糗事百科段子相关推荐

最新文章

热门文章