Python爬虫实例：从百度贴吧下载多页话题内容

上周网络爬虫课程中，留了一个实践：从百度贴吧下载多页话题内容。我完成的是从贴吧中一个帖子中爬取多页内容，与老师题目要求的从贴吧中爬取多页话题还是有一定区别的，况且，在老师讲评之后，我瞬间就发现了自己跟老师代码之间的差距了，我在代码书写上还是存在很多不规范不严谨的地方，而且也没有体现出面向对象的思想，所以，重新将这个题目做一遍，学习一下大佬是怎么写的。

实例：从百度贴吧下载多页话题内容

先了解一下百度贴吧（ http://tieba.baidu.com/f?）我们定义几个函数：

loadPage(url) 用于获取网页
writePage(html,filename) 用于将已获得的网页存储为本地文件
tiebaCrawler(url,beginpPage,endPage,keyword)用于调度，提供需要抓取的页面URLs
main：程序主控模块，完成基本命令行交互接口

import urllib.request
import urllib.parsedef loadPage(url):'''Function: Fetching url and accessing the webpage contenturl: the wanted webpage url'''headers = {'Accept': 'text/html','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',}print('To send HTTP request to %s ' % url)request = urllib.request.Request(url, headers=headers)return urllib.request.urlopen(request).read().decode('utf-8')def writePage(html, filename):'''Function : To write the content of html into a local filehtml : The response contentfilename : the local filename to be used stored the response'''print('To write html into a local file %s ...' % filename)with open(filename, 'w', encoding='utf-8') as f:f.write(str(html))print('Work done!')print('---'*10)def tiebaCrawler(url, beginPage, endPage, keyword):'''Function: The scheduler of crawler, is used to access every wanted url in turnsurl: the url of tieba's webpagebeginPage: initial pageendPage: end pagekeyword: the wanted keyword'''for page in range(beginPage,endPage+1):pn = (page - 1) * 50queryurl = url + '&pn=' + str(pn)filename = keyword + str(page) + '_tieba.html'writePage(loadPage(queryurl), filename)if __name__ == '__main__':kw = input('Please input the wanted tieba\'s name:')beginPage = int(input('The beginning page number:'))endPage = int(input('The ending page number:'))# 百度贴吧查询 url 例子：http://tieba.baidu.com/f?ie=utf-8&kw=%E8%B5%B5%E4%B8%BD%E9%A2%96&red_tag=m2239217474url = 'http://tieba.baidu.com/f?ie=utf-8&'key = urllib.parse.urlencode({'kw':kw})queryurl = url + keytiebaCrawler(queryurl, beginPage, endPage, kw)

运行结果:Please input the wanted tieba's name:赵丽颖
The beginning page number:1
The ending page number:3
To send HTTP request to http://tieba.baidu.com/f?ie=utf-8&kw=%E8%B5%B5%E4%B8%BD%E9%A2%96&pn=0
To write html into a local file 赵丽颖1_tieba.html ...
Work done!
------------------------------
To send HTTP request to http://tieba.baidu.com/f?ie=utf-8&kw=%E8%B5%B5%E4%B8%BD%E9%A2%96&pn=50
To write html into a local file 赵丽颖2_tieba.html ...
Work done!
------------------------------
To send HTTP request to http://tieba.baidu.com/f?ie=utf-8&kw=%E8%B5%B5%E4%B8%BD%E9%A2%96&pn=100
To write html into a local file 赵丽颖3_tieba.html ...
Work done!
------------------------------

心得收获

将功能尽量拆分开来，每个函数只做一件事儿，控制好函数的输入和输出，体现面向对象的思想。
每个函数都要写备注（格式如上文代码中），讲清楚两件事，a.函数是做什么用的？b.函数的参数各表示什么？交代清楚了这些，不仅可以大大增加代码的可读性，还可以督促自己规范代码。

Python爬虫实例：从百度贴吧下载多页话题内容相关推荐

src获取同级目录中的图片_一个简单的Python爬虫实例：百度贴吧页面下载图片
本文主要实现一个简单的爬虫,目的是从一个百度贴吧页面下载图片. 1. 概述本文主要实现一个简单的爬虫,目的是从一个百度贴吧页面下载图片.下载图片的步骤如下: 获取网页html文本内容: 分析html ...
python爬虫代码实例-Python爬虫爬取百度搜索内容代码实例
这篇文章主要介绍了Python爬虫爬取百度搜索内容代码实例,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友可以参考下搜索引擎用的很频繁,现在利用Python爬 ...
饱暖思淫欲之美女图片的Python爬虫实例（二）
美女图片的Python爬虫实例:面向服务器版 ==该爬虫面向成年人且有一定的自控能力(涉及部分性感图片,仅用于爬虫实例研究)== 前言初始教程存在问题解决思路目标实现步骤硬件配置服务器信 ...
python爬虫进阶案例,Python进阶(二十)-Python爬虫实例讲解
#Python进阶(二十)-Python爬虫实例讲解本篇博文主要讲解Python爬虫实例,重点包括爬虫技术架构,组成爬虫的关键模块:URL管理器.HTML下载器和HTML解析器. ##爬虫简单架构 ...
python爬虫实例教程之豆瓣电影排行榜--python爬虫requests库
我们通过requests库进行了简单的网页采集和百度翻译的操作,这一节课我们继续进行案例的讲解–python爬虫实例教程之豆瓣电影排行榜,这次的案例与上节课案例相似,同样会涉及到JSON模块,异步加载 ...
Python爬虫实例 wallhaven网站高清壁纸爬取。
文章目录 Python爬虫实例 wallhaven网站高清壁纸爬取一.数据请求 1.分析网页源码 2.全网页获取二.数据处理 1.提取原图所在网页链接 2.获取高清图片地址及title 三.下载图 ...
python爬虫实例之——多线程爬取小说
之前写过一篇爬取小说的博客,但是单线程爬取速度太慢了,之前爬取一部小说花了700多秒,1秒两章的速度有点让人难以接受. 所以弄了个多线程的爬虫. 这次的思路和之前的不一样,之前是一章一章的爬,每爬一章 ...
python3爬虫爬取百度贴吧下载图片
python3爬虫爬取百度贴吧下载图片学习爬虫时没事做的小练习. 百度对爬虫还是很友好的,在爬取内容方面还是较为容易. 可以方便各位读者去百度贴吧一键下载每个楼主的图片,至于是什么类型的图片,就看你 ...
Python爬虫实例（1）--requests的应用
Python爬虫实例(1) 我们在接下来的爬虫实例(1)里面将逐步的循序渐进的介绍爬虫的各个步骤. 已及时用到的工具,以及具体情况下的用法. 我们的任务是这样的: 爬取<修真聊天群>小说的 ...

Python爬虫实例：从百度贴吧下载多页话题内容

实例：从百度贴吧下载多页话题内容

心得收获

Python爬虫实例：从百度贴吧下载多页话题内容相关推荐

最新文章

热门文章