Python进阶(二十)Python爬虫实例讲解

文章目录

一、前言
二、爬虫简单架构
三、程序入口函数(爬虫调度段)
四、URL管理器
五、网页下载器
六、网页解析器
七、网页输出器
八、运行结果
九、拓展阅读

一、前言

本篇博文主要讲解Python爬虫实例，重点包括爬虫技术架构，组成爬虫的关键模块：URL管理器、HTML下载器和HTML解析器。

二、爬虫简单架构

三、程序入口函数(爬虫调度段)

#coding:utf8
import time, datetimefrom maya_Spider import url_manager, html_downloader, html_parser, html_outputerclass Spider_Main(object):#初始化操作def __init__(self):#设置url管理器self.urls = url_manager.UrlManager()#设置HTML下载器self.downloader = html_downloader.HtmlDownloader()#设置HTML解析器self.parser = html_parser.HtmlParser()#设置HTML输出器self.outputer = html_outputer.HtmlOutputer()#爬虫调度程序def craw(self, root_url):count = 1self.urls.add_new_url(root_url)while self.urls.has_new_url():try:new_url = self.urls.get_new_url()print('craw %d : %s' % (count, new_url))html_content = self.downloader.download(new_url)new_urls, new_data = self.parser.parse(new_url, html_content)self.urls.add_new_urls(new_urls)self.outputer.collect_data(new_data)if count == 10:breakcount = count + 1except:print('craw failed')self.outputer.output_html()if __name__ == '__main__':#设置爬虫入口root_url = 'http://baike.baidu.com/view/21087.htm'#开始时间print('开始计时..............')start_time = datetime.datetime.now()obj_spider = Spider_Main()obj_spider.craw(root_url)#结束时间end_time = datetime.datetime.now()print('总用时：%ds'% (end_time - start_time).seconds)

四、URL管理器

class UrlManager(object):def __init__(self):self.new_urls = set()self.old_urls = set()def add_new_url(self, url):if url is None:returnif url not in self.new_urls and url not in self.old_urls:self.new_urls.add(url)def add_new_urls(self, urls):if urls is None or len(urls) == 0:returnfor url in urls:self.add_new_url(url)def has_new_url(self):return len(self.new_urls) != 0def get_new_url(self):new_url = self.new_urls.pop()self.old_urls.add(new_url)return new_url

五、网页下载器

import urllib
import urllib.requestclass HtmlDownloader(object):def download(self, url):if url is None:return None#伪装成浏览器访问，直接访问的话csdn会拒绝user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'headers = {'User-Agent':user_agent}#构造请求req = urllib.request.Request(url,headers=headers)#访问页面response = urllib.request.urlopen(req)#python3中urllib.read返回的是bytes对象，不是string,得把它转换成string对象，用bytes.decode方法return response.read().decode()

六、网页解析器

import re
import urllib
from urllib.parse import urlparsefrom bs4 import BeautifulSoupclass HtmlParser(object):def _get_new_urls(self, page_url, soup):new_urls = set()#/view/123.htmlinks = soup.find_all('a', href=re.compile(r'/item/.*?'))for link in links:new_url = link['href']new_full_url = urllib.parse.urljoin(page_url, new_url)new_urls.add(new_full_url)return new_urls#获取标题、摘要def _get_new_data(self, page_url, soup):#新建字典res_data = {}#urlres_data['url'] = page_url#<dd class="lemmaWgt-lemmaTitle-title"><h1>Python</h1>获得标题标签title_node = soup.find('dd', class_="lemmaWgt-lemmaTitle-title").find('h1')print(str(title_node.get_text()))res_data['title'] = str(title_node.get_text())#<div class="lemma-summary" label-module="lemmaSummary">summary_node = soup.find('div', class_="lemma-summary")res_data['summary'] = summary_node.get_text()return res_datadef parse(self, page_url, html_content):if page_url is None or html_content is None:return Nonesoup = BeautifulSoup(html_content, 'html.parser', from_encoding='utf-8')new_urls = self._get_new_urls(page_url, soup)new_data = self._get_new_data(page_url, soup)return new_urls, new_data

七、网页输出器

class HtmlOutputer(object):def __init__(self):self.datas = []def collect_data(self, data):if data is None:returnself.datas.append(data )def output_html(self):fout = open('maya.html', 'w', encoding='utf-8')fout.write("<head><meta http-equiv='content-type' content='text/html;charset=utf-8'></head>")fout.write('<html>')fout.write('<body>')fout.write('<table border="1">')# <th width="5%">Url</th>fout.write('''<tr style="color:red" width="90%"><th>Theme</th><th width="80%">Content</th></tr>''')for data in self.datas:fout.write('<tr>\n')# fout.write('\t<td>%s</td>' % data['url'])fout.write('\t<td align="center"><a href=\'%s\'>%s</td>' % (data['url'], data['title']))fout.write('\t<td>%s</td>\n' % data['summary'])fout.write('</tr>\n')fout.write('</table>')fout.write('</body>')fout.write('</html>')fout.close()

八、运行结果

九、拓展阅读

完整代码

Python进阶(二十)Python爬虫实例讲解相关推荐

python爬虫进阶案例,Python进阶(二十)-Python爬虫实例讲解
#Python进阶(二十)-Python爬虫实例讲解本篇博文主要讲解Python爬虫实例,重点包括爬虫技术架构,组成爬虫的关键模块:URL管理器.HTML下载器和HTML解析器. ##爬虫简单架构 ...
python多线程爬虫实例-Python3多线程爬虫实例讲解代码
多线程概述多线程使得程序内部可以分出多个线程来做多件事情,充分利用CPU空闲时间,提升处理效率.python提供了两个模块来实现多线程thread 和threading ,thread 有一些缺点, ...
[Python人工智能] 二十二.基于大连理工情感词典的情感分析和情绪计算
从本专栏开始,作者正式研究Python深度学习.神经网络及人工智能相关知识.前一篇文章分享了CNN实现中文文本分类的过程,并与贝叶斯.决策树.逻辑回归.随机森林.KNN.SVM等分类算法进行对比.这篇 ...
python分割数字_对python数据切割归并算法的实例讲解
当一个 .txt 文件的数据过于庞大,此时想要对数据进行排序就需要先将数据进行切割,然后通过归并排序,最终实现对整体数据的排序.要实现这个过程我们需要进行以下几步:获取总数据行数:根据行数按照自己的需 ...
python中numpy数组的合并_基于Python中numpy数组的合并实例讲解
基于Python中numpy数组的合并实例讲解 Python中numpy数组的合并有很多方法,如 - np.append() - np.concatenate() - np.stack() - np. ...
Python进阶之Scrapy-redis分布式爬虫抓取当当图书
Python进阶之Scrapy-redis分布式爬虫抓取当当图书 1. 准备工作 1.1 安装scrapy-redis 1.2 在windows安装redis程序 1.3 打开redis服务 2. 需 ...
python实现简单的api接口-对Python实现简单的API接口实例讲解
get方法代码实现 # coding:utf-8 import json from urlparse import parse_qs from wsgiref.simple_server impor ...
简单python脚本实例-对Python实现简单的API接口实例讲解
get方法代码实现 # coding:utf-8 import json from urlparse import parse_qs from wsgiref.simple_server impor ...
Python教程（十）--if 实例运用（棒子老虎鸡游戏）
转载请标明出处: 原文发布于:浅尝辄止,未尝不可的博客 https://blog.csdn.net/qq_31019565 Python教程(十)–if 实例运用(棒子老虎鸡游戏) #这里使用了一个新 ...
J2EE进阶(二十四)JBoss Web和 Tomcat的区别
J2EE进阶(二十四)JBoss Web和 Tomcat的区别在Web2.0的浪潮中,各种页面技术和框架不断涌现,为服务器端的基础架构提出了更高的稳定性和可扩展性的要求.近年来,作为开源中间件的全球 ...

Python进阶(二十)Python爬虫实例讲解

文章目录

一、前言

二、爬虫简单架构

三、程序入口函数(爬虫调度段)

四、URL管理器

五、网页下载器

六、网页解析器

七、网页输出器

八、运行结果

九、拓展阅读

Python进阶(二十)Python爬虫实例讲解相关推荐

最新文章

热门文章