Python不使用scrapy框架而编写的网页爬虫程序

本文代码节选（略有改动）自《Python程序设计（第2版）》（董付国编著，清华大学出版社），没有使用scrapy爬虫框架，而是使用标准库urllib访问网页实现爬虫功能，如果网页包含感兴趣的关键词，就把这个网页保存成为本地文件，并且有效控制了爬取深度，避免爬遍互联网。

import sys

import re

import os

import urllib.request as lib

def craw_links(url, depth, keywords, processed):

'''url:the url to craw

depth:the current depth to craw

keywords:the tuple of keywords to focus

processed:the urls already crawled

'''

if url.startswith(('http://', 'https://')):

if url not in processed:

# mark this url as processed

processed.append(url)

else:

# avoid processing the same url again

return

print('Crawing '+url+'...')

with lib.urlopen(url) as fp:

# Python3 returns bytes

# so need to decode

contents = fp.read()

contents_decoded = contents.decode('UTF-8')

# form a regular expression

pattern = '|'.join(keywords)

# if this page contains certain keywords, save it to a file

flag = False

if pattern:

searched = re.search(pattern, contents_decoded)

else:

# if the keywords to filter is not given, save current page

flag = True

if flag or searched:

with open('craw\\'+url.replace(':','_').replace('/','_'), 'wb') as fp:

fp.write(contents)

# find all the links in the current page

links = re.findall('href="(.*?)"', contents_decoded)

# craw all links in the current page

for link in links:

# consider the relative path

if not link.startswith(('http://','https://')):

try:

index = url.rindex('/')

link = url[0:index+1]+link

except:

pass

# control the crawl depth

if depth>0 and link.endswith(('.htm','.html')):

craw_links(link, depth-1, keywords, processed)

if __name__ == '__main__':

processed = []

keywords = ('datetime','KeyWord2')

if not os.path.exists('craw') or not os.path.isdir('craw'):

os.mkdir('craw')

start_url = r'https://docs.python.org/3/library/index.html'

craw_links(start_url, 1, keywords, processed)

Python不使用scrapy框架而编写的网页爬虫程序相关推荐

带你快速了解爬虫的原理及过程，并编写一个简单爬虫程序
目录前言你应该知道什么是爬虫? 一.Scrapy的基本执行过程二.Scrapy的实现 2.1Scrapy框架安装 2.2创建项目 (1)爬虫框架组件介绍 (2)控制台运行创建框架命令(spide ...
初识scrapy框架，美空图片爬虫实战
这俩天研究了下scrapy爬虫框架,遂准备写个爬虫练练手.平时做的较多的事情是浏览图片,对,没错,就是那种艺术照,我骄傲的认为,多看美照一定能提高审美,并且成为一个优雅的程序员.O(∩_∩ ...
scrapy框架---带你飞向爬虫路（九）
回顾(八)系统学习出门左转一到八 scrapy框架五大组件+工作流程+常用命令 [1]五大组件1.1) 引擎(Engine)1.2) 爬虫程序(Spider)1.3) 调度器(Scheduler)1 ...
Scrapy框架学习笔记：猫眼爬虫
文章目录一.提出任务二.实现任务 (一)创建PyCharm项目 - MaoyanCrawler (二)创建Scrapy项目 - Maoyan (三)利用指令生成爬虫程序基本框架 (四)修改全局配置 ...
使用Scrapy框架爬取慕课网页
想要完成一个scrapy爬虫框架,那首先得明确自己想要爬取的东西是什么,要选择什么样的爬取方法.接下来我就讲一下我使用Scrapy框架爬取慕课网的一些思路以及过程. 思路:(1)打开慕课网址,并分析网 ...
使用Java框架Pronghorn编写快速的应用程序
1973年, 卡尔·休伊特 ( Carl Hewitt)提出了一个受量子力学启发的想法. 他想开发能够并行执行任务,在包含自己的本地内存和处理器的同时无缝通信的计算机. 天生就是演员模型 ,并且有了一 ...
使用scrapy框架做武林中文网的爬虫
一.安装首先scrapy的安装之前需要安装这个模块:wheel.lxml.Twisted.pywin32,最后在安装scrapy pip install wheel pip install lxml ...
【Spring框架】编写第一个入门程序
一. 概述: 1). Spring: Spring框架是由Rod Johnson组织和开发的一个分层的Java SE/EE full-stack(一站式)轻量级开源框架,它以IoC(Inversion ...
scrapy框架下的两个爬虫分工合作爬取智联招聘所有职位信息。
爬虫一本次爬取为两个爬虫,第一个爬虫爬取需要访问的URL并且存储到文本中,第二个爬虫读取第一个爬虫爬取的URl然后依次爬取该URL下内容,先运行第一个爬虫然后运行第二个爬虫即可完成爬取. 本帖仅供学 ...

Python不使用scrapy框架而编写的网页爬虫程序

Python不使用scrapy框架而编写的网页爬虫程序相关推荐

最新文章

热门文章