Python爬虫之网页下载器网页解析器

一、网页下载器 -- urllib2的三种网页下载方法

import cookielib
import urllib2url = "http://www.baidu.com"
print 'first method'

#直接请求

response1 = urllib2.urlopen(url)

#获取状态码，如果是200表示获取成功
print response1.getcode()

#读取内容response1.read()
print len(response1.read())print 'second'

#添加data、URL、http header
request = urllib2.Request(url)
request.add_header("user-agent","Mozilla/5.0")
response2 = urllib2.urlopen(request)
print response2.getcode()
print len(response2.read())print 'thired method'

#添加特殊情景的处理器

#创建cookie容器
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
response3 = urllib2.urlopen(url)
print response3.getcode()
print cj
print response3.read()

二、网页解析器

正则表达式(模式匹配)

html.parser(结构化解析-DOM)

Beautiful Soup

lxml

Beautiful Soup

--Python第三方库，用于从HTML或XML中提取数据

--官网：http://www.crummy.com/software/BeautifulSoup/

安装并测试beautifulsoup4

--安装：下载后放入Python目录下，cmd窗口进入解压后的文件cd beautifulsoup4-4.1.2，setup.py build，setup.py install

--测试：import bs4

三、Beautiful Soup语法

创建Beautiful Soup对象 -> 搜索节点（find_all(name,attrs,string) find()） -> 访问节点信息（）

下面是BeautifulSoup实例测试：

解析网页字符串：

#coding=UTF-8
import re
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""soup = BeautifulSoup(html_doc,'html.parser',from_encoding='utf-8')
print '获取所有的链接'
links = soup.find_all('a')
for link in links:print link.name,link['href'],link.get_text()
print '获取 elsie 的链接'
link_node = soup.find('a',href='http://example.com/elsie')
print link_node.name,link_node['href'],link_node.get_text()print '正则表达式'
link_node = soup.find('a',href=re.compile(r"sie"))
print link_node.name,link_node['href'],link_node.get_text()print '获取p段落文字'
p_node = soup.find('p',class_='title')
print p_node.name,p_node.get_text()

四、实例爬虫

抓取百度页面

-----------确定目标：百度百科Python词条相关词条的爬取标题简介

-----------分析目标

----------------------URL格式：/view/125370.htm

----------------------数据格式：标题：<dd class="lemmaWgt-lemmaTitle-title"><h1>***</h1></dd>

简介：<div class="lemma-summary" label-module="lemmaSummary">***</div>

----------------------网页编码：UTF-8

------------编写代码

------------执行爬虫

Python爬虫之网页下载器网页解析器相关推荐

使用Python爬虫示例-自动下载网页音频文件
使用Python爬虫示例-自动下载网页音频文件使用库目标网站获取并解析网页源代码访问下载链接使用库 requests 用来发送http请求. BeautifulSoup 一个灵活又方便的网页 ...
python爬取网页表格数据匹配,python爬虫——数据爬取和具体解析
标签:pattern div mat txt 保存关于 json result with open 关于正则表达式的更多用法,可参考链接:https://blog.c ...
Python3.X 爬虫实战（静态下载器与解析器）
[工匠若水 http://blog.csdn.net/yanbober 未经允许严禁转载,请尊重作者劳动成果.私信联系我] 1 背景这两天比较忙,各种锅锅接,忙里偷闲完结这一篇吧.在我们在上一篇&l ...
Python爬虫4.2 — ajax(动态网页数据抓取)用法教程
Python爬虫4.2 - ajax[动态网页数据]用法教程综述 AJAX 介绍什么是AJAX 实例说明请求分析获取方式实例说明其他博文链接综述本系列文档用于对Python爬虫技术的学 ...
Python爬虫：Xpath爬取网页信息（附代码）
Python爬虫:Xpath爬取网页信息(附代码) 上一次分享了使用Python简单爬取网页信息的方法.但是仅仅对于单一网页的信息爬取一般无法满足我们的数据需求.对于一般的数据需求,我们通常需要从一个 ...
mac用python爬虫下载图片_使用Python爬虫实现自动下载图片
python爬虫支持模块多.代码简洁.开发效率高 ,是我们进行网络爬虫可以选取的好工具.对于一个个的爬取下载,势必会消耗我们大量的时间,使用Python爬虫就可以解决这个问题,即可以实现自动下载.本文 ...
python爬虫教程下载-Python爬虫视频教程全集下载
原标题:Python爬虫视频教程全集下载 Python作为一门高级编程语言,在编程中应用得非常广泛.随着人工智能的发展,python人才的需求更大.当然,这也吸引了很多同学选择自学Python爬虫.P ...
python爬虫,爬取下载图片
python爬虫,爬取下载图片分别引入以下三个包 from urllib.request import urlopen from bs4 import BeautifulSoup import re ...
python爬虫小工具——下载助手
使用request库小下载: 需要一次性写到内存,花费一定空间,然后写入磁盘. import requests image_url = "https://www.python.org/st ...
Python 爬虫 m3u8的下载及AES解密
python爬虫 m3u8的下载及AES加密的解密前言 2023.1.23更新线程池版完整代码异步协程版前言这里与hxdm分享一篇关于m3u8视频流的爬取下载合并成mp4视频的方法,并且支 ...

Python爬虫之网页下载器网页解析器

Python爬虫之网页下载器网页解析器相关推荐

最新文章

热门文章