Python之解析html内容

开始学习崔庆才的《Python3网络爬虫开发实战》

里面有段有趣的html

<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>This is a Demo</title> #显示
</head>
<body>
<div id="container">
<div class="wrapper">
<h2 class="title">Hello Wolrd</h2> #显示
<p class="text">Hello, my name is Jieyun, you can call me June.</p> #显示
</div>
</div>
</body>
</html>

下面贴一下上面这段程序最后保存为.html格式后的显示效果

<head> </head>中间的<title>This is a Demo</title>显示在网页标签中
<body></body>中间的<h2 class="title">Hello Wolrd</h2>显示的为网页内容的标题
<body></body>中间的<p class="text">Hello, my name is Jieyun, you can call me June.</p>显示的是网页中的主题内容

记录一下这个容易理解的html可以以一种简单的视角切入到html理解中。

解析html内容第一步，如何根据网页节点选择需要的内容

上面的html中有一句<div id="container">，div节点的id为container，那么想要选择整个<div id="container"></div>做包含的内容时，可以写.select(#container)(这是BeautifulSoup中利用select进行选择)
简单粗暴点，选择时id="container"写成.select(#container)、class="wrapper"写成.select(.wrapper)
如何提取到

Hello, my name is Jieyun, you can call me June.

这句话呢？

剥洋葱，一层剥一层，.select(#container .wrapper p)就能准确定位到上面那句话。

那么标题怎么选择定位呢？
当然是.select(#container .wrapper h2)了，光说不练假把式，我们用Python3来完成整个过程吧。

Python3的内容选择

步骤：
1 首先建立按照上述第一段代码写个txt，再保存为html格式
2 利用BeautifulSoup+select完成内容选择

下面这段程序首先参考了程序来源，因为我只get过url，没有直接从本地打开过html,我百度了【python从本地读取html文件】，在源程序的基础上改了一点，不影响本真操作，毕竟只是为了得到soup再完成select选择这一功能

from bs4 import BeautifulSoup
file=open(r'C:\Users\gui\Desktop\CSDN\simpleHtml.html') #打开html
Soup = BeautifulSoup(file.read(), 'lxml') #file本身是一个HTTPResponse类型的对象，通过调用它的read属性返回网页内容
title=Soup.select('#container .wrapper h2') #定位标题
context=Soup.select('#container .wrapper p') #定位主体内容
print(title[0].text)
print(context[0].text)

下面是执行结果：