python beautifulsoup4_【python+beautifulsoup4】Beautifulsoup4

Beautiful soup将复杂HTML文档转换成一个复杂的属性结构，每个节点都是python对象，所有对象可归纳为4种Tag，NavigableString,BeautifulSoup,Comment

1.Tag 就是html中的一个个标签

tag有两个重要的属性，name和attrs

2.NavigableString 字符对象

#打印出标签p中的内容

print (soup.p.string)

3.BeautifulSoup 表示的是一个文档的内容

⼤部分时候,可以把它当作Tag 对象，是⼀个特殊的 Tag

4.Comment 特殊的NavigableString对象

输出的内容不包括注释符号

一、遍历文档树：

1.直接子节点：.contents和.children属性

.conten

tag 的 .content 属性可以将tag的⼦节点以列表的⽅式输出

Print(soup.head.contents)

# [

the domouse’s story]

.children返回的是list对象

print (soup.head.children)

for child in soup.body.children:

print (child)

2.所有子孙节点：.descendants

contents 和 .children 属性仅包含tag的直接⼦节点， .descendants 属性可以对所有tag的⼦孙节点进⾏递归循环，和 children类似，我们也需要遍历获取其中的内容。

for child in soup.descendants:

print (child)

通过一个例子来更直观的看出三者之间的区别

获取的节点如下

以下代码分别获取了class=‘catListTag’下直接子节点和子孙子节点的信息

运行结果：

D:\PycharmProjects\ImoocInterface\venv\Scripts\python.exe D:/PycharmProjects/ImoocInterface/soup_test.py

-------------------contents-----------------------

['\n',

我的标签

, '\n',

Autoit(1)
beautifulsoup4(1)
debug(1)
fiddler(1)
grid(1)
jdk(1)
python logging(1)
进程(1)
模块(1)
线程(1)
更多

, '\n']

-------------------children------------------------

我的标签

Autoit(1)
beautifulsoup4(1)
debug(1)
fiddler(1)
grid(1)
jdk(1)
python logging(1)
进程(1)
模块(1)
线程(1)
更多

-------------------descendants-----------------------

我的标签

Autoit(1)
beautifulsoup4(1)
debug(1)
fiddler(1)
grid(1)
jdk(1)
python logging(1)
进程(1)
模块(1)
线程(1)
更多

Autoit(1)

Autoit

(1)

beautifulsoup4(1)

beautifulsoup4

(1)

debug(1)

debug

.....................

对比三者可发现，contens和children输出为直接子节点的内容即

和

标签所包含的内容，而descendant输出为子孙节点的内容不仅有

和

所包含的内容，还直接输出了
- 标签下
- 标签的内容
  3.节点内容：.string属性
  
  二、搜索
  
  1. find_all(name, attrs, recursive, text,**kwargs)
  
  1) name 参数
  
  name 参数可以查找所有名字为 name 的tag,字符串对象会被⾃动忽略掉
  
  A.传字符串
  
  最简单的过滤器是字符串.在搜索⽅法中传⼊⼀个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容,下⾯的例⼦⽤于查找⽂档中所有的 标签:
  
  soup.find_all('b')
  
  # [The Dormouse's story]
  
  B.传正则表达式
  
  如果传⼊正则表达式作为参数,Beautiful Soup会通过正则表达式的 match()来匹配内容.下⾯例⼦中找出所有以b开头的标签,这表示
  
  和 标签都应该被找到
  
  import re
  
  for tag in soup.find_all(re.compile("^b")):
  
  print(tag.name)
  
  # body
  
  # b
  
  C.传列表
  
  如果传⼊列表参数,Beautiful Soup会将与列表中任⼀元素匹配的内容返回.下⾯代码找到⽂档中所有
  
  soup.find_all(["a", "b"])
  
  # [The Dormouse's story,
  
  #
  
  ie
  
  ,
  
  # Lac
  
  ie,
  
  # Ti
  
  llie]
  
  2) keyword 参数
  
  soup.find_all(id_='link2')或soup.find_all(class_='link2')
  
  注意关键字后的下划线，没有下划线会报错
  
  # [Lac
  
  ie]
  
  3) text 参数
  
  通过 text 参数可以搜搜⽂档中的字符串内容，与 name 参数的可选值⼀样,text 参数接受字符串 , 正则表达式 , 列表
  
  soup.find_all(text="Elsie")
  
  # [u'Elsie']
  
  2.soup.find(name, attrs, recursive, text,**kwargs)
  
  找到第一个符合的对象
  
  三、css选择器
  
  与soup.find_all()类似，查找所有符合的节点并返回list
  
  (1)通过标签查找 soup.select(‘b’)
  
  返回所有标签为的节点
  
  (2)通过类名或ID查找
  
  Soup.select(‘.classname’)
  
  Soup.select(‘#id’)
  
  (3)组合查找
  
  标签+类Soup.select(‘b .classname’)
  
  返回b标签中类名为classname的节点
  
  子标签查找Soup.select(‘head>title’)
  
  返回父标签为head的title节点
  
  (4)属性查找
  
  Soup.select(“a[class=’link’]”)
  
  标签为a且class为link的节点
  
  Soup.select(“p a[class=’link’]”)
  
  返回p标签中a[class=’link’]的节点
  
  (5)获取内容
  
  get_text()
  
  四、爬网页图片
  
  1、目标网站
  
  1) 打开一个风景图的网站：https://www.enterdesk.com/
  
  2) 用 firebug 定位
  
  3)从下图可以看出，所有的图片都是 img 标签,父节点class属性为egeli_pic_dl
  
  2、用 find_all 找出所有的标签
  
  1).find_all(class_="legeli_pic_dl")获取所有的图片对象标签
  
  2).从标签里面提出 jpg 的 url 地址和 title
  
  1 from bs4 importBeautifulSoup2 importrequests3 importos4 r = requests.get("https://www.enterdesk.com/")5 #获取页面内容
  
  6 content =r.content7 #用html.parser解析html
  
  8 soup = BeautifulSoup(content, 'html.parser')9 #获取所有class为egene_pic_dl，返回tag类,为list
  
  10 all = soup.find_all(class_='egeli_pic_dl')11 for i inall:12 #获取图片路径和名称
  
  13 img_url = i.img['src']14 img_name = i.img['title']
  
  3.保存图片
  
  1).在当前脚本文件夹下创建一个 img 的子文件夹
  
  2).导入 os 模块，os.getcwd()这个方法可以获取当前脚本的路径
  
  3).用 open 打开写入本地电脑的文件路径，命名为：os.getcwd()+"\\img\\"+img_name+'.jpg'(命名重复的话，会被覆盖掉)
  
  4).requests 里 get 打开图片的 url 地址，content 方法返回的是二进制流文件，可以直接写到本地
  
  1 for i inall:2 #获取图片路径和名称
  
  3 img_url = i.img['src']4 img_name = i.img['title']5 #保存图片
  
  6 with open(os.getcwd()+'\\img\\'+img_name+'.jpg', 'wb') as f:7 f.write(requests.get(img_url).content)
  
  4.运行结果