python从web抓取信息（爬虫中soup.select()与soup.find

1)利用 webbrowser 模块打开指定的URL

从sys.argv读取命令行参数或从剪切板粘贴内容
用webbrowser.open()函数打开网页

import webbrowser, sys, pyperclip
if len(sys.argv)>1:content = sys.argv[1]
else:content = pyperclip.paste()
webbrowser.open(content)

打开cmd命令提示符，转换当前工作目录，


C:\Users\Lenovo>cd "F:\python_work"    #直接输入想要跳转的路径
C:\Users\Lenovo>                       #什么也没发生，但是系统已经接受了你的请求，只是还没有转变过来
C:\Users\Lenovo>F:                     #跳转一下盘！
F:\python_work>test.py https://blog.csdn.net/qq_45894443  #开始输入命令行参数

2)用 requests 模块从 Web 下载网页并检查错误

import requests
res = requests.get("https://editor.csdn.net/md?articleId=107890815")
try:res.raise_for_status()
except Exception as exc:print("There was a problem: %s"%(exc))

当网页存在时，res.raise_for_status()不执行任何操作，网页不存在时抛出错误，用try-except结构打印错误：

There was a problem: 404 Client Error: Not Found for url:
http://…

3)将下载文件保存到硬盘中

首先，必须用“写二进制”模式打开该文件，即向函数传入字符串’wb’，作为 open()的第二参数。为了将 Web 页面写入到一个文件，可以使用 for 循环和 Response 对象的 iter_content()方法。如果不用需要写入文件，想直接利用这些HTML代码的话可以采用res.text

import requests
res = requests.get("https://www.sigs.tsinghua.edu.cn/zsjz/115163.jhtml")
try:res.raise_for_status()
except Exception as exc:print("There was a problem: %s"%(exc))
file_object = open("F:\\python_work\\zsjz_page.txt", 'wb')
for chunk in res.iter_content(100000):file_object.write(chunk)
file_object.close()

4)用BeautifulSoup模块解析HTML

新建一个HTML文件内容如下，将其命名为example.html：

<!-- This is the example.html example file. --> <html><head><title>The Website Title</title></head>
<body>
<p>Download my <strong>Python</strong> book from <a href="http:// inventwithpython.com">my website</a>.</p>
<p>Download my <strong>Python</strong> book from <a href="http:// inventwithpython-the-copied-one.com">my copied website</a>.</p>
<p class="slogan">Learn Python the easy way!</p>
<p>By <span id="author">Al Sweigart</span></p>
</body></html>

下面用BeautifulSoup来解析HTML并查找带有id属性author的元素以及查找相应链接：

import bs4
fileObject = open("F:\\python_work\\CSDN\\example.html", 'rb')
soup = bs4.BeautifulSoup(fileObject, features='html.parser')
linkElem = soup.select('p #author') #select()方法返回一个Tag对象的列表
print(linkElem[0].getText()) #Tag对象.getText()返回符合寻找要求的该Tag对象中的字符串
print(str(linkElem[0])) #str(Tag对象)显示它代表的HTML标
print(linkElem[0].attrs, '\n') #Tag对象.attrs它将所有HTML属性作为一个字典linkElem1 = soup.select('a[href]') #寻找名为<a>带有href属性的元素，返回一个列表
print(linkElem1, '\n')linkElem2 = soup.find_all('a') #寻找名为<a>的元素，返回一个列表
for elem in linkElem2:print(elem.get('href')) #遍历列表，并将链接提取出来
print('\n')linkElem3 = soup.find_all('a', text='my website')[0]['href'] #寻找名为<a>，并带有文本'my website'的元素，[0]['href']表示列表的第一项中的链接部分
print(linkElem3)

打印结果：

Al Sweigart
<span id="author">Al Sweigart</span>
{'id': 'author'} [<a href="http:// inventwithpython.com">my website</a>, <a href="http:// inventwithpython-the-copied-one.com">my website</a>] http:// inventwithpython.com
http:// inventwithpython-the-copied-one.comhttp:// inventwithpython.com

CSS选择器的select()例子：

传递给 select()方法的选择器	将匹配…
soup.select(‘div’)	所有名为<div>的元素
soup.select(’#author’)	带有 id 属性为 author 的元素
soup.select(’.notice’)	所有使用 CSS class 属性名为 notice 的元素
soup.select(‘div span’)	所有在<div>元素之内的<span>元素
soup.select(‘div > span’)	所有直接在<div>元素之内的<span>元素，中间没有其他元素
soup.select(‘input[name]’)	所有名为<input>，并有一个 name属性，其值无所谓的元素
soup.select(‘input[type=“button”]’)	所有名为<input>，并有一个 type 属性，其值为 button 的元素

不同的选择器模式可以组合起来，形成复杂的匹配。例如，soup.select(‘p #author’) 将匹配所有 id 属性为 author 的元素，只要它也在一个<p>元素之内。

BeautifulSoup模块返回的soup对象的find_all()函数：

find_all（tag, attributes, recurisive, text, limit, keywords）

tag，即标签名，可以寻找单个标签find_all（‘h1’），也可以寻找一堆标签find_all（[‘h1’,‘h2’,‘h3’]）

attributes，属性，即通过标签具有的属性来查找标签，其属性参数需要用字典封装。用法如 find_all（attr={‘class’:‘red’}）,或者find_all(‘class_’ = ‘red’)。

recursive ,是否支持递归，默认为True，意思为是只查找文档的一级标签（子节点），还是查找文档的所有标签（子孙节点）。默认查找所有标签（子孙节点）。

text，文本。去用标签内的文本内容去匹配标签。find_all（‘a’, text=‘inspirational’）

如在此查找my website，并提取其链接。即可直接soup.find_all(‘a’,text = ‘my website’)[0][‘href’]非常方便。

limit 限制获取结果的数目，设置参数后按照网页顺序选择设定好的前几项。

keyword，选择指定属性的标签 find_all（class_ = ‘title’），其相当于直接用属性选择，不设定标签名find_all（’’, {‘class’:‘title’}）当然使用关键词查找还是简化了不少。

注意：指定class属性时应该采用class_形式，因为class属于python中的关键字，为了与之区别。

另外text文本局限性比较大，具体详情可以参考https://www.jianshu.com/p/01780025d9a9。

一般我们大规模查找节点使用find_all是比较快速的，但是如果遇到查找比较精确的节点，推荐soup.select。