Python爬虫之string、strings、stripped_strings、get

Python爬虫获取html中的文本方法多种多样，这里主要介绍一下string、strings、stripped_strings和get_text用法

string：用来获取目标路径下第一个非标签字符串，得到的是个字符串

strings：用来获取目标路径下所有的子孙非标签字符串，返回的是个生成器

stripped_strings：用来获取目标路径下所有的子孙非标签字符串，会自动去掉空白字符串，返回的是一个生成器

get_text：用来获取目标路径下的子孙字符串，返回的是字符串（包含HTML的格式内容）

text：用来获取目标路径下的子孙非标签字符串，返回的是字符串

这里补充说明一下，如果获取到的是生成器，一般都是把它转换成list，不然你看不出那是什么玩意

接下来举栗子说明。以某中介网站举例，目标是获取各个在售二手单元的信息

一、string

import requests
from bs4 import BeautifulSoupurl = 'https://gz.centanet.com/ershoufang/'
headers = {'User-Agent':"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"}respone = requests.get(url,headers=headers)
soup = BeautifulSoup(respone.text)ps = soup.select('div[class="section"] div[class="house-item clearfix"] p[class="house-name"]')
for p in ps:house = p.stringprint(house)

上面代码执行结果显示的是一堆None，这是因为string只会取第一个值，如下图，第一个值是空，所以最终获取到的是None

二、strings

import requests
from bs4 import BeautifulSoupurl = 'https://gz.centanet.com/ershoufang/'
headers = {'User-Agent':"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"}respone = requests.get(url,headers=headers)
soup = BeautifulSoup(respone.text)ps = soup.select('div[class="section"] div[class="house-item clearfix"] p[class="house-name"]')
for p in ps:house = list(p.strings)print(house)

如下图，每个list中都会有10个字段，这些字段如何来的参考上图我框红色的地方就知道了。

三、stripped_strings

import requests
from bs4 import BeautifulSoupurl = 'https://gz.centanet.com/ershoufang/'
headers = {'User-Agent':"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"}respone = requests.get(url,headers=headers)
soup = BeautifulSoup(respone.text)houses = []
ps = soup.select('div[class="section"] div[class="house-item clearfix"] p[class="house-name"]')
for p in ps:house = list(p.stripped_strings)#stripped_strings一下子能取出对应目录下的所有文本，并且自动把空白去掉houses.append(house)print(house)

四、get_text

import requests
from bs4 import BeautifulSoupurl = 'https://gz.centanet.com/ershoufang/'
headers = {'User-Agent':"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"}respone = requests.get(url,headers=headers)
soup = BeautifulSoup(respone.text)ps = soup.select('div[class="section"] div[class="house-item clearfix"] p[class="house-name"]')
for p in ps:house = p.get_textprint(house)print("=="*40)

如下图，红色框选中部分即为一个字符串

五、text

import requests
from bs4 import BeautifulSoupurl = 'https://gz.centanet.com/ershoufang/'
headers = {'User-Agent':"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"}respone = requests.get(url,headers=headers)
soup = BeautifulSoup(respone.text)ps = soup.select('div[class="section"] div[class="house-item clearfix"] p[class="house-name"]')
for p in ps:house = p.textprint(house)print("=="*40)

Python爬虫之string、strings、stripped_strings、get_text和text用法区别相关推荐

python xpath语法-Python爬虫之XPath语法和lxml库的用法
本来打算写的标题是XPath语法,但是想了一下Python中的解析库lxml,使用的是Xpath语法,同样也是效率比较高的解析方法,所以就写成了XPath语法和lxml库的用法安装为什么要用这个库 ...
Python爬虫入门四之Urllib库的高级用法
1.设置Headers 有些网站不会同意程序直接用上面的方式进行访问,如果识别有问题,那么站点根本不会响应,所以为了完全模拟浏览器的工作,我们需要设置一些Headers 的属性. 首先,打开我们的浏览 ...
Python爬虫十六式 - 第三式：Requests的用法
Requests: 让 HTTP 服务人类学习一时爽,一直学习一直爽 Hello,大家好,我是Connor,一个从无到有的技术小白.今天我们继续来说我们的 Python 爬虫,上一次我们说到了 ...
python爬虫系列—— requests和BeautifulSoup库的基本用法
本文主要介绍python爬虫的两大利器:requests和BeautifulSoup库的基本用法. 1. 安装requests和BeautifulSoup库可以通过3种方式安装: easy_inst ...
BeautifulSoup的高级应用之 contents children descendants string strings stripped_strings
继上一节.BeautifulSoup的高级应用之 find findAll,这一节,主要解说BeautifulSoup有关的其它几个重要应用函数. 本篇中,所使用的html为: html_doc = ...
Python爬虫之（六）requests库的用法
介绍对了解一些爬虫的基本理念,掌握爬虫爬取的流程有所帮助.入门之后,我们就需要学习一些更加高级的内容和工具来方便我们的爬取.那么这一节来简单介绍一下 requests 库的基本用法安装 pip i ...
python中predict函数_sklearn中predict()与predict_proba()用法区别
predict是训练后返回预测结果,是标签值. predict_proba返回的是一个 n 行 k 列的数组, 第 i 行第 j 列上的数值是模型预测第 i 个预测样本为某个标签的概率,并且每一行 ...
python里面pop，remove和del 三者的用法区别
首先,remove 是删除首个符合条件的元素.并不是删除特定的索引.如下例: >>> a = [0, 2, 2, 3] >>> a.remove(2) >&g ...
python爬虫能干什么-总算发现python爬虫能够干什么
网络爬虫另外一些不常使用的名字还有蚂蚁,自动索引,模拟程序或者蠕虫.下面是小编为您整理的关于python爬虫能够干什么,希望对你有所帮助. python爬虫能够干什么 1.python爬虫可从网站某一 ...

Python爬虫之string、strings、stripped_strings、get_text和text用法区别