爬虫基础-bs4数据解析样例

抓取三国演义小说

# -*- coding:utf-8 -*-
import requests
from bs4 import BeautifulSoup
import lxml
if __name__ == "__main__":headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}# 1.对首页页面进行爬取url = 'https://www.shicimingju.com/book/sanguoyanyi.html'response = requests.get(url=url,headers=headers)response.encoding = 'utf-8'soup = BeautifulSoup(response.text, 'lxml')# 取出所有的li标签(章节信息)list_li = soup.select('.book-mulu > ul > li')
#    print(list_li)fp = open('./sanguo.txt', 'w', encoding='utf-8')for li in list_li:detail_name = li.a.stringdetail_href = 'https://www.shicimingju.com' + li.a['href']#对详情页发起请求,解析章节内容detail_page = requests.get(url=detail_href, headers=headers)detail_page.encoding = 'utf-8'detail_soup = BeautifulSoup(detail_page.text,'lxml')# 根据div属性定位到有章节内容的divdiv_tag = detail_soup.find('div', class_='chapter_content')# 去除标签,只显示文本(text)content = div_tag.textfp.write(detail_name + ':' + content + '\n')print(detail_name,'爬取成功!!!')fp.close()

抓取圣墟小说

import requests
from bs4 import BeautifulSoup
import lxml
import time
if __name__ == "__main__":headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}fp = open('./sx.txt','w',encoding='utf8')url = 'http://www.bequgew.com/51561/'response = requests.get(url=url,headers=headers)response.encoding = 'utf-8'soup = BeautifulSoup(response.text,'lxml')# 使用select方法,查找出div属性为article_texttitleb下所有li标签(下面包含多个ul,所以不能用(.article_texttitleb > ul > li),直接空格li)
#    list_li = soup.select('.article_texttitleb li')#### 下面三行效果等同于上面一行效果,配合最下面的for循环使用# 先使用find_all查找出class属性为article_texttitleb的div标签下所有内容-输出为列表list_div = soup.find_all('div', class_='article_texttitleb')# 使用BeautifulSoup解析上面的列表内容-输出为所有标签(文本格式)div_bf = BeautifulSoup(str(list_div[0]), 'lxml')# 使用find_all方法查找到所有的li标签-输出为字典li_all = div_bf.find_all('li')
#    for li in list_li:for li in li_all:detail_name = li.a.stringdetail_link = "http://www.bequgew.com" + li.a['href']detail_response = requests.get(url=detail_link, headers=headers)detail_response.encoding = 'utf-8'detail_soup = BeautifulSoup(detail_response.text,'lxml')# find方法查找出属性id为book_text下的所有内容,并以文本形式输出(用text方法),find_all方法输出的为列表,find输出符合条件的标签内容content = detail_soup.find('div', id = 'book_text').textfp.write(detail_name + content + '\n\n')print(detail_name,'--','下载完毕')time.sleep(10)fp.close()

抓取糗事百科图片

# -*- coding:utf-8 -*-
import requests
from bs4 import BeautifulSoup
import lxml
import osif __name__ == "__main__":if not os.path.exists('./download'):os.mkdir('./download')headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}url = 'https://www.qiushibaike.com/imgrank/'response = requests.get(url=url, headers=headers)parse_page = BeautifulSoup(response.text, 'lxml')img_list = parse_page.select('.thumb img')for img_src in img_list:img_http = 'http:' + img_src['src']img_name = img_http.split('/')[-1]img_path = './download/' + img_nameimg_content = requests.get(url=img_http, headers=headers).contentwith open(img_path, 'wb') as fp:fp.write(img_content)fp.close()print(img_name,'--','下载完成')

爬虫基础-bs4数据解析样例相关推荐

python爬虫程序详解_Python网络爬虫之三种数据解析方式
指定url 基于requests模块发起请求获取响应对象中的数据进行持久化存储其实,在上述流程中还需要较为重要的一步,就是在持久化存储之前需要进行指定数据解析.因为大多数情况下的需求,我们都会指 ...
爬虫第三讲数据解析
文章目录爬虫第三讲数据解析一.正则表达式 1.match()函数.search()函数.findall()函数 2.正则表达式中的元字符 3.正则表达式模式 4.正则表达式重复匹配 5.正则表 ...
java爬取网页数据_Python网络爬虫实战(二)数据解析
Python网络爬虫实战 (二)数据解析本系列从零开始阐述如何编写Python网络爬虫,以及网络爬虫中容易遇到的问题,比如具有反爬,加密的网站,还有爬虫拿不到数据,以及登录验证等问题,会伴随大量网站 ...
python爬虫解析数据包_Python网络爬虫之三种数据解析方式
引入回顾requests实现数据爬取的流程指定url 基于requests模块发起请求获取响应对象中的数据进行持久化存储其实,在上述流程中还需要较为重要的一步,就是在持久化存储之前需要进行指 ...
爬虫之常用数据解析方法
爬虫之常用数据解析方法
Python-爬虫（BS4数据解析）
文章目录 1. BS4数据解析常见方法 2.BS4数据解析,爬取豆瓣电影属性 1. BS4数据解析常见方法 BS4数据解析方法是把需要的数据进行截取.处理数据的时间比较长测试用网页: <!DO ...
ApiSix基础入门：协议解析-样例详解
一 .http协议配置 1.反向代理测试 ①启动一个正常的web服务器也可以使用已经安装的openresty #启动web服务器默认80端口 /usr/local/openresty/bin/op ...
java爬虫面试题_Java 网络爬虫基础知识入门解析
前言说起网络爬虫,大家想起的估计都是 Python ,诚然爬虫已经是 Python 的代名词之一,相比 Java 来说就要逊色不少.有不少人都不知道 Java 可以做网络爬虫,其实 Java 也能做 ...
爬虫之网络数据解析的三种方式---正则表达式、XPath数据解析库、BeautifulSoup数据解析库
1.正则表达式爬虫的四个主要步骤: 明确目标(要知道你准备取哪个范围或者网站上取搜索) 爬(将所有网站的内容全部爬下来) 取(取掉对我们没用的数据) 处理数据(按照我们想要的方式存储和使用) 什么是 ...

爬虫基础-bs4数据解析样例

爬虫基础-bs4数据解析样例相关推荐

最新文章

热门文章