1. urllib和BeautifulSoup

1.1 urllib的基本用法

urllib是Python 3.x中提供的一系列操作URL的库，它可以轻松的模拟用户使用浏览器访问网页。

使用步骤：

导入urllib库的request模块：from urllib import request
请求URL，如：resp = request.urlopen(‘http://www.baidu.com’)
使用响应对象输出数据，如：print(resp.read().decode(“utf-8”))

示例：

from urllib import requestresp = request.urlopen("http://www.baidu.com")
print(resp.read().decode("utf-8"))

1.1.1 模拟真实浏览器

携带User-Agent头

# 使用Request(url)获取请求对象
req = request.Request(url)
# 使用add_header(key,value)方法添加请求头
req.add_header(key, value)
# 使用urlopen请求链接
resp = request.urlopen(req)
# 使用decode对结果进行编码
print(resp.read().decode("utf-8"))

from urllib import request
# 使用Request(url)获取请求对象
req = request.Request("http://www.baidu.com")
# 使用add_header(key,value)方法添加请求头
req.add_header("User Agent", "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Mobile Safari/537.36")
# 使用urlopen请求链接
resp = request.urlopen(req)
# 使用decode对结果进行编码
print(resp.read().decode("utf-8"))

1.1.2 使用POST

导入urllib库下面的parse：from urllib import parse
使用urlencode生成post数据

postData = parse.urlencode([(key1, val1),(key2, val2),(keyn, valn)
])

使用postData发送post请求：

request.urlopen(req, data=postData.encode('utf-8'))

得到请求状态：resp.status
得到服务器的类型：resp.reason

示例：

from urllib.request import urlopen
from urllib.request import Request
from urllib import parsereq = Request("https://m.xbiquge.la/register.php")
# 使用parser.urlencode()生成post数据
postData = parse.urlencode([("SignupForm[username]", "admin"),("SignupForm[password]", "123456"),("SignupForm[email]", ""),("register", "确认注册")
])
req.add_header("Origin", "https://m.xbiquge.la")
req.add_header("User Agent", "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Mobile Safari/537.36")
resp = urlopen(req, data=postData.encode('utf-8'))
print(resp.read().decode("utf-8"))

执行结果：

<!doctype html>
<html>
<head>
<title>出现错误！</title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name="MobileOptimized" content="240"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0,  minimum-scale=1.0, maximum-scale=1.0" />
<style>.content{margin:50px 7px 10px 7px;}.content a{color:#0080C0;}
</style>
</head>
<body>
<div class="content"><div class="c1"><h1>出现错误！</h1><strong>错误原因：</strong><ul>
<li>用户名已存在.<li/><li>email不能为空！<li/>        <br /><br /><br />请 <a href="javascript:history.back(1)">返 回</a> 并修正<br /><br />
</div>
</body>
</html>

1.2 BeautifulSoup

1.2.1 安装BeautifulSoup4

Linux

sudo apt-get install python-bs4

sudo easy_install pip
pip install beautifulsoup4

Windows

pip install beautifulsoup4
pip3 install beautifulsoup4

下载：https://www.crummy.com/software/BeautifulSoup/#Download
文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#id4

1.2.2 BeautifulSoup使用

使用BeautifulSoup(html, ‘html.parser’)解析HTML
查找一个节点：soup.find(id=‘imooc’)
查找多个节点：soup.findAll(‘a’)
使用正则表达式匹配：soup.findAll(‘a’, href=reObj)

from bs4 import BeautifulSoup as bshtml_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p><p class="story">...</p>
"""soup = bs(html_doc, 'html.parser')
# 格式化输出
print(soup.prettify())

执行结果：

<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p><p class="story">...</p></body>
</html>

from bs4 import BeautifulSoup as bs
import rehtml_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p><p class="story">...</p>
"""soup = bs(html_doc, 'html.parser')
# 格式化输出
#print(soup.prettify())# 获取title
print(soup.title.string) # The Dormouse's story
# 获取第一个a标签
print(soup.a) # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# 获取id为link2的标签
print(soup.find(id="link2")) # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# 获取id为link2的标签的内容
print(soup.find(id="link2").string) # Lacie
print(soup.find(id="link2").get_text()) # Lacie
# 获取所有a标签
print(soup.findAll("a"))
'''
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
]
'''
# 输出所有a标签的内容
for link in soup.findAll("a"):print(link.string)
'''
Elsie
Lacie
Tillie
'''
print(soup.find('p', {"class":"story"}))
print(soup.find('p', {"class":"story"}).get_text())# 正则表达式查找
# 找出所有以b开头的标签
for tag in soup.find_all(re.compile("^b")):print(tag.name)data = soup.findAll("a", href=re.compile(r"^http://example\.com/"))
print(data)

1.3 获取百度百科词条信息

# 引入开发包
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re# 处理空行
def bs_preprocess(html):"""remove distracting whitespaces and newline characters"""pat = re.compile('(^[\s]+)|([\s]+$)', re.MULTILINE)html = re.sub(pat, '', html)  # remove leading and trailing whitespaceshtml = re.sub('\n', ' ', html)  # convert newlines to spaces将换行符替换成空格# this preserves newline delimitershtml = re.sub('[\s]+<', '<', html)  # remove whitespaces before opening tagshtml = re.sub('>[\s]+', '>', html)  # remove whitespaces after closing tagsreturn html# 请求URL并把结果用UTF-8编码
resp = urlopen("https://baike.baidu.com/").read().decode("utf-8")
resp = bs_preprocess(resp)
# 使用BeautifulSoup去解析
soup = BeautifulSoup(resp, "html.parser")
# 格式化输出
# print(soup.prettify())
# 查找所有标签 按层级查找
for tag in soup.find_all():# 是否包含em、img标签if tag.name in ["em", "img"]:# 包含则删除对应的标签tag.decompose()if tag.name in ['span']:# 父标签不是divif tag.parent.name != "div":tag.decompose()
# 删除类名以category_或content_cnt开头的div
for div in soup.findAll("div", {"class": re.compile("^(category_|content_cnt)")}):div.decompose()# 获取所有以https://baike.baidu.com/item/开头的a标签的属性
listUrls = soup.findAll("a", href=re.compile("^https://baike.baidu.com/item/"))
#print(listUrls)
# 输出所有的词条对应的名称和URL
for url in listUrls:# 过滤以.jpg或JPG结尾的URLif url.get_text().strip() != "" and url['href'].strip() != "":# 输出URL的文字和对应的链接# string只能获取一个，get_text()获取标签下所有的文字print(url.get_text().strip(), "<--->", url['href'].strip())

执行结果：

1.4 存储数据到MySQL

1.4.1 安装与卸载

通过pip安装pymysql

pip install pymysql

通过安装文件

python setup.py install

卸载

pip uninstall pymysql

1.4.2 pymysql的使用

# 引入开发包
import pymysql.cursors# 获取数据库链接
connection = pymysql.connect(host='localhost', user='root',password='123456',db='baikeurl',charset='utf8mb4')# 获取会话指针
connection.cursor()# 执行SQL语句
cursor.execute(sql, (参数1, 参数n))# 提交
connection.commit()# 关闭
connection.close()

新建数据库：

建表sql：

SET NAMES utf8mb4;
SET FOREIGN_KEY_CHECKS = 0;-- ----------------------------
-- Table structure for urls
-- ----------------------------
DROP TABLE IF EXISTS `urls`;
CREATE TABLE `urls`  (`id` int(11) NOT NULL AUTO_INCREMENT,`urlname` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL,`urlhref` varchar(1000) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL,PRIMARY KEY (`id`) USING BTREE
) ENGINE = InnoDB AUTO_INCREMENT = 1 CHARACTER SET = utf8mb4 COLLATE = utf8mb4_general_ci ROW_FORMAT = Dynamic;SET FOREIGN_KEY_CHECKS = 1;

1.4.3 存储数据到MySQL

获取数据库连接：

connection = pymysql.connect(host='localhost',user='root',password='123456',db='db',charset='utf8mb4')

使用connection.cursor()获取会话指针
使用cursor.execute(sql, (参数1,参数n))执行sql
提交connection.commit()
关闭连接connection.close()
使用cursor.execute()获取查询出多少条记录
使用cursor.fetchone()获取下一行记录
使用cursor.fetchmany(size=10)获取指定数量的记录
使用cursor.fetchall()获取全部的记录

# 引入开发包
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import pymysql.cursors# 处理空行
def bs_preprocess(html):"""remove distracting whitespaces and newline characters"""pat = re.compile('(^[\s]+)|([\s]+$)', re.MULTILINE)html = re.sub(pat, '', html)  # remove leading and trailing whitespaceshtml = re.sub('\n', ' ', html)  # convert newlines to spaces将换行符替换成空格# this preserves newline delimitershtml = re.sub('[\s]+<', '<', html)  # remove whitespaces before opening tagshtml = re.sub('>[\s]+', '>', html)  # remove whitespaces after closing tagsreturn html# 请求URL并把结果用UTF-8编码
resp = urlopen("https://baike.baidu.com/").read().decode("utf-8")
resp = bs_preprocess(resp)
# 使用BeautifulSoup去解析
soup = BeautifulSoup(resp, "html.parser")
# 格式化输出
# print(soup.prettify())
# 查找所有标签 按层级查找
for tag in soup.find_all():# 是否包含em、img标签if tag.name in ["em", "img"]:# 包含则删除对应的标签tag.decompose()if tag.name in ['span']:# 父标签不是divif tag.parent.name != "div":tag.decompose()
# 删除类名以category_或content_cnt开头的div
for div in soup.findAll("div", {"class": re.compile("^(category_|content_cnt)")}):div.decompose()# 获取所有以https://baike.baidu.com/item/开头的a标签的属性
listUrls = soup.findAll("a", href=re.compile("^https://baike.baidu.com/item/"))
#print(listUrls)
# 输出所有的词条对应的名称和URL
for url in listUrls:# 过滤以.jpg或JPG结尾的URLif url.get_text().strip() != "" and url['href'].strip() != "":# 输出URL的文字和对应的链接# string只能获取一个，get_text()获取标签下所有的文字print(url.get_text().strip(), "<--->", url['href'].strip())# 获取数据库链接connection = pymysql.connect(host='localhost',user='root',password='123456',db='baikeurl',charset='utf8mb4')try:# 获取会话指针with connection.cursor() as cursor:# 创建sql语句sql = "insert into `urls` (`urlname`, `urlhref`) values (%s, %s)"# 执行sql语句cursor.execute(sql, (url.get_text(), url['href']))# 提交connection.commit()finally:connection.close()

效果：

1.4.4 读取（查询）MySQL数据

# 得到总记录数
cursor.execute()# 查询下一行
cursor.fetchone()# 得到指定大小
cursor.fetchmany(size=None)# 得到全部
cursor.fetcchall()# 关闭
connection.close()

# 导入开发包
import pymysql.cursors# 获取链接
connection = pymysql.connect(host='localhost',user='root',password='123456',db='baikeurl',charset='utf8mb4')try:# 获取会话指针with connection.cursor() as cursor:# 查询语句sql = "select `urlname`, `urlhref` from `urls` where `id` is not null"count = cursor.execute(sql)print(count)# 查询数据#result = cursor.fetchall()#print(result)result2 = cursor.fetchmany(size=3)print(result2)
finally:connection.close()

执行结果：

2. 常见文档读取（TXT，PDF）

2.1 python读取TXT文档

读取TXT文档：urlopen()
读取PDF文档：pdfminer3k

from urllib.request import urlopenhtml = urlopen("https://www.csdn.net/robots.txt")
print(html.read().decode("utf-8"))

执行效果：

User-agent: *
Disallow: /scripts
Disallow: /public
Disallow: /css/
Disallow: /images/
Disallow: /content/
Disallow: /ui/
Disallow: /js/
Disallow: /scripts/
Disallow: /article_preview.html*
Disallow: /tag/
Disallow: /*?*
Disallow: /link/Sitemap: https://www.csdn.net/sitemap-aggpage-index.xml
Sitemap: https://www.csdn.net/article/sitemap.txt

2.2 pdfminer3k安装

下载：https://pypi.org/project/pdfminer3k/

pip install pdfminer3k

python setup.py install

2.3 python读取PDF文档

from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice'''
w：以写方式打开；
a：以追加模式打开（从EOF开始，必要时创建新文件）；
r+：以读写模式打开；
w+：以读写模式打开（参见w）；
a+：以读写模式打开（参见a）；
rb：以二进制读模式打开；
wb：以二进制追加模式打开（参见w）；
ab：以二进制追加模式打开（参见a）；
rb+：以二进制读写模式打开（参见r+）；
wb+：以二进制读写模式打开（参见w+）；
ab+：以二进制读写模式打开（参见a+）
'''
# 获取文档对象
# 以二进制读模式打开
fp = open("test.pdf", "rb")# 创建一个与文档关联的解释器
parser = PDFParser(fp)# PDF文档的对象
doc = PDFDocument()# 链接解释器和文档对象
parser.set_document(doc)
doc.set_parser(parser)# 初始化文档
doc.initialize("") # 密码为空# 创建PDF资源管理器
resource = PDFResourceManager()# 参数分析器
laparam = LAParams()# 创建一个聚合器
device = PDFPageAggregator(resource, laparams=laparam)# 创建PDF页面解释器
interpreter = PDFPageInterpreter(resource, device)# 使用文档对象得到页面的集合
for page in doc.get_pages():# 使用页面解释器来读取interpreter.process_page(page)# 使用聚合器来获取内容layout = device.get_result()for out in layout:if hasattr(out, "get_text"):print(out.get_text())

test.pdf：

执行效果：

古之学者必有师。师者，所以传道受业解惑也。人非生而知之者，孰能无惑？
惑而不从师，其为惑也，终不解矣。生乎吾前，其闻道也固先乎吾，吾从而师之；
生乎吾后，其闻道也亦先乎吾，吾从而师之。吾师道也，夫庸知其年之先后生于吾
乎？是故无贵无贱，无长无少，道之所存，师之所存也。月份
一月份
二月份
三月份预期销售额700
500
800实际销售额650
600
600开始主页指南查询输入检索关
键字否是是否检索到相
关记录是结果分页显
示是否继续否结束

参考文章地址：

https://www.bbsmax.com/A/VGzlB8Z15b/
https://www.bbsmax.com/A/ZOJPP97yJv/
https://www.shangmayuan.com/a/c7e0ed730ffb4a8e9f22936a.html
https://www.pythonf.cn/read/86
https://my.oschina.net/u/3238650/blog/2253390
https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#id4

Python之数据采集与文档读取练习相关推荐

python自动生成和读取word_使用Python自动生成Word文档的教程
当然要用第三方库啦 :) 使用以下命令安装: pip install python-docx 使用该库的基本步骤为: 1.建立一个文档对象(可自动使用默认模板建立,也可以使用已有文件). 2.设置文档 ...
Python: 用于计算txt文档的字数的小脚本
在一次实践中,需要计算txt文档(英文和数字)的字数,并且还要统计路径下的所有txt文档的字数总数. 本来以为很简单,但是在编写的过程中还是出现了一些问题. 首先就是,字数和字符数是不一样的,不能简单 ...
python新建word文档_使用Python 自动生成 Word 文档的教程
当然要用第三方库啦 :) 使用以下命令安装: pip install python-docx 使用该库的基本步骤为: 1.建立一个文档对象(可自动使用默认模板建立,也可以使用已有文件). 2.设置文档 ...
Python+pymupdf处理PDF文档案例6则
推荐图书: <Python程序设计(第3版)>,(ISBN:978-7-302-55083-9),清华大学出版社,2020年6月第1次印刷,7月第2次印刷京东购买链接:https://i ...
Python 实现将 Markdown 文档转换为 EPUB 电子书文件
Python 实现将 Markdown 文档转换为 EPUB 电子书文件 Markdown Markdown 是一种轻量级的标记语言,用于以简单且易于阅读的方式格式化文本.它由 John Gruber ...
文档读取 Walden
文档读取并计算单词频率 import collections # 收集数据 f=open("D:\python\Walden.txt").read() # 打开需要读取的文档 f= ...
Python动态修改Word文档内容，保留格式样式，并批量生成PDF
Python动态修改Word文档内容,保留格式样式,并批量生成PDF 前言一.需要安装的库二.核心逻辑-替换前言假如你有一个Word模版文档,要在里面填写人员信息,但人员有成百上千个,手动填起 ...
用Python提取解析pdf文档中内容
用Python提取解析pdf文档中内容文章目录: 参考: 1.https://blog.csdn.net/tmaczt/article/details/82876018 # Tika库 2.http ...
使用sphinx为python注释生成docAPI文档
sphinx简介 sphinx是一种基于Python的文档工具,它可以令人轻松的撰写出清晰且优美的文档,由Georg Brandl在BSD许可证下开发. 新版的Python3文档就是由sphinx生成 ...

Python之数据采集与文档读取练习

1. urllib和BeautifulSoup

1.1 urllib的基本用法

1.1.1 模拟真实浏览器

1.1.2 使用POST

1.2 BeautifulSoup

1.2.1 安装BeautifulSoup4

1.2.2 BeautifulSoup使用

1.3 获取百度百科词条信息

1.4 存储数据到MySQL

1.4.1 安装与卸载

1.4.2 pymysql的使用

1.4.3 存储数据到MySQL

1.4.4 读取（查询）MySQL数据

2. 常见文档读取（TXT，PDF）

2.1 python读取TXT文档

2.2 pdfminer3k安装

2.3 python读取PDF文档

Python之数据采集与文档读取练习相关推荐

最新文章

热门文章