python3爬虫的简单使用

一、前言

首先我现在从事的工作不是程序员，为什么要写这一篇粗略的爬虫使用呢，因为我在工作中确实需要使用到这个，可以为我省去不少麻烦。

对于给定项目资料，知道项目编号或者项目名称，可以在网页上查询出项目金额和项目经理是谁，10条数据还行，但是如果有100条数据，一个个复制粘贴到网页上查询，这个工作量就很大了，所以如果用爬虫去批量读取和返回数据，可以大大减少我的工作量。

注意，我不会去详细的解释原理，因为我自己也是半吊子，我只注重实用性。首先确保你有一定的python编程基础，本次使用的全都是python3。

二、基础

1、从最简单的打开百度一下开始。

import urllib.requestdef url_test():"""新建Request实例，除了必须要有 url 参数之外，还可以设置另外两个参数：data（默认空）：是伴随 url 提交的数据（比如要post的数据），同时 HTTP 请求将从 "GET"方式 改为 "POST"方式。headers（默认空）：是一个字典，包含了需要发送的HTTP报头的键值对。"""ua_header = {"User-Agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; x64) AppleWebKit / ""537.36 (KHTML, likeGecko) Chrome/92.0.4515.107Safari / 537.36"}ua_request = urllib.request.Request("http://www.baidu.com/", headers=ua_header)response2 = urllib.request.urlopen(ua_request)html = response2.read()res = [response2.getcode(), response2.geturl(), response2.info()]print(html)print(res)if __name__ == '__main__':url_test()

以上代码中，ua_header放的是请求头，urllib.request.Request(“http://www.baidu.com/”, headers=ua_header)里放置的是URL地址。urlopen打开网址，read读取网址内容， html即是打开百度一下的内容。

请求头的主要释义如下：

1. Host (主机和端口号)

Host：对应网址URL中的Web名称和端口号，用于指定被请求资源的Internet主机和端口号，通常属于URL的一部分。

2. Connection (链接类型)

Connection：表示客户端与服务连接类型

Client 发起一个包含 Connection:keep-alive 的请求，HTTP/1.1使用 keep-alive 为默认值。
Server收到请求后：
- 如果 Server 支持 keep-alive，回复一个包含 Connection:keep-alive 的响应，不关闭连接；
- 如果 Server 不支持 keep-alive，回复一个包含 Connection:close 的响应，关闭连接。
如果client收到包含 Connection:keep-alive 的响应，向同一个连接发送下一个请求，直到一方主动关闭连接。

keep-alive在很多情况下能够重用连接，减少资源消耗，缩短响应时间，比如当浏览器需要多个文件时(比如一个HTML文件和相关的图形文件)，不需要每次都去请求建立连接。

3. Upgrade-Insecure-Requests (升级为HTTPS请求)

Upgrade-Insecure-Requests：升级不安全的请求，意思是会在加载 http 资源时自动替换成 https 请求，让浏览器不再显示https页面中的http请求警报。

*HTTPS 是以安全为目标的 HTTP 通道，所以在 HTTPS 承载的页面上不允许出现 HTTP 请求，一旦出现就是提示或报错。*

4. User-Agent (浏览器名称)

User-Agent：是客户浏览器的名称，浏览器打开是什么，就填什么。

5. Accept (传输文件类型)

Accept：指浏览器或其他客户端可以接受的MIME（Multipurpose Internet Mail Extensions（多用途互联网邮件扩展））文件类型，服务器可以根据它判断并返回适当的文件格式。

6. Referer (页面跳转处)

Referer：表明产生请求的网页来自于哪个URL，用户是从该 Referer页面访问到当前请求的页面。这个属性可以用来跟踪Web请求来自哪个页面，是从什么网站来的等。

有时候遇到下载某网站图片，需要对应的referer，否则无法下载图片，那是因为人家做了防盗链，原理就是根据referer去判断是否是本网站的地址，如果不是，则拒绝，如果是，就可以下载；

7. Accept-Encoding（文件编解码格式）

Accept-Encoding：指出浏览器可以接受的编码方式。编码方式不同于文件格式，它是为了压缩文件并加速文件传递速度。浏览器在接收到Web响应之后先解码，然后再检查文件格式，许多情形下这可以减少大量的下载时间。

8. Accept-Language（语言种类）

Accept-Langeuage：指出浏览器可以接受的语言种类，如en或en-us指英语，zh或者zh-cn指中文，当服务器能够提供一种以上的语言版本时要用到。

9. Accept-Charset（字符编码）

10. Cookie （Cookie）

Cookie：浏览器用这个属性向服务器发送Cookie。Cookie是在浏览器中寄存的小型数据体，它可以记载和服务器相关的用户信息，简单来说与用户登录有关。

11. Content-Type (POST数据类型)

Content-Type：POST请求里用来表示的内容类型。

2、GET方式

# urlib 的urlencode()接收的参数是一个字典
import urllib.parse
import urllib.requesturl = "https://tieba.baidu.com/f?ie=utf-8&&"
ua_header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML"", like Gecko) Chrome/92.0.4515.107 Safari/537.36"}
values = {"kw": "游戏王ex2006"}
data = urllib.parse.urlencode(values)
url = url + data
ua_request = urllib.request.Request(url, headers=ua_header)
response = urllib.request.urlopen(ua_request)
with open("index.html", "wb") as f:f.write(response.read())

打开百度贴吧，搜索游戏王ex2006，然后在网络中查看URL会发现https://tieba.baidu.com/f?ie=utf-8&kw=%E6%B8%B8%E6%88%8F%E7%8E%8Bex2006&fr=search这一串。

这是因为一般HTTP请求提交数据，需要编码成 URL编码格式，然后做为url的一部分，或者作为参数传到Request对象中。 urllib.parse.urlencode(values)的作用就是将游戏王ex2006转换为URL编码。

3、POST方式

from urllib import request
from urllib import parse
import time
import random
import hashlibdef get_salt():lts_get = int(time.time())salt_get = lts_get + random.randint(0, 10)return lts_get, salt_getdef get_md5(v):md5 = hashlib.md5()  # md5对象，md5不能反解，但是加密是固定的，就是关系是一一对应，所以有缺陷，可以被对撞出来# navigator.appVersion 就是user-agentmd5.update(ua_header["User-Agent"].encode("utf-8"))bv = md5.hexdigest()# update需要一个bytes格式参数sign_get = "fanyideskweb" + v + str(get_salt()[1]) + "Y2FYu%TNSbMCxc3t2u^XT"md5.update(sign_get.encode('utf-8'))sign = md5.hexdigest()  # 拿到加密字符串return bv, signurl = "https://fanyi.youdao.com/translate?smartresult_o=dict&smartresult=rule"
ua_header = {"Host": "fanyi.youdao.com","Accept": "application/json, text/javascript, */*; q=0.01","X-Requested-With": "XMLHttpRequest","User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36""(KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36","Accept-Language": "zh-CN,zh;q=0.9","Content-Type": "application/x-www-form-urlencoded; charset=UTF-8","sec-ch-ua-mobile": "?0"
}print("请输入需要查询的单词：")
content = input()
lts, salt = get_salt()
bv, sign = get_md5(content)
# 通过抓包得到的有道翻译的post数据post_json = {"i": content,"from": "AUTO","to": "AUTO","smartresult": "dict","client": "fanyideskweb","salt": salt,"sign": sign,"lts": lts,"bv": bv,"doctype": "json","version": "2.1","keyfrom": "fanyi.web","action": "FY_BY_REALTlME"
}
post_data = parse.urlencode(post_json)
ua_request = request.Request(url=url, data=post_data.encode("utf-8"), headers=ua_header)
html = request.urlopen(ua_request).read().decode("utf-8")
print(html)"""
fanyi.min.js
define("newweb/common/service", ["./utils", "./md5", "./jquery-1.7"], function(e, t) {var n = e("./jquery-1.7");e("./utils");e("./md5");var r = function(e) {var t = n.md5(navigator.appVersion), r = "" + (new Date).getTime(), i = r + parseInt(10 * Math.random(), 10);return {ts: r,bv: t,salt: i,sign: n.md5("fanyideskweb" + e + i + "Y2FYu%TNSbMCxc3t2u^XT")}};
"""

上面这段代码这么长，是因为要破解反爬虫机制，具体的由来可以看这篇文章，https://blog.csdn.net/qq_22808061/article/details/119385740，爬虫有道云翻译中的js加密（2021年8月3日）-python3爬虫。

上面代码主要关注这两段，post_data = parse.urlencode(post_json)与
ua_request = request.Request(url=url, data=post_data.encode(“utf-8”), headers=ua_header)，

parse.urlencode(post_json)这个对post的数据进行URL编码，然后进行拼接，传送数据。

这里有一点要注意，在ua_header的请求头部分，不要加Accept-Encoding，这会导致读取后的内容还要解编码。

4、GET与POST的百度贴吧批量下载图片的案例

import os
import urllib.request
import urllib.parse
from lxml import etree
import re
import timeclass Spider:def __init__(self):self.tiebaName = "噬神者2"  # input("请输入需要访问的贴吧")self.beginPage = 46  # int(input("请输入起始页"))self.endPage = 50  # int(input("请输入终止页"))self.url = "http://tieba.baidu.com/f"self.ua_header = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9","User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36","Accept-Language": ":zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6","sec-ch-ua": 'Google Chrome";v="93", " Not;A Brand";v="99", "Chromium";v="93"',"sec-ch-ua-mobile": "?0","sec-ch-ua-platform": "Windows","Sec-Fetch-Dest": "document","Sec-Fetch-Mode": "navigate","Sec-Fetch-Site": "none","Sec-Fetch-User": "?1","Upgrade-Insecure-Requests": "1"}# 图片编号self.userName = 1def tiebaSpider(self):for page in range(self.beginPage, self.endPage + 1):pn = (page - 1) * 50  # page numberword = {'kw': self.tiebaName, 'pn': pn}word = urllib.parse.urlencode(word)myUrl = self.url + "?" + wordprint(myUrl)time.sleep(15)self.loadPage(myUrl)def loadPage(self, url):req = urllib.request.Request(url, headers=self.ua_header)html = urllib.request.urlopen(req).read().decode('utf-8')# 抓取当前页面的所有帖子的url的后半部分，也就是帖子编号# http://tieba.baidu.com/p/4884069807里的 “p/4884069807”print(html)# re.S 如果没有re.S 则是只匹配一行有没有符合规则的字符串，如果没有则下一行重新匹配# 如果加上re.S 则是将所有的字符串将一个整体进行匹配pattern = re.compile(r'<div.*?class="threadlist_lz clearfix">(.*?)</div>', re.S)item_list = pattern.findall(html)# *****这里使用xpath失败了，将html页面格式化一下观察发现，我们要爬的内容被注释掉了，暂时不知道原因，在浏览器中打开就是没有被格式化# 反正这里搞了好久都不知道为什么被注释掉了，xpath就是匹配不到，所以就用正则表达式写了# 解析html为Xml文档print(item_list)selector = etree.HTML(html)links = selector.xpath('//div[@class="threadlist_lz clearfix"]/div/a/@href')print("LINKS=%s" % links)for item in item_list:print(item)pattern = re.compile(r'href="(/p/\d{10})')m = pattern.search(item)if m:print(m)link = "http://tieba.baidu.com" + m.group(1)print(link)time.sleep(15)self.loadImages(link)else:continuedef loadImages(self, link):"""获取图片"""req = urllib.request.Request(link, headers=self.ua_header)html = urllib.request.urlopen(req).read()selector = etree.HTML(html)# 获取这个帖子里所有图片的src路径# 很奇怪，这里的xpath就有用了imagesLinks = selector.xpath('//img[@class="BDE_Image"]/@src')# 依次取出图片路径，下载保存for imagesLink in imagesLinks:# 在实际运行过程中发现有些图片是破损的，在贴吧中打不开，会导致程序出错中断，用这个检查图片对不对if imagesLink.startswith("https://imgsa.baidu.com/"):self.writeImages(imagesLink)time.sleep(15)else:continuedef writeImages(self, imagesLink):"""将images存到userName文件中"""print(imagesLink)# 文件夹不存在则建立if not os.path.exists('./images/'):os.mkdir('./images/')file = open('./images/' + "噬神者2-46页-50页" + str(self.userName) + '.png', 'wb')images = urllib.request.urlopen(imagesLink).read()file.write(images)file.close()self.userName += 1if __name__ == "__main__":mySpider = Spider()mySpider.tiebaSpider()

这段代码里使用了一点点XPATH定位，还失败了一部分，在后面的使用中也发现，XPATH定位常常不准确或不能使用，推荐使用RE正则式来匹配定位，经过数次运行，这段代码应该没有什么大问题，就是下载保存图片比较慢，time.sleep(15)15秒睡眠时间最好不要去掉，刚开始没加这个，IP被封过一段时间。

5、登录与COOKIE

def login_IT():"""登陆模块"""user_name = input("请输入账号名称：")passwd = input("请输入密码：")# 1. 构建一个CookieJar对象实例来保存cookiecookie_object = cookiejar.CookieJar()# 2. 使用HTTPCookieProcessor()来创建cookie处理器对象，参数为CookieJar()对象cookie_handler = request.HTTPCookieProcessor(cookie_object)# 3. 通过 build_opener() 来构建openeropener = request.build_opener(cookie_handler)# 4. addheaders 接受一个列表，里面每个元素都是一个headers信息的元祖, opener将附带headers信息opener.addheaders = [("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ""Chrome/87.0.4280.141 Safari/537.36"),("Content-Type", "application/x-www-form-urlencoded; charset=UTF-8"),("X-Requested-With", "XMLHttpRequest")]# 5. 需要登录的账户和密码data = {"LoginID": user_name, "Password": passwd, "X-Requested-With": "XMLHttpRequest"}# 6. 通过urlencode()转码postdata = parse.urlencode(data)try:# 7. 构建Request请求对象，包含需要发送的用户名和密码us_request = request.Request("http://132.230.114.133/DICT/ExpIndex/Login/CheckLogin",data=postdata.encode("utf-8"))# 8. 通过opener发送这个请求，并获取登录后的Cookie值，opener.open(us_request)# 9. opener包含用户登录后的Cookie值，可以直接访问那些登录后才可以访问的页面response = opener.open("http://132.230.114.133/Default")# 10. 打印响应内容response_utf = response.read().decode('utf-8')# print(response_utf)pattern = re.compile(r'<a href="#" οnclick="showUserInfo\(\);">(.+?)</a>', re.S)result = pattern.findall(response_utf)if result:pattern = re.compile(r"[A-Za-z0-9\W]")m = pattern.sub('', result[0])# print(m)print("登陆成功！\n欢迎你，%s" % m)else:print("登录失败，请检查，或在浏览器上验证！")return 0cookie_str = ""for item in cookie_object:cookie_str = cookie_str + item.name + "=" + item.value + ";"login_cookie = cookie_str  #这里就是所要的cookieexcept Exception as e:print(5 * "!" + "软件运行错误" + 5 * "!")print("登录失败，请检查，或在浏览器上验证！")print(e)return 0

上面这种登录方式用的比较少，因为只有一些简单自建的站点、没有任何验证方式的站点才能这样登录，那么对于很复杂的、需要验证方式登录的站点怎么办？那就在网页上登陆好，找到登录后的cookie，直接复制粘贴到代码里就行，不要去搞那些花里胡哨的。

6、CSV的使用

import csvdef write_csv_dict():headers = ["name", "age", "height"]values = [{"name": "小王", "age": 18, "height": 178},{"name": "小王", "age": 18, "height": 178},{"name": "小王", "age": 18, "height": 178}]with open("dict_demo.csv", "w", encoding="utf-8-sig", newline='') as f1:# 使用csv.DictWriter()方法，需传入两个个参数，第一个为对象，第二个为文件的titlewriter = csv.DictWriter(f1, headers)  # 使用此方法，写入表头writer.writeheader()for value in values:writer.writerow(value)def read_csv_dict():with open("test.csv", "r", encoding="ANSI") as fp:# 使用DictReader创建的reader是一个字典对象，遍历后，不包含第一行数据reader = csv.DictReader(fp)for values in reader:print(values)write_csv_dict()
read_csv_dict()

中文乱码问题https://blog.csdn.net/qq_39248703/article/details/80175976?utm_medium=distribute.pc_relevant.none-task-blog-2_defaultbaidujs_baidulandingword~default-0.opensearchhbase&spm=1001.2101.3001.4242.1

数据批量读取，返回的结果也要批量保存，我选择使用CSV工具，写入和读取我都选择字典方式，写入CSV时有 encoding=“utf-8-sig”，这是为了防止乱码，可以查看上方链接。

三、最后的代码展现

import time
from urllib import request
from urllib import parse
import csv
import json
import osclass DICT_search:def __init__(self):self.url = "http://132.230.114.133/DICT/PMS/Project/getTableList"self.ua_header = {"Host": "132.230.114.133","Cookie": "ASP.NET_SessionId=nnkl0nvimv5yxbe3jhzg0gzu","User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, ""like Gecko) Chrome/87.0.4280.141 Safari/537.36"}self.post_josn = {"page": "1","limit": "2000","sort": "","order": "","SearchType": "1","DelFlag": "0","YearFlag": "","ContractBH": "","ProjectBH": "","ProjectName": "","QuJu": "","TotalAmountS": "","TotalAmountE": "","QXSK": "","HXFK": "","Provider": "","Status": "","SZStatus": "","shouqian_Name": "","shouzhong_Name": "","zonghe_Name": "",}self.header_catalog = ["项目编号", "项目名称", "售中状态", "结果", "ProjectName", "CreateTime", "ProjectCategory","SZStatusName", "shouzhong_Name", "shouqian_Name", "ProjectBH", "RowGuid"]def start(self):self.check_load_csv()self.write_csv_dict_header()def load_csv(self):"""读取要查询的表格"""# 这里的编码如何查看，可以右击用文本笔记打开，查看编码格式try:with open("load_csv.csv", "r", encoding="ANSI") as fr:reader = csv.DictReader(fr)for values in reader:self.search_project(values)time.sleep(0.5)except Exception as e:print(5 * "!" + "软件运行错误" + 5 * "!")print(e)return 0def search_project(self, search_values):"""查询项目"""self.post_josn["ProjectBH"] = search_values["ProjectBH"]self.post_josn["ProjectName"] = search_values["ProjectName"]self.post_josn["SZStatus"] = search_values["SZStatus"]# print(self.post_josn)post_data = parse.urlencode(self.post_josn)try:ua_request = request.Request(url=self.url, data=post_data.encode('utf-8'), headers=self.ua_header)response = request.urlopen(ua_request).read().decode('utf-8')# print(response)data = json.loads(response)# print(data)if data['count'] == 0:self.write_csv_dict(search_values, result_count=0, data={})elif data['count'] == 1:# print(data['data'][0].keys())self.write_csv_dict(search_values, result_count=1, data=data)else:self.write_csv_dict(search_values, result_count=data['count'], data=data)except Exception as e:print(5 * "!" + "软件运行错误" + 5 * "!")print(e)return 0def write_csv_dict_header(self):"""写入表头"""if not os.path.exists('./result/'):os.mkdir('./result')try:with open("./result/result.csv", "w", encoding="ANSI", newline="") as fw:writer_query_catalog = csv.DictWriter(fw, self.header_catalog, extrasaction="ignore")writer_query_catalog.writeheader()self.load_csv()except IOError as e:print(20 * "*")print("已有result.csv文件被打开，请备份后再试，文件会被覆盖")print(20 * "*")print("请按任意键退出")except Exception as e:print(5 * "!" + "软件运行错误" + 5 * "!")print(e)return 0def write_csv_dict(self, search_values, result_count, data):"""根据查询到的不同结果，写入内容"""dict_result_catalog = {"项目编号": search_values["ProjectBH"],"项目名称": search_values["ProjectName"],"售中状态": search_values["SZStatus"],"结果": result_count}if data != {}:data["data"][0]["RowGuid"] = 'http://132.230.114.133/RowGuid=' + data["data"][0]["RowGuid"]dict_result_catalog.update(data['data'][0])with open("./result/result.csv", "a", encoding="ANSI", newline="") as fw:writer_query_catalog = csv.DictWriter(fw, self.header_catalog, extrasaction="ignore")writer_query_catalog.writerow(dict_result_catalog)if result_count > 1:for count in range(1, result_count):data["data"][count]["RowGuid"] = 'http://132.230.114.133/RowGuid=' + \data["data"][count]["RowGuid"]with open("./result/result.csv", "a", encoding="ANSI", newline="") as fw:writer_query_catalog = csv.DictWriter(fw, self.header_catalog, extrasaction="ignore")writer_query_catalog.writerow(data['data'][count])print("查询成功，请按任意键退出")def check_load_csv(self):"""检查loadcvs是否存在以及是否为空"""try:with open("load_csv.csv", "r", encoding="ANSI") as fr:reader = csv.reader(fr)next(reader)first_row = next(reader)print("表格不为空:%s" % first_row)except IOError as e:print("*" * 20)print("没有找到load_csv.csv，已经重新创建该文件，请检查！！！")print("*" * 20)header = ["ProjectBH", "ProjectName", "SZStatus"]with open("load_csv.csv", "w", encoding="ANSI", newline="") as fw:writer_query_catalog = csv.DictWriter(fw, header, extrasaction="ignore")writer_query_catalog.writeheader()print("*" * 5 + "创建完毕" + "*" * 5)print("请按任意键退出")except StopIteration as e:print(5 * "!" + "load_csv.csv可能没有查询内容，请检查！" + 5 * "!")except Exception as e:print(5 * "!" + "软件运行错误" + 5 * "!")print(e)return 0if __name__ == "__main__":myDICT = DICT_search()myDICT.start()os.system("pause")

上面得代码只做了三件事。

第一件事，根据浏览器登录后得页面，找出User-Agent、Cookie、Host和URL，以及要post的数据。