简介

#介绍：使用requests可以模拟浏览器的请求，比起之前用到的urllib，requests模块的api更加便捷（本质就是封装了urllib3）#注意：requests库发送请求将网页内容下载下来以后，并不会执行js代码，这需要我们自己分析目标站点然后发起新的request请求#安装：pip3 install requests

在pycharm中操作：

import requests   #导入模块def run():        #声明一个run方法print("跑码文件")    #打印内容if __name__ == "__main__":   #主程序入口run()    #调用上面的run方法

显示如下结果，代表编译没有问题

跑码文件

接下来，我们开始测试requests模块是否可以使用

修改上述代码中的

import requestsdef run():response = requests.get("http://www.baidu.com")print(response.text)if __name__ == "__main__":run()

运行结果（出现下图代表你运行成功了）：

<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>ç¾åº¦ä¸ä¸ï¼ä½ å°±ç¥é</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=ç¾åº¦ä¸ä¸ class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>æ°é»</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>å°å¾</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>è§é¢</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>è´´å§</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>ç»å½</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">ç»å½</a>');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">æ´å¤äº§å</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>å³äºç¾åº¦</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>ä½¿ç¨ç¾åº¦åå¿è¯»</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>æè§åé¦</a>&nbsp;äº¬ICPè¯030173å·&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

接下来，我们实际下载一张图片试试，比如下面这张图片

图片链接：https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1562215040437&di=aa3fc27e7acb5ded2643b315497cfce2&imgtype=0&src=http%3A%2F%2Fimg.9ku.com%2Fgeshoutuji%2Fsingertuji%2F4%2F4779%2F4779_9.jpg

import requestsdef run():response = requests.get("https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1562215040437&di=aa3fc27e7acb5ded2643b315497cfce2&imgtype=0&src=http%3A%2F%2Fimg.9ku.com%2Fgeshoutuji%2Fsingertuji%2F4%2F4779%2F4779_9.jpg")with open("Alizee.jpg","wb") as f :f.write(response.content)f.closeif __name__ == "__main__":run()

运行代码之后，发现在文件夹内部生成了一个文件

打开文件之后发现，图片显示正常，说明图片爬取成功。

我们继续修改代码，因为有的服务器图片，都做了一些限制，我们可以用浏览器打开，但是使用Python代码并不能完整的下载下来。

修改代码，加入请求头

import requestsdef run():# 头文件，header是字典类型headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.5383.400 QQBrowser/10.0.1313.400"}response = requests.get(“https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1562215040437&di=aa3fc27e7acb5ded2643b315497cfce2&imgtype=0&src=http%3A%2F%2Fimg.9ku.com%2Fgeshoutuji%2Fsingertuji%2F4%2F4779%2F4779_9.jpg”,headers=headers) with open("Alizee.jpg","wb") as f :f.write(response.content)   f.closeif __name__ == "__main__":run()

重点查看上述代码中 requests.get部分，添加了一个headers的实参。这样我们程序就下载下来了完整的图片。

Python爬虫页面分析

我们今天要爬的网站叫做 http://www.umei.cc/bizhitupian/meinvbizhi

当然，部分图片尺度较大，请自我屏蔽

import requestsall_urls = []  # 我们拼接好的图片集和列表路径class Spider():# 构造函数，初始化数据使用def __init__(self, target_url, headers):self.target_url = target_urlself.headers = headers# 获取所有的想要抓取的URLdef getUrls(self, start_page, page_num):global all_urls# 循环得到URLfor i in range(start_page, page_num + 1):url = self.target_url % iall_urls.append(url)if __name__ == "__main__":headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36','HOST': 'www.umei.cc',}target_url = "http://www.umei.cc/bizhitupian/meinvbizhi/%d.htm"  # 图片集和列表规则
spider = Spider(target_url, headers)spider.getUrls(1, 16)print(all_urls)

可以看到所有的url地址，存放在all_urls列表中

['http://www.umei.cc/bizhitupian/meinvbizhi/1.htm', 'http://www.umei.cc/bizhitupian/meinvbizhi/2.htm', 'http://www.umei.cc/bizhitupian/meinvbizhi/3.htm', 'http://www.umei.cc/bizhitupian/meinvbizhi/4.htm', 'http://www.umei.cc/bizhitupian/meinvbizhi/5.htm', 'http://www.umei.cc/bizhitupian/meinvbizhi/6.htm', 'http://www.umei.cc/bizhitupian/meinvbizhi/7.htm', 'http://www.umei.cc/bizhitupian/meinvbizhi/8.htm', 'http://www.umei.cc/bizhitupian/meinvbizhi/9.htm', 'http://www.umei.cc/bizhitupian/meinvbizhi/10.htm', 'http://www.umei.cc/bizhitupian/meinvbizhi/11.htm', 'http://www.umei.cc/bizhitupian/meinvbizhi/12.htm', 'http://www.umei.cc/bizhitupian/meinvbizhi/13.htm', 'http://www.umei.cc/bizhitupian/meinvbizhi/14.htm', 'http://www.umei.cc/bizhitupian/meinvbizhi/15.htm', 'http://www.umei.cc/bizhitupian/meinvbizhi/16.htm']

上面的代码，可能需要有一定的Python基础可以看懂，不过你其实仔细看一下，就几个要点

第一个是 class Spider(): 我们声明了一个类,然后我们使用 def __init__去声明一个构造函数，这些我觉得你找个教程30分钟也就学会了。

拼接URL，我们可以用很多办法，我这里用的是最直接的，字符串拼接。

注意上述代码中有一个全局的变量 all_urls 我用它来存储我们的所有分页的URL，这里就是我们接下来下载网页的url地址

接下来，是爬虫最核心的部分代码了

我们需要分析页面中的逻辑。首先打开 http://www.umei.cc/bizhitupian/meinvbizhi/1.htm ，右键审查元素

分析源代码可知，所有图片资源都在li标签里面

接下来爬取每张图片里面的title和图片链接

这里我们采用多线程的方式爬取（这里还用了一种设计模式，叫观察者模式）

import threading   #多线程模块
from lxml import etree #lxml模块
import time #时间模块

新增加一个全局的变量，并且由于是多线程操作，我们需要引入线程锁

all_img_urls = []       #图片列表页面的数组
g_lock = threading.Lock()  #初始化一个锁

声明一个生产者的类，用来不断的获取图片详情页地址，然后添加到 all_img_urls 这个全局变量中

# 生产者，负责从每个页面提取图片列表链接
class Producer(threading.Thread):def run(self):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36','HOST': 'www.umei.cc'}global all_urlswhile len(all_urls) > 0:g_lock.acquire()  # 在访问all_urls的时候，需要使用锁机制page_url = all_urls.pop()  # 通过pop方法移除最后一个元素，并且返回该值g_lock.release()  # 使用完成之后及时把锁给释放，方便其他线程使用try:print("分析" + page_url)response = requests.get(page_url, headers=headers, timeout=2)html_data= etree.HTML(response.text)all_pic_link = html_data.xpath("//a[@class='TypeBigPics']/img/@src")print(all_pic_link)global all_img_urlsg_lock.acquire()  # 这里还有一个锁all_img_urls += all_pic_link  # 这个地方注意数组的拼接，没有用append直接用的+=也算是python的一个新语法吧print(all_img_urls)g_lock.release()  # 释放锁time.sleep(0.5)except:pass

上述代码用到了继承的概念，我从threading.Thread中继承了一个子类，继承的基础学习，你可以去翻翻 http://www.runoob.com/python3/python3-class.html 菜鸟教程就行。

线程锁，在上面的代码中，当我们操作all_urls.pop()的时候，我们是不希望其他线程对他进行同时操作的，否则会出现意外，所以我们使用g_lock.acquire()锁定资源，然后使用完成之后，记住一定要立马释放g_lock.release(),否则这个资源就一直被占用着，程序无法进行下去了。

匹配网页中的URL，我使用的是xpath解析，进行匹配。

代码容易出错的地方，我放到了

try: except: 里面，当然，你也可以自定义错误。

如果上面的代码，都没有问题，那么我们就可以在程序入口的地方编写

for x in range(2):t = Producer()t.start()

执行程序，因为我们的Producer继承自threading.Thread类，所以，你必须要实现的一个方法是 def run 这个我相信在上面的代码中，你已经看到了。然后我们可以执行啦~~~

运行结果：

这样，图片详情页面的列表就已经被我们存储起来了。

接下来，我们需要执行这样一步操作，我想要等待图片详情页面全部获取完毕，在进行接下来的分析操作。

这里增加代码

threads= []
#开启两个线程去访问
for x in range(2):t = Producer()t.start()#threads.append(t)# for tt in threads:
#     tt.join()print("进行到我这里了")

把上面的tt.join等代码注释打开：

发现一个本质的区别，就是，我们由于是多线程的程序，所以，当程序跑起来之后，print("进行到我这里了")不会等到其他线程结束，就会运行到，但是当我们改造成上面的代码之后，也就是加入了关键的代码 tt.join() 那么主线程的代码会等到所以子线程运行完毕之后，在接着向下运行。这就满足了，我刚才说的，先获取到所有的图片详情页面的集合，这一条件了。

join所完成的工作就是线程同步，即主线程遇到join之后进入阻塞状态，一直等待其他的子线程执行结束之后，主线程在继续执行。这个大家在以后可能经常会碰到。

下面编写一个消费者/观察者，也就是不断关注刚才我们获取的那些图片详情页面的数组。

添加一个全局变量，用来存储获取到的图片链接

pic_links = []            #图片地址列表

# 消费者
class Consumer(threading.Thread):def run(self):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36','HOST': 'www.umei.cc'}global all_img_urls  # 调用全局的图片详情页面的数组print("%s is running " % threading.current_thread)while len(all_img_urls) > 0:print("在这")g_lock.acquire()img_url = all_img_urls.pop()g_lock.release()try:response = requests.get(img_url,headers=headers)html_data = etree.HTML(response.content.decode())title = html_data.xpath("//div[@class='ArticleTitle']/strong/text()")all_pic_src = html_data.xpath("//div[@class='ImageBody']/p/a/img/@src")pic_dict = {title[0]: all_pic_src[0]} # python字典global pic_linksg_lock.acquire()pic_links.append(pic_dict)  # 字典数组#print(pic_links)#print(title + "获取成功")
                g_lock.release()except:print("有问题")time.sleep(0.5)

#开启10个线程去获取链接
for x in range(10):ta = Consumer()ta.start()

运行程序，打印出来是列表里面包含字典的数据

接下来就是，我们开篇提到的那个存储图片的操作了，还是同样的步骤，写一个自定义的类

我们获取图片链接之后，就需要下载了，我上面的代码是首先创建了一个之前获取到title的文件目录，然后在目录里面通过下面的代码,去创建一个文件。

涉及到文件操作，引入一个新的模块

import os  #目录操作模块

class DownPic(threading.Thread):def run(self):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',}while True:  # 这个地方写成死循环，为的是不断监控图片链接数组是否更新global pic_links# 上锁
            g_lock.acquire()if len(pic_links) == 0:  # 如果没有图片了，就解锁# 不管什么情况，都要释放锁
                g_lock.release()continueelse:pic = pic_links.pop()g_lock.release()# 遍历字典列表for key,value in pic.items():print("==================",key,value)path = key.strip()is_exists = os.path.exists(path)# 判断结果if not is_exists:# 如果不存在则创建目录# 创建目录操作函数
                        os.makedirs(path)print(path + '目录创建成功')else:# 如果目录存在则不创建，并提示目录已存在print(path + '目录已存在')filename = path + "/" + key+".jpg"if os.path.exists(filename):continueelse:response = requests.get(url=value,headers=headers)with open(filename,'wb') as f:f.write(response.content)f.close()

然后在主程序中编写代码

#开启10个线程保存图片
for x in range(10):down = DownPic()down.start()

运行程序，在文件夹里可以看到下载的下来的图片以及文件

整理全部代码：

import requestsimport threading   #多线程模块
from lxml import etree #xpath方式爬取
import time #时间模块import osall_img_urls = []       #图片列表页面的数组
g_lock = threading.Lock()  #初始化一个锁

pic_links = []            #图片地址列表

all_urls = []  # 我们拼接好的图片集和列表路径class Spider():# 构造函数，初始化数据使用def __init__(self, target_url, headers):self.target_url = target_urlself.headers = headers# 获取所有的想要抓取的URLdef getUrls(self, start_page, page_num):global all_urls# 循环得到URLfor i in range(start_page, page_num + 1):url = self.target_url % iall_urls.append(url)# 生产者，负责从每个页面提取图片列表链接
class Producer(threading.Thread):def run(self):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36','HOST': 'www.umei.cc'}global all_urlswhile len(all_urls) > 0:g_lock.acquire()  # 在访问all_urls的时候，需要使用锁机制page_url = all_urls.pop()  # 通过pop方法移除最后一个元素，并且返回该值g_lock.release()  # 使用完成之后及时把锁给释放，方便其他线程使用try:print("分析" + page_url)response = requests.get(page_url, headers=headers, timeout=2)html_data= etree.HTML(response.text)all_pic_link = html_data.xpath("//a[@class='TypeBigPics']/@href")print(all_pic_link)global all_img_urlsg_lock.acquire()  # 这里还有一个锁all_img_urls += all_pic_link  # 这个地方注意数组的拼接，没有用append直接用的+=也算是python的一个新语法吧#print(all_img_urls)g_lock.release()  # 释放锁time.sleep(0.5)except:pass# 消费者
class Consumer(threading.Thread):def run(self):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36','HOST': 'www.umei.cc'}global all_img_urls  # 调用全局的图片详情页面的数组print("%s is running " % threading.current_thread)while len(all_img_urls) > 0:g_lock.acquire()img_url = all_img_urls.pop()g_lock.release()try:response = requests.get(img_url,headers=headers)html_data = etree.HTML(response.content.decode())title = html_data.xpath("//div[@class='ArticleTitle']/strong/text()")all_pic_src = html_data.xpath("//div[@class='ImageBody']/p/a/img/@src")pic_dict = {title[0]: all_pic_src[0]} # python字典global pic_linksg_lock.acquire()pic_links.append(pic_dict)  # 字典数组print(pic_links)#print(title + "获取成功")
                g_lock.release()except:print("有问题")time.sleep(0.5)class DownPic(threading.Thread):def run(self):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',}while True:  # 这个地方写成死循环，为的是不断监控图片链接数组是否更新global pic_links# 上锁
            g_lock.acquire()if len(pic_links) == 0:  # 如果没有图片了，就解锁# 不管什么情况，都要释放锁
                g_lock.release()continueelse:pic = pic_links.pop()g_lock.release()# 遍历字典列表for key,value in pic.items():print("==================",key,value)path = key.strip()is_exists = os.path.exists(path)# 判断结果if not is_exists:# 如果不存在则创建目录# 创建目录操作函数
                        os.makedirs(path)print(path + '目录创建成功')else:# 如果目录存在则不创建，并提示目录已存在print(path + '目录已存在')filename = path + "/" + key+".jpg"if os.path.exists(filename):continueelse:response = requests.get(url=value,headers=headers)with open(filename,'wb') as f:f.write(response.content)f.close()if __name__ == "__main__":headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36','HOST': 'www.umei.cc',}target_url = "http://www.umei.cc/bizhitupian/meinvbizhi/%d.htm"  # 图片集和列表规则
spider = Spider(target_url, headers)spider.getUrls(1, 16)threads= []# 开启两个线程去访问for x in range(2):t = Producer()t.start()threads.append(t)for tt in threads:tt.join()print("进行到我这里了")# 开启10个线程去获取链接for x in range(10):ta = Consumer()ta.start()for x in range(10):down = DownPic()down.start()

转载于:https://www.cnblogs.com/nikecode/p/11130801.html

Python爬虫-02 request模块爬取妹子图网站相关推荐

Python爬虫入门教程：爬取妹子图网站 - 独行大佬
妹子图网站---- 安装requests打开终端:使用命令pip3 install requests等待安装完毕即可使用接下来在终端中键入如下命令?123# mkdir demo # cd demo# ...
python爬虫之正则表达式（爬取妹子网图片）
目录正则表达式正则表达式(regular expression)描述了一种字符串匹配的模式(pattern),可以用来检查一个串是否含有某种子串.将匹配的子串替换或者从某个串中取出符合某个条件的子 ...
Python爬虫之利用xpath爬取ip代理网站的代理ip
爬虫工具 python3 pycharm edge/chrome requests库的用法 requests库是python中简单易用的HTTP库用命令行安装第三方库 pip install req ...
使用python requests 爬取妹子图网站图片
import requests import os import re# 封面图http://mm.chinasareview.com/wp-content/uploads/2017a/07/04/l ...
python爬虫———多线程threading模块爬取抖音用户信息
爬虫背景: 由于原来的数据库中有1.5亿左右的用户id,但是其中有1.2亿的用户资料是不完整的(没有粉丝数量,点赞数量等,算是无用数据),现在老板要求将这些没有资料的用户更新信息,咋办? 刚开始的想法 ...
python多线程爬取妹子图网站_python爬取妹子图全站全部图片-可自行添加-线程-进程爬取，图片去重...
from bs4 import BeautifulSoup import sys,os,requests,pymongo,time from lxml import etree def get_fen ...
【python爬虫自学笔记】-----爬取简书网站首页文章标题与链接
from urllib import request from bs4 import BeautifulSoup #一个可以从html或者xml中提取结构化数据的python库 #构造头文件,模拟浏览 ...
python爬虫爬妹子图_【爬虫】直接上干货-爬取妹子图整站图片
该楼层疑似违规已被系统折叠隐藏此楼查看此楼 #coding=utf-8 import os import requests from lxml import etree import time cl ...
python爬取妹子图片1_【爬虫】直接上干货-爬取妹子图整站图片
该楼层疑似违规已被系统折叠隐藏此楼查看此楼 #coding=utf-8 import os import requests from lxml import etree import time cl ...

Python爬虫-02 request模块爬取妹子图网站

Python爬虫页面分析

Python爬虫-02 request模块爬取妹子图网站相关推荐

最新文章

热门文章