Python爬虫实战 | (9) 爬取搜狗图片

本篇博客我们将爬取百度图片，输入搜索词，爬取与搜索词相关的图片。

首先打开搜狗图片https://pic.sogou.com/，比如搜索"猫"，此时的URL如下：

https://pic.sogou.com/pics?query=%C3%A8&w=05009900&p=&_asf=pic.sogou.com&_ast=1563449302&sc=index&sut=8710&sst0=1563449302189

如果仅凭借URL来爬取的话，URL中需要体现出搜索词信息以及页数信息，所以我们需要使用下面这个URL(至于这个URL是怎么得到的，目前我也不清楚，先照搬)：

https://pic.sogou.com/pics?query={}&mode=1&start={}&reqType=ajax&reqFrom=result&tn=0

其中第一个{}替换为搜索词，第二个搜索词替换为页数信息。

首先搭建程序主体框架：

import time
import requests
import os
from requests import RequestException
import jsondef get_page(url):passdef parse_page(html, count, word):passif __name__ == '__main__':word = '猫'  # 关键词page = 10  # 爬取的页数count = 0  #图片计数if not os.path.exists(word):os.makedirs(word)  # 建目录for i in range(page):url =  'https://pic.sogou.com/pics?query={}&mode=1&start={}&reqType=ajax&reqFrom=result&tn=0'.format(word,i*48)# 发送请求、获取响应html = get_page(url)# 解析响应 数据存储count = parse_page(html, count, word)time.sleep(1)

发送请求获取响应，编写get_page(url)函数：

def get_page(url):try:# 添加User-Agent，放在headers中，伪装成浏览器headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}response = requests.get(url, headers=headers)if response.status_code == 200:response.encoding = response.apparent_encodingreturn responsereturn Noneexcept RequestException:return None

注意和之前不同，这里指返回response，因为在解析首页时，我们需要的是response.text;当获取图片URL爬取保存图片时，需要的是response.content。返回response，两次请求可以通用这个函数。

打开上面的链接，会发现他返回的是json格式的数据：

所有的图片信息都在items下，上图蓝色阴影代表一张图片的信息，内部都是由一些键值对组成，我们关心的是pic_url字段，他的值是图片真正的链接。所以，我们要先把图片的pic_url解析出来，然后再进行图片爬取，和保存。

解析响应，解析json数据，提取middleURL并保存，然后爬取pic_url，保存图片：


def parse_page(html, count, word):html = html.textif html:p = json.loads(html)['items']  # 转为json格式  提取items字段print(len(p))  # 图片数for i in p[:-1]:  # [0:5]前5张print(i['pic_url'])count = count + 1# 数据保存with open(word + '/' + word + '_url_搜狗.txt', 'a', encoding='utf-8') as f:f.write(i['pic_url'] + '\n')pic = get_page(i['pic_url'])if pic:with open(word + '/' + '搜狗_'+str(count) + '.jpg', 'wb') as f:f.write(pic.content)time.sleep(1)return count

完整代码：

import time
import requests
import os
from requests import RequestException
import jsondef get_page(url):try:# 添加User-Agent，放在headers中，伪装成浏览器headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}response = requests.get(url, headers=headers)if response.status_code == 200:response.encoding = response.apparent_encodingreturn responsereturn Noneexcept RequestException:return Nonedef parse_page(html, count, word):html = html.textif html:p = json.loads(html)['items']  # 转为json格式  提取items字段print(len(p))  # 图片数for i in p[:-1]:  # [0:5]前5张print(i['pic_url'])count = count + 1# 数据保存with open(word + '/' + word + '_url_搜狗.txt', 'a', encoding='utf-8') as f:f.write(i['pic_url'] + '\n')pic = get_page(i['pic_url'])if pic:with open(word + '/' + '搜狗_'+str(count) + '.jpg', 'wb') as f:f.write(pic.content)time.sleep(1)return countif __name__ == '__main__':word = '猫'  # 关键词page = 10  # 爬取的页数count = 0  #图片计数if not os.path.exists(word):os.makedirs(word)  # 建目录for i in range(page):url =  'https://pic.sogou.com/pics?query={}&mode=1&start={}&reqType=ajax&reqFrom=result&tn=0'.format(word,i*48)# 发送请求、获取响应html = get_page(url)# 解析响应 数据存储count = parse_page(html, count, word)time.sleep(1)

Python爬虫实战 | (9) 爬取搜狗图片相关推荐

Python爬虫实战之爬取网站全部图片(一)
Python爬虫实战之爬取网站全部图片(二) Python爬虫实战之通过ajax获得图片地址实现全站图片下载(三) 一.获得图片地址和图片名称 1.进入网址之后按F12 打开开发人员工具点击e ...
Python爬虫实战之爬取糗事百科段子
Python爬虫实战之爬取糗事百科段子完整代码地址:Python爬虫实战之爬取糗事百科段子程序代码详解: Spider1-qiushibaike.py:爬取糗事百科的8小时最新页的段子.包含的信息 ...
Python爬虫实战之爬取百度贴吧帖子
Python爬虫实战之爬取百度贴吧帖子大家好,上次我们实验了爬取了糗事百科的段子,那么这次我们来尝试一下爬取百度贴吧的帖子.与上一篇不同的是,这次我们需要用到文件的相关操作. 本篇目标对百度贴吧的 ...
Python爬虫实战(1) | 爬取豆瓣网排名前250的电影（下）
在Python爬虫实战(1) | 爬取豆瓣网排名前250的电影(上)中,我们最后爬出来的结果不是很完美,这对于"精益求精.追求完美的"程序猿来说怎么能够甘心所以,今天,用pyth ...
携程ajax,Python爬虫实战之爬取携程评论
一.分析数据源这里的数据源是指html网页?还是Aajx异步.对于爬虫初学者来说,可能不知道怎么判断,这里辰哥也手把手过一遍. 提示:以下操作均不需要登录(当然登录也可以) 咱们先在浏览器里面搜索携 ...
Python爬虫实战：爬取解放日报新闻文章
上一篇<Python 网络爬虫实战:爬取人民日报新闻文章>发布之后,确实帮到了不少朋友. 前几天,我好哥们问我:我想爬另一个日报新闻网站,网页结构几乎跟人民日报几乎一模一样,但是我用你的那 ...
爬虫实践：爬取搜狗图片
前言本文中,将通过爬取搜狗图片为例,分析Ajax请求来抓取网页数据 (通过传入关键字,已达到爬取不同类型图片的目的) AJAX引擎其实是一个JavaScript对象,全写是 window.XMLHt ...
python爬虫实践之爬取美女图片
目录概述准备所需模块涉及知识点运行效果完成爬虫 1. 分析网页 2. 爬虫代码概述爬取妹子图网的美女图片. 准备所需模块 time requests lxml 涉及知识点 pytho ...
Python爬虫实战之爬取全国理工类大学数量+数据可视化
上次爬取高考分数线这部分收了个尾,今天咱们来全面爬取全国各省有多少所理工类大学,并简单实现一个数据可视化.话不多说,咱们开始吧. 第一步,拿到url地址第二步,获取高校数据第三步,地图可视化第四 ...

Python爬虫实战 | (9) 爬取搜狗图片

Python爬虫实战 | (9) 爬取搜狗图片相关推荐

最新文章

热门文章