使用ajax爬取今日头条街拍图片

文章目录

分析请求
获取一组信息
解析json
获取图片列表
本地存储
整合功能

分析请求

地址：https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D

可以发现以?aid开头的链接包含了内容信息
拖动页面，获得连续的?aid信息
- https://www.toutiao.com/api/search/content/?aid=24&app_name=web_search&offset=0&format=json&keyword=%E8%A1%97%E6%8B%8D&autoload=true&count=20&en_qc=1&cur_tab=1&from=search_tab&pd=synthesis&timestamp=1582600289707
- https://www.toutiao.com/api/search/content/?aid=24&app_name=web_search&offset=20&format=json&keyword=%E8%A1%97%E6%8B%8D&autoload=true&count=20&en_qc=1&cur_tab=1&from=search_tab&pd=synthesis&timestamp=1582601046539

可以发现不同的链接之间只有offset和timestamp有差异，经过试验，timestamp取值似乎不会影响获得的内容

获取一组信息

import requests
headers = {'cookie': 'tt_webid=6797200619561698823; s_v_web_id=verify_k7195l9r_uXMR9eu7_6yoD_4gkg_BOXR_MKFTGKfMqteU; ttcid=258a3cc32ee8498599686a745574cf7b28; WEATHER_CITY=%E5%8C%97%E4%BA%AC; tt_webid=6797200619561698823; csrftoken=c44d5abf445176f703e9994d9aea0b16; tt_scid=WBDUqNCX24zV0vnk7GkqwcTaUwgHDmOuOTC4cg8N.K2fPREnRW.D6XVshWxiaxPAb9ed','accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3','User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Mobile Safari/537.36'
}
def get_info(url):# 输入为ajax的urltry:response = requests.get(url, headers = headers)if response.status_code == 200:return response.json()except requests.ConnectionError as e:print('Error:', e.args)

test_json = get_info('https://www.toutiao.com/api/search/content/?aid=24&app_name=web_search&offset=0&format=json&keyword=%E8%A1%97%E6%8B%8D&autoload=true&count=20&en_qc=1&cur_tab=1&from=search_tab&pd=synthesis&timestamp=1582600289707')

解析json

for info in test_json.get('data'):if info.get('abstract') != None:print('标题：', info.get('title'))print('作者：', info.get('source'))print('类型：', info.get('display_type_self'))print('原文：', info.get('article_url'))

标题： 北京街拍：三里屯潮拍凸显与众不同的时尚街拍，与众不同最时尚
作者： 皇城根五爷原创街拍
类型： self_gallery
原文： http://toutiao.com/group/6428768938196206081/
标题： 随手街拍 殿堂级美女
作者： 宽城地出溜
类型： self_article
原文： http://toutiao.com/group/6777243413981954575/
标题： 街拍：有人说这样拍，才算时尚街拍
作者： 皇城根五爷原创街拍
类型： self_gallery
原文： http://toutiao.com/group/6609426526892982798/
标题： 路人街拍，“肥而不腻”的难得女神
作者： 西贝时尚
类型： self_article
原文： http://toutiao.com/group/6773219752874607108/
标题： 街拍：性感就是如此简单，丰满韵味包臀裙
作者： 感遇街拍
类型： self_article
原文： http://toutiao.com/group/6795774593480000011/
标题： 三里屯街拍
作者： 艾丝
类型： self_gallery
原文： http://toutiao.com/group/6760980286252515853/
标题： 街拍：人间极品，完美的身材，天使的脸庞，我去哪里找
作者： 感遇街拍
类型： self_article
原文： http://toutiao.com/group/6795524234970923531/
标题： 街拍：好身材高颜值的美女们
作者： 秋水一手咨询
类型： self_gallery
原文： http://toutiao.com/group/6797002678686712323/
标题： 街拍：冬季穿搭，三里屯潮人穿搭，永远都是少女们的时尚风向标
作者： 皇城根五爷原创街拍
类型： self_gallery
原文： http://toutiao.com/group/6774540258336834056/
标题： 街拍：美女姐姐上街头，美到发亮，100%回头率
作者： 感遇街拍
类型： self_article
原文： http://toutiao.com/group/6796140055858512398/
标题： 街拍：好好了解时尚，不断提升自己的穿搭技巧，彰显自身的个性
作者： 皇城根五爷原创街拍
类型： self_gallery
原文： http://toutiao.com/group/6796847662701216268/
标题： 街拍：你想和图几有故事
作者： 秋水一手咨询
类型： self_gallery
原文： http://toutiao.com/group/6796633637992268301/
标题： 50位街头摄影大师，50张经典街拍作品
作者： 宁影纪
类型： self_article
原文： http://toutiao.com/group/6793614971650441731/

获取图片列表

这时图片地址不在ajax中了，而是来源于js

import re
import json
# 注意这里要换一下headers
newheaders = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'}
def get_imgs(url):# 输入为文章详情urltry:response = requests.get(url, headers = newheaders)if response.status_code == 200:content = response.text# 获取包含图片的json部分并转换为json类型（由于要多次转义所以需要用两次loads方法）images = re.search(re.compile('gallery: JSON.parse\((.*?)\),',re.S), content)images = json.loads(images.group(1))images = json.loads(images)images = images['sub_images']# 去除额外的\\部分return [re.sub('\\\\', '', img.get('url')) for img in images]except requests.ConnectionError as e:print('Error:', e.args)

get_imgs('http://toutiao.com/group/6428768938196206081/')

['http://p3.pstatp.com/origin/243b0002e5ef87385601','http://p1.pstatp.com/origin/243b0002e5f15c188892','http://p1.pstatp.com/origin/26e300004740640ff18a','http://p1.pstatp.com/origin/24390003cc898fb29a19','http://p1.pstatp.com/origin/24380000f50c5c50cd76','http://p3.pstatp.com/origin/24340002e8685c2bdd8f','http://p3.pstatp.com/origin/243b0002e5f3a9124820','http://p1.pstatp.com/origin/243a00030149960659dc','http://p1.pstatp.com/origin/243b0002e5f44b3d8cdc','http://p3.pstatp.com/origin/24380000f50fa79e73e6','http://p1.pstatp.com/origin/24390003cc8b1daf17a6']

本地存储

import os
from hashlib import md5
def save_img(filename, imgs):# 输入为文件名和图片url列表# 文件夹不存在则新建if not os.path.exists(filename):os.mkdir(filename)for img in imgs:try:response = requests.get(img)if response.status_code == 200:# 图片名称为内容的md5编码file_path = '{0}/{1}.{2}'.format(filename, md5(response.content).hexdigest(), 'jpg')if not os.path.exists(file_path):with open(file_path, "wb") as f:f.write(response.content)except requests.ConnectionError as e:print('Error:', e.args)

save_img('hhh', get_imgs('http://toutiao.com/group/6428768938196206081/'))

整合功能

from urllib.parse import quote
import time
headers = {'cookie':'tt_webid=6797200619561698823; s_v_web_id=verify_k7195l9r_uXMR9eu7_6yoD_4gkg_BOXR_MKFTGKfMqteU; ttcid=258a3cc32ee8498599686a745574cf7b28; WEATHER_CITY=%E5%8C%97%E4%BA%AC; tt_webid=6797200619561698823; csrftoken=c44d5abf445176f703e9994d9aea0b16; tt_scid=WBDUqNCX24zV0vnk7GkqwcTaUwgHDmOuOTC4cg8N.K2fPREnRW.D6XVshWxiaxPAb9ed','accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3','User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Mobile Safari/537.36'
}
newheaders = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'
}def get_url(keyword, offset):# 输入参数为查询关键词和偏移量return 'https://www.toutiao.com/api/search/content/?aid=24&app_name=web_search&offset=' + str(offset) + '&format=json&keyword=' + quote(keyword) + '&autoload=true&count=20&en_qc=1&cur_tab=1&from=search_tab&pd=synthesis&timestamp=' + str(int(time.time() * 1000))def toutiao(keyword, number):# 输入参数为查询关键词和查询数量offset = 0total = 0baseurl = get_url(keyword, offset)infos = get_info(baseurl)if not os.path.exists(keyword):os.mkdir(keyword)while True:for info in infos.get('data'):# 这里的爬取方式只适用于gallery类型if info.get('abstract') != None and info.get('display_type_self') == "self_gallery":filename = keyword + "/【" + info.get('source')+'】'+info.get('title')total += 1save_img(filename, get_imgs(info.get('article_url')))if total == number:returnoffset += 20baseurl = get_url(keyword, offset)infos = get_info(baseurl)

toutiao('街拍', 4)

使用ajax爬取今日头条街拍图片相关推荐

分析Ajax爬取今日头条街拍图片
# -*- coding:UTF-8 -*- import requests import time import os from hashlib import md5def get_page(off ...
[Python3网络爬虫开发实战] --分析Ajax爬取今日头条街拍美图
[Python3网络爬虫开发实战] --分析Ajax爬取今日头条街拍美图学习笔记--爬取今日头条街拍美图准备工作抓取分析实战演练学习笔记–爬取今日头条街拍美图尝试通过分析Ajax请求来抓取 ...
爬取今日头条街拍图片
** *爬取今日头条街拍图片 * ** # coding=utf-8 import os import re import time from multiprocessing.pool import ...
python爬取今日头条_Python3网络爬虫实战-36、分析Ajax爬取今日头条街拍美图
本节我们以今日头条为例来尝试通过分析 Ajax 请求来抓取网页数据的方法,我们这次要抓取的目标是今日头条的街拍美图,抓取完成之后将每组图片分文件夹下载到本地保存下来. 1. 准备工作在本节开始之前请 ...
Python爬虫：爬取今日头条“街拍”图片（修改版）
前言在参考<Python3网络爬虫开发实战>学习爬虫时,练习项目中使用 requests ajax 爬取今日头条的"街拍"图片,发现书上的源代码有些已经不适合现在了, ...
Scrapy 爬取今日头条街拍图片
scrapy 爬取今日头条图片保存至本地之前用 requests 爬取过今日头条街拍的图片,当时只是爬取每篇文章的缩略图,今天尝试用 scrapy 来大规模爬取街拍详细图片. 分析页面今日头条的内 ...
转：【Python3网络爬虫开发实战】6.4-分析Ajax爬取今日头条街拍美图
[摘要] 本节中,我们以今日头条为例来尝试通过分析Ajax请求来抓取网页数据的方法.这次要抓取的目标是今日头条的街拍美图,抓取完成之后,将每组图片分文件夹下载到本地并保存下来. 1. 准备工作在本节 ...
Python3网络爬虫开发实战分析Ajax爬取今日头条街拍美图
本节中,我们以今日头条为例来尝试通过分析Ajax请求来抓取网页数据的方法.这次要抓取的目标是今日头条的街拍美图,抓取完成之后,将每组图片分文件夹下载到本地并保存下来. 1. 准备工作很多人学习pyt ...
【Python3网络爬虫开发实战】6.4-分析Ajax爬取今日头条街拍美图
[摘要] 本节中,我们以今日头条为例来尝试通过分析Ajax请求来抓取网页数据的方法.这次要抓取的目标是今日头条的街拍美图,抓取完成之后,将每组图片分文件夹下载到本地并保存下来. 1. 准备工作在本节 ...

使用ajax爬取今日头条街拍图片

文章目录

分析请求

获取一组信息

解析json

获取图片列表

本地存储

整合功能

使用ajax爬取今日头条街拍图片相关推荐

最新文章

热门文章