爬取今日头条街拍图片

# coding=utf-8
import os
import re
import time
from multiprocessing.pool import Pool
import requests
from urllib.parse import urlencodeheaders={'Cookie': '你的cookie','User-Agent':'你的user-agent'}
#得到首页面上相应详情页的url
def get_search_page(offset):params = {'aid':'24','app_name': 'web_search','offset': offset,'format':'json','keyword':'街拍','autoload':'true','count':'20','en_qc':'1','cur_tab':'1','from':'search_tab','pd':'synthesis','timestamp': int(round(time.time() * 1000)),}base_url = 'https://www.toutiao.com/api/search/content/?' + urlencode(params)print(base_url)try:response1 = requests.get(base_url,headers=headers)#print(type(response1))#search_res = json.loads(requests.get(url+urlencode(params),headers=headers).text,encoding='utf-8')['data']if response1.status_code == 200:json1=response1.json()#print(json1)#if json1.get('data'):for item in json1.get('data'):title=item.get('title')url_group=item.get('share_url')try:if title!=None and url_group!=None:yield {'title':item.get('title'),'url_group':item.get('share_url')}except:print('非图片类型页面')except requests.ConnectionError:return None#得到图片的集合
def get_images_group(url_group):print('正在进行详情页面解析')response2=requests.get(url_group,headers=headers)try:if response2.status_code == 200:content1s=re.findall('[a-zA-z]+://[^\s]*&quot',response2.text,re.S)if (content1s==None):print("文章类型不对，没有找到图片集合",url_group)return None#content3s=re.findall('/pgc-image/[^\s]*&quot',response2.text,re.S)#print(content1s)for content1 in content1s:content2=re.sub('&quot','',content1)content3=re.sub('[a-zA-z]+://[^\s]*/pgc-image/','',content2)yield {'name': content3,'image_url': content2,}#print(response2.text)except ConnectionError:print('无法连接')return None
#保存图片
def save_image(item,title):try:response = requests.get(item.get('image_url'))if response.status_code == 200:file_path1='C:/Users/Desktop/图片/爬虫/'+title+'/'#print(file_path1)file_path=file_path1+'{0}.{1}'.format(item.get('name'),'jpg')if not os.path.exists(file_path):try:os.makedirs(file_path1)except:print('路径已经存在', file_path1)with open(file_path, 'wb') as f:f.write(response.content)f.close()else:print('已经下载', file_path)except requests.ConnectionError:print('保存图片失败')#主函数
def main(offset):for item1 in get_search_page(offset):url_group= item1.get('url_group')title=item1.get('title')for item2 in get_images_group(url_group):save_image(item2,title)print(offset)GROUP_START = 1
GROUP_END = 20if __name__ == '__main__':pool = Pool()# 创建进程池,可以选择创建进程池的数量groups = ([x * 20 for x in range(GROUP_START, GROUP_END + 1)])pool.map(main, groups)# 将数组中的每个元素提取出来当作函数的参数，创建一个个进程，放进进程池中# 第一个参数是函数，第二个参数是一个迭代器，将迭代器中的数字作为参数依次传入函数中pool.close()#关闭进程池，不再接受新的进程pool.join()#主进程阻塞等待子进程的退出

爬取今日头条街拍图片相关推荐

Scrapy 爬取今日头条街拍图片
scrapy 爬取今日头条图片保存至本地之前用 requests 爬取过今日头条街拍的图片,当时只是爬取每篇文章的缩略图,今天尝试用 scrapy 来大规模爬取街拍详细图片. 分析页面今日头条的内 ...
Python爬虫：爬取今日头条“街拍”图片（修改版）
前言在参考<Python3网络爬虫开发实战>学习爬虫时,练习项目中使用 requests ajax 爬取今日头条的"街拍"图片,发现书上的源代码有些已经不适合现在了, ...
python爬取今日头条_爬取今日头条街拍图片
参考于崔庆才的Python爬虫教程,但是崔的视频时间过长,今日头条网站反爬虫也有了变化,因此写下此文章来记录自己的爬取过程遇到的问题,也给大家一些借鉴.欢迎大家讨论. 一.获取索引页. 我们会发现do ...
使用ajax爬取今日头条街拍图片
文章目录分析请求获取一组信息解析json 获取图片列表本地存储整合功能分析请求地址:https://www.toutiao.com/search/?keyword=%E8%A1%97%E ...
分析Ajax爬取今日头条街拍图片
# -*- coding:UTF-8 -*- import requests import time import os from hashlib import md5def get_page(off ...
利用Ajax爬取今日头条头像，街拍图片。关于崔庆才python爬虫爬取今日头条街拍内容遇到的问题的解决办法。
我也是初学爬虫,在看到崔庆才大佬的爬虫实战:爬取今日头条街拍美图时,发现有些内容过于陈旧运行程序时已经报错,网页的源代码早已不一样了.以下是我遇到的一些问题. 1.用开发者选项筛选Ajax文件时预览看 ...
python爬虫今日头条_python爬虫—分析Ajax请求对json文件爬取今日头条街拍美图
python爬虫-分析Ajax请求对json文件爬取今日头条街拍美图前言本次抓取目标是今日头条的街拍美图,爬取完成之后,将每组图片下载到本地并保存到不同文件夹下.下面通过抓取今日头条街拍美图讲解一 ...
python爬取今日头条街拍,Python3今日头条街拍爬虫
学习了大才哥的在线视频教程,特来这里总结分享一下. 不同于上一篇糗事百科的爬虫,这里爬取今日头条街拍需要分析ajax请求得来的数据. 首先这里是爬取的起始页可以看到当我们往下拉滚动条的时候,新数据是 ...
[Python3网络爬虫开发实战] --分析Ajax爬取今日头条街拍美图
[Python3网络爬虫开发实战] --分析Ajax爬取今日头条街拍美图学习笔记--爬取今日头条街拍美图准备工作抓取分析实战演练学习笔记–爬取今日头条街拍美图尝试通过分析Ajax请求来抓取 ...

爬取今日头条街拍图片

爬取今日头条街拍图片

爬取今日头条街拍图片相关推荐

最新文章

热门文章

爬取今日头条街拍图片

*爬取今日头条街拍图片 *

爬取今日头条街拍图片相关推荐

最新文章

热门文章

爬取今日头条街拍图片