python3多线程高容错爬取头条的街拍美图

分析头条的ajax,通过正则表达式，python3多线程高容错爬取头条的街拍美图，保存到mongodb,并下载图片
头条的内容网页较之前已经改版，图床页不仅有ajax的还有html的内容网页
所以使用了两种正则，根据条件调用

#!/usr/bin/env python
# -*- coding:utf-8 -*-
"""
@author:Aiker
@file:toutiao.py
@time:下午9:35
"""
import json
import os
import re
from json import JSONDecodeError
from multiprocessing import Pool
from urllib.parse import urlencode
from hashlib import md5
import pymongo
import requests
from requests.exceptions import RequestExceptionMONGO_URL = 'localhost:27017'
MONGO_DB = 'toutiao'
MONGO_TABLE = 'toutiao'
GROUP_START = 1
GROUP_END = 20
KEYWORD = '街拍'
client = pymongo.MongoClient(MONGO_URL, connect=False)
db = client[MONGO_DB]headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}def get_url(url):try:response = requests.get(url, headers=headers)if response.status_code == 200:return response.textreturn Noneexcept RequestException:print('请求失败', url)return Nonedef get_page_index(offset, keyword):data = {'aid': '24','app_name': 'web_search','offset': offset,'format': 'json','keyword': keyword,'autoload': 'true','count': '20','en_qc': '1','cur_tab': '1','from': 'search_tab','pd': 'synthesis','timestamp': '1124216535987'}url = 'https://www.toutiao.com/api/search/content/?' + urlencode(data)  # 字典对象转化url对象try:response = requests.get(url, headers=headers)if response.status_code == 200:return response.textreturn Noneexcept RequestException:print('请求索引页失败')return Nonedef parse_page_index(html):try:data = json.loads(html)  # 转化为json对象if data and 'data' in data.keys():# print(data.keys()) #调试，输出所有keyfor item in data.get('data'):if 'article_url' in item:  # 判断是否存在，避免出现None# print(item)yield item.get('article_url')  # 构造生成器except JSONDecodeError:passexcept TypeError:passdef get_page_detail(url):try:response = requests.get(url, headers=headers)if response.status_code == 200:return response.textreturn Noneexcept RequestException:print('请求详情页出错', url)return Nonedef parse_page_detail(html, url):pattern = re.compile("articleInfo:.*?title:\s'(.*?)',.*?content:\s'(.*?)'.*?groupId", re.S)result = re.findall(pattern, html)# print(tc)if result:title, content = result[0]pattern = re.compile("(http://.*?)"", re.S)images = re.findall(pattern, content)# print(img)for image in images: download_image(image, title)# print(item)return {'title': title,'url': url,'images': images}else:pattern = re.compile('BASE_DATA.galleryInfo.*?title:\s\'(.*?)\'.*?gallery: JSON.parse\("(.*)"\)', re.S)result = re.findall(pattern, html)# print(result[0])if result:title, content = result[0]data = json.loads(content.replace('\\', ''))# print(data)if data and 'sub_images' in data.keys():sub_images = data.get('sub_images')images = [item.get('url') for item in sub_images]for image in images: download_image(image,title)return {'title': title,'url': url,'images': images}def save_to_mongo(result):if db[MONGO_TABLE].insert(result):print('存储到MongoDB成功', result)return Truereturn Falsedef download_image(url,title):print('正在下载', url)try:response = requests.get(url)if response.status_code == 200:save_image(response.content,title)return Noneexcept RequestException:print('请求图片出错', url)return Nonedef save_image(content,title):try:if title:title = re.sub('[:?！!：？]', '', title)  # 替换title中的特殊字符，避免建立资料夹目录出错dir = 'z:\\toutiao\\'if os.path.exists(dir + title):passelse:os.mkdir(dir + title)file_path = '{0}/{1}.{2}'.format( dir + title, md5(content).hexdigest(), 'jpg')if not os.path.exists(file_path):with open(file_path, 'wb') as f:f.write(content)f.close()except OSError:passdef main(offset):html = get_page_index(offset, KEYWORD)for url in parse_page_index(html):print(url)html = get_page_detail(url)if html:result = parse_page_detail(html, url)if result:save_to_mongo(result)# print(html)if __name__ == '__main__':# main()groups = [x * 20 for x in range(GROUP_START, GROUP_END + 1)]pool = Pool()pool.map(main, groups)pool.close()pool.join()

对Python感兴趣或者是正在学习的小伙伴，可以加入我们的Python学习扣qun：784758214，看看前辈们是如何学习的！从基础的python脚本到web开发、爬虫、django、数据挖掘等，零基础到项目实战的资料都有整理。送给每一位python的小伙伴！直播分享一些学习的方法和需要注意的小细节，点击加入我们的 python学习者聚集地

下载图片，并保存到mongodb

python3多线程高容错爬取头条的街拍美图相关推荐

2019-4—22爬取头条新闻街拍图片
代码如下: # coding=gbk import requests from requests.exceptions import RequestException from urllib.pars ...
[Python3网络爬虫开发实战] --分析Ajax爬取今日头条街拍美图
[Python3网络爬虫开发实战] --分析Ajax爬取今日头条街拍美图学习笔记--爬取今日头条街拍美图准备工作抓取分析实战演练学习笔记–爬取今日头条街拍美图尝试通过分析Ajax请求来抓取 ...
转：【Python3网络爬虫开发实战】6.4-分析Ajax爬取今日头条街拍美图
[摘要] 本节中,我们以今日头条为例来尝试通过分析Ajax请求来抓取网页数据的方法.这次要抓取的目标是今日头条的街拍美图,抓取完成之后,将每组图片分文件夹下载到本地并保存下来. 1. 准备工作在本节 ...
Python3网络爬虫开发实战分析Ajax爬取今日头条街拍美图
本节中,我们以今日头条为例来尝试通过分析Ajax请求来抓取网页数据的方法.这次要抓取的目标是今日头条的街拍美图,抓取完成之后,将每组图片分文件夹下载到本地并保存下来. 1. 准备工作很多人学习pyt ...
【Python3网络爬虫开发实战】6.4-分析Ajax爬取今日头条街拍美图
[摘要] 本节中,我们以今日头条为例来尝试通过分析Ajax请求来抓取网页数据的方法.这次要抓取的目标是今日头条的街拍美图,抓取完成之后,将每组图片分文件夹下载到本地并保存下来. 1. 准备工作在本节 ...
python爬虫今日头条街拍美图开发背景_【Python3网络爬虫开发实战】6.4-分析Ajax爬取今日头条街拍美图...
[摘要] 本节中,我们以今日头条为例来尝试通过分析Ajax请求来抓取网页数据的方法.这次要抓取的目标是今日头条的街拍美图,抓取完成之后,将每组图片分文件夹下载到本地并保存下来. 1. 准备工作在本节 ...
python爬虫今日头条_python爬虫—分析Ajax请求对json文件爬取今日头条街拍美图
python爬虫-分析Ajax请求对json文件爬取今日头条街拍美图前言本次抓取目标是今日头条的街拍美图,爬取完成之后,将每组图片下载到本地并保存到不同文件夹下.下面通过抓取今日头条街拍美图讲解一 ...
Python爬虫 | 批量爬取今日头条街拍美图
点击上方"Python爬虫与数据挖掘",进行关注回复"书籍"即可获赠Python从入门到进阶共10本电子书今日鸡汤浮云一别后,流水十年间. 专栏作者:霖he ...
python爬取今日头条_Python3网络爬虫实战-36、分析Ajax爬取今日头条街拍美图
本节我们以今日头条为例来尝试通过分析 Ajax 请求来抓取网页数据的方法,我们这次要抓取的目标是今日头条的街拍美图,抓取完成之后将每组图片分文件夹下载到本地保存下来. 1. 准备工作在本节开始之前请 ...

python3多线程高容错爬取头条的街拍美图

python3多线程高容错爬取头条的街拍美图相关推荐

最新文章

热门文章