python学习（五）爬取今日头条图库

今天抽出时间写了一个小爬虫来爬取今日头条的图片
简要的说下
1图片首页是通过ajax 发生请求得到json数据然后渲染到网页，
2然后每个详情页中在获取的网页的源代码中是包含图片地址的但是直接获取img元素来获取这就需要正则来提取然后可以生成json 数据来获取图片地址
主要就是这两点明白这两点基本就完事了

# coding=utf-8import time
import requests
import urllib.parse
import os
from lxml import etree
import hashlib
import string
import re
import jsonclass toutiao(object):def __init__(self):self.header = {"content-type": "application/x-www-form-urlencoded","referer": "https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D","user-agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 LBBROWSER",}str = "tt_webid=6706425614288946700; WEATHER_CITY=%E5%8C%97%E4%BA%AC; UM_distinctid=16b8e5a1a2424d-0e0eb957d32c8e-19174638-1fa400-16b8e5a1a254a4; s_v_web_id=588f2ef74e1ceeaf59f3208561768055; __tasessionId=fjqbezvwc1561461398100; csrftoken=bbe2498f2893b007a4b41af04b99c838; tt_webid=6706425614288946700; CNZZDATA1259612802=1882149716-1561460036-https%253A%252F%252Fwww.baidu.com%252F%7C1561460036"self.cookies = {i.split('=')[0]: i.split('=')[1] for i in str.split(';')}img_str = "tt_webid=6706441381927290371; UM_distinctid=16b8e92149daa7-064dc97ae3da06-39395704-1fa400-16b8e92149eb16; tt_webid=6706441381927290371; csrftoken=247aa3e21fc9f5f62f6dae721f4fb01d; tt_webid=6706441381927290371; WEATHER_CITY=%E5%8C%97%E4%BA%AC; CNZZDATA1259612802=429449312-1561463830-%7C1561507030; s_v_web_id=89b01385acfcbede0b21c08deae4878d; __tasessionId=54mvwi6qw1561512119813"self.img_cookies = {i.split('=')[0]: i.split('=')[1] for i in str.split(';')}self.url_list=[ self.get_url(i*20) for i in range(0,20)]self.page=0def get_url(self, offset):url = "https://www.toutiao.com/api/search/content/?"timescape = "%d" % time.time()params = {"aid": "24","app_name": "web_search","offset": offset,"format": "json","keyword": "美女","autoload": "true","count": "20","en_qc": "1","cur_tab": "1","from": "search_tab","pd": "synthesis","timestamp": timescape}return (url + urllib.parse.urlencode(params))def parse_url(self, url,header,cookies):response = requests.get(url, headers=header, cookies=cookies)if response.status_code == 200:return responseelse:print("请求失败")def run(self):self.start()def start(self):if self.page < len(self.url_list):time.sleep(2)print(self.url_list[self.page])response = self.parse_url(self.url_list[self.page], self.header, self.cookies)json_data = response.json()self.page += 1self.get_url_item(json_data)def get_url_item(self, json_data):try:for data in json_data["data"]:try :item={}item["title"] = data['title']item["share_url"] = data['share_url']item["id"] = data['item_id']item["group_id"]=data["group_id"]self.get_detail(item)except:continueself.start()except:self.start()print("没有data"+json_data)def get_detail(self, item: dict):url = item['share_url']title=item['title']if url is not "":response = self.parse_url(url, self.header, self.img_cookies)if url.find("http://toutiao.com")>=0:str_temp=re.findall(r"content: \'&lt;div&gt;&lt;p&gt;(.*?)&lt;\/p&gt;&lt;\/div&gt;\',",response.text)if len(str_temp)>0:for str_url in str_temp:str_img=re.findall(r"&quot;(.*?)&quot;",str_url)str_img=[i for i in str_img if i.find("http://")>=0]print('*'*50)print(str_img)for img in str_img:self.save_pic(img, title)else:str_temp=re.findall(r'gallery: JSON\.parse\("(.*?)"\),',response.text)if len(str_temp)>0:json_str=json.loads(str_temp[0].replace('\\',''))print('-' * 50)print(json_str)url_list =json_str["sub_images"]for url in url_list:self.save_pic(url["url"],title)else:print("dddddddddddd")html=etree.HTML(response.content)img_url_list=html.xpath("//img/@src")if len(img_url_list)>0:for img in img_url_list:self.save_pic(img,title)else:print(item["title"]+"连接为空")def save_pic(self,img,title):if img is  "":returntitle_new="".join([i for i in title if i not in string.punctuation and i.isalnum()])if not os.path.exists("./pic"):os.mkdir("./pic")path_dir="./pic/"+title_newif not os.path.exists(path_dir):os.mkdir(path_dir)md5 = hashlib.md5()md5.update(img.encode())str_md5=md5.hexdigest()response=requests.get(img,headers=self.header)file_path=path_dir+"/"+str_md5+".png"with open(file_path,"wb") as f:f.write(response.content)print("下载完成"+img+file_path)if __name__ == '__main__':spider = toutiao()spider.run()

知识点：
1 去掉标点符号
isalnum():string中至少有一个字符，而且全是字母或者数字或者是字母和数字混合返回True，其他情况返回False：

  title_new="".join([i for i in title if i not in string.punctuation and i.isalnum()])

2 创建文件夹是不能多级创建只能创建一级
3 urllib 将网址和参数组合成网址参数是字典形式

    def get_url(self, offset):url = "https://www.toutiao.com/api/search/content/?"timescape = "%d" % time.time()params = {"aid": "24","app_name": "web_search","offset": offset,"format": "json","keyword": "美女","autoload": "true","count": "20","en_qc": "1","cur_tab": "1","from": "search_tab","pd": "synthesis","timestamp": timescape}return (url + urllib.parse.urlencode(params))

python学习（五）爬取今日头条图库相关推荐

Python爬虫：爬取今日头条“街拍”图片（修改版）
前言在参考<Python3网络爬虫开发实战>学习爬虫时,练习项目中使用 requests ajax 爬取今日头条的"街拍"图片,发现书上的源代码有些已经不适合现在了, ...
python分析并爬取今日头条的视频链接
如题,分析并爬取今日头条的视频链接代码仅供交流使用一.分析 1.进入现在的官网http://www.365yg.com/,然后通过抓包发现首页数据的走向,一般来说首页数据放在网页中,要不然就是用j ...
Python爬虫 | 批量爬取今日头条街拍美图
点击上方"Python爬虫与数据挖掘",进行关注回复"书籍"即可获赠Python从入门到进阶共10本电子书今日鸡汤浮云一别后,流水十年间. 专栏作者:霖he ...
[Python3网络爬虫开发实战] --分析Ajax爬取今日头条街拍美图
[Python3网络爬虫开发实战] --分析Ajax爬取今日头条街拍美图学习笔记--爬取今日头条街拍美图准备工作抓取分析实战演练学习笔记–爬取今日头条街拍美图尝试通过分析Ajax请求来抓取 ...
Python3从零开始爬取今日头条的新闻【五、解析头条视频真实播放地址并自动下载】
Python3从零开始爬取今日头条的新闻[一.开发环境搭建] Python3从零开始爬取今日头条的新闻[二.首页热点新闻抓取] Python3从零开始爬取今日头条的新闻[三.滚动到底自动加载] Pyt ...
python编程100例头条-python爬虫演示：以爬取今日头条为例
编者按众所周知,Python是一门编程语言,操作简洁而清晰.功能专业而强大.入门容易又严谨.2018年,教育部正式将人工智能.物联网.大数据处理划入高中课程,这就意味着,现在的中学生开始就要学习编程 ...
python爬虫今日头条_python爬虫—分析Ajax请求对json文件爬取今日头条街拍美图
python爬虫-分析Ajax请求对json文件爬取今日头条街拍美图前言本次抓取目标是今日头条的街拍美图,爬取完成之后,将每组图片下载到本地并保存到不同文件夹下.下面通过抓取今日头条街拍美图讲解一 ...
python抽取指定url页面的title_Python使用scrapy爬虫，爬取今日头条首页推荐新闻
爬取今日头条https://www.toutiao.com/首页推荐的新闻,打开网址得到如下界面查看源代码你会发现全是js代码,说明今日头条的内容是通过js动态生成的. 用火狐浏览器F12查看得知 ...
用Python爬取今日头条，里面的东西统统白送！
近年来今日头条做的可谓是风生水起,自上线以来,围绕内容载体和分发方式两个维度不断丰富,至今已衍生出图文.视频.微头条.专栏.搜索.直播等多种内容形式.根据最新中国联通发布的App大数据排行榜,今日头条 ...

python学习（五）爬取今日头条图库

python学习（五）爬取今日头条图库相关推荐

最新文章

热门文章