豆果美食菜谱爬取

主要功能

多线程爬取豆果美食APP菜谱分类中的菜谱数据，并存到mongoDB

框架

whistle分析数据包
夜神安卓模拟器安装豆果app
python 编写爬虫代码
vscode 编辑器
mongoDB 存储数据
ROBO 3T mongoDB可视化工具

豆果界面

一些截图

代码

spider_douguo.py：

# spider_douguo.py
import requests
import json
from multiprocessing import Queue
from handle_mongo import mongo_info
from concurrent.futures import ThreadPoolExecutor # 线程池#创建队列
queue_list = Queue()# 处理数据请求
def handle_request(url, data):header = {"client":"4","version":"6962.2","device":"SM-G955N","sdk":"25,7.1.2","channel":"baidu",# "resolution":"1600*900",# "display-resolution":"1600*900",# "dpi":"2.0",# "android-id":"784F438E43A20000",# "pseudo-id":"864394010787945","brand":"samsung","scale":"2.0","timezone":"28800","language":"zh","cns":"2","carrier":"CMCC","User-Agent":"Mozilla/5.0 (Linux; Android 7.1.2; SM-G955N Build/N2G48H; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/75.0.3770.143 Mobile Safari/537.36","imei":"864394010787945","terms-accepted":"1","newbie":"1","reach":"10000","Content-Type":"application/x-www-form-urlencoded; charset=utf-8","Accept-Encoding":"gzip","Connection":"Keep-Alive","Host":"api.douguo.net","Content-Length":"147",}response = requests.post(url=url,headers=header,data=data)return response# 抓取品类列表
def handle_cat():url = 'http://api.douguo.net/recipe/flatcatalogs'data = {"client":"4","_vs":"2305",}response = handle_request(url,data)index_dict = json.loads(response.text)for index_item in index_dict["result"]["cs"]:for index_item_1 in index_item["cs"]:for index_item_2 in index_item_1["cs"]:queue_list.put(index_item_2["name"])# 关键词搜索
def handle_search(keyword):print("当前处理的食材是:",keyword,end="\n")url = 'http://api.douguo.net/search/universalnew/0/10'data = {"client":"4","keyword":keyword,"_vs":"400",}response = handle_request(url,data)caipu_list_dict =  json.loads(response.text)for item in caipu_list_dict["result"]["recipe"]["recipes"]:caipu_info = {}caipu_info["shicai"] = keywordcaipu_info['caipu_name'] = item["n"]caipu_info["author_name"] = item["an"]caipu_info["caipu_id"] = item["id"]caipu_info["cookstory"] = item["cookstory"]caipu_info["img"] = item["img"]caipu_info["major"] = item["major"]caipu_info["detail_url"] = item["au"]detail_info_dict = json.loads(handle_detail(caipu_info))caipu_info["tips"] = detail_info_dict["result"]["recipe"]["tips"]caipu_info["cookstep"] = detail_info_dict["result"]["recipe"]["cookstep"]print("当前入库的菜谱是：",caipu_info['caipu_name'])mongo_info.insert_item(caipu_info)#菜谱详情
def handle_detail(item):url = "http://api.douguo.net/recipe/detail/" + str(item["caipu_id"])data = {"client":"4","_vs":"11101","_ext": '{"query":{ "kw":' + str(item["shicai"]) + ',"src":"11101","idx":"1", "type":"13", "id":' + str(item["caipu_id"]) + ' }',}response = handle_request(url,data)return response.texthandle_cat()pool = ThreadPoolExecutor(max_workers=20) #创建线程池
# while queue_list.qsize() > 0: 报错
while not queue_list.empty():pool.submit(handle_search,queue_list.get()) # 函数名和 参数

mongoDB存储数据：

# handle_mongodb.py
import pymongofrom pymongo.collection import Collectionclass Connect_mongo(object):def __init__(self):self.client = pymongo.MongoClient(host="127.0.0.1",port=27017)self.db_data = self.client["dougou_meishi"]def insert_item(self,item):db_collection = Collection(self.db_data,'t_douguo_item')db_collection.insert(item)mongo_info = Connect_mongo()

做个笔记

粘贴抓包得到的header在编辑器里处理成key-value的正则表达式子

抓到的 Header

client: 4
version: 6962.2
device: SM-G955N
sdk: 25,7.1.2
channel: baidu
resolution: 1600*900
display-resolution: 1600*900
dpi: 2.0
brand: samsung
scale: 2.0
timezone: 28800
Content-Type: application/x-www-form-urlencoded; charset=utf-8
Accept-Encoding: gzip
Connection: Keep-Alive
Cookie: duid=64275234
Host: api.douguo.net
Content-Length: 147

处理后：

"client":" 4",
"version":" 6962.2",
"device":" SM-G955N",
"sdk":" 25,7.1.2",
"channel":" baidu",
"resolution":" 1600*900",
"display-resolution":" 1600*900",
"dpi":" 2.0",
"brand":" samsung",
"scale":" 2.0",
"timezone":" 28800",
"Content-Type":" application/x-www-form-urlencoded; charset=utf-8",
"Accept-Encoding":" gzip",
"Connection":" Keep-Alive",
"Cookie":" duid=64275234",
"Host":" api.douguo.net",
"Content-Length":" 147",

同样把url参数处理成key-value

client=4&_session=123&keyword=%E5%9C%9F%E8%B1%86&_vs=11110&sign_ran=123123&code=123123

先用换行替换&符号

替换结果：

client=4
_session=123
keyword=%E5%9C%9F%E8%B1%86
_vs=11110
sign_ran=123123
code=123123

再处理为key-value的格式

处理结果：

"client":"4"
"_session":"123"
"keyword":"%E5%9C%9F%E8%B1%86"
"_vs":"11110"
"sign_ran":"123123"
"code":"123123"

项目代码地址（可运行）

点击此处前往github

遇到的问题

Q:报错信息

while queue_list.qsize() > 0:
File “/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/queues.py”, line 120, in qsize
return self._maxsize - self._sem._semlock._get_value()

A:
mac os 中 queue.qsize() 报错。暂时的解决办法是，使用queue.empty 来解决
原代码：

while queue_list.qsize() > 0:pool.submit(handle_search,queue_list.get()) # 函数 和参数

修改后：

....
while not queue_list.empty():pool.submit(handle_search,queue_list.get()) # 函数 和参数
.....

参考

Python爬虫工程师必学——App数据抓取实战
ImportError: No module named pymongo

Python App 爬虫：豆果美食APP 菜谱爬取相关推荐

Python网络爬虫数据采集实战：Scrapy框架爬取QQ音乐存入MongoDB
通过前七章的学习,相信大家对整个爬虫有了一个比较全貌的了解 ,其中分别涉及四个案例:静态网页爬取.动态Ajax网页爬取.Selenium浏览器模拟爬取和Fillder今日头条app爬取,基本 ...
Python网络爬虫实例——“中国最好大学排名爬取”（嵩天：北理工大学）学习笔记
这个例子比较简单也容易理解,我将细致的解析这个例子中算法流程.写一个博客算是给自己一个激励吧.一起加油.(_ZHJ三月和九月) 完整版代码 import requests from bs4 impor ...
Python网络爬虫实战：世纪佳缘爬取近6万条小姐姐数据后发现惊天秘密
翻着安静到死寂的聊天列表,我忽然惊醒,不行,我们不能这样下去,光羡慕别人有什么用,我们要行动起来,去找自己的幸福!!! 我也想"谈不分手的恋爱" !!!内牛满面!!! 注册登陆一气 ...
python selenium爬虫豆瓣_使用selenium+requests爬取豆瓣小组讨论列表
获取本文代码 · 我的GitHub 注:这个项目的代码会在我的GitHub持续优化.更新,而在本文中的代码则是最初版本的代码. 豆瓣小组豆瓣有一个"小组"模块,有一些小组中会发布 ...
Python网络爬虫实践（1）：爬取网易云音乐播放量大于1000万的歌单
Python网络爬虫实践(1) 一.需求分析爬取网易云音乐播放量大于1000万的歌单. 二.实施步骤 1.安装selenium selenium是一个用于Web应用自动化程序测试的工具,测试直接运行 ...
python网络爬虫学习(六)利用Pyspider+Phantomjs爬取淘宝模特图片
本篇博文在编写时参考了http://cuiqingcai.com/2652.html,向作者表示感谢一.新的问题与工具平时在淘宝上剁手的时候,总是会看到各种各样的模特.由于自己就读于一所男女比例三 ...
Python网络爬虫实践（2）：爬取小说网站小说
Python网络爬虫实践(2) 一.需求分析爬取某小说网站的一部小说二.步骤目标数据网站页面分析数据加载流程分析目标数据所对应的url 下载数据清洗,处理数据数据持久化重点:分析目 ...
（转）Python网络爬虫实战：世纪佳缘爬取近6万条数据
又是一年双十一了,不知道从什么时候开始,双十一从"光棍节"变成了"双十一购物狂欢节",最后一个属于单身狗的节日也成功被攻陷,成为了情侣们送礼物秀恩爱的节日. 翻 ...
python网络爬虫之解析网页的正则表达式(爬取4k动漫图片)[三]
目录前言一.正则表达式的学习 1.正则表达式的匹配工具 2.正则表达式的样式 3.正则表达式的案例二.爬取网页图片 1.分析网页 2.获取数据爬取妹子网的案例后记前言 hello,大家好 ...
Python数据爬虫学习笔记（11）爬取千图网图片数据
需求:在千图网http://www.58pic.com中的某一板块中,将一定页数的高清图片素材爬取到一个指定的文件夹中. 分析:以数码电器板块为例 1.查看该板块的每一页的URL: 注意到第一页是&q ...

Python App 爬虫：豆果美食APP 菜谱爬取

豆果美食菜谱爬取

主要功能

框架

豆果界面

一些截图

代码

做个笔记

项目代码地址（可运行）

遇到的问题

参考

Python App 爬虫：豆果美食APP 菜谱爬取相关推荐

最新文章

热门文章