淘宝定义了限速规则,爬取淘宝网站上的数据时,为了防止淘宝的数据接口返回以下消息

u'\r\nvar propvalues={"error_response":{"code":7,"msg":"App Call Limited","sub_code":"accesscontrol.limited-by-dynamic-access-count","sub_msg":"This ban will last for 1 more seconds","request_id":"2elclr9dnika"}}'

通常我们会在爬虫res.text代码行之后加入sleep(15),让请求频率放慢,不过这样爬数据太慢啦,

限速对于新的链接地址(cid值变啦)淘宝网的限速规则不会立即触发生效,我经过试验后

索性不加sleep函数,但我会记录请求一旦返回App Call Limited消息时 cid 值等于多少,

我们可以手动来提交链接地址来获得那些丢失的没有被爬虫成功爬下来的数据。

在本范例中本来我想一次性爬完所有的数据,把顶级分类数据category-top.csv文件的内容定义如下:

0|{"itemcats_get_response":{"item_cats":{"item_cat":[
{"cid":16,"is_parent":true,"name":"女装/女士精品","parent_cid":0,"status":"normal"},
{"cid":120886001,"is_parent":true,"name":"公益","parent_cid":0,"status":"normal"},
{"cid":98,"is_parent":true,"name":"包装","parent_cid":0,"status":"normal"},
{"cid":120950002,"is_parent":true,"name":"天猫点券","parent_cid":0,"status":"normal"},
{"cid":50802001,"is_parent":true,"name":"数字阅读","parent_cid":0,"status":"normal"},
{"cid":120894001,"is_parent":true,"name":"淘女郎","parent_cid":0,"status":"normal"},
{"cid":50023722,"is_parent":true,"name":"隐形眼镜/护理液","parent_cid":0,"status":"normal"},
{"cid":50026555,"is_parent":true,"name":"购物提货券","parent_cid":0,"status":"normal"},
{"cid":50026523,"is_parent":true,"name":"休闲娱乐","parent_cid":0,"status":"normal"},
{"cid":50008075,"is_parent":true,"name":"餐饮美食卡券","parent_cid":0,"status":"normal"},
{"cid":50019095,"is_parent":true,"name":"消费卡","parent_cid":0,"status":"normal"},
{"cid":50014927,"is_parent":true,"name":"教育培训","parent_cid":0,"status":"normal"},
{"cid":26,"is_parent":true,"name":"汽车/用品/配件/改装","parent_cid":0,"status":"normal"},
{"cid":50020808,"is_parent":true,"name":"家居饰品","parent_cid":0,"status":"normal"},
{"cid":50020857,"is_parent":true,"name":"特色手工艺","parent_cid":0,"status":"normal"},
{"cid":50025707,"is_parent":true,"name":"度假线路/签证送关/旅游服务","parent_cid":0,"status":"normal"},
{"cid":50024099,"is_parent":true,"name":"电子元器件市场","parent_cid":0,"status":"normal"},
{"cid":30,"is_parent":true,"name":"男装","parent_cid":0,"status":"normal"},
{"cid":50008164,"is_parent":true,"name":"住宅家具","parent_cid":0,"status":"normal"},
{"cid":50020611,"is_parent":true,"name":"商业/办公家具","parent_cid":0,"status":"normal"},
{"cid":50010788,"is_parent":true,"name":"彩妆/香水/美妆工具","parent_cid":0,"status":"normal"},
{"cid":1801,"is_parent":true,"name":"美容护肤/美体/精油","parent_cid":0,"status":"normal"},
{"cid":50023282,"is_parent":true,"name":"美发护发/假发","parent_cid":0,"status":"normal"},
{"cid":1512,"is_parent":false,"name":"手机","parent_cid":0,"status":"normal"},
{"cid":14,"is_parent":true,"name":"数码相机/单反相机/摄像机","parent_cid":0,"status":"normal"},
{"cid":1201,"is_parent":false,"name":"MP3/MP4/iPod/录音笔","parent_cid":0,"status":"normal"},
{"cid":1101,"is_parent":false,"name":"笔记本电脑","parent_cid":0,"status":"normal"},
{"cid":50019780,"is_parent":false,"name":"平板电脑/MID","parent_cid":0,"status":"normal"},
{"cid":50018222,"is_parent":true,"name":"DIY电脑","parent_cid":0,"status":"normal"},
{"cid":11,"is_parent":true,"name":"电脑硬件/显示器/电脑周边","parent_cid":0,"status":"normal"},
{"cid":50018264,"is_parent":true,"name":"网络设备/网络相关","parent_cid":0,"status":"normal"},
{"cid":50008090,"is_parent":true,"name":"3C数码配件","parent_cid":0,"status":"normal"},
{"cid":50012164,"is_parent":true,"name":"闪存卡/U盘/存储/移动硬盘","parent_cid":0,"status":"normal"},
{"cid":50007218,"is_parent":true,"name":"办公设备/耗材/相关服务","parent_cid":0,"status":"normal"},
{"cid":50018004,"is_parent":true,"name":"电子词典/电纸书/文化用品","parent_cid":0,"status":"normal"},
{"cid":20,"is_parent":true,"name":"电玩/配件/游戏/攻略","parent_cid":0,"status":"normal"},
{"cid":50022703,"is_parent":true,"name":"大家电","parent_cid":0,"status":"normal"},
{"cid":50011972,"is_parent":true,"name":"影音电器","parent_cid":0,"status":"normal"},
{"cid":50012100,"is_parent":true,"name":"生活电器","parent_cid":0,"status":"normal"},
{"cid":50012082,"is_parent":true,"name":"厨房电器","parent_cid":0,"status":"normal"},
{"cid":50002768,"is_parent":true,"name":"个人护理/保健/按摩器材","parent_cid":0,"status":"normal"},
{"cid":27,"is_parent":true,"name":"家装主材","parent_cid":0,"status":"normal"},
{"cid":124912001,"is_parent":false,"name":"合约机","parent_cid":0,"status":"normal"},
{"cid":50020332,"is_parent":true,"name":"基础建材","parent_cid":0,"status":"normal"},
{"cid":50020485,"is_parent":true,"name":"五金/工具","parent_cid":0,"status":"normal"},
{"cid":50026535,"is_parent":true,"name":"医疗及健康服务","parent_cid":0,"status":"normal"},
{"cid":50020579,"is_parent":true,"name":"电子/电工","parent_cid":0,"status":"normal"},
{"cid":50050471,"is_parent":true,"name":"婚庆/摄影/摄像服务","parent_cid":0,"status":"normal"},
{"cid":50011949,"is_parent":true,"name":"特价酒店/特色客栈/公寓旅馆","parent_cid":0,"status":"normal"},
{"cid":21,"is_parent":true,"name":"居家日用","parent_cid":0,"status":"normal"},
{"cid":50016349,"is_parent":true,"name":"厨房/烹饪用具","parent_cid":0,"status":"normal"},
{"cid":50016348,"is_parent":true,"name":"家庭/个人清洁工具","parent_cid":0,"status":"normal"},
{"cid":50008163,"is_parent":true,"name":"床上用品","parent_cid":0,"status":"normal"},
{"cid":35,"is_parent":true,"name":"奶粉/辅食/营养品/零食","parent_cid":0,"status":"normal"},
{"cid":50014812,"is_parent":true,"name":"尿片/洗护/喂哺/推车床","parent_cid":0,"status":"normal"},
{"cid":50022517,"is_parent":true,"name":"孕妇装/孕产妇用品/营养","parent_cid":0,"status":"normal"},
{"cid":50008165,"is_parent":true,"name":"童装/婴儿装/亲子装","parent_cid":0,"status":"normal"},
{"cid":50020275,"is_parent":true,"name":"传统滋补营养品","parent_cid":0,"status":"normal"},
{"cid":50002766,"is_parent":true,"name":"零食/坚果/特产","parent_cid":0,"status":"normal"},
{"cid":50016422,"is_parent":true,"name":"粮油米面/南北干货/调味品","parent_cid":0,"status":"normal"},
{"cid":121380001,"is_parent":true,"name":"国内机票/国际机票/增值服务","parent_cid":0,"status":"normal"},
{"cid":121536003,"is_parent":true,"name":"数字娱乐","parent_cid":0,"status":"normal"},
{"cid":121536007,"is_parent":true,"name":"全球购代购市场","parent_cid":0,"status":"normal"},
{"cid":40,"is_parent":true,"name":"腾讯QQ专区","parent_cid":0,"status":"normal"},
{"cid":50010728,"is_parent":true,"name":"运动/瑜伽/健身/球迷用品","parent_cid":0,"status":"normal"},
{"cid":50013886,"is_parent":true,"name":"户外/登山/野营/旅行用品","parent_cid":0,"status":"normal"},
{"cid":50011699,"is_parent":true,"name":"运动服/休闲服装","parent_cid":0,"status":"normal"},
{"cid":25,"is_parent":true,"name":"玩具/童车/益智/积木/模型","parent_cid":0,"status":"normal"},
{"cid":50011665,"is_parent":true,"name":"网游装备/游戏币/帐号/代练","parent_cid":0,"status":"normal"},
{"cid":50008907,"is_parent":true,"name":"手机号码/套餐/增值业务","parent_cid":0,"status":"normal"},
{"cid":99,"is_parent":true,"name":"网络游戏点卡","parent_cid":0,"status":"normal"},
{"cid":23,"is_parent":true,"name":"古董/邮币/字画/收藏","parent_cid":0,"status":"normal"},
{"cid":50007216,"is_parent":true,"name":"鲜花速递/花卉仿真/绿植园艺","parent_cid":0,"status":"normal"},
{"cid":50004958,"is_parent":true,"name":"移动/联通/电信充值中心","parent_cid":0,"status":"normal"},
{"cid":50011740,"is_parent":true,"name":"流行男鞋","parent_cid":0,"status":"normal"},
{"cid":50006843,"is_parent":true,"name":"女鞋","parent_cid":0,"status":"normal"},
{"cid":50006842,"is_parent":true,"name":"箱包皮具/热销女包/男包","parent_cid":0,"status":"normal"},
{"cid":1625,"is_parent":true,"name":"女士内衣/男士内衣/家居服","parent_cid":0,"status":"normal"},
{"cid":50010404,"is_parent":true,"name":"服饰配件/皮带/帽子/围巾","parent_cid":0,"status":"normal"},
{"cid":50011397,"is_parent":true,"name":"珠宝/钻石/翡翠/黄金","parent_cid":0,"status":"normal"},
{"cid":28,"is_parent":true,"name":"ZIPPO/瑞士军刀/眼镜","parent_cid":0,"status":"normal"},
{"cid":33,"is_parent":true,"name":"书/杂志/报纸","parent_cid":0,"status":"normal"},
{"cid":34,"is_parent":true,"name":"音乐/影视/明星/音像","parent_cid":0,"status":"normal"},
{"cid":50017300,"is_parent":true,"name":"乐器/吉他/钢琴/配件","parent_cid":0,"status":"normal"},
{"cid":29,"is_parent":true,"name":"宠物/宠物食品及用品","parent_cid":0,"status":"normal"},
{"cid":2813,"is_parent":true,"name":"成人用品/情趣用品","parent_cid":0,"status":"normal"},
{"cid":50012029,"is_parent":true,"name":"运动鞋new","parent_cid":0,"status":"normal"},
{"cid":50013864,"is_parent":true,"name":"饰品/流行首饰/时尚饰品新","parent_cid":0,"status":"normal"},
{"cid":50014811,"is_parent":true,"name":"网店/网络服务/软件","parent_cid":0,"status":"normal"},
{"cid":50023724,"is_parent":true,"name":"其他","parent_cid":0,"status":"normal"},
{"cid":50017652,"is_parent":true,"name":"TP服务商大类","parent_cid":0,"status":"normal"},
{"cid":50023575,"is_parent":true,"name":"房产/租房/新房/二手房/委托服务","parent_cid":0,"status":"normal"},
{"cid":50023717,"is_parent":true,"name":"OTC药品/医疗器械/计生用品","parent_cid":0,"status":"normal"},
{"cid":50023878,"is_parent":true,"name":"自用闲置转让","parent_cid":0,"status":"normal"},
{"cid":50024186,"is_parent":true,"name":"保险","parent_cid":0,"status":"normal"},
{"cid":50024612,"is_parent":true,"name":"阿里健康送药服务","parent_cid":0,"status":"normal"},
{"cid":50024971,"is_parent":true,"name":"新车/二手车","parent_cid":0,"status":"normal"},
{"cid":50025004,"is_parent":true,"name":"个性定制/设计服务/DIY","parent_cid":0,"status":"normal"},
{"cid":50025110,"is_parent":true,"name":"电影/演出/体育赛事","parent_cid":0,"status":"normal"},
{"cid":50025618,"is_parent":true,"name":"理财","parent_cid":0,"status":"normal"},
{"cid":50025705,"is_parent":true,"name":"洗护清洁剂/卫生巾/纸/香薰","parent_cid":0,"status":"normal"},
{"cid":50025968,"is_parent":true,"name":"司法拍卖拍品专用","parent_cid":0,"status":"normal"},
{"cid":50026316,"is_parent":true,"name":"咖啡/麦片/冲饮","parent_cid":0,"status":"normal"},
{"cid":50023804,"is_parent":true,"name":"装修设计/施工/监理","parent_cid":0,"status":"normal"},
{"cid":50026800,"is_parent":true,"name":"保健食品/膳食营养补充食品","parent_cid":0,"status":"normal"},
{"cid":50050359,"is_parent":true,"name":"水产肉类/新鲜蔬果/熟食","parent_cid":0,"status":"normal"},
{"cid":50074001,"is_parent":true,"name":"摩托车/装备/配件","parent_cid":0,"status":"normal"},
{"cid":50158001,"is_parent":true,"name":"网络店铺代金/优惠券","parent_cid":0,"status":"normal"},
{"cid":50230002,"is_parent":true,"name":"服务商品","parent_cid":0,"status":"normal"},
{"cid":50454031,"is_parent":true,"name":"景点门票/演艺演出/周边游","parent_cid":0,"status":"normal"},
{"cid":50468001,"is_parent":true,"name":"手表","parent_cid":0,"status":"normal"},
{"cid":50510002,"is_parent":true,"name":"运动包/户外包/配件","parent_cid":0,"status":"normal"},
{"cid":50008141,"is_parent":true,"name":"酒类","parent_cid":0,"status":"normal"},
{"cid":50734010,"is_parent":true,"name":"资产","parent_cid":0,"status":"normal"},
{"cid":50025111,"is_parent":true,"name":"本地化生活服务","parent_cid":0,"status":"normal"},
{"cid":121938001,"is_parent":false,"name":"淘点点预定点菜","parent_cid":0,"status":"normal"},
{"cid":121940001,"is_parent":false,"name":"淘点点现金券","parent_cid":0,"status":"normal"},
{"cid":122650005,"is_parent":true,"name":"童鞋/婴儿鞋/亲子鞋","parent_cid":0,"status":"normal"},
{"cid":122684003,"is_parent":true,"name":"自行车/骑行装备/零配件","parent_cid":0,"status":"normal"},
{"cid":122718004,"is_parent":true,"name":"家庭保健","parent_cid":0,"status":"normal"},
{"cid":122852001,"is_parent":true,"name":"居家布艺","parent_cid":0,"status":"normal"},
{"cid":122950001,"is_parent":true,"name":"节庆用品/礼品","parent_cid":0,"status":"normal"},
{"cid":122952001,"is_parent":true,"name":"餐饮具","parent_cid":0,"status":"normal"},
{"cid":122928002,"is_parent":true,"name":"收纳整理","parent_cid":0,"status":"normal"},
{"cid":122966004,"is_parent":true,"name":"处方药","parent_cid":0,"status":"normal"},
{"cid":123536002,"is_parent":true,"name":"阿里通信专属类目","parent_cid":0,"status":"normal"},
{"cid":123500005,"is_parent":true,"name":"资产(政府类专用)","parent_cid":0,"status":"normal"},
{"cid":123690003,"is_parent":true,"name":"精制中药材","parent_cid":0,"status":"normal"},
{"cid":124024001,"is_parent":true,"name":"农业生产资料(农村淘宝专用)","parent_cid":0,"status":"normal"},
{"cid":124044001,"is_parent":true,"name":"品牌台机/品牌一体机/服务器","parent_cid":0,"status":"normal"},
{"cid":124050001,"is_parent":true,"name":"全屋定制","parent_cid":0,"status":"normal"},
{"cid":124242008,"is_parent":true,"name":"智能设备","parent_cid":0,"status":"normal"},
{"cid":124354002,"is_parent":true,"name":"电动车/配件/交通工具","parent_cid":0,"status":"normal"},
{"cid":124466001,"is_parent":true,"name":"农用物资","parent_cid":0,"status":"normal"},
{"cid":124468001,"is_parent":true,"name":"农机/农具/农膜","parent_cid":0,"status":"normal"},
{"cid":124470001,"is_parent":true,"name":"畜牧/养殖物资","parent_cid":0,"status":"normal"},
{"cid":124470006,"is_parent":true,"name":"整车(经销商)","parent_cid":0,"status":"normal"},
{"cid":124484008,"is_parent":true,"name":"模玩/动漫/周边/cos/桌游","parent_cid":0,"status":"normal"},
{"cid":124458005,"is_parent":true,"name":"茶","parent_cid":0,"status":"normal"},
{"cid":124568010,"is_parent":true,"name":"室内设计师","parent_cid":0,"status":"normal"},
{"cid":124750013,"is_parent":true,"name":"俪人购(俪人购专用)","parent_cid":0,"status":"normal"},
{"cid":124698018,"is_parent":true,"name":"装修服务","parent_cid":0,"status":"normal"},
{"cid":124844002,"is_parent":true,"name":"拍卖会专用","parent_cid":0,"status":"normal"},
{"cid":124868003,"is_parent":true,"name":"盒马","parent_cid":0,"status":"normal"},
{"cid":124852003,"is_parent":true,"name":"二手数码","parent_cid":0,"status":"normal"},
{"cid":125102006,"is_parent":true,"name":"到家业务","parent_cid":0,"status":"normal"},
{"cid":125406001,"is_parent":true,"name":"享淘卡","parent_cid":0,"status":"normal"},
{"cid":126040001,"is_parent":true,"name":"橙运","parent_cid":0,"status":"normal"},
{"cid":126252002,"is_parent":true,"name":"门店O2O","parent_cid":0,"status":"normal"},
{"cid":126488005,"is_parent":true,"name":"天猫零售O2O","parent_cid":0,"status":"normal"},
{"cid":126488008,"is_parent":true,"name":"阿里健康B2B平台","parent_cid":0,"status":"normal"},
{"cid":126602002,"is_parent":true,"name":"生活娱乐充值","parent_cid":0,"status":"normal"},
{"cid":126700003,"is_parent":true,"name":"家装灯饰光源","parent_cid":0,"status":"normal"},
{"cid":126762001,"is_parent":true,"name":"美容美体仪器","parent_cid":0,"status":"normal"},
{"cid":127076003,"is_parent":true,"name":"平台充值活动(仅内部店铺)","parent_cid":0,"status":"normal"},
{"cid":127492006,"is_parent":true,"name":"标准件/零部件/工业耗材","parent_cid":0,"status":"normal"},
{"cid":127484003,"is_parent":true,"name":"润滑/胶粘/试剂/实验室耗材","parent_cid":0,"status":"normal"},
{"cid":127508003,"is_parent":true,"name":"机械设备","parent_cid":0,"status":"normal"},
{"cid":127458007,"is_parent":true,"name":"搬运/仓储/物流设备","parent_cid":0,"status":"normal"},
{"cid":127442006,"is_parent":true,"name":"纺织面料/辅料/配套","parent_cid":0,"status":"normal"},
{"cid":127450004,"is_parent":true,"name":"金属材料及制品","parent_cid":0,"status":"normal"},
{"cid":127452002,"is_parent":true,"name":"橡塑材料及制品","parent_cid":0,"status":"normal"},
{"cid":127588002,"is_parent":true,"name":"阿里云云市场","parent_cid":0,"status":"normal"},
{"cid":127878006,"is_parent":true,"name":"新制造","parent_cid":0,"status":"normal"},
{"cid":127924022,"is_parent":true,"name":"零售通","parent_cid":0,"status":"normal"}
]},"request_id":"s82mq3r0hshh"}}|0

如果像上面那样定义顶级分类数据category-top.csv文件内容的话,爬虫代码的运行时间会很长,可能代码运行过程中会返回很多限速消息,返回数据不完整解析数据格式时出现的致命错误,或者其它网络错误,这样爬取数据我们自己不好控制,所以就不要那么贪心啦,我们让上面的165条顶级分类数据分成165次分别爬取,也就是说,我们需要像下面这样重新定义category-top.csv文件的内容,注意文件格式要定义成3行

[myth@contoso ~]$ cat /home/myth/taobao/category-top.csv
0|{"itemcats_get_response":{"item_cats":{"item_cat":[
{"cid":16,"is_parent":true,"name":"女装/女士精品","parent_cid":0,"status":"normal"},
]},"request_id":"s82mq3r0hshh"}}|0
[myth@contoso ~]$

如果你调试爬虫代码,你会发现顶级分类 ------ 女装/女士精品会按照以下递归次序从淘宝服务器上获得以下数据:

顶级分类:
{"cid":16,"is_parent":true,"name":"女装/女士精品","parent_cid":0,"status":"normal"}女装/女士精品:
[{u'status': u'normal', u'parent_cid': 16, u'name': u'连衣裙', u'is_parent': False, u'cid': 50010850},
{u'status': u'normal', u'parent_cid': 16, u'name': u'T恤', u'is_parent': False, u'cid': 50000671},
{u'status': u'normal', u'parent_cid': 16, u'name': u'衬衫', u'is_parent': False, u'cid': 162104},
{u'status': u'normal', u'parent_cid': 16, u'name': u'裤子', u'is_parent': True, u'cid': 1622},
{u'status': u'normal', u'parent_cid': 16, u'name': u'牛仔裤', u'is_parent': False, u'cid': 162205},
{u'status': u'normal', u'parent_cid': 16, u'name': u'半身裙', u'is_parent': False, u'cid': 1623},
{u'status': u'normal', u'parent_cid': 16, u'name': u'马夹', u'is_parent': False, u'cid': 50013196},
{u'status': u'normal', u'parent_cid': 16, u'name': u'蕾丝衫/雪纺衫', u'is_parent': False, u'cid': 162116},
{u'status': u'normal', u'parent_cid': 16, u'name': u'毛针织衫', u'is_parent': False, u'cid': 50000697},
{u'status': u'normal', u'parent_cid': 16, u'name': u'短外套', u'is_parent': False, u'cid': 50011277},
{u'status': u'normal', u'parent_cid': 16, u'name': u'西装', u'is_parent': False, u'cid': 50008897},
{u'status': u'normal', u'parent_cid': 16, u'name': u'卫衣/绒衫', u'is_parent': False, u'cid': 50008898},
{u'status': u'normal', u'parent_cid': 16, u'name': u'毛衣', u'is_parent': False, u'cid': 162103},
{u'status': u'normal', u'parent_cid': 16, u'name': u'风衣', u'is_parent': False, u'cid': 50008901},
{u'status': u'normal', u'parent_cid': 16, u'name': u'毛呢外套', u'is_parent': False, u'cid': 50013194},
{u'status': u'normal', u'parent_cid': 16, u'name': u'棉衣/棉服', u'is_parent': False, u'cid': 50008900},
{u'status': u'normal', u'parent_cid': 16, u'name': u'羽绒服', u'is_parent': False, u'cid': 50008899},
{u'status': u'normal', u'parent_cid': 16, u'name': u'皮衣', u'is_parent': False, u'cid': 50008904},
{u'status': u'normal', u'parent_cid': 16, u'name': u'皮草', u'is_parent': False, u'cid': 50008905},
{u'status': u'normal', u'parent_cid': 16, u'name': u'中老年女装', u'is_parent': False, u'cid': 50000852},
{u'status': u'normal', u'parent_cid': 16, u'name': u'大码女装', u'is_parent': False, u'cid': 1629},
{u'status': u'normal', u'parent_cid': 16, u'name': u'套装/学生校服/工作制服', u'is_parent': True, u'cid': 1624},
{u'status': u'normal', u'parent_cid': 16, u'name': u'婚纱/旗袍/礼服', u'is_parent': True, u'cid': 50011404},
{u'status': u'normal', u'parent_cid': 16, u'name': u'唐装/民族服装/舞台服装', u'is_parent': True, u'cid': 50008906},
{u'status': u'normal', u'parent_cid': 16, u'name': u'背心吊带', u'is_parent': False, u'cid': 121412004},
{u'status': u'normal', u'parent_cid': 16, u'name': u'抹胸', u'is_parent': False, u'cid': 121434004}]裤子:
[{u'status': u'normal', u'parent_cid': 1622, u'name': u'休闲裤', u'is_parent': False, u'cid': 162201},
{u'status': u'normal', u'parent_cid': 1622, u'name': u'西装裤/正装裤', u'is_parent': False, u'cid': 50022566},
{u'status': u'normal', u'parent_cid': 1622, u'name': u'打底裤', u'is_parent': False, u'cid': 50007068},
{u'status': u'normal', u'parent_cid': 1622, u'name': u'棉裤/羽绒裤', u'is_parent': False, u'cid': 50026651}]套装/学生校服/工作制服:
[{u'status': u'normal', u'parent_cid': 1624, u'name': u'学生校服', u'is_parent': False, u'cid': 50008903},
{u'status': u'normal', u'parent_cid': 1624, u'name': u'职业女裙套装', u'is_parent': False, u'cid': 162401},
{u'status': u'normal', u'parent_cid': 1624, u'name': u'职业女裤套装', u'is_parent': False, u'cid': 162402},
{u'status': u'normal', u'parent_cid': 1624, u'name': u'休闲运动套装', u'is_parent': False, u'cid': 162404},
{u'status': u'normal', u'parent_cid': 1624, u'name': u'其它制服/套装', u'is_parent': False, u'cid': 162403},
{u'status': u'normal', u'parent_cid': 1624, u'name': u'医护制服', u'is_parent': False, u'cid': 50011411},
{u'status': u'normal', u'parent_cid': 1624, u'name': u'酒店工作制服', u'is_parent': False, u'cid': 50011412},
{u'status': u'normal', u'parent_cid': 1624, u'name': u'时尚套装', u'is_parent': False, u'cid': 123216004}]婚纱/旗袍/礼服:
[{"cid":162701,"is_parent":false,"name":"婚纱","parent_cid":50011404,"status":"normal"},
{"cid":50005065,"is_parent":false,"name":"旗袍","parent_cid":50011404,"status":"normal"},
{"cid":162702,"is_parent":false,"name":"礼服\/晚装","parent_cid":50011404,"status":"normal"}]唐装/民族服装/舞台服装:
[{u'status': u'normal', u'parent_cid': 50008906, u'name': u'民族服装/舞台装', u'is_parent': False, u'cid': 162703},
{u'status': u'normal', u'parent_cid': 50008906, u'name': u'唐装/中式服装', u'is_parent': True, u'cid': 1636}]唐装/中式服装:
[{u'status': u'normal', u'parent_cid': 1636, u'name': u'上衣', u'is_parent': False, u'cid': 50003509},
{u'status': u'normal', u'parent_cid': 1636, u'name': u'裤子', u'is_parent': False, u'cid': 50003510},
{u'status': u'normal', u'parent_cid': 1636, u'name': u'裙子', u'is_parent': False, u'cid': 50003511}]

python 爬虫的运行环境是Linux,当然你也可以在Windows环境下运行爬虫,python版本如下:

[myth@contoso ~]$ python --version
Python 2.7.5
[myth@contoso ~]$

爬虫代码如下:

#!/usr/bin/python
# -*- coding: utf-8 -*-import requests
import sys
import jsonreload(sys)
sys.setdefaultencoding('utf-8')
session = requests.Session()f = open('category-top.csv','r')
data = list()
for line in open('category-top.csv'):line = f.readline().strip(',\n')data.append(line)
cidStr = ''.join(data)
f.close()def createCidSelect(cidStr):cidArr = cidStr.split("|")cid = cidArr[0]spanId = cidArr[2]if '' == cid:return FalsecidArr = json.loads(cidArr[1])['itemcats_get_response']cidArr = cidArr['item_cats']cidArr = cidArr['item_cat']count = len(cidArr)file = open('category-all.csv', 'a')list1 = list()for i in range(count):if cidArr[i]['status'] == 'normal':file.write('{0},{1},{2},{3},{4};\n'.format(cidArr[i]['status'],cidArr[i]['parent_cid'],cidArr[i]['name'],int(cidArr[i]['is_parent']),cidArr[i]['cid']))list1.append(cidArr[i])file.close()parentId = cidfor item in list1:childCidList(item,parentId)def  childCidList(item,parentId):cid = 0try:cid = item['cid']if item['is_parent'] == False:loadScript(cid)returnurl = 'http://open.taobao.com/apitools/ajax_props.do?_tb_token_=3365b5d353fed&cid='+ str(cid) +'&act=childCid&restBool=false&ua=090%23qCQXNTXpXOVXPvi0XXXXXQkOIr77HU0hzDlo3e5rAGB2zoPlhnG5%2ByiUIr7ejGmnfjLiXXfbC7NK%2BvQXaKZdRva2jrbsXmLiXXfbC7NK24QXrpehnTFfoVM3eeu8iGliXX5dtRJXExTEMiwtXvXQsVW8ZxDiXXF2mp%2F9vQjBXvXzbc9P9lqAxgLAq6anQgwoWawOSBLiXajeGXriHnepAFhnPIj3Ho39h9kvXP73IzgeG%2FXXHYVmV6hnD6u3HoPsH4QXaPjPiq2d7D7bPvQXiHDow1Qg%2FrliXXfMhTQ%2F%2BvQXaKZWvPXMjrY0VBViXi2oemXumVM3oMavtXFjQ7%2Ba2T%3D%3D'headers = {"Accept": "*/*","Accept-Encoding": "gzip, deflate","Accept-Language": "en-US,en;q=0.9","Connection": "keep-alive","Cookie": "t=bbb6c14edb1c8d0e65996158979a8027; cna=I0NFEysmiWkCAQ6cyYxll3kh; tg=0; lgc=mycarting; tracknick=mycarting; mt=np=; v=0; cookie2=3253fc64aa52d22785ce4a3f5af722d2; _tb_token_=3365b5d353fed; dnk=mycarting; JSESSIONID=B94DBA871F64C03C095830C238F218A9; uc1=cookie14=UoTeNzVRvRsWPg%3D%3D&lng=zh_CN&cookie16=W5iHLLyFPlMGbLDwA%2BdvAGZqLg%3D%3D&existShop=false&cookie21=VFC%2FuZ9ajCbF99I65Qm9gQ%3D%3D&tag=8&cookie15=Vq8l%2BKCLz3%2F65A%3D%3D&pas=0; uc3=nk2=DkmnuVZqM291&id2=UU8OcO9lI45Clg%3D%3D&vt3=F8dBzr2Fa6i4%2Fc9OIz8%3D&lg2=UtASsssmOIJ0bQ%3D%3D; existShop=MTUyOTIxODk0Nw%3D%3D; sg=g43; csg=c98ee665; cookie1=VTrg90saGzeX9ovigm2mqTr%2Fu6w0vPNLI3IgZ0vhu9E%3D; unb=2761447894; skt=ed11d5f665cdb5a8; _cc_=W5iHLLyFfA%3D%3D; _l_g_=Ug%3D%3D; _nk_=mycarting; cookie17=UU8OcO9lI45Clg%3D%3D; apushdf188ec636caeab174aad0f3441beb09=%7B%22ts%22%3A1529221047670%2C%22parentId%22%3A1529220867224%7D; isg=BP7-BDyyKY-LxH2pSg1zscaeTx2Al8izHjVx9qgGccE8S58lEM-zyUVtxx-H87rR","Host": "open.taobao.com","Referer": "http://open.taobao.com/apitools/apiPropTools.htm?spm=0.0.0.0.mlPbbQ","User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36",}res = session.get(url, headers=headers)if (res.text != '' and res.text.find('{"itemcats_get_response":{"item_cats":{"item_cat":') > 0):cidStr = str(cid) + '|'+ res.text + '|' + str(parentId)createCidSelect(cidStr)else:print 'childCidList: ' + res.textexcept Exception as err:print  "cid : " + str(cid)print  "parentId : " + str(parentId)print errdef  loadScript(cid):try:url = 'http://open.taobao.com/apitools/ajax_props.do?_tb_token_=3365b5d353fed&act=props&cid='+ str(cid) +'&restBool=false&ua=090%23qCQXc4XpXODXPXi0XXXXXQkOIr77HUR5flY73eg3AGB3fzQocPf5Aw1OIruEk0Rs24QXQczccXFFoVM3VUVTihPz9JPWqaLiXXB%2B0ydC24QXrpec2XdDoVM3VUpKijLiXXB%2B0ydC24QXrpec2vzsoVM3ebpQinDiXXF2mp%2F9vQjBXvXUM%2Ben9l8BvNoGriLiXajeGXrfHnepFehnPIj3Ho39h9kvXP73IzgeG%2FXXHYVmV6hnD6u3HoPsH4QXaOXTsEIXgwYSPvQXit2CqnY8PmLiXXB%2B0ydC3vQXiPR22amsXvXqzwE6XkFGOYnqq4QXius%2BSbQ%3D'headers = {"Accept": "*/*","Accept-Encoding": "gzip, deflate","Accept-Language": "en-US,en;q=0.9","Connection": "keep-alive","Cookie": "t=bbb6c14edb1c8d0e65996158979a8027; cna=I0NFEysmiWkCAQ6cyYxll3kh; tg=0; lgc=mycarting; tracknick=mycarting; mt=np=; v=0; cookie2=3253fc64aa52d22785ce4a3f5af722d2; _tb_token_=3365b5d353fed; dnk=mycarting; JSESSIONID=B94DBA871F64C03C095830C238F218A9; uc1=cookie14=UoTeNzVRvRsWPg%3D%3D&lng=zh_CN&cookie16=W5iHLLyFPlMGbLDwA%2BdvAGZqLg%3D%3D&existShop=false&cookie21=VFC%2FuZ9ajCbF99I65Qm9gQ%3D%3D&tag=8&cookie15=Vq8l%2BKCLz3%2F65A%3D%3D&pas=0; uc3=nk2=DkmnuVZqM291&id2=UU8OcO9lI45Clg%3D%3D&vt3=F8dBzr2Fa6i4%2Fc9OIz8%3D&lg2=UtASsssmOIJ0bQ%3D%3D; existShop=MTUyOTIxODk0Nw%3D%3D; sg=g43; csg=c98ee665; cookie1=VTrg90saGzeX9ovigm2mqTr%2Fu6w0vPNLI3IgZ0vhu9E%3D; unb=2761447894; skt=ed11d5f665cdb5a8; _cc_=W5iHLLyFfA%3D%3D; _l_g_=Ug%3D%3D; _nk_=mycarting; cookie17=UU8OcO9lI45Clg%3D%3D; isg=BJubr7m9RASWA7jy73KOygu5KvbF2KV447J0TY3ZbBqxbLlOFUA_wrmuAsRizAdq; apushdf188ec636caeab174aad0f3441beb09=%7B%22ts%22%3A1529221231139%2C%22parentId%22%3A1529220867224%7D","Host": "open.taobao.com","Referer": "http://open.taobao.com/apitools/apiPropTools.htm?spm=0.0.0.0.mlPbbQ","User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36",}res = session.get(url, headers=headers)outArr = res.text.split(";")if len(outArr) == 3:if (outArr[0] != '' and outArr[0].find('var props={"itemprops_get_response":{"item_props":{"item_prop":') > 1):file1 = open('props.csv', 'a')file1.write('{0}\n'.format(outArr[0]))file1.close()else:print str(cid) + " : " + outArr[0]if (outArr[1] != '' and outArr[1].find('var propvalues={"itempropvalues_get_response":{"last_modified":') > 1):file2 = open('propvalues.csv', 'a')file2.write('{0}\n'.format(outArr[1]))file2.close()else:print str(cid) + " : " + outArr[1]else:print outArrexcept Exception as err:print cidprint errcreateCidSelect(cidStr)

如何运行爬虫呢?

我们首先需要登录 淘宝开放平台 http://open.taobao.com/apitools/apiPropTools.htm?spm=0.0.0.0.mlPbbQ

为了保证万无一失,我们需要再一次确认以下地址能够成功返回数据,必须在上面已经登录google浏览器上发送下面这个地址

Request URL:http://open.taobao.com/apitools/ajax_props.do?_tb_token_=3365b5d353fed&cid=16&act=childCid&restBool=false&ua=090%23qCQXU4XvXpXXPXi0XXXXXQkOIr7EkU9szQ4bI%2B5rAGB3fovZcnGnGDkIOrgyTU5nq4QXi6W21dwWXvXBV7Vhihc3oVMCx5QuYk3G4k9sXvXq2CCyOmlXKotK%2BvQXaBVRozUEXudBmmLiXXfbC7NK24QXrpecvTFfoVM3ecgeijLiXXfbC7NKH4QXaOXTsEO4%2FBDGPvQXit2CqnY8PCLiXajeGXriHYVCOFhnDXa3HoUmh9kvXP73IzgeG%2FXXHYVmV6hnDXa3Ho64wvQXib%2Fc2viqUjp%2FXvXuCVHkRwiP3vQXi3e7PUasXvXq2C9LOMVXKym324QXQW3c6vF6oVM37RMEihPz9JPRq4QXi6W21dw%3D

了保证万无一失,我们需要再一次确认以下地址能够成功返回数据,必须在上面已经登录google浏览器上发送下面这个地址

Request URL:http://open.taobao.com/apitools/ajax_props.do?_tb_token_=3365b5d353fed&act=props&cid=50010850&restBool=false&ua=090%23qCQXt4XOX6TXPXi0XXXXXQkOIr7EkU7GDQToIeg3AGBvDrxmhYZhGQ86OzpMHU0nq4QXiP%2B0gzfsXvXqtKoQXP7PI0%2Fk%2BvQXaKZWwTXkjr56WYDiXXF2mp%2F9vQjBXvXzFZEaXQDA2dbv5I106NXHuL%2BkykLiXajeGXriHYVCOFhnP353HoUmh9kvXP73IzgeG%2FXXHYVmV6hnDXa3Ho64H4QXa67Mf8dkHqJwPvQXit2CqnY8PmLiXXfUZ8wl3vQXiXXXXXfsXvXqtKy3XPZPPrBC24QXQczccXFzoVM3aTd4ihPz9JPWqX%3D%3D

上面的地址1次请求,淘宝的服务器会返回2条(也可以说是2行很长字符串数据,结尾是分号)javascript格式的数据,数据很大,这2行数据有7.5MB(是不是2行数据,你也可以把网页的源码数据粘贴到Notepad++中一目了然啦)

如果你想查看连衣裙分类数据地址如下:

https://pan.baidu.com/s/1geos-Z-tYcdSOCpXGxtMkw

启动爬虫代码:

[myth@contoso ~]$ cd /home/myth/taobao

[myth@contoso taobao]$ python taobao.py

实时输出已经爬下来的数据:

tail -f /home/myth/taobao/category-all.csv
tail -f /home/myth/taobao/props.csv

tail -f /home/myth/taobao/propvalues.csv

[myth@contoso ~]$ cd /home/myth/taobao
[myth@contoso taobao]$ ls
category-all.csv  category-top.csv  props.csv  propvalues.csv  taobao.py  venv
[myth@contoso taobao]$ ls -lht
total 94M
-rw-rw-r-- 1 myth myth  94M Jun 18 23:20 propvalues.csv
-rw-rw-r-- 1 myth myth 186K Jun 18 23:20 props.csv
-rw-rw-r-- 1 myth myth 1.7K Jun 18 23:18 category-all.csv
-rw-rw-r-- 1 myth myth 5.8K Jun 18 20:49 taobao.py
-rw-rw-r-- 1 myth myth  178 Jun 17 15:16 category-top.csv
drwxrwxr-x 5 myth myth   82 Jun 16 04:53 venv
[myth@contoso taobao]$

为了继续爬顶级分类数据,我们可能要把category-top.csv,props.csv还有propvalues.csv分别另存为

category-top1.csv,props1.csv和propvalues1.csv最后清空已经爬下来的全部数据cat /dev/null > /home/myth/taobao/category-all.csv && cat /dev/null > /home/myth/taobao/props.csv && cat /dev/null > /home/myth/taobao/propvalues.csv

我们可以手动把category-top.csv文件中定义的第1条顶级分类数据------"女装/女士精品"换成

{"cid":16,"is_parent":true,"name":"女装/女士精品","parent_cid":0,"status":"normal"},

如下这条顶级数据分类

{"cid":120886001,"is_parent":true,"name":"公益","parent_cid":0,"status":"normal"},

继续爬取第2条顶级分类数据 -------"公益",依次类推,这样我们就可以爬完整个淘宝网站上的

商品分类数据,关键属性数据,销售属性数据,还有非关键性属性数据。

关于props.csv和propvalues.csv文件输出的格式为何不用category-all.csv一样的格式输出,

那是因为我们希望直接在浏览器手动提交链接返回数据格式能与爬虫返回的数据格式一致

方便我们补充完丢失的数据后一起提交给数据格式化工具来进一步做数据格式输出(自己编写代码实现)

比如,你可拼接数据成为SQL 插入语句导入到数据库,当然你也可以拼接成Redis的数据导入格式

启动开发工具

/home/myth/pycharm-community-2018.1.4/bin/pycharm.sh

清空已经爬下来的全部数据

cat /dev/null > /home/myth/taobao/category-all.csv && cat /dev/null > /home/myth/taobao/props.csv && cat /dev/null > /home/myth/taobao/propvalues.csv

查看被爬取的顶级分类数据
cat /home/myth/taobao/category-top.csv

如何编写爬虫获取淘宝网上所有的商品分类以及关键属性 销售属性 非关键属性数据相关推荐

  1. [PHP] 编写爬虫获取淘宝网上所有的商品分类以及关键属性 销售属性 非关键属性数据...

    参考文章地址:https://blog.csdn.net/zhengzizhi/article/details/80716608 http://open.taobao.com/apitools/api ...

  2. Python网络爬虫获取淘宝商品价格

    1.Python网络爬虫获取淘宝商品价格代码: #-*-coding:utf-8-*- ''' Created on 2017年3月17日 @author: lavi ''' import reque ...

  3. python爬虫淘宝和天猫的区别_python爬虫获取淘宝天猫商品详细参数

    import re from collections import OrderedDict from bs4 import BeautifulSoup from pyquery import PyQu ...

  4. 爬虫获取淘宝等电商历史价格,分析资源网站实现本地重建(仅供个人学习)

    1.分析网站数据接口 网站通过访问接口history.aspx获取数据 接口必要参数 其中url是处理后的商品地址,token是加密后的参数 data = {'DA': '1','action': ' ...

  5. python爬虫(14)获取淘宝MM个人信息及照片(上)

    python爬虫(14)获取淘宝MM个人信息及照片(上) python爬虫(14)获取淘宝MM个人信息及照片(中) python爬虫(14)获取淘宝MM个人信息及照片(下)(windows版本) 网上 ...

  6. python爬虫(14)获取淘宝MM个人信息及照片(中)

    python爬虫(14)获取淘宝MM个人信息及照片(中) python爬虫(14)获取淘宝MM个人信息及照片(上) python爬虫(14)获取淘宝MM个人信息及照片(下)(windows版本) 在上 ...

  7. python遇到天猫反爬虫_selenium 淘宝登入反爬虫解决方案(亲测有效)

    前言 目前在对淘宝进行数据爬取的时候都会碰到,登入时的滑块问题,无论是手动还是脚本都不成功.这里的很重要一个原因是很多的网站都对selenium做了反爬虫机制.接下来是笔者参考网上的网友们的方法亲自测 ...

  8. 【淘宝商品】获取淘宝商品ID、获取淘宝商品详情

    心态快爆炸,获取商品ID已经困扰好几天了,网上搜寻的办法也总是达不到想要的效果,要么没有权限,要么获取不到Location.. 附上研究结果工具包(俺就喜欢各种工具包,简单快捷,有其他见解的同志欢迎骚 ...

  9. 自动获取淘宝API数据访问的SessionKey

    原文地址为: 自动获取淘宝API数据访问的SessionKey 最近在忙与淘宝做对接的工作,总体感觉淘宝的api文档做的还不错,不仅有沙箱测试环境,而且对于每一个api都可以通过api测试工具生成想要 ...

  10. 淘宝网上线 | 历史上的今天

    整理 | 王启隆 透过「历史上的今天」,从过去看未来,从现在亦可以改变未来. 今天是 2023 年 5 月 10 日,在 1975 年的今天,索尼推出了 Betamax 盒式磁带录音机.Betamax ...

最新文章

  1. 智慧城市搞圈地卖设备的思路该结束了
  2. 神经网络的设计与分析之概述
  3. 动软分享社区系统实现个性化导购营销平台
  4. mysql 必须掌握的工具pt-query-digest安装
  5. Java断言(assert)
  6. python+selenuim自动化测试(六)弹窗的处理
  7. git常用命令_10、Git常用命令
  8. 一个比较完美的spacer div技巧
  9. SparkContext: Error initializing SparkContext解决方法
  10. Eclipse编译项目内存溢出,修改配置
  11. 【无标题】免费论文查重的方法;知网也可以免费查重啦
  12. 十款ASP CMS建站系统源码
  13. 通过BOMC制作微码更新介质方法
  14. 中银泰各种投资理财方式对比
  15. 问题 B: 栈的操作问题
  16. 怎么才能防止SSL劫持
  17. div中的img标签多余空白bug解决方案
  18. Linux-alias设置命令别名
  19. windows下使用vscode开发stm32
  20. 四种最优化方法复盘总结

热门文章

  1. vmware虚拟机centos7扩容
  2. DxO FilmPack 5 for Mac(胶片模拟效果滤镜软件)
  3. 元气骑士android替换存档教程,元气骑士游戏怎么将存档转移到另一个手机中
  4. 管理科学与运筹学(MS/OR)国际期刊最新权威排名
  5. html语言span标签,html 中span标签里面都能放那些标签??
  6. 小程序跳转到其他小程序
  7. Centos7设置开机自动运行脚本
  8. HERO2009 午夜骚魂
  9. 公司数百人尽数被抓,只因旗下程序员写了这样一个爬虫!
  10. Python dummy variable