这是一个关于Instagram爬虫的介绍。

GitHub源码参考(代码和爬取数据):https://github.com/hilqiqi0/crawler/tree/master/simple/instagram

爬取的每个数据保存格式:{ 图片的访问路径,评论数,点赞数,帖子的内容 }
eg:{
        "img_url": "https://scontent-sin6-2.cdninstagram.com/vp/0e345bfd870f2fb489f091ed5507397f/5C1A8CB6/t51.2885-15/e35/40949123_1104283529724860_6046749716819964824_n.jpg",
        "comment_count": 12932,
        "like_count": 1321753,
        "text": "Featured photo by @maomay__\\nWeekend Hashtag Project: #WHPperspective\\nThis weekend, the goal is to take photos and videos from a different point of view, as in this featured photo by Mao May (@maomay__). Here are some tips to get you started:\\nCapture a familiar subject or scene from an unexpected angle. Get up close and let a face cover the entire frame, or make a puppy look large by shooting from ground-level as she stares down. Find a high vantage point to show the wider context of a festival scene or bustling market.\\nUse geometry to your advantage. Look for graphic lines — in bridges or telephone wires — that converge to a vanishing point in your composition. Find a new way to capture patterns in everyday places, like the wheels of bicycles lined up in a rack, or symmetrical bricks in an unruly garden.\\nPlay an eye trick. Defy gravity with simple editing, like rotating the frame. Recruit a friend to make a well-timed leap, that, when rotated, looks like they’re flying through air. Or turn a dandelion into a human-size parasol by playing with scale and distance.\\n\\nPROJECT RULES: Please add the #WHPperspective hashtag only to photos and videos shared over this weekend and only submit your own visuals to the project. If you include music in your video submissions, please only use music to which you own the rights. Any tagged photo or video shared over the weekend is eligible to be featured next week."
    }

技术难点总结:1、需要翻墙;2、Instagram在8、9月份之前是没有反扒,之后ajax请求加了反扒。

反扒算法:(请求头加了'X-Instagram-GIS'字段)
        1、将rhx_gis和queryVariables进行组合
        2、然后进行md5哈希

代码说明和修改:0、默认下载120个,若想下载更多可以删除数量判断或者修改阈值
       1、该代码使用的是蓝灯,代理端口为52212;若是其他的翻墙工具,请修改代理端口号
       2、该代码爬取的是https://www.instagram.com网站中instagram博主的信息;若想爬取其他博主的信息,需要修改博主名
       3、该代码仅是测试,尚未进行代码模块化、封装等

关于流程和分析:1、参见文章最后参考;2、直接分析代码

import re
import json
import time
import random
import requests
from pyquery import PyQuery as pq
import hashliburl_base = 'https://www.instagram.com/instagram/'
uri = 'https://www.instagram.com/graphql/query/?query_hash=a5164aed103f24b03e7b7747a2d94e3c&variables=%7B%22id%22%3A%22{user_id}%22%2C%22first%22%3A12%2C%22after%22%3A%22{cursor}%22%7D'headers = {
'Connection':'keep-alive',
'Host':'www.instagram.com',
'Referer':'https://www.instagram.com/instagram/',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
'X-Requested-With':'XMLHttpRequest'
}proxy = {'http': 'http://127.0.0.1:52212','https': 'http://127.0.0.1:52212'
}def hashStr(strInfo):h = hashlib.md5()h.update(strInfo.encode("utf-8"))return h.hexdigest()def get_html(url):try:response = requests.get(url, headers=headers, proxies=proxy)if response.status_code == 200:return response.textelse:print('请求网页源代码错误, 错误状态码:', response.status_code)except Exception as e:print(e)return Nonedef get_json(headers,url):try:response = requests.get(url, headers=headers,proxies=proxy, timeout=10)if response.status_code == 200:return response.json()else:print('请求网页json错误, 错误状态码:', response.status_code)except Exception as e:print(e)time.sleep(60 + float(random.randint(1, 4000))/100)return get_json(headers,url)def get_samples(html):samples = []user_id = re.findall('"profilePage_([0-9]+)"', html, re.S)[0]GIS_rhx_gis = re.findall('"rhx_gis":"([0-9a-z]+)"', html, re.S)[0]print('user_id:' + user_id)print(GIS_rhx_gis)doc = pq(html)items = doc('script[type="text/javascript"]').items()for item in items:if item.text().strip().startswith('window._sharedData'):# window._sharedData 的内容转换为字典js_data = json.loads(item.text()[21:-1], encoding='utf-8')# 12 张初始页面图片信息edges = js_data["entry_data"]["ProfilePage"][0]["graphql"]["user"]["edge_owner_to_timeline_media"]["edges"]# 网页页面信息page_info = js_data["entry_data"]["ProfilePage"][0]["graphql"]["user"]["edge_owner_to_timeline_media"]['page_info']# 下一页的索引值AQCSnXw1JsoV6LPOD2Of6qQUY7HWyXRc_CBSMWB6WvKlseC-7ibKho3Em0PEG7_EP8vwoXw5zwzsAv_mNMR8yX2uGFZ5j6YXdyoFfdbHc6942wcursor = page_info['end_cursor']# 是否有下一页flag = page_info['has_next_page']# 节点信息筛选for edge in edges:               # 如果是视频直接跳过if edge['node']['is_video'] == "true":continuetime.sleep(1)# 图片信息筛选sample = {}if edge['node']['display_url']:display_url = edge['node']['display_url']
#                    print(display_url)sample["img_url"] = display_urlsample["comment_count"] = edge['node']['edge_media_to_comment']["count"]sample["like_count"] = edge['node']['edge_liked_by']["count"] print(sample["img_url"])print(sample["comment_count"])print(sample["like_count"])if edge['node']['shortcode']:shortcode = edge['node']['shortcode']# https://www.instagram.com/p/{shortcode}/?__a=1textUrl = 'https://www.instagram.com/p/' + shortcode + '/?__a=1'textRespose = get_json(headers,textUrl)
#                    print(textRespose)
#                    print(type(textRespose))    textDict = textRespose['graphql']['shortcode_media']['edge_media_to_caption']['edges'][0]['node']sample["text"] = str(textDict)[10:-2]print(sample["text"])samples.append(sample)print(cursor, flag)# AJAX 请求更多信息                     while flag:url = uri.format(user_id=user_id, cursor=cursor)print(url)queryVariables = '{"id":"' + user_id + '","first":12,"after":"' +cursor+ '"}'print(queryVariables)headers['X-Instagram-GIS'] = hashStr(GIS_rhx_gis + ":" + queryVariables)print(headers)js_data = get_json(headers,url)
#        print(js_data)infos = js_data['data']['user']['edge_owner_to_timeline_media']['edges']cursor = js_data['data']['user']['edge_owner_to_timeline_media']['page_info']['end_cursor']flag = js_data['data']['user']['edge_owner_to_timeline_media']['page_info']['has_next_page']#        print(infos)for info in infos:if info['node']['is_video']:continueelse:sample = {}display_url = info['node']['display_url']
#                print(display_url)sample["img_url"] = display_urlsample["comment_count"] = info['node']['edge_media_to_comment']["count"]sample["like_count"] = info['node']['edge_media_preview_like']["count"]                    if info['node']['shortcode']:time.sleep(1)shortcode = info['node']['shortcode']# https://www.instagram.com/p/{shortcode}/?__a=1textUrl = 'https://www.instagram.com/p/' + shortcode + '/?__a=1'textRespose = get_json(headers,textUrl)
#                    print(textRespose)
#                    print(type(textRespose))    textDict = textRespose['graphql']['shortcode_media']['edge_media_to_caption']['edges'][0]['node']sample["text"] = str(textDict)[10:-2]print(sample["img_url"])print(sample["comment_count"])print(sample["like_count"])  print(sample["text"])samples.append(sample)print(cursor, flag)# 下载120个 返回if len(samples) > 120:return samplesreturn samplesdef main():url = url_basehtml = get_html(url)samples = get_samples(html)
#    print(samples)with open("./samples.txt","a",encoding='utf-8') as f:f.write(str(samples))if __name__ == '__main__':start = time.time()main()

参考1:https://www.jianshu.com/p/985c2b4e8f6c

参考2:https://blog.csdn.net/geng333abc/article/details/79403395

爬虫:Instagram信息爬取相关推荐

  1. 爬虫-菜谱信息爬取(保存至数据库)

    目录 爬虫爬取思路 python代码 数据库代码 后期发现: 解决方法: 词云制作 爬虫爬取思路 python代码 import requests # 请求 from lxml import etre ...

  2. layui获取input信息_python爬虫—用selenium爬取京东商品信息

    python爬虫--用selenium爬取京东商品信息 1.先附上效果图(我偷懒只爬了4页) 2.京东的网址https://www.jd.com/ 3.我这里是不加载图片,加快爬取速度,也可以用Hea ...

  3. python爬虫scrapy爬取新闻标题及链接_python爬虫框架scrapy爬取梅花网资讯信息

    原标题:python爬虫框架scrapy爬取梅花网资讯信息 一.介绍 本例子用scrapy-splash爬取梅花网(http://www.meihua.info/a/list/today)的资讯信息, ...

  4. python爬虫公众号_python爬虫_微信公众号推送信息爬取的实例

    问题描述 利用搜狗的微信搜索抓取指定公众号的最新一条推送,并保存相应的网页至本地. 注意点 搜狗微信获取的地址为临时链接,具有时效性. 公众号为动态网页(JavaScript渲染),使用request ...

  5. 基于golang的爬虫demo,爬取微博用户的粉丝和关注者信息

    基于golang的爬虫demo,爬取微博用户的粉丝和关注者信息 注意:仅供学习交流,任何非法使用与作者无关! 目录 基于golang的爬虫demo,爬取微博用户的粉丝和关注者信息 一.背景与取材 二. ...

  6. 【2020-10-27】 scrapy爬虫之猎聘招聘信息爬取

    声明:本文只作学习研究,禁止用于非法用途,否则后果自负,如有侵权,请告知删除,谢谢! scrapy爬虫之猎聘招聘信息爬取 1.项目场景 目标网址:https://www.liepin.com/zhao ...

  7. 网络爬虫 | selenium 爬取动态加载信息

    使用selenium实现动态渲染页面的爬取.selenium是浏览器自动测试框架,模拟浏览器,驱动浏览器执行特定的动作,并可获取浏览器当前呈现的页面的源代码,可见即可爬.该工具支持IE浏览器.Mozi ...

  8. Python爬虫入门(爬取豆瓣电影信息小结)

    Python爬虫入门(爬取豆瓣电影信息小结) 1.爬虫概念 网络爬虫,是一种按照一定规则,自动抓取互联网信息的程序或脚本.爬虫的本质是模拟浏览器打开网页,获取网页中我们想要的那部分数据. 2.基本流程 ...

  9. Python 爬虫 中国行政区划信息爬取 (初学者)

    Python 爬虫 中国行政区划信息爬取 (初学者) 背景 环境准备 代码片段 1.定义地址信息对象 2.地址解析对象 2.1 获取web信息 2.2 web信息解析 2.3 区划信息提取 2.4 省 ...

  10. Scrapy框架爬虫项目:京东商城笔记本电脑信息爬取

    一.创建Scrapy项目 在cmd中输入一下指令创建一个新的scrapy项目及一个爬虫 scrapy startproject JD_Goodscd JD_Goodsscrapy genspider ...

最新文章

  1. 计算机视觉不可能凉!
  2. ASP操作MSQL类
  3. vue教程2:vue基础
  4. gbk文件转为utf8文件
  5. Docker使用国内镜像仓库
  6. Cloud for Customer Silverlight UI部分源代码
  7. 基于评论文本的深度推荐系统总结
  8. 如何在矩池云GPU云中安装MATLAB 2019b软件
  9. 【2012百度之星资格赛】F:百科蝌蚪团
  10. py4j.java gateway_python 2.7-为什么PySpark无法找到py4j.java_gateway?
  11. matlab求带参数二重定积分,matlab二重定积分
  12. 代码整洁之道 python_《代码整洁之道》与 Python 之禅
  13. JDY-31蓝牙模块测试
  14. 开发3D游戏建模都需要哪些软件?软件繁多,如何从中挑选学习?
  15. 获取IOS APP Icon
  16. javascript运行机制
  17. python中根号怎么输入_python中根号怎么表示
  18. OpenWrt路由开启DDNS+端口转发进行外网访问
  19. python numpy库 一些统计量计算
  20. Nand2Tetris - Week 1

热门文章

  1. esxi01磁盘扩容_给EXSI虚拟机中linux硬盘扩容
  2. 如何连接工作组计算机win7,如何设置工作组,详细教您如何设置局域网工作组
  3. 深入理解计算机系统 csapp 家庭作业(第三章完整版)
  4. 练习7-9 计算天数(15 分)
  5. mysql获取年龄_sql获取时间、年龄
  6. python 拆分pdf指定页_Python简单拆分PDF文件,将一个PDF文件拆分成指定份数
  7. 甲乙2个人去买冬瓜,甲买差7元,乙买差9元,合买差1元,冬瓜多少钱?
  8. android8.0的电池图标,Android 8.0 电池图标 显示分析
  9. 集团类企业信息化原则与思路
  10. 1.1 pug常用命令