爬虫：Instagram信息爬取

这是一个关于Instagram爬虫的介绍。

GitHub源码参考（代码和爬取数据）：https://github.com/hilqiqi0/crawler/tree/master/simple/instagram

爬取的每个数据保存格式：{ 图片的访问路径，评论数，点赞数，帖子的内容 }
eg：{
        "img_url": "https://scontent-sin6-2.cdninstagram.com/vp/0e345bfd870f2fb489f091ed5507397f/5C1A8CB6/t51.2885-15/e35/40949123_1104283529724860_6046749716819964824_n.jpg",
        "comment_count": 12932,
        "like_count": 1321753,
        "text": "Featured photo by @maomay__\\nWeekend Hashtag Project: #WHPperspective\\nThis weekend, the goal is to take photos and videos from a different point of view, as in this featured photo by Mao May (@maomay__). Here are some tips to get you started:\\nCapture a familiar subject or scene from an unexpected angle. Get up close and let a face cover the entire frame, or make a puppy look large by shooting from ground-level as she stares down. Find a high vantage point to show the wider context of a festival scene or bustling market.\\nUse geometry to your advantage. Look for graphic lines — in bridges or telephone wires — that converge to a vanishing point in your composition. Find a new way to capture patterns in everyday places, like the wheels of bicycles lined up in a rack, or symmetrical bricks in an unruly garden.\\nPlay an eye trick. Defy gravity with simple editing, like rotating the frame. Recruit a friend to make a well-timed leap, that, when rotated, looks like they’re flying through air. Or turn a dandelion into a human-size parasol by playing with scale and distance.\\n\\nPROJECT RULES: Please add the #WHPperspective hashtag only to photos and videos shared over this weekend and only submit your own visuals to the project. If you include music in your video submissions, please only use music to which you own the rights. Any tagged photo or video shared over the weekend is eligible to be featured next week."
    }

技术难点总结：1、需要翻墙；2、Instagram在8、9月份之前是没有反扒，之后ajax请求加了反扒。

反扒算法：（请求头加了'X-Instagram-GIS'字段）
1、将rhx_gis和queryVariables进行组合
2、然后进行md5哈希

代码说明和修改：0、默认下载120个，若想下载更多可以删除数量判断或者修改阈值
       1、该代码使用的是蓝灯，代理端口为52212；若是其他的翻墙工具，请修改代理端口号
       2、该代码爬取的是https://www.instagram.com网站中instagram博主的信息；若想爬取其他博主的信息，需要修改博主名
       3、该代码仅是测试，尚未进行代码模块化、封装等

关于流程和分析：1、参见文章最后参考；2、直接分析代码

import re
import json
import time
import random
import requests
from pyquery import PyQuery as pq
import hashliburl_base = 'https://www.instagram.com/instagram/'
uri = 'https://www.instagram.com/graphql/query/?query_hash=a5164aed103f24b03e7b7747a2d94e3c&variables=%7B%22id%22%3A%22{user_id}%22%2C%22first%22%3A12%2C%22after%22%3A%22{cursor}%22%7D'headers = {
'Connection':'keep-alive',
'Host':'www.instagram.com',
'Referer':'https://www.instagram.com/instagram/',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
'X-Requested-With':'XMLHttpRequest'
}proxy = {'http': 'http://127.0.0.1:52212','https': 'http://127.0.0.1:52212'
}def hashStr(strInfo):h = hashlib.md5()h.update(strInfo.encode("utf-8"))return h.hexdigest()def get_html(url):try:response = requests.get(url, headers=headers, proxies=proxy)if response.status_code == 200:return response.textelse:print('请求网页源代码错误, 错误状态码：', response.status_code)except Exception as e:print(e)return Nonedef get_json(headers,url):try:response = requests.get(url, headers=headers,proxies=proxy, timeout=10)if response.status_code == 200:return response.json()else:print('请求网页json错误, 错误状态码：', response.status_code)except Exception as e:print(e)time.sleep(60 + float(random.randint(1, 4000))/100)return get_json(headers,url)def get_samples(html):samples = []user_id = re.findall('"profilePage_([0-9]+)"', html, re.S)[0]GIS_rhx_gis = re.findall('"rhx_gis":"([0-9a-z]+)"', html, re.S)[0]print('user_id：' + user_id)print(GIS_rhx_gis)doc = pq(html)items = doc('script[type="text/javascript"]').items()for item in items:if item.text().strip().startswith('window._sharedData'):# window._sharedData 的内容转换为字典js_data = json.loads(item.text()[21:-1], encoding='utf-8')# 12 张初始页面图片信息edges = js_data["entry_data"]["ProfilePage"][0]["graphql"]["user"]["edge_owner_to_timeline_media"]["edges"]# 网页页面信息page_info = js_data["entry_data"]["ProfilePage"][0]["graphql"]["user"]["edge_owner_to_timeline_media"]['page_info']# 下一页的索引值AQCSnXw1JsoV6LPOD2Of6qQUY7HWyXRc_CBSMWB6WvKlseC-7ibKho3Em0PEG7_EP8vwoXw5zwzsAv_mNMR8yX2uGFZ5j6YXdyoFfdbHc6942wcursor = page_info['end_cursor']# 是否有下一页flag = page_info['has_next_page']# 节点信息筛选for edge in edges:               # 如果是视频直接跳过if edge['node']['is_video'] == "true":continuetime.sleep(1)# 图片信息筛选sample = {}if edge['node']['display_url']:display_url = edge['node']['display_url']
#                    print(display_url)sample["img_url"] = display_urlsample["comment_count"] = edge['node']['edge_media_to_comment']["count"]sample["like_count"] = edge['node']['edge_liked_by']["count"] print(sample["img_url"])print(sample["comment_count"])print(sample["like_count"])if edge['node']['shortcode']:shortcode = edge['node']['shortcode']# https://www.instagram.com/p/{shortcode}/?__a=1textUrl = 'https://www.instagram.com/p/' + shortcode + '/?__a=1'textRespose = get_json(headers,textUrl)
#                    print(textRespose)
#                    print(type(textRespose))    textDict = textRespose['graphql']['shortcode_media']['edge_media_to_caption']['edges'][0]['node']sample["text"] = str(textDict)[10:-2]print(sample["text"])samples.append(sample)print(cursor, flag)# AJAX 请求更多信息                     while flag:url = uri.format(user_id=user_id, cursor=cursor)print(url)queryVariables = '{"id":"' + user_id + '","first":12,"after":"' +cursor+ '"}'print(queryVariables)headers['X-Instagram-GIS'] = hashStr(GIS_rhx_gis + ":" + queryVariables)print(headers)js_data = get_json(headers,url)
#        print(js_data)infos = js_data['data']['user']['edge_owner_to_timeline_media']['edges']cursor = js_data['data']['user']['edge_owner_to_timeline_media']['page_info']['end_cursor']flag = js_data['data']['user']['edge_owner_to_timeline_media']['page_info']['has_next_page']#        print(infos)for info in infos:if info['node']['is_video']:continueelse:sample = {}display_url = info['node']['display_url']
#                print(display_url)sample["img_url"] = display_urlsample["comment_count"] = info['node']['edge_media_to_comment']["count"]sample["like_count"] = info['node']['edge_media_preview_like']["count"]                    if info['node']['shortcode']:time.sleep(1)shortcode = info['node']['shortcode']# https://www.instagram.com/p/{shortcode}/?__a=1textUrl = 'https://www.instagram.com/p/' + shortcode + '/?__a=1'textRespose = get_json(headers,textUrl)
#                    print(textRespose)
#                    print(type(textRespose))    textDict = textRespose['graphql']['shortcode_media']['edge_media_to_caption']['edges'][0]['node']sample["text"] = str(textDict)[10:-2]print(sample["img_url"])print(sample["comment_count"])print(sample["like_count"])  print(sample["text"])samples.append(sample)print(cursor, flag)# 下载120个 返回if len(samples) > 120:return samplesreturn samplesdef main():url = url_basehtml = get_html(url)samples = get_samples(html)
#    print(samples)with open("./samples.txt","a",encoding='utf-8') as f:f.write(str(samples))if __name__ == '__main__':start = time.time()main()

参考1：https://www.jianshu.com/p/985c2b4e8f6c

参考2：https://blog.csdn.net/geng333abc/article/details/79403395

爬虫：Instagram信息爬取相关推荐

爬虫-菜谱信息爬取（保存至数据库）
目录爬虫爬取思路 python代码数据库代码后期发现: 解决方法: 词云制作爬虫爬取思路 python代码 import requests # 请求 from lxml import etre ...
layui获取input信息_python爬虫—用selenium爬取京东商品信息
python爬虫--用selenium爬取京东商品信息 1.先附上效果图(我偷懒只爬了4页) 2.京东的网址https://www.jd.com/ 3.我这里是不加载图片,加快爬取速度,也可以用Hea ...
python爬虫scrapy爬取新闻标题及链接_python爬虫框架scrapy爬取梅花网资讯信息
原标题:python爬虫框架scrapy爬取梅花网资讯信息一.介绍本例子用scrapy-splash爬取梅花网(http://www.meihua.info/a/list/today)的资讯信息, ...
python爬虫公众号_python爬虫_微信公众号推送信息爬取的实例
问题描述利用搜狗的微信搜索抓取指定公众号的最新一条推送,并保存相应的网页至本地. 注意点搜狗微信获取的地址为临时链接,具有时效性. 公众号为动态网页(JavaScript渲染),使用request ...
基于golang的爬虫demo，爬取微博用户的粉丝和关注者信息
基于golang的爬虫demo,爬取微博用户的粉丝和关注者信息注意:仅供学习交流,任何非法使用与作者无关! 目录基于golang的爬虫demo,爬取微博用户的粉丝和关注者信息一.背景与取材二. ...
【2020-10-27】 scrapy爬虫之猎聘招聘信息爬取
声明:本文只作学习研究,禁止用于非法用途,否则后果自负,如有侵权,请告知删除,谢谢! scrapy爬虫之猎聘招聘信息爬取 1.项目场景目标网址:https://www.liepin.com/zhao ...
网络爬虫｜ selenium 爬取动态加载信息
使用selenium实现动态渲染页面的爬取.selenium是浏览器自动测试框架,模拟浏览器,驱动浏览器执行特定的动作,并可获取浏览器当前呈现的页面的源代码,可见即可爬.该工具支持IE浏览器.Mozi ...
Python爬虫入门（爬取豆瓣电影信息小结）
Python爬虫入门(爬取豆瓣电影信息小结) 1.爬虫概念网络爬虫,是一种按照一定规则,自动抓取互联网信息的程序或脚本.爬虫的本质是模拟浏览器打开网页,获取网页中我们想要的那部分数据. 2.基本流程 ...
Python 爬虫中国行政区划信息爬取（初学者）
Python 爬虫中国行政区划信息爬取 (初学者) 背景环境准备代码片段 1.定义地址信息对象 2.地址解析对象 2.1 获取web信息 2.2 web信息解析 2.3 区划信息提取 2.4 省 ...
Scrapy框架爬虫项目：京东商城笔记本电脑信息爬取
一.创建Scrapy项目在cmd中输入一下指令创建一个新的scrapy项目及一个爬虫 scrapy startproject JD_Goodscd JD_Goodsscrapy genspider ...

爬虫：Instagram信息爬取

爬虫：Instagram信息爬取相关推荐

最新文章

热门文章