Resource Recommendation

前段时间做项目需要爬Facebook，但因为疫情原因官方的个人Graph API暂停申请权限，抓耳挠腮之际只能奔向万能的GitHub找资源。多多少少试了好多包，把个人觉得比较好的罗列在下面，仅供个人学习和交流，不用于商业用途。

在线 Facebook主页基本信息（公开的地址、电话、邮箱、营业时间等等）爬取工具，快速便捷，有免费试用版。https://phantombuster.com/automations/facebook/8369/facebook-profile-scraper
来自GitHub，试了下爬取个人主页的相关帖子、视频等等还是很强大的，需要有效的credentials（注册邮箱和密码）。 https://github.com/harismuneer/Ultimate-Facebook-Scraper
来自GitHub，可以爬取公共主页所有帖子、对应时间、转赞评数目、帖子ID等，不需要credentials，是我找到的少数几个能爬公共主页的有效代码，可惜评论的具体内容无法爬取。https://github.com/kevinzg/facebook-scraper

Practical Usage

最终选择上述第三种方法来爬取目标公司Facebook公共主页的所有帖子并输出xlsx数据：

import re
import time
import datetime
import pandas as pd
import numpy as np
from Facebook_Scraper.facebook_scraper import get_posts
from Facebook_Scraper.facebook_scraper import fetch_share_and_reactionsdef facebook_scrap():# The data type of incorporation date and dissolution date is timestamp, we'll convert them into string containing only date.data = pd.read_excel('../data/dataset.xlsx',converters={'Date of Establishment_legal':str,'Dissolved_legal':str})# Column 'Date of Establishment_legal' contains the company's incorporation date, column 'Dissolved_legal' contains the company's dissolution date, and column 'Facebook' contains the link of the Facebook public page of the company if any. # We only extract companies with Facebook linksdata = data[data['Facebook'].notna()]  data['Date of Establishment_legal'] = data['Date of Establishment_legal'].apply(lambda x: x[0:10])data['Dissolved_legal'] = data['Dissolved_legal'].apply(lambda x: x[0:10] if type(x)==str else(x))# The input of Facebook scraping code should be its account name, so we need to extract account name from the linklinks = data['Facebook'].to_list()     account = [0 for _ in range(data.shape[0])]pattern = re.compile('https://www.facebook.com/([a-zA-Z0-9.]+)')for i in range(len(links)):try:name = re.findall(pattern, links[i])[0]account[i] = nameexcept:account[i] = 0posts_data = pd.DataFrame({"post_id":"","text":"","post_text":"","shared_text":"","time":"","image":"","likes":"","comments":"",\"shares":"","post_url":"","link":""},index=["0"])abbreviation = data['Company name_abbreviation'].to_list()incorporation_date = data['Date of Establishment_legal'].to_list()dissolution_date = data['Dissolved_legal'].to_list()# Starting to scrap postsfor i in range(0,len(account)):cnt = 0#There are about 2 posts per page, and pages=4000 should be enough for us to scrap all the Facebook posts since the account was created.for post in get_posts(account = account[i], pages=4000):  cnt += 1more_info_post = fetch_share_and_reactions(post)more_info_post['Company name_abbreviation'] = abbreviation[i]more_info_post['account'] = account[i]more_info_post['incorporation_date'] = incorporation_date[i]more_info_post['dissolution_date'] = dissolution_date[i]df = pd.DataFrame(more_info_post,index=["0"])posts_data = posts_data.append(df,ignore_index=True,sort=False)print(account[i],cnt,' posts are scraped.')useful_columns = ['post_id','text','shared_text','time','image','likes','comments','shares','post_url','link',\'Company name_abbreviation','account','incorporation_date','dissolution_date']posts_data = pd.DataFrame(posts_data, columns=useful_columns)posts_data = posts_data.drop([0])posts_data.to_excel('../data/all_facebook_posts.xlsx',index=False)return posts_data

Python爬取Facebook公共主页帖子相关推荐

简书爬ajax接口获取csrf,Python爬取简书主页信息
主要学习如何通过抓包工具分析简书的Ajax加载,有时间再写一个Multithread proxy spider提升效率. 1. 关键点: 使用单线程爬取,未登录,爬取简书主页Ajax加载的内容.主要有 ...
python爬取贴吧所有帖子-python爬取贴吧帖子
一.介绍我们常遇到一些很长的贴吧连载帖子想存到本地再看此文就是运用python爬取指定百度贴吧的帖子并存到本地满足需求环境:python2.7 目标网页:[长篇连载]剑网3的正史和野史--从头开 ...
python爬取贴吧所有帖子-Python爬虫实例（一）爬取百度贴吧帖子中的图片
程序功能说明:爬取百度贴吧帖子中的图片,用户输入贴吧名称和要爬取的起始和终止页数即可进行爬取. 思路分析: 一.指定贴吧url的获取例如我们进入秦时明月吧,提取并分析其有效url如下 ?后面为查询字 ...
python爬取虎扑论坛帖子数据
前言本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理. PS:如有需要Python学习资料的小伙伴可以加点击下方链接自行获取 python免费学习资 ...
Python:爬取FaceBook用户头像
博客迁移个人博客站点,欢迎访问,www.jiingfengji.tech 本文地址 Python爬取源码本文介绍的爬取方法是基于已经有一些用户的sns_id了,然后通过头像链接进行头像下载. 以下 ...
python爬取贴吧所有帖子-通过python爬取贴吧数据并保存为word
前言 Python是一种跨平台的计算机程序设计语言.是一种面向对象的动态类型语言,最初被设计用于编写自动化脚本(shell),随着版本的不断更新和语言新功能的添加,越来越多被用于独立的.大型项目的开发 ...
python爬取贴吧所有帖子-Python爬虫爬取百度贴吧的帖子
同样是参考网上教程,编写爬取贴吧帖子的内容,同时把爬取的帖子保存到本地文档: #!/usr/bin/python #_*_coding:utf-8_*_ import urllib import ur ...
python爬取贴吧所有帖子-Python实现的爬取百度贴吧图片功能完整示例
本文实例讲述了Python实现的爬取百度贴吧图片功能.分享给大家供大家参考,具体如下: #coding:utf-8 import requests import urllib2 import urll ...
python爬取贴吧所有帖子-python 爬虫爬取百度贴吧，获取海量信息
需要用到的库:requests,re,xpath 首先打开随便一个贴吧:贴吧首页通过观察发现每一个帖子的链接是这样的:帖子链接我们只需要获取后面灰色部分就可以了,点击f12 按ctrl+f 找到链 ...
python爬虫(13)爬取百度贴吧帖子
爬取百度贴吧帖子一开始只是在网上看到别人写的爬取帖子的文章,然后自己就忍不住手痒自己锻炼一下, 然后照着别人的写完,发现不太过瘾, 毕竟只是获取单个帖子的内容,感觉内容稍显单薄,然后自己重新做了修改 ...

Python爬取Facebook公共主页帖子

Resource Recommendation

Practical Usage

Python爬取Facebook公共主页帖子相关推荐

最新文章

热门文章