[459]抓取微信公众号(二)

难点

微信公众号历史的所有文章（来源？？？）
每篇文章的阅读量和点赞量（电脑上浏览文章只显示内容，没有阅读量、点赞量、评论……）

突破难点一

搜狗微信搜索，可以搜索微信公众号文章。但是貌似只能显示该公众号最近十篇的文章。
搜狗微信搜索地址：https://www.sogou.com/wapindex/ 或
https://weixin.sogou.com/
利用抓包工具（Fiddler），抓取文章。成本有点大……，且貌似只能抓取原创文章。不符合个人需求。
利用微信个人订阅号进行爬取，神奇的操作。

操作

拥有一个微信个人订阅号，附上登陆和注册链接。微信公众平台：https://mp.weixin.qq.com/

没有注册的童鞋可以用自己的微信号注册一下，过程十分简单。

登陆之后，点击左侧菜单栏“管理”-“素材管理”。再点击右边的“新建图文素材”

弹出一个新的标签页，在上面的工具栏找到“超链接”并点击

弹出了一个小窗口，选择“查找文章”，输入需要查找的公众号，这里用“宅基地”公众号作为例子

点击之后，可以弹出该公众号的所有历史文章

搜索公众号名称

搜索可以获取所有相关的公众号信息，不过我这里只取第一个做测试，其他的有兴趣的也可以全部获取。

获取要爬取的公众号的fakeid
选定要爬取的公众号，获取文章接口地址
文章列表翻页及内容获取

详细细节查资料研究吧，可参考崔大神的文章，https://mp.weixin.qq.com/s?__biz=MzI5NDY1MjQzNA==&mid=2247483970&idx=1&sn=cde40462d2346ded8e8c11ab4442bbab&chksm=ec5edd3fdb2954299e5b4736b3729014d4853e50e643de06640ba3af370753db069667511db1&mpshare=1&scene=1&srcid=0612suzxGJXTmoak9i81rRSZ&pass_ticket=YsJz0pUrK8Yj6XuoyHfGbfjFAgRZ9wHQMTLCnfaYLlQGaOXangzh2LWgrfB8lf76#rd

完整代码

根据个人公众号接口抓取

# -*- coding: utf-8 -*-
import time,random,re,json,requests
from selenium import webdriver
from selenium.webdriver import Chrome
from selenium.webdriver.firefox.options import Options#微信公众号账号
user="你的公众号账号"
#公众号密码
password="你的公众号密码"
#设置要爬取的公众号列表
gzlist=['要爬取的公众号名字']#登录微信公众号，获取登录之后的cookies信息，并保存到本地文本中
def weChat_login():#定义一个空的字典，存放cookies内容post={}#用webdriver启动谷歌浏览器print("启动浏览器，打开微信公众号登录界面")options = Options()options.add_argument('-headless')  # 无头参数driver = Chrome(executable_path='chromedriver', chrome_options=options)#打开微信公众号登录页面driver.get('https://mp.weixin.qq.com/')#等待5秒钟time.sleep(5)print("正在输入微信公众号登录账号和密码......")#清空账号框中的内容driver.find_element_by_xpath("./*//input[@id='account']").clear()#自动填入登录用户名driver.find_element_by_xpath("./*//input[@id='account']").send_keys(user)#清空密码框中的内容driver.find_element_by_xpath("./*//input[@id='pwd']").clear()#自动填入登录密码driver.find_element_by_xpath("./*//input[@id='pwd']").send_keys(password)# 在自动输完密码之后需要手动点一下记住我print("请在登录界面点击:记住账号")time.sleep(10)#自动点击登录按钮进行登录driver.find_element_by_xpath("./*//a[@id='loginBt']").click()# 拿手机扫二维码！print("请拿手机扫码二维码登录公众号")time.sleep(20)print("登录成功")#重新载入公众号登录页，登录之后会显示公众号后台首页，从这个返回内容中获取cookies信息driver.get('https://mp.weixin.qq.com/')#获取cookiescookie_items = driver.get_cookies()#获取到的cookies是列表形式，将cookies转成json形式并存入本地名为cookie的文本中for cookie_item in cookie_items:post[cookie_item['name']] = cookie_item['value']cookie_str = json.dumps(post)with open('cookie.txt', 'w+', encoding='utf-8') as f:f.write(cookie_str)print("cookies信息已保存到本地")#爬取微信公众号文章，并存在本地文本中
def get_content(query):#query为要爬取的公众号名称#公众号主页url = 'https://mp.weixin.qq.com'#设置headersheader = {"HOST": "mp.weixin.qq.com","User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0"}#读取上一步获取到的cookieswith open('cookie.txt', 'r', encoding='utf-8') as f:cookie = f.read()cookies = json.loads(cookie)#登录之后的微信公众号首页url变化为：https://mp.weixin.qq.com/cgi-bin/home?t=home/index&lang=zh_CN&token=1849751598，从这里获取token信息response = requests.get(url=url, cookies=cookies)token = re.findall(r'token=(\d+)', str(response.url))[0]#搜索微信公众号的接口地址search_url = 'https://mp.weixin.qq.com/cgi-bin/searchbiz?'#搜索微信公众号接口需要传入的参数，有三个变量：微信公众号token、随机数random、搜索的微信公众号名字query_id = {'action': 'search_biz','token' : token,'lang': 'zh_CN','f': 'json','ajax': '1','random': random.random(),'query': query,'begin': '0','count': '5'}#打开搜索微信公众号接口地址，需要传入相关参数信息如：cookies、params、headerssearch_response = requests.get(search_url, cookies=cookies, headers=header, params=query_id)#取搜索结果中的第一个公众号lists = search_response.json().get('list')[0]#获取这个公众号的fakeid，后面爬取公众号文章需要此字段fakeid = lists.get('fakeid')#微信公众号文章接口地址appmsg_url = 'https://mp.weixin.qq.com/cgi-bin/appmsg?'#搜索文章需要传入几个参数：登录的公众号token、要爬取文章的公众号fakeid、随机数randomquery_id_data = {'token': token,'lang': 'zh_CN','f': 'json','ajax': '1','random': random.random(),'action': 'list_ex','begin': '0',#不同页，此参数变化，变化规则为每页加5'count': '5','query': '','fakeid': fakeid,'type': '9'}#打开搜索的微信公众号文章列表页appmsg_response = requests.get(appmsg_url, cookies=cookies, headers=header, params=query_id_data)#获取文章总数max_num = appmsg_response.json().get('app_msg_cnt')#每页至少有5条，获取文章总的页数，爬取时需要分页爬num = int(int(max_num) / 5)#起始页begin参数，往后每页加5begin = 0while num + 1 > 0 :query_id_data = {'token': token,'lang': 'zh_CN','f': 'json','ajax': '1','random': random.random(),'action': 'list_ex','begin': '{}'.format(str(begin)),'count': '5','query': '','fakeid': fakeid,'type': '9'}print('正在翻页：--------------',begin)#获取每一页文章的标题和链接地址，并写入本地文本中query_fakeid_response = requests.get(appmsg_url, cookies=cookies, headers=header, params=query_id_data)fakeid_list = query_fakeid_response.json().get('app_msg_list')for item in fakeid_list:content_link=item.get('link')content_title=item.get('title')fileName=query+'.txt'with open(fileName,'a',encoding='utf-8') as fh:fh.write(content_title+":\n"+content_link+"\n")num -= 1begin = int(begin)begin+=5time.sleep(2)if __name__=='__main__':try:#登录微信公众号，获取登录之后的cookies信息，并保存到本地文本中weChat_login()#登录之后，通过微信公众号后台提供的微信公众号文章接口爬取文章for query in gzlist:#爬取微信公众号文章，并存在本地文本中print("开始爬取公众号："+query)get_content(query)print("爬取完成")except Exception as e:print(str(e))

根据搜狗微信接口抓取

# -*- coding:utf-8 -*-
import requests,os,time,re
from urllib.parse import quote
from pyquery import PyQuery as pq
from selenium.webdriver import Chrome
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.wait import WebDriverWait# 这三行代码是防止在python2上面编码错误的，在python3上面不要要这样设置
# import sys
# reload(sys)
# sys.setdefaultencoding('utf-8')# 搜索入口地址，以公众为关键字搜索该公众号
def get_search_result_by_keywords(sogou_search_url):# 爬虫伪装头部设置headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0'}# 设置操作超时时长timeout = 5# 爬虫模拟在一个request.session中完成s = requests.Session()log(u'搜索地址为：%s' % sogou_search_url)return s.get(sogou_search_url, headers=headers, timeout=timeout).content# 获得公众号主页地址
def get_wx_url_by_sougou_search_html(sougou_search_html):doc = pq(sougou_search_html)return doc('div[class=txt-box]')('p[class=tit]')('a').attr('href')# 使用webdriver 加载公众号主页内容，主要是js渲染的部分
def get_selenium_js_html(url):# browser = webdriver.PhantomJS(executable_path=r'D:\Python2.7\Scripts\phantomjs.exe')options = Options()options.add_argument('-headless')  # 无头参数driver = Chrome(executable_path='chromedriver', chrome_options=options)wait = WebDriverWait(driver, timeout=10)driver.get(url)time.sleep(3)# 执行js得到整个页面内容html = driver.execute_script("return document.documentElement.outerHTML")driver.close()return html# 获取公众号文章内容
def parse_wx_articles_by_html(selenium_html):doc = pq(selenium_html)return doc('div[class="weui_media_box appmsg"]')# 将获取到的文章转换为字典
def switch_arctiles_to_list(articles):# 定义存贮变量articles_list = []i = 1# 遍历找到的文章，解析里面的内容if articles:for article in articles.items():log(u'开始整合(%d/%d)' % (i, len(articles)))# 处理单个文章articles_list.append(parse_one_article(article))i += 1return articles_list# 解析单篇文章
def parse_one_article(article):article_dict = {}# 获取标题title = article('h4[class="weui_media_title"]').text().strip()###log(u'标题是： %s' % title)# 获取标题对应的地址url = 'http://mp.weixin.qq.com' + article('h4[class="weui_media_title"]').attr('hrefs')log(u'地址为： %s' % url)# 获取概要内容summary = article('.weui_media_desc').text()log(u'文章简述： %s' % summary)# 获取文章发表时间date = article('.weui_media_extra_info').text().strip()log(u'发表时间为： %s' % date)# 获取封面图片pic = parse_cover_pic(article)# 返回字典数据return {'title': title,'url': url,'summary': summary,'date': date,'pic': pic}# 查找封面图片，获取封面图片地址
def parse_cover_pic(article):pic = article('.weui_media_hd').attr('style')p = re.compile(r'background-image:url\((.*?)\)')rs = p.findall(pic)log(u'封面图片是：%s ' % rs[0] if len(rs) > 0 else '')return rs[0] if len(rs) > 0 else ''# 自定义log函数，主要是加上时间
def log(msg):print(u'%s: %s' % (time.strftime('%Y-%m-%d_%H-%M-%S'), msg))# 验证函数
def need_verify(selenium_html):' 有时候对方会封锁ip，这里做一下判断，检测html中是否包含id=verify_change的标签，有的话，代表被重定向了，提醒过一阵子重试 'return pq(selenium_html)('#verify_change').text() != ''# 创建公众号命名的文件夹
def create_dir(keywords):if not os.path.exists(keywords):os.makedirs(keywords)def run(keywords):' 爬虫入口函数 '# Step 0 ：  创建公众号命名的文件夹create_dir(keywords)# 搜狐微信搜索链接入口sogou_search_url = 'http://weixin.sogou.com/weixin?type=1&query=%s&ie=utf8&s_from=input&_sug_=n&_sug_type_=' % quote(keywords)# Step 1：GET请求到搜狗微信引擎，以微信公众号英文名称作为查询关键字log(u'开始获取，微信公众号英文名为：%s' % keywords)log(u'开始调用sougou搜索引擎')sougou_search_html = get_search_result_by_keywords(sogou_search_url)# Step 2：从搜索结果页中解析出公众号主页链接log(u'获取sougou_search_html成功，开始抓取公众号对应的主页wx_url')wx_url = get_wx_url_by_sougou_search_html(sougou_search_html)log(u'获取wx_url成功，%s' % wx_url)# Step 3：Selenium+PhantomJs获取js异步加载渲染后的htmllog(u'开始调用selenium渲染html')selenium_html = get_selenium_js_html(wx_url)# Step 4: 检测目标网站是否进行了封锁if need_verify(selenium_html):log(u'爬虫被目标网站封锁，请稍后再试')else:# Step 5: 使用PyQuery，从Step 3获取的html中解析出公众号文章列表的数据log(u'调用selenium渲染html完成，开始解析公众号文章')articles = parse_wx_articles_by_html(selenium_html)log(u'抓取到微信文章%d篇' % len(articles))# Step 6: 把微信文章数据封装成字典的listlog(u'开始整合微信文章数据为字典')articles_list = switch_arctiles_to_list(articles)return [content['title'] for content in articles_list]if __name__ == '__main__':gongzhonghao = input(u'input weixin gongzhonghao:')if not gongzhonghao:gongzhonghao = 'spider'text = " ".join(run(gongzhonghao))print(text)

直接运行main方法，在console中输入你要爬的公众号的英文名称，中文可能会搜出来多个，这里做的是精确搜索只搜出来一个，查看公众号英文号，只要在手机上点开公众号然后查看公众号信息

防盗链

微信公众号对文章中的图片做了防盗链处理，所以如果在公众号和小程序、PC浏览器以外的地方是无法显示图片的，这里推荐大家可以看下这篇文章了解下如何处理微信的防盗链。

https://blog.csdn.net/tjcyjd/article/details/74643521

参考：https://blog.csdn.net/d1240673769/article/details/75907152/
https://blog.csdn.net/wnma3mz/article/details/78570580
https://www.jianshu.com/p/874e85bedb4b

[459]抓取微信公众号(二)相关推荐

Python 抓取微信公众号账号信息
搜狗微信搜索提供两种类型的关键词搜索,一种是搜索公众号文章内容,另一种是直接搜索微信公众号.通过微信公众号搜索可以获取公众号的基本信息及最近发布的10条文章,今天来抓取一下微信公众号的账号信息爬虫 ...
python简答题及答案查询公众号和软件_Python 抓取微信公众号账号信息的方法
搜狗微信搜索提供两种类型的关键词搜索,一种是搜索公众号文章内容,另一种是直接搜索微信公众号.通过微信公众号搜索可以获取公众号的基本信息及最近发布的10条文章,今天来抓取一下微信公众号的账号信息爬虫 ...
Python项目实战：抓取微信公众号账号信息
搜狗微信搜索提供两种类型的关键词搜索,一种是搜索公众号文章内容,另一种是直接搜索微信公众号.通过微信公众号搜索可以获取公众号的基本信息及最近发布的10条文章,今天来抓取一下微信公众号的账号信息( 爬虫 ...
java 微信文章评论点赞_使用fiddler抓取微信公众号文章的阅读数、点赞数、评论数...
1 设置fiddler支持https 打开fiddler,在菜单栏中依次选择 [Tools]->[Options]->[HTTPS],勾上如下图的选项: 单击Actions,选择Expor ...
python公众号文章_Python 抓取微信公众号文章
起因是刷微信的时候看到一篇文章,Python 抓取微信公众号文章保存成pdf,很容易搜到,就不贴出来了先用chrome登陆微信公众号后台,先获取一下自己的cookie,复制下来就行,解析一下转换成 ...
记一次批量定时抓取微信公众号文章的实现
记一次批量定时抓取微信公众号文章的实现抓取前的说明和准备数据的抓取批量抓取定时抓取对爬虫防抓取机制的一些解决办法最后抓取前的说明和准备本次抓取的选择的语言是java,本文章不会将整个工 ...
Python实现抓取微信公众号文章
本文首发于微信公众号:"算法与编程之美",欢迎关注,及时了解更多此系列文章. 前言对于抓取微信公众号文章主要通过代理ip抓包进行的操作,总会出现一些问题,以下问题导致无法抓包. ...
利用搜狗抓取微信公众号文章
微信一直是一个自己玩的小圈子,前段时间搜狗推出的微信搜索带来了一丝曙光.搜狗搜索推出了内容搜索和公众号搜索两种,利用后者可以抓取微信公众号的最新内容,看了下还是比较及时的. 每个公众号都有一个open ...
python爬取公众号历史文章_pythons爬虫：抓取微信公众号历史文章(selenium+phantomjs)...
原标题:pythons爬虫:抓取微信公众号历史文章(selenium+phantomjs) 大数据挖掘DT数据分析公众号: datadw 本文爬虫代码可以通过回复本公众号关键字"公众号& ...

[459]抓取微信公众号(二)

难点

突破难点一

操作

[459]抓取微信公众号(二)相关推荐

最新文章

热门文章