我订阅了近 100 个公众号,有时候想再找之前读过的文章,发现搜索起来特别困难,如果忘了收藏,估计得找半小时,更让人无语的是,文章已经发布者删除,或者文章因违规被删除。那么有没有这样的爬虫,可以将公众号的文章全部爬到本地,并提供便捷的搜索功能,这样当我想查找某类文章的时候会非常方便,同时文章都在本地,也不用担心被人删除。

最近正好看到一个牛逼的 Python 爬虫项目,就是爬取微信公众号的文章的,看了一下功能介绍,真是想见恨晚啊,作者水平真的是牛逼,我已经献出了自己的崇拜,特分享出来,你可以使用它的功能,也可以研究它的技术,请拿走不谢。访问项目地址,相信你完全有独立部署的能力。

项目地址:https://github.com/wonderfulsuccess/weixin_crawler

功能展示

UI主界面

爬虫主界面.gif

添加公众号爬取任务和已经爬取的公众号列表

公众号.png

爬虫界面

设置界面

设置.png

公众号历史文章列表

历史文章列表.gif

报告

报告.gif

搜索

搜索.gif

简介

weixin_crawler是一款使用Scrapy、Flask、Echarts、Elasticsearch等实现的微信公众号文章爬虫,自带分析报告和全文检索功能,几百万的文档都能瞬间搜索。weixin_crawler设计的初衷是尽可能多、尽可能快地爬取微信公众的历史发文。

weixin_crawler 尚处于维护之中, 方案有效, 请放心尝试。

免部署马上体验公众号数据采集

通过免安装可执行程序WCplus.exe https://shimo.im/docs/E1IjqOy2cYkPRlZd 可马上体验weixin_crawler的数据采集功、导出Excel和PDF功能。

主要特点

  1. 使用Python3编写
    Python3 is used
  2. 爬虫框架为Scrapy并且实际用到了Scrapy的诸多特性,是深入学习Scrapy的不错开源项目
    Made full use of scrapy, if you are struggling with scrapy this repo helps to spark
  3. 利用Flask、Flask-socketio、Vue实现了高可用性的UI界面。功能强大实用,是新媒体运营等岗位不错的数据助手
    Flask、Flask-socketio、Vue are used to build a full stack project crawler
  4. 得益于Scrapy、MongoDB、Elasticsearch的使用,数据爬取、存储、索引均简单高效
    Thanks to scrapy mongodb elasticsearch weixin_crawler is not only a crawler but also a search engine
  5. 支持微信公众号的全部历史发文爬取
    Able to crawl all the history articles of any weixin official account
  6. 支持微信公众号文章的阅读量、点赞量、赞赏量、评论量等数据的爬取
    Able to crawl the reading data
  7. 自带面向单个公众号的数据分析报告
    Released with report module based on sigle official account
  8. 利用Elasticsearch实现了全文检索,支持多种搜索和模式和排序模式,针对搜索结果提供了趋势分析图表
    It is also a search engine
  9. 支持对公众号进行分组,可利用分组数据限定搜索范围
    Able to group official account which can be used to define searching range
  10. 原创手机自动化操作方法,可实现爬虫无人监管
    Whith the help of adb, weixin_crawler is able to opereate Android phone automatically, which means it can work without any human monitoring
  11. 支持多微信APP同时采集, 理论上采集速度可线性增加
    Mutiple weixin app is supported to imporove crawling speed linearly

使用到的主要工具

语言 Python3.6 前端 web框架 Flask / Flask-socketio / gevent js/css库 Vue / Jquery / W3css / Echarts / Front-awsome 后端 爬虫 Scrapy 存储 Mongodb / Redis 索引 Elasticsearch

运行方法

weixin_crawler已经在Win/Mac/Linux系统下运行成功, 建议优先使用win系统尝试 weixin_crawler could work on win/mac/linux, although it is suggested to try on win os firstly

Insatall mongodb / redis / elasticsearch and run them in the background

downlaod mongodb / redis / elasticsearch from their official sites and install them

run them at the same time under the default configuration. In this case mongodb is localhost:27017 redis is localhost:6379(or you have to config in weixin_crawler/project/configs/auth.py)

Inorder to tokenize Chinese, elasticsearch-analysis-ik have to be installed for Elasticsearch

Install proxy server and run proxy.js

install nodejs and then npm install anyproxy and redis in weixin_crawler/proxy

cd to weixin_crawler/proxy and run node proxy.js

install anyproxy https CA in both computer and phone side

if you are not sure how to use anyproxy, here is the doc

Install the needed python packages

NOTE: you may can not simply type pip install -r requirements.txt to install every package, twisted is one of them which is needed by scrapy. When you get some problems about installing python package(twisted for instance), here always have a solution——downlod the right version package to your drive and run $ pip install package_name

I am not sure if your python enviroment will throw other package not found error, just install any package that is needed

Some source code have to be modified(maybe it is not reasonable)

scrapy Python36Libsite-packagesscrapyhttpequest _init_.py --> weixin_crawlersource_codeequest_init_.py

scrapy Python36Libsite-packagesscrapyhttpesponse _init_.py --> weixin_crawlersource_codeesponse_init_.py

pyecharts Python36Libsite-packagespyechartsbase.py --> weixin_crawlersource_codebase.py. In this case function get_echarts_options is added in line 106

If you want weixin_crawler work automatically those steps are necessary or you shoud operate the phone to get the request data that will be detected by Anyproxy manual

Install adb and add it to your path(windows for example)

install android emulator(NOX suggested) or plugin your phone and make sure you can operate them with abd from command line tools

If mutiple phone are connected to your computer you have to find out their adb ports which will be used to add crawler

adb does not support Chinese input, this is a bad news for weixin official account searching. In order to input Chinese, adb keyboard has to be installed in your android phone and set it as the default input method, more is here

Why could weixin_crawler work automatically? Here is the reason:

If you want to crawl a wechat official account, you have to search the account in you phone and click its "全部消息" then you will get a message list , if you roll down more lists will be loaded. Anyone of the messages in the list could be taped if you want to crawl this account's reading data If a nickname of a wechat official account is given, then wexin_crawler operate the wechat app installed in a phone, at the same time anyproxy is 'listening background'...Anyway weixin_crawler get all the request data requested by wechat app, then it is the show time for scrapy As you supposed, in order to let weixin_crawler operate wechat app we have to tell adb where to click swap and input, most of them are defined in weixin_crawler/project/phone_operate/config.py. BTW phone_operate is responsible for wechat operate just like human beings, its eyes are baidu OCR API and predefined location tap area, its fingers are adb Run the main.py

$ cd weixin_crawler/project/

$ python(3) ./main.py

Now open the browser and everything you want would be in localhost:5000.

In this long step list you may get stucked, join our community for help, tell us what you have done and what kind of error you have found.

Let's go to explore the world in localhost:5000 together

作者:somenzz
来源:简书

python爬取公众号阅读量_分享一个牛逼的Python项目:公众号文章爬虫相关推荐

  1. python 公众号文章发布_分享一个牛逼的Python项目:公众号文章爬虫

    我订阅了近 100 个公众号,有时候想再找之前读过的文章,发现搜索起来特别困难,如果忘了收藏,估计得找半小时,更让人无语的是,文章已经发布者删除,或者文章因违规被删除.那么有没有这样的爬虫,可以将公众 ...

  2. python爬取抖音用户数据_一篇文章教会你用Python抓取抖音app热点数据

    今天给大家分享一篇简单的安卓app数据分析及抓取方法.以抖音为例,我们想要抓取抖音的热点榜数据. 要知道,这个数据是没有网页版的,只能从手机端下手. 首先我们要安装charles抓包APP数据,它是一 ...

  3. python爬取抖音用户数据_「docker实战篇」python的docker-抖音web端数据抓取(19)

    import re import requests import time from lxml import etree def handle_decode(input_data,share_web_ ...

  4. 利用python爬取58同城简历数据_利用python爬取58同城简历数据-Go语言中文社区

    利用python爬取58同城简历数据 最近接到一个工作,需要获取58同城上面的简历信息(http://gz.58.com/qzyewu/).最开始想到是用python里面的scrapy框架制作爬虫.但 ...

  5. python爬取微博数据词云_用Python爬取微博数据生成词云图片

    原标题:用Python爬取微博数据生成词云图片 欢迎关注天善智能 hellobi.com,我们是专注于商业智能BI,大数据,数据分析领域的垂直社区,学习.问答.求职,一站式搞定! 对商业智能BI.大数 ...

  6. python爬取网页表格数据匹配_爬取表格类网站数据并保存为excel文件

    本文转载自以下网站:50 行代码爬取东方财富网上市公司 10 年近百万行财务报表数据 https://www.makcyun.top/web_scraping_withpython6.html 主要学 ...

  7. python爬取饿了么评论_爬取饿了么官网数据 scrapy

    展开全部 Scrapy框架的初步运用 上午刚配置好scrapy框架,32313133353236313431303231363533e58685e5aeb931333363393734下午我就迫不及待 ...

  8. python爬取豆瓣影评理论依据_我用Python爬取了豆瓣的影评

    使用Python爬取豆瓣的影评,比爬取网易云简单,因为不需要设置特定的headers,关于网易云说几句,很难爬取,对请求头有着严格的要求,前几年那会还好些. 爬取结果分为:用户名,评价的星级,评论的内 ...

  9. python 爬取亚马逊评论_用Python爬取了三大相亲软件评论区,结果...

    小三:怎么了小二?一副愁眉苦脸的样子. 小二:唉!这不是快过年了吗,家里又催相亲了 ... 小三:现在不是流行网恋吗,你可以试试相亲软件呀. 小二:这玩意靠谱吗? 小三:我也没用过,你自己看看软件评论 ...

最新文章

  1. linux open系统调用的O_DIRECT标记
  2. 云计算介绍 、TCP/IP协议及配置
  3. docker centos7 安装mysql_centos7通过docker安装mysql
  4. 解决 .net core 中 nuget 包版本冲突问题
  5. 收藏 | 2020年腾讯技术工程十大热门文章
  6. python实现语音播放_用Python实现语音播报
  7. python购物记录程序_python ATM购物程序
  8. CSAPP Bomb Lab记录
  9. Linux运维文档之nginx
  10. 2022泰迪杯数据挖掘挑战赛C题方案及赛后总结:疫情背景下的周边游需求图谱分析
  11. java数组排序(反转排序)
  12. SOUI中几个view视图控件的基本使用
  13. mbedtls基础及其应用
  14. UWB定位系统场景的分析
  15. AIRS Opencat机器猫
  16. android root 升级失败怎么办,安卓手机ROOT失败的常见原因及解决办法
  17. DNF盗号木马之突破令牌密保
  18. acwing 95. 费解的开关(蓝桥杯)
  19. python与传感器交互_Python-socket实现与小米传感器通信
  20. 8月2日 jquery

热门文章

  1. jquery打印网页当前页
  2. 工具学习——有哪些好用的语音转文字app
  3. 独享还是共享,你选择哪一种锁?(独享锁/共享锁)
  4. 【Unity3D】人物跟随鼠标位置
  5. 主板芯片介绍---Intel芯片组(一)
  6. 《响应式Web设计实践》一2.2 字体大小
  7. 适合个人投资者的理财策略
  8. 转:使用rz上传压缩文件遇到的一些坑
  9. Java导出多个excel并且打包成zip压缩文件
  10. 获取显示器分辨率大小更改页面字体大小JS