搜狗微信爬虫获取文章信息

author：Voccoo
time:2019-4-1

"""
1.本demo只是为了爬取指定公众号或指定关键字下公众号，限定时间内所发送的文章。若要获取公众号信息，一并存取，请根据第一条gzhurl做以修改，或者从文章中直接获取2.本demo只是匆忙间完成的，还有许多细节并不够完美。比如对返回值为空的判断等。若要使用请根据自己的需求加以修改3.本次代理使用了redis来存储，这是作者自己的习惯存储方式。若要使用，请自行修改方法redis_proxy().4.‘用代理，就上芝麻IP!’5.本demo只获取到文章的名称，更多信息请自行修改获取。
"""from fake_useragent import UserAgent
import requests, time
from scrapy import Selector
import random
import redis, json
from urllib.parse import quote# redis为ip池
# 从redis中获取ip
#
def redis_proxy():redis_conn = redis.StrictRedis(host='localhost',password='Cs123456.',port=6379,db=1)redis_ip = redis_conn.blpop('ips')ip = json.loads(redis_ip[1].decode('UTF-8'))proxy = {'https': 'https://{}'.format(ip['ip'])}return proxy# 获取html
def get_html_act(url, referer):ua = UserAgent()while True:proxies = redis_proxy()try:headers = {'User-Agent': ua.random,'Upgrade-Insecure-Requests': '1',}session = requests.session()session.get('https://mp.weixin.qq.com/',headers=headers,proxies=proxies,timeout=3)html = requests.get(url,headers=headers,proxies=proxies,# allow_redirects=False,timeout=3)if html.status_code == 200:# print(html.text)return Selector(text=html.text)else:print('---状态码---{}被封了---！'.format(proxies['https']))except Exception as e:print('-----超时抛错----')# 获取html
def get_html(url, referer):# print(url)ua = UserAgent()while True:proxies = redis_proxy()try:headers = {"Host": "weixin.sogou.com","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",'User-Agent': ua.random,"ContentType": "text/xml;charset=utf-8",'Referer': referer,'Upgrade-Insecure-Requests': '1',}session = requests.session()html = session.get(url,headers=headers,proxies=proxies,allow_redirects=False,timeout=3)if html.status_code == 200:return Selector(text=html.text)else:print('---状态码---{}被封了---！'.format(proxies['https']))except Exception as e:print('-----超时抛错----')def run(gzh, start_time, endtime):""":param gzh::param start_time::param endtime::return:ps:搜索公众号的url，为了获取wxid，搜索关键可以为关键字也可为公众号若使用关键字搜索公众号的wxid，本demo只获取第一页的，请自行翻页获取"""gzh_url = 'https://weixin.sogou.com/weixin?type=1&s_from=input&query={}&ie=utf8&_sug_=n&_sug_type_='.format(quote(gzh))gzh_html = get_html(gzh_url, 'https://weixin.sogou.com/')wxid_list = gzh_html.css('ul.news-list2 li::attr(d)').extract()for wxid in wxid_list:page_ = Truepage_count = 1url = 'https://weixin.sogou.com/weixin?type=2&ie=utf8&query={}&tsn=5&ft={}&et={}&interation=&wxid={}&usip={}&page={}'.format(quote(gzh), start_time, endtime, wxid, quote(gzh), page_count)referer = 'https://weixin.sogou.com/weixin?type=2&s_from=input&query={}&ie=utf8&_sug_=n&_sug_type_='.format(quote(gzh))while page_:response = get_html(url, referer)article_urls = response.css('div.news-box ul.news-list li div.txt-box h3 a::attr(data-share)').extract()if len(article_urls) == 10:print('--翻页--进入第{}页--'.format(page_count+1))url = url.replace('&page={}'.format(page_count),'&page={}'.format(page_count+1))page_count += 1else:page_ = Falsefor al in article_urls:# print(al)article_html = get_html_act(al, '')article_name = article_html.css('#activity-name::text').extract_first()if article_name:# 输出当前页面链接文章名称print(article_name.strip())else:print(al)if __name__ == '__main__':# 开始时间start_time = '2019-03-01'# 结束时间endtime = '2019-04-01'# 公众号，也可以为公众号关键字gzh = '痴海'run(gzh, start_time, endtime)

搜狗微信爬虫获取文章信息相关推荐

python爬虫：搜狗微信公众号文章信息的采集（https://weixin.sogou.com/），保存csv文件
import requests from requests.exceptions import RequestException from lxml import etree import csv i ...
微信公众号文章信息（阅读量、在看、点赞数）获取
实现这一个功能主要用到了selenium.mitmproxy和wechatarticles,利用selenium可以实现脚本模拟浏览器访问,mitmproxy配合wechatarticles获取文 ...
如何用python爬取公众号文章搜狗微信搜索_python如何爬取搜狗微信公众号文章永久链接的思路解析...
这篇文章主要介绍了python如何爬取搜狗微信公众号文章永久链接的思路解析 ,小编觉得挺不错的,现在分享给大家,也给大家做个参考.一起跟随小编过来看看吧. 本文主要讲解思路,代码部分请自行解决搜狗微信 ...
php 获取企业号用户,微信企业号获取用户信息(示例代码)
业务操作最基础的一个功能是获取访客的身份,传统的获取方式是提供一个登录页面用以访客登录. 在微信企业号中,用户在微信中访问页面时,可以根据相关API获取此用户的微信账号信息,以此来匹配业务服务器存储的 ...
如何用python爬取公众号文章搜狗微信搜索_python抓取搜狗微信公众号文章
初学python,抓取搜狗微信公众号文章存入mysql mysql表: 代码: import requests import json import re import pymysql # 创建连接 ...
python抓取微信_python抓取搜狗微信公众号文章
初学python,抓取搜狗微信公众号文章存入mysql mysql表: 代码: import requests import json import re import pymysql # 创建连接 ...
python抓取搜狗微信公众号文章
初学python,抓取搜狗微信公众号文章存入mysql mysql表: 代码: import requests import json import re import pymysql# 创建连接 c ...
python wechatsougou_python抓取搜狗微信公众号文章
初学python,抓取搜狗微信公众号文章存入mysql mysql表: 代码: import requests import json import re import pymysql # 创建连接 ...
Python爬虫获取文章的标题及你的博客的阅读量，评论量。所有数据写入本地记事本。最后输出你的总阅读量！
Python爬虫获取文章的标题及你的博客的阅读量,评论量.所有数据写入本地记事本.最后输出你的总阅读量!还可以进行筛选输出!比如阅读量大于1000,之类的! 完整代码在最后.依据阅读数量进行降序输出! ...

搜狗微信爬虫获取文章信息

搜狗微信爬虫获取文章信息相关推荐

最新文章

热门文章