Python爬虫系列之爬取微信公众号新闻数据

小程序爬虫接单、app爬虫接单、网页爬虫接单、接口定制、网站开发、小程序开发 > 点击这里联系我们 <

微信请扫描下方二维码

代码仅供学习交流,请勿用于非法用途

  • 监控指定目录文件,对文件指定数据进行爬取

一、代码实现

# -*- coding:utf-8 -*-
import requests
from watchdog.observers import Observer
from watchdog.events import *
import json
from bs4 import BeautifulSoup
import time
from queue import Queue
import threading
import hashlib
import MySQLdb
import urllib.parse
import re'''@Author  :wanglei@Date    :2019/10/10@Desc    :搜狗新闻爬取
'''
#---------------------------------------------------------------------------
threadNum = 1
mysql_user = "root"
mysql_password = "root"
mysql_database = "news"
mysql_table = "wx_sou_news"
monitorPath = r"c:/users/asus/PycharmProjects/it002/crawler/wxSou/category/"
#---------------------------------------------------------------------------headers = {"Cookie": "SUV=005A28DAABD72B4D5D1D3A358EDDF616; CXID=2B6C8A624E9C1A5A5FA5AAEA5CE40242; SUID=4D2BD7AB3220910A000000005D1CCB39; YYID=18AFB43C3A5F3B3AA6478C7D6E3167A1; pgv_pvi=6812688384; weixinIndexVisited=1; sct=1; QIDIANID=Q4l+8p+7M86kIsIyKi6QuMMxpv2kxXbyv7+NiKkBnBxeMNoejOSOJh0JOzT5VdC8; SMYUV=1567084685160551; UM_distinctid=16cdd86a76bb81-06651bda9ec90a-4d045769-1fa400-16cdd86a76c963; GOTO=Af99046; ad=Hjx0Nlllll2NtCV$lllllVCT74llllllNYkMnZllll9lllllj0ULn5@@@@@@@@@@; wuid=AAFeRCfSKQAAAAqHEEfWUAAAkwA=; FREQUENCY=1568125047430_1; front_screen_resolution=1920*1080; usid=kY0HNhvMU5jsLQMx; IPLOC=CN5101; ld=MZllllllll2NJXlklllllVL4ltYlllllHIjeFlllll9lllllVklll5@@@@@@@@@@; LSTMV=146%2C33; LCLKINT=3197; SNUID=533ADB0C6164F215F13E1515562E4F098; ABTEST=8|1569627405|v1","Host": "weixin.sogou.com","Referer": "https://weixin.sogou.com/","User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3314.0 Safari/537.36 SE 2.X MetaSr 1.0","X-Requested-With": "XMLHttpRequest"
}class FileEventHandler(FileSystemEventHandler):def on_any_event(self, event):passdef rmFile(self, path):try:os.remove(path)except Exception as e:passdef checkPathStatus(self, path):if "jb_tmp" in path or "jb_old" in path or not os.path.exists(path):return Falsereturn Truedef requestsUrl(self, path):if self.checkPathStatus(path):categoryQueue = getCategoryQueue(path)self.rmFile(path)for i in range(threadNum):w = wxSouSpider(categoryQueue)w.start()def on_moved(self, event):if not event.is_directory:path = event.dest_pathself.requestsUrl(path)def on_created(self, event):if not event.is_directory:path = event.src_pathself.requestsUrl(path)def on_modified(self, event):if not event.is_directory:path = event.src_pathself.requestsUrl(path)class wxSouSpider(threading.Thread):def __init__(self, categoryQueue, *args, **kwargs):super(wxSouSpider, self).__init__(*args, **kwargs)self.categoryQueue = categoryQueuedef getHTML(self, url):while True:try:resp = requests.get(url, headers=headers, timeout=10)return resp.content.decode("utf-8")except Exception as e:passdef getCategoryUrl(self, category):return "https://weixin.sogou.com/pcindex/pc/pc_" + str(category) + "/pc_" + str(category) + ".html"def md(self, s):return hashlib.md5(str(s).encode("utf-8")).hexdigest()def getDate(self, ts):try:nowTs = int(time.time())diffTime = nowTs - int(ts)num = 0company = ""if diffTime < 60:num = diffTimecompany = "秒"elif diffTime < 3600:num = diffTime // 60company = "分钟"elif diffTime < 86400:num = diffTime // 3600company = "小时"else:num = diffTime // 86400company = "天"return str(num) + company + "前"except Exception as e:return Nonedef getNewsList(self, url, catrgory):html = self.getHTML(url)soup = BeautifulSoup(html, "html.parser")newsList = []try:lis = soup.find("ul", attrs={"class": "news-list"}).find_all("li")for li in lis:metaNews = {}metaNews['category'] = catrgorymetaNews['hs'] = ""try:metaNews['hs'] = self.md(metaNews['banner'])except Exception as e:passtxtBox = ""try:txtBox = li.find("div", attrs={"class": "txt-box"})except Exception as e:continuea = ""try:a = txtBox.find("h3").find("a")except Exception as e:continuemetaNews['title'] = ""try:metaNews['title'] = a.textexcept Exception as e:passmetaNews['url'] = ""try:metaNews['url'] = str(a['href']).replace("×tamp", "&timestamp")except Exception as e:passmetaNews['description'] = ""try:metaNews['description'] = txtBox.find("p", "txt-info").textexcept Exception as e:passmetaNews['source'] = ""try:metaNews['source'] = txtBox.find("div", "s-p").find("a").textexcept Exception as e:passmetaNews['date'] = ""try:metaNews['date'] = self.getDate(txtBox.find("div", "s-p")['t'])except Exception as e:passnewsList.append(metaNews)return newsListexcept Exception as e:return Nonedef pipLine(self, news):try:conn = MySQLdb.connect(user=mysql_user, host="127.0.0.1", password=mysql_password, database=mysql_database, charset='utf8')cursor = conn.cursor()cursor.execute("insert into " + mysql_table + "(category, banner, title, url, description, source, `date`, hs) ""values('%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s')" %(news['category'], news['banner'], news['title'], news['url'], news['description'], news['source'], news['date'], news['hs']))conn.commit()return Trueexcept Exception as e:return Falsedef run(self):while True:if self.categoryQueue.empty():breakcategory = self.categoryQueue.get()url = self.getCategoryUrl(category)newsList = self.getNewsList(url, category)if newsList:for news in newsList:status = self.pipLine(news)if not status:breakdef getCategoryQueue(path):try:categoryQueue = Queue(0)with open(path, "r", encoding="utf-8") as f:for line in f:try:categoryQueue.put(line.replace("\r", "").replace("\n", "").replace("\t", "").replace(" ", ""))except Exception as e:continuereturn categoryQueueexcept Exception as e:passif __name__ == '__main__':observer = Observer()event_handler = FileEventHandler()observer.schedule(event_handler, monitorPath, True)observer.start()try:while True:time.sleep(1)except KeyboardInterrupt:observer.stop()

小程序爬虫接单、app爬虫接单、网页爬虫接单、接口定制、网站开发、小程序开发 > 点击这里联系我们 <

Python爬虫系列之爬取微信公众号新闻数据相关推荐

  1. Python爬虫实例:爬取微信公众号图片(表情包)

    背景: 在学习了简单爬虫的编写之后,我试图通过编写爬取公众号图片(表情包)来丰富我的聊天技能,亦不致于败给各种熊猫头. 在学习了requests库之后,就能够很轻松地爬取静态页面的信息,把网页对象获取 ...

  2. Python爬虫案例:爬取微信公众号文章

    本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理. 文章转载于公众号:早起Python 作者:陈熹 大家好,今天我们来讲点Selenium自动化,你是 ...

  3. 使用python的requests库爬取微信公众号文章中的图片

    1.首先导入我们需要的库: import requests from lxml import etree import os 以某篇文章为例,复制该文章连接: 2.使用requests库获取该网址的响 ...

  4. 爬虫python下载文献代码_Python爬虫案例:爬取微信公众号文章

    本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理. 文章转载于公众号:早起Python 作者:陈熹 大家好,今天我们来讲点Selenium自动化,你是 ...

  5. python爬虫实战-爬取微信公众号所有历史文章 - (00) 概述

    http://efonfighting.imwork.net 欢迎关注微信公众号"一番码客"获取免费下载服务与源码,并及时接收最新文章推送. 最近几年随着人工智能和大数据的兴起,p ...

  6. python爬虫能爬取微信密码吗_爬虫如何爬取微信公众号文章

    下篇文章:python爬虫如何爬取微信公众号文章(二) 下下篇连接python爬虫如何实现每天爬取微信公众号的推送文章 因为最近在法院实习,需要一些公众号的数据,然后做成网页展示出来便于查看,之前我倒 ...

  7. python爬虫爬取微信公众号小程序信息

    python爬虫爬取微信公众号小程序信息 爬取内容 某汽车维修信息提供的维修店名称,地点以及电话(手机)号码 爬取步骤 啥也别管,先抓包看看,在这里,博主使用的抓包软件是charles 抓包:将网络传 ...

  8. python爬虫如何实现每天爬取微信公众号的推送文章

    python爬虫如何实现每天爬取微信公众号的推送文章 上上篇文章爬虫如何爬取微信公众号文章 上篇文章python爬虫如何爬取微信公众号文章(二) 上面的文章分别介绍了如何批量获取公众号的历史文章url ...

  9. Python爬虫系列之爬取某优选微信小程序全国店铺商品数据

    Python爬虫系列之爬取某优选微信小程序全国商品数据 小程序爬虫接单.app爬虫接单.网页爬虫接单.接口定制.网站开发.小程序开发 > 点击这里联系我们 < 微信请扫描下方二维码 代码仅 ...

最新文章

  1. Go 转义字符及风格
  2. python 画图_学python画图最快的方式——turtle小海龟画图
  3. signature=d3634edefd0f91592d1c7b65bef4a31d,Additional file 14
  4. Ubuntu文件上锁了,怎么打开???亲测有效
  5. RHEL5 怎么装vim
  6. 奇淫技巧之整形数组偏移量
  7. 设计模式--组合模式C++实现
  8. CentOS 7 Hadoop安装配置
  9. Linux chapter 1
  10. 苹果傲慢,售后服务中外有别
  11. 设计模式:Builder模式
  12. springboot自定义过滤器的方法
  13. python 大小写字母怎么用数字表示_python判断字符串是字母 数字 大小写(转载)...
  14. 静态时序分析笔记-第二章:STA概念(下)
  15. Unity同时接入微信和支付宝支付 Android篇(aar包和jar包)
  16. 保姆级-天翼网关TEWA-700G、TEWA-1000E/G等系列光猫获取超级密码
  17. Android 简历+面试题 汇总
  18. 利用python进行股票分析(五)通过tushare读取股票数据
  19. android Twitter第三方登陆
  20. css中“~”(波浪号)、“,”(逗号)、 “ + ”(加号)和 “ > ”(大于号)是什么意思?

热门文章

  1. Centos用speedtest.py测试服务器(国外)上传下载速度
  2. 0xc0000005 系统应用日志_0xc0000005,小编教你怎么解决应用程序正常初始化0xc0000005失败...
  3. Qt报错 converting to execution character set:illegal byte sequence
  4. 【Linux问题栏】虚拟机中无法识别电脑摄像头和usb摄像头
  5. Linux内存卡槽故障判断,内存插槽损坏的三种常见故障
  6. python埃及分数_送你一份低折扣书单,Python就占了6本,人工智能2本
  7. 数据库sql简单的优化方案
  8. easyexcel 设置标题_Alibaba easyExcel对Excel操作之复杂标题处理
  9. 2004-10-30 周六
  10. 2022重装Win7系统(64位)提示Windows update无法搜索新更新错误代码80072EFE