搜狗微信公众号文章反爬虫完美攻克

很简单，selenium + chromedriver，搜狗的部分直接在chrome模拟浏览器内部操作即可，而mp.weixin.qq.com则是腾讯的了，不反爬虫，用urllib requests等等即可。

需要扫码登陆，不扫码只能采取10页数据

from selenium import webdriver
import time
from bs4 import BeautifulSoup
import threadingdriver = webdriver.Chrome()
driver.get("http://weixin.sogou.com/")
driver.find_element_by_xpath('//*[@id="loginBtn"]').click()find = input("输入你想查找的关键词")
driver.find_element_by_xpath('//*[@id="query"]').send_keys("%s"%find)
driver.find_element_by_xpath('//*[@id="searchForm"]/div/input[3]').click()
time.sleep(2)url_list = []
while True:page_source = driver.page_source#print(page_source)bs_obj = BeautifulSoup(page_source,"html.parser")one_url_list = bs_obj.findAll("div",{"class":"txt-box"})for url in one_url_list:url_list.append(url.h3.a.attrs['href'])#print(url.h3.a.attrs['href'])next_page = "http://weixin.sogou.com/weixin" + bs_obj.find("a",{"id":"sogou_next"}).attrs['href']driver.get(next_page)time.sleep(1)def get_img(url,num,connect,cursor):response = requests.get(url,headers = header).contentcontent = str(response,encoding = "utf-8")bs_obj = BeautifulSoup(content,"html.parser")img_list = bs_obj.findAll("img")count = 0for img in img_list:try:imgurl=get_total_url(img.attrs["data-src"])store_name = "%s"%url_num+"%s"%countpath = r"C:\Users\Mr.Guo\Pictures\weixin"check_mkdir(path)urllib.request.urlretrieve(imgurl,r"C:\Users\Mr.Guo\Pictures\weixin\%s.jpeg" %store_name)insert_into_table(connect,cursor,store_name,html)count += 1except Exception as e:passfor url_num in range(len(url_list)):t = threading.Thread(target = get_img,args = (url_list[url_num],url_num,connect,cursor,))t.start()

搜狗微信公众号文章反爬虫完美攻克相关推荐

如何用python爬取公众号文章搜狗微信搜索_python抓取搜狗微信公众号文章
初学python,抓取搜狗微信公众号文章存入mysql mysql表: 代码: import requests import json import re import pymysql # 创建连接 ...
如何用python爬取公众号文章搜狗微信搜索_python如何爬取搜狗微信公众号文章永久链接的思路解析...
这篇文章主要介绍了python如何爬取搜狗微信公众号文章永久链接的思路解析 ,小编觉得挺不错的,现在分享给大家,也给大家做个参考.一起跟随小编过来看看吧. 本文主要讲解思路,代码部分请自行解决搜狗微信 ...
python抓取微信_python抓取搜狗微信公众号文章
初学python,抓取搜狗微信公众号文章存入mysql mysql表: 代码: import requests import json import re import pymysql # 创建连接 ...
python抓取搜狗微信公众号文章
初学python,抓取搜狗微信公众号文章存入mysql mysql表: 代码: import requests import json import re import pymysql# 创建连接 c ...
python wechatsougou_python抓取搜狗微信公众号文章
初学python,抓取搜狗微信公众号文章存入mysql mysql表: 代码: import requests import json import re import pymysql # 创建连接 ...
【scrapy爬虫】最新sogou搜狗搜索机智操作绕过反爬验证码（搜狗微信公众号文章同理）
前情提要此代码使用scrapy框架爬取特定"关键词"下的搜狗常规搜索结果,保存到同级目录下csv文件.并非爬取微信公众号文章,但是绕过验证码的原理相同.如有错误,希望大家指正. ...
搜狗微信公众号文章抓取
机器能做的事就别让人来做! 目标: 抓取特定微信公众号文章思路:利用selenium模拟浏览器行为,进行抓取(理由:搜狗已将文章链接进行处理,且页面为动态生成) 框架: 步骤: 1.登录搜狗 a.找 ...
java 抓取搜狗微信_搜狗微信公众号文章抓取
机器能做的事就别让人来做! 目标: 抓取特定微信公众号文章思路:利用selenium模拟浏览器行为,进行抓取(理由:搜狗已将文章链接进行处理,且页面为动态生成) 框架: 步骤: 1.登录搜狗 a.找 ...
微信公众号文章的爬虫系统
差不多俩个星期了吧,一直在调试关于微信公众号的文章爬虫系统,终于一切都好了,但是在这期间碰到了很多问题,今天就来回顾一下,总结一下,希望有用到的小伙伴可以学习学习. 1.做了俩次爬虫了,第一次怕的凤凰 ...
python爬虫：搜狗微信公众号文章信息的采集（https://weixin.sogou.com/），保存csv文件
import requests from requests.exceptions import RequestException from lxml import etree import csv i ...

搜狗微信公众号文章反爬虫完美攻克

搜狗微信公众号文章反爬虫完美攻克相关推荐

最新文章

热门文章