1.为什么使用代理池

  • 许多⽹网站有专⻔门的反爬⾍虫措施,可能遇到封IP等问题。
  • 互联⽹网上公开了了⼤大量量免费代理理,利利⽤用好资源。
  • 通过定时的检测维护同样可以得到多个可⽤用代理理。

2.代理池的要求

  • 多站抓取, 异步检测
  • 定时筛选, 持续更新
  • 提供接口, 易于提取

3.代理池架构

4.github上下载代理池维护的代码

https://github.com/Germey/ProxyPool

 View Code

5.配置代理池维护的代码,后执行报错

    import redis
ModuleNotFoundError: No module named 'redis'

错误原因:Python默认是不支持Redis的,当引用redis时就会报错:

解决方法:为Python安装Redis库

登陆https://github.com/andymccurdy/redis-py下载安装包

# 安装解压缩命令
yum install -y unzip zip# 解压
unzip redis-py-master.zip -d /usr/local/redis# 进入解压文件目录
cd /usr/local/redis/redis-py-master# 安装redis库
sudo python setup.py install

再次执行还是报错

原来是安装redis库的python的版本不对,下面指定为那个python进行安装

sudo /root/anaconda3/bin/python setup.py install

再次执行

可以看到,刚刚上面的那个报错信息已经消失,但又出现了新的报错信息

    import aiohttp
ModuleNotFoundError: No module named 'aiohttp'

解决方法:

pip install aiohttp

再次执行,报错:

  from fake_useragent import UserAgent,FakeUserAgentError
ModuleNotFoundError: No module named 'fake_useragent'

解决方法:

pip install fake-useragent

再次执行,报错:

Traceback (most recent call last):File "/home/henry/dev/myproject/flaskr/flaskr.py", line 23, in <module>app.run()File "/home/henry/.local/lib/python3.5/site-packages/flask/app.py", line 841, in runrun_simple(host, port, self, **options)File "/home/henry/.local/lib/python3.5/site-packages/werkzeug/serving.py", line 739, in run_simpleinner()File "/home/henry/.local/lib/python3.5/site-packages/werkzeug/serving.py", line 699, in innerfd=fd)File "/home/henry/.local/lib/python3.5/site-packages/werkzeug/serving.py", line 593, in make_serverpassthrough_errors, ssl_context, fd=fd)File "/home/henry/.local/lib/python3.5/site-packages/werkzeug/serving.py", line 504, in __init__HTTPServer.__init__(self, (host, int(port)), handler)File "/usr/lib/python3.5/socketserver.py", line 440, in __init__self.server_bind()File "/usr/lib/python3.5/http/server.py", line 138, in server_bindsocketserver.TCPServer.server_bind(self)File "/usr/lib/python3.5/socketserver.py", line 454, in server_bindself.socket.bind(self.server_address)
OSError: [Errno 98] Address already in use
[Finished in 1.9s]

临时解决办法:

再次执行,报错如下:

ssh://root@192.168.33.12:22/root/anaconda3/bin/python3 -u /www/python3/maoyantop100/ProxyPool-master/run.py
Ip processing running* Serving Flask app "proxypool.api" (lazy loading)* Environment: productionWARNING: Do not use the development server in a production environment.Use a production WSGI server instead.* Debug mode: off
Refreshing ip* Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
Waiting for adding
PoolAdder is working
Callback crawl_ip181
Getting http://www.ip181.com/
Getting result http://www.ip181.com/ 200
ValidityTester is working
Async Error
Callback crawl_kuaidaili
Getting https://www.kuaidaili.com/free/inha/1/
Getting result https://www.kuaidaili.com/free/inha/1/ 200
Getting 183.163.40.223:31773 from crawl_kuaidaili
Getting 183.163.40.223:31773 from crawl_kuaidaili
Getting 183.163.40.223:31773 from crawl_kuaidaili
Getting 183.163.40.223:31773 from crawl_kuaidaili
Getting 183.163.40.223:31773 from crawl_kuaidaili
Getting 183.163.40.223:31773 from crawl_kuaidaili
Getting 183.163.40.223:31773 from crawl_kuaidaili
Getting 183.163.40.223:31773 from crawl_kuaidaili
Getting 183.163.40.223:31773 from crawl_kuaidaili
Getting 183.163.40.223:31773 from crawl_kuaidaili
Getting 183.163.40.223:31773 from crawl_kuaidaili
Getting 183.163.40.223:31773 from crawl_kuaidaili
Getting 183.163.40.223:31773 from crawl_kuaidaili
Getting 183.163.40.223:31773 from crawl_kuaidaili
Getting 183.163.40.223:31773 from crawl_kuaidaili
Getting https://www.kuaidaili.com/free/inha/2/
Getting result https://www.kuaidaili.com/free/inha/2/ 503
Process Process-2:
Traceback (most recent call last):File "/root/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrapself.run()File "/root/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, in runself._target(*self._args, **self._kwargs)File "/www/python3/maoyantop100/ProxyPool-master/proxypool/schedule.py", line 130, in check_pooladder.add_to_queue()File "/www/python3/maoyantop100/ProxyPool-master/proxypool/schedule.py", line 87, in add_to_queueraw_proxies = self._crawler.get_raw_proxies(callback)File "/www/python3/maoyantop100/ProxyPool-master/proxypool/getter.py", line 28, in get_raw_proxiesfor proxy in eval("self.{}()".format(callback)):File "/www/python3/maoyantop100/ProxyPool-master/proxypool/getter.py", line 51, in crawl_kuaidailire_ip_adress = ip_adress.findall(html)
TypeError: expected string or bytes-like object
Refreshing ip
Waiting for adding

解决方法:将getter.py line 51, in crawl_kuaidaili re_ip_adress = ip_adress.findall(html) 改成 re_ip_adress = ip_adress.findall(str(html))

再次执行:

到此为止,以上所有报错信息都已解决了。

查看redis数据库中抓取到的免费的有效代理

访问此接口即可获取一个随机可用代理

完整参考代码:

import requests
from urllib.parse import urlencode
from requests.exceptions import ConnectionError
from pyquery import PyQuery as pq
import pymongo
from lxml.etree import XMLSyntaxError
from config import *client = pymongo.MongoClient(MONGO_URL)
db = client[MONGO_DB_WEIXIN]
base_url = 'http://weixin.sogou.com/weixin?'
headers = {'Cookie':'ABTEST=0|1532956836|v1; SNUID=86EFE3F74346317411F0D85443087F1E; IPLOC=CN3100; SUID=C4ACA0B44842910A000000005B5F10A4; SUID=C4ACA0B45018910A000000005B5F10A4; weixinIndexVisited=1; SUV=00CF640BB4A0ACC45B5F10A581EB0750; sct=1; JSESSIONID=aaa5-uR_KeqyY51fCIHsw; ppinf=5|1532957080|1534166680|dHJ1c3Q6MToxfGNsaWVudGlkOjQ6MjAxN3x1bmlxbmFtZToxODolRTklODIlQjklRTYlOUYlQUZ8Y3J0OjEwOjE1MzI5NTcwODB8cmVmbmljazoxODolRTklODIlQjklRTYlOUYlQUZ8dXNlcmlkOjQ0Om85dDJsdUJ4alZpSjlNNDczeEphazBteWRkeE1Ad2VpeGluLnNvaHUuY29tfA; pprdig=XiDXOUL6rc8Ehi5XsOUYk-BVIFnPjZrNpwSjN3OknS0KjPtL7-KA8pqp9rKFEWK7YIBYgcZYkB5zhQ3teTjyIEllimmEiMUBBxbe_-O8DMu6ovVCimv7V1ejJQI_vWh-Q2b1UvYM_6Pei5mh9HBEYeqi-oNVJb-U4VAcC-BiiXo; sgid=03-34235539-AVtfEZhssmpJkiaRxNjPAr1k; ppmdig=1532957080000000f835369148a0058ef4c8400357ffc265','Host':'weixin.sogou.com','Upgrade-Insecure-Requests':'1','User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0'
}proxy = None# 3、请求url过程中,可能会遇到反爬虫措施,这时候需要开启代理
def get_proxy():try:response = requests.get(PROXY_POOL_URL)if response.status_code == 200:return response.textreturn Noneexcept ConnectionError:return None# 2、请求url,得到索引页html
def get_html(url,count=1):print('Crawling',url)print('Trying Count',count)global proxyif count >= MAX_COUNT:print('Tried Too Many Counts')return Nonetry:if proxy:proxies = {'http':'http://' + proxy}response = requests.get(url, allow_redirects=False, headers=headers, proxies=proxies)else:response = requests.get(url, allow_redirects=False, headers=headers)if response.status_code == 200:return response.textif response.status_code == 302:print('302')proxy = get_proxy()if proxy:print('Using Proxy',proxy)count += 1return get_html(url,count)else:print('Get Proxy Failed')return Noneexcept ConnectionError as e:print('Error Occurred',e.args)proxy = get_proxy()count +=1return get_html(url,count)
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=1
# Trying Count 1
# 302
# Using Proxy 190.11.106.94:8080
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=1
# Trying Count 2
# 302
# Using Proxy 213.128.7.72:53281
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=1
# Trying Count 3
# 302
# Using Proxy 190.147.43.62:53281
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=1
# Trying Count 4
# 302
# Using Proxy 39.104.62.87:8080
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=1
# Trying Count 5
# Tried Too Many Counts
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=2
# Trying Count 1
# 302
# Using Proxy 95.59.195.73:8080
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=2
# Trying Count 2
# 302
# Using Proxy 111.121.193.214:3128
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=2
# Trying Count 3
# 302
# Using Proxy 118.190.95.35:9001
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=2
# Trying Count 4
# 302
# Using Proxy 61.135.217.7:80
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=2
# Trying Count 5
# Tried Too Many Counts
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=3
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=4
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=5
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=6
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=7
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=8
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=9
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=10
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=11
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=12
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=13
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=14
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=15
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=16
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=17
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=18
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=19
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=20
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=21
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=22
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=23
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=24
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=25
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=26
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=27
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=28
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=29
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=30
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=31
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=32
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=33
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=34
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=35
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=36
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=37
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=38
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=39
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=40
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=41
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=42
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=43
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=44
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=45
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=46
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=47
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=48
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=49
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=50
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=51
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=52
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=53
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=54
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=55
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=56
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=57
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=58
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=59
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=60
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=61
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=62
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=63
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=64
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=65
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=66
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=67
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=68
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=69
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=70
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=71
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=72
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=73
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=74
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=75
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=76
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=77
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=78
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=79
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=80
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=81
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=82
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=83
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=84
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=85
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=86
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=87
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=88
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=89
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=90
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=91
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=92
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=93
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=94
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=95
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=96
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=97
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=98
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=99
# Trying Count 1
# Crawling http://weixin.sogou.com/weixin?query=%E9%A3%8E%E6%99%AF&type=2&page=100
# Trying Count 1# 1、构造url,进行微信关键词搜索
def get_index(keyword,page):data = {'query':keyword,'type':2,'page':page}queries = urlencode(data)url = base_url + querieshtml = get_html(url)return html# 4、分析索引页html代码,返回微信详情页url
def parse_index(html):doc = pq(html)items = doc('.news-box .news-list li .txt-box h3 a').items()for item in items:yield item.attr('href')# 5、请求微信详情页url,得到详情页html
def get_detail(url):try:response =requests.get(url)if response.status_code == 200:return response.textreturn Noneexcept ConnectionError:return None# 6、分析详情页html代码,得到微信标题、公众号、发布日期,文章内容等信息
def parse_detail(html):try:doc = pq(html)title = doc('.rich_media_title').text()content = doc('.rich_media_content').text()date = doc('#post-date').text()nickname = doc('#js_profile_qrcode > div > strong').text()wechat = doc('#js_profile_qrcode > div > p:nth-child(3) > span').text()return {'title':title,'content':content,'date':date,'nickname':nickname,'wechat':wechat}except XMLSyntaxError:return None# 保存到数据库MongoDB
def save_to_mongo(data):if db['articles'].update({'title':data['title']},{'$set':data},True):print('Saved to Mongo',data['title'])else:print('Saved to Mongo Failed',data['title'])# 四、调试模块
def main():for page in range(1,101):html = get_index(KEYWORD,page)# print(html)if html:article_urls = parse_index(html)# print(article_urls)# http://mp.weixin.qq.com/s?src=11&timestamp=1532962678&ver=1030&signature=y0i61ogz4QZEkNu-BrqFNFPnKwRh7qkdb7OpPVZjO2WEPPaZMv*w2USW1uosLJUJF6O4VXRw4DSLlwpCBtLjEW7fncV6idpY5xChzALf47rn8-PauyK5rgHvQTFs0ePy&new=1for article_url in article_urls:# print(article_url)article_html = get_detail(article_url)if article_html:article_data = parse_detail(article_html)print(article_data)if article_data:save_to_mongo(article_data)# {#     'title': '广东首批十条最美公路出炉!茂名周边也有,自驾约起!',#     'content': '由广东省旅游局、广东省交通运输厅\n联合重磅发布了\n十条首批广东最美旅游公路\n条条惊艳!希望自驾骑行的别错过!\n\n\n\n\n① 湛江∣菠萝的海\n\n\n\n\n起止位置:\nS289徐闻县曲界镇——雷州市调风镇段,全长28km。\n\n\n特色:\n汇集生态农业观光体验、美丽乡村、风车群、湖泊、火山口等元素的旅游观光大道。\n\n\n\n\n推荐游览线路》》》\n\n\n菠萝的海核心景区一一风车群一一一世界地质公园田洋火山口一一龙门村一一九龙山国家湿地公园\n\n\n② 江门∣碉楼逸风\n\n\n\n\n\n起止位置:\n\n开阳高速塘口出口——X555——自力村——立园——325国道——S275——马降龙——S275——锦江里,全长35km。\n\n\n特色:\n世界文化遗产之旅,世界建筑景观长廊。\n\n\n推荐游览线路》》》\n\n\n自力村——立园——赤坎古镇——马降龙——锦江里\n\n\n③ 肇庆∣千里走廊\n\n\n\n\n起止位置:\n国道321四会市——封开县与广西交界处(G321),全长170km。\n\n\n特色:\n汇集西江风貌、山水景观、人文景观、乡村旅游、名胜古迹,花岗岩、石灰石等地质自然景观。\n\n\n\n\n推荐游览线路》》》\n\n\n贞山景区——六祖寺——鼎湖山景区——北岭山森林公园——七星岩景区——羚羊峡古栈道森林公园——端砚村——阅江楼——宋城墙——梅俺——包公文化园——悦城龙母祖庙——三元塔——德庆学宫——广信塔\n\n\n④ 广州∣悠游增城\n\n\n\n\n起止位置:\n\n荔城街--派潭镇白水寨风景名胜区,全长45km。\n\n\n特色:\n岭南绿色植物带和北回归线翡翠绿洲。\n\n\n推荐游览线路》》》\n\n\n增城莲塘春色景区——莲塘印象园——何仙姑景区——小楼人家景区——二龙山花园——邓村石屋——金叶子度假酒店——白水寨风景名胜区\n\n\n⑤ 珠海∣浪漫珠海\n起止位置:\n拱北——唐家,全长28km。\n\n\n特色:\n珠江口海域及岸线,百年渔港、浪漫气质、休闲街区远眺港珠澳大桥。\n\n\n推荐游览线路》》》\n\n\n港珠澳大桥一一海滨泳场一一城市客厅一一珠海渔女雕像一一海滨公园一一景山公园一一香炉湾一一野狸岛(大剧院)一一珠海市博物馆(新馆)一一美丽湾一一凤凰湾沙滩一一淇澳岛\n\n\n⑥ 汕头∣潮风岛韵\n\n\n\n\n\n起止位置:\n\n东海岸大道——南澳大桥——南澳环岛公路,全长95km。\n\n\n特色:\n粤东特区滨海城市、南国风情景观长廊、宏伟南澳跨海大桥、独特恬静的环岛滨海旅游公路。\n\n\n\n\n推荐游览线路》》》\n\n\n东海岸大道—南澳大桥—南澳环岛公路(逆时针环岛:宋井、青澳湾、金银岛、总兵府、黄花山森林公园)\n\n\n⑦ 韶关∣大美丹霞\n\n\n\n\n起止位置:\n1.省道S246线仁化至黄岗段\n2.国道G106线仁化县丹霞山至韶赣高速丹霞出口\n3.国道G323线仁化丹霞出口至小观园段\n4.国道G323线韶关市区过境段(湾头至桂头)\n5.“穿丹霞”景区内旅游通道\n总长160.3km。\n\n\n特色:\n世界自然遗产,丹霞地貌、河流、田园风貌、古村落等。\n\n\n\n\n推荐游览线路》》》\n\n\n石塘古村落——丹霞山风景名胜区——灵溪河森林公园——五马寨生态园\n\n\n⑧ 河源∣万绿河源\n\n\n\n\n起止位置:\n源城区东江湾迎客大桥——桂山旅游大道段万绿谷,全长33km。\n\n\n特色:\n公路依山沿湖而建,风景优美,沿途可观多彩万绿湖和秀美大桂山景色;集山、泉、湖、河、瀑、林于一体,融自然景观与人文景观于一身;感受河源独特的生态文化、客家文化、恐龙文化、温泉文化等不同文化的内涵与魅力。\n\n\n\n\n推荐游览线路》》》\n\n\n巴伐利亚——客天下水晶温泉——新丰江大坝——野趣沟——桂山——万绿谷——万绿湖——镜花缘\n\n\n⑨ 梅州∣休闲梅州\n\n\n\n\n\n起止位置:\n梅县区S223线秀兰桥——雁洋镇长教村,全长30km。\n\n\n特色:\n文化浓郁,风情醇厚,风景优美,人文景观独具特色,旅游服务设施完善。\n\n\n\n\n推荐游览线路》》》\n\n\n秀兰大桥——叶剑英纪念园——雁鸣湖旅游度假村——灵光寺旅游区——雁南飞茶田景区——桥溪古韵景区\n\n\n⑩ 清远∣北江画卷\n\n\n\n\n起止位置:\n清城起龙塘镇K244、K2478,英德起K2346止K2440,全长99km。\n\n\n特色:\n山水文化旅游画廊。沿线集亲情温泉、宗教文化、闲情山水、激情漂流、休闲度假、乡村旅游、名胜古迹、北江美食于一体的深度旅游体验带。\n\n\n\n\n推荐游览线路》》》\n\n\n德盈新银盏温泉度假村——飞霞风景区——黄腾峡生态旅游区——牛鱼嘴原始生态风景区——天子山旅游度假区——飞来峡水利枢纽风景区——上岳古村——铁溪小镇——连江口镇——浈阳峡旅游度假区——宝晶宫生态旅游度假区——奇洞温泉度假区——积庆里红茶谷——仙湖温泉旅游度假区\n来源:广东省旅游局、广东省旅游协会',#     'date': '',#     'nickname': '茂名建鸿传媒网',#     'wechat': ''# }if __name__ == '__main__':main()

使用Redis+Flask维护动态代理池相关推荐

  1. Python爬虫入门之使用Redis+Flask维护动态代理池

    代理池的要求 多站抓取, 异步检测 定时筛选, 持续更新 提供接口, 易于提取 代理池架构 代码 代码放到github上了,稍微修改了一点,可以正常运行了.有问题评论留言讨论. 分为两种ProxyPo ...

  2. 使用redis所维护的代理池抓取微信文章

    搜狗搜索可以直接搜索微信文章,本次就是利用搜狗搜搜出微信文章,获得详细的文章url来得到文章的信息.并把我们感兴趣的内容存入到mongodb中. 因为搜狗搜索微信文章的反爬虫比较强,经常封IP,所以要 ...

  3. 通过Flask和Redis构造一个动态维护的代理池

    代理池的维护 目前有很多网站提供免费代理,而且种类齐全,比如各个地区.各个匿名级别的都有,不过质量实在不敢恭维,毕竟都是免费公开的,可能一个代理无数个人在用也说不定.所以我们需要做的是大量抓取这些免费 ...

  4. 动态可维护ip代理池搭建(定时更新模块)

    动态可维护ip代理池(爬虫模块)继上一次的爬虫模块,我们先来优化一下并添加redis配置参数,参数配置以自身机器设定. import random import time import request ...

  5. python ip动态代理_给自己的爬虫做一个简单的动态代理池

    使用代理服务器一直是爬虫防BAN最有效的手段,但网上的免费代理往往质量很低,大部分代理完全不能使用,剩下能用的代理很多也只有几分钟的寿命,没法直接用到爬虫项目中. 下面简单记录一下我用scrapy+r ...

  6. python爬虫系列:做一个简单的动态代理池

    自动 1.设置动态的user agent 1 import urllib.request as ure 2 import urllib.parse as upa 3 import random 4 f ...

  7. Python3 [爬虫实战] Redis+Flask 动态维护cookies池(上)

    Redis 使用 1 首先去官网下载Reidszip文件. http://www.redis.cn/topics/config.html 2 Reids的安装,直接解压缩zip文件,然后放在一个文件夹 ...

  8. 爬虫实战(一)—利用requests、mongo、redis代理池爬取英雄联盟opgg实时英雄数据

    概述 可关注微信订阅号 loak 查看实际效果. 代码已托管github,地址为:https://github.com/luozhengszj/LOLGokSpider 包括了项目的所有代码. 此篇文 ...

  9. 搭建简易动态ip代理池

    代理池架搭建准备工作 首先需要先安装Redis数据库并加以启动,,另外还需要安装aiohttp,request,redis-py,pyquery,flask库等 代理池的目标 首先我们可以划分为四个模 ...

最新文章

  1. 符号隔开数字求最大值
  2. ITK:比较两个图像并将输出像素设置为最小值
  3. 重装系统 计算机意外遇到错误无法运行,win7系统重装笔记本提示"计算机意外的重新启动或遇到错误"的解决方法...
  4. 关于“豪猪”,你理解的透彻吗?【Hystrix是个什么玩意儿】
  5. mysql8.0.12插件_MySQL8.0.12 安装及配置
  6. Leetcode 219. 存在重复元素 II
  7. win7电脑文件夹属性没有安全选项的解决方法
  8. python集合中的元素是否可以重复_python列表--查找集合中重复元素的个数
  9. jquery ajax html php区别,ajax与jquery的区别是什么
  10. 上海二工大 - 健康日报AutoCheck
  11. 22、了解网卡和IP地址
  12. web开发设为首页、添加到收藏夹实现方法
  13. 计算机xp怎么做备份,xp系统如何备份系统呢,详细教您如何备份
  14. 港大HKU邮箱(connect.hku.hk)添加至iphone 自带邮箱方法
  15. 【自然语言处理】【多模态】OFA:通过简单的sequence-to-sequence学习框架统一架构、任务和模态
  16. App逆向 Frida - 夜神模拟器安装配置 基本使用
  17. Python实现给一个不多于5位的正整数,求它是几位数,逆序打印这个数字
  18. 20175312陶光远 与 20175309刘雨恒 结对
  19. Vue3+ Vue-cli (2) 组件篇
  20. 免费注册 Redhat 开发者订阅和激活订阅

热门文章

  1. 开发者藏经阁——超全阿里系电子书大合集(打包下载)
  2. 杜克大学计算机统计学,杜克大学统计学硕士录取
  3. OGNL表达式语言介绍
  4. Mybatis学习(狂神)
  5. 自己重构一个非常简单的网页
  6. MarkDown API
  7. Python web应用程序
  8. 根据输入的qq号获取昵称、邮箱和头像等基础信息
  9. 智慧健康养老!2022年湖北省智慧健康养老产品及服务推广目录申报范围、要求及材料
  10. 一篇了解RabbitMQ