【Python3.6爬虫学习记录】（十一）使用代理IP及用多线程测试IP可用性--刷访问量

前言：本来准备写一个刷空间留言的脚本，然而kb TX，无限循环空间验证码。上午还傻x的学验证码识别，后来才发现根本发不了留言，即使填的是对的，仍然继续弹出。无奈，睡了一觉，开始搞新玩意–代理IP！其实之前就应该用到的，然而一直用selenium，没好好看header，也没用cookie和IP。之后用到再补上，同时还有简单验证码的识别等。

可以了解一下代理IP相关知识：通过Python爬虫代理IP快速增加博客阅读量
发现一篇很好的文章： Python3网络爬虫(十一)：爬虫黑科技之让你的爬虫程序更像人类用户的行为(代理IP池等)
关于刷访问量，主要机制是网页限制仅不同IP登陆才能增加访问量，虽然有的网页是根据cookies增加，后者更低级。

第一部分 requests , ChromeDriver, PhantomJS的代理IP使用

1-1 requests使用代理IP

 http = 'http://'+str(ip)proxies = {"http": http}try:r = requests.get("http://blog.csdn.net/qq_36962569/article/details/77387299", proxies=proxies)
except Exception as e:print(+e)

同理，requests模块添加headers ，cookies ，data，可以直接

requests.get(url,headers=headers)
requests.get(url,cookies=cookies
requests.get(url,data=data)

也可以传递多个参数，

requests.get(url,headers=headers,data=data)

参考链接：
Python 笔记七：Requests爬虫技巧（隆重推出，十分详细）
Python爬虫技巧—设置代理IP

1-2 ChromeDriver使用代理IP

def ChromeDriverWithIP():PROXY = "47.52.108.18"chrome_options = webdriver.ChromeOptions()# 两种用法添加代理IP# chrome_options.add_argument('--proxy-server=http://35.189.128.127')chrome_options.add_argument('--proxy-server={0}'.format(PROXY))# 传递代理IPchrome = webdriver.Chrome(chrome_options=chrome_options)chrome.get('http://www.cnblogs.com/buzhizhitong/p/5714419.html')print('2: ', chrome.page_source)

1-3 PhantomJS使用代理IP

#phantomjs selenium 如何动态修改代理
from selenium import webdriver
from selenium.webdriver import DesiredCapabilities
from selenium.webdriver.common.proxy import Proxy
from selenium.webdriver.common.proxy import ProxyTypedef DynamicUsingIP():proxy = Proxy({'proxyType': ProxyType.MANUAL,'httpProxy': '210.38.1.134'  # 代理ip和端口})# 新建一个代理IP对象desired_capabilities = DesiredCapabilities.PHANTOMJS.copy()# 加入代理IPproxy.add_to_capabilities(desired_capabilities)driver = webdriver.PhantomJS(desired_capabilities=desired_capabilities)# 测试一下，打开使用的代理IP地址信息driver.get('http://1212.ip138.com/ic.asp')print(driver.page_source)# # 现在开始切换ip# # 再新建一个ip# proxy = Proxy(#     {#         'proxyType': ProxyType.MANUAL,#         'httpProxy': 'ip:port'  # 代理ip和端口#     }# )# # 再新建一个“期望技能”，（）# desired_capabilities = DesiredCapabilities.PHANTOMJS.copy()# # 把代理ip加入到技能中# proxy.add_to_capabilities(desired_capabilities)# # 新建一个会话，并把技能传入# driver.start_session(desired_capabilities)# driver.get('http://httpbin.org/ip')# print(driver.page_source)driver.quit()

参考链接：
盘点selenium phantomJS使用的坑（介绍PhantomJS相关的注意事项）
在Selenium中设置代理IP（介绍多种设置方法）
selenium phantomjs 设置代理ip方法
phantomjs和selenium设置proxy、headers（）

第二部分测试代理IP的可用性

2-1 未使用线程测试

# IP check，将可用的IP重新保存到IP
def IPCheck():IP = []SuccessIP = []# 读取文件with open('IP.txt','r') as f:for line in f:IP.append(line[:-1])# request模块使用代理for ip in IP:http = 'http://'+str(ip)proxies = {"http": http}time.sleep(10)try:r = requests.get("http://blog.csdn.net/qq_36962569/article/details/77387299", proxies=proxies)except:print(str(ip)+'---connect failed')else:SuccessIP.append(ip)print(str(ip)+'---success')# 重新保存n=0f=open('IP.txt','w')for ip in SuccessIP:f.write(ip+'\n')n+=1f.close()print('Total are '+str(n)+' successful IP')

速度非常慢，基本上测试50个，得用3分钟。而使用多线程，测试70个，仅用十来秒（真tn的快嘞）。
参考链接：
使用python验证代理ip是否可用

2-2 使用多线程测试

# 使用多线程验证IP 可用性
def TreadCheckIP():# 获得IPproxys = []with open('IP.txt','r') as f:for line in f:proxys.append(line[:-1])proxy_ip = open('proxy_ip.txt', 'w')  # 新建一个储存有效IP的文档lock = threading.Lock()  # 建立一个锁# 验证代理IP有效性的方法def test(i):socket.setdefaulttimeout(5)  # 设置全局超时时间try:http = 'http://' + str(proxys[i])proxies = {"http": http}r = requests.get("http://blog.csdn.net/qq_36962569/article/details/77387299", proxies=proxies)lock.acquire()  # 获得锁print(proxys[i], 'is OK')proxy_ip.write('%s\n' % str(proxys[i]))  # 写入该代理IPlock.release()  # 释放锁except Exception as e:lock.acquire()print(proxys[i], e)lock.release()# 单线程验证'''for i in range(len(proxys)):test(i)'''# 多线程验证threads = []for i in range(len(proxys)):thread = threading.Thread(target=test, args=[i])threads.append(thread)thread.start()# 阻塞主进程，等待所有子线程结束for thread in threads:thread.join()proxy_ip.close()  # 关闭文件

关于多线程还不是很懂，自己还写不出来，后续继续了解学习。
参考链接：
python爬虫成长之路（二）：抓取代理IP并多线程验证（写的非常好）