python爬虫代理服务器_Python爬虫多线程抓取代理服务器

Python作为一门功能强大的脚本语言来说，经常被用来写爬虫程序，下面是Python爬虫多线程抓取代理服务器

首先通过谷歌把包含代理服务器地址的网页查出来，我选择从 http://www.88181.com/ 这个网站上去抓，在它上面了爬了800个代理(选择的8个页面)

#!/usr/bin/env python

#coding:utf-8

import urllib2

import re

import threading

import time

rawProxyList = []

checkedProxyList = []

#抓取代理网站

portdicts ={'v':"3",'m':"4",'a':"2",'l':"9",'q':"0",'b':"5",'i':"7",'w':"6",'r':"8",'c':"1"}

targets = []

for i in xrange(1,9):

target = r"http://www.88181.com/proxy%d.html" % i

targets.append(target)

#print targets

#正则

p = re.compile(r'''

(.+?)(.+?).+?(.+?)''')

#获取代理的类

class ProxyGet(threading.Thread):

def __init__(self,target):

threading.Thread.__init__(self)

self.target = target

def getProxy(self):

print "目标网站： " + self.target

req = urllib2.urlopen(self.target)

result = req.read()

#print chardet.detect(result)

matchs = p.findall(result)

for row in matchs:

ip=row[0]

port =row[1]

port = map(lambda x:portdicts[x],port.split('+'))

port = ''.join(port)

agent = row[2]

addr = row[3].decode("cp936").encode("utf-8")

proxy = [ip,port,addr]

#print proxy

rawProxyList.append(proxy)

def run(self):

self.getProxy()

#检验代理的类

class ProxyCheck(threading.Thread):

def __init__(self,proxyList):

threading.Thread.__init__(self)

self.proxyList = proxyList

self.timeout = 5

self.testUrl = "http://www.baidu.com/"

self.testStr = "030173"

def checkProxy(self):

cookies = urllib2.HTTPCookieProcessor()

for proxy in self.proxyList:

proxyHandler = urllib2.ProxyHandler({"http" : r'http://%s:%s' %(proxy[0],proxy[1])})

#print r'http://%s:%s' %(proxy[0],proxy[1])

opener = urllib2.build_opener(cookies,proxyHandler)

opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:22.0) Gecko/20100101 Firefox/22.0')]

#urllib2.install_opener(opener)

t1 = time.time()

try:

#req = urllib2.urlopen("http://www.baidu.com", timeout=self.timeout)

req = opener.open(self.testUrl, timeout=self.timeout)

#print "urlopen is ok...."

result = req.read()

#print "read html...."

timeused = time.time() - t1

pos = result.find(self.testStr)

#print "pos is %s" %pos

if pos > 1:

checkedProxyList.append((proxy[0],proxy[1],proxy[2],timeused))

#print "ok ip: %s %s %s %s" %(proxy[0],proxy[1],proxy[2],timeused)

else:

continue

except Exception,e:

#print e.message

continue

def run(self):

self.checkProxy()

if __name__ == "__main__":

getThreads = []

checkThreads = []

#对每个目标网站开启一个线程负责抓取代理

for i in range(len(targets)):

t = ProxyGet(targets[i])

getThreads.append(t)

for i in range(len(getThreads)):

getThreads[i].start()

for i in range(len(getThreads)):

getThreads[i].join()

print '.'*10+"总共抓取了%s个代理" %len(rawProxyList) +'.'*10

#开启20个线程负责校验，将抓取到的代理分成20份，每个线程校验一份

for i in range(20):

t = ProxyCheck(rawProxyList[((len(rawProxyList)+19)/20) * i:((len(rawProxyList)+19)/20) * (i+1)])

checkThreads.append(t)

for i in range(len(checkThreads)):

checkThreads[i].start()

for i in range(len(checkThreads)):

checkThreads[i].join()

print '.'*10+"总共有%s个代理通过校验" %len(checkedProxyList) +'.'*10

#持久化

f= open("proxy_list.txt",'w+')

for proxy in sorted(checkedProxyList,cmp=lambda x,y:cmp(x[3],y[3])):

print "checked proxy is: %s:%s\t%s\t%s" %(proxy[0],proxy[1],proxy[2],proxy[3])

f.write("%s:%s\t%s\t%s\n"%(proxy[0],proxy[1],proxy[2],proxy[3]))

f.close()部分log：目标网站： http://www.88181.com/proxy1.html

目标网站： http://www.88181.com/proxy2.html

目标网站： http://www.88181.com/proxy3.html

目标网站： http://www.88181.com/proxy4.html

目标网站： http://www.88181.com/proxy5.html

目标网站： http://www.88181.com/proxy6.html

目标网站： http://www.88181.com/proxy7.html

目标网站： http://www.88181.com/proxy8.html

..........总共抓取了800个代理..........

..........总共有478个代理通过校验.........

173.213.113.111:8089 United States 0.341555833817

173.213.113.111:3128 United States 0.347477912903

210.101.131.232:8080 韩国首尔 0.418715000153

.....

python爬虫代理服务器_Python爬虫多线程抓取代理服务器相关推荐

python捕捉线程错误_Pythonrequests多线程抓取出现HTTPConnectionPoolMaxretiresexceeded异常...
问题: Python requests 多线程抓取出现HTTPConnectionPool Max retires exceeded异常描述: 主要代码如下:import threading im ...
python真正好用的多线程库,使用python的selenium库还有多线程抓取CET4成绩
没有requests快,但好写 # -*- coding: utf-8 -*- #使用selenium的webdriver的方法 import csv import os import time im ...
python 广告拦截_Python如何在抓取时欺骗反广告块过滤器？
Javascript解析您遇到的问题是在页面加载后加载数据的JavaScript过滤器.警告您正在使用adblock的消息以原始HTML格式存在,并且是完全静态的.当JavaScript调用能够验证 ...
Python学习笔记——爬虫原理与Requests数据抓取
目录为什么要做网络爬虫? 通用爬虫和聚焦爬虫 HTTP和HTTPS 客户端HTTP请求请求方法 HTTP请求主要分为Get和Post两种方法常用的请求报头 1. Host (主机和端口号) 2. ...
python中国大学排名爬虫写明详细步骤-Python爬虫--2019大学排名数据抓取
Python爬虫--2019大学排名数据抓取准备工作输入:大学排名URL连接输出:大学排名信息屏幕输出所需要用到的库:requests,bs4 思路获取网页信息提取网页中的内容并放到数据结 ...
十七、爬虫实战，多线程抓取大搜网新车的数据
上次爬取毛豆新车的数据十六.爬虫实战,多线程抓取毛豆新车的数据这次爬取大搜车卖车爬虫实战对于之前学的知识,作一个整合,爬取大搜车卖车信息目标:爬取大搜车卖车信息,并写入mongodb数据库 ...
如何用python抓取文献_浅谈Python爬虫技术的网页数据抓取与分析
浅谈 Python 爬虫技术的网页数据抓取与分析吴永聪 [期刊名称] <计算机时代> [年 ( 卷 ), 期] 2019(000)008 [摘要] 近年来 , 随着互联网的发展 , 如何 ...
Python网络爬虫，pyautogui与pytesseract抓取新浪微博数据，OCR
Python网络爬虫,pyautogui与pytesseract抓取新浪微博数据,OCR方案用ocr与pyautogui,以及webbrowser实现功能:设计爬虫抓取新浪微博数据,比如,抓取微博用 ...
Python爬虫成长之路：抓取证券之星的股票数据(转）
获取数据是数据分析中必不可少的一部分,而网络爬虫是是获取数据的一个重要渠道之一.鉴于此,我拾起了Python这把利器,开启了网络爬虫之路. 本篇使用的版本为python3.5,意在抓取证券之星上当天所 ...

python爬虫代理服务器_Python爬虫多线程抓取代理服务器

python爬虫代理服务器_Python爬虫多线程抓取代理服务器相关推荐

最新文章

热门文章