最近又重新拾起了久违的爬虫,写了一个代理ip的爬取,验证和存储器。

1.爬取网站是西刺代理,使用了requests+beautifulsoup库

2.验证的网站使用了京东和淘宝的首页,用了urllib+beautifulsoup库

3.将爬取后的代码存入本地的数据库中,这里使用的是sql server 2008,用的是pyodbc库

4.验证的时候开了20个线程,用了python里的threading库

5.定期从库中拿出代理ip,将失效的ip删除

爬取代码:

# -*- coding: utf-8 -*-
import time
import pyodbc
import requests
import urllib
import threading
import socket
import sys
import csv
from bs4 import BeautifulSoup
reload(sys)
sys.setdefaultencoding("utf-8")target_url = []
aim_ip = []
for i in range(1, 2):url = 'http://www.xicidaili.com/nn/%d' %itarget_url.append(url)all_message = []
class ipGet(threading.Thread):def __init__(self, target):threading.Thread.__init__(self)self.target = targetdef Get_ip(self):headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'}html = requests.get(self.target, headers=headers)soup = BeautifulSoup(html.text)trs = soup.find('table', id='ip_list').find_all('tr')for tr in trs[1:]:tds = tr.find_all('td')ip = tds[1].text.strip()opening = tds[2].text.strip()message = [ip, opening]all_message.append(message)# print ip, openingdef run(self):self.Get_ip()class ipCheck(threading.Thread):def __init__(self, ipList):threading.Thread.__init__(self)self.ipList = ipListself.timeout = 6self.test_url = 'http://www.jd.com/?cu=true&utm_source=click.linktech.cn&utm_medium=tuiguang&utm_campaign=t_4_A100220955&utm_term=7e7c13a102664ab3a6886ccefa66d930&abt=3'self.another_url = 'https://www.taobao.com/'def Check_ip(self):socket.setdefaulttimeout(3)for ip in self.ipList:try:proxy_host = "http://"+ip[0]+":"+ip[1]proxy_temp = {"http":proxy_host}t_start = time.time()res = urllib.urlopen(self.test_url, proxies=proxy_temp).read()res2 = urllib.urlopen(self.another_url, proxies=proxy_temp).read()t_use = time.time() - t_startsoup = BeautifulSoup(res)soup2 = BeautifulSoup(res2)ans = soup.find('link', rel='dns-prefetch')ans2 = soup2.find('link', rel='dns-prefetch')if ans != None and ans2 != None:aim_ip.append((ip[0], ip[1], t_use))else:continueexcept Exception, e:print edef run(self):self.Check_ip()class save_csv():def __init__(self, SaveList):self.driver = '{SQL Server}'self.server = '(local)'self.database = 'ip_save'self.savelist = SaveListdef Save_ip(self):base = pyodbc.connect(DRIVER = self.driver, SERVER = self.server, DATABASE = self.database)source = base.cursor()counts = 0for each in self.savelist:aim = source.execute("select * from ip where ips='%s'"%each[0])if aim.fetchone() == None:source.execute("Insert into ip values('%s','%s','%s')"%(each[0], each[1], each[2]))else:print "The ip: '%s' is exist!"%each[0]counts += 1base.commit()source.close()base.close()return countsif __name__ == '__main__':GetThreading = []CheckThreading = []for i in range(len(target_url)):t = ipGet(target_url[i])GetThreading.append(t)for i in range(len(GetThreading)):GetThreading[i].start()print GetThreading[i].is_alive()for i in range(len(GetThreading)):GetThreading[i].join()print '@' * 3 + ' ' * 2 + "总共抓取了%s个代理" % len(all_message) + ' ' * 2 + '@' * 3for i in range(20):t = ipCheck(all_message[((len(all_message)+19)/20)*i:((len(all_message)+19)/20)*(i+1)])CheckThreading.append(t)for i in range(len(CheckThreading)):CheckThreading[i].start()print CheckThreading[i].is_alive()for i in range(len(CheckThreading)):CheckThreading[i].join()print '@' * 3 + ' ' * 2 + "总共有%s个代理通过校验" % len(aim_ip) + ' ' * 2 + '@' * 3t = save_csv(aim_ip)counts = t.Save_ip()print '@' * 3 + ' ' * 2 + "总共新增%s个代理" % (len(aim_ip)-counts) + ' ' * 2 + '@' * 3

定期验证:

# -*- coding: utf-8 -*-
import pyodbc
import threading
import socket
import urllib
import time
from bs4 import BeautifulSoupclass Get_ip_sql():def __init__(self):self.driver = '{SQL Server}'self.server = '(local)'self.database = 'ip_save'def Get(self):base = pyodbc.connect(DRIVER=self.driver, SERVER=self.server, DATABASE=self.database)source = base.cursor()CheckList = source.execute("Select * from ip")CheckList = list(CheckList)counts = source.execute("Select count(*) from ip")row = counts.fetchone()return CheckList, row[0]class Check_ip_intime(threading.Thread):def __init__(self, CheckList):threading.Thread.__init__(self)self.checklist = CheckListself.driver = '{SQL Server}'self.server = '(local)'self.database = 'ip_save'self.test_url = 'http://www.jd.com/?cu=true&utm_source=click.linktech.cn&utm_medium=tuiguang&utm_campaign=t_4_A100220955&utm_term=7e7c13a102664ab3a6886ccefa66d930&abt=3'self.another_url = 'https://www.taobao.com/'def Work(self):base = pyodbc.connect(DRIVER=self.driver, SERVER=self.server, DATABASE=self.database)source = base.cursor()socket.setdefaulttimeout(3)for each in self.checklist:try:proxy_host = "http://"+each[0]+":"+bytes(each[1])proxy_temp = {'http':proxy_host}t_start = time.time()res = urllib.urlopen(self.test_url, proxies=proxy_temp).read()res2 = urllib.urlopen(self.another_url, proxies=proxy_temp).read()t_use = time.time() - t_startt_use = bytes(t_use)soup = BeautifulSoup(res)soup2 = BeautifulSoup(res2)ans = soup.find('link', rel='dns-prefetch')ans2 = soup2.find('link', rel='dns-prefetch')if ans == None or ans2 == None:source.execute("Delete from ip where ips = '%s'"%(each[0]))else:source.execute("Update ip set time_used = '%s' where ips = '%s'"%(t_use, each[0]))print each[0]except Exception, e:source.execute("Delete from ip where ips = '%s'"%(each[0]))print ebase.commit()def run(self):self.Work()class Count_ip():def  __init__(self):self.driver = '{SQL Server}'self.server = '(local)'self.database = 'ip_save'def Compute(self):base = pyodbc.connect(DRIVER=self.driver, SERVER=self.server, DATABASE=self.database)source = base.cursor()col = source.execute("Select count(*) from ip")ans = col.fetchone()return ans[0]if __name__ == '__main__':t = Get_ip_sql()Check, counts= t.Get()CheckThreading = []points = 0for i in range(5):t = Check_ip_intime(Check[((counts + 4) / 5) * i:((counts + 4) / 5) * (i + 1)])CheckThreading.append(t)for i in range(len(CheckThreading)):CheckThreading[i].start()print CheckThreading[i].is_alive()for i in range(len(CheckThreading)):CheckThreading[i].join()c = Count_ip()ans = c.Compute()print '@' * 3 + ' ' * 2 + "总共删除了%s个失效代理" %(counts - ans) + ' ' * 2 + '@' * 3print '@' * 3 + ' ' * 2 + "剩余%s个代理" % ans + ' ' * 2 + '@' * 3

从西刺代理爬取代理ip,并验证是否可用相关推荐

  1. Python爬取西刺国内高匿代理ip并验证

    1.抓取ip存入文件 首先,我们访问西刺首页 http://www.xicidaili.com/,并点击国内高匿代理,如下图: 按 F12 检查网页元素或者 ctrl+u查看网页源代码: 我们需要提取 ...

  2. scrapy 西刺代理 爬取

    原文链接: scrapy 西刺代理 爬取 上一篇: TensorFlow vgg19 图像识别 下一篇: scrapy 代理使用 爬取西刺网的代理信息,保存为json文件 http://www.xic ...

  3. python3 抓取西刺网免费代理IP并验证是否可用

    爬取西祠网免费高匿代理IP并验证是否可用存到csv文件 #导入模块import requestsimport chardetimport randomfrom scrapy.selector impo ...

  4. Python爬虫简单运用爬取代理IP

    功能1: 爬取西拉ip代理官网上的代理ip 环境:python3.8+pycharm 库:requests,lxml 浏览器:谷歌 IP地址:http://www.xiladaili.com/gaon ...

  5. python爬取国内代理ip_Python语言爬取代理IP

    本文主要向大家介绍了Python语言爬取代理IP,通过具体的内容向大家展示,希望对大家学习Python语言有所帮助. #!/usr/bin/env python #-*-coding=utf-8 -* ...

  6. Python3网络爬虫开发实战,使用IP代理爬取微信公众号文章

    前面讲解了代理池的维护和付费代理的相关使用方法,接下来我们进行一下实战演练,利用代理来爬取微信公众号的文章. 很多人学习python,不知道从何学起. 很多人学习python,掌握了基本语法过后,不知 ...

  7. python爬取代理IP并进行有效的IP测试

    爬取代理IP及测试是否可用 很多人在爬虫时为了防止被封IP,所以就会去各大网站上查找免费的代理IP,由于不是每个IP地址都是有效的,如果要进去一个一个比对的话效率太低了,我也遇到了这种情况,所以就直接 ...

  8. Python爬虫-IP隐藏技术与代理爬取

    文章目录 前言 IP 隐藏 Proxifier 免费代理 自动爬取 前言 在渗透测试或者爬虫运行过程中,目标服务器会记录下我们的IP,甚至会封锁我们的IP,所以我们需要隐藏自己的IP.这时就需要用到代 ...

  9. 【实用工具系列之爬虫】python实现爬取代理IP(防 ‘反爬虫’)

    系列 [实用工具系列之爬虫]python实现爬取代理IP(防 '反爬虫') [实用工具系列之爬虫]python实现快速爬取财经资讯(防 '反爬虫') 本文使用python实现代理IP的爬取,并可以防' ...

最新文章

  1. shell判断文件是否存在[转]
  2. es6 --- forEach的实现
  3. 我通过了阿里面试,但算法太差,还是没去!
  4. LeetCode 300最长递增子序列
  5. 解决Access denied for user ''@'localhost' to database 'mysql'问题
  6. 今日头条数据分析师分享有感
  7. Double转BigDecimal并保留两位小数出现异常: java.lang.ArithmeticException: Rounding necessary
  8. 谈谈我对上手MacOS的体验与macos常用快捷键总结
  9. 【2019正睿金华集训】0803总结
  10. 【BZOJ4372】烁烁的游戏(点分树)
  11. 滑动窗口切割图片并重定位标注框
  12. 表单ajax提交插件,详解javascript表单的Ajax提交插件的使用
  13. 待解决问题-流体力学
  14. MVCC和快照读丶当前读
  15. PHP版本微信支付开发----电脑网站扫码支付(native)(心得、总结)
  16. 《AV1 Bitstream Decoding Process Specification》,译名:AV1比特流及解码规范-Chapter 05-语法结构-Section 11~12
  17. 重学计算机(四、程序是怎么链接的)
  18. (非usb方式)树莓派4BCentos系统下使用SIM7600G-H进行GPS(直接插在树莓派上使用)
  19. 导入mysql文件报错:ERROR: ASCII ‘\0‘ appeared in the statement, but this is not allowed unless option --bin
  20. 运行时错误“-2147023174 (800706ba): RPC 服务器不可用”是怎么回事?

热门文章

  1. 《c语言程序设计》网课答案,C语言程序设计基础知到网课答案
  2. java多线程和锁,自用,长文
  3. 计算方法实验:方程求根二分法、不动点迭代法、牛顿法
  4. 【学习笔记】Android基础知识回顾
  5. 接口自动化-接口请求数据准备-如何生成随机姓名、年龄、号码、email等
  6. svn提交代码报错:svn: E175002: Unexpected HTTP status 502 ‘Bad Gateway‘
  7. 基于嵌入式ARM工控主板与X86工控主板的比较
  8. 软件测试 | 白盒的测试方法
  9. Java:15位或18位居民身份证号码通用校验(正则表达式、日期格式、末尾校验码)
  10. 如何寻找合适的代言人进行品牌推广?