查看源码 使用xpath解析标签

import requests
import parselproxies_list = []url = "https://www.kuaidaili.com/free/"hander = {"User-Agent": "Mozilla/5.0"}r = requests.get(url, headers=hander, timeout=30)data = r.text#print(data)html_data = parsel.Selector(data)tr_parse = html_data.xpath('//table[@class="table table-bordered table-striped"]/tbody/tr')for tr in tr_parse:proxies_dict = {}http_type = tr.xpath('./td[4]/text()').extract_first()ip = tr.xpath('./td[1]/text()').extract_first()ip_port = tr.xpath('./td[2]/text()').extract_first()proxies_dict[http_type] = ip + ':' + ip_portproxies_list.append(proxies_dict)print(proxies_list)def check_ip(proxies_list):#检查IP的质量 hander = {"User-Agent": "Mozilla/5.0"}can_use = []for ip in proxies_list:try:response = requests.get('http://www.baidu.com',headers=hander,timeout=0.1)#如果超过0.12秒没反应则抛弃if response.status_code == 200:can_use.append(ip)except Exception as e:print(ip, e)return can_useprint(check_ip(proxies_list)) #输出高质量ip
{'HTTP': '125.94.44.129:1080'} HTTPConnectionPool(host='www.baidu.com', port=80): Read timed out. (read timeout=0.1)
[{'HTTP': '60.190.250.120:8080'}, {'HTTP': '118.112.195.91:9999'}, {'HTTP': '110.243.5.163:9999'}, {'HTTP': '118.89.91.108:8888'}, {'HTTP': '125.122.199.13:9000'}, {'HTTP': '171.11.28.248:9999'}, {'HTTP': '211.152.33.24:39406'}, {'HTTP': '59.62.35.130:9000'}, {'HTTP': '123.163.96.124:9999'}, {'HTTP': '125.117.135.10:9000'}, {'HTTP': '175.44.108.164:9999'}, {'HTTP': '110.243.15.228:9999'}, {'HTTP': '1.193.245.47:9999'}, {'HTTP': '59.62.24.87:9000'}]

使用代理ip池来访问:

 proxies_list = []proxy = [{'HTTP': '60.190.250.120:8080'}, {'HTTP': '118.112.195.91:9999'}, {'HTTP': '110.243.5.163:9999'}, {'HTTP': '118.89.91.108:8888'}, {'HTTP': '125.122.199.13:9000'}, {'HTTP': '171.11.28.248:9999'}, {'HTTP': '211.152.33.24:39406'}, {'HTTP': '59.62.35.130:9000'}, {'HTTP': '123.163.96.124:9999'}, {'HTTP': '125.117.135.10:9000'}, {'HTTP': '175.44.108.164:9999'}, {'HTTP': '110.243.15.228:9999'}, {'HTTP': '1.193.245.47:9999'}, {'HTTP': '59.62.24.87:9000'}]for a in range(1,5):url = "https://www.kuaidaili.com/free/inha/"+str(a)+"/"hander = {"User-Agent": "Mozilla/5.0"}for i in proxy:r = requests.get(url, headers=hander, timeout=1, proxies=i)if r.status_code == 200:html = r.texthtml_parsel_data = parsel.Selector(html)tr_parse = html_parsel_data.xpath('//table[@class="table table-bordered table-striped"]/tbody/tr')for tr in tr_parse:proxy_dict = {}http_type = tr.xpath('./td[4]/text()').extract_first()ip = tr.xpath('./td[1]/text()').extract_first()ip_port = tr.xpath('./td[2]/text()').extract_first()proxy_dict[http_type] = ip + ':' + ip_portproxies_list.append(proxy_dict)breakelse:continuedef check_ip(proxies_list):hander = {"User-Agent": "Mozilla/5.0"}can_use = []for ip in proxies_list:try:response = requests.get('http://www.baidu.com',headers=hander,timeout=0.1)if response.status_code == 200:can_use.append(ip)except Exception as e:print(ip, e)return can_use
print(check_ip(proxies_list)) 输出高质量代理IP

[{'HTTP': '175.42.128.48:9999'}, {'HTTP': '123.101.212.223:9999'}, {'HTTP': '60.190.250.120:8080'}, {'HTTP': '125.94.44.129:1080'}, {'HTTP': '118.112.195.91:9999'}, {'HTTP': '110.243.5.163:9999'}, {'HTTP': '118.89.91.108:8888'}, {'HTTP': '125.122.199.13:9000'}, {'HTTP': '171.11.28.248:9999'}, {'HTTP': '211.152.33.24:39406'}, {'HTTP': '59.62.35.130:9000'}, {'HTTP': '123.163.96.124:9999'}, {'HTTP': '125.117.135.10:9000'}, {'HTTP': '175.44.108.164:9999'}, {'HTTP': '110.243.15.228:9999'}, {'HTTP': '59.62.24.87:9000'}, {'HTTP': '113.124.93.190:9999'}, {'HTTP': '119.119.239.155:9000'}, {'HTTP': '60.13.42.157:9999'}, {'HTTP': '180.104.63.242:9000'}, {'HTTP': '175.42.68.223:9999'}, {'HTTP': '1.198.73.202:9999'}, {'HTTP': '125.108.76.226:9000'}, {'HTTP': '106.75.177.227:8111'}, {'HTTP': '124.93.201.59:42672'}, {'HTTP': '121.233.206.211:9999'}, {'HTTP': '175.44.109.104:9999'}, {'HTTP': '118.212.104.240:9999'}, {'HTTP': '163.204.240.107:9999'}, {'HTTP': '60.13.42.77:9999'}, {'HTTP': '49.89.86.30:9999'}, {'HTTP': '106.42.217.26:9999'}, {'HTTP': '115.29.170.58:8118'}, {'HTTP': '183.166.133.196:9999'}, {'HTTP': '114.223.208.165:8118'}, {'HTTP': '175.44.109.71:9999'}, {'HTTP': '163.204.244.219:9999'}, {'HTTP': '210.5.10.87:53281'}, {'HTTP': '123.101.213.137:9999'}, {'HTTP': '171.15.49.169:9999'}, {'HTTP': '1.198.72.171:9999'}, {'HTTP': '125.108.101.220:9000'}, {'HTTP': '36.250.156.85:9999'}, {'HTTP': '123.169.167.44:9999'}, {'HTTP': '123.169.167.44:9999'}, {'HTTP': '115.219.168.69:8118'}, {'HTTP': '1.199.30.73:9999'}, {'HTTP': '222.74.65.69:56210'}, {'HTTP': '110.243.26.53:9999'}, {'HTTP': '171.13.7.108:9999'}, {'HTTP': '175.43.151.48:9999'}, {'HTTP': '1.193.245.3:9999'}, {'HTTP': '163.204.240.35:9999'}, {'HTTP': '113.195.16.66:9999'}, {'HTTP': '27.43.188.27:9999'}, {'HTTP': '113.208.115.190:8118'}, {'HTTP': '125.110.100.170:9000'}, {'HTTP': '1.198.72.19:9999'}, {'HTTP': '121.232.199.174:9000'}]

xpath 语法

获取href属性和 文本

python爬虫 构建自己的代理IP池相关推荐

  1. python爬虫之ProxyPool(代理ip地址池的构建)

    ProxyPool 2020-7-28 安装 安装Python 至少Python3.5以上 安装Redis 安装好之后将Redis服务开启 Redis环境安装(分布式爬虫数据存储) https://b ...

  2. 用Python爬虫抓取免费代理IP

    点击上方"程序员大咖",选择"置顶公众号" 关键时刻,第一时间送达! 不知道大家有没有遇到过"访问频率太高"这样的网站提示,我们需要等待一段 ...

  3. 【Python 爬虫教程】付费代理IP与免费代理IP的区别是什么

    网络上有很多厂商提供代理IP服务.其中,有免费版和付费版,这两者的区别是什么?下面就来为大家进行详细的介绍. 区别 成本:免费代理IP用户不需要成本,但是可用率低.付费代理IP则相反. 安全性:免费I ...

  4. 爬虫单个ip代理设置_爬虫怎么设置代理ip池?

    网络技术现在是如此发达,用户换ip再也不用自己手动来,很多ip代理都是傻瓜式操作,智能完成切换,完全不用使用者操心. 像在我们在利用网络爬虫开展数据采集,遇到爬取频率过高.频次过多的问题,会碰到ip被 ...

  5. 什么是代理IP池,如何构建?

    什么是代理ip池? 通俗地比喻一下,它就是一个池子,里面装了很多代理ip.它有如下的行为特征: 1.池子里的ip是有生命周期的,它们将被定期验证,其中失效的将被从池子里面剔除. 2.池子里的ip是有补 ...

  6. 做了一个动态代理 IP 池项目,邀请大家免费测试~

    长期在掘金潜水, 现在打算出来创业了,目前公司在深圳. 做了点啥呢, 就是给爬虫用的动态代理 IP 池啦. 目前运行很稳定, 邀请大家来免费测试使用, 获取免费激活码:微信公众号"2808p ...

  7. 干货分享,使用python爬虫构建免费代理IP池

    在使用python爬虫的时候,经常会遇见所要爬取的网站采取了反爬取技术,高强度.高效率地爬取网页信息常常会给网站服务器带来巨大压力,所以同一个IP反复爬取同一个网页,就很可能被封,那如何解决呢?使用代 ...

  8. 【python】爬虫入门:代理IP池的使用、文件的写入与网易云爬取时的注意事项

    一.概述 在两天前实现利用爬虫爬取网易云音乐用户的各类公开信息之后,我对现有爬虫进行了功能上的增加.主要有: ①.使用代理IP池防止IP被封: ②.将爬取用户的听歌记录.歌单.关注.粉丝这四类数据的代 ...

  9. Python爬虫伪装,请求头User-Agent池,和代理IP池搭建使用

    一.前言 在使用爬虫的时候,很多网站都有一定的反爬措施,甚至在爬取大量的数据或者频繁地访问该网站多次时还可能面临ip被禁,所以这个时候我们通常就可以找一些代理ip,和不用的浏览器来继续爬虫测试.下面就 ...

最新文章

  1. php中instanceof的使用
  2. 基于linux的驱动设计,《基于LINUX的虚拟驱动设计》-毕业论文.doc
  3. unicode,ansi,utf-8,unicode big endian编码的区别
  4. Runtime 系列 3-- 给 category 添加属性
  5. 香港科技园公司“牵手”腾讯 共推香港金融科技发展
  6. 删除rabbitmq的队列和队列中的数据
  7. elasticsearch配置文件解析
  8. linux调用v4l2获取视频,嵌入式Linux:V4L2视频采集操作流程和接口说明
  9. CSS之看穿绝对定位
  10. html 价格计算,HTML打折计算价格实现原理与脚本代码
  11. Center OS 7 /etc/rc.d/init.d/network, status=6
  12. [转载] python 判断字符串是否包含另一个字符串_强烈推荐:Python字符串(string)方法整理(一)...
  13. gmssl编译linux,linux 编译安装GmSSL记录
  14. XRHT系列电钢琴实训室配置方案及清单
  15. 面试官系列:前端高频面试题汇总 | Vue面试题
  16. 恒生杭州历年软件测试笔试题,【恒生电子软件测试面试】首先做一个笔试题,然...-看准网...
  17. 【EhCache: 一款Java的进程内缓存框架】EhCache 是什么、代码实战、版本3的改进
  18. android多屏幕多分辨率的一些概念
  19. 怎样在电脑上查学生的准考证
  20. 拉取 trace.txt 进行 anr 分析

热门文章

  1. 使用联想计算节点的方法整理
  2. 解决:linux启动Redis报Failed to search for file:Cannot prepare internal mirrorlist: No URLs in mirrorlist
  3. 前端开发打包工具——webpack(1)
  4. 代码显示return的用法(c语言和java的比较和整合)
  5. 实战:618/双11大促备战全流程点点滴滴
  6. TreeMap、TreeSet简介
  7. rua噗实验(rip实验)
  8. 在ORACLE中用DBCA创建数据库
  9. Gradle安装及配置国内镜像
  10. AndroidHttpCapture---手机轻松抓包工具