Python爬虫：爬取免费代理ip

之前写的几个爬虫都只能爬取到少量的信息，这是由于一个ip频繁地访问网站，会被认定为非正常的爬虫从而被屏蔽，这时候就需要使用代理ip来访问网站了，具体方法就是在发送request时添加一个proxy参数。代理ip有收费的也有免费的，免费的通常不太稳定，或者根本无法使用。我们的目标是在https://www.xicidaili.com/nn/页面爬取免费代理ip，然后验证代理ip可否使用，最后把相关信息保存到数据库中，不过要经常更新。

以下为需要导入的模块

import pymysql
from fake_useragent import UserAgent
import requests
from bs4 import BeautifulSoup
import threading
import time

通过观察页面，每一个代理ip的信息都位于tr标签中，详细信息位于td中

我们的思路是在设计线程中获取tr，然后写一个获取ip信息列表的方法，提取有效信息

def parse_msg(ip_list):ip_list = []for i in range(1,len(ip_list)):tds = ip_list[i].find_all('td')ip, port, typ = tds[1].text, int(tds[2].text), tds[5].text.lower()ip_list.append({'ip': ip, 'port': port, 'typ': typ})return ip_list

此时获取到的ip列表中的ip质量良莠不齐，我们还需要通过此ip访问网络以测试该ip是否可用

def check_ip(ip, proxies_list):try:proxy = get_headers_proxy(ip)url = 'https://www.ipip.net/'r = requests.get(url, headers={'User-Agent':str(UserAgent().random)}, proxies=proxy, timeout=5)r.raise_for_status()except Exception as e:passelse:proxies_list.append(ip)

其中get_headers_proxy方法获取了消息头中代理的标准写法

def get_headers_proxy(dic):s = dic['typ'] + '://' + dic['ip'] + ':' + str(dic['port'])return {'http': s, 'https': s}

然后将这些可用的ip信息存入到数据库中

def save_mysql(ip_list):conn = pymysql.connect(host='localhost', user='root', passwd='root', db='python', charset="utf8")cursor = conn.cursor()cursor.execute('SET NAMES utf8;')cursor.execute('SET CHARACTER SET utf8;')cursor.execute('SET character_set_connection=utf8;')for i in range(len(ip_list)):query = """insert into proxy_ip(ip,port,typ)values(%s,%s,%s)"""ip = ip_list[i]['ip']port = ip_list[i]['port']typ = ip_list[i]['typ']values = (ip, port, typ)cursor.execute(query, values)cursor.close()conn.commit()conn.close()

接着是自定义的线程类

class GetThread(threading.Thread):def __init__(self, args):threading.Thread.__init__(self, args=args)self.proxies_list = []def run(self):url = 'http://www.xicidaili.com/nn/%d' % self._args[0]user_agent = UserAgent().randomheaders = {'User-Agent': user_agent}r = requests.get(url, headers=headers)r.encoding = r.apparent_encodingr.raise_for_status()soup = BeautifulSoup(r.text, 'lxml')ip_msg = soup.find_all('tr')[1:]ip_list = parse_msg(ip_msg)threads = []for ip in ip_list:t = threading.Thread(target=check_ip, args=[ip, self.proxies_list])t.start()time.sleep(0.1)threads.append(t)[t.join() for t in threads]def get_proxies_list(self):return self.proxies_list

完整代码

import pymysql
from fake_useragent import UserAgent
import requests
from bs4 import BeautifulSoup
import threading
import timedef get_headers_proxy(dic):s = dic['typ'] + '://' + dic['ip'] + ':' + str(dic['port'])return {'http': s, 'https': s}def parse_msg(ip_list):ip_list = []for i in range(1,len(ip_list)):tds = ip_list[i].find_all('td')ip, port, typ = tds[1].text, int(tds[2].text), tds[5].text.lower()ip_list.append({'ip': ip, 'port': port, 'typ': typ})return ip_listdef check_ip(ip, proxies_list):try:proxy = get_headers_proxy(ip)url = 'https://www.ipip.net/'r = requests.get(url, headers={'User-Agent':str(UserAgent().random)}, proxies=proxy, timeout=5)r.raise_for_status()except Exception as e:passelse:proxies_list.append(ip)def save_mysql(ip_list):conn = pymysql.connect(host='localhost', user='root', passwd='root', db='python', charset="utf8")cursor = conn.cursor()cursor.execute('SET NAMES utf8;')cursor.execute('SET CHARACTER SET utf8;')cursor.execute('SET character_set_connection=utf8;')for i in range(len(ip_list)):query = """insert into proxy_ip(ip,port,typ)values(%s,%s,%s)"""ip = ip_list[i]['ip']port = ip_list[i]['port']typ = ip_list[i]['typ']values = (ip, port, typ)cursor.execute(query, values)cursor.close()conn.commit()conn.close()class GetThread(threading.Thread):def __init__(self, args):threading.Thread.__init__(self, args=args)self.proxies_list = []def run(self):url = 'http://www.xicidaili.com/nn/%d' % self._args[0]user_agent = UserAgent().randomheaders = {'User-Agent': user_agent}r = requests.get(url, headers=headers)r.encoding = r.apparent_encodingr.raise_for_status()soup = BeautifulSoup(r.text, 'lxml')ip_msg = soup.find_all('tr')[1:]ip_list = parse_msg(ip_msg)threads = []for ip in ip_list:t = threading.Thread(target=check_ip, args=[ip, self.proxies_list])t.start()time.sleep(0.1)threads.append(t)[t.join() for t in threads]def get_proxies_list(self):return self.proxies_listif __name__ == '__main__':threads = []for i in range(1, 50):t = GetThread(args=[i])t.start()time.sleep(3)threads.append(t)[t.join() for t in threads]for t in threads:proxies_list = t.get_proxies_list()save_mysql(proxies_list)

运行成果

ps：实测后果然免费的太不稳定了，还是得花钱买

Python爬虫：爬取免费代理ip相关推荐

用Python爬虫抓取免费代理IP
点击上方"程序员大咖",选择"置顶公众号" 关键时刻,第一时间送达! 不知道大家有没有遇到过"访问频率太高"这样的网站提示,我们需要等待一段 ...
简单爬虫-爬取免费代理ip
环境:python3.6 主要用到模块:requests,PyQuery 代码比较简单,不做过多解释了 #!usr/bin/python # -*- coding: utf-8 -*- import ...
多线程爬取免费代理ip池（给我爬）
多线程爬取免费代理ip池 (给我爬) 文章目录多线程爬取免费代理ip池 (给我爬) 安装的库 IP 隐藏代理ip 多线程爬取读入代理ip 写入代理ip 验证代理ip 解析网页得到代理ip 获取网 ...
爬取免费代理IP并测试
爬取免费代理IP并测试写在开头:这次总共爬了三个代理ip的网站,前两个网站经过测试,ip并不能访问我真正想爬的网站 Git仓库:https://gitee.com/jiangtongxueya/my ...
golang爬取免费代理IP
golang爬取免费的代理IP,并验证代理IP是否可用这里选择爬取西刺的免费代理Ip,并且只爬取了一页,爬取的时候不设置useAgent西刺不会给你数据,西刺也做反爬虫处理了,所以小心你的IP被封掉 ...
简易爬取免费代理IP
爬取maitian屡次被封,先建立一个免费代理ip池吧暂时保存为txt格式思路: 1.找到免费的ip代理网站以西刺代理的4个网站为例: 国内普通代理: http://www.xicidaili. ...
自己动手爬取免费代理IP
使用爬虫在爬取一些大网站的时候,总会出现被反爬技术阻碍的情况,限制IP就是其中一种. 那么使用代理就是很好的解决方案. 作为一个穷的裤兜比脸干净的人(博主每天洗脸,不要怀疑这一点),花钱去买代理就不在 ...
scrapy爬取免费代理IP存储到数据库构建自有IP池
以抓取西刺代理网站的高匿IP并存储到mysql数据库为例西刺网:http://www.xicidaili.com/nn/ 运行环境:scrapy1.0.3 python2.7.10 需要安装MySQ ...
爬取免费代理IP代码
以下代码会抓取西刺代理网站的代理ip: 1.抓取西刺代理网站的代理ip 2.并根据指定的目标url,对抓取到ip的有效性进行验证 3.在和此代码同级文件夹下创建ip.txt文件,将有效的IP存入 # ...

Python爬虫：爬取免费代理ip

完整代码

Python爬虫：爬取免费代理ip相关推荐

最新文章

热门文章