Python 爬虫实例（6）—— 爬取蚂蚁免费代理

数据库表sql语句：

CREATE TABLE `free_ip` (`free_ip_id` int(11) NOT NULL AUTO_INCREMENT COMMENT '主键',`ip` varchar(255) DEFAULT NULL COMMENT 'ip地址',`port` varchar(255) DEFAULT NULL COMMENT '端口',`yini_class` varchar(255) DEFAULT NULL COMMENT '匿名等级',`http_type` varchar(255) DEFAULT NULL COMMENT '代理类型',`response_time` varchar(255) DEFAULT NULL COMMENT '响应时间',`address` varchar(255) DEFAULT NULL COMMENT '地理位置',`validate_time` varchar(255) DEFAULT NULL COMMENT '最近验证时间',`hashcode` varchar(255) DEFAULT NULL COMMENT '去重',PRIMARY KEY (`free_ip_id`),UNIQUE KEY `hashcode` (`hashcode`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=4220 DEFAULT CHARSET=utf8;

源代码：

# coding:utf-8
import random, re
import sqlite3
import json, time
import uuid
from bs4 import BeautifulSoup
import threading
import requests
import MySQLdb
from lxml import etreeimport urllib3
urllib3.disable_warnings()
import urllib2import sys
reload(sys)
sys.setdefaultencoding('utf-8')session = requests.session()import logging
import logging.handlers
import platform
sysStr = platform.system()
if sysStr =="Windows":LOG_FILE_check = 'H:\\log\\log.txt'
else:LOG_FILE_check = '/log/wlb/crawler/cic.log'handler = logging.handlers.RotatingFileHandler(LOG_FILE_check, maxBytes=128 * 1024 * 1024,backupCount=10)  # 实例化handler  200M 最多十个文件
fmt = '\n' + '%(asctime)s - %(filename)s:%(lineno)s  - %(message)s'
formatter = logging.Formatter(fmt)  # 实例化formatter
handler.setFormatter(formatter)  # 为handler添加formatter
logger = logging.getLogger('check')  # 获取名为tst的logger
logger.addHandler(handler)  # 为logger添加handler
logger.setLevel(logging.DEBUG)def md5(str):import hashlibm = hashlib.md5()m.update(str)return m.hexdigest()def freeIp():for i in range(1,1000):print "正在爬取的位置是：",iurl = "http://www.ip181.com/daili/" + str(i)+ ".html"headers = {"Host":"www.ip181.com","Connection":"keep-alive","Upgrade-Insecure-Requests":"1","User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.91 Safari/537.36","Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8","Referer":url,"Accept-Encoding":"gzip, deflate","Accept-Language":"zh-CN,zh;q=0.8",}try:result = session.get(url=url,headers=headers).textresult = result.encode('ISO-8859-1').decode(requests.utils.get_encodings_from_content(result)[0])except:result = session.get(url=url, headers=headers).textresult = result.encode('ISO-8859-1').decode(requests.utils.get_encodings_from_content(result)[0])soup = BeautifulSoup(result, 'html.parser')result_soup = soup.find_all("div", attrs={"class": "col-md-12"})[1]result_soup = str(result_soup).replace('\r\n\t','').replace('\r\n','').replace('\n\t','').replace('\n','').replace(' class="warning"','')result_soups = re.findall('最近验证时间</td></tr>(.*?)</tbody></table><div class="page">共',result_soup)[0]print result_soupsresult_list = re.findall('<tr><td>(.*?)</td><td>(.*?)</td><td>(.*?)</td><td>(.*?)</td><td>(.*?)</td><td>(.*?)</td><td>(.*?)</td></tr>',result_soups)for item in result_list:ip = item[0]port = item[1]yini_class = item[2]http_type = item[3]response_time = item[4]address = item[5]validate_time = item[6]proxy = str(ip) + ":" + porthashcode = md5(proxy)try: # 此处是数据库连接，请换成自己的数据库conn = MySQLdb.connect(host="110.110.110.717", user="lg", passwd="456", db="369",charset="utf8")cursor = conn.cursor()sql = """INSERT INTO free_ip (ip,port,yini_class,http_type,response_time,address,validate_time,hashcode) VALUES (%s,%s,%s,%s,%s,%s,%s,%s)"""params = (ip,port,yini_class,http_type,response_time,address,validate_time,hashcode)cursor.execute(sql, params)conn.commit()cursor.close()print "          插入成功      "except Exception as e:print "********插入失败********"print efreeIp()

爬取效果：

转载于:https://www.cnblogs.com/xuchunlin/p/6774414.html

Python 爬虫实例（6）—— 爬取蚂蚁免费代理相关推荐

Python爬虫实例：爬取“最好大学网”大学排名
实例2 爬取大学排名上海交通大学设计了一个"最好大学网",上面列出了当前的大学排名.我们要设计爬虫程序,爬取大学排名信息. 爬虫功能要求: 输入:大学排名URL链接输出:大学排 ...
python爬虫实例之爬取智联招聘数据
这是作者的处女作,轻点喷.... 实习在公司时领导要求学习python,python的爬虫作为入门来说是十分友好的,话不多说,开始进入正题. 主要是爬去智联的岗位信息进行对比分析出java和pytho ...
Python爬虫实例(二)——爬取新冠疫情每日新增人数
Python是世界上最美的语言大家好,我是Henry! 疫情以来,相信大家每天都关注着疫情的实时动态,许多网站上也post了疫情的相关资料. 丁香园百度各个网站都会统计每日新增,刚学了Matpl ...
Python爬虫实例：爬取猫眼电影——破解字体反爬
字体反爬字体反爬也就是自定义字体反爬,通过调用自定义的字体文件来渲染网页中的文字,而网页中的文字不再是文字,而是相应的字体编码,通过复制或者简单的采集是无法采集到编码后的文字内容的. 现在貌似不少网 ...
python爬取b站搜索结果_Python爬虫实例：爬取猫眼电影——破解字体反爬,Python爬虫实例：爬取B站《工作细胞》短评——异步加载信息的爬取,Python爬虫实例：爬取豆瓣Top250...
字体反爬字体反爬也就是自定义字体反爬,通过调用自定义的字体文件来渲染网页中的文字,而网页中的文字不再是文字,而是相应的字体编码,通过复制或者简单的采集是无法采集到编码后的文字内容的. 现在貌似不少网 ...
Python爬虫实例：爬取国内所有医院信息
本博客仅用于技术讨论,若有侵权,联系笔者删除. 此次的目的是爬取国内医院的基本信息,并按省份存储.爬取的黄页是医院列表.以下是结果图: 一.初始化数据初始化基本的数据,包括global变量,省份名称 ...
Python爬虫实例：爬取 viveport 上 1406 款VR游戏信息
这次我们准备爬取 HTC VIVE 的VR内容平台--Viveport,上面有 1406 款优质的 VR 游戏/应用,本次我们的目标就是把这些游戏的详细信息爬取下来. 首先我们去该网站看一下(http ...
Python爬虫实例：爬取微信公众号图片（表情包）
背景: 在学习了简单爬虫的编写之后,我试图通过编写爬取公众号图片(表情包)来丰富我的聊天技能,亦不致于败给各种熊猫头. 在学习了requests库之后,就能够很轻松地爬取静态页面的信息,把网页对象获取 ...
python爬虫实战：爬取西刺代理网站，获取免费的代理IP
爬取的网站链接:西刺网站 import requests import chardet import random import time from bs4 import BeautifulSoup ...

Python 爬虫实例（6）—— 爬取蚂蚁免费代理

Python 爬虫实例（6）—— 爬取蚂蚁免费代理相关推荐

最新文章

热门文章