基于Python的HTTPS协议模拟登陆+爬取页面

之前写的一直没成功，原因是用的不是HTTPS相关的函数。这次仔细研究了一下，有几个需要注意的点，一个是POST模拟登陆的时候，header中的cookie值，不同的网站应该会有不同的要求；另一个是GET页面的时候，是需要加上POST得到的response中的set-cookie的。这样才能利用登陆的成功。

写完POST和GET页面后，顺便写了个简单的命令行实现。

import httplib, urllib
import urllib2
import cookielib
import sysfile_text = "build_change.txt"
resultTable = dict()
host = 'buuuuuuu.knight.com'def Login(username, password , csrf =  'Gy2O70iSjOTbWhWgBLvf4HDuf4jUe4RP'):url = '/login/'values = {'username' : username,'password' : password,'next' : '','csrfmiddlewaretoken': csrf,}headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36','Content-Type': 'application/x-www-form-urlencoded','Connection' : 'keep-alive','Cookie':'csrftoken=%s' % csrf ,  'Referer':'https://buuuuuuu.knight.com/login/','Origin':'https://buuuuuuu.knight.com','Content-Type':'application/x-www-form-urlencoded','Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,p_w_picpath/webp,*/*;q=0.8',}values = urllib.urlencode(values)conn = httplib.HTTPSConnection(host, 443)conn.request("POST", url, values, headers)response = conn.getresponse()print 'Login: ', response.status, response.reason'''hdata = response.getheaders()for i in xrange(len(hdata)):for j in xrange(len(hdata[i])):print hdata[i][j],print '''return response.getheader("set-cookie")def GetHtml(_url , cookie):get_headers = {'Host' : 'xxxxx.knight.com','Connection' : 'keep-alive' , 'Cache-Control' : 'max-age=0','Cookie' : cookie ,'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,p_w_picpath/webp,*/*;q=0.8','User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36','Accept-Language' : 'zh-CN,zh;q=0.8,en;q=0.6',}conn=httplib.HTTPSConnection(host)conn.request("GET", _url,None,get_headers)res2=conn.getresponse()print "Get %s:" % _url ,res2.status, res2.reason'''hdata1 = res2.getheaders()for i in xrange(len(hdata1)):for j in xrange(len(hdata1[i])):print hdata1[i][j],print '''data = res2.read()fp = open("build_change.txt","w")fp.write(data)fp.close()def ParseHtml():fp = open(file_text,"r")content = fp.readline()_pos = 0while content:  if content.find("class=\"change-body\"") >= 0:topic = content.split(">")resultTable[_pos] = topic[1]while content:content = fp.readline()resultTable[_pos] = resultTable[_pos] + contentif content.find("</div>")>= 0:_pos = _pos + 1breakcontent = fp.readline()fp.close()print "Parse html success."def GenerateResultTxt():f = open("build_change_result.txt","w")for m in resultTable.keys():f.write("-------------------------------------------------------------------------------------------\n")f.write(resultTable[m])f.close()print "Generate result success : build_change_result.txt ."
def Help():print '-h    :    help'print '-u    :    username(must)'print '-p    :    password(must)'print '-c    :    csrftoken(optional)'print '-s    :    sandbox build id(must)'print 'For example:'print '[1]  python BuildChange.py -h'print '[2]  python BuildChang.py -u u -p p -s s1 s2'print '[3]  python BuildChang.py -u u -p p -c c -s s1 s2'def ParseParam(com):length = len(com)username = ""password = ""csrf = ""sid1 = ""sid2 = ""if length == 2 or length == 8 or length == 10:if com[1] == '-h':Help()for i in range(1,length):if com[i] == '-u' and i < (length-1):username = com[i+1]i += 1elif com[i] == '-p' and i < (length-1):password = com[i+1]i += 1elif com[i] == '-c' and i < (length-1):csrf = com[i+1]i += 1elif com[i] == '-s' and i < (length-2):sid1 = com[i+1]sid2 = com[i+2]i += 2if username == "" or password == "" or sid1 == "" or sid2 == "":print '[Error] Parameter error!'print '[Error] You can use \"python BuildChange.py -h\" to see how can use this script. 'else:if csrf == "":cookie = Login(username, password)else:cookie = Login(username, password, csrf)_url = "//changelog//between//%s//and//%s/" % (sid1, sid2)GetHtml(_url, cookie)ParseHtml()GenerateResultTxt()# C:\Python27\python.exe C:\Users\knight\Desktop\build\BuildChange.py -u xux -p KKKKKKKK -s 1859409 1858525if __name__ == "__main__":ParseParam(sys.argv)

转载于:https://blog.51cto.com/xuxueliang/1422522

基于Python的HTTPS协议模拟登陆+爬取页面相关推荐

Python 爬虫实战，模拟登陆爬取数据
Python 爬虫实战,模拟登陆爬取数据从0记录爬取某网站上的资源连接: 模拟登陆爬取数据保存到本地结果演示: 源网站展示: 爬到的本地文件展示: 环境准备: python环境安装略安装r ...
python爬取新浪新闻首页_Python爬虫学习：微信、知乎、新浪等主流网站的模拟登陆爬取方法...
微信.知乎.新浪等主流网站的模拟登陆爬取方法摘要:微信.知乎.新浪等主流网站的模拟登陆爬取方法. 网络上有形形色色的网站,不同类型的网站爬虫策略不同,难易程度也不一样.从是否需要登陆这方面来说,一些 ...
使用Python和selenium的Chromedriver模拟登陆爬取网站信息(beautifulsoup)
爬取的信息很多,所以需要设置断点,在程序重启时能继续爬取.并且能在断掉之后自动重启. 1.setting.py 对爬取的常量进行设置 """ 基本信息设置 "& ...
python爬虫：Selenium模拟浏览器爬取淘宝商品信息
1.数据提取前期网页分析分析:淘宝网页数据也是通过Ajax技术获取的,但是淘宝的API接口参数比较复杂,可能包含加密密匙等参数:所以,想要通过自己构造API接口获取完整网页信息很难实现(可能只有部分 ...
python漫画滤镜怎么实现的_基于Python实现的ComicReaper漫画自动爬取脚本
转载请注明来源, 原文链接 : 讲真的, 手机看漫画翻页总是会手残碰到页面上的广告好吧, 再碰上站点的带宽还很低, 无疑是雪上加霜, 要是指定漫画的主页URL就能给我返回整本漫画的所有图片并且整理好存 ...
python房子代码_基于python的链家小区房价爬取——仅需60行代码！
简介首先打开相关网页(北京链家小区信息). 注意本博客的代码适用于爬取某个城市的小区二手房房价信息. 如果需要爬取其他信息,可修改代码,链家的数据获取的基本逻辑都差不多. 效果展示因为只需要60行 ...
模拟登陆爬取大学智慧校园的成绩单
我爬取的地址是:http://authserver.bbgu.edu.cn/authserver/login?service=http%3A%2F%2Fehall.bbgu.edu.cn%2Flogi ...
基于Python的必联网招标信息爬取系统课程报告+代码
资源下载地址:https://download.csdn.net/download/sheziqiong/85672637 资源下载地址:https://download.csdn.net/downl ...
python爬取南京市房价_基于python的链家小区房价爬取——仅需60行代码
简介首先打开相关网页(北京链家小区信息). 注意本博客的代码适用于爬取某个城市的小区二手房房价信息. 如果需要爬取其他信息,可修改代码,链家的数据获取的基本逻辑都差不多. 效果展示因为只需要60行 ...

基于Python的HTTPS协议模拟登陆+爬取页面

基于Python的HTTPS协议模拟登陆+爬取页面相关推荐

最新文章

热门文章