Python-新浪微博爬虫采集数据

想要采集新浪微博的数据，如何不做模拟登陆，情况如下：

<!DOCTYPE html>
<html>
<head><meta http-equiv="Content-type" content="text/html; charset=gb2312"/><title>Sina Visitor System</title>
</head>
<body>
<span id="message"></span>
<script type="text/javascript" src="/js/visitor/mini.js"></script>
<script type="text/javascript">window.use_fp = "1" == "1"; // 是否采集设备指纹。var url = url || {};(function () {this.l = function (u, c) {try {var s = document.createElement("script");s.type = "text/javascript";s[document.all ? "onreadystatechange" : "onload"] = function () {if (document.all && this.readyState != "loaded" && this.readyState != "complete") {return}this[document.all ? "onreadystatechange" : "onload"] = null;this.parentNode.removeChild(this);if (c) {c()}};s.src = u;document.getElementsByTagName("head")[0].appendChild(s)} catch (e) {}};}).call(url);// 流程入口。wload(function () {try {var need_restore = "1" == "1"; // 是否走恢复身份流程。// 如果需要走恢复身份流程，尝试从 cookie 获取用户身份。if (!need_restore || !Store.CookieHelper.get("SRF")) {// 若获取失败走创建访客流程。// 流程执行时间过长（超过 3s），则认为出错。var error_timeout = window.setTimeout("error_back()", 3000);tid.get(function (tid, where, confidence) {// 取指纹顺利完成，清除出错 timeout 。window.clearTimeout(error_timeout);incarnate(tid, where, confidence);});} else {// 用户身份存在，尝试恢复用户身份。restore();}} catch (e) {// 出错。error_back();}});// “返回” 回调函数。var return_back = function (response) {if (response["retcode"] == 20000000) {back();} else {// 出错。error_back(response["msg"]);}};// 跳转回初始地址。var back = function() {var url = "http://weibo.com/zhaoliying?is_search=0&visible=0&is_tag=0&profile_ftype=1&page=2";if (url != "none") {window.location.href = url;}};// 跨域广播。var cross_domain = function (response) {var from = "weibo";if (response["retcode"] == 20000000) {var crossdomain_host = "login.sina.com.cn";if (crossdomain_host != "none") {var cross_domain_intr = window.location.protocol + "//" + crossdomain_host + "/visitor/visitor?a=crossdomain&cb=return_back&s=" +encodeURIComponent(response["data"]["sub"]) + "&sp=" + encodeURIComponent(response["data"]["subp"]) + "&from=" + from + "&_rand=" + Math.random();url.l(cross_domain_intr);} else {back();}} else {// 出错。error_back(response["msg"]);}};// 为用户赋予访客身份 。var incarnate = function (tid, where, conficence) {var gen_conf = "";var from = "weibo";var incarnate_intr = window.location.protocol + "//" + window.location.host + "/visitor/visitor?a=incarnate&t=" +encodeURIComponent(tid) + "&w=" + encodeURIComponent(where) + "&c=" + encodeURIComponent(conficence) +"&gc=" + encodeURIComponent(gen_conf) + "&cb=cross_domain&from=" + from + "&_rand=" + Math.random();url.l(incarnate_intr);};// 恢复用户丢失的身份。var restore = function () {var from = "weibo";var restore_intr = window.location.protocol + "//" + window.location.host +"/visitor/visitor?a=restore&cb=restore_back&from=" + from + "&_rand=" + Math.random();url.l(restore_intr);};// 跨域恢复丢失的身份。var restore_back = function (response) {// 身份恢复成功走广播流程，否则走创建访客流程。if (response["retcode"] == 20000000) {var url = "http://weibo.com/zhaoliying?is_search=0&visible=0&is_tag=0&profile_ftype=1&page=2";var alt = response["data"]["alt"];var savestate = response["data"]["savestate"];if (alt != "") {requrl = (url == "none") ? "" : "&url=" + encodeURIComponent(url);var params = "entry=sso&alt=" + encodeURIComponent(alt) + "&returntype=META" +"&gateway=1&savestate=" + encodeURIComponent(savestate) + requrl;window.location.href = "http://login.sina.com.cn/sso/login.php?" + params;} else {cross_domain(response);}} else {tid.get(function (tid, where, confidence) {incarnate(tid, where, confidence);});}};// 出错情况返回登录页。var error_back = function (msg) {var url = "http://weibo.com/zhaoliying?is_search=0&visible=0&is_tag=0&profile_ftype=1&page=2";if (url != "none") {if (url.indexOf("ssovie4c55=0") === -1) {url += (((url.indexOf("?") === -1) ? "?" : "&") + "ssovie4c55=0");}window.location.href = "http://weibo.com/login.php";} else {if(document.getElementById("message")) {document.getElementById("message").innerHTML = "Error occurred" + (msg ? (": " + msg) : ".");}}}</script>
</body>
</html>

根本就无法采集到数据！

首先使用正常的账号，登陆新浪微博https://login.sina.com.cn/signup/signin.php?entry=sso如图所示：

然后下载软件Http Analyzer（下载链接见：http://download.csdn.net/detail/u010343650/9665839）进行抓包分析如图所示：

根据抓包或通过博客作者提供的链接我们可以看到:

获得上面的4个属性值（servertime、nonce、pubkey、rsakv）代码编程有:

def getLoginInfo():preLoginURL = r'http://login.sina.com.cn/sso/prelogin.php?entry=weibo&callback=sinaSSOController.preloginCallBack&su=&rsakt=mod&client=ssologin.js(v1.4.18)'html = requests.get(preLoginURL).textjsonStr = re.findall(r'\((\{.*?\})\)', html)[0]data = json.loads(jsonStr)servertime = data["servertime"]nonce = data["nonce"]pubkey = data["pubkey"]rsakv = data["rsakv"]return servertime, nonce, pubkey, rsakv

接着我们马上需要做的预登陆了。

登陆的时候我们需要用到其中的servertime、nonce、pubkey、rsakv字段，使用抓包我们看到链接http://i.sso.sina.com.cn/js/ssologin.js查看，复制到txt文件中,并用NodePad++打开,搜索username的加密方式，如图我看到的是：用户名username经过base64编码后得到值和登陆密码的加密方式。

于是我们知道要获得加密后的密码，我们需要提供原始的password、servertime、nonce和pubkey 这4个参数。分别对应的方法为：

#加密用户名，su为POST中的用户名字段
def getSu(username):su = base64.b64encode(username.encode('utf-8')).decode('utf-8')return su

def getSp(password, servertime, nonce, pubkey):pubkey = int(pubkey, 16)key = rsa.PublicKey(pubkey, 65537)# 以下拼接明文从js加密文件中得到message = str(servertime) + '\t' + str(nonce) + '\n' + str(password)message = message.encode('utf-8')sp = rsa.encrypt(message, key)# 把二进制数据的每个字节转换成相应的2位十六进制表示形式。sp = binascii.b2a_hex(sp)return sp

我们再回到http://login.sina.com.cn/sso/login.php?client=ssologin.js(v1.4.18)，这个地址就是进行post提交数据的地址，下面是我自己提交的数据：

postData = {'entry': 'weibo','gateway': '1','from': '','savestate': '7','userticket': '1',"pagerefer": "http://weibo.com/' + zhaoliying + '?is_search=0&visible=0&is_tag=0&profile_ftype=1&page=' + str(1)","vsnf": "1","su": su,"service": "miniblog","servertime": servertime,"nonce": nonce,"pwencode": "rsa2","rsakv": rsakv,"sp": sp,"sr": "1440*900","encoding": "UTF-8","prelt": "126","url": "http://open.weibo.com/ajaxlogin.php?framelogin=1&callback=parent.sinaSSOController.feedBackUrlCallBack","returntype": "META",}

提交出去后出现如图所示的问题：2次重复登录（登录一次还不够，还要进行第二次登录）

将location.replace 里面的链接解析出来，解析办法：

记得将第一次模拟登陆得到的session值保存起来，利用第一次得到的session开始我们第二次模拟登录 ,于是整个代码是这样的：

# -*- coding: UTF-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf8')__author__ = 'Mouse'
import requests
import json
import re
import base64
import rsa
import binasciiagent = 'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:41.0) Gecko/20100101 Firefox/41.0'
headers = {'User-Agent': agent
}def get_logininfo():preLogin_url = r'http://login.sina.com.cn/sso/prelogin.php?entry=weibo&callback=sinaSSOController.preloginCallBack&' \r'su=&rsakt=mod&client=ssologin.js(v1.4.18)'html = requests.get(preLogin_url).textjsonStr = re.findall(r'\((\{.*?\})\)', html)[0]data = json.loads(jsonStr)servertime = data["servertime"]nonce = data["nonce"]pubkey = data["pubkey"]rsakv = data["rsakv"]return servertime, nonce, pubkey, rsakvdef get_su(username):"""加密用户名，su为POST中的用户名字段"""su = base64.b64encode(username.encode('utf-8')).decode('utf-8')return sudef get_sp(password, servertime, nonce, pubkey):pubkey = int(pubkey, 16)key = rsa.PublicKey(pubkey, 65537)# 以下拼接明文从js加密文件中得到message = str(servertime) + '\t' + str(nonce) + '\n' + str(password)message = message.encode('utf-8')sp = rsa.encrypt(message, key)# 把二进制数据的每个字节转换成相应的2位十六进制表示形式。sp = binascii.b2a_hex(sp)return spdef login(su, sp, servertime, nonce, rsakv):post_data = {'entry': 'weibo','gateway': '1','from': '','savestate': '7','userticket': '1',"pagerefer": "http://weibo.com/' + zhaoliying + '?is_search=0&visible=0&is_tag=0&profile_ftype=1&page=' + str(1)","vsnf": "1","su": su,"service": "miniblog","servertime": servertime,"nonce": nonce,"pwencode": "rsa2","rsakv": rsakv,"sp": sp,"sr": "1440*900","encoding": "UTF-8","prelt": "126","url": "http://open.weibo.com/ajaxlogin.php?framelogin=1&callback=parent.sinaSSOController.feedBackUrlCallBack","returntype": "META",}login_url = r'http://login.sina.com.cn/sso/login.php?client=ssologin.js(v1.4.18)'session = requests.Session()res = session.post(login_url, data=post_data, headers=headers)html = res.content.decode('gbk')info = re.findall(r"location\.replace\(\'(.*?)\'", html)[0]print(info)login_index = session.get(info, headers=headers)uuid = login_index.textuuid_pa = r'"uniqueid":"(.*?)"'uuid_res = re.findall(uuid_pa, uuid, re.S)[0]web_weibo_url = "http://weibo.com/%s/profile?topnav=1&wvr=6&is_all=1" % uuid_resweibo_page = session.get(web_weibo_url, headers=headers)weibo_pa = r'<title>(.*?)</title>'userName = re.findall(weibo_pa, weibo_page.content.decode("utf-8", 'ignore'), re.S)[0]print('登陆成功，你的用户名为：'+userName)return session#调用模拟登录的程序，从网页中抓取指定URL的数据，获取原始的HTML信息存入raw_html.txt中
def get_rawhtml(session, url):response = session.get(url)content = response.textprint(content)#print "成功爬取指定网页源文件并且存入raw_html.txt"return content   #返回值为原始的HTML文件内容def crawler(session, number, url):for n in range(number):n = n + 1url = 'http://weibo.com/' + url + '?is_search=0&visible=0&is_tag=0&profile_ftype=1&page=' + str(n)print("crawler url", url)content = get_rawhtml(session, url)  # 调用获取网页源文件的函数执行print("page %d get success and write into raw_html.txt"%n)def main_carwler(session, url, page_num):print("URL", url)crawler(session, page_num, url)   #调用函数开始爬取if __name__ == '__main__':servertime, nonce, pubkey, rsakv = get_logininfo()print("servertime is :", servertime)print("nonce is :", nonce)print("pubkey is :", pubkey)print("rsakv is :", rsakv)#name = input('请输入用户名：')su = get_su("")#password = input('请输入密码：')sp = get_sp("", servertime, nonce, pubkey)session = login(su, sp, servertime, nonce, rsakv)print("session is ", session)weibo_url = "zhaoliying"main_carwler(session, weibo_url, 2)

运行情况：

到此模拟登陆结束！

Python-新浪微博爬虫采集数据相关推荐

淘宝api开放平台买家卖家订单接口，python网络爬虫采集数据
custom-自定义API操作公共参数请求地址: https://console.open.onebound.cn/console/?i=Anzexi 名称类型必须描述 key String ...
python网络爬虫-采集整个网站
上一篇文章中,实现了在一个网站上随机地从一个链接跳掉另一个链接.但是,如果需要系统地把整个网站按目录分类,或者要搜索网站上的每一个页面,就得采集整个网站,那是一种非常耗费内存资源的过程,尤其处理大型网 ...
python怎么爬虫理数据_Python神技能 | 使用爬虫获取汽车之家全车型数据
最近想在工作相关的项目上做技术改进,需要全而准的车型数据,寻寻觅觅而不得,所以就只能自己动手丰衣足食,到网上获(窃)得(取)数据了. 汽车之家是大家公认的数据做的比较好的汽车网站,所以就用它吧.(感谢 ...
python如何爬虫eps数据_Python爬虫数据提取总结
原博文 2019-01-24 18:06 − 关于Python的爬虫的一些数据提取的方法总结第一种 : 正则表达式2. 正则表达式相关注解2.1 数量词的贪婪模式与非贪婪模式2.2 常用方法第二种: ...
【第十一届“泰迪杯”数据挖掘挑战赛】泰迪杯c题爬虫采集数据（源码+数据）
["第十一届"泰迪杯"数据挖掘挑战赛-- C 题:泰迪内推平台招聘与求职双向推荐系统构建(采集数据)] 问题: 数据详情: 根据工作id获取详细数据(1571条).csv ...
python新浪微博爬虫
通过 python+scrapy+redis+MongoDB 编写的新浪微博爬虫程序主要加入redis的set数据结构来做指纹去重和历史记录 github地址:https://github.com/ ...
爬虫采集数据遇到验证码怎么解决？
摘要:出现验证码一般是采集速度比较快.采集数据多,触发了网站的防采集机制所导致的.解决方案是由于其不确定性,并不是采集每一条数据都出现验证码.故需在规则中引入分支判断,对网页是否出现验证码进行判断. ...
python作品_专业解读 | 制作游戏、开发APP、爬虫采集数据等背后，Python全栈专业背后还有更大的世界...
01 什么是Python全栈专业全栈是指利用多种技能独立完成产品开发,以实践方式将前端.后端.移动端.服务器端等领域结合到一起. 例如我们可以用全栈的技术来开发跳一跳或者是豆瓣这样的产品. 什么是p ...
python如何爬虫网页数据-python爬虫——爬取网页数据和解析数据
1.网络爬虫的基本概念网络爬虫(又称网络蜘蛛,机器人),就是模拟客户端发送网络请求,接收请求响应,一种按照一定的规则,自动地抓取互联网信息的程序. 只要浏览器能够做的事情,原则上,爬虫都能够做到. ...
python如何爬虫网页数据-如何轻松爬取网页数据？
一.引言在实际工作中,难免会遇到从网页爬取数据信息的需求,如:从微软官网上爬取最新发布的系统版本.很明显这是个网页爬虫的工作,所谓网页爬虫,就是需要模拟浏览器,向网络服务器发送请求以便将网络资源从网 ...

Python-新浪微博爬虫采集数据

Python-新浪微博爬虫采集数据相关推荐

最新文章

热门文章