python爬取加密qq空间_python3爬虫爬取QQ好友空间说说

开发环境Win10

python 3.6.3

pycharm 2018.1

第三方库csv

requests

pymysql

selenium

无头浏览器 PhantomJS

主要思路通过QQ邮箱导出好友文件，使用csv获取所有好友QQ号。

2. 使用selenium和PhantomJS模拟登录QQ空间。

3. 使用requests库重发请求。

进入QQ邮箱，点击左侧通讯录，然后点击导出联系人文件

使用csv模块读取csv文件，并获取所有QQ号。

import csv

def get_qq():

with open('qq.csv') as f:

qq_num = []

for row in csv.reader(f):

qq_num.append(row[3].split('@')[0])

进入QQ空间，打开浏览器开发者工具，点击“说说”时，会看到浏览器发送了这样一个请求

返回内容为类似json格式的字符串，total为说说总数，可以通过total算出总页数

msglist是所有说说的内容列表，点开后是以下格式，属性看字面意思就可以

接下来分析http请求参数

其中uin和host_uin是访问空间的qq账号，pos为（页码-1）*20

g_tk和qzonetoken是签名

g_tk根据cookies['p_skey']获取，算法如下

def get_gtk(cookies):

hashes = 5381

for letter in cookies['p_skey']:

hashes += (hashes << 5) + ord(letter)

qzonetoken可以在html页面查看

使用selenium和PhantomJS登录QQ空间，获取cookie gtk qzonetoken

import re

def login_qzone():

try:

d = webdriver.PhantomJS()

d.get('https://qzone.qq.com')

d.maximize_window()

time.sleep(1)

d.save_screenshot('code.png')

time.sleep(1)

os.system('code.png')

time.sleep(10)

html = d.page_source

qzonetoken = re.search(r'window.g_qzonetoken = \(function\(\)\{ try\{return "(.*?)";\}', html).group(1)

for cookie in d.get_cookies():

cookies[cookie['name']] = cookie['value']

gtk = get_gtk(cookies)

return cookies, gtk, qzonetoken

except:

print('登录失败')

获取总页数（和获取说说的的请求一致）

import requests

def get_pages(qq, gtk, qzonetoken, qzonetoken):

headers = {

'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'

}

params = {"uin": qq,

"inCharset": "utf-8",

"outCharset": "utf-8",

"hostUin": qq,

"notice": "0",

"sort": "0",

"pos": "0",

"num": "20",

"cgi_host": "http://taotao.qq.com/cgi-bin/emotion_cgi_msglist_v6",

"code_version": "1",

"format": "jsonp",

"need_private_comment": "1",

"g_tk": gtk,

"qzonetoken": qzonetoken

}

r = session.get(url='https://user.qzone.qq.com/proxy/domain/taotao.qq.com/cgi-bin/emotion_cgi_msglist_v6',

params=params, headers=headers, cookies=cookies)

if r.status_code != 200:

print('获取{}说说总数失败'.format(qq))

return 0

# 把返回的字符串截取为json格式

raw_msg = json.loads(r.text[10:-2])

if not raw_msg['msglist']:

return 0

total = raw_msg['total']

pages = total // 20

if total % 20 != 0:

pages += 1

return pages

获取说说内容

import requests

def get_talk(qq, gtk, qzonetoken, qzonetoken):

headers = {

'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'

}

params = {"uin": qq,

"inCharset": "utf-8",

"outCharset": "utf-8",

"hostUin": qq,

"notice": "0",

"sort": "0",

"pos": "0",

"num": "20",

"cgi_host": "http://taotao.qq.com/cgi-bin/emotion_cgi_msglist_v6",

"code_version": "1",

"format": "jsonp",

"need_private_comment": "1",

"g_tk": gtk,

"qzonetoken": qzonetoken

}

r = session.get(url='https://user.qzone.qq.com/proxy/domain/taotao.qq.com/cgi-bin/emotion_cgi_msglist_v6',

params=params, headers=headers, cookies=cookies)

if r.status_code != 200:

print('获取{}第{}页说说失败'.format(qq, page))

return 0

raw_msg = json.loads(r.text[10:-2])

msglist = raw_msg['msglist']

return msglist

清洗数据

import time

def clean_data(qq, msglist):

item_list = []

try:

for msg in msglist:

item = {}

item['content'] = msg['content']

# 只保留中文用于分词

item['filiter_content'] = ''.join(re.findall(r'[\u4e00-\u9fa5]+', msg['content']))

item['create_time'] = time.localtime(msg['created_time'])

if msg['lbs']['id']:

item['location'] = '{},{}'.format(msg['lbs']['pos_x'], msg['lbs']['pos_y'])

else:

item['location'] = ''

item['source'] = msg['source_name']

item_list.append(item)

except:

# 部分用户隐藏了说说导致返回None

pass

self.save(qq, item_list)

存入数据库

import pymysql

def save(qq, item_list):

db = pymysql.connect('localhost', 'root', '1234', 'db1', charset='utf8mb4')

cursor = db.cursor(pymysql.cursors.DictCursor)

for item in item_list:

sql = "insert into qqtalk (qq, content, filiter_content, create_time, source, location) values (%s,%s,%s,%s,%s,%s)"

cursor.execute(sql, (

qq, item['content'], item['filiter_content'], item['create_time'], item['source'], item['location']))

db.commit()

cursor.close()

db.close()

共爬取600位好友，没有使用多线程，用了接近3个小时，有19W条

用获取到的地理信息做了一个热力图

python爬取加密qq空间_python3爬虫爬取QQ好友空间说说相关推荐

python爬取付费电影思路_python3爬虫爬取猫眼电影TOP100（含详细爬取思路）
待爬取的网页地址为https://maoyan.com/board/4,本次以requests.BeautifulSoup css selector为路线进行爬取,最终目的是把影片排名.图片.名称.演 ...
python爬取加密qq空间_python3.7 爬取QQ空间好友
1 from urllib importparse2 from selenium importwebdriver3 importrequests4 importjson5 from json impo ...
python桌面爬虫_Python3爬虫爬取英雄联盟高清桌面壁纸功能示例【基于Scrapy框架】...
本文实例讲述了Python3爬虫爬取英雄联盟高清桌面壁纸功能.分享给大家供大家参考,具体如下: 使用Scrapy爬虫抓取英雄联盟高清桌面壁纸源码地址:https://github.com/snowy ...
python跑一亿次循环_python爬虫爬取微博评论
原标题:python爬虫爬取微博评论 python爬虫是程序员们一定会掌握的知识,练习python爬虫时,很多人会选择爬取微博练手.python爬虫微博根据微博存在于不同媒介上,所爬取的难度有差异,无 ...
python爬取网页数据流程_Python爬虫爬取数据的步骤
爬虫: 网络爬虫是捜索引擎抓取系统(Baidu.Google等)的重要组成部分.主要目的是将互联网上的网页下载到本地,形成一个互联网内容的镜像备份. 步骤: 第一步:获取网页链接 1.观察需要爬取的多 ...
python java 爬数据_如何用java爬虫爬取网页上的数据
当我们使用浏览器处理网页的时候,有时候是不需要浏览的,例如使用PhantomJS适用于无头浏览器,进行爬取网页数据操作.最近在进行java爬虫学习的小伙伴们有没有想过如何爬取js生成的网络页面吗?别急 ...
python爬取大众点评评论_python爬虫抓取数据小试Python——爬虫抓取大众点评上的数据 - 电脑常识 - 服务器之家...
python爬虫抓取数据小试Python--爬虫抓取大众点评上的数据发布时间:2017-04-07
python中plguba_Python量化交易进阶讲堂-爬虫抓取东方财富网股吧帖子
欢迎大家订阅<Python实战-构建基于股票的量化交易系统>小册子,小册子会陆续推出与小册内容相关的专栏文章,对涉及到的知识点进行更全面的扩展介绍.本篇专栏为小册子内容的加推篇!!! 前言 ...
python3抓取b站弹幕_python3写爬取B站视频弹幕功能
需要准备的环境: 一个B站账号,需要先登录,否则不能查看历史弹幕记录联网的电脑和顺手的浏览器,我用的Chrome Python3环境以及request模块,安装使用命令,换源比较快: pip3 in ...

python爬取加密qq空间_python3爬虫爬取QQ好友空间说说

python爬取加密qq空间_python3爬虫爬取QQ好友空间说说相关推荐

最新文章

热门文章