python爬取新浪微博大V的所有微博内容

相关github地址：https://github.com/KaguraTyan/web_crawler

一般做爬虫爬取网站时，首选的都是m站，其次是wap站，最后考虑PC站，因为PC站的各种验证最多。当然，这不是绝对的，有的时候PC站的信息最全，而你又恰好需要全部的信息，那么PC站是你的首选。一般m站都以m开头后接域名，我们这次通过m.weibo.cn去分析微博的HTTP请求。

准备工作

1、环境配置

python 3
win10
chrome
urllib
json
xlwt
time
os

2、代理ip

使用代理ip爬虫是反爬虫手段之一，很多网站会检测某一时间段内某个ip的访问次数，访问次数过多，就会禁止该ip访问（比如防刷票）。所以爬虫时可以设置多个代理，隔一段时间换一个，及时其中一个被封，也可调用其他ip进行完成爬虫任务。在urllib.request库中，通过ProxyHandler来设置使用代理服务器。网上有很多免费代理ip池，如西刺免费代理IPhttp://www.xicidaili.com，根据需要选择。但是一般这种仅适合个人爬虫需求，因为很多免费代理ip可能同时被很多人使用，可使用时间短，速度慢，匿名度不高，所以专业的爬虫工程师或爬虫公司需要使用更高质量的私密代理，通常这种代理需要找专门的供应商购买，再通过用户名/密码授权使用。

单个代理ip调用

#!/usr/bin/python
# -*- coding: UTF-8 -*-import urllib.request
import randomurl = "https://www.douban.com/"header = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36"}# 构建了两个代理Handler，一个有代理IP，一个没有代理IP
httpproxy_handler = urllib.request.ProxyHandler({"http": "61.135.217.7:80"})
nullproxy_handler = urllib.request.ProxyHandler({})proxySwitch = True  # 定义一个代理开关# 通过 urllib2.build_opener()方法使用这些代理Handler对象，创建自定义opener对象
# 根据代理开关是否打开，使用不同的代理模式
if proxySwitch:opener = urllib.request.build_opener(httpproxy_handler)
else:opener = urllib.request.build_opener(nullproxy_handler)request = urllib.request.Request(url, headers=header)# 方法1、只有使用opener.open()方法发送请求才使用自定义的代理，而使用urlopen()函数则不使用自定义代理。
response = opener.open(request)# 方法2、urllib.request.install_opener(opener)函数就是将opener应用到全局，之后所有的，
# 不管是opener.open()还是urlopen() 发送请求，都将使用自定义代理。
# urllib.request.install_opener(opener)
# response = urlopen(request)data = response.read().decode('utf-8', 'ignore')
print(data)

随机选取多个代理ip列表

#!/usr/bin/python
# -*- coding: UTF-8 -*-import urllib.request
import randomurl ="https://www.douban.com/"header={"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36"}# 代理IP列表随机抽取
proxy_list = [{"http" : "220.168.52.245:55255"},{"http" : "124.193.135.242:54219"},{"http" : "36.7.128.146:52222"},]# 随机选择一个代理
proxy = random.choice(proxy_list)
print(proxy)# 使用选择的代理构建代理处理器对象
httpproxy_handler = urllib.request.ProxyHandler(proxy)
opener = urllib.request.build_opener(httpproxy_handler)
request = urllib.request.Request(url, headers=header)
response = opener.open(request)
data = response.read().decode('utf-8', 'ignore')
print(data)

完整代码

'''
抓取并保存 正文、图片、发布时间、点赞数、评论数、转发数抓取的微博id：
洋葱故事会   https://m.weibo.cn/u/1806732505'''# -*-coding:utf8-*-
# 需要的模块
import os
import urllib
import urllib.request
import time
import json
import xlwt # 定义要爬取的微博大V的微博ID
id='1806732505'# 设置代理IP
proxy_addr="122.241.72.191:808"# 定义页面打开函数
def use_proxy(url,proxy_addr):req=urllib.request.Request(url)req.add_header("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0")proxy=urllib.request.ProxyHandler({'http':proxy_addr})opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler)urllib.request.install_opener(opener)data=urllib.request.urlopen(req).read().decode('utf-8','ignore')return data# 获取微博主页的containerid，爬取微博内容时需要此id
def get_containerid(url):data=use_proxy(url,proxy_addr)content=json.loads(data).get('data')for data in content.get('tabsInfo').get('tabs'):if(data.get('tab_type')=='weibo'):containerid=data.get('containerid')return containerid# 获取微博大V账号的用户基本信息，如：微博昵称、微博地址、微博头像、关注人数、粉丝数、性别、等级等
def get_userInfo(id):url='https://m.weibo.cn/api/container/getIndex?type=uid&value='+iddata=use_proxy(url,proxy_addr)content=json.loads(data).get('data')profile_image_url=content.get('userInfo').get('profile_image_url')description=content.get('userInfo').get('description')profile_url=content.get('userInfo').get('profile_url')verified=content.get('userInfo').get('verified')guanzhu=content.get('userInfo').get('follow_count')name=content.get('userInfo').get('screen_name')fensi=content.get('userInfo').get('followers_count')gender=content.get('userInfo').get('gender')urank=content.get('userInfo').get('urank')print("微博昵称：" + name + "\n" + "微博主页地址：" + profile_url + "\n" + "微博头像地址：" + profile_image_url + "\n" + "是否认证：" + str(verified) + "\n" + "微博说明：" + description + "\n" + "关注人数：" + str(guanzhu) + "\n" +  "粉丝数：" + str(fensi) + "\n" + "性别：" + gender + "\n" + "微博等级：" + str(urank) + "\n")return name# 保存图片
def savepic(pic_urls, created_at, page, num):pic_num = len(pic_urls)srcpath = 'weibo_img/洋葱故事会/'if not os.path.exists(srcpath):os.makedirs(srcpath)picpath = str(created_at) + 'page' + str(page) + 'num' + str(num) + 'pic'for i in range(len(pic_urls)):picpathi = picpath + str(i)path = srcpath + picpathi + ".jpg"urllib.request.urlretrieve(pic_urls[i], path)# 获取微博内容信息,并保存到文本中，内容包括：每条微博的内容、微博详情页面地址、点赞数、评论数、转发数等
def get_weibo(id,file):i=1while True:url='https://m.weibo.cn/api/container/getIndex?type=uid&value='+idweibo_url='https://m.weibo.cn/api/container/getIndex?type=uid&value='+id+'&containerid='+get_containerid(url)+'&page='+str(i)try:data=use_proxy(weibo_url,proxy_addr)content=json.loads(data).get('data')cards=content.get('cards')if(len(cards)>0):for j in range(len(cards)):print("-----正在爬取第"+str(i)+"页，第"+str(j)+"条微博------")card_type=cards[j].get('card_type')if(card_type==9):mblog=cards[j].get('mblog')attitudes_count=mblog.get('attitudes_count')   # 点赞数comments_count=mblog.get('comments_count')    # 评论数created_at=mblog.get('created_at')              # 发布时间reposts_count=mblog.get('reposts_count')     # 转发数scheme=cards[j].get('scheme')                  # 微博地址text=mblog.get('text')                        # 微博内容pictures=mblog.get('pics')           # 正文配图，返回listpic_urls = []                          # 存储图片url地址if pictures:for picture in pictures:pic_url = picture.get('large').get('url')pic_urls.append(pic_url)# print(pic_urls)# 保存文本with open(file,'a',encoding='utf-8') as fh:if len(str(created_at)) < 6:created_at = '2019-'+ str(created_at)# 页数、条数、微博地址、发布时间、微博内容、点赞数、评论数、转发数、图片链接fh.write(str(i)+'\t'+str(j)+'\t'+str(scheme)+'\t'+str(created_at)+'\t'+text+'\t'+str(attitudes_count)+'\t'+str(comments_count)+'\t'+str(reposts_count)+'\t'+str(pic_urls)+'\n')# 保存图片savepic(pic_urls, created_at, i, j)i+=1'''休眠1s以免给服务器造成严重负担'''time.sleep(1)else:breakexcept Exception as e:print(e)passdef txt_xls(filename,xlsname):""":文本转换成xls的函数:param filename txt文本文件名称、:param xlsname 表示转换后的excel文件名"""try:with open(filename,'r',encoding='utf-8') as f:xls=xlwt.Workbook()#生成excel的方法，声明excelsheet = xls.add_sheet('sheet1',cell_overwrite_ok=True)# 页数、条数、微博地址、发布时间、微博内容、点赞数、评论数、转发数sheet.write(0, 0, '爬取页数')sheet.write(0, 1, '爬取当前页数的条数')sheet.write(0, 2, '微博地址')sheet.write(0, 3, '发布时间')sheet.write(0, 4, '微博内容')sheet.write(0, 5, '点赞数')sheet.write(0, 6, '评论数')sheet.write(0, 7, '转发数')sheet.write(0, 8, '图片链接')x = 1while True:#按行循环，读取文本文件line = f.readline()if not line:break  #如果没有内容，则退出循环for i in range(0, len(line.split('\t'))):item=line.split('\t')[i]sheet.write(x,i,item) # x单元格行，i 单元格列x += 1 #excel另起一行xls.save(xlsname) #保存xls文件except:raiseif __name__=="__main__":name = get_userInfo(id)file = str(name) + id+".txt"get_weibo(id,file)txtname = file xlsname = str(name) + id + ".xls"txt_xls(txtname, xlsname)print('finish')

爬虫结果

python爬取新浪微博大V的所有微博内容相关推荐

python爬取微博数据词云_爬虫篇：使用Python动态爬取某大V微博，再用词云分析...
这是我用大V冯大辉老师最近5000多条微博内容做的词云,大家可以围观一下. 之前也写了一篇用python 来爬取朋友的QQ说说,大家也可以围观一下好了,开始进入正题:#coding:utf-8 &q ...
人生苦短，用Python爬取微博大V
这里的微博爬虫,我主要实现的是输入你关心的某个大 V 的微博名称,以及某条微博的相关内容片段,即可自动爬取相关该大 V 一段时间内发布的微博信息和对应微博的评论信息. Cookie 获取与上面的 B ...
python制作pdf教程_学以致用:Python爬取廖大Python教程制作pdf！
学以致用:Python爬取廖大Python教程制作pdf! python-tutorial-pdf 当我学了廖大的Python教程后,感觉总得做点什么,正好自己想随时查阅,于是就开始有了制作PDF这个 ...
Python爬取新浪微博热搜榜
Python爬取新浪微博实时热搜榜.名人热搜榜.热点热搜榜和潮流热搜榜四大板块.这些板块都是不需要登录的,所以爬起来还是比较简单的.不过频繁的爬取会出现验证码. 作用爬取四大榜单的关键词和热搜指数并存 ...
Python爬取新浪微博评论数据，了解一下？
开发工具 **Python版本:**3.6.4 相关模块: argparse模块: requests模块: jieba模块: wordcloud模块: 以及一些Python自带的模块. 环境搭建安装 ...
python爬微博数据合法吗_GitHub - ChaliceRunRunRun/weibo-crawler: 新浪微博爬虫，用python爬取新浪微博数据...
功能连续爬取一个或多个新浪微博用户(如Dear-迪丽热巴.郭碧婷)的数据,并将结果信息写入文件.写入信息几乎包括了用户微博的所有数据,主要有用户信息和微博信息两大类,前者包含用户昵称.关注数.粉丝数 ...
Python爬取各大汽车销量信息
前言本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理. 滑稽研究所 | 作者滑稽研究所 | 来源我们可以看到这个网页上面有我们 ...
python 爬虫微博 github_GitHub - peanut-shi/weiboSpider: 新浪微博爬虫，用python爬取新浪微博数据...
功能爬取新浪微博信息,并写入csv/txt文件,文件名为目标用户id加".csv"和".txt"的形式,同时还会下载该微博原始图片和微博视频(可选). 本程序 ...
python爬取微博评论数据的github链接_GitHub - 13633825898/weiboSpider: 新浪微博爬虫，用python爬取新浪微博数据...
功能爬取新浪微博信息,并写入csv/txt文件,文件名为目标用户id加".csv"和".txt"的形式,同时还会下载该微博原始图片(可选). 本程序需要设置用 ...
python 爬虫微博 github_GitHub - bubblesran/weiboSpider: 新浪微博爬虫，用python爬取新浪微博数据...
功能爬取新浪微博信息,并写入csv/txt文件,文件名为目标用户id加".csv"和".txt"的形式,同时还会下载该微博原始图片和微博视频(可选). 本程序 ...

python爬取新浪微博大V的所有微博内容

python爬取新浪微博大V的所有微博内容相关推荐

最新文章

热门文章