微博爬虫(python)历险记

用python爬微博某明星发布的文章和图片遇到的那些坑
初始代码

#-*-coding:utf8-*-import re
import string
import sys
import os
import urllib
import urllib2
from bs4 import BeautifulSoup
import requests
from lxml import etreereload(sys)
sys.setdefaultencoding('utf-8')
if(len(sys.argv) >=2):user_id = (int)(sys.argv[1])
else:user_id = (int)(raw_input(u"请输入user_id: "))cookie = {"Cookie": "#your cookie"}
url = 'http://weibo.cn/u/%d?filter=1&page=1'%user_idhtml = requests.get(url, cookies = cookie).content
selector = etree.HTML(html)
pageNum = (int)(selector.xpath('//input[@name="mp"]')[0].attrib['value'])result = ""
urllist_set = set()
word_count = 1
image_count = 1print u'爬虫准备就绪...'for page in range(1,pageNum+1):#获取lxml页面url = 'http://weibo.cn/u/%d?filter=1&page=%d'%(user_id,page) lxml = requests.get(url, cookies = cookie).content#文字爬取selector = etree.HTML(lxml)content = selector.xpath('//span[@class="ctt"]')for each in content:text = each.xpath('string(.)')if word_count >= 4:text = "%d :"%(word_count-3) +text+"\n\n"else :text = text+"\n\n"result = result + textword_count += 1#图片爬取soup = BeautifulSoup(lxml, "lxml")urllist = soup.find_all('a',href=re.compile(r'^http://weibo.cn/mblog/oripic',re.I))first = 0for imgurl in urllist:urllist_set.add(requests.get(imgurl['href'], cookies = cookie).url)image_count +=1fo = open("/Users/Personals/%s"%user_id, "wb")
fo.write(result)
word_path=os.getcwd()+'/%d'%user_id
print u'文字微博爬取完毕'link = ""
fo2 = open("/Users/Personals/%s_imageurls"%user_id, "wb")
for eachlink in urllist_set:link = link + eachlink +"\n"
fo2.write(link)
print u'图片链接爬取完毕'if not urllist_set:print u'该页面中不存在图片'
else:#下载图片,保存在当前目录的pythonimg文件夹下image_path=os.getcwd()+'/weibo_image'if os.path.exists(image_path) is False:os.mkdir(image_path)x=1for imgurl in urllist_set:temp= image_path + '/%s.jpg' % xprint u'正在下载第%s张图片' % xtry:urllib.urlretrieve(urllib2.urlopen(imgurl).geturl(),temp)except:print u"该图片下载失败:%s"%imgurlx+=1print u'原创微博爬取完毕，共%d条，保存路径%s'%(word_count-4,word_path)
print u'微博图片爬取完毕，共%d张，保存路径%s'%(image_count-1,image_path)

使用 python 3.7 pip10
参数 cookies 自己登陆微博后取出 user_id 某明星微博url连接uid

pip 版本
异常报错:You are using pip version xxx, however version xxx is available.
You should consider upgrading via the ‘pip install --upgrade pip’ command.
原因解释:pip 不是最新的,需要更新 10>18,可以尝试’pip install --upgrade pip’命令
解决办法:‘pip install --upgrade pip’ 命令更新
etree
异常报错:
原因解释:python3之后 lxml包中的etree不存在
解决办法: import lxml.html
etree = lxml.html.etree
sys
异常报错:module ‘sys’ has no attribute ‘setdefaultencoding’
原因分析:Python3字符串默认编码unicode, 所以sys.setdefaultencoding也不存在了
解决办法:去掉sys.setdefaultencoding
raw_input
异常报错:name ‘raw_input’ is not defined
原因解释:python3.x系列不再有 raw_input 函数
解决办法:raw_input换成input(等效)
fo = open("/xxxx/%s"%user_id, “wb”)
异常报错:UnicodeEncodeError: ‘gbk’ codec can’t encode character ‘\xa0’ in position 2: illegal multibyte sequence
原因解释:在windows系统里，新文件的默认编码是gbk,但是python文件本身是utf-8 ,使用该编码打开windows下的新文件会报错
解决办法:fo = open("/xxxx/"%user_id, “wb”,encoding=“utf-8”)
文件查不到
异常报错:No such file or directory: ‘/xxxx/user_id’
原因解释:没有文件夹
解决办法:相对路径去掉前面/,绝对路径外加r
soup.find_all(‘a’,href=re.compile(r’^http://weibo.cn/mblog/oripic’,re.I))
问题:该页面中不存在图片
原因:查找的标签应该选择img
解决:改成soup.find_all(‘img’)就可以爬到
下载图片报错
报错信息:python AttributeError(“module ‘urllib’ has no attribute ‘urlretrieve’”)
原因解释:python2 与python3的urllib不同在与python3要加上.request
解决方法:urllib.request.urlretrieve(url, temp, Schedule)

python运用的函数有些参数显示红色,但是不影响运行,目前还不知道为什么(是否和开发工具有关,小弟用的idea)

处理完之后代码

# coding:utf-8
import urllib.request
import re
import importlib
import sys
import os
import urllib
from bs4 import BeautifulSoup
import requests
import lxml.htmletree = lxml.html.etreedef pachong(c):# importlib.reload(sys)# # sys.setdefaultencoding('utf-8')# if (len(sys.argv) >= 2):#     user_id = (int)(sys.argv[1])# else:user_id = (int)(input(u"请输入user_id: "))cookie = {"Cookie": c}url = 'http://weibo.cn/u/%d?filter=1&page=1' % user_idprint("url: ", url)r = requests.get(url, cookies=cookie)html = r.contentprint("html: ", html)selector = etree.HTML(html)pageNum = (int)(selector.xpath('//input[@name="mp"]')[0].attrib['value'])result = ""urllist_set = set()word_count = 1image_count = 1print(u'爬虫准备就绪...')for page in range(1, pageNum + 1):# 获取lxml页面url = 'http://weibo.cn/u/%d?filter=1&page=%d' % (user_id, page)lxml = requests.get(url, cookies=cookie).content# 文字爬取selector = etree.HTML(lxml)content = selector.xpath('//span[@class="ctt"]')for each in content:text = each.xpath('string(.)')if word_count >= 4:text = "%d :" % (word_count - 3) + text + "\n\n"else:text = text + "\n\n"result = result + textword_count += 1# 图片爬取soup = BeautifulSoup(lxml, "lxml")urllist = soup.find_all('img')first = 0for imgurl in urllist:urllist_set.add(requests.get(imgurl['src'], cookies=cookie).url)image_count += 1fo = open("text/%s_title.txt" % user_id, "w", encoding="utf-8")fo.write(result)word_path = os.getcwd() + '/%d' % user_idprint(u'文字微博爬取完毕')link = ""fo2 = open("text/%s_imageurls.txt" % user_id, "w", encoding="utf-8")for eachlink in urllist_set:link = link + eachlink + "\n"fo2.write(link)print(u'图片链接爬取完毕')if not urllist_set:print(u'该页面中不存在图片')else:# 下载图片,保存在当前目录的pythonimg文件夹下image_path = os.getcwd() + '/weibo_image'if os.path.exists(image_path) is False:os.mkdir(image_path)x = 1for imgurl in urllist_set:temp = image_path + '/%s.jpg' % xprint(u'正在下载第%s张图片' % x)try:url = urllib.request.urlopen(imgurl).geturl();urllib.request.urlretrieve(url, temp, Schedule)except Exception as e:print(u"该图片下载失败:%s" % imgurl)print("e: ",repr(e));x += 1print('原创微博爬取完毕，共%d条，保存路径%s' % (word_count - 4, word_path))print('微博图片爬取完毕，共%d张，保存路径%s' % (image_count - 1, image_path))def Schedule(a, b, c):# a:已经下载的数据块# b:数据块的大小# c:远程文件的大小per = 100.0 * a * b / cif per > 100:per = 100print('%.2f%%' % per)

微博爬虫(python)历险记相关推荐

微博爬虫python_微博爬虫 python
本文爬取的是m站的微博内容,基于python 2.7 一. 微博内容爬取 1.要爬取的微博首页网址https://m.weibo.cn/u/3817188860?uid=3817188860& ...
微博爬虫 python
本文爬取的是m站的微博内容,基于python 2.7 一. 微博内容爬取 1.要爬取的微博首页网址https://m.weibo.cn/u/3817188860?uid=3817188860& ...
Python+Selenium多线程基础微博爬虫
一.随便扯扯的概述大家好,虽然我自上大学以来就一直在关注着CSDN,在这上面学到了很多知识,可是却从来没有发过博客(还不是因为自己太菜,什么都不会),这段时间正好在机房进行期末实训,我们组做的是一个 ...
python微博爬虫实战_32个Python爬虫实战项目，满足你的项目荒，附赠资料
写在前面学习Python爬虫的小伙伴想成为爬虫行业的大牛么? 你想在网页上爬取你想要的数据不费吹灰之力么? 那么亲爱的小伙伴们肯定需要项目实战去磨练自己的技术,毕竟没有谁能随随便便成功! 小编前段时 ...
python爬虫现状_基于Python的微博爬虫系统研究
基于 Python 的微博爬虫系统研究陈政伊袁云静贺月锦武瑞轩 [摘要] [摘要]随着大数据时代到来,爬虫的需求呈爆炸式增长,以新浪微博为代表的一系列社交应用蕴含着巨大的数据资源.以新浪 ...
python抓取微博数据中心_微博爬虫开源项目汇总大全
作者:阿橙网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本.另外一些不常使用的名字还有蚂蚁.自动索引.模 ...
【python微博爬虫+定时发送邮件操作②】不会吧不会吧！不会2020了还有人需要用软件看微博热搜吧？
以下内容为本人原创,欢迎大家观看学习,禁止用于商业用途, ·作者:@Yhen ·原文网站:CSDN ·原文链接:https://blog.csdn.net/Yhen1/article/details/ ...
「Python爬虫系列讲解」十一、基于登录分析的 Selenium 微博爬虫
本专栏是以杨秀璋老师爬虫著作<Python网络数据爬取及分析「从入门到精通」>为主线.个人学习理解为主要内容,以学习笔记形式编写的. 本专栏不光是自己的一个学习分享,也希望能给您普及一些关 ...
python 自动发微博_GitHub - RisingStar20/pf_send_weibo: 通过爬虫自动发微博的Python项目...
pf_send_weibo 通过爬虫自动发微博的Python项目要爬取的网站配置自己的微博账号: USER_NAME 用户名 PASSWD 密码 # 登录方式 LOGIN_TYPE_UID = & ...

微博爬虫(python)历险记

微博爬虫(python)历险记相关推荐

最新文章

热门文章