python 百度贴吧爬虫（下载图片）

业余时用python写的百度贴吧爬虫程序，算是对学习python程序得一个练习。

本程序可以针对给定的贴吧链接，把帖子楼主的发言或者图片爬取出来，目前主要功能为下载所有楼主发的图片。爬取楼主发言的功能仅支持屏幕输出，没有保存到本地文件，有兴趣的朋友可以进行补充。仅供学习，转载请标明出处。

tieba_spider.py

#coding:utf-8
import urllib2,re,time,threading
import DownQueueuser_agent='Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36' #模拟浏览器访问
url='http://tieba.baidu.com/p/3271638607?see_lz=1&pn=' #贴吧地址，只看楼主
header={'User-Agent' : user_agent}g_worker=DownQueue.down() #下载器class Tieba_Spider(threading.Thread):def __init__(self,url,type):threading.Thread.__init__(self)self.url=urlself.type=typeself.num=0def run(self):self.start_spider()def get_info(self):try:req=urllib2.Request(self.url,headers=header)response=urllib2.urlopen(req)htm=response.read().decode('gbk')self.num=self.get_page_num(htm)print 'It has %d page' % self.numself.title=self.get_title(htm)print 'It\'s title is %s'%self.titleexcept urllib2.URLError,e:if hasattr(e,'code'):print 'Error code :',e.code              if hasattr(e,'reason'):print 'Reason :',e.reasondef start_spider(self):global g_workerself.get_info()for i in range(1,self.num+1,1):print 'start : ',itry:req=urllib2.Request(self.url+str(i),headers=header)response=urllib2.urlopen(req)htm=response.read().decode('gbk')if self.type==0:self.page_deal(htm)elif self.type==1:self.down_pic(htm)except urllib2.URLError,e:if hasattr(e,'code'):print 'Error code :',e.code              if hasattr(e,'reason'):print 'Reason :',e.reasong_worker.set_flag(True)def get_page_num(self,htm):match=re.search(r'<span class="red">(\d*)</span>',htm)if match:return int(match.group(1))else:return 0def get_title(self,htm):match=re.search(r'class="core_title_txt(\s+)"(\s+)title="(.*?)"',htm)if match:return match.group(3)else:print 'no match title'return ''def page_deal(self,htm):match=re.findall(r'id="post_content_(.*?)">(.*?)</div>',htm)if match:for it in match:print it[1],'\n'else:print 'no deal'def down_pic(self,htm):global g_workermatch=re.findall(r'<img class="BDE_Image" pic_type=(.*?)src="(.*?)"',htm)if match:for it in match:print 'picture url :',it[1],'\n'g_worker.push(it[1])else:print 'no deal'        if __name__=='__main__':spider=Tieba_Spider(url,1)#参数1为下载图片。默认为0，功能为抓取楼主的发言在屏幕上显示spider.start()g_worker.start()

DownQueue.py

#coding:utf-8
import threading,Queue,re,time
import urllib2class down(threading.Thread):def __init__(self):threading.Thread.__init__(self)self.queue=Queue.Queue(1000)self.semaphore=threading.Semaphore(0)self.flag=False #是否停止def push(self,obj):self.queue.put(obj)self.semaphore.release()def set_flag(self,f):self.flag=fdef run(self):while True:if self.semaphore.acquire():obj=self.queue.get()data=urllib2.urlopen(obj).read()pic=re.search(r'.*/(.*)',obj)print 'dowing ',pic.group(1)fd=open('./spider_pic/%s'%pic.group(1),'wb')fd.write(data)fd.close()if self.queue.empty() and self.flag:    #线程结束条件，队列为空并且退出标志为真break

Tieba_Spider 类为爬虫类，负责爬出楼主发言中的图片链接，并将其推入down类的队列中。down类的工作为下载图片。两个类均继承自threading.Thread。仅供学习，转载请标明出处。

python 百度贴吧爬虫（下载图片）相关推荐

src获取同级目录中的图片_一个简单的Python爬虫实例：百度贴吧页面下载图片
本文主要实现一个简单的爬虫,目的是从一个百度贴吧页面下载图片. 1. 概述本文主要实现一个简单的爬虫,目的是从一个百度贴吧页面下载图片.下载图片的步骤如下: 获取网页html文本内容: 分析html ...
利用爬虫从一个百度贴吧页面下载图片
1. 概述本文主要实现一个简单的爬虫,目的是从一个百度贴吧页面下载图片.下载图片的步骤如下: 获取网页html文本内容: 分析html中图片的html标签特征,用正则解析出所有的图片url链接列表: ...
mac用python爬虫下载图片_使用Python爬虫实现自动下载图片
python爬虫支持模块多.代码简洁.开发效率高 ,是我们进行网络爬虫可以选取的好工具.对于一个个的爬取下载,势必会消耗我们大量的时间,使用Python爬虫就可以解决这个问题,即可以实现自动下载.本文 ...
用python画写轮眼_Python爬虫入门-图片下载（写轮眼--Lyon）
Python小白最近入了爬虫的坑,但是一直到前天为止我会的只会简单的爬取网页上的文本信息,比如什么豆瓣上的书评 ,知乎上红人的关注者 --一些很简单的爬虫.就在昨天我无聊闲暇在逛知乎偶然发现Lyon ...
python requests html格式图片打不开_爬虫下载图片打不开是什么原因，最新简易爬虫教程...
前言本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理. (本文来自www.777n.com) 作者: GitPython (原文来 ...
Python 爬虫下载图片两种方法
""" 下载图片 """url = "图片链接"filename = "图片存储的路径" # 记得加 ...
python百度搜索url爬取图片
这里以百度搜索为案例,搜索并下载图片 import requests # python HTTP客户端库,编写爬虫和测试服务器响应数据会用到的类库 import re # 导入正则表达式模块 impo ...
Day3-scrapy爬虫下载图片自定义名称
学习Scrapy过程中发现用Scrapy下载图片时,总是以他们的URL的SHA1 hash值为文件名,如: 图片URL:http://www.example.com/image.jpg 它的SHA1 ...
利用python3爬虫下载图片、pdf文档
环境语言环境:python3.6 操作系统:Win10 第三方库 requests 互联网上的资源大都是以二进制形式存储和运输的,如图片.pdf.音频.视频等,像.dat..ts等这些不常用的文件也 ...

python 百度贴吧爬虫（下载图片）

python 百度贴吧爬虫（下载图片）相关推荐

最新文章

热门文章