爬取百度贴吧帖子

一开始只是在网上看到别人写的爬取帖子的文章，然后自己就忍不住手痒自己锻炼一下，然后照着别人的写完，发现不太过瘾，毕竟只是获取单个帖子的内容，感觉内容稍显单薄，然后自己重新做了修改，把它变成重写成了一个比较强大的爬虫

精简版本

简介

通过帖子的地址，获取楼层的内容，并将评论内容存贮到本地。

先来看效果图：

         ***************************************** **    Welcome to Spider of baidutieba  ** **      Created on 2017-04-25          ** **      @author: Jimy _Fengqi          ** *****************************************请输入帖子代号
http://tieba.baidu.com/p/5081359666
是否只获取楼主发言，是输入1，否输入0
0
是否写入楼层信息，是输入1，否输入0
1
打开的网页是：http://tieba.baidu.com/p/5081359666?see_lz=0&pn=1
1
《人民的名义》搞笑穿帮，亦可你手机拿反了
《人民的名义》搞笑穿帮，亦可你手机拿反了
本帖子共有1页
打开的网页是：http://tieba.baidu.com/p/5081359666?see_lz=0&pn=1相信最近大伙都陷入了《人民的名义》追剧浪潮中，该剧讲述了一位吃炸酱面的国家部委项目处长被人举报受贿千万，当这位腐败分子的面具被最终撕开的同时，与之案件牵连甚紧的京州市副市长丁义珍在一位神秘人物的帮助下流亡海外。案件线索终定位于由京州光明湖项目引发的一家国企大风服装厂的股权争夺，于是展开了一系列扑朔迷离智斗剧情的故事！<br><img class="BDE_Image" pic_type="1" width="430" height="240" src="https://imgsa.baidu.com/forum/w%3D580/sign=b0a6294fe1f81a4c2632ecc1e72b6029/26fcffd98d1001e985948f34b20e7bec54e79716.jpg" ><br>穿帮镜头一：<br>《人民的名义》讲述的故事是在2015年的背景之下发生的，但是胸怀宇宙的孙连成孙区长接到了一个大风厂拆迁的电话，当他拿起手机接电话时，可以看到桌子上的座机显示的时间是2016年的。<br><img class="BDE_Image" pic_type="1" width="560" height="265" src="https://imgsa.baidu.com/forum/w%3D580/sign=3ad145f3104c510faec4e21250582528/f9e90e1001e939019caa160571ec54e736d19616.jpg" ><br><img class="BDE_Image" pic_type="1" width="560" height="269" src="https://imgsa.baidu.com/forum/w%3D580/sign=556db4475b0fd9f9a0175561152cd42b/082082e93901213f069bd5e75ee736d12f2e9516.jpg" ><br>穿帮镜头二：<br>陈海给侯亮平打电话说自己这边离真相不远了，等他到了北京一切就好办了，可是没想到在边打电话边步行的过程中就出了车祸。电视剧里这一镜头很快就闪过了，可是当小流君放慢画面来看时就发现穿帮了。注意看这个车祸的过程，车子上是没有人的。<br><img class="BDE_Image" pic_type="1" width="560" height="266" src="https://imgsa.baidu.com/forum/w%3D580/sign=a5d9ff04d3f9d72a17641015e42b282a/84d9ba01213fb80ec479faec3cd12f2eb9389416.jpg" ><br><img class="BDE_Image" pic_type="1" width="560" height="287" src="https://imgsa.baidu.com/forum/w%3D580/sign=919a77f2dd2a60595210e1121835342d/bc31a23fb80e7bec947298da252eb9389b506b16.jpg" ><br>穿帮镜头三：<br>侯亮平带着一群人在高速路口把李达康书记的专车拦下来，并当着达康书记的面把欧阳菁抓走了，于是达康书记送了侯亮平一个霸气侧漏的眼神。之后回来的路上陆亦可赶紧给领导打电话解释，这时候她手机是拿对的。没一会检察长就往她手机打了回来，这时候陆亦可手机就拿反了，可能聊得正嗨的她也没发现自己手机拿反了。<br><img class="BDE_Image" pic_type="1" width="560" height="270" src="https://imgsa.baidu.com/forum/w%3D580/sign=186c79216a59252da3171d0c049a032c/a40f3b0e7bec54e7f7448125b3389b504fc26a16.jpg" ><br><img class="BDE_Image" pic_type="1" width="560" height="266" src="https://imgsa.baidu.com/forum/w%3D580/sign=15bfce522f2dd42a5f0901a3333a5b2f/3d3ef8ec54e736d1edbb173391504fc2d5626916.jpg" ><br>豆瓣影评：<br>截至目前，《人民的名义》在还未全部上映的情况下，该剧已经有103168人次参与评分。其中给予1星评分的比例占3.5%，2星比例占2.3%，3星比例占10.4%，4星比例占31.1%，5星比例最高占了52.7%。综合指数4星半，综合得分8.5分。<br><img class="BDE_Image" pic_type="1" width="560" height="294" src="https://imgsa.baidu.com/forum/w%3D580/sign=a3cc8b26de2a283443a636036bb4c92e/fedcd7e736d12f2e7aad355b45c2d56285356816.jpg" >
正在写入第1页数据
写入任务完成

思路分析：

百度贴吧的帖子地址格式是有规律的： https://tieba.baidu.com/p/5081359666

即帖子前面的地址都是‘ https://tieba.baidu.com/p/’

然后，如果只看楼主发言 , 地址后面会加上 ‘？ see_lz=1’，否则就是 ‘see_lz=0’

其次，帖子如果有好多页，那么地址后面会加上 ‘&pn=1’ 即‘=’ 后面的就是页面数目。

这样我们的思路就很清晰了：

1.首先决定是否只获取楼主的信息。

2.根据帖子的代号，获取主页内容，根据主页内容，获取本贴的页码数目

3.获取每一页的评论内容

4.将获取到的信息存贮到本地

核心问题

1. 获取页面内容：

 def getPage(self,pageNum):try:url=self.baseUrl+self.seelz+'&pn='+str(pageNum)print "打开的网页是："+urlrequest=urllib2.Request(url)response=urllib2.urlopen(request)return response.read().decode('utf-8')except urllib2.URLError,e:if hasattr(e,'reason'):print "连接百度贴吧失败，错误原因",e.reasonreturn None

2.获取帖子的页码：

#获取一个帖子总共有多少页def getPageNum(self,page):pattern=re.compile('<li class="l_reply_num.*?</span>.*?<span.*?>(.*?)</span>',re.S)result=re.search(pattern,page)print result.group(1)if result:return result.group(1).strip()else:return ‘1’#找不到至少也是一页

3.匹配楼层内容：

 def getContent(self,page):#匹配所有楼层的内容pattern = re.compile('<div id="post_content_.*?>(.*?)</div>',re.S)items=re.findall(pattern,page)contents=[]for item in items:print itemcontent="\n"+self.tool.replace(item)+"\n"contents.append(content.encode('utf-8'))

4.写入数据到本地

 def writeData(self,contents):for item in contents:if self.floorTag=='1':floorLine ='\n'+str(self.floor)+"-------------------------------------------------------------\n"self.file.write(floorLine)self.file.write(item)self.floor +=1

整体代码如下：

#!/usr/bin/python
#coding:utf-8import urllib
import urllib2
import re
import time
import sysimport osreload(sys)
sys.setdefaultencoding('utf-8')#处理页面标签类
class Tool:#去除img标签,7位长空格removeImg = re.compile('<img.*?>| {7}|')#删除超链接标签removeAddr = re.compile('<a.*?>|</a>')#把换行的标签换为\nreplaceLine = re.compile('<tr>|<div>|</div>|</p>')#将表格制表<td>替换为\treplaceTD= re.compile('<td>')#把段落开头换为\n加空两格replacePara = re.compile('<p.*?>')#将换行符或双换行符替换为\nreplaceBR = re.compile('<br><br>|<br>')#将其余标签剔除removeExtraTag = re.compile('<.*?>')#删除正斜线和反斜线removeLine1=re.compile(r'/')removeLine2=re.compile(r'\\')def replace(self,x):x = re.sub(self.removeImg,"",x)x = re.sub(self.removeAddr,"",x)x = re.sub(self.replaceLine,"\n",x)x = re.sub(self.replaceTD,"\t",x)x = re.sub(self.replacePara,"\n    ",x)x = re.sub(self.replaceBR,"\n",x)x = re.sub(self.removeExtraTag,"",x)#strip()将前后多余内容删除return x.strip()def replaceSlash(self,x):x=re.sub(self.removeLine1,"",x)x=re.sub(self.removeLine2,"",x)return x.strip()class BaiDuTieBa:def __init__(self,baseUrl,seelz,floorTag):self.baseUrl=baseUrlself.seelz='?see_lz='+str(seelz)self.tool=Tool()self.file=Noneself.floor=1self.defaultTitle="百度贴吧"self.floorTag=floorTagdef getPage(self,pageNum):try:url=self.baseUrl+self.seelz+'&pn='+str(pageNum)print "打开的网页是："+urlrequest=urllib2.Request(url)response=urllib2.urlopen(request)return response.read().decode('utf-8')except urllib2.URLError,e:if hasattr(e,'reason'):print "连接百度贴吧失败，错误原因",e.reasonreturn None#获取一个帖子总共有多少页def getPageNum(self,page):pattern=re.compile('<li class="l_reply_num.*?</span>.*?<span.*?>(.*?)</span>',re.S)result=re.search(pattern,page)print result.group(1)if result:return result.group(1).strip()else:return Nonedef getTitle(self,page):pattern=re.compile('<h1 class="core_title_txt.*?>(.*?)</h1>',re.S)result=re.search(pattern,page)title=""if result:title = result.group(1).strip()else:pattern=re.compile('<h2 class="core_title_txt.*?>(.*?)</h2>',re.S)result=re.search(pattern,page)if result:title =  result.group(1).strip()else:pattern=re.compile('<h3 class="core_title_txt.*?>(.*?)</h3>',re.S)result=re.search(pattern,page)if result:title =  result.group(1).strip()else:return Noneprint titlereturn titledef setFileTilte(self,title):if title is not None:title=self.tool.replaceSlash(title)self.file=open(title+".txt","w+")self.file.write(self.baseUrl)else:self.file=open(self.defaultTitle+".txt","w+")self.file.write(self.baseUrl)def getContent(self,page):#匹配所有楼层的内容pattern = re.compile('<div id="post_content_.*?>(.*?)</div>',re.S)items=re.findall(pattern,page)contents=[]for item in items:print itemcontent="\n"+self.tool.replace(item)+"\n"contents.append(content.encode('utf-8'))return contentsdef writeData(self,contents):for item in contents:if self.floorTag=='1':floorLine ='\n'+str(self.floor)+"-------------------------------------------------------------\n"self.file.write(floorLine)self.file.write(item)self.floor +=1def start(self):indexPage=self.getPage(1)pageNum=self.getPageNum(indexPage)title=self.getTitle(indexPage)print titleself.setFileTilte(title)if pageNum == None:print "URL  已经失效，请重试"try:print "本帖子共有"+str(pageNum)+"页"for i in range(1,int(pageNum)+1):page =self.getPage(i)contents=self.getContent(page)print "正在写入第" + str(i) + "页数据"self.writeData(contents)#出现写入异常except IOError,e:print "写入异常，原因" + e.messagefinally:print "写入任务完成"
if __name__ == '__main__':print '''***************************************** **    Welcome to Spider of baidutieba  ** **      Created on 2017-04-25          ** **      @author: Jimy _Fengqi          ** *****************************************'''print "请输入帖子代号"basURL="http://tieba.baidu.com/p/"+str(raw_input('http://tieba.baidu.com/p/'))seelz = raw_input("是否只获取楼主发言，是输入1，否输入0\n")floorTag = raw_input("是否写入楼层信息，是输入1，否输入0\n")bdtb=BaiDuTieBa(basURL,seelz,floorTag)bdtb.start()

完整版本

运行效果：

最终版本：

#!/usr/bin/python
#coding:utf-8import urllib2
import re
import json
import urllib
import time
import random
import os
import threading
import HTMLParser
from bs4 import BeautifulSoup as BSimport sys
reload(sys)
sys.setdefaultencoding('utf-8')#处理页面标签类
class Tool:#去除img标签,7位长空格removeImg = re.compile('<img.*?>| {7}|')#删除超链接标签removeAddr = re.compile('<a.*?>|</a>')#把换行的标签换为\nreplaceLine = re.compile('<tr>|<div>|</div>|</p>')#将表格制表<td>替换为\treplaceTD= re.compile('<td>')#把段落开头换为\n加空两格replacePara = re.compile('<p.*?>')#将换行符或双换行符替换为\nreplaceBR = re.compile('<br><br>|<br>')#将其余标签剔除removeExtraTag = re.compile('<.*?>')#删除正斜线和反斜线removeLine1=re.compile(r'/')removeLine2=re.compile(r'\\')def replace(self,x):x = re.sub(self.removeImg,"",x)x = re.sub(self.removeAddr,"",x)x = re.sub(self.replaceLine,"\n",x)x = re.sub(self.replaceTD,"\t",x)x = re.sub(self.replacePara,"\n    ",x)x = re.sub(self.replaceBR,"\n",x)x = re.sub(self.removeExtraTag,"",x)#strip()将前后多余内容删除return x.strip()def replaceSlash(self,x):x=re.sub(self.removeLine1,"",x)x=re.sub(self.removeLine2,"",x)return x.strip()class GetBaiduTieba:def __init__(self,keyword):self.keyword=keywordself.tiebaUrl='http://tieba.baidu.com/f?kw=%s' % self.keywordself.tool=Tool()#初始化工具类self.info_list=[]#存贮帖子地址，标题，回复数，创建人，创建时间的全局变量self.tiezi_info_list=[]#存贮每一个帖子的回复情况，包括楼层，回复人，时间，内容，本楼层的评论数self.create_dir(self.keyword)self.tiezi_path=self.keyword+'/'+self.keyword+'.txt'self.tiezi_file=open(self.tiezi_path,'w')#创建文件夹def create_dir(self,path):if not os.path.exists(path):  os.makedirs(path) #获取页面内容def get_html(self,url):self.my_log(1,u'start crawl %s ...' % url)headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.2; rv:16.0) Gecko/20100101 Firefox/16.0'}#设置headerreq = urllib2.Request(url=url,headers=headers)try:html = urllib2.urlopen(req).read().decode('utf-8')html=HTMLParser.HTMLParser().unescape(html)#处理网页内容， 可以将一些html类型的符号如" 转换回双引号#html = html.decode('utf-8','replace').encode(sys.getfilesystemencoding())#转码:避免输出出现乱码except urllib2.HTTPError,e:self.my_log(2,u"连接百度贴吧失败，错误原因：%s " % e.code)return Noneexcept urllib2.URLError,e:if hasattr(e,'reason'):self.my_log(2,u"连接百度贴吧失败，错误原因:%s " % e.reason)return Nonereturn html#自定义log 打印函数， 以数字定义log 级别def my_log(self,log_leavel,msg): #0：不打印      1：main       2：error      3：warning        log= { 0:lambda:no_log(msg),1:lambda:main_log(msg), 2:lambda:error_log(msg), 3:lambda:warning_log(msg)} def no_log(msg):passdef main_log(msg):print u'main: %s: %s' % (time.strftime('%Y-%m-%d_%H-%M-%S'), msg) def error_log(msg):print u'error: %s: %s' % (time.strftime('%Y-%m-%d_%H-%M-%S'), msg) def warning_log(msg):    print u'warning:  %s: %s' % (time.strftime('%Y-%m-%d_%H-%M-%S'), msg)return log[log_leavel]()#获取本贴吧总共有多少帖子def get_Total_Num(self,html):try:pattern=re.compile('<div id="frs_list_pager".*?class="next pagination-item ".*?href=(.*?)class=.*?</div>',re.S)#直接使用正则暴力匹配result=re.search(pattern,html)#使用search方法找到内容， 因为只有一个，不需要使用find_all的方法patternNum=re.compile('pn=(\d+)')#对于获取到的数据进行重新查找，找到我们要的数字Num=re.search(patternNum,result.group(1))#只寻找一个元素，因此这里参数为1PageNum=int(Num.group(1))#self.my_log(1,TotalNum)except Exception,e:self.my_log(3,u'贴吧的帖子数量没有找到， 错误原因：%s' % e)return Nonefinally:self.my_log(1,u'本贴吧总共有%s个帖子' % int(PageNum))return int(PageNum)#获取本贴吧总共有多少帖子，详细的def get_page_content(self,html):try:tiebaNumContent=[]#定义匹配规则，总共匹配三个元素pattern=re.compile('<div class="th_footer_l".*?<span.*?>(\d+)</span>.*?<span.*?>(\d+)</span>.*?class="red_text">(\d+).*?</div>',re.S)results=re.search(pattern,html)#参数为0的元素是正则匹配到的所有内容，1是第一个括号里面的内容，2是第二个括号里面的内容，3是第三个括号里面的内容#使用search方法，就用group的方式获取找到的元素tiebaTheme=results.group(1)tiebaNum=results.group(2)tiebaPeople=results.group(3)tiebaNumContent.append(tiebaTheme)tiebaNumContent.append(tiebaNum)tiebaNumContent.append(tiebaNumContent)except Exception,e:self.my_log(3,u'贴吧的帖子数量没有找到， 错误原因：%s' % e)return Nonefinally:self.my_log(1,u'本贴吧共有主题数:%s, 帖子数:%s, %s 人在本贴吧发布内容' % (tiebaTheme,tiebaNum,tiebaPeople) )self.tiezi_file.write(u'本贴吧共有主题数:%s, 帖子数:%s, %s 人在本贴吧发布内容\n' % (tiebaTheme,tiebaNum,tiebaPeople))return tiebaNumContent#根据页面num依次获取每一个页面的帖子内容def getAll_tiezi_list(self,PageNum):if PageNum < 51:self.my_log(1,u'当前贴吧帖子数量不足一页内容')return None#for num in range(50,PageNum+1,50):#for num in range(50,500,50):for num in range(50,50,50):current_url=self.tiebaUrl+"&ie=utf-8&pn="+str(num)target_Content=self.get_Single_Title_And_Url(current_url)#获取单页内容def get_Single_Title_And_Url(self,url):#定义存贮变量info=[]html=self.get_html(url)if not html:self.log(3,u'页面%s内容获取失败，跳过' % url)return Nonetry:#创建正则匹配模板 #pattern=re.compile('<li class=" j_thread_list clearfix".*?<span class="threadlist_rep_num center_text".*?>(.*?)</span>.*?<a href=(.*?)title=(.*?)target=.*?class="frs-author-name j_user_card".*?>(.*?)</a>.*?class="pull-right is_show_create_time".*?>(.*?)</span>.*?</li>',re.S)#匹配整个页面的列表内容pattern=re.compile('<li class=" j_thread_list clearfix".*?</li>',re.S)tiezi_Contents=re.findall(pattern,html)#调试logself.my_log(1,'当前页面%s 找到了%d 个帖子' % (url,len(tiezi_Contents)))#匹配回帖人数，帖子标题，帖子地址replyNum_tirle_url_pattern=re.compile('<span class="threadlist_rep_num center_text".*?>(.*?)</span>.*?<a href="(.*?)" title="(.*?)" target=',re.S)#匹配创建人，创建时间author_creattime_Pattern=re.compile('<span class="frs-author-name-wrap".*?target="_blank">(.*?)</a>.*?class="pull-right is_show_create_time".*?>(.*?)</span>',re.S)for item in tiezi_Contents:tmp={}#临时变量replyNum_tirle_url=re.search(replyNum_tirle_url_pattern,item)author_creattime=re.search(author_creattime_Pattern,item)replyNum_Tmp=replyNum_tirle_url.group(1)#回复人数tieziNum=str(replyNum_tirle_url.group(2))[-10:]#截取帖子编号ttieziUrl_Tmp='http://tieba.baidu.com'+replyNum_tirle_url.group(2)#帖子地址tieziTitle_Tmp=replyNum_tirle_url.group(3)#帖子标题author_Tmp=author_creattime.group(1)#创贴人creat_time_Tmp=author_creattime.group(2)#建贴时间self.my_log(1,u"发帖人:%s|发帖时间:%s|帖子题目%s|帖子地址%s|跟帖人数%s|帖子编号:%s" % (author_Tmp,creat_time_Tmp,tieziTitle_Tmp,ttieziUrl_Tmp,replyNum_Tmp,tieziNum))#将获取到的数据写入文件中self.tiezi_file.write(u"发帖人:%s|发帖时间:%s|帖子题目%s|帖子地址%s|跟帖人数%s|帖子编号:%s\n" % (author_Tmp,creat_time_Tmp,tieziTitle_Tmp,ttieziUrl_Tmp,replyNum_Tmp,tieziNum))tmp['replyNum']=replyNum_Tmptmp['tieziNum']=tieziNumtmp['tieziUrl']=ttieziUrl_Tmptmp['tieziTitle']=tieziTitle_Tmp.strip()tmp['author']=author_Tmp.encode('utf-8')tmp['creat_time']=creat_time_Tmpinfo.append(tmp)#回帖数小于10的数据，暂时抛弃if int(replyNum_Tmp) > 10:self.info_list.append(tmp)self.my_log( 1,"数据匹配之后还有%d个帖子" % len(info))except Exception,e:self.my_log(2,u'匹配数据异常,跳过,错误原因：%s' % e)return Nonefinally:self.my_log(1,u'当前页面 %s 数据查找完毕' % url )return info#获取每一个帖子的页码数目，因为  之前已经过滤过一次了，因此这里不需要重新过滤那些回帖数很少的情况def get_each_tiezi_content(self):tmp_test_url=[]#中间变量tiezi_count=len(self.info_list)#遍历次数for i in range(tiezi_count):#tmp_single_info_list=random.choice(self.info_list)#随机选择一个urltmp_single_info_list=self.info_list[i]#?see_lz=  它后面的值决定了是否只看楼主信息tiezi_url= tmp_single_info_list['tieziUrl']+'?see_lz=0&pn=1'#重新组合要访问的帖子的地址tiezi_num=1try:html=self.get_html(tiezi_url)tiezi_num=self.get_tiezi_Page_Num(html)#获取帖子的页码数目except Exception,e:self.my_log(2,u'get_each_tiezi_content() 匹配数据异常,跳过,错误原因：%s' % e)tmp_single_info_list['tiezi_page_num']=tiezi_num#将得到的帖子页数重新加到数据中去tmp_test_url.append(tiezi_url)#收集异常地址finally:tmp_single_info_list['tiezi_page_num']=tiezi_num#将得到的帖子页数重新加到数据中去tmp_single_info_list['tieziUrl']=tiezi_url#将得到的帖子页数重新加到数据中去self.info_list[i]=tmp_single_info_list#for a,b in self.info_list[i].items():#一种遍历字典的方法，这里是测试是否将数据添加成功#  print a,b                         #self.my_log( 1,u'%s|%s|当前帖子的页数：%s' % (sys._getframe().f_lineno,sys._getframe().f_code.co_name,str(tiezi_num)))#del self.info_list[0]self.my_log(1,'异常的url 有：%d' % len(tmp_test_url))#获取一个帖子总共有多少页def get_tiezi_Page_Num(self,page):if not page:self.my_log(3,u'页面%s内容获取失败，跳过')return Nonetry:pattern=re.compile('<li class="l_reply_num.*?</span>.*?<span.*?>(.*?)</span>',re.S)#匹配页码规则result=re.search(pattern,page)except Exception,e:self.my_log(2,u'查找帖子页面数目异常,跳过,错误原因：%s' % e)finally:if result:return result.group(1).strip()else:return Nonedef mutil_thread(self):    #帖子的内容过多，这里仅仅开启三个线程，来爬取三个帖子的内容for i in range(3):tmp_single_info_list=random.choice(self.info_list)#随机选择一个地址p=threading.Thread(target=self.loop_for_every_tiezi, args=(tmp_single_info_list,))p.start()time.sleep(3)p.join()#tmp_single_info_list=random.choice(self.info_list)#随机选择一个地址#self.loop_for_every_tiezi(tmp_single_info_list)def loop_for_every_tiezi(self,tmp_single_info_list):#self.my_log(1,u'thread %s is running...' % threading.current_thread().name)#tmp_single_info_list=random.choice(self.info_list)#随机选择一个地址page_num=tmp_single_info_list['tiezi_page_num']page_url_base=tmp_single_info_list['tieziUrl'][:-1]#对地址做一个处理tiezi_num=tmp_single_info_list['tieziNum']path=self.keyword+'/'+tiezi_numself.create_dir(path)#使用帖子的地址创建文件夹tiezi_file_path=path+'/'+tiezi_num+'.txt'#创建文件，记录本贴子所有回复内容tiezi_file_symbol=open(tiezi_file_path,'w')for num in range(1,int(page_num)+1):current_url=page_url_base+str(num)#重新组装地址current_tiezi_html=self.get_html(current_url)#获取当前帖子页面内容self.get_every_tiezi_content(current_tiezi_html,num,tiezi_file_symbol)tiezi_file_symbol.close()#根据每一页帖子的内容，匹配每一个楼层的内容   def get_every_tiezi_content(self,html,num,tiezi_file_symbol):#采用函数嵌套，先定义两个内部函数，然后再做处理#只获取首楼信息     def get_every_tiezi_first_floor_content(html):try:#首楼匹配规则first_floor_pattern = re.compile('<div class="l_post j_l_post l_post_bright noborder ".*?<div class="clear"></div>.*?</div>',re.S)item=re.search(first_floor_pattern,html)tmp_content=item.group(0)#获取回贴人的楼层，回复时间，回复人，内容author_content_floor_time_pattern=re.compile('data-field=.*?"date":"(.*?)".*?post_no":(\d+),.*?comment_num":(\d+),.*?>.*?<li class="d_name".*?target="_blank">(.*?)</a>.*?<div id="post_content_.*?>(.*?)</div>',re.S)items=re.search(author_content_floor_time_pattern,tmp_content)tmp={}floor_reply_time = items.group(1).strip()floor_post_num= items.group(2)floor_comment_num = items.group(3)floor_author = items.group(4)floor_reply_content =items.group(5)tmp['floor_reply_time']=floor_reply_timetmp['floor_post_num']=floor_post_numtmp['floor_comment_num']=floor_comment_numtmp['floor_author']=floor_authortmp['floor_reply_content']=floor_reply_content#回复内容还需要重新做处理，因此暂时没有记录到文件中去self.tiezi_info_list.append(tmp)self.my_log(1, u'%s楼| 回复人：%s|回复时间：%s|本楼层回复数：%s ' % (floor_post_num,floor_author,floor_reply_time,floor_comment_num))tiezi_file_symbol.write(u'%s楼|回复人：%s|回复时间：%s|本楼层回复数：%s\n' % (floor_post_num,floor_author,floor_reply_time,floor_comment_num))save_tiezi_content(floor_reply_content,tiezi_file_symbol)except Exception,e:self.my_log(2,u'匹配首楼内容失败,跳过,错误原因：%s' % e)#获取不包含首楼的其他楼层内容        def get_every_tiezi_not_first_floor_content(html):try:other_floor_pattern=re.compile('<div class="l_post j_l_post l_post_bright  ".*?<div class="clear"></div>.*?</div>',re.S)items=re.findall(other_floor_pattern,html)except Exception,e:self.my_log(2,u'匹配其他楼层内容失败,跳过,错误原因：%s' % e)finally:self.my_log(0,u'不在首页，找到%s 个回帖' % len(items) )reply_num= len(items)try:author_content_floor_time_pattern=re.compile('data-field=.*?"date":"(.*?)".*?post_no":(\d+),.*?comment_num":(\d+),.*?>.*?<li class="d_name".*?target="_blank">(.*?)</a>.*?<div id="post_content_.*?>(.*?)</div>',re.S)for floor_content in items:item=re.search(author_content_floor_time_pattern,floor_content)tmp={}floor_reply_time = item.group(1).strip()floor_post_num= item.group(2)floor_comment_num = item.group(3)floor_author = item.group(4)floor_reply_content =item.group(5)tmp['floor_reply_time']=floor_reply_timetmp['floor_post_num']=floor_post_numtmp['floor_comment_num']=floor_comment_numtmp['floor_author']=floor_authortmp['floor_reply_content']=floor_reply_content#回复内容还需要重新做处理，因此暂时没有记录到文件中去self.tiezi_info_list.append(tmp)self.my_log(1, u'%s楼| 回复人：%s|回复时间：%s|本楼层回复数：%s ' % (floor_post_num,floor_author,floor_reply_time,floor_comment_num))tiezi_file_symbol.write(u'%s楼|回复人：%s|回复时间：%s|本楼层回复数：%s\n' % (floor_post_num,floor_author,floor_reply_time,floor_comment_num))save_tiezi_content(floor_reply_content,tiezi_file_symbol)except Exception,e:self.my_log(2,u'匹配其他楼层，查找回复内容时失败,跳过,错误原因：%s' % e)def save_tiezi_content(floor_reply_content,tiezi_file_symbol):floor_reply_content=self.tool.replace(floor_reply_content)floor_reply_content=self.tool.replaceSlash(floor_reply_content)tiezi_file_symbol.write(floor_reply_content)temp_data=u'\n**********************分割符*************************\n'tiezi_file_symbol.write(temp_data)if not html:self.my_log(3,u'页面%s内容获取失败，跳过')return None#首楼层和其他楼层内容不同， 而首楼层只在帖子的第一页出现if num == 1:get_every_tiezi_first_floor_content(html)get_every_tiezi_not_first_floor_content(html)else:get_every_tiezi_not_first_floor_content(html)def run(self):self.my_log(1,u'start crawl...')#step1  获取贴吧入口网页内容tieba_html = self.get_html(self.tiebaUrl)#step2  查询本贴吧总共多少页内容Page_Num=self.get_Total_Num(tieba_html)tieba_Num_Content=self.get_page_content(tieba_html)#setp3 获取单页内容self.get_Single_Title_And_Url(self.tiebaUrl)#step4 根据PageNum 获取所有帖子的内容self.getAll_tiezi_list(Page_Num)self.my_log(1,u'total length is %d' % len(self.info_list))#step 5  遍历每一个帖子， 获取其页码数self.get_each_tiezi_content()self.my_log(1,u'Start mutil thread to crawl...' )time.sleep(3)self.mutil_thread()#step 6 获取每一个帖子所有的内容#self.loop_for_every_tiezi()self.my_log(1,u'total length is %d' % len(self.info_list))self.my_log(1,u'End crawl')#关闭文件描述符self.tiezi_file.close()#测试函数def test(self):url='http://tieba.baidu.com/p/5062576866?see_lz=0&pn=1'html=self.get_html(url)self.get_every_tiezi_content(html,1,'5062576866')
if __name__ == '__main__':print '''***************************************** **    Welcome to Spider of baidutieba  ** **      Created on 2017-04-25          ** **      @author: Jimy _Fengqi          ** *****************************************'''keyword=raw_input(u'请输入要获取的贴吧名字：')if not keyword:keyword='python'print '将要获取%s 贴吧的内容' % keyword#GetBaiduTieba(keyword).test()GetBaiduTieba(keyword).run()#mytieba=GetBaiduTieba(keyword)#mytieba.run()

python爬虫(13)爬取百度贴吧帖子相关推荐

Python爬虫实战-爬取百度贴吧帖子
本篇目标 1.对百度贴吧的任意帖子进行抓取 2.指定是否只抓取楼主发帖内容 3.将抓取到的内容分析并保存到文件如果觉得一步步看麻烦的话可以拉到最下面有完整源码可以直接使用 1.URL格式的确定首先 ...
python爬虫之爬取百度网盘
爬虫之爬取百度网盘(python) #coding: utf8 """ author:haoning create time: 2015-8-15 "" ...
入门级别的Python爬虫代码爬取百度上的图片
简单讲解下python爬取百度图片的方法还有一些小坑(ps:我是搞.net的所以python只是新手讲错勿怪,注意:系统是windows下的) 首先讲下对百度图片上请求的分析:这里我引用下别人的博客, ...
python爬虫：爬取百度小姐姐照片
自从学会了爬虫,身体状况一天不如一天,营养都跟不上了,教大家爬取百度性感小姐姐的图片,先看一下效果. 项目流程第一步:准备工作工欲善其事,必先利其器 pip install requests,该模 ...
Python爬虫：爬取百度图片（selenium模拟登录，详细注释）
1.驱动下载百度图片这种网站是动态的,即并不是网页中的内容全部存储在源代码中,而是不停地动态刷新,所以需要使用selenium模拟浏览器登录,除了安装selenium库之外,还需要针对不同地浏览器安 ...
Python爬虫——关键字爬取百度图片
在日常生活中,我们经常需要使用百度图片来搜索相关的图片资源.而如果需要大量获取特定关键字的图片资源,手动一个个下载无疑十分繁琐且费时费力.因此,本文将介绍如何通过Python爬虫技术,自动化地获取百度 ...
python爬虫——批量爬取百度图片
最近做项目,需要一些数据集,图片一张一张从网上下载太慢了,于是学了爬虫. 参考了大佬的文章:https://blog.csdn.net/qq_40774175/article/details/8127 ...
Python爬虫，爬取百度贴吧图片和视频文件，xpath+lxml，访问被拒的原因分析
目录百度贴吧图片和视频文件爬取程序 1.需求分析 2.url分析 3.Xpath分析 4.程序设计 5.坑点百度贴吧图片和视频文件爬取程序 1.需求分析进入百度贴吧,搜索周杰伦,进入周杰伦吧.我 ...
python爬虫之爬取百度翻译
使用python中requests模块就可以爬取 import requestspost_url = 'https://fanyi.baidu.com/sug' headers = {'User-Ag ...

python爬虫(13)爬取百度贴吧帖子

爬取百度贴吧帖子

精简版本

简介

思路分析：

完整版本

python爬虫(13)爬取百度贴吧帖子相关推荐

最新文章

热门文章