python爬虫吧-Python爬虫——抓取贴吧帖子

#-*- coding:utf-8 -*-#!/user/bin/python

importurllibimporturllib2importre#处理页面标签类

classTool:#去除img标签,7位长空格

removeImg = re.compile('| {7}|')#删除超链接标签

removeAddr = re.compile('|')#把换行的标签换为

replaceLine = re.compile('

|')#将表格制表替换为

replaceTD= re.compile('

')#把段落开头换为加空两格

replacePara = re.compile('

')#将换行符或双换行符替换为

replaceBR = re.compile('

|
')#将其余标签剔除

removeExtraTag = re.compile('<.*?>')defreplace(self,x):

x= re.sub(self.removeImg,"",x)

x= re.sub(self.removeAddr,"",x)

x= re.sub(self.replaceLine," ",x)

x= re.sub(self.replaceTD," ",x)

x= re.sub(self.replacePara," ",x)

x= re.sub(self.replaceBR," ",x)

x= re.sub(self.removeExtraTag,"",x)#strip()将前后多余内容删除

returnx.strip()classBDTB:#初始化，传入基地址，是否只看楼主的参数

def __init__(self, baseUrl, seeLZ, floorTag):

self.baseURL=baseUrl

self.seeLZ= '?see_lz=' +str(seeLZ)

self.tool=Tool()#全局file变量，文件写入操作对象

self.file =None#楼层标号，初始化为1

self.floor = 1

#默认标题

self.defaultTitle = u"百度某某贴吧"

#是否写入楼层分隔符标记

self.floorTag =floorTag#传入页码，获取该页帖子的代码

defgetPage(self, pageNum):try:

url= self.baseURL + self.seeLZ + '&pn=' +str(pageNum)

request=urllib2.Request(url)

response=urllib2.urlopen(request)return response.read().decode('utf-8')excepturllib2.URLError, e:if hasattr(e, "reason"):print u"连接百度贴吧失败,错误原因", e.reasonreturnNone#获得帖子标题

defgetTitle(self,page):

page= self.getPage(1)

pattern= re.compile('

(.*?)

', re.S)

result=re.search(pattern, page)ifresult:#print result.group(1)

return result.group(1).strip()else:returnNone#得到帖子页数

defgetPageNum(self,page):

page= self.getPage(1)

pattern= re.compile('

(.*?)',re.S)

result=re.search(pattern, page)ifresult:#print "回复个数："

#print result.group(1)

return result.group(1).strip()else:returnNone#获得帖子的内容

defgetContent(self,page):

page= self.getPage(1)

pattern= re.compile('

(.*?)

',re.S)

items=re.findall(pattern,page)

contents=[]

floor= 1

for item initems:

content= " " + self.tool.replace(item) + " "contents.append(content.encode('utf-8'))#print self.tool.replace(item)

#floor += 1

returncontentsdefsetFileTitle(self,title):if title is notNone:

self.file= open(title + ".txt", "w+")else:

self.file= open(self.defaultTitle + ".txt", "w+")defwriteData(self,contents):for item incontents:if self.floorTag == '1':

floorline= " " + str(self.floor) + u"------------------------------------- "self.file.write(floorline)

self.file.write(item)

self.floor+= 1

defstart(self):

indexPage= self.getPage(1)

pageNum=self.getPageNum(indexPage)

title=self.getTitle(indexPage)

self.setFileTitle(title)if pageNum ==None:print "URL已失效，请重试"

return

try:print "该帖子共有" + str(pageNum) + "页"

for i in range(1,int(pageNum) + 1):print "正在写入第" + str(i) + "页数据"page=self.getPage(i)

contents=self.getContent(page)

self.writeData(contents)exceptIOError,e:print "写入异常，原因" +e.messagefinally:print "Succeed~"

print u"请输入帖子代码"baseURL= 'http://tieba.baidu.com/p/' + str(raw_input(u'http://tieba.baidu.com/p/'))

seeLZ= raw_input("是否只看楼主，是输入1，否输入0\n")

floorTag= raw_input("是否写入楼层信息，是输入1，否输入0\n")

bdtb=BDTB(baseURL, seeLZ,floorTag)

bdtb.start()

python爬虫吧-Python爬虫——抓取贴吧帖子相关推荐

关于Python爬虫原理和数据抓取1.1
为什么要做爬虫? 首先请问:都说现在是"大数据时代",那数据从何而来? 企业产生的用户数据:百度指数.阿里指数.TBI腾讯浏览指数.新浪微博指数数据平台购买数据:数据堂.国云数据 ...
python爬虫百度百科-python爬虫(一)_爬虫原理和数据抓取
本篇将开始介绍Python原理,更多内容请参考:Python学习指南为什么要做爬虫著名的革命家.思想家.政治家.战略家.社会改革的主要领导人物马云曾经在2015年提到由IT转到DT,何谓DT,DT ...
Python爬虫实战六之抓取爱问知识人问题并保存至数据库
大家好,本次为大家带来的是抓取爱问知识人的问题并将问题和答案保存到数据库的方法,涉及的内容包括: Urllib的用法及异常处理 Beautiful Soup的简单应用 MySQLdb的基础用法正则表 ...
python爬虫beautifulsoup爬当当网_Python爬虫包 BeautifulSoup 递归抓取实例详解_python_脚本之家...
Python爬虫包 BeautifulSoup 递归抓取实例详解概要: 爬虫的主要目的就是为了沿着网络抓取需要的内容.它们的本质是一种递归的过程.它们首先需要获得网页的内容,然后分析页面内容并找到 ...
Python爬虫包 BeautifulSoup 递归抓取实例详解
Python爬虫包 BeautifulSoup 递归抓取实例详解概要: 爬虫的主要目的就是为了沿着网络抓取需要的内容.它们的本质是一种递归的过程.它们首先需要获得网页的内容,然后分析页面内容并找到另 ...
[Python爬虫] 三、数据抓取之Requests HTTP 库
往期内容提要: [Python爬虫] 一.爬虫原理之HTTP和HTTPS的请求与响应 [Python爬虫] 二.爬虫原理之定义.分类.流程与编码格式一.urllib 模块所谓网页抓取,就是把URL ...
Python爬虫之gif图片抓取
Python爬虫之gif图片抓取标签:图片爬虫这几天,对于怎么去爬取图片很感兴趣,就研究了一下,图片爬虫可以说是有简单,更有复杂的,今天,我做了一个比较简单的gif的图片爬虫,仅仅学习一下怎么进行 ...
Python，网络爬虫selenium与pyautogui抓取新浪微博用户数据
Python,网络爬虫selenium与pyautogui抓取新浪微博用户数据不需要登陆新浪微博账户,直接运行就可以通过python爬虫爬取新浪微博用户数据.本例selenium与pyautogui ...
[Python爬虫] 四、数据抓取之HTTP/HTTPS抓包工具Fiddler
往期内容提要: [Python爬虫] 一.爬虫原理之HTTP和HTTPS的请求与响应 [Python爬虫] 二.爬虫原理之定义.分类.流程与编码格式 [Python爬虫] 三.数据抓取之Request ...
Python学习笔记——爬虫之urllib数据抓取
目录 urllib库的基本使用 Get方式 POST方式: 获取AJAX加载的内容 Handler处理器和自定义Opener urllib库的基本使用所谓网页抓取,就是把URL地址中指定的网络资 ...

python爬虫吧-Python爬虫——抓取贴吧帖子

(.*?)

python爬虫吧-Python爬虫——抓取贴吧帖子相关推荐

最新文章

热门文章