python爬虫 | 鸿鹄论坛评论爬取

这次爬取的是鸿鹄论坛的某个帖的评论

这次实例的网页信息十分好爬，因为该网页使用的是静态网页，所以我这次加多了多线程和导入数据库的功能。

注释和代码都在下面了，
在爬取的评论当中关于楼主的发表没有包含在内
（毕竟不是评论逃~~）
save函数中爬取详细信息我只写了一部分，可自行补充

import requests
from lxml import etree
import re
import time
import random
from bs4 import BeautifulSoup
import threading
import queue
import pymongo
from bson import ObjectIdexitflag = 0  #结束标志 1代表主线程结束
user_url_list=[]
myclient = pymongo.MongoClient("mongodb://localhost:27017/")
mydb = myclient["honghu_db"]
db_comment = mydb["comment"]#创建HonghuComment类，用来获得和保存数据
class HonghuComment(object):def __init__(self):self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',}#获取第一页的数据，并且把评论和用户简要保存到数据库def gethtml1(self,url):print('page:1')r=requests.get(url,headers=self.headers)selector=etree.HTML(r.text)names=selector.xpath('//a[@class="xw1"]/text()')pids=selector.xpath('//*[@class="p_pop blk bui card_gender_"]/@id')user_urls=selector.xpath('//strong/a[@class="xi2"]/@href')for i in range(len(names)):                   user_id=re.findall(r'space-uid-(.*?).html',user_urls[i+1])[0].strip()print('name:',names[i],' pid:',pids[i+1][-8:],'\nuser_id:',user_id)comment_dict={'name':names[i],'pid':pids[i+1][-8:],'user_id':user_id}           user_url_list.append(user_urls[i+1])response_time=selector.xpath('//em[@id="authorposton'+str(pids[i+1][-8:])+'"]/text()')[0]print('response_time:',response_time)comment_dict['response_time']=response_timecomment=selector.xpath('//td[@id="postmessage_'+str(pids[i+1][-8:])+'"]/text()')[0].strip()if comment=='':comment_images=selector.xpath('//td[@id="postmessage_'+str(pids[i+1][-8:])+'"]/img/@src')for comment_iamge in comment_images:print('https://bbs.hh010.com/'+comment_iamge)comment_dict['comment_images']=comment_imageselse:print('comment:',comment)comment_dict['comment']=commentdb_comment.insert_one(comment_dict)time.sleep(5)#获取从第二页开始到第n页的数据，并且把评论和用户简要保存到数据库def gethtml(self,url,i):print('page:',i)time.sleep(random.randint(2,3)+random.random())#礼貌地随机停几秒try:r=requests.get(url,headers=self.headers)selector=etree.HTML(r.text)names=selector.xpath('//a[@class="xw1"]/text()')pids=selector.xpath('//*[@class="p_pop blk bui card_gender_"]/@id')user_urls=selector.xpath('//strong/a[@class="xi2"]/@href')for i in range(len(names)):             user_id=re.findall(r'space-uid-(.*?).html',user_urls[i])[0].strip()print('name:',names[i],' pid:',pids[i][-8:],'\nuser_id:',user_id)comment_dict={'name':names[i],'pid':pids[i][-8:],'user_id':user_id}           user_url_list.append(user_urls[i])            response_time=selector.xpath('//em[@id="authorposton'+str(pids[i][-8:])+'"]/text()')[0]               print('response_time:',response_time)comment_dict['response_time']=response_timecomment=selector.xpath('//td[@id="postmessage_'+str(pids[i][-8:])+'"]/text()')[0].strip()if comment=='':comment_images=selector.xpath('//td[@id="postmessage_'+str(pids[i][-8:])+'"]/img/@src')for comment_iamge in comment_images:print('https://bbs.hh010.com/'+comment_iamge)comment_dict['comment_images']=comment_imageselse:print('comment:',comment)comment_dict['comment']=commentdb_comment.insert_one(comment_dict)#插入数据except Exception as e:print('Error',e)
#把用户详细数据保存到数据库
def save(name,q):   try:while not exitflag:if not q.empty():qget=q.get()num,id_url=qget[0],qget[1]time.sleep(random.randint(4,9)+random.random())#礼貌地随机停几秒r_save=requests.get(id_url)selector_save=etree.HTML(r_save.text)user_id=re.findall(r'space-uid-(.*?).html',id_url)[0].strip()user_group=re.findall(r'href=\"https://bbs.hh010.com/home.php\?mod=spacecp&amp;ac=usergroup&amp;.*?>(.*?)</a',r_save.text,re.M|re.S)[0].strip()print(name,' ',num,' user_id:',user_id,' user_group:',user_group)     id_find = db_comment.find_one({'user_id':user_id})   detail_update = db_comment.update_one({'_id':ObjectId(str(id_find['_id']))},{'$set':{'user_group':user_group}})print(db_comment.find_one({'user_id':user_id}))details=selector_save.xpath('//ul[@id="pbbs"]/li/text()')detail_dict={'在线时间':details[0],'注册时间':details[1],'最后访问':details[2],'上次活动时间':details[3],'上次发表时间':details[4],'所在时区':details[5]}detail_update = db_comment.update_one({'_id':ObjectId(str(id_find['_id']))},{'$set':detail_dict})print(detail_update)except Exception as e:print('Error',e)  #创建一个多进程
class myThread(threading.Thread):def __init__(self,name,q):threading.Thread.__init__(self)self.q = qdef run(self):print ("开启线程：" + self.name)save(self.name,self.q)print ("退出线程：" + self.name) if __name__=='__main__':url1='https://bbs.hh010.com/thread-568709-1-1.html'hc=HonghuComment()hc.gethtml1(url1)#选择要遍历的页数for i in range(2,3):url='https://bbs.hh010.com/thread-568709-'+str(i)+'-1.html'hc.gethtml(url,i)  starttime=time.time() q = queue.Queue(3) #创建线程的个数threads = []   #选择要使用线程的个数for threadname in range(3):thread = myThread(threadname, q)thread.start()threads.append(thread)# 填充队列 for i in range(len(user_url_list)):q.put([i,user_url_list[i]])       exitflag=1    # 等待所有线程完成for t in threads:t.join()endtime=time.time()print ("退出主线程")time.sleep(5)print(endtime-starttime)

本次爬虫仅供学习使用

python爬虫 | 鸿鹄论坛评论爬取相关推荐

python爬虫豆瓣影评的爬取cookies实现自动登录账号
python爬虫豆瓣影评的爬取cookies实现自动登录账号频繁的登录网页会让豆瓣锁定你的账号-- 网页请求使用cookies来实现的自动登录账号,这里的cookies因为涉及到账号我屏蔽了,具 ...
转 Python爬虫实战一之爬取糗事百科段子
静觅 » Python爬虫实战一之爬取糗事百科段子首先,糗事百科大家都听说过吧?糗友们发的搞笑的段子一抓一大把,这次我们尝试一下用爬虫把他们抓取下来. 友情提示糗事百科在前一段时间进行了改版,导致 ...
Python 爬虫中国行政区划信息爬取（初学者）
Python 爬虫中国行政区划信息爬取 (初学者) 背景环境准备代码片段 1.定义地址信息对象 2.地址解析对象 2.1 获取web信息 2.2 web信息解析 2.3 区划信息提取 2.4 省 ...
Python爬虫实战一之爬取糗事百科段子
点我进入原文另外, 中间遇到两个问题: 1. ascii codec can't decode byte 0xe8 in position 0:ordinal not in range(128) 解 ...
python爬虫（一）爬取豆瓣电影排名前50名电影的信息
python爬虫(一)爬取豆瓣电影排名前50名电影的信息在Python爬虫中,我们可以使用beautifulsoup对网页进行解析. 我们可以使用它来爬取豆瓣电影排名前50名的电影的详细信息,例如排 ...
【爬虫实战】评论爬取及词频统计详解
爬虫前言 aqy评论爬取请求数据数据清洗爬取数据分词停用词绘制统计表词云绘制主函数一些其他问题优化前言本项目来自Baidu AI Studio相关python课程. aqy评论 ...
python爬取图片教程-推荐|Python 爬虫系列教程一爬取批量百度图片
Python 爬虫系列教程一爬取批量百度图片https://blog.csdn.net/qq_40774175/article/details/81273198# -*- coding: utf-8 ...
python爬虫对炒股有没有用_使用python爬虫实现网络股票信息爬取的demo
实例如下所示: import requests from bs4 import BeautifulSoup import traceback import re def getHTMLText(url ...
《python爬虫实战》：爬取贴吧上的帖子
<python爬虫实战>:爬取贴吧上的帖子经过前面两篇例子的练习,自己也对爬虫有了一定的经验. 由于目前还没有利用BeautifulSoup库,因此关于爬虫的难点还是正则表达式的书写. ...
python爬虫学习之定向爬取淘宝商品价格
python爬虫学习之定向爬取淘宝商品价格 import requests import redef getHTMLText(url):try:r = requests.get(url, tim ...

python爬虫 | 鸿鹄论坛评论爬取

python爬虫 | 鸿鹄论坛评论爬取相关推荐

最新文章

热门文章