MongoDB爬虫实践：虎扑论坛

实践目的：

本实践项的目的是获取虎扑步行街论坛上所有帖子的数据，网站地址如下：
https://bbs.hupu.com/bxj

实践代码：

import requests
from bs4 import BeautifulSoup
import datetime
import redef get_page(link):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) \AppleWebKit/537.36 (KHTML, like Gecko) \Chrome/70.0.3538.25 \Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400 '}r = requests.get(link, headers = headers)html = r.contenthtml = html.decode('UTF-8')soup = BeautifulSoup(html, 'lxml')return soupdef get_data(post_list):data_list = []for post in post_list:#获取标题title = post.find('div', class_="titlelink box").a.text#获取作者author = post.find('div', class_="author box").a.text#发表时间date_time = post.find('div', class_="author box").contents[5].text#回复reply_view = post.find('span', class_="ansour box").text.strip()reply = reply_view.split('/')[0].strip()view = reply_view.split('/')[1].strip()#上次评论last_reply = post.find('div', class_="endreply box").a.textlast_reply_time = post.find('div', class_="endreply box").span.textdata_list.append([title, author, date_time, reply, view, last_reply, last_reply_time])return data_listlink = "https://bbs.hupu.com/bxj"
soup = get_page(link)
post_list = soup.find('ul', class_="for-list").find_all('li')
data_list = get_data(post_list)
for each in data_list:print(each)

结果如下：

在上述代码中，我们使用get_page()函数来获取页面的内容，和前面不同的是，对于获取代码用的不是r.text而不是r.content，这是因为此网站使用gzip封装，需要使用r.content进行封装，然后把代码由UTF-8解码为unicode。

把获取的数据存到MongoDB数据库中

这里用了一个类MongoAPI来实现使用和操作数据库的功能。下面的代码只爬取了1页的数据。

import requests
from bs4 import BeautifulSoup
import datetime
from pymongo import MongoClientclass MongoAPI(object):def __init__(self, db_ip, db_port, db_name, table_name):self.db_ip = db_ipself.db_port = db_portself.db_name = db_nameself.table_name = table_nameself.conn = MongoClient(host=self.db_ip, port=self.db_port)self.db = self.conn[self.db_name]self.table = self.db[self.table_name]def get_one(self, query): #获取数据库的一条资料return self.table.find_one(query, projection={"_id": False})def get_all(self, query): #获取数据库中满足条件的所有数据return self.table.find(query)def add(self, kv_dict): #向集合中添加数据return self.table.insert(kv_dict)def delete(self, query): #删除集合中的数据return self.table.delete_many(query)def check_exist(self, query): #查看集合中是否包含满足需要的数据，返回False或Trueret = self.table.find_one(query)return ret != None#更新集合中的数据，如果没有就创建一条数据def update(self, query, kv_dict):self.table.update_one(query, {'$set': kv_dict}, upsert=True)def get_page(link):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) \AppleWebKit/537.36 (KHTML, like Gecko) \Chrome/70.0.3538.25 \Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400 '}r = requests.get(link, headers = headers)html = r.contenthtml = html.decode('UTF-8')soup = BeautifulSoup(html, 'lxml')return soupdef get_data(post_list):data_list = []for post in post_list:#获取标题title = post.find('div', class_="titlelink box").a.text#获取作者author = post.find('div', class_="author box").a.text#发表时间date_time = post.find('div', class_="author box").contents[5].text#回复reply_view = post.find('span', class_="ansour box").text.strip()reply = reply_view.split('/')[0].strip()view = reply_view.split('/')[1].strip()#上次评论last_reply = post.find('div', class_="endreply box").a.textlast_reply_time = post.find('div', class_="endreply box").span.textdata_list.append([title, author, date_time, reply, view, last_reply, last_reply_time])return data_listlink = "https://bbs.hupu.com/bxj"
soup = get_page(link)
post_list = soup.find('ul', class_="for-list").find_all('li')
data_list = get_data(post_list)
hupu_post = MongoAPI("localhost", 27017, "hupu", "post")
for each in data_list:hupu_post.add({'title': each[0],'author': each[1],'date_time': each[2],'reply': each[3],'view': each[4],'last_reply': each[5],'last_reply_time': each[6]})

爬取前10页所有数据

for i in range(1,11):link = "https://bbs.hupu.com/bxj-" + str(i)soup = get_page(link)post_list = soup.find('ul', class_="for-list").find_all('li')data_list = get_data(post_list)hupu_post = MongoAPI("localhost", 27017, "hupu", "post")for each in data_list:#因为已经写入过了，所以这里需要使用更新，这样就不会产生重复数据hupu_post.update({'title': each[0]},{'title': each[0],'author': each[1],'date_time': each[2],'reply': each[3],'view': each[4],'last_reply': each[5],'last_reply_time': each[6]})time.sleep(3)print('第', i, '页获取完成，休息3秒')

结果如下：

查看数据库

MongoDB爬虫实践：虎扑论坛相关推荐

Python3 爬虫实战 — 虎扑论坛步行街【requests、Beautiful Soup、MongoDB】
爬取时间:2019-10-12 爬取难度:★★☆☆☆☆ 请求链接:https://bbs.hupu.com/bxj 爬取目标:爬取虎扑论坛步行街的帖子,包含主题,作者,发布时间等,数据保存到 Mong ...
python3论坛_Python3 爬虫实战 — 虎扑论坛步行街
爬取时间:2019-10-12 爬取难度:★★☆☆☆☆ 请求链接:https://bbs.hupu.com/bxj 爬取目标:爬取虎扑论坛步行街的帖子,包含主题,作者,发布时间等,数据保存到 Mong ...
【Python爬虫】MongoDB爬虫实践：爬取虎扑论坛
MongoDB爬虫实践:爬取虎扑论坛网站地址为:https://bbs.hupu.com/bxj 1.网站分析首先,定位网页上帖子名称.帖子链接.作者.作者链接.创建时间.回复数目.浏览数目.最后 ...
爬虫入门实践之爬取虎扑论坛帖子
现在网络以及移动互联网发展迅速,大家花费越来越多的时间逛一些网站浏览帖子,比如贴吧.论坛等.博主喜欢打篮球,爱看NBA,因此常常行迹于虎扑论坛,看一些精彩赛事以及比较好的帖子.本文主要通过对虎扑某一版 ...
虎扑论坛爬虫采集数据可视化分析
原文链接:http://tecdat.cn/?p=2018 论坛为用户提供了相同的业余爱好,互动和交流的广阔平台,以及由此产生的庞大数据和复杂的用户交互场景也包含有价值的信息,本文关于虎扑论坛的帖子, ...
[python 爬虫]Python爬虫抓取虎扑论坛帖子图片
自从可以实现抓取文字了,自然要尝试更多的类型,比如图片.我是一个有逛虎扑论坛习惯的人,经常会发现有些帖子的图片挺好看的想保存下来,但是如果人为保存的话,一个帖子至少都有二三十张,这将是一个庞大的工作量 ...
python爬虫(爬虎扑英雄联盟论坛)
第十五讲 BeautifulSoup解析HTML标签爬虫实战项目(英雄联盟虎扑论坛) import requests url = 'https://bbs.hupu.com/lol' headers ...
python爬取虎扑论坛帖子数据
前言本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理. PS:如有需要Python学习资料的小伙伴可以加点击下方链接自行获取 python免费学习资 ...
scrapy框架爬取虎扑论坛球队新闻
目录 Scrapy 框架制作 Scrapy 爬虫一共需要4步: Scrapy的安装介绍 Windows 安装方式一. 新建项目(scrapy startproject) 二.明确目标(mySpi ...
(转)虎扑论坛基因探秘：社群用户行为数据洞察
论坛为有相同爱好的网友提供了广阔的互动交流平台,而由此积累下来的庞大数据和复杂的用户互动场景也蕴含着有价值的信息,本文对虎扑论坛的帖子.个人信息展开分析,探索虎扑论坛用户有哪些特点? ▼ tecdat ...

MongoDB爬虫实践：虎扑论坛

实践目的：

实践代码：

结果如下：

把获取的数据存到MongoDB数据库中

爬取前10页所有数据

结果如下：

查看数据库

MongoDB爬虫实践：虎扑论坛相关推荐

最新文章

热门文章