静态网页爬虫实战（二）

以抽屉网为例，爬取该用户评论，并存入MongoDB数据库

"""
链接mongoDB后，导入数据
"""import requests
import urllib
import re
from bs4 import BeautifulSoup
import pymongo
import time
from datetime import datetime,timedeltadef transTime(rtime):#时间戳转为时间now = time.strftime("%Y-%m-%d %H:%M:%S")# print(rtime)rtime = rtime.encode('utf-8')# print(rtime)if b"\xe5\xa4\xa9" in rtime:#带天数先正则res = re.search(b"(.*)\xe5\xa4\xa9(.*)\xe5\xb0\x8f\xe6\x97\xb6", rtime)days = str(res[1], encoding='utf-8')hours = str(res[2], encoding='utf-8')return days, hours, 0elif b"\xe5\xb0\x8f\xe6\x97\xb6" in rtime:#带小时再正则res = re.search(b"(.*)\xe5\xb0\x8f\xe6\x97\xb6(.*)\xe5\x88\x86\xe9\x92\x9f", rtime)# hours = res[1]# minutes = res[2]hours = str(res[1], encoding='utf-8')minutes = str(res[2], encoding='utf-8')return 0, hours, minuteselif b'\xe5\xb0\x8f\xe4\xba\x8e' in rtime:#小于1分钟return 0, 0, 0else:#几分钟minutes = re.search(b"^(.*)\xe5\x88\x86", rtime).group(1)minutes = str(minutes, encoding='utf-8')return 0, 0, minutesdef transDatetime(days,hours,minutes):#转换时间为日期now = datetime.now()d1 = now - timedelta(minutes=int(minutes))if hours!=0:d1 = d1 - timedelta(hours=int(hours))if days!=0:d1 = d1 - timedelta(days=int(days))return d1def saveComments(n):# 数据存入MongoDB  参考网址：https://www.jianshu.com/p/7d14c3ad810fclient = pymongo.MongoClient("localhost", 27017)  # 创建连接，因为用的本机的mongodb数据库，所以直接写localhost即可，也可以写成127.0.0.1，27017为端口db = client['mydb']  # 连接的数据库collection = db['saveComments_1']  # 连接的表# 通过循环实现对不同页码的网页的数据爬取# 参考网址：https://www.cnblogs.com/dudududu/p/8823871.html   https://blog.csdn.net/weixin_41032076/article/details/80171640for page in range(n):  # 以10页为例url = 'https://dig.chouti.com/user/cocolary/comments/' + str(page)headers2 = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) ''Chrome/51.0.2704.63 Safari/537.36'}request = urllib.request.Request(url, headers=headers2)answer = urllib.request.urlopen(request)html_text = answer.read()data = html_text.decode('utf-8')soup = BeautifulSoup(html_text.decode('utf-8'), 'html.parser')# 找出class属性值为content-list的divnews_list = soup.find('div', {'class': 'content-list'})# 找出news_list下的所有div标签news = news_list.find_all('div')# 遍历newsfor i in news:try:time = i.find('div', {'class': 'comment-time'}).get_text().strip()  # 提取时间com = i.find('span', {'class': 'text-comment-con'}).get_text().strip()  # 提取评论source = i.find('span', {'class': 'content-source'}).get_text().strip()  # 提取来源Section = i.find('span', {'class': 'content-kind'}).get_text().strip()  # 提取来源区ding_num = i.find('span', {'class': 'ding-num'}).get_text().strip()  # 提取ding-numcai_num = i.find('span', {'class': 'cai-num'}).get_text().strip()  # 提取ding-numtitle_content = i.find('div', {'class': 'comment-title'}).find('a').get_text().strip()  # 提取评论新闻内容title_href = i.find('div', {'class': 'comment-title'}).find('a').get('href')  # 提取评论新闻hrefstate_href = i.find('div', {'class': 'comment-state'}).find('a').get('href')  # 提取评论状态hrefdays, hours, minutes = transTime(time)       # print(days,hours,minutes)time = transDatetime(days, hours, minutes)bianhao = re.findall(r"\d+\.?\d*", state_href)k = 0for i in bianhao:if k == 0:title_id = i   # print("新闻编号为：", title_id)k = 1else:comment_id = i  # print("评论编号为：", comment_id)k = 0data = {}data['CommentTime'] = time.strftime("%Y-%m-%d %H:%M:%S")   #评论时间data['com_content'] = com  #评论内容source = source[1:]   #去掉来源（微博、微信……）前边的短线data['Source'] = sourcedata['Section'] = Sectiondata['news_content'] = title_content   #新闻内容#去掉中括号ding_num = ding_num.replace('[', '').replace(']', '')cai_num = cai_num.replace('[', '').replace(']', '')data['Ups'] = int(ding_num)data['Downs'] = int(cai_num)data['NID'] = int(title_id)data['CID'] = int(comment_id)collection.insert(data)  # 插入记录except AttributeError as e:continueanswer.close()def saveInfo(soup,num):# 数据存入MongoDB  参考网址：https://www.jianshu.com/p/7d14c3ad810fclient = pymongo.MongoClient("localhost", 27017)  # 创建连接，因为用的本机的mongodb数据库，所以直接写localhost即可，也可以写成127.0.0.1，27017为端口db = client['mydb']  # 连接的数据库collection = db['saveInfo_1']  # 连接的表name = soup.find('div',{'class':"tu"}).get_text().strip()  #usernameeare = soup.find('div',{'class':"tu-m"}).find_all('span')# servetime = soup.find('div',{'class':"medal"}).get_text().strip()   来自jssignNature = soup.find('div', {'class': "tu-b"}).get_text().strip()score = soup.find('div',{'class':"profile-B_2"}).find('span').get_text().strip()score = int(score)k=0for i in eare:if k == 0:eare_one = i.get_text().strip()k = 1else:eare_two = i.get_text().strip()break# sex = soup.find('div',{'class':"tum_sex"}).get_text()   来自jsposts = int(soup.find(id='shu_fa').get_text().strip())recommend = int(soup.find(id='shu_digg').get_text().strip())all_comments_num = numInfo = {}Info["Nick"] = nameInfo["posts"] = postsInfo["eare_one"] = eare_oneInfo["eare_two"] = eare_twoInfo["jifen"] = scoreInfo["recommend"] = recommendInfo["all_comments_num"] = all_comments_numInfo["signNatur"] = signNaturecollection.insert(Info)  # 插入记录url = 'https://dig.chouti.com/user/cocolary/comments/1'
headers2 = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) ''Chrome/51.0.2704.63 Safari/537.36'}
request = urllib.request.Request(url, headers=headers2)
answer = urllib.request.urlopen(request)
html_text = answer.read()
soup = BeautifulSoup(html_text.decode('utf-8'), 'html.parser')#获取评论页数
comments = soup.find(id = "shu_comment")
num = int(comments.text)
numm = num / 15+1
n = int(numm)# 爬取的评论保存到文件中
#saveComments(2)# 爬取的个人信息保存到文件中
saveInfo(soup,num)

静态网页爬虫实战（二）相关推荐

java爬取网页数据_Python网络爬虫实战(二)数据解析
Python网络爬虫实战 (二)数据解析本系列从零开始阐述如何编写Python网络爬虫,以及网络爬虫中容易遇到的问题,比如具有反爬,加密的网站,还有爬虫拿不到数据,以及登录验证等问题,会伴随大量网站 ...
您访问的网页出错了! 网络连接异常、网站服务器失去响应_数据分析系列——静态网页爬虫进阶（requests）...
在之前"数据分析系列--数据分析入门"16篇中有与爬虫的相关内容,介绍的相对简单.静态网页爬虫进阶系列将分别从网页的自动爬取(Requests).网络数据解析(BeautifulS ...
python 静态网页_Python静态网页爬虫相关知识
想要开发一个简单的Python爬虫案例,并在Python3以上的环境下运行,那么需要掌握哪些知识才能完成一个简单的Python爬虫呢? 爬虫的架构实现爬虫包括调度器,管理器,解析器,下载器和输出器. ...
网页爬虫实战：全国电动汽车充电站数据
爬虫实战:全国电动汽车充电站数据项目详情页请访问 Github,喜欢的话就去加个 star 吧 ,附上Github个人博客先放张效果图,吸引一下眼球下面进入正题~ 想必大家某些时候总需要爬取一些 ...
python3爬虫实战二：股票信息抓取及存储
参考:http://python.jobbole.com/88350/?utm_source=blog.jobbole.com&utm_medium=relatedPosts#article- ...
爬虫实战(二) 用Python爬取网易云歌单
最近,博主喜欢上了听歌,但是又苦于找不到好音乐,于是就打算到网易云的歌单中逛逛本着 "用技术改变生活" 的想法,于是便想着写一个爬虫爬取网易云的歌单,并按播放量自动进行排序这篇 ...
爬虫python代码网易云_爬虫实战(二) 用Python爬取网易云歌单
最近,博主喜欢上了听歌,但是又苦于找不到好音乐,于是就打算到网易云的歌单中逛逛本着 "用技术改变生活" 的想法,于是便想着写一个爬虫爬取网易云的歌单,并按播放量自动进行排序这篇 ...
python爬虫实例网易云-爬虫实战(二) 用Python爬取网易云歌单
最近,博主喜欢上了听歌,但是又苦于找不到好音乐,于是就打算到网易云的歌单中逛逛本着 "用技术改变生活" 的想法,于是便想着写一个爬虫爬取网易云的歌单,并按播放量自动进行排序这篇 ...
python爬取豆瓣电影排行前250获取电影名称和网络链接[静态网页]————爬虫实例（1）
目录 1.算法原理: 2.程序流程: 3.程序代码: 4.运行结果(部分结果展示): 5.结果分析: 1.算法原理: (1)利用import命令导入模块或者导入模块中的对象: ①利用requests库 ...

静态网页爬虫实战（二）

静态网页爬虫实战（二）相关推荐

最新文章

热门文章

静态网页 爬虫实战（二）

静态网页 爬虫实战（二）相关推荐

最新文章

热门文章

静态网页爬虫实战（二）

静态网页爬虫实战（二）相关推荐