文章目录

前言
一、首先分析网页
二、编写代码
总结

前言

本次我们要获取的是gitee网站中搜索java得到的项目的所有评论，并将评论和回复区分开。先看看效果。

编写代码

1.分析网页

搜索的结果图是这个，看上面的链接我们通过多点击几页，发现链接中的pageno是当前的页数，q为搜索的参数。这样我们就可以更改这俩个参数获取到我所需要的所有url。

然后我们点击其中一个链接。比如下图的这个项目就有107个评论，注意结尾还有一个加载更多，它不是一次性显示出来所有评论的，需要不停的点击加载更多直到没有为止。

对于一般的评论我们获取到源码后可以很轻松的获取到，但要如何将回复和评论区分开，并将回复与该条评论放在一起呢。查看网页结构后我们发现可以通过class属性值为comment的div标签分开后，然后再在每一个div标签下提取评论，第一条评论为主评论，后面的都为该条评论的回复。

解决了这个问题后，我们还需要注意一些特殊的评论。
比如有@某人的评论，以及一些被屏蔽了的评论。这些评论的标签定位有点不一样，需要针对处理一下。
![屏幕截图 2021-05-04 011233](https://img-hello-world.oss-cn-beijing.aliyuncs.c

2.编写代码

首先是获取搜索结果的链接。

def get_url(start,end):
'''
:strart: 起始页
:end: 结束页
'''# 获取起始页到结束页的所有项目链接while True:try:t_url = []h_url = 'https://search.gitee.com/?skin=rec&type=repository&q=java&pageno='headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36 Edg/89.0.774.76',}for i in range(start,end+1):new_url = h_url + str(i)print(f'正在获取第{i}页所有链接')res = requests.get(new_url,headers = headers).text# print(res)html = etree.HTML(res)x_urls = html.xpath('//div[@class="title"]/a[@class="ns" and @target="_blank"]/@href')x_urls.pop(0)# print(len(x_urls))t_url.append({str(i):x_urls})time.sleep(1)# print(t_url)except BaseException:print('正在重新获取该页链接')else:breakreturn t_url

然后是获取一个链接的网页源码的函数。里面的time.sleep里的值需要根据自己的网速进行调整。

def get_html(url):#获取项目页面的源码while True:try:#打开谷歌浏览器wd = webdriver.Chrome()#发起请求wd.get(url)while True:#寻找加载更多按钮进行点击html = wd.page_sourceh1 = etree.HTML(html)button = h1.xpath('//div[@class="ui link button btn-load-more"]')#找不到按钮退出if len(button) == 0:break#寻找结束按钮later = h1.xpath('//div[@class="ui link button btn-load-more disabled"]')if len(later) == 0:#找不到结束按钮则加载更多i = wd.find_element_by_xpath('//*[@class="ui link button btn-load-more"]')i.click()time.sleep(0.5)else:breakexcept BaseException:wd.quit()time.sleep(0.5)print("正在重新获取")else:# 正常结束则关闭浏览器wd.quit()time.sleep(0.5)breakreturn html

然后是解析网页源码，提取出评论。

def get_i(lists):# 获取屏蔽评论的索引a = []new_lists = enumerate(lists)for new_list in new_lists:if new_list[1] == '此条评论已被系统屏蔽':a.append(new_list[0])return adef jianxi(data):#解析xml# print(data)# 匹配出每一个大的评论块标签，包含回复# strs = re.findall('<div class="comments"><div.*?class="comment.*?<div class="comments">|<div class="comments"><div.*?"comment note".*?</div>\n<input id',data,re.S)strs = re.findall('<div class="comments"><div.*?data-note-id=".*?<div class="comments">|<div class="comments"><div.*?"comment note".*?</div>\n<input id',data, re.S)html = etree.HTML(data)#匹配该项目的名称title = html.xpath('//head//meta[@itemprop="name"]/@content')#匹配该项目的评论数量num = html.xpath('//span[@class="comments-count"]/text()')try:num = num[0].split()except IndexError:print("该项目无评论，请查看页面确认")all_data = {'title': title[0], 'comment_num': '0'}print(all_data)return all_datanu = 0  # 用于检测评论数是否正确comments = []   #用于存储评论if int(num[0]) != 0:for str1 in strs:# 匹配出评论者和评论时间# print(str1)str2 = etree.HTML(str1)comments_authors = str2.xpath('//a[@class="author js-popover-card"]/text()')comments_times = str2.xpath('//span[@class="timeago"]/@title')# 匹配出一个评论和回复所在的标签str3 = re.findall('<div class="content arrow_box".*?</div></div></div>|<div class="children-comments comments".*?</div></div></div>',str1, re.S)comments_comments = []for str4 in str3:str5 = etree.HTML(str4)# 匹配出评论并拼接comments_comment = str5.xpath('//p//text() | //div[@class="author blocked-title pb-2"]/text()')co = ''for com in comments_comment:co += comcomments_comments.append(co)nu += 1b = get_i(comments_comments)# 检测是否存在被系统屏蔽的评论if len(b) != 0:comments_authors.insert(b[0], '系统屏蔽')c0 = comments_comments.pop(0)a0 = comments_authors.pop(0)t0 = comments_times.pop(0)# 检测是否存在回复if len(comments_comments) != 0:c = [{a0: c0, 'time': t0}]d = []for i in range(len(comments_comments)):d.append({comments_authors[i]: comments_comments[i], 'time': comments_times[i]})c.append({'回复': d})comments.append(c)else:comments.append([{a0: c0, 'time': t0}])else:print('该项目无评论')#检测评论数量是否正确if nu == int(num[0]):print(('解析正确'))else:print('解析错误')print(f'缺少{int(num[0])-nu}条评论')if num[0] != 0:all_data = {'title':title[0],'comment_num':nu,'comment':comments}else:all_data = {'title': title[0], 'comment_num': num[0]}print(all_data)return all_data

3.总的代码

from selenium import webdriver
import requests
from lxml import etree
import time
import redef get_url(start,end):# 获取前100页的所有项目链接while True:try:t_url = []h_url = 'https://search.gitee.com/?skin=rec&type=repository&q=java&pageno='headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36 Edg/89.0.774.76',}for i in range(start,end+1):new_url = h_url + str(i)print(f'正在获取第{i}页所有链接')res = requests.get(new_url,headers = headers).text# print(res)html = etree.HTML(res)x_urls = html.xpath('//div[@class="title"]/a[@class="ns" and @target="_blank"]/@href')x_urls.pop(0)# print(len(x_urls))t_url.append({str(i):x_urls})time.sleep(1)# print(t_url)except BaseException:print('正在重新获取该页链接')else:breakreturn t_urldef get_html(url):#获取项目页面的源码while True:try:#打开谷歌浏览器wd = webdriver.Chrome()#发起请求wd.get(url)while True:#寻找加载更多按钮进行点击html = wd.page_sourceh1 = etree.HTML(html)button = h1.xpath('//div[@class="ui link button btn-load-more"]')#找不到按钮退出if len(button) == 0:break#寻找结束按钮later = h1.xpath('//div[@class="ui link button btn-load-more disabled"]')if len(later) == 0:#找不到结束按钮则加载更多i = wd.find_element_by_xpath('//*[@class="ui link button btn-load-more"]')i.click()time.sleep(0.5)else:breakexcept BaseException:wd.quit()time.sleep(0.5)print("正在重新获取")else:# 正常结束则关闭浏览器wd.quit()time.sleep(0.5)breakreturn htmldef get_i(lists):# 获取屏蔽评论的索引a = []new_lists = enumerate(lists)for new_list in new_lists:if new_list[1] == '此条评论已被系统屏蔽':a.append(new_list[0])return adef jianxi(data):#解析xml# print(data)# 匹配出每一个大的评论块标签，包含回复# strs = re.findall('<div class="comments"><div.*?class="comment.*?<div class="comments">|<div class="comments"><div.*?"comment note".*?</div>\n<input id',data,re.S)strs = re.findall('<div class="comments"><div.*?data-note-id=".*?<div class="comments">|<div class="comments"><div.*?"comment note".*?</div>\n<input id',data, re.S)html = etree.HTML(data)#匹配该项目的名称title = html.xpath('//head//meta[@itemprop="name"]/@content')#匹配该项目的评论数量num = html.xpath('//span[@class="comments-count"]/text()')try:num = num[0].split()except IndexError:print("该项目无评论，请查看页面确认")all_data = {'title': title[0], 'comment_num': '0'}print(all_data)return all_datanu = 0  # 用于检测评论数是否正确comments = []   #用于存储评论if int(num[0]) != 0:for str1 in strs:# 匹配出评论者和评论时间# print(str1)str2 = etree.HTML(str1)comments_authors = str2.xpath('//a[@class="author js-popover-card"]/text()')comments_times = str2.xpath('//span[@class="timeago"]/@title')# 匹配出一个评论和回复所在的标签str3 = re.findall('<div class="content arrow_box".*?</div></div></div>|<div class="children-comments comments".*?</div></div></div>',str1, re.S)comments_comments = []for str4 in str3:str5 = etree.HTML(str4)# 匹配出评论并拼接comments_comment = str5.xpath('//p//text() | //div[@class="author blocked-title pb-2"]/text()')co = ''for com in comments_comment:co += comcomments_comments.append(co)nu += 1b = get_i(comments_comments)# 检测是否存在被系统屏蔽的评论if len(b) != 0:comments_authors.insert(b[0], '系统屏蔽')c0 = comments_comments.pop(0)a0 = comments_authors.pop(0)t0 = comments_times.pop(0)# 检测是否存在回复if len(comments_comments) != 0:c = [{a0: c0, 'time': t0}]d = []for i in range(len(comments_comments)):d.append({comments_authors[i]: comments_comments[i], 'time': comments_times[i]})c.append({'回复': d})comments.append(c)else:comments.append([{a0: c0, 'time': t0}])else:print('该项目无评论')#检测评论数量是否正确if nu == int(num[0]):print(('解析正确'))else:print('解析错误')print(f'缺少{int(num[0])-nu}条评论')if num[0] != 0:all_data = {'title':title[0],'comment_num':nu,'comment':comments}else:all_data = {'title': title[0], 'comment_num': num[0]}print(all_data)return all_dataif __name__ == '__main__':#jianxi(data)返回的类型为{'title':xxx,'comment_num':xxx,'comment':[[{'xxx':'xxx','time':'xxx'},{'回复':[{'xxx':'xxx','time':'xxx'}]}]]},startTime = int(time.time())start = 10end = 11p_urls = get_url(start,end)print(p_urls)for i in range(start,end+1):urls = p_urls.pop(0)# print(urls)print(f'正在获取第{i}页')j = 1for url in urls[str(i)]:print(f'正在获取第{j}/11个项目评论')data = get_html(url)jianxi(data)j+=1# #用于测试单个链接# url = 'https://gitee.com/jfinal/jfinal-weixin?_from=gitee_search'# data = get_html(url)# jianxi(data)endTime = int(time.time())differenceTime = endTime - startTimeprint('运行时间：' + str(differenceTime) + '秒')

总结

这次的代码量有近200行，但只要按照流程一步步写还是没有问题的。前面所说的q参数是搜索的内容，代码里我没有加上，仅仅是对于java进行了搜索，这个有需要的话可以自己进行修改。

gitee网站中项目的评论爬取(selenium)相关推荐

如何爬一个网站的数据-免费爬取网站的任意数据软件
如何爬一个网站的数据?爬取网络数据大家称之为网络爬行收集页面以创建索引或集合.另一方面,网络抓取下载页面以提取一组特定的数据用于分析目的,例如,产品详细信息.定价信息.SEO 数据或任何其他数据集. ...
No.2 大众点评评论爬取
大众点评评论爬取一.简介网址:http://www.dianping.com/shop/G41gaJfqGBICtiVY 效果:爬取评论使用框架:selenium.requests.re 难度系 ...
大数据信息资料采集:中国知网文献资料网站数据信息资料爬取
大数据信息资料采集:中国知网文献资料网站数据信息资料爬取数据采集满足多种业务场景:适合产品.运营.销售.数据分析.政府机关.电商从业者.学术研究等多种身份职业. 舆情监控:全方位监测公开信息,抢先获 ...
大数据信息资料采集：公众号武志红文章评论爬取八爪鱼采集器规则
大数据信息资料采集:公众号武志红文章评论爬取八爪鱼采集器规则大数据信息资料采集公众号历史文章采集公众号评论爬取微信公众号历史文章导出抓取微信公众号所有文章. 公众号文章抓取工具抓取公众号所 ...
python爬取新浪新闻首页_Python爬虫学习：微信、知乎、新浪等主流网站的模拟登陆爬取方法...
微信.知乎.新浪等主流网站的模拟登陆爬取方法摘要:微信.知乎.新浪等主流网站的模拟登陆爬取方法. 网络上有形形色色的网站,不同类型的网站爬虫策略不同,难易程度也不一样.从是否需要登陆这方面来说,一些 ...
大数据信息资料采集：文化公众号槽边往事历史文章搜集评论爬取
大数据信息资料采集:文化公众号槽边往事历史文章搜集评论爬取大数据信息资料采集公众号历史文章采集公众号评论爬取微信公众号历史文章导出抓取微信公众号所有文章. 公众号文章抓取工具抓取公众号所有 ...
大数据信息资料采集：情感公号风茕子历史文章评论爬取八爪鱼采集
大数据信息资料采集:情感公号风茕子历史文章评论爬取八爪鱼采集大数据信息资料采集公众号历史文章采集公众号评论爬取微信公众号历史文章导出抓取微信公众号所有文章. 公众号文章抓取工具抓取公众号所 ...
python处理json数据——网易云评论爬取
python处理json数据--网易云评论爬取准备代码准备 1.python 3.7 2.需要安装的库: requests jsonpath pandas time fake_useragent ...
qu.la网站上的小说爬取
qu.la网站上的小说爬取 ##这个项目是我最早开始写的爬虫项目,代码比较简陋在写这个项目时,我还不会Python的协程编程,用协程可提升爬虫速度至少5倍,参考我的文章[线程,协程对比和Python ...

gitee网站中项目的评论爬取(selenium)