【Python】爬取TapTap原神评论并生成词云分析

序言

本来是想爬B站的，但是B站游戏区的评论好像是动态方式加载，分析了一通没搞懂怎么爬，所以转到了TapTap，TapTap评论页通过URL来定位，非常容易拼接URL去获取想要的页面，所以这次爬取的对象选为TapTap。

目标

爬取TapTap社区原神游戏下玩家的评论，生成词频，词云，可视化关键词。

步骤

爬虫

目标是爬取用户名、评分、时间、评论四个维度的信息，首先要获取到页面上的评论列表：

response = requests.get(self.comments_url % page, headers=self.headers)
print('访问第', page, '页，状态是', response.status_code, '。')
time.sleep(random.random())
html = etree.HTML(response.text)
contents = html.xpath('//ul[contains(@class, "taptap-review-list")]/li')

然后遍历列表解析出各个字段：

user = content.xpath('.//a[@class="taptap-user-name"]/text()')[0] or '无名氏'
score = content.xpath('.//div[@class="item-text-score"]/i[@class="colored"]/@style')[0][7:9]
score = int(score) / 14
comment_time = content.xpath('(.//span)[4]/text()')[0]
comment = content.xpath('(.//div[@class="item-text-body"])[1]/p/text()')
comment = '\n'.join(comment)

最后把数据存入文件供之后使用：

comment_dir = {'user': users, 'score': scores, 'time': times, 'comment': comments}
comment_df = pd.DataFrame(comment_dir)
comment_df.to_csv('./tables/taptap_comments.csv')
comment_df['comment'].to_csv('./tables/comments.csv', index=False)

分词

爬虫拿到了数据，接下来就要对数据进行分词，这里使用的是jieba库：

jieba.load_userdict('./dictionary/my_dict.txt')
with open('./tables/comments.csv', 'r', encoding='utf-8') as f:word_list = jieba.cut(f.read())
with open('./dictionary/ignore_dict.txt', 'r', encoding='utf-8') as f:ignore_words = f.read().splitlines()

其中加载用户词典声明一些jieba中没有的词，再用忽略词典过滤掉一些无意义的词：

for word in word_list:if word not in ignore_words:word = re.sub(r'[\n ]', '', word)if len(word) < 1:continuewords.append(word)

之后对分词后的数据进行词频分析：

frq = {}
for word in words:frq[word] = frq.get(word, 0) + 1
items = list(frq.items())
items.sort(key=lambda x:x[1], reverse=True)
print('词频前10统计如下：')
for i in range(10):word, count = items[i]print(word, '：', count)

词云

最后调用WordCloud库生成词云即可：

wordle = word_cloud.generate(wordle_data)
image = wordle.to_image()
image.show()

结果

词云：

词频：

词频前10统计如下：
钟离 ： 337
角色 ： 205
氪 ： 187
玩家 ： 182
好 ： 135
人 ： 131
原神 ： 130
真的 ： 126
让 ： 123
这个 ： 118

词频统计这里，忽略词典还不完善，还有无意义的词没过滤掉，不想写了，就这样吧。词云用了晴宝的蒙版，刻晴，永远的lp！（虽然我还没抽到）

（免疫）

源码

主文件genshin_wordle.py：

import requests
import time
import random
import jieba
import re
import pandas as pd
import numpy as np
from lxml import etree
from PIL import Image
from wordcloud import WordCloudclass genshin():def __init__(self):self.headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}self.comments_url = 'https://www.taptap.com/app/168332/review?order=default&page=%d#review-list'def crawl_comments(self):users = []scores = []times = []comments = []# 爬取10页评论for page in range(10):response = requests.get(self.comments_url % page, headers=self.headers)print('访问第', page, '页，状态是', response.status_code, '。')time.sleep(random.random())html = etree.HTML(response.text)contents = html.xpath('//ul[contains(@class, "taptap-review-list")]/li')# 遍历该页的评论信息for content in contents:# 解析各个字段user = content.xpath('.//a[@class="taptap-user-name"]/text()')[0] or '无名氏'score = content.xpath('.//div[@class="item-text-score"]/i[@class="colored"]/@style')[0][7:9]score = int(score) / 14comment_time = content.xpath('(.//span)[4]/text()')[0]comment = content.xpath('(.//div[@class="item-text-body"])[1]/p/text()')comment = '\n'.join(comment)# 把一条完整记录存入数组users.append(user)scores.append(score)times.append(comment_time)comments.append(comment)# 封装数据，写入文件comment_dir = {'user': users, 'score': scores, 'time': times, 'comment': comments}comment_df = pd.DataFrame(comment_dir)comment_df.to_csv('./tables/taptap_comments.csv')comment_df['comment'].to_csv('./tables/comments.csv', index=False)print(comment_df)def word_frequency(self, words):frq = {}for word in words:frq[word] = frq.get(word, 0) + 1items = list(frq.items())items.sort(key=lambda x:x[1], reverse=True)print('词频前10统计如下：')for i in range(10):word, count = items[i]print(word, '：', count)def segmentation(self):# 加载自定义词典和忽略词典，进行分词jieba.load_userdict('./dictionary/my_dict.txt')with open('./tables/comments.csv', 'r', encoding='utf-8') as f:word_list = jieba.cut(f.read())with open('./dictionary/ignore_dict.txt', 'r', encoding='utf-8') as f:ignore_words = f.read().splitlines()words = []# 遍历分词for word in word_list:if word not in ignore_words:word = re.sub(r'[\n ]', '', word)if len(word) < 1:continuewords.append(word)global wordle_datawordle_data = ','.join(words)print(wordle_data)self.word_frequency(words)def generate_wordle(self):# 配置词云参数wordle_mask = np.array(Image.open('./images/keqing.jpg'))word_cloud = WordCloud(font_path='./fonts/simhei.ttf',background_color="white",mask=wordle_mask,max_words=300,min_font_size=5,max_font_size=100,width=500,height=350,)global wordle_datawordle = word_cloud.generate(wordle_data)image = wordle.to_image()image.show()wordle.to_file('./images/genshin_wordle.png')genshin = genshin()
genshin.crawl_comments()
genshin.segmentation()
genshin.generate_wordle()

自定义词典my_dict.txt：

璃月
迪卢克
刻晴
阿贝多
米哈游

忽略词典ignore_dict.txt：

，
的
。
了
我
是
就
游戏
也不
？
都
你
"
玩
说
一个
这
在
给
和
还
有
没有
没
就是
但是
吧
到
什么
现在
个
能
！
—
他
很

【Python】爬取TapTap原神评论并生成词云分析相关推荐

Python爬取网易云歌曲评论，做词云分析
前言 emmmm 没什么说的,想说的都在代码里环境使用 Python 3.8 解释器 3.10 Pycharm 2021.2 专业版 selenium 3.141.0 本次要用到selenium模块 ...
爬取qq音乐的评论并生成词云——以《听妈妈的话》为例
爬取qq音乐的评论并生成词云我们选取的是歌曲的周杰伦的听妈妈的话先看效果图首先,我们进去qq音乐找到这首歌网易云出来挨打 https://y.qq.com/n/yqq/song/002hXD ...
python爬取QQ空间好友说说并生成词云
最近自己玩爬虫玩得很嗨.想到爬QQ空间主要是因为在看网课的时候有不少人刷弹幕要去爬前女友空间..咳咳,虽然我没有前女友,但是这不失为一个有趣的练手机会.(爬完之后发现不会留下访客记录!确实很适合爬前女 ...
1] python 爬取微信好友个性签名，生成词云
在Anaconda下完成,参考https://blog.csdn.net/zhonglixianyun/article/details/78229782 结果图: 1. 需要的库 numpy, os, ...
python爬取网页版QQ空间，生成词云图、柱状图、折线图（附源码）
python爬取网页版QQ空间,生成词云图.柱状图.折线图最近python课程学完了,琢磨着用python点什么东西,经过一番搜索,盯上了QQ空间,拿走不谢,欢迎点赞收藏,记得github给个sta ...
python爬取陌生人的qq空间_Python爬取QQ空间好友说说并生成词云(超详细)
前言先看效果图: 思路 1.确认访问的URL 2.模拟登录你的QQ号 3.判断好友空间是否加了权限,切换到说说的frame,爬取当前页面数据,下拉滚动条,翻页继续获取爬取的内容写入本地TXT文件中 ...
网络爬虫爬取b站励志弹幕并生成词云(精心笔记总结)
bilibili献给新一代的演讲<后浪> 前言在进入本文之前,我很想给大家推荐b站这个视频,3080.2万播放,27.9万条弹幕.这个视频之火不是因为漂亮的小姐姐,也不是什么很傻,很逗人 ...
网络爬虫爬取b站励志弹幕并生成词云(精心笔记总结)！
bilibili献给新一代的演讲<后浪> 此文转载文,著作权归作者所有,如有侵权联系小编删除! 前言在进入本文之前,我很想给大家推荐b站这个视频,3080.2万播放,27.9万条弹幕.这 ...
超简单，Python爬取阴阳师式神视频
Python爬取阴阳师官网式神CG,附完整代码爬取阴阳师式神宣传CG 网页分析教程开始 1 发送网页请求使用第三方库requests来获取网页使用BeautifulSoup解析网页 2 获取目 ...
python爬取b站评论_学习笔记(1):写了个python爬取B站视频评论的程序
学习笔记(1):写了个python爬取B站视频评论的程序 import requests import json import os table='fZodR9XQDSUm21yCkr6zBqiveY ...

【Python】爬取TapTap原神评论并生成词云分析

序言

目标

步骤

爬虫

分词

词云

结果

源码

【Python】爬取TapTap原神评论并生成词云分析相关推荐

最新文章

热门文章