python实现词云（爬取豆瓣影评）

该程序是学完python之后的一个实战项目，通过分析网站的html，来爬取影评，并将其做成词云。

本程序主要分为3个过程。

1、抓取网页数据

使用Python爬虫技术获取豆瓣电影中最新上映电影的网页，其网址如下：

https://movie.douban.com/cinema/nowplaying/qingdao/

2、清理数据

通常将某部影评信息存入eachCommentList列表中。为便于数据清理和词频统计，把eachCommentList列表形成字符串comments，将comments字符串中的“也”“太”“ 的”等虚词（停用词）清理掉后进行词频统计。

3、用词云进行展示

最后使用词云包对影评信息进行词云展示。

# 分词包
import json
import jieba
import jieba.analyse
# numpy计算包
import numpy
import numpy as np
import re
import matplotlib
import matplotlib.pyplot as plt
import requests_html
import requests
from bs4 import BeautifulSoup as bs
matplotlib.rcParams['figure.figsize']=(10.0, 5.0)
# 词云包
from wordcloud import WordCloud
import os
import tkinter as tk
from tkinter import *
from PIL import ImageTk, Image
#分词网页函数
def getNowPlayingMovie_list(url):
#    url = 'https://movie.douban.com/cinema/nowplaying/qingdao/'session = requests_html.HTMLSession()r = session.get(url)html_data = r.html.htmlsoup= bs(html_data, 'html.parser')nowplaying_movie = soup.find_all('div', id='nowplaying')nowplaying_movie_list = nowplaying_movie[0].find_all('li', attrs={'class': 'list-item'})nowplaying_list = []for item in nowplaying_movie_list:nowplaying_dict = {}nowplaying_dict['id'] = item['data-subject']for tag_img_item in item.find_all('img'):nowplaying_dict['name'] = tag_img_item['alt']nowplaying_list.append(nowplaying_dict)return nowplaying_list# 爬取评论函数
# 参数为电影id号和要爬取评论的页码
def getCommentsById(movieId,pageNum):eachCommentList = []if pageNum > 0:start=(pageNum-1)*20else:return Falserequrl = 'https://movie.douban.com/subject/' + movieId + '/comments?start='+str(start)+'&limit=20'
#    print(requrl)session = requests_html.HTMLSession()response = session.get(requrl)html_data = response.html.htmlsoup = bs(html_data, 'html.parser')comment_div_list = soup.find_all('div', attrs={'class': 'comment'})for item in comment_div_list:# 获取p标签内部的span标签（即评论内容）b = item.find('p').find('span')if b.string is not None:# eachCommentList.append(item.find_all('p')[0].string)eachCommentList.append(b.string)return eachCommentListdef main():url = str(inp1.get())page_num = int(inp2.get())NowPlayingMovie_list=getNowPlayingMovie_list(url)# 前10页for t in range(len(NowPlayingMovie_list)):temp = 'https://movie.douban.com/subject/'+NowPlayingMovie_list[t]['id']+'/?from=playing_poster'session = requests_html.HTMLSession()r = session.get(temp)html_data = r.html.html# aaa = bs(html_data)# a = aaa.find('a', attrs={'href':'comments?sort=new_score&status=P'}).string# page_num = int(re.findall('\d+', a)[0])# print(page_num)commentList = []for i in range(page_num):num = i + 1commentList_temp = getCommentsById(NowPlayingMovie_list[t]['id'], num)commentList.append(commentList_temp)# 将列表中的数据转换为字符串comments = ''file = NowPlayingMovie_list[t]file_name = file.get("name")for k in range(len(commentList)):comments += (str(commentList[k])).strip()# 使用正则表达是去掉标点符号pattern = re.compile(r'[\u4e00-\u9fa5]')filterdata = re.findall(pattern, comments)cleaned_comments = ''.join(filterdata)# 使用jieba分词进行中文分词result = jieba.analyse.textrank(cleaned_comments,topK=200,withWeight=True)keywords = dict()for i in result:keywords[i[0]] = i[1]json_str = json.dumps(keywords, ensure_ascii=False)with open("out.txt", "a", encoding="utf-8") as fObj:fObj.write("删除停用词前：")fObj.write(json_str)with open("out.txt", "a")as file:file.write('\n')#       print("删除停用词前", keywords)# 停用词集合stop_words = []for line in open('stopword.txt', 'r', encoding='utf-8'):stop_words.append(line.rstrip('\n'))keywords = {x: keywords[x] for x in keywords if x not in stop_words}json_str = json.dumps(keywords, ensure_ascii=False)with open("out.txt", "a", encoding="utf-8") as fObj:fObj.write("删除停用词后：")fObj.write(json_str)with open("out.txt", "a")as file:file.write('\n')pic = np.array(Image.open('dol.jpg'))#      print('删除停用词后', keywords)# 使用词云显示wordcloud = WordCloud(scale=8,font_path='simhei.ttf',background_color='white',max_font_size=80,mask=pic,stopwords=stop_words,)word_frequence = keywordsmyword = wordcloud.fit_words(word_frequence)# 展示词云图plt.rcParams["font.sans-serif"] = ["SimHei"]plt.rcParams["axes.unicode_minus"] = Falsefig = plt.figure(t)plt.imshow(myword)plt.axis('off')plt.title('电影:《'+file_name+"》", size=26)plt.savefig(file_name+'.png')plt.draw()plt.pause(4)  # 间隔的秒数： 4splt.close(fig)# if __name__ == '__main__':
#     print(os.getcwd())
#     main()root = Tk()
root.geometry('1000x600')
canvas = tk.Canvas(root, width=1000, height=600, bd=0, highlightthickness=0)
imgpath = 'D:\\123.jpg'
img = Image.open(imgpath)
photo = ImageTk.PhotoImage(img)
canvas.create_image(500, 240, image=photo)
canvas.pack()
entry = tk.Entry(root, insertbackground='blue', highlightthickness=2)
entry.pack()root.title('电影词云搜索')lb1 = Label(root, text='请输入要查找的链接:', bg = 'blue', font=('华文新魏', 15))
lb1.place(relx=0.23, rely=0.2, relwidth=0.25, relheight=0.05)
inp1 = Entry(root)
inp1.place(relx=0.25, rely=0.3, relwidth=0.2, relheight=0.07)lb2 = Label(root, text='请输入要查找的页数:', bg = 'blue', font=('华文新魏', 15))
lb2.place(relx=0.53, rely=0.2, relwidth=0.25, relheight=0.05)
inp2 = Entry(root)
inp2.place(relx=0.55, rely=0.3, relwidth=0.2, relheight=0.07)btn1 = Button(root, text='搜索', fg='red', bg='blue', font=('华文新魏', 20), command=main)
btn1.place(relx=0.45, rely=0.7, relwidth=0.1, relheight=0.1)root.mainloop()

效果图：（输入的链接是一定的，如果想输入其他链接，需要再分析网页的html）这是爬取到的某一个影评：（模板有点小了，可以再换一下模板）

python实现词云（爬取豆瓣影评）相关推荐

python爬取豆瓣影评_【python爬虫实战】爬取豆瓣影评数据
概述: 爬取豆瓣影评数据步骤: 1.获取网页请求 2.解析获取的网页 3.提速数据 4.保存文件源代码: # 1.导入需要的库 import urllib.request from bs4 impo ...
python爬取豆瓣影评生成词云的课程设计报告_Python爬取豆瓣影评，生成词云图，只要简单一步即可实现。...
最近看了一部电影<绣春刀>,里面的剧情感觉还不错,本文爬取的是绣春刀电影的豆瓣影评,1000个用户的短评,共5W多字.用jieba分词,对词语的出现频率进行统计,再通过wordcloud生 ...
#私藏项目实操分享#Python爬虫实战，requests+xpath模块，Python实现爬取豆瓣影评
前言利用利用requests+xpath爬取豆瓣影评,废话不多说. 让我们愉快地开始吧~ 开发工具 Python版本:3.6.4 相关模块: requests模块: jieba模块: pandas模 ...
Python爬虫实战，requests+xpath模块，Python实现爬取豆瓣影评
前言利用利用requests+xpath爬取豆瓣影评,废话不多说. 让我们愉快地开始吧~ 开发工具 **Python版本:**3.6.4 相关模块: requests模块: jieba模块: pan ...
用python爬取豆瓣影评及影片信息(评论时间、用户ID、评论内容)
爬虫入门:python爬取豆瓣影评及影片信息:影片评分.评论时间.用户ID.评论内容思路分析元素定位完整代码豆瓣网作为比较官方的电影评价网站,有很多对新上映影片的评价,不多说,直接进入正题. ...
python电影评论的情感分析流浪地球_《流浪地球》影评分析（一）：使用Python爬取豆瓣影评...
本文爬虫的步骤: 使用Selenium库的webdriver进行网页元素定位和信息获取: 使用BeautifulSoup库进行数据的提取: 使用Pandas库进行数据的存储. 后台回复python爬虫 ...
python爬虫——Cookie登录爬取豆瓣短评和影评及常见问题
python爬虫--Cookie登录爬取豆瓣短评和影评常见问题(本文已解决) 具体步骤一.获取网页源码短评.影评二.解析网页源码及爬取评论 1.短评网页解析 ①确定位置 2.短评爬取 ①名称爬 ...
爬虫实战2(上）：爬取豆瓣影评
这次我们将主要尝试利用python+requsets模拟登录豆瓣爬取复仇者联盟4影评,首先让我们了解一些模拟登录相关知识补充.本文结构如下: request模块介绍与安装 get与post方式介 ...
利用Requests库和正则表达式爬取豆瓣影评Top250
说明最近看了下爬虫基础,想写个博客来记录一下,一来是可以方便和我一样刚入门的小白来参考学习,二来也当做自己的笔记供自己以后查阅. 本文章是利用python3.6和Requests库(需自行安装,cm ...
Python爬虫入门（爬取豆瓣电影信息小结）
Python爬虫入门(爬取豆瓣电影信息小结) 1.爬虫概念网络爬虫,是一种按照一定规则,自动抓取互联网信息的程序或脚本.爬虫的本质是模拟浏览器打开网页,获取网页中我们想要的那部分数据. 2.基本流程 ...

python实现词云（爬取豆瓣影评）

python实现词云（爬取豆瓣影评）相关推荐

最新文章

热门文章