猫眼top100数据解析

这是继上篇爬取数据后的数据解析，且尝试使用更多种方法去抓取、存储数据。上篇链接为link

抓取数据方法介绍

1.利用正则表达式解析

def parse_one_page(html):pattern = '<dd>.*?board-index.*?">(\d+)</i>.*?data-src="(.*?)".*?/>.*?movie-item-info.*?title="(.*?)".*?star">' + \'(.*?)</p>.*?releasetime">(.*?)</p>.*?integer">(.*?)</i>.*?fraction">(\d+)</i>.*?</dd>'# re.S匹配任意字符，多行regex = re.compile(pattern, re.S)items = regex.findall(html)for item in items:yield {'index': item[0],'thumb': get_large_thumb(item[1]),'title': item[2],'actors': item[3].strip()[3:],'release_time': get_release_time(item[4].strip()[5:]),'area': get_release_area(item[4].strip()[5:]),'score': item[5] + item[6]}passpass

2.使用lxml中Xpath路径解析

def parse_one_page2(html):parse = etree.HTML(html)items = parse.xpath("//*[@id='app']//div//dd")for item in items:yield{'index':item.xpath("./i/text()")[0],'thumb':get_large_thumb(str(item.xpath("./a/img[2]/@data-src")[0].strip())),'name':item.xpath("./a/@title")[0],'star':item.xpath(".//p[@class='star']/text()")[0].strip(),'time':get_release_time(item.xpath(".//p[@class='releasetime']/text()")[0].strip()[5:]),'area':get_release_area(item.xpath(".//p[@class='releasearea']/text()")[0].strip()[5:]),'score':item.xpath(".//p[@class='score']/i[1]/text（）")[0]+\item.xpath(".//p[@class='score']/i[2]/text()")[0]}passpass

此方法一般用于对规则性的信息的解析，是解析利器，也是爬虫信息抽取利器。

3.bs4的soup.select方法

def parse_one_page3(html):soup = BeautifulSoup(html,'lxml')items =range(10)for item in items:yield{'index':soup.select("dd i.board-index")[item].string,'thumb':get_large_thumb(soup.select("a > img.board-img")[item]['data-src']),'name':soup.select(".name a")[item].string,'star':soup.select(".star")[item].string.strip()[3:],'time':get_release_time(soup.select(".releasetime")[item].string.strip()[5:]),'area':get_release_area(soup.select(".releasearea")[item].string.strip()[5:]),'score':soup.select(".integer")[item].string+soup.select(".fraction")[item].string,}passpass

用beautifulsoup + css选择器提取。

4.API接口函数 - find函数

def parse_one_page4(html):soup = BeautifulSoup(html, 'lxml')items = range(10)for item in items:yield {'index':soup.find_all(class_="board-index")[item].string,'thumb':get_large_thumb(soup.find_all(class_="board-img")[item].attrs['data-src']),'name':soup.find_all(name='p',attrs={'class':"name"})[item].string,'star':soup.find_all(name='p',attrs={'class':"star"})[item].string.strip()[3:],'time':get_release_time(soup.find_all(class_='releasetime')[item].string.strip()[5:]),'area':get_release_area(soup.find_all(class_='releasetime')[item].string.strip()[5:]),'score':soup.find_all(name='i',attrs={'class':"integer"})[item].string.strip() +soup.find_all(name='i',attrs={'class':"fraction"})[item].string.strip()}passpass

Beautifulsoup除了和css选择器搭配，还可以直接用它自带的find_all函数进行提取，如上所示。

2.存储方法介绍

1.字典格式存储，JSON串

def write_to_file(items):# a为追加的意思，utf_8_sig是使简体中文不乱码with open('save.csv','a',encoding='utf_8_sig')as f:f.write(json.dumps(items,ensure_ascii=False) + '\n')print('第%s部电影爬取完毕'% items["index"])pass
pass

2.格式存储

def write_to_file2(items):with open('save2.csv','a',encoding='utf_8_sig',newline='')as f:fieldnames = ['index','thumb','name','star','time','area','score']w = csv.DictWriter(f,fieldnames=fieldnames)w.writerow(items)passpass

3.值存储

def write_to_file3(items):with open('save.csv', 'a', encoding='utf_8_sig', newline='')as f:w = csv.writer(f)w.writerow(items.values())passpass

3.数据解析：可视化解析

以画出电影评分前十的柱状图为例。

1.前置工作导入所需库、所需数据及设置主题

import matplotlib.pyplot as plt
import pylab as pl
import pandas as pdplt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['font.family']='sans-serif'
#解决符号'-'乱码问题
plt.rcParams['axes.unicode_minus'] = False#，设置主题
plt.style.use('ggplot')
# 设置柱形图大小
fig = plt.figure(figsize=(8,5))
colors1 = '#6D6D6D'
#导入原始数据
cloumns = ['index','thumb','name','star','time','area','score']
df=pd.read_csv('save2.csv',encoding='utf-8',header=None,names=cloumns,index_col='index')

2.绘图

def annsis1():df_score= df.sort_values('score',ascending=False)# asc False降序,True升序: descname1 = df_score.name[:10] #X轴坐标score1 = df_score.score[:10]#Y轴坐标plt.bar(range(10),score1,tick_label=name1) #绘制条形图，用range()能保持X轴顺序一致plt.ylim(9,10)plt.title("电影评分最高Top10",color=colors1)plt.xlabel('电影名称')plt.ylabel('评分')#标记数值for x,y in enumerate(list(score1)):plt.text(x,y+0.01,'%s' %round(y,1),ha='center',color=colors1)passpl.xticks(rotation=270)#旋转270°plt.tight_layout() #去除空白vplt.show()pass

旋转270°是为了防止某些电影名称过长导致与其他电影名称重叠。

3.结果

python猫眼top数据解析画图相关推荐

python 读文件数据并画图
python 读文件数据并画图代码如下: import pandas as pd import matplotlib.pyplot as plt import re import os import ...
Python 爬虫 xpath 数据解析基本用法
Python 爬虫 xpath 数据解析基本用法 1. 基本语法 1.1 解析 html 语法 1.2 获取标签 1.3 获取标签中的内容 1.4 获取标签中的属性 1.5 通过内容寻找结点 2. 实 ...
Python学习 Day43 数据解析-BeautifulSoup 07
BeautifulSoup 解析数据一.BeautifulSoup概述 1.BeautifulSoup 是一个可以从HTML或XML文档中提取数据的Python库功能简单强大.容错能力高.文档相对 ...
Python爬虫之数据解析之bs4
数据解析之bs4 一.bs4进行数据解析二.bs4库和lxml库的安装三.BeautifulSoup对象四.项目实例一.bs4进行数据解析 1.数据解析的原理 ① 标签定位. ② 提取标签.标 ...
python层级抓取_070.Python聚焦爬虫数据解析
一聚焦爬虫数据解析 1.1 基本介绍聚焦爬虫的编码流程指定url 基于requests模块发起请求获取响应对象中的数据数据解析进行持久化存储如何实现数据解析三种数据解析方式正则表达式 ...
【网络爬虫】python中的数据解析工具(re,bs4,xpath,pyquery)
1.基础知识 1.1 正则(re) Regular Expression, 正则表达式, ⼀种使⽤表达式的⽅式对字符进⾏匹配的语法规则. ⽹⻚源代码本质上就是⼀个超⻓的字符串, 想从⾥⾯提取内容.⽤正 ...
【Python】json数据解析
目录 json文件数据解析爬虫获取王者荣耀英雄信息json数据包并解析爬虫获取抖音视频json数据包并解析 json文件数据解析 json字符串:通常类似python数据类型中的列表和字典的结合, ...
Python爬虫之数据解析/提取（二）
文章目录前言数据分析分类数据解析原理概述一.正则re进行数据解析 1.1 爬取糗事百科中糗图板块下所有的糗图图片⭐ 二.bs4解析概述 2.1 获取整个标签 2.2 获取标签属性或者存储的文本 ...
Python爬虫之数据解析和提取
获取数据之后需要对数据进行解析和提取,需要用到的库是BeautifulSoup,需要在终端安装 pip install BeautifulSoup4 1)解析数据 bs对象=BeautifulSoup ...

python猫眼top数据解析画图