Cris 小哥哥的大数据项目之 Hive 统计 YouTube 热门视频

Author：Cris

文章目录

Cris 小哥哥的大数据项目之 Hive 统计 YouTube 热门视频
- - Author：Cris
- 1. 项目需求
- 2. 表结构和 ETL
- - 2.1 表结构
  - 2.2 ETL 原始数据
  - 2.3 开启 IDEA 完成 ETL 代码
  - - - Mapper 阶段
    - - EtlStringUtil
    - - Driver
    - - 测试
- 3. 服务器完成 ETL 和表建立
- - 3.1 打包和数据清理
  - 3.2 建表
  - 3.3 导入数据
  - 3.4 向 orc 表插入数据
- 4. 实际业务分析
- - 4.1 统计视频观看数Top10
  - 4.2 统计视频类别热度Top10
  - 4.3 统计出视频观看数最高的20个视频的所属类别以及每个类别包含Top20视频的个数
  - 4.4 统计视频观看数Top50所关联视频的所属类别按照视频上传量排序
  - 4.5 统计每个类别中的视频热度Top10，以Music为例
  - 4.6 统计每个类别中视频流量Top10，以Music为例
  - 4.7 统计上传视频最多的用户Top10以及他们上传的观看次数在前20的视频
  - 4.8 统计每个类别视频观看数Top10

1. 项目需求

统计 YouTube 网站的常规指标，各种TopN指标：

–统计视频观看数Top10

–统计视频类别热度Top10

–统计视频观看数Top20所属类别

–统计视频观看数Top50所关联视频的所属类别Rank

–统计每个类别中的视频热度Top10

–统计每个类别中视频流量Top10

–统计上传视频最多的用户Top10以及他们上传的视频

–统计每个类别视频观看数Top10

2. 表结构和 ETL

2.1 表结构

视频表

字段	备注	详细描述
video id	视频唯一id	11位字符串
uploader	视频上传者	上传视频的用户名String
age	视频年龄	视频在平台上的整数天
category	视频类别	上传视频指定的视频分类
length	视频长度	整形数字标识的视频长度
views	观看次数	视频被浏览的次数
rate	视频评分	满分5分
Ratings	流量	视频的流量，整型数字
conments	评论数	一个视频的整数评论数
related ids	相关视频id	相关视频的id，最多20个

其中 related ids 有可能为空（新上传的视频也来不及通过数据分析为其添加相关视频）

用户表

字段	备注	字段类型
uploader	上传者用户名	string
videos	上传视频数	int
friends	朋友数量	int

2.2 ETL 原始数据

通过观察原始数据形式，可以发现，视频可以有多个所属分类，每个所属分类用&符号分割，且分割的两边有空格字符，同时相关视频也是可以有多个元素，多个相关视频又用“\t”进行分割。为了分析数据时方便对存在多个子元素的数据进行操作，我们首先进行数据重组清洗操作。即：将所有的类别用“&”分割，同时去掉两边空格，多个相关视频id也使用“&”进行分割

原始数据示例如下：

第四列表示类别，需要 ETL 将空格去掉；然后是第十列数据表示关联的视频 id，同样需要 ETL 将 \t 分割符变为指定的分隔符

2.3 开启 IDEA 完成 ETL 代码

写我们的 ETL 代码之前，需要明白数据清洗的目的，即需要将原数据按照怎么样的规则来进行清洗，这才是 ETL 的重点，只有清晰明了 原材料 和 最终产品 之间的关系，才会很快速的完成 ETL 代码的编写

- Mapper 阶段

ETL 一般都是使用 Mapper 阶段来对我们的原始数据进行筛选，所以通常都是只有 Mapper 阶段，这一点需要注意！

/*** 进行 ETL 流程的 Mapper，不需要走 Reducer** @author zc-cris* @version 1.0**/
@SuppressWarnings("JavaDoc")
public class MovieTopMapper extends Mapper<LongWritable, Text, Text, NullWritable> {private Text k = new Text();/** 通过工具类将每行数据进行校验和转换* @param key 文本偏移量* @param value 每行文本* @param context MapReduce 上下文对象* @throws IOException* @throws InterruptedException*/@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {String[] strings = value.toString().split("\t");String result = EtlStringUtil.handle(strings);if (result != null) {// 自定义数据清洗成功的计数器并自增context.getCounter("ETLCounter", "true").increment(1);k.set(result);context.write(k, NullWritable.get());} else {// 自定义数据清洗失败并过滤的计数器并自增context.getCounter("ETLCounter", "false").increment(1);}}
}

- EtlStringUtil

根据上面的 Mapper 阶段代码，可以发现我们需要一个工具类来完成具体的数据校验和转换，代码如下:

/*** ETL 工具类** @author zc-cris* @version 1.0**/
public class EtlStringUtil {private static StringBuilder stringBuilder = new StringBuilder();private static final int STANDARD_LENGTH = 9;private static final int STANDARD_LENGTH_SUB_1 = 8;/*** 使用 main 方法做测试，保证工具类的可用性** @param args main 方法的参数*/public static void main(String[] args) {String handle = EtlStringUtil.handle("SDNkMu8ZT68\tw00dy911\t630\tPeople & Blogs\t186\t10181\t3.49\t494\tcris\tloveu\tsimida".split("\t"));System.out.println("handle = " + handle);}/*** 将输入的字符串数组转换为合格的数据** @param strings 待处理的字符串数组* @return  处理后的字符串*/static String handle(String[] strings) {// 一定要清除原 StringBuilder 对象的字符串内容！！！stringBuilder.delete(0, stringBuilder.length());if (strings.length < STANDARD_LENGTH) {return null;}strings[3] = strings[3].replaceAll(" ", "");if (strings.length == STANDARD_LENGTH) {for (int i = 0; i < STANDARD_LENGTH_SUB_1; i++) {stringBuilder.append(strings[i]).append("\t");}stringBuilder.append(strings[8]);} else {for (int i = 0; i < STANDARD_LENGTH; i++) {stringBuilder.append(strings[i]).append("\t");}for (int i = 9; i < strings.length - 1; i++) {stringBuilder.append(strings[i]).append("&");}stringBuilder.append(strings[strings.length - 1]);}return stringBuilder.toString();}
}

- Driver

最后就是整个 ETL 的驱动类，使用 Hive 官方推荐的方式来写?

/*** ETL 驱动类，最标准的写法** @author zc-cris* @version 1.0**/
@SuppressWarnings("JavaDoc")
public class MovieTopDriver implements Tool {/*** 抽象的配置类**/private Configuration configuration = new Configuration();/*** 抽象的任务类**/private Job job = null;public static void main(String[] args) {try {new MovieTopDriver().run(args);} catch (Exception e) {e.printStackTrace();}}@Overridepublic int run(String[] args) throws Exception {job = Job.getInstance(configuration);job.setJarByClass(MovieTopDriver.class);job.setMapperClass(MovieTopMapper.class);job.setNumReduceTasks(0);job.setOutputKeyClass(Text.class);job.setOutputValueClass(NullWritable.class);initFileInputPath(args[0]);initFileOutputPath(args[1]);boolean flag = job.waitForCompletion(true);return flag ? 0 : 1;}/*** 对 ETL 清理后的文件输出路径做校验** @param outputPath ETL 输出文件路径* @throws IOException*/private void initFileOutputPath(String outputPath) throws IOException {FileSystem fileSystem = FileSystem.get(configuration);Path path = new Path(outputPath);boolean exists = fileSystem.exists(path);if (exists) {// 删除已经存在的 ETL 文件清洗输出目录fileSystem.delete(path, true);}FileOutputFormat.setOutputPath(job, path);}/*** 对 ETL 待清理文件输入路径做出校验** @param inputPath ETL 输入文件路径* @throws IOException*/private void initFileInputPath(String inputPath) throws IOException {FileSystem fileSystem = FileSystem.get(configuration);Path f = new Path(inputPath);boolean exists = fileSystem.exists(f);if (exists) {FileInputFormat.setInputPaths(job, f);} else {throw new RuntimeException("ETL 文件输入路径不存在！！！");}}@Overridepublic void setConf(Configuration conf) {this.configuration = conf;}@Overridepublic Configuration getConf() {return this.configuration;}
}

- 测试

对我们写的任何代码，必须自己亲自测试一遍才可以提交!!!

工具类的测试代码已经写在上面的工具类里面，工具类没有问题，才能在本地测试数据的转换流程是否 ok

让我们的程序跑起来。。。

结果如下：

观察第四列和第十列的数据，可以发现我们已经成功完成了 ETL 需求?

3. 服务器完成 ETL 和表建立

之前在本地上，我们已经跑成功了 ETL，接下来打包我们的程序丢到 Linux 上对数据进行清洗，然后建立数据表

3.1 打包和数据清理

这里使用 Maven 打包程序

这里使用 rz 命令将数据和包发送到 Linux 上（速度有点慢，大文件不推荐使用 lrzsz 插件）

将原始数据上传到 HDFS 上

hadoop fs -put ./movie /movie

然后跑 ETL 程序

hadoop jar ./movieTop-1.0-SNAPSHOT.jar "com.cris.MovieTopDriver" /movie /movie_etl_out

3.2 建表

这里我们需要建立四张表，两张原始数据表（直接和 ETL 后的数据建立对应关系），还有两外两张表映射经过压缩后的 orc 格式的文件，我们实际查询使用的表都是 orc 格式的数据，这样子查询效率高多了，而且压缩后的文件也很小~

建表

# 原始数据表
create table movieTop_ori(videoId string, uploader string, age int, category array<string>, length int, views int, rate float, ratings int, comments int,relatedId array<string>)
row format delimited
fields terminated by "\t"
collection items terminated by "&"
stored as textfile;# 原始用户数据表
create table movieTop_user_ori(uploader string,videos int,friends int)
row format delimited
fields terminated by "\t"
stored as textfile;# 创建 orc 压缩表
create table movieTop_orc(videoId string, uploader string, age int, category array<string>, length int, views int, rate float, ratings int, comments int,relatedId array<string>)
row format delimited fields terminated by "\t"
collection items terminated by "&"
stored as orc;create table movieTop_user_orc(uploader string,videos int,friends int)
row format delimited
fields terminated by "\t"
stored as orc;

3.3 导入数据

首先导入 ETL 后的数据

load data inpath "/movie_etl_out" into table movieTop_ori;

load data inpath "/user.txt" into table movieTop_user_ori;

3.4 向 orc 表插入数据

insert into table movieTop_orc select * from movieTop_ori;

insert into table movieTop_user_orc select * from movieTop_user_ori;

4. 实际业务分析

4.1 统计视频观看数Top10

思路：使用order by按照views字段做一个全局排序即可，同时我们设置只显示前10条

select videoId, views from movieTop_orc order by views desc limit 10;

4.2 统计视频类别热度Top10

思路：

即统计每个类别有多少个视频，显示出包含视频最多的前10个类别。
我们需要按照类别group by聚合，然后count组内的videoId个数即可。
因为当前表结构为：一个视频对应一个或多个类别。所以如果要group by类别，需要先将类别进行列转行(展开)，然后再进行count即可。
最后按照热度排序，显示前10条

第一步将每个 video 的类别标签爆炸开来，生成临时的一个标签对应一个 video 的表

# LATERAL VIEW udtf(expression) tableAlias AS columnAlias
select videoId,category_name from movieTop_orc lateral view explode(category) category_info as category_name;

根据临时表进行热度最高的类别 top 10 查询

select count(videoId) as hot, category_name from
(select videoId, category_name from movieTop_orc lateral view explode(category) category_info as category_name
) as tbl_temp1
group by category_name
order by hot desc
limit 10;

4.3 统计出视频观看数最高的20个视频的所属类别以及每个类别包含Top20视频的个数

思路：

先找到观看数最高的20个视频所属条目的所有信息，降序排列
把这20条信息中的category分裂出来(列转行)
最后查询视频分类名称和该分类下有多少个Top20的视频

order by 和 group by 字句一样，字段都必须在 select 字句中出现!!!

牢记 lateral view explode（字段）tableAlias as clounmnAlias 语法~~~

select videoId, category,views from movieTop_orc order by views desc limit 20;

select videoId, category_name from (select videoId, category from movieTop_orc order by views desc limit 20
) as tbl_temp lateral view explode(category) category_info as category_name;

select count(videoId) as video_count, category_name from (select videoId, category_name from (select videoId, category, views from movieTop_orc order by views desc limit 20
) as tbl_temp lateral view explode(category) category_info as category_name
) as tbl_temp2 group by category_name order by video_count desc;

4.4 统计视频观看数Top50所关联视频的所属类别按照视频上传量排序

先要查询出观看数前 50 的视频信息（关联视频是重点！）

select views, relatedId from movieTop_orc order by views desc limit 50;

关联视频字段是 array 数据类型，需要 explode

select views, related_info from t1 lateral view explode(relatedId) related_numbers as related_info;

求出这些关联视频的所属类别，需要 join 原始的 movieTop 表

select  distinct(t2.related_info),tbl_ori.category, t2.views from t2 join movieTop_orc as tbl_ori on tbl_ori.videoId = t2.related_info;

因为所属类别也是 array 数据类型，也需要 explode

select category_name, views, related_info from t3 lateral view explode(category) category_info as category_name;

对这些类别排序

select category_name, count(related_info) as hot from t4 group by category_name order by hot desc;

整合所有 sql

select category_name, count(related_info) as hot from
(select category_name, views, related_info from (select  distinct(t2.related_info),tbl_ori.category, t2.views from ( select views, related_info from (select views, relatedId from movieTop_orc order by views desc limit 50)as t1 lateral view explode(relatedId) related_numbers as related_info) as t2 join movieTop_orc as tbl_ori on tbl_ori.videoId = t2.related_info) as t3 lateral view explode(category) category_info as category_name) as t4
group by category_name order by hot desc;

4.5 统计每个类别中的视频热度Top10，以Music为例

要想统计Music类别中的视频热度Top10，需要先找到Music类别，那么就需要将category展开，所以可以创建一张表用于存放categoryId展开的数据。
向category展开的表中插入数据。
统计对应类别（Music）中的视频热度

Tips：遇上有多个标签的字段（通常是 array 数据类型，例如分类信息等），通常需要建立一张关于每个标签对应每条数据的表，这样子有利于做关于标签的复杂 sql 查询

create table movieTop_category(videoId string, uploader string, age int, categoryId string, length int, views int, rate float, ratings int, comments int, relatedId array<string>)
row format delimited
fields terminated by "\t"
collection items terminated by "&"
stored as orc;

insert into table movieTop_categoryselect videoId,uploader,age,categoryId,length,views,rate,ratings,comments,relatedId from movieTop_orc lateral view explode(category) category_tbl as categoryId;

最后执行查询语句

select videoId, views from movieTop_category where categoryId = 'Music' order by views desc limit 10;

4.6 统计每个类别中视频流量Top10，以Music为例

select videoId, ratings from movieTop_category where categoryId = 'Music' order by ratings desc limit 10;

4.7 统计上传视频最多的用户Top10以及他们上传的观看次数在前20的视频

select t2.videoId, t2.views, t1.uploader from
(select uploader,videos from movieTop_user_orc order by videos desc limit 10) as t1
join movieTop_orc as t2 on
t1.uploader = t2.uploader
order by t2.views desc limit 20;

4.8 统计每个类别视频观看数Top10

Tips: 这种分类别统计 TopN 的要求一般都是使用 rank 函数搭配 over 开窗函数

先得到categoryId展开的表数据
子查询按照categoryId进行分区，然后分区内排序，并生成递增数字，该递增数字这一列起名为rank列
通过子查询产生的临时表，查询 rank 值小于等于 10 的数据行即可

select * from
(select videoId, views, categoryId, row_number() over(partition by categoryId order by views desc) as rk from movieTop_category) as t1
where t1.rk < 11;

每个类别排序和根据指定的类别排名是不一样的，前者需要 rank 函数搭配 over 开窗函数，后者仅仅 order by 搭配 where 字句即可

这里没有滚动截屏，所以没有展示所有类别的排名，需悉知！