项目：YouToBe

三、项目

原始数据youtube在此下载：https://pan.baidu.com/s/1we1KPA2IIEAGIJczyr2dMQ

3.1、数据结构

3.1.1、视频表

3.1.2、用户表

3.2 原始数据存放地

HDFS 目录：

视频数据集：/youtube/video/2008

用户数据集：/youtube/users/2008

3.3、技术选型

Hadoop 2.7.2

Hive 1.2.2

Mysql 5.6

3.3.1、数据清洗

Hadoop MapReduce

3.3.2、数据分析

MapReduce or Hive

3.4、ETL 原始数据

通过观察原始数据形式，可以发现，视频可以有多个所属分类，每个所属分类用&符号分割，

且分割的两边有空格字符，同时相关视频也是可以有多个元素，多个相关视频又用“\t”进

行分割。为了分析数据时方便对存在多个子元素的数据进行操作，我们首先进行数据重组清

洗操作。即：将所有的类别用“&”分割，同时去掉两边空格，多个相关视频 id 也使用“&”

进行分割。

该项目的 pom.xml 文件：

<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.z</groupId>
<artifactId>youtube</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>youtube</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<repositories>
<repository>
<id>centor</id>
<url>http://central.maven.org/maven2/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId> <artifactId>hadoop-yarn-server-resourcemanager</artifactId> <version>2.7.2</version>
</dependency>
</dependencies>
</project>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
3.6.1、ETL 之 ETLUtil

package com.z.youtube.util;
public class ETLUtils {
/**
* 1、过滤不合法数据
* 2、去掉&符号左右两边的空格
* 3、\t 换成&符号
* @param ori
* @return
*/
public static String getETLString(String ori){
String[] splits = ori.split("\t");
//1、过滤不合法数据
if(splits.length < 9) return null;
//2、去掉&符号左右两边的空格
splits[3] = splits[3].replaceAll(" ", "");
StringBuilder sb = new StringBuilder();
//3、\t 换成&符号
for(int i = 0; i < splits.length; i++){
sb.append(splits[i]);
if(i < 9){
if(i != splits.length - 1){
sb.append("\t");
}
}else{
if(i != splits.length - 1){
sb.append("&");
}
}
}
return sb.toString();
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
3.6.2、ETL 之 Mapper

package com.z.youtube.mr.etl;
import java.io.IOException;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import com.z.youtube.util.ETLUtil;
public class VideoETLMapper extends Mapper<Object, Text, NullWritable, Text>{
Text text = new Text();
@Override
protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String etlString = ETLUtil.oriString2ETLString(value.toString());
if(StringUtils.isBlank(etlString)) return;
text.set(etlString);
context.write(NullWritable.get(), text);
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
3.6.3、ETL 之 Runner

package com.z.youtube.mr.etl;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class VideoETLRunner implements Tool {
private Configuration conf = null;

@Override
public void setConf(Configuration conf) {
this.conf = conf;
}

@Override
public Configuration getConf() {
return this.conf;
}

@Override
public int run(String[] args) throws Exception {
conf = this.getConf();
conf.set("inpath", args[0]);
conf.set("outpath", args[1]);
Job job = Job.getInstance(conf, "youtube-video-etl");
job.setJarByClass(VideoETLRunner.class);
job.setMapperClass(VideoETLMapper.class);
job.setMapOutputKeyClass(NullWritable.class);
job.setMapOutputValueClass(Text.class);
job.setNumReduceTasks(0);
this.initJobInputPath(job);
this.initJobOutputPath(job);
return job.waitForCompletion(true) ? 0 : 1;
}

Configuration conf = job.getConfiguration();
String outPathString = conf.get("outpath");
FileSystem fs = FileSystem.get(conf);
Path outPath = new Path(outPathString);
if(fs.exists(outPath)){
fs.delete(outPath, true);
}
FileOutputFormat.setOutputPath(job, outPath);
}

private void initJobInputPath(Job job) throws IOException {
Configuration conf = job.getConfiguration();
String inPathString = conf.get("inpath");
FileSystem fs = FileSystem.get(conf);
Path inPath = new Path(inPathString);
if(fs.exists(inPath)){
FileInputFormat.addInputPath(job, inPath);
}else{
throw new RuntimeException("HDFS 中该文件目录不存在：" + inPathString);
}
}

public static void main(String[] args) {
try {
int resultCode = ToolRunner.run(new VideoETLRunner(), args);
if(resultCode == 0){
System.out.println("Success!");
}else{
System.out.println("Fail!");
}
System.exit(resultCode);
} catch (Exception e) {
e.printStackTrace();
System.exit(1);
}
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
3.6.4、执行 ETL

赠送Maven编译打包命令提示：-P local clean package

bin/yarn jar ~/softwares/jars/youtube-0.0.1-SNAPSHOT.jar \
com.z.youtube.etl.ETLYoutubeVideosRunner \
/youtube/video/2008/0222 \
/youtube/output/video/2008/0222
1
2
3
4
5
3.5、准备工作

3.5.1、创建表

创建表：youtube_ori，youtube_user_ori，

创建表：youtube_orc，youtube_user_orc

youtube_ori：

create table youtube_ori(
videoId string,
uploader string,
age int,
category array<string>,
length int,
views int,
rate float,
ratings int,
comments int,
relatedId array<string>)
row format delimited
fields terminated by "\t"
collection items terminated by "&"
stored as textfile;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
youtube_user_ori：

create table youtube_user_ori(
uploader string,
videos int,
friends int)
clustered by (uploader) into 24 buckets
row format delimited
fields terminated by "\t"
stored as textfile;
1
2
3
4
5
6
7
8
然后把原始数据插入到 orc 表中

youtube_orc：

create table youtube_orc(
videoId string,
uploader string,
age int,
category array<string>,
length int,
views int,
rate float,
ratings int,
comments int,
relatedId array<string>)
clustered by (uploader) into 8 buckets
row format delimited fields terminated by "\t"
collection items terminated by "&"
stored as orc;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
youtube_user_orc：

create table youtube_user_orc(
uploader string,
videos int,
friends int)
clustered by (uploader) into 24 buckets
row format delimited
fields terminated by "\t"
stored as orc;
1
2
3
4
5
6
7
8
3.5.2、导入 ETL 后的数据

youtube_ori：

load data inpath "/youtube/output/video/2008/0222" into table youtube_ori;
1
youtube_user_ori：

load data inpath "/youtube/user/2008/0903" into table youtube_user_ori;
1
3.5.3、向 ORC 表插入数据

youtube_orc：

insert into table youtube_orc select * from youtube_ori;

youtube_user_orc：

insert into table youtube_user_orc select * from youtube_user_ori;

3.6、业务分析

3.6.1、统计视频观看数 Top10

思路：

1) 使用 order by 按照 views 字段做一个全局排序即可，同时我们设置只显示前 10 条。

最终代码：

select
videoId,
uploader,
age,
category,
length,
views,
rate,
ratings,
comments
from
youtube_orc
order by
views
desc limit
10;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
3.6.2、统计视频类别热度 Top10

思路：

1) 即统计每个类别有多少个视频，显示出包含视频最多的前 10 个类别。

2) 我们需要按照类别 group by 聚合，然后 count 组内的 videoId 个数即可。

3) 因为当前表结构为：一个视频对应一个或多个类别。所以如果要 group by 类别，需要先将类别进行列转行(展开)，然后再进行 count 即可。

4) 最后按照热度排序，显示前 10 条。

最终代码：

select
category_name as category,
count(t1.videoId) as hot
from (
select
videoId,
category_name
from
youtube_orc lateral view explode(category) t_catetory as category_name) t1
group by
t1.category_name
order by
hot
desc limit
10;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
3.6.3、统计出视频观看数最高的 20 个视频的所属类别以及类别包含

这 Top20 视频的个数

思路：

1) 先找到观看数最高的 20 个视频所属条目的所有信息，降序排列

2) 把这 20 条信息中的 category 分裂出来(列转行)

3) 最后查询视频分类名称和该分类下有多少个 Top20 的视频

最终代码：

select
category_name as category,
count(t2.videoId) as hot_with_views
from (
select
videoId,
category_name
from (
select
*
from
youtube_orc
order by
views
desc limit
20) t1 lateral view explode(category) t_catetory as category_name) t2
group by
category_name
order by
hot_with_views
desc;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
3.6.4、统计视频观看数 Top50 所关联视频的所属类别的热度排名

思路：

1) 查询出观看数最多的前 50 个视频的所有信息(当然包含了每个视频对应的关联视频)，记为临时表 t1

t1:观看数前 50 的视频

select
*
from
youtube_orc
order by
views
desc limit
50;
1
2
3
4
5
6
7
8
2) 将找到的 50 条视频信息的相关视频 relatedId 列转行，记为临时表 t2

t2:将相关视频的 id 进行列转行操作

select
explode(relatedId) as videoId
from
t1;
1
2
3
4
3) 将相关视频的 id 和 youtube_orc 表进行 inner join 操作
t5:得到两列数据，一列是 category，一列是之前查询出来的相关视频 id

(select
distinct(t2.videoId),
t3.category
from
t2
inner join
youtube_orc t3 on t2.videoId = t3.videoId) t4 lateral view explode(category) t_catetory as category_name;
1
2
3
4
5
6
7
4) 按照视频类别进行分组，统计每组视频个数，然后排行

最终代码：

select
category_name as category,
count(t5.videoId) as hot
from (
select
videoId,
category_name
from (
select
distinct(t2.videoId),
t3.category
from (
select
explode(relatedId) as videoId
from (
select
*
from
youtube_orc
order by
views
desc limit
50) t1) t2
inner join
youtube_orc t3 on t2.videoId = t3.videoId) t4 lateral view explode(category)
t_catetory as category_name) t5
group by
category_name
order by
hot
desc;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
3.6.5、统计每个类别中的视频热度 Top10，以 Music 为例

思路：

1) 要想统计 Music 类别中的视频热度 Top10，需要先找到 Music 类别，那么就需要将 category

展开，所以可以创建一张表用于存放 categoryId 展开的数据。

2) 向 category 展开的表中插入数据。

3) 统计对应类别（Music）中的视频热度。

最终代码：

创建表类别表：

create table youtube_category(
videoId string,
uploader string,
age int,
categoryId string,
length int,
views int,
rate float,
ratings int,
comments int,
relatedId array<string>)
row format delimited
fields terminated by "\t"
collection items terminated by "&"
stored as orc;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
向类别表中插入数据：

insert into table youtube_category
select
videoId,
uploader,
age,
categoryId,
length,
views,
rate,
ratings,
comments,
relatedId
from
youtube_orc lateral view explode(category) catetory as categoryId;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
统计 Music 类别的 Top10（也可以统计其他）

select
videoId,
views
from
youtube_category
where
categoryId = "Music"
order by
views
desc limit
10;
1
2
3
4
5
6
7
8
9
10
11
3.6.6、统计每个类别中视频流量 Top10，以 Music 为例

思路：

1) 创建视频类别展开表（categoryId 列转行后的表）

2) 按照 ratings 排序即可

最终代码：

select
videoId,
views,
ratings
from
youtube_category
where
categoryId = "Music"
order by
ratings
desc limit
10;
1
2
3
4
5
6
7
8
9
10
11
12
3.6.7、统计上传视频最多的用户 Top10 以及他们上传的观看次数在

前 20 的视频

思路：

1) 先找到上传视频最多的 10 个用户的用户信息

select
*
from
youtube_user_orc
order by
videos
desc limit
10;
1
2
3
4
5
6
7
8
2) 通过 uploader 字段与 youtube_orc 表进行 join，得到的信息按照 views 观看次数进行排序即可。

最终代码：

select
t2.videoId,
t2.views,
t2.ratings,
t1.videos,
t1.friends
from (
select
*
from
youtube_user_orc
order by
videos desc
limit
10) t1
join
youtube_orc t2
on
t1.uploader = t2.uploader
order by
views desc
limit
20;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
不过好像原始数据中有点问题，上传视频最多的top10用户的视频没有排在观看次数前20的。。。

3.6.8、统计每个类别视频观看数 Top10

思路：

1) 先得到 categoryId 展开的表数据

2) 子查询按照 categoryId 进行分区，然后分区内排序，并生成递增数字，该递增数字这一列起名为 rank 列

3) 通过子查询产生的临时表，查询 rank 值小于等于 10 的数据行即可。

最终代码：

select
t1.*
from (
select
videoId,
categoryId,
views,
row_number() over(partition by categoryId order by views desc) rank from
youtube_category) t1
where
rank <= 10;
---------------------
作者：Zoin
来源：CSDN
原文：https://blog.csdn.net/a376712116/article/details/81073241
版权声明：本文为博主原创文章，转载请附上博文链接！

三、项目

原始数据youtube在此下载：https://pan.baidu.com/s/1we1KPA2IIEAGIJczyr2dMQ

3.1、数据结构

3.1.1、视频表

3.1.2、用户表

3.2 原始数据存放地

HDFS 目录：

视频数据集：/youtube/video/2008

用户数据集：/youtube/users/2008

3.3、技术选型

Hadoop 2.7.2

Hive 1.2.2

Mysql 5.6

3.3.1、数据清洗

Hadoop MapReduce

3.3.2、数据分析

MapReduce or Hive

3.4、ETL 原始数据

通过观察原始数据形式，可以发现，视频可以有多个所属分类，每个所属分类用&符号分割，

且分割的两边有空格字符，同时相关视频也是可以有多个元素，多个相关视频又用“\t”进

行分割。为了分析数据时方便对存在多个子元素的数据进行操作，我们首先进行数据重组清

洗操作。即：将所有的类别用“&”分割，同时去掉两边空格，多个相关视频 id 也使用“&”

进行分割。

该项目的 pom.xml 文件：

public class VideoETLRunner implements Tool {
private Configuration conf = null;

@Override
public void setConf(Configuration conf) {
this.conf = conf;
}

@Override
public Configuration getConf() {
return this.conf;
}

赠送Maven编译打包命令提示：-P local clean package

bin/yarn jar ~/softwares/jars/youtube-0.0.1-SNAPSHOT.jar \
com.z.youtube.etl.ETLYoutubeVideosRunner \
/youtube/video/2008/0222 \
/youtube/output/video/2008/0222
1
2
3
4
5
3.5、准备工作

3.5.1、创建表

创建表：youtube_ori，youtube_user_ori，

创建表：youtube_orc，youtube_user_orc

youtube_ori：

youtube_orc：

youtube_ori：

load data inpath "/youtube/output/video/2008/0222" into table youtube_ori;
1
youtube_user_ori：

load data inpath "/youtube/user/2008/0903" into table youtube_user_ori;
1
3.5.3、向 ORC 表插入数据

youtube_orc：

insert into table youtube_orc select * from youtube_ori;

youtube_user_orc：

insert into table youtube_user_orc select * from youtube_user_ori;

3.6、业务分析

3.6.1、统计视频观看数 Top10

思路：

1) 使用 order by 按照 views 字段做一个全局排序即可，同时我们设置只显示前 10 条。

最终代码：

思路：

1) 即统计每个类别有多少个视频，显示出包含视频最多的前 10 个类别。

2) 我们需要按照类别 group by 聚合，然后 count 组内的 videoId 个数即可。

3) 因为当前表结构为：一个视频对应一个或多个类别。所以如果要 group by 类别，需要先将类别进行列转行(展开)，然后再进行 count 即可。

4) 最后按照热度排序，显示前 10 条。

最终代码：

这 Top20 视频的个数

思路：

1) 先找到观看数最高的 20 个视频所属条目的所有信息，降序排列

2) 把这 20 条信息中的 category 分裂出来(列转行)

3) 最后查询视频分类名称和该分类下有多少个 Top20 的视频

最终代码：

思路：

1) 查询出观看数最多的前 50 个视频的所有信息(当然包含了每个视频对应的关联视频)，记为临时表 t1

t1:观看数前 50 的视频

select
*
from
youtube_orc
order by
views
desc limit
50;
1
2
3
4
5
6
7
8
2) 将找到的 50 条视频信息的相关视频 relatedId 列转行，记为临时表 t2

t2:将相关视频的 id 进行列转行操作

最终代码：

思路：

1) 要想统计 Music 类别中的视频热度 Top10，需要先找到 Music 类别，那么就需要将 category

展开，所以可以创建一张表用于存放 categoryId 展开的数据。

2) 向 category 展开的表中插入数据。

3) 统计对应类别（Music）中的视频热度。

最终代码：

创建表类别表：

select
videoId,
views
from
youtube_category
where
categoryId = "Music"
order by
views
desc limit
10;
1
2
3
4
5
6
7
8
9
10
11
3.6.6、统计每个类别中视频流量 Top10，以 Music 为例

思路：

1) 创建视频类别展开表（categoryId 列转行后的表）

2) 按照 ratings 排序即可

最终代码：

前 20 的视频

思路：

1) 先找到上传视频最多的 10 个用户的用户信息

select
*
from
youtube_user_orc
order by
videos
desc limit
10;
1
2
3
4
5
6
7
8
2) 通过 uploader 字段与 youtube_orc 表进行 join，得到的信息按照 views 观看次数进行排序即可。

最终代码：

3.6.8、统计每个类别视频观看数 Top10

思路：

1) 先得到 categoryId 展开的表数据

2) 子查询按照 categoryId 进行分区，然后分区内排序，并生成递增数字，该递增数字这一列起名为 rank 列

3) 通过子查询产生的临时表，查询 rank 值小于等于 10 的数据行即可。

最终代码：

select
t1.*
from (
select
videoId,
categoryId,
views,
row_number() over(partition by categoryId order by views desc) rank from
youtube_category) t1
where
rank <= 10;

项目：YouToBe相关推荐

超级计算机游戏开发,人类会成为新的超级计算机吗？
原标题:人类会成为新的超级计算机吗? 人类并非已经被淘汰哲学家René Descartes关于什么使人类独一无二的思考如今正越来越苍白无力,反而是"我想,没办法,我很快就要报废了!&quo ...
【Youtobe trydjango】Django2.2教程和React实战系列一【项目简介 | 搭建 | 工具】
[Youtobe trydjango]Django2.2教程和React实战系列一[项目简介 | 搭建 | 工具] 1.环境与选型说明 2.技术栈选型说明 3.django搭建详解 3.1. 项目虚拟 ...
VIBE：3D人体姿态预测项目复现笔记
VIBE是一个的3D人体姿态预测开源项目,需要基于该项目作一些开发,首先需要能够搭建和是的环境成功复现它. 不过,这个项目的复现的,真的不是一星半点的艰难. 1.系统选择之前一直用的Windows, ...
2.4K Star！450 个重磅前端开源项目合集推荐
大家好,我是你们的猫哥,还是那个不喜欢吃鱼.又不喜欢喵的超级猫 ~ 不知不觉,公众号:前端GitHub 和 GitHub 上的仓库 FrontEndGitHub 都已经更新并运营超过半年了呀, 前 ...
2K Star！超过 50 个专题、450 个好项目，大半年来推荐过的重磅项目合集
大家好,我是你们的猫哥,还是那个不喜欢吃鱼.又不喜欢喵的超级猫 ~ 不知不觉,公众号:前端GitHub 和 GitHub 上的仓库 FrontEndGitHub 都已经更新并运营超过半年了呀, 前 ...
如何提高编程能力？这里有项目开发创意
你是否曾经想开发一些东西但苦于无从下手?就像文学创作者会遭遇写作瓶颈,开发人员也不例外. 我跟我的朋友吉姆 Jim一起,创作了一个collection of application ideas[应用创 ...
【Youtobe trydjango】Django2.2教程和React实战系列四【创建Django应用】
[Youtobe trydjango]Django2.2教程和React实战系列四[创建Django应用] 1. 创建应用 2. 修改应用 1. 创建应用打开cmd黑框,也可以用下列方法打开项目根目 ...
【Youtobe trydjango】Django2.2教程和React实战系列二【settings配置文件】
[Youtobe trydjango]Django2.2教程和React实战系列二[settings配置文件] 1. Django项目初始化过程 2. 全貌 3. 详细解释 4. 增加其他配置 1. ...
【Youtobe trydjango】Django2.2教程和React实战系列七【模板templates和Django模板引擎】
[Youtobe trydjango]Django2.2教程和React实战系列七[模板templates和Django模板引擎] 1. 修改视图函数 2. 新建模板文件夹和html文件 3. 项目配 ...

项目：YouToBe

项目：YouToBe相关推荐

最新文章

热门文章