hive mysql 实战_【Hive】HiveQL实战之分析函数窗口函数
本篇主要介绍将存储到Mysql的示例数据库Sakila以Sqoop的方式导入到Hive,然后详细讲解Hive的分析函数和窗口函数。
一 实战环境
1 Hive版本
hive> select version();
OK
2.3.3 r8a511e3f79b43d4be41cd231cf5c99e43b248383
Time taken: 0.944 seconds, Fetched: 1 row(s)
hive>
2 Mysql数据库
mysql> select database();
+------------+
| database() |
+------------+
| sakila |
+------------+
1 row in set (0.11 sec)
注:本实战使用Mysql示例数据库,具体安装可参考:MySQL之示例数据库Sakila下载及安装。
3 数据准备
1)方法一:MySQL——>HDFS——>Hive
--将Sakila下的City导入至HDFS
[hadoop@strong ~]$ sqoop import --connect jdbc:mysql://strong.hadoop.com:3306/sakila --username root --password root --table city --columns "city_id,city,country_id,last_upda
te" --warehouse-dir /user/sqoop1 -m 1
--查看已导入的数据
[hadoop@strong ~]$ hdfs dfs -cat /user/sqoop1/city/part-m-00000
--在Hive创建city表
hive> create table city(city_id int, city string,country_id int,last_update string)
> row format delimited
> fields terminated by ',';
OK
Time taken: 1.386 seconds
--加载HDFS中的city文件
hive> load data inpath '/user/sqoop1/city/' into table city;
Loading data to table hive.city
OK
Time taken: 4.764 seconds
--查看导入到Hive的数据
hive> select *from city limit 10;
OK
1A Corua (La Corua)872006-02-15 04:45:25.0
2Abha822006-02-15 04:45:25.0
3Abu Dhabi1012006-02-15 04:45:25.0
4Acua602006-02-15 04:45:25.0
5Adana972006-02-15 04:45:25.0
6Addis Abeba312006-02-15 04:45:25.0
7Aden1072006-02-15 04:45:25.0
8Adoni442006-02-15 04:45:25.0
9Ahmadnagar442006-02-15 04:45:25.0
10Akishima502006-02-15 04:45:25.0
Time taken: 10.489 seconds, Fetched: 10 row(s)
2)方法二:MySQL——>Hive
--删除HDFS文件
[hadoop@strong ~]$ hdfs dfs -rm -R /user/hadoop/city
--删除City表
hive> drop table city;
OK
Time taken: 0.995 seconds
--将Sakila下的City导入至Hive下的City表
[hadoop@strong ~]$ sqoop import --connect jdbc:mysql://strong.hadoop.com:3306/sakila --username root --password root --table city --columns "city_id,city,country_id,last_upda
te" --hive-import --hive-database hive --create-hive-table --fields-terminated-by ',' -m 1
注:会出现以下错误:Import failed: java.io.IOException: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf
解决办法:[hadoop@strong ~]$ cp /usr/local/apache-hive-2.3.3-bin/lib/hive-common-2.3.3.jar /usr/local/sqoop-1.4.7/lib/
--查看导入Hive的数据
hive> select *from city limit 10;
OK
1A Corua (La Corua)872006-02-15 04:45:25.0
2Abha822006-02-15 04:45:25.0
3Abu Dhabi1012006-02-15 04:45:25.0
4Acua602006-02-15 04:45:25.0
5Adana972006-02-15 04:45:25.0
6Addis Abeba312006-02-15 04:45:25.0
7Aden1072006-02-15 04:45:25.0
8Adoni442006-02-15 04:45:25.0
9Ahmadnagar442006-02-15 04:45:25.0
10Akishima502006-02-15 04:45:25.0
Time taken: 0.544 seconds, Fetched: 10 row(s)
--查看HDFS文件
[hadoop@strong ~]$ hdfs dfs -ls /user/hadoop/city
ls: `/user/hadoop/city': No such file or directory
注:查看Sqoop Import执行的信息18/06/28 12:34:38 INFO hive.HiveImport: Export directory is contains the _SUCCESS file only, removing the directory.可看出不会创建HDFS文件,执行过程中HDFS文件只是一个过渡,是个临时文件,完成后即删除。
二 Hive分析函数实战
1 样例数据
hive> select *from city where country_id in (101,102,108);
OK
city.city_idcity.citycity.country_idcity.last_update
3Abu Dhabi1012006-02-15 04:45:25.0
12al-Ayn1012006-02-15 04:45:25.0
88Bradford1022006-02-15 04:45:25.0
149Dundee1022006-02-15 04:45:25.0
280Kragujevac1082006-02-15 04:45:25.0
312London1022006-02-15 04:45:25.0
368Novi Sad1082006-02-15 04:45:25.0
470Sharja1012006-02-15 04:45:25.0
494Southampton1022006-02-15 04:45:25.0
495Southend-on-Sea1022006-02-15 04:45:25.0
496Southport1022006-02-15 04:45:25.0
500Stockport1022006-02-15 04:45:25.0
589York1022006-02-15 04:45:25.0
Time taken: 0.359 seconds, Fetched: 13 row(s)
2 Row_Number函数
hive> select t.*,row_number()over(partition by country_id) rnk from city t where country_id in(101,102,108);
t.city_idt.cityt.country_idt.last_updaternk
3Abu Dhabi1012006-02-15 04:45:25.01
470Sharja1012006-02-15 04:45:25.02
12al-Ayn1012006-02-15 04:45:25.03
589York1022006-02-15 04:45:25.01
500Stockport1022006-02-15 04:45:25.02
496Southport1022006-02-15 04:45:25.03
495Southend-on-Sea1022006-02-15 04:45:25.04
494Southampton1022006-02-15 04:45:25.05
312London1022006-02-15 04:45:25.06
149Dundee1022006-02-15 04:45:25.07
88Bradford1022006-02-15 04:45:25.08
280Kragujevac1082006-02-15 04:45:25.01
368Novi Sad1082006-02-15 04:45:25.02
Time taken: 63.74 seconds, Fetched: 13 row(s)
3 Rank函数
--加上partition by
hive> select t.*,rank()over(partition by country_id order by city_id) rnk from city t where country_id in(101,102,108);
t.city_idt.cityt.country_idt.last_updaternk
3Abu Dhabi1012006-02-15 04:45:25.01
12al-Ayn1012006-02-15 04:45:25.02
470Sharja1012006-02-15 04:45:25.03
88Bradford1022006-02-15 04:45:25.01
149Dundee1022006-02-15 04:45:25.02
312London1022006-02-15 04:45:25.03
494Southampton1022006-02-15 04:45:25.04
495Southend-on-Sea1022006-02-15 04:45:25.05
496Southport1022006-02-15 04:45:25.06
500Stockport1022006-02-15 04:45:25.07
589York1022006-02-15 04:45:25.08
280Kragujevac1082006-02-15 04:45:25.01
368Novi Sad1082006-02-15 04:45:25.02
Time taken: 57.047 seconds, Fetched: 13 row(s)
--不加上partition by
hive> select t.*,rank()over(order by city_id) rnk from city t where country_id in(101,102,108);
t.city_idt.cityt.country_idt.last_updaternk
3Abu Dhabi1012006-02-15 04:45:25.01
12al-Ayn1012006-02-15 04:45:25.02
88Bradford1022006-02-15 04:45:25.03
149Dundee1022006-02-15 04:45:25.04
280Kragujevac1082006-02-15 04:45:25.05
312London1022006-02-15 04:45:25.06
368Novi Sad1082006-02-15 04:45:25.07
470Sharja1012006-02-15 04:45:25.08
494Southampton1022006-02-15 04:45:25.09
495Southend-on-Sea1022006-02-15 04:45:25.010
496Southport1022006-02-15 04:45:25.011
500Stockport1022006-02-15 04:45:25.012
589York1022006-02-15 04:45:25.013
Time taken: 73.697 seconds, Fetched: 13 row(s)
--使用Rank排序会出现跳号的情况,即
hive> select t.*,rank()over(order by country_id) rnk from city t where country_id in(101,102,108);
t.city_idt.cityt.country_idt.last_updaternk
3Abu Dhabi1012006-02-15 04:45:25.01
470Sharja1012006-02-15 04:45:25.01
12al-Ayn1012006-02-15 04:45:25.01
589York1022006-02-15 04:45:25.04
500Stockport1022006-02-15 04:45:25.04
496Southport1022006-02-15 04:45:25.04
495Southend-on-Sea1022006-02-15 04:45:25.04
494Southampton1022006-02-15 04:45:25.04
312London1022006-02-15 04:45:25.04
149Dundee1022006-02-15 04:45:25.04
88Bradford1022006-02-15 04:45:25.04
280Kragujevac1082006-02-15 04:45:25.012
368Novi Sad1082006-02-15 04:45:25.012
Time taken: 57.533 seconds, Fetched: 13 row(s)
注:Rank函数排序会出现跳号。
4 Dense_rank函数
hive> select t.*,dense_rank()over(order by country_id) rnk from city t where country_id in(101,102,108);
t.city_idt.cityt.country_idt.last_updaternk
3Abu Dhabi1012006-02-15 04:45:25.01
470Sharja1012006-02-15 04:45:25.01
12al-Ayn1012006-02-15 04:45:25.01
589York1022006-02-15 04:45:25.02
500Stockport1022006-02-15 04:45:25.02
496Southport1022006-02-15 04:45:25.02
495Southend-on-Sea1022006-02-15 04:45:25.02
494Southampton1022006-02-15 04:45:25.02
312London1022006-02-15 04:45:25.02
149Dundee1022006-02-15 04:45:25.02
88Bradford1022006-02-15 04:45:25.02
280Kragujevac1082006-02-15 04:45:25.03
368Novi Sad1082006-02-15 04:45:25.03
Time taken: 60.528 seconds, Fetched: 13 row(s)
注:Dense_rank函数排序不会出现跳号。
5 Lead函数
hive> select t.*, lead(city_id)over(partition by country_id order by city_id) next_city from city t where country_id in(101,102,108);
t.city_idt.cityt.country_idt.last_updatenext_city
3Abu Dhabi1012006-02-15 04:45:25.012
12al-Ayn1012006-02-15 04:45:25.0470
470Sharja1012006-02-15 04:45:25.0NULL
88Bradford1022006-02-15 04:45:25.0149
149Dundee1022006-02-15 04:45:25.0312
312London1022006-02-15 04:45:25.0494
494Southampton1022006-02-15 04:45:25.0495
495Southend-on-Sea1022006-02-15 04:45:25.0496
496Southport1022006-02-15 04:45:25.0500
500Stockport1022006-02-15 04:45:25.0589
589York1022006-02-15 04:45:25.0NULL
280Kragujevac1082006-02-15 04:45:25.0368
368Novi Sad1082006-02-15 04:45:25.0NULL
Time taken: 59.491 seconds, Fetched: 13 row(s)
注: lead(city_id)超出边界,将返回NULL值。
6 LAG函数
hive> select t.*, lag(city_id)over(partition by country_id order by city_id) prior_city from city t where country_id in(101,102,108);
t.city_idt.cityt.country_idt.last_updateprior_city
3Abu Dhabi1012006-02-15 04:45:25.0NULL
12al-Ayn1012006-02-15 04:45:25.03
470Sharja1012006-02-15 04:45:25.012
88Bradford1022006-02-15 04:45:25.0NULL
149Dundee1022006-02-15 04:45:25.088
312London1022006-02-15 04:45:25.0149
494Southampton1022006-02-15 04:45:25.0312
495Southend-on-Sea1022006-02-15 04:45:25.0494
496Southport1022006-02-15 04:45:25.0495
500Stockport1022006-02-15 04:45:25.0496
589York1022006-02-15 04:45:25.0500
280Kragujevac1082006-02-15 04:45:25.0NULL
368Novi Sad1082006-02-15 04:45:25.0280
Time taken: 56.631 seconds, Fetched: 13 row(s)
--指定偏移量,默认偏移量为1
hive> select t.*, lag(city_id,2,'No Data')over(partition by country_id order by city_id) next_city from city t where country_id in(101,102,108);
t.city_idt.cityt.country_idt.last_updatenext_city
3Abu Dhabi1012006-02-15 04:45:25.0NULL
12al-Ayn1012006-02-15 04:45:25.0NULL
470Sharja1012006-02-15 04:45:25.03
88Bradford1022006-02-15 04:45:25.0NULL
149Dundee1022006-02-15 04:45:25.0NULL
312London1022006-02-15 04:45:25.088
494Southampton1022006-02-15 04:45:25.0149
495Southend-on-Sea1022006-02-15 04:45:25.0312
496Southport1022006-02-15 04:45:25.0494
500Stockport1022006-02-15 04:45:25.0495
589York1022006-02-15 04:45:25.0496
280Kragujevac1082006-02-15 04:45:25.0NULL
368Novi Sad1082006-02-15 04:45:25.0NULL
Time taken: 66.487 seconds, Fetched: 13 row(s)
--返回当前值的前一个和后一个
hive> select t.*, lag(city_id)over(partition by country_id order by city_id) prior_city,lead(city_id)over(partition by country_id order by city_id) next_city from city t where country_id in(101,102,108);
t.city_idt.cityt.country_idt.last_updateprior_citynext_city
3Abu Dhabi1012006-02-15 04:45:25.0NULL12
12al-Ayn1012006-02-15 04:45:25.03470
470Sharja1012006-02-15 04:45:25.012NULL
88Bradford1022006-02-15 04:45:25.0NULL149
149Dundee1022006-02-15 04:45:25.088312
312London1022006-02-15 04:45:25.0149494
494Southampton1022006-02-15 04:45:25.0312495
495Southend-on-Sea1022006-02-15 04:45:25.0494496
496Southport1022006-02-15 04:45:25.0495500
500Stockport1022006-02-15 04:45:25.0496589
589York1022006-02-15 04:45:25.0500NULL
280Kragujevac1082006-02-15 04:45:25.0NULL368
368Novi Sad1082006-02-15 04:45:25.0280NULL
Time taken: 52.446 seconds, Fetched: 13 row(s)
7 First_Value函数
hive> select t.*, first_value(city_id)over(partition by country_id order by city_id) first_city from city t where country_id in(101,102,108);
t.city_idt.cityt.country_idt.last_updatefirst_city
3Abu Dhabi1012006-02-15 04:45:25.03
12al-Ayn1012006-02-15 04:45:25.03
470Sharja1012006-02-15 04:45:25.03
88Bradford1022006-02-15 04:45:25.088
149Dundee1022006-02-15 04:45:25.088
312London1022006-02-15 04:45:25.088
494Southampton1022006-02-15 04:45:25.088
495Southend-on-Sea1022006-02-15 04:45:25.088
496Southport1022006-02-15 04:45:25.088
500Stockport1022006-02-15 04:45:25.088
589York1022006-02-15 04:45:25.088
280Kragujevac1082006-02-15 04:45:25.0280
368Novi Sad1082006-02-15 04:45:25.0280
Time taken: 78.285 seconds, Fetched: 13 row(s)
8 Last_value函数
hive> select t.*, last_value(city_id)over(partition by country_id order by city_id rows between unbounded preceding and unbounded following) last_city from city t where country_id in(101,102,108);
t.city_idt.cityt.country_idt.last_updatelast_city
3Abu Dhabi1012006-02-15 04:45:25.0470
12al-Ayn1012006-02-15 04:45:25.0470
470Sharja1012006-02-15 04:45:25.0470
88Bradford1022006-02-15 04:45:25.0589
149Dundee1022006-02-15 04:45:25.0589
312London1022006-02-15 04:45:25.0589
494Southampton1022006-02-15 04:45:25.0589
495Southend-on-Sea1022006-02-15 04:45:25.0589
496Southport1022006-02-15 04:45:25.0589
500Stockport1022006-02-15 04:45:25.0589
589York1022006-02-15 04:45:25.0589
280Kragujevac1082006-02-15 04:45:25.0368
368Novi Sad1082006-02-15 04:45:25.0368
Time taken: 49.807 seconds, Fetched: 13 row(s)
注:在使用first_value和last_value时需加上窗口rows between unbounded preceding and unbounded following,否则会出现预想不到的结果!!!因为默认是RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW。
9 Count函数
hive> select country_id,count(1) cqty from city where country_id in(101,102,108) group by country_id;
country_idcqty
1013
1028
1082
Time taken: 56.987 seconds, Fetched: 3 row(s)
10 Sum函数
hive> select country_id,sum(city_id) csum from city where country_id in(101,102,108) group by country_id;
country_idcsum
101485
1023123
108648
Time taken: 57.993 seconds, Fetched: 3 row(s)
11 Avg函数
hive> select country_id,avg(city_id) cavg from city where country_id in(101,102,108) group by country_id;
country_idcavg
101161.66666666666666
102390.375
108324.0
Time taken: 54.42 seconds, Fetched: 3 row(s)
12 Max函数
hive> select country_id,max(city_id) cmax from city where country_id in(101,102,108) group by country_id;
country_idcmax
101470
102589
108368
Time taken: 58.02 seconds, Fetched: 3 row(s)
13 Min函数
hive> select country_id,min(city_id) cmax from city where country_id in(101,102,108) group by country_id;
country_idcmax
1013
10288
108280
Time taken: 57.602 seconds, Fetched: 3 row(s)
注:以上5个函数都可以在后面添加Over和窗口。
窗口有如下规范:
(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING
hive mysql 实战_【Hive】HiveQL实战之分析函数窗口函数相关推荐
- python开发商城实战视频_商城项目实战_商城项目实战教程_商城项目实战视频教程 _课课家...
本套餐将包括两个重磅性的课程与一个赠送学习的课程,分别为SpringBoot实战视频教程与RabbitMQ实战教程跟SSM整合开发之poi导入导出Excel.目的是为了让各位小伙伴可以从零基础一步一个 ...
- hive mysql 字符集_创建Hive表会自动更改mysql Metastore的字符集设置
但是,当我尝试删除表格时,出现如下所示的错误. FAILED: Error in metadata: javax.jdo.JDODataStoreException: Error(s) were fo ...
- hive+mysql+速度_半小时搞定Hadoop+Mysql+Hive+Python
1. 说明 搭建过Hadoop集群的小伙伴一定知道,如果不用docker,半小时配好Hadoop+Mysql+Hive(后简称Hive)肯定是胡吹,有了Docker镜像,没有说明文档,配好了也不一定会 ...
- hive mysql互导_利用Sqoop实现Hive的数据与MySQL数据的互导
1. 配置概览 Hive arguments: --create-hive-table Fail if the target hive table exists --hive-database Set ...
- 多租户mysql架构_团队开发框架实战—多租户架构
1 对多租户的理解 多租户定义:多租户技术或称多重租赁技术,简称SaaS,是一种软件架构技术,是实现如何在多用户环境下(此处的多用户一般是面向企业用户)共用相同的系统或程序组件,并且可确保各用户间数据 ...
- hive mysql windows_Java-从MySQL到Hive导入,其中MySQL在Windows上运行,而Hive在Cent OS上运行(Horton Sandbox)...
在任何答案和评论之前.我尝试了在Stackoverflow中找到的几个选项,但均以失败告终.以下是这些链接- 我通过命令行在Horton Sandbox中进行了尝试并成功. sqoop import ...
- hive mysql类型,(二)Hive数据类型、数据定义、数据操作和查询
1.数据类型 1.1 基本数据类型Hive数据类型长度例子TINYINT1byte有符号整数20 SMALINT2byte有符号整数20 INT4byte有符号整数20 BIGINT8byte有符号整 ...
- python docker实战_「docker实战篇」python的docker-docker镜像的创建使用dockerfile(3
从上篇docker commit学习可以了解到,镜像的定制其实每一层添加的配置和文件,如果把每一层的修改配置,修改文件,都写入脚本,用这个脚本构建定制镜像,无法重复的问题,镜像构建透明性的问题,体积的 ...
- python知识图谱实战_知识图谱实战
原标题:知识图谱实战 知识图谱是近来非常红火的技术,融合网络爬虫,自然语言处理,机器学习,深度学习,图数据库,复杂网络分析等多种热门技术于一身,技术含量密集,在构造语义搜索,问答平台,高智能的人机界面 ...
- java akka 实战_《Akka实战:快速构建高可用分布式应用》(杜云飞)【摘要 书评 试读】- 京东图书...
Akka 是一款优秀的分布式并发框架,虽然它是基于 Scala 语言实现的,但我们却可轻松地将其运行在JVM上,在不改变现有架构的基础上支持更高的并发量.另一方面,Akka 是一款轻量级开源技术,它既 ...
最新文章
- python退出帮助系统help应该使用exit_简明Python3教程 5.第一步
- ASP.NET获取IP的方法
- 启动tomcat时,一直卡在Deploying web application directory这块的解决方案
- java 日期calendar_java时间对象Date,Calendar和LocalDate/LocalDateTime
- Effective C++学习第三天
- c语言数组指定位置插入和删除_Apache POI在指定位置插入表格
- php可以用scanf,C/C++中 使用scanf和printf如何读入输出double型数据。
- Nature、Science、Cell全加入!80家学术机构新冠研究全部免费
- C#之float数组转字节流
- HTML iframe标签下 子页面调用父页面js 容易产生的跨域调用问题 Uncaught DOMException
- php b框架,thinkphp_bjui
- efs+pro+for+三星android设备,【极光ROM】-【三星S8/S8+ G9550/G9500】-【V30.0 Android-PIE-TL2】...
- python 问卷调查系统_GitHub - JukieChen/surveySystem-1: 问卷调查系统
- python求和函数详解_python 中求和函数 sum详解
- python灰产路子有哪些_Python3 网络爬虫(四):视频下载,那些事儿!-后台/架构/数据库-敏捷大拇指-一个敢保留真话的IT精英社区...
- 编写一个求x的n次方的函数
- 西门子1200 PLC CRC效验程序功能块
- vue cli 服务器文件,Vue CLI 部署 - 闪电教程JSRUN
- python的numpy教程_ROS与Python入门教程-使用numpy
- alpha测试和beta测试Gamma测试的区别是什么?