本篇主要介绍将存储到Mysql的示例数据库Sakila以Sqoop的方式导入到Hive,然后详细讲解Hive的分析函数和窗口函数。

一 实战环境

1 Hive版本

hive> select version();

OK

2.3.3 r8a511e3f79b43d4be41cd231cf5c99e43b248383

Time taken: 0.944 seconds, Fetched: 1 row(s)

hive>

2 Mysql数据库

mysql> select database();

+------------+

| database() |

+------------+

| sakila |

+------------+

1 row in set (0.11 sec)

注:本实战使用Mysql示例数据库,具体安装可参考:MySQL之示例数据库Sakila下载及安装。

3 数据准备

1)方法一:MySQL——>HDFS——>Hive

--将Sakila下的City导入至HDFS

[hadoop@strong ~]$ sqoop import --connect jdbc:mysql://strong.hadoop.com:3306/sakila --username root --password root --table city --columns "city_id,city,country_id,last_upda

te" --warehouse-dir /user/sqoop1 -m 1

--查看已导入的数据

[hadoop@strong ~]$ hdfs dfs -cat /user/sqoop1/city/part-m-00000

--在Hive创建city表

hive> create table city(city_id int, city string,country_id int,last_update string)

> row format delimited

> fields terminated by ',';

OK

Time taken: 1.386 seconds

--加载HDFS中的city文件

hive> load data inpath '/user/sqoop1/city/' into table city;

Loading data to table hive.city

OK

Time taken: 4.764 seconds

--查看导入到Hive的数据

hive> select *from city limit 10;

OK

1A Corua (La Corua)872006-02-15 04:45:25.0

2Abha822006-02-15 04:45:25.0

3Abu Dhabi1012006-02-15 04:45:25.0

4Acua602006-02-15 04:45:25.0

5Adana972006-02-15 04:45:25.0

6Addis Abeba312006-02-15 04:45:25.0

7Aden1072006-02-15 04:45:25.0

8Adoni442006-02-15 04:45:25.0

9Ahmadnagar442006-02-15 04:45:25.0

10Akishima502006-02-15 04:45:25.0

Time taken: 10.489 seconds, Fetched: 10 row(s)

2)方法二:MySQL——>Hive

--删除HDFS文件

[hadoop@strong ~]$ hdfs dfs -rm -R /user/hadoop/city

--删除City表

hive> drop table city;

OK

Time taken: 0.995 seconds

--将Sakila下的City导入至Hive下的City表

[hadoop@strong ~]$ sqoop import --connect jdbc:mysql://strong.hadoop.com:3306/sakila --username root --password root --table city --columns "city_id,city,country_id,last_upda

te" --hive-import --hive-database hive --create-hive-table --fields-terminated-by ',' -m 1

注:会出现以下错误:Import failed: java.io.IOException: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf

解决办法:[hadoop@strong ~]$ cp /usr/local/apache-hive-2.3.3-bin/lib/hive-common-2.3.3.jar /usr/local/sqoop-1.4.7/lib/

--查看导入Hive的数据

hive> select *from city limit 10;

OK

1A Corua (La Corua)872006-02-15 04:45:25.0

2Abha822006-02-15 04:45:25.0

3Abu Dhabi1012006-02-15 04:45:25.0

4Acua602006-02-15 04:45:25.0

5Adana972006-02-15 04:45:25.0

6Addis Abeba312006-02-15 04:45:25.0

7Aden1072006-02-15 04:45:25.0

8Adoni442006-02-15 04:45:25.0

9Ahmadnagar442006-02-15 04:45:25.0

10Akishima502006-02-15 04:45:25.0

Time taken: 0.544 seconds, Fetched: 10 row(s)

--查看HDFS文件

[hadoop@strong ~]$ hdfs dfs -ls /user/hadoop/city

ls: `/user/hadoop/city': No such file or directory

注:查看Sqoop Import执行的信息18/06/28 12:34:38 INFO hive.HiveImport: Export directory is contains the _SUCCESS file only, removing the directory.可看出不会创建HDFS文件,执行过程中HDFS文件只是一个过渡,是个临时文件,完成后即删除。

二 Hive分析函数实战

1 样例数据

hive> select *from city where country_id in (101,102,108);

OK

city.city_idcity.citycity.country_idcity.last_update

3Abu Dhabi1012006-02-15 04:45:25.0

12al-Ayn1012006-02-15 04:45:25.0

88Bradford1022006-02-15 04:45:25.0

149Dundee1022006-02-15 04:45:25.0

280Kragujevac1082006-02-15 04:45:25.0

312London1022006-02-15 04:45:25.0

368Novi Sad1082006-02-15 04:45:25.0

470Sharja1012006-02-15 04:45:25.0

494Southampton1022006-02-15 04:45:25.0

495Southend-on-Sea1022006-02-15 04:45:25.0

496Southport1022006-02-15 04:45:25.0

500Stockport1022006-02-15 04:45:25.0

589York1022006-02-15 04:45:25.0

Time taken: 0.359 seconds, Fetched: 13 row(s)

2 Row_Number函数

hive> select t.*,row_number()over(partition by country_id) rnk from city t where country_id in(101,102,108);

t.city_idt.cityt.country_idt.last_updaternk

3Abu Dhabi1012006-02-15 04:45:25.01

470Sharja1012006-02-15 04:45:25.02

12al-Ayn1012006-02-15 04:45:25.03

589York1022006-02-15 04:45:25.01

500Stockport1022006-02-15 04:45:25.02

496Southport1022006-02-15 04:45:25.03

495Southend-on-Sea1022006-02-15 04:45:25.04

494Southampton1022006-02-15 04:45:25.05

312London1022006-02-15 04:45:25.06

149Dundee1022006-02-15 04:45:25.07

88Bradford1022006-02-15 04:45:25.08

280Kragujevac1082006-02-15 04:45:25.01

368Novi Sad1082006-02-15 04:45:25.02

Time taken: 63.74 seconds, Fetched: 13 row(s)

3 Rank函数

--加上partition by

hive> select t.*,rank()over(partition by country_id order by city_id) rnk from city t where country_id in(101,102,108);

t.city_idt.cityt.country_idt.last_updaternk

3Abu Dhabi1012006-02-15 04:45:25.01

12al-Ayn1012006-02-15 04:45:25.02

470Sharja1012006-02-15 04:45:25.03

88Bradford1022006-02-15 04:45:25.01

149Dundee1022006-02-15 04:45:25.02

312London1022006-02-15 04:45:25.03

494Southampton1022006-02-15 04:45:25.04

495Southend-on-Sea1022006-02-15 04:45:25.05

496Southport1022006-02-15 04:45:25.06

500Stockport1022006-02-15 04:45:25.07

589York1022006-02-15 04:45:25.08

280Kragujevac1082006-02-15 04:45:25.01

368Novi Sad1082006-02-15 04:45:25.02

Time taken: 57.047 seconds, Fetched: 13 row(s)

--不加上partition by

hive> select t.*,rank()over(order by city_id) rnk from city t where country_id in(101,102,108);

t.city_idt.cityt.country_idt.last_updaternk

3Abu Dhabi1012006-02-15 04:45:25.01

12al-Ayn1012006-02-15 04:45:25.02

88Bradford1022006-02-15 04:45:25.03

149Dundee1022006-02-15 04:45:25.04

280Kragujevac1082006-02-15 04:45:25.05

312London1022006-02-15 04:45:25.06

368Novi Sad1082006-02-15 04:45:25.07

470Sharja1012006-02-15 04:45:25.08

494Southampton1022006-02-15 04:45:25.09

495Southend-on-Sea1022006-02-15 04:45:25.010

496Southport1022006-02-15 04:45:25.011

500Stockport1022006-02-15 04:45:25.012

589York1022006-02-15 04:45:25.013

Time taken: 73.697 seconds, Fetched: 13 row(s)

--使用Rank排序会出现跳号的情况,即

hive> select t.*,rank()over(order by country_id) rnk from city t where country_id in(101,102,108);

t.city_idt.cityt.country_idt.last_updaternk

3Abu Dhabi1012006-02-15 04:45:25.01

470Sharja1012006-02-15 04:45:25.01

12al-Ayn1012006-02-15 04:45:25.01

589York1022006-02-15 04:45:25.04

500Stockport1022006-02-15 04:45:25.04

496Southport1022006-02-15 04:45:25.04

495Southend-on-Sea1022006-02-15 04:45:25.04

494Southampton1022006-02-15 04:45:25.04

312London1022006-02-15 04:45:25.04

149Dundee1022006-02-15 04:45:25.04

88Bradford1022006-02-15 04:45:25.04

280Kragujevac1082006-02-15 04:45:25.012

368Novi Sad1082006-02-15 04:45:25.012

Time taken: 57.533 seconds, Fetched: 13 row(s)

注:Rank函数排序会出现跳号。

4 Dense_rank函数

hive> select t.*,dense_rank()over(order by country_id) rnk from city t where country_id in(101,102,108);

t.city_idt.cityt.country_idt.last_updaternk

3Abu Dhabi1012006-02-15 04:45:25.01

470Sharja1012006-02-15 04:45:25.01

12al-Ayn1012006-02-15 04:45:25.01

589York1022006-02-15 04:45:25.02

500Stockport1022006-02-15 04:45:25.02

496Southport1022006-02-15 04:45:25.02

495Southend-on-Sea1022006-02-15 04:45:25.02

494Southampton1022006-02-15 04:45:25.02

312London1022006-02-15 04:45:25.02

149Dundee1022006-02-15 04:45:25.02

88Bradford1022006-02-15 04:45:25.02

280Kragujevac1082006-02-15 04:45:25.03

368Novi Sad1082006-02-15 04:45:25.03

Time taken: 60.528 seconds, Fetched: 13 row(s)

注:Dense_rank函数排序不会出现跳号。

5 Lead函数

hive> select t.*, lead(city_id)over(partition by country_id order by city_id) next_city from city t where country_id in(101,102,108);

t.city_idt.cityt.country_idt.last_updatenext_city

3Abu Dhabi1012006-02-15 04:45:25.012

12al-Ayn1012006-02-15 04:45:25.0470

470Sharja1012006-02-15 04:45:25.0NULL

88Bradford1022006-02-15 04:45:25.0149

149Dundee1022006-02-15 04:45:25.0312

312London1022006-02-15 04:45:25.0494

494Southampton1022006-02-15 04:45:25.0495

495Southend-on-Sea1022006-02-15 04:45:25.0496

496Southport1022006-02-15 04:45:25.0500

500Stockport1022006-02-15 04:45:25.0589

589York1022006-02-15 04:45:25.0NULL

280Kragujevac1082006-02-15 04:45:25.0368

368Novi Sad1082006-02-15 04:45:25.0NULL

Time taken: 59.491 seconds, Fetched: 13 row(s)

注: lead(city_id)超出边界,将返回NULL值。

6 LAG函数

hive> select t.*, lag(city_id)over(partition by country_id order by city_id) prior_city from city t where country_id in(101,102,108);

t.city_idt.cityt.country_idt.last_updateprior_city

3Abu Dhabi1012006-02-15 04:45:25.0NULL

12al-Ayn1012006-02-15 04:45:25.03

470Sharja1012006-02-15 04:45:25.012

88Bradford1022006-02-15 04:45:25.0NULL

149Dundee1022006-02-15 04:45:25.088

312London1022006-02-15 04:45:25.0149

494Southampton1022006-02-15 04:45:25.0312

495Southend-on-Sea1022006-02-15 04:45:25.0494

496Southport1022006-02-15 04:45:25.0495

500Stockport1022006-02-15 04:45:25.0496

589York1022006-02-15 04:45:25.0500

280Kragujevac1082006-02-15 04:45:25.0NULL

368Novi Sad1082006-02-15 04:45:25.0280

Time taken: 56.631 seconds, Fetched: 13 row(s)

--指定偏移量,默认偏移量为1

hive> select t.*, lag(city_id,2,'No Data')over(partition by country_id order by city_id) next_city from city t where country_id in(101,102,108);

t.city_idt.cityt.country_idt.last_updatenext_city

3Abu Dhabi1012006-02-15 04:45:25.0NULL

12al-Ayn1012006-02-15 04:45:25.0NULL

470Sharja1012006-02-15 04:45:25.03

88Bradford1022006-02-15 04:45:25.0NULL

149Dundee1022006-02-15 04:45:25.0NULL

312London1022006-02-15 04:45:25.088

494Southampton1022006-02-15 04:45:25.0149

495Southend-on-Sea1022006-02-15 04:45:25.0312

496Southport1022006-02-15 04:45:25.0494

500Stockport1022006-02-15 04:45:25.0495

589York1022006-02-15 04:45:25.0496

280Kragujevac1082006-02-15 04:45:25.0NULL

368Novi Sad1082006-02-15 04:45:25.0NULL

Time taken: 66.487 seconds, Fetched: 13 row(s)

--返回当前值的前一个和后一个

hive> select t.*, lag(city_id)over(partition by country_id order by city_id) prior_city,lead(city_id)over(partition by country_id order by city_id) next_city from city t where country_id in(101,102,108);

t.city_idt.cityt.country_idt.last_updateprior_citynext_city

3Abu Dhabi1012006-02-15 04:45:25.0NULL12

12al-Ayn1012006-02-15 04:45:25.03470

470Sharja1012006-02-15 04:45:25.012NULL

88Bradford1022006-02-15 04:45:25.0NULL149

149Dundee1022006-02-15 04:45:25.088312

312London1022006-02-15 04:45:25.0149494

494Southampton1022006-02-15 04:45:25.0312495

495Southend-on-Sea1022006-02-15 04:45:25.0494496

496Southport1022006-02-15 04:45:25.0495500

500Stockport1022006-02-15 04:45:25.0496589

589York1022006-02-15 04:45:25.0500NULL

280Kragujevac1082006-02-15 04:45:25.0NULL368

368Novi Sad1082006-02-15 04:45:25.0280NULL

Time taken: 52.446 seconds, Fetched: 13 row(s)

7 First_Value函数

hive> select t.*, first_value(city_id)over(partition by country_id order by city_id) first_city from city t where country_id in(101,102,108);

t.city_idt.cityt.country_idt.last_updatefirst_city

3Abu Dhabi1012006-02-15 04:45:25.03

12al-Ayn1012006-02-15 04:45:25.03

470Sharja1012006-02-15 04:45:25.03

88Bradford1022006-02-15 04:45:25.088

149Dundee1022006-02-15 04:45:25.088

312London1022006-02-15 04:45:25.088

494Southampton1022006-02-15 04:45:25.088

495Southend-on-Sea1022006-02-15 04:45:25.088

496Southport1022006-02-15 04:45:25.088

500Stockport1022006-02-15 04:45:25.088

589York1022006-02-15 04:45:25.088

280Kragujevac1082006-02-15 04:45:25.0280

368Novi Sad1082006-02-15 04:45:25.0280

Time taken: 78.285 seconds, Fetched: 13 row(s)

8 Last_value函数

hive> select t.*, last_value(city_id)over(partition by country_id order by city_id rows between unbounded preceding and unbounded following) last_city from city t where country_id in(101,102,108);

t.city_idt.cityt.country_idt.last_updatelast_city

3Abu Dhabi1012006-02-15 04:45:25.0470

12al-Ayn1012006-02-15 04:45:25.0470

470Sharja1012006-02-15 04:45:25.0470

88Bradford1022006-02-15 04:45:25.0589

149Dundee1022006-02-15 04:45:25.0589

312London1022006-02-15 04:45:25.0589

494Southampton1022006-02-15 04:45:25.0589

495Southend-on-Sea1022006-02-15 04:45:25.0589

496Southport1022006-02-15 04:45:25.0589

500Stockport1022006-02-15 04:45:25.0589

589York1022006-02-15 04:45:25.0589

280Kragujevac1082006-02-15 04:45:25.0368

368Novi Sad1082006-02-15 04:45:25.0368

Time taken: 49.807 seconds, Fetched: 13 row(s)

注:在使用first_value和last_value时需加上窗口rows between unbounded preceding and unbounded following,否则会出现预想不到的结果!!!因为默认是RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW。

9 Count函数

hive> select country_id,count(1) cqty from city where country_id in(101,102,108) group by country_id;

country_idcqty

1013

1028

1082

Time taken: 56.987 seconds, Fetched: 3 row(s)

10 Sum函数

hive> select country_id,sum(city_id) csum from city where country_id in(101,102,108) group by country_id;

country_idcsum

101485

1023123

108648

Time taken: 57.993 seconds, Fetched: 3 row(s)

11 Avg函数

hive> select country_id,avg(city_id) cavg from city where country_id in(101,102,108) group by country_id;

country_idcavg

101161.66666666666666

102390.375

108324.0

Time taken: 54.42 seconds, Fetched: 3 row(s)

12 Max函数

hive> select country_id,max(city_id) cmax from city where country_id in(101,102,108) group by country_id;

country_idcmax

101470

102589

108368

Time taken: 58.02 seconds, Fetched: 3 row(s)

13 Min函数

hive> select country_id,min(city_id) cmax from city where country_id in(101,102,108) group by country_id;

country_idcmax

1013

10288

108280

Time taken: 57.602 seconds, Fetched: 3 row(s)

注:以上5个函数都可以在后面添加Over和窗口。

窗口有如下规范:

(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING

hive mysql 实战_【Hive】HiveQL实战之分析函数窗口函数相关推荐

  1. python开发商城实战视频_商城项目实战_商城项目实战教程_商城项目实战视频教程 _课课家...

    本套餐将包括两个重磅性的课程与一个赠送学习的课程,分别为SpringBoot实战视频教程与RabbitMQ实战教程跟SSM整合开发之poi导入导出Excel.目的是为了让各位小伙伴可以从零基础一步一个 ...

  2. hive mysql 字符集_创建Hive表会自动更改mysql Metastore的字符集设置

    但是,当我尝试删除表格时,出现如下所示的错误. FAILED: Error in metadata: javax.jdo.JDODataStoreException: Error(s) were fo ...

  3. hive+mysql+速度_半小时搞定Hadoop+Mysql+Hive+Python

    1. 说明 搭建过Hadoop集群的小伙伴一定知道,如果不用docker,半小时配好Hadoop+Mysql+Hive(后简称Hive)肯定是胡吹,有了Docker镜像,没有说明文档,配好了也不一定会 ...

  4. hive mysql互导_利用Sqoop实现Hive的数据与MySQL数据的互导

    1. 配置概览 Hive arguments: --create-hive-table Fail if the target hive table exists --hive-database Set ...

  5. 多租户mysql架构_团队开发框架实战—多租户架构

    1 对多租户的理解 多租户定义:多租户技术或称多重租赁技术,简称SaaS,是一种软件架构技术,是实现如何在多用户环境下(此处的多用户一般是面向企业用户)共用相同的系统或程序组件,并且可确保各用户间数据 ...

  6. hive mysql windows_Java-从MySQL到Hive导入,其中MySQL在Windows上运行,而Hive在Cent OS上运行(Horton Sandbox)...

    在任何答案和评论之前.我尝试了在Stackoverflow中找到的几个选项,但均以失败告终.以下是这些链接- 我通过命令行在Horton Sandbox中进行了尝试并成功. sqoop import ...

  7. hive mysql类型,(二)Hive数据类型、数据定义、数据操作和查询

    1.数据类型 1.1 基本数据类型Hive数据类型长度例子TINYINT1byte有符号整数20 SMALINT2byte有符号整数20 INT4byte有符号整数20 BIGINT8byte有符号整 ...

  8. python docker实战_「docker实战篇」python的docker-docker镜像的创建使用dockerfile(3

    从上篇docker commit学习可以了解到,镜像的定制其实每一层添加的配置和文件,如果把每一层的修改配置,修改文件,都写入脚本,用这个脚本构建定制镜像,无法重复的问题,镜像构建透明性的问题,体积的 ...

  9. python知识图谱实战_知识图谱实战

    原标题:知识图谱实战 知识图谱是近来非常红火的技术,融合网络爬虫,自然语言处理,机器学习,深度学习,图数据库,复杂网络分析等多种热门技术于一身,技术含量密集,在构造语义搜索,问答平台,高智能的人机界面 ...

  10. java akka 实战_《Akka实战:快速构建高可用分布式应用》(杜云飞)【摘要 书评 试读】- 京东图书...

    Akka 是一款优秀的分布式并发框架,虽然它是基于 Scala 语言实现的,但我们却可轻松地将其运行在JVM上,在不改变现有架构的基础上支持更高的并发量.另一方面,Akka 是一款轻量级开源技术,它既 ...

最新文章

  1. python退出帮助系统help应该使用exit_简明Python3教程 5.第一步
  2. ASP.NET获取IP的方法
  3. 启动tomcat时,一直卡在Deploying web application directory这块的解决方案
  4. java 日期calendar_java时间对象Date,Calendar和LocalDate/LocalDateTime
  5. Effective C++学习第三天
  6. c语言数组指定位置插入和删除_Apache POI在指定位置插入表格
  7. php可以用scanf,C/C++中 使用scanf和printf如何读入输出double型数据。
  8. Nature、Science、Cell全加入!80家学术机构新冠研究全部免费
  9. C#之float数组转字节流
  10. HTML iframe标签下 子页面调用父页面js 容易产生的跨域调用问题 Uncaught DOMException
  11. php b框架,thinkphp_bjui
  12. efs+pro+for+三星android设备,【极光ROM】-【三星S8/S8+ G9550/G9500】-【V30.0 Android-PIE-TL2】...
  13. python 问卷调查系统_GitHub - JukieChen/surveySystem-1: 问卷调查系统
  14. python求和函数详解_python 中求和函数 sum详解
  15. python灰产路子有哪些_Python3 网络爬虫(四):视频下载,那些事儿!-后台/架构/数据库-敏捷大拇指-一个敢保留真话的IT精英社区...
  16. 编写一个求x的n次方的函数
  17. 西门子1200 PLC CRC效验程序功能块
  18. vue cli 服务器文件,Vue CLI 部署 - 闪电教程JSRUN
  19. python的numpy教程_ROS与Python入门教程-使用numpy
  20. alpha测试和beta测试Gamma测试的区别是什么?

热门文章

  1. 你真的知道怎么实现一个延迟队列吗?
  2. jquery 图片预览插件viewer
  3. 顶格排列怎么设置_Word中添加编号时第二行不能顶格排列怎么办?
  4. 使用iperf 测速
  5. Cocos Creator3.x NavMesh导航网格寻路
  6. python实现异或
  7. python for循环求和_pythonfor循环语句求和
  8. 豆豆趣事[2016年11月]
  9. 【数据结构】期中考试一把梭(通宵版上)
  10. 服务都要上K8s,怎么打造一个自动部署K8s的Git流水线?