Hive 练习(带数据)
练习使用的数据已上传
链接:https://pan.baidu.com/s/1L5znszdXLUytH9qvTdO4JA
提取码:lzyd
练习1 word count
通过Hive完成word count作业。
- 创建一个表,导入一篇文章。
hive> create table `article` (`sentence` string);
OK
Time taken: 1.019 secondshive> load data local inpath '/mnt/hgfs/vm_shared/The_Man_of_Property.txt' overwrite into table article;
Loading data to table default.article
Table default.article stats: [numFiles=1, numRows=0, totalSize=632207, rawDataSize=0]
OK
Time taken: 1.386 seconds
- 分词
将文章按照空格切分成单词:
使用split
方法,通过空格切分文章
hive> select split(sentence," ") from article;["Preface"]
["“The","Forsyte","Saga”","was","the","title","originally","destined","for","that","part","of","it","which","is","called","“The","Man","of","Property”;","and","to","adopt","it","for","the","collected","chronicles","of","the","Forsyte","family","has","indulged","the","Forsytean","tenacity","that","is","in","all","of","us.","The","word","Saga","might","be","objected","to","on","the","ground","that","it","connotes","the","heroic","and","that","there","is","little","heroism","in","these","pages.","But","it","is","used","with","a","suitable","irony;","and,","after","all,","this","long","tale,","though","it","may","deal","with","folk","in","frock","coats,","furbelows,","and","a","gilt-edged","period,","is","not","devoid","of","the","essential","heat","of","conflict.","Discounting","for","the","gigantic","stature","and","blood-thirstiness","of","old","days,","as","they","have","come","down","to","us","in","fairy-tale","and","legend,","the","folk","of","the","old","Sagas","were","Forsytes,","assuredly,","in","their","possessive","instincts,","and","as","little","proof","against","the","inroads","of","beauty","and","passion","as","Swithin,","Soames,","or","even","Young","Jolyon.","And","if","heroic","figures,","in","days","that","never","were,","seem","to","startle","out","from","their","surroundings","in","fashion","unbecoming","to","a","Forsyte","of","the","Victorian","era,","we","may","be","sure","that","tribal","instinct","was","even","then","the","prime","force,","and","that","“family”","and","the","sense","of","home","and","property","counted","as","they","do","to","this","day,","for","all","the","recent","efforts","to","“talk","them","out.”"]
["So","many","people","have","written","and","claimed","that","their","families","were","the","originals","of","the","Forsytes","that","one","has","been","almost","encouraged","to","believe","in","the","typicality","of","an","imagined","species.","Manners","change","and","modes","evolve,","and","“Timothy’s","on","the","Bayswater","Road”","becomes","a","nest","of","the","unbelievable","in","all","except","essentials;","we","shall","not","look","upon","its","like","again,","nor","perhaps","on","such","a","one","as","James","or","Old","Jolyon.","And","yet","the","figures","of","Insurance","Societies","and","the","utterances","of","Judges","reassure","us","daily","that","our","earthly","paradise","is","still","a","rich","preserve,","where","the","wild","raiders,","Beauty","and","Passion,","come","stealing","in,","filching","security","from","beneath","our","noses.","As","surely","as","a","dog","will","bark","at","a","brass","band,","so","will","the","essential","Soames","in","human","nature","ever","rise","up","uneasily","against","the","dissolution","which","hovers","round","the","folds","of","ownership."]
["“Let","the","dead","Past","bury","its","dead”","would","be","a","better","saying","if","the","Past","ever","died.","The","persistence","of","the","Past","is","one","of","those","tragi-comic","blessings","which","each","new","age","denies,","coming","cocksure","on","to","the","stage","to","mouth","its","claim","to","a","perfect","novelty."]
["But","no","Age","is","so","new","as","that!","Human","Nature,","under","its","changing","pretensions","and","clothes,","is","and","ever","will","be","very","much","of","a","Forsyte,","and","might,","after","all,","be","a","much","worse","animal."]
...
["The","End"]
Time taken: 0.086 seconds, Fetched: 2866 row(s)可以看到运行结果是许多个字符串数组。一共2866条数据
- 使用
wc
命令查看文件行数
wc The_Man_of_Property.txt2866 111783 632207 The_Man_of_Property.txt行数 字数 字节数 文件名称
- 同样是2866行,可见
split
会将每一行句子分到一个数组中。 - 经过分词后,将每个单词变为一行,方便统计相同单词的个数。使用
explode
实现。
hive> select explode(split(sentence," ")) from article;...
we
are
not
at
home.”
And
in
young
Jolyon’s
face
he
slammed
the
door.The
End
Time taken: 0.085 seconds, Fetched: 111818 row(s)
- 查看结果已经将每个单词单独放到了一行,一共111818行数据。
- 分割出来的单词带有一些标点符号不是我们想要的,所以用一个正则提取出单词。
select regexp_extract(word,'[a-zA-Z]+',0) from (select explode(split(sentence," ")) word from article) t;...
at
home
And
in
young
Jolyon
face
he
slammed
the
doorThe
End
Time taken: 0.066 seconds, Fetched: 111818 row(s)
- 单词分割好了,接下来该计数了。
select word, count(*)
from (select regexp_extract(str,'[a-zA-Z]+[\’]*[a-zA-Z]+',0) word from (select explode(split(sentence," ")) str from article) t1
) t2
group by word;......
yield 4
yielded 3
yielding 2
yields 1
you 522
young 198
younger 10
youngest 3
youngling 1
your 130
yours 2
yourself 22
yourselves 1
youth 10
you’d 14
you’ll 21
you’re 23
you’ve 25
Time taken: 27.26 seconds, Fetched: 9872 row(s)
- 看起来结果还不错。
- 其实如果不用正则过滤的话会简单不少。
select word, count(*) AS cnt
from (select explode(split(sentence,' ')) wordfrom article
) t
group by word;
练习前数据准备
- 数据调研
trains.csv (订单——商品)
----------------------
order_id:订单号
product_id:商品ID
add_to_cart_order:加入购物车的位置
reordered:这个订单是否重复购买(1 表示是 0 表示否)
orders.csv (数据仓库中定位:用户行为表)
----------------------
order_id:订单号
user_id:用户id
eval_set:订单的行为(历史产生的或者训练所需要的)
order_number:用户购买订单的先后顺序
order_dow:order day of week ,订单在星期几进行购买的(0-6)
order_hour_of_day:订单在哪个小时段产生的(0-23)
days_since_prior_order:表示后一个订单距离前一个订单的相隔天数
- 建表,导入数据
create table trains(
order_id string,
product_id string,
add_to_cart_order string,
reordered string
)row format delimited
fields terminated by ','
lines terminated by '\n';
hive> load data local inpath '/mnt/hgfs/vm_shared/trains.csv' overwrite into table trains;
Loading data to table default.trains
Table default.trains stats: [numFiles=1, numRows=0, totalSize=24680147, rawDataSize=0]
OK
Time taken: 1.801 seconds
hive> select * from trains limit 10;
OK
trains.order_id trains.product_id trains.add_to_cart_order trains.reordered
order_id product_id add_to_cart_order reordered
1 49302 1 1
1 11109 2 1
1 10246 3 0
1 49683 4 0
1 43633 5 1
1 13176 6 0
1 47209 7 0
1 22035 8 1
36 39612 1 0
Time taken: 0.1 seconds, Fetched: 10 row(s)
结果中第一行是我定义的列名,第二行是数据中自带的字段,因此第一行数据是引入的脏数据需要去除。
去除的方法有很多。
方法1. 现在已经导入数据了,可以通过HQL覆盖当前数据。
insert overwrite table trains
select * from trains where order_id !='order_id'
方法2. 可以在数据导入前直接对数据集操作删除第一行。
[root@node1 vm_shared]# head trains.csv //这是数据集的前几行,是带字段名的
order_id,product_id,add_to_cart_order,reordered
1,49302,1,1
1,11109,2,1
1,10246,3,0
1,49683,4,0
1,43633,5,1
1,13176,6,0
1,47209,7,0
1,22035,8,1
36,39612,1,0
- 使用命令
sed '1d' trains.csv> trains_tmp.csv
- 结果
[root@node1 vm_shared]# head trains_tmp.csv //可见第一行已经删掉了
1,49302,1,1
1,11109,2,1
1,10246,3,0
1,49683,4,0
1,43633,5,1
1,13176,6,0
1,47209,7,0
1,22035,8,1
36,39612,1,0
36,19660,2,1
方法3. 在建表中加入属性skip.header.line.count'='1'
,这样在导入数据会自动跳过第一行。如:
create table xxx(
...
)
row format delimited
fields terminated by '\t'
tblproperties ('skip.header.line.count'='1');
- 同样的方法建表及导入
orders.csv
练习2 每个用户有多少个订单
select user_id, count(*) from orders group by user_id ;...
Time taken: 32.335 seconds, Fetched: 206209 row(s)
练习3 每个用户一个订单平均是多少商品?
注意:使用聚合函数(count、sum、avg、max、min )的时候要结合group by 进行使用
- 创建表,导入数据
create table priors(
order_id string,
product_id string,
add_to_cart_order string,
reordered string)
row format delimited
fields terminated by ','
lines terminated by '\n'
tblproperties ('skip.header.line.count'='1');
hive> load data local inpath '/mnt/hgfs/vm_shared/priors.csv' overwrite into table priors;
Loading data to table default.priors
Table default.priors stats: [numFiles=1, numRows=0, totalSize=577550706, rawDataSize=0]
OK
Time taken: 13.463 seconds
- 应为需求是每个用户的订单平均的商品数,所以我们需要用每个用户的商品数除每个用户的订单数。
select order_id, count(product_id) cnt from priors group by order_id;
select o.user_id, sum(p.cnt)/count(o.order_id) from orders o join (select order_id, count(product_id) cnt from priors group by order_id) p on o.order_id=p.order_id group by user_id limit 10;1 5.9
2 13.928571428571429
3 7.333333333333333
4 3.6
5 9.25
6 4.666666666666667
7 10.3
8 16.333333333333332
9 25.333333333333332
10 28.6
练习4 每个用户在一周中的购买订单的分布(列转行) ?
select
user_id
, sum(case when order_dow='0' then 1 else 0 end) dow0
, sum(case when order_dow='1' then 1 else 0 end) dow1
, sum(case when order_dow='2' then 1 else 0 end) dow2
, sum(case when order_dow='3' then 1 else 0 end) dow3
, sum(case when order_dow='4' then 1 else 0 end) dow4
, sum(case when order_dow='5' then 1 else 0 end) dow5
, sum(case when order_dow='6' then 1 else 0 end) dow6
from orders
group by user_id
练习5 用户购买的商品数大于100的商品有哪些
训练集trains和priors数相同的表结构,在两个集合的全量数据中查找。
-- 通过with/as定义一个临时数据集
with user_pro_cnt_tmp as (select * from
(
-- 订单训练数据
select
a.user_id,b.product_id
from orders as a
left join trains b
on a.order_id=b.order_idunion all
-- 订单历史数据
select
a.user_id,b.product_id
from orders as a
left join priors b
on a.order_id=b.order_id
) t
)
select
user_id
, count(distinct product_id) pro_cnt
from user_pro_cnt_tmp
group by user_id
having pro_cnt >= 100
limit 10;
Hive 练习(带数据)相关推荐
- 给Clouderamanager集群里安装基于Hive的大数据实时分析查询引擎工具Impala步骤(图文详解)...
不多说,直接上干货! 这个很简单,在集群机器里,选择就是了,本来自带就有Impala的. 扩展博客 给Ambari集群里安装基于Hive的大数据实时分析查询引擎工具Impala步骤(图文详解) 欢迎大 ...
- Hive之DDL数据定义
Hive之DDL数据定义 目录 创建数据库 查询数据库 修改数据库 删除数据库 创建表 分区表 修改表 删除表 1. 创建数据库 创建一个数据库,数据库在HDFS上的默认存储路径是/user/hive ...
- hive mysql类型,(二)Hive数据类型、数据定义、数据操作和查询
1.数据类型 1.1 基本数据类型Hive数据类型长度例子TINYINT1byte有符号整数20 SMALINT2byte有符号整数20 INT4byte有符号整数20 BIGINT8byte有符号整 ...
- HIVE之 DDL 数据定义 DML数据操作
DDL数据库定义 创建数据库 1)创建一个数据库,数据库在 HDFS 上的默认存储路径是/user/hive/warehouse/*.db.hive (default)> create data ...
- hive取mysql数据oracle数据,Hive安装过程(mysql/oracle存储元数据)详解
Hive安装过程(mysql/oracle存储元数据) 前置条件: - mysql数据库已经安装成功 - hadoop环境已经配置正确,且可以提供正常服务 说明: -由于资源有限,下面配置的hadoo ...
- 大数据笔记30—Hadoop基础篇13(Hive优化及数据倾斜)
Hive优化及数据倾斜 知识点01:回顾 知识点02:目标 知识点03:Hive函数:多行转多列 知识点04:Hive函数:多行转单列 知识点05:Hive函数:多列转多行 知识点06:Hive函数: ...
- hive和hbase数据迁移
数据迁移 文章目录 数据迁移 一.数据分析 1. Hive数据分析 2. Hbase数据分析 3. Kudu数据分析 二.数据迁移设 1. Hive数据迁移设计 2. Hbase数据迁移设计 3. K ...
- Hive导出复杂数据到csv文件
工作中经常遇到使用Hive导出数据到文本文件供数据分析时使用.Hive导出复杂数据到csv等文本文件时,有时会遇到以下几个问题: 导出的数据只有数据没有列名. 导出的数据比较复杂时,如字符串内包含一些 ...
- Spark读取Hive中的数据加载为DataFrame
首先要告诉SparkSql,Hive在哪.然后读取Hive中的数据,必须开启enableHiveSupport. val spark = SparkSession.builder().appName( ...
- 删除表空间联带数据文件
--查看数据文件大小 select file_name,tablespace_name,bytes/1024/1024 M from dba_data_files; --注意系统表空间.临时表空间.U ...
最新文章
- DefaultTableCellRenderer
- 作为一名程序员,最大的成就感来自哪里?
- 机器学习的9个基础概念和10种基本算法总结
- 我恋爱了,对象是纸片人
- jqgrid 列表条件查询的几步关键操作
- win7屏保时间设置_论如何优雅的设置手机和电脑时间屏保!
- 自考--运筹学--计算题总结
- vue 切换页面没有改变滚动条_Web前端高级Vue学习笔记(三)
- java应用程序如何编译运作_开发Java应用程序的基本步骤是: 1 编写源文件, 2.编译源文件, 3.运行程序。_学小易找答案...
- JAVA关键字final修饰类,Java入门之认识final关键字、权限修饰符和内部类
- Tomcat服务器端口修改,tomcat服务器配置端口 tomcat端口号修改操作步骤
- python 基础代谢率计算_Python入门案例(三):BMR(基础代谢率)计算器
- C语言怎样求矩阵上三角乘积,C语言经典算法 - 上三角、下三角、对称矩阵
- 纯css制作导航下拉菜单
- 运维服务级别管理流程
- Linux下服务器密码正确,登录的时候却提示密码错误
- HTTP和HTTPS
- git 常见错误 The remote end hung up unexpectedly
- Vue+iView身份证、统一社会信用编码校验
- SpringMVC框架从入门到精通