练习使用的数据已上传
链接：https://pan.baidu.com/s/1L5znszdXLUytH9qvTdO4JA
提取码：lzyd

练习1 word count

通过Hive完成word count作业。

创建一个表，导入一篇文章。

hive> create table `article` (`sentence` string);
OK
Time taken: 1.019 secondshive> load data local inpath '/mnt/hgfs/vm_shared/The_Man_of_Property.txt' overwrite into table article;
Loading data to table default.article
Table default.article stats: [numFiles=1, numRows=0, totalSize=632207, rawDataSize=0]
OK
Time taken: 1.386 seconds

分词
将文章按照空格切分成单词：
使用split方法，通过空格切分文章

hive> select split(sentence," ") from article;["Preface"]
["“The","Forsyte","Saga”","was","the","title","originally","destined","for","that","part","of","it","which","is","called","“The","Man","of","Property”;","and","to","adopt","it","for","the","collected","chronicles","of","the","Forsyte","family","has","indulged","the","Forsytean","tenacity","that","is","in","all","of","us.","The","word","Saga","might","be","objected","to","on","the","ground","that","it","connotes","the","heroic","and","that","there","is","little","heroism","in","these","pages.","But","it","is","used","with","a","suitable","irony;","and,","after","all,","this","long","tale,","though","it","may","deal","with","folk","in","frock","coats,","furbelows,","and","a","gilt-edged","period,","is","not","devoid","of","the","essential","heat","of","conflict.","Discounting","for","the","gigantic","stature","and","blood-thirstiness","of","old","days,","as","they","have","come","down","to","us","in","fairy-tale","and","legend,","the","folk","of","the","old","Sagas","were","Forsytes,","assuredly,","in","their","possessive","instincts,","and","as","little","proof","against","the","inroads","of","beauty","and","passion","as","Swithin,","Soames,","or","even","Young","Jolyon.","And","if","heroic","figures,","in","days","that","never","were,","seem","to","startle","out","from","their","surroundings","in","fashion","unbecoming","to","a","Forsyte","of","the","Victorian","era,","we","may","be","sure","that","tribal","instinct","was","even","then","the","prime","force,","and","that","“family”","and","the","sense","of","home","and","property","counted","as","they","do","to","this","day,","for","all","the","recent","efforts","to","“talk","them","out.”"]
["So","many","people","have","written","and","claimed","that","their","families","were","the","originals","of","the","Forsytes","that","one","has","been","almost","encouraged","to","believe","in","the","typicality","of","an","imagined","species.","Manners","change","and","modes","evolve,","and","“Timothy’s","on","the","Bayswater","Road”","becomes","a","nest","of","the","unbelievable","in","all","except","essentials;","we","shall","not","look","upon","its","like","again,","nor","perhaps","on","such","a","one","as","James","or","Old","Jolyon.","And","yet","the","figures","of","Insurance","Societies","and","the","utterances","of","Judges","reassure","us","daily","that","our","earthly","paradise","is","still","a","rich","preserve,","where","the","wild","raiders,","Beauty","and","Passion,","come","stealing","in,","filching","security","from","beneath","our","noses.","As","surely","as","a","dog","will","bark","at","a","brass","band,","so","will","the","essential","Soames","in","human","nature","ever","rise","up","uneasily","against","the","dissolution","which","hovers","round","the","folds","of","ownership."]
["“Let","the","dead","Past","bury","its","dead”","would","be","a","better","saying","if","the","Past","ever","died.","The","persistence","of","the","Past","is","one","of","those","tragi-comic","blessings","which","each","new","age","denies,","coming","cocksure","on","to","the","stage","to","mouth","its","claim","to","a","perfect","novelty."]
["But","no","Age","is","so","new","as","that!","Human","Nature,","under","its","changing","pretensions","and","clothes,","is","and","ever","will","be","very","much","of","a","Forsyte,","and","might,","after","all,","be","a","much","worse","animal."]
...
["The","End"]
Time taken: 0.086 seconds, Fetched: 2866 row(s)可以看到运行结果是许多个字符串数组。一共2866条数据

使用wc命令查看文件行数

wc The_Man_of_Property.txt2866   111783  632207  The_Man_of_Property.txt行数   字数      字节数 文件名称

同样是2866行，可见split会将每一行句子分到一个数组中。
经过分词后，将每个单词变为一行，方便统计相同单词的个数。使用explode实现。

hive> select explode(split(sentence," ")) from article;...
we
are
not
at
home.”
And
in
young
Jolyon’s
face
he
slammed
the
door.The
End
Time taken: 0.085 seconds, Fetched: 111818 row(s)

查看结果已经将每个单词单独放到了一行，一共111818行数据。
分割出来的单词带有一些标点符号不是我们想要的，所以用一个正则提取出单词。

select regexp_extract(word,'[a-zA-Z]+',0) from (select explode(split(sentence," ")) word from article) t;...
at
home
And
in
young
Jolyon
face
he
slammed
the
doorThe
End
Time taken: 0.066 seconds, Fetched: 111818 row(s)

单词分割好了，接下来该计数了。

select word, count(*)
from (select regexp_extract(str,'[a-zA-Z]+[\’]*[a-zA-Z]+',0) word from (select explode(split(sentence," ")) str from article) t1
) t2
group by word;......
yield   4
yielded 3
yielding    2
yields  1
you 522
young   198
younger 10
youngest    3
youngling   1
your    130
yours   2
yourself    22
yourselves  1
youth   10
you’d   14
you’ll  21
you’re  23
you’ve  25
Time taken: 27.26 seconds, Fetched: 9872 row(s)

看起来结果还不错。
其实如果不用正则过滤的话会简单不少。

select word, count(*) AS cnt
from (select explode(split(sentence,' ')) wordfrom article
) t
group by word;

练习前数据准备

数据调研

trains.csv (订单——商品）
----------------------
order_id:订单号
product_id:商品ID
add_to_cart_order：加入购物车的位置
reordered：这个订单是否重复购买（1 表示是 0 表示否）

orders.csv （数据仓库中定位：用户行为表）
----------------------
order_id:订单号
user_id:用户id
eval_set:订单的行为（历史产生的或者训练所需要的）
order_number：用户购买订单的先后顺序
order_dow：order day of week ,订单在星期几进行购买的（0-6）
order_hour_of_day：订单在哪个小时段产生的（0-23）
days_since_prior_order：表示后一个订单距离前一个订单的相隔天数

建表，导入数据

create table trains(
order_id string,
product_id string,
add_to_cart_order string,
reordered string
)row format delimited
fields terminated by ','
lines terminated by '\n';

hive> load data local inpath '/mnt/hgfs/vm_shared/trains.csv' overwrite into table trains;
Loading data to table default.trains
Table default.trains stats: [numFiles=1, numRows=0, totalSize=24680147, rawDataSize=0]
OK
Time taken: 1.801 seconds

hive> select * from trains limit 10;
OK
trains.order_id trains.product_id   trains.add_to_cart_order    trains.reordered
order_id    product_id  add_to_cart_order   reordered
1   49302   1   1
1   11109   2   1
1   10246   3   0
1   49683   4   0
1   43633   5   1
1   13176   6   0
1   47209   7   0
1   22035   8   1
36  39612   1   0
Time taken: 0.1 seconds, Fetched: 10 row(s)

结果中第一行是我定义的列名，第二行是数据中自带的字段，因此第一行数据是引入的脏数据需要去除。
去除的方法有很多。

方法1. 现在已经导入数据了，可以通过HQL覆盖当前数据。

insert overwrite table trains
select * from trains where order_id !='order_id'

方法2. 可以在数据导入前直接对数据集操作删除第一行。

[root@node1 vm_shared]# head trains.csv //这是数据集的前几行，是带字段名的
order_id,product_id,add_to_cart_order,reordered
1,49302,1,1
1,11109,2,1
1,10246,3,0
1,49683,4,0
1,43633,5,1
1,13176,6,0
1,47209,7,0
1,22035,8,1
36,39612,1,0

使用命令

sed '1d' trains.csv> trains_tmp.csv

结果

[root@node1 vm_shared]# head trains_tmp.csv //可见第一行已经删掉了
1,49302,1,1
1,11109,2,1
1,10246,3,0
1,49683,4,0
1,43633,5,1
1,13176,6,0
1,47209,7,0
1,22035,8,1
36,39612,1,0
36,19660,2,1

方法3. 在建表中加入属性skip.header.line.count'='1'，这样在导入数据会自动跳过第一行。如：

create table xxx(
...
)
row format delimited
fields terminated by '\t'
tblproperties ('skip.header.line.count'='1');

同样的方法建表及导入orders.csv

练习2 每个用户有多少个订单

select user_id, count(*) from orders group by user_id ;...
Time taken: 32.335 seconds, Fetched: 206209 row(s)

练习3 每个用户一个订单平均是多少商品？

注意:使用聚合函数（count、sum、avg、max、min ）的时候要结合group by 进行使用

创建表，导入数据

create table priors(
order_id string,
product_id string,
add_to_cart_order string,
reordered string)
row format delimited
fields terminated by ','
lines terminated by '\n'
tblproperties ('skip.header.line.count'='1');

hive> load data local inpath '/mnt/hgfs/vm_shared/priors.csv' overwrite into table priors;
Loading data to table default.priors
Table default.priors stats: [numFiles=1, numRows=0, totalSize=577550706, rawDataSize=0]
OK
Time taken: 13.463 seconds

应为需求是每个用户的订单平均的商品数，所以我们需要用每个用户的商品数除每个用户的订单数。

select order_id, count(product_id) cnt from priors group by order_id;

select o.user_id, sum(p.cnt)/count(o.order_id) from orders o join (select order_id, count(product_id) cnt from priors group by order_id) p on o.order_id=p.order_id group by user_id limit 10;1 5.9
2   13.928571428571429
3   7.333333333333333
4   3.6
5   9.25
6   4.666666666666667
7   10.3
8   16.333333333333332
9   25.333333333333332
10  28.6

练习4 每个用户在一周中的购买订单的分布(列转行) ？

select
user_id
, sum(case when order_dow='0' then 1 else 0 end) dow0
, sum(case when order_dow='1' then 1 else 0 end) dow1
, sum(case when order_dow='2' then 1 else 0 end) dow2
, sum(case when order_dow='3' then 1 else 0 end) dow3
, sum(case when order_dow='4' then 1 else 0 end) dow4
, sum(case when order_dow='5' then 1 else 0 end) dow5
, sum(case when order_dow='6' then 1 else 0 end) dow6
from orders
group by user_id

练习5 用户购买的商品数大于100的商品有哪些

训练集trains和priors数相同的表结构，在两个集合的全量数据中查找。

-- 通过with/as定义一个临时数据集
with user_pro_cnt_tmp as (select * from
(
-- 订单训练数据
select
a.user_id,b.product_id
from orders as a
left join trains b
on a.order_id=b.order_idunion all
-- 订单历史数据
select
a.user_id,b.product_id
from orders as a
left join priors b
on a.order_id=b.order_id
) t
)
select
user_id
, count(distinct product_id) pro_cnt
from user_pro_cnt_tmp
group by user_id
having pro_cnt >= 100
limit 10;

Hive 练习（带数据）相关推荐

给Clouderamanager集群里安装基于Hive的大数据实时分析查询引擎工具Impala步骤（图文详解）...
不多说,直接上干货! 这个很简单,在集群机器里,选择就是了,本来自带就有Impala的. 扩展博客给Ambari集群里安装基于Hive的大数据实时分析查询引擎工具Impala步骤(图文详解) 欢迎大 ...
Hive之DDL数据定义
Hive之DDL数据定义目录创建数据库查询数据库修改数据库删除数据库创建表分区表修改表删除表 1. 创建数据库创建一个数据库,数据库在HDFS上的默认存储路径是/user/hive ...
hive mysql类型,(二)Hive数据类型、数据定义、数据操作和查询
1.数据类型 1.1 基本数据类型Hive数据类型长度例子TINYINT1byte有符号整数20 SMALINT2byte有符号整数20 INT4byte有符号整数20 BIGINT8byte有符号整 ...
HIVE之 DDL 数据定义 DML数据操作
DDL数据库定义创建数据库 1)创建一个数据库,数据库在 HDFS 上的默认存储路径是/user/hive/warehouse/*.db.hive (default)> create data ...
hive取mysql数据oracle数据,Hive安装过程（mysql/oracle存储元数据）详解
Hive安装过程(mysql/oracle存储元数据) 前置条件: - mysql数据库已经安装成功 - hadoop环境已经配置正确,且可以提供正常服务说明: -由于资源有限,下面配置的hadoo ...
大数据笔记30—Hadoop基础篇13(Hive优化及数据倾斜)
Hive优化及数据倾斜知识点01:回顾知识点02:目标知识点03:Hive函数:多行转多列知识点04:Hive函数:多行转单列知识点05:Hive函数:多列转多行知识点06:Hive函数: ...
hive和hbase数据迁移
数据迁移文章目录数据迁移一.数据分析 1. Hive数据分析 2. Hbase数据分析 3. Kudu数据分析二.数据迁移设 1. Hive数据迁移设计 2. Hbase数据迁移设计 3. K ...
Hive导出复杂数据到csv文件
工作中经常遇到使用Hive导出数据到文本文件供数据分析时使用.Hive导出复杂数据到csv等文本文件时,有时会遇到以下几个问题: 导出的数据只有数据没有列名. 导出的数据比较复杂时,如字符串内包含一些 ...
Spark读取Hive中的数据加载为DataFrame
首先要告诉SparkSql,Hive在哪.然后读取Hive中的数据,必须开启enableHiveSupport. val spark = SparkSession.builder().appName( ...
删除表空间联带数据文件
--查看数据文件大小 select file_name,tablespace_name,bytes/1024/1024 M from dba_data_files; --注意系统表空间.临时表空间.U ...

Hive 练习（带数据）

练习1 word count

练习前数据准备

练习2 每个用户有多少个订单

练习3 每个用户一个订单平均是多少商品？

练习4 每个用户在一周中的购买订单的分布(列转行) ？

练习5 用户购买的商品数大于100的商品有哪些

Hive 练习（带数据）相关推荐

最新文章

热门文章