练习使用的数据已上传
链接:https://pan.baidu.com/s/1L5znszdXLUytH9qvTdO4JA
提取码:lzyd

练习1 word count

通过Hive完成word count作业。

  • 创建一个表,导入一篇文章。
hive> create table `article` (`sentence` string);
OK
Time taken: 1.019 secondshive> load data local inpath '/mnt/hgfs/vm_shared/The_Man_of_Property.txt' overwrite into table article;
Loading data to table default.article
Table default.article stats: [numFiles=1, numRows=0, totalSize=632207, rawDataSize=0]
OK
Time taken: 1.386 seconds
  • 分词
    将文章按照空格切分成单词:
    使用split方法,通过空格切分文章
hive> select split(sentence," ") from article;["Preface"]
["“The","Forsyte","Saga”","was","the","title","originally","destined","for","that","part","of","it","which","is","called","“The","Man","of","Property”;","and","to","adopt","it","for","the","collected","chronicles","of","the","Forsyte","family","has","indulged","the","Forsytean","tenacity","that","is","in","all","of","us.","The","word","Saga","might","be","objected","to","on","the","ground","that","it","connotes","the","heroic","and","that","there","is","little","heroism","in","these","pages.","But","it","is","used","with","a","suitable","irony;","and,","after","all,","this","long","tale,","though","it","may","deal","with","folk","in","frock","coats,","furbelows,","and","a","gilt-edged","period,","is","not","devoid","of","the","essential","heat","of","conflict.","Discounting","for","the","gigantic","stature","and","blood-thirstiness","of","old","days,","as","they","have","come","down","to","us","in","fairy-tale","and","legend,","the","folk","of","the","old","Sagas","were","Forsytes,","assuredly,","in","their","possessive","instincts,","and","as","little","proof","against","the","inroads","of","beauty","and","passion","as","Swithin,","Soames,","or","even","Young","Jolyon.","And","if","heroic","figures,","in","days","that","never","were,","seem","to","startle","out","from","their","surroundings","in","fashion","unbecoming","to","a","Forsyte","of","the","Victorian","era,","we","may","be","sure","that","tribal","instinct","was","even","then","the","prime","force,","and","that","“family”","and","the","sense","of","home","and","property","counted","as","they","do","to","this","day,","for","all","the","recent","efforts","to","“talk","them","out.”"]
["So","many","people","have","written","and","claimed","that","their","families","were","the","originals","of","the","Forsytes","that","one","has","been","almost","encouraged","to","believe","in","the","typicality","of","an","imagined","species.","Manners","change","and","modes","evolve,","and","“Timothy’s","on","the","Bayswater","Road”","becomes","a","nest","of","the","unbelievable","in","all","except","essentials;","we","shall","not","look","upon","its","like","again,","nor","perhaps","on","such","a","one","as","James","or","Old","Jolyon.","And","yet","the","figures","of","Insurance","Societies","and","the","utterances","of","Judges","reassure","us","daily","that","our","earthly","paradise","is","still","a","rich","preserve,","where","the","wild","raiders,","Beauty","and","Passion,","come","stealing","in,","filching","security","from","beneath","our","noses.","As","surely","as","a","dog","will","bark","at","a","brass","band,","so","will","the","essential","Soames","in","human","nature","ever","rise","up","uneasily","against","the","dissolution","which","hovers","round","the","folds","of","ownership."]
["“Let","the","dead","Past","bury","its","dead”","would","be","a","better","saying","if","the","Past","ever","died.","The","persistence","of","the","Past","is","one","of","those","tragi-comic","blessings","which","each","new","age","denies,","coming","cocksure","on","to","the","stage","to","mouth","its","claim","to","a","perfect","novelty."]
["But","no","Age","is","so","new","as","that!","Human","Nature,","under","its","changing","pretensions","and","clothes,","is","and","ever","will","be","very","much","of","a","Forsyte,","and","might,","after","all,","be","a","much","worse","animal."]
...
["The","End"]
Time taken: 0.086 seconds, Fetched: 2866 row(s)可以看到运行结果是许多个字符串数组。一共2866条数据
  • 使用wc命令查看文件行数
wc The_Man_of_Property.txt2866   111783  632207  The_Man_of_Property.txt行数   字数      字节数 文件名称
  • 同样是2866行,可见split会将每一行句子分到一个数组中。
  • 经过分词后,将每个单词变为一行,方便统计相同单词的个数。使用explode实现。
hive> select explode(split(sentence," ")) from article;...
we
are
not
at
home.”
And
in
young
Jolyon’s
face
he
slammed
the
door.The
End
Time taken: 0.085 seconds, Fetched: 111818 row(s)
  • 查看结果已经将每个单词单独放到了一行,一共111818行数据。
  • 分割出来的单词带有一些标点符号不是我们想要的,所以用一个正则提取出单词。
select regexp_extract(word,'[a-zA-Z]+',0) from (select explode(split(sentence," ")) word from article) t;...
at
home
And
in
young
Jolyon
face
he
slammed
the
doorThe
End
Time taken: 0.066 seconds, Fetched: 111818 row(s)
  • 单词分割好了,接下来该计数了。
select word, count(*)
from (select regexp_extract(str,'[a-zA-Z]+[\’]*[a-zA-Z]+',0) word from (select explode(split(sentence," ")) str from article) t1
) t2
group by word;......
yield   4
yielded 3
yielding    2
yields  1
you 522
young   198
younger 10
youngest    3
youngling   1
your    130
yours   2
yourself    22
yourselves  1
youth   10
you’d   14
you’ll  21
you’re  23
you’ve  25
Time taken: 27.26 seconds, Fetched: 9872 row(s)
  • 看起来结果还不错。
  • 其实如果不用正则过滤的话会简单不少。
select word, count(*) AS cnt
from (select explode(split(sentence,' ')) wordfrom article
) t
group by word;

练习前数据准备

  1. 数据调研
trains.csv (订单——商品)
----------------------
order_id:订单号
product_id:商品ID
add_to_cart_order:加入购物车的位置
reordered:这个订单是否重复购买(1 表示是 0 表示否)
orders.csv (数据仓库中定位:用户行为表)
----------------------
order_id:订单号
user_id:用户id
eval_set:订单的行为(历史产生的或者训练所需要的)
order_number:用户购买订单的先后顺序
order_dow:order day of week ,订单在星期几进行购买的(0-6)
order_hour_of_day:订单在哪个小时段产生的(0-23)
days_since_prior_order:表示后一个订单距离前一个订单的相隔天数
  1. 建表,导入数据
create table trains(
order_id string,
product_id string,
add_to_cart_order string,
reordered string
)row format delimited
fields terminated by ','
lines terminated by '\n';
hive> load data local inpath '/mnt/hgfs/vm_shared/trains.csv' overwrite into table trains;
Loading data to table default.trains
Table default.trains stats: [numFiles=1, numRows=0, totalSize=24680147, rawDataSize=0]
OK
Time taken: 1.801 seconds
hive> select * from trains limit 10;
OK
trains.order_id trains.product_id   trains.add_to_cart_order    trains.reordered
order_id    product_id  add_to_cart_order   reordered
1   49302   1   1
1   11109   2   1
1   10246   3   0
1   49683   4   0
1   43633   5   1
1   13176   6   0
1   47209   7   0
1   22035   8   1
36  39612   1   0
Time taken: 0.1 seconds, Fetched: 10 row(s)
  • 结果中第一行是我定义的列名,第二行是数据中自带的字段,因此第一行数据是引入的脏数据需要去除。

  • 去除的方法有很多。

方法1. 现在已经导入数据了,可以通过HQL覆盖当前数据。

insert overwrite table trains
select * from trains where order_id !='order_id'

方法2. 可以在数据导入前直接对数据集操作删除第一行。

[root@node1 vm_shared]# head trains.csv //这是数据集的前几行,是带字段名的
order_id,product_id,add_to_cart_order,reordered
1,49302,1,1
1,11109,2,1
1,10246,3,0
1,49683,4,0
1,43633,5,1
1,13176,6,0
1,47209,7,0
1,22035,8,1
36,39612,1,0
  • 使用命令
sed '1d' trains.csv> trains_tmp.csv
  • 结果
[root@node1 vm_shared]# head trains_tmp.csv //可见第一行已经删掉了
1,49302,1,1
1,11109,2,1
1,10246,3,0
1,49683,4,0
1,43633,5,1
1,13176,6,0
1,47209,7,0
1,22035,8,1
36,39612,1,0
36,19660,2,1

方法3. 在建表中加入属性skip.header.line.count'='1',这样在导入数据会自动跳过第一行。如:

create table xxx(
...
)
row format delimited
fields terminated by '\t'
tblproperties ('skip.header.line.count'='1');
  • 同样的方法建表及导入orders.csv

练习2 每个用户有多少个订单

select user_id, count(*) from orders group by user_id ;...
Time taken: 32.335 seconds, Fetched: 206209 row(s)

练习3 每个用户一个订单平均是多少商品?

注意:使用聚合函数(count、sum、avg、max、min )的时候要结合group by 进行使用

  • 创建表,导入数据
create table priors(
order_id string,
product_id string,
add_to_cart_order string,
reordered string)
row format delimited
fields terminated by ','
lines terminated by '\n'
tblproperties ('skip.header.line.count'='1');
hive> load data local inpath '/mnt/hgfs/vm_shared/priors.csv' overwrite into table priors;
Loading data to table default.priors
Table default.priors stats: [numFiles=1, numRows=0, totalSize=577550706, rawDataSize=0]
OK
Time taken: 13.463 seconds
  • 应为需求是每个用户的订单平均的商品数,所以我们需要用每个用户的商品数除每个用户的订单数。
select order_id, count(product_id) cnt from priors group by order_id;
select o.user_id, sum(p.cnt)/count(o.order_id) from orders o join (select order_id, count(product_id) cnt from priors group by order_id) p on o.order_id=p.order_id group by user_id limit 10;1 5.9
2   13.928571428571429
3   7.333333333333333
4   3.6
5   9.25
6   4.666666666666667
7   10.3
8   16.333333333333332
9   25.333333333333332
10  28.6

练习4 每个用户在一周中的购买订单的分布(列转行) ?

select
user_id
, sum(case when order_dow='0' then 1 else 0 end) dow0
, sum(case when order_dow='1' then 1 else 0 end) dow1
, sum(case when order_dow='2' then 1 else 0 end) dow2
, sum(case when order_dow='3' then 1 else 0 end) dow3
, sum(case when order_dow='4' then 1 else 0 end) dow4
, sum(case when order_dow='5' then 1 else 0 end) dow5
, sum(case when order_dow='6' then 1 else 0 end) dow6
from orders
group by user_id

练习5 用户购买的商品数大于100的商品有哪些

训练集trains和priors数相同的表结构,在两个集合的全量数据中查找。

-- 通过with/as定义一个临时数据集
with user_pro_cnt_tmp as (select * from
(
-- 订单训练数据
select
a.user_id,b.product_id
from orders as a
left join trains b
on a.order_id=b.order_idunion all
-- 订单历史数据
select
a.user_id,b.product_id
from orders as a
left join priors b
on a.order_id=b.order_id
) t
)
select
user_id
, count(distinct product_id) pro_cnt
from user_pro_cnt_tmp
group by user_id
having pro_cnt >= 100
limit 10;

Hive 练习(带数据)相关推荐

  1. 给Clouderamanager集群里安装基于Hive的大数据实时分析查询引擎工具Impala步骤(图文详解)...

    不多说,直接上干货! 这个很简单,在集群机器里,选择就是了,本来自带就有Impala的. 扩展博客 给Ambari集群里安装基于Hive的大数据实时分析查询引擎工具Impala步骤(图文详解) 欢迎大 ...

  2. Hive之DDL数据定义

    Hive之DDL数据定义 目录 创建数据库 查询数据库 修改数据库 删除数据库 创建表 分区表 修改表 删除表 1. 创建数据库 创建一个数据库,数据库在HDFS上的默认存储路径是/user/hive ...

  3. hive mysql类型,(二)Hive数据类型、数据定义、数据操作和查询

    1.数据类型 1.1 基本数据类型Hive数据类型长度例子TINYINT1byte有符号整数20 SMALINT2byte有符号整数20 INT4byte有符号整数20 BIGINT8byte有符号整 ...

  4. HIVE之 DDL 数据定义 DML数据操作

    DDL数据库定义 创建数据库 1)创建一个数据库,数据库在 HDFS 上的默认存储路径是/user/hive/warehouse/*.db.hive (default)> create data ...

  5. hive取mysql数据oracle数据,Hive安装过程(mysql/oracle存储元数据)详解

    Hive安装过程(mysql/oracle存储元数据) 前置条件: - mysql数据库已经安装成功 - hadoop环境已经配置正确,且可以提供正常服务 说明: -由于资源有限,下面配置的hadoo ...

  6. 大数据笔记30—Hadoop基础篇13(Hive优化及数据倾斜)

    Hive优化及数据倾斜 知识点01:回顾 知识点02:目标 知识点03:Hive函数:多行转多列 知识点04:Hive函数:多行转单列 知识点05:Hive函数:多列转多行 知识点06:Hive函数: ...

  7. hive和hbase数据迁移

    数据迁移 文章目录 数据迁移 一.数据分析 1. Hive数据分析 2. Hbase数据分析 3. Kudu数据分析 二.数据迁移设 1. Hive数据迁移设计 2. Hbase数据迁移设计 3. K ...

  8. Hive导出复杂数据到csv文件

    工作中经常遇到使用Hive导出数据到文本文件供数据分析时使用.Hive导出复杂数据到csv等文本文件时,有时会遇到以下几个问题: 导出的数据只有数据没有列名. 导出的数据比较复杂时,如字符串内包含一些 ...

  9. Spark读取Hive中的数据加载为DataFrame

    首先要告诉SparkSql,Hive在哪.然后读取Hive中的数据,必须开启enableHiveSupport. val spark = SparkSession.builder().appName( ...

  10. 删除表空间联带数据文件

    --查看数据文件大小 select file_name,tablespace_name,bytes/1024/1024 M from dba_data_files; --注意系统表空间.临时表空间.U ...

最新文章

  1. DefaultTableCellRenderer
  2. 作为一名程序员,最大的成就感来自哪里?
  3. 机器学习的9个基础概念和10种基本算法总结
  4. 我恋爱了,对象是纸片人
  5. jqgrid 列表条件查询的几步关键操作
  6. win7屏保时间设置_论如何优雅的设置手机和电脑时间屏保!
  7. 自考--运筹学--计算题总结
  8. vue 切换页面没有改变滚动条_Web前端高级Vue学习笔记(三)
  9. java应用程序如何编译运作_开发Java应用程序的基本步骤是: 1 编写源文件, 2.编译源文件, 3.运行程序。_学小易找答案...
  10. JAVA关键字final修饰类,Java入门之认识final关键字、权限修饰符和内部类
  11. Tomcat服务器端口修改,tomcat服务器配置端口 tomcat端口号修改操作步骤
  12. python 基础代谢率计算_Python入门案例(三):BMR(基础代谢率)计算器
  13. C语言怎样求矩阵上三角乘积,C语言经典算法 - 上三角、下三角、对称矩阵
  14. 纯css制作导航下拉菜单
  15. 运维服务级别管理流程
  16. Linux下服务器密码正确,登录的时候却提示密码错误
  17. HTTP和HTTPS
  18. git 常见错误 The remote end hung up unexpectedly
  19. Vue+iView身份证、统一社会信用编码校验
  20. SpringMVC框架从入门到精通

热门文章

  1. 京东口罩到货,邮件实时通知
  2. 【Python】第二章 内置数据类型
  3. 2-2 用Plot函数绘图
  4. 新浪微博小工具--PC遥控器1.0发布
  5. 圆周率近似计算matlab,matlab 圆周率的近似计算 实验报告.doc
  6. 达内 Java 全套教程 NPM、PYPI、DockerHub 备
  7. 怎么选房能让房子升值20倍?这些因素很重要
  8. java计算机毕业设计新疆旅游专列订票系统源码+mysql数据库+lw文档+系统+调试部署
  9. 算法表示法之大O表示法
  10. OpenGL超级宝典的例子Triangle