文章目录

数据准备及描述
- 数据描述
- 数据样例
- 字段描述
- 数据存储
- 准备开始
功能需求
- 1. 数据处理：针对数据问题，请给出对应的解决方案（15分）
- 2. 组织数据（10分）
- 3. 统计微博总量和独立用户数（7分）
- 4. 统计用户所有微博被转发的次数之和，输出top5用户，并给出次数（7分）
- 5. 统计带图片的微博数（7分）
- 6. 统计使用iphone发微博的独立用户数（7分）
- 7. 将微博的点赞人数和转发人数相加求和，并将相加之和降序排列，取前10条记录，输出userid和总次数（6分）
- 8. 统计微博中评论次数小于1000的用户ID与数据来源信息，将其放入视图，然后统计视图中数据来源是”ipad客户端”的用户数目（10分）
- 9. 统计微博内容中出现”iphone”次数最多的用户，最终结果输出用户id和次数（注意：该次数是”iphone”的出现次数，不是出现”iphone”的微博数目）（15分）
- 10. 求每天发微博次数最多的那个家伙的ID和发微博的条数（8分）
- 11. 求出所有被多次引用（同一张照片出现在多条微博中，超过1条就算多条）的照片的数目（8分）

数据准备及描述

数据描述

用户的历史数据，戴止到20131215，压缩后221MB，解压后878MB，整个数据1206个小文件，所有数据格式均是json格式
数据下载链接

数据样例

[{"beCommentWeiboId":"","beForwardWeiboId":"","catchTime":"1387165034","commentCount":"6","content":"Raresmileyportrait(1977)","createTime":"1387130972","info1":"","info2":"","info3":"","mlevel":"","musicurl":[],"pic_list":["http://ww2.sinaimg.cn/thumbnail/69d3e27djw1ebkxp7rtczj20mo0mogmy.jpg"],"praiseCount":"5","reportCount":"70","source":"","userId":"1775493757","videourl":[],"weiboId":"3655954636173507","weiboUrl":"http://weibo.com/1775493757/AntDppU0H"}]
[{"beCommentWeiboId":"","beForwardWeiboId":"3655954636173507","catchTime":"1387165034","commentCount":"29","content":"玲笑容！","createTime":"1387139090","info1":"","info2":"","info3":"","mlevel":"","musicurl":[],"pic_list":[],"praiseCount":"72","reportCount":"61","source":"新浪微博","userId":"1719481457","videourl":[],"weiboId":"3655988685551869","weiboUrl":"http://weibo.com/1719481457/Anuwkniih"}]
[{"beCommentWeiboId":"","beForwardWeiboId":"","catchTime":"1387165034","commentCount":"4","content":"lifeisbeautifulandallisaboutconfident&amp;trust&amp;friends&amp;LOVE,thanksto@黄伟文,youmakemefeellikehongkongismagic&amp;happiness.","createTime":"1387053188","info1":"","info2":"","info3":"","mlevel":"","musicurl":[],"pic_list":[],"praiseCount":"8","reportCount":"8","source":"","userId":"1733190683","videourl":[],"weiboId":"3655628385727081","weiboUrl":"http://weibo.com/1733190683/Anl9co1Sh"}]

字段描述

共19个字段：

beCommentWeiboId  是否评论
beForwardWeiboId 是否是转发微博
catchTime 抓取时间
commentCount 评论次数
content 内容
createTime 创建时间
info1 信息字段1
info2信息字段2
info3信息字段3
mlevel   no sure
musicurl    音乐链接
pic_list    照片列表（可以有多个）
praiseCount 点赞人数
reportCount 转发人数
source  数据来源
userId  用户id
videourl    视频链接
weiboId 微博id
weiboUrl    微博网址

数据存储

hdfs://hdp01:9000/data/weibo
建表的时候，建外表

[hdp01@hdp01 weibo]$ hdfs dfs -ls /data/weibo
Found 2 items
-rw-r--r--   2 hdp01 supergroup    1004992 2020-01-11 16:17 /data/weibo/1387159770_1087770692_20100101000000_VCSvoMgPvrSTKhCkkIA7uMV9Hn10877706927159770ouss.json
-rw-r--r--   2 hdp01 supergroup     680641 2020-01-11 16:17 /data/weibo/1387159770_1180721740_20100101000000_tBx94gQvEoOWTiB4n3gORSmS11807217407159771ouss.json

准备开始

hive> set hive.exec.model.local.auto=true;
--hive> set hive.cli.print.header=true;
hive> create database weibo;
hive> use weibo;

功能需求

1. 数据处理：针对数据问题，请给出对应的解决方案（15分）

数据文件过多：要合并，请给出解决方案
mapreduce

2. 组织数据（10分）

（创建Hive表weibo_json(json string)，表只有一个字段，导入所有数据，并验证查询前5条数据）
（解析完weibo_json当中的json格式数据到拥有19个字段的weibo表中，写出必要的SQL语句）

创建weibo_json表

hive> create external table if not exists weibo_json(> json string)> location "/data/weibo";-- 因为我创建的外部表，location指向了/data/weibo，所以表创建完成直接就可以读数据了
hive> select * from weibo_json limit 2;
OK
[{"beCommentWeiboId":"","beForwardWeiboId":"","catchTime":"1387159495","commentCount":"1419","content":"分享图片","createTime":"1386981067","info1":"","info2":"","info3":"","mlevel":"","musicurl":[],"pic_list":["http://ww3.sinaimg.cn/thumbnail/40d61044jw1ebixhnsiknj20qo0qognx.jpg"],"praiseCount":"5265","reportCount":"1285","source":"iPad客户端","userId":"1087770692","videourl":[],"weiboId":"3655325888057474","weiboUrl":"http://weibo.com/1087770692/AndhixO7g"}]
[{"beCommentWeiboId":"","beForwardWeiboId":"","catchTime":"1387159495","commentCount":"91","content":"行走：#去远方发现自己#@费勇主编，跨界明星联合执笔，分享他们观行思趣的心发现、他们的成长与心路历程，当当网限量赠送出品人@陈坤抄诵印刷版《心经》，赠完不再加印哦！详情请戳：http://t.cn/8k622Sj","createTime":"1386925242","info1":"","info2":"","info3":"","mlevel":"","musicurl":[],"pic_list":["http://ww4.sinaimg.cn/thumbnail/b2336177jw1ebi6j4twk7j20m80tkgra.jpg"],"praiseCount":"1","reportCount":"721","source":"","userId":"2989711735","videourl":[],"weiboId":"3655091741442099","weiboUrl":"http://weibo.com/2989711735/An7bE639F"}]
Time taken: 0.134 seconds, Fetched: 2 row(s)

创建json解析表weibo

-- 一条数据展开格式化如下：
[{"beCommentWeiboId":"","beForwardWeiboId":"","catchTime":"1387159495","commentCount":"1419","content":"分享图片","createTime":"1386981067","info1":"","info2":"","info3":"","mlevel":"","musicurl":[],"pic_list":["http://ww3.sinaimg.cn/thumbnail/40d61044jw1ebixhnsiknj20qo0qognx.jpg"],"praiseCount":"5265","reportCount":"1285","source":"iPad客户端","userId":"1087770692","videourl":[],"weiboId":"3655325888057474","weiboUrl":"http://weibo.com/1087770692/AndhixO7g"}
]-- 创建表weibo
create table if not exists weibo as select
get_json_object(json,'$[0].beCommentWeiboId')comment,
get_json_object(json,'$[0].beForwardWeiboId') forward,
get_json_object(json,'$[0].catchTime') time,
get_json_object(json,'$[0].commentCount') count,
get_json_object(json,'$[0].content') content,
get_json_object(json,'$[0].createTime') ctime,
get_json_object(json,'$[0].info1') u1,
get_json_object(json,'$[0].info2') u2,
get_json_object(json,'$[0].info3') u3,
get_json_object(json,'$[0].mlevel') ml,
get_json_object(json,'$[0].musicurl') murl,
get_json_object(json,'$[0].pic_list') pic_list,
get_json_object(json,'$[0].praiseCount') pcount,
get_json_object(json,'$[0].reportCount') rcount,
get_json_object(json,'$[0].source') sre,
get_json_object(json,'$[0].userId') uid,
get_json_object(json,'$[0].videourl') vurl,
get_json_object(json,'$[0].weiboId') weiboid,
get_json_object(json,'$[0].weiboUrl') weibourl from weibo_json;

3. 统计微博总量和独立用户数（7分）

--微博总量：
hive> select count(*) from weibo;
OK
2795
Time taken: 0.266 seconds, Fetched: 1 row(s)--独立用户数：
hive> select count(distinct uid) from weibo;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hdp01_20200111164547_e782dba7-be05-4d5f-ac89-ed0ca0c8cf20
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:set mapreduce.job.reduces=<number>
Starting Job = job_1578725328740_0002, Tracking URL = http://hdp02:8088/proxy/application_1578725328740_0002/
Kill Command = /home/hdp01/apps/hadoop-2.7.7/bin/hadoop job  -kill job_1578725328740_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-01-11 16:45:53,654 Stage-1 map = 0%,  reduce = 0%
2020-01-11 16:45:57,875 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.97 sec
2020-01-11 16:46:03,122 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 2.43 sec
MapReduce Total cumulative CPU time: 2 seconds 430 msec
Ended Job = job_1578725328740_0002
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 2.43 sec   HDFS Read: 974677 HDFS Write: 103 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 430 msec
OK
474
Time taken: 17.347 seconds, Fetched: 1 row(s)

4. 统计用户所有微博被转发的次数之和，输出top5用户，并给出次数（7分）

-- 分析
分组：uid
求：sum(rcount)
排序：和
--
select
uid,sum(cast(rcount as int)) totalcount
from weibo
group by uid
order by totalcount desc limit 5;1087770692 5480962
1769962140  1196060
3005254180  482943
1270468784  449891
2803301701  336813

5. 统计带图片的微博数（7分）

--pic_list   照片列表（可以有多个）
hive> select pic_list from weibo limit 3;
OK
["http://ww3.sinaimg.cn/thumbnail/40d61044jw1ebixhnsiknj20qo0qognx.jpg"]
["http://ww4.sinaimg.cn/thumbnail/b2336177jw1ebi6j4twk7j20m80tkgra.jpg"]
[]
-- 求count,过滤带图片的
hive> select count(weiboid) from weibo where length(pic_list)>2;
1579
--或
hive> select count(weiboid) from weibo where pic_list like '%http%';
1579

6. 统计使用iphone发微博的独立用户数（7分）

-- 过滤 source 中包含iphone的
hive> select count(distinct uid) from weibo where lower(sre) like "%iphone%";
2
-- 或
hive> desc function instr;
OK
instr(str, substr) - Returns the index of the first occurance of substr in str
hive> select count(distinct uid) from weibo where instr(lower(sre),"iphone")>0;
2

7. 将微博的点赞人数和转发人数相加求和，并将相加之和降序排列，取前10条记录，输出userid和总次数（6分）

-- 分析
分组：uid
求：sum(pcount+rcount)
排序： sum
---
select
uid,sum(cast(pcount as int)+cast(rcount as int)) totalcount
from weibo
group by uid
order by totalcount desc limit 10;1087770692    7507175
1769962140  1201930
1270468784  497905
3005254180  482944
2803301701  348426
2656274875  302841
1649005320  239881
1211441627  221671
1266321801  217585
2534616035  178187

8. 统计微博中评论次数小于1000的用户ID与数据来源信息，将其放入视图，然后统计视图中数据来源是”ipad客户端”的用户数目（10分）

-- commentCount 评论次数 < 1000
hive> create view count_view as > select > uid,sre> from weibo> where cast(count as int)<1000;
OK
uid sre
Time taken: 0.177 seconds
-- 过滤 sre="ipad客户端"
select
count(distinct uid)
from count_view
where lower(sre) like "%ipad客户端%";2

9. 统计微博内容中出现”iphone”次数最多的用户，最终结果输出用户id和次数（注意：该次数是”iphone”的出现次数，不是出现”iphone”的微博数目）（15分）

-- 分析
分组: uid
求: sum(用户的每一篇微博中出现"iphone"的次数)
排序：sum   倒序  limit 1--
select
uid, sum(myfun(lower(content),"iphone")) totalcount
from weibo
group by uid
order by totalcount desc limit 1;uid    totalcount
1547286441  2

自定义函数：求每一篇微博中的iphone次数

package com.study.follow.udf;
import org.apache.hadoop.hive.ql.exec.UDF;public class GetStrCount extends UDF {/*** 查询 index 次数* @param content 原始串* @param index 要查询的串* @return*/public int evaluate(String content,String index){int count =0;// content = iphone i iphonewhile (content.contains(index)){count++;//对content重新赋值content = content.substring(content.indexOf(index)+index.length());}return count;}
}

-- 添加jar
hive> add jar /home/hdp01/tmpfiles/hiveData/follow-1.0-SNAPSHOT.jar;
Added [/home/hdp01/tmpfiles/hiveData/follow-1.0-SNAPSHOT.jar] to class path
Added resources: [/home/hdp01/tmpfiles/hiveData/follow-1.0-SNAPSHOT.jar]
hive> list jar;
/home/hdp01/tmpfiles/hiveData/follow-1.0-SNAPSHOT.jar
hive> create temporary function myfun as 'com.study.follow.udf.GetStrCount';

10. 求每天发微博次数最多的那个家伙的ID和发微博的条数（8分）

-- 分析
每个人 每天的发博数量
分组：uid ctmie
求：count(weiboId)
排序：倒序 limit 1-------- 第一种理解，所有人每一天的发博数量 全局排序
select
uid,
day(from_unixtime(cast(ctime as bigint),"yyyy-MM-dd")) fbday,
count(weiboId) totalcount
from weibo
group by uid,day(from_unixtime(cast(ctime as bigint),"yyyy-MM-dd"))
order by totalcount limit 1;uid fbday   totalcount
1032747075  5   1-------- 第二种理解， 每一天内部排序，每天出来一个最多的
局部求topN row_number
--分步1：
create table step01 as
select
uid,
day(from_unixtime(cast(ctime as bigint),"yyyy-MM-dd")) fbday,
count(weiboId) totalcount
from weibo
group by uid,day(from_unixtime(cast(ctime as bigint),"yyyy-MM-dd"));
--分步2：在第一步的基础上进行局部排序
select * from
(select
uid,
fbday,
totalcount,
row_number() over(partition by fbday order by totalcount desc) index
from step01) a
where index=1;a.uid    a.fbday a.totalcount    a.index
1087770692  1   35  1
1087770692  2   25  1
1087770692  3   26  1
1180721740  4   32  1
1180721740  5   86  1
1087770692  6   39  1
1180721740  7   33  1
1087770692  8   35  1
1087770692  9   36  1
1087770692  10  34  1
1087770692  11  34  1
1087770692  12  25  1
1180721740  13  38  1
1087770692  14  29  1
1180721740  15  44  1
1087770692  16  38  1
1087770692  17  42  1
1087770692  18  44  1
1087770692  19  38  1
1087770692  20  56  1
1087770692  21  43  1
1087770692  22  43  1
1087770692  23  36  1
1180721740  24  23  1
1087770692  25  28  1
1087770692  26  34  1
1087770692  27  38  1
1180721740  28  29  1
1087770692  29  26  1
1087770692  30  23  1
1180721740  31  17  1
-- 或 分步1，分步2 联合查询
select * from
(select
uid,
fbday,
totalcount,
row_number() over(partition by fbday order by totalcount desc) index
from
(select
uid,
day(from_unixtime(cast(ctime as bigint),"yyyy-MM-dd")) fbday,
count(weiboId) totalcount
from weibo
group by uid,day(from_unixtime(cast(ctime as bigint),"yyyy-MM-dd"))
) b
) a
where index=1;a.uid    a.fbday a.totalcount    a.index
1087770692  1   35  1
1087770692  2   25  1
1087770692  3   26  1
1180721740  4   32  1
1180721740  5   86  1
1087770692  6   39  1
1180721740  7   33  1
1087770692  8   35  1
1087770692  9   36  1
1087770692  10  34  1
1087770692  11  34  1
1087770692  12  25  1
1180721740  13  38  1
1087770692  14  29  1
1180721740  15  44  1
1087770692  16  38  1
1087770692  17  42  1
1087770692  18  44  1
1087770692  19  38  1
1087770692  20  56  1
1087770692  21  43  1
1087770692  22  43  1
1087770692  23  36  1
1180721740  24  23  1
1087770692  25  28  1
1087770692  26  34  1
1087770692  27  38  1
1180721740  28  29  1
1087770692  29  26  1
1087770692  30  23  1
1180721740  31  17  1

注意：
这个给出的解中把日期都取了day(),这样如果多个月的数据，每个月同一天的数据就会合并
所以正常的话应该不加这个day(),明白就行，就不重新写了

11. 求出所有被多次引用（同一张照片出现在多条微博中，超过1条就算多条）的照片的数目（8分）

-- 分析
每一个照片被引用的次数
过滤 次数>1
先看一下照片列数据
hive> select pic_list from weibo limit 3;
["http://ww1.sinaimg.cn/thumbnail/5ec0ffd6jw1ebecowl3qnj20gf20511x.jpg"]
[]
["http://ww2.sinaimg.cn/square/75b52ed2jw1eb9zeuce89j214c0qo10f.jpg","http://ww1.sinaimg.cn/square/75b52ed2jw1eb9zev47cbj21400qo7el.jpg","http://ww3.sinaimg.cn/square/75b52ed2jw1eb9zevnsokj21400qo45p.jpg","http://ww1.sinaimg.cn/square/75b52ed2jw1eb9zew8xrhj21400qowmu.jpg","http://ww2.sinaimg.cn/square/75b52ed2jw1eb9zex7hurj21400qo12o.jpg","http://ww4.sinaimg.cn/square/75b52ed2jw1eb9zexs7aoj21400qo45i.jpg","http://ww3.sinaimg.cn/square/75b52ed2jw1eb9zexsc99j21400qon5o.jpg","http://ww1.sinaimg.cn/square/75b52ed2jw1eb9zf4e36tj21400qojyu.jpg","http://ww1.sinaimg.cn/square/75b52ed2jw1eb9zf615s2j21400qojxx.jpg"]
字符串数组，所以要先切掉两头的[]，然后逗号分隔
过滤掉值为空[]的pic_list
-- 1. 炸裂每一个照片  求每一个微博的  引用的照片
create table ti11_step01 as
select
weiboid,pic.picurl picurl
from weibo
lateral view explode(split(substr(pic_list,2,length(pic_list)-2),",")) pic as picurl
where pic_list<>"[]";hive> select * from ti11_step01 limit 10;
OK
ti11_step01.weiboid ti11_step01.picurl
3655325888057474    "http://ww3.sinaimg.cn/thumbnail/40d61044jw1ebixhnsiknj20qo0qognx.jpg"
3655091741442099    "http://ww4.sinaimg.cn/thumbnail/b2336177jw1ebi6j4twk7j20m80tkgra.jpg"
3655043682911102    "http://ww3.sinaimg.cn/thumbnail/40d61044jw1ebi12vz90oj20hs0ns0tt.jpg"
3655008891426001    "http://ww1.sinaimg.cn/thumbnail/40d61044jw1ebhx30o0haj20hs0bvq3l.jpg"
3654600311574268    "http://ww2.sinaimg.cn/thumbnail/40d61044jw1ebgm5qh2v4j20qo0qoq6f.jpg"
3654376063078681    "http://ww2.sinaimg.cn/thumbnail/52e0e4f8jw1ebfwa7tpg0j208c069t9b.jpg"
3654376063078681    "http://ww2.sinaimg.cn/thumbnail/52e0e4f8jw1ebfwa7tpg0j208c069t9b.jpg"
3654267598493042    "http://ww3.sinaimg.cn/thumbnail/5951573cjw1ebfjwrobmrj20m83huhdt.jpg"
3654281733100362    "http://ww2.sinaimg.cn/thumbnail/40d61044jw1ebflkk2aijj20hs0bv74n.jpg"
3653970221948710    "http://ww3.sinaimg.cn/thumbnail/5ec0ffd6jw1ebelpus6rpj20ne0x3ah0.jpg"
-- 2. 求每一个照片被引用的次数
分组 picurl
过滤 被引用次数>1
select
picurl,
count(*) totalcount
from ti11_step01
group by picurl
having totalcount>1;picurl   totalcount
"http://ww1.sinaimg.cn/thumbnail/40d61044jw1e0ujdhhczsj.jpg"  2
"http://ww1.sinaimg.cn/thumbnail/40d61044jw1e1g0d0eerfj.jpg"  2
"http://ww1.sinaimg.cn/thumbnail/40d61044jw1e3upos6jaaj20an0fs75w.jpg"    2
"http://ww1.sinaimg.cn/thumbnail/40d61044jw1e9qefdld2zj20680aqwes.jpg"    2
"http://ww1.sinaimg.cn/thumbnail/53fcb2b7jw1dwky1br0c1j.jpg"  2
"http://ww1.sinaimg.cn/thumbnail/568e0e3bgw1dwj4rguztxj.jpg"  2
"http://ww1.sinaimg.cn/thumbnail/62ddcfb5jw1e1h4vefh3uj.jpg"  2
"http://ww1.sinaimg.cn/thumbnail/7390d6e5jw1dx9v40n5qrj.jpg"  2
"http://ww1.sinaimg.cn/thumbnail/7d440be4jw1e5nqzfiw8nj20co2iqqro.jpg"    2--3. 求被引用多次的照片总个数
select count(*) totalcount
from
(select
picurl,count(*) totalcount
from ti11_step01
group by picurl
having totalcount>1) a;totalcount
43-- 上边儿1，2，3步联合查询
select count(*) totalcount
from
(select
picurl,count(*) totalcount
from
(select
weiboid,pic.picurl picurl
from weibo
lateral view explode(split(substr(pic_list,2,length(pic_list)-2),",")) pic as picurl
where pic_list<>"[]") b
group by picurl
having totalcount>1) a;totalcount
43

【Hive】hive 微博案例相关推荐

大数据——Hive分析项目案例
Hive分析项目案例梳理商业网站中经常统计的数据有哪些: UV:独立访客同一个用户访问多次会产生多个记录,但是这些记录会在运算的时候合并为1个语法:count(distinct guid) PV ...
大数据从入门到实战——Hive综合应用案例 ——学生成绩查询
Hive综合应用案例 - 学生成绩查询第1关计算每个班的语文总成绩和数学总成绩第2关查询选修了3门以上的课程的学生姓名第3关课程选修人数第4关 shujuku课程的平均成绩第1关计算 ...
Hive——hive安装
1.Hive安装地址 1．Hive官网地址 http://hive.apache.org/ 2．文档查看地址 https://cwiki.apache.org/confluence/display/H ...
Hive ---- Hive入门
Hive ---- Hive入门 1. 什么是Hive 1. Hive简介 2. Hive本质 2. Hive架构原理 1. 用户接口:Client 2. 元数据:Metastore 3. 驱动器:D ...
深入浅出学Hive——Hive原理
目录: 初始Hive Hive安装与配置 Hive内建操作符与函数开发 Hive JDBC Hive参数 Hive高级编程 Hive QL Hive Shell基本操作 Hive优化 Hive体系结构 ...
[Hive] - Hive参数含义详解
hive中参数分为三类,第一种system环境变量信息,是系统环境变量信息:第二种是env环境变量信息,是当前用户环境变量信息:第三种是hive参数变量信息,是由hive-site.xml文件定义的以 ...
【jQuery笔记Part4】02-jQuery微博案例
jQuery微博案例 jQuery.trim(str) 去掉字符串起始和结尾的空格焦点 focus() 设置焦点 blur() 失去焦点微博案例静态页面监听发布按钮点击添加删除事件(存在问题 ...
Spark on Hive Hive on Spark傻傻分不清？
Spark on Hive? Hive on Spark傻傻分不清? 1 spark on hive Spark on hive,是spark计算引擎依托hive data source,spark ...
【HBase学习笔记-尚硅谷-Java API shell命令谷粒微博案例】
HBase学习笔记 HBase 一.HBase简介 1.HBase介绍 2.HBase的逻辑结构和物理结构 3.数据模型 4.基本架构二.快速入门 1.配置HBase 2.命令三.API 1.获取 ...
HBase的微博案例
HBase的微博案例 1. 实验环境说明 2. 实验目的 3. 实验步骤 3.1 正常启动HADOOP.ZOOKEEPER 3.2 启动HBASE 3.3 实验步骤 3.3.1 先把虚拟机的地址映射加 ...

【Hive】hive 微博案例

文章目录

数据准备及描述

数据描述

数据样例

字段描述

数据存储

准备开始

功能需求

1. 数据处理：针对数据问题，请给出对应的解决方案（15分）

2. 组织数据（10分）

3. 统计微博总量和独立用户数（7分）

4. 统计用户所有微博被转发的次数之和，输出top5用户，并给出次数（7分）

5. 统计带图片的微博数（7分）

6. 统计使用iphone发微博的独立用户数（7分）

7. 将微博的点赞人数和转发人数相加求和，并将相加之和降序排列，取前10条记录，输出userid和总次数（6分）

8. 统计微博中评论次数小于1000的用户ID与数据来源信息，将其放入视图，然后统计视图中数据来源是”ipad客户端”的用户数目（10分）

9. 统计微博内容中出现”iphone”次数最多的用户，最终结果输出用户id和次数（注意：该次数是”iphone”的出现次数，不是出现”iphone”的微博数目）（15分）

10. 求每天发微博次数最多的那个家伙的ID和发微博的条数（8分）

11. 求出所有被多次引用（同一张照片出现在多条微博中，超过1条就算多条）的照片的数目（8分）

【Hive】hive 微博案例相关推荐

最新文章

热门文章

【Hive】hive 微博案例

文章目录

数据准备及描述

数据描述

数据样例

字段描述

数据存储

准备开始

功能需求

1. 数据处理：针对数据问题，请给出对应的解决方案（15分）

2. 组织数据（10分）

3. 统计微博总量 和 独立用户数（7分）

4. 统计用户所有微博被转发的次数之和，输出top5用户，并给出次数（7分）

5. 统计带图片的微博数（7分）

6. 统计使用iphone发微博的独立用户数（7分）

7. 将微博的点赞人数和转发人数相加求和，并将相加之和降序排列，取前10条记录，输出userid和总次数（6分）

8. 统计微博中评论次数小于1000的用户ID与数据来源信息，将其放入视图，然后统计视图中数据来源是”ipad客户端”的用户数目（10分）

9. 统计微博内容中出现”iphone”次数最多的用户，最终结果输出用户id和次数（注意：该次数是”iphone”的出现次数，不是出现”iphone”的微博数目）（15分）

10. 求每天发微博次数最多的那个家伙的ID和发微博的条数（8分）

11. 求出所有被多次引用（同一张照片出现在多条微博中，超过1条就算多条）的照片的数目（8分）

【Hive】hive 微博案例相关推荐

最新文章

热门文章

3. 统计微博总量和独立用户数（7分）