Hive微博数据统计分析

1、数据描述：用户的历史微博数据，截止到20131215，
压缩后221MB，解压后878MB，整个数据有1206个小文件，所有数据的格式均是json格式。
2、数据样例：

[{"beCommentWeiboId":"","beForwardWeiboId":"","catchTime":"1387157643","commentCount":"682","content":"喂！2014。。。2014！喂。。。","createTime":"1387086483","info1":"","info2":"","info3":"","mlevel":"","musicurl":[],"pic_list":["http://ww1.sinaimg.cn/square/47119b17jw1ebkc9b07x9j218g0xcair.jpg","http://ww4.sinaimg.cn/square/47119b17jw1ebkc9ebakij218g0xc113.jpg","http://ww2.sinaimg.cn/square/47119b17jw1ebkc9hml7dj218g0xcgt6.jpg","http://ww3.sinaimg.cn/square/47119b17jw1ebkc9kyakyj218g0xcqb3.jpg"],"praiseCount":"1122","reportCount":"671","source":"iPhone客户端","userId":"1192336151","videourl":[],"weiboId":"3655768039404271","weiboUrl":"http://weibo.com/1192336151/AnoMrDstN"}]

2、字段描述
总共19个字段

beCommentWeiboId  是否评论 string beForwardWeiboId 是否是转发微博 string catchTime 抓取时间 stringcommentCount 评论次数 intcontent  内容  stringcreateTime 创建时间   string info1 信息字段1 stringinfo2信息字段2 stringinfo3信息字段3  stringmlevel   no sure stringmusicurl 音乐链接    stringpic_list  照片列表（可以有多个）   stringpraiseCount   点赞人数    intreportCount  转发人数    intsource   数据来源    stringuserId    用户id        stringvideourl  视频链接    stringweiboId   微博id        stringweiboUrl  微博网址    string

3、功能需求
建表的时候建外部表
数据的存储目录： hdfs：//hadoop01:9000/data/weibo
1、数据处理：针对数据问题，请给出对应的解决方案（15分）
数据文件过多：要合并，请给出解决方案

2、组织数据（10分）
（创建Hive表weibo_json(json string)，表只有一个字段，导入所有数据，并验证查询前5条数据）
（解析完weibo_json当中的json格式数据到拥有19个字段的weibo表中，写出必要的SQL语句）

3、统计微博总量和独立用户数（7分）

4、统计用户所有微博被转发的次数之和，输出top5用户，并给出次数（7分）

5、统计带图片的微博数（7分）

6、统计使用iphone发微博的独立用户数（7分）

7、将微博的点赞人数和转发人数相加求和，并将相加之和降序排列，取前10条记录，输出userid和总次数（6分）

8、统计微博中评论次数小于1000的用户ID与数据来源信息，将其放入视图，然后统计视图中数据来源是”ipad客户端”的用户数目（10分）

9、统计微博内容中出现”iphone”次数最多的用户，最终结果输出用户id和次数（注意：该次数是”iphone”的出现次数，不是出现”iphone”的微博数目）（15分） ***** 自定义

10、求每天发微博次数最多的那个家伙的ID和发微博的条数（8分）

11、求出所有被多次引用（同一张照片出现在多条微博中，超过1条就算多条）的照片的数目（8分）

2、组织数据

//创建数据库 create database weibo ;use weibo ;//创建表create table weibo_json (json string);//导入数据load data local inpath "/home/hadoop/hive_data/weibo" into table weibo_json;create table weibo_json(json string) location"/home/hadoop/hive_data/weibo";//验证select json from weibo_json limit 5 ;//创建19个字段的微博表create table if not exists weibo (beCommentWeiboId string,beForwardWeiboId string,catchTime string,commentCount int,content string,createTime string,info1 string ,info2 string ,info3 string ,mlevel string ,musicurl string ,pic_list string ,praiseCount int ,reportCount int ,source string ,userId string,videourl string, weiboId string,weiboUrl string) row format delimited fields terminated by '\t';create table if not exists weibo (beCommentWeiboId string,beForwardWeiboId string,catchTime string,commentCount int,content string,createTime string,info1 string ,info2 string ,info3 string ,mlevel string,musicurl string ,pic_list string ,praiseCount int ,reportCount int ,source string ,userId string,videourl string, weiboId string,weiboUrl string) row format delimited fields terminated by '\t';// 把weibo_json表的数据解析到weibo表：
get_json_object(line, '$.userid') as useridinsert into table weibo select
get_json_object(json,'$[0].beCommentWeiboId') as beCommentWeiboId,
get_json_object(json,'$[0].beForwardWeiboId') as beForwardWeiboId,
get_json_object(json,'$[0].catchTime') as catchTime,
get_json_object(json,'$[0].commentCount') as commentCount,
get_json_object(json,'$[0].content') as content,
get_json_object(json,'$[0].createTime') as createTime,
get_json_object(json,'$[0].info1') as info1,
get_json_object(json,'$[0].info2') as info2,
get_json_object(json,'$[0].info3') as info3,
get_json_object(json,'$[0].mlevel') as mlevel,
get_json_object(json,'$[0].musicurl') as musicurl,
get_json_object(json,'$[0].pic_list') as pic_list,
get_json_object(json,'$[0].praiseCount') as praiseCount,
get_json_object(json,'$[0].reportCount') as reportCount,
get_json_object(json,'$[0].source') as source,
get_json_object(json,'$[0].userId') as userId,
get_json_object(json,'$[0].videourl') as videourl,
get_json_object(json,'$[0].weiboId') as weiboId,
get_json_object(json,'$[0].weiboUrl') as weiboUrl
from weibo_json;

3、统计微博总量和独立用户数（7分）

select count(*) from weibo ;
select count(distinct userId)  from  weibo ;//distinct 去重结果：
统计微博总量
hive> select count(*) from weibo ;
OK
1451868
独立用户数
Total MapReduce CPU Time Spent: 23 seconds 890 msec
OK
78540
Time taken: 23.817 seconds, Fetched: 1 row(s)

4、统计用户所有微博被转发的次数之和（beForwardWeiboId），输出top5用户，并给出次数（7分）

select userId, sum(reportCount) as reportCount,count(*) weibonum from weibo group by userId order by reportCount desc limit 5;
结果：OK1793285524      76454805        14141629810574      73656898        12432803301701      68176008        34431266286555      55111054        2781191258123      54808042        411

5、统计带图片(pic_list)的微博数（7分）

方法一：select count(weiboId) from weibo  where length(pic_list)>2;方法二：select count(weiboId) tupian from weibo where pic_list like '%http%';
结果：OK750512

6、统计使用iphone发微博的独立用户数（7分）

select count(distinct userId) from weibo where lower(source) like '%iPhone%';
select count(distinct userId) from weibo where instr(lower(source),"iphone")>0;
结果：OK936

7、将微博的点赞人数和转发人数相加求和，并将相加之和降序排列，取前10条记录，输出userid和总次数（6分）

select userId ,sum(praiseCount+reportCount) as times from weibo  group by userId order by times desc limit 10 ;结果：
1793285524      114941096
1629810574      97612070
1266286555      83789422
2803301701      74208822
1195242865      69292231
1191258123      61985742
1197161814      59093308
2656274875      52380775
2202387347      51623117
1195230310      48321083

8、统计微博中评论次数小于1000的用户ID与数据来源信息，将其放入视图，然后统计视图中数据来源是”ipad客户端”的用户数目（10分）

create view weibo_source_view as select userId,source from weibo where commentCount<1000;
select * from weibo_source_view limit 10 ;
结果：
2989711735
1087770692      iPad客户端
1390470392
1390470392
1498502972
1087770692      iPad客户端
1589706710
1087770692      iPad客户端
1087770692      iPad客户端
1589706710select count(distinct userId) as time from weibo_source_view where lower(source) like '%ipad客户端%';

9、统计微博内容中出现”iphone”次数最多的用户，最终结果输出用户id和次数
（注意：该次数是”iphone”的出现次数，不是出现”iphone”的微博数目）（15分） ***** 自定义

select userId,sum(size(split(lcase(content),"iphone"))-1) as no
from weibo
group by userid order by no desc limit 1;
结果：
OK
1781387491      512

10、求每天发微博次数最多的那个家伙的ID和发微博的条数（8分）

select userId,from_unixtime(cast(createTime as int ),'yyyy-MM-dd') as weibo_data ,count(*) as no from weibo group by from_unixtime(cast(createTime as int),'yyyy-MM-dd'), userId order by no desc limit 5;结果：
OK
1601563722      2013-09-23      196
1713926427      2013-12-13      190
1642591402      2013-11-23      165
1713926427      2013-12-15      147
1713926427      2013-12-14      144

11、求出所有被多次引用（同一张照片出现在多条微博中，超过1条就算多条）的照片的数目（8分）

第一个sql：
create table tupian_list as
select explode(split(substring(pic_list,2,length(pic_list)-2),",")) as list
from weibo where pic_list !="[]";select list,count(*) as count from tupian_list group by list having count >=2 order by count limit 10 ;

Hive微博数据统计分析相关推荐

500万条微博数据来源分析
最近项目不是特别忙,想做一些微博方面的分析和处理工作,如果自己现爬取微博数据,积累数据比较慢,恰好看到北理工张华平老师分享的500万条微博数据,直接借用他的数据分析.下载地址是:http://www. ...
hive 插入数据映射到hbase_年薪50万都难招的大数据工程师，凭什么？
回顾2018年,降薪.裁员.互联网寒冬似乎成为主旋律,那实际上资本市场萎缩了吗? 其实不然,2018年6月,蚂蚁金服还获得140亿融资,而热度较高的大数据行业,在2018年的融资额达到1273.1亿元 ...
Hadoop 电影评分数据统计分析实验
Hadoop Hadoop分布式计算基础是什么? 1.存储 2.计算 ==电影评分数据统计分析实验== ==[项目目标]== 1)掌握Hive的查询语句的使用 2)掌握R的可视化分析 ==[实验原理] ...
从Hive导出数据到Oracle数据库--Sqoop
首先解释一下各行代码: sqoop export # 指定要导入到Oracle的那张表(通常与hive中的表同名) --table TABLE_NAME # host_ip:导入oracle库所在的i ...
hive导出数据到本地文件报错解决方法
hive导出数据到本地文件报错解决方法参考文章: (1)hive导出数据到本地文件报错解决方法 (2)https://www.cnblogs.com/yaopeiyun/p/12232251.htm ...
BigData之matplotlib：爬虫2018年福布斯中国富豪榜进行数据统计分析，大数据告诉你一些不可思议的事情
BigData之matplotlib:爬虫2018年福布斯中国富豪榜进行数据统计分析,大数据告诉你一些不可思议的事情目录数据统计分析 1.2018年福布斯中国富豪榜(资产≥60亿美元)财富地区分布 ...
Hive Cilent数据操作
Hive Cilent数据操作 Hive运行命令方式有cli,jdbc.hwi.beeline.而我们经常使用的往往是cli shell 操作. cli shell hive -help hive - ...
spark用scala读取hive表数据（不同版本区别）
spark用scala读取hive表数据 spark1.6写法: val conf = new SparkConf() val sc = new SparkContext(conf) ...
JDBC实现从Hive抽取数据导入Oracle
环境:浙江移动华为云平台云平台大数据采用了 Kerberos 认证. 开发历程: 1.在宁波大数据实验环境测试通过了JDBC实现从Hive抽取数据导入Oracle功能. 2.通过查看其它项目的数据库 ...
sqoop导出数据单mysql_sqoop导出hive表数据到mysql
直接在mysql里从本地文件系统导入数据 mysql>LOAD DATA LOCAL INFILE 'C:\\Users\\asys\\Documents\\Tencent Files\\131 ...

Hive微博数据统计分析

Hive微博数据统计分析相关推荐

最新文章

热门文章