我采用的是网上的电影大数据,共有3个文件,movies.dat、user.dat、ratings.dat。分别有3000/6000和1百万数据,正好做实验。

下面先介绍数据结构:

RATINGS FILE DESCRIPTION
================================================================================
All ratings are contained in the file "ratings.dat" and are in the
following format:

UserID::MovieID::Rating::Timestamp

- UserIDs range between 1 and 6040
- MovieIDs range between 1 and 3952
- Ratings are made on a 5-star scale (whole-star ratings only)
- Timestamp is represented in seconds since the epoch as returned by time(2)
- Each user has at least 20 ratings
USERS FILE DESCRIPTION

================================================================================
User information is in the file "users.dat" and is in the following
format:

UserID::Gender::Age::Occupation::Zip-code

All demographic information is provided voluntarily by the users and is
not checked for accuracy. Only users who have provided some demographic
information are included in this data set.

- Gender is denoted by a "M" for male and "F" for female
- Age is chosen from the following ranges:

* 1: "Under 18"
* 18: "18-24"
* 25: "25-34"
* 35: "35-44"
* 45: "45-49"
* 50: "50-55"
* 56: "56+"

- Occupation is chosen from the following choices:

* 0: "other" or not specified
* 1: "academic/educator"
* 2: "artist"
* 3: "clerical/admin"
* 4: "college/grad student"
* 5: "customer service"
* 6: "doctor/health care"
* 7: "executive/managerial"
* 8: "farmer"
* 9: "homemaker"
* 10: "K-12 student"
* 11: "lawyer"
* 12: "programmer"
* 13: "retired"
* 14: "sales/marketing"
* 15: "scientist"
* 16: "self-employed"
* 17: "technician/engineer"
* 18: "tradesman/craftsman"
* 19: "unemployed"
* 20: "writer"

MOVIES FILE DESCRIPTION
================================================================================

Movie information is in the file "movies.dat" and is in the following
format:

MovieID::Title::Genres

- Titles are identical to titles provided by the IMDB (including
year of release)
- Genres are pipe-separated and are selected from the following genres:

* Action
* Adventure
* Animation
* Children's
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-Noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western

****************************************************************************************************

二、进入重点

开始建库、建表:

create database movies;
use movies;
//试试建表
CREATE TABLE users(userid:Long);
create table users(userid:Bigint);
CREATE TABLE ratings(userid Int,movieid Int,rating Int,timestamp Timestamp)PARTITIONED BY(dt String) ROW FORMAT DELIMITED FIELDS TERMINATED BY '::';
出错:FAILED: ParseException line 1:55 Failed to recognize predicate 'timestamp'. Failed rule: 'identifier' in column specification

timestamp不支持数据结构里的字符串,改之。

CREATE TABLE ratings(userid Int,movieid Int,rating Int,timestamped Timestamp)PARTITIONED BY(dt String) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

LOAD DATA LOCAL INPATH '/home/dyq/Documents/movies/ratings-douhao.dat' into table ratings PARTITION(dt="20161201");
hive> select * from ratings limit 10;
OK
1 1193 5 NULL 20161201
1 661 3 NULL 20161201
1 914 3 NULL 20161201
1 3408 4 NULL 20161201
1 2355 5 NULL 20161201
1 1197 3 NULL 20161201
1 1287 5 NULL 20161201
1 2804 5 NULL 20161201
1 594 4 NULL 20161201
1 919 4 NULL 20161201

看来用"::"做分隔符有了麻烦,替换成我喜欢的","

drop table ratings;
CREATE TABLE ratings(userid Int,movieid Int,rating Int,timestamped String)PARTITIONED BY(dt String) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

hive> select * from ratings limit 10;
OK
1 1193 5 978300760 20161201
1 661 3 978302109 20161201
1 914 3 978301968 20161201
1 3408 4 978300275 20161201
1 2355 5 978824291 20161201
1 1197 3 978302268 20161201
1 1287 5 978302039 20161201
1 2804 5 978300719 20161201
1 594 4 978302268 20161201
1 919 4 978301368 20161201
Time taken: 0.122 seconds, Fetched: 10 row(s)

一切OK!hive的语义真是不够强大的说。

下面建立Movies和users表。

CREATE TABLE movies(movieid Int,title String,genres String)PARTITIONED BY(dt String) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

LOAD DATA LOCAL INPATH '/home/dyq/Documents/movies/movies-douhao.dat' into table movies PARTITION(dt="20161201");

CREATE TABLE users(userid Int,gender String,age Int,occupation String,zip-code String)PARTITIONED BY(dt String) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

FAILED: ParseException line 1:73 cannot recognize input near '-' 'code' 'String' in column type

CREATE TABLE users(userid Int,gender String,age Int,occupation String,zipcode String)PARTITIONED BY(dt String) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

LOAD DATA LOCAL INPATH '/home/dyq/Documents/movies/users-douhao.dat' into table users PARTITION(dt="20161201");

hive> select * from users limit 10;
OK
1 F 1 10 48067 20161201
2 M 56 16 70072 20161201
3 M 25 15 55117 20161201
4 M 45 7 02460 20161201
5 M 25 20 55455 20161201
6 F 50 9 55117 20161201
7 M 35 1 06810 20161201
8 M 25 12 11413 20161201
9 M 25 17 61614 20161201
10 F 35 1 95370 20161201
Time taken: 0.168 seconds, Fetched: 10 row(s)

*****************************************************************
创建索引:

create index ratings_userid_index on table ratings(userid) as 'COMPACT' with deferred rebuild;
show index on ratings;
drop index ratings_userid_index on ratings;

create index ratings_movieid_index on table ratings(movieid) as 'COMPACT' with deferred rebuild;
show index on ratings;
drop index ratings_movieid_index on ratings;

加索引前的join:
select movies.movieid,movies.title,ratings.rating from movies join ratings on(movies.movieid=ratings.movieid);
Time taken: 40.721 seconds, Fetched: 1000209 row(s)

加索引后的join:
Time taken: 40.816 seconds, Fetched: 1000209 row(s)

查询某一个值:
select movies.movieid,movies.title,ratings.rating from movies join ratings on(movies.movieid=ratings.movieid) where movies.movieid=2716;
Time taken: 33.834 seconds, Fetched: 2181 row(s)

索引后:
drop index ratings_movieid_index on ratings;
drop index ratings_userid_index on ratings;
select movies.movieid,movies.title,ratings.rating from movies join ratings on(movies.movieid=ratings.movieid) where movies.movieid=2716;

Time taken: 29.428 seconds, Fetched: 2181 row(s)

转载于:https://www.cnblogs.com/PMP4561705/p/6123809.html

hive1.2.1实战操作电影大数据!相关推荐

  1. 从爬虫到分析之2018猫眼电影大数据

    有态度地学习 双11已经过去,双12即将来临,离2018年的结束也就2个月不到,还记得年初立下的flag吗? 完成了多少?相信很多人和我一样,抱头痛哭... 本次利用猫眼电影,实现对2018年的电影大 ...

  2. 你看一场电影 大数据解读了这些秘密

    文章讲的是你看一场电影 大数据解读了这些"秘密",电影票在线销售已超过了线下的影院销售,对应的,越来越多的用户数据也被收集分析.结果是,你买了一张电影票,跟着就会知道你看完电影可能 ...

  3. UC发布电影大数据 IP改编电影最受观众青睐

    2016年,国内电影市场波澜不惊,虽然没有延续去年的火爆之势,不过以<魔兽>和<微微一笑很倾城>为代表的IP改编电影则成功霸屏.在这背后,电影用户的喜好和行为所产生的新方向也引 ...

  4. 致青春VS杜蕾斯,用QQ空间电影大数据解读关联性

    按照<黑天鹅>的理论,下一部我们想看什么电影,甚至什么是好电影,都没有人知道.<爆发>却说,人类行为93%是可以预测的,预见未来依靠的就是"大数据",这与Q ...

  5. 猫眼发布电影大数据报告:大数据时代的电影消费洞察

    近日,猫眼电影发布了关于"大数据时代的电影消费洞察"的报告(以下简称报告),报告数据分析来源于超5亿人次的猫眼电影消费数据和4000家影院数据.报告显示,2015上半年全国电影票房 ...

  6. 2018年贺岁档电影票房大数据报告!国产电影的黄金时代已经到来?

    要说春节期间哪里人最多,电影院若是第二,估计没人敢说第一.尤其在2018年春节档(2月16日--2月21日)更是创下了近57亿的票房,较2017年的33.4亿增长了70%,成为了内地史上最强贺岁档! ...

  7. 手握大数据互联网对抗传统电影底气足

    中秋假期前电影业注定不会平静,在近半个月时间里,互联网巨头BAT(百度.阿里.腾讯)纷纷宣布进军电影业,虽然各方内容各有侧重,但都表示,将致力于打造线上线下融合的文化娱乐平台. 8月27日,华谊兄弟与 ...

  8. 机器选角、票房预测,大数据如何改变中国电影?

    近期,网上对"3.8女生节"VS"3.8妇女节"展开了热烈讨论,笔者发现代社会越来越倾向赋予"3.7女生节"年轻.时尚的定义.而从商业角度来 ...

  9. 大数据:豆瓣电视剧爬虫反爬代理IP、spark清洗、flask框架做可视化

    豆瓣电影大数据项目全步骤 1.豆瓣爬虫: 我开始写豆瓣电视剧爬虫时觉得很简单,但在实操时出现了封IP的情况,导致我苦恼了好久,现在终于写出来了 废话不多说,直接上代码: run 函数用来获取页面的数据 ...

最新文章

  1. LoadRunner 常见错误收集及解决方案
  2. 笨办法学Python——学习笔记1
  3. python web开发-flask中response,cookies,session对象使用详解
  4. python变量的输入
  5. php8更新,PHP 8 中新特性以及重大调整
  6. 什么是框架?spring mvc框架
  7. c语言中用简易暗纹来输入密码,确定夫琅和费单缝衍射明、暗纹位置的不同教学方法的讨论...
  8. 硬币找零问题,动态规划基础,百度面试题
  9. Python+OpenCV4:读写输入和输出的简单实践(图片、视频、摄像头)
  10. 登录显示未找到服务器,未找到服务器
  11. win10远程桌面连接
  12. Unity Application Block 1.0系列(5): 使用BuildUp让已存在对象实例也支持依赖注入
  13. 携程校招——携程海洋馆的海豚小宝宝(C++)
  14. python——人工智能(AI)之网络图片文字识别案例详细讲解
  15. Google earth 生成研究区适量边界(研究区边界从哪来?)
  16. 磁性微型机器人通过结肠翻筋斗以输送药物
  17. javaSE(71-148)
  18. C/C++ 各类型int、long、double、char、long long取值范围(基本类型的最大最小值)
  19. 2020.10.3--PS--模糊工具、减淡工具、橡皮擦工具
  20. 如何将你的网站提交到Google

热门文章

  1. 关于 DellEMC 安装系统时找不到系统硬盘的问题
  2. 智能化停车无线通信解决方案
  3. 基于Tensorflow的环境声音分类
  4. 计算机考研数学用什么教材好,考研数学303,用什么课本?用什么参考书?
  5. android仿微信发布动态功能,Android仿微信朋友圈发布动态功能
  6. 借贷领域需求强烈,Trister‘s Lend上线TVL超2000万美元值得关注!
  7. TexturePacker批量合图脚本
  8. 苹果手机怎样信任软件_网络资讯:苹果手机怎样放大时钟
  9. html如何加入多张图片自动滚动,微技巧丨滑动样式如何添加多张图片
  10. win10连接网络共享打印机出现709错误解决方法