Spark对接Hive:整合Hive操作及函数
1.拷贝hive-site.xml文件到spark的conf目录下
2.[hadoop@hadoop002 bin]$ ./spark-shell --master local[2] --jars ~/software/mysql-connector-java-5.1.47.jar
注意用5版本的mysql-connector-java
scala> spark.sql("show databases").show
+------------+
|databaseName|
+------------+
| default|
| test|
+------------+
scala> spark.sql("select *from test.wc").show
20/02/19 08:50:05 WARN metastore.ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
+-----------------+
| sentence|
+-----------------+
|hello hello hello|
| spark hadoop|
| hive|
+-----------------+
3.另一种启动方式
[hadoop@hadoop002 bin]$ ./spark-sql --master local --jars ~/software/mysql-connector-java-5.1.47.jar --driver-class-path ~/software/mysql-connector-java-5.1.47.jar
--driver-class-path 表明driver端也要需要该jar包服务。另一种方式是把jar包放$SPARK_HOME的lib下面,不过这样的话任何一个spark程序启动都会加载这个jar包。
spark-sql (default)> desc formatted wc;
20/02/19 21:08:05 INFO metastore.HiveMetaStore: 0: get_database: test
20/02/19 21:08:05 INFO HiveMetaStore.audit: ugi=hadoop ip=unknown-ip-addr cmd=get_database: test
20/02/19 21:08:05 INFO metastore.HiveMetaStore: 0: get_table : db=test tbl=wc
20/02/19 21:08:05 INFO HiveMetaStore.audit: ugi=hadoop ip=unknown-ip-addr cmd=get_table : db=test tbl=wc
20/02/19 21:08:05 INFO metastore.HiveMetaStore: 0: get_table : db=test tbl=wc
20/02/19 21:08:05 INFO HiveMetaStore.audit: ugi=hadoop ip=unknown-ip-addr cmd=get_table : db=test tbl=wc
20/02/19 21:08:05 INFO codegen.CodeGenerator: Code generated in 181.19869 ms
col_name data_type comment
sentence string NULL# Detailed Table Information
Database test
Table wc
Owner hadoop
Created Time Sun Nov 10 16:53:07 CST 2019
Last Access Thu Jan 01 08:00:00 CST 1970
Created By Spark 2.2 or prior
Type MANAGED
Provider hive
Table Properties [transient_lastDdlTime=1573378511]
Statistics 36 bytes
Location hdfs://hadoop002:8020/user/hive/warehouse/test.db/wc
Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat org.apache.hadoop.mapred.TextInputFormat
OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Storage Properties [serialization.format=1]
Partition Provider Catalog
Time taken: 0.511 seconds, Fetched 19 row(s)
20/02/19 21:08:05 INFO thriftserver.SparkSQLCLIDriver: Time taken: 0.511 seconds, Fetched 19 row(s)
spark-sql (default)>
4.thriftserver和beeline的使用
4.1启动长服务
[hadoop@hadoop002 sbin]$ ./start-thriftserver.sh --help
[hadoop@hadoop002 sbin]$ ./start-thriftserver.sh --jars ~/software/mysql-connector-java-5.1.47.jar
starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to /home/hadoop/app/spark-2.4.4-bin-2.6.0-cdh5.15.1/logs/spark-hadoop-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-hadoop002.out
[hadoop@hadoop002 sbin]$ tail -200f /home/hadoop/app/spark-2.4.4-bin-2.6.0-cdh5.15.1/logs/spark-hadoop-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-hadoop002.out
*****
20/02/19 09:37:16 INFO service.AbstractService: Service:ThriftBinaryCLIService is started.
20/02/19 09:37:16 INFO service.AbstractService: Service:HiveServer2 is started.
20/02/19 09:37:16 INFO thriftserver.HiveThriftServer2: HiveThriftServer2 started
20/02/19 09:37:16 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@72c5064f{/sqlserver,null,AVAILABLE,@Spark}
20/02/19 09:37:16 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4a0c04ab{/sqlserver/json,null,AVAILABLE,@Spark}
20/02/19 09:37:16 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5d9d8ecf{/sqlserver/session,null,AVAILABLE,@Spark}
20/02/19 09:37:16 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@43cc7951{/sqlserver/session/json,null,AVAILABLE,@Spark}
20/02/19 09:37:17 INFO thrift.ThriftCLIService: Starting ThriftBinaryCLIService on port 10000 with 5...500 worker threads
4.2 启动beeline,
[hadoop@hadoop002 spark-2.4.4-bin-2.6.0-cdh5.15.1]$ ./bin/beeline -u jdbc:hive2://hadoop002:10000
Connecting to jdbc:hive2://hadoop002:10000
20/02/19 09:48:58 INFO jdbc.Utils: Supplied authorities: hadoop002:10000
20/02/19 09:48:58 INFO jdbc.Utils: Resolved authority: hadoop002:10000
20/02/19 09:48:58 INFO jdbc.HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://hadoop002:10000
Error: Failed to open new session: java.lang.RuntimeException: java.lang.RuntimeException: org.apache.hadoop.security.AccessControlException: Permission denied: user=anonymous, access=EXECUTE, inode="/tmp":hadoop:supergroup:drwx------[hadoop@hadoop002 bin]$ ./beeline -n hadoop -u jdbc:hive2://hadoop002:10000
Connecting to jdbc:hive2://hadoop002:10000
20/02/19 10:19:18 INFO jdbc.Utils: Supplied authorities: hadoop002:10000
20/02/19 10:19:18 INFO jdbc.Utils: Resolved authority: hadoop002:10000
20/02/19 10:19:18 INFO jdbc.HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://hadoop002:10000
Connected to: Spark SQL (version 2.4.4)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1.spark2 by Apache Hive
0: jdbc:hive2://hadoop002:10000> show databases;
+---------------+--+
| databaseName |
+---------------+--+
| default |
| test |
+---------------+--+
2 rows selected (0.642 seconds)
0: jdbc:hive2://hadoop002:10000> use test;
+---------+--+
| Result |
+---------+--+
+---------+--+
No rows selected (0.05 seconds)
0: jdbc:hive2://hadoop002:10000> select * from wc;
+--------------------+--+
| sentence |
+--------------------+--+
| hello hello hello |
| spark hadoop |
| hive |
+--------------------+--+
3 rows selected (1.471 seconds)
0: jdbc:hive2://hadoop002:10000>
5.ThriftServer VS 例行的Spark Application
ThriftServer是一个长服务,7*24 小时运行,后者跑完就没有了
前者只有启动的时候申请资源,后者每次启动都要申请资源
前者多个资源提交的资源其他的资源可以共享,比如cache
6.通过代码操作Clinet
import java.sql.DriverManager/*** 通过JDBC连接Client访问数据*/
object JDBC2ThiftClientApp {def main(args: Array[String]): Unit = {Class.forName("org.apache.hive.jdbc.HiveDriver")val connection = DriverManager.getConnection("jdbc:hive2:hadoop002:10000")val pstm = connection.prepareStatement("select * from test.wc")val rs = pstm.executeQuery()while (rs.next()){println(rs.getObject(1))}}
}
7.Spark通过代码操作Hive
import java.util.Propertiesimport com.typesafe.config.ConfigFactory
import org.apache.spark.sql.{SaveMode, SparkSession}object HiveSourceApp {def main(args: Array[String]): Unit = {val spark = SparkSession.builder().master("local").appName("HiveSourceApp").enableHiveSupport().getOrCreate()
// spark.sql("show databases").show()//从MySQL中读取数据val config = ConfigFactory.load()val url = config.getString("db.default.url")val user = config.getString("db.default.user")val password = config.getString("db.default.password")val driver = config.getString("db.default.driver")val database = config.getString("db.default.database")val table = config.getString("db.default.table")//从mysql读取数据val connectionProperties = new Properties()connectionProperties.put("user", user)connectionProperties.put("password", password)//TODO 处理业务逻辑val jdbcDF = spark.read.jdbc(url, s"$database.$table", connectionProperties)jdbcDF.show()//将数据写入Hive中jdbcDF.write.mode(SaveMode.Append).saveAsTable("test.hive")spark.stop()}}
Spark对接Hive:整合Hive操作及函数相关推荐
- Spark与Iceberg整合查询操作-查询快照,表历史,data files Manifests 查询快照,时间戳数据...
1.8.6 Spark与Iceberg整合查询操作 1.8.6.1 DataFrame API加载Iceberg中的数据 Spark操作Iceberg不仅可以使用SQL方式查询Iceberg中的数据, ...
- 数据湖(十四):Spark与Iceberg整合查询操作
文章目录 Spark与Iceberg整合查询操作 一.DataFrame API加载Iceberg中的数据 二.查询表快照
- Hive常见查询操作与函数汇总
目录 一.查询操作 1.基本查询(Like VS RLike) 2.Join语句 3.分组 4.排序 sort by 和 distribute by 6.分桶抽样 二.函数汇总 1.查询函数 行与列的 ...
- 【大数据入门核心技术-Tez】(三)Tez与Hive整合
一.准备工作 1.Hadoop和Hive安装 [大数据入门核心技术-Hadoop](五)Hadoop3.2.1非高可用集群搭建 [大数据入门核心技术-Hadoop](六)Hadoop3.2.1高可用集 ...
- 【大数据开发】SparkSQL——Spark对接Hive、Row类、SparkSQL函数、UDF函数(用户自定义函数)、UDAF函数、性能调优、SparkSQL解决数据倾斜
文章目录 一.Spark对接Hive准备工作 1.1 集群文件下载 1.2 导入依赖 1.3 打开集群metastore服务 二.Spark对接Hive 2.1 查询Hive 2.2 读取MySQL中 ...
- Spark SQL实战(08)-整合Hive
1 整合原理及使用 Apache Spark 是一个快速.可扩展的分布式计算引擎,而 Hive 则是一个数据仓库工具,它提供了数据存储和查询功能.在 Spark 中使用 Hive 可以提高数据处理和查 ...
- beeline执行sql文件_【SparkSQL】介绍、与Hive整合、Spark的th/beeline/jdbc/thriftserve2、shell方式使用SQL...
目录 一.Spark SQL介绍 SQL on Hadoop框架: 1)Spark SQL 2)Hive 3)Impala 4)Phoenix Spark SQL是用来处理离线数据的,他的编程模型是D ...
- 2.4-2.5、Hive整合(整合Spark、整合Hbase)、连接方式Cli、HiveServer和hivemetastore、Squirrel SQL Client等
2.4其它整合 2.4.1Hive整合Spark Spark整合hive,需要将hive_home下的conf下的hive_site.xml放到spark_home下的conf目录下.(3台服务器都做 ...
- Hive基本操作,DDL操作(创建表,修改表,显示命令),DML操作(Load Insert Select),Hive Join,Hive Shell参数(内置运算符、内置函数)等
1. Hive基本操作 1.1DDL操作 1.1.1 创建表 建表语法 CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_nam ...
- Hive SQL操作与函数自定义(二)
9 Operators and UDFs 9.1 内置运算符 9.1.1 关系运算符 操作符 运算对象的类型 描述 A <=> B ALL 都是NULL时,返回TRUE,有一为NULL时, ...
最新文章
- maven2学习总结(1,入门起步与实践)
- python解析原理_代码详解:Python虚拟环境的原理及使用
- 向linux内核增加新的系统调用,为linux内核添加新的系统调用
- java内存结构不包含堆,JVM之详细分析java内存结构模型
- JS 一个简单的隔行变色函数
- COGS 2507 零食店
- chrome 长截屏插件
- 英文java简历模板下载_java软件工程师英文简历模板下载
- 百分字符知识付费教程
- T-SNE可视化实现
- unity2D游戏案例-躲避怪云
- LoRa开发|LoRa无线传输技术介绍
- java作业的提交规范与要求
- 砸金蛋html5小游戏设计总结
- 华为u8500在usb模式下logcat无法打印信息
- SQL Server 2008 SP3简体中文版官方下载
- 对webkit-font-smoothing和-moz-osx-font-smoothing的理解
- Photoshop中蒙尘与划痕的使用和案例:蒙尘与划痕磨皮、去划痕
- Git 错误 Unable to create 'E:/xxx/.git/index.lock': File exists.的解决办法
- 【视觉定位UV】Mark点配置文件格式说明