sparkling-water的介绍与实践(command line)
sparkling-water是将spark和h2o集成与一体的工具,主要思想是利用h2o进行数据挖掘,而利用进行数据处理和一部分计算,具体架构如下:
我们可以从图中看到,spark对源数据做了处理,然后交给h2o进行建模,在预测阶段也作为了计算引擎, sparkling-water的牛逼之处在于使用了和spark的一样的数据结构,这样在数据处理的时候可以十分灵活。
我们在加载数据的时候,既可以使用spark,也可以使用h2o,spark和h2o直接可以共享同样的数据结构(RDD),但是我们在进行进行数据挖掘(h2o只能使用后缀为.hex的文件),因此需要转换才能够进行计算。
共享rdd数据结构有非常多好处,比如就可以利用spark进行数据的清洗,好了,我们直接来看一下怎么使用。
一、下载与安装
(1)官方提供的下载地址
但是官方提供的地址下载十分慢,当然有VPN的另谈了,这里我提供了我的网盘地址
(2)下载后上传到linux上进行解压
unzip sparkling-water-3.26.2-2.4.zip
(3)启动sparkling-water
找到解压路径下的sparkling-water的bin下的shell,进行启动即可
./sparkling-shell
启动结果如下:
[root@node11 bin]# ./sparkling-shell Using Spark defined in the SPARK_HOME=/opt/software/spark-2.4.3-bin-hadoop2.7 environmental property-----Spark master (MASTER) : local[*]Spark home (SPARK_HOME) : /opt/software/spark-2.4.3-bin-hadoop2.7H2O build version : 3.26.0.2 (yau)Sparkling Water version : 3.26.2-2.4Spark build version : 2.4.3Scala version : 2.11
----19/08/20 15:59:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... us applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://node11:4040
Spark context available as 'sc' (master = local[*], app id = local-1566316763461).
Spark session available as 'spark'.
Welcome to____ __/ __/__ ___ _____/ /___\ \/ _ \/ _ `/ __/ '_//___/ .__/\_,_/_/ /_/\_\ version 2.4.3/_/Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_221)
Type in expressions to have them evaluated.
Type :help for more information.scala> import org.apache.spark.h2o._
import org.apache.spark.h2o._
scala>
至此单机版的sparkling-water就可以使用了
二、实践案例
这里的案例是我根据官方提供的资料(https://github.com/h2oai/sparkling-water/tree/master/examples)进行操练的。
(1)导入spark.h2o的包
import org.apache.spark.h2o._
(2)初始化,其实就是启动h2o
val hc = H2OContext.getOrCreate(spark)
命令运行后结果如下:
19/08/20 16:04:44 WARN internal.InternalH2OBackend: To avoid non-deterministic behavior of Spark broadcast-based joins,
we recommend to set `spark.sql.autoBroadcastJoinThreshold` property of SparkSession to -1.
E.g. spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
We also recommend to avoid using broadcast hints in your Spark SQL code.
19/08/20 16:04:44 WARN internal.InternalH2OBackend: Increasing 'spark.locality.wait' to value 0 (Infinitive) as we need to ensure we run on the nodes with H2O
hc: org.apache.spark.h2o.H2OContext =Sparkling Water Context:* Sparkling Water Version: 3.26.2-2.4* H2O name: sparkling-water-root_local-1566316763461* cluster size: 1* list of used nodes:(executorId, host, port)------------------------(driver,node11,54321)------------------------Open H2O Flow in browser: http://192.168.12.137:54321 (CMD + click in Mac OSX)
(3)导入hc的包
import hc.implicits._
(4)导入spark的包
import spark.implicits._
(5)定义天气数据的路径
val weatherDataFile = "/opt/software/sparkling-water-3.26.2-2.4/examples/smalldata/chicago/Chicago_Ohare_International_Airport.csv"
注意的是,这里最好是输入绝对路径,官方提供的是相对路径,自己要进行处理,我是直接使用的绝对路径,不然下一步加载数据会报找不到路径
org.apache.spark.sql.AnalysisException: Path does not exist: file:/opt/software/sparkling-water-3.26.2-2.4/bin/examples/smalldata/chicago/Chicago_Ohare_International_Airport.csv;at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:558)at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)at scala.collection.immutable.List.foreach(List.scala:392)at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)at scala.collection.immutable.List.flatMap(List.scala:355)at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545)at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:615)at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:467)... 55 elided
(6)利用上面定义的路径进行加载数据
val weatherTable = spark.read.option("header", "true").option("inferSchema", "true").csv(weatherDataFile).withColumn("Date", to_date('Date, "MM/dd/yyyy")).withColumn("Year", year('Date)).withColumn("Month", month('Date)) .withColumn("DayofMonth", dayofmonth('Date))
(7)导入java的包
import java.io.File
(8)定义航空的路径
val dataFile = "/opt/software/sparkling-water-3.26.2-2.4/examples/smalldata/airlines/allyears2k_headers.zip"
(9)加载航空数据
val airlinesH2OFrame = new H2OFrame(new File(dataFile))
会给我们这样一个返回:
airlinesH2OFrame: water.fvec.H2OFrame =
Frame key: allyears2k_headers.hexcols: 31rows: 43978chunks: 1size: 2154864
(10)将.hex文件转换成rdd
val airlinesTable = hc.asDataFrame(airlinesH2OFrame)
(11)利用spark进行数据的过滤
val flightsToORD = airlinesTable.filter('Dest === "ORD")
(12)计算一下看看过滤后还有多少数据
flightsToORD.count
结果:
9/08/20 16:24:08 WARN util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
res0: Long = 2103
(13)利用spark进行rdd的join操作(合并表)
val joinedDf = flightsToORD.join(weatherTable, Seq("Year", "Month", "DayofMonth"))
(14)导包
import water.support.H2OFrameSupport._
(15)转换成.hex
val joinedHf = columnsToCategorical(hc.asH2OFrame(joinedDf), Array("Year", "Month", "DayofMonth"))
(16)导入深度学习的包
import _root_.hex.deeplearning.DeepLearning
import _root_.hex.deeplearning.DeepLearningModel.DeepLearningParameters
import _root_.hex.deeplearning.DeepLearningModel.DeepLearningParameters.Activation
(17)设置深度学习的参数
val dlParams = new DeepLearningParameters()
dlParams._train = joinedHf
dlParams._response_column = "ArrDelay"
dlParams._epochs = 5
dlParams._activation = Activation.RectifierWithDropout
dlParams._hidden = Array[Int](100, 100)
(18)训练模型
val dl = new DeepLearning(dlParams)
val dlModel = dl.trainModel.get
运行结果:
dlModel: hex.deeplearning.DeepLearningModel =
Model Metrics Type: RegressionDescription: Metrics reported on full training framemodel id: DeepLearning_model_1566317084383_1frame id: frame_rdd_21_b16087b00dcb5349ed00b2f0a1964249MSE: 246.94397RMSE: 15.714451mean residual deviance: 246.94397mean absolute error: 9.7153425root mean squared log error: NaN
Variable Importances:Variable Relative Importance Scaled Importance PercentageDepDelay 1.000000 1.000000 0.020609NASDelay 0.953474 0.953474 0.019650Diverted 0.952912 0.952912 0.019639Cancelled 0.940236 0.940236 0.019378DayofMonth.12 0.929144 ...
(19)模型预测
val predictionsHf = dlModel.score(joinedHf)
val predictionsDf = hc.asDataFrame(predictionsHf)
(20)查看预测结果
predictionsDf.show
+-------------------+
| predict|
+-------------------+
| -14.28115203904661|
|-17.384369532025993|
|-15.648360659746515|
|-21.735323004320165|
|-0.4630290696992674|
| -9.351177667940217|
| 112.65659409295617|
| 30.161421574369385|
| 15.403270012684139|
| 170.8349751399989|
| 12.498370529294341|
| 147.3795710418184|
|-6.1483336982319585|
| 44.329600499888926|
| 17.50615431570487|
| 102.51282569095915|
| 7.4154391246514955|
| 9.09458182717221|
|-12.357870505795454|
|-14.798434263256837|
+-------------------+
only showing top 20 rows
官方提供的是利用Rstudio进行查看的,我这样其实不科学,因为只能查看最多20条数据
sparkling-water的介绍与实践(command line)相关推荐
- MySQL中MySQL X.X Command Line Client一闪而过的问题
问题介绍:我安装完MySQL(我安装的是5.5)后,使用MySQL 5.5 Command Line Client,每次点击,总是一闪而过.解决方法如下: 首先进入cmd 切入MySQL的安装目录,然 ...
- ERROR MESSAGE: Invalid command line: Malformed walker argument: Could not find walker with name
介绍和分析 我是用的环境是 GATK v3.7-0-gcfedb67 的 GenomeAnalysisTK.jar Java环境为: > java -version openjdk versio ...
- mysql57数据库命令_MySQL 5.7 mysql command line client 使用命令详解
MySQL 5.7 MySQL command line client 使用命令 1.输入密码:****** 2.ues mysql;使用Mysql 3.show databases;显示数据库 4. ...
- [iOS越狱开发]安装command line tools for Xcodew
网上搜了篇文章,介绍iOS的越狱开发,其中提到了要给Xcode安装command line tools,以前从没听过这个工具,然后就google了下. 关于Xcode Command Line Too ...
- 基于Python的线性回归预测模型介绍及实践
基于Python的线性回归预测模型介绍及实践 这是一篇学习的总结笔记 参考自<从零开始学数据分析与挖掘> [中]刘顺祥 著 完整代码及实践所用数据集等资料放置于:Github 线性回归预测 ...
- TexturePacker命令行使用(command line)
TexturePacker是一个非常好用的小图合并工具,介绍它的文章非常多,多数都是使用GUI工具的,但是: 如果原始图片发生了改变,我们就需要重新手动拼接一下,麻烦. 使用GUI界面非常不高端,我高 ...
- [Warning] Using a password on the command line interface can be insecure.
mysql: [Warning] Using a password on the command line interface can be insecure. 来自TMySQL用户,包括很多开发和G ...
- android安装命令行工具下载,Command line tools下载-Command line tools(命令行工具)下载 v1.0官方版--pc6下载站...
Commandlinetools命令行工具,如果你不需要AndroidStudio,你可以使用基本Android命令行工具,你可以使用包含的sdkmanager来下载其他SDK包,这些工具都包含在An ...
- IDEA Springboot启动报Command line is too long错误
启动报错: Error running 'CmsFrontApplication': Command line is too long. Shorten command line for CmsFro ...
- 实验记录 | mutect问题详解:No tribble type was provided on the command line and the type of the file?
出错详情: /home/xxzhang/workplace/software/java/jdk1.7.0_80/bin/java -Djava.io.tmpdir=./output_RNA/mutmp ...
最新文章
- 985 CV 找不到工作? 4 点诚恳建议
- 当前主流、最新技术回眸(三)
- golang json 获取所有key_Golang —— JSON 大法
- 美国限制研究生入境,港大神操作,只要你愿意,填个表就行,还有机会获得校长奖学金...
- TensorFlow配置日志等级
- 计算机网络实验五:虚拟局域网技术
- 保障粮食安全-农业大健康-温铁军 谋定落实粮食安全责任
- gradle的下载与环境变量配置
- Windows环境下Android Studio系列5—日志调试
- oracle12c xtts迁移,记录一次XTTS迁移碰到的问题
- Ubuntu18.04系统中python3.7安装MultiNEAT库
- innodb为什么写入数据快_重要,知识点:InnoDB的插入缓冲
- 赋能10000家合作伙伴! | 凌云时刻
- 趣味Python — 不到20行代码制作一个 “手绘风” 视频
- java广告投放系统_广告投放系统
- win10网络显示已连接到服务器异常,w10 网络连接配置异常如何修复
- SOME/IP与DDS对比及DDS测试策略和方案探讨
- Android一步步实现无痕埋点(3)-------虎躯一震
- 《财务共享服务》读书笔记
- 湖北工业大学校园网自动认证功能