上一章介绍了H2O的使用,这次来学习学习H2O架构接口和实践。

1,H2O架构
关于H2O架构,很多资料也有说明,这里我们一起来看看官网上的介绍。

最上面的是客户层,即接口交互层,H2O支持JavaScript,R,Python,Excel,Tableau,Flow等多种形式的外部交互。
下面那个可以理解为H2O的关键引擎层,JVM Components,每个JVM进程被分为三层:语言,算法,核心架构,负责执行引擎,算法引擎,数据引擎,任务处理引擎:
Rapids Expression Evaluation Engine
Scala
Customer Algorithm
Parse
Algorithm(GLM,GBM,DL,K-means,PCA)
In-H2O Prediction Engine
Memory Management—数据存储
Fluid Vector Frame
Distributed K/V Store
Non-blocking HashMap
CPU Management—计算
Job
MR Task
Fork/Join
最底层是Spark,Hadoop,Standalone H2O

准备一个文件list.txt
localhost:54321
localhost:54323
启动一个终端:java -Xmx2G -jar h2o.jar -ip localhost -flatfile list.txt,同样的命令再执行一次,这样两个节点的h2o集群就启动了。登录http://localhost:54321/ 或者http://localhost:54323/ 查看集群状态:

2,H2O接口
H2O构建的模型,可以通过Plain Old Java Object (POJO) or Model ObJect, Optimized (MOJO)的方式进行访问。这里以Gradient Boosting Machine模型为例,说明怎么使用:

将MOJO下载到本地,例如文件名为:gbm_b3929752_4f0e_4c9b_a99c_702cbd42feca.zip,解压,进入experimental目录,新建一个main.java程序

import java.io.*;
import hex.genmodel.easy.RowData;
import hex.genmodel.easy.EasyPredictModelWrapper;
import hex.genmodel.easy.prediction.*;
import hex.genmodel.MojoModel;public class main {public static void main(String[] args) throws Exception {EasyPredictModelWrapper model = new EasyPredictModelWrapper(MojoModel.load("gbm_b3929752_4f0e_4c9b_a99c_702cbd42feca.zip"));RowData row = new RowData();row.put("AGE", "68");row.put("RACE", "2");row.put("DCAPS", "2");row.put("VOL", "0");row.put("GLEASON", "6");BinomialModelPrediction p = model.predictBinomial(row);System.out.println("Has penetrated the prostatic capsule (1=yes; 0=no): " + p.label);System.out.print("Class probabilities: ");for (int i = 0; i < p.classProbabilities.length; i++) {if (i > 0) {System.out.print(",");}System.out.print(p.classProbabilities[i]);}System.out.println("");}
}

将h2o-genmodel.jar和gbm_b3929752_4f0e_4c9b_a99c_702cbd42feca.zip放到当前目录。
编译: javac -cp h2o-genmodel.jar -J-Xms2g -J-XX:MaxPermSize=128m main.java
执行: $ java -cp .:h2o-genmodel.jar main (linux)
$ java -cp .;h2o-genmodel.jar main (Windows)
可以看到输出结果,将整个过程参数化,可以做成通用的程序使用。
关于其他的方式,可以参看文档:http://docs.h2o.ai/h2o/latest-stable/h2o-docs/productionizing.html

3,H2O实践
下面引用https://github.com/h2oai/h2o-3/blob/master/h2o-r/demos/rdemo.lending.club.large.R上面的一个例子:

library(h2o)
h2o.init()# Set this to True if you want to fetch the data directly from S3.
# This is useful if your cluster is running in EC2.
data_source_is_s3 <- FALSElocate_source <- function(s) {if (data_source_is_s3)myPath <- paste0("s3n://h2o-public-test-data/", s)elsemyPath <- h2o:::.h2o.locate(s)
}plot_scoring <- function(model) {sh <- h2o.scoreHistory(object = model)par(mfrow=c(1,2))if(model@algorithm == "gbm" | model@algorithm == "drf"){min <- min(range(sh$training_rmse), range(sh$validation_rmse))max <- max(range(sh$training_rmse), range(sh$validation_rmse))plot(x = sh$number_of_trees, y = sh$validation_rmse, col = "orange", main = model@model_id, ylim = c(min,max))points(x = sh$number_of_trees, y = sh$training_rmse, col = "blue")min <- min(range(sh$training_auc), range(sh$validation_auc))max <- max(range(sh$training_auc), range(sh$validation_auc))plot(x = sh$number_of_trees, y = sh$validation_auc, col = "orange", main = model@model_id, ylim = c(min,max))points(x = sh$number_of_trees, y = sh$training_auc, col = "blue")return(data.frame(number_of_trees = sh$number_of_trees, validation_auc = sh$validation_auc, validation_rmse = sh$validation_rmse))}if(model@algorithm == "deeplearning"){plot(x = sh$epochs, y = sh$validation_rmse, col = "orange", main = model@model_id)plot(x = sh$epochs, y = sh$validation_auc, col = "orange", main = model@model_id)}
}# Pick either the big or the small demo.
small_test <-  locate_source("bigdata/laptop/lending-club/LoanStats3a.csv")
big_test <-  c(locate_source("bigdata/laptop/lending-club/LoanStats3a.csv"),locate_source("bigdata/laptop/lending-club/LoanStats3b.csv"),locate_source("bigdata/laptop/lending-club/LoanStats3c.csv"),locate_source("bigdata/laptop/lending-club/LoanStats3d.csv"))print("Import approved loan requests for Lending Club...")
loanStats <- h2o.importFile(path = big_test, parse = F)
col_types <- c('numeric', 'numeric', 'numeric', 'numeric', 'numeric', 'enum', 'string', 'numeric','enum', 'enum', 'enum', 'string', 'enum', 'numeric', 'enum', 'enum', 'enum', 'enum','string', 'enum', 'enum', 'enum', 'enum', 'enum', 'numeric', 'numeric', 'enum','numeric', 'numeric', 'numeric', 'numeric', 'numeric', 'numeric', 'string', 'numeric','enum', 'numeric', 'numeric', 'numeric', 'numeric', 'numeric', 'numeric', 'numeric','numeric', 'numeric', 'enum', 'numeric', 'enum', 'enum', 'numeric', 'enum', 'numeric')
loanStats <- h2o.parseRaw(data = loanStats, destination_frame = "loanStats", col.types = col_types)print("Create bad loan label, this will include charged off, defaulted, and late repayments on loans...")
loanStats <- loanStats[!(loanStats$loan_status %in% c("Current", "In Grace Period", "Late (16-30 days)", "Late (31-120 days)")), ]
loanStats <- loanStats[!is.na(loanStats$id),]
loanStats$bad_loan <- loanStats$loan_status %in% c("Charged Off", "Default", "Does not meet the credit policy.  Status:Charged Off")
loanStats$bad_loan <- as.factor(loanStats$bad_loan)
print(paste(nrow(loanStats), "of 550573 loans have either been paid off or defaulted..."))print("Turn string interest rate and revoling util columns into numeric columns...")
loanStats$int_rate <- h2o.strsplit(loanStats$int_rate, split = "%")
loanStats$int_rate <- h2o.trim(loanStats$int_rate)
loanStats$int_rate <- as.h2o(as.numeric(as.matrix(loanStats$int_rate)))
# loanStats$int_rate <- as.numeric(loanStats$int_rate)
loanStats$revol_util <- h2o.strsplit(loanStats$revol_util, split = "%")
loanStats$revol_util <- h2o.trim(loanStats$revol_util)
loanStats$revol_util <- as.h2o(as.numeric(as.matrix(loanStats$revol_util)))
# loanStats$revol_util <- as.numeric(loanStats$revol_util)print("Calculate the longest credit length in years...")
time1 <- as.Date(h2o.strsplit(x = loanStats$earliest_cr_line, split = "-")[,2], format = "%Y")
time2 <- as.Date(h2o.strsplit(x = loanStats$issue_d, split = "-")[,2], format = "%Y")
loanStats$credit_length_in_years <- year(time2) - year(time1)
## Ideally you can parse the column as a Date column immediately
## loanStats$earliest_cr_line <- as.Date(x = loanStats$earliest_cr_line, format = "%b-%Y")
## loanStats$issue_d          <- as.Date(x = loanStats$issue_d, format = "%b-%Y")
## loanStats$credit_length_in_years <- year(loanStats$earliest_cr_line) - year(loanStats$issue_d)print("Convert emp_length column into numeric...")
## remove " year" and " years", also translate n/a to ""
loanStats$emp_length <- h2o.sub(x = loanStats$emp_length, pattern = "([ ]*+[a-zA-Z].*)|(n/a)", replacement = "")
loanStats$emp_length <- h2o.trim(loanStats$emp_length)
loanStats$emp_length <- h2o.sub(x = loanStats$emp_length, pattern = "< 1", replacement = "0")
loanStats$emp_length <- h2o.sub(x = loanStats$emp_length, pattern = "10\\\\+", replacement = "10")
loanStats$emp_length <- as.h2o(as.numeric(as.matrix(loanStats$emp_length)))
# loanStats$emp_length <- as.numeric(loanStats$emp_length)print("Map multiple levels into one factor level for verification_status...")
loanStats$verification_status <- as.character(loanStats$verification_status)
loanStats$verification_status <- h2o.sub(x = loanStats$verification_status, pattern = "VERIFIED - income source", replacement = "verified")
loanStats$verification_status <- h2o.sub(x = loanStats$verification_status, pattern = "VERIFIED - income", replacement = "verified")
loanStats$verification_status <- as.factor(loanStats$verification_status)## Check to make sure all the string/enum to numeric conversion completed correctly
x <- c("int_rate", "revol_util", "credit_length_in_years", "emp_length", "verification_status")
c1 <- as.data.frame(loanStats[1,x])
c2 <- data.frame(int_rate = 10.65, revol_util = 83.7, credit_length_in_years = 26,emp_length = 10, verification_status = "verified")
if(!all(c1 == c2)) {print(c1)print(c2)stop("Conversion column(s) did not run correctly.")}print("Calculate the total amount of money earned or lost per loan...")
loanStats$earned <- loanStats$total_pymnt - loanStats$loan_amntprint("Set variables to predict bad loans...")
myY <- "bad_loan"
myX <-  c("loan_amnt", "term", "home_ownership", "annual_inc", "verification_status", "purpose","addr_state", "dti", "delinq_2yrs", "open_acc", "pub_rec", "revol_bal", "total_acc","emp_length", "collections_12_mths_ex_med", "credit_length_in_years", "inq_last_6mths", "revol_util")loanStats$inq_last_6mths <- as.factor(loanStats$inq_last_6mths)
loanStats$collections_12_mths_ex_med <- as.factor(loanStats$collections_12_mths_ex_med)
loanStats$pub_rec <- as.factor(loanStats$pub_rec)data  <- loanStats
rand  <- h2o.runif(data)
train <- data[rand$rnd <= 0.8, ]
valid <- data[rand$rnd > 0.8, ]models <- c()
for(i in 4:5){
start     <- Sys.time()
gbm_model <- h2o.gbm(x = myX, y = myY, training_frame = train, validation_frame = valid, balance_classes = T,learn_rate = 0.05, score_each_iteration = T, ntrees = 100, max_depth = i)
end       <- Sys.time()
gbmBuild  <- end - start
print(paste("Took", gbmBuild, units(gbmBuild), "to build a GBM Model with 100 trees and a auc of :",h2o.auc(gbm_model) , "on the training set and",h2o.auc(gbm_model, valid = T), "on the validation set."))
gbm_score <- plot_scoring(model = gbm_model)
models <- c(models, gbm_model)
}##### Validate Results
max_auc_on_valid <- c()
for(model in models) {
sh <- h2o.scoreHistory(model)
best_model <- sh[sh$validation_auc == max(sh$validation_auc),]
max_auc_on_valid <- rbind(max_auc_on_valid, best_model)
}best_model = which(max_auc_on_valid$validation_auc == max(max_auc_on_valid$validation_auc))
gbm_model = models [[best_model]]print("The variable importance for the GBM model...")
print(h2o.varimp(gbm_model))
print("The confusion matrix for the GBM model...")
print(h2o.confusionMatrix(gbm_model, valid = T))
h2o.auc(gbm_model)
h2o.auc(gbm_model, valid = T)## Do a post - analysis of how much money we would've saved with this model...
printMoney <- function(x){x <- round(abs(x),2)format(x, digits=10, nsmall=2, decimal.mark=".", big.mark=",")}## Calculate how much money will be lost to false negative, vs how much will be saved due to true positives
loanStats$pred <- h2o.predict(gbm_model, loanStats)[,1]
net <- as.data.frame(h2o.group_by(data = loanStats, by = c("bad_loan", "pred"), sum("earned")))
n1  <- net[ net$bad_loan == 0 & net$pred == 0, 3]
n2  <- net[ net$bad_loan == 0 & net$pred == 1, 3]
n3  <- net[ net$bad_loan == 1 & net$pred == 1, 3]
n4  <- net[ net$bad_loan == 1 & net$pred == 0, 3]## Calculate the amount of earned
print(paste0("Total amount of profit still earned using the model : $", printMoney(n1) , ""))
print(paste0("Total amount of profit forfeitted using the model : $", printMoney(n2) , ""))
print(paste0("Total amount of loss that could have been prevented : $", printMoney(n3) , ""))
print(paste0("Total amount of loss that still would've accrued : $", printMoney(n4) , ""))## Value of the GBM Model
diff <- n3 + n2
print(paste0("Total immediate gain the implementation of the model would've had on completed approved loans : $",printMoney(diff),""))## Run prediction of two similar applicants
a1 <- as.h2o(data.frame(loan_amnt = 25000, term = "36 months", home_ownership = "RENT", annual_inc = 70000, purpose = "credit card"))
a2 <- as.h2o(data.frame(loan_amnt = 25000, term = "36 months", home_ownership = "RENT", annual_inc = 70000, purpose = "medical"))p1 <- h2o.predict(object = gbm_model, newdata = a1)
p2 <- h2o.predict(object = gbm_model, newdata = a2)if(sum(p1$predict == 1)) stop("Loan for credit card debt should be approved")
if(sum(p2$predict == 0)) stop("Loan for medical bills should not be approved")

数据挖据---机器学习平台之H2O架构/接口/实践相关推荐

  1. 高德地图全链路压测平台TestPG的架构与实践

    高德地图:全链路压测平台TestPG的架构与实践 转自  https://www.sohu.com/a/341414025_692515 1. 导读 2019年以来,高德DAU一个亿进入常态,不断增长 ...

  2. 三级综合医院数据集成平台建设与架构设计 | 实践分享

    1.医院数据集成平台建设的背景 国内大多数三级医院信息化起步于上世纪90年代初,至今发展有将近30年历史,主要分为四个阶段: 第一阶段,财务电子化模式:上世纪90年代中期,北上广的三甲医院已开始引入基 ...

  3. .net函数查询_特来电智能分析平台动态查询架构创新实践

    一.业务背景及痛点 目前主流互联网智能分析平台中,数据查询作为基础的设施服务支撑着基础数据及业务分析的功能展现.随着数据量的增长,数据存储方式多元化,相对静态数据可能存储到关系型数据库中,订单类动态数 ...

  4. 高德全链路压测平台TestPG的架构与实践

    导读 2018年十一当天,高德DAU突破一个亿,不断增长的日活带来喜悦的同时,也给支撑高德业务的技术人带来了挑战.如何保障系统的稳定性,如何保证系统能持续的为用户提供可靠的服务?是所有高德技术人面临的 ...

  5. 16个用于数据科学和机器学习的顶级平台

    调研机构Gartner公司将数据科学和机器学习平台定义为"具有凝聚力的软件应用程序,它提供了创建多种数据科学解决方案以及将这些解决方案合并到业务流程.周围基础设施和产品中所必需的基本构建块的 ...

  6. 汽车之家机器学习平台的架构与实践

    导读:汽车之家机器学习平台是为算法工程师打造的一站式机器学习服务平台,集数据导入.数据处理.模型开发.模型训练.模型评估.服务上线等功能于一体,提供一站式全方位的机器学习建模流程,快速打造智能业务.本 ...

  7. 阿里巴巴的相关-----ODPS技术架构、Java Web架构、PAI机器学习平台

    摘要:ODPS是分布式的海量数据处理平台,提供了丰富的数据处理功能和灵活的编程框架.本文从ODPS面临的挑战.技术架构.Hadoop迁移到ODPS.应用实践注意点等方面带领我们初步了解了ODPS的现状 ...

  8. cube开源一站式云原生机器学习平台-架构(一)

    全栈工程师开发手册 (作者:栾鹏) 一站式云原生机器学习平台 前言:cube是开源的云原生机器学习平台,目前包含特征平台,支持在/离线特征:数据源管理,支持结构数据和媒体标注数据管理:在线开发,在线的 ...

  9. 网易云音乐机器学习平台实践

    机器学习平台基础架构 在网易云音乐内部,机器学习平台早期主要承担着包括音乐推荐.主站搜索.创新算法业务在内的核心业务,慢慢地也覆盖包括音视频.NLP等内容理解业务.机器学习平台基础架构如下,目前我按功 ...

最新文章

  1. mysql的一个bug Block Nested Loop
  2. excel打印预览在哪里_Excel如何打印表格,每页纸都有标题?
  3. 日志分析工具ELK(一)
  4. Android--使用VideoView播放视频
  5. 【莫队/树上莫队/回滚莫队】原理详解及例题:小B的询问(普通莫队),Count on a tree II(树上莫队),kangaroos(回滚莫队)
  6. Origin使用手册/笔记第二部分:数据的录入
  7. 华三防火墙h3cf100配置双宽带_华三防火墙冗余口配置 h3c f100防火墙配置教程
  8. windows 编程 之 问题解决笔记
  9. [spoj694spoj705]New Distinct Substrings(后缀数组)
  10. 企业真实面试题总结(一)
  11. 使用python来刷csdn下载积分(一)
  12. 冒险岛079服务端_linux版(ubuntu,CentOS)下载
  13. 云计算与大数据——数据挖掘常用算法
  14. 如何避开微信小程序的审核机制(实测有效)
  15. office文档***
  16. 使用JIRA管理项目工单
  17. Linux按键响应测试
  18. 每天干的啥?(2019.6)
  19. 在editplus中 删除空白行、匹配删除行
  20. linux do_irq 报错 代码,linux-2.6.38中断机制分析—asm_do_IRQ

热门文章

  1. HCIP-H12-223练习题
  2. Lua math函数的用法
  3. vue 点击当前路由怎么重新加载_Vue 路由切换时页面内容没有重新加载的解决方法...
  4. AJPFX平台:01.14日内交易策略
  5. 如何创造一种团队文化
  6. 智能服务器升级中,全面智能升级! 宁畅G40服务器释放强大算力
  7. 牛与马的把表情包(带有改进)
  8. Games101-闫令琪 1-4讲 基础知识+变换 (笔记整理)
  9. Java支持latex,基于Java和LaTeX的文档自动生成技术研究
  10. 南京航空大学c语言课程设计,南京航空航天大学C语言课程设计报告.doc