数据挖据---机器学习平台之H2O架构/接口/实践

上一章介绍了H2O的使用，这次来学习学习H2O架构接口和实践。

1，H2O架构
关于H2O架构，很多资料也有说明，这里我们一起来看看官网上的介绍。

最上面的是客户层，即接口交互层，H2O支持JavaScript，R，Python，Excel，Tableau，Flow等多种形式的外部交互。
下面那个可以理解为H2O的关键引擎层，JVM Components，每个JVM进程被分为三层:语言，算法，核心架构，负责执行引擎，算法引擎，数据引擎，任务处理引擎：
Rapids Expression Evaluation Engine
Scala
Customer Algorithm
Parse
Algorithm(GLM,GBM,DL,K-means,PCA)
In-H2O Prediction Engine
Memory Management—数据存储
Fluid Vector Frame
Distributed K/V Store
Non-blocking HashMap
CPU Management—计算
Job
MR Task
Fork/Join
最底层是Spark，Hadoop，Standalone H2O

准备一个文件list.txt
localhost:54321
localhost:54323
启动一个终端：java -Xmx2G -jar h2o.jar -ip localhost -flatfile list.txt，同样的命令再执行一次，这样两个节点的h2o集群就启动了。登录http://localhost:54321/ 或者http://localhost:54323/ 查看集群状态：

2，H2O接口
H2O构建的模型，可以通过Plain Old Java Object (POJO) or Model ObJect, Optimized (MOJO)的方式进行访问。这里以Gradient Boosting Machine模型为例，说明怎么使用：

将MOJO下载到本地，例如文件名为：gbm_b3929752_4f0e_4c9b_a99c_702cbd42feca.zip，解压，进入experimental目录，新建一个main.java程序

import java.io.*;
import hex.genmodel.easy.RowData;
import hex.genmodel.easy.EasyPredictModelWrapper;
import hex.genmodel.easy.prediction.*;
import hex.genmodel.MojoModel;public class main {public static void main(String[] args) throws Exception {EasyPredictModelWrapper model = new EasyPredictModelWrapper(MojoModel.load("gbm_b3929752_4f0e_4c9b_a99c_702cbd42feca.zip"));RowData row = new RowData();row.put("AGE", "68");row.put("RACE", "2");row.put("DCAPS", "2");row.put("VOL", "0");row.put("GLEASON", "6");BinomialModelPrediction p = model.predictBinomial(row);System.out.println("Has penetrated the prostatic capsule (1=yes; 0=no): " + p.label);System.out.print("Class probabilities: ");for (int i = 0; i < p.classProbabilities.length; i++) {if (i > 0) {System.out.print(",");}System.out.print(p.classProbabilities[i]);}System.out.println("");}
}

将h2o-genmodel.jar和gbm_b3929752_4f0e_4c9b_a99c_702cbd42feca.zip放到当前目录。
编译： javac -cp h2o-genmodel.jar -J-Xms2g -J-XX:MaxPermSize=128m main.java
执行： $ java -cp .:h2o-genmodel.jar main （linux）
$ java -cp .;h2o-genmodel.jar main (Windows)
可以看到输出结果，将整个过程参数化，可以做成通用的程序使用。
关于其他的方式，可以参看文档：http://docs.h2o.ai/h2o/latest-stable/h2o-docs/productionizing.html

3，H2O实践
下面引用https://github.com/h2oai/h2o-3/blob/master/h2o-r/demos/rdemo.lending.club.large.R上面的一个例子：

library(h2o)
h2o.init()# Set this to True if you want to fetch the data directly from S3.
# This is useful if your cluster is running in EC2.
data_source_is_s3 <- FALSElocate_source <- function(s) {if (data_source_is_s3)myPath <- paste0("s3n://h2o-public-test-data/", s)elsemyPath <- h2o:::.h2o.locate(s)
}plot_scoring <- function(model) {sh <- h2o.scoreHistory(object = model)par(mfrow=c(1,2))if(model@algorithm == "gbm" | model@algorithm == "drf"){min <- min(range(sh$training_rmse), range(sh$validation_rmse))max <- max(range(sh$training_rmse), range(sh$validation_rmse))plot(x = sh$number_of_trees, y = sh$validation_rmse, col = "orange", main = model@model_id, ylim = c(min,max))points(x = sh$number_of_trees, y = sh$training_rmse, col = "blue")min <- min(range(sh$training_auc), range(sh$validation_auc))max <- max(range(sh$training_auc), range(sh$validation_auc))plot(x = sh$number_of_trees, y = sh$validation_auc, col = "orange", main = model@model_id, ylim = c(min,max))points(x = sh$number_of_trees, y = sh$training_auc, col = "blue")return(data.frame(number_of_trees = sh$number_of_trees, validation_auc = sh$validation_auc, validation_rmse = sh$validation_rmse))}if(model@algorithm == "deeplearning"){plot(x = sh$epochs, y = sh$validation_rmse, col = "orange", main = model@model_id)plot(x = sh$epochs, y = sh$validation_auc, col = "orange", main = model@model_id)}
}# Pick either the big or the small demo.
small_test <-  locate_source("bigdata/laptop/lending-club/LoanStats3a.csv")
big_test <-  c(locate_source("bigdata/laptop/lending-club/LoanStats3a.csv"),locate_source("bigdata/laptop/lending-club/LoanStats3b.csv"),locate_source("bigdata/laptop/lending-club/LoanStats3c.csv"),locate_source("bigdata/laptop/lending-club/LoanStats3d.csv"))print("Import approved loan requests for Lending Club...")
loanStats <- h2o.importFile(path = big_test, parse = F)
col_types <- c('numeric', 'numeric', 'numeric', 'numeric', 'numeric', 'enum', 'string', 'numeric','enum', 'enum', 'enum', 'string', 'enum', 'numeric', 'enum', 'enum', 'enum', 'enum','string', 'enum', 'enum', 'enum', 'enum', 'enum', 'numeric', 'numeric', 'enum','numeric', 'numeric', 'numeric', 'numeric', 'numeric', 'numeric', 'string', 'numeric','enum', 'numeric', 'numeric', 'numeric', 'numeric', 'numeric', 'numeric', 'numeric','numeric', 'numeric', 'enum', 'numeric', 'enum', 'enum', 'numeric', 'enum', 'numeric')
loanStats <- h2o.parseRaw(data = loanStats, destination_frame = "loanStats", col.types = col_types)print("Create bad loan label, this will include charged off, defaulted, and late repayments on loans...")
loanStats <- loanStats[!(loanStats$loan_status %in% c("Current", "In Grace Period", "Late (16-30 days)", "Late (31-120 days)")), ]
loanStats <- loanStats[!is.na(loanStats$id),]
loanStats$bad_loan <- loanStats$loan_status %in% c("Charged Off", "Default", "Does not meet the credit policy.  Status:Charged Off")
loanStats$bad_loan <- as.factor(loanStats$bad_loan)
print(paste(nrow(loanStats), "of 550573 loans have either been paid off or defaulted..."))print("Turn string interest rate and revoling util columns into numeric columns...")
loanStats$int_rate <- h2o.strsplit(loanStats$int_rate, split = "%")
loanStats$int_rate <- h2o.trim(loanStats$int_rate)
loanStats$int_rate <- as.h2o(as.numeric(as.matrix(loanStats$int_rate)))
# loanStats$int_rate <- as.numeric(loanStats$int_rate)
loanStats$revol_util <- h2o.strsplit(loanStats$revol_util, split = "%")
loanStats$revol_util <- h2o.trim(loanStats$revol_util)
loanStats$revol_util <- as.h2o(as.numeric(as.matrix(loanStats$revol_util)))
# loanStats$revol_util <- as.numeric(loanStats$revol_util)print("Calculate the longest credit length in years...")
time1 <- as.Date(h2o.strsplit(x = loanStats$earliest_cr_line, split = "-")[,2], format = "%Y")
time2 <- as.Date(h2o.strsplit(x = loanStats$issue_d, split = "-")[,2], format = "%Y")
loanStats$credit_length_in_years <- year(time2) - year(time1)
## Ideally you can parse the column as a Date column immediately
## loanStats$earliest_cr_line <- as.Date(x = loanStats$earliest_cr_line, format = "%b-%Y")
## loanStats$issue_d          <- as.Date(x = loanStats$issue_d, format = "%b-%Y")
## loanStats$credit_length_in_years <- year(loanStats$earliest_cr_line) - year(loanStats$issue_d)print("Convert emp_length column into numeric...")
## remove " year" and " years", also translate n/a to ""
loanStats$emp_length <- h2o.sub(x = loanStats$emp_length, pattern = "([ ]*+[a-zA-Z].*)|(n/a)", replacement = "")
loanStats$emp_length <- h2o.trim(loanStats$emp_length)
loanStats$emp_length <- h2o.sub(x = loanStats$emp_length, pattern = "< 1", replacement = "0")
loanStats$emp_length <- h2o.sub(x = loanStats$emp_length, pattern = "10\\\\+", replacement = "10")
loanStats$emp_length <- as.h2o(as.numeric(as.matrix(loanStats$emp_length)))
# loanStats$emp_length <- as.numeric(loanStats$emp_length)print("Map multiple levels into one factor level for verification_status...")
loanStats$verification_status <- as.character(loanStats$verification_status)
loanStats$verification_status <- h2o.sub(x = loanStats$verification_status, pattern = "VERIFIED - income source", replacement = "verified")
loanStats$verification_status <- h2o.sub(x = loanStats$verification_status, pattern = "VERIFIED - income", replacement = "verified")
loanStats$verification_status <- as.factor(loanStats$verification_status)## Check to make sure all the string/enum to numeric conversion completed correctly
x <- c("int_rate", "revol_util", "credit_length_in_years", "emp_length", "verification_status")
c1 <- as.data.frame(loanStats[1,x])
c2 <- data.frame(int_rate = 10.65, revol_util = 83.7, credit_length_in_years = 26,emp_length = 10, verification_status = "verified")
if(!all(c1 == c2)) {print(c1)print(c2)stop("Conversion column(s) did not run correctly.")}print("Calculate the total amount of money earned or lost per loan...")
loanStats$earned <- loanStats$total_pymnt - loanStats$loan_amntprint("Set variables to predict bad loans...")
myY <- "bad_loan"
myX <-  c("loan_amnt", "term", "home_ownership", "annual_inc", "verification_status", "purpose","addr_state", "dti", "delinq_2yrs", "open_acc", "pub_rec", "revol_bal", "total_acc","emp_length", "collections_12_mths_ex_med", "credit_length_in_years", "inq_last_6mths", "revol_util")loanStats$inq_last_6mths <- as.factor(loanStats$inq_last_6mths)
loanStats$collections_12_mths_ex_med <- as.factor(loanStats$collections_12_mths_ex_med)
loanStats$pub_rec <- as.factor(loanStats$pub_rec)data  <- loanStats
rand  <- h2o.runif(data)
train <- data[rand$rnd <= 0.8, ]
valid <- data[rand$rnd > 0.8, ]models <- c()
for(i in 4:5){
start     <- Sys.time()
gbm_model <- h2o.gbm(x = myX, y = myY, training_frame = train, validation_frame = valid, balance_classes = T,learn_rate = 0.05, score_each_iteration = T, ntrees = 100, max_depth = i)
end       <- Sys.time()
gbmBuild  <- end - start
print(paste("Took", gbmBuild, units(gbmBuild), "to build a GBM Model with 100 trees and a auc of :",h2o.auc(gbm_model) , "on the training set and",h2o.auc(gbm_model, valid = T), "on the validation set."))
gbm_score <- plot_scoring(model = gbm_model)
models <- c(models, gbm_model)
}##### Validate Results
max_auc_on_valid <- c()
for(model in models) {
sh <- h2o.scoreHistory(model)
best_model <- sh[sh$validation_auc == max(sh$validation_auc),]
max_auc_on_valid <- rbind(max_auc_on_valid, best_model)
}best_model = which(max_auc_on_valid$validation_auc == max(max_auc_on_valid$validation_auc))
gbm_model = models [[best_model]]print("The variable importance for the GBM model...")
print(h2o.varimp(gbm_model))
print("The confusion matrix for the GBM model...")
print(h2o.confusionMatrix(gbm_model, valid = T))
h2o.auc(gbm_model)
h2o.auc(gbm_model, valid = T)## Do a post - analysis of how much money we would've saved with this model...
printMoney <- function(x){x <- round(abs(x),2)format(x, digits=10, nsmall=2, decimal.mark=".", big.mark=",")}## Calculate how much money will be lost to false negative, vs how much will be saved due to true positives
loanStats$pred <- h2o.predict(gbm_model, loanStats)[,1]
net <- as.data.frame(h2o.group_by(data = loanStats, by = c("bad_loan", "pred"), sum("earned")))
n1  <- net[ net$bad_loan == 0 & net$pred == 0, 3]
n2  <- net[ net$bad_loan == 0 & net$pred == 1, 3]
n3  <- net[ net$bad_loan == 1 & net$pred == 1, 3]
n4  <- net[ net$bad_loan == 1 & net$pred == 0, 3]## Calculate the amount of earned
print(paste0("Total amount of profit still earned using the model : $", printMoney(n1) , ""))
print(paste0("Total amount of profit forfeitted using the model : $", printMoney(n2) , ""))
print(paste0("Total amount of loss that could have been prevented : $", printMoney(n3) , ""))
print(paste0("Total amount of loss that still would've accrued : $", printMoney(n4) , ""))## Value of the GBM Model
diff <- n3 + n2
print(paste0("Total immediate gain the implementation of the model would've had on completed approved loans : $",printMoney(diff),""))## Run prediction of two similar applicants
a1 <- as.h2o(data.frame(loan_amnt = 25000, term = "36 months", home_ownership = "RENT", annual_inc = 70000, purpose = "credit card"))
a2 <- as.h2o(data.frame(loan_amnt = 25000, term = "36 months", home_ownership = "RENT", annual_inc = 70000, purpose = "medical"))p1 <- h2o.predict(object = gbm_model, newdata = a1)
p2 <- h2o.predict(object = gbm_model, newdata = a2)if(sum(p1$predict == 1)) stop("Loan for credit card debt should be approved")
if(sum(p2$predict == 0)) stop("Loan for medical bills should not be approved")

数据挖据---机器学习平台之H2O架构/接口/实践相关推荐

高德地图全链路压测平台TestPG的架构与实践
高德地图:全链路压测平台TestPG的架构与实践转自 https://www.sohu.com/a/341414025_692515 1. 导读 2019年以来,高德DAU一个亿进入常态,不断增长 ...
三级综合医院数据集成平台建设与架构设计 | 实践分享
1.医院数据集成平台建设的背景国内大多数三级医院信息化起步于上世纪90年代初,至今发展有将近30年历史,主要分为四个阶段: 第一阶段,财务电子化模式:上世纪90年代中期,北上广的三甲医院已开始引入基 ...
.net函数查询_特来电智能分析平台动态查询架构创新实践
一.业务背景及痛点目前主流互联网智能分析平台中,数据查询作为基础的设施服务支撑着基础数据及业务分析的功能展现.随着数据量的增长,数据存储方式多元化,相对静态数据可能存储到关系型数据库中,订单类动态数 ...
高德全链路压测平台TestPG的架构与实践
导读 2018年十一当天,高德DAU突破一个亿,不断增长的日活带来喜悦的同时,也给支撑高德业务的技术人带来了挑战.如何保障系统的稳定性,如何保证系统能持续的为用户提供可靠的服务?是所有高德技术人面临的 ...
16个用于数据科学和机器学习的顶级平台
调研机构Gartner公司将数据科学和机器学习平台定义为"具有凝聚力的软件应用程序,它提供了创建多种数据科学解决方案以及将这些解决方案合并到业务流程.周围基础设施和产品中所必需的基本构建块的 ...
汽车之家机器学习平台的架构与实践
导读:汽车之家机器学习平台是为算法工程师打造的一站式机器学习服务平台,集数据导入.数据处理.模型开发.模型训练.模型评估.服务上线等功能于一体,提供一站式全方位的机器学习建模流程,快速打造智能业务.本 ...
阿里巴巴的相关-----ODPS技术架构、Java Web架构、PAI机器学习平台
摘要:ODPS是分布式的海量数据处理平台,提供了丰富的数据处理功能和灵活的编程框架.本文从ODPS面临的挑战.技术架构.Hadoop迁移到ODPS.应用实践注意点等方面带领我们初步了解了ODPS的现状 ...
cube开源一站式云原生机器学习平台-架构（一）
全栈工程师开发手册 (作者:栾鹏) 一站式云原生机器学习平台前言:cube是开源的云原生机器学习平台,目前包含特征平台,支持在/离线特征:数据源管理,支持结构数据和媒体标注数据管理:在线开发,在线的 ...
网易云音乐机器学习平台实践
机器学习平台基础架构在网易云音乐内部,机器学习平台早期主要承担着包括音乐推荐.主站搜索.创新算法业务在内的核心业务,慢慢地也覆盖包括音视频.NLP等内容理解业务.机器学习平台基础架构如下,目前我按功 ...

数据挖据---机器学习平台之H2O架构/接口/实践

数据挖据---机器学习平台之H2O架构/接口/实践相关推荐

最新文章

热门文章