autogluon--自动机器学习快速训练模型
AutoGluon
支持的任务
安装
$ pip install -U "mxnet<2.0.0" -i https://pypi.tuna.tsinghua.edu.cn/simple # cpu版本
$ pip install -U "mxnet_cu101" -i https://pypi.tuna.tsinghua.edu.cn/simple # gpu版本
$ pip install autogluon -i https://pypi.tuna.tsinghua.edu.cn/simple
实战
表格数据预测
对于表示为表格的标准数据集(存储为 CSV 文件、来自数据库的数据 等)
AutoGluon 可以生成模型以根据其他列中的值预测一列中的值。
您就可以在标准监督学习任务(分类和回归)中实现高精度,而无需处理数据清理、特征工程、超参数优化、模型选择等繁琐问题
- 实例
from autogluon.tabular import TabularDataset, TabularPredictorclass DataFrameDataset:label = "class" # 表格数据集标签def train_data(self):"""加载线上训练数据集: 预测一个人的收入是否超出5万美元: 数据集返回结构是一个dataframe:return:"""# train_data_ = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')train_data_ = TabularDataset('./train.csv')print("Train Data:\n",train_data_.head())return train_data_def test_data(self):"""加载线上测试数据集:return:"""# test_data_ = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')test_data_ = TabularDataset('./test.csv')print("Test Data:\n",test_data_.head())y_test_ = test_data_[self.label]test_data_no_label_ = test_data_.drop(columns=[self.label])return test_data_, test_data_no_label_, y_test_def user_data(self):"""用户自己的测试数据"""# test_data_ = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')[0]test_data_ = TabularDataset('./test.csv').head(1)y_test_ = test_data_[self.label]test_data_no_label_ = test_data_.drop(columns=[self.label])return test_data_no_label_, y_test_class Model:dataset = DataFrameDataset()model_path = "table_predictor"def __init__(self):self.predictor: TabularPredictor = Nonedef train(self, eval_metric="roc_auc", presets="medium_quality_faster_train", time_limit=60, holdout_frac=0.1):"""常用参数调节:param eval_metric: 精度评估指标f1: 用于二分类roc_auc: 用于二分类log_loss: 用于分类mean_absolute_error: 用于回归median_absolute_error: 用于回归:param presets: 模型训练预设条件best_quality: 牺牲时间训练高精度模型medium_quality_faster_train: 牺牲质量快速产生模型good_quality_faster_inference_only_refit: 相对好的模型且推理时间相对快optimize_for_deployment::param time_limit: 训练时长:param holdout_frac: 指定从训练集出分割出多少比例的验证集:param hyperparameters: 用户可以定义搜索空间,相见请参考sdk文档,例如你可以为模型设置迭代次数等num_epochs:return:"""print("开始训练 ..........")self.predictor = TabularPredictor(label=self.dataset.label, path=self.model_path, eval_metric=eval_metric)train_data = self.dataset.train_data()self.predictor.fit(train_data, time_limit=time_limit, excluded_model_types=['KNN', 'NN', 'custom'],presets=presets, holdout_frac=holdout_frac)print("评估模型 ..........")test_data, test_data_no_label, y_test = self.dataset.test_data()y_pred = self.predictor.predict_proba(test_data_no_label)evaluate = self.predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)board = self.predictor.leaderboard(test_data, silent=True)print("评估结果:\n",evaluate)print("模型在测试集上的效果:\n", board)print("模型特征重要性:\n", self.predictor.feature_importance(data=train_data))print("模型信息:\n", self.predictor.info())self.predictor.delete_models(models_to_keep='best', dry_run=False) # 保留最优模型,其他模型将删除def predict(self):"""预测数据get_model_best(): 获取最好的模型predict(): 输出结果对应标签predict_proba(): 输出结果对应标签下的概率值:return:"""self.predictor = TabularPredictor.load(self.model_path)best_model = self.predictor.get_model_best()print("Best Model:\n", best_model)test_data_no_label, y_test = self.dataset.user_data()import timestart_time = time.time()y_pred = self.predictor.predict(test_data_no_label, model=best_model)print("inference time: ", time.time() - start_time)y_pred_prob = self.predictor.predict_proba(test_data_no_label, model=best_model)print("y_test: ", y_test)print("y_pred: ", y_pred)print(y_pred_prob)print("预测结果:", y_test == y_pred)if __name__ == '__main__':import firefire.Fire(Model())
- 运行
$ python3 auto.py train
开始训练 ..........
Warning: path already exists! This predictor may overwrite an existing predictor! path="table_predictor"
Loaded data from: ./train.csv | Columns = 15 / 15 | Rows = 39073 -> 39073
Train Data:age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country class
0 25 Private 178478 Bachelors 13 Never-married Tech-support Own-child White Female 0 0 40 United-States <=50K
1 23 State-gov 61743 5th-6th 3 Never-married Transport-moving Not-in-family White Male 0 0 35 United-States <=50K
2 46 Private 376789 HS-grad 9 Never-married Other-service Not-in-family White Male 0 0 15 United-States <=50K
3 55 ? 200235 HS-grad 9 Married-civ-spouse ? Husband White Male 0 0 50 United-States >50K
4 36 Private 224541 7th-8th 4 Married-civ-spouse Handlers-cleaners Husband White Male 0 0 40 El-Salvador <=50K
Presets specified: ['medium_quality_faster_train']
Beginning AutoGluon training ... Time limit = 60s
AutoGluon will save models to "table_predictor/"
AutoGluon Version: 0.3.1
Train Data Rows: 39073
Train Data Columns: 14
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).2 unique label values: [' <=50K', ' >50K']If 'binary' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping: class 1 = >50K, class 0 = <=50KNote: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive ( >50K) vs negative ( <=50K) class.To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...Available Memory: 2261.62 MBTrain Data (Original) Memory Usage: 22.92 MB (1.0% of available memory)Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.Stage 1 Generators:
...
...
AutoGluon training complete, total runtime = 67.34s ...
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("table_predictor/")
评估模型 ..........
Loaded data from: ./test.csv | Columns = 15 / 15 | Rows = 9769 -> 9769
Test Data:age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country class
0 31 Private 169085 11th 7 Married-civ-spouse Sales Wife White Female 0 0 20 United-States <=50K
1 17 Self-emp-not-inc 226203 12th 8 Never-married Sales Own-child White Male 0 0 45 United-States <=50K
2 47 Private 54260 Assoc-voc 11 Married-civ-spouse Exec-managerial Husband White Male 0 1887 60 United-States >50K
3 21 Private 176262 Some-college 10 Never-married Exec-managerial Own-child White Female 0 0 30 United-States <=50K
4 17 Private 241185 12th 8 Never-married Prof-specialty Own-child White Male 0 0 20 United-States <=50K
Evaluation: roc_auc on test data: 0.9323364763680665
Evaluations on test data:
{"roc_auc": 0.9323364763680665,"accuracy": 0.8761388064284983,"balanced_accuracy": 0.8000729586881633,"mcc": 0.6412270975073234,"f1": 0.7151600753295669,"precision": 0.7870466321243523,"recall": 0.6553062985332183
}
评估结果:{'roc_auc': 0.9323364763680665, 'accuracy': 0.8761388064284983, 'balanced_accuracy': 0.8000729586881633, 'mcc': 0.6412270975073234, 'f1': 0.7151600753295669, 'precision': 0.7870466321243523, 'recall': 0.6553062985332183}
模型在测试集上的效果:model score_test score_val pred_time_test pred_time_val fit_time pred_time_test_marginal pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 WeightedEnsemble_L2 0.932336 0.935558 0.202042 0.067755 27.769457 0.018307 0.001736 1.468465 2 True 9
1 CatBoost 0.931855 0.934702 0.038226 0.020100 25.316929 0.038226 0.020100 25.316929 1 True 5
2 LightGBM 0.931088 0.934438 0.145509 0.045919 0.984063 0.145509 0.045919 0.984063 1 True 2
3 LightGBMXT 0.928022 0.930726 0.345757 0.102603 2.624385 0.345757 0.102603 2.624385 1 True 1
4 NeuralNetFastAI 0.914286 0.914985 0.178472 0.074495 15.291426 0.178472 0.074495 15.291426 1 True 8
5 RandomForestGini 0.911646 0.910570 0.547053 0.109948 3.671543 0.547053 0.109948 3.671543 1 True 3
6 RandomForestEntr 0.911283 0.911003 0.578620 0.111891 4.344845 0.578620 0.111891 4.344845 1 True 4
7 ExtraTreesEntr 0.904868 0.905856 0.734225 0.175408 3.133456 0.734225 0.175408 3.133456 1 True 7
8 ExtraTreesGini 0.904081 0.905642 1.049894 0.122711 2.265818 1.049894 0.122711 2.265818 1 True 6
Computing feature importance via permutation shuffling for 14 features using 1000 rows with 3 shuffle sets...2.79s = Expected runtime (0.93s per shuffle set)1.05s = Actual runtime (Completed 3 of 3 shuffle sets)
模型特征重要性:importance stddev p_value n p99_high p99_low
capital-gain 0.067067 0.006993 0.001802 3 0.107138 0.026997
age 0.041595 0.015031 0.020439 3 0.127725 -0.044534
relationship 0.022320 0.003899 0.005010 3 0.044662 -0.000022
education-num 0.021446 0.006913 0.016467 3 0.061057 -0.018166
occupation 0.020063 0.004353 0.007664 3 0.045004 -0.004877
marital-status 0.018524 0.004348 0.008936 3 0.043436 -0.006389
capital-loss 0.016525 0.003324 0.006609 3 0.035571 -0.002520
hours-per-week 0.014453 0.000872 0.000606 3 0.019451 0.009455
fnlwgt 0.010131 0.001359 0.002972 3 0.017917 0.002345
workclass 0.006703 0.001663 0.009949 3 0.016229 -0.002824
education 0.004152 0.000358 0.001231 3 0.006200 0.002103
native-country 0.004013 0.002588 0.057583 3 0.018840 -0.010815
sex 0.002201 0.000911 0.026324 3 0.007422 -0.003020
race 0.002195 0.001788 0.083653 3 0.012439 -0.008049
模型信息:
...
Deleting model LightGBMXT. All files under table_predictor/models/LightGBMXT/ will be removed.
Deleting model RandomForestGini. All files under table_predictor/models/RandomForestGini/ will be removed.
Deleting model RandomForestEntr. All files under table_predictor/models/RandomForestEntr/ will be removed.
Deleting model ExtraTreesGini. All files under table_predictor/models/ExtraTreesGini/ will be removed.
Deleting model ExtraTreesEntr. All files under table_predictor/models/ExtraTreesEntr/ will be removed.
Deleting model NeuralNetFastAI. All files under table_predictor/models/NeuralNetFastAI/ will be removed.$ python3 auto.py predict
Best Model:WeightedEnsemble_L2
inference time: 0.6184391975402832
y_test: 0 <=50K
Name: class, dtype: object
y_pred: 0 <=50K
Name: class, dtype: object<=50K >50K
0 0.934742 0.065258
预测结果: 0 True
Name: class, dtype: bool
图像分类预测任务
为了对图像进行分类,AutoGluon可以自动生成高质量的图像分类模型。提供的图像数据集上训练高度准确的神经网络,并代表您自动利用诸如迁移学习和超参数优化等提高准确性的技术。
- 实例
from autogluon.vision import ImageDataset, ImagePredictor
from tensorflow.keras.datasets import mnist
import abc
import pandas as pd
import os
import numpy as np
import requests
import cv2class MnistDataSets:"""配置数据集,以及标签"""datasets_dir = "mnist_datasets"def download_mnist_data(self):"""加载官方的手写数据集"""(self.x_train, self.y_train), (self.x_test, self.y_test) = mnist.load_data()# print(self.x_train.shape)# print(self.y_train.shape)# print(self.x_train[0].shape)# (60000, 28, 28)# (60000,)# (28, 28)# 这里输入可知,数据集包含了60000张图片,且素材是一个单通道28x28for label in self.label_mapping.keys():os.makedirs(name=f"{self.datasets_dir}/train/{label}", exist_ok=True)os.makedirs(name=f"{self.datasets_dir}/test/{label}", exist_ok=True)train_length = self.x_train.shape[0]# train_length = 3000test_length = self.x_test.shape[0]# test_length = 1000import timefor index in range(train_length):cv2.imwrite(filename=f"{self.datasets_dir}/train/{self.y_train[index]}/{time.time()}.jpg",img=self.x_train[index])# breakfor index in range(test_length):cv2.imwrite(filename=f"{self.datasets_dir}/test/{self.y_test[index]}/{time.time()}.jpg",img=self.x_test[index])# break@propertydef label_mapping(self):"""标签映射关系"""return {1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 0: 0}def get_online_test_data(self):"""在线获取一张手写体图片,并做前处理:return:"""label = 3url = "https://img1.baidu.com/it/u=3472197447,93830654&fm=253&fmt=auto&app=138&f=JPEG?w=500&h=281"image = requests.get(url).contentnparr = np.fromstring(image, np.uint8)gray = cv2.imdecode(nparr, cv2.IMREAD_GRAYSCALE)gray = cv2.resize(gray, (28, 28))_, gray = cv2.threshold(gray, thresh=165, maxval=255, type=cv2.THRESH_BINARY)return gray, labeldef load_data(self):"""加载数据集:return:"""train_data, val_data, test_data = ImageDataset.from_folders(root=self.datasets_dir, train="train", test="test")print("训练数据集:\n", train_data)print("测试数据集:\n", test_data)print("标签信息\n", val_data)return train_data, test_dataclass Model(MnistDataSets):model_path = "mnist-model"def __init__(self):self.predictor: ImagePredictor = Nonedef train(self):"""模型训练:return:"""train_data, test_data = self.load_data()self.predictor = ImagePredictor()print("开始训练模型....")self.predictor.fit(train_data, hyperparameters={'epochs': 10})print("模型存储中....")self.predictor.save(self.model_path)print("模型评估中....")evaluate = self.predictor.evaluate(test_data)print("模型评估结果:\n", evaluate)def predict(self):gray, label = self.get_online_test_data()self.predictor = ImagePredictor.load(self.model_path)print(self.predictor.list_models())_, nparr = cv2.imencode('.jpg', gray)cv2.imwrite("3.jpg", nparr)import timestart_time = time.time()pred = self.predictor.predict("./3.jpg")print("inference time: ", time.time() - start_time) # 这里非常耗时,所以这个库并不是很优print(pred)if __name__ == '__main__':import firefire.Fire(Model())
- 运行
$ python3 predictor-image.py download_mnist_data
$ python3 predictor-image.py train
开始训练模型....
`time_limit=auto` set to `time_limit=7200`.
Reset labels to [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Randomly split train_data into train[54000]/validation[6000] splits.
No GPU detected/allowed, using most conservative search space.
Starting fit without HPO
modified configs(<old> != <new>): {root.img_cls.model resnet101 != resnet18
root.train.early_stop_baseline 0.0 != -inf
root.train.batch_size 32 != 16
root.train.early_stop_patience -1 != 10
root.train.early_stop_max_value 1.0 != inf
root.train.epochs 200 != 10
root.gpus (0,) != ()
root.misc.seed 42 != 48
}
Saved config to /Users/rockontrol/Desktop/python_code/code/test/autopluon/e4f22295/.trial_0/config.yaml
Model resnet18 created, param count: 11181642
AMP not enabled. Training in float32....
Epoch[0] Batch [2349] Speed: 5.057824 samples/sec accuracy=0.626835 lr=0.000100
Epoch[0] Batch [2399] Speed: 5.103389 samples/sec accuracy=0.630052 lr=0.000100
Epoch[0] Batch [2449] Speed: 5.524156 samples/sec accuracy=0.632398 lr=0.000100
`time_limit=7199.991618871689` reached, exit early...
Finished, total runtime is 7260.00 s
{ 'best_config': { 'batch_size': 16,'dist_ip_addrs': None,'early_stop_baseline': -inf,'early_stop_max_value': inf,'early_stop_patience': 10,'epochs': 10,'final_fit': False,'gpus': [],'log_dir': '/Users/rockontrol/Desktop/python_code/code/test/autopluon/e4f22295','lr': 0.01,'model': 'resnet18','ngpus_per_trial': 0,'nthreads_per_trial': 32,'num_trials': 1,'num_workers': 4,'problem_type': 'multiclass','scheduler': 'local','search_strategy': 'random','searcher': 'random','seed': 48,'time_limits': 7200,'wall_clock_tick': 1642659798.8088949},'total_time': 7200.409552574158,'train_acc': 0.6342192524115756,'valid_acc': -inf}
模型存储中....
模型评估中....[Epoch 0] validation: top1=0.948200 top5=0.999200
模型评估结果:{'loss': 0.29002630821466446, 'top1': 0.9482, 'top5': 0.9992}$ python3 predictor-image.py predict
inference time: 23.1709041595459
0 0
Name: label, dtype: int64
物体检测预测任务
- 鉴于分类模型推理性能不好,加上api接口设计不是很友好,对象检测就不测试了
文本数据预测任务
- 应该和表格类数据,具有相同的性能
autogluon--自动机器学习快速训练模型相关推荐
- 自动机器学习新进展!性能超过人类调参师6个点,AutoGluon 低调开源
机器之心报道 机器之心,Datawhale编辑 自动机器学习效果能有多好?比如让 MobileNet1.0 backbone 的 YOLO3 超过 ResNet-50 backbone 的 faste ...
- 【赠书】快速入门自动机器学习!自动机器学习(AutoML):方法、系统与挑战 图书赠送!...
周末了,这次给大家赠送3本机器学习好书,<自动机器学习(AutoML):方法.系统与挑战>,请看细节. 这是一本什么书 这是一本全面介绍自动机器学习的好书,主要包含自动机器学习的方法.实际 ...
- 一文讲解自动机器学习(AutoML)!
Datawhale 作者:瞿晓阳,AutoML书籍作者 寄语:让计算机自己去学习和训练规则,是否能达到更好的效果呢?自动机器学习就是答案,也就是所谓"AI的AI",让AI去学习AI ...
- 【华为云技术分享】基于自动机器学习的心脏病预测模型(1)
前言 Technology developed using artificial intelligence (AI) could identify people at high risk of a f ...
- 自动机器学习工具全景图:精选22种框架,解放炼丹师
作者 Alexander Allen.Adithya Balaji 王小新 编译自 Georgian Impact Blog 量子位 出品 | 公众号 QbitAI 构建一个典型的机器学习项目,一般分 ...
- 不用深度学习网络,只需预先设置NAS算法,就能实现AutoML自动机器学习的革命吗?
AutoML(自动机器学习)是深度学习的新方式,利用大数据分析.高性能计算.数据管理.算法.边缘计算等技术.有了AutoML,我们就不再需要设计复杂的深度学习网络,用于数据采集.数据预处理.优化.应用 ...
- 自动机器学习AutoML
[研究背景]随着深度神经网络的不断发展,各种模型和新颖模块的不断发明利用,人们逐渐意识到开发一种新的神经网络结构越来越费时费力,为什么不让机器自己在不断的学习过程中创造出新的神经网络呢? 正是出于这个 ...
- 机器学习指南_机器学习-快速指南
机器学习指南 机器学习-快速指南 (Machine Learning - Quick Guide) 机器学习-简介 (Machine Learning - Introduction) Today's ...
- 什么是自动机器学习(AutoML)?(译)
本文选自<Hands-On Automated Machine Learning> 自动机器学习(AutoML) 旨在通过让一些通用步骤 (如数据预处理.模型选择和调整超参数) 自动化,来 ...
- 微软开源的自动机器学习工具上新了:NNI概览及新功能详解
作者 | 宋驰 来源 | 微软研究院AI头条(ID: MSRAsia) 2018年9月,微软亚洲研究院发布了第一版 NNI (Neural Network Intelligence) ,目前已在 Gi ...
最新文章
- NameError: name ‘train_test_split‘ is not defined的解决方法:
- 10W字!推荐一个牛逼的人工智能笔记教程!全部整理好了(附下载)!
- ubuntu常见问题解决方法
- python 表单中值为空的还需要传入么_牛掰!100行Python,自动动手打造一款多国语言翻译软件...
- oracle中block
- mysql中文乱码解决_Stata 中文乱码顽疾解决方法
- HttpJsonResult和ModelMap使用??
- Sybase:数据类型(对比sqlserver)
- java里程碑之泛型--泛型基本语法
- 原生android字体,不用Root,国产安卓手机如何把字体切换成安卓原生字体
- 致敬柳传志三网合一的佳沃品牌之路
- DB2 windows下9.5安装教程
- 能将PDF转成PPT图片文字的转换器
- java maincase 电影票的售卖与购买 day10-11
- 小米手机全球已舍弃“MI”品牌,全面改用“xiaomi”全称品牌
- 计算机音乐数字乐谱天空之城,idreampiano天空之城乐谱
- mysql查询本周的周一(星期一)和周日(星期日)
- SQL删除表中字段name相同的数据,需要保留一条
- 中国碳化硅(SiC)行业“十四五”规划和远景目标建议报告2022-2028年
- 怎么把多个ppt文件合并到一个ppt文件中?