AutoGluon

支持的任务

安装

$ pip install -U "mxnet<2.0.0" -i https://pypi.tuna.tsinghua.edu.cn/simple # cpu版本
$ pip install -U "mxnet_cu101" -i https://pypi.tuna.tsinghua.edu.cn/simple # gpu版本
$ pip install autogluon -i https://pypi.tuna.tsinghua.edu.cn/simple

实战

表格数据预测

对于表示为表格的标准数据集(存储为 CSV 文件、来自数据库的数据 等)
AutoGluon 可以生成模型以根据其他列中的值预测一列中的值。
您就可以在标准监督学习任务(分类和回归)中实现高精度,而无需处理数据清理、特征工程、超参数优化、模型选择等繁琐问题
  • 实例
from autogluon.tabular import TabularDataset, TabularPredictorclass DataFrameDataset:label = "class"  # 表格数据集标签def train_data(self):"""加载线上训练数据集: 预测一个人的收入是否超出5万美元: 数据集返回结构是一个dataframe:return:"""# train_data_ = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')train_data_ = TabularDataset('./train.csv')print("Train Data:\n",train_data_.head())return train_data_def test_data(self):"""加载线上测试数据集:return:"""# test_data_ = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')test_data_ = TabularDataset('./test.csv')print("Test Data:\n",test_data_.head())y_test_ = test_data_[self.label]test_data_no_label_ = test_data_.drop(columns=[self.label])return test_data_, test_data_no_label_, y_test_def user_data(self):"""用户自己的测试数据"""# test_data_ = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')[0]test_data_ = TabularDataset('./test.csv').head(1)y_test_ = test_data_[self.label]test_data_no_label_ = test_data_.drop(columns=[self.label])return test_data_no_label_, y_test_class Model:dataset = DataFrameDataset()model_path = "table_predictor"def __init__(self):self.predictor: TabularPredictor = Nonedef train(self, eval_metric="roc_auc", presets="medium_quality_faster_train", time_limit=60, holdout_frac=0.1):"""常用参数调节:param eval_metric: 精度评估指标f1: 用于二分类roc_auc: 用于二分类log_loss: 用于分类mean_absolute_error: 用于回归median_absolute_error: 用于回归:param presets: 模型训练预设条件best_quality: 牺牲时间训练高精度模型medium_quality_faster_train: 牺牲质量快速产生模型good_quality_faster_inference_only_refit: 相对好的模型且推理时间相对快optimize_for_deployment::param time_limit: 训练时长:param holdout_frac: 指定从训练集出分割出多少比例的验证集:param hyperparameters: 用户可以定义搜索空间,相见请参考sdk文档,例如你可以为模型设置迭代次数等num_epochs:return:"""print("开始训练 ..........")self.predictor = TabularPredictor(label=self.dataset.label, path=self.model_path, eval_metric=eval_metric)train_data = self.dataset.train_data()self.predictor.fit(train_data, time_limit=time_limit, excluded_model_types=['KNN', 'NN', 'custom'],presets=presets, holdout_frac=holdout_frac)print("评估模型 ..........")test_data, test_data_no_label, y_test = self.dataset.test_data()y_pred = self.predictor.predict_proba(test_data_no_label)evaluate = self.predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)board = self.predictor.leaderboard(test_data, silent=True)print("评估结果:\n",evaluate)print("模型在测试集上的效果:\n", board)print("模型特征重要性:\n", self.predictor.feature_importance(data=train_data))print("模型信息:\n", self.predictor.info())self.predictor.delete_models(models_to_keep='best', dry_run=False)  # 保留最优模型,其他模型将删除def predict(self):"""预测数据get_model_best(): 获取最好的模型predict(): 输出结果对应标签predict_proba(): 输出结果对应标签下的概率值:return:"""self.predictor = TabularPredictor.load(self.model_path)best_model = self.predictor.get_model_best()print("Best Model:\n", best_model)test_data_no_label, y_test = self.dataset.user_data()import timestart_time = time.time()y_pred = self.predictor.predict(test_data_no_label, model=best_model)print("inference time: ", time.time() - start_time)y_pred_prob = self.predictor.predict_proba(test_data_no_label, model=best_model)print("y_test: ", y_test)print("y_pred: ", y_pred)print(y_pred_prob)print("预测结果:", y_test == y_pred)if __name__ == '__main__':import firefire.Fire(Model())
  • 运行
$ python3 auto.py train
开始训练 ..........
Warning: path already exists! This predictor may overwrite an existing predictor! path="table_predictor"
Loaded data from: ./train.csv | Columns = 15 / 15 | Rows = 39073 -> 39073
Train Data:age   workclass  fnlwgt   education  education-num       marital-status          occupation    relationship    race      sex  capital-gain  capital-loss  hours-per-week  native-country   class
0   25     Private  178478   Bachelors             13        Never-married        Tech-support       Own-child   White   Female             0             0              40   United-States   <=50K
1   23   State-gov   61743     5th-6th              3        Never-married    Transport-moving   Not-in-family   White     Male             0             0              35   United-States   <=50K
2   46     Private  376789     HS-grad              9        Never-married       Other-service   Not-in-family   White     Male             0             0              15   United-States   <=50K
3   55           ?  200235     HS-grad              9   Married-civ-spouse                   ?         Husband   White     Male             0             0              50   United-States    >50K
4   36     Private  224541     7th-8th              4   Married-civ-spouse   Handlers-cleaners         Husband   White     Male             0             0              40     El-Salvador   <=50K
Presets specified: ['medium_quality_faster_train']
Beginning AutoGluon training ... Time limit = 60s
AutoGluon will save models to "table_predictor/"
AutoGluon Version:  0.3.1
Train Data Rows:    39073
Train Data Columns: 14
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).2 unique label values:  [' <=50K', ' >50K']If 'binary' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 =  >50K, class 0 =  <=50KNote: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive ( >50K) vs negative ( <=50K) class.To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...Available Memory:                    2261.62 MBTrain Data (Original)  Memory Usage: 22.92 MB (1.0% of available memory)Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.Stage 1 Generators:
...
...
AutoGluon training complete, total runtime = 67.34s ...
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("table_predictor/")
评估模型 ..........
Loaded data from: ./test.csv | Columns = 15 / 15 | Rows = 9769 -> 9769
Test Data:age          workclass  fnlwgt      education  education-num       marital-status        occupation relationship    race      sex  capital-gain  capital-loss  hours-per-week  native-country   class
0   31            Private  169085           11th              7   Married-civ-spouse             Sales         Wife   White   Female             0             0              20   United-States   <=50K
1   17   Self-emp-not-inc  226203           12th              8        Never-married             Sales    Own-child   White     Male             0             0              45   United-States   <=50K
2   47            Private   54260      Assoc-voc             11   Married-civ-spouse   Exec-managerial      Husband   White     Male             0          1887              60   United-States    >50K
3   21            Private  176262   Some-college             10        Never-married   Exec-managerial    Own-child   White   Female             0             0              30   United-States   <=50K
4   17            Private  241185           12th              8        Never-married    Prof-specialty    Own-child   White     Male             0             0              20   United-States   <=50K
Evaluation: roc_auc on test data: 0.9323364763680665
Evaluations on test data:
{"roc_auc": 0.9323364763680665,"accuracy": 0.8761388064284983,"balanced_accuracy": 0.8000729586881633,"mcc": 0.6412270975073234,"f1": 0.7151600753295669,"precision": 0.7870466321243523,"recall": 0.6553062985332183
}
评估结果:{'roc_auc': 0.9323364763680665, 'accuracy': 0.8761388064284983, 'balanced_accuracy': 0.8000729586881633, 'mcc': 0.6412270975073234, 'f1': 0.7151600753295669, 'precision': 0.7870466321243523, 'recall': 0.6553062985332183}
模型在测试集上的效果:model  score_test  score_val  pred_time_test  pred_time_val   fit_time  pred_time_test_marginal  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0  WeightedEnsemble_L2    0.932336   0.935558        0.202042       0.067755  27.769457                 0.018307                0.001736           1.468465            2       True          9
1             CatBoost    0.931855   0.934702        0.038226       0.020100  25.316929                 0.038226                0.020100          25.316929            1       True          5
2             LightGBM    0.931088   0.934438        0.145509       0.045919   0.984063                 0.145509                0.045919           0.984063            1       True          2
3           LightGBMXT    0.928022   0.930726        0.345757       0.102603   2.624385                 0.345757                0.102603           2.624385            1       True          1
4      NeuralNetFastAI    0.914286   0.914985        0.178472       0.074495  15.291426                 0.178472                0.074495          15.291426            1       True          8
5     RandomForestGini    0.911646   0.910570        0.547053       0.109948   3.671543                 0.547053                0.109948           3.671543            1       True          3
6     RandomForestEntr    0.911283   0.911003        0.578620       0.111891   4.344845                 0.578620                0.111891           4.344845            1       True          4
7       ExtraTreesEntr    0.904868   0.905856        0.734225       0.175408   3.133456                 0.734225                0.175408           3.133456            1       True          7
8       ExtraTreesGini    0.904081   0.905642        1.049894       0.122711   2.265818                 1.049894                0.122711           2.265818            1       True          6
Computing feature importance via permutation shuffling for 14 features using 1000 rows with 3 shuffle sets...2.79s   = Expected runtime (0.93s per shuffle set)1.05s   = Actual runtime (Completed 3 of 3 shuffle sets)
模型特征重要性:importance    stddev   p_value  n  p99_high   p99_low
capital-gain      0.067067  0.006993  0.001802  3  0.107138  0.026997
age               0.041595  0.015031  0.020439  3  0.127725 -0.044534
relationship      0.022320  0.003899  0.005010  3  0.044662 -0.000022
education-num     0.021446  0.006913  0.016467  3  0.061057 -0.018166
occupation        0.020063  0.004353  0.007664  3  0.045004 -0.004877
marital-status    0.018524  0.004348  0.008936  3  0.043436 -0.006389
capital-loss      0.016525  0.003324  0.006609  3  0.035571 -0.002520
hours-per-week    0.014453  0.000872  0.000606  3  0.019451  0.009455
fnlwgt            0.010131  0.001359  0.002972  3  0.017917  0.002345
workclass         0.006703  0.001663  0.009949  3  0.016229 -0.002824
education         0.004152  0.000358  0.001231  3  0.006200  0.002103
native-country    0.004013  0.002588  0.057583  3  0.018840 -0.010815
sex               0.002201  0.000911  0.026324  3  0.007422 -0.003020
race              0.002195  0.001788  0.083653  3  0.012439 -0.008049
模型信息:
...
Deleting model LightGBMXT. All files under table_predictor/models/LightGBMXT/ will be removed.
Deleting model RandomForestGini. All files under table_predictor/models/RandomForestGini/ will be removed.
Deleting model RandomForestEntr. All files under table_predictor/models/RandomForestEntr/ will be removed.
Deleting model ExtraTreesGini. All files under table_predictor/models/ExtraTreesGini/ will be removed.
Deleting model ExtraTreesEntr. All files under table_predictor/models/ExtraTreesEntr/ will be removed.
Deleting model NeuralNetFastAI. All files under table_predictor/models/NeuralNetFastAI/ will be removed.$ python3 auto.py predict
Best Model:WeightedEnsemble_L2
inference time:  0.6184391975402832
y_test:  0     <=50K
Name: class, dtype: object
y_pred:  0     <=50K
Name: class, dtype: object<=50K      >50K
0  0.934742  0.065258
预测结果: 0    True
Name: class, dtype: bool

图像分类预测任务

为了对图像进行分类,AutoGluon可以自动生成高质量的图像分类模型。提供的图像数据集上训练高度准确的神经网络,并代表您自动利用诸如迁移学习和超参数优化等提高准确性的技术。
  • 实例
from autogluon.vision import ImageDataset, ImagePredictor
from tensorflow.keras.datasets import mnist
import abc
import pandas as pd
import os
import numpy as np
import requests
import cv2class MnistDataSets:"""配置数据集,以及标签"""datasets_dir = "mnist_datasets"def download_mnist_data(self):"""加载官方的手写数据集"""(self.x_train, self.y_train), (self.x_test, self.y_test) = mnist.load_data()# print(self.x_train.shape)# print(self.y_train.shape)# print(self.x_train[0].shape)# (60000, 28, 28)# (60000,)# (28, 28)# 这里输入可知,数据集包含了60000张图片,且素材是一个单通道28x28for label in self.label_mapping.keys():os.makedirs(name=f"{self.datasets_dir}/train/{label}", exist_ok=True)os.makedirs(name=f"{self.datasets_dir}/test/{label}", exist_ok=True)train_length = self.x_train.shape[0]# train_length = 3000test_length = self.x_test.shape[0]# test_length = 1000import timefor index in range(train_length):cv2.imwrite(filename=f"{self.datasets_dir}/train/{self.y_train[index]}/{time.time()}.jpg",img=self.x_train[index])# breakfor index in range(test_length):cv2.imwrite(filename=f"{self.datasets_dir}/test/{self.y_test[index]}/{time.time()}.jpg",img=self.x_test[index])# break@propertydef label_mapping(self):"""标签映射关系"""return {1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 0: 0}def get_online_test_data(self):"""在线获取一张手写体图片,并做前处理:return:"""label = 3url = "https://img1.baidu.com/it/u=3472197447,93830654&fm=253&fmt=auto&app=138&f=JPEG?w=500&h=281"image = requests.get(url).contentnparr = np.fromstring(image, np.uint8)gray = cv2.imdecode(nparr, cv2.IMREAD_GRAYSCALE)gray = cv2.resize(gray, (28, 28))_, gray = cv2.threshold(gray, thresh=165, maxval=255, type=cv2.THRESH_BINARY)return gray, labeldef load_data(self):"""加载数据集:return:"""train_data, val_data, test_data = ImageDataset.from_folders(root=self.datasets_dir, train="train", test="test")print("训练数据集:\n", train_data)print("测试数据集:\n", test_data)print("标签信息\n", val_data)return train_data, test_dataclass Model(MnistDataSets):model_path = "mnist-model"def __init__(self):self.predictor: ImagePredictor = Nonedef train(self):"""模型训练:return:"""train_data, test_data = self.load_data()self.predictor = ImagePredictor()print("开始训练模型....")self.predictor.fit(train_data, hyperparameters={'epochs': 10})print("模型存储中....")self.predictor.save(self.model_path)print("模型评估中....")evaluate = self.predictor.evaluate(test_data)print("模型评估结果:\n", evaluate)def predict(self):gray, label = self.get_online_test_data()self.predictor = ImagePredictor.load(self.model_path)print(self.predictor.list_models())_, nparr = cv2.imencode('.jpg', gray)cv2.imwrite("3.jpg", nparr)import timestart_time = time.time()pred = self.predictor.predict("./3.jpg")print("inference time: ", time.time() - start_time)  # 这里非常耗时,所以这个库并不是很优print(pred)if __name__ == '__main__':import firefire.Fire(Model())
  • 运行
$ python3 predictor-image.py download_mnist_data
$ python3 predictor-image.py train
开始训练模型....
`time_limit=auto` set to `time_limit=7200`.
Reset labels to [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Randomly split train_data into train[54000]/validation[6000] splits.
No GPU detected/allowed, using most conservative search space.
Starting fit without HPO
modified configs(<old> != <new>): {root.img_cls.model   resnet101 != resnet18
root.train.early_stop_baseline 0.0 != -inf
root.train.batch_size 32 != 16
root.train.early_stop_patience -1 != 10
root.train.early_stop_max_value 1.0 != inf
root.train.epochs    200 != 10
root.gpus            (0,) != ()
root.misc.seed       42 != 48
}
Saved config to /Users/rockontrol/Desktop/python_code/code/test/autopluon/e4f22295/.trial_0/config.yaml
Model resnet18 created, param count:                                         11181642
AMP not enabled. Training in float32....
Epoch[0] Batch [2349]   Speed: 5.057824 samples/sec     accuracy=0.626835       lr=0.000100
Epoch[0] Batch [2399]   Speed: 5.103389 samples/sec     accuracy=0.630052       lr=0.000100
Epoch[0] Batch [2449]   Speed: 5.524156 samples/sec     accuracy=0.632398       lr=0.000100
`time_limit=7199.991618871689` reached, exit early...
Finished, total runtime is 7260.00 s
{ 'best_config': { 'batch_size': 16,'dist_ip_addrs': None,'early_stop_baseline': -inf,'early_stop_max_value': inf,'early_stop_patience': 10,'epochs': 10,'final_fit': False,'gpus': [],'log_dir': '/Users/rockontrol/Desktop/python_code/code/test/autopluon/e4f22295','lr': 0.01,'model': 'resnet18','ngpus_per_trial': 0,'nthreads_per_trial': 32,'num_trials': 1,'num_workers': 4,'problem_type': 'multiclass','scheduler': 'local','search_strategy': 'random','searcher': 'random','seed': 48,'time_limits': 7200,'wall_clock_tick': 1642659798.8088949},'total_time': 7200.409552574158,'train_acc': 0.6342192524115756,'valid_acc': -inf}
模型存储中....
模型评估中....[Epoch 0] validation: top1=0.948200 top5=0.999200
模型评估结果:{'loss': 0.29002630821466446, 'top1': 0.9482, 'top5': 0.9992}$ python3 predictor-image.py predict
inference time:  23.1709041595459
0    0
Name: label, dtype: int64

物体检测预测任务

  • 鉴于分类模型推理性能不好,加上api接口设计不是很友好,对象检测就不测试了

文本数据预测任务

  • 应该和表格类数据,具有相同的性能

autogluon--自动机器学习快速训练模型相关推荐

  1. 自动机器学习新进展!性能超过人类调参师6个点,AutoGluon 低调开源

    机器之心报道 机器之心,Datawhale编辑 自动机器学习效果能有多好?比如让 MobileNet1.0 backbone 的 YOLO3 超过 ResNet-50 backbone 的 faste ...

  2. 【赠书】快速入门自动机器学习!自动机器学习(AutoML):方法、系统与挑战 图书赠送!...

    周末了,这次给大家赠送3本机器学习好书,<自动机器学习(AutoML):方法.系统与挑战>,请看细节. 这是一本什么书 这是一本全面介绍自动机器学习的好书,主要包含自动机器学习的方法.实际 ...

  3. 一文讲解自动机器学习(AutoML)!

    Datawhale 作者:瞿晓阳,AutoML书籍作者 寄语:让计算机自己去学习和训练规则,是否能达到更好的效果呢?自动机器学习就是答案,也就是所谓"AI的AI",让AI去学习AI ...

  4. 【华为云技术分享】基于自动机器学习的心脏病预测模型(1)

    前言 Technology developed using artificial intelligence (AI) could identify people at high risk of a f ...

  5. 自动机器学习工具全景图:精选22种框架,解放炼丹师

    作者 Alexander Allen.Adithya Balaji 王小新 编译自 Georgian Impact Blog 量子位 出品 | 公众号 QbitAI 构建一个典型的机器学习项目,一般分 ...

  6. 不用深度学习网络,只需预先设置NAS算法,就能实现AutoML自动机器学习的革命吗?

    AutoML(自动机器学习)是深度学习的新方式,利用大数据分析.高性能计算.数据管理.算法.边缘计算等技术.有了AutoML,我们就不再需要设计复杂的深度学习网络,用于数据采集.数据预处理.优化.应用 ...

  7. 自动机器学习AutoML

    [研究背景]随着深度神经网络的不断发展,各种模型和新颖模块的不断发明利用,人们逐渐意识到开发一种新的神经网络结构越来越费时费力,为什么不让机器自己在不断的学习过程中创造出新的神经网络呢? 正是出于这个 ...

  8. 机器学习指南_机器学习-快速指南

    机器学习指南 机器学习-快速指南 (Machine Learning - Quick Guide) 机器学习-简介 (Machine Learning - Introduction) Today's ...

  9. 什么是自动机器学习(AutoML)?(译)

    本文选自<Hands-On Automated Machine Learning> 自动机器学习(AutoML) 旨在通过让一些通用步骤 (如数据预处理.模型选择和调整超参数) 自动化,来 ...

  10. 微软开源的自动机器学习工具上新了:NNI概览及新功能详解

    作者 | 宋驰 来源 | 微软研究院AI头条(ID: MSRAsia) 2018年9月,微软亚洲研究院发布了第一版 NNI (Neural Network Intelligence) ,目前已在 Gi ...

最新文章

  1. NameError: name ‘train_test_split‘ is not defined的解决方法:
  2. 10W字!推荐一个牛逼的人工智能笔记教程!全部整理好了(附下载)!
  3. ubuntu常见问题解决方法
  4. python 表单中值为空的还需要传入么_牛掰!100行Python,自动动手打造一款多国语言翻译软件...
  5. oracle中block
  6. mysql中文乱码解决_Stata 中文乱码顽疾解决方法
  7. HttpJsonResult和ModelMap使用??
  8. Sybase:数据类型(对比sqlserver)
  9. java里程碑之泛型--泛型基本语法
  10. 原生android字体,不用Root,国产安卓手机如何把字体切换成安卓原生字体
  11. 致敬柳传志三网合一的佳沃品牌之路
  12. DB2 windows下9.5安装教程
  13. 能将PDF转成PPT图片文字的转换器
  14. java maincase 电影票的售卖与购买 day10-11
  15. 小米手机全球已舍弃“MI”品牌,全面改用“xiaomi”全称品牌
  16. 计算机音乐数字乐谱天空之城,idreampiano天空之城乐谱
  17. mysql查询本周的周一(星期一)和周日(星期日)
  18. SQL删除表中字段name相同的数据,需要保留一条
  19. 中国碳化硅(SiC)行业“十四五”规划和远景目标建议报告2022-2028年
  20. 怎么把多个ppt文件合并到一个ppt文件中?

热门文章

  1. 豆瓣电影TOP250和书籍TOP250爬虫
  2. kali之永恒之蓝使用流程(操作全套步骤)
  3. java 订单模块实现
  4. Java gateway process exited before sending its port number
  5. maven - filtering标签
  6. 推荐电视剧 后宫甄嬛传 2012
  7. Unity简单麻将胡牌算法
  8. 清理注册表 php,怎样清理注册表?
  9. ADC相关参数之---INL和DNL
  10. 从0开始:win10系统下基于V831的目标检测