Ubuntu14下Auto-sklearn安装调试总结

1. 说明
本次调试目的：因为公司内部需要做一个算法比较，顺带学习使用该技能，因为太久没有更新博客，本次调试运用为主，原理之后空了再深究
2. 原理篇
2.1什么是auto-sklearn

图1 ：Auto-sklearn框架结构（图摘自2015年的论文，此时只支持分类，现在的版本添加了回归）

Auto-sklearn是一个自动化机器学习框架，结构如图1所示，用户只要输入数据和标签，框架可以自动进行数据预处理，特征预处理，（分类/回归）算法选择，最终可导出模型，存储并使用。
auto-sklearn在KDnuggets举办的机器学习博客大赛中，取得了冠军。另外的热门自动机器学习框架auto_ml和TPOT。

2.2 简单的原理介绍
Auto-sklearn可以通过贝叶斯优化方式将超参数最优化，就是通过不断迭代以下几个步骤：
1）.创建一个概率模型，来找到超参数设置与机器学习的表现之间的关系
2）.使用这个模型来挑选出有用的超参数设置，通过权衡探索与开发，进而继续尝试。探索指的是探索模型的未知领域；开发指的是重点从已知的空间中找到表现良好的部分。
3）.设置好超参数，然后运行机器学习算法。

下面将进一步阐明这个过程是如何进行的：
这个过程可以概括为联合选择算法、预处理方法以及超参数。具体如下：分类/回归的选择、预处理方法是最高优先级、分类超参数、被选择方法的超参数会被激活。我们将使用贝叶斯优化方法来搜索组合空间。贝叶斯优化方法适用于处理高维条件空间。我们使用SMAC，SMAC是的基础是随机森林，它是解决这类问题的最好方式。
就实用性而言，由于Auto-sklearn直接替代scikit-learn的estimator，因此scikt-learn需要安装这个功能，我们才能利用到这个优势。Auto-sklearn同样也支持在分布式文件系统中进行并行计算，同时它也可以利用scikit-learn模型的持续特性。

参考：
https://www.leiphone.com/news/201701/dKfVIWiDaWvdMqKu.html?winzoom=1&viewType=weixin

https://papers.nips.cc/paper/5872-efficient-and-robust-automated-machine-learning.pdf

3. 安装篇
3.1 系统需求
• Linux operating system (for example Ubuntu),
• Python (>=3.5).
• C++ compiler (with C++11 supports) and SWIG (version 3.0 or later)
3.2 安装过程
本测试采用ubuntu14.04系统，其中gcc version=4.8.4
3.2.1 python3.5版本升级
由于ubuntu14的python为2.7版本，根据3.1系统需求需要升级到python3.5以上

添加 PPA：

sudo add-apt-repository ppa:fkrull/deadsnakes
sudo apt-get update

安装 Python 3

sudo apt-get install python3.5
sudo apt-get install python3.5-dev
sudo apt-get install libncurses5-dev

取消原本的 Python 3.4 ，并将 Python3 链接到最新的 3.5 上：

sudo mv /usr/bin/python3 /usr/bin/python3-old
sudo ln -s /usr/bin/python3.5 /usr/bin/python3

安装新版pip：

wget https://bootstrap.pypa.io/get-pip.py
sudo python3 get-pip.py
sudo pip3 install setuptools --upgrade
sudo pip3 install ipython[all]

取消原本的 Python 2.7 ，并将 Python 链接到最新的 3.5 上：

sudo mv /usr/bin/python /usr/bin/python-old
sudo ln -s /usr/bin/python3.5 /usr/bin/python

参考：
https://www.jianshu.com/p/4f4b2ed568f4

3.2.2 安装swig3
如果按照官方要求安装sudo apt-get install build-essential swig，默认安装的是的swig2,后续安装pyrfr会报错Can not install pyrfr , error: command ‘swig’ failed with exit ，参考https://github.com/automl/auto-sklearn/issues/314 的错误回答，安装swig3:

Sudo apt-get install swig3
sudo ln -s /usr/bin/swig3.0 /usr/bin/ swig

第一句是按照swig3，第二句是添加swig的软链接。
3.2.3 安装auto-sklearn

安装auto-sklearn的所有依赖包:

curl https://raw.githubusercontent.com/automl/auto-sklearn/master/requirements.txt | xargs -n 1 -L 1 sudo pip install

安装auto-sklearn:

Sudo pip install auto-sklearn

注意：安装完成后，在使用fit进行训练会报Error when using “rf_with_instances.py，此处需要修改SMAC3内rf_with_instances.py的源码。参考：
https://github.com/automl/SMAC3/issues/298
操作方法如下：

cd /usr/local/lib/python3.5/dist-packages/smac/epm
sudo vim rf_with_instances.py

把下面的

self.rf.fit(data, rng=self.rng)

修改成

self.rf.fit(data, self.rng)

4. 调试篇
Auto-sklearn现在仅支持监督学习的分类和回归（官网说明未来希望支持深度学习等内容）。
本次也做两个实验，一个是auto-sklearn分类算法的应用，手写数字识别，见4.1。另一个是auto-sklearn回归算法的应用，iptv用户数预测（仅预测10分钟），见4.2。
4.1 手写数字识别（分类问题）
在ipython中运行如下代码（官方用例），对手写数字识别进行训练（分类问题）。准确率达到99.3%。运行本用例程序需要1小时。

如果需要减少运行时间，需要添加参数：time_left_for_this_task和per_run_time_limit。time_left_for_this_task表示该任务一共跑多少时间（秒），默认是3600，所以会跑1小时；per_run_time_limit表示每种算法跑多少时间（秒）；参考代码如下： time_left_for_this_task设置成60秒，per_run_time_limit设置为10秒

API默认的内存使用为3G，可以自行配置，详见API：http://automl.github.io/auto-sklearn/stable/api.html
官方建议实际应用中auto-sklearn跑24小时或越久越好。

另外，auto-sklearn的存储方式与sklearn相同，4.2中也将使用：

from sklearn.externals import joblib
…………
automl.fit(X_train, y_train)
joblib.dump(automl, '.\\model\\automl_train.m')
…………
automl_new = joblib.load('.\\model\\automl_train.m')

4.2 iptv用户数预测（回归问题）
4.1中为官方用例的简单试用，本用例为iptv用户预测（回归问题）：从10.10.1.209的mysql数据库中获取数据(公司内部的生产数据)，并使用auto-sklearn进行自动机器学习。

核心代码如下：

简单使用1分钟时间训练，并进行预测，得到预测结果的mse（平方根误差率）=0.0007，原来试用GBDT预测的mse=0.0006。
通过show_models()方法得到，一共便利了两种算法训练，先用了svr（支持向量机svm的回归方法），第二种算法是GBDT算法。
如运行时间增加，可尝试更多的数据预处理方法，特征处理方法，轮询跟多的机器学习算法。

完整代码如下（训练时间1小时）：

#用auto-sklearn
from sklearn import ensemble
from sklearn.externals import joblib
import numpy as np
import pandas as pd
import MySQLdb
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt#import autosklearn.classification
import autosklearn.regression
import sklearn.model_selection
import sklearn.datasets
import sklearn.metricsdef isholiday(data):if data == '2017-04-01':return 0.2if data == '2017-04-02':return 1if data == '2017-04-03':return 1if data == '2017-04-04':return 0.8if data == '2017-04-28':return 0.2if data == '2017-04-29':return 1if data == '2017-04-30':return 1if data == '2017-05-01':return 0.8else:return 0def getResult():db = MySQLdb.connect("10.10.1.209", "root", "123456", "db_iptv_10")cursor = db.cursor()sql = "select time, online_number from hh_online_smooth_gdbt_province_minute where SUBSTR(time, 1, 6)='201711';"cursor.execute(sql)rows = cursor.fetchall()return rowsdef getFeatures(time):feature_one = []time = pd.to_datetime(time)time_index = int(str(time).split(' ')[1].split(':')[0]) * 6 + \int(str(time).split(' ')[1].split(':')[1]) / int(10)workday = time.isoweekday()holiday = isholiday(str(time))feature_one.append([time_index, workday, holiday])return feature_onedef getCollection():rows = getResult()data_column = []time_column = []for row in rows:data_column.append(int(row[1]))time_column.append(pd.to_datetime(str(row[0])))data = pd.DataFrame(data_column, index=time_column)print('------------------')#print(data)#shift_data = datashift_data = (data.shift() / data).shift(-1)#print(data.shift())#print(shift_data)# print rowscollections = []for time in shift_data.index:time_index = int(str(time).split(' ')[1].split(':')[0]) * 6 + \int(str(time).split(' ')[1].split(':')[1]) / int(10)weekday = time.isoweekday()holiday = isholiday(str(time))online_number = shift_data.ix[time].values[0]collections.append([time_index, weekday, holiday, online_number])collections = np.array(collections)repetion_index = []for i in range(collections.shape[0]):if str(collections[i,3]) == 'nan':repetion_index.append(i)collections = np.delete(collections, repetion_index, axis=0)return collectionsdef train_model():collections = getCollection()print(collections)features = collections[:,:3]#print('---------')#print(features)ff = pd.DataFrame(features, columns=['index', 'weekday', 'holiday'])targets = collections[:,3:]X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(features, targets, random_state=1)#print(y_train)automl = autosklearn.regression.AutoSklearnRegressor(time_left_for_this_task=3600, per_run_time_limit=360)automl.fit(X_train, y_train)joblib.dump(automl, '.\\model\\automl_train.m')y_hat = automl.predict(X_test)#print(y_hat)print("model", automl.show_models())print("mean_squared_error", sklearn.metrics.mean_squared_error(y_test, y_hat))print("mean_absolute_error", sklearn.metrics.mean_absolute_error(y_test, y_hat))print("median_absolute_error", sklearn.metrics.median_absolute_error(y_test, y_hat))def predict_model(time, number):feature = []automl = joblib.load('.\\model\\automl_train.m')time = pd.to_datetime(time)time_index = int(str(time).split(' ')[1].split(':')[0]) * 6 + int(str(time).split(' ')[1].split(':')[1]) / int(10)weekday = time.isoweekday()holiday = isholiday(str(time))feature = [[time_index, weekday, holiday]]y_predict = automl.predict(np.array(feature))#print(y_predict)data = round(number/y_predict[0])print(feature, data)if __name__ == '__main__':train_model()predict_model('201712131340', 495183)

Ubuntu14下Auto-sklearn安装调试总结相关推荐

Win10 64位系统下PCL + Visual Studio + cmake + (Qt) 安装调试
Win10 64位系统下PCL + Visual Studio + cmake + (Qt) 安装调试在这里只介绍all in one方式安装 1.软件准备安装pcl(点云库)需要涉及pcl.pc ...
详细说明如何在pycharm不联网的情况下，离线安装第三方库及依赖包（如sklearn）
1.安装目标库 1.首先,选择你要导入的库文件,如seaborn库下载网站: https://pypi.org/ 或https://www.lfd.uci.edu/~gohlke/pythonlib ...
python离线安装第三方库whl_详细说明如何在pycharm不联网的情况下，离线安装第三方库及依赖包（如sklearn）...
1.安装目标库 1.首先,选择你要导入的库文件,如seaborn库下载网站: https://pypi.org/ 或https://www.lfd.uci.edu/~gohlke/pythonlib ...
ubuntu删除安装的mysql数据库_Ubuntu下MySQL数据库安装与配置与卸载
安装: sudo apt-get install mysql-server mysql-client 一旦安装完成,MySQL 服务器应该自动启动.您可以在终端提示符后运行以下命令来检查 MySQL ...
Linux下MyCat的安装即使用
mycat适用场景当数据量上亿左右的时间再进行分库,可以按表进行分一个表一个数据库,然后每个库放到不同的服务器上,来减少服务器的压力.只要每个服务连接不同的数据库就可以了,这种叫垂直切割. 但是当一 ...
计算机网络安装调试费用,计算机网络实验网络设备及其安装调试new
计算机网络实验网络设备及其安装调试new (52页) 本资源提供全文预览,点击全文预览即可全文预览,如果喜欢文档就下载吧,查找使用更方便哦! 29.9 积分实验3 网络设备安装调试认识网络和相关 ...
基于Ardupilot/PX4固件，APM/PIXhawk硬件的VTOL垂直起降固定翼软硬件参数调试（第一篇）安装调试
基于Ardupilot/PX4固件,APM/PIXhawk硬件的VTOL垂直起降固定翼软硬件参数调试(第一篇)安装调试本文内容大部分来自Kris,我们的K大,在VTOL领域的大牛,在此,非常感谢K大 ...
ubuntu-14.04 源码安装cntk笔记
linux版本安装安装环境 ubuntu14.04LTS(官方使用版本) 环境准备安装 g++ apt-get install g++ 安装git apt-get install git 安装AC ...
virtualBox下配置已经安装好的ubuntu内存大小
virtualBox下配置已经安装好的ubuntu内存大小其实一般情况下是没有什么必要去做这件事情的,我的情况特殊点,已经安装好ubuntu很久了,也用了很久,开发环境也是用了很久,但只有一点,不知 ...
Linux下图解minicom安装
Linux下图解minicom安装 minicom是一个串口通信工具,就像Windows下的HyperTerminal.可用来与串口设备通信,如调试交换机和Modem等.它的Ubuntu软件包的名称就 ...

Ubuntu14下Auto-sklearn安装调试总结

Ubuntu14下Auto-sklearn安装调试总结相关推荐

最新文章

热门文章