flask部署机器学习

There's one question I always get asked regarding Data Science:

关于数据科学,我经常被问到一个问题:

What is the best way to master Data Science? What will get me hired?

掌握数据科学的最佳方法是什么? 什么会雇用我?

My answer remains constant: There is no alternative to working on portfolio-worthy projects.

我的回答一直是不变的:除了有有价值的项目,别无选择。

Even after passing the TensorFlow Developer Certificate Exam, I’d say that you can only really prove your competency with projects that showcase your research, programming skills, mathematical background, and so on.

即使通过TensorFlow开发人员证书考试,我也要说,您只能通过展示您的研究,编程技能,数学背景等等的项目来真正证明自己的能力。

In my post how to build an effective Data Science Portfolio, I shared many project ideas and other tips to prepare an awesome portfolio. This post is dedicated to one of those ideas: building an end-to-end data science/ML project.

在我的文章“ 如何建立有效的数据科学组合”中 ,我分享了许多项目想法和其他技巧来准备一个很棒的组合。 这篇文章致力于这些想法之一:构建端到端数据科学/ ML项目。

议程 (Agenda)

This tutorial is intended to walk you through all the major steps involved in completing an and-to-end Machine Learning project. For this project, I’ve chosen a supervised learning regression problem.

本教程旨在引导您完成完成端到端机器学习项目的所有主要步骤。 在这个项目中,我选择了监督学习回归问题。

Here are the major topics covered:

以下是涵盖的主要主题:

  • Pre-requisites and Resources

    先决条件和资源

  • Data Collection and Problem Statement

    数据收集和问题陈述

  • Exploratory Data Analysis with Pandas and NumPy

    使用Pandas和NumPy进行探索性数据分析

  • Data Preparation using Sklearn

    使用Sklearn进行数据准备

  • Selecting and Training a few Machine Learning Models

    选择和训练一些机器学习模型

  • Cross-Validation and Hyperparameter Tuning using Sklearn

    使用Sklearn进行交叉验证和超参数调整

  • Deploying the Final Trained Model on Heroku via a Flask App

    通过Flask App在Heroku上部署最终训练模型

Let’s start building…

让我们开始建造...

先决条件和资源 (Pre-requisites and Resources)

To go through this project and tutorial, you should be familiar with Machine Learning algorithms, Python environment setup, and common ML terminologies. Here are a few resources to get you started:

要完成本项目和教程,您应该熟悉机器学习算法,Python环境设置和通用的ML术语。 以下是一些入门资源:

  • Read the first 2–3 chapters of The hundred page ML book: http://themlbook.com/wiki/doku.php

    阅读《百页ML》一书的前2至3章: http : //themlbook.com/wiki/doku.php

  • List of Tasks for almost every Machine Learning Project — Keep referring to this list while working on this (or any other) ML project.

    几乎每个机器学习项目的任务列表-在从事此(或任何其他)ML项目时,请始终参考此列表。

  • You need a Python Environment set up — a virtual environment dedicated to this project.

    您需要设置一个Python环境 -这个项目专用的虚拟环境。

  • You should be familiar with Jupyter Notebook.

    您应该熟悉Jupyter Notebook 。

That’s it, so make sure you have an understanding of these concepts and tools and you’re ready to go!

就是这样,因此请确保您对这些概念和工具有所了解,并准备好开始!

数据收集和问题陈述 (Data Collection and Problem Statement)

The first step is to get your hands on the data. But if you have access to data (as most product-based companies do), then the first step is to define the problem that you want to solve. We don’t have the data yet, so we are going to collect the data first.

第一步是掌握数据。 但是,如果您有权访问数据(就像大多数基于产品的公司一样),那么第一步就是定义要解决的问题。 我们还没有数据,所以我们将首先收集数据。

We are using the Auto MPG dataset from the UCI Machine Learning Repository. Here is the link to the dataset:

我们正在使用UCI机器学习存储库中的Auto MPG数据集。 这是数据集的链接:

  • http://archive.ics.uci.edu/ml/datasets/Auto+MPG

    http://archive.ics.uci.edu/ml/datasets/Auto+MPG

The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes.

数据涉及以每加仑英里为单位的城市循环燃料消耗,可通过3个多值离散值和5个连续属性来预测。

Once you have downloaded the data, move it to your project directory, activate your virtualenv, and start the Jupyter local server.

下载数据后,将其移动到项目目录,激活virtualenv,然后启动Jupyter本地服务器。

You can download the data into your project from the notebook as well using wget :

您也可以使用wget将数据从笔记本电脑下载到项目中:

!wget "http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"

The next step is to load this .data file into a pandas datagram. For that, make sure you have pandas and other general use case libraries installed. Import all the general use case libraries like so:

下一步是将该.data文件加载到熊猫数据报中。 为此,请确保已安装熊猫和其他常规用例库。 像这样导入所有常规用例库:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Then read and load the file into a dataframe using the read_csv() method:

然后使用read_csv()方法读取文件并将其加载到数据帧中:

# defining the column names
cols = ['MPG','Cylinders','Displacement','Horsepower','Weight','Acceleration', 'Model Year', 'Origin']
# reading the .data file using pandas
df = pd.read_csv('./auto-mpg.data', names=cols, na_values = "?",comment = '\t',sep= " ",skipinitialspace=True)
#making a copy of the dataframe
data = df.copy()

Next, look at a few rows of the dataframe and read the description of each attribute from the website. This helps you define the problem statement.

接下来,查看数据框的几行,并从网站上阅读每个属性的描述。 这可以帮助您定义问题陈述。

Problem Statement — The data contains the MPG (Mile Per Gallon) variable which is continuous data and tells us about the efficiency of fuel consumption of a vehicle in the 70s and 80s.

问题陈述—数据包含MPG(每加仑英里数)变量,该变量是连续数据,它告诉我们70年代和80年代车辆的油耗效率。

Our aim here is to predict the MPG value for a vehicle, given that we have other attributes of that vehicle.

鉴于 我们拥有该车辆的其他属性, 我们的目的是预测 车辆 的MPG值

使用Pandas和NumPy进行探索性数据分析 (Exploratory Data Analysis with Pandas and NumPy)

For this rather simple dataset, the exploration is broken down into a series of steps:

对于这个相当简单的数据集,探索分为以下几个步骤:

检查列的数据类型 (Check for data type of columns)

##checking the data info
data.info()

检查空值。 (Check for null values.)

##checking for all the null values
data.isnull().sum()

The horsepower column has 6 missing values. We’ll have to study the column a bit more.

马力列有6个缺失值。 我们将不得不进一步研究该专栏。

检查马力列中的异常值 (Check for outliers in horsepower column)

##summary statistics of quantitative variables
data.describe()##looking at horsepower box plot
sns.boxplot(x=data['Horsepower'])

Since there are a few outliers, we can use the median of the column to impute the missing values using the pandas median() method.

由于存在一些离群值,因此我们可以使用pandas median()方法使用列的中值来估算缺失值。

##imputing the values with median
median = data['Horsepower'].median()
data['Horsepower'] = data['Horsepower'].fillna(median)
data.info()

在类别列中查找类别分布 (Look for the category distribution in categorical columns)

##category distributiondata["Cylinders"].value_counts() / len(data)
data['Origin'].value_counts()

The 2 categorical columns are Cylinders and Origin, which only have a few categories of values. Looking at the distribution of the values among these categories will tell us how the data is distributed:

2个类别列是“圆柱体”和“原点”,它们只有少数几类值。 查看这些类别之间的值分布将告诉我们数据的分布方式:

相关图 (Plot for correlation)

##pairplots to get an intuition of potential correlationssns.pairplot(data[["MPG", "Cylinders", "Displacement", "Weight", "Horsepower"]], diag_kind="kde")

The pair plot gives you a brief overview of how each variable behaves with respect to every other variable.

配对图可让您简要概述每个变量相对于每个其他变量的行为。

For example, the MPG column (our target variable) is negatively correlated with the displacement, weight, and horsepower features.

例如,MPG列(目标变量)与排量,重量和马力特征负相关。

保留测试数据集 (Set aside the test data set)

This is one of the first things we should do, as we want to test our final model on unseen/unbiased data.

这是我们应该做的第一件事,因为我们想在看不见/无偏的数据上测试最终模型。

There are many ways to split the data into training and testing sets but we want our test set to represent the overall population and not just a few specific categories. Thus, instead of using simple and common train_test_split() method from sklearn, we use stratified sampling.

有很多方法可以将数据分为训练集和测试集,但是我们希望我们的测试集代表总体,而不仅仅是几个特定类别。 因此,我们使用分层采样 ,而不是使用train_test_split()简单且通用的train_test_split()方法

Stratified Sampling — We create homogeneous subgroups called strata from the overall population and sample the right number of instances to each stratum to ensure that the test set is representative of the overall population.

分层抽样—我们从总体总体中创建称为子层次的同质子组,并对每个层次采样正确数量的实例,以确保测试集代表总体总体。

In task 4, we saw how the data is distributed over each category of the Cylinder column. We’re using the Cylinder column to create the strata:

在任务4中,我们看到了数据如何分布在Cylinder列的每个类别上。 我们使用Cylinder列创建层次:

from sklearn.model_selection import StratifiedShuffleSplitsplit = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(data, data["Cylinders"]):strat_train_set = data.loc[train_index]strat_test_set = data.loc[test_index]

Checking for the distribution in training set:

检查训练集中的分布:

##checking for cylinder category distribution in training setstrat_train_set['Cylinders'].value_counts() / len(strat_train_set)

Testing set:

测试集:

strat_test_set["Cylinders"].value_counts() / len(strat_test_set)

You can compare these results with the output of train_test_split() to find out which one produces better splits.

您可以将这些结果与train_test_split()的输出进行比较,以找出哪个产生更好的拆分。

检查原点列 (Checking the Origin Column)

The Origin column about the origin of the vehicle has discrete values that look like the code of a country.

有关车辆原点的“原点”列具有离散值,看起来像一个国家/地区的代码。

To add some complication and make it more explicit, I converted these numbers to strings:

为了增加一些复杂性并使它更明确,我将这些数字转换为字符串:

##converting integer classes to countries in Origin columntrain_set['Origin'] = train_set['Origin'].map({1: 'India', 2: 'USA', 3 : 'Germany'})
train_set.sample(10)

We’ll have to preprocess this categorical column by one-hot encoding these values:

我们必须通过热编码这些值来预处理此分类列:

##one hot encoding
train_set = pd.get_dummies(train_set, prefix='', prefix_sep='')
train_set.head()

测试新变量—分析每个变量与目标变量的相关性 (Testing for new variables — Analyze the correlation of each variable with the target variable)

## testing new variables by checking their correlation w.r.t. MPG
data['displacement_on_power'] = data['Displacement'] / data['Horsepower']
data['weight_on_cylinder'] = data['Weight'] / data['Cylinders']
data['acceleration_on_power'] = data['Acceleration'] / data['Horsepower']
data['acceleration_on_cyl'] = data['Acceleration'] / data['Cylinders']corr_matrix = data.corr()
corr_matrix['MPG'].sort_values(ascending=False)

We found acceleration_on_power and acceleration_on_cyl as two new variables which turned out to be more positively correlated than the original variables.

我们发现acceleration_on_poweracceleration_on_cyl是两个新变量,事实证明它们与原始变量的相关性更高。

This brings us to the end of the Exploratory Analysis. We are ready to proceed to our next step of preparing the data for Machine Learning.

这使我们进入探索性分析的结尾。 我们准备继续下一步,为机器学习准备数据。

使用Sklearn进行数据准备 (Data Preparation using Sklearn)

One of the most important aspects of Data Preparation is that we have to keep automating our steps in the form of functions and classes. This makes it easier for us to integrate the methods and pipelines into the main product.

数据准备的最重要方面之一是,我们必须保持函数和类形式的自动化。 这使我们更容易将方法和管道集成到主要产品中。

Here are the major tasks to prepare the data and encapsulate functionalities:

以下是准备数据和封装功能的主要任务:

预处理分类属性-转换椭圆 (Preprocessing Categorical Attribute — Converting the Oval)

##onehotencoding the categorical values
from sklearn.preprocessing import OneHotEncodercat_encoder = OneHotEncoder()
data_cat_1hot = cat_encoder.fit_transform(data_cat)
data_cat_1hot   # returns a sparse matrixdata_cat_1hot.toarray()[:5]

数据清理—电脑 (Data Cleaning — Imputer)

We’ll be using the SimpleImputer class from the impute module of the Sklearn library:

我们将使用Sklearn库的impute模块中的SimpleImputer类:

##handling missing values
from sklearn.impute import SimpleImputerimputer = SimpleImputer(strategy="median")imputer.fit(num_data)

属性添加-添加自定义转换 (Attribute Addition — Adding custom transformation)

In order to make changes to datasets and create new variables, sklearn offers the BaseEstimator class. Using it, we can develop new features by defining our own class.

为了更改数据集并创建新变量,sklearn提供了BaseEstimator类。 使用它,我们可以通过定义自己的类来开发新功能。

We have created a class to add two new features as found in the EDA step above:

我们创建了一个类来添加两个新功能,如上面的EDA步骤所示:

  • acc_on_power — Acceleration divided by Horsepoweracc_on_power —加速度除以马力
  • acc_on_cyl — Acceleration divided by the number of Cylindersacc_on_cyl —加速度除以缸数
from sklearn.base import BaseEstimator, TransformerMixinacc_ix, hpower_ix, cyl_ix = 4, 2, 0##custom class inheriting the BaseEstimator and TransformerMixin
class CustomAttrAdder(BaseEstimator, TransformerMixin):def __init__(self, acc_on_power=True):self.acc_on_power = acc_on_power  # new optional variabledef fit(self, X, y=None):return self  # nothing else to dodef transform(self, X):acc_on_cyl = X[:, acc_ix] / X[:, cyl_ix] # required new variableif self.acc_on_power:acc_on_power = X[:, acc_ix] / X[:, hpower_ix]return np.c_[X, acc_on_power, acc_on_cyl] # returns a 2D arrayreturn np.c_[X, acc_on_cyl]attr_adder = CustomAttrAdder(acc_on_power=True)
data_tr_extra_attrs = attr_adder.transform(data_tr.values)
data_tr_extra_attrs[0]

设置用于数字和分类属性的数据转换管道 (Setting up Data Transformation Pipeline for numerical and categorical attributes)

As I said, we want to automate as much as possible. Sklearn offers a great number of classes and methods to develop such automated pipelines of data transformations.

就像我说的,我们要尽可能地自动化。 Sklearn提供了大量类和方法来开发这种自动化的数据转换管道。

The major transformations are to be performed on numerical columns, so let’s create the numerical pipeline using the Pipeline class:

主要转换将在数字列上执行,因此让我们使用Pipeline类创建数字管道:

def num_pipeline_transformer(data):'''Function to process numerical transformationsArgument:data: original dataframe Returns:num_attrs: numerical dataframenum_pipeline: numerical pipeline object'''numerics = ['float64', 'int64']num_attrs = data.select_dtypes(include=numerics)num_pipeline = Pipeline([('imputer', SimpleImputer(strategy="median")),('attrs_adder', CustomAttrAdder()),('std_scaler', StandardScaler()),])return num_attrs, num_pipeline

In the above code snippet, we have cascaded a set of transformations:

在上面的代码片段中,我们级联了一组转换:

  • Imputing Missing Values — using the SimpleImputer class discussed above.

    插补缺失值-使用上面讨论的SimpleImputer类。

  • Custom Attribute Addition— using the custom attribute class defined above.自定义属性添加-使用上面定义的自定义属性类。
  • Standard Scaling of each Attribute — always a good practice to scale the values before feeding them to the ML model, using the standardScaler class.

    每个属性的标准缩放-使用standardScaler类在将值提供给ML模型之前始终是缩放值的好习惯。

数值和分类列的组合管道 (Combined Pipeline for both Numerical and Categorical columns)

We have the numerical transformation ready. The only categorical column we have is Origin for which we need to one-hot encode the values.

我们已经准备好数值转换。 我们唯一的分类列是Origin,我们需要对其进行一次热编码。

Here’s how we can use the ColumnTransformer class to capture both of these tasks in one go.

这是我们如何使用ColumnTransformerColumnTransformer捕获这两个任务的方法。

def pipeline_transformer(data):'''Complete transformation pipeline for bothnuerical and categorical data.Argument:data: original dataframe Returns:prepared_data: transformed data, ready to use'''cat_attrs = ["Origin"]num_attrs, num_pipeline = num_pipeline_transformer(data)full_pipeline = ColumnTransformer([("num", num_pipeline, list(num_attrs)),("cat", OneHotEncoder(), cat_attrs),])prepared_data = full_pipeline.fit_transform(data)return prepared_data

To the instance, provide the numerical pipeline object created from the function defined above. Then call the OneHotEncoder() class to process the Origin column.

对于实例,提供从上面定义的函数创建的数字管线对象。 然后,调用OneHotEncoder()类来处理Origin列。

最终自动化 (Final Automation)

With these classes and functions defined, we now have to integrate them into a single flow which is going to be simply two function calls.

在定义了这些类和函数之后,我们现在必须将它们集成到单个流中,这将仅仅是两个函数调用。

  1. Preprocessing the Origin Column to convert integers to Country names:预处理“原始列”以将整数转换为国家/地区名称:
##preprocess the Origin column in data
def preprocess_origin_cols(df):df["Origin"] = df["Origin"].map({1: "India", 2: "USA", 3: "Germany"})    return df

2.  Calling the final pipeline_transformer function defined above:

2.调用上面定义的最后一个pipeline_transformer函数:

##from raw data to processed data in 2 stepspreprocessed_df = preprocess_origin_cols(data)
prepared_data = pipeline_transformer(preprocessed_df)prepared_data

Voilà, your data is ready to use in just two steps!

Voilà,您的数据只需两步即可使用!

The next step is to start training our ML models.

下一步是开始训练我们的ML模型。

选择和训练机器学习模型 (Selecting and Training Machine Learning Models)

Since this is a regression problem, I chose to train the following models:

由于这是一个回归问题,因此我选择训练以下模型:

  1. Linear Regression

    线性回归

  2. Decision Tree Regressor

    决策树回归器

  3. Random Forest Regressor

    随机森林回归

  4. SVM Regressor

    SVM回归器

I’ll explain the flow for Linear Regression and then you can follow the same for all the others.

我将解释线性回归的流程,然后对所有其他流程进行相同的处理。

It’s a simple 4-step process:

这是一个简单的四步过程:

  1. Create an instance of the model class.创建模型类的实例。
  2. Train the model using the fit() method.使用fit()方法训练模型。
  3. Make predictions by first passing the data through pipeline transformer.首先将数据通过管道转换器进行预测。
  4. Evaluating the model using Root Mean Squared Error (typical performance metric for regression problems)使用均方根误差(回归问题的典型性能指标)评估模型
from sklearn.linear_model import LinearRegressionlin_reg = LinearRegression()
lin_reg.fit(prepared_data, data_labels)##testing the predictions with first 5 rows
sample_data = data.iloc[:5]
sample_labels = data_labels.iloc[:5]
sample_data_prepared = pipeline_transformer(sample_data)print("Prediction of samples: ", lin_reg.predict(sample_data_prepared))

Evaluating model:

评估模型:

from sklearn.metrics import mean_squared_errormpg_predictions = lin_reg.predict(prepared_data)
lin_mse = mean_squared_error(data_labels, mpg_predictions)
lin_rmse = np.sqrt(lin_mse)lin_rmse

RMSE for Linear regression: 2.95904

线性回归的RMSE:2.95904

使用Sklearn进行交叉验证和超参数调整 (Cross-Validation and Hyperparameter Tuning using Sklearn)

Now, if you perform the same for Decision Tree, you’ll see that you have achieved a 0.0 RMSE value which is not possible – there is no “perfect” Machine Learning Model (we’ve not reached that point yet).

现在,如果对“决策树”执行相同的操作,您将看到已经实现了0.0 RMSE值,这是不可能的–没有“完美的”机器学习模型(我们还没有达到这一点)。

Problem: we are testing our model on the same data we trained on, which is a problem. Now, we can’t use the test data yet until we finalize our best model that is ready to go into production.

问题:我们正在根据训练的相同数据测试模型,这是一个问题。 现在,在最终确定可投入生产的最佳模型之前,我们无法使用测试数据。

Solution: Cross-Validation

解决方案: 交叉验证

Scikit-Learn’s K-fold cross-validation feature randomly splits the training set into K distinct subsets called folds. Then it trains and evaluates the model K times, picking a different fold for evaluation every time and training on the other K-1 folds.

Scikit-Learn的K折交叉验证功能将训练集随机分成K不同的子集,称为子折。 然后,它训练和评估模型K次,每次选择一个不同的折叠进行评估,然后训练其他K-1个折叠。

The result is an array containing the K evaluation scores. Here’s how I did for 10 folds:

结果是一个包含K个评估得分的数组。 这是我做10折的方法:

from sklearn.model_selection import cross_val_scorescores = cross_val_score(tree_reg, prepared_data, data_labels, scoring="neg_mean_squared_error", cv = 10)
tree_reg_rmse_scores = np.sqrt(-scores)

The scoring method gives you negative values to denote errors. So while calculating the square root, we have to add negation explicitly.

计分方法为您提供负值以表示错误。 因此,在计算平方根时,我们必须显式添加负数。

For Decision Tree, here is the list of all scores:

对于决策树,这是所有分数的列表:

Take the average of these scores:

取这些分数的平均值:

微调超参数 (Fine-Tuning Hyperparameters)

After testing all the models, you’ll find that RandomForestRegressor has performed the best but it still needs to be fine-tuned.

测试完所有模型后,您会发现RandomForestRegressor表现最佳,但仍需要进行微调。

A model is like a radio station with a lot of knobs to handle and tune. Now, you can either tune all these knobs manually or provide a range of values/combinations that you want to test.

一个模型就像一个广播电台,带有很多旋钮来进行调节。 现在,您可以手动调整所有这些旋钮,也可以提供要测试的一系列值/组合。

We use GridSearchCV to find out the best combination of hyperparameters for the RandomForest model:

我们使用GridSearchCV来为RandomForest模型找出最佳的超参数组合:

from sklearn.model_selection import GridSearchCVparam_grid = [{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},]forest_reg = RandomForestRegressor()grid_search = GridSearchCV(forest_reg, param_grid,scoring='neg_mean_squared_error',return_train_score=True,cv=10,)grid_search.fit(prepared_data, data_labels)

GridSearchCV requires you to pass the parameter grid. This is a python dictionary with parameter names as keys mapped with the list of values you want to test for that param.

GridSearchCV要求您传递参数网格。 这是一个Python字典,其中参数名称作为键与您要测试该参数的值列表映射。

We can pass the model, scoring method, and cross-validation folds to it.

我们可以将模型,评分方法和交叉验证折叠传递给它。

Train the model and it returns the best parameters and results for each combination of parameters:

训练模型,并针对每种参数组合返回最佳参数和结果:

检查功能重要性 (Check Feature Importance)

We can also check the feature importance by enlisting the features and zipping them up with the best_estimator’s feature importance attribute as follows:

我们还可以通过征募特征并将其与best_estimator的特征重要性属性一起压缩来检查特征重要性,如下所示:

# feature importances
feature_importances = grid_search.best_estimator_.feature_importances_extra_attrs = ["acc_on_power", "acc_on_cyl"]
numerics = ['float64', 'int64']
num_attrs = list(data.select_dtypes(include=numerics))attrs = num_attrs + extra_attrs
sorted(zip(attrs, feature_importances), reverse=True)

We see that acc_on_power, which is the derived feature, has turned out to be the most important feature.

我们看到acc_on_power是派生的功能,事实证明它是最重要的功能。

You might want to keep iterating a few times before finalizing the best configuration.

在确定最佳配置之前,您可能需要保持迭代几次。

The model is now ready with the best configuration.

现在可以使用最佳配置准备好模型。

评估整个系统 (Evaluate the Entire System)

It’s time to evaluate this entire system:

现在是评估整个系统的时候了:

##capturing the best configuration
final_model = grid_search.best_estimator_##segregating the target variable from test set
X_test = strat_test_set.drop("MPG", axis=1)
y_test = strat_test_set["MPG"].copy()##preprocessing the test data origin column
X_test_preprocessed = preprocess_origin_cols(X_test)##preparing the data with final transformation
X_test_prepared = pipeline_transformer(X_test_preprocessed)##making final predictions
final_predictions = final_model.predict(X_test_prepared)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)

If you want to look at my complete project, here is the GitHub repository:

如果您想看一下我的完整项目,请访问GitHub存储库:

With that, you have your final model ready to go into production.

这样,您就可以将最终模型投入生产。

For deployment, we save our model into a file using the pickle model and develop a Flask web service to be deployed in Heroku. Let's see how that works now.

为了进行部署,我们使用pickle模型将模型保存到文件中,并开发将在Heroku中部署的Flask Web服务。 让我们看看它现在如何工作。

您需要部署什么应用程序? (What do you need to deploy the application?)

In order to deploy any trained model, you need the following:

为了部署任何训练有素的模型,您需要以下内容:

  • A trained model ready to deploy — save the model into a file to be further loaded and used by the web service.

    训练有素的模型可供部署 -将模型保存到文件中,以供Web服务进一步加载和使用。

  • A web service — that gives a purpose for your model to be used in practice. For our fuel consumption model, it can be using the vehicle configuration to predict its efficiency. We’ll use Flask to develop this service.

    Web服务 -为您的模型在实际中使用提供目的。 对于我们的油耗模型,它可以使用车辆配置来预测其效率。 我们将使用Flask开发此服务。

  • A cloud service provider — you need special cloud servers to deploy the application. For simplicity, we are going to use Heroku for this (I'll cover AWS and GCP in other articles).

    云服务提供商 -您需要特殊的云服务器来部署应用程序。 为简单起见,我们将为此使用Heroku(在其他文章中将介绍AWS和GCP)。

Let’s get started by looking at each of these processes one by one.

让我们开始逐个查看这些过程中的每个过程。

保存训练好的模型 (Saving the Trained Model)

Once you’re confident enough to take your trained and tested model into the production-ready environment, the first step is to save it into a .h5 or .bin file using a library like pickle .

一旦有足够的信心将经过培训和测试的模型带入可用于生产的环境中,第一步就是使用pickle类的库将其保存到.h5或.bin文件中。

Make sure you have pickle installed in your environment.

确保您的环境中安装了pickle

Next, let’s import the module and dump the model into a .bin file:

接下来,让我们导入模块并将模型转储到.bin文件中:

import pickle##dump the model into a file
with open("model.bin", 'wb') as f_out:pickle.dump(final_model, f_out) # write final_model in .bin filef_out.close()  # close the file

This will save your model in your present working directory unless you specify some other path.

除非您指定其他路径,否则这会将您的模型保存在当前的工作目录中。

It’s time to test if we are able to use this file to load our model and make predictions. We are going to use the same vehicle config as we defined above:

现在该测试我们是否能够使用此文件加载模型并进行预测。 我们将使用与上面定义的相同的车辆配置:

##vehicle config
vehicle_config = {'Cylinders': [4, 6, 8],'Displacement': [155.0, 160.0, 165.5],'Horsepower': [93.0, 130.0, 98.0],'Weight': [2500.0, 3150.0, 2600.0],'Acceleration': [15.0, 14.0, 16.0],'Model Year': [81, 80, 78],'Origin': [3, 2, 1]
}

Let’s load the model from the file:

让我们从文件中加载模型:

##loading the model from the saved file
with open('model.bin', 'rb') as f_in:model = pickle.load(f_in)

Make predictions on the vehicle_config:

vehicle_config预测:

##defined in prev_blog
predict_mpg(vehicle_config, model)##output: array([34.83333333, 18.50666667, 20.56333333])

The output is the same as we predicted earlier using final_model.

输出与我们先前使用final_model预测的相同。

开发网络服务 (Developing a web service)

The next step is to package this model into a web service that, when given the data through a POST request, returns the MPG (Miles per Gallon) predictions as a response.

下一步是将该模型打包到Web服务中,当通过POST请求提供数据时,该Web服务将MPG(英里/加仑)预测作为响应返回。

I am using the Flask web framework, a commonly used lightweight framework for developing web services in Python. In my opinion, it is probably the easiest way to implement a web service.

我正在使用Flask Web框架,Flask Web框架是用于在Python中开发Web服务的常用轻量级框架。 我认为,这可能是实现Web服务的最简单方法。

Flask gets you started with very little code and you don’t need to worry about the complexity of handling with HTTP requests and responses.

Flask只需很少的代码即可入门,您无需担心使用HTTP请求和响应进行处理的复杂性。

Here are the steps:

步骤如下:

  • Create a new directory for your flask application.为烧瓶应用程序创建一个新目录。
  • Set up a dedicated environment with dependencies installed using pip.设置专用环境,并使用pip安装依赖项。
  • Install the following packages:安装以下软件包:
pandas
numpy
sklearn
flask
gunicorn
seaborn

The next step is to activate this environment and start developing a simple endpoint to test the application:

下一步是激活此环境并开始开发一个简单的端点来测试应用程序:

Create a new file, main.py and import the flask module:

创建一个新文件main.py并导入flask模块:

from flask import Flask

Create a Flask app by instantiating the Flask class:

通过实例化Flask类来创建Flask应用程序:

##creating a flask app and naming it "app"
app = Flask('app')

Create a route and a function corresponding to it that will return a simple string:

创建一条路由和一个与其对应的函数,该函数将返回一个简单的字符串:

@app.route('/test', methods=['GET'])
def test():return 'Pinging Model Application!!'

The above code makes use of decorators — an advanced Python feature. You can read more about decorators here.

上面的代码利用了装饰器-一种高级的Python功能。 您可以在此处阅读有关装饰器的更多信息。

We don’t need a deep understanding of decorators, just that adding a decorator @app.route on top of the test() function assigns that web service address to that function.

我们不需要对装饰器有深入的了解,只需在test()函数顶部添加装饰器@app.route将该Web服务地址分配给该函数。

Now, to run the application we need this last piece of code:

现在,要运行该应用程序,我们需要最后一段代码:

if __name__ == ‘__main__’:app.run(debug=True, host=’0.0.0.0', port=9696)

The run method starts our flask application service. The 3 parameters specify:

run方法将启动我们的flask应用程序服务。 3个参数指定:

  • debug=True — restarts the application automatically when it encounters any change in the code

    debug=True遇到代码中的任何更改时自动重新启动应用程序

  • host=’0.0.0.0' — makes the web service public

    host='0.0.0.0'将Web服务公开

  • port=9696 — the port that we use to access the application

    port=9696我们用于访问应用程序的端口

Now, in your terminal run the main.py:

现在,在您的终端中运行main.py

python main.py

Opening the URL http://0.0.0.0:9696/test in your browser will print the response string on the webpage:

在浏览器中打开URL http://0.0.0.0:9696/test将在网页上打印响应字符串:

With the application now running, let’s run the model.

现在运行该应用程序,让我们运行模型。

Create a new directory model_files to store all the model-related code.

创建一个新目录model_files来存储所有与模型相关的代码。

In this directory, create a ml_model.py file which will contain the data preparation code and the predict function we wrote here.

在此目录中,创建一个ml_model.py file ,其中将包含我们在此处编写的数据准备代码和预测函数。

Copy and paste the libraries you imported earlier in the article and the preprocessing/transformation functions. The file should look like this:

复制并粘贴您在本文前面导入的库以及预处理/转换功能。 该文件应如下所示:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as snsfrom sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputerfrom sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer##functionsdef preprocess_origin_cols(df):df["Origin"] = df["Origin"].map({1: "India", 2: "USA", 3: "Germany"})return dfacc_ix, hpower_ix, cyl_ix = 3, 5, 1class CustomAttrAdder(BaseEstimator, TransformerMixin):def __init__(self, acc_on_power=True): # no *args or **kargsself.acc_on_power = acc_on_powerdef fit(self, X, y=None):return self  # nothing else to dodef transform(self, X):acc_on_cyl = X[:, acc_ix] / X[:, cyl_ix]if self.acc_on_power:acc_on_power = X[:, acc_ix] / X[:, hpower_ix]return np.c_[X, acc_on_power, acc_on_cyl]return np.c_[X, acc_on_cyl]def num_pipeline_transformer(data):numerics = ['float64', 'int64']num_attrs = data.select_dtypes(include=numerics)num_pipeline = Pipeline([('imputer', SimpleImputer(strategy="median")),('attrs_adder', CustomAttrAdder()),('std_scaler', StandardScaler()),])return num_attrs, num_pipelinedef pipeline_transformer(data):cat_attrs = ["Origin"]num_attrs, num_pipeline = num_pipeline_transformer(data)full_pipeline = ColumnTransformer([("num", num_pipeline, list(num_attrs)),("cat", OneHotEncoder(), cat_attrs),])full_pipeline.fit_transform(data)return full_pipeline    def predict_mpg(config, model):if type(config) == dict:df = pd.DataFrame(config)else:df = configpreproc_df = preprocess_origin_cols(df)print(preproc_df)pipeline = pipeline_transformer(preproc_df)prepared_df = pipeline.transform(preproc_df)print(len(prepared_df[0]))y_pred = model.predict(prepared_df)return y_pred

In the same directory add your saved model.bin file as well.

在同一目录中,还添加您保存的model.bin文件。

Now, in the main.py we are going to import the predict_mpg function to make predictions. But to do that we are required to create an empty __init__.py file to tell Python that the directory is a package.

现在,在main.py我们将导入预测predict_mpg函数进行预测。 但是要做到这一点,我们需要创建一个空的__init__.py文件来告诉Python该目录是一个包。

Your directory should have this tree:

您的目录应具有以下树:

Next up, define the predict/ route that will accept the vehicle_config from an HTTP POST request and return the predictions using the model and predict_mpg() method.

接下来,定义将接受HTTP POST请求中的vehicle_configpredict/路线,并使用model和predict_mpg()方法返回预测。

In your main.py, first import:

在您的main.py中,首先导入:

import pickle
from flask import Flask, request, jsonify
from model_files.ml_model import predict_mpg

Then add the predict route and the corresponding function:

然后添加predict路线和相应的功能:

@app.route('/predict', methods=['POST'])
def predict():vehicle = request.get_json()print(vehicle)with open('./model_files/model.bin', 'rb') as f_in:model = pickle.load(f_in)f_in.close()predictions = predict_mpg(vehicle, model)result = {'mpg_prediction': list(predictions)}return jsonify(result)

Here, we’ll only be accepting POST request for our function and thus we have methods=[‘POST’] in the decorator.

在这里,我们只接受函数的POST请求,因此装饰器中有methods=['POST']

  • First, we capture the data( vehicle_config) from our request using the get_json() method and store it in the variable vehicle.

    首先,我们使用get_json()方法从请求中捕获数据(vehicle_config)并将其存储在可变的vehicle中。

  • Then we load the trained model into the model variable from the file we have in the model_files folder.

    然后,将训练后的模型从model_files文件夹中的文件加载到模型变量中。

  • Now, we make the predictions by calling the predict_mpg function and passing the vehicle and model.

    现在,我们通过调用prepare_mpg函数并传递vehiclemodel进行预测。

  • We create a JSON response of this array returned in the predictions variable and return this JSON as the method response.我们为此在联想变量中返回的数组创建一个JSON响应,并将此JSON作为方法响应返回。

We can test this route using Postman or the requests package and then start the server running the main.py. Then in your notebook, add this code to send a POST request with the vehicle_config:

我们可以使用Postman或requests包测试此路由,然后启动运行main.py的服务器。 然后在您的笔记本中,添加以下代码以使用vehicle_config发送POST请求:

import requestsurl = “http://localhost:9696/predict"
r = requests.post(url, json = vehicle_config)
r.text.strip()##output: '{"mpg_predictions":[34.60333333333333,19.32333333333333,14.893333333333333]}'

Great! Now, comes the last part: this same functionality should work when deployed on a remote server.

大! 现在,最后一部分介绍了:部署在远程服务器上时,该功能应该可以使用。

在Heroku上部署应用程序 (Deploying the application on Heroku)

To deploy this flask application on Heroku, you need to follow these very simple steps:

要在Heroku上部署此flask应用程序,您需要执行以下非常简单的步骤:

  1. Create a Procfile in the main directory — this contains the command to get the run the application on the server.

    在主目录中创建一个Procfile ,该文件包含在服务器上运行该应用程序的命令。

  2. Add the following in your Procfile:

    在您的Procfile添加以下Procfile

web: gunicorn wsgi:app

We are using gunicorn (installed earlier) to deploy the application:

我们正在使用gunicorn(先前安装)来部署应用程序:

Gunicorn is a pure-Python HTTP server for WSGI applications. It allows you to run any Python application concurrently by running multiple Python processes within a single dyno. It provides a perfect balance of performance, flexibility, and configuration simplicity.

Gunicorn是用于WSGI应用程序的纯Python HTTP服务器。 它允许您通过在单个dyno中运行多个Python进程来同时运行任何Python应用程序。 它在性能,灵活性和配置简单性之间实现了完美的平衡。

Now, create a wsgi.py file and add:

现在,创建一个wsgi.py文件并添加:

##importing the app from main file
from main import appif __name__ == “__main__”: app.run()

Make sure you delete the run code from the main.py .

确保从main.py删除运行代码。

Write all the python dependencies into requirements.txt.

将所有python依赖项都写到requirements.txt

You can use pip freeze > requirements.txt or simply put the above-mentioned list of packages + any other package that your application is using.

您可以使用pip freeze > requirements.txt或简单地将上述软件包列表+您的应用程序正在使用的任何其他软件包放入列表中。

Now, using the terminal,

现在,使用终端,

  • initialize an empty git repository,初始化一个空的git仓库,
  • add the files to the staging area,将文件添加到暂存区,
  • and commit files to the local repository:并将文件提交到本地存储库:
$ git init
$ git add .
$ git commit -m "Initial Commit"

Next, create a Heroku account if you haven’t already. Then login to the Heroku CLI:

接下来, 创建一个Heroku帐户(如果尚未创建)。 然后登录到Heroku CLI:

heroku login

Approve the login from the browser as the page pops up.

弹出页面时,请从浏览器批准登录。

Now create a flask app:

现在创建一个flask应用程序:

heroku create <name of your app>

I named it mpg-flask-app. It will create a flask app and will give us a URL on which the app will be deployed.

我将其命名为mpg-flask-app 。 它将创建一个Flask应用程序,并为我们提供一个将在其上部署应用程序的URL。

Finally, push all your code to Heroku remote:

最后,将所有代码推送到Heroku远程站点:

$ git push heroku master

$ git push heroku master

And Voilà! Your web service is now deployed on https://mpg-flask-app.herokuapp.com/predict.

还有Voilà! 现在,您的Web服务已部署在https://mpg-flask-app.herokuapp.com/predict上 。

Again, test the endpoint using the request package by sending the same vehicle config:

同样,通过发送相同的车辆配置,使用request包测试端点:

With that, you have all the major skills you need to start building more complex ML applications.

这样,您就具备了开始构建更复杂的ML应用程序所需的所有主要技能。

You can refer to my GitHub repository for this project.

您可以参考我的GitHub存储库中的该项目。

And you can develop this entire project along with me:

您可以与我一起开发整个项目:

下一步 (Next Steps)

This was still a simple project. For the next steps, I’d recommend you take up a more complex dataset – maybe pick up a classification problem and repeat these tasks until deployment.

这仍然是一个简单的项目。 对于下一步,我建议您使用更复杂的数据集-也许会遇到分类问题,并重复这些任务直到部署。

使用Harshit来查看数据科学 -我的YouTube频道 (Check out Data Science with Harshit — My YouTube Channel)

Here is the complete tutorial (in playlist form) on my YouTube channel where you can follow me while working on this project.

这是我的YouTube频道上的完整教程(以播放列表形式),您可以在进行此项目时关注我。

With this channel, I plan to roll out a couple of series covering the entire data science space. Here is why you should be subscribing to the channel:

我计划通过这个渠道推出一系列涵盖整个数据科学领域的系列文章 。 这就是为什么您要订阅该频道的原因 :

  • These series would cover all the required/demanded quality tutorials on each of the topics and subtopics like Python fundamentals for Data Science.

    这些系列将涵盖关于每个主题和子主题(如数据科学的Python基础知识)的所有必需/要求的质量教程。

  • Explained Mathematics and derivations of why we do what we do in ML and Deep Learning.

    解释了数学以及为何我们在ML和深度学习中所做的事情的推导 。

  • Podcasts with Data Scientists and Engineers at Google, Microsoft, Amazon, etc, and CEOs of big data-driven companies.

    Google,Microsoft,Amazon等的数据科学家和工程师以及大数据驱动公司的CEO进行的播客 。

  • Projects and instructions to implement the topics learned so far. Learn about new certifications, Bootcamp, and resources to crack those certifications like this TensorFlow Developer Certificate Exam by Google.

    实施到目前为止所学主题的项目和说明 。 了解新的认证,Bootcamp以及破解这些认证的资源,例如Google的TensorFlow开发人员证书考试。

翻译自: https://www.freecodecamp.org/news/end-to-end-machine-learning-project-turorial/

flask部署机器学习

flask部署机器学习_如何开发端到端机器学习项目并使用Flask将其部署到Heroku相关推荐

  1. 开发端到端的Ajax应用程序(转)

    开发端到端的 Ajax 应用程序,第 3 部分: 集成.测试和调试应用程序 隔离应用程序层以产生干净优雅的 Web 应用程序 文档选项 将此页作为电子邮件发送 级别: 中级 Senthil Natha ...

  2. 转载--开发端到端的 Ajax 应用程序,第 3 部分: 集成、测试和调试应用程序

    开发端到端的 Ajax 应用程序,第 3 部分: 集成.测试和调试应用程序 隔离应用程序层以产生干净优雅的 Web 应用程序 文档选项 <tr valign="top"> ...

  3. 机器学习_吴恩达_week1(机器学习分类+单变量线性回归)

    目录 一.绪论 1.1 欢迎 1.2 机器学习是什么? 1.3 监督学习 1.4 非监督学习 二.单变量线性回归 2.1 模型表示 2.2 代价函数 2.3 代价函数的直观理解I 2.4 代价函数的直 ...

  4. 机器学习_用树回归方法画股票趋势线

     本篇的主题是分段线性拟合,也叫回归树,是一种集成算法,它同时使用了决策和线性回归的原理,其中有两点不太容易理解,一个是决策树中熵的概念,一个是线性拟合时求参数的公式为什么是由矩阵乘法实现的.如需详解 ...

  5. flask渲染图像_用于图像推荐的Flask应用

    flask渲染图像 After creating a Python-based machine learning application you might want to get it runnin ...

  6. 角距离恒星_恒星问卷调查的10倍机器学习生产率

    角距离恒星 With availability of massive data and computation, Machine Learning (ML) and other spheres of ...

  7. 《成为一名机器学习工程师》_如何在2020年成为机器学习工程师

    <成为一名机器学习工程师> 机器学习工程 (Machine Learning Engineering) The title of "Machine Learning Engine ...

  8. 端到端机器学习_使用automl进行端到端的自动化机器学习过程

    端到端机器学习 Prerequisite: 先决条件: - Docker -码头工人 - Jupyter Notebook -Jupyter笔记本 - Python and Pip -Python和P ...

  9. python 读grid 数据_科学网—Python_机器学习_总结14:Grid search - 李军的博文

    机器学习中存在两类参数:通过训练数据学习得到的参数:---可认为是辨识得到的参数,例如模型系数: 在学习算法中单独需要优化的参数--超参.调优参数:---算法自身的系数,例如决策树的深度参数: Gri ...

最新文章

  1. 【转】ASP.NET中“字母和数字混合的验证码”详解
  2. 2、(整数类型)INT、TINYINT、SMALLINT、MEDIUMINT、BIGINT
  3. sqlmap自动扫描注入点_同天上降魔主,真是人间太岁神——SQLMAP 高级教程
  4. python 3d绘图 拖动_使用python-matplotlib连续3D绘图(即图形更新)?
  5. java+卡有型号吗,第一次写java代码,就卡主了,真是惨,有木有大腿来帮忙调试一下...
  6. 【Mybatis】resultMap继承
  7. pix2pix 学习笔记
  8. cout输出格式不常用情况
  9. Git不断询问我ssh密钥密码
  10. 认识 ARM、FPGA
  11. 【321天】跃迁之路——程序员高效学习方法论探索系列(实验阶段79-2017.12.23)...
  12. opensips mysql_opensips中使用mysql实现用户认证
  13. android盒子没声音,TCL安卓智能电视没声音解决办法
  14. Maven打包自定义MANIFEST.MF键值对
  15. 编译openssl1.1.1f for android
  16. 时间序列信号处理(四)——傅里叶变换和短时傅里叶变换python实现
  17. C语言中的abort函数
  18. oxygen 生成java对象_利用oxygen编辑并生成xml文件,并使用JAVA的JAXB技术完成xml的解析...
  19. ChatGPT版必应被华人小哥攻破,一句话「催眠」问出所有Prompt
  20. 【资源】年底送你一套编程视频(含源码)

热门文章

  1. 【jquery】$.each的使用方法
  2. 计算机应用主要设计到哪些方面,大学计算机应用基础教案设计.doc
  3. HTML5左取函数,Javascript常用方法函数收集(一)
  4. 微信小程序左滑删除效果的实现完整源码附效果图
  5. 20-flutter下拉刷新与上拉加载
  6. http和https的区别 与 SSL/TLS协议运行机制的概述
  7. iOS后台持续定位并定时上传
  8. 一次奇怪的AP注册异常问题处理
  9. python——赋值与深浅拷贝
  10. Android5.0如何正确启用isLoggable(二) 理分析