lr模型和dnn模型

机器学习 (Machine Learning)

Everyone can fit data into any model machine learning or deep learning frameworks easily. Following the best practices may help you to distinguish others. Also, you may consider the following tricks. Here are some methods that I applied during my data scientists’ journey.

每个人都可以轻松地将数据放入任何模型机器学习或深度学习框架中。遵循最佳做法可能会帮助您与众不同。另外，您可以考虑以下技巧。这是我在数据科学家旅途中应用的一些方法。

表中的内容 (Table of Content)

Data Preparation

资料准备

Process Your Own Data
处理您自己的数据
Use Tensor
使用张量
Data Augmentation
数据扩充
Sampling Same Data
采样相同数据

Model Training

模型训练

Saving Intermediate Checkpoint
保存中间检查点
Virtual Epoch
虚拟时代
Simple is Beauty
简单就是美
Simplifying Problem
简化问题

Debugging

调试

Simplifying Problem
简化问题
Using Eval Mode for Training
使用评估模式进行训练
Data Shifting
数据转移
Addressing Underfitting
解决拟合不足
Addressing Overfitting
解决过度拟合

Production

生产

Meta Data Association
元数据协会
Switch to Inference Mode
切换到推理模式
Scaling Cost
缩放成本
Stateless
无状态
Batch Process
批处理
Use C++
使用C ++

资料准备 (Data Preparation)

处理您自己的数据 (Process Your Own Data)

It will be suggested to handle data processing within a model (or within prediction service). The reason is a consumer may not know how to do that and making feature engineering transparent to them.

建议在模型(或预测服务)中处理数据处理。原因是消费者可能不知道该怎么做以及使功能工程对他们透明。

Taking a text classification problem as an example, and you are using BERT for classification. You cannot ask your client to make the tokenization and feature conversations (converting text to token ID).

以文本分类问题为例，您正在使用BERT进行分类。您不能要求客户进行标记化和功能对话(将文本转换为标记ID)。
Taking a regression problem as an example and date (e.g., 10/31/2019) is one of the features. In your initial model, you may only use the day of the week (i.e., Thursday) as a feature. After several iterations, the day of the week is no longer a good feature, and you want to use day (i.e., 31) only. If your client only passes the date (i.e., 10/31/2019) instead of a day of the week (i.e., 31) from day 1, you do not need to change the API interface in order to roll out a new model.

以回归问题为例和日期(例如10/31/2019)是功能之一。在初始模型中，您只能将星期几(即星期四)用作功能。经过几次迭代之后，星期几不再是一个好功能，您只想使用day(即31)。如果您的客户仅通过日期(即10/31/2019)而不是从第1天起的一周中的某一天(即31)，则无需更改API接口即可推出新模型。
Taking automatic speech recognition as an example, a consumer can only send audio to you but not classic features such as Mel Frequency Cepstral Coefficient (MFCC).

以自动语音识别为例，消费者只能向您发送音频，而不能发送经典功能，例如梅尔频率倒谱系数(MFCC)。

So it is suggested to embedding data preprocessing in your pipeline rather than asking your client to do it.

因此，建议将数据预处理嵌入到您的管道中，而不要让您的客户端来做。

使用张量 (Use Tensor)

Tensor is an N-dimensional array and optimizing for multidimensional calculation. It is faster than using Python dictionary or array, and the expected data format for a deep learning framework (e.g., PyTorch or TensorFlow) is tensor.

Tensor是一个N维数组，针对多维计算进行了优化。它比使用Python字典或数组更快，并且深度学习框架(例如PyTorch或TensorFlow)的预期数据格式为张量。

数据扩充 (Data Augmentation)

Lack of labeled data is one of the challenges that practitioners usually deal with it. Transfer learning is one of the ways to overcome it. You can consider using ResNet (for computer vision), BERT (for natural language processing). On the other hand, you can generate synthetic data to increase labeled data. albumentations and imgaug help to generate data for an image while nlpaug generate textual data.

缺乏标签数据是从业人员通常应对的挑战之一。迁移学习是克服它的方法之一。您可以考虑使用ResNet(用于计算机视觉)，BERT(用于自然语言处理)。另一方面，您可以生成合成数据以增加标记数据。 albumentations和imgaug帮助，而生成的图像数据nlpaug生成文本数据。

If you understand your data, you should tailor made augmentation approach it. Remember that the golden rule in data science is garbage in garbage out.

如果您了解自己的数据，则应量身定制增强方法。请记住，数据科学中的黄金法则是将垃圾逐出。

采样相同数据 (Sampling Same Data)

Image for post — Photo by Jeremy Bishop on Unsplash

Most of the time, we want to draw data randomly in order to keep the sample data distribution across a train set, test set, and validation set. Meanwhile, you want to keep this “random” behavior all the time such that you can get the same set of a train set, test set, and validation set.

大多数时候，我们希望随机绘制数据，以保持样本数据在训练集，测试集和验证集中的分布。同时，您希望一直保持这种“随机”行为，以便可以得到同一组训练集，测试集和验证集。

If data come with a date attribute, you can split data by this column easily.
如果数据带有日期属性，则可以按此列轻松拆分数据。
Otherwise, you can change the seed such that you can have consistent random behavior.
否则，您可以更改种子，以便具有一致的随机行为。

import torchimport numpy as npimport randomseed = 1234random.seed(seed)np.random.seed(seed)torch.manual_seed(seed)torch.cuda.manual_seed(seed)

模型训练 (Model Training)

保存中间检查点 (Saving Intermediate Checkpoint)

Regrading to saving a trained model, one of the easier ways is saving it after completing the entire training process. However, there are several drawbacks. Let go through it together.

降级为保存训练有素的模型，一种简单的方法是在完成整个训练过程后进行保存。但是，有几个缺点。让我们一起经历一下。

Due to model complexity, computing resource, and size of training data, the entire model training process may take several days or weeks. It will be too risky if no intermediate checkpoints are persisted as a machine can be shutdown incidentally.

由于模型的复杂性，计算资源和训练数据的大小 ，整个模型训练过程可能需要几天或几周的时间。如果不保留任何中间检查点，则可能会太危险，因为机器可能会意外关闭。
In general, longer training model leads a better result (e.g., less loss). However, overfitting can happen. The last checkpoint does not deliver the best result in most of the time. We need to use an intermediate checkpoint for production most of the time.

通常，更长的训练模型会带来更好的结果(例如，更少的损失)。但是，过度拟合可能会发生。 在大多数情况下，最后一个检查点无法提供最佳结果 。大多数时候，我们需要使用中间检查点进行生产。
Saving your money when using an early stop mechanism. Noticed that a model does not improve for several around of epoch, we may stop it earlier to save time and resources. You may argue that the best model may be trained after several epochs. It is how you balance it.

使用提前停止机制可以省钱 。请注意，某个模型在几个时期内并没有改善，我们可能会更早停止以节省时间和资源。您可能会争辩说，最好的模型可能会在几个时期后得到训练。这就是平衡的方式。

So can we do it? Ideally, you may persist all checkpoints (e.g., saving model after every epoch), but it requests lots of storage. Indeed, it will be recommended to keep only the best model (or best three models) and the last model.

那我们能做到吗？理想情况下，您可以保留所有检查点(例如，在每个时期之后保存模型)，但是它需要大量存储空间。实际上，建议仅保留最佳模型(或最佳三个模型)和最后一个模型。

虚拟时代 (Virtual Epoch)

Epoch is a very common parameter in model training. It may affect your model performance if it does not initial correctly.

时代是模型训练中非常普遍的参数。如果初始化不正确，可能会影响模型性能。

For instance, if we have 1 million records and we set 5 epochs for training, there are 5 million (1M *5) training data in total. After three weeks, we got another 0.5 million records. If we use the same epoch (i.e., 5) for model training, total training data become 7.5 million (1.5M *5). The issues are :

例如，如果我们有100万条记录，并且设置了5个训练纪元，则总共有500万个(1M * 5)训练数据。三周后，我们又获得了50万条记录。如果我们使用相同的纪元(即5)进行模型训练，则总训练数据将达到750万(1.5M * 5)。问题是：

It may not be easier to know the improvement of the model is caused by increasing unique training data or increasing total training data.
可能不容易知道模型的改进是由增加唯一训练数据或增加总训练数据引起的。
Newly 0.5M extends training time to an hour or even days. It increases the risk of machine failure.
新近的0.5M将训练时间延长到一个小时甚至几天。它增加了机器故障的风险。

Instead of using a static epoch, a virtual epoch is suggested to replace the original epoch. The virtual epoch can be calculated based on the size of training data, desired epoch, batch size.

建议使用虚拟纪元代替原始纪元，而不是使用静态纪元。虚拟纪元可以基于训练数据的大小，期望纪元，批处理大小来计算。

Here is our usual setup:

这是我们通常的设置：

#originalnum_data = 1000 * 1000batch_size = 100num_step = 14 * 1000 * 1000num_checkpoint = 20steps_per_epoch = num_step//num_checkpoint#TensorFlow/ Kerasmodel.fit(x, epoch=num_checkpoint, steps_per_epoch=steps_per_epoch,  batch_size=batch_size)

Indeed, you can use the following setup:

实际上，您可以使用以下设置：

num_data = 1000 * 1000num_total_data = 14 * 1000 * 1000batch_size = 100num_checkpoint = 20steps_per_epoch = num_total_data // (batch_size*num_checkpoint)#TensorFlow/ Kerasmodel.fit(x, epoch=num_checkpoint, steps_per_epoch=steps_per_epoch,  batch_size=batch_size)

简单就是美 (Simple is Beauty)

Practitioners intend to use state-of-the-art models to build an initial model. Indeed, building a simple enough model as a baseline model is always recommended. Reasons are:

从业者打算使用最先进的模型来构建初始模型。实际上，始终建议构建一个足够简单的模型作为基准模型。原因如下：

We always need a baseline model to justify the proposed model. It is hard to tell a client that our amazing deep neural network model is better than others.

我们总是需要一个基线模型来证明所提议的模型的合理性。很难告诉客户我们惊人的深度神经网络模型比其他模型更好。
The baseline model does not need to very good in terms of performance, but it must be explainable. A business user always wants to know the reasons for the prediction result.

基准模型在性能方面不需要非常好，但是必须可以解释。商业用户总是想知道预测结果的原因。
Easy to implement is very important. A client cannot wait for a year in order to get a good enough model. We need to build a set of models in order to gain momentum from an investor to build your wonderful model on top of the initial model.

易于实施非常重要。客户无法等待一年才能获得足够好的模型。我们需要构建一组模型，以便从投资者那里获得动力，以便在初始模型的基础上构建出色的模型。

Here is some suggested baseline model in different fields:

这是不同领域的一些建议基准模型：

Acoustic: Instead of training a model to get a vector representation (i.e., embeddings layer), you may use classic features such as mel frequency cepstral coefficient (MFCC) or mel spectrogram features. Passing those features to a single layer of long short-term memory (LSTM) or convolutional neural network (CNN) and a fully connected layer for classification or prediction.

声学：您可以使用经典功能(例如梅尔频率倒谱系数(MFCC)或梅尔频谱图功能)来代替训练模型来获取矢量表示(即，嵌入层)。将这些特征传递到长短期记忆(LSTM)或卷积神经网络(CNN)的单层以及用于分类或预测的完全连接的层。
Computer Vision (CV): TODO

计算机视觉 (CV)：TODO
Natural Language Processing (NLP): Use bag-of-words or classic word embeddings with LSTM is a good starting point and shifting to transformer-based models such as BERT or XLNet later.

自然语言处理(NLP) ：将单词袋或经典单词嵌入与LSTM一起使用是一个很好的起点，稍后再转向基于转换器的模型，例如BERT或XLNet 。

调试 (Debugging)

简化问题 (Simplifying Problem)

Sometimes, classification problems include 1 million data with 1000 categories. It is too hard to debug your model when the model performance is lower than your exception. Bad performance can be contributed by model complexity, data quality, or bug. Therefore, it is recommended to simplify the problem such that we can guarantee it is bug-free. We leverage the overfitting problem to achieve this target.

有时，分类问题包括100万个数据和1000个类别。当模型性能低于异常时，很难调试模型。 模型复杂性，数据质量或错误可能导致性能不佳。因此，建议简化问题，以便我们可以保证它没有错误。我们利用过度拟合 问题来实现此目标。

Instead of classifying 1000 categories, you can sample 10 categories with 100 records per category and train your model. By using the same set (or subset) of training data as an evaluation dataset, you should able to overfit your model and achieving good results (e.g., 80 or even 90+ accuracy). If not, there may be some bugs in your model development.

无需对1000个类别进行分类，而是可以对10个类别进行采样，每个类别100条记录，并训练模型。通过使用相同的训练数据集(或子集)作为评估数据集，您应该能够过度拟合模型并获得良好的结果 (例如，精度达到 80甚至90+)。如果没有，那么您的模型开发中可能会有一些错误。

使用评估模式进行训练 (Using Eval Mode for Training)

If evaluation set accuracy does not change in the first several epoch, you may forget to reset “train” mode after evaluation

如果评估设置的准确性在前几个时期没有变化，您可能会忘记在评估后重置“训练”模式

In PyTorch, you need to swap train and eval mode during training and evaluation. If train mode is enabled, batch normalization, dropout, or other layers will be affected. Sometimes, you may forget to enable it after evaluation.

在PyTorch中，您需要在训练和评估期间交换train和eval模式。如果启用了训练模式，则批量标准化，退出或其他层将受到影响。有时，您可能会忘记在评估后启用它。

model = MyModel() # Default mode is training modefor e in range(epoch):  # mode.train() # forget to enable train mode  logits = model(x_train)  loss = loss_func(logits, y_train)  model.zero_grad()  loss.backward()  optimizer.step()mode.eval() # enable eval mode  with torch.no_grad():    eval_preds = model(x_val)

数据转移 (Data Shifting)

Data shifting happened when the training dataset is different from the evaluation/ testing dataset. In the computer vision (CV) task, it may be possible that most of your training data are day time pictures while testing data are night time pictures.

当训练数据集与评估/测试数据集不同时发生数据移位。在计算机视觉(CV)任务中，您的大多数训练数据可能是白天的图片，而测试数据是夜间的图片。

You may randomly pick some samples from both datasets for checking if you find that there is a big difference between training loss/ accuracy and test loss/ accuracy. To address this problem, you may consider:

如果您发现训练损失/准确度与测试损失/准确度之间存在很大差异，则可以从两个数据集中随机抽取一些样本进行检查。要解决此问题，您可以考虑：

Make sure that maintaining the similar distribution of data between training, test, and online prediction dataset.

确保在训练，测试和在线预测数据集之间保持相似的数据分布。
Add more training data if possible.

如果可能，添加更多的训练数据 。
Add synthetic data by leveraging libraries. Consider using nlpaug (for natural language processing and acoustic task) and imgaug (for computer vision task).

利用库添加综合数据 。考虑使用nlpaug (用于自然语言处理和声学任务)和imgaug (用于计算机视觉任务)。

解决拟合不足 (Addressing Underfitting)

Underfitting means the training error is larger than the expected error. In other words, the model cannot achieve the expected performance. There are lots of factors causing a large error. To address this problem, you can start with an easier way to see whether it can be resolved. If this problem can be fixed in an earlier stage, you can save more time as easier it is in terms of less human effort.

欠拟合意味着训练误差大于预期误差。换句话说，该模型无法达到预期的性能。有很多因素会导致较大的错误。要解决此问题，您可以从一种更简单的方法开始，看它是否可以解决。如果可以在较早的阶段解决此问题，则可以节省更多时间，因为这样做可以减少人工工作量。

Perform error analysis. Interpreting your model via LIME, SHAP, or Anchor such that you can have a sense of the problem.

执行错误分析。通过LIME ， SHAP或Anchor 解释模型 ，以便您可以了解问题所在。
An initial model may be too simple. Increase model complexity such as adding long short-term memory (LSTM) layers, convolution neural network (CNN) layers, or fully connected (FC) layers.

初始模型可能太简单了。 增加模型的复杂性，例如增加长短期记忆(LSTM)层，卷积神经网络(CNN)层或完全连接(FC)层。
Overfit model a little bit by reducing regularization layers. Dropout and weight decay are designed to prevent overfitting. You may try removing those regularization layouts to see whether a problem can be resolved.

通过减少正则化层，可以有点过拟合模型。跌落和重量衰减旨在防止过度拟合。您可以尝试删除那些正则化布局，以查看问题是否可以解决。
Adopt state-of-the-art model architecture. Considering using transformers (e.g., BERT or XLNet) in natural language processing (NLP)).

采用最先进的模型架构。 考虑在自然语言处理(NLP)中使用转换器(例如BERT或XLNet )。
Introduce synthetic data. Generating more data helps with improving model performance without any human effort. Theoretically, generated data should share the same label. It allows the model to “see” more diverse data and improving robustness eventually. You can leverage nlpaug (for natural language processing and acoustic task) and imgaug (for computer vision task) to perform data augmentation.

介绍综合数据 。生成更多数据有助于无需任何人工就能提高模型性能。从理论上讲，生成的数据应该共享相同的标签。它允许模型“查看”更多不同的数据并最终提高鲁棒性。您可以利用nlpaug (用于自然语言处理和声学任务)和imgaug (用于计算机视觉任务)执行数据增强 。
Assign better hyper-parameters and optimizer. Instead of using the default/ general learning rate, epoch, batch size, you may consider performing hyper-parameters tuning. Consider using beam search, grid search, or random search to identify a better hyper-parameters and optimizer. This approach is relatively simple by just changing hyper-parameters, but it may take a longer time.

分配更好的超参数和优化器。您可以考虑执行超参数调整，而不是使用默认/常规学习率，时期，批处理大小。考虑使用波束搜索，网格搜索或随机搜索来确定更好的超参数和优化器 。仅更改超参数，此方法相对简单，但可能需要更长的时间。
Revisit your data and introducing extra features.
重新访问您的数据并引入其他功能。

解决过度拟合 (Addressing Overfitting)

Besides underfitting, you may also face the overfitting problems. Overfitting means that your model fits your training too much and not generalize enough for other data. In other words, your train loss/ accuracy is better than validation loss/ accuracy. Considering the following approaches to address it

除了拟合不足之外，您可能还会面临拟合过度的问题。过度拟合意味着您的模型过于适合您的训练，而对于其他数据的概括不足。换句话说，您的火车损失/准确性要比验证损失/准确性好。考虑以下解决方法

Perform error analysis. Interpreting your model via LIME, SHAP, or Anchor such that you can have a sense of the problem.

执行错误分析。通过LIME ， SHAP或Anchor 解释模型 ，以便您可以了解问题所在。
Add more training data if possible.
如果可能，添加更多的训练数据。
Introduce regularization and normalization layers. Dropout (regularization layer) and batch normalization (normalization layer) help to reduce overfitting by removing some inputs and smoothing inputs.

介绍 正则化和归一化层 。辍学(正则化层)和批处理归一化(归一化层)通过删除一些输入并平滑输入来帮助减少过度拟合。
Introduce synthetic data. Generating more data helps with improving model performance without any human effort. Theoretically, generated data should share the same label. It allows the model to “see” more diverse data and improving robustness eventually. You can leverage nlpaug (for natural language processing and acoustic task) and imaug (for computer vision task) to perform data augmentation.

介绍综合数据 。生成更多数据有助于无需任何人工就能提高模型性能。从理论上讲，生成的数据应该共享相同的标签。它允许模型“查看”更多不同的数据并最终提高鲁棒性。您可以利用nlpaug (用于自然语言处理和声学任务)和imaug (用于计算机视觉任务)执行数据增强 。
Assign better hyper-parameters and optimizer. Instead of using the default/ general learning rate, epoch, batch size, you may consider performing hyper-parameters tuning. Consider using beam search, grid search, or random search to identify a better hyper-parameters and optimizer. This approach is relatively simple by just changing hyper-parameters, but it may take a longer time.

分配更好的超参数和优化器。您可以考虑执行超参数调整，而不是使用默认/常规学习率，时期，批处理大小。考虑使用波束搜索，网格搜索或随机搜索来确定更好的超参数和优化器 。仅更改超参数，此方法相对简单，但可能需要更长的时间。
Use an early-stop mechanism to find the optimal model.
使用早期停止机制来找到最佳模型。
Remove features.

删除功能 。
A model may be too complex. Decrease model complexity.

模型可能太复杂。降低 模型复杂度 。

生产 (Production)

元数据协会 (Meta Data Association)

After your model is rollout, you need to check out some exceptional cases. One way to do it is by generating ID and persisting it to the database. However, it comes with several issues that increase the difficulty of troubleshooting. Here are some disadvantages:

展开模型后，您需要检查一些例外情况。一种方法是通过生成ID并将其持久化到数据库中。但是，它带有几个问题，增加了故障排除的难度。这里有一些缺点：

The coupling problem impacts system flexibility. In architecture design point of view, decoupling is one of way to build a high flexibility system. If we generate ID and passing prediction results with this ID to a client, the client needs to persist it in their database. What if we changed format or data type, you need to inform all consumer to update their database scheme.

耦合问题影响系统的灵活性。从体系结构设计的角度来看， 去耦是构建高灵活性系统的一种方法 。如果我们生成ID并将具有该ID的预测结果传递给客户端，则客户端需要将其持久化在他们的数据库中。如果我们更改了格式或数据类型，您需要通知所有使用者更新其数据库方案该怎么办。
We may need to gather more metadata based on the consumer’s primary key. Extra primary key increases joining complexity and storage consumption. Instead.

我们可能需要根据使用者的主键收集更多的元数据。额外的主键增加了连接的复杂性和存储消耗 。代替。

To overcome this problem, the prediction result should associate with the consumer’s primary key directly.

为了克服这个问题，预测结果应直接与消费者的主键关联。

切换到推理模式 (Switch to Inference Mode)

When using PyTorch, there are several settings that you should take care when deploying your model to production. Aforementioned about eval in PyTorch, it makes those layers (e.g., Dropout, BatchNorm) work in inference mode such as no dropout action is applied in inference time. It does not only speeds up your process but also feeding all information to the neural network. detach and torch.no_grad will help you to get a result from a graph and using less memory.

使用PyTorch时，在将模型部署到生产环境时，应注意一些设置。上述关于eval在PyTorch，它使那些层(例如，差，BatchNorm)在推理模式工作，例如没有下降现象动作在推理时施加。它不仅可以加快您的处理速度，而且可以将所有信息馈送到神经网络。 detach和torch.no_grad将帮助您从图形中获得结果并使用较少的内存。

mode.eval() # enable eval modewith torch.no_grad():  eval_preds = model(x_val)

缩放成本 (Scaling Cost)

When you try to scaling out API to handle more throughput, you may consider using GPU sometimes. It is true that the GPU VM is much more expensive than the CPU. However, GPU brings some advantages to you, such as less computation time, and less VM is required to maintain the same service level. Try to evaluate and see whether GPU saves some money.

当您尝试扩展API以处理更大的吞吐量时，您可能会考虑有时使用GPU。确实，GPU VM比CPU贵得多。但是，GPU为您带来了一些优势，例如更少的计算时间，以及需要更少的VM来维持相同的服务水平。尝试评估一下，看看GPU是否可以节省一些钱。

无状态 (Stateless)

Try to make your API stateless such that your API service can be scaled easily. Stateless means do NOT save any intermediate result in an API server (memory or local storage). Just keep the API server simple and returning the result to the client without storing anything in memory or local storage.

尝试使您的API变为无状态，以便可以轻松扩展您的API服务。无状态意味着不要将任何中间结果保存在API服务器(内存或本地存储)中。只需保持API服务器简单，然后将结果返回给客户端，而无需在内存或本地存储中存储任何内容。

批处理 (Batch Process)

Predicting a set of records usually faster than record one by one. Most of the modern machine learning or deep learning framework optimized prediction performance (in terms of speed). You may notice there are great improvements by switching to batch mode prediction.

预测一组记录通常比一个记录更快。大多数现代机器学习或深度学习框架都优化了预测性能(在速度方面)。您可能会注意到，切换到批处理模式预测有很大的改进。

使用C ++ (Use C++)

Although Python is the first-class citizen in the machine learning field, it may too slow when compared to other programming languages such as C++. You may consider using TorchScript if you desire low latency inference time. The general idea is you can still train your model in Python and generate C++ compatible model by using it.

尽管Python是机器学习领域的一等公民，但与其他编程语言(例如C ++)相比，它可能会太慢。如果您希望低延迟推理时间，则可以考虑使用TorchScript 。通常的想法是，您仍然可以使用Python训练模型并使用它生成C ++兼容模型。

翻译自: https://medium.com/towards-artificial-intelligence/tricks-of-building-an-ml-or-dnn-model-b2de54cf440a