蒙特卡洛_蒙特卡洛辍学

蒙特卡洛

There ain’t no such thing as a free lunch, at least according to the popular adage. Well, not anymore! Not when it comes to neural networks, that is to say. Read on to see how to improve your network’s performance with an incredibly simple yet clever trick called the Monte Carlo Dropout.

至少根据流行的格言，没有免费的午餐之类的东西。好吧，不再了！也就是说，不是说到神经网络。请继续阅读，以了解如何通过一个极其简单而巧妙的技巧称为Monte Carlo Dropout来改善网络性能。

退出 (Dropout)

The magic trick we are about to introduce only works if your neural network has dropout layers, so let’s kick off with briefly introducing these. Dropout boils down to simply switching-off some neurons at each training step. At each step, a different set of neurons are switched off. Mathematically speaking, each neuron has some probability p of being ignored, called the dropout rate. The dropout rate is typically set to be between 0 (no dropout) and 0.5 (approximately 50% of all neurons will be switched off). The exact value depends on the network type, layer size, and the degree to which the network overfits the training data.

我们将要介绍的魔术技巧只有在您的神经网络具有辍学层时才有效，因此让我们从简要介绍这些开始。辍学归结为仅在每个训练步骤中关闭一些神经元。在每个步骤中，将关闭一组不同的神经元。从数学上讲，每个神经元都有被忽略的概率p ，称为辍学率。 辍学率通常设置为0(无辍学)至0.5(约50％的所有神经元将被关闭)。确切值取决于网络类型，层大小以及网络对训练数据的过度拟合程度。

A full network (left) and the same network with two neurons dropped out in a particular training step (right).

But why do this? Dropout is a regularization technique, that is, it helps prevent overfitting. With little data and/or a complex network, the model might memorize the training data and, as a result, work great on the data it has seen during training but deliver terrible results on new, unseen data. This is called overfitting, and dropout seeks to alleviate it.

但是为什么呢？辍学是一种正则化技术，也就是说，它有助于防止过度拟合。如果数据很少和/或网络很复杂，该模型可能会记住训练数据，因此，可以很好地处理训练过程中看到的数据，但会在看不见的新数据上产生可怕的结果。这称为过度拟合，辍学试图缓解这种情况。

How? There are two ways to understand why switching off some parts of the model might be beneficial. First, the information spreads out more evenly across the network. Think about a single neuron somewhere inside the network. There are a couple of other neurons that provide it with inputs. With dropout, each of these input sources can disappear at any time during training. Hence, our neuron cannot rely on one or two inputs only, it has to spread out its weights and pay attention to all inputs. As a result, it becomes less sensitive to input changes which results in the model generalizing better.

怎么样？有两种方法可以理解为什么关闭模型的某些部分可能是有益的。首先，信息在网络中分布更均匀。考虑网络内部某处的单个神经元。还有其他几个神经元为其提供输入。通过辍学，这些输入源中的每一个都可以在训练期间随时消失。因此，我们的神经元不能仅依靠一两个输入，它必须分散权重并注意所有输入。结果，它对输入的变化变得不那么敏感，从而导致模型更好地泛化。

The other explanation of dropout’s effectiveness is even more important from the point of view of our Monte Carlo trick. Since in every training iteration you randomly sample the neurons to be dropped out in each layer (according to that layer’s dropout rate), a different set of neurons are being dropped out each time. Hence, each time the model’s architecture is slightly different and you can think of the outcome as an averaging ensemble of many different neural networks, each trained on one batch of data only.

从我们的蒙特卡洛技巧的角度来看，辍学有效性的另一种解释更为重要。由于在每次训练迭代中，您都随机采样要在每一层中退出的神经元(根据该层的退出率)，因此每次都会丢弃一组不同的神经元。因此，每次模型的体系结构稍有不同，您都可以将结果视为许多不同神经网络的平均集合，每个神经网络仅对一批数据进行训练。

A final detail: dropout is only used during training. At inference time, that is when we make predictions with our network, we typically don’t apply any dropout — we want to use all the trained neurons and connections.

最后一个细节：辍学仅在训练期间使用。在推论时，也就是在我们通过网络进行预测时，我们通常不会应用任何退出方式-我们希望使用所有训练有素的神经元和连接。

蒙特卡洛 (Monte Carlo)

Now that we have dropout out of the way, what is Monte Carlo? If you’re thinking about a neighborhood in Monaco, you’re right! But there is more to it.

现在我们已经退学了，什么是蒙特卡洛？如果您正在考虑摩纳哥的一个社区，那是对的！但是还有更多。

Monte Carlo, Monaco. Photo by Geoff Brooks on Unsplash

In statistics, Monte Carlo refers to a class of computational algorithms that rely on repeated random sampling to obtain a distribution of some numerical quantity.

在统计中，蒙特卡洛(Monte Carlo)是指一类计算算法，该算法依赖于重复随机采样来获得一定数量的分布。

蒙特卡洛辍学：模型准确性 (Monte Carlo Dropout: model accuracy)

Monte Carlo Dropout, proposed by Gal & Ghahramani (2016), is a clever realization that the use of the regular dropout can be interpreted as a Bayesian approximation of a well known probabilistic model: the Gaussian process. We can treat the many different networks (with different neurons dropped out) as Monte Carlo samples from the space of all available models. This provides mathematical grounds to reason about the model’s uncertainty and, as it turns out, often improves its performance.

Gal＆Ghahramani(2016)提出的蒙特卡洛辍学(Monte Carlo Dropout )是一个聪明的认识，即可以将常规辍学的使用解释为众所周知的概率模型：高斯过程的贝叶斯近似。我们可以从所有可用模型的空间中将许多不同的网络(丢弃了不同的神经元)当作蒙特卡洛样本。这为推理模型的不确定性提供了数学依据，并且事实证明，通常可以提高模型的性能。

How does it work? We simply apply dropout at test time, that's all! Then, instead of one prediction, we get many, one by each model. We can then average them or analyze their distributions. And the best part: it does not require any changes in the model’s architecture. We can even use this trick on a model that has already been trained! To see it working in practice, let’s train a simple network to recognize digits from the MNIST dataset.

它是如何工作的？我们只是在测试时间应用辍学，仅此而已！然后，我们得到了很多预测，而不是一个预测，每个模型一个。然后，我们可以将它们取平均值或分析其分布。最好的部分是：它不需要对模型的体系结构进行任何更改。我们甚至可以在已经训练好的模型上使用此技巧！为了看到它在实践中的有效性，让我们训练一个简单的网络来识别MNIST数据集中的数字。

(X_train, y_train), (X_test, y_test) = keras.datasets.mnist.load_data()model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=(28, 28)))
model.add(keras.layers.Dropout(0.25))
model.add(keras.layers.Dense(300, activation="relu"))
model.add(keras.layers.Dropout(0.25))
model.add(keras.layers.Dense(300, activation="relu"))
model.add(keras.layers.Dropout(0.25))
model.add(keras.layers.Dense(10, activation="softmax"))optimizer = keras.optimizers.Nadam(lr=0.001)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
model.fit(X_train, y_train, epochs=50)
model.evaluate(X_test, y_test)

After training for 30 epochs, this model scores the accuracy of 96.7% on the test set. To turn on dropout at prediction time, we simply need to set training=True to ensure training-like behavior, that is dropping out some neurons. This way, each prediction will be slightly different and we may generate as many as we like.

在训练了30个纪元后，该模型在测试集上的准确率达到96.7％。要在预测时启用辍学功能，我们只需要设置training=True即可确保类似培训的行为，即放弃一些神经元。这样，每个预测都会略有不同，并且我们可以生成任意数量的预测。

Let’s create two useful functions: predict_proba() generates the desired number num_samples of predictions and averages the predicted class probability for each of the 10 digits in the MNIST dataset, while predict_class() simply chooses the highest predicted probability to pick the most likely class.

让我们创建两个有用的函数： predict_proba()生成所需数目的num_samples个预测并平均MNIST数据集中10个数字中每个数字的预测类别概率，而predict_class()只是选择最高的预测概率以选择最可能的类别。

def predict_proba(X, model, num_samples):preds = [model(X, training=True) for _ in range(num_samples)]return np.stack(preds).mean(axis=0)def predict_class(X, model, num_samples):proba_preds = predict_proba(X, model, num_samples)return np.argmax(proba_preds, axis=1)

Now, let’s make 100 predictions and evaluate accuracy on the test set.

现在，让我们做出100个预测并评估测试集的准确性。

y_pred = predict_class(X_test, model, 100)
acc = np.mean(y_pred == y_test)

This yields an accuracy of 97.2%. Compared to the previous result, we have decreased the error rate from 3.3% to 2.8%, which is by 15%, without changing or retraining the model at all!

这样得出的精度为97.2％。 与之前的结果相比，我们将错误率从3.3％降低到2.8％，降低了15％，而根本没有更改或重新训练模型！

蒙特卡洛辍学：预测不确定性 (Monte Carlo Dropout: prediction uncertainty)

Let’s take a look at prediction uncertainty. In classification tasks, class probabilities obtained from the softmax output are often erroneously interpreted as model confidence. However, Gal & Ghahramani (2016) show that a model can be uncertain in its predictions even with a high softmax output. We can see it in our MNIST predictions as well. Let’s compare the softmax output with the Monte Carlo Dropout-predicted probabilities for a single test example.

让我们看一下预测不确定性。在分类任务中，从softmax输出获得的分类概率通常被错误地解释为模型置信度。但是， Gal＆Ghahramani(2016)表明，即使有很高的softmax输出，模型的预测也可能不确定。我们也可以在MNIST的预测中看到它。让我们将单个测试示例的softmax输出与Monte Carlo Dropout预测的概率进行比较。

y_pred_proba = predict_proba(X_test, model, 100)softmax_output = np.round(model.predict(X_test[1:2]), 3)
mc_pred_proba = np.round(y_pred_proba[1], 3)
print(softmax_output, mc_pred_proba)

softmax_output: [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]mc_pred_proba: [0. 0. 0.989 0.008 0.001 0. 0. 0.001 0.001 0. ]

softmax_output： [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.] [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.] mc_pred_proba： [0. 0. 0.989 0.008 0.001 0. 0. 0.001 0.001 0. ] [0. 0. 0.989 0.008 0.001 0. 0. 0.001 0.001 0. ]

Both agree that the test example is most likely from the 3rd class. However, the softmax is 100% sure that’s the case, which should already alert you that something is not right. Probability estimates of 0% or 100% are usually dangerous. Monte Carlo Dropout provides us with much more information about the prediction uncertainty: most likely it’s class 3, but there is a small chance it might be class 4, and 5, although unlikely, is still more probable than 1, for instance.

双方都同意该测试示例最有可能来自第三类。但是，softmax可以100％确保是这种情况，这应该已经使您警惕某些不正确的情况。概率估计为0％或100％通常很危险。蒙特卡洛辍学(Monte Carlo Dropout)为我们提供了有关预测不确定性的更多信息：最有可能是3级，但很有可能是4级，而5级虽然不太可能，但仍然比1级更有可能。

蒙特卡洛辍学：回归问题 (Monte Carlo Dropout: regression problems)

So far, we have talked about a classification task. Let’s now turn to a regression problem to see how Monte Carlo Dropout provides us with prediction uncertainty. Let’s fit a regression model to predict house prices using the Boston housing dataset.

到目前为止，我们已经讨论了分类任务。现在让我们来看一个回归问题，以了解蒙特卡洛辍学如何为我们提供预测不确定性。让我们拟合一个回归模型，以使用波士顿住房数据集预测房价。

(X_train, y_train), (X_test, y_test) = keras.datasets.boston_housing.load_data()model = keras.models.Sequential()
model.add(keras.layers.Dropout(0.1))
model.add(keras.layers.Dense(128, activation="relu"))
model.add(keras.layers.Dropout(0.1))
model.add(keras.layers.Dense(64, activation="relu"))
model.add(keras.layers.Dropout(0.1))
model.add(keras.layers.Dense(1, activation="relu"))optimizer = keras.optimizers.Nadam(lr=0.001)
model.compile(loss="mse", optimizer=optimizer)
model.fit(X_train, y_train, epochs=30, validation_split=0.1)

For a classification task, we have defined functions to predict class probabilities and the most likely class. Similarly, for the regression problem, we need functions to get the predictive distribution and a point estimate (let’s use the mean for this).

对于分类任务，我们定义了函数来预测班级概率和最可能的班级。同样，对于回归问题，我们需要函数来获取预测分布和点估计(让我们使用均值)。

def predict_dist(X, model, num_samples):preds = [model(X, training=True) for _ in range(num_samples)]return np.hstack(preds)def predict_point(X, model, num_samples):pred_dist = predict_dist(X, model, num_samples)return pred_dist.mean(axis=1)

Let’s again make 100 predictions for one test example and plot their distribution, marking its mean, which is our point estimate, or best guess.

让我们再次为一个测试示例做出100个预测，并绘制它们的分布，标记其均值，这是我们的点估计值或最佳猜测。

y_pred_dist = predict_dist(X_test, model, 100)
y_pred = predict_point(X_test, model, 100)sns.kdeplot(y_pred_dist[0], shade=True)
plt.axvline(y_pred[0], color='red')
plt.show()

Predictive price distribution for one test example from Boston housing data. The red line denotes the mean.

For this particular test example, the mean of the predictive distribution amounts to 18, but we can see that other values are not unlikely — the model is not very certain about its predictions.

对于这个特定的测试示例，预测分布的平均值为18，但是我们可以看到其他值也不是不可能-该模型对其预测不是很确定。

蒙特卡洛辍学：实现细节 (Monte Carlo Dropout: an implementation detail)

Just one final remark: we have been implementing Monte Carlo Dropout by setting the model’s training mode to true throughout this article. This works well, but it might affect other parts of the model that behave differently at training and inference time, such as batch normalization, for instance. To make sure we only switch on dropout without affecting anything else, we should create a custom MonteCarloDropout layer that inherits from the regular dropout, and has its training parameter set to true by default (the following piece of code has been adapted form Geron (2019)).

最后一点：在本文中，我们一直通过将模型的training模式设置为true来实现Monte Carlo Dropout。这很好用，但是可能会影响模型的其他部分，这些部分在训练和推理时的行为会有所不同，例如批处理规范化。为了确保只打开Dropout而不影响其他任何东西，我们应该创建一个自定义的MonteCarloDropout图层，该图层继承自常规的dropout，并且默认情况下将其training参数设置为true (以下代码已从Geron(2019 ))。

class MonteCarloDropout(keras.layers.Dropout):def call(self, inputs):return super().call(inputs, training=True)

结论(Conclusion)

Monte Carlo Dropout boils down to training a neural network with the regular dropout and keeping it switched on at inference time. This way, we can generate multiple different predictions for each instance.蒙特卡洛辍学归结为使用常规辍学训练神经网络，并使其在推理时保持打开状态。这样，我们可以为每个实例生成多个不同的预测。
For classification tasks, we can average the softmax outputs for each class. This tends to lead to more accurate predictions, which additionally express the model’s uncertainty properly.对于分类任务，我们可以平均每个类的softmax输出。这往往会导致更准确的预测，从而可以正确表达模型的不确定性。
For regression tasks, we can analyze the predictive distribution to check which values are likely, or summarize it using its mean or median.对于回归任务，我们可以分析预测分布以检查可能的值，或使用其平均值或中位数对其进行汇总。
Monte Carlo Dropout is very easy to implement in TensorFlow: it only requires setting a model’s training mode to true before making predictions. The safest way to do so is to write a custom three-liner class inheriting from the regular Dropout.

在TensorFlow中非常容易实现Monte Carlo Dropout：仅需在进行预测之前将模型的training模式设置为true 。最安全的方法是编写一个从常规Dropout继承的自定义三层类。

资料来源 (Sources)

Gal Y. & Ghahramani Z., 2016, Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning, Proceedings of the 33rd International Conference on Machine LearningGal Y.和Ghahramani Z.，2016年，作为贝叶斯近似的辍学：代表深度学习中的模型不确定性，第33届国际机器学习会议论文集
Geron A., 2019, 2nd edition, Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent SystemsGeron A.，2019年，第二版，使用Scikit-Learn和TensorFlow进行动手机器学习：构建智能系统的概念，工具和技术

Thanks for reading! I hope you have learned something useful that will boost your projects