进化策略的视觉指南 A Visual Guide to Evolution Strategies

brief

博客地址 链接

这个图首先就好有趣!
Survival of the fittest.物竞天择,适者生存

下面开始正文:
在这篇文章中,我借助一些直观的例子来解释进化策略(ES)是如何工作的。我尽量让方程式保持轻松,如果读者希望了解更多细节,我提供了原始文章的链接。这是一系列文章的第一篇,我计划在这里展示如何将这些算法应用到从MNIST、OpenAI Gym、Roboschool到PyBullet环境的一系列任务中。

Introduction

Neural network models are highly expressive and flexible, and if we are able to find a suitable set of model parameters, we can use neural nets to solve many challenging problems. Deep learning’s success largely comes from the ability to use the backpropagation algorithm to efficiently calculate the gradient of an objective function over each model parameter. With these gradients, we can efficiently search over the parameter space to find a solution that is often good enough for our neural net to accomplish difficult tasks.
神经网络模型具有很强的表现力和灵活性,如果我们能够找到一组合适的模型参数,我们就可以使用神经网络来解决很多具有挑战性的问题。深度学习的成功很大程度上来自于使用反向传播算法高效计算目标函数在每个模型参数上的梯度的能力。有了这些梯度,我们就可以在参数空间上高效地搜索,找到一个通常足够好的解决方案,让我们的神经网完成困难的任务。
However, there are many problems where the backpropagation algorithm cannot be used. For example, in reinforcement learning (RL) problems, we can also train a neural network to make decisions to perform a sequence of actions to accomplish some task in an environment. However, it is not trivial to estimate the gradient of reward signals given to the agent in the future to an action performed by the agent right now, especially if the reward is realised many timesteps in the future. Even if we are able to calculate accurate gradients, there is also the issue of being stuck in a local optimum, which exists many for RL tasks.
然而,有很多问题不能使用反向传播算法。例如,在强化学习(RL)问题中,我们也可以训练一个神经网络来做出决策,以执行一系列的动作来完成环境中的一些任务。然而,要估计未来给予代理的奖励信号与代理现在所执行的动作的梯度并不是一件小事,尤其是当奖励在未来许多时间步实现的时候。即使我们能够计算出准确的梯度,也会存在卡在局部最优的问题,对于RL任务来说,存在很多问题。

A whole area within RL is devoted to studying this credit-assignment problem, and great progress has been made in recent years. However, credit assignment is still difficult when the reward signals are sparse. In the real world, rewards can be sparse and noisy. Sometimes we are given just a single reward, like a bonus check at the end of the year, and depending on our employer, it may be difficult to figure out exactly why it is so low. For these problems, rather than rely on a very noisy and possibly meaningless gradient estimate of the future to our policy, we might as well just ignore any gradient information, and attempt to use black-box optimisation techniques such as genetic algorithms (GA) or ES.
RL内部有一整个领域专门研究这个信用分配问题,近年来取得了很大的进展。然而,当奖励信号是稀疏的时候,信用分配仍然是困难的。在现实世界中,奖励可能是稀疏的、有噪音的。有时候,我们只得到一个单一的奖励,比如年底的奖金支票,根据雇主的不同,我们可能很难弄清楚它到底为什么这么低。对于这些问题,与其依靠一个非常嘈杂且可能毫无意义的未来梯度估计来制定我们的政策,不如直接忽略任何梯度信息,尝试使用遗传算法(GA)或ES等黑盒优化技术。
OpenAI published a paper called Evolution Strategies as a Scalable Alternative to Reinforcement Learning where they showed that evolution strategies, while being less data efficient than RL, offer many benefits. The ability to abandon gradient calculation allows such algorithms to be evaluated more efficiently. It is also easy to distribute the computation for an ES algorithm to thousands of machines for parallel computation. By running the algorithm from scratch many times, they also showed that policies discovered using ES tend to be more diverse compared to policies discovered by RL algorithms.
OpenAI发表了一篇名为《Evolution Strategies as a Scalable Alternative to Reinforcement Learning》的论文,他们在论文中表明,进化策略虽然数据效率比RL低,但却有很多好处。放弃梯度计算的能力使得这类算法的评估效率更高。也很容易将一个ES算法的计算分布到数千台机器上进行并行计算。通过从头开始多次运行该算法,他们还表明,与RL算法发现的策略相比,使用ES发现的策略往往更加多样化。
I would like to point out that even for the problem of identifying a machine learning model, such as designing a neural net’s architecture, is one where we cannot directly compute gradients. While RL, Evolution, GA etc., can be applied to search in the space of model architectures, in this post, I will focus only on applying these algorithms to search for parameters of a pre-defined model.
我想指出的是,即使是对于识别机器学习模型的问题,比如设计神经网的架构,也是我们无法直接计算梯度的。RL、Evolution、GA等,都可以应用在模型架构的空间中进行搜索,而在这篇文章中,我将只关注应用这些算法来搜索预先定义的模型的参数。

What is an Evolution Strategy? 什么是进化策略?


Two-dimensional Rastrigin function has many local optima 二维Rastrigin函数有许多局部最优值。

The diagrams below are top-down plots of shifted 2D Schaffer and Rastrigin functions, two of several simple toy problems used for testing continuous black-box optimisation algorithms. Lighter regions of the plots represent higher values of F(x,y)F(x,y)F(x,y). As you can see, there are many local optimums in this function. Our job is to find a set of model parameters (x,y)(x,y)(x,y), such that F(x,y)F(x,y)F(x,y) is as close as possible to the global maximum.
下图是移位的二维Schaffer和Rastrigin函数的自上而下的图,这是用于测试连续黑盒优化算法的几个简单玩具问题中的两个。图中较浅的区域代表F(x,y)F(x,y)F(x,y)的较高值。正如你所看到的,在这个函数中存在许多局部最优值。我们的工作是找到一组模型参数 (x,y)(x,y)(x,y),使F(x,y)F(x,y)F(x,y)尽可能地接近全局最大值。


Although there are many definitions of evolution strategies, we can define an evolution strategy as an algorithm that provides the user a set of candidate solutions to evaluate a problem. The evaluation is based on an objective function that takes a given solution and returns a single fitness value. Based on the fitness results of the current solutions, the algorithm will then produce the next generation of candidate solutions that is more likely to produce even better results than the current generation. The iterative process will stop once the best known solution is satisfactory for the user.
虽然进化策略的定义有很多,但我们可以将进化策略定义为:为用户提供一组候选解来评估问题的算法。评估是基于一个目标函数,它取一个给定的解决方案,并返回一个单一的适应度值。根据当前解决方案的适应度结果,算法将产生下一代候选解决方案,这些解决方案更有可能产生比当前一代更好的结果。一旦最佳已知解令用户满意,迭代过程将停止。
Given an evolution strategy algorithm called EvolutionStrategy, we can use in the following way:
给定一个名为EvolutionStrategy的进化策略算法,我们可以用以下方式:

solver = EvolutionStrategy()while True:# ask the ES to give us a set of candidate solutionssolutions = solver.ask()# create an array to hold the fitness results.fitness_list = np.zeros(solver.popsize)# evaluate the fitness for each given solution.for i in range(solver.popsize):fitness_list[i] = evaluate(solutions[i])# give list of fitness results back to ESsolver.tell(fitness_list)# get best parameter, fitness from ESbest_solution, best_fitness = solver.result()if best_fitness > MY_REQUIRED_FITNESS:break

Although the size of the population is usually held constant for each generation, they don’t need to be. The ES can generate as many candidate solutions as we want, because the solutions produced by an ES are sampled from a distribution whose parameters are being updated by the ES at each generation. I will explain this sampling process with an example of a simple evolution strategy.
虽然每一代的种群规模通常是保持不变的,但它们并不需要保持不变。ES可以产生我们想要的任何数量的候选解,因为ES产生的解是从一个分布中抽样出来的,而这个分布的参数在每一代都会被ES更新。我将用一个简单的进化策略的例子来解释这个采样过程。

Simple Evolution Strategy

One of the simplest evolution strategy we can imagine will just sample a set of solutions from a Normal distribution, with a mean μ\muμ and a fixed standard deviation σ\sigmaσ. In our 2D problem, μ=(μx,μy)\mu=(\mu_x,\mu_y)μ=(μx​,μy​) and σ=(σx,σy)\sigma=(\sigma_x,\sigma_y)σ=(σx​,σy​). Initially, μ\muμ is set at the origin. After the fitness results are evaluated, we set μ\muμ to the best solution in the population, and sample the next generation of solutions around this new mean. This is how the algorithm behaves over 20 generations on the two problems mentioned earlier:
我们可以想象的最简单的演化策略之一是从正态分布中抽取一组解的样本,其平均值为 μ\muμ,标准差为 σ\sigmaσ。 在我们的二维问题中,μ=(μx,μy)\mu=(\mu_x,\mu_y)μ=(μx​,μy​)和 σ=(σx,σy)\sigma=(\sigma_x,\sigma_y)σ=(σx​,σy​)。 最初,μ\muμ在原点设置。在评估了适应性结果后,我们将μ\muμ设置为种群中最好的解,并围绕这个新的平均值对下一代的解进行采样。这就是算法在前面提到的两个问题上超过20代的表现:


In the visualisation above, the green dot indicates the mean of the distribution at each generation, the blue dots are the sampled solutions, and the red dot is the best solution found so far by our algorithm.
在上面的可视化中,绿色的点表示每一代分布的平均值,蓝色的点是采样的解,红色的点是我们算法目前找到的最佳解。
This simple algorithm will generally only work for simple problems. Given its greedy nature, it throws away all but the best solution, and can be prone to be stuck at a local optimum for more complicated problems. It would be beneficial to sample the next generation from a probability distribution that represents a more diverse set of ideas, rather than just from the best solution from the current generation.
这种简单的算法一般只适用于简单的问题。考虑到它的贪婪性质,除了最佳解之外,它会丢掉所有的解,对于更复杂的问题,可能容易卡在局部最优。从代表更多不同想法的概率分布中抽取下一代的样本,而不是仅仅从当前一代的最佳解中抽取样本,这将是有益的。

Simple Genetic Algorithm 简单遗传算法

One of the oldest black-box optimisation algorithms is the genetic algorithm. There are many variations with many degrees of sophistication, but I will illustrate the simplest version here.
遗传算法是最古老的黑盒优化算法之一。有许多复杂程度的变化,但我将在这里说明最简单的版本。
The idea is quite simple: keep only 10% of the best performing solutions in the current generation, and let the rest of the population die. In the next generation, to sample a new solution is to randomly select two solutions from the survivors of the previous generation, and recombine their parameters to form a new solution. This crossover recombination process uses a coin toss to determine which parent to take each parameter from. In the case of our 2D toy function, our new solution might inherit xx or yy from either parents with 50% chance. Gaussian noise with a fixed standard deviation will also be injected into each new solution after this recombination process.
这个想法很简单:在当前一代中只保留10%的表现最好的解,让其余的解死亡。在下一代中,抽取新解的样本,就是从上一代的幸存者中随机选取两个解,并将它们的参数重新组合,形成一个新的解。这个交叉重组的过程采用掷硬币的方式来决定每个参数取自哪个父代。在我们的二维玩具函数的情况下,我们的新解可能会以50%的概率从任何一个父代继承xx或yy。在这个重新组合过程之后,每个新的解都会被注入一个固定标准差的高斯噪声。


The figure above illustrates how the simple genetic algorithm works. The green dots represent members of the elite population from the previous generation, the blue dots are the offsprings to form the set of candidate solutions, and the red dot is the best solution.
上图说明了简单遗传算法的工作原理。绿色的点代表上一代精英人群的成员,蓝色的点是形成候选解集的后代,红色的点是最佳解。
Genetic algorithms help diversity by keeping track of a diverse set of candidate solutions to reproduce the next generation. However, in practice, most of the solutions in the elite surviving population tend to converge to a local optimum over time. There are more sophisticated variations of GA out there, such as CoSyNe, ESP, and NEAT, where the idea is to cluster similar solutions in the population together into different species, to maintain better diversity over time.
遗传算法通过跟踪一组多样化的候选解来帮助多样性,以繁衍下一代。然而,在实践中,大多数精英存活种群中的解决方案往往会随着时间的推移而收敛到一个局部最优。有一些更复杂的GA变体,如CoSyNe、ESP和NEAT,其想法是将种群中相似的解聚成不同的物种,以在一段时间内保持更好的多样性。

Covariance-Matrix Adaptation Evolution Strategy (CMA-ES)协方差矩阵适应演化策略

A shortcoming of both the Simple ES and Simple GA is that our standard deviation noise parameter is fixed. There are times when we want to explore more and increase the standard deviation of our search space, and there are times when we are confident we are close to a good optima and just want to fine tune the solution. We basically want our search process to behave like this:
简单ES和简单GA的一个缺点是我们的标准差噪声参数是固定的。有的时候,我们希望探索更多,增加搜索空间的标准差,有的时候,我们自信已经接近一个好的optima,只想微调一下解决方案。我们基本上希望我们的搜索过程表现为这样的状态:


Amazing isn’it it? The search process shown in the figure above is produced by Covariance-Matrix Adaptation Evolution Strategy (CMA-ES). CMA-ES an algorithm that can take the results of each generation, and adaptively increase or decrease the search space for the next generation. It will not only adapt for the mean μ\muμ and σ\sigmaσ parameters, but will calculate the entire covariance matrix of the parameter space. At each generation, CMA-ES provides the parameters of a multi-variate normal distribution to sample solutions from. So how does it know how to increase or decrease the search space?
很神奇吧?上图所示的搜索过程是由协方差矩阵适应进化策略(CMA-ES)产生的。CMA-ES一种算法,它可以把每一代的结果,自适应地增加或减少下一代的搜索空间。它不仅会对均值μ\muμ 和 σ\sigmaσ参数进行自适应,而且会计算整个参数空间的协方差矩阵。在每一代中,CMA-ES都会提供一个多变量正态分布的参数来采样求解。那么它是如何知道如何增加或减少搜索空间的呢?
Before we discuss its methodology, let’s review how to estimate a covariance matrix. This will be important to understand CMA-ES’s methodology later on. If we want to estimate the covariance matrix of our entire sampled population of size of NNN, we can do so using the set of equations below to calculate the maximum likelihood estimate of a covariance matrix CCC. We first calculate the means of each of the ​​ xix_ixi​ and ​​ yiy_iyi​ in our population:
在讨论其方法论之前,我们先回顾一下如何估计协方差矩阵。这对后面理解 CMA-ES 的方法论很重要。如果我们想估计我们整个大小为 NNN 的抽样人口的协方差矩阵,我们可以使用下面的方程组来计算协方差矩阵 CCC 的最大似然估计。我们首先计算人口中 xix_ixi​ 和 yiy_iyi​ 各自的平均值:
μx=1N∑i=1Nxi\mu_x=\frac{1}{N}\sum\limits_{i=1}^Nx_iμx​=N1​i=1∑N​xi​
The terms of the 2×22\times22×2 covariance matrix CCC will be:
σx2=1N∑i=1N(xi−μx)2\sigma_x^2=\frac{1}{N}\sum\limits_{i=1}^N(x_i-\mu_x)^2σx2​=N1​i=1∑N​(xi​−μx​)2
σy2=1N∑i=1N(yi−μy)2\sigma_y^2=\frac{1}{N}\sum\limits_{i=1}^N(y_i-\mu_y)^2σy2​=N1​i=1∑N​(yi​−μy​)2
σxy=1N∑i=1N(xi−μx)(yi−μy)\sigma_{xy}=\frac{1}{N}\sum\limits_{i=1}^N(x_i-\mu_x)(y_i-\mu_y)σxy​=N1​i=1∑N​(xi​−μx​)(yi​−μy​)
Of course, these resulting mean estimates μx\mu_xμx​ and μy\mu_yμy​, and covariance terms σx\sigma_xσx​, σy\sigma_yσy​, σxy\sigma_{xy}σxy​ will just be an estimate to the actual covariance matrix that we originally sampled from, and not particularly useful to us.
CMA-ES modifies the above covariance calculation formula in a clever way to make it adapt well to an optimisation problem. I will go over how it does this step-by-step. CMA-ES对上述协方差计算公式进行了巧妙的修改,使其很好地适应优化问题。我将逐步介绍它是如何做到这一点的。 Firstly, it focuses on the best NbestN_{best}Nbest​ solutions in the current generation. For simplicity let’s set NbestN_{best}Nbest​ to be the best 25% of solutions. After sorting the solutions based on fitness, we calculate the mean μg+1\mu^{g+1}μg+1 parameters that we had just calculated, in the calculation:

Armed with a set of μx\mu_xμx​, μy\mu_yμy​, σx\sigma_xσx​, σy\sigma_yσy​ and σxy\sigma_{xy}σxy​ parameters for the next generation (g+1)(g+1)(g+1), we can now sample the next generation of candidate solutions.
Below is a set of figures to visually illustrate how it uses the results from the current generation (g)(g)(g) to construct the solutions in the next generation (g+1)(g+1)(g+1):

  1. Calculate the fitness score of each candidate solution in generation (g)(g)(g). 计算(g)(g)(g)代中每个候选解的适合度得分。
  2. Isolates the best 25% of the population in generation (g)(g)(g), in purple. 隔离出一代(g)(g)(g)中最好的25%的人口,紫色。
  3. Using only the best solutions, along with the mean μ(g)\mu^{(g)}μ(g) of the current generation (the green dot), calculate the covariance matrix Cg+1C^{g+1}Cg+1 of the next generation. ​​只用最好的解,加上本代的平均值(绿点),计算出下一代的协方差矩阵C。
  4. Sample a new set of candidate solutions using the updated mean μ(g+1)\mu^{(g+1)}μ(g+1) and covariance matrix C(g+1)C^{(g+1)}C(g+1).使用更新后的平均数mu(g+1)mu^{(g+1)}mu(g+1)和协方差矩阵C(g+1)C^{(g+1)}C(g+1)来抽取一组新的候选解.

Let’s visualise the scheme one more time, on the entire search process on both problems:

Because CMA-ES can adapt both its mean and covariance matrix using information from the best solutions, it can decide to cast a wider net when the best solutions are far away, or narrow the search space when the best solutions are close by. My description of the CMA-ES algorithm for a 2D toy problem is highly simplified to get the idea across. For more details, I suggest reading the CMA-ES Tutorial prepared by Nikolaus Hansen, the author of CMA-ES.
由于CMA-ES可以利用最佳解的信息来调整它的均值和协方差矩阵,所以当最佳解离它很远的时候,它可以决定把网撒得更大,或者当最佳解离它很近的时候,它可以缩小搜索空间。我对一个二维玩具问题的CMA-ES算法的描述是高度简化的,以表达这个想法。更多细节,我建议阅读CMA-ES的作者Nikolaus Hansen编写的CMA-ES教程。
This algorithm is one of the most popular gradient-free optimisation algorithms out there, and has been the algorithm of choice for many researchers and practitioners alike. 该算法是目前最流行的无梯度优化算法之一,一直是许多研究者和从业者的首选算法。The only real drawback is the performance if the number of model parameters we need to solve for is large, as the covariance calculation is O(N2)O(N^2)O(N2), although recently there has been approximations to make it O(N)O(N)O(N). CMA-ES is my algorithm of choice when the search space is less than a thousand parameters. I found it still usable up to ∼10K\sim10K∼10K parameters if I’m willing to be patient.唯一真正的缺点是,如果我们需要求解的模型参数数量很大,那么性能就会很差,因为协方差计算是O(N2)O(N^2)O(N2),尽管最近已经有近似的方法使其成为O(N)O(N)O(N)。当搜索空间小于一千参数时,CMA-ES是我的首选算法。我发现如果我愿意有耐心的话,它在 ∼10K\sim10K∼10K 参数以内仍然可以使用。

Natural Evolution Strategies 自然进化策略

Imagine if you had built an artificial life simulator, and you sample a different neural network to control the behavior of each ant inside an ant colony. Using the Simple Evolution Strategy for this task will optimise for traits and behaviours that benefit individual ants, and with each successive generation, our population will be full of alpha ants who only care about their own well-being. 想象一下,如果你建立了一个人工生命模拟器,你采样一个不同的神经网络来控制蚁群里面每一只蚂蚁的行为。使用简单进化策略来完成这个任务,将优化有利于蚂蚁个体的性状和行为,随着每一代的相继产生,我们的蚁群将充满只关心自己福祉的阿尔法蚂蚁。
Instead of using a rule that is based on the survival of the fittest ants, what if you take an alternative approach where you take the sum of all fitness values of the entire ant population, and optimise for this sum instead to maximise the well-being of the entire ant population over successive generations? Well, you would end up creating a Marxist utopia.与其使用基于适者生存的规则,不如采用另一种方法,即取整个蚂蚁种群的所有适性值的总和,并对这个总和进行优化,以使整个蚂蚁种群在连续几代人中的福利最大化。那么,你最终会创造一个马克思主义的乌托邦。
A perceived weakness of the algorithms mentioned so far is that they discard the majority of the solutions and only keep the best solutions. Weak solutions contain information about what not to do, and this is valuable information to calculate a better estimate for the next generation.到目前为止所提到的算法的一个明显弱点是,它们抛弃了大多数的解决方案,只保留了最好的解决方案。薄弱的解决方案包含了不应该做什么的信息,而这是为下一代计算更好的估计的宝贵信息。
Many people who studied RL are familiar with the REINFORCE paper. In this 1992 paper, Williams outlined an approach to estimate the gradient of the expected rewards with respect to the model parameters of a policy neural network. This paper also proposed using REINFORCE as an Evolution Strategy, in Section 6 of the paper. This special case of REINFORCE-ES was expanded later on in Parameter-Exploring Policy Gradients (PEPG, 2009) and Natural Evolution Strategies (NES, 2014).许多研究RL的人都熟悉REINFORCE论文。在1992年的这篇论文中,Williams概述了一种估计预期报酬相对于策略神经网络模型参数的梯度的方法。这篇论文的第6节还提出了使用REINFORCE作为一种进化策略。REINFORCE-ES的这种特殊情况后来在参数探索策略梯度(PEPG,2009)和自然进化策略(NES,2014)中得到了扩展。
In this approach, we want to use all of the information from each member of the population, good or bad, for estimating a gradient signal that can move the entire population to a better direction in the next generation. Since we are estimating a gradient, we can also use this gradient in a standard SGD update rule typically used for deep learning. We can even use this estimated gradient with Momentum SGD, RMSProp, or Adam if we want to. 在这种方法中,我们希望利用种群中每个成员的所有信息,不管是好的还是坏的,来估计一个梯度信号,这个信号可以使整个种群在下一代中向更好的方向发展。既然我们估计的是一个梯度,我们也可以在通常用于深度学习的标准 SGD 更新规则中使用这个梯度。如果我们愿意,我们甚至可以用 Momentum SGD、RMSProp 或 Adam 来使用这个估计的梯度。
The idea is to maximise the expected value of the fitness score of a sampled solution. If the expected result is good enough, then the best performing member within a sampled population will be even better, so optimising for the expectation might be a sensible approach. Maximising the expected fitness score of a sampled solution is almost the same as maximising the total fitness score of the entire population. 这个想法是最大化采样解的适应度得分的期望值。如果预期的结果足够好,那么采样种群中表现最好的成员将更加优秀,因此针对预期进行优化可能是一种明智的方法。最大限度地提高采样解的预期适应度得分几乎等同于最大限度地提高整个种群的总适应度得分。
If zzz is a solution vector sampled from a probability distribution function π(z,θ)\pi(z, \theta)π(z,θ), we can define the expected value of the objective function FFF as:
J(θ)=Eθ[F(z)]=∫F(z)π(z,θ)dzJ(\theta)=E_{\theta}[F(z)]=\int F(z)\pi(z,\theta)dzJ(θ)=Eθ​[F(z)]=∫F(z)π(z,θ)dz
where θ\thetaθ are the parameters of the probability distribution function. For example, if π\piπ is a normal distribution, then θ\thetaθ would be μ\muμ and σ\sigmaσ. For our simple 2D toy problems, each ensemble zzz is a 2D vector (x,y)(x, y)(x,y). 其中 θ\thetaθ 是概率分布函数的参数。例如,如果 π\piπ 是一个正态分布,那么 θ\thetaθ 就是 μ\muμ 和 σ\sigmaσ。对于我们简单的二维玩具问题,每个集合 zzz 是一个二维向量 (x,y)(x, y)(x,y)。
The NES paper contains a nice derivation of the gradient of J(θ)J(\theta)J(θ) with respect to θ\thetaθ. Using the same log-likelihood trick as in the REINFORCE algorithm allows us to calculate the gradient of J(θ)J(\theta)J(θ): NES论文中包含了J(θ)J(\theta)J(θ)相对于θ\thetaθ的梯度的一个很好的推导。使用与REINFORCE算法相同的对数似然技巧,我们可以计算J(θ)J(\theta)J(θ)的梯度:
∇J(θ)≈1N∑i=1NF(zi)∇θlogπ(zi,θ)\nabla J(\theta)\approx\frac{1}{N}\sum\limits_{i=1}^NF(z^i)\nabla_{\theta}log\pi(z^i,\theta)∇J(θ)≈N1​i=1∑N​F(zi)∇θ​logπ(zi,θ)
With this gradient ∇θJ(θ)\nabla_{\theta} J(\theta)∇θ​J(θ), we can use a learning rate parameter α\alphaα (such as 0.01) and start optimising the θ\thetaθ parameters of pdf π\piπ so that our sampled solutions will likely get higher fitness scores on the objective function FFF. Using SGD (or Adam), we can update θ\thetaθ for the next generation:有了这个梯度 ∇θJ(θ)\nabla_{\theta} J(\theta)∇θ​J(θ),我们可以使用一个学习率参数 α\alphaα (比如 0.01),并开始优化 pdf π\piπ 的 θ\thetaθ参数,这样我们的采样解就有可能在目标函数 FFF 上获得更高的适应度分数。使用SGD(或Adam),我们可以为下一代更新 θ\thetaθ:
θ→θ+α∇θ\theta\rightarrow\theta + \alpha\nabla_{\theta}θ→θ+α∇θ​
and sample a new set of candidate solutions zzz from this updated pdf, and continue until we arrive at a satisfactory solution.并从这个更新的pdf中抽出一组新的候选解zzz,一直到我们得出满意的解为止。
In Section 6 of the REINFORCE paper, Williams derived closed-form formulas of the gradient ∇θlog⁡π(zi,θ)\nabla_{\theta} \log \pi(z^i, \theta)∇θ​logπ(zi,θ), for the special case where π(z,θ)\pi(z, \theta)π(z,θ) is a factored multi-variate normal distribution (i.e., the correlation parameters are zero). In this special case, θ\thetaθ are the μ\muμ and σ\sigmaσ vectors. Therefore, each element of a solution can be sampled from a univariate normal distribution 在REINFORCE论文的第6节中,Williams得出了梯度∇θlogπ(zi,θ)\nabla_{\theta} log \pi(z^i, \theta)∇θ​logπ(zi,θ)的闭式公式,适用于π(z,θ)\pi(z, \theta)π(z,θ)是保理多变量正态分布的特殊情况(即相关参数为零)。在这种特殊情况下,θ\thetaθ是μ\muμ和σ\sigmaσ向量。因此,解的每个元素都可以从单变量正态分布中取样
zj∼N(μj,σj)z_j\sim N(\mu_j,\sigma_j)zj​∼N(μj​,σj​)
The closed-form formulas for ∇θlog⁡N(zi,θ)\nabla_{\theta} \log N(z^i, \theta)∇θ​logN(zi,θ), for each individual element of vector θ\thetaθ on each solution iii in the population can be derived as: 对于向量θ\thetaθ中每个元素iii的每个解,∇θlog⁡N(zi,θ)\nabla_{\theta} \log N(z^i, \theta)∇θ​logN(zi,θ)的闭式公式可以得出:
∇μjlog⁡N(zi,μ,σ)=zji−μjσj2,\displaystyle {\nabla_{\mu_{j}} \log N(z^i, \mu, \sigma) = \frac{z_j^i - \mu_j}{\sigma_j^2},}∇μj​​logN(zi,μ,σ)=σj2​zji​−μj​​,
∇σjlog⁡N(zi,μ,σ)=(zji−μj)2−σj2σj3.\nabla_{\sigma_{j}} \log N(z^i, \mu, \sigma) = \frac{(z_j^i - \mu_j)^2 - \sigma_j^2}{\sigma_j^3}.∇σj​​logN(zi,μ,σ)=σj3​(zji​−μj​)2−σj2​​.
​​
For clarity, I use the index of jjj, to count across parameter space, and this is not to be confused with superscript iii, used to count across each sampled member of the population.为了清楚起见,我使用了jjj的指数,用来计算整个参数空间,这与上标iii是不能混淆的,用来计算人口中的每个抽样成员。 For our 2D problems, z1=xz_1=xz1​=x, z2=yz_2=yz2​=y, μ1=μx\mu_1=\mu_xμ1​=μx​, μ2=μy\mu_2=\mu_yμ2​=μy​, σ1=σx\sigma_1=\sigma_xσ1​=σx​ , σ2=σy\sigma_2=\sigma_yσ2​=σy​ in this context.​​
These two formulas can be plugged back into the approximate gradient formula to derive explicit update rules for μ\muμ and σ\sigmaσ. In the papers mentioned above, they derived more explicit update rules, incorporated a baseline, and introduced other tricks such as antithetic sampling in PEPG, which is what my implementation is based on. NES proposed incorporating the inverse of the Fisher Information Matrix into the gradient update rule. But the concept is basically the same as other ES algorithms, where we update the mean and standard deviation of a multi-variate normal distribution at each new generation, and sample a new set of solutions from the updated distribution. Below is a visualization of this algorithm in action, following the formulas described above: 这两个公式可以回插到近似梯度公式中,得出μ\muμ和σ\sigmaσ的显式更新规则。在上面提到的论文中,他们推导出了更明确的更新规则,加入了基线,并引入了其他技巧,比如PEPG中的反抽样,我的实现就是基于此。NES提出将Fisher信息矩阵的倒数纳入梯度更新规则中。但其概念与其他ES算法基本相同,我们在每一次新的生成时更新多变量正态分布的均值和标准差,并从更新后的分布中抽取新的解集。下面是这个算法的可视化操作,按照上面介绍的公式:


We see that this algorithm is able to dynamically change the σ\sigmaσ’s to explore or fine tune the solution space as needed. Unlike CMA-ES, there is no correlation structure in our implementation, so we don’t get the diagonal ellipse samples, only the vertical or horizontal ones, although in principle we can derive update rules to incorporate the entire covariance matrix if we needed to, at the expense of computational efficiency.我们看到,这个算法能够动态地改变 σ\sigmaσ 的来探索或根据需要微调解空间。与CMA-ES不同的是,在我们的实现中没有相关结构,所以我们没有得到对角线的椭圆样本,只有垂直或水平的样本,尽管原则上我们可以推导出更新规则来纳入整个协方差矩阵,如果我们需要的话,这将牺牲计算效率。
I like this algorithm because like CMA-ES, the σ\sigmaσ’s can adapt so our search space can be expanded or narrowed over time. Because the correlation parameter is not used in this implementation, the efficiency of the algorithm is O(N)O(N)O(N) so I use PEPG if the performance of CMA-ES becomes an issue. I usually use PEPG when the number of model parameters exceed several thousand.我喜欢这个算法,因为像CMA-ES一样,sigmasigmasigma的可以适应,所以我们的搜索空间可以随着时间的推移而扩大或缩小。因为在这个实现中没有使用相关参数,算法的效率是O(N)O(N)O(N),所以如果CMA-ES的性能成为一个问题,我就使用PEPG。当模型参数数量超过几千个时,我通常使用PEPG。

OpenAI Evolution Strategy

In OpenAI’s paper, they implement an evolution strategy that is a special case of the REINFORCE-ES algorithm outlined earlier. In particular, σ\sigmaσ is fixed to a constant number, and only the μ\muμ parameter is updated at each generation. Below is how this strategy looks like, with a constant σ\sigmaσ parameter:在OpenAI的论文中,他们实现了一种进化策略,这是前面概述的REINFORCE-ES算法的特例。特别是,σ\sigmaσ被固定为一个常数,只有μ\muμ参数在每一代中被更新。下面是这个策略的样子,在 σ\sigmaσ 参数不变的情况下:


In addition to the simplification, this paper also proposed a modification of the update rule that is suitable for parallel computation across different worker machines. In their update rule, a large grid of random numbers have been pre-computed using a fixed seed. By doing this, each worker can reproduce the parameters of every other worker over time, and each worker needs only to communicate a single number, the final fitness result, to all of the other workers. This is important if we want to scale evolution strategies to thousands or even a million workers located on different machines, since while it may not be feasible to transmit an entire solution vector a million times at each generation update, it may be feasible to transmit only the final fitness results. In the paper, they showed that by using 1440 workers on Amazon EC2 they were able to solve the Mujoco Humanoid walking task in ∼10\sim10∼10 minutes.除了简化之外,本文还提出了更新规则的修改,适用于不同工机的并行计算。在他们的更新规则中,使用固定的种子预先计算了一个大网格的随机数。通过这样做,每个 worker 可以随着时间的推移重现其他每个worker的参数,而每个 worker 只需要向所有其他 worker 传达一个单一的数字,即最终的适应度结果。如果我们想要将进化策略扩展到位于不同机器上的数千个甚至一百万个工作者,这一点非常重要,因为虽然在每一次世代更新时将整个解决方案向量传输一百万次可能不可行,但只传输最终的适应度结果可能是可行的。在论文中,他们展示了通过在亚马逊EC2上使用1440个工人,他们能够在∼10\sim10∼10分钟内解决 Mujoco 人形行走任务。
I think in principle, this parallel update rule should work with the original algorithm where they can also adapt σ\sigmaσ, but perhaps in practice, they wanted to keep the number of moving parts to a minimum for large-scale parallel computing experiments. This inspiring paper also discussed many other practical aspects of deploying ES for RL-style tasks, and I highly recommend going through it to learn more.我认为原则上,这种并行更新规则应该可以和原始算法一起工作,他们也可以适应/sigma/sigma/sigma,但也许在实践中,他们希望将大规模并行计算实验的移动部件数量保持在最低限度。这篇鼓舞人心的论文还讨论了部署ES用于RL式任务的许多其他实用方面,我强烈建议大家去了解更多。

Fitness Shaping

Most of the algorithms above are usually combined with a fitness shaping method, such as the rank-based fitness shaping method I will discuss here. Fitness shaping allows us to avoid outliers in the population from dominating the approximate gradient calculation mentioned earlier: 以上大多数算法通常都会结合适应度整形法,比如我这里要讨论的基于秩的适应度整形法。适应度整形可以让我们避免种群中的离群值主导前面提到的近似梯度计算:
∇θJ(θ)≈1N∑i=1NF(zi)∇θlog⁡π(zi,θ).\nabla_{\theta} J(\theta) \approx \frac{1}{N} \sum\limits_{i=1}^{N} \; F(z^i) \; \nabla_{\theta} \log \pi(z^i, \theta).∇θ​J(θ)≈N1​i=1∑N​F(zi)∇θ​logπ(zi,θ).
If a particular F(zm)F(z^m)F(zm) is much larger than other F(zi)F(z^i)F(zi) in the population, then the gradient might become dominated by this outliers and increase the chance of the algorithm being stuck in a local optimum. To mitigate this, one can apply a rank transformation of the fitness. Rather than use the actual fitness function, we would rank the results and use an augmented fitness function which is proportional to the solution’s rank in the population. Below is a comparison of what the original set of fitness may look like, and what the ranked fitness looks like: 如果一个特定的 F(zm)F(z^m)F(zm) 比种群中的其他 F(zi)F(z^i)F(zi) 大得多,那么梯度可能会被这个离群值所支配,从而增加算法被卡在局部最优的机会。为了缓解这种情况,可以应用适应度的等级变换。我们不使用实际的适应度函数,而是将结果进行排序,使用一个增强的适应度函数,这个函数与解在种群中的排名成正比。下面是原始的适应度集合可能是什么样子,和排名后的适应度是什么样子的比较:

What this means is supposed we have a population size of 101. We would evaluate each population to the actual fitness function, and then sort the solutions based by their fitness. We will assign an augmented fitness value of -0.50 to the worse performer, -0.49 to the second worse solution, …, 0.49 to the second best solution, and finally a fitness value of 0.50 to the best solution. This augmented set of fitness values will be used to calculate the gradient update, instead of the actual fitness values. In a way, it is a similar to just applying Batch Normalization to the results, but more direct. There are alternative methods for fitness shaping but they all basically give similar results in the end.这意味着,假设我们有101个种群规模。我们将根据实际的适应度函数对每个种群进行评估,然后根据其适应度值对解决方案进行排序。我们会给表现较差的解分配一个增强的健身值-0.50,给第二差的解分配一个增强的适应度值 -0.49,…,给第二好的解分配一个增强的适应度值 0.49,最后给最好的解分配一个适应度值 0.50。这组增强的适应度值将被用来计算梯度更新,而不是实际的适应度值。在某种程度上,这与只对结果应用Batch Normalization类似,但更直接。还有其他的适应度值塑形方法,但最后基本上都会得到类似的结果。
I find fitness shaping to be very useful for RL tasks if the objective function is non-deterministic for a given policy network, which is often the cases on RL environments where maps are randomly generated and various opponents have random policies. It is less useful for optimising for well-behaved functions that are deterministic, and the use of fitness shaping can sometimes slow down the time it takes to find a good solution.我发现,如果目标函数对于给定的策略网络是非确定性的,那么适应度值整形对于RL任务非常有用,在RL环境中,地图是随机生成的,各种对手的策略也是随机的,这种情况经常发生。对于行为良好的确定性函数的优化来说,它的作用不大,而且使用适应度值整形有时会减慢找到一个好的解决方案所需的时间。

MNIST

Although ES might be a way to search for more novel solutions that are difficult for gradient-based methods to find, it still vastly underperforms gradient-based methods on many problems where we can calculate high quality gradients. For instance, only an idiot would attempt to use a genetic algorithm for image classification. But sometimes such people do exist in the world, and sometimes these explorations can be fruitful! 虽然ES可能是一种寻找更多基于梯度方法难以找到的新解的方法,但在许多我们可以计算高质量梯度的问题上,它的性能仍然大大低于梯度方法。例如,只有白痴才会尝试使用遗传算法进行图像分类。但有时世界上确实存在这样的人,有时这些探索也会取得丰硕的成果!
Since all ML algorithms should be tested on MNIST, I also tried to apply these various ES algorithms to find weights for a small, simple 2-layer convnet used to classify MNIST, just to see where we stand compared to SGD. The convnet only has ∼11k\sim11k∼11k parameters so we can accommodate the slower CMA-ES algorithm. The code and the experiments are available here.由于所有的ML算法都应该在MNIST上进行测试,我也尝试应用这些不同的ES算法为一个小型的、简单的2层convnet寻找权重,用于分类MNIST,只是为了看看我们与SGD相比处于什么位置。convnet只有∼11k\sim11k∼11k参数,所以我们可以适应较慢的CMA-ES算法。代码和实验可以在这里获得。
Below are the results for various ES methods, using a population size of 101, over 300 epochs. We keep track of the model parameters that performed best on the entire training set at the end of each epoch, and evaluate this model once on the test set after 300 epochs. It is interesting how sometimes the test set’s accuracy is higher than the training set for the models that have lower scores.下面是各种 ES 方法的结果,使用的群体大小为 101,超过 300 个 epochs。我们在每个纪元结束时,跟踪在整个训练集上表现最好的模型参数,并在300个 epochs 后在测试集上评估一次这个模型。有趣的是,对于那些分数较低的模型来说,有时测试集的准确率比训练集的高


We should take these results with a grain of salt, since they are based on a single run, rather than the average of 5-10 runs. The results based on a single-run seem to indicate that CMA-ES is the best at the MNIST task, but the PEPG algorithm is not that far off. Both of these algorithms achieved ~ 98% test accuracy, 1% lower than the SGD/ADAM baseline. Perhaps the ability to dynamically alter its covariance matrix, and standard deviation parameters over each generation allowed it to fine-tune its weights better than OpenAI’s simpler variation. 我们应该对这些结果持怀疑态度,因为它们是基于单次运行的结果,而不是5-10次运行的平均值。基于单次运行的结果似乎表明,CMA-ES在MNIST任务中是最好的,但PEPG算法也不差。这两种算法都达到了~98%的测试精度,比SGD/ADAM基线低1%。也许动态改变其协方差矩阵的能力,以及每一代的标准差参数,让它比 OpenAI 的简单变化更好地微调其权重。

Try It Yourself

There are probably open source implementations of all of the algorithms described in this article. The author of CMA-ES, Nikolaus Hansen, has been maintaining a numpy-based implementation of CMA-ES with lots of bells and whistles. His python implementation introduced me to the training loop interface described earlier. Since this interface is quite easy to use, I also implemented the other algorithms such as Simple Genetic Algorithm, PEPG, and OpenAI’s ES using the same interface, and put it in a small python file called es.py, and also wrapped the original CMA-ES library in this small library. This way, I can quickly compare different ES algorithms by just changing one line:本文中描述的所有算法可能都有开源的实现。CMA-ES的作者Nikolaus Hansen一直在维护一个基于numpy的CMA-ES的实现,里面有很多花哨的东西。他的python实现将我引入了前面介绍的训练循环接口。由于这个接口相当容易使用,所以我也用同样的接口实现了其他算法,如简单遗传算法、PEPG、OpenAI的ES,并把它放在一个名为es.py的python小文件中,同时把原来的CMA-ES库也封装在这个小库中。这样一来,我只要改一行就可以快速比较不同的ES算法:

import es#solver = es.SimpleGA(...)
#solver = es.PEPG(...)
#solver = es.OpenES(...)
solver = es.CMAES(...)while True:solutions = solver.ask()fitness_list = np.zeros(solver.popsize)for i in range(solver.popsize):fitness_list[i] = evaluate(solutions[i])solver.tell(fitness_list)result = solver.result()if result[1] > MY_REQUIRED_FITNESS:break

You can look at es.py on GitHub and the IPython notebook examples using the various ES algorithms.
你可以看看GitHub上的es.py和使用各种ES算法的IPython笔记本示例。
In this IPython notebook that accompanies es.py, I show how to use the ES solvers in es.py to solve a 100-Dimensional version of the Rastrigin function with even more local optimum points. The 100-D version is somewhat more challenging than the trivial 2D version used to produce the visualizations in this article. Below is a comparison of the performance for various algorithms discussed:
在这篇与es.py配套的IPython笔记本中,我展示了如何使用es.py中的ES求解器来求解具有更多局部最优点的100维版本的Rastrigin函数。100-D版本比用于生成本文中可视化的琐碎的2D版本更具挑战性。下面是讨论的各种算法的性能比较:

On this 100-D Rastrigin problem, none of the optimisers got to the global optimum solution, although CMA-ES comes close. CMA-ES blows everything else away. PEPG is in 2nd place, and OpenAI-ES / Genetic Algorithm falls behind. I had to use an annealing schedule to gradually lower \sigmaσ for OpenAI-ES to make it perform better for this task.在这个100-D的Rastrigin问题上,没有一个优化器得到了全局最优解,尽管CMA-ES接近。CMA-ES将其他所有的优化器都打倒了。PEPG排在第二位,OpenAI-ES /遗传算法落在后面。我不得不使用退火计划来逐渐降低OpenAI-ES的σ,使其在这个任务中表现更好。

Final solution that CMA-ES discovered for 100-D Rastrigin function.
Global optimal solution is a 100-dimensional vector of exactly 10.
CMA-ES发现的100-D Rastrigin函数的最终解。
全局最优解是一个正好为10的100维向量。

P(H∣E1E2⋯)=P(H)P(H‾)⋅P(E1∣H)P(E1∣H‾)⋅P(E2∣H)P(E2∣H‾)×⋯P(H)P(H‾)⋅P(E1∣H)P(E1∣H‾)⋅P(E2∣H)P(E2∣H‾)×⋯+1P(H|E_1 E_2 \cdots) = \frac{ \frac{P(H)}{P(\overline H)} \cdot \frac{P(E_1|H)}{P(E_1|\overline H)} \cdot \frac{P(E_2|H)}{P(E_2|\overline H)} \times \cdots }{ \frac{P(H)}{P(\overline H)} \cdot \frac{P(E_1|H)}{P(E_1|\overline H)} \cdot \frac{P(E_2|H)}{P(E_2|\overline H)} \times \cdots + 1} P(H∣E1​E2​⋯)=P(H)P(H)​⋅P(E1​∣H)P(E1​∣H)​⋅P(E2​∣H)P(E2​∣H)​×⋯+1P(H)P(H)​⋅P(E1​∣H)P(E1​∣H)​⋅P(E2​∣H)P(E2​∣H)​×⋯​

【读博客/翻译】A Visual Guide to Evolution Strategies 进化策略的视觉指南相关推荐

  1. python遥感影像分类代码_【博客翻译】使用 Python Tensorflow 实现简单的神经网络卫星遥感影像分类...

    Landsat 5 多光谱数据分类指导手册原作者:Pratyush Tripathy 翻译:荆雪涵 姐妹篇雪涵:[博客翻译]CNN 与中分辨率遥感影像分类​zhuanlan.zhihu.com 深度学 ...

  2. ScottGu之博客翻译-LINQ to SQL第三部分,查询数据库 (Part 3 - Querying our Database)

     本贴只为共享知识,更为简洁(即无英文的版本)将会发布在博客堂上,堂主正对此文进行审阅. 希望本贴能对您的LINQ to SQL语言的学习有一定的帮助! 原贴链接: http://weblogs.as ...

  3. 【资讯博客翻译】----通过序列转导实现联合语音识别和说话人二值化

    [翻译]通过序列转导实现联合语音识别和说话人二值化 原文网址:https://ai.googleblog.com/2019/08/joint-speech-recognition-and-speake ...

  4. 虽然知识贱卖了,我还是要推荐──要读书,多读书,多读博客多读电子书

    虽然说现在是知识经济,但是似乎最不值钱的就是知识,知识被贱卖这种事情如今并不少见,比如大学生毕业去做洗脚工,等等. 因为这里是中国,人多,然后在这一个大前提下,一切不合理变得合理,否则你说不合理,你要 ...

  5. 手把手教你用Keras进行多标签分类(附代码)_数据派THU-CSDN博客 (翻译:程思衍校对:付宇帅)

    手把手教你用Keras进行多标签分类(附代码)_数据派THU-CSDN博客 手把手教你用Keras进行多标签分类(附代码)_数据派THU-CSDN博客

  6. ScottGu之博客翻译-LINQ to SQL第四部分,更新数据库 LINQ to SQL (Part 4 - Updating our Database)...

    原贴链接: http://weblogs.asp.net/scottgu/archive/2007/07/11/linq-to-sql-part-4-updating-our-database.asp ...

  7. 至读博客朋友的一封信

    大家好: 最近正式在51CTO开始撰写文章,原来一直在新浪做.本想将新浪的文章一一迁移过来.但是发现花费时间太长,因此,新浪相关的文章,都是链接地址,请大家见谅. 本文转自hblxp32151CTO博 ...

  8. 提取多个字段_【博客翻译】建筑物轮廓线提取以及损坏分类

    原文链接 原作者:Rohit Singh, Sandeep Kumar 贡献者:Vinay Viswambharan, Divyansh Jha, Shivani Pathak, Daniel Wil ...

  9. 高仿维信安卓(读博客)

    http://blog.csdn.net/lmj623565791/article/details/38799363 转载于:https://www.cnblogs.com/hansongjiang/ ...

最新文章

  1. Udacity机器人软件工程师课程笔记(十一)-ROS-编写ROS节点
  2. 使用PHP CURL 模拟HTTP实现在线请求工具-toolfk程序员工具网
  3. 小程序开发实战学习笔记
  4. ListView 与 RecyclerView的创建与使用的异同
  5. 【Leetcode | 1】3. 无重复字符的最长子串
  6. i++ 和 ++i 效率的分析以及自定义类型的自增/自减运算符重载实例
  7. Adidas、金拱门、KFC、乐天玛特,零售巨头的选址秘诀都在数据里了
  8. leetcode - 461. 汉明距离
  9. tf2.1下生成yolo.h5文件
  10. 回顾声智科技助力联想智能音箱MINI亮相CES Asia
  11. 查询解析MySQL_mysql内部查询过程详解
  12. 单片机驱动程序是什么,驱动文件组成。
  13. 百度云OCR图片文字识别实现
  14. yd什么意思_excel中yd是什么意思
  15. 如何做好一个软件测试管理者,高效带好团队呢?
  16. 哪一个属于计算机外存储器,下边哪一个属于计算机的外存储器()
  17. c++ 双人五子棋(可直接复制)
  18. 小米品牌升级,启用新LOGO
  19. php的turn服务器,搭建TurnServer服务器
  20. 解决安装程序无法初始化。请下载Adobe

热门文章

  1. matlab中clear的功能,matlab中clc,close,close all,clear,clear all作用区别
  2. 免费下载excel办公软件_Smartbi电子表格下载
  3. practise-sumer
  4. Python打印菱形
  5. 九阴真经Ambari——1.熟悉Hortonworks官网结构并找到Ambari下载地址
  6. spss安装剩下一个python_SPSSPython脚本在spss命令内部时停止并出现错误spss.提交()将创建一个警告...
  7. 用户在电商网站中购买成功了,那么它在微服务中经历了什么(转)
  8. openwrt生成固件(dd命令)
  9. php中使用soap的建立共享接口
  10. 上海工程技术大学计算机专硕,2017年上海工程技术大学硕士研究生调剂公告