ISSN 1751-956X

作者：Seyed Sajad Mousavi1 , Michael Schukat1, Enda Howley

黄生词

蓝牛句

绿公式

红生涩

Abstract:

Recent advances in combining deep neural network architectures with reinforcement learning (RL) techniques have shown promising potential results in solving complex control problems with high-dimensional state and action spaces.

最近进展在结合深度神经网络结构和强化学习技术已经展示了一个有潜在希望的结果在解决复杂控制问题这个问题伴随这高维状态和动作空间。

Inspired by these successes, in this study, the authors built two kinds of RL algorithms: deep policy-gradient (PG) and value-function-based agents which can predict the best possible traffic signal for a traffic intersection.

通过这些成功的激发，在这些研究中，作者建立了两种强化学习算法：深度策略梯度（DPG）和基于值函数的智能体，这个智能体可以预测最好的十字路口交通信号的可能。

At each time step, these adaptive trafficlight control agents receive a snapshot of the current state of a graphical traffic simulator and produce control signals.

在每步中，这些自适应交通灯控制智能体收到一个快照一个现在交通模拟器状态的快照和产生一个控制信号。

The PG-based agent maps its observation directly to the control signal;

PG-based智能体管理他的直接观察去控制信号；

however, the value-function-based agent first estimates values for all legal control signals.

然而，基于值函数的智能体首先评价价值对于全部合法的控制信号。

The agent then selects the optimal control action with the highest value.

智能体接着选择最佳的控制动作，这个动作有着最高的价值。

Their methods show promising results in a traffic network simulated in the simulation of urban mobility traffic simulator, without suffering from instability issues during the training process.

他们的方法展示了有希望的结果在交通网络仿真中，在城市移动交通仿真中，没有面临不稳定的问题在训练过程中。

1 Introduction

With regard to fast growing population around the world, the urban population in the 21st century is expected to increase dramatically.

世界上的环境在快速的增长，21世界的城市污染预计将会快速增长。

Hence, it is imperative that urban infrastructure is managed effectively to contend with this growth.

因此，它是有必要的。什么是有必要的城市的基础设施管理效率主张伴随着增长。

One of the most critical consideration when designing modern cities is developing smart traffic management systems.

大多数评论考虑之一是当设计现在城市是发展智能交通管理系统。

The main goal of a traffic management system is reducing traffic congestion which nowadays is one of the major issues of megacities.

主要的目标一个交通管理系统是减少交通拥堵（当今大城市的主要问题之一）。

Efficient urban traffic management results in time and financial savings as well as reducing carbon dioxide emission into atmosphere.

高效的城市管理结果可以节约时间和金钱除了减少二氧化碳排放进入大气层。

To address this issue, a lot of solutions have been proposed [1–4].

为了解决此问题，许多解决方案被提出。（直接引用1-4）

They can be roughly classified into three groups.

他们粗略的分为三组。

The first is pre-timed signal control, where a fixed time is determined for all green phases according to historical traffic demand, without considering possible fluctuations in traffic demand.

第一种是预定时的信号控制，一个固定的时间是坚定的为了全部绿等相位通过它的历史的交通需要，没有考虑可能波动在交通需要。

The second is vehicle-actuated signal control, where traffic demand information is used, provided by inductive loop detectors on an equipped intersection to decide to control the signals, e.g. extending or terminating a green phase.

第二种是交通驱动的信号灯控制，交通需要控制信息是有用的，被提供通过感应线圈检测器在一个装备十字路口去决定控制信号，例如。延伸或者结束绿灯相位。

The third is adaptive signal control, where the signal timing control is managed and updated automatically according to the current state of the intersection (i.e. traffic demand, queue length of vehicles in each lane of the intersection and traffic flow fluctuation) [5].

第三种是自适应交通灯控制，信号定时控制是管理和更新自动化的通过目前的十字路口的状态（换言之，交通需求，车辆的排队长度在每一条十字路口的车道和交通流波动）。

In this paper, we are interested in the third approach and aim to propose two novel methods for traffic signal control by leveraging recent advances in machine learning and artificial intelligence fields [6, 7].

在这篇论文中，我们对第三种方法感兴趣，并且致力于去提出两种新颖的方法为了交通信号控制通过杠杆扩充最近接近机器学习和人工智能领域。

Reinforcement learning (RL) [8] as a machine learning technique for traffic signal control problem has led to impressive results [2, 9] and has shown a promising potential solver.

强化学习作为机器学习技术为了交通灯信号控制问题已经导致令人印象深刻的结果[2,9]并且已经展示一个有期望的潜在解决方案。

It does not need to have a perfect knowledge of the environment in advance, for example, traffic flow.

它不提前需要环境的先验知识，例如，交通流。

Instead they are able to gain knowledge and model the dynamics of the environment just by interacting with it.

代替他们能够收获知识和模型（环境的动态）仅仅相互作用和环境。

An RL agent learns based on trial and error. It receives a scalar reward after taking each action in the environment.

一个强化学习之恩那个提学习基于一个实验和错误。它收到一个数量的奖赏之后采取每一个动作在环境中。

The obtained reward is based on how well the taken action is and the agent's goal is to learn an optimal control policy, so the discounted cumulative reward is maximised via repeated interaction with its environment.

获得的奖赏是基于有多好采取动作是和智能体的目标是去学习一个最佳的控制策略，因此不重视累积的奖赏是最大限度通过反复和环境相互作用。

Aside from traffic control, RL has been applied to a number of real-world problems such as cloud computing [10, 11].

除了交通控制，强化学习实用的许多真实世界问题像云计算[10-11]。

Typically, the complexity of using RL in real-world applications such as traffic signal management, grows exponentially as state and action spaces increase.

典型的，强化学习算法在真实世界的应用是复杂的，比如交通信号管理，指数增长随着状态和动作空间增加。

To deal with this problem, function approximation techniques and hierarchical RL (HRL) approaches can be used.

为了解决这个问题，函数近似技术和分等级的强化学习算法可以被应用。

Recently, deep learning has gained huge attraction and has been successfully combined with RL techniques to deal with complex optimisation problems such as playing Atari 2600 games [7], Computer Go program [12] etc., where the classical RL methods could not provide optimal solutions.

最近，深度学习已经取得巨大的吸引力并且已经成功的结合强化学习技术去解决了复杂的最优化问题比如，玩Atari 2600游戏，电脑GO 程序，等等，这种典型的强化学习算法不可以提供最佳的解决方案。

In this way, the current state of the environment is fed into a deep neural net [e.g. a convolutional neural network (CNN) [13]] trained by RL techniques to predict the next possible optimal action(s).

用这种方法，现在的环境状态是 馈入（反馈进入的意思吧）一个深度神经网络[例如一个卷积神经网络CNN]训练通过一个强化学习技术去预测下一个可能的最佳动作。

Inspired by the successes of combining RL with deep learning paradigm and with regard to the complex nature of environment of traffic signal control problem, in this paper we aim to use the effectiveness and power of deep RL to build adaptive signal control methods in order to optimise the traffic flow.

有灵感的通过组合强化学习和深度学习范例的成功并且关于这个复杂的性质交通信号控制的环境，在这篇论文中我们止于礼用高效的有力的深度强化学习方法去建立自适应信号灯控制目的是去自适应交通流。

Although a few previous studies have tried to apply deep RL in the traffic signal control problem [14, 15], in this research the state representation is different.

尽管一些（很少的意思屈指可数）以前的研究已经试着去应用深度强化学习在交通信号控制问题，在这个调查的表现是不同的。（意思是我很好我很棒我优秀）

Also, one of our methods uses policy-gradient (PG) method which does not suffer from oscillations and instabilities during the training process and can take full advantage of the available data of the environment to develop the optimal control policy.

尽管，我们方法的一种用策略梯度的方法，不必经历振幅和不稳定在训练过程中，并且可以用到所有可以用到的数据环境中去进化自适用控制策略。

We propose adaptive signal controllers by combination of two RL approaches (i.e. PG and action-value function) and a deep convolution neural network, which perceive embedded camera observations in order to produce control signals in an isolated intersection.

我们提议自适应信号控制通过两个强化学习方法组合（换言之，策略梯度和动作-价值函数）和深度卷积神经网络，深度感知相机观测为了去生产控制信号灯在一个独立的十字路口。（实验配置吧）

We conduct simulated experiments with our proposed methods in simulation of urban mobility (SUMO) traffic simulator.

我们实施仿真实验伴随着我们的提议方法在sumo交通仿真软件中。

The rest of this paper is organised as follows.

论文其余的被组织如下所示。

Section 2 provides related work in the area of traffic light control (TLC).

第二部分提供相关的工作在交通灯控制。

Section 3 gives a brief review of RL techniques which we have used in this research.

第三部分给出一个简短的评论关于强化学习，我们已经使用在这篇论文中。

Section 4 presents how to formulate the TLC problem as an RL task and the proposed methods to solve the task.

第四部分提交我们怎样去构想这个交通灯控制问题作为一个强化学习人物和被提议的方法去解决这个任务。

Then, Section 5 provides simulation results and the performance of the proposed approaches.

接着，第五部分提供仿真结果和表现通过被提议的方法。

Finally, Section 6 concludes this paper and gives some directions for future research.

最后，第六部分结论这篇文章并且给出一些方向为了将来的研究调查。

2 Related work

A lot of research has been done in academic and industry communities to build adaptive traffic signal control systems.

许多调查已经被做通过在学术和工业团体中去建立一个自适应交通控制系统。

In particular, significant research has been conducted employing RL methods in the area of traffic light signal control [16–20].

尤其是，重要的调查已经管理雇佣强化学习方法在一些区域交通信号灯控制[16-20].

These works have achieved promising results.

这些工作已经达到了有期望的结果。

However, their simulation testbeds have not been mature enough to be comparable with more realistic situations.

然而，他们的仿真实验没有充分考虑足够的充分的相似更加真实的情况。

Developing advanced traffic simulation tools have made researchers develop novel state representation and reward functions for RL algorithms, which could consider more aspects of complexity and reality of real-world traffic problems [3,5, 21–24].

发展先进的交通信号仿真工具已经使得研究者开发新颖的状态表现和奖赏函数为了强化学习算法，强化学习算法可以考虑更多复杂的相位和真实世界的交通问题。

All these attempts viewed the TLC problem as a fully observable Markov decision process (MDP) and investigated whether Q-learning algorithm can be applied to it.

全部这些尝试的观点交通灯自适应控制可以作为一个完全观察马尔可夫观测过程并且研究是否Q-learning算法可以应用它。

However,Richter's study formulated the traffic problem as a partially observable MDP (POMDP) and applied PG methods to guarantee local convergence under a partial observable environment [25].

然而，richter的研究规划这个交通问题作为一部分观测到的马尔可夫观测过程和使用的策略梯度方法去保证局部几种作为一个不公平的观测环境。

By utilising advances in deep learning and its application to different domains [11, 26, 27], deep learning has gained attention in the area of traffic management systems.

通过使用接近在深度学习中并且它的申请去不同的领域，深度学习已经得到了注意力在交通管理系统的

The previous research has used deep stacked auto-encoders (SAEs) neural networks to estimate Q values, where each Q-value is corresponding to each available signal phase [28].

之前的研究已经使用深度堆叠自动编码器（SAE）神经网络来估计Q值，其中每个Q值对应于每个可用的信号相位[28]。

It considered measures of speed and queueing length as its state in each time step of learning process of its proposed method.

它考虑加速的测量方法和排队长度随着它的情形在每次学习过程中他提议的方法。

Two recent studies by van der Pol and Oliehoek [14] and Genders and Razavi [15] provided deep RL agents that used deep Q-network [7] to map from given states to Q values.

两个学习智能最近的研究通过两个人提供的深度强化体，智能体用了深度q-network去绘制地图通过被给出状态的q 值。

Their state representations were a binary matrix of the positions of vehicles on the lanes of an intersection, and a combination of the presence matrix of vehicles, speed and the current traffic signal phase, respectively.

他们的状态陈述是一个二进制矩阵这个矩阵是形容在十字路口车道线上车辆位置的矩阵，并且一个车辆存在矩阵组合，速度和现在交通信号相位，各自的。

However, we use raw visual input data of the traffic simulator snapshots as system states.

然而，我们用生的、无经验的可看见的输入数据交通信号仿真快照作为一个系统的数据。
Moreover, in addition to estimating Q-function, one of the proposed methods directly maps from the input state to a probability distribution over actions (i.e. signal phases) via deep PG method.

此外，除此之外估计q函数，被提议方法的一个直接管理地图来自输入清醒去一个可能的分布遍于各种动作之间（换言之，信号灯相位）通过深度策略梯度方法。

3 Background

In this section, we will review RL approaches and briefly describe how RL is applied to real-world problems where the number of states and actions are extremely high, so that the regular RL techniques cannot deal with them.

在这个章节，我们将要回顾强化学习算法方式和短暂的表述怎样强化学习被应用到真实世界的问题中，状态和动作的数量是极端的高，因此经常强化学习技术不可以解决他们。（强化学习不行了）

3.1 Reinforcement learning

A common RL [8] setting is shown in Fig. 1, where an RL agent interacts with an environment.

一个普通的共同的强化学习环境配置如图所示，哪里是强化学习智能体互动与周围环境。

The interaction is continued until reaching a terminal state or the agent meets a termination condition.

交互将持续，直到达到终端状态或代理满足终止条件。

Usually, the problems that RL techniques are applied to are treated as MDPs.

通常，强化学习技术是被应用到马尔可夫问题中的。

An MDP is defined as a five tuple S, A, T, R,γ , where S is the set of states in the state space of the environment, A is the set of actions in the action space that the agent can use in order to interact with the environment, T is the transition function, which is the probability of moving between the environment states, R is the reward function andγ∈ [0, 1] is known as the discount factor, which models the importance of the future and immediate rewards.

一个马尔可夫问题被定义通过五个元组S\A\T\R\γ，其中S是状态设置空间集合一个环境的，A是动作空间集合那个智能体可以用为了和环境互动,T是转移函数，它是一个概率，一个在环境状态移动的概率，R是奖赏函数，γ是属于0-1之间的一个数，被称为折扣因子，它模拟了重要性，将来和现在重要性的重要性。

At each time step t, the agent perceives the state st∈ S and, based on its observation, selects an action at.

在每一个时间步t中，代理意识到状态在st状态属于S状态动作空间，基于他们的注意力，选择和动作。

Taking the action, leads to the state of the environment transitions to the next states st + 1∈ S regarding the transition function T.

采取该操作，导致环境状态转换到下一个状态 st + 1∈ S 就转移函数 T来说。

Then, the agent receives reward rt which is determined by the reward function R.

接着，代理收到奖赏rt，奖赏通过奖赏函数R来确定。

The goal of the learning agent in RL framework is to learn an optimal policyπ: S × A→ [0, 1] which defines the probability of selecting action at in state st, so that with following the underlying policy the expected cumulative discounted reward over time is maximised.

强化学习架构中学习智能体的目标是去学习一个最佳的策略Π：状态×动作叉乘的结果在[0，1]之间，这些定义一些潜在的选择动作在状态st，中，所以下面的基础策略是预期的累积折扣奖励，随着时间推移最大化的。

The discounted future reward, Rt, at time t is defined as follows:

折扣函数奖赏Rt，在时间t定义如下：

where the role of the discount factor γ is to trade-off the worth of immediate and future rewards.

折扣函数γ的作用是去平衡交换现在和将来的奖励价值。

In most real-world problems, there are many states and actions which make it impossible to apply
classic RL techniques, which consider tabular representations for their state and action spaces.

在真实的世界问题中，有许多状态和动作这些使得它不可能去应用分类强化学习技巧，这些被考虑列表陈述为了全部的状态动作空间。

Hence, it is common to use function approximators [29] or decomposition and aggregation techniques such as HRL approaches [30–32] and advance HRL [33].

因此，它是正常的去用函数策略或者分布式聚集技术，例如HRL方式和提升HRL。

Different forms of function approximators can be used with RL techniques.

不同形式的函数近似器可以用强化学习技术。

For example, linear function approximation, a linear combination of feature of state and action spaces f and learned weights w (e.g.∑i f iw) or a non-linear function approximation (e.g. a neural network).

比如，线性函数近似，状态和动作空间的特征会排列组合形成一个线，学习权重，或者非线性函数近似（比如：一个神经网络）。

Until recently, the majority of work in RL has been applying linear function approximators.

直到最近，强化学习的主要工作已经被应用在线性函数近似器。

More recently, deep neural networks (DNNs) such as CNNs, recurrent neural networks, SAE etc. have also been commonly used as function approximators for large RL tasks [6, 34].

最近，深度神经网络DNNs,,比如卷积神经网络，周期性神经网络，SAE等等。通常已经使用函数近似为了巨大的强化学习任务。

The interested readers are referred to [35] for a review of using DNNs with RL framework.

有兴趣的读者可以引用去回顾使用DNNs网络伴随着强化学习框架。

3.2 Deep learning and deep Q-learning

Deep learning techniques are one of the best solutions to address high-dimensional data and extract discriminative information from the data.

深度学习技术是处理高维数据并从数据中提取判别信息的最佳解决方案之一。

Deep learning algorithms have the capability of automating feature extraction (the extraction of representations) from the data.

深度学习算法具有从数据中自动提取特征（提取表示）的能力。

The representation is learned through the data which are fed directly into deep nets without using human knowledge (i.e. automated feature extraction).

表示是通过数据学习的，这些数据直接输入到深网中，而无需使用人类知识（即自动特征提取）。

Deep learning models contain multiple layers of representations. Indeed, it is a stack of building blocks such as auto-encoders, Restricted Boltzmann machines and
convolutional layers.

深度学习模型包含多重表示。的确，它是一堆自动建造块例如自动编码，受限玻尔兹曼机和卷积层。

During training, the raw data is fed into a network consisting of multiple layers.

在训练期间，原始数据被馈送到由多个层组成的网络中。

The output of each layer which is non-linear feature transformations is used as inputs to the next layers of the DNN.

每个非线性要素转换层的输出将用作 DNN 下一层的输入。

The output representation of the final layer can be used for constructing classifiers or those applications which can have the better efficiency and performance with abstract representation of the data in a hierarchical manner as inputs.

最后一层的输出表示可用于构造分类器或那些可以具有更好效率和性能的应用程序，以分层方式抽象表示数据作为输入。

A non-linear transformation is applied at each layer on its input to try to learn and extract underlying explanatory factors.

在其输入的每个层应用非线性变换，以尝试学习和提取潜在的解释因素。

Consequently, this process learns a hierarchy of abstract representations.

因此，这个过程学习抽象表示的层次结构。

One of the main advantages of DNNs is the capability of automating feature extraction from raw input data.

DNNs的一个主要的优点是自动提取原始数据的能力。

A deep Q-learning network (DQN) [6] uses this benefit of deep learning in order to represent the agent's observation as an abstract representation in learning an optimal control policy.

深度Q学习网络（DQN）[6]使用深度学习的这种优势，将代理的观察表示为学习最优控制策略的抽象表示。

The DQN method aggregates a DNN function approximator with Q-learning to learn action-value function and as a result a policyπ, the behaviour of the agent which tells the agent what action should be selected for each input state.

DQN 方法使用 Q 学习聚合 DNN 函数近似器，以学习操作值函数，并因此聚合 policy π，即代理的行为，该行为告诉代理应为每个输入状态选择什么操作。

Applying non-linear function approximators such as neural networks with model-free RL algorithms in high-dimensional continuous state and action spaces has some convergence problems [36].

在高维连续状态和动作空间中应用非线性函数近似器，例如具有无模型RL算法的神经网络存在一些收敛问题[36]。

The reasons for these issues are: (i) consecutive states in RL tasks have correlation.

这些问题的原因是：（i）RL任务中的连续状态具有相关性。

(ii) The underlying policy of the agent is changing frequently, because of slight changes in Q values.

（ii）由于 Q 值的微小变化，代理的基本策略经常变化。

To cope with these problems, the DQN provides some solutions which improve the performance of the algorithm significantly.

为了应对这些问题，DQN 提供了一些解决方案，可显著提高算法的性能。

For the problem of correlated states, DQN uses the previously proposed experience replay approach [37].

对于相关状态的问题，DQN使用先前提出的体验重放方法[37]。

In this way, at each time step, the DQN stores the agent's experience (st, at, rt, rt + 1) into a data set D, where st, at and rt are the state, chosen action and received reward, respectively, and st + 1 is the state at the next time step.

通过这种方式，在每个时间步长中，DQN 将代理的经验（st、at、rt、rt + 1）存储到数据集 D 中，其中 st、at 和 rt 分别是状态、所选操作和收到的奖励，st + 1 是下一个时间步的状态。

To update the network, the DQN utilises stochastic minibatch updates with uniformly random sampling from the experience replay memory (previously observed transitions) at training time.

为了更新网络，DQN 利用随机小分枝更新，在训练时从体验重放内存（先前观察到的转换）中统一随机抽样。

This negates strong correlations between consecutive samples.

这否定了连续样本之间的强相关性。

Another approach to deal with aforementioned convergence issues, which we also examine in this research, is the PG methods.

处理上述收敛问题的另一种方法是PG方法，我们也在本研究中进行了研究。

This approach has demonstrated better convergence properties in some RL problems [38].

这种方法在一些RL问题中显示出更好的收敛特性[38]。

3.3 PG methods

A PG method tries to optimise a parameterised policy function by gradient-descent method.

PG 方法试图通过梯度下降法优化参数化策略函数。

Indeed, PG methods are interested in searching policy space to learn policies directly, instead of
estimating state-value or action-value functions.

事实上，PG方法感兴趣的是搜索策略空间以直接学习策略，而不是估计状态值或操作值函数。

Unlike the
traditional RL algorithms, PG methods do not suffer from the
convergence problems of estimating value functions under non-
linear function approximation or in the environments which might
be POMDPs.

与传统的RL算法不同，PG方法不会受到在非线性函数近似下或在可能是POMDP的环境中估计值函数的收敛问题的影响。

They can also deal with the complexity of continuous state and action spaces better than purely value-based methods [38].

它们还可以比纯粹基于值的方法更好地处理连续状态和动作空间的复杂性[38]。

PG methods estimate policy gradients using Monte Carlo estimates of the policy gradients [39].

PG方法使用政策梯度的蒙特卡罗估计来估计政策梯度[39]。

These methods are guaranteed to converge to a local optimum of their parameterised policy function.

这些方法保证收敛到其参数化策略函数的局部最优。

However, typically PG methods result in high variance in their gradient estimates.

然而，PG方法通常会导致其梯度估计值的高方差。

Hence, in order to reduce the
variance of the gradient estimators, some methods subtract a
baseline function from the policy gradients.

因此，为了减少梯度估计器的方差，一些方法从策略梯度中减去基线函数。

The baseline function
can be calculated in different manners [40, 41]. By inspiring these
features of PG methods and successes of neural networks in
automatic feature abstractions, we use DNNs to represent an
optimal traffic control policy directly in the traffic signal control
problem.

基线函数可以以不同的方式计算[40，41]。通过激发PG方法的这些特征和神经网络在自动特征抽象中的成功，我们使用DNN直接在交通信号控制问题中表示最佳交通控制策略。

4 System description

In this section, we will formulate TLC problem as an RL task by describing the states, actions and reward function.

在这部分，我们将要构建TLC问题作为一个强化学习任务通过描述状态、动作和奖赏函数。

We then present the policy as a DNN and how to train the network.

我们接着提出策略作为深度神经网络和怎样去训练网络。

4.1 State representation （状态表示）

We represent the state of the system as an image st∈ Rd or a snapshot of the current state of a graphical simulator {e.g. SUMO-graphical user interface (GUI) [42]} which is a vector of raw pixel
values of current view of the intersection at each step of simulation(as shown in Fig. 1).

我们将系统的状态表示为图像 St属于Rd或者一个快照最近状态的快照现在状态一个图片仿真的（比如：SUMO-绘画使用者界面）是交叉点当前视图的原始像素值的矢量，在十字路口在每一步仿真中例如1系统所示。

This kind of representation is such as putting a camera on an intersection which enables it to view the whole intersection.

这种表现是像设置一个相机在十字路口使它能够去观察整个十字路口的情况。

The state representation in the TLC literature usually uses a vector representing the presence of a vehicle at the intersection, a Boolean-valued vector where a value 1 indicates the presence of a vehicle and a value 0 indicates the absence of a vehicle [14, 43] or a combination of the presence vector with another vector indicating the vehicle's speed at the given intersection [15].

状态表现在十字路口交通灯文学通常用一个矢量表示在十字路口车辆的一个车辆的表现，一个布尔向量价值1表明车辆的存在和一个值0表示一个车辆的缺失，或者一个存在向量的组合和另一个向量的指示车辆的速度在被给出的交通路口。

Regardless of these states representations that are using a prior knowledge provided, they make assumptions which are not generalisable for the real world.

无论这些状态怎样陈述，提供怎样的先验知识，他们使用假定都不可可推广于真实世界。

However, by feeding the state as an image to a CNN, the system can detect the location and the presence of all vehicles with different lengths and as a result the vehicles’ queue on each lane.

但是，通过将状态作为图像提供给CNN，系统可以检测位置和全部车辆的存在伴随着不同的长度和每个车辆长度的结果。

Furthermore, by stacking a history of consecutive observations as input, the convolutional layers of a deep network are able to estimate velocity and travel direction of vehicles.

此外，通过堆叠一个历史的连贯的观察作为输入，深度网络的卷积层是估计速率和车辆的旅行方向

Hence, the system can implicitly benefit from this information as well.

因此，系统也可以隐含地从这些信息中受益。

4.2 Action set（动作集）

To control traffic signal phases, we define a set of possible actions
A = {North/South green (NSG), East/West green (EWG)}.

交通灯相位控制，我们定义一个可能的动作A={北南绿等，东西绿等}

NSG
allows vehicles to pass from North-to-South and vice versa and
also indicates the vehicles on East/West route should stop and not
proceed through the intersection.

南北绿等孕育车辆通过北到南并且反之亦然并且也表明车辆在西或东应该停止和不应钱锦通过红绿灯。

EWG allows vehicles to pass
from East to West and vice versa and implies the vehicles on
North/South route should stop and not proceed through the
intersection.

东或西红绿灯允许车辆通行来自东去西反之亦然是按时车辆在南北路线应该停止河不应前进在十字路口。

At each time step t, an agent regarding its strategy chooses an action at∈ A.

在每一步时间t内，一个智能体关于它的策略选择一个动作在几个A中。

Depending on the selected action, the vehicles on each lane are allowed to cross the intersection.

依赖在选择动作，车辆在每一个车道线是允许去通过十字路口。

4.3 Reward function

Typically, an immediate reward rt∈ℝ is a scalar value which the agent receives after taking the chosen action in the environment at each time step.

具有代表性的，一个立即奖赏，奖赏在集合李是一个标量函数智能体收到之后采取动作之后在环境中每一个时间步。

We set the reward as the difference between the total cumulative delays of two consecutive actions, i.e.

我们设置奖赏作为一个不同在总共的累积演示两个炼虚动作，除此之外。

where Dt and Dt− 1 are the total cumulative delays in the current and previous time steps.

哪里t时刻的D 和t-1上一个时刻的D一共累积延时在现在和过去的时间步长中。

The total cumulative delay at time t is the summation of the cumulative delay of all the vehicles appeared from t = 0 to current time step t in the system.

总共积累的延时在时间t时刻是总和累积在全部的车辆出现自0时刻开始去现在的时间不在系统中。

The positive reward values imply the taken actions led to decrease the total cumulative delay and the negative rewards imply an increase in the delay.

这个正面的奖赏价值暗示采取动作导致下降总共的累积的延时和消极的奖赏暗示增大在延时中。

With regard to the reward values, the agent may decide to change its policy in certain states of the system in the future.

关于奖赏函数，智能体可能决定去改变它的策略在确定的状态在系统将来。

4.4 Agent's policy

The agent chooses the actions based on a policy π.

代理根据策略选择动作。

In the policy-based algorithm, the policy is defined as a mapping from the input state to a probability distribution over actions A.

在基于策略的算法中，策略是被定义作为一个绘图输入状态去一个可能潜在的分布关于动作的分布。

We use the DNN as the function approximator and refer its parameters θ as policy parameters.

我们用DNN网络作为一个函数策略和涉及它的参数θ作为一个策略函数。

The policy distribution π(at | st;θ) is learned by performing gradient descent on the policy parameters.

分布策略Π（）是学习通过表演梯度下降在策略参数中。

The action-value-function maps the input state to action values, which each represents the future reward that can be achieved for the given state and action.

动作值函数地图输入状态去动作价值，每一个表现将来奖赏可以是一个达到为了被给出的状态和动作。

The optimal policy can then be extracted by performing a greedy approach to select the
best possible action.

最佳的策略可以被萃取通过一个贪婪的接近方法去选择最佳的动作。

4.5 Objective function and system training（目标函数和系统训练）

There are many measures such as maximising throughput, minimising and balancing queue length, minimising the delay etc.

有许多方法比如最大限度的输入，最小配平排队长度，最小的延时等等。

in the traffic signal management literature to consider as the learning agent's objective function.

在交通信号管理文献中去考虑学习智能体的目标函数。

In this research, the agent aims to maximise the reduction in the total cumulative delay, which
empirically has been shown to maximise throughput and to reduce queue length (more details discussed in Section 5.3).

在这项研究中，代理旨在最大限度地减少总累积延迟，经验表明这可以最大限度地提高吞吐量并减少队列长度（更多细节在第5.3节中讨论）。

The objective of the agent is to maximise the expected cumulative discounted reward.

智能体的目标是去最大限度的去提高预期的折扣和奖励。

We aim to maximise the reward under the probability distributionπ(at | st;θ).

我们的目的是最大限度的奖赏在Π可能分布之下。

We divide the system training based on two RL approaches: value-function-based and policy-based.

我们分开系统训练基于两个强化学习方法：基于值函数的和基于策略的。

In value-function-based approach, the value function Qπ(s, a) is defined as follows:

基于值函数算法和基于策略函数的被定义如下：

where it is implicit that s, s′∈ S and a∈ A.

The value function can be parameterised Q(s, s;θ) with parameter vector θ.

值函数可以参数化Q（s,s:θ）伴随着参数向量。

Typically, the gradient-descent methods are used to learn parameters,θ by trying to minimise the following loss function of mean-squared error in Q values,where r +γ maxa′Q(s′, a′;θ) is the target value. In the DQN algorithm, a target Q-network is used to address the instability problem of the policy.

通常，梯度下降方法用于学习参数θ，方法是尝试最小化Q值中均方误差的以下损失函数，其中 +γ maxa′Q（s′， a′;θ）是目标值。在 DQN 算法中，目标 Q 网络用于解决策略的不稳定问题。

The network is trained with the target Q-network to obtain consistent Q-learning targets by keeping the weight parameters (θ−) used in the Q-learning target fixed and updating them periodically every N steps through the parameters of the main network θ.

使用目标 Q 网络对网络进行训练，通过保持 Q 学习目标中使用的权重参数（θ−）固定，并通过主网络 θ 的参数每 N 步定期更新一次，从而获得一致的 Q 学习目标。

The target value of the DQN is represented as follows:

DNQ的目标表表示如下：

whereθ− are parameters of the target network. The stochastic gradient-descent method is used in order to optimise (5).

其中θ− 是目标网络的参数。随机梯度下降法用于优化（5）。

The parameters of the deep Q-learning algorithm are updated as follows:

深度 Q 学习算法的参数更新如下：

where yi is the target value for iteration i andα is a scalar learning rate.

其中 yi 是迭代 i 的目标值，而 α 是标量学习速率。

Algorithm 1 (see Fig. 2) presents the pseudo-code for the training algorithm.

算法 1（参见图 2）提供了训练算法的伪代码。

In policy-based approach, the gradient of the objective function represented in (3) is given by:

在基于策略的方法中，（3）中表示的目标函数的梯度由下式给出：

This (8) is standard learning rule of the REINFORCE algorithm[44].

这个（8）是强化算法的标准学习规则[44]。

It updates the policy parameters θ in the direction ∇θ log(at | st;θ), so that the probability of action at at state st is increased if it has led to high cumulative reward; however, it is decreased if the action has result in a low reward.

它更新了方向∇θ log（在| st;θ）方向上的策略参数θ，因此，如果导致高累积奖励，则在状态st处采取行动的概率增加;但是，如果动作导致低奖励，则会降低该值。

The gradient estimate in (2) results to have high variance. It is common to reduce the variance by subtracting a baseline function bt(st) from the return Rt, without changing expectation.

（2）中的梯度估计值具有高方差。通常通过从返回 Rt 中减去基线函数 bt（st）来减少方差，而不改变期望值。

Commonly, an estimate of the state-value function is used as the baseline, bt(st) = Vπθv(st).

通常，状态值函数的估计值用作基线，bt（st） = Vπθv（st）。

Thus, the adjusted gradient is ∇θ log(at | st;θ)(Rt− bt(st)).

调整后的策略梯度是∇θ log(at | st;θ)(Rt− bt(st)).

The value Rt− bt is known as the advantage function.

值 Rt− bt 称为优势函数。

With regard to the advantage actor–critic method [45], computing a single update is done by selecting actions using the underlying policy for up to M steps or till a terminal state is met.

关于优势参与者-批评者方法[45]，计算单个更新是通过使用基础策略选择最多M个步骤的操作或直到满足最终状态来完成的。

In this way, the agent obtains up to M rewards from the environment at each update point and updates the policy parameters after every n≤ M steps regarding n-step returns.

通过这种方式，代理在每个更新点从环境中获得最多 M 个奖励，并在每个 n≤ M 个步骤（有关 n 步返回）后更新策略参数。

The vector parameters θ are updated through the stochastic gradient-descent method:

矢量参数 θ 通过随机梯度下降法更新：

where A(st, at;θ,θv) is an estimate of the advantage function corresponding ∑i = 0
n− 1γirt + i +γnV(st + n;θ)− V(st;θv), where n might have different values with respect to the state, up to M.

其中 A（st， at;θ，θv）是对优势函数的估计∑i = 0 n− 1γirt + i +γnV（st + n;θ）− V（st;θv），其中 n 相对于状态可能具有不同的值，直到 M。

This process is an actor–critic algorithm, the policyπ(at | st;θ) refers to the actor and the estimate of the state-value function Vπθv(st) implies to the critic [45, 46].

这个过程是一个演员-批评家算法，policy π（在| st;θ）指的是演员，状态值函数Vπθv（st）的估计值暗示给批评者[45，46]。

Algorithm 2(see Fig. 3) shows the pseudo-code for the training algorithm.

算法 2（见图 3）显示了训练算法的伪代码。

5 Experiment and results 5 实验和结果

In this section, we present the simulation environment, where our experiments have been done. We then describe the details of the DNN utilised including hyper-parameters to represent the agent's policy.

在本节中，我们将介绍模拟环境，我们的实验已经完成。然后，我们描述所使用的 DNN 的详细信息，包括用于表示代理策略的超参数。

5.1 Experiment setup

We have used the SUMO [42] tool to simulate traffic in all experiments.

我们使用 SUMO [42] 工具来模拟所有实验中的流量。

SUMO is a well known open source traffic simulator which provides useful application programming interfaces and a GUI view to model large road networks as well as some
possibilities to handle them.

SUMO是一个众所周知的开源交通模拟器，它提供了有用的应用程序编程接口和GUI视图来模拟大型道路网络以及一些处理它们的可能性。

In particular, we utilised SUMO–GUI v0.28.0 as it allows to have snapshots of each step of the
simulation.

特别是，我们使用了 SUMO–GUI v0.28.0，因为它允许对仿真的每个步骤进行快照。

The intersection geometry used in this paper is shown in Fig. 4.

本文中使用的交叉点几何如图4所示。

There are four incoming lanes to the intersection and four outgoing lanes from the intersection.

有四个进来的车道线去十字路口和四个离开车道来自十字路口。

To generate traffic demands from different directions (i.e. North-to-South and West-to-East and
vice versa) to the road network, we randomly sample from a uniform probability distribution with the probability of 0.1 to model vehicle generation at each 3600 time steps.

为了生成从不同方向（即从北到南，从西到东，反之亦然）到道路网络的交通需求，我们从概率为0.1的均匀概率分布中随机抽样，以模拟每个3600时间步长的车辆生成。

5.2 System architecture and hyper-parameters 系统架构和超参数

We took the snapshots from the SUMO–GUI and did some basic pre-processing.

我们从 SUMO–GUI 拍摄快照，并进行了一些基本的预处理。

The snapshots are converted from red–green–blue representation to grey-scale and resized them to 128 × 128 frames.

快照将从红-绿-蓝表示转换为灰度，并将其大小调整为 128 × 128 帧。

To enable our system to memorise a history of the past observations, we stacked the last four frames of the history and provided them to the system as input.

为了使我们的系统能够存储历史的观测值，我们堆放最后的四帧的历史和作为系统的输入。

So, the input to the network was a 128 × 128 × 4 image.

因此，网络的输入是128×128×4图像。

We applied approximately the same architecture of the deep Q-network (DQN) algorithm introduced by Mnih et al. [6, 7].

我们应用了Mnih等人引入的深度Q网络（DQN）算法的大致相同的架构[6，7]。

The network consists of a stack of two convolutional layers with 16 8 × 8 and 32 4 × 4 filters with strides 4 and 2, respectively.

该网络由两个卷积层组成的堆栈组成，分别具有 16 个 8 × 8 和 32 个 4 × 4 个过滤器，步幅分别为 4 和 2。

The final hidden layer is fully connected with 256 hidden nodes.

最终的隐藏层与256个隐藏节点完全连接。

All three hidden layers are followed by a rectifier non-linearity.

所有三个隐藏层后跟一个整流器非线性。

The main difference with the network architecture of the DQN method is the last layer, where the last layer of DQN is a fully connected linear layer with a number of output neurons [i.e. Q values Q(a, s)] corresponding to each action in a given Atari 2600 game, while in policy-based model the last layer represents two sets of outputs, a softmax output resulting in a probability distribution over the actions A [i.e. the policyπ(a, s)], and a single linear output node resulting in the estimate of the state-value function V(s).

与DQN方法的网络架构的主要区别在于最后一层，其中DQN的最后一层是完全连接的线性层，具有许多输出神经元[即Q值Q（a，s）]对应于给定Atari 2600游戏中的每个操作，而在基于策略的模型中，最后一层表示两组输出，一个 softmax 输出导致作用 A [即 policyπ（a， s） ] 上的概率分布，以及一个线性输出节点，从而产生状态值函数 V（s）的估计值。

For value-function model we used the architecture, the same as the DQN.

对于值函数模型，我们使用与 DQN 相同的体系结构。

The output layer is corresponding to action values.

输出图层对应于操作值。

In all of our experiments, the discount factor was set to γ = 0.99 and all weights of the network
were updated by the Adam optimiser [47] with a learning rate α = 0.00001 and with mini batches of size M (up to 32), the maximum number of steps that the agent can take to follow its policy and afterwards need to update it.

在我们所有的实验中，贴现因子都设置为γ = 0.99，并且网络的所有权重都由Adam优化器[47]更新，学习速率α = 0.00001，大小为M的迷你批次（最多32个），这是代理可以遵循其策略的最大步骤数，然后需要更新它。

The network was trained for about 1050 epoch,∼2 million time steps.

该网络经过了大约1050个epoch，∼200万时间步的训练。

Each epoch is corresponded ten episodes and each episode was a complete SUMO–GUI simulation.

每个纪元对应十集，每集都是完整的 SUMO–GUI 模拟。

The learned policies by the agent was evaluated every ten episodes by running SUMO–GUI for five episodes and averaging the resulting rewards, total cumulative delay and queue length.

通过运行 SUMO–GUI 五集并平均生成的奖励、总累积延迟和队列长度，每十集对代理学习的策略进行一次评估。

To evaluate our proposed method, we also built a shallow neural network (SNN) with one hidden layer.

为了评价我们提出的办法，我们也建立一个浅层的的神经网络包含一个隐含层的神经网络。

The hidden layer has 64 hidden nodes followed by a rectifier non-linearity.

隐含层包含64哥隐藏结点并且后面包含一个非线性的整流器。

The output layer is a fully connected linear layer with a number of output neurons corresponding to each traffic signal phase in the intersection.

输出层是一个完全连接的线性层，具有对应于交叉路口中每个交通信号阶段的多个输出神经元。

Two vectors are used as input state of the network.

两个向量用作网络的输入状态。

The first representing the number of queued vehicles at the lanes of the intersection (i.e. North, South, East and West) and the second corresponding to the current traffic signal phase of the intersection.

第一个向量代表排队的车辆的数量在十字路口车道线（换句话说东南西北）和第二个对应于现在十字路口的交通信号灯相位。

SNN is trained with the same hyper-parameters and optimisation method (i.e. the gradient decent algorithm) as the proposed methods.

SNN采用与所提出方法相同的超参数和优化方法（即梯度体面算法）进行训练。

5.3 Results and discussion 结果和讨论

To evaluate the performance of the proposed methods, we compared them against a baseline traffic controller, a controller that gives an equal fixed time to each phase of the intersection.

为了评估所提出方法的性能，我们将其与基线交通控制器进行了比较，该控制器为交叉路口的每个阶段提供相等的固定时间。

We ran SUMO–GUI simulator for the proposed model using the configuration setting explained in Section 5.2 and compared the average reward, average total cumulative delay and average queue length achieved to the baseline.

我们使用第 5.2 节中介绍的配置设置为建议的模型运行 SUMO–GUI 模拟器，并将平均奖励、平均总累积延迟和平均队列长度与基线进行比较。

Fig. 5 shows the received average reward while the agent follows a certain policy. As shown in Fig. 5, the proposed method performs significantly better than the baseline and results more reward magnitudes by doing more epochs.

图五中展示收到的平均奖赏而代理遵循特定的策略。如图5所示，所提出的方法明显优于基线，并通过做更多的epoch来获得更多的奖励幅度。

This gradually increasing reward reflects the agent's ability to learn an optimal control policy in a stable manner.

这种逐渐增加的奖励反映了代理以稳定的方式学习最佳控制策略的能力。

Unlike using deep RL for estimating the Q values in traffic light optimisation problem [14], the proposed agent does not suffer stability issues.

与使用深度RL来估计交通灯优化问题[14]中的Q值不同，所提出的代理不会遇到稳定性问题。

To assess the learned policy by the agent, two of the most common performance metrics in the traffic signal control literature is implemented: the cumulative delay and queue length.

为了评估代理学习的策略，实施了交通信号控制文献中最常见的两个性能指标：累积延迟和队列长度。

Figs. 6 and 7 illustrate the performance comparison of the leaning agent regarding average cumulative delay time and average queue length metrics, respectively, to the baseline, while the agent is following the learning policy over time.

图 6 和图 7 说明了倾斜代理在平均累积延迟时间和平均队列长度指标方面的性能比较，分别与基线，而代理则遵循学习策略随时间的变化。

The plots clearly show the agent is able to find a policy resulting minimising queue length and total cumulative delay.

这些图清楚地表明，代理能够找到策略，从而最大限度地减少队列长度和总累积延迟。

Moreover, these graphs reveal that by using the reward function for reducing cumulative delay, the intersection queue length is reduced as well as the total cumulative delay of all vehicles.

此外，这些图表表明，通过使用奖励函数减少累积延迟，减少了交叉口队列长度以及所有车辆的总累积延迟。

We also compared the proposed methods with the SNN, which is a shallow neural network with one hidden layer.

我们还将所提出的方法与SNN进行了比较，SNN是一个具有一个隐藏层的浅层神经网络。

Table 1 reports a comparison of the proposed models and the SNN model in terms of the average and standard deviation (μ,σ) of average queue length,
average cumulative delay time and the received average reward metrics.

表1报告了所提出的模型与SNN模型在平均队列长度、平均累积延迟时间和收到的平均奖励指标的平均和标准偏差（μ，σ）方面的比较。

The results on Table 1 are calculated from the last 100 training epochs of each method.

表 1 中的结果根据每种方法的最后 100 个训练周期计算得出。

Comparing the metrics shown in Table 1, demonstrates that the proposed models significantly outperform the SNN method.

通过比较表 1 中所示的指标，可以看出所提出的模型明显优于 SNN 方法。

On the basis of the data in Table 1 we can induce 67 and 72% reductions in the average cumulative delay and queue length for the PG method and 68 and 73% reductions for value-function-based method compared with the SNN.

根据表 1 中的数据，与 SNN 相比，PG 方法的平均累积延迟和队列长度可减少 67% 和 72%，基于值函数的方法可减少 68% 和 73%。

Furthermore, we can see that the proposed methods have received average rewards superior to the SNN.

此外，我们可以看到，所提出的方法获得了优于SNN的平均奖励。

Considering these results, it is obvious that the policy gradient and value-function agents could learn the control policies better than the SNN approach.

考虑到结果，它是明显的那就是策略梯度和值函数智能体可以学习控制策略更好在比SNN方法。

考虑到这些结果，很明显，策略梯度和值函数代理可以比 SNN 方法更好地学习控制策略。

6 Conclusion 结论

In this paper, we applied deep RL algorithms with focusing on both policy and value-function-based methods to traffic signal control problem in order to find optimal control policies of signalling, just by using raw visual input data of the traffic simulator snapshots.

在本文中，我们应用了深度RL算法，重点关注基于策略和价值函数的方法来解决交通信号控制问题，以便找到信号的最优控制策略，只需使用交通模拟器快照的原始视觉输入数据即可。

Our approaches have led to promising results and showed they could find more stable control policies compared with the previous work of using deep RL in traffic light optimisation.

我们的方法取得了有希望的结果，并表明与之前在交通灯优化中使用深度RL的工作相比，他们可以找到更稳定的控制策略。

In our work, we developed and tested the proposed methods in a small application, extending the work for more complex traffic simulations, for instance considering many intersections and multiple agents to control each intersection, using multi-agent learning techniques to handle coordination problem between agents would be a direction for future research.

在我们的工作中，我们在一个小应用程序中开发和测试了所提出的方法，扩展了更复杂的交通模拟的工作，例如考虑许多交叉路口和多个代理来控制每个交叉点，使用多智能体学习技术来处理代理之间的协调问题将是未来研究的方向。

Traffic light control using deep policy-gradient and value-function-based reinforcement learning相关推荐

读书笔记-Coordinated Deep Reinforcement Learners for Traffic Light Control
Coordinated Deep Reinforcement Learners for Traffic Light Control 本文研究了交通灯的学习控制策略.在交通灯控制问题引入了一种新的奖励函 ...
POMO: Policy Optimization with Multiple Optima for Reinforcement Learning学习笔记
文章目录摘要零.一些基础 1.梯度近似 2.策略梯度定理 3.REINFORCE 4.REINFORCE with Baseline 5.REINFORCE Actor-Critic 6.多解旅行 ...
IntelliLight: a Reinforcement Learning Approach for Intelligent Traffic Light Control 论文阅读
IntelliLight 全文脉络概述 1.本文贡献 1)Experiments with real traffic data. 2)Interpretations of the policy. 3 ...
18 Issues in Current Deep Reinforcement Learning from ZhiHu
深度强化学习的18个关键问题 from: https://zhuanlan.zhihu.com/p/32153603 85 人赞了该文章深度强化学习的问题在哪里?未来怎么走?哪些方面可以突破? 这两 ...
DDPG（Deep Deterministic Policy Gradient）
Hi,这是第二篇算法简介呀论文链接:"Continuous control with deep reinforcement learning." ,2016 文章概述这篇文 ...
【强化学习】DDPG(Deep Deterministic Policy Gradient)算法详解
http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html 引用莫凡老师的素材 https://morvanzhou.github.io/tut ...
深度增强学习DDPG（Deep Deterministic Policy Gradient）算法源码走读
原文链接:https://blog.csdn.net/jinzhuojun/article/details/82556127 本文是基于OpenAI推出deep reinforcement learn ...
【强化学习】Deep Deterministic Policy Gradient(DDPG)算法详解
1 DDPG简介 DDPG吸收了Actor-Critic让Policy Gradient 单步更新的精华,而且还吸收让计算机学会玩游戏的DQN的精华,合并成了一种新算法,叫做Deep Deterini ...
Policy gradient Method of Deep Reinforcement learning (Part One)
目录 Abstract Part one: Basic knowledge Policy Environment Dynamics Policy Policy Approximation Policy ...

Traffic light control using deep policy-gradient and value-function-based reinforcement learning