Abstract
- Keywords
1 INTRODUCTION
2 REINFORCEMENT LEARNING
3 PROBLEM FORMULATION
- A. Problem Statement
- B. System State Space SSS
- C. Action Space AAA
- D. System Transition Times
- E. Reward Function
- F. Learning Function QQQ
- G. Action selecting method
4 TEST AND RESULTS ANALYSIS
- Demonstration Test
- Comparation Test
5 CONCLUSION

Abstract

This paper addresses a multi-AGV flow-shop scheduling problem with a reinforcement learning method.
用强化学习解决多AGV流水线调度问题

The objectives dealt with here is to obtain a AGV schedule that minimize the average job delay and total makespan.
目标是最小化工作延时和总时间表

In this new method AGVs share full information on each machine’s instant state and job being executed, making decisions thorough understanding of the entire flow shop.
在这种新方法中，AGV可以共享关于每台机器的即时状态和正在执行的作业的完整信息，从而做出彻底了解整个流程车间的决策。

Simulation results demonstrate that this new method learns optimal or near-optimal solution from the past experience and provides better performance than multi-agent scheduling method in a dynamic environment.
仿真结果表明，该方法在动态环境下学习最优或接近最优解，比多智能体调度方法具有更好的性能。

Keywords

multi-AGV
flow-shop
reinforcement learning
Markov problem
optimal solution

1 INTRODUCTION

The AGVs move on the fixed transfer tracks, serving machines with transportation of semi-finished products.
AGVs在固定的运输轨道上移动，为提供半成品运输的机器提供服务。

Each AGV can acquire the state of the entire system and take corresponding actions. This method provides AGVs with learning capabilities to decide the tasks that needed to be done based on current situation of the system.
每个AGV都可以获取整个系统的状态，并采取相应的操作。该方法为AGV提供了学习能力，可以根据系统的当前情况来决定需要完成的任务。

2 REINFORCEMENT LEARNING

These applications are performed in many manufacturing systems, such as flow shop[8,9], job shop[10] and Flexible Manufacturing System(FMS).
这些应用程序在许多制造系统中执行，如流动车间、作业车间和柔性制造系统(FMS)。

[8] Brauer W, Weiss G. “Multi-machine scheduling: a multi-agent learning approach”. In: Proceedings international conference on multi-agent systems; 1998 p.42-8.
[9] Creighton DC, Nahavandi S. “The application of a reinforcement learning agent to a multi-product manufacturing facility”. In: Proceedings of IEEE international conference on industrial technology, Bangkok, Thailand;2002. p. 1229-34.
[10] Aydin ME. “Dynamic job-shop scheduling using reinforcement learning agents”. Robotics and Autonomous Systems 2000; 33(2): 169–78

Author	What Done	Papers
Wang and Usher	performed Q-learning upon a single machine dispatching rule selection problem to find the optimal policies.	Wang YC,Usher JM. “Application of reinforcement learning for agentbased production scheduling”. Engineering Applications of Artificial Intelligence 2005; 18(1): 73–82.
Jiao et al.	addressed a stochastic economic lot scheduling problem (SELSP)（随机经济批量调度问题） for a single machine make-to-stock（面向库存的） production system with two RL algorithms: QLS and QLIH for real-time decision-making.	Jiao, W. X., and L. X. Zhu. 2012. “Intelligent Dynamic Control of Stochastic Economic Lot Scheduling by Agent-Based Reinforcement Learning”. International Journal of Production Research. 50: 4381–4395.
Helman et al.	proposed a collaborative RL method to solve a two-robot flow-shop scheduling problem with four robot collaboration levels.	Kfir Arviv, Helman Stern, Yael Edan. “ Collaborative reinforcement learning for a two robot job transfer flow-shop scheduling problem”. International Journal of Production Research, 54:4, 1196-1209

关键：如何有效建模？？

3 PROBLEM FORMULATION

A. Problem Statement

问题描述：

有 nnn 个工作（J1,J2,...,Jj,...,JnJ_{1},J_{2},...,J_{j},...,J_{n}J1,J2,...,Jj,...,Jn）需要在 mmm 台机器（M1,...,Mi,...,MmM_{1},...,M_{i},...,M_{m}M1,...,Mi,...,Mm）上调度；
所有工作都是独立的，且每个工作都需要 mmm 道工序（OijO_{ij}Oij）；OijO_{ij}Oij 表示第 jjj 个工作的第 iii 个工序。流程序列：O1j→O2j...→OmjO_{1j} \rightarrow O_{2j} ... \rightarrow O_{mj}O1j→O2j...→Omj；
一台机器可以操作至多一个工作；一台机器的输入内容 IJBjIJB_{j}IJBj 和输出内容 OJBjOJB_{j}OJBj 都是无限容量；在机器 MiM_{i}Mi 上等待处理的作业存储在 IJBiIJB_{i}IJBi中，而那些已经被 MiM_{i}Mi 处理的作业存储在 OJBiOJB_{i}OJBi 中。
第 JjJ_{j}Jj 个工作在第 MiM_{i}Mi 个机器上的时间是： pijp_{ij}pij；
几个AGVs：R1,⋯,Rq,⋯,Rk(1≤q≤k)R_{1},\cdots,R_{q},\cdots,R_{k}(1≤q≤k)R1,⋯,Rq,⋯,Rk(1≤q≤k) 被部署用来在机器缓冲区之间转移产品。
tq,it_{q,i}tq,i表示一个AGV RqR_{q}Rq 从 OJBiOJB_{i}OJBi 运输到 IJBi+1IJB_{i+1}IJBi+1 的时间。假设AGV的作业转移速度和空行驶速度相同。因此，考虑到缓冲区 OJBiOJB_{i}OJBi 和 IJBi+1IJB_{i+1}IJBi+1 之间的距离 D(i,i+1)D_{(i,i+1)}D(i,i+1) 米，作业转移时间或空移动时间可以通过方程 tq,i=D(i,i+1)Tqt_{q,i}=\frac{D_{(i,i+1)}}{Tq}tq,i=TqD(i,i+1) 来计算。
目标：尽量减少AGV的总旅行和等待时间。

With more experience accumulated, the deviation between the true Q function value and the estimate Q function value decreases gradually.
随着经验积累的增加，真实Q函数值与估计Q函数值之间的偏差逐渐减小。

A system transition epoch ttt is defined when an AGV unloads a job onto a machine’s buffer and is ready for carrying out another job. From this moment the next action is selected and the AGV will move to the related machine. The learning episode includes several epochs(as a sequence) until the system reaches the final state.
当AGV将作业卸载到机器的缓冲区中并准备好执行另一个作业时，就定义了系统过渡时期 ttt。从此时开始，选择下一个动作，AGV将移动到相关的机器上。学习过程包括几个迭代（作为一个序列），直到系统达到最终状态。

B. System State Space SSS

The system state space is defined by the states of the buffer-machine pairs and the current positions of the AGVs.
系统状态空间由“缓冲-机”对的状态和AGV的当前位置来定义。

There are four possible states for for the buffer-machine pair MiM_{i}Mi and OJBi,(i=1,2...m)OJB_{i},(i=1,2...m)OJBi,(i=1,2...m).
对于缓冲区-机器对 MiM_{i}Mi 和 OJBi(i=1,2...m)OJB_{i}(i=1,2...m)OJBi(i=1,2...m)，有四种可能的状态。记成：si∈{1,2,3,4}s_{i} \in \{ 1,2,3,4 \}si∈{1,2,3,4}

MiM_{i}Mi 是不在工作和 OJBiOJB_{i}OJBi 是空的；
MiM_{i}Mi 是在工作和 OJBiOJB_{i}OJBi 是空的；
MiM_{i}Mi 是不在工作和 OJBiOJB_{i}OJBi 是满的；
MiM_{i}Mi 是在工作和 OJBiOJB_{i}OJBi 是满的；

The whole system state is a (m+k)(m+k)(m+k) vector S={s1,s2,...si,...sm;R1,R2,...Rj,...Rk}\mathbf{S}=\{s_{1}, s_{2},...s_{i},...s_{m}; R_{1}, R_{2},...R_{j},...R_{k}\}S={s1,s2,...si,...sm;R1,R2,...Rj,...Rk} with si∈{1,2,3,4}s_{i} \in \{1, 2, 3, 4\}si∈{1,2,3,4}, Rj∈{L11,L12,...,Lpz,...Lmm}R_{j} \in \{L_{11}, L_{12}, ..., L_{pz}, ... L_{mm}\}Rj∈{L11,L12,...,Lpz,...Lmm}，(1≤p≤m−1,z=p+1)(1 ≤ p ≤ m-1 , z=p+1)(1≤p≤m−1,z=p+1), where Rj∈LpzR_{j} \in L_{pz}Rj∈Lpz means that RjR_{j}Rj’s location is between machine MpM_{p}Mp and Mp+1=zM_{p+1=z}Mp+1=z and LppL_{pp}Lpp means that RjR_{j}Rj is located at machine MpM_{p}Mp.

At the beginning the state of the system is S0={2,1,1..1;L11,...L11}S_{0} = \{2,1,1..1; L_{11},...L_{11} \}S0={2,1,1..1;L11,...L11} as every machine is idle except the first machine and all the machines’ output buffer are empty.
除了第一个机器工作，其他机器都是停运的；所有输出仓都是空的

In the final state all the jobs have been processed by all the machines, located in the last machine’s output buffer, while the AGVs are located in random machine’s location.
所有工作经过了所有工序，都在最后一台机器的输出仓；AGVs都停在随机机器的位置

C. Action Space AAA

A={a1,a2,⋯,am−1}A=\{ a_{1},a_{2},\cdots,a_{m-1} \}A={a1,a2,⋯,am−1}
ai,i=1,2,⋯,m−1a_{i},i=1,2,\cdots,m-1ai,i=1,2,⋯,m−1 指的是将作业从 MiM_{i}Mi 的输出缓冲区传输到 Mi+1M_{i+1}Mi+1 的输入缓冲区的AGV。
对于所有需要通过 m−1m-1m−1 机器传输的作业，转换的最小数量应为 n×(m−1)n \times (m-1)n×(m−1)

D. System Transition Times

The system transition occurs when an AGV drops off a job onto one of the machines’ input buffer, and then this AGV is ready for the next transporting task.
当AGV将作业下放到其中一个机器的输入缓冲区时，就会发生系统转换，然后该AGV就可以完成下一个传输任务了。

一个AGV位于 MkM_{k}Mk 的位置，需要将一个作业从 MiM_{i}Mi 传输到 Mi+1M_{i+1}Mi+1，这个传输任务包括两个连续的AGV动作。

AGV从当前位置 MiM_{i}Mi 移动到下一位置 Mi+1M_{i+1}Mi+1
将物体运输至机器的输入仓

如果 MiM_{i}Mi 的输出缓冲区中没有完成的作业，第二步可能会延迟。因此，此AGV必须在机器MiM_{i}Mi 处等待时间 ddd，直到 Mi+1M_{i+1}Mi+1 完成作业。这里将 ddd 定义为AGV到达 MiM_{i}Mi 的输出缓冲区后的等待时间。

总的转换时间是
T=tk,i+ti,i+1+d,ifsi=2;T=tk,i+ti,i+1,ifsi∈{3,4}T=t_{k,i}+t_{i,i+1+d}, if s_{i}=2; \\ T=t_{k,i}+t_{i,i+1}, if s_{i} \in \{3,4\} T=tk,i+ti,i+1+d,ifsi=2;T=tk,i+ti,i+1,ifsi∈{3,4}

系统转换点是在{t0,t1,⋯,tT}\{ t_{0},t_{1},\cdots,t_{T} \}{t0,t1,⋯,tT}，时间间隔是 TTT。

E. Reward Function

AGV的动作奖励可以定义为 rq(t,i)r_{q}(t,i)rq(t,i) ，即在 MkM_{k}Mk 完成转移任务并准备从 MiM_{i}Mi 传输到 Mi+1M_{i+1}Mi+1 的AGV RqR_{q}Rq 的每个过渡时间点 ttt 计算。

当一个AGV RqR_{q}Rq 完成一个传输任务时，它将在另一台机器上接受下一个任务。
这里让下一台机器表示为 MiM_{i}Mi，并让 AT(i)AT(i)AT(i) 表示为AGV到达机器 MiM_{i}Mi 输出缓冲区的时间(其中 AT(i)>tAT(i)>tAT(i)>t)，JW(i)JW(i)JW(i) 是等待时间最长的作业的进程结束时间。由于 MiM_{i}Mi 可能处于任何状态，当AGV到达时可能出现3种情况。

机器 si∈{2}s_{i} \in \{2\}si∈{2}。状态AGV必须等待机器完成操作。
机器 si∈{3,4}s_{i} \in \{3,4\}si∈{3,4}。状态AGV没有遇到任何延迟，并立即开始下一个运输作业。
如果机器si∈{2}s_{i} \in \{2\}si∈{2}，AGV将不会来它接工作。

对于所有的AGVR1…Rq…RkR_{1}…R_{q}…R_{k}R1…Rq…Rk

让AGV R1…RqR_{1}…R_{q}R1…Rq 使用奖励函数 r1(t,i)r_{1}(t,i)r1(t,i)，它打算最小化AGV等待时间。
r1(t,i)=JW(i)−AT(i),∀i:si∈{2}r_{1}(t,i)=JW(i)-AT(i), \forall i:s_{i} \in \{ 2 \} r1(t,i)=JW(i)−AT(i),∀i:si∈{2}
为Q函数提供的奖励值设置为 r1m(t,i)r_{1m}(t,i)r1m(t,i)，这是所有处于 s∈{2}s \in \{ 2 \}s∈{2}状态的机器的AGV等待时间的最小值。
r1,m(t,i)=min⁡(r1(t,i),∀i:si∈{2})r_{1,m}(t,i) = \min(r_{1}(t,i),\forall i:s_{i} \in \{ 2 \}) r1,m(t,i)=min(r1(t,i),∀i:si∈{2})
让AGV Rq…RkR_{q}…R_{k}Rq…Rk 使用奖励函数 r2(t,i)r_{2}(t,i)r2(t,i)，它打算最小化工作等待时间。
r2(t,i)=AT(i)−JW(i),∀i:si∈{3,4}r_{2}(t,i)=AT(i)-JW(i), \forall i:s_{i} \in \{ 3,4 \} r2(t,i)=AT(i)−JW(i),∀i:si∈{3,4}
为Q函数提供的奖励值设置为 r2m(t,i)r_{2m}(t,i)r2m(t,i)，这是所有处于 s∈{3,4}s \in \{3,4\}s∈{3,4} 状态的机器的工作等待时间的最小值。
r2,m(t,i)=min⁡(r2(t,i),∀i:si∈{3,4})r_{2,m}(t,i) = \min(r_{2}(t,i),\forall i:s_{i} \in \{ 3,4 \}) r2,m(t,i)=min(r2(t,i),∀i:si∈{3,4})
等待时间的最小化导致了剩余时间的减少。

这两个奖励功能不应该同时被一个AGV采用，每个AGV都有一个学习函数 Qq(St,at)Q_{q}(S_{t},a_{t})Qq(St,at)，并根据相关的奖励值进行更新。
当AGV为下一个工作做好准备时，奖励被计算为相关Q学习函数中的即时奖励 r1m(t,i)r_{1m}(t,i)r1m(t,i) 和 r2m(t,i)r_{2m}(t,i)r2m(t,i)，根据 ϵ−greedy\epsilon-greedyϵ−greedy 策略选择下一个动作，并考虑到过去的学习经验。

F. Learning Function QQQ

在这个学习函数中，QQQ 值是根据反映AGV和工作等待时间的奖励值来计算的。
为了减少整体剩余时间，应尽量减少AGV和工作等待时间。
在整个系统中，Q学习函数的数量取决于AGV的数量。
每个AGV都有一个相关的Q函数，它代表AGV或作业等待时间。
Qq(st,at)=(1−α)Qq(st,at)+α(rqm(t,i)+γmin⁡(Qq(st+1,at)−Qq(st,at))),for:q=1,2Q_{q}(s_{t},a_{t}) = (1-\alpha)Q_{q}(s_{t},a_{t})+\alpha(r_{qm}(t,i)+\gamma\min(Q_{q}(s_{t+1},a_{t})-Q_{q}(s_{t},a_{t}))),for :q=1,2 Qq(st,at)=(1−α)Qq(st,at)+α(rqm(t,i)+γmin(Qq(st+1,at)−Qq(st,at))),for:q=1,2
当相关AGV刚刚完成工作并收到奖励值时，Q值在每个转换时间更新。
学习阶段将在系统进入最终状态时完成。

G. Action selecting method

ϵ−greedy\epsilon-greedyϵ−greedy，ϵ\epsilonϵ 随时间减少，以鼓励在开始时进行探索和随着AGV的改进进行开发。
ϵ=1iβ\epsilon = \frac{1}{i^{\beta}} ϵ=iβ1
i→i \rightarrowi→ 第 iii 个迭代次数；
β→\beta \rightarrowβ→ 正值，表示值 ϵ\epsilonϵ 下降到0的速度
使用相同类型的奖励函数的AGV共享相同的Q值矩阵。
具有相同最优目标的不同AGV所学习到的知识可以相互共享，从而加快了Q值的收敛速度。
否则，系统的状态-动作对的Q值在学习时间有限的情况下可能无法达到收敛性，也无法得到AGV调度的最优解。

4 TEST AND RESULTS ANALYSIS

Demonstration Test

Items	Values
β\betaβ	0.5
αt\alpha_{t}αt	0.8
γ\gammaγ	0.1
machines mmm	6
jobs nnn	50
AGVs rrr	2
CPU	Intel Core™4 Duo CPU(2.00GHZ)
RAM	8GB
the trials number	5
the learning episodes number	20
the Job processing time of each machine	sampled from a distribution between [4,12][4, 12][4,12]
the AGV transporting times between adjacent machines	sampled variously between [2,10][2,10][2,10]

Comparation Test

当应用多智能体方法时，每个物理实体和逻辑实体都被抽象为一个智能体。
机器智能体负责其相关机器的处理工作，AGV智能体决定相关的AGV将执行哪个传输任务。
智能体根据其知识模块中部署的规则库做出决策。

当系统开始处理产品时，智能体之间相互通信，以获取他们进行决策所需的信息。
这种相互作用机制是根据主体的作用和它们之间的关系来定义的。
在智能体从其他人那里获取消息后，它们将过滤与决策机制相关的关键信息，并决定下一步要采取的操作。

在本实验中，运输智能体(AGV)遵循等待时间最长的作业具有优先级转移的规则。
当所有作业都被处理后，整个系统将重置并开始另外9个处理周期。最后，记录了这10个周期的平均最长时间。

RL方法场景的最大持续时间小于多智能体方法，特别是当问题规模较大时，说明该方法在处理复杂的工业环境时可以得到更优化的解。
但是，考虑到多智能体方法的分布式结构的好处，当系统中发生动态变化时，它可能具有更好的性能。

5 CONCLUSION

Further study should focus on the collaboration among all the AGVs and apply this scheduling method into more complex job-shop system.
进一步的研究应集中于所有AGVs之间的协作，并将这种调度方法应用于更复杂的车间系统。

【论文笔记】A Reinforcement Learning Method for Multi-AGV Scheduling in Manufacturing相关推荐

论文笔记 Hierarchical Reinforcement Learning for Scarce Medical Resource Allocation
KDD 2021 0 摘要面对COVID-19的爆发,医疗资源紧缺问题日益突出.因此,迫切需要有效的医疗资源配置策略. 强化学习(RL)对于决策制定很强大,但通过强化学习解决这个问题存在三个关键挑战 ...
【论文笔记】Reinforcement and Imitation Learning for Diverse Visuomotor Skills
目录 Abstract Introduction Related Work Model A. Background: GAIL and PPO 1. 行为克隆(Behavior Cloning) 2. ...
[论文翻译]DeepPath: A Reinforcement Learning Method for Knowledge Graph Reasoning
DeepPath: A Reinforcement Learning Method for Knowledge Graph Reasoning 0 总结名称项目题目 DeepPath: A Re ...
DeepPath: A Reinforcement Learning Method for Knowledge Graph Reasoning
DeepPath: A Reinforcement Learning Method for Knowledge Graph Reasoning 来源背景 Motivation 强化学习训练过程实 ...
【论文笔记】Map-Based Localization Method for Autonomous Vehicles Using 3D-LIDAR
[论文笔记]Map-Based Localization Method for Autonomous Vehicles Using 3D-LIDAR ~~~ ~~~~ 精确和稳健的定位是复杂 ...
论文翻译 —— Episodic reinforcement learning with associative memory
标题:Episodic reinforcement learning with associative memory 文章链接:Episodic reinforcement learning with ...
论文浅尝 | Reinforcement Learning for Relation Classification
论文链接:http://aihuang.org/p/papers/AAAI2018Denoising.pdf 来源:AAAI 2018 Motivation Distant Supervision 是 ...
每天一篇论文 289/365Deep Reinforcement Learning for Robotic Pushing and Picking in Cluttered Environment
Deep Reinforcement Learning for Robotic Pushing and Picking in Cluttered Environment 摘要本文提出了一种新的机器人 ...
图像隐写术分析论文笔记：Deep learning for steganalysis via convolutional neural networks
好久没有写论文笔记了,这里开始一个新任务,即图像的steganalysis任务的深度网络模型.现在是论文阅读阶段,会陆续分享一些相关论文,以及基础知识,以及传统方法的思路,以资借鉴. 这一篇是Medi ...

【论文笔记】A Reinforcement Learning Method for Multi-AGV Scheduling in Manufacturing

目录