Lect2_MDPs
文章目录
- Markov Decision Processes
- Markov Processes
- Definition
- Markov Property
- State Transition Matrix
- Markov Reward Process
- Definition
- Return
- Why discount
- Value Function
- Bellman Equation
- Markov Decision Processes
- Definition
- Policy
- Value Function
- Bellman Expectation Equation
- Optimal Value Function
- Finding an optimal Policy
- Bellman Optimaltiy Equation
Markov Decision Processes
MPDs formally describe an environment for RL
Markov Processes
Definition
无记忆的随机过程
i.e. a sequence of random states S1,S2,…S_1,S_2,\dotsS1,S2,… with the [Markov property](#Markov Property)
A Markov Process (or Markov Chain) is a tuple ⟨S,P⟩\lang \mathcal{S, P} \rang⟨S,P⟩
- S\mathcal{S}S is a (finite) set of states
- P\mathcal{P}P is a state [transition probability matrix](#state transition matrix)
Pss′=P[St+1=s′∣St=s]\mathcal{P}_{ss'} = \mathbb{P}\left[S_{t+1} = s' \mid S_t = s \right]Pss′=P[St+1=s′∣St=s]
Markov Property
“The future is independent of the past given the present”
P[St+1∣St]=P[St+1∣S1,…,St]\mathbb{P}\left[ S_{t+1} \mid S_t \right] = \mathbb{P}\left[ S_{t+1} \mid S_1,\dots,S_t \right] P[St+1∣St]=P[St+1∣S1,…,St]
The state is a sufficient statistic of the future
State Transition Matrix
state transition probability Psource_destination\mathcal{P}_{source\_destination}Psource_destination
Pss′=P[St+1=s′∣St=s]\mathcal{P}_{ss'} = \mathbb{P}\left[S_{t+1} = s' \mid S_t = s \right] Pss′=P[St+1=s′∣St=s]
state transition matrix P\mathcal{P}P
defines transition probabilities from all state s to all successor state s’
P=[P11⋯P1n⋮Pn1⋯Pnn]\mathcal{P} = \left[ \begin{matrix} \mathcal{P}_{11} & \cdots & \mathcal{P}_{1n} \\ \vdots \\ \mathcal{P}_{n1} & \cdots & \mathcal{P}_{nn} \end{matrix} \right] P=⎣⎢⎡P11⋮Pn1⋯⋯P1nPnn⎦⎥⎤
根据概率的性质,一定有:
∑j=1nPij=1∀i=1,…,n\sum_{j=1}^n \mathcal{P}_{ij} = 1 \qquad \forall i =1,\dots,n j=1∑nPij=1∀i=1,…,n
Example:
Markov Reward Process
Definition
a Markov chain with values
A Markov Reward Process is a tuple ⟨S,P,R,γ⟩\lang \mathcal{S,P,R,\gamma} \rang⟨S,P,R,γ⟩
- S\mathcal{S}S is a finite set of states
- P\mathcal{P}P is a state transition probability matrix
Pss′=P[St+1=s′∣St=s]\mathcal{P}_{ss'} = \mathbb{P}\left[S_{t+1} = s' \mid S_t = s \right]Pss′=P[St+1=s′∣St=s] - R\mathcal{R}R is a reward function, $\mathcal{R}s = \mathbb{E}[R{t+1} \mid S_t =s] $
- γ\gammaγ is a discount factor, γ∈[0,1]\gamma \in [0, 1]γ∈[0,1]
注意这时的reward只和状态有关 Rs\mathcal{R}_sRs,例如下图的 class 1 无论去 Facebook 还是 class 2 都是 R=−2R = -2R=−2
Return
the retrun GtG_tGt is the total discounted reward from time-step t
Gt=Rt+1+γRt+2+⋯=∑k=0∞γkRt+k+1G_t = R_{t+1} + \gamma R_{t+2} + \dots = \sum_{k=0}^\infin \gamma^k R_{t+k+1} Gt=Rt+1+γRt+2+⋯=k=0∑∞γkRt+k+1
- γ∈[0,1]\gamma \in [0,1]γ∈[0,1] 代表了未来的奖励在现在这一时刻起的作用大小,更希望得到现有的奖励,未来的奖励就要把它打折扣
- γ\gammaγ 代表“近视”,比较重视短期利益
- γ\gammaγ 代表“远视”,对未来的奖励也一样看待
Why discount
- 有些马尔可夫过程是带环的,它并没有终结,我们想避免这个无穷的奖励。
- 我们并没有建立一个完美的模拟环境的模型,也就是说,我们对未来的评估不一定是准确的,我们不一定完全信任我们的模型,因为这种不确定性,所以我们对未来的预估增加一个折扣。我们想把这个不确定性表示出来,希望尽可能快地得到奖励,而不是在未来某一个点得到奖励。
- 如果这个奖励是有实际价值的,我们可能是更希望立刻就得到奖励,而不是后面再得到奖励(现在的钱比以后的钱更有价值)。
- 在人的行为里面来说的话,大家也是想得到即时奖励。
- 有些时候可以把这个系数设为 0:我们就只关注了它当前的奖励。我们也可以把它设为 1:对未来并没有折扣,未来获得的奖励跟当前获得的奖励是一样的。
Value Function
The state value function v(s) of an MRP is the expected return starting from state s
v(s)=E[Gt∣St=s]\operatorname{v}(s) = \mathbb{E}[G_t \mid S_t = s] v(s)=E[Gt∣St=s]
Bellman Equation
记 St+1S_{t+1}St+1 为 s′s's′,根据期望的定义:
v(s)=Rs+γ∑s′∈SPss′v(s′)(4)\color{red}{\operatorname{v}(s) = \mathcal{R}_s + \gamma \sum_{s' \in S} \mathcal{P}_{ss'}\operatorname{v}(s')} \tag{4} v(s)=Rs+γs′∈S∑Pss′v(s′)(4)
由 equation 1. 到式 equation 2. 并不是那么的直观,还需要进一步证明 E[Gt+1∣St]=E[v(St+1)∣St]=E[E[Gt+1∣St+1]∣St]\mathbb{E}[G_{t+1} \mid S_{t}] = \mathbb{E}\left[\operatorname{v}(S_{t+1}) \mid S_t \right] = \mathbb{E}\left[\mathbb{E}[G_{t+1} \mid S_{t+1}] \mid S_t \right]E[Gt+1∣St]=E[v(St+1)∣St]=E[E[Gt+1∣St+1]∣St]
先回顾一下全概率公式:E(X)=∑xxP(X=x∣Y=y)\mathbb{E}(X) = \sum_{x} x\operatorname{P}(X = x \mid Y = y)E(X)=∑xxP(X=x∣Y=y)
记 Gt+1=g′,St+1=s′,St=sG_{t+1} = g' ,\ S_{t+1} = s' , \ S_t = sGt+1=g′, St+1=s′, St=s
E[E[Gt+1∣St+1]∣St]=E[E[g′∣S′]∣St]=E[∑g′g′p(g′∣s′)∣s]=∑s′(∑g′g′p(g′∣s′,s))p(s′∣s)=∑s′∑g′g′p(g′,s′,s)p(s′,s)p(s′,s)p(s)=∑s′∑g′g′p(g′,s′∣s)=∑g′g′p(g′∣s)=E[Gt+1∣st]\begin{aligned} \mathbb{E}\left[\mathbb{E}[G_{t+1} \mid S_{t+1}] \mid S_t \right] &= \mathbb{E}\left[\mathbb{E}[g' \mid S'] \mid S_t \right] \\ &= \mathbb{E}\left[\sum_{g'}g'p(g' \mid s') \mid s \right] \\ &= \sum_{s'} \left(\sum_{g'}g'p(g' \mid s',s) \right)p(s' \mid s) \\ &= \sum_{s'} \sum_{g'} g' \frac{p(g',s',s)}{p(s',s)} \frac{p(s',s)}{p(s)} \\ &= \sum_{s'} \sum_{g'} g'p(g',s' \mid s) \\ &= \sum_{g'} g'p(g' \mid s) \\ & = \mathbb{E}[G_{t+1} \mid s_t] \end{aligned} E[E[Gt+1∣St+1]∣St]=E[E[g′∣S′]∣St]=E⎣⎡g′∑g′p(g′∣s′)∣s⎦⎤=s′∑⎝⎛g′∑g′p(g′∣s′,s)⎠⎞p(s′∣s)=s′∑g′∑g′p(s′,s)p(g′,s′,s)p(s)p(s′,s)=s′∑g′∑g′p(g′,s′∣s)=g′∑g′p(g′∣s)=E[Gt+1∣st]
Equation 3. in Matrix form
根据矩阵形式,直接可求 value function 的解析解:v=(I−γ(P))−1R\mathbf{v} = \left(I - \gamma \mathcal(P)\right)^{-1} \mathcal{R}v=(I−γ(P))−1R
但是对于含n个状态的矩阵,计算复杂度为 O(n3)O(n^3)O(n3),因此解析解只适合小型的MRPs,对于大型的MRPs,采取迭代的方法:
- Dynamic programming
- Monte-Carlo evaluation
- Temporal-Difference learning
Markov Decision Processes
Definition
A MDP is a MRP with decisions. It is a environment in which all states are Markov
A Markov Decision Process is a tuple ⟨S,A,P,R,γ⟩\lang \mathcal{S, A, P, R, \gamma} \rang⟨S,A,P,R,γ⟩
- S\mathcal{S}S is a finite set of states
- A\mathcal{A}A is a finite set of actions
- P\mathcal{P}P is a state transition probability matrix
Pss′a=P[St+1=s′∣St=s,At=a]\mathcal{P}_{ss'}^{\color{red}a} = \mathbb{P}\left[S_{t+1} = s' \mid S_t = s ,{\color{red} A_t = a}\right]Pss′a=P[St+1=s′∣St=s,At=a] - R\mathcal{R}R is a reward function, $\mathcal{R}s{\color{red}a} = \mathbb{E}[R{t+1} \mid S_t =s ,{\color{red} A_t = a}] $
- γ\gammaγ is a discount factor, γ∈[0,1]\gamma \in [0, 1]γ∈[0,1]
Policy
A policy π\piπ is a distribution over actions given states
π(a∣s)=P[At=a∣st=s]\pi(a \mid s) = \mathbb{P}\left[A_t =a \mid s_t = s \right] π(a∣s)=P[At=a∣st=s]
Given an MDP M=⟨S,A,P,R,γ⟩\mathcal{M} = \lang \mathcal{S,A,P,R,\gamma} \rangM=⟨S,A,P,R,γ⟩ and a policy π\piπ
The state sequence S1,S2,…S_1, S_2, \dotsS1,S2,… is a Markkov process ⟨S,Pπ⟩\lang \mathcal{S,P^\pi} \rang⟨S,Pπ⟩
The state and reward sequence S1,R2,S2,…S_1, R_2, S_2, \dotsS1,R2,S2,… is a Markkov reward process ⟨S,Pπ,Rπ,γ⟩\lang \mathcal{S,P^\pi, R^\pi, \gamma} \rang⟨S,Pπ,Rπ,γ⟩
where
Ps,s′π=∑a∈Aπ(a∣s)Pss′aRsπ=∑a∈Aπ(a∣s)Rsa\mathcal{P}_{s,s'}^\pi = \sum_{a \in \mathcal{A}}\pi(a \mid s)\mathcal{P}_{ss'}^a \\ R_s^\pi = \sum_{a \in \mathcal{A}}\pi(a \mid s)\mathcal{R}_{s}^a Ps,s′π=a∈A∑π(a∣s)Pss′aRsπ=a∈A∑π(a∣s)Rsa
Value Function
- State-value funtion vπ(s)=E[Gt∣St=s]v_\pi(s) = \mathbb{E}\left[G_t \mid S_t = s \right]vπ(s)=E[Gt∣St=s]
- Action-value function qπ(s,a)=E[Gt∣St=s,At=a]q_\pi(s,a) = \mathbb{E}\left[G_t \mid S_t = s, A_t = a \right]qπ(s,a)=E[Gt∣St=s,At=a]
Bellman Expectation Equation
value function can be decomposed into immediate reward plus discounted value of successor state
vπ(s)=Eπ[Rt+1+γvπ(St+1)∣St=s](5){\color{blue} \operatorname{v}_{\color{red}\pi}(s) = \mathbb{E}_{\color{red}\pi} \left[R_{t+1} + \gamma \operatorname{v}_{\color{red}\pi}(S_{t+1}) \mid S_t =s \right] } \tag{5} vπ(s)=Eπ[Rt+1+γvπ(St+1)∣St=s](5)qπ(s,a)=Eπ[Rt+1+γqπ(St+1,At+1)∣St=s,At=a](6){\color{blue} q_{\color{red}\pi}(s,a) = \mathbb{E}_{\color{red}\pi} \left[R_{t+1} + \gamma q_{\color{red}\pi}(S_{t+1,}A_{t+1}) \mid S_t = s, A_t =a \right] } \tag{6} qπ(s,a)=Eπ[Rt+1+γqπ(St+1,At+1)∣St=s,At=a](6)
equation 5. 和 equation 6. 表明了当前状态和未来状态之间的 value function 的关系考虑 state-value function 和 action-value function 之间的关系
vπ(s)=∑a∈Aπ(a∣s)qπ(s,a)对所有用黑色实心圆代表的 action: a 求和 (7)\operatorname{v}_{\pi}(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s) q_\pi(s, a) \qquad \text{对所有用黑色实心圆代表的 action: a 求和 } \tag{7} vπ(s)=a∈A∑π(a∣s)qπ(s,a)对所有用黑色实心圆代表的 action: a 求和 (7)
qπ(s,a)=Rsa+γ∑s′∈SPss′avπ(s′)对所有用空心圆代表的 state: s 求和(8)q_\pi(s,a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a \operatorname{v}_\pi(s') \tag{8} \qquad \text{对所有用空心圆代表的 state: s 求和} qπ(s,a)=Rsa+γs′∈S∑Pss′avπ(s′)对所有用空心圆代表的 state: s 求和(8)
- 把 equation 7. 和 equation 8. 互相代入就可以得出 equation 5. 和 equation 6. 的取去掉 E[]\mathbb{E}[\ ]E[ ] 的形式
vπ(s)=∑a∈Aπ(a∣s)(Rsa+γ∑s′∈SPss′avπ(s′))(9)\operatorname{v}_\pi(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s) \left(\mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \operatorname{v}_\pi(s') \right) \tag{9} vπ(s)=a∈A∑π(a∣s)(Rsa+γs′∈S∑Pss′avπ(s′))(9)
qπ(s,a)=Rsa+γ∑s′∈SPss′a∑a′∈Aπ(a′∣s′)qπ(s′,a′)(10)q_\pi(s,a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \sum_{a' \in \mathcal{A}}\pi(a' \mid s') q_\pi(s',a') \tag{10} qπ(s,a)=Rsa+γs′∈S∑Pss′aa′∈A∑π(a′∣s′)qπ(s′,a′)(10)
直接由 equation 5. 6. 分别推 equation 9. 10.
vπ(s)=Eπ[Rt+1+γvπ(St+1)∣St=s]=Eπ[Rt+1∣St=s]+γEπ[vπ(St+1)∣St=s]=∑a∈Aπ(a∣s)Rsa+γ∑s′∈SPss′πvπ(s′)=∑a∈Aπ(a∣s)Rsa+γ∑s′∈S[∑a∈Aπ(a∣s)Pss′a]vπ(s′)=∑a∈Aπ(a∣s)(Rsa+γ∑s′∈SPss′avπ(s′))≜Equation 9.\begin{aligned} \operatorname{v}_{\color{red}\pi}(s) &= \mathbb{E}_{\color{red}\pi} \left[R_{t+1} + \gamma \operatorname{v}_{\color{red}\pi}(S_{t+1}) \mid S_t =s \right] \\ &= \mathbb{E}_{\color{red}\pi} \left[R_{t+1} \mid S_t = s \right] + \gamma \mathbb{E}_{\color{red}\pi} \left[\operatorname{v}_{\color{red}\pi}(S_{t+1}) \mid S_t =s \right] \\ &= \sum_{a \in \mathcal{A}}\pi(a \mid s)\mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^\pi \operatorname{v}_\pi(s') \\ &= \sum_{a \in \mathcal{A}}\pi(a \mid s)\mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \left[ \sum_{a \in \mathcal{A}}\pi(a \mid s)\mathcal{P}_{ss'}^a \right] \operatorname{v}_\pi(s') \\ & = \sum_{a \in \mathcal{A}} \pi(a \mid s) \left(\mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \operatorname{v}_\pi(s') \right) \triangleq \text{Equation 9.} \end{aligned} vπ(s)=Eπ[Rt+1+γvπ(St+1)∣St=s]=Eπ[Rt+1∣St=s]+γEπ[vπ(St+1)∣St=s]=a∈A∑π(a∣s)Rsa+γs′∈S∑Pss′πvπ(s′)=a∈A∑π(a∣s)Rsa+γs′∈S∑[a∈A∑π(a∣s)Pss′a]vπ(s′)=a∈A∑π(a∣s)(Rsa+γs′∈S∑Pss′avπ(s′))≜Equation 9.
qπ(s,a)=Eπ[Rt+1+γqπ(St+1,At+1)∣St=s,At=a]=Eπ[Rt+1∣St=s,At=a]+γEπ[qπ(St+1,At+1)∣St=s,At=a]=Rsa+γ∑s′∈SPss′a∑a′∈Aπ(a′∣s′)qπ(s′,a′)≜Equation 10.\begin{aligned} q_{\color{red}\pi}(s,a) &= \mathbb{E}_{\color{red}\pi} \left[R_{t+1} + \gamma q_{\color{red}\pi}(S_{t+1,}A_{t+1}) \mid S_t = s, A_t =a \right] \\ &= \mathbb{E}_{\color{red}\pi} \left[R_{t+1} \mid S_t = s, A_t = a \right] + \gamma \mathbb{E}_{\color{red}\pi} \left[q_{\color{red}\pi}(S_{t+1,}A_{t+1}) \mid S_t = s, A_t =a \right] \\ &= \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \sum_{a' \in \mathcal{A}}\pi(a' \mid s') q_\pi(s',a') \triangleq \text{Equation 10.} \end{aligned} qπ(s,a)=Eπ[Rt+1+γqπ(St+1,At+1)∣St=s,At=a]=Eπ[Rt+1∣St=s,At=a]+γEπ[qπ(St+1,At+1)∣St=s,At=a]=Rsa+γs′∈S∑Pss′aa′∈A∑π(a′∣s′)qπ(s′,a′)≜Equation 10.
直接使用 Equation 9. 计算如下例子:
Optimal Value Function
记:v∗(s)=maxπvπ(s)\operatorname{v}_*(s) = \underset{\pi}{\operatorname{max}}\operatorname{v}_{\pi}(s)v∗(s)=πmaxvπ(s), q∗(s,a)=maxπqπ(s,a)q_*(s,a) = \underset{\pi}{\operatorname{max}}q_{\pi}(s,a)q∗(s,a)=πmaxqπ(s,a)
An MDP is “solved” when we know the optimal value function
如何比较 policies 的好坏(大小):π≥π′ifvπ(s)≥vπ′(s),∀s\pi \geq \pi' \operatorname{if} \operatorname{v}_\pi(s) \geq \operatorname{v}_{\pi'}(s), \quad \forall sπ≥π′ifvπ(s)≥vπ′(s),∀s
Finding an optimal Policy
If we know q∗(s,a)q_*(s,a)q∗(s,a), we immediately have the optimal policy
π∗(a∣s)={1,if a=arg maxa∈Aq∗(s,a)0,otherwise\pi_*(a \mid s) = \begin{cases} 1, \quad \text{if}\ a = \underset{a \in \mathcal{A}}{\operatorname{arg\,max}}\ q_*(s,a) \\ 0, \quad otherwise \end{cases} π∗(a∣s)=⎩⎨⎧1,if a=a∈Aargmax q∗(s,a)0,otherwise
Bellman Optimaltiy Equation
v∗(s)=maxaq∗(s,a)(11)\operatorname{v}_*(s) = \underset{a}{\operatorname{max}}q_{\color{red}*}(s,a) \tag{11} v∗(s)=amaxq∗(s,a)(11)
q∗(s,a)=Rsa+γ∑s′∈SPss′av∗(s′)(12)q_*(s,a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \operatorname{v}_{\color{red}*}(s') \tag{12} q∗(s,a)=Rsa+γs′∈S∑Pss′av∗(s′)(12)
Equation 11. 和 Equation 12. 互相代入
v∗(s)=maxaRsa+γ∑s′∈SPss′av∗(s′)(11)\operatorname{v}_*(s) = \underset{a}{\operatorname{max}}\mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \operatorname{v}_{\color{red}*}(s') \tag{11} v∗(s)=amaxRsa+γs′∈S∑Pss′av∗(s′)(11)
q∗(s,a)=Rsa+γ∑s′∈SPss′amaxa′q∗(s′,a′)(12)q_*(s,a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \underset{a'}{\operatorname{max}}q_{\color{red}*}(s',a') \tag{12} q∗(s,a)=Rsa+γs′∈S∑Pss′aa′maxq∗(s′,a′)(12)
Lect2_MDPs相关推荐
最新文章
- 【Android 高性能音频】Oboe 函数库简介 ( Oboe 简介 | Oboe 特点 | Oboe 编译工具 | Oboe 相关文档 | Oboe 测试工具 )
- 设计模式之四(抽象工厂模式第一回合)
- opencv求两张图像光流_光流(optical flow)和openCV中实现
- 休眠NONSTRICT_READ_WRITE CacheConcurrencyStrategy如何工作
- 设置了li(float:right),里面的li反过来显示 - 解决办法
- (25)二分频verilog与VHDL编码(学无止境)
- Inno Setup 如何让生成的setup.exe文件有管理员权限
- oracle 触发器入门,ORACLE PL/SQ入门 (存储过程、触发器)
- Python利用模糊查询两个excel文件数据 导出新表格
- keil4和烧录软件的基本使用
- 拍牌系统改版html5,开启上海拍牌的日子,有点玩人的系统,一会快一会慢
- csm和uefi_如何以简单正确的姿势理解“UEFI”和“BIOS”?
- MSP430开发环境配置
- jQuery幻灯片插件Skippr
- 《Java程序小作业之自动贩卖机》#谭子
- matplotlib 常用图形绘制与官方文档
- Tita:2021年的绩效考核(上)
- 宝宝的肚子看起来是鼓鼓胀胀
- 2022年全球市场输尿管入口导引鞘总体规模、主要生产商、主要地区、产品和应用细分研究报告
- 世界各国新娘幸福瞬间