


state-action rewards

finite horizon MDPs

linear dynamical systems 線性動力系統


-linear quadratic regulation(LQR) 線性二次型調節控制

just a recap:MDP is a five tuple(下圖給出一般性的定義,然后做一些改動,生成variations)

首先在reward function上做一些改動,使之不僅僅是states的函數,而且還是action的函數。記得在上一講時,我說過對於無限連續狀態的MDP,不能直接應用value iteration,因為對於continuous MDP, we use some approximations of the optimal value function. But later, we’ll talk about a special case of MDPs, where you can actually represent the value function exactly, even if you have an infinite-state space or even if you have a continuous-state space. I’ll actually do that, talk about these special constants of infinite-state MDPs, using this new variations of the reward function and the alternative to just counting, so start to make the formulation a little easier.

first variation: state-action reward 比如機器人動起來比靜止麻煩,所以reward和action也有關系了。

second variation: finite horizon MDP

圖中說到的optimal policy may be non-stationery(非平穩的),that is to say, my optimal action to take will be different for different time steps. 然后舉了一個例子(見黑板右下的方框內)

Since we have non-stationery optimal policy ,I’m going to allow non-stationery transition probabilities (非平穩過渡概率)as well. 比如說飛機,飛久了燃料就少了,轉移概率就是時變的。

So now we have a non-stationery optimal policy, let’s talk about an algorithm to actually try to find the optimal policy. Let’s define the following.(圖中的value function start increasingly from time t and end at time T), and write out the value iteration algorithm(橫線下面).

事實上,有一個很漂亮的dynamics programming algorithm(動態規划算法)來解決,從T開始,反向計算。

有人提問,finite horizon MDP 沒有了discounting 那一項,也就是γ(用以調節reward隨時間衰減的)

Andrew: usually use either finite horizon MDP or discounting, not the both. 其實二者在作用上又一些共同,都可以保證value function 是有限的。

以上就是finite horizon MDP

下面綜合以上兩種idea來提出一種special MDP(雖然有着strong assumptions , but are reasonable),along with a very elegant and effective algorithm to solve the even very large MDP.


要用到和finite horizon MDP 一樣的動態規划算法。specify the problem first, assume A B W are given, our goal is to find an optimal policy. 另外,圖中的noise (W)其實並不重要,所以后面的可能會忽略,或者處理的比較馬虎。

assume Ut and Vt are positive semi definite(半正定的),從而reward 函數總是負的。

對於以上,for a complete example(以便理解), suppose you have a helicopter : want St 盡可能等於0

接下來的幾個步驟 I’m going to derive things for the general case of non-stationary dynamics(非穩態動力學),會有更多的數學和equations for LQR, I’m going to write out the case for the fairy general case of time varied dynamics and time varied reward functions.為了便於理解,你們可以忽略一些帶有時間下標的符號,假想是固定的。

然后會講extension of this called differential dynamic programming(微分動態規划的擴展)

先講一下how to come up with a linear model. The key assumption in the model is that the dynamics are linear. There is also the assumption that reward function is quadratic.

If you have an inverted pendulum system, and you want to model the inverted pendulum using a linear model like this…

another way to come up with a linear model is to linearize a nonlinear model.(需要注意的問題是:應該在期望工作點附近線性化,比如倒立擺的豎直位置,否則遠端的誤差太大了。)

write it(linearize a nonlinear model)in math, first write the simple case and the fully general one in a second.

now I’ll post a LQR problem, given A B U V, then our goal is to come up with a policy to maximize the expected value of this rewards.

算法正是早先提到的用於解決 finite horizon 的動態規划算法。

下圖中的(because …≥0)是因為VT矩陣是半正定的,最后時刻T肯定希望沒有動作。

好了,現在讓我們做動態規划的步驟(DP steps):簡單說就是given V*t+1想要求V*t

It turns out LQR has following useful property, it turns out each of these value function can be represented as a quadratic function . (不理解??)


St狀態的value function 等於t時刻的即時reward和St+1時刻的reward的期望。。。

對at求導可得optimal action(some linear combination of states St,或者一個矩陣Lt乘以St)

然后把at放回去,做maximization,你會得到這個,其中迭代公式的名稱叫做Discrete time Riccati equatoin :

So, to summarize, our algorithm for finding the exact solution to finite horizon LQR problems is as follows.


So the very cool thing about the solution of discrete time LQR problems finite horizon LQR problems is that this is a problem in an infinite state, with a continuous state. But nonetheless, under the assumptions we made, you can prove that the value function is a quadratic function of the state. Therefore, just by computing these matrixes phi(t) and the real number psi(t), you can actually exactly represent the value function, even for these infinite large state spaces, even for continuous state spaces. And so the computation of these algorithms scales only like the cubes, scales only as a polynomial in terms of the number of state variables… it’s easily applied to problems with even very large states spaces, so we actually often apply variations of this algorithm to some subset, to some particular subset for the things we do on our helicopter, which has high dimensional state spaces with twelve or higher dimensions. This has worked very well for that.

So it turns out there are even more things you can do with this…


