
  • Value Funtion Approximation
    • Introduction
      • Why need?
      • Types of Value Function Approximation
      • Which Funtion Approximator?
    • Incremental Methods
      • Value Funtion Approx. by SGD
      • Linear Funtion Approximation
      • Incremental Prediction Algorithms
      • Control with Value Function Approximation
  • Action-Value Function Approximation
    • Linear Action-Value Function Approximation
    • Incremental Prediction Algorithms
    • Covergence of Prediction Algorithms
    • Covergence of Control Algorithms
  • Batch Methods

Value Funtion Approximation


Why need?

  • we have represented value function by a lookup table

    • Every state s has an entry V(s)V(s)V(s)
    • Or every state-action pair s, a has an entry Q(s,a)Q(s,a)Q(s,a)
  • Problem with large MDPs:
    • There ate too many states and/or actions to store in memory
    • It’s too slow to learn the value of each state individually

Solution for large MDPs:

  • Estimate value function with function approximation
    v^(s,w)≈vπ(s)orq^(s,a,w)≈qπ(s,a)\begin{aligned} \hat{v}(s,\mathbf{w}) &\approx v_\pi(s) \\ or\ \hat{q}(s,a,\mathbf{w}) &\approx q_\pi(s,a) \end{aligned} v^(s,w)or q^​(s,a,w)​≈vπ​(s)≈qπ​(s,a)​

Types of Value Function Approximation

Which Funtion Approximator?

There are many funtion approximators, but we consider differentiable fucntion approximators, e.g.

  • Linear combinations of features
  • Neural network
  • Decision tree
  • Nearest neighbor
  • Fourier / wavelet bases
  • …\dots…

Incremental Methods

Value Funtion Approx. by SGD

Goal: find parameter vector w\mathbf{w}w Minimising mean-squared error between approximate value function v^(s,w)\hat{v}(s,\mathbf{w})v^(s,w) and true value function vπ(s)v_\pi(s)vπ​(s) :
J(w)=Eπ[(vπ(S)−v^(S,w))2](1)\pmb{J}(\mathbf{w}) = \mathbb{E}_\pi \left[(v_\pi(S) - \hat{v}(S, \mathbf{w}))^2 \right] \tag{1} JJJ(w)=Eπ​[(vπ​(S)−v^(S,w))2](1)
Gradient descent finds a local minimum:
Δw=−12α∇wJ(w)=αEπ[(vπ(S)−v^(S,w))∇wv^(S,w)]\Delta \mathbf{w} = -\frac{1}{2} \alpha \nabla_{\mathbf{w}} \pmb{J}(\mathbf{w}) = \alpha {\color{red}{\mathbb{E}_\pi}} \left[(v_\pi(S) - \hat{v}(S, \mathbf{w})) \nabla_{\mathbf{w}}\hat{v}(S,\mathbf{w}) \right] Δw=−21​α∇w​JJJ(w)=αEπ​[(vπ​(S)−v^(S,w))∇w​v^(S,w)]
Expected update is equal to full gradient update



