文章目录

  • Value Funtion Approximation
    • Introduction
      • Why need?
      • Types of Value Function Approximation
      • Which Funtion Approximator?
    • Incremental Methods
      • Value Funtion Approx. by SGD
      • Linear Funtion Approximation
      • Incremental Prediction Algorithms
      • Control with Value Function Approximation
  • Action-Value Function Approximation
    • Linear Action-Value Function Approximation
    • Incremental Prediction Algorithms
    • Covergence of Prediction Algorithms
    • Covergence of Control Algorithms
  • Batch Methods

Value Funtion Approximation

Introduction

Why need?

  • we have represented value function by a lookup table

    • Every state s has an entry V(s)V(s)V(s)
    • Or every state-action pair s, a has an entry Q(s,a)Q(s,a)Q(s,a)
  • Problem with large MDPs:
    • There ate too many states and/or actions to store in memory
    • It’s too slow to learn the value of each state individually

Solution for large MDPs:

  • Estimate value function with function approximation
    v^(s,w)≈vπ(s)orq^(s,a,w)≈qπ(s,a)\begin{aligned} \hat{v}(s,\mathbf{w}) &\approx v_\pi(s) \\ or\ \hat{q}(s,a,\mathbf{w}) &\approx q_\pi(s,a) \end{aligned} v^(s,w)or q^​(s,a,w)​≈vπ​(s)≈qπ​(s,a)​

Types of Value Function Approximation

Which Funtion Approximator?

There are many funtion approximators, but we consider differentiable fucntion approximators, e.g.

  • Linear combinations of features
  • Neural network
  • Decision tree
  • Nearest neighbor
  • Fourier / wavelet bases
  • …\dots…

Incremental Methods

Value Funtion Approx. by SGD

Goal: find parameter vector w\mathbf{w}w Minimising mean-squared error between approximate value function v^(s,w)\hat{v}(s,\mathbf{w})v^(s,w) and true value function vπ(s)v_\pi(s)vπ​(s) :
J(w)=Eπ[(vπ(S)−v^(S,w))2](1)\pmb{J}(\mathbf{w}) = \mathbb{E}_\pi \left[(v_\pi(S) - \hat{v}(S, \mathbf{w}))^2 \right] \tag{1} JJJ(w)=Eπ​[(vπ​(S)−v^(S,w))2](1)
Gradient descent finds a local minimum:
Δw=−12α∇wJ(w)=αEπ[(vπ(S)−v^(S,w))∇wv^(S,w)]\Delta \mathbf{w} = -\frac{1}{2} \alpha \nabla_{\mathbf{w}} \pmb{J}(\mathbf{w}) = \alpha {\color{red}{\mathbb{E}_\pi}} \left[(v_\pi(S) - \hat{v}(S, \mathbf{w})) \nabla_{\mathbf{w}}\hat{v}(S,\mathbf{w}) \right] Δw=−21​α∇w​JJJ(w)=αEπ​[(vπ​(S)−v^(S,w))∇w​v^(S,w)]
Expected update is equal to full gradient update

Lect6_Value_Function_Approximation相关推荐

最新文章

  1. 人工智能(Artificial Intelligence)常用算法
  2. python读取大文件内容_python 读取大文件
  3. 【控制】频域分析及奈氏判据
  4. 攻防世界easyJava(re Moble)
  5. ERROR:transport error 202: gethostbyname: unknown host
  6. 215. 数组中的第K个最大元素 BFPRT最牛解法
  7. win与Linux的防火墙配置
  8. 【C++ primer】第七章 函数-C++的编程模块
  9. vector中针对自定义类型的排序
  10. Java 中 Comparable 和 Comparator 比较(转)
  11. Java千百问_06数据结构(003)_什么是基本类型包装器
  12. 论文查重不能超过多少?
  13. 计算机领域获奖感言,期中考试获奖感言
  14. 静态代理的实现-模拟中介代理房东出租房子给房客
  15. geany配置html5,Geany的教程
  16. 通信业的双11来了!充话费、办宽带、买手机每年这时候最划算
  17. telegraf 使用 inputs.exec插件收集监控数据
  18. UIDatePicker得到的时间中怎么去掉时分秒(字符串操作知识拓展)--iOS开发
  19. 100种活动促销方案
  20. [其他] 如何在音乐网站下载音频,无需任何插件

热门文章

  1. int(4)、int(8)、int(11) 分别占用几个字节 ?
  2. java 图片质量压缩_java图片高质量压缩
  3. 100多年前人们心中的2018年:部分预测已成现实
  4. 【云原生 | 从零开始学Kubernetes】十二、k8spod的生命周期与容器钩子
  5. word中表格剩最后一行,一旦超过两行自动跳到下一页
  6. vdbench运行报错:java.net.NoRouteToHostException: No route to host (Host unreachable)
  7. 数字营销专业术语介绍
  8. 多态之父类引用指向子类对象
  9. 利用 FFMPEG 批量提取指定起止时间视频片段
  10. Monitoring(监控)