强化学习note1导论

news/2024/5/18 23:06:21 标签: 强化学习

Textbook:Sutton and Barton reinforcement learning

周博磊老师中文课

coding

架构:Pytorch

在这里插入图片描述

与supervised learning 的区别:监督学习:1.假设数据之间无关联i.i.d. 2.有label
强化学习:不一定i.i.d;没有立刻feed back(delay reward)

​ exploration(采取新行为)&exploitation(采取目前最好的行为)

feature:

  • Trial-and-error exploration
  • Delay reward
  • time matters(sequential ,non i.i.d)
  • Agent’s actions affect the subsequential data it recieves

compared with supervised learning,reinforcement learaning can sometimes surpass the behavior of human

possible rollout sequence

agent&environment

rewards:scalar feedback

sequential decision making:

近期与远期奖励的trade off

full observation&partial observation

RL Agent:

component:
1.policy:agent’s behavior function

from state/observation to action

stochastic policy:Probabilistic sample: π ( a ∣ s ) = P [ A t = a ∣ S t = s ] \pi(a|s)=P[A_t=a|S_t=s] π(as)=P[At=aSt=s]

deterministic policy: a ∗ = a r g m a x a   π ( a ∣ s ) a^*=arg\underset{a}{max}\,\pi(a|s) a=argamaxπ(as)

2.value function:

expected discounted sum of future rewards under a particular policy π \pi π

discount factor weights immediate vs future rewards

used to quantify goodness/badness of states and actions
v π ( s ) = △ E π [ G t ] = E π [ ∑ k = 0 γ k R t + k + 1 ∣ S t = s ] v_{\pi}(s)\overset{\triangle}=E_\pi[G_t]=E_\pi[\sum_{k=0}\gamma^kR_{t+k+1}|S_t=s] vπ(s)=Eπ[Gt]=Eπ[k=0γkRt+k+1St=s]
Q-function(use to select among actions):
q π ( s , a ) = △ E π [ G T ∣ S t , A t = a ] q_\pi(s,a)\overset{\triangle}=E_\pi[G_T|S_t,A_t=a] qπ(s,a)=Eπ[GTSt,At=a]

5.Model

A model predicts what the environment will do next

Types of RL Agents based on what the Agent Learns
Value-based agent

显示学习价值函数、隐式学习策略

Policy-based agent

显示学习policy、no value function

Actor-critic agent

结合policy and value function

Types of RL Agents based on what the Agent Learns
Model-based

直接学习model(环境转移)

Model-free

直接学习value function/policy function

No model
在这里插入图片描述

Exploration and Exploitation

exploration:进行试错

exploitation:选择已知情况下的最优

import gym

#这个锤子在python里面跑,一跑就卡...
env=gym.make('CartPole-v0')
env.reset()
env.render()
action=env.action_space.sample()
observation,reward,done,info=env.step(action)

exmaple

next calss:Markov决策过程


http://www.niftyadmin.cn/n/1060410.html

相关文章

tableview长按响应的实现

1、添加手势识别事件、按住时长、处理函数 UILongPressGestureRecognizer * longPressGr [[UILongPressGestureRecognizer alloc] initWithTarget:self action:selector(longPressToDo:)]; longPressGr.minimumPressDuration 0.8; [self.tableView addGestureRecognizer:long…

简单数据类型:1.步长的概念。2.字符串中的内容没有实际意义,只是表示的作用,所以不可变...

>>> mystr abcdefg >>> mystr[0::1] abcdefg >>> mystr[0::2] #步长计算从自已本身算起 aceg >>> mystr[0::-1]#从右往左 a >>> mystr[::-1] gfedcba >>> mystr[0] 1 Traceback (most recent call last):File "&…

note2Markov Decision Process(MDP)

Markov Decision Process(MDP) Markov Property:Just depend on current status Markov Process/Markov Chain state transition matrix P :p(st1s′∣sts)p(s_{t1}s|s_ts)p(st1​s′∣st​s) 从一个节点到另一个节点的概率 Markov Reward Process(MRP):add reward weights…

iOS iPadOS safari 独立Web应用屏幕旋转的时候 onresize window.innerHeight 数值不对。

iOS iPadOS safari 独立Web应用屏幕旋转的时候 onresize window.innerHeight 数值不对 一、问题描述 我有一个日记应用,是可以作为独立 Web 应用运行的那种,但在旋转屏幕的时候获取到的 window.innerHeight 和 window.innerWidth 就不对了,…

iOS开发如何提高(from 唐巧的博客)

http://blog.devtang.com/blog/2014/07/27/ios-levelup-tips/转载于:https://www.cnblogs.com/jhj117/p/4631017.html

使用PostMan进行API自动化测试

最近在进行一个老项目的升级,第一步是先将node版本从4.x升级到8.x,担心升级会出现问题,所以需要将服务的接口进行验证;如果手动输入各种URL,人肉check,一个两个还行,整个服务。。大几十个接口&a…

数值分析笔记

这里是数值分析的部分笔记,(对前三章的算法都用python做了实验,程序见具体内容处) PS:其中比较不重要的地方偷懒用了一点点numpy的API。内容写的比较简略,仅供参考,望见谅(更详细的数值分析Java实现 请见清…

多线程系列(二):多线程基础

目录 线程的几种状态基础线程前台线程、后台线程一、线程的几种状态 我们所说的基础线程就是通过Thread类显示创建的线程。可以大体了解一下Thread类相关的成员: 属性: 方法: 线程的5个状态: 1、创建状态(new&#xff…