Trust Region Policy Optimization
- 角度一:off-policy
- 重要性采样 Importance Sampling
- 梯度优化
- 角度二:数值优化
- 置信域优化
- 蒙特卡洛近似
TRPO算法的全称是 Trust Region Policy Optimization,即信赖域策略优化。
角度一:off-policy
通常在强化学习策略梯度训练中,智能体每跟环境做一次完整的交互得到一条蒙特卡洛采样轨迹,策略网络的参数做一次更新,也就是on-policy
方法,其优点是误差较小,但学习效率低,每个样本只能学习一次。为了提高样本学习效率,此时使用参数为
θ
′
\theta '
θ′ 的旧网络做采样收集样本来跟新当前网络的参数
θ
\theta
θ,即off-policy
方法。但旧网络收集的样本能直接用于更新当前网络的参数吗?
重要性采样 Importance Sampling
E
x
∼
p
(
x
)
[
f
(
x
)
]
=
∫
f
(
x
)
p
(
x
)
d
x
=
∫
f
(
x
)
p
(
x
)
q
(
x
)
q
(
x
)
d
x
=
E
x
∼
q
(
x
)
[
f
(
x
)
p
(
x
)
q
(
x
)
]
\begin{split} \mathbb{E} _{x\sim p(x)}\left [ f(x) \right ] &= \int f(x)p(x)dx \\ &= \int f(x)\frac{p(x)}{q(x)} q(x)dx\\ &= \mathbb{E} _{x\sim q(x)}\left [ f(x)\frac{p(x)}{q(x)} \right ] \end{split}
Ex∼p(x)[f(x)]=∫f(x)p(x)dx=∫f(x)q(x)p(x)q(x)dx=Ex∼q(x)[f(x)q(x)p(x)]
可以看出,有办法通过从
q
(
x
)
q(x)
q(x)中采样来计算从
p
(
x
)
p(x)
p(x)中采样的关于
f
(
x
)
f(x)
f(x)的期望,尽管二者期望相同,但方差略有不同。
V
a
r
x
∼
p
(
x
)
[
f
(
x
)
]
=
E
x
∼
p
(
x
)
[
f
2
(
x
)
]
−
[
E
x
∼
p
(
x
)
[
f
(
x
)
]
]
2
Var_{x\sim p(x)}\left[f(x) \right] = \mathbb{E} _{x\sim p(x)}\left [ f^2(x) \right ]-\left[ \mathbb{E} _{x\sim p(x)}\left [ f(x) \right ]\right]^2
Varx∼p(x)[f(x)]=Ex∼p(x)[f2(x)]−[Ex∼p(x)[f(x)]]2
V
a
r
x
∼
p
(
x
)
[
f
(
x
)
]
=
E
x
∼
q
(
x
)
[
f
(
x
)
p
(
x
)
q
(
x
)
]
2
−
[
E
x
∼
q
(
x
)
[
f
(
x
)
p
(
x
)
q
(
x
)
]
]
2
=
∫
f
2
(
x
)
p
2
(
x
)
q
2
(
x
)
q
(
x
)
d
x
−
[
E
x
∼
p
(
x
)
[
f
(
x
)
]
]
2
=
E
x
∼
p
(
x
)
[
f
2
(
x
)
p
(
x
)
q
(
x
)
]
−
[
E
x
∼
p
(
x
)
[
f
(
x
)
]
]
2
\begin{split} Var_{x\sim p(x)}\left[f(x) \right] &= \mathbb{E} _{x\sim q(x)}\left [ f(x)\frac{p(x)}{q(x)} \right ]^2-\left[ \mathbb{E} _{x\sim q(x)}\left [ f(x)\frac{p(x)}{q(x)} \right ] \right]^2 \\ &=\int f^2(x)\frac{p^2(x)}{q^2(x)}q(x)dx-\left[ \mathbb{E} _{x\sim p(x)}\left [ f(x) \right ] \right]^2\\ &= \mathbb{E} _{x\sim p(x)}\left[f^2(x)\frac{p(x)}{q(x)} \right]-\left[ \mathbb{E} _{x\sim p(x)}\left [ f(x) \right ] \right]^2 \end{split}
Varx∼p(x)[f(x)]=Ex∼q(x)[f(x)q(x)p(x)]2−[Ex∼q(x)[f(x)q(x)p(x)]]2=∫f2(x)q2(x)p2(x)q(x)dx−[Ex∼p(x)[f(x)]]2=Ex∼p(x)[f2(x)q(x)p(x)]−[Ex∼p(x)[f(x)]]2
可以看出,二者方差只有在第一项中相差
p
(
x
)
q
(
x
)
\frac{p(x)}{q(x)}
q(x)p(x) 倍,因此
p
(
x
)
p(x)
p(x)和
q
(
x
)
q(x)
q(x)的分布不能相差很大。我们再回头看策略梯度方法的优化函数
J
(
θ
)
J(\theta)
J(θ)
记一条轨迹序列:
τ
=
{
s
1
,
a
1
,
s
2
,
a
2
,
.
.
.
s
T
,
a
T
}
\tau = \{s_1,a_1,s_2,a_2,...s_T,a_T\}
τ={s1,a1,s2,a2,...sT,aT}
其在参数
θ
\theta
θ下的概率和回报Return为:
π
θ
(
τ
)
=
π
(
s
1
)
∏
t
=
1
T
π
θ
(
a
t
∣
s
t
)
π
(
s
t
+
1
∣
s
t
,
a
t
)
R
(
τ
)
=
∑
t
=
1
T
r
t
\begin{split} \pi_{\theta}(\tau) &= \pi(s_1)\prod_{t=1}^{T} \pi_{\theta }(a_t\mid s_t)\pi(s_{t+1}\mid s_t,a_t)\\ R(\tau) &= \sum_{t=1}^{T}r_t \end{split}
πθ(τ)R(τ)=π(s1)t=1∏Tπθ(at∣st)π(st+1∣st,at)=t=1∑Trt
注意,
π
θ
(
τ
)
\pi_{\theta}(\tau)
πθ(τ)中只有一项与参数
θ
\theta
θ有关。我们要做的事是做大化期望回报,使用梯度上升法需要求导数,即:
θ
∗
=
arg max
θ
J
(
θ
)
=
arg max
θ
∑
τ
R
(
τ
)
π
θ
(
τ
)
∇
J
(
θ
)
=
∑
τ
R
(
τ
)
π
θ
(
τ
)
=
∑
τ
[
R
(
τ
)
π
θ
(
τ
)
∇
π
θ
(
τ
)
π
θ
(
τ
)
]
=
∑
τ
[
R
(
τ
)
π
θ
(
τ
)
∇
log
π
θ
(
τ
)
]
=
E
τ
∼
π
θ
(
τ
)
[
R
(
τ
)
∇
log
π
θ
(
τ
)
]
\begin{split} \theta^* &=\underset{\theta}{\argmax } J(\theta)\\ &=\underset{\theta}{\argmax } \sum_{\tau} R(\tau) \pi_{\theta}(\tau)\\ \nabla J(\theta)&= \sum_{\tau} R(\tau) \pi_{\theta}(\tau)\\ &=\sum_{\tau} \left[R(\tau) \pi_{\theta}(\tau)\frac{\nabla \pi_{\theta}(\tau)}{\pi_{\theta}(\tau)} \right]\\ &=\sum_{\tau} \left[R(\tau)\pi_{\theta}(\tau)\nabla \log \pi_{\theta}\left (\tau \right)\right ]\\ &= \mathbb{E} _{\tau \sim \pi_{\theta}(\tau )}\left[R(\tau)\nabla \log \pi_{\theta}\left (\tau \right)\right ] \end{split}
θ∗∇J(θ)=θargmaxJ(θ)=θargmaxτ∑R(τ)πθ(τ)=τ∑R(τ)πθ(τ)=τ∑[R(τ)πθ(τ)πθ(τ)∇πθ(τ)]=τ∑[R(τ)πθ(τ)∇logπθ(τ)]=Eτ∼πθ(τ)[R(τ)∇logπθ(τ)]
梯度优化
假设我们有不同参数的策略网络
π
θ
′
\pi_{\theta'}
πθ′来采样数据,并更新策略网络
π
θ
\pi_{\theta}
πθ的参数,使用蒙特卡洛近似采用,根据重要性采样公式,有:
∇
log
π
θ
(
τ
)
=
∇
log
[
π
(
s
1
)
∏
t
=
1
T
π
θ
(
a
t
∣
s
t
)
π
(
s
t
+
1
∣
s
t
,
a
t
)
]
=
∇
[
log
π
(
s
1
)
+
∑
t
=
1
T
log
π
θ
(
a
t
∣
s
t
)
+
∑
t
=
1
T
π
(
s
t
+
1
∣
s
t
,
a
t
)
]
=
∑
t
=
1
T
∇
log
π
θ
(
a
t
∣
s
t
)
∇
J
θ
′
(
θ
)
=
E
(
s
t
,
a
t
)
∼
π
θ
′
[
π
θ
(
s
t
,
a
t
)
π
θ
′
(
s
t
,
a
t
)
R
θ
′
(
s
t
,
a
t
)
∇
log
π
θ
(
a
t
∣
s
t
)
]
=
E
(
s
t
,
a
t
)
∼
π
θ
′
[
π
θ
(
a
t
∣
s
t
)
π
θ
(
s
t
)
π
θ
′
(
a
t
∣
s
t
)
π
θ
′
(
s
t
)
R
θ
′
(
a
t
,
s
t
)
∇
log
π
θ
(
a
t
∣
s
t
)
]
≈
E
(
s
t
,
a
t
)
∼
π
θ
′
[
π
θ
(
a
t
∣
s
t
)
π
θ
′
(
a
t
∣
s
t
)
R
θ
′
(
s
t
,
a
t
)
∇
log
π
θ
(
a
t
∣
s
t
)
]
=
E
(
s
t
,
a
t
)
∼
π
θ
′
[
∇
π
θ
(
a
t
∣
s
t
)
π
θ
′
(
a
t
∣
s
t
)
R
θ
′
(
s
t
,
a
t
)
]
\begin{split} \nabla \log \pi_{\theta}\left (\tau \right)&=\nabla \log \left[ \pi(s_1)\prod_{t=1}^{T} \pi_{\theta }(a_t\mid s_t)\pi(s_{t+1}\mid s_t,a_t) \right]\\ &=\nabla \left[\log \pi(s_1)+\sum_{t=1}^T\log \pi_{\theta }(a_t\mid s_t)+\sum_{t=1}^T \pi(s_{t+1}\mid s_t,a_t) \right]\\ &=\sum_{t=1}^T\nabla\log \pi_{\theta }(a_t\mid s_t) \\ \nabla J_{\theta'}(\theta)&= \mathbb{E} _{(s_t,a_t) \sim \pi_{\theta'}}\left[\frac{\pi_{\theta}(s_t,a_t)}{\pi_{\theta'}(s_t,a_t)}R^{\theta'}(s_t,a_t)\nabla \log \pi_{\theta}\left (a_t|s_t \right)\right ]\\ &=\mathbb{E} _{(s_t,a_t) \sim \pi_{\theta'}}\left[\frac{\pi_{\theta}(a_t|s_t)\pi_{\theta}(s_t)}{\pi_{\theta'}(a_t|s_t)\pi_{\theta'}(s_t)}R^{\theta'}(a_t,s_t)\nabla \log \pi_{\theta}\left (a_t|s_t \right)\right ]\\ &\approx \mathbb{E} _{(s_t,a_t) \sim \pi_{\theta'}}\left[\frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta'}(a_t|s_t)}R^{\theta'}(s_t,a_t)\nabla \log \pi_{\theta}\left (a_t|s_t \right)\right ]\\ &=\mathbb{E} _{(s_t,a_t) \sim \pi_{\theta'}}\left[\frac{\nabla \pi_{\theta}(a_t|s_t)}{\pi_{\theta'}(a_t|s_t)}R^{\theta'}(s_t,a_t)\right ] \end{split}
∇logπθ(τ)∇Jθ′(θ)=∇log[π(s1)t=1∏Tπθ(at∣st)π(st+1∣st,at)]=∇[logπ(s1)+t=1∑Tlogπθ(at∣st)+t=1∑Tπ(st+1∣st,at)]=t=1∑T∇logπθ(at∣st)=E(st,at)∼πθ′[πθ′(st,at)πθ(st,at)Rθ′(st,at)∇logπθ(at∣st)]=E(st,at)∼πθ′[πθ′(at∣st)πθ′(st)πθ(at∣st)πθ(st)Rθ′(at,st)∇logπθ(at∣st)]≈E(st,at)∼πθ′[πθ′(at∣st)πθ(at∣st)Rθ′(st,at)∇logπθ(at∣st)]=E(st,at)∼πθ′[πθ′(at∣st)∇πθ(at∣st)Rθ′(st,at)]
其中
R
θ
′
(
s
t
,
a
t
)
=
∑
i
=
t
T
γ
i
−
t
r
s
i
,
a
i
∼
π
θ
R^{\theta'}(s_t,a_t)=\sum_{i=t}^{T} \gamma^{i-t} r_{s_i,a_i\sim \pi_\theta}
Rθ′(st,at)=∑i=tTγi−trsi,ai∼πθ,此时优化函数变为:
J
(
θ
)
=
E
(
s
t
,
a
t
)
∼
π
θ
′
[
π
θ
(
a
t
∣
s
t
)
π
θ
′
(
a
t
∣
s
t
)
R
θ
′
(
s
t
,
a
t
)
]
\begin{split} J(\theta)&= \mathbb{E} _{(s_t,a_t) \sim \pi_{\theta'}}\left[\frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta'}(a_t|s_t)}R^{\theta'}(s_t,a_t)\right ] \end{split}
J(θ)=E(st,at)∼πθ′[πθ′(at∣st)πθ(at∣st)Rθ′(st,at)]
注意重要性采样使用的条件是两个分布的差异不能太大,因此TRPO算法通过添加约束使得
π
θ
\pi_{\theta}
πθ接近
π
θ
′
\pi_{\theta'}
πθ′。即:
J
(
θ
)
=
E
(
s
t
,
a
t
)
∼
π
θ
′
[
π
θ
(
a
t
∣
s
t
)
π
θ
′
(
a
t
∣
s
t
)
R
θ
′
(
s
t
,
a
t
)
]
s
.
t
.
K
L
(
π
θ
∣
π
θ
′
)
<
δ
\begin{split} &J(\theta)= \mathbb{E} _{(s_t,a_t) \sim \pi_{\theta'}}\left[\frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta'}(a_t|s_t)}R^{\theta'}(s_t,a_t)\right ]\\ & s.t.\quad KL(\pi_{\theta}|\pi_{\theta'})< \delta \end{split}
J(θ)=E(st,at)∼πθ′[πθ′(at∣st)πθ(at∣st)Rθ′(st,at)]s.t.KL(πθ∣πθ′)<δ
意思是设置一个信赖域,使得网络参数的更新不要超出这个范围。相应的PPO算法则是添加一个类似正则项的东西,即:
J
p
p
o
(
θ
)
=
J
(
θ
)
−
β
K
L
(
π
θ
∣
π
θ
′
)
J_{ppo}(\theta)= J(\theta)-\beta KL(\pi_{\theta}|\pi_{\theta'})
Jppo(θ)=J(θ)−βKL(πθ∣πθ′)
角度二:数值优化
策略梯度收敛快但不稳定,且对超参数敏感,但TRPO收敛稳定且对超参数不那么敏感。假设目标是最大化
J
(
θ
)
J(\theta)
J(θ),可以使用梯度上升
的方式求解,但有些情况下无法得到梯度,例如当目标函数为
J
(
θ
)
=
E
S
[
V
(
S
;
θ
)
]
J(\theta)=\mathbb{E} _S\left[V(S;\theta)\right]
J(θ)=ES[V(S;θ)]
需要通过定积分才能求出期望,从而求出梯度,但定积分不一定存在解析解,因此一般采用随机梯度上升
的方式求解。随机梯度上升从S
中采样一个样本s
,计算其梯度
g
=
∂
V
(
s
;
θ
)
∂
θ
g=\frac{\partial V(s;\theta )}{\partial \theta }
g=∂θ∂V(s;θ),然后更新梯度。
置信域优化
定义
N
(
θ
′
)
=
{
θ
′
∣
∥
θ
′
−
θ
∥
≤
δ
}
\mathcal{N} (\theta' ) = \left \{ \theta' \mid \left \| \theta' - \theta \right \| \le \delta \right \}
N(θ′)={θ′∣∥θ′−θ∥≤δ}
为参数
θ
\theta
θ 的一个领域,我们定义已知的可微分近似函数
L
(
θ
′
∣
θ
)
L(\theta'|\theta)
L(θ′∣θ),使得在这个领域内,
L
(
θ
′
∣
θ
)
L(\theta'|\theta)
L(θ′∣θ)与
J
(
θ
′
)
J(\theta')
J(θ′)足够接近,于是在置信域内最大化
J
(
θ
′
)
J(\theta')
J(θ′)问题转换为最大化
L
(
θ
′
∣
θ
)
L(\theta'|\theta)
L(θ′∣θ)。通常置信域优化分为两个关键步:
- 置信域内近似: L ( θ ∣ θ o l d ) ≈ J ( θ ) L(\theta|\theta_{old})\approx J(\theta) L(θ∣θold)≈J(θ)
- 置信域内最大化近似函数: θ n e w = arg max θ ∈ N ( θ ) L ( θ ∣ θ o l d ) \theta_{new}=\underset{\theta\in \mathcal{N}(\theta )}{\argmax} L(\theta|\theta_{old}) θnew=θ∈N(θ)argmaxL(θ∣θold)
第一步近似可以使用蒙特卡洛近似或者泰勒展开等近似方法,第二部是一个带约束优化问题,求解起来比较复杂。因此置信域算法通过不断重复两个步骤(第二笔求解最优化问题是也需要重复多次),便可以得到 J ( θ ) J(\theta) J(θ)的局部最优解。
在强化学习中,有
V
π
(
s
)
=
E
A
∼
π
(
⋅
∣
s
;
θ
)
[
Q
π
(
s
,
A
)
]
=
∑
a
π
(
a
∣
s
;
θ
)
Q
π
(
s
,
a
)
=
∑
a
π
(
a
∣
s
;
θ
o
l
d
)
π
(
a
∣
s
;
θ
)
π
(
a
∣
s
;
θ
o
l
d
)
Q
π
(
s
,
a
)
=
E
A
∼
π
(
⋅
∣
s
;
θ
o
l
d
)
[
π
(
a
∣
s
;
θ
)
π
(
a
∣
s
;
θ
o
l
d
)
Q
π
(
s
,
a
)
]
J
(
θ
)
=
E
s
[
V
π
(
s
)
]
=
E
s
[
E
A
∼
π
(
⋅
∣
s
;
θ
o
l
d
)
[
π
(
a
∣
s
;
θ
)
π
(
a
∣
s
;
θ
o
l
d
)
Q
π
(
s
,
a
)
]
]
\begin{split} V_{\pi}(s)&=\mathbb{E}_{A\sim \pi(\cdot|s;\theta)}\left[Q_{\pi}(s,A)\right]\\ &=\sum_a \pi(a|s;\theta)Q_{\pi}(s,a)\\ &=\sum_a \pi(a|s;\theta_{old}) \frac{\pi(a|s;\theta)}{\pi(a|s;\theta_{old})}Q_{\pi}(s,a) \\ &=\mathbb{E}_{A\sim \pi(\cdot|s;\theta_{old})}\left[ \frac{\pi(a|s;\theta)}{\pi(a|s;\theta_{old})}Q_{\pi}(s,a) \right]\\ J(\theta)&=\mathbb{E}_{s}[V_{\pi}(s)]\\ &=\mathbb{E}_{s}\left[\mathbb{E}_{A\sim \pi(\cdot|s;\theta_{old})}\left[ \frac{\pi(a|s;\theta)}{\pi(a|s;\theta_{old})}Q_{\pi}(s,a) \right] \right] \end{split}
Vπ(s)J(θ)=EA∼π(⋅∣s;θ)[Qπ(s,A)]=a∑π(a∣s;θ)Qπ(s,a)=a∑π(a∣s;θold)π(a∣s;θold)π(a∣s;θ)Qπ(s,a)=EA∼π(⋅∣s;θold)[π(a∣s;θold)π(a∣s;θ)Qπ(s,a)]=Es[Vπ(s)]=Es[EA∼π(⋅∣s;θold)[π(a∣s;θold)π(a∣s;θ)Qπ(s,a)]]
蒙特卡洛近似
我们对目标函数
J
(
θ
)
J(\theta)
J(θ)做蒙特卡洛近似,假设采样得到一条轨迹
τ
=
{
s
1
,
a
1
,
s
2
,
a
2
,
.
.
.
s
T
,
a
T
}
\tau = \{s_1,a_1,s_2,a_2,...s_T,a_T\}
τ={s1,a1,s2,a2,...sT,aT}
则有近似函数
L
(
θ
∣
θ
o
l
d
)
=
1
N
∑
i
=
1
N
(
π
(
a
i
∣
s
i
;
θ
)
π
(
a
i
∣
s
i
;
θ
o
l
d
)
Q
π
(
s
i
,
a
i
)
)
L(\theta|\theta_{old})=\frac{1}{N}\sum_{i=1}^{N}\left( \frac{\pi(a_i|s_i;\theta)}{\pi(a_i|s_i;\theta_{old})}Q_{\pi}(s_i,a_i) \right)
L(θ∣θold)=N1i=1∑N(π(ai∣si;θold)π(ai∣si;θ)Qπ(si,ai))
由于
Q
π
(
s
i
,
a
i
)
Q_{\pi}(s_i,a_i)
Qπ(si,ai)无法计算,因此我们使用折扣回报作为动作价值函数的近似,即
R
i
=
∑
i
=
t
T
γ
i
−
t
r
i
L
(
θ
∣
θ
o
l
d
)
≈
1
N
∑
i
=
1
N
(
π
(
a
i
∣
s
i
;
θ
)
π
(
a
i
∣
s
i
;
θ
o
l
d
)
R
i
)
\begin{split} R_i&=\sum_{i=t}^{T} \gamma^{i-t} r_i\\ L(\theta|\theta_{old})&\approx \frac{1}{N}\sum_{i=1}^{N}\left( \frac{\pi(a_i|s_i;\theta)}{\pi(a_i|s_i;\theta_{old})}R_i \right) \end{split}
RiL(θ∣θold)=i=t∑Tγi−tri≈N1i=1∑N(π(ai∣si;θold)π(ai∣si;θ)Ri)
此时做梯度上升
θ
n
e
w
=
arg max
θ
L
(
θ
∣
θ
o
l
d
)
s
.
t
.
θ
∈
N
(
θ
)
\begin{split} &\theta_{new} =\underset{\theta}{\argmax} L(\theta|\theta_{old})\\ & s.t.\quad \theta\in \mathcal{N}(\theta ) \end{split}
θnew=θargmaxL(θ∣θold)s.t.θ∈N(θ)
其中
N
(
θ
)
\mathcal{N}(\theta )
N(θ)可以为:
K
L
(
π
θ
∣
π
θ
o
l
d
)
≤
δ
或者
∥
θ
−
θ
o
l
d
∥
≤
δ
\begin{split} &KL(\pi_{\theta}|\pi_{\theta_{old}})\le \delta\\ 或者 &\left \| \theta- {\theta_{old}} \right \| \le \delta \end{split}
或者KL(πθ∣πθold)≤δ∥θ−θold∥≤δ