My Roadmap in Reinforcement Learning

一、前言

前段时间接受导师的建议，学习了一些强化学习和GANs的内容，第一周先看的强化学习，二三周看的GANs。强化学习（RL）是一个很有趣的领域，一直以来也是我很喜欢的一个AI的分支，被誉为是AI皇冠上的明珠，因为通过RL能很直观地反映出“智能”。第一周看完之后有不少收获，当时想着要写一篇博客记录下来，结果一拖再拖…
时至今日，已经是第四周了，本来给自己定的本周计划是入门object tracking领域，几天过去了，感觉tracking的入门门槛相对比较高，自己这几天论文看得很艰难，有点迷失了，加上我导师找我谈话催论文，弄得我心烦意乱无心学习。既然如此，不如利用现在的迷失时间来归纳一下RL方面的内容。闲话少说，现在开始进入RL时间。

================================================================

q-learning谈起">二、从Q-Learning谈起

要入门RL，首先的入门算法就是Q-Learning了。Q-Learning如果换一个更大的名字，应该是基于值迭代的马尔科夫链的一个求解算法，认识到Q-Learning的这个名字有助于更深地把握其背后的数学思想，从而能够将其用到求解其他基于马尔科夫链建模的数学模型中去（至少我是这么认为的。。）。
Q-Learning因为是基于值迭代的求解算法（这个和后面会提到的基于策略迭代的算法，比如policy gradient刚好对立），所以理解起来，它其实就是在玩一个Q-tabel，也就是存储着 $Q(state, action)$ 的一个表格，表格的横栏表示不同的state，纵栏表示不同的action。

Q-Learning算法的目的就是不断地迭代优化这个Q-tabel，直至其收敛以逼近理想化的Q-value（真正理想化的Q-value是得不到的，只能approximate），收敛之后每个state的决策就可以直接从Q-tabel中查找Q-value最大的action作为当前state的决策了。

$\pi(s) = \arg max_aQ(s,a)$

而Q-learning优化Q-tabel的方法基于以下几个数学公式：

一是将discounted future reward定义为

$R_t=r_t+ \gamma r_{t+1}+\gamma^2 r_{t+2}+\gamma^3 r_{t+3}...+\gamma^{n-t} r_{n}$ （其中 $\gamma$ 是discount factor）

二是将Q-value（或者Q function）定义为：

the maximum discounted future reward when we perform action a in state s, and continue optimally from that point on
或者
the best possible score at the end of the game after performing action a in state s

从这个定义来看Q-value是一个很理想化的值，想真正地得到理想的Q-value是没有办法。但是，我们可以去逼近它，所以说Q-Learning整个算法其实就是在为了逼近理想化的Q-value而不断地“努力”。

Q-value（或者Q function）用数学表达就是：

$Q(s_t, a_t)=max R_t$
（注意：在intel的博文中写成了 $R_{t+1}$ ，我觉得应该有问题）

三是，拜一、二、所赐，就能顺理成章地得到大名鼎鼎的Bellman Equation了：

$Q(s,a)=r+\gamma max_{a'}Q(s',a)$

其中， $r$ 代表即时的reward（也就是执行action后立马得到的reward）， $s'$ 是执行 $r$ 之后转到的下一个state， $r$ 和 $s'$ 都是通过simulator（也就是game模拟器）观测到的。

拿到了Bellman Equation，就可以快乐地迭代更新Q-table中的Q-value了，整个的算法流程如下：

这里写图片描述

上述算法流程在“select and carry out an action a”的时候可以有不同的策略，比如可以是随机的，也可以是以 $\epsilon$ 的概率的随机选择， $1-\epsilon$ 的概率选择当前Q-value最高对应的action，这种选取方法被称为 $\epsilon$ -greedy exploration。

虽然Q-Learning算法迭代之初得到的Q-value可能相比理想的Q-value差别很大，但是已经有理论证明，只要有足够多的iterations，那么Q-tabel最终会收敛并且能够表示“理想的（却是我们真正想要的）”Q-value。

相关链接：
- Intel官博 Guest Post (Part I): Demystifying Deep Reinforcement Learning
- A painless Q-Learning tutorial

————————————————————————————————————-

Sarsa和Sarsa-Lambda

观察上图的Q-Learning算法流程图，虽然每次都是根据最大化的原则来选择 $a'$ 所对应的 $Q(s',a')$ 来更新 $Q(s,a)$ ，但是进入下一个iteration后，选择的action未必就是 $a'$ ，因此这样来看，传统的Q-Learning看起来有点“不负责任”，盲目地追求用尽量大的值来更新 $Q(s,a)$ 却又不真正地执行 $a'$ 。由此有了Q-Learning的另一个版本，叫做SARSA，这个奇怪的名字其实就是（state-action-reward-sate-action）的首字母组合，从名字就能看出，SARSA是属于“言出必行”类型的算法，是一个“实践派”，既然使用了最大的 $Q(s',a')$ 来更新 $Q(s,a)$ ，那么我就下一个iteration就执行 $a'$ 。相比之下，Q-Learning则有点冒险，因为它过度地去explore了，不像SARSA那么保守务实，一步一一个脚印。

两者算法的对比见下图，从 $Q(s,a)$ 的更新来看，Sarsa言出必行。因此，Q-Learning是off-policy，而Sarsa是on-policy算法：

另外还有一个Q-Learning的版本叫做Sarsa-Lambda，理解起来就是，Sarsa是属于单步更新算法，以“寻宝为例”，Sarsa只会给寻找到宝藏的前一步一个奖励，而忽略了之前许多步的作用。因此引入了回合更新算法，这样就可以照顾到先前的步。

而lambda=1的时候相当于同等地看待先前的所有步，为了引入时间上的discount，所以lambda常常介于0,1之间（0对应的就是普通的Sarsa了）

为了更具体地了解一下Sarsa-lambda的思想和具体实现，可以看一下下面的代码语句块：

class SarsaLambdaTable(RL): # 继承 RL class
    def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9, trace_decay=0.9):
        ...
    def check_state_exist(self, state):
        ...
    def learn(self, s, a, r, s_, a_):
        # 这部分和 Sarsa 一样
        self.check_state_exist(s_)
        q_predict = self.q_table.ix[s, a]
        if s_ != 'terminal':
            q_target = r + self.gamma * self.q_table.ix[s_, a_]
        else:
            q_target = r
        error = q_target - q_predict

        # 这里开始不同:
        # 对于经历过的 state-action, 我们让他+1, 证明他是得到 reward 路途中不可或缺的一环
        self.eligibility_trace.ix[s, a] += 1

        # Q table 更新
        self.q_table += self.lr * error * self.eligibility_trace

        # 随着时间衰减 eligibility trace 的值, 离获取 reward 越远的步, 他的"不可或缺性"越小
        self.eligibility_trace *= self.gamma*self.lambda_

从代码中可以看到，借用eligibility_trace为“桥梁”，在每一步的Q tabel更新时可以更新整个Q table（或者先前所有步），并且eligibility_trace随时间以gamma，随步距以lambda_的速度衰减。

————————————————————————————————————-
以上就是初入RL需要掌握的三个经典算法， Q-Learning（off-policy）， Sarsa（on-policy）， Sarsa-Lambda（on-policy）。其中Sarsa和Sarsa-Lambda主要参考了莫烦强化学习教程，在我看来应该是优于Q-Learning的算法，尤其是Sarsa-Lambda的回合更新在reward的分配上是一个考虑相对更加周全的算法，不过Sarsa， Sarsa-Lambda背后的主要思想都是继承了Q-Learning，都是基于值迭代的马尔科夫链的一个求解算法!

================================================================

三、Deep Q Network

前边提到的Q-Learning及其衍生算法有一个弊病，那就是当state的数量很大时，比如在处理视频游戏（比如Atari游戏）的时候，用像素组合表征的state是一个天文数字，这样一来Q-Learning就有点捉襟见肘了。而神经网络对于建模高维结构化数据是一大利器，所以用Network去扮演高维Q table的角色，学习一个Q function成了一个自然的想法。

使用神经网络来approximate $Q（s,a）$ 有两种可以选择的结构:

其中，以state作为输入，输出各个action对应的Q-value只需要一次forward pass就能得到Q tabel的一行了，更加方便和容易建模。

需要一提的是，如果输入的state是图像或者视频，通常不会使用pooling层，因为pooling层会对transition不敏感，而视频游戏中对物体的位置信息是需要保留的。

根据对Q function的理解，给定一个transition <s,a,r,s′> <script type="math/tex" id="MathJax-Element-27"> </script>，那么DQN的损失函数定义为：

$L=\frac{1}{2}[r+max_{a'}Q(s',a')-Q(s,a)]^2$

其中 $r+max_{a'}Q(s',a')$ 是target， $Q(s,a)$ 是prediction， $s'$ 是 $s$ 的下一个state。

先前的Q-table更新算法由此变更为：
————————————————————————-
1. Do a feedforward pass for the current state $s$ to get predicted Q-values for all actions.
2. Do a feedforward pass for the next state $s'$ and calculate maximum overall network outputs $max_{a'}Q(s',a')$ .
3. Set Q-value target for action to $r + \gamma max_{a'} Q(s’, a’)$ (use the max calculated in step 2). For all other actions, set the Q-value target to the same as originally returned from step 1, making the error 0 for those outputs.
4. Update the weights using backpropagation.
————————————————————————-

Experience Replay
从以上算法描述看可以发现，DQN基本就是Q-Learning的神经网络版本，既没有吸收Sarsa也没有吸收Sarsa-Lambda的思想，仍然是一个off-policy的算法。off-policy的好处就是可以进行Experience Replay.可以把所有的experiences <s,a,r,s′> <script type="math/tex" id="MathJax-Element-37">< s, a, r, s’ ></script>存储在一个replay memory中，然后可以选择random minibatch来训练DQN网络。Experience Replay是一个训练DQN的很重要的trick。

Fixed Q target
为了让DQN训练更加稳定，还有一个很重要的trick就是fixed Q target不得不提。具体来说，就是在训练DQN的时候，为了避免Network每一步都频繁更新导致网络学习不稳定。特意复制了一个相同的结构的Q Network放在一旁作为target，也即上面算法流程中的tt。这个Q Network就是fixed的，不是每一步都更新，而是每隔几步再把非fixed的Q network的参数复制到这个Q Network上。
这里写图片描述
可以参考：https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/4-2-DQN2/

使用Experience Replay， $\epsilon$ -greedy exploration和Fixed Q target策略后的DQN算法伪代码：

这里写图片描述
通过上面的伪代码可以发现，Experience Replay中的experience是来自于 $Q(\phi(s_t),a;\theta)$ 的，而非来自于fixed的 $Q(\phi(s_t),a;\theta^-)$ 。

其实DeepMind还使用了许多tricks来使DQN最终work，包括error clipping, reward clipping等，这里就不细说了。Deep Q-Learning算法已经被Google申请了专利。

参考：Intel官博 Guest Post (Part I): Demystifying Deep Reinforcement Learning

===========================================

四、Policy Gradient

DQN通过用NN来建模高维结构化数据，扮演Q table的角色，使得处理视频游戏这种高维state的RL问题成为了可能。但是，如果我们的处理的问题的action也很多，甚至是连续的action，而不是离散的，那么DQN也会捉襟见肘了。而这里，就要policy gradient（PG）登场了，PG可以很好地建模连续action的RL问题。

首先摘抄一段来自Andrej Karpathy博客中的几句话：

Similar to what happened in Computer Vision, the progress in RL is not driven as much as you might reasonably assume by new amazing ideas. In Computer Vision, the 2012 AlexNet was mostly a scaled up (deeper and wider) version of 1990’s ConvNets. Similarly, the ATARI Deep Q Learning paper from 2013 is an implementation of a standard algorithm (Q Learning with function approximation, which you can find in the standard RL book of Sutton 1998), where the function approximator happened to be a ConvNet. AlphaGo uses policy gradients with Monte Carlo Tree Search (MCTS) - these are also standard components.

前边讲到的DQN算法只是把function approximator换成了ConvNet，这一部分要讲的policy gradient算法，一种比Q-Learning更好的算法。

It turns out that Q-Learning is not a great algorithm (you could say that DQN is so 2013 (okay I’m 50% joking)). In fact most people prefer to use Policy Gradients, including the authors of the original DQN paper who have shown Policy Gradients to work better than Q Learning when tuned well. PG is preferred because it is end-to-end: there’s an explicit policy and a principled approach that directly optimizes the expected reward.

一句话说Policy Gradient：

Policy Gradients: Run a policy for a while. See what actions led to high rewards. Increase their probability.

下面这段解释没太理解（来自Andrej的博客），字面上看好像还是没有解释清为什么不区别对待frame 50和frame 150，明明这两次bounce都可以通过即时的reward来反映出是“正确”还是“错误”的action。。那么让frame 50的action概率增大，frame 150的概率减小应该很容易做到呀？

If you think through this process you’ll start to find a few funny properties. For example what if we made a good action in frame 50 (bouncing the ball back correctly), but then missed the ball in frame 150? If every single action is now labeled as bad (because we lost), wouldn’t that discourage the correct bounce on frame 50? You’re right - it would. However, when you consider the process over thousands/millions of games, then doing the first bounce correctly makes you slightly more likely to win down the road, so on average you’ll see more positive than negative updates for the correct bounce and your policy will end up doing the right thing.

为了写Policy Gradient这一部分，我又把Andrej Karpathy的博客又看了一遍，因为之前理解了又忘了。。

其实归根结底，Policy Gradient就是直接优化一个policy network（用数学表示为 $p(x|\theta)$ ，其中 $\theta$ 就是网络的参数， $x$ 就是policy network， $x$ 是对 $p(a|i)$ 的建模， $i$ 是输入图片或者视频帧， $a$ 是action），不像先前基于value的Q-Learning，Sarsa等间接地去approximate Q-function（或者Q-value）去找到最佳的policy，这么一想PG解决强化学习更加直接而干脆。

使用PG还有一个好处就是，PG支持策略的概率输出，而Q-Learning则是确定性的算法，概率的输出更加符合直觉，因为这个世界上很多时候解都不止一个，就比如说想要填饱肚子可以吃饭，也可以吃面。

Andrej使用了130多行写了一个PG，我摘取了比较重要的片段以助理解：
第3行，依据概率（体现出PG的概率输出特点）选择action，这里之所以用2,3可能是Gym这个游戏刚好action限于2,3之间吧
第9行，计算出encourage当前action的梯度（用于后面的梯度反传）
第33行，将第9行得到的梯度（都是encourage属性的梯度）用discounted_epr进行调制，这里正是PG的魔法所在。discounted_epr其实就是一个episode中所有的reward（一个list）进行时间衰减后的结果（一个新的list）
第34行，将调制得到的梯度送回网络反向传播，更新参数。

从代码也可以看初PG是一个回合更新算法，直到拿到了done信号才会进行梯度反传更新，done信号只有在一个episode结束后才会变为True。在Pong-v0游戏中，应该就是电脑或者玩家中的一者，用掉21条命，另一方率先得到21后，一个episode即结束，然后收集reward和梯度，用时间衰减后的reward去调制梯度，反向传播。

 # forward the policy network and sample an action from the returned probability
 aprob, h = policy_forward(x)
 action = 2 if np.random.uniform() < aprob else 3 # roll the dice!

 # record various intermediates (needed later for backprop)
 xs.append(x) # observation
 hs.append(h) # hidden state
 y = 1 if action == 2 else 0 # a "fake label"
 dlogps.append(y - aprob) # grad that encourages the action that was taken to be taken (see http://cs231n.github.io/neural-networks-2/#losses if confused)

 # step the environment and get new measurements
 observation, reward, done, info = env.step(action)
 reward_sum += reward

 drs.append(reward) # record reward (has to be done after we call step() to get reward for previous action)

 if done: # an episode finished
   episode_number += 1

   # stack together all inputs, hidden states, action gradients, and rewards for this episode
   epx = np.vstack(xs)
   eph = np.vstack(hs)
   epdlogp = np.vstack(dlogps)
   epr = np.vstack(drs)
   xs,hs,dlogps,drs = [],[],[],[] # reset array memory

   # compute the discounted reward backwards through time
   discounted_epr = discount_rewards(epr)
   # standardize the rewards to be unit normal (helps control the gradient estimator variance)
   discounted_epr -= np.mean(discounted_epr)
   discounted_epr /= np.std(discounted_epr)

   epdlogp *= discounted_epr # modulate the gradient with advantage (PG magic happens right here.)
   grad = policy_backward(eph, epdlogp)

================================================================

先写到这里吧，遗留的几个问题是：
一是，先前提到的Andrej博客中的那段解释frame 50和frame 150“胡子眉毛一把抓”的处理方式的理解。
二是， discount_rewards(epr)函数（见下）没太看明白 [已解决]

def discount_rewards(r):
  """ take 1D float array of rewards and compute discounted reward """
  discounted_r = np.zeros_like(r)
  running_add = 0
  for t in reversed(xrange(0, r.size)):
    if r[t] != 0: running_add = 0 # reset the sum, since this was a game boundary (pong specific!)
    running_add = running_add * gamma + r[t]
    discounted_r[t] = running_add
  return discounted_r

其实这个discount_reward函数，输入的是一维的reward，输出的就是时间衰减后的discount_reward。而通过运行程序会发现，在Pong-v0游戏中，reward的计算方法是：

每一帧都会计算一个reward
没有死亡发生时，reward为0
对方死一次，reward为+1
自己死一次，reward为-1
谁死够21次，游戏结束，也就是一个episode结束

由于是每完成一个episode才用收集到的encorage属性的gradient和reward进行一次梯度反传更新，那么一个episode内的reward其实是一个一维向量，里面有几千个元素，视游戏中各episode的长短而各不相同。
reward写出来可能是这种形式: reward = [0,0,1,0,0,…,0,0,-1,0,0,…,0,-1]
也就是说大部分时候，reward都是0，只有当有死亡发生时，reward才会变成1或者-1，所以上面的discount_reward函数才会用一句：

if r[t] != 0: running_add = 0 # reset the sum, since this was a game boundary (pong specific!)

来表示一旦遇到死亡发生，reward的衰减就从这一次的死亡开始重新计算！[完]

三是，对于 discounted_epr为什么要做normalize（去均值并除以方差）的进一步理

上面从Karpathy的角度，intuitively阐述了Poliy Gradient的思想，如果从严格的数学形式来理解证明，可以看CS294的lecture 4 policy gradient introduction。我截取了三张PPT如下，涵盖了PG的推导，其中的符号含义应该也可以猜到，其中的 $\tau$ 表示的是一条马尔科夫链的trajectory。

这里写图片描述