[update 20200712]

OpenAI的网站是很好的reference：spinningup

Plan

看完李宏毅RL视频
开始one by one implementation，based on openai tips
At the mean time, master pytorch/tf and deep learning basics.
When have time, keep an eye on the research frontier

This overview is largely based on this article: https://medium.com/@SmartLabAI/reinforcement-learning-algorithms-an-intuitive-overview-904e2dff5bbc.

On-Policy vs Off-Policy

[update 0710] 看过李宏毅DRL视频后意识到，基于TD的Q-learning中的replay buffer跟on-off policy的关系主要是分布意义的。因为buffer里面的tuple并不是trajectory而是experience，跟当前在train哪个policy无关。但是整个buffer中tuple的分布，与使用当前policy去collect data得到的数据分布是不一样的，再加上从replay buffer sample的时候一般是uniformly，所以replay buffer如果非要去对应的话，对应的是off policy。针对MC的trajectory，如果用pi产生的traj去train pi'，那么就更加是off policy的范畴。

[source:https://www.quora.com/Why-is-Q-Learning-deemed-to-be-off-policy-learning]

主要判断，在更新Q时，Q所评估的policy跟目前与环境互动的policy是否是同一个。在Sarsa中是的，在Q-learning中，Q的update本质是在评估 $\pi^*$ 而非当前 $\pi$ .涉及到的Q(s',a')中的a'是否由当前actor根据s'得出,抑或是一种approximation like the max function in q-learning. 在使用replay-buffer的情况下，或者a'由target actor生成的情况下，称为off-policy。否则是on-policy。这是我个人浏览了很多信息后，目前的理解。[update 0413]见下图，目前有了新的理解：remember Qfunction是基于TD的，前后action是有顺序关系的。换句话说，train Q的时候，需要知道，是Q(?)跟当前的Q差了一个r。这时，如果此处的？与当前policy应当输出的action相符，说明我们想要把Q 按照当前policy去train，所以是on policy。否则的话，如Sarsa，当前policy给出的是epsilon-greedy choice但是train Q的时候假定下一步是totally greedy的，所以Q与当前policy不符合，所以是off。

涉及到replay buf时候，relay-buffer给出的a'是历史上的某个policy的action，与当前actor所能返回的action不同，所以Q没有在试图把自己train成当前policy的评估函数，所以是off。

总结：判断on还是off，取决于，train Q的时候，Q(s',a')中的a'与当前actor所建议的action是否相同。换句话说，我们是否在把Q train成当前police的评估函数。

那么下一个问题来了：两者应该如何选择呢？下面这个回答可以参考一下，特别是对于'take action'的理解。所以总的来说q learning off policy直接learn optimal policy但是有可能会不稳定，难以converge等。Sarsa相对conservative一些，所以当training代价比较大时可以考虑。

最后一个疑问，今天意识到TD本质上是有时间顺序在的：Q(s,a) for a->a'->a'' 和 Q(s,a) for a->a''->a'两者的值可能本来就不能一致。针对TD背后的思想，需要进一步思考

On-policy v.s. Off-policy

An on-policy agent learns the value based on its current action a derived from the current policy, whereas its off-policy counter part learns it based on the action a* obtained from another policy.

The reason that Q-learning is off-policy is that it updates its Q-values using the Q-value of the next state s′ and the greedy action a′. In other words, it estimates the return (total discounted future reward) for state-action pairs assuming a greedy policy were followed despite the fact that it's not following a greedy policy.

The reason that SARSA is on-policy is that it updates its Q-values using the Q-value of the next state s′ and the current policy's action a′′. It estimates the return for state-action pairs assuming the current policy continues to be followed.

The distinction disappears if the current policy is a greedy policy. However, such an agent would not be good since it never explores.

Policy Optimization (Policy Iteration)

f(state) -> action.

What action to take now?

Policy Gradient. Think of play a game 10000 times and check the expected reward of f(s), then you can do gradient asent on it.
Trust Region Policy Optimization (TRPO). A on-policy method which do PG with a big step but not too big: it is within the trust region. This is ensured by a constraint on the difference between the current behavior (not parameters) and the behavior of the target network. Both continuous and discrete space are supported.
Proximal Policy Optimization (PPO). Like TRPO is an on-policy method, but traite that contraint as penalty (regularization). Popularized by OpenAI. Do PG by improving gradients smartly to avoid performance issue and instability. Can be implemented and solved more easily than TRPO but with similar performance so it is preferred than TRPO.

Remarks

Why we invent TRPO/PPO: each time when the policy is upated, all previous samples are outdated. It is too costly to regenerate all samples on each policy update. PPO allows to reuse the old experiences, allowing moving from on-policy to off-policy.
Rewards should be centered to 0. Since PG is based on sampling, if all rewards are positive, the probability of sampling out some actions would be less and less.
For given (s,a), only the disconted reward afterward should be considered.

[Update 20200719]

Police gradient 只是Policy Search的一种而已。PG根据gradient对每个action进行优化，而存在其他的方法直接在policy space中进行搜索和优化。目前看到的一个ressource：ICML2015 Tutorial. TO CHECK MORE ON PS

Q-Learning (Value Iteration)

f(state, action)->expected action value

Action-value functoin: How good is it if an particular action is taken?

DQN. tends to overestimate the q value, because the greedy setting: max_a Q(s,a). Variations/tips:

Double-DQN is DQN with target network. Separate the Q which selects the action and the Q which compute the Q valued used to update the bellman equation.
Dueling DQN is DQN with separate output on V(s) and A(s,a). Then let Q(s,a)=V(s)+A(s,a).
1. Advantage: the update of V(s) will influence A(s,a), even if some actions are not sampled. In practice,
  1. some normalization should be done : keep sum(A)=0 ~~HOW TO IMPLEMENT?~~ It's simply done by substracting avg(A)
  2. also add constraints on A, so that the network will not simply set V(s) to 0.
Prioritized Replay: prefer to use samples having large TD error.
Multi-step: combine MC with TD, use not only one transition but multiple consecutive transitions.
Noisy Exploration: Noise on action (epsilon greedy) or noise on parameters (noise on theta before each episode. state-dependent exploration).
Distributional Q: Q(S,a) is no more a scalar expected value, but a distribution (several bins).Distributional Reinforcement Learning with Quantile Regression (QR-DQN). Instead of returning an expected value of an action, it returns a distribution. Then quantiles can be considered to identify the 'best' action. C51:Use distributional bellman equation instead of only consider EXPECTATION of futre rewards
Rainbow = dqn + double+dueling+noise+prioritized+distributional+multi-step
Hindsight Experience Replay. DQN with goals added to input. Especially useful for sparse reward space, like some 1/0 games. Can also be combined with DDPG.

DQN for Continuous actions

sampling actions
gradient ascent to solve the argmax (ddpg)
modify the network structure to facilitate the optimization (CHECK THE PAPER. are we solving optimization pb using dl?)

Hybrid

For Actor-Critic and DDPG check Actor-Critic, DDPG and GAN

DDPG
A3C. Asynchronous: serval agents are trained in parallel. Actor-Critic: policy gradient and q-learning are combined. Also check Soft Actor-Critic
TD3

Model-based vs Model free

Model: world model. Structured information about the environment. 利用了环境的结构性信息进行planning。一定程度上知道State transition probability
Model-free method see the environment as a black box, only providing state and reward as numbers. No extra info can be emploited.

More on model-based methods, check Model-based Reinforcement Learning