强化学习论文研读（四）——Deep Reinforcement Learning with Double Q-Learning

double Q learning + DQN的合成算法。

论文主要有5点贡献：

一是DQN会对动作的价值过估计。

二是过估计是有害的。

三是double Q learning 可以减少过估计。通过评估网络和动作选择网络解耦实现的。

四是提出了三层卷积+FC的Double DQN 算法结构和参数更新公式。

五是证明了 Double DQN是有效的。

相比于DQN主要改进在一点：

看到里边的两个Q中的θ是不一样的。一个是target的net，一个是当前的网络，存在时间更新的前后顺序，用于解耦合。

有价值的文章:

强化学习（十）Double DQN (DDQN)

理解:

配合nature中的target 网络使用，防止在估计局势的使用使用一个网络，从而形成过估计。

Double DQN

DQN有一个显著的问题，就是DQN估计的Q值往往会偏大。这是由于我们Q值是以下一个s'的Q值的最大值来估算的，但下一个state的Q值也是一个估算值，也依赖它的下一个state的Q值...，这就导致了Q值往往会有偏大的的情况出现。

们在同一个s'进行试探性出发，计算某个动作的Q值。然后和DQN的记过进行比较就可以得出上述结论。

这种欺上瞒下的做法，实在令人恼火。于是有人想到一个互相监察的想法。

这个思路也很直观。如果只有一个Q网络，它不是经常吹牛嘛。那我就用两个Q网络，因为两个Q网络的参数有差别，所以对于同一个动作的评估也会有少许不同。我们选取评估出来较小的值来计算目标。这样就能避免Q网络吹牛的情况发生了。

另外一种做法也需要用到两个Q网络。Q1网络推荐能够获得最大Q值的动作；Q2网络计算这个动作在Q2网络中的Q值。

恰好，如果我们用上Fixed Q-targets，我们不就是有两个Q网络了吗？

所以你可以看到，这个优化在DQN上很容易实现。这就是doubleDQN和DQN的唯一的变化。

————Double DQN原理是什么，怎样实现？（附代码）

代码实现：

python">import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import gym
import matplotlib.pyplot as plt
import copy
import os
import random
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"

# hyper-parameters
BATCH_SIZE = 128
LR = 0.01
GAMMA = 0.90
EPISILO = 0.9
MEMORY_CAPACITY = 2000
Q_NETWORK_ITERATION = 100

env = gym.make("CartPole-v0")
env = env.unwrapped
NUM_ACTIONS = env.action_space.n
NUM_STATES = env.observation_space.shape[0]
ENV_A_SHAPE = 0 if isinstance(env.action_space.sample(), int) else env.action_space.sample.shape

class Net(nn.Module):
    """docstring for Net"""
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(NUM_STATES, 50)
        self.fc1.weight.data.normal_(0,0.1)
        self.fc2 = nn.Linear(50,30)
        self.fc2.weight.data.normal_(0,0.1)
        self.out = nn.Linear(30,NUM_ACTIONS)
        self.out.weight.data.normal_(0,0.1)

    def forward(self,x):
        x = self.fc1(x)
        x = F.relu(x)
        x = self.fc2(x)
        x = F.relu(x)
        action_prob = self.out(x)
        return action_prob

class DQN():
    """docstring for DQN"""
    def __init__(self):
        super(DQN, self).__init__()
        self.eval_net, self.target_net = Net().cuda(), Net().cuda()

        self.learn_step_counter = 0
        self.memory_counter = 0
        self.memory = np.zeros((MEMORY_CAPACITY, NUM_STATES * 2 + 2))
        # why the NUM_STATE*2 +2
        # When we store the memory, we put the state, action, reward and next_state in the memory
        # here reward and action is a number, state is a ndarray
        self.optimizer = torch.optim.Adam(self.eval_net.parameters(), lr=LR)
        self.loss_func = nn.MSELoss()

    def choose_action(self, state):
        state = torch.unsqueeze(torch.FloatTensor(state), 0).cuda() # get a 1D array
        if np.random.randn() <= EPISILO:# greedy policy
            action_value = self.eval_net.forward(state)
            action = torch.max(action_value, 1)[1].cpu().data.numpy()
            action = action[0] if ENV_A_SHAPE == 0 else action.reshape(ENV_A_SHAPE)
        else: # random policy
            action = np.random.randint(0,NUM_ACTIONS)
            action = action if ENV_A_SHAPE ==0 else action.reshape(ENV_A_SHAPE)
        return action


    def store_transition(self, state, action, reward, next_state):
        transition = np.hstack((state, [action, reward], next_state))
        index = self.memory_counter % MEMORY_CAPACITY
        self.memory[index, :] = transition
        self.memory_counter += 1


    def learn(self):

        #update the parameters
        if self.learn_step_counter % Q_NETWORK_ITERATION ==0:
            self.target_net.load_state_dict(self.eval_net.state_dict())
        self.learn_step_counter+=1

        #sample batch from memory
        sample_index = np.random.choice(MEMORY_CAPACITY, BATCH_SIZE)
        batch_memory = self.memory[sample_index, :]
        batch_state = torch.FloatTensor(batch_memory[:, :NUM_STATES]).cuda()
        batch_action = torch.LongTensor(batch_memory[:, NUM_STATES:NUM_STATES+1].astype(int)).cuda()
        batch_reward = torch.FloatTensor(batch_memory[:, NUM_STATES+1:NUM_STATES+2]).cuda()
        batch_next_state = torch.FloatTensor(batch_memory[:,-NUM_STATES:]).cuda()

        #q_eval
        actions_value = self.eval_net.forward(batch_next_state)
        next_action = torch.unsqueeze(torch.max(actions_value, 1)[1], 1)
        eval_q = self.eval_net.forward(batch_state).gather(1, batch_action)
        next_q = self.target_net.forward(batch_next_state).gather(1, next_action)
        target_q = batch_reward + GAMMA * next_q
        loss = self.loss_func(eval_q, target_q)


        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

def reward_func(env, x, x_dot, theta, theta_dot):
    r1 = (env.x_threshold - abs(x))/env.x_threshold - 0.5
    r2 = (env.theta_threshold_radians - abs(theta)) / env.theta_threshold_radians - 0.5
    reward = r1 + r2
    return reward

def main():

    dqn = DQN()
    episodes = 250
    print("Collecting Experience....")
    for i in range(episodes):
        state = env.reset()
        ep_reward = 0
        while True:
            env.render()
            action = dqn.choose_action(state)
            next_state, _, done, info = env.step(action)
            x, x_dot, theta, theta_dot = next_state
            reward = reward_func(env, x, x_dot, theta, theta_dot)

            dqn.store_transition(state, action, reward, next_state)
            ep_reward += reward

            if dqn.memory_counter >= MEMORY_CAPACITY:
                dqn.learn()
                if done:
                    print("episode: {} , the episode reward is ,{}".format(i, round(ep_reward, 3)))
            if done:
                break
            state = next_state


if __name__ == '__main__':
    main()

在实现中要注意，学习次数不能过小，也就是episodes 不能过小，否则如论如何都立不起来。