double Q learning + DQN的合成算法。
论文主要有5点贡献:
一是DQN会对动作的价值过估计。
二是过估计是有害的。
三是double Q learning 可以减少过估计。通过评估网络和动作选择网络解耦实现的。
四是提出了三层卷积+FC的Double DQN 算法结构和参数更新公式。
五是证明了 Double DQN是有效的。
相比于DQN主要改进在一点:
看到里边的两个Q中的θ是不一样的。一个是target的net,一个是当前的网络,存在时间更新的前后顺序,用于解耦合。
有价值的文章:
强化学习(十)Double DQN (DDQN)
理解:
配合nature中的target 网络使用,防止在估计局势的使用使用一个网络,从而形成过估计。
Double DQN
DQN有一个显著的问题,就是DQN估计的Q值往往会偏大。这是由于我们Q值是以下一个s'的Q值的最大值来估算的,但下一个state的Q值也是一个估算值,也依赖它的下一个state的Q值...,这就导致了Q值往往会有偏大的的情况出现。
们在同一个s'进行试探性出发,计算某个动作的Q值。然后和DQN的记过进行比较就可以得出上述结论。
这种欺上瞒下的做法,实在令人恼火。于是有人想到一个互相监察的想法。
这个思路也很直观。如果只有一个Q网络,它不是经常吹牛嘛。那我就用两个Q网络,因为两个Q网络的参数有差别,所以对于同一个动作的评估也会有少许不同。我们选取评估出来较小的值来计算目标。这样就能避免Q网络吹牛的情况发生了。
另外一种做法也需要用到两个Q网络。Q1网络推荐能够获得最大Q值的动作;Q2网络计算这个动作在Q2网络中的Q值。
恰好,如果我们用上Fixed Q-targets,我们不就是有两个Q网络了吗?
所以你可以看到,这个优化在DQN上很容易实现。这就是doubleDQN和DQN的唯一的变化。
————Double DQN原理是什么,怎样实现?(附代码)
代码实现:
python">import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import gym
import matplotlib.pyplot as plt
import copy
import os
import random
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"
# hyper-parameters
BATCH_SIZE = 128
LR = 0.01
GAMMA = 0.90
EPISILO = 0.9
MEMORY_CAPACITY = 2000
Q_NETWORK_ITERATION = 100
env = gym.make("CartPole-v0")
env = env.unwrapped
NUM_ACTIONS = env.action_space.n
NUM_STATES = env.observation_space.shape[0]
ENV_A_SHAPE = 0 if isinstance(env.action_space.sample(), int) else env.action_space.sample.shape
class Net(nn.Module):
"""docstring for Net"""
def __init__(self):
super(Net, self).__init__()
self.fc1 = nn.Linear(NUM_STATES, 50)
self.fc1.weight.data.normal_(0,0.1)
self.fc2 = nn.Linear(50,30)
self.fc2.weight.data.normal_(0,0.1)
self.out = nn.Linear(30,NUM_ACTIONS)
self.out.weight.data.normal_(0,0.1)
def forward(self,x):
x = self.fc1(x)
x = F.relu(x)
x = self.fc2(x)
x = F.relu(x)
action_prob = self.out(x)
return action_prob
class DQN():
"""docstring for DQN"""
def __init__(self):
super(DQN, self).__init__()
self.eval_net, self.target_net = Net().cuda(), Net().cuda()
self.learn_step_counter = 0
self.memory_counter = 0
self.memory = np.zeros((MEMORY_CAPACITY, NUM_STATES * 2 + 2))
# why the NUM_STATE*2 +2
# When we store the memory, we put the state, action, reward and next_state in the memory
# here reward and action is a number, state is a ndarray
self.optimizer = torch.optim.Adam(self.eval_net.parameters(), lr=LR)
self.loss_func = nn.MSELoss()
def choose_action(self, state):
state = torch.unsqueeze(torch.FloatTensor(state), 0).cuda() # get a 1D array
if np.random.randn() <= EPISILO:# greedy policy
action_value = self.eval_net.forward(state)
action = torch.max(action_value, 1)[1].cpu().data.numpy()
action = action[0] if ENV_A_SHAPE == 0 else action.reshape(ENV_A_SHAPE)
else: # random policy
action = np.random.randint(0,NUM_ACTIONS)
action = action if ENV_A_SHAPE ==0 else action.reshape(ENV_A_SHAPE)
return action
def store_transition(self, state, action, reward, next_state):
transition = np.hstack((state, [action, reward], next_state))
index = self.memory_counter % MEMORY_CAPACITY
self.memory[index, :] = transition
self.memory_counter += 1
def learn(self):
#update the parameters
if self.learn_step_counter % Q_NETWORK_ITERATION ==0:
self.target_net.load_state_dict(self.eval_net.state_dict())
self.learn_step_counter+=1
#sample batch from memory
sample_index = np.random.choice(MEMORY_CAPACITY, BATCH_SIZE)
batch_memory = self.memory[sample_index, :]
batch_state = torch.FloatTensor(batch_memory[:, :NUM_STATES]).cuda()
batch_action = torch.LongTensor(batch_memory[:, NUM_STATES:NUM_STATES+1].astype(int)).cuda()
batch_reward = torch.FloatTensor(batch_memory[:, NUM_STATES+1:NUM_STATES+2]).cuda()
batch_next_state = torch.FloatTensor(batch_memory[:,-NUM_STATES:]).cuda()
#q_eval
actions_value = self.eval_net.forward(batch_next_state)
next_action = torch.unsqueeze(torch.max(actions_value, 1)[1], 1)
eval_q = self.eval_net.forward(batch_state).gather(1, batch_action)
next_q = self.target_net.forward(batch_next_state).gather(1, next_action)
target_q = batch_reward + GAMMA * next_q
loss = self.loss_func(eval_q, target_q)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
def reward_func(env, x, x_dot, theta, theta_dot):
r1 = (env.x_threshold - abs(x))/env.x_threshold - 0.5
r2 = (env.theta_threshold_radians - abs(theta)) / env.theta_threshold_radians - 0.5
reward = r1 + r2
return reward
def main():
dqn = DQN()
episodes = 250
print("Collecting Experience....")
for i in range(episodes):
state = env.reset()
ep_reward = 0
while True:
env.render()
action = dqn.choose_action(state)
next_state, _, done, info = env.step(action)
x, x_dot, theta, theta_dot = next_state
reward = reward_func(env, x, x_dot, theta, theta_dot)
dqn.store_transition(state, action, reward, next_state)
ep_reward += reward
if dqn.memory_counter >= MEMORY_CAPACITY:
dqn.learn()
if done:
print("episode: {} , the episode reward is ,{}".format(i, round(ep_reward, 3)))
if done:
break
state = next_state
if __name__ == '__main__':
main()
在实现中要注意,学习次数不能过小,也就是episodes 不能过小,否则如论如何都立不起来。