Applying REINFORCE to easy21 - reinforcement-learning

Applying REINFORCE to easy21 - reinforcement-learning

I'm trying to apply the REINFORCE algorithm (with SoftMax policy, undiscounted Gt with baseline) on David Silver's easy21, and I am having problems with the actual implementation. When compared to a pure MC approach, the produced result does not converge to Q*. Here is the related code:
hit = True
stick = False
actions = [hit, stick]
alpha = 0.1
theta = np.random.randn(420).reshape((420,1))
def psi(state, action):
if state.player < 1 or state.player > 21:
return np.zeros((420, 1))
dealers = [int(state.dealer == x + 1) for x in range(0, 10)]
players = [int(state.player == x + 1) for x in range(0, 21)]
actions = [int(action == hit), int(action == stick)]
psi = [1 if (i == 1 and j == 1 and k == 1) else 0
for i in dealers for j in players for k in actions]
return np.array(psi).reshape((420, 1))
def Q(state, action, weight):
return np.matmul(psi(state, action).T, weight)
def softmax(state, weight):
allQ = [Q(state, a, weight) for a in actions]
probs = np.exp(allQ) / np.sum(np.exp(allQ))
return probs.reshape((2,))
def score_function(state, action, weight):
probs = softmax(state, weight)
expected_score = (probs[0] * psi(state, hit)) + (probs[1] * psi(state, stick))
return psi(state, action) - expected_score
def softmax_policy(state, weight):
probs = softmax(state, weight)
if np.random.random() < probs[1]:
return stick
else:
return hit
if __name__ == "__main__":
Q_star = np.load('Q_star.npy')
for k in range(1, ITERATIONS):
terminal = False
state = game.initialise_state()
action = softmax_policy(state, theta)
history = [state, action]
while not terminal:
state, reward = game.step(state, action)
action = softmax_policy(state, theta)
terminal = state.terminal
if terminal:
state_action_pairs = zip(history[0::3], history[1::3])
history.append(reward)
history.append(state)
Gt = sum(history[2::3])
for s, a in state_action_pairs:
advantage = Gt - Q(s, a, prev_theta)
theta += alpha * score_function(s, a, theta) * advantage
else:
history.append(reward)
history.append(state)
history.append(action)
if k % 10000 == 0:
print("MSE: " + str(round(np.sum((Q_star - generate_Q(theta)) ** 2),2)))
Output:
python reinforce.py
MSE: 288.18
MSE: 248.45
MSE: 227.08
MSE: 215.46
MSE: 207.3
MSE: 202.61
MSE: 197.82
MSE: 195.96
MSE: 194.01
The table below shows the value function created using this algorithm:
Update:
Fixed the code by using a different theta initialisation:
theta = np.zeros((420,1))
Current value function:
But the current value function still does not match Q*
(missing peak at player sum = 11)
The entire code is available at:
https://github.com/Soundpulse/easy21-rl/blob/main/reinforce.py

Related

Why the Loss function does not decrease significantly in Flux.jl

After trying some optimizations on activation function and epochs value , it is not possible to fit the model to y data which is a function of the input data.
using Flux, Plots, Statistics
x = Array{Float64}(rand(5, 100));
w = [diff(x[1,:]); 0]./x[1,:];
y1 = cumsum(cos.(cumsum(w)));
scatter(y1)
y = reshape(y1, (1, 100));
data = [(x, y)];
model = Chain(Dense(5 => 100), Dense(100 => 1), identity)
model[1].weight;
loss(m, x, y) = Flux.mse(m(x), y)
Flux.mse(model(x), y)
Flux.mse(model(x), y) == mean((model(x) .- y).^2)
opt_stat = Flux.setup(ADAM(), model)
loss_history = []
epochs = 10000
for epoch in 1:epochs
Flux.train!(loss, model, data, opt_stat)
# print report
train_loss = Flux.mse(model(x), y)
push!(loss_history, train_loss)
println("Epoch = $epoch : Training Loss = $train_loss")
end
ŷ = model(x)
Flux.mse(model(x), y)
Y = reshape(ŷ, (100, 1));
scatter(Y)

How i can use dqn and ddpg to successfully train an agent excellent in customized environment?

I'm new in AI, and i want to get in the field, i have spent some time finishing a program to train an agent for a simple customized environment, but when i perform the training in colab for 10000 episodes, it still can not get well performance. I guess whether there is something wrong with the customized env or there is something wrong with the training process.
Env: a helicopter tries to get throw the continous flow of birds (max num: 10), the birds moves from the right to the left, and there is fuel randomly. If the helicopter is still alive, i.e., it has not collided with a bird and still has fuel (initialized by 1000, when it collides with the fuel icon (max num: 2), fuel_left will be reset to 1000), its rewards plus 1.
the environment is shown in the figure:
after 10000 episode in ddpg/dqn, the agent still can not play more than 15 seconds, could you point out where the problem is?
Action space(1 dim): 0, 1, 2, 3, 4 -> helicopter moves up, down, left, right and keep static.
State space(28 dim): (x,y) for 10 birds, 2 fuel, and 1 helicopter. Besides, there is fuel left and rewards obtained.
Rewards: If the helicopter is alive, rewards plus 1.
the env settings code is as follwos (custom.py):
import numpy as np
import cv2
import matplotlib.pyplot as plt
import random
import math
import time
from gym import Env, spaces
import time
font = cv2.FONT_HERSHEY_COMPLEX_SMALL
class ChopperScape(Env):
def __init__(self):
super(ChopperScape,self).__init__()
self.maxbirdnum = 10
self.maxfuelnum = 2
self.observation_shape = (28,)
self.canvas_shape = (600,800,3)
self.action_space = spaces.Discrete(5,)
self.last_action = 0
self.obs = np.zeros(self.observation_shape)
self.canvas = np.ones(self.canvas_shape) * 1
self.elements = []
self.maxfuel = 1000
self.y_min = int (self.canvas_shape[0] * 0.1)
self.x_min = 0
self.y_max = int (self.canvas_shape[0] * 0.9)
self.x_max = self.canvas_shape[1]
def draw_elements_on_canvas(self):
self.canvas = np.ones(self.canvas_shape) * 1
for elem in self.elements:
elem_shape = elem.icon.shape
x,y = elem.x, elem.y
self.canvas[y : y + elem_shape[1], x:x + elem_shape[0]] = elem.icon
text = 'Fuel Left: {} | Rewards: {}'.format(self.fuel_left, self.ep_return)
self.canvas = cv2.putText(self.canvas, text, (10,20), font, 0.8, (0,0,0), 1, cv2.LINE_AA)
def reset(self):
self.fuel_left = self.maxfuel
self.ep_return = 0
self.obs = np.zeros(self.observation_shape)
self.obs[26] = self.maxfuel
self.bird_count = 0
self.fuel_count = 0
x = random.randrange(int(self.canvas_shape[0] * 0.05), int(self.canvas_shape[0] * 0.90))
y = random.randrange(int(self.canvas_shape[1] * 0.05), int(self.canvas_shape[1] * 0.90))
self.chopper = Chopper("chopper", self.x_max, self.x_min, self.y_max, self.y_min)
self.chopper.set_position(x,y)
self.obs[24] = x
self.obs[25] = y
self.elements = [self.chopper]
self.canvas = np.ones(self.canvas_shape) * 1
self.draw_elements_on_canvas()
return self.obs
def get_action_meanings(self):
return {0: "Right", 1: "Left", 2: "Down", 3: "Up", 4: "Do Nothing"}
def has_collided(self, elem1, elem2):
x_col = False
y_col = False
elem1_x, elem1_y = elem1.get_position()
elem2_x, elem2_y = elem2.get_position()
if 2 * abs(elem1_x - elem2_x) <= (elem1.icon_w + elem2.icon_w):
x_col = True
if 2 * abs(elem1_y - elem2_y) <= (elem1.icon_h + elem2.icon_h):
y_col = True
if x_col and y_col:
return True
return False
def step(self, action):
done = False
reward = 1
assert self.action_space.contains(action), "invalid action"
if action == 4:
self.chopper.move(0,5)
elif action == 1:
self.chopper.move(0,-5)
elif action == 2:
self.chopper.move(5,0)
elif action == 0:
self.chopper.move(-5,0)
elif action == 3:
self.chopper.move(0,0)
if random.random() < 0.1 and self.bird_count<self.maxbirdnum:
spawned_bird = Bird("bird_{}".format(self.bird_count), self.x_max, self.x_min, self.y_max, self.y_min)
self.bird_count += 1
bird_y = random.randrange(self.y_min, self.y_max)
spawned_bird.set_position(self.x_max, bird_y)
self.elements.append(spawned_bird)
if random.random() < 0.05 and self.fuel_count<self.maxfuelnum:
spawned_fuel = Fuel("fuel_{}".format(self.bird_count), self.x_max, self.x_min, self.y_max, self.y_min)
self.fuel_count += 1
fuel_x = random.randrange(self.x_min, self.x_max)
fuel_y = self.y_max
spawned_fuel.set_position(fuel_x, fuel_y)
self.elements.append(spawned_fuel)
for elem in self.elements:
if isinstance(elem, Bird):
if elem.get_position()[0] <= self.x_min:
self.elements.remove(elem)
self.bird_count -= 1
else:
elem.move(-5,0)
if self.has_collided(self.chopper, elem):
done = True
reward = -100000.0*(1.0/self.ep_return+1)
if isinstance(elem, Fuel):
flag1 = False
flag2 = False
if self.has_collided(self.chopper, elem):
self.fuel_left = self.maxfuel
flag1 = True
reward += 2
# time.sleep(0.5)
if elem.get_position()[1] <= self.y_min:
flag2 = True
self.fuel_count -= 1
else:
elem.move(0, -5)
if flag1 == True or flag2 == True:
self.elements.remove(elem)
self.fuel_left -= 1
if self.fuel_left == 0:
done = True
self.draw_elements_on_canvas()
self.ep_return += 1
birdnum = 0
fuelnum = 0
x_, y_ = self.chopper.get_position()
dis = 0.0
for elem in self.elements:
x,y = elem.get_position()
if isinstance(elem,Bird):
self.obs[2*birdnum] = x
self.obs[2*birdnum+1] = y
birdnum += 1
dis += math.hypot(x_-x,y_-y)
if isinstance(elem,Fuel):
base = self.maxbirdnum*2
self.obs[base+2*fuelnum] = x
self.obs[base+2*fuelnum+1] = y
fuelnum += 1
self.obs[24] = x_
self.obs[25] = y_
self.obs[26] = self.fuel_left
self.obs[27] = self.ep_return
if x_ == self.x_min or x_ == self.x_max or y_ == self.y_max or y_ == self.y_min:
reward -= random.random()
for i in range(26):
if i%2 == 0:
self.obs[i]/=800.0
else:
self.obs[i]/=600.0
self.obs[26]/=1000.0
self.obs[27]/=100.0
# print('reward:',reward)
# if done == True:
# time.sleep(1)
return self.obs, reward, done, {}
def render(self, mode = "human"):
assert mode in ["human", "rgb_array"], "Invalid mode, must be either \"human\" or \"rgb_array\""
if mode == "human":
cv2.imshow("Game", self.canvas)
cv2.waitKey(10)
elif mode == "rgb_array":
return self.canvas
def close(self):
cv2.destroyAllWindows()
class Point(object):
def __init__(self, name, x_max, x_min, y_max, y_min):
self.x = 0
self.y = 0
self.x_min = x_min
self.x_max = x_max
self.y_min = y_min
self.y_max = y_max
self.name = name
def set_position(self, x, y):
self.x = self.clamp(x, self.x_min, self.x_max - self.icon_w)
self.y = self.clamp(y, self.y_min, self.y_max - self.icon_h)
def get_position(self):
return (self.x, self.y)
def move(self, del_x, del_y):
self.x += del_x
self.y += del_y
self.x = self.clamp(self.x, self.x_min, self.x_max - self.icon_w)
self.y = self.clamp(self.y, self.y_min, self.y_max - self.icon_h)
def clamp(self, n, minn, maxn):
return max(min(maxn, n), minn)
class Chopper(Point):
def __init__(self, name, x_max, x_min, y_max, y_min):
super(Chopper, self).__init__(name, x_max, x_min, y_max, y_min)
self.icon = cv2.imread("chopper1.jpg") / 255.0
self.icon_w = 64
self.icon_h = 64
self.icon = cv2.resize(self.icon, (self.icon_h, self.icon_w))
class Bird(Point):
def __init__(self, name, x_max, x_min, y_max, y_min):
super(Bird, self).__init__(name, x_max, x_min, y_max, y_min)
self.icon = cv2.imread("bird1.jpg") / 255.0
self.icon_w = 32
self.icon_h = 32
self.icon = cv2.resize(self.icon, (self.icon_h, self.icon_w))
class Fuel(Point):
def __init__(self, name, x_max, x_min, y_max, y_min):
super(Fuel, self).__init__(name, x_max, x_min, y_max, y_min)
self.icon = cv2.imread("fuel1.jpg") / 255.0
self.icon_w = 32
self.icon_h = 32
self.icon = cv2.resize(self.icon, (self.icon_h, self.icon_w))
if __name__ == '__main__':
from IPython import display
env = ChopperScape()
obs = env.reset()
while True:
# random agent
action = random.randrange(-1,1)
obs, reward, done, info = env.step(action)
# Render the game
env.render()
if done == True:
break
env.close()
the ddpg algorithm to train the agent is as follows (ddpg.py):
from custom import ChopperScape
import random
import collections
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
#超参数
lr_mu = 0.005
lr_q = 0.01
gamma = 0.99
batch_size = 32
buffer_limit = 50000
tau = 0.005 # for target network soft update
class ReplayBuffer():
def __init__(self):
self.buffer = collections.deque(maxlen=buffer_limit)
def put(self, transition):
self.buffer.append(transition)
def sample(self, n):
mini_batch = random.sample(self.buffer, n)
s_lst, a_lst, r_lst, s_prime_lst, done_mask_lst = [], [], [], [], []
for transition in mini_batch:
s, a, r, s_prime, done = transition
s_lst.append(s)
a_lst.append([a])
r_lst.append(r)
s_prime_lst.append(s_prime)
done_mask = 0.0 if done else 1.0
done_mask_lst.append(done_mask)
return torch.tensor(s_lst, dtype=torch.float), torch.tensor(a_lst, dtype=torch.float), \
torch.tensor(r_lst, dtype=torch.float), torch.tensor(s_prime_lst, dtype=torch.float), \
torch.tensor(done_mask_lst, dtype=torch.float)
def size(self):
return len(self.buffer)
class MuNet(nn.Module):
def __init__(self):
super(MuNet, self).__init__()
self.fc1 = nn.Linear(28, 128)
self.fc2 = nn.Linear(128, 64)
self.fc_mu = nn.Linear(64, 1)
def forward(self, x):
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
mu = torch.tanh(self.fc_mu(x))
return mu
class QNet(nn.Module):
def __init__(self):
super(QNet, self).__init__()
self.fc_s = nn.Linear(28, 64)
self.fc_a = nn.Linear(1,64)
self.fc_q = nn.Linear(128, 32)
self.fc_out = nn.Linear(32,1)
def forward(self, x, a):
h1 = F.relu(self.fc_s(x))
h2 = F.relu(self.fc_a(a))
cat = torch.cat([h1,h2], dim=1)
q = F.relu(self.fc_q(cat))
q = self.fc_out(q)
return q
class OrnsteinUhlenbeckNoise:
def __init__(self, mu):
self.theta, self.dt, self.sigma = 0.1, 0.01, 0.1
self.mu = mu
self.x_prev = np.zeros_like(self.mu)
def __call__(self):
x = self.x_prev + self.theta * (self.mu - self.x_prev) * self.dt + \
self.sigma * np.sqrt(self.dt) * np.random.normal(size=self.mu.shape)
self.x_prev = x
return x
def train(mu, mu_target, q, q_target, memory, q_optimizer, mu_optimizer):
s,a,r,s_prime,done_mask = memory.sample(batch_size)
core = q_target(s_prime, mu_target(s_prime)) * done_mask
target = r + gamma * core
q_loss = F.smooth_l1_loss(q(s,a), target.detach())
q_optimizer.zero_grad()
q_loss.backward()
q_optimizer.step()
mu_loss = -q(s,mu(s)).mean() # That's all for the policy loss.
mu_optimizer.zero_grad()
mu_loss.backward()
mu_optimizer.step()
def soft_update(net, net_target):
for param_target, param in zip(net_target.parameters(), net.parameters()):
param_target.data.copy_(param_target.data * (1.0 - tau) + param.data * tau)
def main():
env = ChopperScape()
memory = ReplayBuffer()
q, q_target = QNet(), QNet()
q_target.load_state_dict(q.state_dict())
mu, mu_target = MuNet(), MuNet()
mu_target.load_state_dict(mu.state_dict())
score = 0.0
print_interval = 20
mu_optimizer = optim.Adam(mu.parameters(), lr=lr_mu)
q_optimizer = optim.Adam(q.parameters(), lr=lr_q)
ou_noise = OrnsteinUhlenbeckNoise(mu=np.zeros(1))
for n_epi in range(10000):
s = env.reset()
done = False
while not done:
a = mu(torch.from_numpy(s).float())
a = a.item() + ou_noise()[0]
print('action:',a)
s_prime, r, done, info = env.step(a)
env.render()
memory.put((s,a,r/100.0,s_prime,done))
score += r
s = s_prime
if memory.size()>20000:
for _ in range(10):
train(mu, mu_target, q, q_target, memory, q_optimizer, mu_optimizer)
soft_update(mu, mu_target)
soft_update(q, q_target)
if n_epi%print_interval==0 and n_epi!=0:
print("# of episode :{}, avg score : {:.1f}".format(n_epi, score/print_interval))
score = 0.0
env.close()
if __name__ == '__main__':
main()
and the dqn algorithm is as follows(dqn.py):
import gym
import collections
import random
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from custom import ChopperScape
#Hyperparameters
learning_rate = 0.0005
gamma = 0.98
buffer_limit = 50000
batch_size = 32
class ReplayBuffer():
def __init__(self):
self.buffer = collections.deque(maxlen=buffer_limit)
def put(self, transition):
self.buffer.append(transition)
def sample(self, n):
mini_batch = random.sample(self.buffer, n)
s_lst, a_lst, r_lst, s_prime_lst, done_mask_lst = [], [], [], [], []
for transition in mini_batch:
s, a, r, s_prime, done_mask = transition
s_lst.append(s)
a_lst.append([a])
r_lst.append([r])
s_prime_lst.append(s_prime)
done_mask_lst.append([done_mask])
return torch.tensor(s_lst, dtype=torch.float), torch.tensor(a_lst), \
torch.tensor(r_lst), torch.tensor(s_prime_lst, dtype=torch.float), \
torch.tensor(done_mask_lst)
def size(self):
return len(self.buffer)
class Qnet(nn.Module):
def __init__(self):
super(Qnet, self).__init__()
self.fc1 = nn.Linear(28, 128)
self.fc2 = nn.Linear(128, 128)
self.fc3 = nn.Linear(128, 5)
def forward(self, x):
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
def sample_action(self, obs, epsilon):
out = self.forward(obs)
# coin = random.random()
# if coin < epsilon:
# return random.randint(0,1)
# else :
# return out.argmax().item()
return out.argmax().item()
def train(q, q_target, memory, optimizer):
for _ in range(10):
s,a,r,s_prime,done_mask = memory.sample(batch_size)
q_out = q(s)
q_a = q_out.gather(1,a)
max_q_prime = q_target(s_prime).max(1)[0].unsqueeze(1)
target = r + gamma * max_q_prime * done_mask
loss = F.smooth_l1_loss(q_a, target)
optimizer.zero_grad()
loss.backward()
optimizer.step()
def main():
env = ChopperScape()
q = torch.load('10000_dqn_3.pt')
q_target = torch.load('10000_dqn_3_qtarget.pt')
# q_target.load_state_dict(q.state_dict())
memory = ReplayBuffer()
print_interval = 20
score = 0.0
optimizer = optim.Adam(q.parameters(), lr=learning_rate)
for n_epi in range(10000):
epsilon = max(0.01, 0.08 - 0.01*(n_epi/200)) #Linear annealing from 8% to 1%
s = env.reset()
done = False
while not done:
a = q.sample_action(torch.from_numpy(s).float(), epsilon)
s_prime, r, done, info = env.step(a)
env.render()
done_mask = 0.0 if done else 1.0
memory.put((s,a,r,s_prime, done_mask))
s = s_prime
if done:
break
score += r
if memory.size()>20000:
train(q, q_target, memory, optimizer)
if n_epi%print_interval==0 and n_epi!=0:
q_target.load_state_dict(q.state_dict())
print("n_episode :{}, score : {:.1f}, n_buffer : {}, eps : {:.1f}%".format(n_epi, score/print_interval, memory.size(), epsilon*100))
score = 0.0
env.close()
def test():
env = ChopperScape()
q = torch.load('10000_dqn_q.pt')
done = False
s = env.reset()
while not done:
a = q.sample_action(torch.from_numpy(s).float(), 1)
s_prime, r, done, info = env.step(a)
env.render()
s = s_prime
if done:
break
if __name__ == '__main__':
main()
when perform dqn, please annotate the action convert part in custom.py/class ChoperScape/step
after 10000 episode in ddpg/dqn, the agent still can not play more than 15 seconds, could you point out where the problem is?

Why is my REINFORCE algorithm not learning?

I am training a REINFORCE algorithm on the CartPole environment. Due to the simple nature of the environment, I expect it to learn quickly. However, that doesn't happen.
Here is the main portion of the algorithm -
for i in range(episodes):
print("i = ", i)
state = env.reset()
done = False
transitions = []
tot_rewards = 0
while not done:
act_proba = model(torch.from_numpy(state))
action = np.random.choice(np.array([0,1]), p = act_proba.data.numpy())
next_state, reward, done, info = env.step(action)
tot_rewards += 1
transitions.append((state, action, tot_rewards))
state = next_state
if i%50==0:
print("i = ", i, ",reward = ", tot_rewards)
score.append(tot_rewards)
reward_batch = torch.Tensor([r for (s,a,r) in transitions])
disc_rewards = discount_rewards(reward_batch)
nrml_disc_rewards = normalize_rewards(disc_rewards)
state_batch = torch.Tensor([s for (s,a,r) in transitions])
action_batch = torch.Tensor([a for (s,a,r) in transitions])
pred_batch = model(state_batch)
prob_batch = pred_batch.gather(dim=1, index=action_batch.long().view(-1, 1)).squeeze()
loss = -(torch.sum(torch.log(prob_batch)*nrml_disc_rewards))
opt.zero_grad()
loss.backward()
opt.step()
Here is the entire algorithm -
#I referred to this when writing the code - https://github.com/DeepReinforcementLearning/DeepReinforcementLearningInAction/blob/master/Chapter%204/Ch4_book.ipynb
import numpy as np
import gym
import torch
from torch import nn
env = gym.make('CartPole-v0')
learning_rate = 0.0001
episodes = 10000
def discount_rewards(reward, gamma = 0.99):
return torch.pow(gamma, torch.arange(len(reward)))*reward
def normalize_rewards(disc_reward):
return disc_reward/(disc_reward.max())
class NeuralNetwork(nn.Module):
def __init__(self, state_size, action_size):
super(NeuralNetwork, self).__init__()
self.state_size = state_size
self.action_size = action_size
self.linear_relu_stack = nn.Sequential(
nn.Linear(state_size, 300),
nn.ReLU(),
nn.Linear(300, 128),
nn.ReLU(),
nn.Linear(128, 128),
nn.ReLU(),
nn.Linear(128, action_size),
nn.Softmax()
)
def forward(self,x):
x = self.linear_relu_stack(x)
return x
model = NeuralNetwork(env.observation_space.shape[0], env.action_space.n)
opt = torch.optim.Adam(params = model.parameters(), lr = learning_rate)
score = []
for i in range(episodes):
print("i = ", i)
state = env.reset()
done = False
transitions = []
tot_rewards = 0
while not done:
act_proba = model(torch.from_numpy(state))
action = np.random.choice(np.array([0,1]), p = act_proba.data.numpy())
next_state, reward, done, info = env.step(action)
tot_rewards += 1
transitions.append((state, action, tot_rewards))
state = next_state
if i%50==0:
print("i = ", i, ",reward = ", tot_rewards)
score.append(tot_rewards)
reward_batch = torch.Tensor([r for (s,a,r) in transitions])
disc_rewards = discount_rewards(reward_batch)
nrml_disc_rewards = normalize_rewards(disc_rewards)
state_batch = torch.Tensor([s for (s,a,r) in transitions])
action_batch = torch.Tensor([a for (s,a,r) in transitions])
pred_batch = model(state_batch)
prob_batch = pred_batch.gather(dim=1, index=action_batch.long().view(-1, 1)).squeeze()
loss = -(torch.sum(torch.log(prob_batch)*nrml_disc_rewards))
opt.zero_grad()
loss.backward()
opt.step()

Your computation for discounting the reward is where your mistake is.
In REINFORCE (and many other algorithms) you need to compute the sum of future discounted rewards for every step onward.
This means that the sum of discounted rewards for the first step should be:
G_1 = r_1 + gamma * r_2 + gamma ^ 2 * r_3 + ... + gamma ^ (T-1) * r_T
G_2 = r_2 + gamma * r_3 + gamma ^ 2 * r_4 + ... + gamma ^ (T-1) * r_T
And so on...
This gives you an array containing all the sum of future rewards for every step (i.e. [G_1, G_2, G_3, ... , G_T])
However, what you compute currently is only applying a discount on the current step's reward:
G_1 = r_1
G_2 = gamma * r_2
G_3 = gamma ^ 2 * r_3
And so on...
Here is the Python code fixing your problem. We compute from the back of the list of reward to the front to be more computationally efficient.
def discount_rewards(reward, gamma=0.99):
R = 0
returns = []
reward = reward.tolist()
for r in reward[::-1]:
R = r + gamma * R
returns.append(R)
returns = torch.tensor(returns[::-1])
return returns
Here is a figure showing the progression of the algorithm's score over the first 5000 steps.

Why Deep Adaptive Input Normalization (DAIN) normalizes time series data accross rows?

The DAIN paper describes how a network learns to normalize time series data by itself, here is how the authors implemented it. The code leads me to think that normalization is happening across rows, not columns. Can anyone explain why it is implemented that way? Because I always thought that one normalizes time series only across columns to keep each feature's true information.
Here is the piece the does normalization:
```python
class DAIN_Layer(nn.Module):
def __init__(self, mode='adaptive_avg', mean_lr=0.00001, gate_lr=0.001, scale_lr=0.00001, input_dim=144):
super(DAIN_Layer, self).__init__()
print("Mode = ", mode)
self.mode = mode
self.mean_lr = mean_lr
self.gate_lr = gate_lr
self.scale_lr = scale_lr
# Parameters for adaptive average
self.mean_layer = nn.Linear(input_dim, input_dim, bias=False)
self.mean_layer.weight.data = torch.FloatTensor(data=np.eye(input_dim, input_dim))
# Parameters for adaptive std
self.scaling_layer = nn.Linear(input_dim, input_dim, bias=False)
self.scaling_layer.weight.data = torch.FloatTensor(data=np.eye(input_dim, input_dim))
# Parameters for adaptive scaling
self.gating_layer = nn.Linear(input_dim, input_dim)
self.eps = 1e-8
def forward(self, x):
# Expecting (n_samples, dim, n_feature_vectors)
# Nothing to normalize
if self.mode == None:
pass
# Do simple average normalization
elif self.mode == 'avg':
avg = torch.mean(x, 2)
avg = avg.resize(avg.size(0), avg.size(1), 1)
x = x - avg
# Perform only the first step (adaptive averaging)
elif self.mode == 'adaptive_avg':
avg = torch.mean(x, 2)
adaptive_avg = self.mean_layer(avg)
adaptive_avg = adaptive_avg.resize(adaptive_avg.size(0), adaptive_avg.size(1), 1)
x = x - adaptive_avg
# Perform the first + second step (adaptive averaging + adaptive scaling )
elif self.mode == 'adaptive_scale':
# Step 1:
avg = torch.mean(x, 2)
adaptive_avg = self.mean_layer(avg)
adaptive_avg = adaptive_avg.resize(adaptive_avg.size(0), adaptive_avg.size(1), 1)
x = x - adaptive_avg
# Step 2:
std = torch.mean(x ** 2, 2)
std = torch.sqrt(std + self.eps)
adaptive_std = self.scaling_layer(std)
adaptive_std[adaptive_std <= self.eps] = 1
adaptive_std = adaptive_std.resize(adaptive_std.size(0), adaptive_std.size(1), 1)
x = x / (adaptive_std)
elif self.mode == 'full':
# Step 1:
avg = torch.mean(x, 2)
adaptive_avg = self.mean_layer(avg)
adaptive_avg = adaptive_avg.resize(adaptive_avg.size(0), adaptive_avg.size(1), 1)
x = x - adaptive_avg
# # Step 2:
std = torch.mean(x ** 2, 2)
std = torch.sqrt(std + self.eps)
adaptive_std = self.scaling_layer(std)
adaptive_std[adaptive_std <= self.eps] = 1
adaptive_std = adaptive_std.resize(adaptive_std.size(0), adaptive_std.size(1), 1)
x = x / adaptive_std
# Step 3:
avg = torch.mean(x, 2)
gate = F.sigmoid(self.gating_layer(avg))
gate = gate.resize(gate.size(0), gate.size(1), 1)
x = x * gate
else:
assert False
return x
```

I am not sure either but they do transpose in forward function : x = x.transpose(1, 2) of the MLP class. Thus, it seemed to me that they normalise over time for each feature.

I try to load Poker Hand dataset(csv) into tensorflow, but the accuracy is always about 50%, how can I do with it?

I try to train an MLP that just consists of a softmax. In tensorflow tutorials, they used mnist dataset, however, I try to use another one, Poker Hand Dataset(10 classes). But by my program, the accuracy is always about 50%, that is so bothersome.
Here is my code
# coding=utf-8
from __future__ import print_function
import tensorflow as tf
import numpy as np
import datetime
class Arc:
def __init__(self):
self.filenames = ['train.csv', 'test.csv']
self.batchSize = 128
self.trainIters = 100000
self.totalEpoch = 1
self.min_after_dequeue = 256
self.capacity = 640
def readData(self, filenames=None):
files = tf.train.string_input_producer(filenames)
reader = tf.TextLineReader()
key, value = reader.read(files)
record_defaults = [[1], [1], [4], [1], [8], [1], [2], [1], [11], [1], [5]]
s1, c1, s2, c2, s3, c3, s4, c4, s5, c5, hand = tf.decode_csv(value,
record_defaults=record_defaults)
features = tf.pack(tf.to_float([s1, c1, s2, c2, s3, c3, s4, c4, s5, c5]))
hand = tf.one_hot(hand, 10, 1, 0, -1, tf.int32)
features_batch, hand_batch = tf.train.shuffle_batch(
[features, hand],
batch_size=self.batchSize,
capacity=self.capacity,
min_after_dequeue=self.min_after_dequeue)
return features_batch, hand_batch
def fullyConnected(self, incoming, n_units, bias=True,
regularizer=None, weight_decay=0.001, trainable=True,
name="FullyConnected"):
if isinstance(incoming, tf.Tensor):
input_shape = incoming.get_shape().as_list()
elif type(incoming) in [np.array, list, tuple]:
input_shape = np.shape(incoming)
else:
raise Exception("Invalid incoming layer")
assert len(input_shape) > 1, "Incoming Tensor shape must be at least 2-D"
n_inputs = int(np.prod(input_shape[1:]))
with tf.name_scope(name) as scope:
W_init = tf.uniform_unit_scaling_initializer(dtype=tf.float32, seed=None)
W_regul = None
if regularizer:
if regularizer == 'L1':
W_regul = lambda x: tf.mul(tf.nn.l2_loss(x), weight_decay, name='L2-Loss')
elif regularizer == 'L2':
W_regul = lambda x: tf.mul(tf.reduce_sum(tf.abs(x)), weight_decay, name='L1-Loss')
with tf.device(''):
try:
W = tf.get_variable(scope + 'W', [n_inputs, n_units], tf.float32, W_init, W_regul)
except Exception as e:
W = tf.get_variable(scope + 'W', [n_inputs, n_units], tf.float32, W_init)
if regularizer is not None:
if regularizer == 'L1':
W = lambda x: tf.mul(tf.nn.l2_loss(W), weight_decay, name='L2-Loss')
elif regularizer == 'L2':
W = lambda x: tf.mul(tf.reduce_sum(tf.abs(W)), weight_decay, name='L1-Loss')
b = None
if bias:
b_init = tf.constant_initializer(0.)
with tf.device(''):
b = tf.get_variable(scope + 'b', [n_units], tf.float32, b_init, W_regul, trainable=trainable)
inference = incoming
if len(input_shape) > 2:
inference = tf.reshape(inference, [-1, n_inputs])
inference = tf.matmul(inference, W)
if b: inference += b
return inference
def network(self, net):
net = self.fullyConnected(net, 10)
net = tf.nn.softmax(net)
return net
def run(self):
features, hand = self.readData(['train.csv'])
x = tf.placeholder(dtype=tf.float32,
shape=[None, 10],
name='Placeholder_X')
y = tf.placeholder(dtype=tf.float32,
shape=[None, 10],
name='Placeholder_Y')
pred = self.network(x)
cost = tf.reduce_mean(-tf.reduce_sum(y * tf.log(pred), reduction_indices=[1]))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01).minimize(cost)
correctPred = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correctPred, tf.float32))
init = tf.initialize_all_variables()
startTime = datetime.datetime.now()
with tf.Session() as sess:
sess.run(init)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
iter = 1
while iter * self.batchSize < self.trainIters:
example, label = sess.run([features, hand])
try:
sess.run(optimizer, feed_dict={x: example, y: label})
except Exception as e:
print(e.message)
if iter % 10 == 0:
loss, acc = sess.run([cost, accuracy], feed_dict={x: example, y: label})
print("Iter " + str(iter * self.batchSize) + ", Minibatch Loss= " + \
"{:.6f}".format(loss) + ", Training Accuracy= " + \
"{:.5f}".format(acc))
iter += 1
coord.request_stop()
coord.join(threads)
print('all done')
endTime = datetime.datetime.now()
fitTime = (endTime - startTime)
print("Training Time:", fitTime)
if __name__ == '__main__':
net = Arc()
net.run()
I got the result as
Iter 1280, Minibatch Loss= 2.210387, Training Accuracy= 0.40625
Iter 2560, Minibatch Loss= 2.371088, Training Accuracy= 0.35156
Iter 3840, Minibatch Loss= 1.723017, Training Accuracy= 0.42188
Iter 5120, Minibatch Loss= 1.650101, Training Accuracy= 0.43750
....
....
Iter 98560, Minibatch Loss= 0.990002, Training Accuracy= 0.54688
Iter 99840, Minibatch Loss= 1.142664, Training Accuracy= 0.52344
all done
Training Time: 0:00:12.081167
What mistake did I make? I guess maybe the queue caused that?

I took a look at it and there are a lot of errors in your code
no activation function
only one layer of fully connected that has very little capacity
the print of the loss value is not displaying the correct value
no encoding of the categorical input value (encode s1 as 4 one_hot encode and c1 as 13 one_hot encode and concatenate the result)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Applying REINFORCE to easy21 - reinforcement-learning

Related

Why the Loss function does not decrease significantly in Flux.jl

How i can use dqn and ddpg to successfully train an agent excellent in customized environment?

Why is my REINFORCE algorithm not learning?

Why Deep Adaptive Input Normalization (DAIN) normalizes time series data accross rows?

I try to load Poker Hand dataset(csv) into tensorflow, but the accuracy is always about 50%, how can I do with it?

Categories

Resources