Problem while implementating of “Continual Learning Through Synaptic Intelligence” paper - deep-learning

I am trying to reproduce the results of “Continual Learning Through Synaptic Intelligence” paper [1]. I tried implementing the algorithm as best as I could understand after going through paper many times. I also looked at it’s official implementation on github which is in tensorflow 1.0, but could not understand much as I don’t have much familiarity with that.
Though I got some results but not good enough as paper. I wanted to ask if anyone can help me to find out where I am going wrong. Before going into coding details I want to discuss sudo code so that I undersatnd what is going wrong with my implementation.
Here is kind of sudo code that I have implemented. Please help me.
lambda = 1
xi = 1e-3
total_tasks = 5
model = NN(total_tasks)
## multiheaded linear model ([784(input)-->256-->256-->2(output)(*5, 5 separate heads)])
## output layer is 2 neuron head (separate heads for each task, total 5 tasks)
## output is vector of size 2 (for 2 classes)
prev_theta = model.theta(copy=True) # updated at end of task
## model.theta() returns list of shared parameters (i.e. layer1 and layer2 excluding output layer)
## copy=True, gives copy of parameters
## so it don't effect original params connected to computaitonal graph
omega_total = zero_like(prev_theta) ## Capital Omega in paper (per-parameter regularization strength)
omega = zero_like(prev_theta) ## small omega in paper (per-parameter contribution to loss)
for task_num in range(total_tasks):
optmizer = ADAM() # created before every task (or reset it)
prev_theta_step = model.theta(copy=True) # updated at end of step
## trainig for task start
for epoch in range(10):
for steps in range(steps_per_epoch):
X, Y = train_dataset[task_num].sample()
## X is flattened image of size 784
## Y is binary vector of size 2 ([0,1] or [1,0])
Y_pred = model(X, task_num) # model is multihead, task_num selects the head
loss = CROSS_ENTROPY(Y_pred, Y)
if(task_num>0): ## reg_loss starts from second task
theta = model.theta()
## here copy is not true so it returns params connected to computaitonal graph
reg_loss = torch.sum(omega_total*torch.square(theta - prev_theta))
loss = loss + lambda*reg_loss
optmizer.zero_grad()
loss.backward()
theta = model.theta(copy=True)
grads = model.theta_grads() ## grads of shared paramters only
omega = omega - grads*(theta - prev_theta_step)
prev_theta_step = theta
optimizer.step()
## training for task complete, update importance parameters
theta = model.theta(copy=True)
omega_total += relu( omega/( (theta - prev_theta)**2 + xi) )
prev_theta = theta
omega = torch.zeros(theta_shape)
## evaluation code
...
...
...
## evaluation done
I am also attaching result I got. In results ‘one’ (blue) represents without regression loss (lambda=0), ‘two’ (green) represents with regression loss (lambda=1).
Thank you for reading so far. Kindly help me out.

Related

Fluctuating RAM in google colab while running a BERT model

I am running a simple comment classification task on google colab. I am using DistilBERT for contextual embeddings.I use only 4000 training sample cause the notebook keeps on crashing.
When I run the cell for obtaining the embeddings, I keep a tab on how the RAM utilisation increases. I am seeing that it oscillates from somewhere between 3gb to 8gb.
Should not it be just increasing? Can anyone explain how this works at lower level.
Here is my code, the cell block at last is where I am seeing the above said thing.
# For DistilBERT:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')
## Want BERT instead of distilBERT? Uncomment the following line:
#model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')
# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)
max_len=80
tokenized = sample['comment_text'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True,max_length= max_len)))
padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape
input_ids = torch.tensor(padded)
attention_mask = torch.tensor(attention_mask)
**with torch.no_grad():
last_hidden_states = model(input_ids, attention_mask=attention_mask)**

When using the frame skipping wrapper for OpenAI Gym, what is the purpose of the np.max line?

I'm implementing the following wrapper used commonly in OpenAI's Gym for Frame Skipping. It can be found in dqn/atari_wrappers.py
I'm very confused about the following line:
max_frame = np.max(np.stack(self._obs_buffer), axis=0)
I have added comments throughout the code for the parts I understand and to aid anyone who may be able to help.
np.stack(self._obs_buffer) stacks the two states in _obs_buffer.
np.max returns the maximum along axis 0.
But what I don't understand is why we're doing this or what it's really doing.
class MaxAndSkipEnv(gym.Wrapper):
"""Return only every 4th frame"""
def __init__(self, env=None, skip=4):
super(MaxAndSkipEnv, self).__init__(env)
# Initialise a double ended queue that can store a maximum of two states
self._obs_buffer = deque(maxlen=2)
# _skip = 4
self._skip = skip
def _step(self, action):
total_reward = 0.0
done = None
for _ in range(self._skip):
# Take a step
obs, reward, done, info = self.env.step(action)
# Append the new state to the double ended queue buffer
self._obs_buffer.append(obs)
# Update the total reward by summing the (reward obtained from the step taken) + (the current
# total reward)
total_reward += reward
# If the game ends, break the for loop
if done:
break
max_frame = np.max(np.stack(self._obs_buffer), axis=0)
return max_frame, total_reward, done, info
At the end of the for loop the self._obs_buffer holds the last two frames.
Those two frames are then max-pooled over, resulting in an observation, that contains some temporal information.

Custom environment Gym for step function processing with DDPG Agent

I'm new to reinforcement learning, and I would like to process audio signal using this technique. I built a basic step function that I wish to flatten to get my hands on Gym OpenAI and reinforcement learning in general.
To do so, I am using the GoalEnv provided by OpenAI since I know what the target is, the flat signal.
That is the image with input and desired signal :
The step function calls _set_action which performs achieved_signal = convolution(input_signal,low_pass_filter) - offset, low_pass_filter takes a cutoff frequency as input as well.
Cutoff frequency and offset are the parameters that act on the observation to get the output signal.
The designed reward function returns the frame to frame L2-norm between the input signal and the desired signal, to the negative, to penalize a large norm.
Following is the environment I created:
def butter_lowpass(cutoff, nyq_freq, order=4):
normal_cutoff = float(cutoff) / nyq_freq
b, a = signal.butter(order, normal_cutoff, btype='lowpass')
return b, a
def butter_lowpass_filter(data, cutoff_freq, nyq_freq, order=4):
b, a = butter_lowpass(cutoff_freq, nyq_freq, order=order)
y = signal.filtfilt(b, a, data)
return y
class `StepSignal(gym.GoalEnv)`:
def __init__(self, input_signal, sample_rate, desired_signal):
super(StepSignal, self).__init__()
self.initial_signal = input_signal
self.signal = self.initial_signal.copy()
self.sample_rate = sample_rate
self.desired_signal = desired_signal
self.distance_threshold = 10e-1
max_offset = abs(max( max(self.desired_signal) , max(self.signal))
- min( min(self.desired_signal) , min(self.signal)) )
self.action_space = spaces.Box(low=np.array([10e-4,-max_offset]),\
high=np.array([self.sample_rate/2-0.1,max_offset]), dtype=np.float16)
obs = self._get_obs()
self.observation_space = spaces.Dict(dict(
desired_goal=spaces.Box(-np.inf, np.inf, shape=obs['achieved_goal'].shape, dtype='float32'),
achieved_goal=spaces.Box(-np.inf, np.inf, shape=obs['achieved_goal'].shape, dtype='float32'),
observation=spaces.Box(-np.inf, np.inf, shape=obs['observation'].shape, dtype='float32'),
))
def step(self, action):
range = self.action_space.high - self.action_space.low
action = range / 2 * (action + 1)
self._set_action(action)
obs = self._get_obs()
done = False
info = {
'is_success': self._is_success(obs['achieved_goal'], self.desired_signal),
}
reward = -self.compute_reward(obs['achieved_goal'],self.desired_signal)
return obs, reward, done, info
def reset(self):
self.signal = self.initial_signal.copy()
return self._get_obs()
def _set_action(self, actions):
actions = np.clip(actions,a_max=self.action_space.high,a_min=self.action_space.low)
cutoff = actions[0]
offset = actions[1]
print(cutoff, offset)
self.signal = butter_lowpass_filter(self.signal, cutoff, self.sample_rate/2) - offset
def _get_obs(self):
obs = self.signal
achieved_goal = self.signal
return {
'observation': obs.copy(),
'achieved_goal': achieved_goal.copy(),
'desired_goal': self.desired_signal.copy(),
}
def compute_reward(self, goal_achieved, goal_desired):
d = np.linalg.norm(goal_desired-goal_achieved)
return d
def _is_success(self, achieved_goal, desired_goal):
d = self.compute_reward(achieved_goal, desired_goal)
return (d < self.distance_threshold).astype(np.float32)
The environment can then be instantiated into a variable, and flattened through the FlattenDictWrapper as advised here https://openai.com/blog/ingredients-for-robotics-research/ (end of the page).
length = 20
sample_rate = 30 # 30 Hz
in_signal_length = 20*sample_rate # 20sec signal
x = np.linspace(0, length, in_signal_length)
# Desired output
y = 3*np.ones(in_signal_length)
# Step signal
in_signal = 0.5*(np.sign(x-5)+9)
env = gym.make('stepsignal-v0', input_signal=in_signal, sample_rate=sample_rate, desired_signal=y)
env = gym.wrappers.FlattenDictWrapper(env, dict_keys=['observation','desired_goal'])
env.reset()
The agent is a DDPG Agent from keras-rl, since the actions can take any values in the continuous action_space described in the environment.
I wonder why the actor and critic nets need an input with an additional dimension, in input_shape=(1,) + env.observation_space.shape
nb_actions = env.action_space.shape[0]
# Building Actor agent (Policy-net)
actor = Sequential()
actor.add(Flatten(input_shape=(1,) + env.observation_space.shape, name='flatten'))
actor.add(Dense(128))
actor.add(Activation('relu'))
actor.add(Dense(64))
actor.add(Activation('relu'))
actor.add(Dense(nb_actions))
actor.add(Activation('linear'))
actor.summary()
# Building Critic net (Q-net)
action_input = Input(shape=(nb_actions,), name='action_input')
observation_input = Input(shape=(1,) + env.observation_space.shape, name='observation_input')
flattened_observation = Flatten()(observation_input)
x = Concatenate()([action_input, flattened_observation])
x = Dense(128)(x)
x = Activation('relu')(x)
x = Dense(64)(x)
x = Activation('relu')(x)
x = Dense(1)(x)
x = Activation('linear')(x)
critic = Model(inputs=[action_input, observation_input], outputs=x)
critic.summary()
# Building Keras agent
memory = SequentialMemory(limit=2000, window_length=1)
policy = BoltzmannQPolicy()
random_process = OrnsteinUhlenbeckProcess(size=nb_actions, theta=0.6, mu=0, sigma=0.3)
agent = DDPGAgent(nb_actions=nb_actions, actor=actor, critic=critic, critic_action_input=action_input,
memory=memory, nb_steps_warmup_critic=2000, nb_steps_warmup_actor=10000,
random_process=random_process, gamma=.99, target_model_update=1e-3)
agent.compile(Adam(lr=1e-3, clipnorm=1.), metrics=['mae'])
Finally, the agent is trained:
filename = 'mem20k_heaviside_flattening'
hist = agent.fit(env, nb_steps=10, visualize=False, verbose=2, nb_max_episode_steps=5)
with open('./history_dqn_test_'+ filename + '.pickle', 'wb') as handle:
pickle.dump(hist.history, handle, protocol=pickle.HIGHEST_PROTOCOL)
agent.save_weights('h5f_files/dqn_{}_weights.h5f'.format(filename), overwrite=True)
Now here is the catch: the agent seems to always be stuck to the same neighborhood of output values across all episodes for a same instance of my env:
The cumulated reward is negative since I just allowed the agent to get negative rewards. I used it from https://github.com/openai/gym/blob/master/gym/envs/robotics/fetch_env.py which is part of OpenAI code as example.
Across one episode, I should get varying sets of actions converging towards a (cutoff_final, offset_final) that would get my input step signal close to my output flat signal, which is clearly not the case. In addition, I thought, for successive episodes, I should get different actions.
I wonder why the actor and critic nets need an input with an additional dimension, in input_shape=(1,) + env.observation_space.shape
I think the GoalEnv is designed with HER (Hindsight Experience Replay) in mind, since it will use the "sub-spaces" inside the observation_space to learn from sparse reward signals (there is a paper in OpenAI website that explains how HER works). Haven't look at the implementation, but my guess is that there needs to be an additional input since HER also process the "goal" parameter.
Since it seems you are not using HER (works with any off-policy algorithm, including DQN, DDPG, etc), you should handcraft an informative reward function (rewards are not binary, eg, 1 if objective achieved, 0 otherwise) and use the base Env class. The reward should be calculated inside the step method, since rewards in MDP's are functions like r(s, a, s`) you probably will have all the information you need. Hope it helps.

Dice and CE loss not training network together

I am training a segmentation network on the Kaggle Salt challenge. My dice and ce decrease, but then suddenly dice increases and CE jumps up a bit, this keeps happening to dice. I have been trying all day to fix this but can’t get my code to run. I am running on only 10 data points to overfit my data but it just is not happening. Any help would be greatly appreciated.
Plots of dice(top) and CE:
Loss curve
Heres my dice and train:
def dice(input, target,weights=torch.tensor([1,1]).float().cuda()):
smooth=.001
dummy=np.zeros([batch_size,2,100,100]) # create dummy to one hot encode target for weighted dice
dummy[:,0,:,:][target==0]=1 # background class is 0
dummy[:,1,:,:][target==1]=1 # salt class is 1
target=torch.tensor(dummy).float().cuda()
# print(input.size(),input[:,0,:,:].size())
input1=input[:,0,:,:].contiguous().view(-1) #flatten both classes seperately
target1=target[:,0,:,:].contiguous().view(-1)
input2=input[:,1,:,:].contiguous().view(-1)
target2=target[:,1,:,:].contiguous().view(-1)
score1=2*(input1*target1).sum()/(input1.sum()+target1.sum()+smooth) #back
score2=2*(input2*target2).sum()/(input2.sum()+target2.sum()+smooth) #salt
score=1-(weights[0]*score1+weights[1]*score2)/2
if score<0:
score=score-score
return(score)
Heres the train:
def train(epoch):
for idx, batch_data in enumerate(dataloader) :
x, target=batch_data['image'].float().cuda(),batch_data['label'].float().cuda()
optimizer.zero_grad()
output = net(x)
# print(output.size())
output.squeeze_(1)
# print('out',output.size(),target.size())
bce_loss = criterion(output, target.long())
lc.append(bce_loss.item())
dice_loss = dice((output), target)
ld.append(dice_loss.item())
loss = dice_loss + bce_loss
l.append(loss.item())
loss.backward()
optimizer.step()
print('Epoch {}, loss {}, bce {}, dice {}'.format(
epoch, sum(l)/len(l), sum(lc)/len(lc) , sum(ld)/len(ld) ))
Heres the rest of the code( I downed from gaggle kernel): https://github.com/bluesky314/Salt-Segmentation/blob/master/kernel-2.ipynb 1 (the training showed here is when I ran that cell(14) a second time so the ups and downs don’t appear but can be seen in the plot)
dataset=DatasetSalt(limit_paths=10) just limits the dataset to any number by only taking the top paths to get the images from
Would really appreciate any help, have been literally struggling on this for 8+ hrs

Number of observations when using plm with first differences

I have a simple issue after running a regression with panel data using plm with a dataset that resembles the one below:
dataset <- data.frame(id = rep(c(1,2,3,4,5), 2),
time = rep(c(0,1), each = 5),
group = rep(c(0,1,0,0,1), 2),
Y = runif(10,0,1))
model <-plm(Y ~ time*group, method = 'fd', effect = 'twoways', data = dataset,
index = c('id', 'time'))
summary(model)
stargazer(model)
As you can see, both the model summary and the table displayed by stargazer would say that my number of observations is 10. However, is it not more correct to say that N = 5, since I have taken away the time element after with the first differences?
You are right about the number of observations. However, your code does not what you want it to do (a first differenced model).
If you want a first differenced model, switch the argument method to model (and delete argument effect because it does not make sense for a first differenced model):
model <-plm(Y ~ time*group, model = 'fd', data = dataset,
index = c('id', 'time'))
summary(model)
## Oneway (individual) effect First-Difference Model
##
## Call:
## plm(formula = Y ~ time * group, data = dataset, model = "fd",
## index = c("id", "time"))
##
## Balanced Panel: n = 5, T = 2, N = 10
## Observations used in estimation: 5
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -0.3067240 -0.0012185 0.0012185 0.1367080 0.1700160
## [...]
In the summary output, you can see the number of observations in your original data (N=10) and the number of observations used in the FD model (5).