I know the softmax activation function: The sum of the ouput layer with a softmax activation is equal to one always, that say: the output vector is normalized, also this is neccesary because the maximun accumalated probability can not exceeds one. Ok, this is clear.
But my question is the following: When the softmax is used as a classifier, is use the argmax function to get the index of the class. so, what is the difference between get a acumulative probability of one or higher if the important parameter is the index to get the correct class?
An example in python, where I made another softmax (really is not a softmax function) but the classifier works in the same way that the classifier with the real softmax function:
import numpy as np
classes = 10
classes_list = ['dog', 'cat', 'monkey', 'butterfly', 'donkey',
'horse', 'human', 'car', 'table', 'bottle']
# This simulates and NN with her weights and the previous
# layer with a ReLU activation
a = np.random.normal(0, 0.5, (classes,512)) # Output from previous layer
w = np.random.normal(0, 0.5, (512,1)) # weights
b = np.random.normal(0, 0.5, (classes,1)) # bias
# correct solution:
def softmax(a, w, b):
a = np.maximum(a, 0) # ReLU simulation
x = np.matmul(a, w) + b
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum(axis=0), np.argsort(e_x.flatten())[::-1]
# approx solution (probability is upper than one):
def softmax_app(a, w, b):
a = np.maximum(a, 0) # ReLU simulation
w_exp = np.exp(w)
coef = np.sum(w_exp)
matmul = np.exp(np.matmul(a,w) + b)
res = matmul / coef
return res, np.argsort(res.flatten())[::-1]
teor = softmax(a, w, b)
approx = softmax_app(a, w, b)
class_teor = classes_list[teor[-1][0]]
class_approx = classes_list[approx[-1][0]]
print(np.array_equal(teor[-1], approx[-1]))
print(class_teor == class_approx)
The obtained class between both methods are always the same (I'm talking about preddictions, not to training). I ask this because I'm implementing the softmax in a FPGA device and with the second method it is not necessary 2 runs to calculate the softmax function: first to find the exponentiated matrix and the sum of it and second to perform the division.
Let's review the uses of softmax:
You should use softmax if:
You are training a NN and want to limit the range of output values during training (you could use other activation functions instead). This can marginally help towards clipping the gradient.
You are performing inference on a NN and you want to obtain a metric on the "degree of confidence" of your classification result (in the range of 0-1).
You are performing inference on a NN and wish to get the top K results. In this case it is recommended as a way to have a "degree of confidence" metric to compare them.
You are performing inference on several NN (ensemble methods) and wish to average them out (otherwise their results wouldn't easily comparable).
You should not use (or remove) softmax if:
You are performing inference on a NN and you only care about the top class. Note that the NN could have been trained with Softmax (for better accuracy, faster convergence, etc..).
In your case, your insights are right: Softmax as an activation function in the last layer is meaningless if your problem only requires you to get the index of the maximum value during the inference phase. Besides, since you are targetting an FPGA implementation, this would only give you extra headaches.
Related
I'm currently learning how to use pytorch to model NNs and did the "Getting Started" Session on the PyTorch Website.
I tried to train a PyTorch NN to apply the function e.g. f(x)=2x-1 to a given input integer list but my model is far apart from learning the right thing.
How can I model and train a PyTorch model to learn a given mathematical function f(x) ?
I've tried this model and trained it with 10 random numbers with labels generated by the 'myFunc' function to learn the function 2x-1.
Thanks for your help.
batch_size = 10
def myFunc(a):
#y = 2x-1
return 2*a-1
class NeuralNetwork(nn.Module):
def __init__(self):
super().__init__()
self.lin1 = nn.Linear(batch_size,1)
self.lin2 = nn.Linear(1,batch_size)
def forward(self, x):
x = self.lin1(x)
x = F.relu(x)
x = self.lin2(x)
return x
model = NeuralNetwork()
Theoretically for your example of an affine-linear function over a bounded interval you need only
linear(bias) -> relu -> linear(bias)
with one node per linear layer. Or just one linear layer without activation.
For more general functions, you will need larger layers in the construction of the first type, with one node for every piece in a piece-wise approximation. The last layer always needs to be linear without activation. Using more layers might give more pieces with less total nodes.
Background
I'm learning about LSTMs and figured I'd try training a very basic LSTM model (big believer of learning through doing).
To start with something basic, I tried implement a LSTM that would sum up the last 10 inputs it has seen. I generated a dataset consisting of 1000 random numbers between 0 and 1, with 1000 labels representing the sum of the previous 10 numbers (label[i] = data[i-9:i+1].sum()), and tried to train the LSTM to recognize this pattern.
I know that simpler models can solve this (ie. linear regression), but I believe that LSTMs should also be able to solve this fairly basic problem.
My initial implementation does seem to work, but when I try to improve the implementation, I start getting constant output values after a few timestamps, looks like it's approximately the average of the training labels.
I'd appreciate any insights as to why the second and third iterations don't work, especially the third iteration, since it looks like it's the same implementation as what I read in "Deep Learning for Coders with fastai & PyTorch" book.
What I've tried so far
I've done 3 iterations so far:
Initially, I generated all sub-sequences of length 10 from the input data along with the corresponding label ([(data[i-9:i+1], label[i]) for i in range(9, len(data))] and fed this into the LSTM
This iteration worked very well, if I feed a sequence of 10 inputs I get an output from the LSTM that is very close to the sum. However, it is kinda cheating in that I'm basically telling the LSTM that the sequerce is length 10. I believe that LSTM should be able to infer the sequence length, so I tried to remove that bit of information.
In my second iteration, I feed the entire sequence into the LSTM at once: ([data[i] for i in range(len(data)), [label[i] for i in range(len(data))]). Basically, a single input with a sequence length of 1000, with a single output of 1000 labels.
However, after training, while running on validation data of 100, all except the first few labels are always a constant number, approximately the average of the training labels.
In my last iteration, I tried feeding inputs one at a time to the LSTM (1000 inputs with a sequence length of 1), with manually storing the hidden and cell states and passing it into the next run with the next input. This produces similar results to #2.
Network setup
For all runs, I used a single layer LSTM with 25 hidden since it's a fairly simple problem. I did try adding layers with dropout or increasing hidden size but didn't help.
Code samples
First iteration:
class LSTMModel(Module):
def __init__(self, seq_len, layers, input_size, hidden_size):
self.lstm1 = nn.LSTM(input_size=input_size, hidden_size=hidden_size, num_layers=layers, bidirectional=False, batch_first=True)
self.fc = nn.Linear(hidden_size, 1)
def forward(self,x):
x, _ = self.lstm1(x)
# [:,-1,:] to grab the last output for sequence length of 10
x = self.fc(x[:,-1,:])
return x[:,0]
Second iteration:
class LSTMModel(Module):
def __init__(self, layers, input_size, hidden_size):
self.lstm1 = nn.LSTM(input_size=input_size, hidden_size=hidden_size, num_layers=layers, bidirectional=False)
self.fc = nn.Linear(hidden_size, 1)
self.input_size = input_size
def forward(self,x):
x,h = self.lstm1(x)
x = self.fc(x)
return x
Final iteration:
class LSTMModel(Module):
def __init__(self, layers, input_size, hidden_size):
self.lstm1 = nn.LSTM(input_size=input_size, hidden_size=hidden_size, num_layers=layers, bidirectional=False, batch_first=True)
self.fc = nn.Linear(hidden_size, 1)
self.h = [torch.zeros(layers, 1, hidden_size).cuda() for _ in range(2)]
def forward(self,x):
x,h = self.lstm1(x, self.h)
self.h = [h_.detach() for h_ in h]
x = self.fc(x)
return x
def reset(self):
for h in self.h: h.zero_()
I'm trying to build an RL model where the input is a NxM matrix, N being the number of selectable actions and M being features describing the action.
In all the RL problems I've seen so far, the state space is either a vector and passed in to a regular neural network or an image and is passed in through a convolutional neural network.
But say we have an environment where the objective is to learn to select the strongest worker for a fixed task, and a single state representation looked like this:
names = ['Bob','Henry','Mike','Phil']
max_squat = [300,400,200,100]
max_bench = [200,100,225,100]
max_deadlift = [600,400,300,225]
strongest_worker_df = pd.DataFrame({'Name':names,'Max_Squat':max_squat,'Max_Bench':max_bench,'Max_Deadlift':max_deadlift})
I want to pass in this 2D matrix (without Name column of course) as an input and have it return a row index, and then pass that row index as an action to the environment and get a reward. Then run a reinforcement learning algorithm on the gradient of the reward with respect to the action selection.
Any suggestions on how to go about this, specifically the state representation?
Well as long as your matrix is of fixed size (N and M don't change), you could just vectorize it (concatenate rows) and the network would work like that.
It is perhaps suboptimal to do this though because given the problem setting it seems preferable to maybe pass each row through the same neural net to get features and then have a top level discriminator that operates on the concatenated features.
An example model that would do this (in TensorFlow code) is:
model_input = x = Input(shape=(N, M))
x = Dense(64, activation='relu')(x)
x = Dropout(0.1)(x)
x = Dense(32, activation='relu')(x)
x = Dropout(0.1)(x)
x = Dense(16, activation='relu')(x)
x = Dropout(0.1)(x)
# The layers above this line define the feature generator, at this point
# your model has 16 fetaures for every person, i.e. an Nx16 matrix.
# Each person's feature have gone through the same nodes and have received
# the same transformations from them.
x = Flatten()(x)
# The Nx16 matrix is now flattened and below define the discriminator
# which will have a softmax output of size N (the highest output identifies
# the selected index)
x = Dense(16, activation='relu')(x)
x = Dropout(0.1)(x)
x = Dense(16, activation='relu')(x)
x = Dropout(0.1)(x)
x = Dense(16, activation='relu')(x)
x = Dropout(0.1)(x)
x = Dense(N, activation='softmax')(x)
model = Model(inputs=model_input, outputs=x)
I'm trying to implement the WNGrad (technically WN-Adam, algorithm 4 in the paper) optimizier (WNGrad) in pytorch. I've never implemented an optimizer in pytorch before so I don't know if I've done it correctly (I started from the adam implementation). The optimizer does not make much progress and falls down like I would expect (bj values can only monotonically increase, which happens quickly so no progress is made) but I'm guessing I have a bug. Standard optimizers (Adam, SGD) work fine on the same model I'm trying to optimize.
Does this implementation look correct?
from torch.optim import Optimizer
class WNAdam(Optimizer):
"""Implements WNAdam algorithm.
It has been proposed in `WNGrad: Learn the Learning Rate in Gradient Descent`_.
Arguments:
params (iterable): iterable of parameters to optimize or dicts defining
parameter groups
lr (float, optional): learning rate (default: 0.1)
beta1 (float, optional): exponential smoothing coefficient for gradient.
When beta=0 this implements WNGrad.
.. _WNGrad\: Learn the Learning Rate in Gradient Descent:
https://arxiv.org/abs/1803.02865
"""
def __init__(self, params, lr=0.1, beta1=0.9):
if not 0.0 <= beta1 < 1.0:
raise ValueError("Invalid beta1 parameter: {}".format(beta1))
defaults = dict(lr=lr, beta1=beta1)
super().__init__(params, defaults)
def step(self, closure=None):
"""Performs a single optimization step.
Arguments:
closure (callable, optional): A closure that reevaluates the model
and returns the loss.
"""
loss = None
if closure is not None:
loss = closure()
for group in self.param_groups:
for p in group['params']:
if p.grad is None:
continue
grad = p.grad.data
state = self.state[p]
# State initialization
if len(state) == 0:
state['step'] = 0
# Exponential moving average of gradient values
state['exp_avg'] = torch.zeros_like(p.data)
# Learning rate adjustment
state['bj'] = 1.0
exp_avg = state['exp_avg']
beta1 = group['beta1']
state['step'] += 1
state['bj'] += (group['lr']**2)/(state['bj'])*grad.pow(2).sum()
# update exponential moving average
exp_avg.mul_(beta1).add_(1 - beta1, grad)
bias_correction = 1 - beta1 ** state['step']
p.data.sub_(group['lr'] / state['bj'] / bias_correction, exp_avg)
return loss
The paper's author has an open sourced implementation on GitHub.
The WNGrad paper
states it's inspired by batch (and weight) normalization. You should use L2 norm with respect to the weight dimensions (don't sum it all) as show in this algorithm
I'm using Keras 2.0.2 Functional API (Tensorflow 1.0.1) to implement a network that takes several inputs and produces two outputs a and b. I need to train the network using the cosine_proximity loss, such that b is the label for a. How do I do this?
Sharing my code here. The last line model.fit(..) is the problematic part because I don't have labeled data per se. The label is produced by the model itself.
from keras.models import Model
from keras.layers import Input, LSTM
from keras import losses
shared_lstm = LSTM(dim)
q1 = Input(shape=(..,.. ), name='q1')
q2 = Input(shape=(..,.. ), name='q2')
a = shared_lstm(q1)
b = shared_lstm(q2)
model = Model(inputs=[q1,q2], outputs=[a, b])
model.compile(optimizer='adam', loss=losses.cosine_proximity)
model.fit([testq1, testq2], [?????])
You can define a fake true label first. For example, define it as a 1-D array of ones of the size of your input data.
Now comes the loss function. You can write it as follows.
def my_cosine_proximity(y_true, y_pred):
a = y_pred[0]
b = y_pred[1]
# depends on whether you want to normalize
a = K.l2_normalize(a, axis=-1)
b = K.l2_normalize(b, axis=-1)
return -K.mean(a * b, axis=-1) + 0 * y_true
I have multiplied y_true by zero and added it just so that Theano does give not missing input warning/error.
You should call your fit function normally i.e. by including your fake ground-truth labels.
model.compile('adam', my_cosine_proximity) # 'adam' used as an example optimizer
model.fit([testq1, testq2], fake_y_true)