Does layer normalization delay training? - deep-learning

You can find my code here with the results:
https://github.com/shwe87/tfm-asr/blob/master/ASR-Spanish-Bi-RNN-17062020.ipynb
I tested two simple models for ASR in Spanish:
Model 1:
- Layer Normalization
- Bi-directional GRU
- Dropout
- Fully Connected layer
- Dropout
- Fully Connected layer as a classifier (classifies one of the alphabet chars)
Model 2:
- Conv Layer 1
- Conv Layer 2
- Fully Connected
- Dropout
- Bidirectional GRU
- Fully connected layer as a classifier
I tried with 30 epochs because I have limited resources of GPU.
The validation and training loss for these two models:
Model 1 performed not so good as expected.
Model 2 worked too well, after 20 epochs, it started overfitting (please see the graph in the notebook results) and in the output, I could actually see some words creating which seems like the labels. Although it is overfitting, it still needs training because it doesn't predict the total outcome. For start, I am happy with this model.
I tested a third complex Model.
You can find it here with the results output:
https://github.com/shwe87/tfm-asr/blob/master/ASR-DNN.ipynb
Model 3:
- Layer Normalization
- RELU
- Bidirectional GRU
- Dropout
- Stack this 10 times more.
The valid loss and training loss for this model:
I tested this on 30 epochs and there were no good results, actually, all the predictions were blank...
Is this because this complex model needs more epochs for training?
Update:
I modified the model by adding 2 convolutional layer before the stacked GRU and the model seems to have improved.
I see that in the first model and the third model I applied layer normalization and both's prediction seems to be very bad....Does layer normalization makes the learning delay? But according to papers like:
https://www.arxiv-vanity.com/papers/1607.06450/ layer normalization speeds up the training and also helps in speeding the training loss. So, I am really confused. I have limited resources of GPU and I am not sure if I should go for another try without layer normalization.......

Related

How can you increase the accuracy of ResNet50?

I'm using Resnet50 model to classify images into two classes: normal cells and cancer cells.
so I want to to increase the accuracy but i don't know what to modify.
# we are using resnet50 for transfer learnin here. So we have imported it
from tensorflow.keras.applications import resnet50
# initializing model with weights='imagenet'i.e. we are carring its original weights
model_name='resnet50'
base_model=resnet50.ResNet50(include_top=False, weights="imagenet",input_shape=img_shape, pooling='max')
last_layer=base_model.output # we are taking last layer of the model
# Add flatten layer: we are extending Neural Network by adding flattn layer
flatten=layers.Flatten()(last_layer)
# Add dense layer
dense1=layers.Dense(100,activation='relu')(flatten)
# Add dense layer to the final output layer
output_layer=layers.Dense(class_count,activation='softmax')(flatten)
# Creating modle with input and output layer
model=Model(inputs=base_model.inputs,outputs=output_layer)
model.compile(Adamax(learning_rate=.001), loss='categorical_crossentropy', metrics=['accuracy'])
There were 48 errors in 534 test cases Model accuracy= 91.01 %
Also what do you think about the results of the graph?
this is the classification report
i got good results but is there a possibility to increase accuracy more than that?
This is a broad question as there are many ways one can attempt to generally improve the network's accuracy. some of which may be
Increase the dimension of the layers that are learned in transfer learning (make sure not to overfit)
Use transfer learning with Convolution layers and not MLP
let the optimization algorithm choose the learning rate on its own
Play with additional augmentations to the dataset
and the list goes on.
Also, if possible, I would suggest comparing your results to other publicly available benchmarks - by doing so you might understand the upper bounds of the accuracies better

Fully connected neural network with constant loss

I am working on a project to predict soccer player values from a set of inputs. The data consists of about 19,000 rows and 8 columns (7 columns for input and 1 column for the target) all of numerical values.
I am using a fully connected Neural Network for the prediction but the problem is the loss is not decreasing as it should.
The loss is very large (1e+13) and doesn’t decrease as it should, it just fluctuates.
This is the function I am using to run the model:
def gradient_descent(model, learning_rate, num_epochs, data_loader, criterion):
losses = []
optimizer = torch.optim.Adam(model.parameters())
for epoch in range(num_epochs): # one epoch
for inputs, outputs in data_loader: # one iteration
inputs, outputs = inputs.to(torch.float32), outputs.to(torch.float32)
logits = model(inputs)
loss = criterion(torch.squeeze(logits), outputs) # forward-pass
optimizer.zero_grad() # zero out the gradients
loss.backward() # compute the gradients (backward-pass)
optimizer.step() # take one step
losses.append(loss.item())
loss = sum(losses[-len(data_loader):]) / len(data_loader)
print(f'Epoch #{epoch}: Loss={loss:.3e}')
return losses
The model is fully connected neural network with 4 hidden layers, each with 7 neurons. input layer has 7 neurons and output has 1. I am using MSE for loss function. I tried changing the learning rate but it is still bad.
What could be the reason behind this?
Thank you!
It is difficult to diagnose your problem from the information you provided, but I'll try to point you in some useful directions.
Data Normalization:
The way we initialize the weights in deep NN has a significant effect on the training process. See, e.g.:
He, K., Zhang, X., Ren, S. and Sun, J., Delving deep into rectifiers: Surpassing human-level performance on imagenet classification (ICCV 2015).
Most initialization methods assume the inputs have zero mean and unit variance (or similar statistics). If your inputs violate these assumptions, you will find it difficult to train. See, e.g., this post.
Normalize the Targets:
You are trying to solve a regression problem (MSE loss), it might be the case that your targets are poorly scaled and causing very large loss values. Try and normalize the targets to span a more compact range.
Learning Rate:
Try and adjust your learning rate: both increasing it and decreasing it by orders of magnitude.

Small validation accuracy ResNet 50

https://github.com/slavaglaps/ResNet_cifar10/blob/master/resnet.ipynb
This is my model trained in 100 epochs
Accuracy on similar models and similar data reaches 90%
What is my problem?
I think it's worth reducing the learning rate with the passage of the epochs.
What do you think that can help me?
There are a few subtle differences.
You are trying to apply ImageNet style architecture to Cifar-10. First convolution is 3 x 3, not 7 x 7. There is no max-pooling layer. The image is downsampled purely by using stride-2 convolutions.
You should probably do mean-centering by keeping featurewise_center = True in ImageDataGenerator.
Do not use very high number of filters such as [512, 1024, 2048]. There are only 50,000 images for you to train unlike ImageNet which has about a million.
In short, read up section 4.2 in the deep residual network paper and try to replicate the network. You may also read this blog.

Implementing joint learning in keras

i am trying to implement a model that is composed of two layers to segment object candidates in keras
So basically this model has the following architecture
Image(channel,width,height) -> multiple convolution and pooling layers- > output('n' feature maps , height width )
Now this single output is being used by two layers
which are as follows
1) convolution (1*1) - > dense layer with m units (output = n * 1*1 ) - > pixel classifier using fully connected layers of h*w dimesion -> upsmapling to (H,N) - > output
2) convolution -> maxpooling->dense layer - > score
Cost function uses outputs of both these layers which is sum of binary logistic regression of each output
Now I have two questions
1) how to implement dense connection over convoluted output in layer 1 to produce h*w pixel classifier as mentioned above
2) How to merge the two layers to calculate the single cost function and then train both the layers jointly using back-propagation
Can anyone tell me how to create the model for above mentioned network architecture.i am new to deep learning so if there something which i misunderstood i ll appreciate if anyone can explain me the errors in my understanding
Thanks
It's easier when you share the code you already have.
For the transition convolution to dense, you have to use model.add(Flatten()), like in the examples here.
Unfortunately, I don't know for the second question, but according to what I just read in the Keras Models, you have to use the graph model.

Hard to understand Caffe MNIST example

After going through the Caffe tutorial here: http://caffe.berkeleyvision.org/gathered/examples/mnist.html
I am really confused about the different (and efficient) model using in this tutorial, which is defined here: https://github.com/BVLC/caffe/blob/master/examples/mnist/lenet_train_test.prototxt
As I understand, Convolutional layer in Caffe simply calculate the sum of Wx+b for each input, without applying any activation function. If we would like to add the activation function, we should add another layer immediately below that convolutional layer, like Sigmoid, Tanh, or Relu layer. Any paper/tutorial I read on the internet applies the activation function to the neuron units.
It leaves me a big question mark as we only can see the Convolutional layers and Pooling layers interleaving in the model. I hope someone can give me an explanation.
As a site note, another doubt for me is the max_iter in this solver:
https://github.com/BVLC/caffe/blob/master/examples/mnist/lenet_solver.prototxt
We have 60.000 images for training, 10.000 images for testing. So why does the max_iter here only 10.000 (and it still can get > 99% accuracy rate)? What does Caffe do in each iteration?
Actually, I'm not so sure if the accuracy rate is the total correct prediction/test size.
I'm very amazed of this example, as I haven't found any example, framework that can achieve this high accuracy rate in that very short time (only 5 mins to get >99% accuracy rate). Hence, I doubt there should be something I misunderstood.
Thanks.
Caffe uses batch processing. The max_iter is 10,000 because the batch_size is 64. No of epochs = (batch_size x max_iter)/No of train samples. So the number of epochs is nearly 10. The accuracy is calculated on the test data. And yes, the accuracy of the model is indeed >99% as the dataset is not very complicated.
For your question about the missing activation layers, you are correct. The model in the tutorial is missing activation layers. This seems to be an oversight of the tutorial. For the real LeNet-5 model, there should be activation functions following the convolution layers. For MNIST, the model still works surprisingly well without the additional activation layers.
For reference, in Le Cun's 2001 paper, it states:
As in classical neural networks, units in layers up to F6 compute a dot product between their input vector and their weight vector, to which a bias is added. This weighted sum, denoted a_i, for unit i, is then passed through a sigmoid squashing function to produce the state of unit i ...
F6 is the "blob" between the two fully connected layers. Hence the first fully connected layers should have an activation function applied (the tutorial uses ReLU activation functions instead of sigmoid).
MNIST is the hello world example for neural networks. It is very simple to today's standard. A single fully connected layer can solve the problem with accuracy of about 92%. Lenet-5 is a big improvement over this example.