Is the GFLOPS indicator in target detection theoretically the bigger the better? What metrics does this value affect? - yolov5

Model Summary: 290 layers, 8580290 parameters, 8580290 gradients, 7.9 GFLOPS
Model Summary: 277 layers, 8590690 parameters, 8590690 gradients, 33.3 GFLOPS

Related

VGG16 Architecture, In Dense layer how 4096 number came? Any mathematical explanation

In VGG16. the last 4 layers
Convolution using 512 filters+Max pooling
Fully connected with 4096 nodes
Fully connected with 4096 nodes
Output layer with SoftMax activation with 1000 nodes.
Where did the number 4096 come from?

is convolution kernel size of 1 meaningful?

I have time series data of 32x32 size and I used sliding window to form a 3d array of timestepsx32x32. So my input are shape of (batchsize x timesteps x 32 x 32)
I tried a (1,3,3) conv3d filter and a (3,3) timedistributed conv2d filter separately. The performance is different.
Can I say that a convolution kernel of size 1 can extract features? what's the difference between these 2 kernels?

Question on the kernel dimensions for convolutions on mel filter bank features

I am currently trying to understand the following paper: https://arxiv.org/pdf/1703.08581.pdf. I am struggling to understand a part about how a convolution is performed on an input of log mel filterbank features:
We train seq2seq models for both end-to-end speech translation, and a baseline model for speech recognition. We found
that the same architecture, a variation of that from [10], works
well for both tasks. We use 80 channel log mel filterbank features extracted from 25ms windows with a hop size of 10ms,
stacked with delta and delta-delta features. The output softmax
of all models predicts one of 90 symbols, described in detail in
Section 4, that includes English and Spanish lowercase letters.
The encoder is composed of a total of 8 layers. The input
features are organized as a T × 80 × 3 tensor, i.e. raw features,
deltas, and delta-deltas are concatenated along the ’depth’ dimension. This is passed into a stack of two convolutional layers
with ReLU activations, each consisting of 32 kernels with shape
3 × 3 × depth in time × frequency. These are both strided by
2 × 2, downsampling the sequence in time by a total factor of 4,
decreasing the computation performed in the following layers.
Batch normalization [26] is applied after each layer.
As I understand it, the input to the convolutional layer is 3 dimensional (number of 25 ms windows (T) x 80 (features for each window) x 3 (features, delta features and delta-delta features). However, the kernels used on those inputs seem to have 4 dimensions and I do not understand why that is. Wouldn't a 4 dimensional kernel need a 4 dimensional input? In my head, the input has the same dimensions as a rgb picture: width (time) x height (frequency) x color channels (features, delta features and delta-delta features). Therefore I would think of a kernel for a 2D convolution as a filter of size a (filter width) x b (filter height) x 3 (depth of the input). Am I missing something here? What is wrong about my idea or what is done different in this paper?
Thanks in advance for your answer!
I figured it out, turns out it was just a misunderstanding from my side: the authors are using 32 kernels of shape 3x3, which results (after two layers with 2x2 striding) in an output of shape t/4x20x32 where t stands for the time dimension.

Convergence failure while training GAN for 128x128 images

thanks for looking at this question!
I attempted to train a simple DCGAN to generate room designs from a dataset of 215 coloured images of size 128x128. My attempt can be summarised as below:
Generator: 5 deconvolution layers from (100x1) noise input to (128x128x1) grayscale image output
Discriminator: 4 convolution layers from (128x128x1) grayscale image input
Optimizer: Adam at learning rate of 0.002 for both Generator and Discriminator
Batch size: 21 images/batch
Epoch: 100 epochs with 10 batches/epoch
Results:
1. D-loss is close to 0, G-loss is close to 1. After which, I've cut down my discriminator by 2 convolution layers, reduce Adam learning rate to 0.00002, hoping that the discriminator doesn't overpower my generator.
Upon (1), D-loss and G-loss hovers around 0.5 - 1.0. However, the generated image still show noise images even after 100 epochs.
Questions:
Is there something wrong in terms of how I trained my GAN?
How should I modify my approach to successfully train the GAN?
Thank you so much everyone for your help, really looking forward!

Neural network with sigmoid neurons does not learn if a factor is added to all weights and biases after initialization

I'm about to experiment with a neural network for handwriting recognition, which can be found here:
https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/src/network.py
If the weights and biases are randomly initialized, it recognizes over 80% of the digits after a few epochs. If I add a small factor of 0.27 to all weights and biases after initialization, learning is much slower, but eventually it reaches the same accuracy of over 80%:
self.biases = [np.random.randn(y, 1)+0.27 for y in sizes[1:]]
self.weights = [np.random.randn(y, x)+0.27 for x, y in zip(sizes[:-1], sizes[1:])]
Epoch 0 : 205 / 2000
Epoch 1 : 205 / 2000
Epoch 2 : 205 / 2000
Epoch 3 : 219 / 2000
Epoch 4 : 217 / 2000
...
Epoch 95 : 1699 / 2000
Epoch 96 : 1706 / 2000
Epoch 97 : 1711 / 2000
Epoch 98 : 1708 / 2000
Epoch 99 : 1730 / 2000
If I add a small factor of 0.28 to all weights and biases after initialization, the network isn't learning at all anymore.
self.biases = [np.random.randn(y, 1)+0.28 for y in sizes[1:]]
self.weights = [np.random.randn(y, x)+0.28 for x, y in zip(sizes[:-1], sizes[1:])]
Epoch 0 : 207 / 2000
Epoch 1 : 209 / 2000
Epoch 2 : 209 / 2000
Epoch 3 : 209 / 2000
Epoch 4 : 209 / 2000
...
Epoch 145 : 234 / 2000
Epoch 146 : 234 / 2000
Epoch 147 : 429 / 2000
Epoch 148 : 234 / 2000
Epoch 149 : 234 / 2000
I think this has to to with the sigmoid function which gets very flat when close to one and zero. But what happens at this point when the mean of the weights and biases is 0.28? Why is there such a steep drop in the number of recognized digits? And why are there outliers like the 429 above?
Initialization plays a big role in training networks. A good initialization can make training and convergence a lot faster, while a bad one can make it many times slower. It can even allow or prevent convergence at all.
You might want to read this fr some more information on the topic
https://towardsdatascience.com/weight-initialization-in-neural-networks-a-journey-from-the-basics-to-kaiming-954fb9b47c79
By adding 0.27 to all weights and biases you probably shift the network away from the optimal solution and increase the gradients. Depending on the layer count this can lead to exploding gradients. Now you have very big updates of weights every iteration. What could be happening is that you have some weight that is 0.3 (after adding 0.27 to it) and we say the optimal value would be 0.1. Now you get a update with -0.4, now you are at -0.1. The next update might be 0.4 (or something close) and you are back at the original problem. So instead of going slow towards the optimal value, the optimizations just overshoots everything and bounces back and forth. This might be fixed after some time or can lead to no convergence at all since the network just bounces around.
Also in general you want biases to be initialized to 0 or very close to zero. If you try this further you might want to try not adding 0.27 to biases and setting them to 0 or something close to 0 initially. Maybe by doing this it can actually learn again.