What is the reasoning that floating-point parameters are preferred over binary parameters?
Is there a specific limitation of binary parameters?
Is there something that is not possible using binary parameters, and we need floating point parameters (and gradient descent) to achieve it?
A function from a binary parameters to the loss is non differentiable. Or rather gradients are 0 everywhere, which means you cannot learn it with gradient descent. You can do various hacks to do so, but it will not be the "standard gradient descent". And gradient descent is the core learning method for neural networks, thus you are left with either a bit of alchemy (hacking things which are not mathematically sound) or very inefficient alternatives (e.g. genetic algorithms). Overall the standard practise is to always train in proper, smooth space (e.g. floats) and then, if needed - binarise / discretise your model.
Related
I have been reading the Deep Learning book by Ian Goodfellow and it mentions in Section 6.5.7 that
The main memory cost of the algorithm is that we need to store the input to the nonlinearity of the hidden layer.
I understand that backprop stores the gradients in a similar fashion to dynamic programming so not to recompute them. But I am confused as to why it stores the input as well?
Backpropagation is a special case of reverse mode automatic differentiation (AD).
In contrast to the forward mode, the reverse mode has the major advantage that you can compute the derivative of an output w.r.t. all inputs of a computation in one pass.
However, the downside is that you need to store all intermediate results of the algorithm you want to differentiate in a suitable data structure (like a graph or a Wengert tape) for as long as you are computing its Jacobian with reverse mode AD, because you're basically "working your way backwards" through the algorithm.
Forward mode AD does not have this disadvantage, but you need to repeat its calculation for every input, so it only makes sense if your algorithm has a lot more output variables than input variables.
According to the documentation on pre-trained computer vision models for transfer learning (e.g., here), input images should come in "mini-batches of 3-channel RGB images of shape (3 x H x W), where H and W are expected to be at least 224".
However, when running transfer learning experiments on 3-channel images with height and width smaller than expected (e.g., smaller than 224), the networks generally run smoothly and often get decent performances.
Hence, it seems to me that the "minimum height and width" is somehow a convention and not a critical parameter. Am I missing something here?
There is a limitation on your input size which corresponds to the receptive field of the last convolution layer of your network. Intuitively, you can observe the spatial dimensionality decreasing as you progress through the network. At least this is the case for feature extractor CNNs which aim at extracting feature embeddings from the input image. That is most pre-trained models such as vanilla VGG, and ResNets networks do not retain spatial dimensionality. If the input of a convolutional layer is smaller than the kernel size (even if/when padded), then you simply won't be able to perform the operation.
TLDR: adaptive pooling layer
For example, the standard resnet50 model accepts input only in ranges 193-225, and this is due to the architecture and downscaling layers (see below).
The only reason why the default pytorch model works is that it is using adaptive pooling layer which allows to not restrict input size. So it's gonna work but you should be ready for performance decay and other fun things :)
Hope you will find it useful:
https://discuss.pytorch.org/t/how-can-torchvison-models-deal-with-image-whose-size-is-not-224-224/51077/3
What is Adaptive average pooling and How does it work?
https://pytorch.org/docs/stable/generated/torch.nn.AdaptiveAvgPool2d.html
https://github.com/pytorch/vision/blob/c187c2b12d86c3909e59a40dbe49555d85b98703/torchvision/models/resnet.py#L118
https://github.com/pytorch/vision/blob/c187c2b12d86c3909e59a40dbe49555d85b98703/torchvision/models/resnet.py#L151
https://developpaper.com/pytorch-implementation-examples-of-resnet50-resnet101-and-resnet152/
I have a dataset in which there is high correlation between the (500+) columns. From what I understand (and correct me if I am wrong), one of the reasons that you do normalising with zero mean and a std dev of one is so that it is easier for a optimizer with a given learning rate to deal with across many problems, rather than adopt the learning rate to the scale of X.
Similarly is there a reason as to why I should 'whiten' my dataset. It seems to be a common step in image processing. Would it make it easier on the optimizer somehow if the columns were independent?
I understand that classically people used to decorrelate the matrices so that the weights became more statistically significant, and also to make the matrix inversion more stable. The matrix inversion part atleast seems to be non-existent when it comes to DL since we use variations of Stochastic Gradient Descent (SGD) these days instead.
It's not something really essential now. Read this note from Andrej. Normally we don't use PCA in deep learning architectures. Because we don't need to reduce features since we have deep architectures which can extract hierarchical features. It's always good to zero center data. Which means you need to normalize data in order to reduce variance in the batch. Anyway normally in CNN we use batch normalization layer. This really helps the network to converge without having covariate shift. ALso modern optimization techniques like adam.rmsprop make the data pre-processing part less important.
I'm enrolled in Coursera ML class and I just started learning about neural networks.
One thing that truly mystifies me is how recognizing something so “human”, like a handwritten digit, becomes easy once you find the good weights for linear combinations.
It is even crazier when you understand that something seemingly abstract (like a car) can be recognized just by finding some really good parameters for linear combinations, and combining them, and feeding them to each other.
Combinations of linear combinations are much more expressible than I once thought.
This lead me to wonder if it is possible to visualize NN's decision process, at least in simple cases.
For example, if my input is 20x20 greyscale image (i.e. total 400 features) and the output is one of 10 classes corresponding to recognized digits, I would love to see some kind of visual explanation of which cascades of linear combinations led the NN to its conclusion.
I naïvely imagine that this may be implemented as visual cue over the image being recognized, maybe a temperature map showing “pixels that affected the decision the most”, or anything that helps to understand how neural network worked in a particular case.
Is there some neural network demo that does just that?
This is not a direct answer to your question. I would suggest you take a look at convolutional neural networks (CNN). In CNNs you can almost see the concept that is learned. You should read this publication:
Y. LeCun, L. Bottou, Y. Bengio and P. Haffner: Gradient-Based Learning Applied to Document Recognition, Proceedings of the IEEE, 86(11):2278-2324, November 1998
CNNs are often called "trainable feature extractors". In fact, CNNs implement 2D filters with trainable coefficients. This is why the activation of the first layers are usually shown as 2D images (see Fig. 13). In this paper the authors use another trick to make the networks even more transparant: the last layer is a radial basis function layer (with gaussian functions), i. e. the distance to an (adjustable) prototype for each class is calculated. You can really see the learned concepts by looking at the parameters of the last layer (see Fig. 3).
However, CNNs are artificial neural networks. But the layers are not fully connected and some neurons share the same weights.
Maybe it doesn't answer the question directly but I found this interesting piece in this Andrew Ng, Jeff Dean, Quoc Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu Devin,
Kai Chen and
Greg Corrado paper (emphasis mine):
In this section, we will present two visualization techniques to verify if the optimal stimulus of the neuron is indeed a face. The first method is visualizing the most responsive stimuli in the test set. Since the test set is large, this method can reliably detect near optimal stimuli of the tested neuron. The second approach is to perform numerical optimization to find the optimal stimulus
...
These visualization methods have complementary strengths and weaknesses. For instance, visualizing the most responsive stimuli may suffer from fitting to noise. On the other hand, the numerical optimization approach can be susceptible to local minima. Results, shown [below], confirm that the tested neuron indeed learns the concept of faces.
In other words, they take a neuron that is best-performing at recognizing faces and
select images from the dataset that it cause it to output highest confidence;
mathematically find an image (not in dataset) that would get highest condifence.
It's fun to see that it actually “captures” features of the human face.
The learning is unsupervised, i.e. input data didn't say whether an image is a face or not.
Interestingly, here are generated “optimal input” images for cat heads and human bodies:
I am trying to implement the cosine and sine functions in floating point (but I have no floating point hardware).
Since my processor has no floating-point hardware, nor instructions, I have already implemented algorithms for floating point multiplication, division, addition, subtraction, and square root. So those are the tools I have available to me to implement cosine and sine.
I was considering using the CORDIC method, at this site
However, I implemented division and square root using newton's method, so I was hoping to use the most efficient method.
Please don't tell me to just go look in a book or that "paper's exist", no kidding they exist. I am looking for names of well known algorithms that are known to be fast and efficient.
First off, depending on your accuracy requirements, this can be considerably fussier than your earlier questions.
Now that you've been warned: you'll first want to reduce the argument modulo pi/2 (or 2pi, or pi, or pi/4) to get the input into a manageable range. This is the subtle part. For a nice discussion of the issues involved, download a copy of K.C. Ng's ARGUMENT REDUCTION FOR HUGE ARGUMENTS: Good to the Last Bit. (simple google search on the title will get you a pdf). It's very readable, and does a great job of describing why this is tricky.
After doing that, you only need to approximate the functions on a small range around zero, which is easily done via a polynomial approximation. A taylor series will work, though it is inefficient. A truncated chebyshev series is easy to compute and reasonably efficient; computing the minimax approximation is better still. This is the easy part.
I have implemented sine and cosine exactly as described, entirely in integer, in the past (sorry, no public sources). Using hand-tuned assembly, results in the neighborhood of 100 cycles are entirely reasonable on "typical" processors. I don't know what hardware you're dealing with (the performance will mostly be gated on how quickly your hardware can produce the high part of an integer multiply).
For various levels of precision, you can find some good approximations here:
http://www.ganssle.com/approx.htm
With the added advantage that they are deterministic in runtime unlike the various "converging series" options which can vary wildly depending on the input value. This matters if you are doing anything real-time (games, motion control etc.)
Since you have the basic arithmetic operations implemented, you may as well implement sine and cosine using their taylor series expansions.