Final accuracy for double vs single precision neural nets - deep-learning

I wonder if anyone had experience with training deep neural nets on, say, CIFAR10 or ILSVRC-2012 datasets and comparing the final results for single and double precision computations?

Can't comment, but this is less of an answer and more of just information and a though experiment. I think this question will be hard to test just because using 64 bit will require CPU instead of the GPU and will increase runtime considerably.
First a paper on the subject from Vincent Vanhoucke at Google: http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/37631.pdf. This paper is focused on optimizing deep networks on cpus and a huge area of optimization is using 'fixed-point SIMD' instructions which as explained in the next link equates to 8 bit precision (unless I'm mistaken). An explanation of this paper can be found at http://petewarden.com/2015/05/23/why-are-eight-bits-enough-for-deep-neural-networks/.
In my own experience, I've used 16 bit precision for my inputs instead of 32 bit for Deep Q-Learning and I've noticed no difference in performance.
I'm not an expert in low level computing but what are the extra digits really helping with? The point of training is to maximize the probability the network will assign the correct class with a large margin (ie softmax output of 95%+). A +- difference of 0.0001 won't change the predicted class.

Related

How does the number of Gibbs sampling iterations impacts Latent Dirichlet Allocation?

The documentation of MALLET mentions following:
--num-iterations [NUMBER]
The number of sampling iterations should be a trade off between the time taken to complete sampling and the quality of the topic model.
MALLET provides furthermore an example:
// Run the model for 50 iterations and stop (this is for testing only,
// for real applications, use 1000 to 2000 iterations)
model.setNumIterations(50);
It is obvious that too few iterations lead to bad topic models.
However, does increasing the number of Gibbs sampling iterations necessarily benefit the quality of the topic model (measured by perplexity, topic coherence or on a downstream task)?
Or is it possible that the model quality decreases with the --num-iterations set to a too high value?
On a personal project, averaged over 10-fold cross-validation increasing the number of iterations from 100 to 1000 did not impact the average accuracy (measured as Mean Reciprocal Rank) for a downstream task. However, within the cross-validation splits the performance changed significantly, although the random seed was fixed and all other parameters kept the same. What part of background knowledge about Gibbs sampling am I missing to explain this behavior?
I am using a symmetric prior for alpha and beta without hyperparameter optimization and the parallelized LDA implementation provided by MALLET.
The 1000 iteration setting is designed to be a safe number for most collection sizes, and also to communicate "this is a large, round number, so don't think it's very precise". It's likely that smaller numbers will be fine. I once ran a model for 1000000 iterations, and fully half the token assignments never changed from the 1000 iteration model.
Could you be more specific about the cross validation results? Was it that different folds had different MRRs, which were individually stable over iteration counts? Or that individual fold MRRs varied by iteration count, but they balanced out in the overall mean? It's not unusual for different folds to have different "difficulty". Fixing the random seed also wouldn't make a difference if the data is different.

What does it mean to normalize based on mean and standard deviation of images in the imagenet training dataset?

In implementation of densenet model as in CheXNet paper, in section 3.1 it is mentioned that:
Before inputting the images into the network, we downscale the images to 224x224 and normalize based on the mean and standard edviation of images in the ImageNet training set.
Why would we want to normalize new set of images with mean and std of different dataset?
How do we get the mean and std of ImageNet dataset? Is it provided somewhere?
Subtracting the mean centers the input to 0, and dividing by the standard deviation makes any scaled feature value the number of standard deviations away from the mean.
Consider how a neural network learns its weights. C(NN)s learn by continually adding gradient error vectors (multiplied by a learning rate) computed from backpropagation to various weight matrices throughout the network as training examples are passed through.
The thing to notice here is the "multiplied by a learning rate".
If we didn't scale our input training vectors, the ranges of our distributions of feature values would likely be different for each feature, and thus the learning rate would cause corrections in each dimension that would differ (proportionally speaking) from one another. We might be over compensating a correction in one weight dimension while undercompensating in another.
This is non-ideal as we might find ourselves in a oscillating (unable to center onto a better maxima in cost(weights) space) state or in a slow moving (traveling too slow to get to a better maxima) state.
Original Post: https://stats.stackexchange.com/questions/185853/why-do-we-need-to-normalize-the-images-before-we-put-them-into-cnn
They used mean and std dev of the ImageNet training set because the weights of their model were pretrained on ImageNet (see Model Architecture and Training section of the paper).

Counting the number of multiply-add operations (MAC) in Caffe CNN's architecture

Lately I've been benchmarking some CNNs regarding time, # of multiply-add operations (MAC), # of parameters and model size. I have seen some similar SO questions (here and here) and in the latter, they suggest using Netscope CNN Analyzer. This tool allows me to calculate most of the things I need just by inputing my Caffe network definition.
However, the number of multiply-add operations of some architectures I've seen in papers and over the internet doesn't match what Netscope is outputting, whereas other architectures match. I'm always comparing either FLOPs or MAC with the MACC column in netscope, but there a ~10x factor that I'm forgetting at some point (check table bellow for more detail).
Architecture ---- MAC (paper/internet) ---- macc column in netscope
VGG 16 ~15.5G ~157G
GoogLeNet ~1.55G ~16G
Reference about GoogLeNet macc number and VGG16 macc number in Netscope.
Does anybody that used that tool could point me out on what mistake I'm doing while reading Netscope output?
I've found what was causing the discrepancy between Netscope and the information I'd found in papers. Most preset architectures in Nestcope were using a batch size of 10 (this is the case for VGG and GoogLeNet, for example), therefore the x10 factor that multiplied the number of mult-add operations.

RNN L2 Regularization stops learning

I use Bidirectional RNN to detect an event of unbalanced occurence. The positive class is 100times less often than the negative class.
While no regularization use I can get 100% accuracy on train set and 30% on validation set.
I turn on l2 regularization and the result is only 30% accuracy on train set too instead of longer learning and 100% accuracy on validation set.
I was thinking that maybe my data is too small so just for experiment I merged train set with test set which I did not use before. Situation was the same as I would use l2 regularization, which I did not now. I get 30% accuracy on train+test and validation.
In use 128hidden units and 80 timesteps in the mentioned experiments
When I increased the number of hidden units to 256 I can again overfit on train+test set to get 100% accuracy but still only 30% on validation set.
I did try so many options for hyperparameters and almost no result. Maybe the weighted cross entropy is causing the problem, in given experiments the weight on positive class is 5. While trying larger weights the results are often worse around 20% of accuracy.
I tried LSTM and GRU cells, no difference.
The best results I got. I tried 2 hidden layers with 256 hidden units, it took around 3 days of computation and 8GB of GPU memory. I got around 40-50% accuracy before it starts overfitting again while l2 regularization was on but no so strong.
Is there some general guideline what to do in this situation? I was not able to find anything.
Too much hidden units can overfit your model. You can try with smaller number of hidden units. As you mentioned, training with more data might improve the performance. If you don't have enough data, you can generate some artificial data. Researchers add distortions to their training data to increase their data size but in a controlled way. This type of strategy is pretty good for image data but certainly if you are dealing with text data, probably you can use some knowledge base that can improve the performance.
There are many works going on using Knowledge-bases to solve NLP and deep learning related tasks.

How to accelerate preconditioned conjugate gradient using cusparse?

I am working on a CFD project and I am using the new CUDA 5 libary "cusparse" to solve a system of linear equations. I tested the sample code "conjugateGradientPrecond". The result show that Preconditioned Gradient using ILU took more time to get the final answer than Conjugate gradient without preconditioning. The former algorithm do need less iteration, but it take to much time on "cusparseScsrsv_solve", so the overall time is longer.
Here is my question, is there any other preconditioned conjugate Gradient that can greatly decrease the iteration while don't include any time-consuming function like "cusparseScsrsv_solve"?
Preconditioning techniques such as ILU0/ILUT, IC0/ICT will require to solve a triangular system two times at each iteration of CG (upper and lower decomposition of the preconditioning matrix). By nature, solving triangular systems is a sequential problem, but for case of sparse matrices some analysis phase can be performed to find some degrees of parallelization (refer to this post). In general, for sparse systems, one can not offer the best preconditioning technique, but the simple diagonal (aka Jacobi) preconditioning, imposes negligible overhead and offers high level of parallelization for GPU implementation.