Task
Suppose the following problem for the neural network. Given a real number x it should output 1 if x is equal to a (i.e. 7.123456789) and output 0 otherwise.
Dataset
For the problem, we construct a dataset of 2000 samples, where 1000 samples are equal to a. Other 1000 samples are distinct from a and randomly sampled "far enough" from a, where "far enough" is determined by 2 parameters: gap (< 1) and sparsity (> 1). Each sample is uniformly sampled from (a - gap*sparsity, a - gap) U (a + gap, a + gap*sparsity). In other words, the gap determines the number of significant decimals and sparsity determines the variance of x.
Model
Feed-forward network with a single hidden layer with ReLU activations (hidden_dim neurons). Adam optimizer with standard parameters, 1e-4 learning rate, 2000 training epochs, BCE loss. float64 weights and float64 input tensors.
Results
Best validation accuracy as a function of gap
Validation accuracies over iterations for different gaps and hidden_dim
Unfortunately, I don't have enough reputation to embed images...
Question
As can be seen from the figures, lower the gap, harder the problem for the neural network (sparsity is fixed to 1e2). And for the gap lower than 1e-4 it cannot learn the difference. Below is the example of 10 samples along with their labels for 1e-4 gap and 1e2 sparsity:
(array([[7.113817389875074 ],
[7.123456789123456 ],
[7.121479348058312 ],
[7.123456789123456 ],
[7.123456789123456 ],
[7.11874792046549 ],
[7.1248334895646215],
[7.123456789123456 ],
[7.131370982033207 ],
[7.118704508563716 ]]),
array([0., 1., 0., 1., 1., 0., 0., 1., 0., 0.]))
I cannot find any literature or tips for such a kind of data. Can anyone recommend some? I think the performance can be significantly improved with a non-standard weight initialization but I am surprised that there is no literature on this. Moreover, from the mathematical viewpoint precision should not be a problem at all (density of real numbers).
The described task is a toy problem, but I am interested in it because I have high-precision spectroscopic measurements.
Unclear whether such an approach would be useful for your actual problem or simply for the toy example, but you should normalize your data such that the range of expected inputs falls within a meaningful range (A common input range would be [-1,1]. This seems to be an issue related to the implementation of neural networks in common deep learning architectures rather than the theory of deep learning architectures (although technically a 1-layer feed-forward neural network is not "deep"). Though a theoretical network could learn a mapping at arbitrary precision, in practice the weights and biases are represented with limited precision so you're limited in essentially the number of digits for a quantity you can reliably predict. The above numbers have enough significant figures that they are nearing the limit of double precision float representation and beyond the limit of single precision floats. So most of the "usable precision" in the model bias is wasted specifying the first significant figures i.e. the mean (that are shared amongst almost all of the data) and if the bias cannot fully capture the mean distribution then some of the "usable precision" of the weights is wasted trying to capture the residual of the mean not captured by the bias parameters. This is unlikely to lead to good performance.
By normalizing, you can make the bias fitting task much easier (the bias is close to 0, and at a minimum the "usable precision" of the float parameters can be meaningfully spent fitting the data. I suspect that solving the bias problem will improve performance by freeing the model weights from the task of estimating the residual of the mean not estimated by the bias parameters.
Related
I'm investigating the task of training a neural network to predict one future value given a sinusoidal input. So for example, as seen in the Figure, the input signal is x and the expected output signal y. The model's output is y^. Doing the regression task is fairly straightforward, and there are a lot of choices for this problem. I'm using a simple recurrent neural network with mean-squared error (MSE) loss between y and y^.
Additionally, suppose I know that the sinusoid is made up of N modalities, e.g., at some points, the wave oscillates at 5 Hz, then 10 Hz, then back to 5 Hz, then up to 15 Hz maybe—i.e., N=3.
In this case, I have ground-truth class labels in a vector k and the model does both regression and classification, additionally outputting a vector k^. An example is shown in the Figure. As this is a multi-class problem with exclusivity, I figured binary cross entropy (BCE) loss should be relevant here.
I'm sure there is a lot of research about combining loss functions, but does just adding MSE and BCE make sense? Scaling one up or down by a factor of 10 doesn't seem to change the learning outcome too much. So I was wondering what is considered the standard approach to problems where there is a joint classification and regression objective.
Additionally, on top of just BCE, I want to penalize k^ for quickly jumping around between classes; for example, if the model guesses one class, I'd like it to stay in that one class and switch only when it's necessary. See how in the Figure, there are fast dark blue blips in k^. I would like the same solid bands as seen in k, and naive BCE loss doesn't account for that.
Appreciate any and all advice!
So going the AMP: Automatic Mixed Precision Training tutorial for Normal networks, I found out that there are two versions, Automatic and GradScaler. I just want to know if it's advisable / necessary to use the GradScaler with the training becayse it is written in the document that:
Gradient scaling helps prevent gradients with small magnitudes from flushing to zero (“underflowing”) when training with mixed precision.
scaler = torch.cuda.amp.GradScaler()
for epoch in range(1):
for input, target in zip(data, targets):
with torch.cuda.amp.autocast():
output = net(input)
loss = loss_fn(output, target)
scaler.scale(loss).backward()
scaler.step(opt)
scaler.update()
opt.zero_grad()
Also, looking at NVIDIA Apex Documentation for PyTorch, they have used it as,
from apex import amp
model, optimizer = amp.initialize(model, optimizer)
loss = criterion(…)
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
optimizer.step()
I think this is what GradScaler does too so I think it is a must. Can someone help me with the query here.
Short answer: yes, your model may fail to converge without GradScaler().
There are three basic problems with using FP16:
Weight updates: with half precision, 1 + 0.0001 rounds to 1. autocast() takes care of this one.
Vanishing gradients: with half precision, anything less than (roughly) 2e-14 rounds to 0, as opposed to single precision 2e-126. GradScaler() takes care of this one.
Explosive loss: similar to the above, overflow is also much more likely with half precision. This is also managed by autocast() context.
I am working on a project to predict soccer player values from a set of inputs. The data consists of about 19,000 rows and 8 columns (7 columns for input and 1 column for the target) all of numerical values.
I am using a fully connected Neural Network for the prediction but the problem is the loss is not decreasing as it should.
The loss is very large (1e+13) and doesn’t decrease as it should, it just fluctuates.
This is the function I am using to run the model:
def gradient_descent(model, learning_rate, num_epochs, data_loader, criterion):
losses = []
optimizer = torch.optim.Adam(model.parameters())
for epoch in range(num_epochs): # one epoch
for inputs, outputs in data_loader: # one iteration
inputs, outputs = inputs.to(torch.float32), outputs.to(torch.float32)
logits = model(inputs)
loss = criterion(torch.squeeze(logits), outputs) # forward-pass
optimizer.zero_grad() # zero out the gradients
loss.backward() # compute the gradients (backward-pass)
optimizer.step() # take one step
losses.append(loss.item())
loss = sum(losses[-len(data_loader):]) / len(data_loader)
print(f'Epoch #{epoch}: Loss={loss:.3e}')
return losses
The model is fully connected neural network with 4 hidden layers, each with 7 neurons. input layer has 7 neurons and output has 1. I am using MSE for loss function. I tried changing the learning rate but it is still bad.
What could be the reason behind this?
Thank you!
It is difficult to diagnose your problem from the information you provided, but I'll try to point you in some useful directions.
Data Normalization:
The way we initialize the weights in deep NN has a significant effect on the training process. See, e.g.:
He, K., Zhang, X., Ren, S. and Sun, J., Delving deep into rectifiers: Surpassing human-level performance on imagenet classification (ICCV 2015).
Most initialization methods assume the inputs have zero mean and unit variance (or similar statistics). If your inputs violate these assumptions, you will find it difficult to train. See, e.g., this post.
Normalize the Targets:
You are trying to solve a regression problem (MSE loss), it might be the case that your targets are poorly scaled and causing very large loss values. Try and normalize the targets to span a more compact range.
Learning Rate:
Try and adjust your learning rate: both increasing it and decreasing it by orders of magnitude.
Link to paper
I'm trying to understand the region proposal network in faster rcnn. I understand what it's doing, but I still don't understand how training exactly works, especially the details.
Let's assume we're using VGG16's last layer with shape 14x14x512 (before maxpool and with 228x228 images) and k=9 different anchors. At inference time I want to predict 9*2 class labels and 9*4 bounding box coordinates. My intermediate layer is a 512 dimensional vector.
(image shows 256 from ZF network)
In the paper they write
"we randomly sample 256 anchors in an image to compute the loss
function of a mini-batch, where the sampled positive and negative
anchors have a ratio of up to 1:1"
That's the part I'm not sure about. Does this mean that for each one of the 9(k) anchor types the particular classifier and regressor are trained with minibatches that only contain positive and negative anchors of that type?
Such that I basically train k different networks with shared weights in the intermediate layer? Therefore each minibatch would consist of the training data x=the 3x3x512 sliding window of the conv feature map and y=the ground truth for that specific anchor type.
And at inference time I put them all together.
I appreciate your help.
Not exactly. From what I understand, the RPN predicts WHk bounding boxes per feature map, and then 256 are randomly sampled per the 1:1 criteria, and these are used as part of the computation for the loss function of that particular mini-batch. You're still only training one network, not k, since the 256 random samples are not of any particular type.
Disclaimer: I only started learning about CNNs a month ago, so I may not understand what I think I understand.
I am doing a text classification task with R, and I obtain a document-term matrix with size 22490 by 120,000 (only 4 million non-zero entries, less than 1% entries). Now I want to reduce the dimensionality by utilizing PCA (Principal Component Analysis). Unfortunately, R cannot handle this huge matrix, so I store this sparse matrix in a file in the "Matrix Market Format", hoping to use some other techniques to do PCA.
So could anyone give me some hints for useful libraries (whatever the programming language), which could do PCA with this large-scale matrix with ease, or do a longhand PCA by myself, in other words, calculate the covariance matrix at first, and then calculate the eigenvalues and eigenvectors for the covariance matrix.
What I want is to calculate all PCs (120,000), and choose only the top N PCs, who accounts for 90% variance. Obviously, in this case, I have to give a threshold a priori to set some very tiny variance values to 0 (in the covariance matrix), otherwise, the covariance matrix will not be sparse and its size would be 120,000 by 120,000, which is impossible to handle with one single machine. Also, the loadings (eigenvectors) will be extremely large, and should be stored in sparse format.
Thanks very much for any help !
Note: I am using a machine with 24GB RAM and 8 cpu cores.
The Python toolkit scikit-learn has a few PCA variants, of which RandomizedPCA can handle sparse matrices in any of the formats supported by scipy.sparse. scipy.io.mmread should be able to parse the Matrix Market format (I never tried it, though).
Disclaimer: I'm on the scikit-learn development team.
EDIT: the sparse matrix support from RandomizedPCA has been deprecated in scikit-learn 0.14. TruncatedSVD should be used in its stead. See the documentation for details.
Instead of running PCA, you could try Latent Dirichlet Allocation (LDA), which decomposes the document-word matrix into a document-topic and topic-word matrix. Here is a link to an R implementation: http://cran.r-project.org/web/packages/lda/ - there are quite a few implementations out there, though if you google.
With LDA you need to specify a fixed number of topics (similar to principle components) in advance. A potentially better alternative is HDP-LDA (http://www.gatsby.ucl.ac.uk/~ywteh/research/npbayes/npbayes-r21.tgz), which learns the number of topics that form a good representation of your corpus.
If you can fit our dataset in memory (which it seems like you can), then you also shouldn't have a problem running the LDA code.
As a number of people on the scicomp forum pointed out, there should be no need to compute all of the 120k principle components. Algorithms like http://en.wikipedia.org/wiki/Power_iteration calculate the largest eigenvalues of a matrix, and LDA algorithms will converge to a minimum-description-length representation of the data given the number of topics specified.
In R big.PCA of bigpca package http://cran.r-project.org/web/packages/bigpca/bigpca.pdf does the job.
text classification task
I resolved almost same problem using a technique for PCA of sparse matrix .
This technique can handle very large sparse matrix.
The result shows such simple PCA outperfoms word2vec.
It intends the simple PCA outperforms LDA.
I suppose you wouldn't be able to compute all principle components. But still you can obtain reduced dimension version of your dataset matrix. I've implemented a simple routine in MATLAB, which can easily be replicated in python.
Compute the covariance matrix of your input dataset, and convert it to a dense matrix. Assuming S is you input 120,000 * 22490 sparse matrix, this would be like:
Smul=full(S.'*S);
Sm=full(mean(S));
Sm2=120000*Sm.'*Sm;
Scov=Smul-Sm2;
Apply eigs function on the covariance matrix to obtain the first N dominant eigenvectors,
[V,D] = eigs(Scov,N);
And obtain pcs by projecting the zero centered matrix on eigenvectors,
Sr=(S-Sm)*V;
Sr is the reduced dimension version of S.