Depthwise separable convolutions require more GPU memory

Depthwise separable convolutions require more GPU memory - deep-learning

I have read many papers and web articles that claim that depthwise-separable convolutions reduce the memory required by a deep learning model compared to standard convolution. However, I do not understand how this would be the case, since depthwise-separable convolution requires storing an extra intermediate-step matrix as well as the final output matrix.
Here are two scenarios:
Typical convolution: You have a 3x3 filter, which is applied to a 7x7 RGB input volume. This results in an output of size 5x5x1 which needs to be stored in GPU memory. Suppose activations are float32, this requires 100 bytes of memory
Depthwise-separable convolution: You have three 3x3x1 filters applied to a 7x7 RGB input volume. This results in three output volumes each of size 5x5x1. You then apply a 1x1 convolution to the concatenated 5x5x3 volume to get a final output volume of size 5x5x1. Hence, with float32 activations, this requires 300 bytes for the intermediate 5x5x3 volume, and 100 bytes for the final output. Hence a total of 400 bytes of memory
As additional evidence, when using an implementation U-Net in pytorch with typical nn.Conv2d convolutions, the model has 17.3M parameters and a forward/backward pass size of 320MB. If I replace all convolutions with depthwise-separable convolutions, the model has 2M parameters, and a forward/backward pass size of 500MB. So fewer parameters, but more memory required
I am sure I am going wrong somewhere, as every article states that depthwise-separable convolutions require less memory. Where am I going wrong with my logic?

Related

How to speed up YoloV5 TFLite inference time in phone app?

I am using the YoloV5 model for custom object recognition, and when I export it to tflite model for inclusion in the mobile app, the resulting time to object recognition is 5201.2ms inference. How can I reduce the inference to optimal for faster recognition? The dataset I use to train is 2200 images and use the model yolov5x to train. Thank for help me !!

You have several options:
Train a smaller Yolo model (m instead of x, for example)
Resize the images (640x640 to for example 320x320, notice that the dimension need to be a multiple of the maximum stride which is 32)
Quantize the model to FP16 or INT8
Use NNAPI delegate (only provides speedup if the CPU contains any HW accelerator: GPU, DSP, NN engine)
None of these options exclude each other, all can be used at the same time for maximum inference speed. 1, 2 & 3 will sacrifice model performance for inference speed.

How to train the RPN in Faster R-CNN?

Link to paper
I'm trying to understand the region proposal network in faster rcnn. I understand what it's doing, but I still don't understand how training exactly works, especially the details.
Let's assume we're using VGG16's last layer with shape 14x14x512 (before maxpool and with 228x228 images) and k=9 different anchors. At inference time I want to predict 9*2 class labels and 9*4 bounding box coordinates. My intermediate layer is a 512 dimensional vector.
(image shows 256 from ZF network)
In the paper they write
"we randomly sample 256 anchors in an image to compute the loss
function of a mini-batch, where the sampled positive and negative
anchors have a ratio of up to 1:1"
That's the part I'm not sure about. Does this mean that for each one of the 9(k) anchor types the particular classifier and regressor are trained with minibatches that only contain positive and negative anchors of that type?
Such that I basically train k different networks with shared weights in the intermediate layer? Therefore each minibatch would consist of the training data x=the 3x3x512 sliding window of the conv feature map and y=the ground truth for that specific anchor type.
And at inference time I put them all together.
I appreciate your help.

Not exactly. From what I understand, the RPN predicts WHk bounding boxes per feature map, and then 256 are randomly sampled per the 1:1 criteria, and these are used as part of the computation for the loss function of that particular mini-batch. You're still only training one network, not k, since the 256 random samples are not of any particular type.
Disclaimer: I only started learning about CNNs a month ago, so I may not understand what I think I understand.

STFT Clarification (FFT for real-time input)

I get how the DFT via correlation works, and use that as a basis for understanding the results of the FFT. If I have a discrete signal that was sampled at 44.1kHz, then that means if I were to take 1s of data, I would have 44,100 samples. In order to run the FFT on that, I would have to have an array of 44,100 and a DFT with N=44,100 in order to get the resolution necessary to detect a frequencies up to 22kHz, right? (Because the FFT can only correlate the input with sinusoidal components up to a frequency of N/2)
That's obviously a lot of data points and calculation time, and I have read that this is where the Short-time FT (STFT) comes in. If I then take the first 1024 samples (~23ms) and run the FFT on that, then take an overlapping 1024 samples, I can get the continuous frequency domain of the signal every 23ms. Then how do I interpret the output? If the output of the FFT on static data is N/2 data points with fs/(N/2) bandwidth, what is the bandwidth of the STFT's frequency output?
Here's an example that I ran in Mathematica:
100Hz sine wave at 44.1kHz sample rate:
Then I run the FFT on only the first 1024 points:
The frequency of interest is then at data point 3, which should somehow correspond to 100Hz. I think 44100/1024 = 43 is something like a scaling factor, which means that a signal with 1Hz in this little window will then correspond to a signal of 43Hz in the full data array. However, this would give me an output of 43Hz*3 = 129Hz. Is my logic correct but not my implementation?

As I have already stated in my earlier comments, the variable N affects the resolution achievable by the output frequency spectrum and not the range of frequencies you can detect.A larger N gives you a higher resolution at the expense of higher computation time and a lower N gives you lower computation time but can cause spectral leakage, which is the effect you have seen in your last figure.
As for your other question, well, theoretically the bandwidth of an FFT is infinite but we band-limit our result to the band of frequencies in the range [-fs/2 to fs/2] because all frequencies outside that band are susceptible to aliasing and are therefore of no use.Furthermore, if the input signal is real (which is true in most cases including ours) then the frequencies from [-fs/2 to 0] are just a reflection of the frequencies from [0 to fs/2] and so some FFT procedures just output the FFT spectrum from [0 to fs/2], which I think applies to your case.This means that the N/2 data points that you received as output represent the frequencies in the range [0 to fs/2] so that is the bandwidth you are working with in the case of the FFT and also in the case of the STFT (the STFT is just a series of FFT's, each FFT in a STFT will give you a spectrum with data points in this band).
I would also like to point out that the STFT will most likely not reduce your computation time if your input is a varying signal such as music because in that case you will need to take perform it several times over the duration of the song for it to be of any use, it will however enable you to understand the frequency characteristics of your song much better that you would do if you just performed one FFT.
To visualise the results of an FFT you use frequency (and/or phase) spectrum plots but in order to visualise the results of an STFT you will most probably need to create a spectrogram which is basically a graph can is made by just basically putting the individual FFT spectrums side by side.The process of creating a spectrogram can be seen in the figure below (Source: Dan Ellis - Introduction to Speech Processing).The spectrogram will show you how your signal's frequency characteristics change over time and how you interpret it will depend on what specific features you are looking to extract/detect from the audio.You might want to look at the spectrogram wikipedia page for more information.

Required buffer for cuFFT

This question is about the buffer required by cuFFT. In the User Guide it is documented that
In the worst case, the CUFFT Library allocates space for
8*batch*n[0]*..*n[rank-1] cufftComplex or cufftDoubleComplex elements
(where batch denotes the number of transforms that will be executed in
parallel, rank is the number of dimensions of the input data (see
Multidimensional transforms) and n[] is the array of transform
dimensions) for single and doubleprecision transforms respectively.
What does "array of transform dimensions" mean? How much buffer does cuFFT need? What I understand with the above is that it needs at least 8x the size of the array being FFTed but this does not make sense to me
Thanks in advance
Daniel

The "array of transform dimensions" is the array containing the problem size in each dimension, see the section on multidimensional transforms for more information.
cuFFT is allocating temporary space to be able to accommodate the intermediate data, the part of the doc you quoted says this is "the worst case", so it's not "at least 8x", it's at most. The doc goes on to say:
Depending on the configuration of the plan, less memory may be used.
In some specific cases, the temporary space allocations can be as low
as 1*batch*n[0]*..*n[rank-1] cufftComplex or cufftDoubleComplex
elements.
So for a NxM 2D single precision transform:
1*N*M*sizeof(cufftComplex) <= space for tmp data <= 8*N*M*sizeof(cufftComplex)

Use cufftGetSize1d and cufftEstimate1d to give you the amount of memory allocated for the buffer. The documentation says cufftPlan1d gives an estimation of the maximum amount and cufftGetSize1d provide a more precise estimation.
In my case I use both 64 and 8192 point FFTs. I get the same issue, the buffer size allocate only 1*batch*n[0] elements.I've made the test with different amount of data and different FFT size and I get this same value.
To conclude, if you need to determine the memory used by a FFT, the CuFFT library provide a fonction to do this.

Apply PCA on very large sparse matrix

I am doing a text classification task with R, and I obtain a document-term matrix with size 22490 by 120,000 (only 4 million non-zero entries, less than 1% entries). Now I want to reduce the dimensionality by utilizing PCA (Principal Component Analysis). Unfortunately, R cannot handle this huge matrix, so I store this sparse matrix in a file in the "Matrix Market Format", hoping to use some other techniques to do PCA.
So could anyone give me some hints for useful libraries (whatever the programming language), which could do PCA with this large-scale matrix with ease, or do a longhand PCA by myself, in other words, calculate the covariance matrix at first, and then calculate the eigenvalues and eigenvectors for the covariance matrix.
What I want is to calculate all PCs (120,000), and choose only the top N PCs, who accounts for 90% variance. Obviously, in this case, I have to give a threshold a priori to set some very tiny variance values to 0 (in the covariance matrix), otherwise, the covariance matrix will not be sparse and its size would be 120,000 by 120,000, which is impossible to handle with one single machine. Also, the loadings (eigenvectors) will be extremely large, and should be stored in sparse format.
Thanks very much for any help !
Note: I am using a machine with 24GB RAM and 8 cpu cores.

The Python toolkit scikit-learn has a few PCA variants, of which RandomizedPCA can handle sparse matrices in any of the formats supported by scipy.sparse. scipy.io.mmread should be able to parse the Matrix Market format (I never tried it, though).
Disclaimer: I'm on the scikit-learn development team.
EDIT: the sparse matrix support from RandomizedPCA has been deprecated in scikit-learn 0.14. TruncatedSVD should be used in its stead. See the documentation for details.

Instead of running PCA, you could try Latent Dirichlet Allocation (LDA), which decomposes the document-word matrix into a document-topic and topic-word matrix. Here is a link to an R implementation: http://cran.r-project.org/web/packages/lda/ - there are quite a few implementations out there, though if you google.
With LDA you need to specify a fixed number of topics (similar to principle components) in advance. A potentially better alternative is HDP-LDA (http://www.gatsby.ucl.ac.uk/~ywteh/research/npbayes/npbayes-r21.tgz), which learns the number of topics that form a good representation of your corpus.
If you can fit our dataset in memory (which it seems like you can), then you also shouldn't have a problem running the LDA code.
As a number of people on the scicomp forum pointed out, there should be no need to compute all of the 120k principle components. Algorithms like http://en.wikipedia.org/wiki/Power_iteration calculate the largest eigenvalues of a matrix, and LDA algorithms will converge to a minimum-description-length representation of the data given the number of topics specified.

In R big.PCA of bigpca package http://cran.r-project.org/web/packages/bigpca/bigpca.pdf does the job.

text classification task
I resolved almost same problem using a technique for PCA of sparse matrix .
This technique can handle very large sparse matrix.
The result shows such simple PCA outperfoms word2vec.
It intends the simple PCA outperforms LDA.

I suppose you wouldn't be able to compute all principle components. But still you can obtain reduced dimension version of your dataset matrix. I've implemented a simple routine in MATLAB, which can easily be replicated in python.
Compute the covariance matrix of your input dataset, and convert it to a dense matrix. Assuming S is you input 120,000 * 22490 sparse matrix, this would be like:
Smul=full(S.'*S);
Sm=full(mean(S));
Sm2=120000*Sm.'*Sm;
Scov=Smul-Sm2;
Apply eigs function on the covariance matrix to obtain the first N dominant eigenvectors,
[V,D] = eigs(Scov,N);
And obtain pcs by projecting the zero centered matrix on eigenvectors,
Sr=(S-Sm)*V;
Sr is the reduced dimension version of S.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008