caffe memory required how to calculate? - deep-learning

When you want to train a net you will get a log information like:
Memory required for data: 493376512
How do you interpret the number? Is it in bytes, bits?

To answer your direct question, the memory usage for the layer is given in bytes, not bits.
Overall, the memory reported is all of the layer's required memory that can be computed at initialization. The parameter and buffer space is allocated as needed, so it's not useful to report those in an aggregate sum -- some of them might take less total memory due to serialized re-use.
As the one answer says, the most effective way to determine the max memory usage is to run a forward pass and grab its high-water mark.

See shelhamer's comment:
https://github.com/BVLC/caffe/issues/2387#issuecomment-97910200
It's the memory required for all the top blobs or "data" in the sense
of layer outputs. It excludes the diffs, the parameters, and any
intermediate blobs within layers. #jeffdonahue do you remember the
motivation for this number?
And longjon:
Since layers are individually responsible for allocating parameters
and buffers, there's no way in general to know the total amount of
memory required except by running a forward pass.

Related

Is there a way to know what's the extra space that cudaMalloc is going to reserve?

When I use cudaMalloc (100) it reserves more than 100 B (According to some users here it's due to granularity issues and housekeeping information.
Is it possible to determine how big this space will be based on the Bytes I need to reserve?
Thank you so much.
EDIT: I'll explain why I need to know.
I want to apply the convolution algorithm over huge images on the GPU. To do so, since there isn't enough memory on the GPU to hold it, I need to split the image in batches of rows an call the kernel several times.
In fact, I need to send 2 images, the OnlyRead matrix and the Results matrix.
I want to calcule a priori the max number of rows I can send to the device according to the amount of free memory.
The first cudaMalloc executes successfully, but the problem appears when trying to execute the second CudaMalloc since the first reserve took more Bytes than expected.
What I'm doing now is considering the free memory amount a 10% less than what it is... but that's just a magical number that came from nowhere..
"Is there a way to know what's the extra space that cudaMalloc is going to reserve?"
Not without violating CUDA's platform guarantees, no. cudaMalloc() returns a pointer to the requested amount of memory. You can't make any assumptions about the amount of memory that happens to be valid after the end of the requested amount - the CUDA allocator already makes use of suballocators, and unlike CPU-based memory allocators, the data structures to track free lists etc. are not interleaved with the allocated memory. So for example, it would be unwise to assume that the CUDA runtime's guarantees about the alignment of the returned pointers mean anything other than that returned pointers will have a certain alignment.
If you study the CUDA runtime's behavior, that will shed light on the behavior of that particular CUDA runtime, but the behavior may change with future releases and break your code.

CUDA memory for lookup tables

I'm designing a set of mathematical functions and implementing them in both CPU and GPU (with CUDA) versions.
Some of these functions are based upon lookup tables. Most of the tables take 4KB, some of them a bit more. The functions based upon lookup tables take an input, pick one or two entry of the lookup table and then compute the result by interpolating or applying similar techniques.
My question is now: where should I save these lookup tables? A CUDA device has many places for storing values (global memory, constant memory, texture memory,...). Provided that every table could be read concurrently by many threads and that the input values, and therefore the lookup indices, can be completely uncorrelated among the threads of every warp (resulting in uncorrelated memory accesses), which memory provides the fastest access?
I add that the contents of these tables are precomputed and completely constant.
EDIT
Just to clarify: I need to store about 10 different 4KB lookup tables. Anyway it would be great to know wether the solution as for this case would be the same for the case with e.g. 100 4KB tables or with e.g. 10 16KB lookup tables.
Texture memory (now called read only data cache) would probably be a choice worth exploring, although not for the interpolation benefits. It supports 32 bit reads without reading beyond this amount. However, you're limited to 48K in total. For Kepler (compute 3.x) this is quite simple to program now.
Global memory, unless you configure it in 32 bit mode, will often drag in 128 bytes for each thread, hugely multiplying what is actually data needed from memory as you (apparently) can't coalesce the memory accesses. Thus the 32 bit mode is probably what you need if you want to use more than 48K (you mentioned 40K).
Thinking of coalescing, if you were to access a set of values in series from these tables, you might be able to interleave the tables such that these combinations could be grouped and read as a 64 or 128 bit read per thread. This would mean the 128 byte reads from global memory could be useful.
The problem you will have is that you're making the solution memory bandwidth limited by using lookup tables. Changing the L1 cache size (on Fermi / compute 2.x) to 48K will likely make a significant difference, especially if you're not using the other 32K of shared memory. Try texture memory and then global memory in 32 bit mode and see which works best for your algorithm. Finally pick a card with a good memory bandwidth figure if you have a choice over hardware.

data per block in CUDA -- does it broadcast in one transaction?

i have a data array that is per-block.
i have N blocks inside a cuda Grid and a constant array of data "block_data[]" with size N.
so, all threads in a given block 'X' access block_data[X] just one time, and do something with that value.
my question is: does this broadcast scheme work efficiently?
if not, what approach should i take?
edit after comments: my only problem with constant memory is its limited size, since i could have more than 64K blocks. That would mean more than 64KB
regards
If you just use a normal global memory access then the transaction is fairly inefficient, although depending on how much work your kernel is doing the impact is probably quite small.
I'm assuming sizeof(block_data) is one byte (inferred from your question "...could have more than 64K blocks. That would mean more than 64KB").
If the operation is cached in L1 then you will fetch 128 bytes for the one bit of info you need (sizeof(block_data)), if other warps in the block request the same data then they should get from L1. The efficiency of the load is 1/128 but you should only pay that once for the block.
If the operation is not cached in L1 (e.g. you pass "-dlcm=cg" to the assembler) then you will fetch 32 bytes. The efficiency is 1/32 but you pay that once for each warp.
Once the data is loaded, it is broadcast to all threads in the warp.
An alternative would be to mark the data as const __restrict__ which indicates to the compiler that the data is a) read-only and b) not aliased by any other pointer. Since the compiler can detect that the access is uniform then it can optimise the access to use one of the read-only caches (e.g. constant cache or, on compute capability >=3.5, read-only data cache aka texture cache).
If you want to change the values in block_data[N] array, better use the concept of shared memory __shared__. If you are not changing the value of block_data[N], use __const__ or use the concept of cache. By using L2 Cache, you can get 1536KB of memory (Kepler).

How to create Large Bit array in cuda?

I need to keep track of around 10000 elements of an array in my algorithm .So for this I need boolean for each record.If I used char array to keep track of 10000 arrays (as 0/1),it would take up lot of memory.
So Can I create an bit array of 10000 bits in Cuda where each bit represents corresponding array record?
As Roger said, the answer is yes, CUDA provides the same bitwise operations (i.e. >>, << and &) as normal C so you can implement your bit array essentially normally (almost, see thread synchronisation issues below).
However, for your situation it is almost certainly not a good idea.
There are problems with thread syncronisation. Imagine two of the threads on your GPU are inverting two bits of a single entry of your array. Each thread will read the same value out of memory, and apply their operation to it, but the thread that writes its value back to memory last will overwrite the result of the other thread. (Note: if your bit array is not being modified by the GPU code then this isn't a problem.)
And, unless this is explicitly required, you shouldn't be optimising for memory use, an array with 10K elements does not take much memory at all: even if you were storing each boolean in an 64 bit integer it's only 80 KB. And obviously you can store them in a smaller datatype. (You should only start worrying about compressing the array as much as possible when you are getting upwards of tens of millions, or even hundreds of millions of elements.)
Also, the way GPUs work means that you might get best performance by using a reasonably large data type for each boolean (most likely a 32 bit one) so that, for example, memory coalescing works better. (I haven't tested this assertion, you would need to run some benchmarks to check it.)

matrix multiplication in cuda

say I want to multiply two matrices together, 50 by 50. I have 2 ways to arrange threads and blocks.
a) one thread to calculate each element of the result matrix. So I have a loop in thread multiplies one row and one column.
b) one thread to do each multiplication. Each element of the result matrix requires 50 threads. After multiplications are done, I can use binary reduction to sum the results.
I wasn't sure which way to take, so I took b. It wasn't ideal. In fact it was slow. Any idea why? My guess would be there are just too many threads and they are waiting for resource most of time, is that true?
As with so many things in high performance computing, the key to understanding performance here is understanding the use of memory.
If you are using one thread do to do one multiplication, then for that thread you have to pull two pieces of data from memory, multiply them, then do some logarthmic number of adds. That's three memory accesses for a mult and an add and a bit - the arithmatic intensity is very low. The good news is that there are many many threads worth of tasks this way, each of which only needs a tiny bit of memory/registers, which is good for occupancy; but the memory access to work ratio is poor.
The simple one thread doing one dot product approach has the same sort of problem - each multiplication requires two memory accesses to load. The good news is that there's only one store to global memory for the whole dot product, and you avoid the binary reduction which doesn't scale as well and requires a lot of synchronization; the down side is there's way less threads now, which at least your (b) approach had working for you.
Now you know that there should be some way of doing more operations per memory access than this; for square NxN matricies, there's N^3 work to do the multiplication, but only 3xN^2 elements - so you should be able to find a way to do way more than 1 computation per 2ish memory accesses.
The approach taken in the CUDA SDK is the best way - the matricies are broken into tiles, and your (b) approach - one thread per output element - is used. But the key is in how the threads are arranged. By pulling in entire little sub-matricies from slow global memory into shared memory, and doing calculations from there, it's possible to do many multiplications and adds on each number you've read in from memory. This approach is the most successful approach in lots of applications, because getting data - whether it's over a network, or from main memory for a CPU, or off-chip access for a GPU - often takes much longer than processing the data.
There's documents in NVidia's CUDA pages (esp http://developer.nvidia.com/object/cuda_training.html ) which describe their SDK example very nicely.
Have you looked at the CUDA documentation: Cuda Programming Model
Also, sample source code: Matrix Multiplication
Did you look at
$SDK/nvidia-gpu-sdk-3.1/C/src/matrixMul
i.e. the matrix multiplication example in the SDK?
If you don't need to implement this yourself, just use a library -- CUBLAS, MAGMA, etc., provide tuned matrix multiplication implementations.