CUDA memory issue with allennlp coreference resolution API - allennlp

Hi I have some CUDA memory issue even though I am using multiple GPUs. I am calling coreference resolution API on long document (aorund 2000words). It seems that the memory is not paralleled. How can I solve this issue? (I am currently using the API as here https://demo.allennlp.org/coreference-resolution)

The coref model uses a lot of memory. It does not automatically take advantage of multiple GPUs. The best thing you can do is reduce the maximum sequence length you send to the model until it fits.

Related

is it possible to force cudaMallocManaged allocate on a specific gpu id (e.g. via cudaSetDevice)

I want to use cudaMallocManaged, but is it possible force it allocate memory on a specific gpu id (e.g. via cudaSetDevice) on a multiple GPU system?
The reason is that I need allocate several arrays on the GPU, and I know which set of these arrays need to work together, so I want to manually make sure they are on the same GPU.
I searched CUDA documents, but didn't find any info related to this. Can someone help? Thanks!
No you can't do this directly via cudaMallocManaged. The idea behind managed memory is that the allocation migrates to whatever processor it is needed on.
If you want to manually make sure a managed allocation is "present" on (migrated to) a particular GPU, you would typically use cudaMemPrefetchAsync. Some examples are here and here. This is generally recommended for good performance if you know which GPU the data will be needed on, rather than using "on-demand" migration.
Some blogs on managed memory/unified memory usage are here and here, and some recorded training is available here, session 6.
From N.2.1.1. Explicit Allocation Using cudaMallocManaged() (emphasis mine):
By default, the devices of compute capability lower than 6.x allocate managed memory directly on the GPU. However, the devices of compute capability 6.x and greater do not allocate physical memory when calling cudaMallocManaged(): in this case physical memory is populated on first touch and may be resident on the CPU or the GPU.
So for any recent architecture it works like NUMA nodes on the CPU: Allocation says nothing about where the memory will be physically allocated. This instead is decided on "first touch", i.e. initialization. So as long as the first write to these locations comes from the GPU where you want it to be resident, you are fine.
Therefore I also don't think a feature request will find support. In this memory model allocation and placement just are completely independent operations.
In addition to explicit prefetching as Robert Crovella described it, you can give more information about which devices will access which memory locations in which way (reading/writing) by using cudaMemAdvise (See N.3.2. Data Usage Hints).
The idea behind all this is that you can start off by just using cudaMallocManaged and not caring about placement, etc. during fast prototyping. Later you profile your code and then optimize the parts that are slow using hints and prefetching to get (almost) the same performance as with explicit memory management and copies. The final code may not be that much easier to read / less complex than with explicit management (e.g. cudaMemcpy gets replaced with cudaMemPrefetchAsync), but the big difference is that you pay for certain mistakes with worse performance instead of a buggy application with e.g. corrupted data that might be overlooked.
In Multi-GPU applications this idea of not caring about placement at the start is probably not applicable, but NVIDIA seems to want cudaMallocManaged to be as uncomplicated as possible for this type of workflow.

get memory usage on cuda context

Is there a way that I can get cuda context memory usage rather than having to use cudaMemGetInfo which only reports global information of a device? or at least a way to get how much memory is occupied by the current application?
It seems to be impossible [No]. However, retrieving per-process memory usage is still a good alternative. And as Robert has pointed out, per-process memory usage can be retrieved using NVML, specifically, by using nvmlDeviceGetComputeRunningProcesses function.

Paralelizing FFT (using CUDA)

On my application I need to transform each line of an image, apply a filter and transform it back.
I want to be able to make multiple FFT at the same time using the GPU. More precisely, I'm using NVIDIA's CUDA. Now, some considerations:
CUDA's FFT library, CUFFT is only able to make calls from the host ( https://devtalk.nvidia.com/default/topic/523177/cufft-device-callable-library/).
On this topic (running FFTW on GPU vs using CUFFT), Robert Corvella says
"cufft routines can be called by multiple host threads".
I believed that doing all this FFTs in parallel would increase performance, but Robert comments
"the FFT operations are of reasonably large size, then just calling the cufft library routines as indicated should give you good speedup and approximately fully utilize the machine"
So,
Is this it? Is there no gain in performing more than one FFT at a time?
Is there any library that supports calls from the device?
Shoud I just use cufftPlanMany() instead (as refered in "is-there-a-method-of-fft-that-will-run-inside-cuda-kernel" by hang or as referred in the previous topic, by Robert)?
Or the best option is to call mutiple host threads?
(this 2 links limit is killing me...)
My objective is to get some discussion on what's the best solution to this problem, since many have faced similar situations.
This might be obsolete once NVIDIA implements device calls on CUFFT.
(something they said they are working on but there is no expected date for the release - something said on the discussion at the NVIDIA forum (first link))
So, Is this it? Is there no gain in performing more than one FFT at a time?
If the individual FFT's are large enough to fully utilize the device, there is no gain in performing more than one FFT at a time. You can still use standard methods like overlap of copy and compute to get the most performance out of the machine.
If the FFT's are small then the batched plan is a good way to get the most performance. If you go this route, I recommend using CUDA 5.5, as there have been some API improvements.
Is there any library that supports calls from the device?
cuFFT library cannot be used by making calls from device code.
There are other CUDA libraries, of course, such as ArrayFire, which may have options I'm not familiar with.
Shoud I just use cufftPlanMany() instead (as refered in "is-there-a-method-of-fft-that-will-run-inside-cuda-kernel" by hang or as referred in the previous topic, by Robert)?
Or the best option is to call mutiple host threads?
Batched plan is preferred over multiple host threads - the API can do a better job of resource management that way, and you will have more API-level visibility (such as through the resource estimation functions in CUDA 5.5) as to what is possible.

How to adjust the cuda number of block and of thread to get optimal performances

I've tested empirically for several values of block and of thread, and the execution time can be greatly reduced with specific values.
I don't see what are the differences between blocks and thread. I figure that it may be that thread in a block have specific cache memory but it's quite fuzzy for me. For the moment, I parallelize my functions in N parts, which are allocated on blocks/threads.
My goal could be to automaticaly adjust the number of blocks and thread regarding to the size of the memory that I've to use. Could it be possible? Thank you.
Hong Zhou's answer is good, so far. Here are some more details:
When using shared memory you might want to consider it first, because it's a very much limited resource and it's not unlikely for kernels to have very specific needs that constrain
those many variables controlling parallelism.
You either have blocks with many threads sharing larger regions or blocks with fewer
threads sharing smaller regions (under constant occupancy).
If your code can live with as little as 16KB of shared memory per multiprocessor
you might want to opt for larger (48KB) L1-caches calling
cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);
Further, L1-caches can be disabled for non-local global access using the compiler option -Xptxas=-dlcm=cg to avoid pollution when the kernel accesses global memory carefully.
Before worrying about optimal performance based on occupancy you might also want to check
that device debugging support is turned off for CUDA >= 4.1 (or appropriate optimization options are given, read my post in this thread for a suitable compiler
configuration).
Now that we have a memory configuration and registers are actually used aggressively,
we can analyze the performance under varying occupancy:
The higher the occupancy (warps per multiprocessor) the less likely the multiprocessor will have to wait (for memory transactions or data dependencies) but the more threads must share the same L1 caches, shared memory area and register file (see CUDA Optimization Guide and also this presentation).
The ABI can generate code for a variable number of registers (more details can be found in the thread I cited). At some point, however, register spilling occurs. That is register values get temporarily stored on the (relatively slow, off-chip) local memory stack.
Watching stall reasons, memory statistics and arithmetic throughput in the profiler while
varying the launch bounds and parameters will help you find a suitable configuration.
It's theoretically possible to find optimal values from within an application, however,
having the client code adjust optimally to both different device and launch parameters
can be nontrivial and will require recompilation or different variants of the kernel to be deployed for every target device architecture.
I believe to automatically adjust the blocks and thread size is a highly difficult problem. If it is easy, CUDA would most probably have this feature for you.
The reason is because the optimal configuration is dependent of implementation and the kind of algorithm you are implementing. It requires profiling and experimenting to get the best performance.
Here are some limitations which you can consider.
Register usage in your kernel.
Occupancy of your current implementation.
Note: having more threads does not equate to best performance. Best performance is obtained by getting the right occupancy in your application and keeping the GPU cores busy all the time.
I've a quite good answer here, in a word, this is a difficult problem to compute the optimal distribution on blocks and threads.

Graph algorithms on GPU

The current GPU execution and memory models are somehow limited (memory limit, limit of data structures, no recursion...).
Do you think it would be feasible to implement a graph theory problem on a GPU? For example, vertex cover? dominating set? independent set? max clique?....
Is it also feasible to have branch-and-bound algorithms on GPUs? Recursive backtracking?
You will be interested in
Exploring the Limits of GPUs With Parallel Graph Algorithms
Accelerating large graph algorithms on the GPU using CUDA.
This is tangentially related to your question, but I've implemented a "recursive" backtracking algorithm for enumerating "self-avoiding walks" on a lattice (n.b.: the stack was simulated within the CUDA kernel, to avoid the overhead of creating local variables for a whole bunch of function calls). It's possible to do this efficiently, so I'm sure this can be adapted to a graph theoretical context. Here's a link to a seminar on the topic where I gave some general discussion about backtracking within the Single Instruction Multiple Data (SIMD) paradigm; it's a pdf of about 1MB in size http://bit.ly/9ForGS .
I don't claim to know about the wider literature on graph theoretical algorithms on GPUs, but hope the above helps a little.
(#TheMachineCharmer, thanks for the links.)