get memory usage on cuda context - cuda

Is there a way that I can get cuda context memory usage rather than having to use cudaMemGetInfo which only reports global information of a device? or at least a way to get how much memory is occupied by the current application?

It seems to be impossible [No]. However, retrieving per-process memory usage is still a good alternative. And as Robert has pointed out, per-process memory usage can be retrieved using NVML, specifically, by using nvmlDeviceGetComputeRunningProcesses function.

Related

is it possible to force cudaMallocManaged allocate on a specific gpu id (e.g. via cudaSetDevice)

I want to use cudaMallocManaged, but is it possible force it allocate memory on a specific gpu id (e.g. via cudaSetDevice) on a multiple GPU system?
The reason is that I need allocate several arrays on the GPU, and I know which set of these arrays need to work together, so I want to manually make sure they are on the same GPU.
I searched CUDA documents, but didn't find any info related to this. Can someone help? Thanks!
No you can't do this directly via cudaMallocManaged. The idea behind managed memory is that the allocation migrates to whatever processor it is needed on.
If you want to manually make sure a managed allocation is "present" on (migrated to) a particular GPU, you would typically use cudaMemPrefetchAsync. Some examples are here and here. This is generally recommended for good performance if you know which GPU the data will be needed on, rather than using "on-demand" migration.
Some blogs on managed memory/unified memory usage are here and here, and some recorded training is available here, session 6.
From N.2.1.1. Explicit Allocation Using cudaMallocManaged() (emphasis mine):
By default, the devices of compute capability lower than 6.x allocate managed memory directly on the GPU. However, the devices of compute capability 6.x and greater do not allocate physical memory when calling cudaMallocManaged(): in this case physical memory is populated on first touch and may be resident on the CPU or the GPU.
So for any recent architecture it works like NUMA nodes on the CPU: Allocation says nothing about where the memory will be physically allocated. This instead is decided on "first touch", i.e. initialization. So as long as the first write to these locations comes from the GPU where you want it to be resident, you are fine.
Therefore I also don't think a feature request will find support. In this memory model allocation and placement just are completely independent operations.
In addition to explicit prefetching as Robert Crovella described it, you can give more information about which devices will access which memory locations in which way (reading/writing) by using cudaMemAdvise (See N.3.2. Data Usage Hints).
The idea behind all this is that you can start off by just using cudaMallocManaged and not caring about placement, etc. during fast prototyping. Later you profile your code and then optimize the parts that are slow using hints and prefetching to get (almost) the same performance as with explicit memory management and copies. The final code may not be that much easier to read / less complex than with explicit management (e.g. cudaMemcpy gets replaced with cudaMemPrefetchAsync), but the big difference is that you pay for certain mistakes with worse performance instead of a buggy application with e.g. corrupted data that might be overlooked.
In Multi-GPU applications this idea of not caring about placement at the start is probably not applicable, but NVIDIA seems to want cudaMallocManaged to be as uncomplicated as possible for this type of workflow.

Memory allocation google cloud function signification

I'm developping functions on the Google Cloud Functions using Firebase. I saw that each functions have a Memory Allocation but I can't find documentation describing the interest and signification of it.
From what I can understand, the more memory you are allocating, the more instance of the functions you can create at runtime. So, the more people are using my app, the more memory I need to allocate for each function.
Is it right to think like this? Does someone has a documentation for that?
Thanks in advance and have a good day!
Adrien
Not quite, the memory allocation is in direct proportion to the function's resources and its virtual RAM. This can be helpful when dealing with processing bulk documents, images, video compression, etc.
I have also heard that increasing memory allocation increases in CPU frequency from 400mhz at 256mb of ram up to 8GB and 4.8ghz respectfully.
found at the end of the first part before free tier
https://cloud.google.com/functions/pricing#cloud-functions-pricing

Matlab - Memory usage of a program

I'm currently implementing different signal processing algorithms in MATLAB, to later implement one of these in C++. To choose between these I'm going to perform a number of tests, one being a memory usage check. That is, I want to see how much memory the different algorithms use. Since the implementations are divided in to sub-function, I'm having problems collecting information about the actual memory usage.
This is what I've tried so far:
I've used the profiler to check memory usage of every function.
Problem: It only shows allocated memory usage. It doesn't show e.g. memory usage of variables in every function.
I've used whos at the end of every function to collect information about all the variables in the workspace of the functions. I then added these to a global variable.
Problem: The global variable keeps increasing even after the execution is done and it seems to never stop.
Now to my question. How can I, in a rather simple way, get information about the memory usage of my program, all functions included?
Best regards
I think your strategy to call whos at the end of every function (just before it returns) is a good one; but maybe you want to print the result to the screen rather than a global. If it "keeps increasing", then maybe you have a callback function that is being called unbeknownst to you, and that includes one of your whos calls. By printing to screen (and maybe including a disp('**** memory usage at the end of <function name> ***') just before it, you will find out why it "keeps going".
The alternative of using memory is somewhat helpful, but it gives information about "available" memory, as well as all the memory used by Matlab (not just the variables).
Of course any snapshot of memory usage doesn't necessarily grab the peak - it's possible that a statement like
x = sum(repmat(A, [1000 1]));
would require quite a large peak memory usage (as you replicate the matrix A 1000 times), yet a snapshot of memory (or running whos) right before or after won't tell you what just happened...
The best way to monitor memory usage is to use the profiler, with the memory option turned on:
profile -memory on
% run your code
profreport
The profiler returns memory usage and function calls statistics. Note that the memory option has an impact on your execution speed.
You can use memory function. Also, see memory management functions. Take a look to matlab memory usage.

CUDA: Can i find out if i have global memory coalescence?

I'm using a GeForce GTX 580 (compute capability 2.0).
In my program i'm suspecting that the bottleneck is access to global memory in the kernel. I suspect this because all the calculations involve numbers gotten by indexing an array stored in global memory, and because switching from double precision to single precision only improves the performance by like 10%. (afaik it should be twice as fast with a fermi device if the floating point operations are the bottleneck (?))
So to improve this bottleneck i thought about memory coalescence. The problem here is that i don't know if i achieved it or not. Either i already have it, and this is as good as it gets (25 times faster than the sequential version on an intel i7), or i might get it to run much faster by somehow rewriting to get coalescence.
But is there a way to know? Can i somehow "turn off" coalescence to find out, or find out in another way?
The CUDA Visual profiler will show you the load/store efficiency of each kernel in the summary table; Grizzly gave a good answer about how this has changed in the newer cards here: Compute Prof's fields for incoherent and coherent gst/gld? (CUDA/OpenCL)
No, memory coalescence is not something you turn on or off, it is something you achieve by using the correct memory access patterns and alignment. I am not sure as I have never used (not working on Windows) but I think nVidia's Parallel Nsight can tell you if your memory accesses are coalesced or not.

Where can I find information about the Unified Virtual Addressing in CUDA 4.0?

Where can I find information / changesets / suggestions for using the new enhancements in CUDA 4.0? I'm especially interested in learning about Unified Virtual Addressing?
Note: I would really like to see an example were we can access the RAM directly from the GPU.
Yes, using host memory (if that is what you mean by RAM) will most likely slow your program down, because transfers to/from the GPU take some time and are limited by RAM and PCI bus transfer rates. Try to keep everything in GPU memory. Upload once, execute kernel(s), download once. If you need anything more complicated try to use asynchronous memory transfers with streams.
As far as I know "Unified Virtual Addressing" is really more about using multiple devices, abstracting from explicit memory management. Think of it as a single virtual GPU, everything else still valid.
Using host memory automatically is already possible with device-mapped-memory. See cudaMalloc* in the reference manual found at the nvidia cuda website.
CUDA 4.0 UVA (Unified Virtual Address) does not help you in accessing the main memory from the CUDA threads. As in the previous versions of CUDA, you still have to map the main memory using CUDA API for direct access from GPU threads, but it will slow down the performance as mentioned above. Similarly, you cannot access GPU device memory from CPU thread just by dereferencing the pointer to the device memory. UVA only guarantees that the address spaces do not overlap across multiple devices (including CPU memory), and does not provide coherent accessibility.