Memory allocation google cloud function signification - google-cloud-functions

I'm developping functions on the Google Cloud Functions using Firebase. I saw that each functions have a Memory Allocation but I can't find documentation describing the interest and signification of it.
From what I can understand, the more memory you are allocating, the more instance of the functions you can create at runtime. So, the more people are using my app, the more memory I need to allocate for each function.
Is it right to think like this? Does someone has a documentation for that?
Thanks in advance and have a good day!
Adrien

Not quite, the memory allocation is in direct proportion to the function's resources and its virtual RAM. This can be helpful when dealing with processing bulk documents, images, video compression, etc.
I have also heard that increasing memory allocation increases in CPU frequency from 400mhz at 256mb of ram up to 8GB and 4.8ghz respectfully.
found at the end of the first part before free tier
https://cloud.google.com/functions/pricing#cloud-functions-pricing

Related

is it possible to force cudaMallocManaged allocate on a specific gpu id (e.g. via cudaSetDevice)

I want to use cudaMallocManaged, but is it possible force it allocate memory on a specific gpu id (e.g. via cudaSetDevice) on a multiple GPU system?
The reason is that I need allocate several arrays on the GPU, and I know which set of these arrays need to work together, so I want to manually make sure they are on the same GPU.
I searched CUDA documents, but didn't find any info related to this. Can someone help? Thanks!
No you can't do this directly via cudaMallocManaged. The idea behind managed memory is that the allocation migrates to whatever processor it is needed on.
If you want to manually make sure a managed allocation is "present" on (migrated to) a particular GPU, you would typically use cudaMemPrefetchAsync. Some examples are here and here. This is generally recommended for good performance if you know which GPU the data will be needed on, rather than using "on-demand" migration.
Some blogs on managed memory/unified memory usage are here and here, and some recorded training is available here, session 6.
From N.2.1.1. Explicit Allocation Using cudaMallocManaged() (emphasis mine):
By default, the devices of compute capability lower than 6.x allocate managed memory directly on the GPU. However, the devices of compute capability 6.x and greater do not allocate physical memory when calling cudaMallocManaged(): in this case physical memory is populated on first touch and may be resident on the CPU or the GPU.
So for any recent architecture it works like NUMA nodes on the CPU: Allocation says nothing about where the memory will be physically allocated. This instead is decided on "first touch", i.e. initialization. So as long as the first write to these locations comes from the GPU where you want it to be resident, you are fine.
Therefore I also don't think a feature request will find support. In this memory model allocation and placement just are completely independent operations.
In addition to explicit prefetching as Robert Crovella described it, you can give more information about which devices will access which memory locations in which way (reading/writing) by using cudaMemAdvise (See N.3.2. Data Usage Hints).
The idea behind all this is that you can start off by just using cudaMallocManaged and not caring about placement, etc. during fast prototyping. Later you profile your code and then optimize the parts that are slow using hints and prefetching to get (almost) the same performance as with explicit memory management and copies. The final code may not be that much easier to read / less complex than with explicit management (e.g. cudaMemcpy gets replaced with cudaMemPrefetchAsync), but the big difference is that you pay for certain mistakes with worse performance instead of a buggy application with e.g. corrupted data that might be overlooked.
In Multi-GPU applications this idea of not caring about placement at the start is probably not applicable, but NVIDIA seems to want cudaMallocManaged to be as uncomplicated as possible for this type of workflow.

CUDA memory issue with allennlp coreference resolution API

Hi I have some CUDA memory issue even though I am using multiple GPUs. I am calling coreference resolution API on long document (aorund 2000words). It seems that the memory is not paralleled. How can I solve this issue? (I am currently using the API as here https://demo.allennlp.org/coreference-resolution)
The coref model uses a lot of memory. It does not automatically take advantage of multiple GPUs. The best thing you can do is reduce the maximum sequence length you send to the model until it fits.

What effects GCP Cloud Function memory usage

I recently redeployed a hanful of python GCP cloud functions and noticed they are taking about 50mbs more memory, triggering memory limit errors (I had to increase the memory allocation from 256mb to 512mb to get them to run). Unfortunately, that is 2x the cost.
I am trying to figure what caused the memory increase. The only thing I can think of is a python package recent upgrade. So, I specified all package versions in the requirements.txt, based off of my local virtual env, which has not changed lately. The memory usage increase remained.
Are there other factors that would lead to a memory utilization increase? Python runtime is still 3.7, the data that the functions processed has not changed. It also doesn't seem to be a change that GCP has made to cloud functions in general, because it has only happened with functions I have redeployed.
I can point out few possibilities of memory limit errors which are:
One of the reasons for out of memory in Cloud Functions is as discussed in the document.
Files that you write consume memory available to your function, and
sometimes persist between invocations. Failing to explicitly delete
these files may eventually lead to an out-of-memory error and a
subsequent cold start.
As mentioned in this StackOverflow Answer, that if you allocate anything in global memory space without deallocating it, the memory allocation will count this with the future invocations. To minimize memory usage, only allocate objects locally that will get cleaned up when the function is complete. Memory leaks are often difficult to detect.
Also, The cloud functions need to respond when they're done. if they don't respond then their allocated resources won't be free. Any exception in the cloud functions may cause a memory limit error.
You may also wanna check Auto-scaling and Concurrency which mentions another possibility.
Each instance of a function handles only one concurrent request at a
time. This means that while your code is processing one request, there
is no possibility of a second request being routed to the same
instance. Thus the original request can use the full amount of
resources (CPU and memory) that you requested.
Lastly, this may be caused by issues with logging. If you are logging objects, this may prevent these objects from being garbage collected. You may need to make the logging less verbose and use string representations to see if the memory usage gets better. Either way, you could try using the Profiler in order to get more information about what’s going on with your Cloud Function’s memory.

get memory usage on cuda context

Is there a way that I can get cuda context memory usage rather than having to use cudaMemGetInfo which only reports global information of a device? or at least a way to get how much memory is occupied by the current application?
It seems to be impossible [No]. However, retrieving per-process memory usage is still a good alternative. And as Robert has pointed out, per-process memory usage can be retrieved using NVML, specifically, by using nvmlDeviceGetComputeRunningProcesses function.

Where can I find information about the Unified Virtual Addressing in CUDA 4.0?

Where can I find information / changesets / suggestions for using the new enhancements in CUDA 4.0? I'm especially interested in learning about Unified Virtual Addressing?
Note: I would really like to see an example were we can access the RAM directly from the GPU.
Yes, using host memory (if that is what you mean by RAM) will most likely slow your program down, because transfers to/from the GPU take some time and are limited by RAM and PCI bus transfer rates. Try to keep everything in GPU memory. Upload once, execute kernel(s), download once. If you need anything more complicated try to use asynchronous memory transfers with streams.
As far as I know "Unified Virtual Addressing" is really more about using multiple devices, abstracting from explicit memory management. Think of it as a single virtual GPU, everything else still valid.
Using host memory automatically is already possible with device-mapped-memory. See cudaMalloc* in the reference manual found at the nvidia cuda website.
CUDA 4.0 UVA (Unified Virtual Address) does not help you in accessing the main memory from the CUDA threads. As in the previous versions of CUDA, you still have to map the main memory using CUDA API for direct access from GPU threads, but it will slow down the performance as mentioned above. Similarly, you cannot access GPU device memory from CPU thread just by dereferencing the pointer to the device memory. UVA only guarantees that the address spaces do not overlap across multiple devices (including CPU memory), and does not provide coherent accessibility.