I hear this term used a lot, something about a build OOMing or running out of memory; what does that mean? I'm speaking from the context of running a dataset build in Transforms Python or Transforms SQL.
OOM == OutOfMemory
This is caused in the case of a Transform by the JVM trying to allocate more memory in heap than it has available or can free using GC (Garbage Collection) This can happen e.g. in your driver when materializing huge query plans, or in executor when dealing with very large array columns or other such data that can't fit into memory.
This can also happen when JVM + non-JVM is using more non-heap memory than is available through the combination of non-used heap memory and memoryOverhead (applies both for driver and executor).
This can be caused by not having enough main JVM memory, memoryOverhead, or doing things like using too much Python memory (e.g. on your driver when using .collect() or on an executor when using UDFs)
Related
I am writing a simplistic raytracer. The idea is that for every pixel there is a thread that traverses a certain structure (geometry) that resides in global memory.
I invoke my kernel like so:
trace<<<gridDim, blockDim>>>(width, height, frameBuffer, scene)
Where scene is a structure that was previously allocated with cudaMalloc. Every thread has to start traversing this structure starting from the same node, and chances are that many concurrent threads will attempt to read the same nodes many times. Does that mean that when such reads take place, it cripples the degree of parallelism?
Given that geometry is large, I would assume that replicating it is not an option. I mean the whole processing still happens fairly fast, but I was wondering whether it is something that has to be dealt with, or simply left flung to the breeze.
First of all I think you got the wrong idea when you say concurrent reads may or may not cripple the degree of parallelism. Because that is what it means to be parallel. Each thread is reading concurrently. Instead you should be thinking if it affects the performance due to more memory accesses when each thread basically wants the same thing i.e. the same node.
Well according to the article here, Memory accesses can be coalesced if data locality is present and within warps only.
Which means if threads within a warp are trying to access memory locations near each other they can be coalesced. In your case each thread is trying to access the "same" node until it meets an endpoint where they branch.
This means the memory accesses will be coalesced within the warp till the threads branch off.
Efficient access to global memory from each thread depends on both your device architecture and your code. Arrays allocated on global memory are aligned to 256-byte memory segments by the CUDA driver. The device can access global memory via 32-, 64-, or 128-byte transactions that are aligned to their size. The device coalesces global memory loads and stores issued by threads of a warp into as few transactions as possible to minimize DRAM bandwidth. A misaligned data access for devices with a compute capability of less than 2.0 affects the effective bandwidth of accessing data. This is not a serious issue when working with a device that has a compute capability of > 2.0. That being said, pretty much regardless of your device generation, when accessing global memory with large strides, the effective bandwidth becomes poor (Reference). I would assume that for random access the the same behavior is likely.
Unless you are not changing the structure while reading, which I assume you do (if it's a scene you probably render each frame?) then yes, it cripples performance and may cause undefined behaviour. This is called a race condition. You can use atomic operations to overcome this type of problem. Using atomic operations guarantees that the race conditions don't happen.
You can try, stuffing the 'scene' to the shared memory if you can fit it.
You can also try using streams to increase concurrency which also brings some sort of synchronization to the kernels that are run in the same stream.
I am a newbie in CUDA programming and in the process of re-writing a C code into a parallelized CUDA new code.
Is there a way to write output data files directly from the device without bothering copying arrays from device to host? I assume if cuPrintf exists, there must be away to write a cuFprintf?
Sorry, if the answer has already been given in a previous topic, I can't seem to find it...
Thanks!
The short answer is, no there is not.
cuPrintf and the built-in printf support in Fermi and Kepler runtime is implemented using device to host copies. The mechanism is no different to using cudaMemcpy to transfer a buffer to the host yourself.
Just about all CUDA compatible GPUs support so-called zero-copy (AKA "pinned, mapped") memory, which allows the GPU to map a host buffer into its address space and execute DMA transfers into that mapped host memory. Note, however, that setup and initialisation of mapped memory has considerably higher overhead than conventional memory allocation (so you really need a lot of transactions to amortise that overhead throughout the life of your application), and that the CUDA driver can't use zero-copy with any other than addresses backed by physical memory. So you can't mmap a file and use zero-copy on it, ie. you will still need explicit host side file IO code to get from a zero-copy buffer to disk.
I am a newbie in CUDA programming and in the process of re-writing a C code into a parallelized CUDA new code.
Is there a way to write output data files directly from the device without bothering copying arrays from device to host? I assume if cuPrintf exists, there must be away to write a cuFprintf?
Sorry, if the answer has already been given in a previous topic, I can't seem to find it...
Thanks!
The short answer is, no there is not.
cuPrintf and the built-in printf support in Fermi and Kepler runtime is implemented using device to host copies. The mechanism is no different to using cudaMemcpy to transfer a buffer to the host yourself.
Just about all CUDA compatible GPUs support so-called zero-copy (AKA "pinned, mapped") memory, which allows the GPU to map a host buffer into its address space and execute DMA transfers into that mapped host memory. Note, however, that setup and initialisation of mapped memory has considerably higher overhead than conventional memory allocation (so you really need a lot of transactions to amortise that overhead throughout the life of your application), and that the CUDA driver can't use zero-copy with any other than addresses backed by physical memory. So you can't mmap a file and use zero-copy on it, ie. you will still need explicit host side file IO code to get from a zero-copy buffer to disk.
Is there a way to query the ram footprint of a specific object in JRuby?
Have you heard about jmap, jhat, and visualvm?
jmap outputs the object memory maps / heap memory details for the given Java process.
jhat is a heap analysis tool that lets you create a Java heap dump, and then lets you query it using a SQL-like language to get more detail.
visualvm is yet another Java tool for viewing details about a Java applications as they are running in the JVM.
I found this post from Charles Nutter to be very helpful in getting me started in JRuby profiling and memory inspection.
My program runs 2 threads - Thread A (for input) and B (for processing). I also have a pair of pointers to 2 buffers, so that when Thread A has finished copying data into Buffer 1, Thread B starts processing Buffer 1 and Thread A starts copying data into Buffer 2. Then when Buffer 2 is full, Thread A copies data into Buffer 1 and Thread B processes Buffer 2, and so on.
My problem comes when I try to cudaMemcpy Buffer[] into d_Buffer (which was previously cudaMalloc'd by the main thread, i.e. before thread creation. Buffer[] were also malloc'd by the main thread). I get a "invalid argument" error, but have no idea which is the invalid argument.
I've reduced my program to a single-threaded program, but still using 2 buffers. That is, the copying and processing takes place one after another, instead of simultaneously. The cudaMemcpy line is exactly the same as the double-threaded one. The single-threaded program works fine.
I'm not sure where the error lies.
Thank you.
Regards,
Rayne
If you are doing this with CUDA 3.2 or earlier, the reason is that GPU contexts are tied to a specific thread. If a multi-threaded program allocated memory on the same GPU from different host threads, the allocations wind up establishing different contexts, and pointers from one context are not portable to another context. Each context has its own "virtualised" memory space to work with.
The solution is to either use the context migration API to transfer a single context from thread to thread as they do work, or try the new public CUDA 4.0rc2 release, which should support what you are trying to do without the use of context migration. The downside is that 4.0rc2 is a testing release, and it requires a particular beta release driver. That driver won't work will all hardware (laptops for example).