Why is learning at CPU slower than at GPU - catboost

I have:
GPU : GeForce RTX 2070 8GB.
CPU : AMD Ryzen 7 1700 Eight-Core Processor.
RAM : 32GB.
Driver Version: 418.43.
CUDA Version: 10.1.
On my project, gpu is also slower than cpu. But now I will use the documentation example.
from catboost import CatBoostClassifier
import time
start_time = time.time()
train_data = [[0,3],
[4,1],
[8,1],
[9,1]]
train_labels = [0,0,1,1]
model = CatBoostClassifier(iterations=1000, task_type = "CPU/GPU")
model.fit(train_data, train_labels, verbose = False)
print(time.time()-start_time)
Training time on gpu: 4.838074445724487
Training time on cpu: 0.43390488624572754
Why is the training time on gpu more than on cpu?

Be careful, no experience with catboost, so the following is from CUDA point of view
data transfer The launch of a kernel (function called by Host, e.g. CPU, executed by device, e.g. GPU) requires data to be transferred from host to device. See image below to get an idea of the transfer time on data size. By default, the memory are non-pinned (using cudaMalloc()). See https://www.cs.virginia.edu/~mwb7w/cuda_support/pinned_tradeoff.html to find out more.
kernel launch overhead Each time when the host calls a kernel, it enqueues the kernel to the working queue of the device. i.e. for each iteration, the host instantiates a kernel, and adds to the queue. Before the introduction of CUDA graph (which also points out that the kernel launch overhead can be significant when the kernel has short execution time), the overhead of each kernel launch cannot be avoided. I don't know how catboost handles iterations, but given that difference between execution times, it seems not have resolved the launch overheads (IMHO)

Catboost uses some different techniques with small datasets (rows<50k or columns<10) that reduces overfitting, but takes more time. Try training with a gigantic dataset, for instance the Epsilon dataset. see github https://github.com/catboost/catboost/issues/505

Related

How to extend tensorflow's GPU memory from system RAM

I want to train googles object detection with faster_rcnn_with resnet101 using mscoco datasetcode. I used only 10,000 images for training purpose.I used graphics: GeForce 930M/PCIe/SSE2. NVIDIA Driver Version:384.90. here is the picture of my GeForce.
And I have 8Gb RAM but in tensorflow gpu it is showed 1.96 Gb.
. Now How can I extend my PGU's RAM. I want to use full system memory.
You can train on the cpu to take advantage of the RAM on your machine. However, to run something on the gpu it has to be loaded to the gpu first. Now you can swap memory in and out, because not all the results are needed at any step. However, you pay with a very long training time and I would rather advise you to reduce the batch size. Nevertheless, details about this process and implementation can be found here: https://medium.com/#Synced/how-to-train-a-very-large-and-deep-model-on-one-gpu-7b7edfe2d072.

Why mxnet's GPU version cost more memory than CPU version?

I made a very simple network using mxnet(two fc layers with dim of 512).
By changing the ctx = mx.cpu() or ctx = mx.gpu(0), I run the same code on both CPU and GPU.
The memory cost of GPU is much bigger than CPU version.(I checked that using 'top' instead of 'nvidia-smi').
It seems strange, as the GPU version also has memory on GPU already, why GPU still need more space on memory?
(line 1 is CPU program / line 2 is GPU program)
It may be due to differences in how much time each process was running.
Looking at your screenshot, CPU process has 5:48.85 while GPU has 9:11.20 - so the GPU training was running almost double the time which could be the reason.
When running on GPU you are loading a bunch of different lower-level libraries in memory (CUDA, CUDnn, etc) which are allocated first in your RAM. If your network is very small like in your current case, the overhead of loading the libraries in RAM will be higher than the cost of storing the network weights in RAM.
For any more sizable network, when running on CPU the amount of memory used by the weights will be significantly larger than the libraries loaded in memory.

High cpu utilization while using 3D convolution on gpu in theano 0.9

My cpu utilization is 100% when using theano.tensor.nnet.conv3d
I am adding conv3d_fft, convgrad3d_fft and convtransp3d_fft to my compiling mode in the theano function.
The interesting part is my gpu utilization is also high. My data is all on gpu memory.
Any ideas about why it cpu utilization is so high?
Thanks!
Update: I tried the same convolution in Keras with Theano backend. It is shockingly faster than theano despite the fact that it also uses conv3d.
For invoking a task on the GPU, the CPU has to do some work (not negligible). If the tasks on the GPU are small, the tasks on the CPU become comparatively big.
By the way, memory transfers are done by the DMA and do not cause CPU load except of invoking the transfer.
I had a CUDA program, where the CPU invoked a GPU task every ~100µs. Profiling the program I found out, that it was CPU bound, while nothing was done on CPU except invoking GPU tasks.

CUDA: why is there a large amount of GPU idle time?

Question
total GPU time + total CPU overhead is smaller than the total execution time. Why?
Detail
I am studying how frequent global memory access and kernel launch may affect the performance and I have designed a code which has multiple small kernels and ~0.1 million kernel calls in total. Each kernel reads data from global memory, processes them and then writes back to the global memory. As expected, the code runs much slower than the original design which has only one large kernel and very few kernel launches.
The problem arose as I used command line profiler to get "gputime" (execution time for the GPU kernel or memory copy method) and "cputime" (CPU overhead for non-blocking method, the sum of gputime and CPU overhead for blocking method ). To my understanding, the sum of all gputimes and all cputimes should exceed the entire execution time (the last "gpuendtimestamp" minus the first "gpustarttimestamp"), but it turns out the contrary is true (sum of gputimes=13.835064 s,
sum of cputimes=4.547344 s, total time=29.582793). Between the end of one kernel and the start of the next, there is often a large amount of waiting time, larger than the CPU overhead of the next kernel. Most of the kernels suffer from this problem are: memcpyDtoH, memcpyDtoD and thrust internel functions such as launch_closure_by_value, fast_scan, etc. What is the probable reason?
System
Windows 7, TCC driver, VS 2010, CUDA 4.2
Thanks for your help!
This is possibly a combination of profiling, which increases latency, and the Windows WDDM subsystem. To overcome the high latency of the latter, the CUDA driver batches GPU operations and submits them in groups with a single Windows kernel call. This can cause large periods of GPU inactivity if CUDA API commands are sitting in an unsubmitted batch.
(Copied #talonmies' comment to an answer, to enable voting and accepting.)

GPU reads from CPU or CPU writes to the GPU?

I am beginner in parallel programming. I have a query which might be seem to be silly but I didn't get a definitive answer when I googled it out.
In GPU computing there is a device i.e. the GPU and the host i.e. the CPU. I wrote a simple hello world program which will allocate some memory on the gpu, pass two parameters (say src[] and dest[]) to the kernel, copy src string i.e. Hello world to dest string and get the dest string from gpu to the host.
Is the string "src" read by the GPU or the CPU writes to the GPU ? Also when we get back the string from GPU, is the GPU writing to the CPU or the CPU reading from the GPU?
In transferring the data back and forth there can be four possibilities
1. CPU to GPU
- CPU writes to GPU
- GPU reads form CPU
2. GPU to CPU
- GPU writes to the CPU
- CPU reads from GPU
Can someone please explain which of these are possible and which are not?
In earlier versions of CUDA and corresponding hardware models, the GPU was more strictly a coprocessor owned by the CPU; the CPU wrote information to the GPU, and read the information back when the GPU was ready. At the lower level, this meant that really all four things were happening: the CPU wrote data to PCIe, the GPU read data from PCIe, the GPU then wrote data to PCIe, and the CPU read back the result. But transactions were initiated by the CPU.
More recently (CUDA 3? 4? maybe even beginning in 2?), some of these details are hidden from the application level, so that, effectively, GPU code can cause transfers to be initiated in much the same way as the CPU can. Consider unified virtual addressing, whereby programmers can access a unified virtual address space for CPU and GPU memory. When the GPU requests memory in the CPU space, this must initiate a transfer from the CPU, essentially reading from the CPU. The ability to put data onto the GPU from the CPU side is also retained. Basically, all ways are possible now, at the top level (at low levels, it's largely the same sort of protocol as always: both read from and write to the PCIe bus, but now, GPUs can initiate transactions as well).
Actually none of these.
Your CPU code initiates the copy of data, but while the data is transferred by the memory controller to the memory of the GPU through whatever bus you have on your system. Meanwhile, the CPU can process other data.
Similarly, when the GPU has finished running the kernels you launched, your CPU code initiates the copy of data, but meanwhile both GPU and CPU can handle other data or run other code.
The copies are called asynchronous or non-blocking. You can optionally do blocking copies, in which the CPU waits for the copy to be completed.
When launching asynchronous tasks, you usually register an "event", which is some kind of flag that you can check later on, to see if the task is finished or not.
In OpenCL the Host (CPU) is exclusively controlling all the transfers of data between GPU and GPU. The host transfers data to the GPU using buffers. The host transfers (reads) back
from the GPU using buffers. For some systems and devices, the transfer isn't physically copying bytes as the Host and GPU use the same physical memory. This is called zero copy.
I just found out in this forum http://devgurus.amd.com/thread/129897 that using CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR in clCreateBuffer allocates memory on the host and that it wont be copied on the device.
There may be issue with performance but this is what I am looking for. Your comments please..