CUDA: Write directly from device to host pinned memory without sacrificing throughput? - cuda

In CUDA, is it possible to write directly to host (pinned) memory from a device kernel?
In my current setup, I first write to device DRAM and then copy from DRAM into host pinned memory.
I'm wondering if I can just write directly to host memory (i.e. use one step instead of two) without sacrificing throughput.
From what I understand, unified memory isn't the answer - guides mention that it's slower (perhaps because of its paging semantics?).
But I haven't tried it, so perhaps I'm mistaken - maybe there's an option to force everything to reside in host pinned memory?

There are numerous question here on the cuda SO tag about how to use pinned memory for "zero-copy" operations. Here is one example. You can find many more examples.
If you only have to write to each output point once, and your writes are/would be nicely coalesced, then there should not be a major performance difference between the costs of:
writing to device memory and then cudaMemcpy D->H after the kernel
writing directly to host-pinned memory
You will still need a cudaDeviceSynchronize() after the kernel call, before accessing the data on the host, to ensure consistency.
Differences on the order of ~10 microseconds are still possible due to CUDA operation overheads.
It should be possible to demonstrate that bulk transfer of data using direct read/writes to pinned memory from kernel code will achieve approximately the same bandwidth as what you would get with a cudaMemcpy transfer.
As an aside, the "paging semantics" of unified memory may be worked around but again a well optimized code in any of these 3 scenarios is not likely to show marked perf or duration differences.
Responding to comments, my use of "approximately" above is probably a stretch, here's a kernel that writes 4GB of data in less than half a second on a PCIE Gen2 system:
$ cat t2138.cu
template <typename T>
__global__ void k(T *d, size_t n){
for (size_t i = blockIdx.x*blockDim.x+threadIdx.x; i < n; i+=gridDim.x*blockDim.x)
d[i] = 0;
}
int main(){
int *d;
size_t n = 1048576*1024;
cudaHostAlloc(&d, sizeof(d[0])*n, cudaHostAllocDefault);
k<<<160, 1024>>>(d, n);
k<<<160, 1024>>>(d, n);
cudaDeviceSynchronize();
int *d1;
cudaMalloc(&d1, sizeof(d[0])*n);
cudaMemcpy(d, d1, sizeof(d[0])*n, cudaMemcpyDeviceToHost);
}
$ nvcc -o t2138 t2138.cu
$ compute-sanitizer ./t2138
========= COMPUTE-SANITIZER
========= ERROR SUMMARY: 0 errors
$ nvprof ./t2138
==21201== NVPROF is profiling process 21201, command: ./t2138
==21201== Profiling application: ./t2138
==21201== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 72.48% 889.00ms 2 444.50ms 439.93ms 449.07ms void k<int>(int*, unsigned long)
27.52% 337.47ms 1 337.47ms 337.47ms 337.47ms [CUDA memcpy DtoH]
API calls: 60.27% 1.88067s 1 1.88067s 1.88067s 1.88067s cudaHostAlloc
28.49% 889.01ms 1 889.01ms 889.01ms 889.01ms cudaDeviceSynchronize
10.82% 337.55ms 1 337.55ms 337.55ms 337.55ms cudaMemcpy
0.17% 5.1520ms 1 5.1520ms 5.1520ms 5.1520ms cudaMalloc
0.15% 4.6178ms 4 1.1544ms 594.35us 2.8265ms cuDeviceTotalMem
0.09% 2.6876ms 404 6.6520us 327ns 286.07us cuDeviceGetAttribute
0.01% 416.39us 4 104.10us 59.830us 232.21us cuDeviceGetName
0.00% 151.42us 2 75.710us 13.663us 137.76us cudaLaunchKernel
0.00% 21.172us 4 5.2930us 3.0730us 8.5010us cuDeviceGetPCIBusId
0.00% 9.5270us 8 1.1900us 428ns 4.5250us cuDeviceGet
0.00% 3.3090us 4 827ns 650ns 1.2230us cuDeviceGetUuid
0.00% 3.1080us 3 1.0360us 485ns 1.7180us cuDeviceGetCount
$
4GB/0.44s = 9GB/s
4GB/0.34s = 11.75GB/s (typical for PCIE Gen2 to pinned memory)
We can see that contrary to my previous statement, the transfer of data using in-kernel copying to a pinned allocation does seem to be slower (about 33% slower in my test case) than using a bulk copy (cudaMemcpy DtoH to a pinned allocation). However this isn't quite an apples-to-apples comparison, because the kernel itself would still have to write the 4GB of data to the device allocation to make the comparison to cudaMemcpy be sensible. The speed of this operation will depend on the GPU device memory bandwidth, which varies by GPU of course. So 33% higher is probably "too high" of an estimate of the comparison. But if your GPU has lots of memory bandwidth, this estimate will be pretty close. (On my V100, writing 4GB to device memory only takes ~7ms).

Related

Why memory prefetch has no impact when transferring from device to host?

I have the following setup:
constexpr uint32_t N{512};
constexpr uint32_t DATA_SIZE{sizeof(float) * N * N};
__managed__ float ma[N * N];
__managed__ float mb[N * N];
__managed__ float mc[N * N];
__global__ void kernel()
{
for (uint32_t i{0}; i < N * N; ++i)
{
mc[i] = ma[i] + mb[i];
}
}
int main(int argc, char *[])
{
for (uint32_t i{0}; i < N * N; ++i)
{
ma[i] = 1.0f;
mb[i] = 2.0f;
}
int deviceId{};
gpuErrchk(cudaGetDevice(&deviceId));
gpuErrchk(cudaMemPrefetchAsync(ma, DATA_SIZE, deviceId, nullptr));
gpuErrchk(cudaMemPrefetchAsync(mb, DATA_SIZE, deviceId, nullptr));
kernel<<<1, 1>>>();
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaMemPrefetchAsync(mc, DATA_SIZE, cudaCpuDeviceId, nullptr));
gpuErrchk(cudaDeviceSynchronize());
float result{0.0f};
for (uint32_t i{0}; i < N * N; ++i)
{
result += mc[i];
}
return static_cast<int>(result);
}
I compile the code with 03 optimizations. Profiling it with nvprof ./test gives me the following (only the memory part):
==29300== Unified Memory profiling result:
Device "Quadro P1000 (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
2 1.0000MB 1.0000MB 1.0000MB 2.000000MB 164.9620us Host To Device
20 153.60KB 4.0000KB 1.0000MB 3.000000MB 266.0500us Device To Host
19 - - - - 551.9440us Gpu page fault groups
Total CPU Page faults: 9
The first line - HtoD - is straightforward - there were 2 prefetches for ma and mb arrays 1MB each.
The second line is strange for 2 reasons:
Prefetching was ignored (well, not completely, more on this later)
The total size of the data is 3MB despite the fact that the total array size is 1MB and in cudaMemPrefetchAsync also 1MB was specified.
If I run the same code with prefetching commented out I have the following results:
==30051== Unified Memory profiling result:
Device "Quadro P1000 (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
20 102.40KB 4.0000KB 508.00KB 2.000000MB 189.9230us Host To Device
29 105.93KB 4.0000KB 512.00KB 3.000000MB 278.4960us Device To Host
24 - - - - 1.311533ms Gpu page fault groups
Total CPU Page faults: 14
As seen in the table prefetching has an impact on the number of transfers - for HtoD it was changed from 2 to 20, and for DtoH it changed from 20 to 29. It also has an impact on performance, but that impact is not major. Especially if I compare it with the third variation of the same code, where I use cudaMalloc instead of managed memory:
Type Time(%) Time Calls Avg Min Max Name
0.00% 164.80us 2 82.401us 82.209us 82.593us [CUDA memcpy HtoD]
0.00% 81.665us 1 81.665us 81.665us 81.665us [CUDA memcpy DtoH]
I am running the NVidia Quadro P1000 laptop, Ubuntu 18.04, Cuda 11.8.
To summarize, here are my questions:
Why does prefetch the memory to the host almost have no impact (29 migrations vs 20 migrations)?
Why is more memory transferred to the host than requested (3Mb instead of the requested 1MB)?
Why even with prefetching the managed memory is the order of magnitude slower than the device memory allocated with cudaMalloc?
As Robert Crovella has mentioned in the comment, the behavior is caused by the fact that the initial location of a managed memory is unspecified.
By default, the devices of compute capability lower than 6.x allocate
managed memory directly on the GPU. However, the devices of compute
capability 6.x and greater do not allocate physical memory when
calling cudaMallocManaged(): in this case physical memory is populated
on first touch and may be resident on the CPU or the GPU.
In my case, the memory was allocated on the GPU. That explains why there were 3MB transferred from the device to the host and the number of transfers themselves. If I remove the initialization loop, the number of HtoD becomes zero (despite the 2 calls to cudaMemPrefetchAsync) and DtoH becomes one.

Why does this kernel not achieve peak IPC on a GK210?

I decided that it would be educational for me to try to write a CUDA kernel that achieves peak IPC, so I came up with this kernel (host code omitted for brevity but is available here)
#define WORK_PER_THREAD 4
__global__ void saxpy_parallel(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
i *= WORK_PER_THREAD;
if (i < n)
{
#pragma unroll
for(int j=0; j<WORK_PER_THREAD; j++)
y[i+j] = a * x[i+j] + y[i+j];
}
}
I ran this kernel on a GK210, with n=32*1000000 elements, and expected to see an IPC of close to 4, but ended up with a lousy IPC of 0.186
ubuntu#ip-172-31-60-181:~/ipc_example$ nvcc saxpy.cu
ubuntu#ip-172-31-60-181:~/ipc_example$ sudo nvprof --metrics achieved_occupancy --metrics ipc ./a.out
==5828== NVPROF is profiling process 5828, command: ./a.out
==5828== Warning: Auto boost enabled on device 0. Profiling results may be inconsistent.
==5828== Profiling application: ./a.out
==5828== Profiling result:
==5828== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "Tesla K80 (0)"
Kernel: saxpy_parallel(int, float, float*, float*)
1 achieved_occupancy Achieved Occupancy 0.879410 0.879410 0.879410
1 ipc Executed IPC 0.186352 0.186352 0.186352
I was even more confused when I set WORK_PER_THREAD=16, resulting in less threads launched, but 16, as opposed to 4, independent instructions for each to execute, the IPC dropped to 0.01
My two questions are:
What is the peak IPC I can expect on a GK210? I think it is 8 = 4 warp schedulers * 2 instruction dispatches per cycle, but I want to be sure.
Why does this kernel achieve such low IPC while achieved occupancy is high, why does IPC decrease as WORK_PER_THREAD increases, and how can I improve the IPC of this kernel?
What is the peak IPC I can expect on a GK210?
The peak IPC per SM is equal to the number of warp schedulers in an SM times the issue rate of each warp scheduler. This information can be found in the whitepaper for a particular GPU. The GK210 whitepaper is here. From that document (e.g. SM diagram on p8) we see that each SM has 4 warp schedulers capable of dual issue. Therefore the peak theoretically achievable IPC is 8 instructions per clock per SM. (however as a practical matter even for well-crafted codes, you're unlikely to see higher than 6 or 7).
Why does this kernel achieve such low IPC while achieved occupancy is high, why does IPC decrease as WORK_PER_THREAD increases, and how can I improve the IPC of this kernel?
Your kernel requires global transactions at nearly every operation. Global loads and even L2 cache loads have latency. When everything you do is dependent on those, there is no way to avoid the latency, so your warps are frequently stalled. The peak observable IPC per SM on a GK210 is somewhere in the vicinity of 6, but you won't get that with continuous load and store operations. Your kernel does 2 loads, and one store (12 bytes total moved), for each multiply/add. You won't be able to improve it. (Your kernel has high occupancy because the SMs are loaded up with warps, but low IPC because those warps are frequently stalled, unable to issue an instruction, waiting for latency of load operations to expire.) You'll need to find other useful work to do.
What might that be? Well if you do a matrix multiply operation, which has considerable data reuse and a relatively low number of bytes per math op, you're likely to see better measurements.
What about your code? Sometimes the work you need to do is like this. We'd call that a memory-bound code. For a kernel like this, the figure of merit to use for judging "goodness" is not IPC but achieved bandwidth. If your kernel requires a particular number of bytes loaded and stored to perform its work, then if we compare the kernel duration to just the memory transactions, we can get a measure of goodness. Stated another way, for a pure memory bound code (i.e. your kernel) we would judge goodness by measuring the total number of bytes loaded and stored (profiler has metrics for this, or for a simple code you can compute it directly by inspection), and divide that by the kernel duration. This gives the achieved bandwidth. Then, we compare that to the achievable bandwidth based on a proxy measurement. A possible proxy measurement tool for this is bandwidthTest CUDA sample code.
As the ratio of these two bandwidths approaches 1.0, your kernel is doing "well", given the memory bound work it is trying to do.

Is memory operation for L2 cache significantly faster than global memory for NVIDIA GPU?

Modern GPU architectures have both L1 cache and L2 cache. It is well-known that L1 cache is much faster than global memory. However, the speed of L2 cache is less clear in the CUDA documentation. I looked up the CUDA documentation, but can only find that the latency of global memory operation is about 300-500 cycles while L1 cache operation takes only about 30 cycles. Can anyone give the speed of L2 cache? Such information may be very useful, since the programming will not focus on optimizing the use of L2 cache if it is not very fast compared with global memory. If the speed is different for different architectures, I just want to focus on the latest architecture, such as NVIDIA Titan RTX 3090 (Compute Capability 8.6) or NVIDIA Telsa V100 (Compute Capability 7.0).
Thank you!
There are at least 2 figures of merit commonly used when discussing GPU memory: latency and bandwidth. From a latency perspective, this number is not published by NVIDIA (that I know of) and the usual practice is to discover it with careful microbenchmarking.
From a bandwidth perspective, AFAIK this number is also not published by NVIDIA (for L2 cache), but it should be fairly easy to discover it with a fairly simple test case of a copy kernel. We can estimate the bandwidth of global memory simply by ensuring that our copy kernel uses a copy footprint that is much larger than the published L2 cache size (6MB for V100), whereas we can estimate the bandwidth of L2 by keeping our copy footprint smaller than that.
Such a code (IMO) is fairly trivial to write:
$ cat t44.cu
template <typename T>
__global__ void k(volatile T * __restrict__ d1, volatile T * __restrict__ d2, const int loops, const int ds){
for (int i = 0; i < loops; i++)
for (int j = threadIdx.x+blockDim.x*blockIdx.x; j < ds; j += gridDim.x*blockDim.x)
if (i&1) d1[j] = d2[j];
else d2[j] = d1[j];
}
const int dsize = 1048576*128;
const int iter = 64;
int main(){
int *d;
cudaMalloc(&d, dsize);
// case 1: 32MB copy, should exceed L2 cache on V100
int csize = 1048576*8;
k<<<80*2, 1024>>>(d, d+csize, iter, csize);
// case 2: 2MB copy, should fit in L2 cache on V100
csize = 1048576/2;
k<<<80*2, 1024>>>(d, d+csize, iter, csize);
cudaDeviceSynchronize();
}
$ nvcc -o t44 t44.cu
$ nvprof ./t44
==53310== NVPROF is profiling process 53310, command: ./t44
==53310== Profiling application: ./t44
==53310== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 6.9032ms 2 3.4516ms 123.39us 6.7798ms void k<int>(int volatile *, int volatile *, int, int)
API calls: 89.47% 263.86ms 1 263.86ms 263.86ms 263.86ms cudaMalloc
4.45% 13.111ms 8 1.6388ms 942.75us 2.2322ms cuDeviceTotalMem
3.37% 9.9523ms 808 12.317us 186ns 725.86us cuDeviceGetAttribute
2.34% 6.9006ms 1 6.9006ms 6.9006ms 6.9006ms cudaDeviceSynchronize
0.33% 985.49us 8 123.19us 85.864us 180.73us cuDeviceGetName
0.01% 42.668us 8 5.3330us 1.8710us 22.553us cuDeviceGetPCIBusId
0.01% 34.281us 2 17.140us 6.2880us 27.993us cudaLaunchKernel
0.00% 8.0290us 16 501ns 256ns 1.7980us cuDeviceGet
0.00% 3.4000us 8 425ns 217ns 876ns cuDeviceGetUuid
0.00% 3.3970us 3 1.1320us 652ns 2.0020us cuDeviceGetCount
$
Based on the profiler output, we can estimate global memory bandwidth as:
2*64*32MB/6.78ms = 604GB/s
we can estimate L2 bandwidth as:
2*64*2MB/123us = 2.08TB/s
Both of these are rough measurements (I'm not doing careful benchmarking here), but bandwidthTest on this V100 GPU reports a device memory bandwidth of ~700GB/s, so I believe the 600GB/s number is "in the ballpark". If we use that to judge that the L2 cache measurement is in the ballpark, then we might guess that the L2 cache may be ~3-4x faster than global memory in some circumstances.

How cuda handle __syncthreads() in kernel?

Think i have a block with 1024 size and assume my gpu has 192 cuda cores.
How cuda handle __syncthreads() in kernels when cuda cores size is lower than block size?
__global__ void staticReverse(int *d, int n)
{
__shared__ int s[1024];
int t = threadIdx.x;
int tr = n-t-1;
s[t] = d[t];
__syncthreads();
d[t] = s[tr];
}
How 'tr' remaining in local memory?
I think you are mixing a few things.
First of all, GPU having 192 CUDA cores is the total core count. Each block however maps to a single Streaming Multiprocessor (SM) which may have a lower core count (depending on the GPU generation).
Let us assume that you own a Pascal GPU which has 64 cores per SM and you have 3
SMs.
A single block maps to a single SM. So you will have 64 cores handling 1024 threads concurrently. Such an SM has enough registers to hold all the necessary data for 1024 threads, but it has only 64 cores which quickly swap which threads they are handling.
This way all the local data, e.g. tr can remain in memory.
Now, because of this quick swapping and concurrent execution, it may happen -- completely by accident -- that some threads get ahead of others. If you want to ensure that at certain point all threads are at the same spot, you use __syncthreads(). All that function does is to instruct the scheduler to properly assign work to the CUDA cores so that they all are at that spot in program at some moment.

wildly varying performance of cuMemAlloc/cuMemFree

In my application cuMemAlloc/cuMemFree seem awfully slow most of the time. However, I found that they are sometimes 10 times faster than usual. The test program below finishes in about 0.4s on two machines, both with cuda 5.5 but one with a compute capability 2.0 card, the other with a 3.5 one.
If the cublas initialization is removed then it takes about 5s. With the cublas initialization in, but allocating a different a different number of bytes such as 4000 it slows down about the same. Needless to say, I'm puzzled by this.
What can be causing this? If it's not a bug in my code, what kind of workaround do I have? The only thing I could think of is preallocating an arena an implementing my own allocator.
#include <stdio.h>
#include <cuda.h>
#include <cublas_v2.h>
#define cudaCheck(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(CUresult code, char *file, int line)
{
if (code != CUDA_SUCCESS) {
fprintf(stderr,"GPUassert: %d %s %d\n", code, file, line);
exit(1);
}
}
void main(int argc, char *argv[])
{
CUcontext context;
CUdevice device;
int devCount;
cudaCheck(cuInit(0));
cudaCheck(cuDeviceGetCount(&devCount));
cudaCheck(cuDeviceGet(&device, 0));
cudaCheck(cuCtxCreate(&context, 0, device));
cublasStatus_t stat;
cublasHandle_t handle;
stat = cublasCreate(&handle);
if (stat != CUBLAS_STATUS_SUCCESS) {
printf ("CUBLAS initialization failed\n");
exit(1);
}
{
int i;
for (i = 0; i < 30000; i++) {
CUdeviceptr devBufferA;
cudaCheck(cuMemAlloc(&devBufferA, 8000));
cudaCheck(cuMemFree(devBufferA));
}
}
}
I took your code and profiled it on a linux 64 bit system with the 319.21 driver and CUDA 5.5 and a non-display compute 3.0 device. My first observation is that the run time is about 0.5s, which seems much faster then you are reporting. If I analyse the nvprof output, I get these histograms:
cuMemFree
Time (us) Frequency
3.65190000e+00 2.96670000e+04
4.59380000e+00 2.76000000e+02
5.53570000e+00 3.20000000e+01
6.47760000e+00 1.00000000e+00
7.41950000e+00 1.00000000e+00
8.36140000e+00 6.00000000e+00
9.30330000e+00 0.00000000e+00
1.02452000e+01 1.00000000e+00
1.11871000e+01 2.00000000e+00
1.21290000e+01 1.40000000e+01
cuMemAlloc
Time (us) Frequency
3.53840000e+00 2.98690000e+04
4.50580000e+00 8.60000000e+01
5.47320000e+00 2.00000000e+01
6.44060000e+00 0.00000000e+00
7.40800000e+00 0.00000000e+00
8.37540000e+00 6.00000000e+00
9.34280000e+00 0.00000000e+00
1.03102000e+01 0.00000000e+00
1.12776000e+01 1.20000000e+01
1.22450000e+01 5.00000000e+00
which tells me that 99.6% of cuMemAlloc calls take less than 3.5384 microseconds, and 98.9% of cuMemFree calls take less than 3.6519 microseconds. No free or allocate operation took more than 12.25 microseconds.
So my conclusions based on these results are
Both cuMemfree and cuMemAlloc are extremely fast, with every one of the 60000 total calls to those APIs in your example taking less than 12.25 microseconds
The median call time for both APIs is 2.7 microseconds, with a standard deviation of 0.25 microseconds, suggesting that there is very little variability in the API latency as well
Very occasionally (about 0.01% of the time), both APIs can be around six times slower than this median. This is probably due to operating system level resource contention
Every single one of the above points completely contradicts every assertion you have made in your question.
Given how different your results apparently are, I can only guess that you are running on a known high latency platform like WDDM Windows, and that driver batching and WDDM subsystem latency are completely dominating the performance of the code. In that case, it would seem that the simplest workaround is change platforms.....
The CUDA memory manager is known to be slow. I've seen mention that it is "two orders of magnitude" slower than host malloc() and free(). This information may be dated, but there are some graphs here:
http://www.cs.virginia.edu/~mwb7w/cuda_support/memory_management_overhead.html
I think this is because the CUDA memory manager is optimized for handling a small number of memory allocations at the cost of slowing down when there is a large number of allocations. And that this is because, in general, it is not efficient to handle many small buffers in a kernel.
There are two main issues with dealing with many buffers in a kernel:
1) It implies passing a table of pointers to the kernel. If there is a pointer for each thread, you incur an initial cost of loading the pointer from a table in global memory, before you can start working with the memory. Following a series of pointers is sometimes called "pointer chasing" and it is especially expensive on a GPU because memory accesses is relatively more expensive.
2) More importantly, a pointer for each thread implies a non-coalesced memory access pattern. On current architectures, if each thread in a warp loads a 32-bit value from global memory that is more than 128 bytes away from the others, 32 memory transaction are required for serving the warp. Each transaction will load 128 bytes and then discard 124 bytes. If all threads in a warp load values from the same natively aligned 128 byte area, all the loads are served by a single memory transaction. So, in a memory bound kernel, memory throughput may be only 1/32 of potential.
The most efficient way to handle memory with CUDA is often to allocate a few large chunks and index into them in the kernel.