Why data migrate from Host to Device when CPU try to read a managed memory initialized by GPU? - cuda

In the following test code, we init data by GPU, and then access data by CPU. I have 2 questions about the profiling result from nvprof.
Why there is one data migrate from Host To Device? In my understanding it should be Device to Host.
Why the H->D count is 2? I think it should be 1, because the data is in one page.
Thanks in advance!
my enviroment
Driver Version: 418.87.00
CUDA Version: 10.1
ubuntu 18.04
#include <cuda.h>
#include <iostream>
using namespace std;
__global__ void setVal(char* data, int idx)
{
data[idx] = 'd';
}
int main()
{
const int count = 8;
char* data;
cudaMallocManaged((void **)&data, count);
setVal<<<1,1>>>(data, 0); //GPU page fault
cout<<" cpu read " << data[0] <<endl;
cudaFree(data);
return 0;
}
==28762== Unified Memory profiling result:
Device "GeForce GTX 1070 (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
2 32.000KB 4.0000KB 60.000KB 64.00000KB 11.74400us Host To Device
1 - - - - 362.9440us Gpu page fault groups
Total CPU Page faults: 1

Why there is one data migrate from Host To Device? In my understanding it should be Device to Host.
You are thrashing data between host and device. Because the GPU kernel launch is asynchronous, your host code, issued after the kernel launch is actually accessing the data before the GPU code.
Put a cudaDeviceSynchronize() after your kernel call, so that the CPU code does not attempt to read the data until after the kernel is complete.
I don't have an answer for your other question. The profiler is often not able to resolve perfectly very small amounts of activity. It does not necessarily instrument all SMs during a profiling run and some of its results may be scaled for the size of a GPC, a TPC and/or the entire GPU. That would be my guess, although it is just speculation. I generally don't expect perfectly accurate results from the profiler when profiling tiny bits of code doing almost nothing.

Related

CUDA C: host doesn't send all the threads at once

I'm trying to learn CUDA, so I wrote some silly code (see below) in order to understand how CUDA works. I set the number of blocks as 1024 but when I run my program it seems that the host doesn't send all the 1024 threads at once to the GPU. Instead, the GPU processes 350 threads first (approx), then another 350 threads, and so on. WHY? Thanks in advance!!
PS1: My computer has Ubuntu installed and an NVIDIA GeForce GTX 1660 SUPER
PS2: In my program, each block goes to sleep for a few seconds and nothing else. Also the host creates an array called "H_Arr" and sends it to the GPU, although the device does not use this array. Of course, the latter doesn't make much sense, but I'm just experimenting to understand how CUDA works.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>
#include <limits>
#include <iostream>
#include <fstream>
#include<unistd.h>
using namespace std;
__device__ void funcDev(int tid);
int NB=1024;
int NT=1;
__global__
void funcGlob(int NB, int NT){
int tid=blockIdx.x*NT+threadIdx.x;
# if __CUDA_ARCH__>=200
printf("start block %d \n",tid);
if(tid<NB*NT){
funcDev(tid);
}
printf("end block %d\n",tid);
#endif
}
__device__ void funcDev(int tid){
clock_t clock_count;
clock_count =10000000000;
clock_t start_clock = clock();
clock_t clock_offset = 0;
while (clock_offset < clock_count)
{
clock_offset = clock() - start_clock;
}
}
int main(void)
{
int i;
ushort *D_Arr,*H_Arr;
H_Arr = new ushort[NB*NT+1];
for(i=1;i<=NB*NT+1;i++){H_Arr[i]=1;}
cudaMalloc((void**)&D_Arr,(NB*NT+1)*sizeof(ushort));
cudaMemcpy(D_Arr,H_Arr,(NB*NT+1)*sizeof(ushort),cudaMemcpyHostToDevice);
funcGlob<<<NB,NT>>>(NB,NT);
cudaFree(D_Arr);
delete [] H_Arr;
return 0;
}
I wrote a program in CUDA C. I set the number of blocks to be 1024. If I understood correctly, in theory 1024 processes should run simultaneously. However, this is not what happens.
GTX 1660 Super seems to have 22 SMs.
It is a compute capability 7.5 device. If you run deviceQuery cuda sample code on your GPU, you can confirm that (the compute capability and the numbers of SMs, called "Multiprocessors"), and also discover that your GPU has a limit of 16 blocks resident per SM at any moment.
So I haven't studied your code at all, really, but since you are launching 1024 blocks (of one thread each), it would be my expectation that the block scheduler would deposit an initial wave of 16x22=352 blocks on the SMs, and then it would wait for some of those blocks to finish/retire before it would be able to deposit any more.
So an "initial wave" of 352 scheduled blocks sounds just right to me.
Throughout your posting, you refer primarily to threads. While it might be correct to say "350 threads are running" (since you are launching one thread per block) it isn't very instructive to think of it that way. The GPU work distributor schedules blocks, not individual threads, and the relevant limit here is the blocks per SM limit.
If you don't really understand the distinction between blocks and threads, or other basic CUDA concepts, you can find many questions here on the SO cuda tag about these concepts, and this online training series, particularly the first 3-4 units, will also be illuminating.

wildly varying performance of cuMemAlloc/cuMemFree

In my application cuMemAlloc/cuMemFree seem awfully slow most of the time. However, I found that they are sometimes 10 times faster than usual. The test program below finishes in about 0.4s on two machines, both with cuda 5.5 but one with a compute capability 2.0 card, the other with a 3.5 one.
If the cublas initialization is removed then it takes about 5s. With the cublas initialization in, but allocating a different a different number of bytes such as 4000 it slows down about the same. Needless to say, I'm puzzled by this.
What can be causing this? If it's not a bug in my code, what kind of workaround do I have? The only thing I could think of is preallocating an arena an implementing my own allocator.
#include <stdio.h>
#include <cuda.h>
#include <cublas_v2.h>
#define cudaCheck(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(CUresult code, char *file, int line)
{
if (code != CUDA_SUCCESS) {
fprintf(stderr,"GPUassert: %d %s %d\n", code, file, line);
exit(1);
}
}
void main(int argc, char *argv[])
{
CUcontext context;
CUdevice device;
int devCount;
cudaCheck(cuInit(0));
cudaCheck(cuDeviceGetCount(&devCount));
cudaCheck(cuDeviceGet(&device, 0));
cudaCheck(cuCtxCreate(&context, 0, device));
cublasStatus_t stat;
cublasHandle_t handle;
stat = cublasCreate(&handle);
if (stat != CUBLAS_STATUS_SUCCESS) {
printf ("CUBLAS initialization failed\n");
exit(1);
}
{
int i;
for (i = 0; i < 30000; i++) {
CUdeviceptr devBufferA;
cudaCheck(cuMemAlloc(&devBufferA, 8000));
cudaCheck(cuMemFree(devBufferA));
}
}
}
I took your code and profiled it on a linux 64 bit system with the 319.21 driver and CUDA 5.5 and a non-display compute 3.0 device. My first observation is that the run time is about 0.5s, which seems much faster then you are reporting. If I analyse the nvprof output, I get these histograms:
cuMemFree
Time (us) Frequency
3.65190000e+00 2.96670000e+04
4.59380000e+00 2.76000000e+02
5.53570000e+00 3.20000000e+01
6.47760000e+00 1.00000000e+00
7.41950000e+00 1.00000000e+00
8.36140000e+00 6.00000000e+00
9.30330000e+00 0.00000000e+00
1.02452000e+01 1.00000000e+00
1.11871000e+01 2.00000000e+00
1.21290000e+01 1.40000000e+01
cuMemAlloc
Time (us) Frequency
3.53840000e+00 2.98690000e+04
4.50580000e+00 8.60000000e+01
5.47320000e+00 2.00000000e+01
6.44060000e+00 0.00000000e+00
7.40800000e+00 0.00000000e+00
8.37540000e+00 6.00000000e+00
9.34280000e+00 0.00000000e+00
1.03102000e+01 0.00000000e+00
1.12776000e+01 1.20000000e+01
1.22450000e+01 5.00000000e+00
which tells me that 99.6% of cuMemAlloc calls take less than 3.5384 microseconds, and 98.9% of cuMemFree calls take less than 3.6519 microseconds. No free or allocate operation took more than 12.25 microseconds.
So my conclusions based on these results are
Both cuMemfree and cuMemAlloc are extremely fast, with every one of the 60000 total calls to those APIs in your example taking less than 12.25 microseconds
The median call time for both APIs is 2.7 microseconds, with a standard deviation of 0.25 microseconds, suggesting that there is very little variability in the API latency as well
Very occasionally (about 0.01% of the time), both APIs can be around six times slower than this median. This is probably due to operating system level resource contention
Every single one of the above points completely contradicts every assertion you have made in your question.
Given how different your results apparently are, I can only guess that you are running on a known high latency platform like WDDM Windows, and that driver batching and WDDM subsystem latency are completely dominating the performance of the code. In that case, it would seem that the simplest workaround is change platforms.....
The CUDA memory manager is known to be slow. I've seen mention that it is "two orders of magnitude" slower than host malloc() and free(). This information may be dated, but there are some graphs here:
http://www.cs.virginia.edu/~mwb7w/cuda_support/memory_management_overhead.html
I think this is because the CUDA memory manager is optimized for handling a small number of memory allocations at the cost of slowing down when there is a large number of allocations. And that this is because, in general, it is not efficient to handle many small buffers in a kernel.
There are two main issues with dealing with many buffers in a kernel:
1) It implies passing a table of pointers to the kernel. If there is a pointer for each thread, you incur an initial cost of loading the pointer from a table in global memory, before you can start working with the memory. Following a series of pointers is sometimes called "pointer chasing" and it is especially expensive on a GPU because memory accesses is relatively more expensive.
2) More importantly, a pointer for each thread implies a non-coalesced memory access pattern. On current architectures, if each thread in a warp loads a 32-bit value from global memory that is more than 128 bytes away from the others, 32 memory transaction are required for serving the warp. Each transaction will load 128 bytes and then discard 124 bytes. If all threads in a warp load values from the same natively aligned 128 byte area, all the loads are served by a single memory transaction. So, in a memory bound kernel, memory throughput may be only 1/32 of potential.
The most efficient way to handle memory with CUDA is often to allocate a few large chunks and index into them in the kernel.

Maximum number of threads for a CUDA kernel on Tesla M2050

I am testing what is maximum number of threads for a simple kernel. I find total number of threads cannot exceed 4096. The code is as follows:
#include <stdio.h>
#define N 100
__global__ void test(){
printf("%d %d\n", blockIdx.x, threadIdx.x);
}
int main(void){
double *p;
size_t size=N*sizeof(double);
cudaMalloc(&p, size);
test<<<64,128>>>();
//test<<<64,128>>>();
cudaFree(p);
return 0;
}
My test environment: CUDA 4.2.9 on Tesla M2050. The code is compiled with
nvcc -arch=sm_20 test.cu
While checking what's the output, I found some combinations are missing. Run the command
./a.out|wc -l
I always got 4096. When I check cc2.0, I can only find the maximum number of blocks for x,y,z dimensions are (1024,1024,512) and maximum number of threads per block is 1024. And the calls to the kernel (either <<<64,128>>> or <<<128,64>>>) are well in the limits. Any idea?
NB: The CUDA memory operations are there to block the code so that the output from the kernel will be shown.
You are abusing kernel printf, and using it to judge how many threads you can run is a completely nonsensical idea. The runtime has a limited buffer size for printf output, and you are simply overflowing it with output when you run enough threads. There is an API to query and set the printf buffer size, using cudaDeviceGetLimit and cudaDeviceSetLimit (thanks to Robert Crovella for the link to the printf documentation in comments).
You can find the maximum number of threads a given kernel can run by looking here in the documentation.

JCuda Pinned Memory Example

JCuda + GEForce Gt640 Question:
I'm trying to reduce the latency associated with copying memory from Device to Host after the result has been computed by the GPU. Doing the simple Vector Add program I found that the bulk of the latency is indeed copying the result buffer back to the Host side. The transfer latency of the source buffers to the Device side is negligible ~.30ms while copying the result back is on the order of 20ms.
I did the research an found that a better alternative to copying out the results is to use pinned memory. From what I learned, this memory is allocated on the host side but the kernel would have direct access to it over the pci-e and in turn yielding a higher speed than copying the result after the computation in bulk. I'm using the following example but the results aren't yielding what I expect.
Kernel: {Simple Example to illustrate point, Launching 1 block 1 thread only}
extern "C"
__global__ void add(int* test)
{
test[0]=1; test[1]=2; test[2]=3; test[3]=4; test[4]=5;
}
Java:
import java.io.*;
import jcuda.*;
import jcuda.runtime.*;
import jcuda.driver.*;
import static jcuda.runtime.cudaMemcpyKind.*;
import static jcuda.driver.JCudaDriver.*;
public class JCudaTest
{
public static void main(String args[])
{
// Initialize the driver and create a context for the first device.
cuInit(0);
CUdevice device = new CUdevice();
cuDeviceGet(device, 0);
CUcontext context = new CUcontext();
cuCtxCreate(context, 0, device);
// Load the ptx file.
CUmodule module = new CUmodule();
JCudaDriver.cuModuleLoad(module, "JCudaKernel.ptx");
// Obtain a function pointer to the kernel function.
CUfunction function = new CUfunction();
JCudaDriver.cuModuleGetFunction(function, module, "add");
Pointer P = new Pointer();
JCudaDriver.cuMemAllocHost(P, 5*Sizeof.INT);
Pointer kernelParameters = Pointer.to(P);
// Call the kernel function with 1 block, 1 thread:
JCudaDriver.cuLaunchKernel(function, 1, 1, 1, 1, 1, 1, 0, null, kernelParameters, null);
int [] T = new int[5];
JCuda.cudaMemcpy(Pointer.to(T), P, 5*Sizeof.INT, cudaMemcpyHostToHost);
// Print the results:
for(int i=0; i<5; i++)
System.out.println(T[i]);
}
}
1.) Build the Kernel:
root#NVS295-CUDA:~/JCUDA/MySamples# nvcc -ptx JCudaKernel.cu
root#NVS295-CUDA:~/JCUDA/MySamples# ls -lrt | grep ptx
-rw-r--r-- 1 root root 3295 Mar 27 17:46 JCudaKernel.ptx
2.) Build the Java:
root#NVS295-CUDA:~/JCUDA/MySamples# javac -cp "../JCuda-All-0.5.0-bin-linux-x86/*:." JCudaTest.java
3.) Run the code:
root#NVS295-CUDA:~/JCUDA/MySamples# java -cp "../JCuda-All-0.5.0-bin-linux-x86/*:." JCudaTest
0
0
0
0
0
Expecting:
1
2
3
4
5
Note: I'm using JCuda0.5.0 for x86 if that matters.
Please let me know what I'm doing wrong and thanks in advance:
Ilir
The problem here is that the device may not access host memory directly.
Admittedly, the documentation sounds misleading here:
cuMemAllocHost
Allocates bytesize bytes of host memory that is page-locked and accessible to the device...
This sounds like a clear statement. However, "accessible" here does not mean that the memory may be used directly as a kernel parameter in all cases. This is only possible on devices that support Unified Addressing. For all other devices, it is necessary to obtain a device pointer that corresponds to the allocated host pointer, with cuMemHostGetDevicePointer.
The key point of page-locked host memory is that the data transfer between the host and device is faster. An example of how this memory may be used in JCuda can be seen in the JCudaBandwidthTest sample (this is for the runtime API, but for the driver API, it works analogously).
EDIT:
Note that the new Unified Memory feature of CUDA 6 actually supports what you originally intended to do: With cudaMallocManaged you can allocate memory that is directly accessible to the host and the device (in the sense that it can, for example, be passed to a kernel, written by the device, and afterwards read by the host without additional effort). Unfortunately, this concept does not map very well to Java, because the memory is still managed by CUDA - and this memory can not replace the memory that is, for example, used by the Java VM for a float[] array or so. But at least it should be possible to create a ByteBuffer from the memory that was allocated with cudaMallocManaged, so that you may access this memory, for example, as a FloatBuffer.

driver.Context.synchronize()- what else to take into consideration -- -a clean-up operation failed

I have this code here (modified due to the answer).
Info
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 46 registers, 120 bytes cmem[0], 176 bytes
cmem[2], 76 bytes cmem[16]
I don't know what else to take into consideration in order to make it work for different combinations of points "numPointsRs" and "numPointsRp"
When ,for example, i run the code with Rs=10000 and Rp=100000 with block=(128,1,1),grid=(200,1) its fine.
My computations:
46 registers*128threads=5888 registers .
My card has limit 32768registers,so 32768/5888=5 +some => 5 block/SM
(my card has limit 6).
With the occupancy calculator i found that using 128 threads/block
gives me 42% and am in the limits of my card.
Also,the number of threads per MP is 640 (limit is 1536)
Now,if i try to use Rs=100000 and Rp=100000 (for the same threads and blocks) it gives me the message in the title,with:
cuEventDestroy failed: launch timeout
cuModuleUnload failed: launch timeout
1) I don't know/understand what else is needed to be computed.
2) I can't understand how we use/find the number of the blocks.I can see
that mostly,someone puts (threads-1+points)/threads ,but that still
doesn't work.
--------------UPDATED----------------------------------------------
After using driver.Context.synchronize() ,the code works for many points (1000000)!
But ,what impact has this addition to the code?(for many points the screen freezes for 1 minute or more).Should i use it or not?
--------------UPDATED2----------------------------------------------
Now,the code doesn't work again without doing anything!
Snapshot of code:
import pycuda.gpuarray as gpuarray
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np
import cmath
import pycuda.driver as drv
import pycuda.tools as t
#---- Initialization and passing(allocate memory and transfer data) to GPU -------------------------
Rs_gpu=gpuarray.to_gpu(Rs)
Rp_gpu=gpuarray.to_gpu(Rp)
J_gpu=gpuarray.to_gpu(np.ones((numPointsRs,3)).astype(np.complex64))
M_gpu=gpuarray.to_gpu(np.ones((numPointsRs,3)).astype(np.complex64))
Evec_gpu=gpuarray.to_gpu(np.zeros((numPointsRp,3)).astype(np.complex64))
Hvec_gpu=gpuarray.to_gpu(np.zeros((numPointsRp,3)).astype(np.complex64))
All_gpu=gpuarray.to_gpu(np.ones(numPointsRp).astype(np.complex64))
#-----------------------------------------------------------------------------------
mod =SourceModule("""
#include <pycuda-complex.hpp>
#include <cmath>
#include <vector>
typedef pycuda::complex<float> cmplx;
typedef float fp3[3];
typedef cmplx cp3[3];
__device__ __constant__ float Pi;
extern "C"{
__device__ void computeEvec(fp3 Rs_mat[], int numPointsRs,
cp3 J[],
cp3 M[],
fp3 Rp,
cmplx kp,
cmplx eta,
cmplx *Evec,
cmplx *Hvec, cmplx *All)
{
while (c<numPointsRs){
...
c++;
}
}
__global__ void computeEHfields(float *Rs_mat_, int numPointsRs,
float *Rp_mat_, int numPointsRp,
cmplx *J_,
cmplx *M_,
cmplx kp,
cmplx eta,
cmplx E[][3],
cmplx H[][3], cmplx *All )
{
fp3 * Rs_mat=(fp3 *)Rs_mat_;
fp3 * Rp_mat=(fp3 *)Rp_mat_;
cp3 * J=(cp3 *)J_;
cp3 * M=(cp3 *)M_;
int k=threadIdx.x+blockIdx.x*blockDim.x;
while (k<numPointsRp)
{
computeEvec( Rs_mat, numPointsRs, J, M, Rp_mat[k], kp, eta, E[k], H[k], All );
k+=blockDim.x*gridDim.x;
}
}
}
""" ,no_extern_c=1,options=['--ptxas-options=-v'])
#call the function(kernel)
func = mod.get_function("computeEHfields")
func(Rs_gpu,np.int32(numPointsRs),Rp_gpu,np.int32(numPointsRp),J_gpu, M_gpu, np.complex64(kp), np.complex64(eta),Evec_gpu,Hvec_gpu, All_gpu, block=(128,1,1),grid=(200,1))
#----- get data back from GPU-----
Rs=Rs_gpu.get()
Rp=Rp_gpu.get()
J=J_gpu.get()
M=M_gpu.get()
Evec=Evec_gpu.get()
Hvec=Hvec_gpu.get()
All=All_gpu.get()
My card:
Device 0: "GeForce GTX 560"
CUDA Driver Version / Runtime Version 4.20 / 4.10
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 1024 MBytes (1073283072 bytes)
( 0) Multiprocessors x (48) CUDA Cores/MP: 0 CUDA Cores //CUDA Cores 336 => 7 MP and 48 Cores/MP
There are quite a few issues that you have to deal with. Answer 1 provided by #njuffa is the best general solution. I'll provide more feedback based upon the limited data you have provided.
PTX output of 46 registers is not the number of registers used by your kernel. PTX is an intermediate representation. The offline or JIT compiler will convert this to device code. Device code may use more or less registers. Nsight Visual Studio Edition, the Visual Profiler, and the CUDA command line profiler can all provide you the correct register count.
The occupancy calculation is not simply RegistersPerSM / RegistersPerThread. Registers are allocated based upon a granularity. For CC 2.1 the granularity is 4 registers per thread per warp (128 registers). 2.x devices can actually allocate at a 2 register granularity but this can lead to fragmentation later in the kernel.
In your occupancy calculation you state
My card has limit 32768registers,so 32768/5888=5 +some => 5 block/SM
(my card has limit 6).
I'm not sure what 6 means. Your device has 7 SMs. The maximum blocks per SM for 2.x devices is 8 blocks per SM.
You have provided an insufficient amount of code. If you provide pieces of code please provide the size of all inputs, the number of times each loop will be executed, and a description of the operations per function. Looking at the code you may be doing too many loops per thread. Without knowing the order of magnitude of the outer loop we can only guess.
Given that the launch is timing out you should probably approach debugging as follows:
a. Add a line to the beginning of the code
if (blockIdx.x > 0) { return; }
Run the exact code you have in one of the previously mentioned profilers to estimate the duration of a single block. Using the launch information provided by the profiler: register per thread, shared memory, ... use the occupancy calculator in the profiler or the xls to determine the maximum number of blocks that you can run concurrently. For example, if the theoretical block occupancy is 3 blocks per SM, and the number of SMs is 7 the you can run 21 blocks at a time which for you launch is 9 waves. NOTE: this assumes equal work per thread. Change the early exit code to allow 1 wave (21 blocks). If this launch times out then you need to reduce the amount of work per thread. If this passes then calculate how many waves you have and estimate when you will timeout (2sec on windows, ? on linux).
b. If you have too many waves then reduce you have to reduce the launch configuration. Given that you index by gridDim.x and blockDim.x you can do this by passing in these dimensions as as parameters to your kernel. This will require tou to minimally change your indexing code. You will also have to pass a blockIdx.x offset. Change your host code to launch multiple kernels back to back. Since there should be no conflict you can rr launch these in multiple streams to benefit from overlap at the end of each wave.
"launch timeout" would appear to indicate that the kernel ran too long and was killed by the watchdog timer. This can happen on GPUs that are also used for graphics output (e.g. a graphical desktop), where the task of the watchdog timer is to prevent the desktop from locking up for more than a few seconds. Best I can recall the watchdog time limit is on the order of 5 seconds or thereabouts.
At any given moment, the GPU can either run graphics, or CUDA, so the watchdog timer is needed when running a GUI to prevent the GUI from locking up for an extended period of time, which renders the machine inoperable through the GUI.
If possible, avoid using this GPU for the desktop and/or other graphics (e.g. don't run X if you are on Linux). If running without graphics isn't an option, to reduce kernel execution time to avoid hitting watchdog timer kernel termination, you will have to do less work per kernel launch, optimize the code so the kernel runs faster for the same amount of work, or deploy a faster GPU.
To provide more inputs on #njuffa's answer, in Windows systems you can increase the launch timeout or TDR (Timeout Detection & Recovery) by following these steps:
1: Open the options in Nsight Monitor.
2: Set an appropriate value for WDDM TDR Delay
CUATION: If this value is small you may get timeout error and for higher values your screen will stay frozen until kernel finishes it's job.
source