Get nvprof default behavior with ncu (NsightComputeCli) - cuda

The default nvprof output is great, but nvprof has been deprecated in favor of ncu. How can I make ncu give me an output that looks more like nvprof?
minimal working example
I have 2 range functions where one is called in a very unoptimal way (using only 1 thread). It takes a much longer time than the other range function.
profile.cu
#include <stdio.h>
//! makes sure both range functions executed correctly
bool check_range(int N, float *x_d) {
float *x_h;
cudaMallocHost(&x_h,N*sizeof(float));
cudaMemcpy(x_h, x_d, N*sizeof(float), cudaMemcpyDeviceToHost);
bool success=true;
for( int i=0; i < N; i++)
if( x_h[i] != i ) {
printf("\33[31mERROR: x[%d]=%g\33[0m\n",i,x_h[i]);
success=false;
break;
}
cudaFreeHost(x_h);
return success;
}
//! called with many threads
__global__ void range_fast(int N, float *x) {
for( int i=threadIdx.x; i < N; i+=blockDim.x)
x[i]=i;
}
//! only gets called with 1 thread. This is the bottleneck I want to detect
__global__ void range_slow(int N, float *x) {
for( int i=threadIdx.x; i < N; i+=blockDim.x)
x[i]=i;
}
int main(int argc, char *argv[]) {
int N=(1<<20)*10;
float *x_fast, *x_slow;
cudaMalloc(&x_fast,N*sizeof(float));
cudaMalloc(&x_slow,N*sizeof(float));
range_fast<<<1,512>>>(N,x_fast);
range_slow<<<1,1>>>(N,x_slow);
check_range(N,x_fast);
check_range(N,x_slow);
cudaFree(x_fast);
cudaFree(x_slow);
return 0;
};
compilation
nvcc profile.cu -o profile.exe
nvprof profiling
nvprof ./profile.exe
nvprof output
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 99.17% 1.20266s 1 1.20266s 1.20266s 1.20266s range_slow(int, float*)
0.53% 6.3921ms 2 3.1961ms 3.1860ms 3.2061ms [CUDA memcpy DtoH]
0.31% 3.7273ms 1 3.7273ms 3.7273ms 3.7273ms range_fast(int, float*)
API calls: 88.79% 1.20524s 2 602.62ms 3.2087ms 1.20203s cudaMemcpy
9.31% 126.39ms 2 63.196ms 100.62us 126.29ms cudaMalloc
1.11% 15.121ms 2 7.5607ms 7.5460ms 7.5754ms cudaHostAlloc
0.64% 8.6687ms 2 4.3344ms 4.2029ms 4.4658ms cudaFreeHost
0.09% 1.2195ms 2 609.73us 103.80us 1.1157ms cudaFree
This gives me a clear idea about which functions are taking most of the runtime, and that range_slow is the bottleneck.
ncu profiling
ncu ./profile.exe
ncu output
The ncu output has far more details, most of which I don't really care about. It also isn't as nicely summarized.

The functionality of nvprof has been broken into 2 separate tools in the "new" profiling tools. The Nsight Compute tool is mostly focused on the activity of kernel (i.e. device code) profiling, and although it can report kernel duration, of course, it is less interested in things like API call activity and memory copy activity.
The tool that has this functionality is Nsight Systems.
Try:
nsys profile --stats=true ./profile.exe
Amongst other things, you will get a pareto list of GPU activities (broken into separate pareto lists of kernel activities and memory copy activities) and a pareto list of API calls.

Related

cuda unified memory and pointer aliasing

I am refreshing my mind with cuda, specially the unify memory (my last real cuda dev was 3 years ago), I am a bit rusted.
The pb:
I am creating a task from a container using unify memory. However, I get a crash, after a few days of investigation,
I am not able to say where is the crash (copy constructor), but not why. Because all pointers are allocated correctly.
I am not in contraction with Nvidia post (https://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/)
about C++ and unify memory
#include <cuda.h>
#include <cstdio>
template<class T>
struct container{
container(int size = 1){ cudaMallocManaged(&p,size*sizeof(T));}
~container(){cudaFree(p);}
__device__ __host__ T& operator[](int i){ return p[i];}
T * p;
};
struct task{
int* a;
};
__global__ void kernel_gpu(task& t, container<task>& v){
printf(" gpu value task %i, should be 2 \n", *(t.a)); // this work
task tmp(v[0]); // BUG
printf(" gpu value task from vector %i, should be 1 \n", *(tmp.a));
}
void kernel_cpu(task& t, container<task>& v){
printf(" cpu value task %i, should be 2 \n", *(t.a)); // this work
task tmp(v[0]);
printf(" cpu value task from vector %i, should be 1 \n", *(tmp.a));
}
int main(int argc, const char * argv[]) {
int* p1;
int* p2;
cudaMallocManaged(&p1,sizeof(int));
cudaMallocManaged(&p2,sizeof(int));
*p1 = 1;
*p2 = 2;
task t1,t2;
t1.a=p1;
t2.a=p2;
container<task> c(2);
c[0] = t1;
c[1] = t2;
//gpu does not work
kernel_gpu<<<1,1>>>(c[1],c);
cudaDeviceSynchronize();
//cpu should work, no concurent access
kernel_cpu(c[1],c);
printf("job done !\n");
cudaFree(p1);
cudaFree(p2);
return 0;
}
Objectively I can pass an object as an argument where the memory has been allocated properly. However, it look like it not possible to use a second degree
of indirection (here the container)
I am doing a conceptual mistake, but I do not see where.
Best,
Timocafe
my machine: cuda 7.5, gcc 4.8.2, Tesla K20 m
Although the memories were allocated as Unified Memory, the container itself is declared in host code and allocated in host memory: container<task> c(2);. You can not pass it as a reference to the device code, and de-referencing it in a kernel will very likely result in illegal memory access.
You may want to use cuda-memcheck to identify such issues.

Execution time issue in CUDA benchmarks

I am trying to profile some CUDA Rodinia benchmarks, in terms of their SM and memory utilization, power consumption etc. For that, I simultaneously execute the benchmark and the profiler which essentially spawns a pthread to profile the GPU execution using NVML library.
The issue is that the execution time of a benchmark, is much higher( about 3 times) in case I do not invoke the profiler along with it, than the case when the benchmark is executing with the profiler. The frequency scaling governor for the CPU is userspace so I do not think that frequency of the CPU is changing. Is it due to the flickering in GPU frequency?
Below is the code for the profiler.
#include <pthread.h>
#include <stdio.h>
#include "nvml.h"
#include "unistd.h"
#define NUM_THREADS 1
void *PrintHello(void *threadid)
{
long tid;
tid = (long)threadid;
// printf("Hello World! It's me, thread #%ld!\n", tid);
nvmlReturn_t result;
nvmlDevice_t device;
nvmlUtilization_t utilization;
nvmlClockType_t jok;
unsigned int device_count, i,powergpu,clo;
char version[80];
result = nvmlInit();
result = nvmlSystemGetDriverVersion(version,80);
printf("\n Driver version: %s \n\n", version);
result = nvmlDeviceGetCount(&device_count);
printf("Found %d device%s\n\n", device_count,
device_count != 1 ? "s" : "");
printf("Listing devices:\n");
result = nvmlDeviceGetHandleByIndex(0, &device);
while(1)
{
result = nvmlDeviceGetPowerUsage(device,&powergpu );
result = nvmlDeviceGetUtilizationRates(device, &utilization);
printf("\n%d\n",powergpu);
if (result == NVML_SUCCESS)
{
printf("%d\n", utilization.gpu);
printf("%d\n", utilization.memory);
}
result=nvmlDeviceGetClockInfo(device,NVML_CLOCK_SM,&clo);
if(result==NVML_SUCCESS)
{
printf("%d\n",clo);
}
usleep(500000);
}
pthread_exit(NULL);
}
int main (int argc, char *argv[])
{
pthread_t threads[NUM_THREADS];
int rc;
long t;
for(t=0; t<NUM_THREADS; t++){
printf("In main: creating thread %ld\n", t);
rc = pthread_create(&threads[t], NULL, PrintHello, (void *)t);
if (rc){
printf("ERROR; return code from pthread_create() is %d\n", rc);
exit(-1);
}
}
/* Last thing that main() should do */
pthread_exit(NULL);
}
With your profiler running, the GPU(s) are being pulled out of their sleep state (due to the access to the nvml API, which is querying data from the GPUs). This makes them respond much more quickly to a CUDA application, and so the application appears to run "faster" if you time the entire application execution (e.g. using the linux time command).
One solution is to place the GPUs in "persistence mode" with the nvidia-smi command (use nvidia-smi --help to get command line help).
Another solution would be to do the timing from within the application, and exclude the CUDA start-up time from the timing measurement, perhaps by executing a cuda command such as cudaFree(0); prior to the start of timing.

Dot Product in CUDA using atomic operations - getting wrong results

I am trying to implement the dot product in CUDA and compare the result with what MATLAB returns. My CUDA code (based on this tutorial) is the following:
#include <stdio.h>
#define N (2048 * 8)
#define THREADS_PER_BLOCK 512
#define num_t float
// The kernel - DOT PRODUCT
__global__ void dot(num_t *a, num_t *b, num_t *c)
{
__shared__ num_t temp[THREADS_PER_BLOCK];
int index = threadIdx.x + blockIdx.x * blockDim.x;
temp[threadIdx.x] = a[index] * b[index];
__syncthreads(); //Synchronize!
*c = 0.00;
// Does it need to be tid==0 that
// undertakes this task?
if (0 == threadIdx.x) {
num_t sum = 0.00;
int i;
for (i=0; i<THREADS_PER_BLOCK; i++)
sum += temp[i];
atomicAdd(c, sum);
//WRONG: *c += sum; This read-write operation must be atomic!
}
}
// Initialize the vectors:
void init_vector(num_t *x)
{
int i;
for (i=0 ; i<N ; i++){
x[i] = 0.001 * i;
}
}
// MAIN
int main(void)
{
num_t *a, *b, *c;
num_t *dev_a, *dev_b, *dev_c;
size_t size = N * sizeof(num_t);
cudaMalloc((void**)&dev_a, size);
cudaMalloc((void**)&dev_b, size);
cudaMalloc((void**)&dev_c, size);
a = (num_t*)malloc(size);
b = (num_t*)malloc(size);
c = (num_t*)malloc(size);
init_vector(a);
init_vector(b);
cudaMemcpy(dev_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, size, cudaMemcpyHostToDevice);
dot<<<N/THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(dev_a, dev_b, dev_c);
cudaMemcpy(c, dev_c, sizeof(num_t), cudaMemcpyDeviceToHost);
printf("a = [\n");
int i;
for (i=0;i<10;i++){
printf("%g\n",a[i]);
}
printf("...\n");
for (i=N-10;i<N;i++){
printf("%g\n",a[i]);
}
printf("]\n\n");
printf("a*b = %g.\n", *c);
free(a); free(b); free(c);
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
}
and I compile it with:
/usr/local/cuda-5.0/bin/nvcc -m64 -I/usr/local/cuda-5.0/include -gencode arch=compute_20,code=sm_20 -o multi_dot_product.o -c multi_dot_product.cu
g++ -m64 -o multi_dot_product multi_dot_product.o -L/usr/local/cuda-5.0/lib64 -lcudart
Information about my NVIDIA cards can be found at http://pastebin.com/8yTzXUuK. I tried to verify the result in MATLAB using the following simple code:
N = 2048 * 8;
a = zeros(N,1);
for i=1:N
a(i) = 0.001*(i-1);
end
dot_product = a'*a;
But as N increases, I'm getting significantly different results (For instance, for N=2048*32 CUDA reutrns 6.73066e+07 while MATLAB returns 9.3823e+07. For N=2048*64 CUDA gives 3.28033e+08 while MATLAB gives 7.5059e+08). I incline to believe that the discrepancy stems from the use of float in my C code, but if I replace it with double the compiler complains that atomicAdd does not support double parameters. How should I fix this problem?
Update: Also, for high values of N (e.g. 2048*64), I noticed that the result returned by CUDA changes at every run. This does not happen if N is low (e.g. 2048*8).
At the same time I have a more fundamental question: The variable temp is an array of size THREADS_PER_BLOCK and is shared between threads in the same block. Is it also shared between blocks or every block operates on a different copy of this variable? Should I think of the method dot as instructions to every block? Can someone elaborate on how exactly the jobs are split and how the variables are shared in this example
Comment this line out of your kernel:
// *c = 0.00;
And add these lines to your host code, before the kernel call (after the cudaMalloc of dev_c):
num_t h_c = 0.0f;
cudaMemcpy(dev_c, &h_c, sizeof(num_t), cudaMemcpyHostToDevice);
And I believe you'll get results that match matlab, more or less.
The fact that you have this line in your kernel unprotected by any synchronization is messing you up. Every thread of every block, whenever they happen to execute, is zeroing out c as you have written it.
By the way, we can do significantly better with this operation in general by using a classical parallel reduction method. A basic (not optimized) illustration is here. If you combine that method with your usage of shared memory and a single atomicAdd at the end (one atomicAdd per block) you'll have a significantly improved implementation. Although it's not a dot product, this example combines those ideas.
Edit: responding to a question below in the comments:
A kernel function is the set of instructions that all threads in the grid (all threads associated with a kernel launch, by definition) execute. However, it's reasonable to think of execution as being managed by threadblock, since the threads in a threadblock are executing together to a large extent. However, even within a threadblock, execution is not in perfect lockstep across all threads, necessarily. Normally when we think of lockstep execution, we think of a warp which is a group of 32 threads in a single threadblock. Therefore, since execution amongst warps within a block can be skewed, this hazard was present even for a single threadblock. However, if there were only one threadblock, we could have gotten rid of the hazard in your code using appropriate sync and control mechanisms like __syncthreads() and (if threadIdx.x == 0) etc. But these mechanisms are useless for the general case of controlling execution across multiple threadsblocks. Multiple threadblocks can execute in any order. The only defined sync mechanism across an entire grid is the kernel launch itself. Therefore to fix your issue, we had to zero out c prior to the kernel launch.

cudaMemcpy is too slow on Tesla C2075

I'm currently working on a server with 2 cuda capable GPU's: Quadro 400 and Tesla C2075. I made a simple vector addition test program. My problem is that while Tesla C2075 GPU is supposed to be more powerful than Quadro 400, it takes it more time to do the job. I found that cudaMemcpy takes up most of the execution time and it works slower on a more powerful gpu.
Here's the source:
void get_matrix(float* arr1,float* arr2,int N1,int N2)
{
int Nx,Ny;
int n_blocks,n_threads;
int dev=0; // 1
float time;
size_t size;
clock_t start,end;
cudaSetDevice(dev);
cudaDeviceProp deviceProp;
start = clock();
cudaGetDeviceProperties(&deviceProp, dev);
Nx=N1;
Ny=N2;
n_threads=256;
n_blocks=(Nx*Ny+n_threads-1)/n_threads;
size=Nx*Ny*sizeof(float);
cudaMalloc((void**)&d_A,size);
cudaMalloc((void**)&d_B,size);
cudaMemcpy(d_A, arr1, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, arr2, size, cudaMemcpyHostToDevice);
vector_add<<<n_blocks,n_threads>>>(d_A,d_B,size);
cudaMemcpy(arr1, d_A, size, cudaMemcpyDeviceToHost);
printf("Running device %s \n",deviceProp.name);
end = clock();
time=float(end-start)/float(CLOCKS_PER_SEC);
printf("time = %e\n",time);
}
int main()
{
int const nx = 20000,ny = nx;
static float a[nx*ny],b[nx*ny];
for(int i=0;i<nx;i++)
{
for(int j=0;j<ny;j++)
{
a[j+ny*i]=j+10*i;
b[j+ny*i]=-(j+10*i);
}
}
get_matrix(a,b,nx,ny);
return 0;
}
The output is:
Running device Quadro 400
time = 1.100000e-01
Running device Tesla C2075
time = 1.050000e+00
And my questions are:
Should I modify the code depending on what GPU I am going to use?
Is there any connection between the number of blocks, threads per block specified in the code and the number of multiprocessors, cores per multiprocessor available on a GPU?
I'm running Linux Open Suse 11.2. The source code is compiled using the nvcc compiler (version 4.2).
Thanks for your help!
Try to invoke get_matrix(a,b,nx,ny) twice and take the second timing result. First time calling to CUDA API will create the cuda context. It often takes a long time.
Please refer to this section in CUDA C Best Practice Guide for how to determine the block size and grid size.

atomic operation disrupting all kernels

I am running some image processing operations on GPU and I need the histogram of the output.
I have written and tested the processing kernels. Also I have tested the histogram kernel for samples of the output pictures separately. They both work fine but when I put all of them in one loop I get nothing.
This is my histogram kernel:
__global__ void histogram(int n, uchar* color, uchar* mask, int* bucket, int ch, int W, int bin)
{
unsigned int X = blockIdx.x*blockDim.x+threadIdx.x;
unsigned int Y = blockIdx.y*blockDim.y+threadIdx.y;
int l = (256%bin==0)?256/bin: 256/bin+1;
int c;
if (X+Y*W < n && mask[X+Y*W])
{
c = color[(X+Y*W)*3]/bin;
atomicAdd(&bucket[c], 1);
c = color[(X+Y*W)*3+1]/bin;
atomicAdd(&bucket[c+l], 1);
c = color[(X+Y*W)*3+2]/bin;
atomicAdd(&bucket[c+l*2], 1);
}
}
It is updating histogram vectors for red, green, and blue.('l' is the length of the vectors)
When I comment out atomicAdds it again produces the output but of course not the histogram.
Why don't they work together?
Edit:
This is the loop:
cudaMemcpy(frame_in_gpu,frame_in.data, W*H*3*sizeof(uchar),cudaMemcpyHostToDevice);
cuda_process(frame_in_gpu, frame_out_gpu, W, H, dimGrid,dimBlock);
cuda_histogram(W*H, frame_in_gpu, mask_gpu, hist, 3, W, bin, dimg_histogram, dimb_histogram);
Then I copy the output to host memory and write it to a video.
These are c codes that only call their kernels with dimGrid and dimBlock that are given as inputs. Also:
dim3 dimBlock(32,32);
dim3 dimGrid(W/32,H/32);
dim3 dimb_Histogram(16,16);
dim3 dimg_Histogram(W/16,H/16);
I changed this for histogram because it worked better with it. Does it matter?
Edit2:
I am using -arch=sm_11 option for compilation. I just read it somewhere. Could anyone tell me how I should choose it?
perhaps you should try to compile without -arch=sm_11 flag.
sm 1.1 is the first architecture which supported atomic operations on global memory while your GPU supports SM 2.0. Hence there is no reason to compile for SM 1.1 unless for backward compatibility.
One possible issue could be that SM 1.1 does not support atomic operations on 64-bit ints in global memory. So I would suggest you recompile the code without -arch option, or use
-arch=sm_20 if you like