I'm working on CUDA as a beginner and am trying to execute a pre written code
the compile gives error for every atomic operation that the code contains...
for example
__global__ void MarkEdgesUV(unsigned int *d_edge_flag, unsigned long long int *d_appended_uvw, unsigned int *d_size, int no_of_edges)
{
unsigned int tid = blockIdx.x*MAX_THREADS_PER_BLOCK + threadIdx.x;
if(tid<no_of_edges)
{
if(tid>0)
{
unsigned long long int test = INF;
test = test << NO_OF_BITS_MOVED_FOR_VERTEX_IDS;
test |=INF;
unsigned long long int test1 = d_appended_uvw[tid]>>(64-(NO_OF_BITS_MOVED_FOR_VERTEX_IDS+NO_OF_BITS_MOVED_FOR_VERTEX_IDS));
unsigned long long int test2 = d_appended_uvw[tid-1]>>(64-(NO_OF_BITS_MOVED_FOR_VERTEX_IDS+NO_OF_BITS_MOVED_FOR_VERTEX_IDS));
if(test1>test2)
d_edge_flag[tid]=1;
if(test1 == test)
* atomicMin(d_size,tid); //also to know the last element in the array, i.e. the size of new edge list
}
else
d_edge_flag[tid]=1;
}
}
gives the error:
error: identifier "atomicMin" is undefined
this happen to be a very reliable code...also i check out, the usage of atomics seems correct....please explain why the error has occured?
I guess you are compiling with nvcc only (defaulting to sm_10), without specifying the minimal needed compute capability. In fact atomicMin() on 32 bits global memory has been introduced in devices with CC1.1 (compute capability 1.1), as you can see in this table.
Try compiling with
nvcc -arch sm_11 ...
This will let the atomicMin() funcion be recognized. However best you compile for the actual architecture of your GPU.
EDITED: being it Tesla C2075, :
nvcc -arch sm_20 ...
should work as well.
EDIT: as required by the OP, adding a reference:
You can find a very detailed explanation of the meaning of -arch and -code options here.
In particular it's reported explicitly that both -arch and -code options can be omitted.
nvcc x.cu
is short hand for
nvcc x.cu -arch=compute_10 -code=sm_10,compute_10
Related
The default nvprof output is great, but nvprof has been deprecated in favor of ncu. How can I make ncu give me an output that looks more like nvprof?
minimal working example
I have 2 range functions where one is called in a very unoptimal way (using only 1 thread). It takes a much longer time than the other range function.
profile.cu
#include <stdio.h>
//! makes sure both range functions executed correctly
bool check_range(int N, float *x_d) {
float *x_h;
cudaMallocHost(&x_h,N*sizeof(float));
cudaMemcpy(x_h, x_d, N*sizeof(float), cudaMemcpyDeviceToHost);
bool success=true;
for( int i=0; i < N; i++)
if( x_h[i] != i ) {
printf("\33[31mERROR: x[%d]=%g\33[0m\n",i,x_h[i]);
success=false;
break;
}
cudaFreeHost(x_h);
return success;
}
//! called with many threads
__global__ void range_fast(int N, float *x) {
for( int i=threadIdx.x; i < N; i+=blockDim.x)
x[i]=i;
}
//! only gets called with 1 thread. This is the bottleneck I want to detect
__global__ void range_slow(int N, float *x) {
for( int i=threadIdx.x; i < N; i+=blockDim.x)
x[i]=i;
}
int main(int argc, char *argv[]) {
int N=(1<<20)*10;
float *x_fast, *x_slow;
cudaMalloc(&x_fast,N*sizeof(float));
cudaMalloc(&x_slow,N*sizeof(float));
range_fast<<<1,512>>>(N,x_fast);
range_slow<<<1,1>>>(N,x_slow);
check_range(N,x_fast);
check_range(N,x_slow);
cudaFree(x_fast);
cudaFree(x_slow);
return 0;
};
compilation
nvcc profile.cu -o profile.exe
nvprof profiling
nvprof ./profile.exe
nvprof output
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 99.17% 1.20266s 1 1.20266s 1.20266s 1.20266s range_slow(int, float*)
0.53% 6.3921ms 2 3.1961ms 3.1860ms 3.2061ms [CUDA memcpy DtoH]
0.31% 3.7273ms 1 3.7273ms 3.7273ms 3.7273ms range_fast(int, float*)
API calls: 88.79% 1.20524s 2 602.62ms 3.2087ms 1.20203s cudaMemcpy
9.31% 126.39ms 2 63.196ms 100.62us 126.29ms cudaMalloc
1.11% 15.121ms 2 7.5607ms 7.5460ms 7.5754ms cudaHostAlloc
0.64% 8.6687ms 2 4.3344ms 4.2029ms 4.4658ms cudaFreeHost
0.09% 1.2195ms 2 609.73us 103.80us 1.1157ms cudaFree
This gives me a clear idea about which functions are taking most of the runtime, and that range_slow is the bottleneck.
ncu profiling
ncu ./profile.exe
ncu output
The ncu output has far more details, most of which I don't really care about. It also isn't as nicely summarized.
The functionality of nvprof has been broken into 2 separate tools in the "new" profiling tools. The Nsight Compute tool is mostly focused on the activity of kernel (i.e. device code) profiling, and although it can report kernel duration, of course, it is less interested in things like API call activity and memory copy activity.
The tool that has this functionality is Nsight Systems.
Try:
nsys profile --stats=true ./profile.exe
Amongst other things, you will get a pareto list of GPU activities (broken into separate pareto lists of kernel activities and memory copy activities) and a pareto list of API calls.
I have recently installed Cuda on my arch-Linux machine through the system's package manager, and I have been trying to test whether or not it is working by running a simple vector addition program.
I simply copy-paste the code from this tutorial (Both the one using one and more kernels) into a file titled cuda_test.cu and run
> nvcc cuda_test.cu -o cuda_test
In either case, the program can run, and I get no errors (both as in the program doesn't crash and the output is that there were no errors). But when I try to run the Cuda profiler on the program:
> sudo nvprof ./cuda_test
I get result:
==3201== NVPROF is profiling process 3201, command: ./cuda_test
Max error: 0
==3201== Profiling application: ./cuda_test
==3201== Profiling result:
No kernels were profiled.
No API activities were profiled.
==3201== Warning: Some profiling data are not recorded. Make sure cudaProfilerStop() or cuProfilerStop() is called before application exit to flush profile data.
The latter warning is not my main problem or the topic of my question, my problem is the message saying that No Kernels were profiled and no API activities were profiled.
Does this mean that the program was run entirely on my CPU? or is it an error in nvprof?
I have found a discussion about the same error here, but there the answer was that the wrong version of Cuda was installed, and in my case, the version installed is the latest version installed through the systems package manager (Version 10.1.243-1)
Is there any way I can get either nvprof to display the expected output?
Edit
Trying to adhere to the warning at the end does not solve the problem:
Adding call to cudaProfilerStop() (or cuProfilerStop()), and also adding cudaDeviceReset(); at end as suggested and linking the appropriate library (cuda_profiler_api.h or cudaProfiler.h) and compiling with
> nvcc cuda_test.cu -o cuda_test -lcuda
Yields a program which can still run, but which, when uppon which nvprof is run, returns:
==12558== NVPROF is profiling process 12558, command: ./cuda_test
Max error: 0
==12558== Profiling application: ./cuda_test
==12558== Profiling result:
No kernels were profiled.
No API activities were profiled.
==12558== Warning: Some profiling data are not recorded. Make sure cudaProfilerStop() or cuProfilerStop() is called before application exit to flush profile data.
======== Error: Application received signal 139
This has not solved the original problem, and has in fact created a new error; the same happens when cudaProfilerStop() is used on its own or alongside cuProfilerStop() and cudaDeviceReset();
The code
The code is, as mentioned copied from a tutorial to test if Cuda is working, though I also have included calls to cudaProfilerStop() and cudaDeviceReset(); for clarity, it is here included:
#include <iostream>
#include <math.h>
#include <cuda_profiler_api.h>
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
int index = threadIdx.x;
int stride = blockDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}
int main(void)
{
int N = 1<<20;
float *x, *y;
cudaProfilerStart();
// Allocate Unified Memory – accessible from CPU or GPU
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&y, N*sizeof(float));
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
// Run kernel on 1M elements on the GPU
add<<<1, 1>>>(N, x, y);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i]-3.0f));
std::cout << "Max error: " << maxError << std::endl;
// Free memory
cudaFree(x);
cudaFree(y);
cudaDeviceReset();
cudaProfilerStop();
return 0;
}
This problem was apparently somewhat well known, after some searching I found this thread about the error-code in the edited version; the solution as discussed there is to call nvprof with the flag --unified-memory-profiling off:
> sudo nvprof --unified-memory-profiling off ./cuda_test
This makes nvprof work as expected-- even without the call to cudaProfileStop.
You can solve the problem by using
sudo nvprof --unified-memory-profiling per-process-device <your program>
the code below compiles just fine. But when i try to run it, i got
GPUassert: invalid device symbol file.cu 114
When i comment lines marked by (!!!) the error wont show up. My question is what is causing this error because it gives me no sense.
Compiling with nvcc file.cu -arch compute_11
#include "stdio.h"
#include <algorithm>
#include <ctime>
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
#define THREADS 64
#define BLOCKS 256
#define _dif (((1ll<<32)-121)/(THREADS*BLOCKS)+1)
#define HASH_SIZE 1024
#define ROUNDS 16
#define HASH_ROW (HASH_SIZE/ROUNDS)+(HASH_SIZE%ROUNDS==0?0:1)
#define HASH_COL 1000000000/HASH_SIZE
typedef unsigned long long ull;
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
//fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
printf("GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__device__ unsigned int primes[1024];
//__device__ unsigned char primes[(1<<28)+1];
__device__ long long n = 1ll<<32;
__device__ ull dev_base;
__device__ unsigned int dev_hash;
__device__ unsigned int dev_index;
time_t curtime;
__device__ int hashh(long long x) {
return (x>>1)%1024;
}
// compute (x^e)%n
__device__ ull mulmod(ull x,ull e,ull n) {
ull ans = 1;
while(e>0) {
if(e&1) ans = (ans*x)%n;
x = (x*x)%n;
e>>=1;
}
return ans;
}
// determine whether n is strong probable prime base a or not.
// n is ODD
__device__ int is_SPRP(ull a,ull n) {
int d=0;
ull t = n-1;
while(t%2==0) {
++d;
t>>=1;
}
ull x = mulmod(a,t,n);
if(x==1) return 1;
for(int i=0;i<d;++i) {
if(x==n-1) return 1;
x=(x*x)%n;
}
return 0;
}
__device__ int prime(long long x) {
//unsigned long long b = 2;
//return is_SPRP(b,(unsigned long long)x);
return is_SPRP((unsigned long long)primes[(((long long)0xAFF7B4*x)>>7)%1024],(unsigned long long)x);
}
__global__ void find(unsigned int *out,unsigned int *c) {
unsigned int buff[HASH_ROW][256];
int local_c[HASH_ROW];
for(int i=0;i<HASH_ROW;++i) local_c[i]=0;
long long b = 121+(threadIdx.x+blockIdx.x*blockDim.x)*_dif;
long long e = b+_dif;
if(b%2==0) ++b;
for(long long i=b;i<e && i<n;i+=2) {
if(i%3==0 || i%5==0 || i%7==0) continue;
int hash_num = hashh(i)-(dev_hash*(HASH_ROW));
if(0<=hash_num && hash_num<HASH_ROW) {
if(prime(i)) continue;
buff[hash_num][local_c[hash_num]++]=(unsigned int)i;
if(local_c[hash_num]==256) {
int start = atomicAdd(c+hash_num,local_c[hash_num]);
if(start+local_c[hash_num]>=HASH_COL) return;
unsigned int *out_offset = out+hash_num*(HASH_COL)*4;
for(int i=0;i<local_c[hash_num];++i) out_offset[i+start]=buff[hash_num][i]; //(!!!)
local_c[hash_num]=0;
}
}
}
for(int i=0;i<HASH_ROW;++i) {
int start = atomicAdd(c+i,local_c[i]);
if(start+local_c[i]>=HASH_COL) return;
unsigned int *out_offset = out+i*(HASH_COL)*4;
for(int j=0;j<local_c[i];++j) out_offset[j+start]=buff[i][j]; //(!!!)
}
}
int main(void) {
printf("HASH_ROW: %d\nHASH_COL: %d\nPRODUCT: %d\n",(int)HASH_ROW,(int)HASH_COL,(int)(HASH_ROW)*(HASH_COL));
ull *base_adr;
gpuErrchk(cudaGetSymbolAddress((void**)&base_adr,dev_base));
gpuErrchk(cudaMemset(base_adr,0,7));
gpuErrchk(cudaMemset(base_adr,0x02,1));
}
A rather unusual error.
The failure is occurring because:
By specifying a virtual architecture only (-arch compute_11) you defer the PTX compile step until runtime (i.e. you are forcing JIT-compile)
The JIT-compile is failing (at runtime)
The failure of the JIT-compile (and link) means device symbols cannot be properly established
Due to the problem with device symbols, the operation cudaGetSymbolAddress on the device symbol dev_base fails, and throws an error.
Why is the JIT-compile failing? You can find out yourself by triggering the machine code compile (which runs the ptxas assembler) by specifying -arch=sm_11 instead of -arch compute_11. If you do that, you'll get this result:
ptxas error : Entry function '_Z4findPjS_' uses too much local data (0x10100 bytes, 0x4000 max)
So even though your code doesn't call the find kernel, it must compile successfully to have a sane device environment for symbols.
Why does this compile error occur? Because you are requesting too much local memory per thread. cc 1.x devices are limited to 16KB local memory per thread, and your find kernel is requesting quite a bit more than that (over 64KB).
When I initially tried it on my device, I was using a cc2.0 device which has a higher limit (512KB per thread) and so the JIT-compile step succeeded.
In general, I would recommend specifying both a virtual architecture and a machine architecture, and the shorthand way to do that is:
nvcc -arch=sm_11 ....
(for a cc1.1 device)
This question/answer may also be of interest, and the nvcc manual has more details about virtual vs. machine architecture, and how to specify the compilation phases for each.
I believe the reason the error goes away when you comment out those particular lines in the kernel, is that with those commented out, the compiler is able to optimize-out the accesses to those local memory areas, and optimize-out the instantiation of the local memory. This allows the JIT-compile step to complete successfully, and your code runs "without runtime error".
You can verify this by commenting those lines out and then specify a full compile (nvcc -arch=sm_11 ...), where -arch is short for --gpu-architecture.
This error usually means the kernel has been compiled for the wrong architecture. You need to find out what the compute capability of your GPU is, and then compile it for that architecture. E.g. if your GPU has compute capability 1.1, compile it with -arch=sm_11. You can also build an executable for more than one architecture.
I am trying to profile some CUDA Rodinia benchmarks, in terms of their SM and memory utilization, power consumption etc. For that, I simultaneously execute the benchmark and the profiler which essentially spawns a pthread to profile the GPU execution using NVML library.
The issue is that the execution time of a benchmark, is much higher( about 3 times) in case I do not invoke the profiler along with it, than the case when the benchmark is executing with the profiler. The frequency scaling governor for the CPU is userspace so I do not think that frequency of the CPU is changing. Is it due to the flickering in GPU frequency?
Below is the code for the profiler.
#include <pthread.h>
#include <stdio.h>
#include "nvml.h"
#include "unistd.h"
#define NUM_THREADS 1
void *PrintHello(void *threadid)
{
long tid;
tid = (long)threadid;
// printf("Hello World! It's me, thread #%ld!\n", tid);
nvmlReturn_t result;
nvmlDevice_t device;
nvmlUtilization_t utilization;
nvmlClockType_t jok;
unsigned int device_count, i,powergpu,clo;
char version[80];
result = nvmlInit();
result = nvmlSystemGetDriverVersion(version,80);
printf("\n Driver version: %s \n\n", version);
result = nvmlDeviceGetCount(&device_count);
printf("Found %d device%s\n\n", device_count,
device_count != 1 ? "s" : "");
printf("Listing devices:\n");
result = nvmlDeviceGetHandleByIndex(0, &device);
while(1)
{
result = nvmlDeviceGetPowerUsage(device,&powergpu );
result = nvmlDeviceGetUtilizationRates(device, &utilization);
printf("\n%d\n",powergpu);
if (result == NVML_SUCCESS)
{
printf("%d\n", utilization.gpu);
printf("%d\n", utilization.memory);
}
result=nvmlDeviceGetClockInfo(device,NVML_CLOCK_SM,&clo);
if(result==NVML_SUCCESS)
{
printf("%d\n",clo);
}
usleep(500000);
}
pthread_exit(NULL);
}
int main (int argc, char *argv[])
{
pthread_t threads[NUM_THREADS];
int rc;
long t;
for(t=0; t<NUM_THREADS; t++){
printf("In main: creating thread %ld\n", t);
rc = pthread_create(&threads[t], NULL, PrintHello, (void *)t);
if (rc){
printf("ERROR; return code from pthread_create() is %d\n", rc);
exit(-1);
}
}
/* Last thing that main() should do */
pthread_exit(NULL);
}
With your profiler running, the GPU(s) are being pulled out of their sleep state (due to the access to the nvml API, which is querying data from the GPUs). This makes them respond much more quickly to a CUDA application, and so the application appears to run "faster" if you time the entire application execution (e.g. using the linux time command).
One solution is to place the GPUs in "persistence mode" with the nvidia-smi command (use nvidia-smi --help to get command line help).
Another solution would be to do the timing from within the application, and exclude the CUDA start-up time from the timing measurement, perhaps by executing a cuda command such as cudaFree(0); prior to the start of timing.
I am running some image processing operations on GPU and I need the histogram of the output.
I have written and tested the processing kernels. Also I have tested the histogram kernel for samples of the output pictures separately. They both work fine but when I put all of them in one loop I get nothing.
This is my histogram kernel:
__global__ void histogram(int n, uchar* color, uchar* mask, int* bucket, int ch, int W, int bin)
{
unsigned int X = blockIdx.x*blockDim.x+threadIdx.x;
unsigned int Y = blockIdx.y*blockDim.y+threadIdx.y;
int l = (256%bin==0)?256/bin: 256/bin+1;
int c;
if (X+Y*W < n && mask[X+Y*W])
{
c = color[(X+Y*W)*3]/bin;
atomicAdd(&bucket[c], 1);
c = color[(X+Y*W)*3+1]/bin;
atomicAdd(&bucket[c+l], 1);
c = color[(X+Y*W)*3+2]/bin;
atomicAdd(&bucket[c+l*2], 1);
}
}
It is updating histogram vectors for red, green, and blue.('l' is the length of the vectors)
When I comment out atomicAdds it again produces the output but of course not the histogram.
Why don't they work together?
Edit:
This is the loop:
cudaMemcpy(frame_in_gpu,frame_in.data, W*H*3*sizeof(uchar),cudaMemcpyHostToDevice);
cuda_process(frame_in_gpu, frame_out_gpu, W, H, dimGrid,dimBlock);
cuda_histogram(W*H, frame_in_gpu, mask_gpu, hist, 3, W, bin, dimg_histogram, dimb_histogram);
Then I copy the output to host memory and write it to a video.
These are c codes that only call their kernels with dimGrid and dimBlock that are given as inputs. Also:
dim3 dimBlock(32,32);
dim3 dimGrid(W/32,H/32);
dim3 dimb_Histogram(16,16);
dim3 dimg_Histogram(W/16,H/16);
I changed this for histogram because it worked better with it. Does it matter?
Edit2:
I am using -arch=sm_11 option for compilation. I just read it somewhere. Could anyone tell me how I should choose it?
perhaps you should try to compile without -arch=sm_11 flag.
sm 1.1 is the first architecture which supported atomic operations on global memory while your GPU supports SM 2.0. Hence there is no reason to compile for SM 1.1 unless for backward compatibility.
One possible issue could be that SM 1.1 does not support atomic operations on 64-bit ints in global memory. So I would suggest you recompile the code without -arch option, or use
-arch=sm_20 if you like