CUDA cudaLaunchCooperativeKernel and grid synchronization - cuda

I am trying to understand how to synchronize a grid of threads with cudaLaunchCooperativeKernel.
https://developer.nvidia.com/blog/cooperative-groups/
I have a very simple kernel where two threads update an array, sync and both print the array:
#include <cooperative_groups.h>
namespace cg = cooperative_groups;
__global__ void kernel(float *buf){
cg::grid_group
grid = cg::this_grid();
if(grid.thread_rank()<2)
buf[grid.thread_rank()] = 10+grid.thread_rank();
assert(grid.is_valid()); // ok!
grid.sync();
if(grid.thread_rank()<2)
printf("thread=%d: %g %g\n",(int)grid.thread_rank(),buf[0],buf[1]);
}
Instead of printing values (10,11) twice, I get:
thread=0: 10 0
thread=1: 0 11
All cuda calls were fine, cuda-memcheck is happy, my cards is "GeForce RTX 2060 SUPER" and it does support cooperative kernels work, checked with:
int supportsCoopLaunch = 0;
if( cudaSuccess != cudaDeviceGetAttribute(&supportsCoopLaunch, cudaDevAttrCooperativeLaunch, dev) )
throw std::runtime_error("Cooperative Launch is not supported on this machine configuration.");
I am confused... Why I don't see the synchronization?

This test is incorrect:
int supportsCoopLaunch = 0;
if( cudaSuccess != cudaDeviceGetAttribute(&supportsCoopLaunch, cudaDevAttrCooperativeLaunch, dev) )
throw std::runtime_error("Cooperative Launch is not supported on this machine configuration.");
The support (or lack of) is not communicated via the cudaError_t return value of the function, instead it is communicated via the value placed in the supportsCoopLaunch variable. You would want to do something like:
int supportsCoopLaunch = 0;
cudaDeviceGetAttribute(&supportsCoopLaunch, cudaDevAttrCooperativeLaunch, dev);
if( supportsCoopLaunch != 1)
throw std::runtime_error("Cooperative Launch is not supported on this machine configuration.");

I found the bug. The actual code was something like that:
__device__ void kernel(float *buf){/* see the function body above*/}
__global__ void parent_kernel(){
float buf[2]; // per-thread buffer!!! The kernel will not 'sync' it!
kernel(buf); // different kernels will get different buffers
}

Related

Is there proper CUDA atomicLoad function?

I've faced with the issue that CUDA atomic API do not have atomicLoad function.
After searching on stackoverflow, I've found the following implementation of CUDA atomicLoad
But looks like this function is failed to work in following example:
#include <cassert>
#include <iostream>
#include <cuda_runtime_api.h>
template <typename T>
__device__ T atomicLoad(const T* addr) {
const volatile T* vaddr = addr; // To bypass cache
__threadfence(); // for seq_cst loads. Remove for acquire semantics.
const T value = *vaddr;
// fence to ensure that dependent reads are correctly ordered
__threadfence();
return value;
}
__global__ void initAtomic(unsigned& count, const unsigned initValue) {
count = initValue;
}
__global__ void addVerify(unsigned& count, const unsigned biasAtomicValue) {
atomicAdd(&count, 1);
// NOTE: When uncomment the following while loop the addVerify is stuck,
// it cannot read last proper value in variable count
// while (atomicLoad(&count) != (1024 * 1024 + biasAtomicValue)) {
// printf("count = %u\n", atomicLoad(&count));
// }
}
int main() {
std::cout << "Hello, CUDA atomics!" << std::endl;
const auto atomicSize = sizeof(unsigned);
unsigned* datomic = nullptr;
cudaMalloc(&datomic, atomicSize);
cudaStream_t stream;
cudaStreamCreate(&stream);
constexpr unsigned biasAtomicValue = 11;
initAtomic<<<1, 1, 0, stream>>>(*datomic, biasAtomicValue);
addVerify<<<1024, 1024, 0, stream>>>(*datomic, biasAtomicValue);
cudaStreamSynchronize(stream);
unsigned countHost = 0;
cudaMemcpyAsync(&countHost, datomic, atomicSize, cudaMemcpyDeviceToHost, stream);
assert(countHost == 1024 * 1024 + biasAtomicValue);
cudaStreamDestroy(stream);
return 0;
}
If you will uncomment the section with atomicLoad then application will stuck ...
Maybe I missed something ? Is there a proper way to load variable modified atomically ?
P.S.: I know there exists cuda::atomic implementation, but this API is not supported by my hardware
Since warps work in a lockstep manner (at least in old arch), if you put a conditional wait for one thread and a producer on another thread, both in same warp, then the warp could be stuck in the waiting if it starts/is executed first. Maybe only newest architecture that has asynchronous warp thread scheduling can do this. For example, you should query minor-major versions of cuda architecture before running this. Volta and onwards is ok.
Also you are launching 1million threads and waiting on all of them at once. GPU may not have that many execution ports/pipeline availability to have 1 million threads in-flight. Maybe it would work in only a GPU of 64k CUDA pipelines (assuming 16 threads in flight per pipeline). Instead of waiting on millions of threads, just spawn sub-kernels from main kernel when a condition occurs. Dynamic parallelism is the key feature. You should also check for the minimum minor-major cuda version to use dynamic parallelism just in case someone is using ancient nvidia cards.
Atomic-add command returns the old value in the target address. If you have meant to call a third kernel only once only after the condition, then you can simply check that returned value by an "if" before starting the dynamic parallelism.
You are printing for 1 million times, it is not good for performance and it may take some time before text appears in console output if you have a slow CPU/RAM.
Lastly, you can optimize performance of atomic operations by running them on shared memory first then going global atomic only once per block. This will miss the point of condition if there are more threads than the condition value (assuming always 1 increment value) so it may not be applicable for all algorithms.

How to copy dynamically allocated memory from device to host? [duplicate]

CUDA programming guide states that "Memory allocated via malloc() can be copied using the runtime (i.e., by calling any of the copy memory functions from Device Memory)", but somehow I'm having trouble to reproduce this functionality. Code:
#include <cstdio>
__device__ int* p;
__global__ void allocate_p() {
p = (int*) malloc(10);
printf("p = %p (seen by GPU)\n", p);
}
int main() {
cudaError_t err;
int* localp = (int*) malloc(10);
allocate_p<<<1,1>>>();
cudaDeviceSynchronize();
//Getting pointer to device-allocated memory
int* tmpp = NULL;
cudaMemcpyFromSymbol(&tmpp, p, 4);
printf("p = %p (seen by CPU)\n", tmpp);
//cudaMalloc((void**)&tmpp, 40);
err = cudaMemcpy(tmpp, localp, 40, cudaMemcpyHostToDevice);
cudaDeviceSynchronize();
printf(" err:%i %s", (int)err, cudaGetErrorString(err));
delete localp;
return 0;
}
crashes with output:
p = 0x601f920 (seen by GPU)
p = 0x601f920 (seen by CPU)
err:11 invalid argument
I gather, that the host sees the appropriate address on device, but somehow does not like it coming from malloc().
If I allocate earlier by cudaMalloc((void**)&np, 40); and then pass the pointer np as argument to kernel allocate_p, where it will be assigned to p (instead of malloc()), then the code runs fine.
What am I doing wrong / how do we use malloc() allocated device-memory in host-side functions?
As far as I am aware, it isn't possible to copy runtime heap memory using the host API functions. It certainly was not possible in CUDA 4.x and the CUDA 5.0 release candidate has not changed this. The only workaround I can offer is to use a kernel to "gather" final results and stuff them into a device transfer buffer or zero copy memory which can be accessed via the API or directly from the host. You can see an example of this approach in this answer and another question where Mark Harris from NVIDIA confirmed that this is a limitation of the (then) current implementation in the CUDA runtime.

Particular Allocating device memory for _global_ function in cuda

want to do this programm on cuda.
1.in "main.cpp"
struct Center{
double * Data;
int dimension;
};
typedef struct Center Center;
//I allow a pointer on N Center elements by the CUDAMALLOC like follow
....
#include "kernel.cu"
....
center *V_dev;
int M =100, n=4;
cudaStatus = cudaMalloc((void**)&V_dev,M*sizeof(Center));
Init<<<1,M>>>(V_dev, M, N); //I always know the dimension of N before calling
My "kernel.cu" file is something like this
#include "cuda_runtime.h"
#include"device_launch_parameters.h"
... //other include headers to allow my .cu file to know the Center type definition
__global__ void Init(Center *V, int N, int dimension){
V[threadIdx.x].dimension = dimension;
V[threadIdx.x].Data = (double*)malloc(dimension*sizeof(double));
for(int i=0; i<dimension; i++)
V[threadIdx.x].Data[i] = 0; //For the value, it can be any kind of operation returning a float that i want to be able put here
}
I'm on visual studio 2008 and CUDA 5.0. When I Build my project, I've got these errors:
error: calling a _host_ function("malloc") from a _global_ function("Init") is not allowed.
I want to know please how can I perform this? (I know that 'malloc' and other cpu memory allocation are not allowed for device memory.
malloc is allowed in device code but you have to be compiling for a cc2.0 or greater target GPU.
Adjust your VS project settings to remove any GPU device settings like compute_10,sm_10 and replace it with compute_20,sm_20 or higher to match your GPU. (And, to run that code, your GPU needs to be cc2.0 or higher.)
You need the compiler parameter -arch=sm_20 and a GPU which supports it.

A function calls another function in CUDA C++

I have a problem with CUDA programing !
Input is a matrix A( 2 x 2 )
Ouput is a matrix A( 2 x 2 ) with every new value is **3 exponent of the old value **
example :
input : A : { 2,2 } output : A { 8,8 }
{ 2,2 } { 8,8 }
I have 2 function in file CudaCode.CU :
__global__ void Power_of_02(int &a)
{
a=a*a;
}
//***************
__global__ void Power_of_03(int &a)
{
int tempt = a;
Power_of_02(a); //a=a^2;
a= a*tempt; // a = a^3
}
and Kernel :
__global__ void CudaProcessingKernel(int *dataA ) //kernel function
{
int bx = blockIdx.x;
int tx = threadIdx.x;
int tid = bx * XTHREADS + tx;
if(tid < 16)
{
Power_of_03(dataA[tid]);
}
__syncthreads();
}
I think it's right, but the error appear : calling a __global__ function("Power_of_02") from a __global__ function("Power_of_03") is only allowed on the compute_35 architecture or above
Why I wrong ? How to repair it ?
The error is fairly explanatory. A CUDA function decorated with __global__ represents a kernel. Kernels can be launched from host code. On cc 3.5 or higher GPUs, you can also launch a kernel from device code. So if you call a __global__ function from device code (i.e. from another CUDA function that is decorated with __global__ or __device__), then you must be compiling for the appropriate architecture. This is called CUDA dynamic parallelism, and you should read the documentation to learn how to use it, if you want to use it.
When you launch a kernel, whether from host or device code, you must provide a launch configuration, i.e. the information between the triple-chevron notation:
CudaProcessingKernel<<<grid, threads>>>(d_A);
If you want to use your power-of-2 code from another kernel, you will need to call it in a similar, appropriate fashion.
Based on the structure of your code, however, it seems like you can make things work by declaring your power-of-2 and power-of-3 functions as __device__ functions:
__device__ void Power_of_02(int &a)
{
a=a*a;
}
//***************
__device__ void Power_of_03(int &a)
{
int tempt = a;
Power_of_02(a); //a=a^2;
a= a*tempt; // a = a^3
}
This should probably work for you and perhaps was your intent. Functions decorated with __device__ are not kernels (so they are not callable directly from host code) but are callable directly from device code on any architecture. The programming guide will also help to explain the difference.

Understanding cuda heap memory limitations per thread

This question is about heap size limitation in cuda.
Having visited some questions concerning this topic, including this one:
new operator in kernel .. strange behaviour
I've made some tests. Given a kernel as follow:
#include <cuda.h>
#include <cuda_runtime.h>
#define CUDA_CHECK( err ) __cudaSafeCall( err, __FILE__, __LINE__ )
#define CUDA_CHECK_ERROR() __cudaCheckError( __FILE__, __LINE__ )
inline void __cudaSafeCall( cudaError err, const char *file, const int line )
{
if ( cudaSuccess != err )
{
fprintf( stderr, "cudaSafeCall() failed at %s:%i : %s\n",
file, line, cudaGetErrorString( err ) );
exit( -1 );
}
return;
}
inline void __cudaCheckError( const char *file, const int line )
{
cudaError err = cudaGetLastError();
if ( cudaSuccess != err )
{
fprintf( stderr, "cudaCheckError() failed at %s:%i : %s\n",
file, line, cudaGetErrorString( err ) );
exit( -1 );
}
return;
}
#include <stdio>
#define NP 900000
__device__ double *temp;
__device__ double *temp2;
__global__
void test(){
int i = blockDim.x*blockIdx.x + threadIdx.x;
if(i==0){
temp = new double[NP];
//temp2 = new double[NP];
}
if(i==0){
for(int k=0;k<NP;k++){
temp[i] = 1.;
if(k%1000 == 0){
printf("%d : %g\n", k, temp[i]);
}
}
}
if(i==0){
delete(temp);
//delete(temp2);
}
}
int main(){
//cudaDeviceSetLimit(cudaLimitMallocHeapSize, 32*1024*1024);
//for(int k=0;k<2;k++){
test<<<ceil((float)NP/512), 512>>>();
CUDA_CHECK_ERROR();
//}
return 0;
}
I want to test the heap size limitation.
Dynamically allocating one array (temp) with one thread which size is
roughly over 960,000*sizeof(double) (close to 8MB, which is the
default limit of the heap size) gives an error : ok. 900,000 works. (does someone know how to calculate the true limit?)
Rising the heap size limit allows to allocate more memory : normal, ok.
Back to a 8MB heap size, allocating one array per thread with TWO threads (so, replacing if (i==0) by if(i==0 || i==1), each one 900,000 * sizeof(double) fails. But 450,000*sizeof(double) each, works. Still ok.
Here comes my problem : allocating TWO arrays with ONE thread (so, temp and temp2 for thread 0), each 900,000 * sizeof(double) works too, but it should not? Indeed when I try to write in both arrays, it fails. But anyone has an idea why this different behaviour in allocation when using two arrays with one thread instead of two arrays with two threads?
EDIT : another test, which I find interesting for those who, like me, would be learning the usage of heap :
5. Executing the kernel two times, with one array of size 900,000 * sizeof(double) allocated by the single thread 0, works if there is the delete. If delete is omitted, it will fail the second time, but the first call will be executed.
EDIT 2 : how to allocate a device-wide variable once but writable by all threads (not from host, using dynamic allocation in device code)?
Probably you are not testing for a returned null-pointer on the new operation, which is a valid method in C++ for the operator to report a failure.
When I modify your code as follows, I get the message "second new failed":
#include <stdio.h>
#define NP 900000
__device__ double *temp;
__device__ double *temp2;
__global__
void test(){
int i = blockDim.x*blockIdx.x + threadIdx.x;
if(i==0){
temp = new double[NP];
if (temp == 0) {printf("first new failed\n"); return;}
temp2 = new double[NP];
if (temp2 == 0) {printf("second new failed\n"); return;}
}
if(i==0){
for(int k=0;k<NP;k++){
temp[i] = 1.;
if(k%1000 == 0){
printf("%d : %g\n", k, temp[i]);
}
}
}
if(i==0){
delete(temp);
delete(temp2);
}
}
int main() {
test<<<1,1>>>();
cudaDeviceSynchronize();
return 0;
}
It's convenient if you provide a complete, compilable code, for others to work with, just as I have.
For your first EDIT question, it's not surprising that the second new will work if the first is deleted. The first allocates nearly all of the 8MB available. If you delete that allocation, then the second one will succeed. Referring to the documentation, we see that memory allocated dynamically in this fashion lives for the entire lifetime of the cuda context, or until a corresponding delete operation is performed (i.e. not just a single kernel call. The completion of the kernel does not necessarily free the allocation.)
For your second EDIT question, you are already demonstrating a method, using your __device__ double *temp; pointer, by which one thread can allocate storage which all threads can access. You will have a problem across blocks, however, because there is no guarantee of synchronization order amongst blocks or execution order amongst blocks, so if you allocate from thread 0 in block 0, that is only useful if block 0 executes before other blocks. You could come up with a complicated scheme to check if the variable allocation was already done (perhaps by testing the pointer for NULL, and also perhaps using atomics) but it creates fragile code. It's better to plan your global allocations ahead of time and allocate accordingly from the host.