cuda - memory allocation crashes - cuda

Ok I am trying to allocate an array of structs on the gpu, and it crashes (gives the stopped working msg).
This is the struct:
typedef struct point_t {
int id;
float x, y;
} point;
this is a part of the cuda code:
cudaError_t d_LoadPoints(point* points, int n , int chunkSize){
// Error code to check return values for CUDA calls
cudaError_t err = cudaSuccess;
int nBytes = n * sizeof(point);
// Allocate the device input points array
point* d_points;
err = cudaMalloc((void** )&d_points, nBytes);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to allocate device vector points (error code %s)!\n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
cudaMemcpy(d_points,points ,nBytes ,cudaMemcpyHostToDevice);
puts("memory allocated successfully");
}
I tried printing the first elements of the points array , as well as n and chunksize and it comes out correctly.
This is the point it seems to crash at (I disabled the rest).
it crashes regardless of the debug prints.
The only thing I can think of is the size.
n is 250,000 and chunksize is 64,000 which I was planning to allocate for 125 blocks with 512 threads each.
I have no idea if this is a good idea or not but this is a side topic since I can;t even reach the kernel call.

Restarting visual studio fixed the problem

Related

CUDA cudaLaunchCooperativeKernel and grid synchronization

I am trying to understand how to synchronize a grid of threads with cudaLaunchCooperativeKernel.
https://developer.nvidia.com/blog/cooperative-groups/
I have a very simple kernel where two threads update an array, sync and both print the array:
#include <cooperative_groups.h>
namespace cg = cooperative_groups;
__global__ void kernel(float *buf){
cg::grid_group
grid = cg::this_grid();
if(grid.thread_rank()<2)
buf[grid.thread_rank()] = 10+grid.thread_rank();
assert(grid.is_valid()); // ok!
grid.sync();
if(grid.thread_rank()<2)
printf("thread=%d: %g %g\n",(int)grid.thread_rank(),buf[0],buf[1]);
}
Instead of printing values (10,11) twice, I get:
thread=0: 10 0
thread=1: 0 11
All cuda calls were fine, cuda-memcheck is happy, my cards is "GeForce RTX 2060 SUPER" and it does support cooperative kernels work, checked with:
int supportsCoopLaunch = 0;
if( cudaSuccess != cudaDeviceGetAttribute(&supportsCoopLaunch, cudaDevAttrCooperativeLaunch, dev) )
throw std::runtime_error("Cooperative Launch is not supported on this machine configuration.");
I am confused... Why I don't see the synchronization?
This test is incorrect:
int supportsCoopLaunch = 0;
if( cudaSuccess != cudaDeviceGetAttribute(&supportsCoopLaunch, cudaDevAttrCooperativeLaunch, dev) )
throw std::runtime_error("Cooperative Launch is not supported on this machine configuration.");
The support (or lack of) is not communicated via the cudaError_t return value of the function, instead it is communicated via the value placed in the supportsCoopLaunch variable. You would want to do something like:
int supportsCoopLaunch = 0;
cudaDeviceGetAttribute(&supportsCoopLaunch, cudaDevAttrCooperativeLaunch, dev);
if( supportsCoopLaunch != 1)
throw std::runtime_error("Cooperative Launch is not supported on this machine configuration.");
I found the bug. The actual code was something like that:
__device__ void kernel(float *buf){/* see the function body above*/}
__global__ void parent_kernel(){
float buf[2]; // per-thread buffer!!! The kernel will not 'sync' it!
kernel(buf); // different kernels will get different buffers
}

CUDA/C - Using malloc in kernel functions gives strange results

I'm new to CUDA/C and new to stack overflow. This is my first question.
I'm trying to allocate memory dynamically in a kernel function, but the results are unexpected.
I read using malloc() in a kernel can lower performance a lot, but I need it anyway so I first tried with a simple int ** array just to test the possibility, then I'll actually need to allocate more complex structs.
In my main I used cudaMalloc() to allocate the space for the array of int *, and then I used malloc() for every thread in the kernel function to allocate the array for every index of the outer array. I then used another thread to check the result, but it doesn't always work.
Here's main code:
#define N_CELLE 1024*2
#define L_CELLE 512
extern "C" {
int main(int argc, char **argv) {
int *result = (int *)malloc(sizeof(int));
int *d_result;
int size_numbers = N_CELLE * sizeof(int *);
int **d_numbers;
cudaMalloc((void **)&d_numbers, size_numbers);
cudaMalloc((void **)&d_result, sizeof(int *));
kernel_one<<<2, 1024>>>(d_numbers);
cudaDeviceSynchronize();
kernel_two<<<1, 1>>>(d_numbers, d_result);
cudaMemcpy(result, d_result, sizeof(int), cudaMemcpyDeviceToHost);
printf("%d\n", *result);
cudaFree(d_numbers);
cudaFree(d_result);
free(result);
}
}
I used extern "C"because I could't compile while importing my header, which is not used in this example code. I pasted it since I don't know if this may be relevant or not.
This is kernel_one code:
__global__ void kernel_one(int **d_numbers) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
d_numbers[i] = (int *)malloc(L_CELLE*sizeof(int));
for(int j=0; j<L_CELLE;j++)
d_numbers[i][j] = 1;
}
And this is kernel_two code:
__global__ void kernel_two(int **d_numbers, int *d_result) {
int temp = 0;
for(int i=0; i<N_CELLE; i++) {
for(int j=0; j<L_CELLE;j++)
temp += d_numbers[i][j];
}
*d_result = temp;
}
Everything works fine (aka the count is correct) until I use less than 1024*2*512 total blocks in device memory. For example, if I #define N_CELLE 1024*4 the program starts giving "random" results, such as negative numbers.
Any idea of what the problem could be?
Thanks anyone!
In-kernel memory allocation draws memory from a statically allocated runtime heap. At larger sizes, you are exceeding the size of that heap and then your two kernels are attempting to read and write from uninitialised memory. This produces a runtime error on the device and renders the results invalid. You would already know this if you either added correct API error checking on the host side, or ran your code with the cuda-memcheck utility.
The solution is to ensure that the heap size is set to something appropriate before trying to run a kernel. Adding something like this:
size_t heapsize = sizeof(int) * size_t(N_CELLE) * size_t(2*L_CELLE);
cudaDeviceSetLimit(cudaLimitMallocHeapSize, heapsize);
to your host code before any other API calls, should solve the problem.
I don't know anything about CUDA but these are severe bugs:
You cannot convert from int** to void**. They are not compatible types. Casting doesn't solve the problem, but hides it.
&d_numbers gives the address of a pointer to pointer which is wrong. It is of type int***.
Both of the above bugs result in undefined behavior. If your program somehow seems to works in some condition, that's just by pure (bad) luck only.

How to copy dynamically allocated memory from device to host? [duplicate]

CUDA programming guide states that "Memory allocated via malloc() can be copied using the runtime (i.e., by calling any of the copy memory functions from Device Memory)", but somehow I'm having trouble to reproduce this functionality. Code:
#include <cstdio>
__device__ int* p;
__global__ void allocate_p() {
p = (int*) malloc(10);
printf("p = %p (seen by GPU)\n", p);
}
int main() {
cudaError_t err;
int* localp = (int*) malloc(10);
allocate_p<<<1,1>>>();
cudaDeviceSynchronize();
//Getting pointer to device-allocated memory
int* tmpp = NULL;
cudaMemcpyFromSymbol(&tmpp, p, 4);
printf("p = %p (seen by CPU)\n", tmpp);
//cudaMalloc((void**)&tmpp, 40);
err = cudaMemcpy(tmpp, localp, 40, cudaMemcpyHostToDevice);
cudaDeviceSynchronize();
printf(" err:%i %s", (int)err, cudaGetErrorString(err));
delete localp;
return 0;
}
crashes with output:
p = 0x601f920 (seen by GPU)
p = 0x601f920 (seen by CPU)
err:11 invalid argument
I gather, that the host sees the appropriate address on device, but somehow does not like it coming from malloc().
If I allocate earlier by cudaMalloc((void**)&np, 40); and then pass the pointer np as argument to kernel allocate_p, where it will be assigned to p (instead of malloc()), then the code runs fine.
What am I doing wrong / how do we use malloc() allocated device-memory in host-side functions?
As far as I am aware, it isn't possible to copy runtime heap memory using the host API functions. It certainly was not possible in CUDA 4.x and the CUDA 5.0 release candidate has not changed this. The only workaround I can offer is to use a kernel to "gather" final results and stuff them into a device transfer buffer or zero copy memory which can be accessed via the API or directly from the host. You can see an example of this approach in this answer and another question where Mark Harris from NVIDIA confirmed that this is a limitation of the (then) current implementation in the CUDA runtime.

External call of a class method in a kernel

I have a class FPlan that has a number of methods such as permute and packing.
__host__ __device__ void Perturb_action(FPlan *dfp){
dfp->perturb();
dfp->packing();
}
__global__ void Vector_Perturb(FPlan **dfp, int n){
int i=threadIx.x;
if(i<n) Perturb_action(dfp[i]);
}
in main:
FPlan **fp_vec;
fp_vec=(FPlan**)malloc(VEC_SIZE*sizeof(FPlan*));
//initialize the vec
for(int i=0; i<VEC_SIZE;i++)
fp_vec[i]=&fp;
//fp of type FPlan that is initialized
int v_sz=sizeof(fp_vec);
double test=fp_vec[0]->getCost();
printf("the cost before perturb %f\n"test);
FPlan **value;
cudaMalloc(&value,v_sz);
cudaMemcpy(value,&fp_vec,v_sz,cudaMemcpyHostToDevice);
//call kernel
dim3 threadsPerBlock(VEC_SIZE);
dim3 numBlocks(1);
Vector_Perturb<<<numBlocks,threadsPerBlock>>> (value,VEC_SIZE);
cudaMemcpy(fp_vec,value,v_sz,cudaMemcpyDeviceToHost);
test=fp_vec[0]->getCost();
printf("the cost after perturb %f\n"test);
test=fp_vec[1]->getCost();
printf("the cost after perturb %f\n"test);
I am getting before permute for fp_vec[0] printf the cost 0.8.
After permute for fp_vec[0] the value inf and for fp_vec[1] the value 0.8.
The expected output after the permutation should be something like fp_vec[0] = 0.7 and fp_vec[1] = 0.9. I want to apply these permutations to an array of type FPlan.
What am I missing? Is calling an external function supported in CUDA?
This seems to be a common problem these days:
Consider the following code:
#include <stdio.h>
#include <stdlib.h>
int main() {
int* arr = (int*) malloc(100);
printf("sizeof(arr) = %i", sizeof(arr));
return 0;
}
what is the expected ouptut? 100? no its 4 (at least on a 32 bit machine). sizeof() returns the size of the type of a variable not the allocated size of an array.
int v_sz=sizeof(fp_vec);
double test=fp_vec[0]->getCost();
printf("the cost before perturb %f\n"test);
FPlan **value;
cudaMalloc(&value,v_sz);
cudaMemcpy(value,&fp_vec,v_sz,cudaMemcpyHostToDevice);
You are allocating 4 (or 8) bytes on the device and copy 4 (or 8) bytes. The result is undefined (and maybe every time garbage).
Besides that, you shold do proper error checking of your CUDA calls.
Have a look: What is the canonical way to check for errors using the CUDA runtime API?

Understanding cuda heap memory limitations per thread

This question is about heap size limitation in cuda.
Having visited some questions concerning this topic, including this one:
new operator in kernel .. strange behaviour
I've made some tests. Given a kernel as follow:
#include <cuda.h>
#include <cuda_runtime.h>
#define CUDA_CHECK( err ) __cudaSafeCall( err, __FILE__, __LINE__ )
#define CUDA_CHECK_ERROR() __cudaCheckError( __FILE__, __LINE__ )
inline void __cudaSafeCall( cudaError err, const char *file, const int line )
{
if ( cudaSuccess != err )
{
fprintf( stderr, "cudaSafeCall() failed at %s:%i : %s\n",
file, line, cudaGetErrorString( err ) );
exit( -1 );
}
return;
}
inline void __cudaCheckError( const char *file, const int line )
{
cudaError err = cudaGetLastError();
if ( cudaSuccess != err )
{
fprintf( stderr, "cudaCheckError() failed at %s:%i : %s\n",
file, line, cudaGetErrorString( err ) );
exit( -1 );
}
return;
}
#include <stdio>
#define NP 900000
__device__ double *temp;
__device__ double *temp2;
__global__
void test(){
int i = blockDim.x*blockIdx.x + threadIdx.x;
if(i==0){
temp = new double[NP];
//temp2 = new double[NP];
}
if(i==0){
for(int k=0;k<NP;k++){
temp[i] = 1.;
if(k%1000 == 0){
printf("%d : %g\n", k, temp[i]);
}
}
}
if(i==0){
delete(temp);
//delete(temp2);
}
}
int main(){
//cudaDeviceSetLimit(cudaLimitMallocHeapSize, 32*1024*1024);
//for(int k=0;k<2;k++){
test<<<ceil((float)NP/512), 512>>>();
CUDA_CHECK_ERROR();
//}
return 0;
}
I want to test the heap size limitation.
Dynamically allocating one array (temp) with one thread which size is
roughly over 960,000*sizeof(double) (close to 8MB, which is the
default limit of the heap size) gives an error : ok. 900,000 works. (does someone know how to calculate the true limit?)
Rising the heap size limit allows to allocate more memory : normal, ok.
Back to a 8MB heap size, allocating one array per thread with TWO threads (so, replacing if (i==0) by if(i==0 || i==1), each one 900,000 * sizeof(double) fails. But 450,000*sizeof(double) each, works. Still ok.
Here comes my problem : allocating TWO arrays with ONE thread (so, temp and temp2 for thread 0), each 900,000 * sizeof(double) works too, but it should not? Indeed when I try to write in both arrays, it fails. But anyone has an idea why this different behaviour in allocation when using two arrays with one thread instead of two arrays with two threads?
EDIT : another test, which I find interesting for those who, like me, would be learning the usage of heap :
5. Executing the kernel two times, with one array of size 900,000 * sizeof(double) allocated by the single thread 0, works if there is the delete. If delete is omitted, it will fail the second time, but the first call will be executed.
EDIT 2 : how to allocate a device-wide variable once but writable by all threads (not from host, using dynamic allocation in device code)?
Probably you are not testing for a returned null-pointer on the new operation, which is a valid method in C++ for the operator to report a failure.
When I modify your code as follows, I get the message "second new failed":
#include <stdio.h>
#define NP 900000
__device__ double *temp;
__device__ double *temp2;
__global__
void test(){
int i = blockDim.x*blockIdx.x + threadIdx.x;
if(i==0){
temp = new double[NP];
if (temp == 0) {printf("first new failed\n"); return;}
temp2 = new double[NP];
if (temp2 == 0) {printf("second new failed\n"); return;}
}
if(i==0){
for(int k=0;k<NP;k++){
temp[i] = 1.;
if(k%1000 == 0){
printf("%d : %g\n", k, temp[i]);
}
}
}
if(i==0){
delete(temp);
delete(temp2);
}
}
int main() {
test<<<1,1>>>();
cudaDeviceSynchronize();
return 0;
}
It's convenient if you provide a complete, compilable code, for others to work with, just as I have.
For your first EDIT question, it's not surprising that the second new will work if the first is deleted. The first allocates nearly all of the 8MB available. If you delete that allocation, then the second one will succeed. Referring to the documentation, we see that memory allocated dynamically in this fashion lives for the entire lifetime of the cuda context, or until a corresponding delete operation is performed (i.e. not just a single kernel call. The completion of the kernel does not necessarily free the allocation.)
For your second EDIT question, you are already demonstrating a method, using your __device__ double *temp; pointer, by which one thread can allocate storage which all threads can access. You will have a problem across blocks, however, because there is no guarantee of synchronization order amongst blocks or execution order amongst blocks, so if you allocate from thread 0 in block 0, that is only useful if block 0 executes before other blocks. You could come up with a complicated scheme to check if the variable allocation was already done (perhaps by testing the pointer for NULL, and also perhaps using atomics) but it creates fragile code. It's better to plan your global allocations ahead of time and allocate accordingly from the host.