Delete a cudaMalloc allocated memory inside kernel - cuda

I want to delete an array allocated by cudaMalloc inside a kernel using delete[]; but memory checker shows access violation, array is kept in memory and kernel continues to execute.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
__global__ void kernel(int *a)
{
int *b = new int[10];
delete[] b; // no violation
delete[] a; // Memory Checker detects access violation.
}
int main()
{
int *d_a;
cudaMalloc(&d_a, 10 * sizeof(int));
kernel<<<1, 1>>>(d_a);
return 0;
}
What is the difference between memory allocated by cudaMalloc and new in device code?
Is it possible to delete memory allocated by cudaMalloc in device code?
Thanks

cudaMalloc in host code and new (or malloc) in device code allocate out of logically separate areas. The two areas are not generally interoperable from an API standpoint.
no
You may wish to read the documentation. The description given there for in-kernel malloc and free generally applies to in-kernel new and delete as well.

Related

Can we copy a "normal" GPU memory to a "unified" memory?

We have two GPU memories, one is allocated with cuMalloc as normal device memory, the other is allocated with cuMallocManaged as unified memory. Is it possible to copy between them? and if we use driver API, what direction should I use?
float* normalMem, unifiedMem;
cuMalloc(&normalMem, 100);
cuMallocManaged(&unifiedMem, 100);
cuMemcpyD2D(unifiedMem, normalMem, 100); // ? D2D? or D2H? or else?
Yes you can. Look at the following code for instance.
It declared a normal pointer a managed pointer and an host pointer all of them of 100 float.
It then initializes the values in the host pointer and then copy the values using cudaMemCpy to the normal pointer.
Values are now copied to the managed pointer
The managed pointer is used in a kernel to shows that values have been copied from the two buffers.
I think that the code is pretty self-explanatory
__global__
void test(float* d_ptr){
for(int i=0;i<100;i++)
printf("%f \n",d_ptr[i]);
printf("\n");
}
////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int main(int argc, char **argv)
{
size_t size = sizeof(float)*100;
float* h_p =(float*) malloc(size);
float* d_p, dm_p ;
cudaMalloc(&d_p,size);
cudaMallocManaged(&dm_p,size);
for(int i=0;i<100;i++)
h_p[i]=2*(float)i;
cudaMemcpy(d_p,h_p,size,cudaMemcpyHostToDevice);
cudaDeviceSynchronize();
cudaMemcpy(dm_p,d_p,size,cudaMemcpyDeviceToDevice);
cudaDeviceSynchronize();
test<<<1,1>>>(dm_p);
cudaDeviceSynchronize();
cudaFree(dm_p);
cudaFree(d_p);
free(h_p);
return 0;
}
Remember to read the Unified Memory access rules.

How to copy dynamically allocated memory from device to host? [duplicate]

CUDA programming guide states that "Memory allocated via malloc() can be copied using the runtime (i.e., by calling any of the copy memory functions from Device Memory)", but somehow I'm having trouble to reproduce this functionality. Code:
#include <cstdio>
__device__ int* p;
__global__ void allocate_p() {
p = (int*) malloc(10);
printf("p = %p (seen by GPU)\n", p);
}
int main() {
cudaError_t err;
int* localp = (int*) malloc(10);
allocate_p<<<1,1>>>();
cudaDeviceSynchronize();
//Getting pointer to device-allocated memory
int* tmpp = NULL;
cudaMemcpyFromSymbol(&tmpp, p, 4);
printf("p = %p (seen by CPU)\n", tmpp);
//cudaMalloc((void**)&tmpp, 40);
err = cudaMemcpy(tmpp, localp, 40, cudaMemcpyHostToDevice);
cudaDeviceSynchronize();
printf(" err:%i %s", (int)err, cudaGetErrorString(err));
delete localp;
return 0;
}
crashes with output:
p = 0x601f920 (seen by GPU)
p = 0x601f920 (seen by CPU)
err:11 invalid argument
I gather, that the host sees the appropriate address on device, but somehow does not like it coming from malloc().
If I allocate earlier by cudaMalloc((void**)&np, 40); and then pass the pointer np as argument to kernel allocate_p, where it will be assigned to p (instead of malloc()), then the code runs fine.
What am I doing wrong / how do we use malloc() allocated device-memory in host-side functions?
As far as I am aware, it isn't possible to copy runtime heap memory using the host API functions. It certainly was not possible in CUDA 4.x and the CUDA 5.0 release candidate has not changed this. The only workaround I can offer is to use a kernel to "gather" final results and stuff them into a device transfer buffer or zero copy memory which can be accessed via the API or directly from the host. You can see an example of this approach in this answer and another question where Mark Harris from NVIDIA confirmed that this is a limitation of the (then) current implementation in the CUDA runtime.

Making CUB blockradixsort on-chip entirely?

I am reading the CUB documentations and examples:
#include <cub/cub.cuh> // or equivalently <cub/block/block_radix_sort.cuh>
__global__ void ExampleKernel(...)
{
// Specialize BlockRadixSort for 128 threads owning 4 integer items each
typedef cub::BlockRadixSort<int, 128, 4> BlockRadixSort;
// Allocate shared memory for BlockRadixSort
__shared__ typename BlockRadixSort::TempStorage temp_storage;
// Obtain a segment of consecutive items that are blocked across threads
int thread_keys[4];
...
// Collectively sort the keys
BlockRadixSort(temp_storage).Sort(thread_keys);
...
}
In the example, each thread has 4 keys. It looks like 'thread_keys' will be allocated in global local memory. If I only has 1 key per thread, could I declare"int thread_key;" and make this variable in register only?
BlockRadixSort(temp_storage).Sort() is taking a pointer to the key as parameter. Does it mean that the keys have to be in global memory?
I would like to use this code but I want each thread to hold one key in register and keep it on-chip in register/shared memory after they are sorted.
Thanks in advance!
You can do this using shared memory (which will keep it "on-chip"). I'm not sure I know how to do it using strictly registers without de-constructing the BlockRadixSort object.
Here's an example code that uses shared memory to hold the initial data to be sorted, and the final sorted results. This sample is mostly set up for one data element per thread, since that seems to be what you are asking for. It's not difficult to extend it to multiple elements per thread, and I have put most of the plumbing in place to do that, with the exception of the data synthesis and debug printouts:
#include <cub/cub.cuh>
#include <stdio.h>
#define nTPB 32
#define ELEMS_PER_THREAD 1
// Block-sorting CUDA kernel (nTPB threads each owning ELEMS_PER THREAD integers)
__global__ void BlockSortKernel()
{
__shared__ int my_val[nTPB*ELEMS_PER_THREAD];
using namespace cub;
// Specialize BlockRadixSort collective types
typedef BlockRadixSort<int, nTPB, ELEMS_PER_THREAD> my_block_sort;
// Allocate shared memory for collectives
__shared__ typename my_block_sort::TempStorage sort_temp_stg;
// need to extend synthetic data for ELEMS_PER_THREAD > 1
my_val[threadIdx.x*ELEMS_PER_THREAD] = (threadIdx.x + 5)%nTPB; // synth data
__syncthreads();
printf("thread %d data = %d\n", threadIdx.x, my_val[threadIdx.x*ELEMS_PER_THREAD]);
// Collectively sort the keys
my_block_sort(sort_temp_stg).Sort(*static_cast<int(*)[ELEMS_PER_THREAD]>(static_cast<void*>(my_val+(threadIdx.x*ELEMS_PER_THREAD))));
__syncthreads();
printf("thread %d sorted data = %d\n", threadIdx.x, my_val[threadIdx.x*ELEMS_PER_THREAD]);
}
int main(){
BlockSortKernel<<<1,nTPB>>>();
cudaDeviceSynchronize();
}
This seems to work correctly for me, in this case I happened to be using RHEL 5.5/gcc 4.1.2, CUDA 6.0 RC, and CUB v1.2.0 (which is quite recent).
The strange/ugly static casting is needed as far as I can tell, because the CUB Sort is expecting a reference to an array of length equal to the customization parameter ITEMS_PER_THREAD(i.e. ELEMS_PER_THREAD):
__device__ __forceinline__ void Sort(
Key (&keys)[ITEMS_PER_THREAD],
int begin_bit = 0,
int end_bit = sizeof(Key) * 8)
{ ...

Dynamic programming in CUDA: global memory allocations to exchange data with child kernels

I have a the following code:
__global__ void interpolation(const double2* __restrict__ data, double2* __restrict__ result, const double* __restrict__ x, const double* __restrict__ y, const int N1, const int N2, int M)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
[...]
double phi_cap1, phi_cap2;
if(i<M) {
for(int m=0; m<(2*K+1); m++) {
[calculate phi_cap1];
for(int n=0; n<(2*K+1); n++) {
[calculate phi_cap2];
[calculate phi_cap=phi_cap1*phi_cap2];
[use phi_cap];
}
}
}
}
I would like to use Dynamic Programming on a Kepler K20 card to dispatch the processing of phi_cap1 and phi_cap2 in parallel to a bunch of threads to reduce the computation time. K=6 in my code, so I'm launching a single block of 13x13 threads.
Following the CUDA Dynamic Parallelism Programming Guide, I'm allocating a matrix phi_cap of 169 elements (formed by the products of phi_cap1 and phi_cap2), needed to exchange the data with the child kernel, in global memory. Indeed, quoting the guide,
As a general rule, all storage passed to a child kernel should be allocated explicitly from the global-memory heap.
I then ended-up with the following code
__global__ void interpolation(const double2* __restrict__ data, double2* __restrict__ result, const double* __restrict__ x, const double* __restrict__ y, const int N1, const int N2, int M)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
[...]
dim3 dimBlock(2*K+1,2*K+1); dim3 dimGrid(1,1);
if(i<M) {
double* phi_cap; cudaMalloc((void**)&phi_cap,sizeof(double)*(2*K+1)*(2*K+1));
child_kernel<<<dimGrid,dimBlock>>>(cc_diff1,cc_diff2,phi_cap);
for(int m=0; m<(2*K+1); m++) {
for(int n=0; n<(2*K+1); n++) {
[use phi_cap];
}
}
}
}
The problem is that the first routine takes 5ms to run, while the second routine, even by commenting the child_kernel launch, takes 23ms, with practically all the time spent in the cudaMalloc API.
Since in dynamic programming one would often need allocating memory space to exchange data with the child kernels, and the only solution seems to be global memory taking so much time, it seems to me that one serious bottleneck of the usefulness of dynamic programming is the data exchange, unless there is a way to circumvent the global memory allocation issue.
The question then is: is there any workaround to the mentioned issue, namely, taking so much time when allocating global memory from within a kernel?. Thanks
SOLUTION PROPOSED IN THE COMMENTS
Allocate the required global memory from outside the parent kernel. I have verified that allocating the required global memory from outside the parent kernel is much faster.
You are calling cudaMalloc from each thread where i < M which means that you are making M cudaMalloc calls.
The bigger M is the worse it is going to get.
Instead you could make a single cudaMalloc call from the first thread of the block allocating M times the size that you used before (actually in your case you should allocate more, so each block is properly aligned). After that sync the threads and you can start your child kernels with correctly computed phi_cap address for each child kernel.
Alternatively (if your specific situation allows you to allocate enough memory that you can hold on to between the kernel calls) you could allocate the memory once outside of the kernel and reuse it. That would be a lot quicker. If M varies between kernel calls you could allocate as much as you would need for the biggest M.

synchronizing device memory access with host thread

Is it possible for a CUDA kernel to synchronize writes to device-mapped memory without any host-side invocation (e.g., of cudaDeviceSynchronize)? When I run the following program, it doesn't seem that the kernel waits for the writes to device-mapped memory to complete before terminating because examining the page-locked host memory immediately after the kernel launch does not show any modification of the memory (unless a delay is inserted or the call to cudaDeviceSynchronize is uncommented):
#include <stdio.h>
#include <cuda.h>
__global__ void func(int *a, int N) {
int idx = threadIdx.x;
if (idx < N) {
a[idx] *= -1;
__threadfence_system();
}
}
int main(void) {
int *a, *a_gpu;
const int N = 8;
size_t size = N*sizeof(int);
cudaSetDeviceFlags(cudaDeviceMapHost);
cudaHostAlloc((void **) &a, size, cudaHostAllocMapped);
cudaHostGetDevicePointer((void **) &a_gpu, (void *) a, 0);
for (int i = 0; i < N; i++) {
a[i] = i;
}
for (int i = 0; i < N; i++) {
printf("%i ", a[i]);
}
printf("\n");
func<<<1, N>>>(a_gpu, N);
// cudaDeviceSynchronize();
for (int i = 0; i < N; i++) {
printf("%i ", a[i]);
}
printf("\n");
cudaFreeHost(a);
}
I'm compiling the above for sm_20 with CUDA 4.2.9 on Linux and running it on a Fermi GPU (S2050).
A kernel launch will immediately return to the host code before any kernel activity has occurred. Kernel execution is in this way asynchronous to host execution and does not block host execution. So it's no surprise that you have to wait a bit or else use a barrier (like cudaDeviceSynchronize()) to see the results of the kernel.
As described here:
In order to facilitate concurrent execution between host and device,
some function calls are asynchronous: Control is returned to the host
thread before the device has completed the requested task. These are:
Kernel launches;
Memory copies between two addresses to the same device memory;
Memory copies from host to device of a memory block of 64 KB or less;
Memory copies performed by functions that are suffixed with Async;
Memory set function calls.
This is all intentional of course, so that you can use the GPU and CPU simultaneously. If you don't want this behavior, a simple solution as you've already discovered is to insert a barrier. If your kernel is producing data which you will immediately copy back to the host, you don't need a separate barrier. The cudaMemcpy call after the kernel will wait until the kernel is completed before it begins it's copy operation.
I guess to answer your question, you are wanting kernel launches to be synchronous without you having even to use a barrier (why do you want to do this? Is adding the cudaDeviceSynchronize() call a problem?) It's possible to do this:
"Programmers can globally disable asynchronous kernel launches for all
CUDA applications running on a system by setting the
CUDA_LAUNCH_BLOCKING environment variable to 1. This feature is
provided for debugging purposes only and should never be used as a way
to make production software run reliably. "
If you want this synchronous behavior, it's better just to use the barriers (or depend on another subsequent cuda call, like cudaMemcpy). If you use the above method and depend on it, your code will break as soon as somebody else tries to run it without the environment variable set. So it's really not a good idea.