I recently meet a problem when copying dynamically allocated data in device to host memory. The data is allocated with malloc, and I copy those data from device to host in host function. Here is the code:
#include <cuda.h>
#include <stdio.h>
#define N 100
__device__ int* d_array;
__global__ void allocDeviceMemory()
{
d_array = new int[N];
for(int i=0; i < N; i++)
d_array[i] = 123;
}
int main()
{
allocDeviceMemory<<<1, 1>>>();
cudaDeviceSynchronize();
int* d_a = NULL;
cudaMemcpyFromSymbol((void**)&d_a, "d_array", sizeof(d_a), 0, cudaMemcpyDeviceToHost);
printf("gpu adress: %p\n", d_a);
int* h_array = (int*)malloc(N*sizeof(int));
cudaError_t errr = cudaMemcpy(h_array, d_a, N*sizeof(int), cudaMemcpyDeviceToHost);
printf("h_array: %d, %d\n", h_array[0], errr);
getchar();
return 0;
}
There is already a poster had the same issue for CUDA 4.1, and some experts suggest upgreading the CUDA driver and runtime to newer version can solve this issue.
CUDA - Copy device data to host?
I have CUDA toolkit 4.2 and lastest developer drivers and C2075, but it still come up with the above problem. Please let me know how to solve this problem.
Unfortunately there is no way to do what you are trying to do it CUDA 4. The host API cannot copy from dynamically allocated addresses on device runtime heap, only device code can access them. If you want to copy with the host API, you will need to write the data into an "output" buffer allocated with the host API first, then you are free to use cudaMemcpy to retrieve it from the host.
You can see confirmation of this limitation from Mark Harris of Nvidia here.
Since this answer was posted in 2012, the restriction on host API interoperability appears to have been set in stone, and is explicitly documented in the CUDA programming guide.
Related
I have a program which would use cudaMallocHost() to allocate the pinned memory, but I forget to use cudaFreeHost() to free the pinned memory...
I run this program once and then exit, but next time I want to run the same program, it would throw segmentation fault when I called cudaMallocHost.
I suspected that is due to the memory is pinned when I first run the program, and when I try to run the program one more time, the OS couldn't find any more memory that could be pinned...
My question is, is there any CUDA API that I can call to clear the already pinned memory in host without knowing the host memory address?
I lookup the CUDA document but didn't find any, and rebooting didn't help either.
Edit
I run htop and found that there is a 17GB of memory which seems like nobody is using it.
I wonder if this is the memory that I pinned?
htop
i made some tests using htop and a small application :
here is the code i used :
#include <cuda_runtime.h>
#include <unistd.h>
#include <vector>
int main(void)
{
std::vector<void*> arPtrs;
for(int i=0;i<5;++i)
{
void* ptr = nullptr;
cudaMallocHost(&ptr, 1 * 1024 * 1024*1024);
arPtrs.push_back(ptr);
sleep(2);
}
return 0;
}
as you can see i don't call the cudaFreeHost on my pointers.
in parallel i monitor the memory with htop. The memory appears released for htop when the application leaves.
May your memory being use by another user so you can't see it ?
Is there a way to copy from device to host within the kernel?
Something like the following code:
__global__ void kernel(int n, double *devA, double *hostA) {
double x = 1.0;
do_computation();
cudaMemcpy(hostA, &x, sizeof(double), cudaMemcpyDeviceToHost);
do_computation();
cudaMemcpy(hostA, devA, sizeof(double), cudaMemcpyDeviceToHost);
}
Is it possible? Based on the CUDA documentation, the cudaMemcpy is not callable from the device, right?
NOTE: I don't want to use the pinned memory. It is low performance since I will constantly check the host variable (memory). So, using pinned memory will issue a page-fault (at best for post-Pascal) that will definitely happen! If both host and device access the same location, it will basically be a ping-pong effect!
Is it possible?
In one word, no.
Based on the CUDA documentation, the cudaMemcpy is not callable from the device, right?
In fact, if you do read the documentation, you will see that cudaMemcpy is supported in device code, but only for device to device transfers and not using local variables as source or destination.
On my machine the call to cudeMemPrefetchAsync in the code below returns 10 (cuda error invalid device) rather than 0. The setup is an Alienware 17 laptop running Windows 10 with a NVidia GTX 1080 GPU and onboard Intel HD Graphics 530. Using driver 376.19 from NVidia (mobile driver)).
I've compiled for compute_61, sm_61. Another user tried running the same code on a Pascal architecture (Titan X) and it returned 0 correctly. I've also tested this in both Debug and Release mode with the same result. Any ideas?
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
int main()
{
int* data;
size_t len = 10;
cudaError_t err = cudaSetDevice(0);
err = cudaMallocManaged(reinterpret_cast<void **>(&data), len, cudaMemAttachGlobal);
err = cudaMemPrefetchAsync(data, len, 0, 0);
}
There is know bug confirmed by a NVIDIA employee for Windows system enviroment (see at the post botton, i
On the other hand, there are reports that code like yours work fine under Linux SO, or with Maxwell cards.
I do have the very same issue of you, but until now no solutions, even using CUDA 9.0 RC. My advice is to use the regular memory approach for now, since there is more than a year of reporting, but no fix.
I have a __global__ function in CUDA. Can it call itself?
Here is my example:
__global__ void
force_create_empty_nodes (struct NODE *Nodes, int topnode, int bits, int no, int x, int y,
int z, struct topnode_data *TopNodes)
{
/// * Some code *///
force_create_empty_nodes <<<1, 8>>>(Nodes, topnode+1, bits+1, no+1,
x+1, y+1, z+1, TopNodes);
}
And error I receive is:
error: kernel launch from __device__ or __global__ functions requires separate compilation mode
Here is my make command:
nvcc -c -arch compute_35 cudaForceNodes.cu -o obj/cudaForceNodes.o
Calling a kernel from another kernel is called dynamic parallelism. The documentation for it is here.
It requires:
A compute capability 3.5 device. You can find the compute capability of your device by running the cuda deviceQuery sample.
Various switches in the compile command, including those specifying compilation for a cc3.5 architecture and those needed for separate (device) compilation, and linking with the device runtime.
Since your GT550M is not a cc 3.5 device, you won't be able to use this feature. There is no other way to call a kernel from within a kernel.
I'm using Visual Studio 2010 and a GTX480 with compute capability 2.0.
I have tried setting sm to 2.0, but when I attempt to use printf() in a kernel, I get:
error : calling a host function("printf") from a __device__/__global__
function("test") is not allowed
This is my code:
#include "util\cuPrintf.cu"
#include <cuda.h>
#include <iostream>
#include <stdio.h>
#include <conio.h>
#include <cuda_runtime.h>
__global__ void test (void)
{
printf("Hello, world from the device!\n");
}
void main(void)
{
test<<<1,1>>>();
getch();
}
I find a example here: "CUDA_C_Programming_Guide" 'page _106' "B.16.4 Examples"
at last,it is work for me :D thank you.
#include "stdio.h"
#include <conio.h>
// printf() is only supported
// for devices of compute capability 2.0 and higher
#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < 200)
#define printf(f, ...) ((void)(f, __VA_ARGS__),0)
#endif
__global__ void helloCUDA(float f)
{
printf("Hello thread %d, f=%f\n", threadIdx.x, f);
}
int main()
{
helloCUDA<<<1, 5>>>(1.2345f);
cudaDeviceSynchronize();
getch();
return 0;
}
To use printf in kernel code, you have to do three things:
make sure that cstdio or stdio.h are included in the kernel compilation unit. CUDA implements kernel printf by overloading, so you must include that file
Compile your code for compute capability 2.x or 3.x and run it on a supported GPU (so pass something like -arch=sm_20 to nvcc or the IDE equivalent in Visual Studio or Nsight Eclipse edition)
Ensure that the kernel has finished running by including an explicit or implicit synchronization point in your host code (cudaDeviceSynchronize for example).
You are probably compiling for an architecture that does not support printf(). By default the project is compiled for compute architecture 1.0. To change this, in VS open the project properties -> CUDA C/C++ -> Device and change the "Code Generation" property to "compute_20,sm_20".
You do not need #include "util\cuPrintf.cu". Please see this for details on how to use printf and how to flush the output so you actually see the result.
If you're getting that error, it probably means that your GPU does not have Compute capability 2.x or higher. This thread goes into more detail on what your options are for printing inside a kernel function.