Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
Actually, the title already is the full question.
Why did Nvidia decide to call its GPU entry functions kernels, but in Cuda they must be annotated with __global__ instead of __kernel__?
The goal is to separate the entity (kernel) and its scope or location.
There three types of the functions which relate to your question:
__device__ functions can be called only from the device, and it is
executed only in the device.
__global__ functions can be called
from the host, and it is executed in the device.
__host__
functions run on the host, called from the host.
If they named functions scope __kernel__, it would be impossible to distinguish them in the way they are separated above.
The __global__ here means "in space shared between host and device" and in these terms in "global area between them".
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 10 months ago.
Improve this question
I'm new to CUDA, and so far all the tutorials I've seen are for arrays.
I am wondering if you can define something like double variable on CUDA, or does something like that have to live on the CPU?
You can have scalar variable as a kernel parameter, as a private variable, as a shared memory variable and even as a global compilation unit variable.
You can have scalar fields in classes, array of structs, struct of arrays, anything that uses a plain old data. You can use typedef, define macro and any bit level hacking as long as the variable is loaded/stored with proper alignment.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I want to know what role play CPUs when HPC Linpack (CUDA version) is runnig. They are recieving data from other cluster nodes and performing CPU-GPU data exchange, arenot they? so thier work doesnot influence on performance, yes?
In typical usage both GPU and CPU are contributing to the numerical calculations. The host code will use MKL or another BLAS implementation for host-generated numerical results, and the device code will use CUBLAS or something related for device numerical results.
A version of HPL is available to registered developers in source code format, so you can inspect all this yourself.
And as you say the CPUs are also involved in various other administration activities such as internode data exchange in a multinode setting.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I know the make_float4 constructor is in vector_functions.h, but which is the header file that implements float4 operation in CUDA?
Thanks.
I don't believe there is a standard cuda header file (i.e. one that will be found by nvcc automatically, such as those in /usr/local/cuda/include) that implements a variety of float4 operators.
However the "helper" header file at:
/usr/local/cuda/samples/common/inc/helper_math.h
(example path on linux) which gets installed with the cuda samples, defines a number of arithmetic operators on float4 quantities.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
What is the difference these two definitions?
If no, does it mean, I will be never able to run code with sm > 21 on the gpu with compute level 2.1?
That's correct. For a compute capability 2.1 device, the maximum code specification (virtual architecture/target architecture) you can give it is -arch=sm_21 Code compiled for -arch=sm_30 for example, would not run correctly on a cc 2.1 device
For more information, you can take a look at the nvcc manual section which covers virtual architectures, as well as the manual section which covers the compile switches specifying virtual architecture and compile targets (code architecture).
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Just curious.
Why does functions in driver API use unsigned int as CUdeviceptr, instead of void?
Runtime API use void, though.
I believe the underlying reason is because a CUdeviceptr is a handle to an allocation in device memory and not an address in device memory. The driver looks up addresses internally from a memory map using this handle, and the internal driver API requires it to be an unsigned integer.
Tim Murray, who was at one stage in charge of CUDA driver development at NVIDIA, wrote this answer on another forum a few years ago. I think that is about as authoritative answer as you will find (although Nick Wilt, who was the original CUDA driver author, also answers questions here on Stack Overflow occasionally and might chime in and provide a better answer than mine).