Nvcc Optimization Flags [closed] - cuda

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed last year.
Improve this question
I have a CUDA c++ code.I'm doing some operations with OpenCV in this code.I compile program
nvcc file.cu -o o2 `pkg-config --libs --cflags opencv4
command.I wonder that which nvcc optimization flags would be the best ,efficient and useful ? Thanks in advance.

There is documentation for nvcc.
There is also command-line help (nvcc --help).
You may find information about optimization and switches in either of those resources.
You shouldn't need any extra flags to get the fastest possible device code from nvcc (do not specify -G). For host code optimization, you may wish to try -O3.

Related

Why are Cuda kernels annotated with `__global__` instead of `__kernel__` [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
Actually, the title already is the full question.
Why did Nvidia decide to call its GPU entry functions kernels, but in Cuda they must be annotated with __global__ instead of __kernel__?
The goal is to separate the entity (kernel) and its scope or location.
There three types of the functions which relate to your question:
__device__ functions can be called only from the device, and it is
executed only in the device.
__global__ functions can be called
from the host, and it is executed in the device.
__host__
functions run on the host, called from the host.
If they named functions scope __kernel__, it would be impossible to distinguish them in the way they are separated above.
The __global__ here means "in space shared between host and device" and in these terms in "global area between them".

Does cudaDeviceSynchronize() stop working when there are other jobs running on the gpu? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
There are jobs running on the GPU, and if I run another code on top of it, the code stops at the point of cudaDeviceSynchronize(). Why does this happen?
Currently only one process is allowed to use a GPU at a given point in time. There is no fairness nor quantum to kill a ''job'' in case it runs for hours in a GPU. The basic usage is first come first serve.
But you may use the CUDA Multi-Process Service (MPS). It basically allows multiple processes to share a single gpu
https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf

understanding HPC Linpack (CUDA edition) [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I want to know what role play CPUs when HPC Linpack (CUDA version) is runnig. They are recieving data from other cluster nodes and performing CPU-GPU data exchange, arenot they? so thier work doesnot influence on performance, yes?
In typical usage both GPU and CPU are contributing to the numerical calculations. The host code will use MKL or another BLAS implementation for host-generated numerical results, and the device code will use CUBLAS or something related for device numerical results.
A version of HPL is available to registered developers in source code format, so you can inspect all this yourself.
And as you say the CPUs are also involved in various other administration activities such as internode data exchange in a multinode setting.

Platform vs Software Framework [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
CUDA advertises itself as a parallel computing platform. However, I'm having trouble seeing how it's any different from a software framework (a collection of libraries used for some functionality). I am using CUDA in class and all I'm seeing is that it provides libraries in C for - functions that help in parallel computing on the GPU - which fits my definition of a framework. So tell me, how is a platform like CUDA different from a framework? Thank you.
CUDA the hardware platform, is the actual GPU and its scheduler ("CUDA architecture"). However CUDA is also a programming language, which is very close to C. To work with the software written in CUDA you also need an API for calling these functions, allocating memory etc. from your host language. So CUDA is a platform, a language and a set of APIs.
If the latter (a set of APIs) matches your definition of a software framework, then the answer is simply yes, as both options are true.

Difference between CUDA level and compute level? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
What is the difference these two definitions?
If no, does it mean, I will be never able to run code with sm > 21 on the gpu with compute level 2.1?
That's correct. For a compute capability 2.1 device, the maximum code specification (virtual architecture/target architecture) you can give it is -arch=sm_21 Code compiled for -arch=sm_30 for example, would not run correctly on a cc 2.1 device
For more information, you can take a look at the nvcc manual section which covers virtual architectures, as well as the manual section which covers the compile switches specifying virtual architecture and compile targets (code architecture).