Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I am new to CUDA and going to buy a GPU that will be sufficient for my needs without spending much. I will be working on an application that will require graphics rendering as well as other general purpose computations.
What should be my primary consideration while buying ?
No. of SMs
No. of CUDA Cores
Core/Shader/Memory Clock
Memory Size
Memory Bus width
How do the above mentioned specifications affect CUDA performance?
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 days ago.
Improve this question
I have a deep learning algorithm using CUDA. I have an RTX3060 graphics card, but only 13% is used when running code. What should I do to make my code run faster and use my graphics card at full performance?
I did the nvidia's cuda driver installations properly
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 1 year ago.
Improve this question
I am looking for obtaining number of cycles of accessing memory type in CUDA. I want to analyze difference of memory types' and cache types' speed on GPU among the each specific architecture. Is there any source where I can find the number of clock cycles of accessing memory relating to its architecture or is there any method to measure them?
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
There are jobs running on the GPU, and if I run another code on top of it, the code stops at the point of cudaDeviceSynchronize(). Why does this happen?
Currently only one process is allowed to use a GPU at a given point in time. There is no fairness nor quantum to kill a ''job'' in case it runs for hours in a GPU. The basic usage is first come first serve.
But you may use the CUDA Multi-Process Service (MPS). It basically allows multiple processes to share a single gpu
https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 8 years ago.
Improve this question
In our application we have FFT part. We would like to port that part onto GPU. We have Tesla K20m GPU. Which version of cuFFT is optimized for K20m card.
There is not a specific version of the cufft library that is optimized for a specific card. Just use the standard cufft library that ships with cuda 5.0 (or cuda 5.5 RC, if you like).
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I was wondering, what is the absolutely fastest way (lowest latency) to produce external signal (for example CMOS state change from 0 to 1 on electrical wire connected to other device etc.) from PC, counting from the moment, where CPU assembler program knows that signal must be produced.
I know that network device, usb, VGA monitor output have some large latency comapred to other interfaces (SATA, PCI-E). Wich of interfaces or what hardware modification can provide a near-0 latency in output from let's suppose assembler program?
I don't know if it is really the fastest interface you can provide, because that also depends on your definition of "external", but http://en.wikipedia.org/wiki/InfiniBand certainly comes close to what your question aims at. Latency is 200 nanoseconds and below in certain scenarios ...