Differentiating b/w memory bound and compute bound CUDA kernels [closed] - cuda

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I am trying to write a static analyzer for differentiating between data intensive and computation intensive CUDA kernels. As much as I have researched on this topic, there is not much literature present on it. One of the ways to accomplish this is to calculate the CGMA ratio of the kernel. If it is 'too high', then it might be compute intensive, otherwise, memory intensive.
The problem with the above method is that I can't seem to decide upon a threshold value for the ratio. That is, above what value should it be classified as compute intensive. One way is to use the ratio of CUDA cores and load/store units as threshold. What does SO think?
I came across this paper in which they are calculating a parameter called 'memory intensity'. First, they calculate a parameter called the activity factor, which is then used to calculate memory intensity. Please find the paper here. You can find memory intensity on page no: 6.
Does there exist any better approach? I am kind-of stuck in my research due to this, and desperately need help.

Related

Designing a circuit that calculates Hamming distance? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I came across this question and I couldn't find it in textbooks or the internet. Seems pretty unique.
I guess there would be some comparators and adders involved, but I have no clue where to start.
The first step will undoubtedly be XORing the two bit sets. Then you need to count the number of logical ones in the output. The best method for designing your circuit would be to make a complete analogy of the hack discussed in this question and explained perfectly in its answer by nneonneo. This would result in the optimal tree of adders, rather than relying on sequential counting. The idea is that in each layer you know how to cap the maximum possible sum of a subset of the inputs, and in how many bits it will fit, eliminating the need for a carry bit. The programming approach is designed for 32 bits but easily modifiable for less or more than that.
For more possible algorithms for computing Hamming weight see this link.

understanding HPC Linpack (CUDA edition) [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I want to know what role play CPUs when HPC Linpack (CUDA version) is runnig. They are recieving data from other cluster nodes and performing CPU-GPU data exchange, arenot they? so thier work doesnot influence on performance, yes?
In typical usage both GPU and CPU are contributing to the numerical calculations. The host code will use MKL or another BLAS implementation for host-generated numerical results, and the device code will use CUBLAS or something related for device numerical results.
A version of HPL is available to registered developers in source code format, so you can inspect all this yourself.
And as you say the CPUs are also involved in various other administration activities such as internode data exchange in a multinode setting.

What is the absolutely fastest way to output a signal to external hardware in modern PC? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I was wondering, what is the absolutely fastest way (lowest latency) to produce external signal (for example CMOS state change from 0 to 1 on electrical wire connected to other device etc.) from PC, counting from the moment, where CPU assembler program knows that signal must be produced.
I know that network device, usb, VGA monitor output have some large latency comapred to other interfaces (SATA, PCI-E). Wich of interfaces or what hardware modification can provide a near-0 latency in output from let's suppose assembler program?
I don't know if it is really the fastest interface you can provide, because that also depends on your definition of "external", but http://en.wikipedia.org/wiki/InfiniBand certainly comes close to what your question aims at. Latency is 200 nanoseconds and below in certain scenarios ...

Is it possible to add 1,000,000 doubles in one clock cycle on a GPU? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I'm curious: on a GPU, is it possible to add millions of numbers in parallel, within a few clock cycles - or is this operation something that is theoretically impossible to parallelize?
By GPU, I mean any offering from nVidia or AMD, e.g. Tesla M2050.
In only one clock cycle, or a "few"? If the former, then no, there are nowhere near enough hardware resources in any GPU to add millions of doubles in the same clock cycle. If you mean "relatively few clock cycles with respect to a typical CPU", then yes. The type of addition you wish to perform also plays a factor. For example, are you doing a reduction sum on the elements of an array? Or adding two vectors together? Or adding a constant to a vector? These all have different performance characteristics on GPUs.

What is the fastest library for finding FFT on a GPU? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
Which is the fastest library to find FFT on a GPU? Please give answers for both NVIDIA and ATI cards. Also, if possible give timing figures.
Thanks.
For NVidia GPUs, look at the CUFFT library. As far as I can tell, AMD has not productized FFT on ATI GPUs yet, but it might be worth looking at the ACML-GPU library. You could also try looking at OpenCL FFT libraries which should work on both GPUs.
Giving timing figures is impossible, because it varies so much depending on the actual hardware you have, your problem size, etc.
The NukadaFFT library is supposed to be the highest performance FFT implementation on NVIDIA hardware. There are links to papers which document the performance of the library - in some cases throughput is claimed to 25% higher than running the same FFT using CUFFT. That comes at a price in flexibility, because the code only supports up to radix 32 transforms.