Slight odd situation I had but I wish to use the BSF/BSR instruction inside CUDA. Just wondering if there's any way to run this instruction in CUDA without a lot of overhead.
The list of integer intrinsics is available in the documentation. You can use the __clz intrinsic for example to mimic BSR. For BSF, I think __ffs should do most of the job.
Related
I have a large program that uses all the registers I allocated per thread (64) and spills to local memory. I would like to be able to tell the compiler which variables should remain in registers at all cost, and which ones I don't really care about. Does the "register" C/C++ keyword work in nvcc? Is there a different mechanism perhaps?
Thanks!
You can use register in CUDA C/C++ if you want to. In any context, it is only a hint to the compiler. It may be ignored. There is no stated guarantee that it does anything at all.
I think these statements are pretty much true for most language implementations of register.
I also think it's quite likely that the compiler can do a better job than you can of deciding what should be in registers, and appropriate priority.
The typical CUDA C/C++ mechanisms for controlling register usage work at a higher level, they are:
the -maxrregcount compile switch
the launch bounds directive.
In CUDA there are __ballot(), __any(), __all(), __popc() and a bunch of lanemask functions to perform warp voting operations across all lanes (usually with the size of 32) within a warp. I'm wondering is there any such functions implemented in OpenCL to perform the same operations within one wavefront. If there is no such function, I may need to implement them as inline functions myself to use in my project.
According to the OpenCL v. 1.1 specification, section 6.11 "Built-in Functions", I believe that the answer is no.
However on NVIDIA GPUs, you can probably use inline PTX to implement these things (or at least this blogger was able to use inline PTX).
Actually check out OpenCL subgroups. They define some cross lane functions like sub_group_all() and sub_group_any() as well as something other interesting things.
Subgroups are a relatively new critter and I am not sure who all supports it. The Intel GPU implementation (extension actually) has a few more interesting shuffling functions to permute lanes (within the register file) as well as to make explicit block writes and reads. I bet AMD also supports subgroups, but I am not sure about NVidia.
I think my kernel is memory bound (because most GPGPU code is memory bound), but I don't actually know for sure. How can I found it out for myself. Probably one has to use the visual profiler, as it depends on the used GPU.
If it is explained in the CUDA Programming guide or in other NVIDIA documentation, don't hesitate to just post a link with a page number, so I can read it up for myself.
Clarification
I would prefer are general "rule" how to determine the limiting factor, but in my special case you can find details about my kernel here: Using `overlap`, `kernel time` and `utilization` to optimize one's kernels
This presentation from NVIDIA talks about selectively disabling memory accesses and arithmetic in your kernel by modifying your source code, in order to determine if one of them is limiting your performance.
A nice trick without any source code modification can be used for code compiled with compute capability 2.0 and above ( based on answer here )
using the "--use_fast_math" flag one can easily increase\decrease compute pressure.
if setting this flag gives a large speed-up, this would indicate a compute bound kernel.
if setting this flag gives little to no speed-up, this would indicate a balanced\memory bound kernel.
I though I would pitch in an answer even though there is an accepted answer and this question is old.
I had a similar problem in my code, although at the time I didn't know it.
I ran the Nvidia Visual Profiler (nvvp) and analysed my program. I found that the profiler had detected my program was limited in some fashion and had some recommendations.
A great tool to use if you are unsure on where to begin.
I know that there is the restriction to call only __device__ functions in the kernel. This prevents me from calling standard functions like strcmp() and so on in the kernel.
At this point I am not able to understand/find the reasons for this. Could not the compiler just follow each includes in strings.h and so on while inlining the calls to strcmp() in the kernel? I guess the reason I am looking for is easy and I am missing something here.
Is it the only way to reimplement all the functions and datatypes I need in kernel computation? Is there a codebase with such reimplementations?
Yes, the only way to use stdlib's functions from kernel is to reimplement them. But I strongly advice you to reconsider this idea, since it's highly unlikely you would need to run code that uses strcmp() on GPU. Please, add additional details about your problem, so a better solution could be proposed (I highly doubt that serial string comparison on GPU is what you really need).
It's barely possible to simply recompile all stdlib for GPU, since it depends a lot on some system calls (like memory allocation), which could not be used on GPU (well, in recent versions of CUDA toolkit you can allocate device memory from kernel, but it's not "cuda-way", is supported only by newest hardware and is very bad for performance).
Besides, CPU versions of most functions is far from being "good" for GPUs. So, in vast majority of cases compiling your ordinary CPU functions for GPU would lead to no good, so the compiler doesn't even try it.
Standard functions like strcmp() have not been compiled for the CUDA architecture. I have not seen any standard C libraries for CUDA.
There are ways of using cuda:
auto-paralleing tools such as PGI workstation;
wrapper such as Thrust(in STL style)
NVidia GPUSDK(runtime/driver API)
Which one is better for performance or learning curve or other factors?
Any suggestion?
Performance rankings will likely be 3, 2, 1.
Learning curve is (1+2), 3.
If you become a CUDA expert, then it will be next to impossible to beat the performance of your hand-rolled code using all the tricks in the book using the GPU SDK due to the control that it gives you.
That said, a wrapper like Thrust is written by NVIDIA engineers and shown on several problems to have 90-95+% efficiency compared with hand-rolled CUDA. The reductions, scans, and many cool iterators they have are useful for a wide class of problems too.
Auto-parallelizing tools tend to not do quite as good a job with the different memory types as karlphillip mentioned.
My preferred workflow is using Thrust to write as much as I can and then using the GPU SDK for the rest. This is largely a factor of not trading away too much performance to reduce development time and increase maintainability.
Go with the traditional CUDA SDK, for both performance and smaller learning curve.
CUDA exposes several types of memory (global, shared, texture) which have a dramatic impact on the performance of your application, there are great articles about it on the web.
This page is very interesting and mentions the great series of articles about CUDA on Dr. Dobb's.
I believe that the NVIDIA GPU SDK is the best, with a few caveats. For example, try to avoid using the cutil.h functions, as these were written solely for use with the SDK, and I've personally, as well as many others, have run into some problems and bugs in them, that are hard to fix (There also is no documentation for this "library" and I've heard that NVIDIA does not support it at all)
Instead, as you mentioned, use the one of the two provided APIs. In particular I recommend the Runtime API, as it is a higher level API, and so you don't have to worry quite as much about all of the low level implementation details as you do in the Device API.
Both APIs are fully documented in the CUDA Programming Guide and CUDA Reference Guide, both of which are updated and provided with each CUDA release.
It depends on what you want to do on the GPU. If your algorithm would highly benefit from the things thrust can offer, like reduction, prefix, sum, then thrust is definitely worth a try and I bet you can't write the code faster yourself in pure CUDA C.
However if you're porting already parallel algorithms from the CPU to the GPU, it might be easier to write them in plain CUDA C. I had already successful projects with a good speedup going this route, and the CPU/GPU code that does the actual calculations is almost identical.
You can combine the two paradigms to some extend, but as far as I know you're launching new kernels for each thrust call, if you want to have all in one big fat kernel (taking too frequent kernel starts out of the equation), you have to use plain CUDA C with the SDK.
I find the pure CUDA C actually easier to learn, as it gives you quite a good understanding on what is going on on the GPU. Thrust adds a lot of magic between your lines of code.
I never used auto-paralleing tools such as PGI workstation, but I wouldn't advise to add even more "magic" into the equation.