Is it better to use a float instead of an int in CUDA?
Does a float decrease bank conflicts and insure coalescence? (or has it nothing to do with this?)
Bank conflicts when reading shared memory are all about the amount of data read. So, since int and float are the same size (at least I think they are on all CUDA platforms), there's no difference.
Coalescence usually refers to global memory accesses - and again, this is to do with the number of bytes read, not the datatype.
Both int and float are four bytes, so it doesn't make any difference (if you're accessing them both the same way) which you use in terms of coalescing your global memory accesses or bank conflicts on shared memory accesses.
Having said that, you may have better performance with floats since the devices are designed to crunch them as fast as possible, ints are often used for control and indexes etc. and hence have lower performance. Of course it's really more complicated than that - if you had nothing but floats then the integer hardware would sit idle which would be a waste.
Bank conflicts and coalescence are all about memory access patterns (whether the threads within a warp all read/write to different locations with uniform stride). Thus, these concerns are independent of data type (float, int, double, etc.)
Note that data type does have an impact on the computation performance. Single precision float is faster than double precision etc. The beefy FPUs in the GPUs generally means that doing calculations in fixed point is unnecessary and may even be detrimental.
Take a look at the "Mathematical Functions" section of CUDA Developers Guide. Using device runtime functions (intrinsic functions) may provide better performance for various types. You may perform multiple operations in one operation within less clock cycles.
For some of the functions of SectionC.1,a less accurate, but faster version exists inthe device runtime component; it has the same name
prefixed with __ (such as __sinf(x)).. The compiler has an option
(-use_fast_math ) that forces each function in Table to compile to its intrinsic counterpart... selectively replace mathematical function
calls by calls to intrinsic functions only where it is merited by the
performance gains and where changed properties such as reduced
accuracy and different special case handling can be tolerated.
For example instead of using => use: x/y => __fdividef(x, y); sinf(x) => __sinf(x)
And you may find more methods like x+c*y being performed with one function..
Related
Suppose a thread of a kernel is trying to update 4 different places on the shared memory. Can I cause that operation to fail and be reversed if any other thread has overwritten any of those locations? Specifically, can this be performed atomically?
mem[a] = x;
mem[b] = y;
mem[c] = z;
mem[d] = w;
No, except for a special case.
This can't be performed atomically, in the general case where a, b,c, and d are arbitrary (i.e. not necessarily adjacent), and/or x,y,z, w are each 32 bits or larger.
I'm using "atomically" to refer to an atomic RMW operation that the hardware provides.
Such operations are limited to a maximum of 64-bits total, so 4 32-bit or larger quantities could not work. Furthermore all data must be contiguous and "naturally" aligned, so independent locations cannot be accessed in a single atomic cycle.
In the special case where the 4 quantities are 16-bit or 8-bit quantities, and adjacent and aligned, you could use a custom atomic.
Alternatives to consider:
You can use critical sections to achieve such things, probably at considerable performance cost, code complexity, and fragility.
Another alternative is to recast your algorithm to use some form of parallel reduction. Since you appear to be operating at the threadblock level, this may be the best approach.
I'd like to optimize the random access read, and random access write in the following code:
__global__ void kernel(float* input, float* output, float* table, size_t size)
{
int x_id = blockIdx.x * blockDim.x + threadIdx.x;
if (x_id > size)
return;
float in_f = input[x_id];
int in_i = (int)(floor(in_f));
int table_index = (int)((in_f - float(in_i)) * 1024000.0f );
float* t = table + table_index;
output[table_index] = t[0] * in_f;
}
As you can see, the index to the table and to the output are determined at run-time, and completely random.
I understand that I can use texture memory or __ldg() for reading such data.
So, my questions are:
Is there better way to read a randomly indexed data than using the texture memory or __ldg()?
What about the random access write as the case of output[table_index] above?
Actually, I'm adding the code here to give an example of random access read and write. I do not need code optimization, I just need a high level description of the best way to deal with such situation.
There are no magic bullets for random access of data on the GPU.
The best advice is to attempt to perform data reorganization or some other method to regularize the data access. For repeated/heavy access patterns, even such intensive methods as sorting operations on the data may result in an overall net improvement in performance.
Since your question implies that the random access is unavoidable, the main thing you can do is intelligently make use of caches.
The L2 is a device-wide cache, and all DRAM accesses go through it. So thrashing of the L2 may be unavoidable if you have large-scale random accesses. There aren't any functions to disable (selectively or otherwise) the L2 for either read or write accesses (*).
For smaller scale cases, the main thing you can do is route the accesses through one of the "non-L1" caches, i.e. the texture cache (on all GPUs) and the Read-Only cache (i.e. __ldg()) on cc3.5 and higher GPUs. The use of these caches may help in 2 ways:
For some access patterns that would thrash the linear-organized L1, you may get some cache hits in the texture or read-only cache, due to a different caching strategy employed by those caches.
On devices that also have an L1 cache in use, routing the "random" traffic through an alternate cache will keep the L1 "unpolluted" and therefore less likely to thrash. In other words, the L1 may still provide caching benefit for other accesses, since it is not being thrashed by the random accesses.
Note that the compiler may route traffic through the read-only cache for you, without explicit use of __ldg(), if you decorate appropriate pointers with const __restrict__ as described here
You can also use cache control hints on loads and stores.
Similar to the above advice for protecting the L1, it may make sense on some devices, in some cases to perform loads and stores in an "uncached" fashion. You can generally get the compiler to handle this for you through the use of the volatile keyword. You can keep both an ordinary and a volatile pointer to the same data, so that accesses that you can regularize can use the "ordinary" pointer, and "random" accesses can use the volatile version. Other mechanisms to pursue uncached access would be to use the ptxas compiler switches (e.g. -Xptxas dlcm=cg) or else manage the load/store operations via appropriate use of inline PTX with appropriate caching modifiers.
The "uncached" advice is the main advice I can offer for "random" writes. Use of the surface mechanism might provide some benefit for some access patterns, but I think it is unlikely to make any improvement for random patterns.
(*) This has changed in recent versions of CUDA and for recent GPU families such as Ampere (cc 8.x). There is a new capability to reserve a portion of the L2 for data persistence. Also see here
There is a huge bunch of data that is waiting to be processed with Machine Learning algorithm at the CUDA device. However I have some concerns about the memory of the device therefore I try to use float numbers instead of double (I guess it is a good solution unless someone indicates better). Is there any way to keeping double precision for results obtained from float numbers? I guess not. Even this is a a little silly question. So what is the other correct way to handle huge data instance at the device.
No, there's no way to keep double precision in the results if you process the data as float. Handle it as double. If memory size is a problem, the usual approach is to handle the data in chunks. Copy a chunk to the GPU, start the GPU processing, and while the processing is going on, copy more data to the GPU, and copy some of the results back. This is a standard approach to handling problems that "don't fit" in the GPU memory size.
This is called overlap of copy and compute, and you use CUDA streams to accomplish this. The CUDA samples (such as simple multi-copy and compute) have a variety of codes which demonstrate how to use streams.
You can indeed compute double precision results from floating point data. At any point in your calculation you can cast a float value to a double value, and according to standard C type promotion rules from there on all calculations with this value will be in double precision.
This applies as long as you use double precision variables to store the result and don't cast it to any other type. Beware of implicit casts in function calls.
CUDA provides built-in vector data types like uint2, uint4 and so on. Are there any advantages to using these data types?
Let's assume that I have a tuple which consists of two values, A and B. One way to store them in memory is to allocate two arrays. The first array stores all the A values and the second array stores all the B values at indexes that correspond to the A values. Another way is to allocate one array of type uint2. Which one should I use? Which way is recommended? Does members of uint3 i.e x, y, z reside side by side in memory?
This is going to be a bit speculative but may add to #ArchaeaSoftware's answer.
I'm mainly familiar with Compute Capability 2.0 (Fermi). For this architecture, I don't think that there is any performance advantage to using the vectorized types, except maybe for 8- and 16-bit types.
Looking at the declaration for char4:
struct __device_builtin__ __align__(4) char4
{
signed char x, y, z, w;
};
The type is aligned to 4 bytes. I don't know what __device_builtin__ does. Maybe it triggers some magic in the compiler...
Things look a bit strange for the declarations of float1, float2, float3 and float4:
struct __device_builtin__ float1
{
float x;
};
__cuda_builtin_vector_align8(float2, float x; float y;);
struct __device_builtin__ float3
{
float x, y, z;
};
struct __device_builtin__ __builtin_align__(16) float4
{
float x, y, z, w;
};
float2 gets some form of special treatment. float3 is a struct without any alignment and float4 gets aligned to 16 bytes. I'm not sure what to make of that.
Global memory transactions are 128 bytes, aligned to 128 bytes. Transactions are always performed for a full warp at a time. When a warp reaches a function that performs a memory transaction, say a 32-bit load from global memory, the chip will at that time perform as many transactions as are necessary for servicing all the 32 threads in the warp. So, if all the accessed 32-bit values are within a single 128-byte line, only one transaction is necessary. If the values come from different 128-byte lines, multiple 128-byte transactions are performed. For each transaction, the warp is put on hold for around 600 cycles while the data is fetched from memory (unless it's in the L1 or L2 caches).
So, I think the key to finding out what type of approach gives the best performance, is to consider which approach causes the fewest 128-byte memory transactions.
Assuming that the built in vector types are just structs, some of which have special alignment, using the vector types causes the values to be stored in an interleaved way in memory (array of structs). So, if the warp is loading all the x values at that point, the other values (y, z, w) will be pulled in to L1 because of the 128-byte transactions. When the warp later tries to access those, it's possible that they are no longer in L1, and so, new global memory transactions must be issued. Also, if the compiler is able to issue wider instructions to read more values in at the same time, for future use, it will be using registers for storing those between the point of the load and the point of use, perhaps increasing the register usage of the kernel.
On the other hand, if the values are packed into a struct of arrays, the load can be serviced with as few transactions as possible. So, when reading from the x array, only x values are loaded in the 128-byte transactions. This could cause fewer transactions, less reliance on the caches and a more even distribution between compute and memory operations.
I don't believe the built-in tuples in CUDA ([u]int[2|4], float[2|4], double[2]) have any intrinsic advantages; they exist mostly for convenience. You could define your own C++ classes with the same layout and the compiler would operate on them efficiently. The hardware does have native 64-bit and 128-bit loads, so you'd want to check the generated microcode to know for sure.
As for whether you should use an array of uint2 (array of structures or AoS) or two arrays of uint (structure of arrays or SoA), there are no easy answers - it depends on the application. For built-in types of convenient size (2x32-bit or 4x32-bit), AoS has the advantage that you only need one pointer to load/store each data element. SoA requires multiple base pointers, or at least multiple offsets and separate load/sore operations per element; but it may be faster for workloads that sometimes only operate on a subset of the elements.
As an example of a workload that uses AoS to good effect, look at the nbody sample (which uses float4 to hold XYZ+mass of each particle). The Black-Scholes sample uses SoA, presumably because float3 is an inconvenient element size.
There's some good info in another thread that contradicts much of the major conclusions said here.
I need to keep track of around 10000 elements of an array in my algorithm .So for this I need boolean for each record.If I used char array to keep track of 10000 arrays (as 0/1),it would take up lot of memory.
So Can I create an bit array of 10000 bits in Cuda where each bit represents corresponding array record?
As Roger said, the answer is yes, CUDA provides the same bitwise operations (i.e. >>, << and &) as normal C so you can implement your bit array essentially normally (almost, see thread synchronisation issues below).
However, for your situation it is almost certainly not a good idea.
There are problems with thread syncronisation. Imagine two of the threads on your GPU are inverting two bits of a single entry of your array. Each thread will read the same value out of memory, and apply their operation to it, but the thread that writes its value back to memory last will overwrite the result of the other thread. (Note: if your bit array is not being modified by the GPU code then this isn't a problem.)
And, unless this is explicitly required, you shouldn't be optimising for memory use, an array with 10K elements does not take much memory at all: even if you were storing each boolean in an 64 bit integer it's only 80 KB. And obviously you can store them in a smaller datatype. (You should only start worrying about compressing the array as much as possible when you are getting upwards of tens of millions, or even hundreds of millions of elements.)
Also, the way GPUs work means that you might get best performance by using a reasonably large data type for each boolean (most likely a 32 bit one) so that, for example, memory coalescing works better. (I haven't tested this assertion, you would need to run some benchmarks to check it.)