I want to use cudaMemcpy to create a ones vector (1,...,1) so that I can do things like sum the rows/columns of a matrix or add a vector to a matrix with CUBLAS. The code will be run on different platforms, so
how can I guarantee that I'm always going to be working with 4-byte floats?
is there a sizeof function for data types on the GPU, or does the device always use the same data type specifications as the host?
Answering your second question first, the device always uses the same data type specification as the host compiler. So you can use sizeof(...) on the host to determine relevant sizes. Many things would be broken if this were not the case.
To answer your first question, then, we need only ask, amongst the supported host-side compilers for CUDA, is the float representation always 32 bits? The answer is yes.
As an aside, note that this is generally true for most platforms. Finding a system which has other than 32-bit floats is difficult. But as far as I know, there is no general C or C++ requirement that float be 32 bits. Someone else may prove me wrong.
Related
I need to decode an RLE in CUDA and I have been trying to think about the most efficient way of expanding the RLE into a list with all my values. So Say my values are 2, 3, 4 and my runs are 3, 3 , 1 I want to expand that to 2, 2, 2, 3, 3, 3, 4.
At first I thought I could use cudaMemset but I am pretty sure now that launches a Kernel and I have CUDA Compute Capability 3.0 so even if it were not probably inefficient to launch a new kernel for each value / run pair I do not have dynamic parallelism available to do this.
So I want to know if this solution is sound before I go and implement it since there are so many things that end up not working well on CUDA if you aren't being clever. Would it be reasonable to make a kernel that will call cudaMalloc then cudaMemCpy to the destination? I can easily compute the prefix sums to know where to copy the memory to and from and make all my reading at least coalesced. What I am worried about is calling cudaMalloc and cudaMemCpy so many times.
Another potential option is writing these values to shared memory and then copying those to global memory. I want to know if my first solution should work and be efficient or if I have to do the latter.
You don't want to think about doing a separate operation (e.g. cudaMalloc, or cudaMemset) for each value/run pair.
After computing the prefix sum on the run sequence, the last value in the prefix sum will be the total allocation size. Use that for a single cudaMalloc operation for the entire final expanded sequence.
Once you have the necessary space allocated, and the prefix sum computed, the actual expansion is pretty straightforward.
thrust can make this pretty easy if you want a fast prototype. There is an example code for it.
#RobertCrovella is of course correct, but you can go even further in terms of efficiency if you have the leeway to slightly tweak your compression sceheme.
Sorry for the self-plugging, but you might be interested in my own implementation of a variant of Run-Length Encoding, with the addition of anchoring of output positions into the input (e.g.. "in which offset in which run do we have the 2048th element?"); this allows for a more equitable assignment of work to thread blocks and avoids the need for a full-blown prefix sum. It's still a work-in-progress, so I only get ~34 GB/sec on a 336 GB/sec memory bandwidth card (Titan X) at the time of writing, but it's quite usable.
There is a huge bunch of data that is waiting to be processed with Machine Learning algorithm at the CUDA device. However I have some concerns about the memory of the device therefore I try to use float numbers instead of double (I guess it is a good solution unless someone indicates better). Is there any way to keeping double precision for results obtained from float numbers? I guess not. Even this is a a little silly question. So what is the other correct way to handle huge data instance at the device.
No, there's no way to keep double precision in the results if you process the data as float. Handle it as double. If memory size is a problem, the usual approach is to handle the data in chunks. Copy a chunk to the GPU, start the GPU processing, and while the processing is going on, copy more data to the GPU, and copy some of the results back. This is a standard approach to handling problems that "don't fit" in the GPU memory size.
This is called overlap of copy and compute, and you use CUDA streams to accomplish this. The CUDA samples (such as simple multi-copy and compute) have a variety of codes which demonstrate how to use streams.
You can indeed compute double precision results from floating point data. At any point in your calculation you can cast a float value to a double value, and according to standard C type promotion rules from there on all calculations with this value will be in double precision.
This applies as long as you use double precision variables to store the result and don't cast it to any other type. Beware of implicit casts in function calls.
I'm working on an algorithm that has to do a small number
of operations on a large numbers of small arrays, somewhat independently.
To give an idea:
1k sorting of arrays of length typically of 0.5k-1k elements.
1k of LU-solve of matrices that have rank 10-20.
everything is in floats.
Then, there is some horizontality to this problem: the above
operations have to be carried independently on 10k arrays.
Also, the intermediate results need not be stored: for example, i don't
need to keep the sorted arrays, only the sum of the smallest $m$ elements.
The whole thing has been programmed in c++ and runs. My question is:
would you expect a problem like this to enjoy significant speed ups
(factor 2 or more) with CUDA?
You can run this in 5 lines of ArrayFire code. I'm getting speedups of ~6X with this over the CPU. I'm getting speedups of ~4X with this over Thrust (which was designed for vectors, not matrices). Since you're only using a single GPU, you can run ArrayFire Free version.
array x = randu(512,1000,f32);
array y = sort(x); // sort each 512-element column independently
array x = randu(15,15,1000,f32), y;
gfor (array i, x.dim(2))
y(span,span,i) = lu(x(span,span,i)); // LU-decomposition of each 15x15 matrix
Keep in mind that GPUs perform best when memory accesses are aligned to multiples of 32, so a bunch of 32x32 matrices will perform better than a bunch of 31x31.
If you "only" need a factor of 2 speed up I would suggest looking at more straightforward optimisation possibilities first, before considering GPGPU/CUDA. E.g. assuming x86 take a look at using SSE for a potential 4x speed up by re-writing performance critical parts of your code to use 4 way floating point SIMD. Although this would tie you to x86 it would be more portable in that it would not require the presence of an nVidia GPU.
Having said that, there may even be simpler optimisation opportunities in your code base, such as eliminating redundant operations (useless copies and initialisations are a favourite) or making your memory access pattern more cache-friendly. Try profiling your code with a decent profiler to see where the bottlenecks are.
Note however that in general sorting is not a particularly good fit for either SIMD or CUDA, but other operations such as LU decomposition may well benefit.
Just a few pointers, you maybe already incorporated:
1) If you just need the m smallest elements, you are probably better of to just search the smallest element, remove it and repeat m - times.
2) Did you already parallelize the code on the cpu? OpenMP or so ...
3) Did you think about buying better hardware? (I know it´s not the nice think to do, but if you want to reach performance goals for a specific application it´s sometimes the cheapest possibility ...)
If you want to do it on CUDA, it should work conceptually, so no big problems should occur. However, there are always the little things, which depend on experience and so on.
Consider the thrust-library for the sorting thing, hopefully someone else can suggest some good LU-decomposition algorithm.
I am doing some programming on MIPS which has a bunch of 32 bit registers but I also know that you can store 64 bit integers, how does this work? Does the integer take up two registers? If so how does the system know to combine the two registers into one long string of binary
According to Wikipedia, the 32-bit MIPS instruction set includes "Load Double Word" and "Store Double Word" instructions that load/store a pair of consecutive registers from/to memory.
For the actual arithmetic, it looks like you typically have to use multiple instructions.
You need to check the documentation for your platform, since it may vary. For example, for MIPS 32-bits, check something like this quick reference (see the "C calling convention" part).
For more details, though, you'd need a more complete reference (the quick one doesn't list any 64-bit arithmetic instructions that I could see, so if they don't exist, you'd have to implement them yourself, and then you can use your own convention for how to store the values).
Is it better to use a float instead of an int in CUDA?
Does a float decrease bank conflicts and insure coalescence? (or has it nothing to do with this?)
Bank conflicts when reading shared memory are all about the amount of data read. So, since int and float are the same size (at least I think they are on all CUDA platforms), there's no difference.
Coalescence usually refers to global memory accesses - and again, this is to do with the number of bytes read, not the datatype.
Both int and float are four bytes, so it doesn't make any difference (if you're accessing them both the same way) which you use in terms of coalescing your global memory accesses or bank conflicts on shared memory accesses.
Having said that, you may have better performance with floats since the devices are designed to crunch them as fast as possible, ints are often used for control and indexes etc. and hence have lower performance. Of course it's really more complicated than that - if you had nothing but floats then the integer hardware would sit idle which would be a waste.
Bank conflicts and coalescence are all about memory access patterns (whether the threads within a warp all read/write to different locations with uniform stride). Thus, these concerns are independent of data type (float, int, double, etc.)
Note that data type does have an impact on the computation performance. Single precision float is faster than double precision etc. The beefy FPUs in the GPUs generally means that doing calculations in fixed point is unnecessary and may even be detrimental.
Take a look at the "Mathematical Functions" section of CUDA Developers Guide. Using device runtime functions (intrinsic functions) may provide better performance for various types. You may perform multiple operations in one operation within less clock cycles.
For some of the functions of SectionC.1,a less accurate, but faster version exists inthe device runtime component; it has the same name
prefixed with __ (such as __sinf(x)).. The compiler has an option
(-use_fast_math ) that forces each function in Table to compile to its intrinsic counterpart... selectively replace mathematical function
calls by calls to intrinsic functions only where it is merited by the
performance gains and where changed properties such as reduced
accuracy and different special case handling can be tolerated.
For example instead of using => use: x/y => __fdividef(x, y); sinf(x) => __sinf(x)
And you may find more methods like x+c*y being performed with one function..