Uniformly distributed pseudorandom integers inside CUDA kernel [duplicate] - cuda

This question already has answers here:
Generating random number within Cuda kernel in a varying range
(3 answers)
Closed 8 years ago.
How can I generate uniformly distributed pseudorandom integers within a kernel? As far as I know Curand Api allows to use poisson discrete distribution, but not uniform.

I suggest two options within a Kernel:
1) using curand_uniform to obtain a random floating point number from a uniform distribution, then map it to integer interval:
float randu_f = curand_uniform(&localState);
randu_f *= (B-A+0.999999); // You should not use (B-A+1)*
randu_f += A;
int randu_int = __float2int_rz(randu_f);
__float2int_rz Convert the single-precision floating point value x to a signed integer in round-towards-zero mode.
*curand_uniform returns a sequence of pseudorandom floats uniformly distributed between 0.0 and 1.0. It may return from 0.0 to 1.0, where 1.0 is included and 0.0 is excluded.
You should use biggest_float_before_1 or a little less than 1, because there is a small chance You will random 1, and You can get out of bounds. I didn't also check does biggest_float_before_1 and floating-point operations on GPU guarantee not to exceed from defined bounds.
2) calling curand returns a sequence of pseudorandom numbers:
int randu_int = A + curand(&localState) % (B-A);
However, modulo is very expensive on GPU and method 1 is faster.

Related

CUDA float addition gives wrong answer (compared to CPU float ops) [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I am new to CUDA. I was using cuda to find the dot prod of float vectors and I came across a float point addition issue in cuda. In essence following is the simple kernel. I'm using -arch=sm_50
So the basic idea is for the thread_0 to add the values of vector a.
__global__ void temp(float *a, float *b, float *c) {
if (0 == threadIdx.x && blockIdx.x == 0 && blockIdx.y ==0 ) {
float xx = 0.0f;
for (int i = 0; i < LENGTH; i++){
xx += a[i];
}
*c = xx;
}
}
When I initialize 'a' with 1000 elements of 1.0 I get the desired result of 1000.00
but when I initialize 'a' with 1.1, I should get 1100.00xx but istead, I am getting 1099.989014. The cpu implementation simply yields 1100.000024
I am trying to understand what the issue here! :-(
I even tried to count the number of 1.1 elements in the a vector and that yeilds 1000, which is expected. and I even used atomicAdd and still I have the same issue.
would be very grateful if someone could help me out here!
best
EDIT:
Biggest concern here is the disparity of the CPU result vs GPU result! I understand floats can be off by some decimal points. But the GPU error is very significant! :-(
It is not possible to represent 1.1 exactly using IEEE-754 floating point representation. As #RobertCrovella mentionned in his comment, the computation performed on the CPU does not use the same IEEE-754 settings than the GPU one.
Indeed, 1.1 in floating point is stored as 0x3F8CCCCD = which is 1.10000002384185. Performing the sum on 1000 elements, the last bits gets lost in rouding, one bit for the first addition, two bits after four, etc, until 10 bits after 1000. Depending on rounding mode, you may truncate the 10 bits for the last half of operations, hence ending up summing 0x3F8CCC00 which is 1.09997558.
The result from CUDA divided by 1000 is 0x3F8CCC71, which is consistent with a calculation in 32 bits.
When compiling on CPU, depending on optimization flags, you may be using fast math, which uses the internal register precision. It can be, if not specifying vector registers, using the x87 FPU which is 80 bits precision. In that occurence, the computation would read 1.1 in float which is 1.10000002384185, add it 1000 times using higher precision, hence not loosing any bit in rounding resulting in 1100.00002384185, and display 1100.000024 which is its round to nearest display.
Depending on compilation flags, the actual equivalent computation on Cpu may require enforcement of 32 bits floating-point arithmetics which can be done using addss of the SSE2 instruction set for example.
You can also play with /fp: option or -mfpmath with the compiler and explore issued instructions. In that case assembly instruction fadd is the 80-bits precision addition.
All of this has nothing to do with GPU floating-point precision. It is rather some misunderstanding of the IEEE-754 norm and the legacy x87 FPU behaviour.

Grid and Block dimension in (py)CUDA [duplicate]

This question already has answers here:
How do I choose grid and block dimensions for CUDA kernels?
(3 answers)
Closed 9 years ago.
I got a question regarding the dimensions of the blocks and grids in (py)CUDA. I know that there are limits in the total size of the blocks, but not of the grids
And that the actual blocksize influences the runtime. But what I'm wondering about is: Does it make a difference if I have a block of 256 threads, to start it like (256,1) or to start it like (128,2), like (64,4) etc.
If it makes a difference: which is the fastest?
Yes, it makes a difference.
(256,1) creates a (1D) block of 256 threads in the X-dimension, all of which have a y-index of 0.
(128,2) creates a (2D) block of 128x2 threads, ie. 128 in the x-dimension and 2 in the y-dimension. These threads will have an x-index ranging from 0 to 127 and a y-index ranging from 0 to 1
The structure of your kernel code must comprehend the thread indexing/numbering.
For example if your kernel code starts with something like:
int idx=threadIdx.x+blockDim.x*blockIdx.x;
and doesn't create any other index variables, it's probably assuming a 1D threadblock and 1D grid.
If, on the other hand, your kernel code starts with something like:
int idx = threadIdx.x+blockDim.x*blockIdx.x;
int idy = threadIdx.y+blockDim.y*blockIdx.y;
It's probably expecting a 2D grid and 2D threadblocks.
Generally speaking, the two approaches are not interchangeable, meaning you cannot launch a kernel that expects a 1D grid with a 2D grid and expect everything to work normally, and vice-versa.

3 questions about alignment

The discussion is restricted to compute capability 2.x
Question 1
The size of a curandState is 48 bytes (measured by sizeof()). When an array of curandStates is allocated, is each element somehow padded (for example, to 64 bytes)? Or are they just placed contiguously in the memory?
Question 2
The OP of Passing structs to CUDA kernels states that "the align part was unnecessary". But without alignment, access to that structure will be divided into two consecutive access to a and b. Right?
Question 3
struct
{
double x, y, z;
}Position
Suppose each thread is accessing the structure above:
int globalThreadID=blockIdx.x*blockDim.x+threadIdx.x;
Position positionRegister=positionGlobal[globalThreadID];
To optimize memory access, should I simply use three separate double variables x, y, z to replace the structure?
Thanks for your time!
(1) They are placed contiguously in memory.
(2) If the array is in global memory, each memory transaction is 128 bytes, aligned to 128 bytes. You get two transactions only if a and b happen to span a 128-byte boundary.
(3) Performance can often be improved by using an struct of arrays instead of an array of structs. This justs means that you pack all your x together in an array, then y and so on. This makes sense when you look at what happens when all 32 threads in a warp get to the point where, for instance, x is needed. By having all the values packed together, all the threads in the warp can be serviced with as few transactions as possible. Since a global memory transaction is 128 bytes, this means that a single transaction can service all the threads if the value is a 32-bit word. The code example you gave might cause the compiler to keep the values in registers until they are needed.

Cuda linear interpolation using textures

I have a curve as follows:
float points[] = {1, 4, 6, 9, 14, 25, 69};
float images[] = {0.3, 0.4, 0.7, 0.9, 1, 2.5, 5.3};
In order to interpolate let's say f(3) I would use linear interpolation between 1 and 4
In order to interpolate let's say f(15) I would apply a binary search on the array of points and get the lowerBound which is 25 and consider interpolation in the interval [14,25] and so on..
I have found out this method is making my device function very slow. I've heard I can use texture memory and tex1D in order to do so ! is it possible even if points[] is not let's say uniform (incremented by constant step)
Any idea ?
It looks like this problem can be broken into two parts:
Use the points array to convert the x value in f(x) to a floating point index between 0 and 7 (requires binary search on points[])
Use that floating point index to get a linearly interpolated value from the images array
Cuda texture memory can make step 2 very fast. I am guessing, however, that most of the time in your kernel is spent on step 1, and I don't think texture memory can help you there.
If you aren't already taking advantage of shared memory, moving your arrays to shared memory will give you a much bigger speedup than using texture memory. There is 48k of shared memory on recent hardware, so if your arrays are less than 24k (6k elements) they should both fit in shared memory. Step 1 can benefit greatly from shared memory because it requires non-contiguous reads of points[], which is very very slow in global memory.
If your arrays don't fit in shared memory, you should break up your arrays into equally sized pieces with 6k elements each and assign each piece to a block. Have each block read through all of the points you are iterpolating, and have it ignore the point if it's not within the portion of the points[] array stored in its shared memory.

as3 Number type - Logic issues with large numbers

I'm curious about an issue spotted in our team with a very large number:
var n:Number = 64336512942563914;
trace(n < Number.MAX_VALUE); // true
trace(n); // 64336512942563910
var a1:Number = n +4;
var a2:Number = a1 - n;
trace(a2); // 8 Expect to see 4
trace(n + 4 - n); // 8
var a3:Number = parseInt("64336512942563914");
trace(a3); // 64336512942563920
n++;
trace(n); //64336512942563910
trace(64336512942563914 == 64336512942563910); // true
What's going on here?
Although n is large, it's smaller than Number.MAX_VALUE, so why am I seeing such odd behaviour?
I thought that perhaps it was an issue with formatting large numbers when being trace'd out, but that doesn't explain n + 4 - n == 8
Is this some weird floating point number issue?
Yes, it is a floating point issue, but it is not a weird one. It is all expected behavior.
Number data type in AS3 is actually a "64-bit double-precision format as specified by the IEEE Standard for Binary Floating-Point Arithmetic (IEEE-754)" (source). Because the number you assigned to n has too many digits to fit into thos 64 bits, it gets rounded off, and that's the reason for all the "weird" results.
If you need to do some exact big integer arithmetic, you will have to use a custom big integer class, e.g. this one.
Numbers in flash are double precision floating point numbers. Meaning they store a number of significant digits and and exponent. It favors a larger range of expressible numbers over the precision of the numbers due to memory constraints. At some point, numbers with a lot of significant digits, rounding will occur. Which is what you are seeing; the least significant digits are being rounded. Google double precision floating point numbers and you'll find a bunch of technical information on why.
It is the nature of the datatype. If you need precise numbers you should really stick to uint or int based integers. Other languages have fixed point or bigint number processing libraries (sometimes called BigInt or Decimal) which are wrappers around ints and longs to express much larger numbers at the cost of memory consumption.
We have an as3 implementation of BigDecimal copied from Java that we use for ALL calculations. In a trading app the floating point errors were not acceptable.
To be safe with integer math when using Numbers, I checked the AS3 documentation for Number and it states:
The Number class can be used to represent integer values well beyond
the valid range of the int and uint data types. The Number data type
can use up to 53 bits to represent integer values, compared to the 32
bits available to int and uint.
A 53 bit integer gets you to 2^53 - 1, if I'm not mistaken which is 9007199254740991, or about 9 quadrillion. The other 11 bits that help make up the 64 bit Number are used in the exponent. The number used in the question is about 64.3 quadrillion. Going past that point (9 quadrillion) requires more bits for the significant number portion (the mantissa) than is allotted and so rounding occurs. A helpful video explaining why this makes sense (by PBS Studio's Infinite Series).
So yeah, one must search for outside resources such as the BigInt. Hopefully, the resources I linked to are useful to someone.
This is, indeed, a floating point number approximation issue.
I guess n is to large compared to 4, so it has to stick with children of its age:
trace(n - n + 4) is ok since it does n-n = 0; 0 + 4 = 4;
Actually, Number is not the type to be used for large integers, but floating point numbers. If you want to compute large integers you have to stay within the limit of uint.MAX_VALUE.
Cheers!