CUB device scan with custom scan op fails - cuda

I am using CUB::InclusiveScan which takes a custom binary, non-commutative, operator. When defining my
template <typename T>
struct MultAddFunctor
{
const T factor;
MultAddFunctor(T factor) : factor(factor) {}
__device__ __forceinline__
T operator()(const T &a, const T &b) const {
return factor*a + b;
}
};
Otherwise, my code is nearly identical to the example code in the documentation (except I have freed allocated memory and added additional syncs to rule that out as a problem). When factor is 1.0 this produces the correct results (which is just a prefix-sum). When factor is something else (such as 0.8) the results are correct for the first 12 values but diverge considerably after that. For example, if the array being scanned is just a bunch of 1.0s, I get the following results:
CUDA Serial
0 1.000 1.000 ✅
1 1.800 1.800 ✅
2 2.440 2.440 ✅
3 2.952 2.952 ✅
4 3.362 3.362 ✅
5 3.689 3.689 ✅
6 3.951 3.951 ✅
7 4.161 4.161 ✅
8 4.329 4.329 ✅
9 4.463 4.463 ✅
10 4.571 4.571 ✅
11 4.656 4.656 ✅
12 6.313 4.725 ❌
13 6.050 4.780 ❌
14 5.840 4.824 ❌
15 5.672 4.859 ❌
...
At element 12, there is a sudden jump in the values and then a decrease even though this should just keep getting larger at a consistent rate.
At first, I thought it was due to the non-commutativity of the operation, but the docs explicitly state that that is fine. I also thought that the factor field itself may not be getting to the device correctly, but even if I hard-code a 0.8 in the equation it is still incorrect (although, factor is probably always in global memory so in the future moving factor into shared/local would be better).
What other reason could there be that the scan is computing the incorrect results?

The reason for the failure here is that cub parallel scan like other parallel scan implementations I am aware of, require a binary scan operator that is associative.
This follows directly from the definition of associativity, using addition as an example. The associative property of addition says that
(a+b)+c = a+(b+c)
When applied here, it means that in a parallel setting, I can apply the binary op to either (a+b) first, or (b+c) first, and then apply the binary op to the remaining step, and I should get the same result. This is associativity, not commutativity. The op here is not associative. To demonstrate this, we actually perform the math indicated by the operator in each case (imagining/pretending that factor is 1000) and we get:
step 1: (1000*a+b)+c = a+(1000*b+c)
step 2: 1000(1000*a+b)+c = 1000*a+(1000*b+c)
and the equality does not hold.

Related

Do __shfl_xx_sync() intrinsics with mask need an additional __syncwarp()?

Do __shfl_xx_sync() instructions, where only some lanes participate, need an additional __syncwarp() instruction, or is setting a mask enough?
I cannot provide a working minimal example, as it is very long and confidential code and the error appeared only in certain run/build configurations.
The code looks basically like the following:
if (threadIdx.x >= 30) {
temp.x = __shfl_up_sync(0xC0000000, x, 1);
temp.y = __shfl_up_sync(0xC0000000, y, 1);
}
// __syncwarp();
__shfl_up_sync(0xffffffff, w, 1, 32);
Release builds worked fine; with debug builds lanes 30 and 31 waited (according to debugger and SASS) at a different sync instruction than the other lanes.
When I introduced __syncwarp() also debug builds run through. And I hope this problem is definitely fixed!?!
I am using a mask in the first two shuffle instructions indicating that only lanes 30 and 31 participate. What happens, if the scheduler decides that lanes 0 to 29 are executed first and goes to the second shuffle instruction (with all participating lanes)? Then those shuffle instructions wait for the lanes 30 and 31. Those threads then get to the upper shuffle instructions. Can the shuffles be distinguished?
If the __syncwarp() is needed: Why would it react differently then the shuffle instruction with mask 0xffffffff itself?
Because it is of other type? (shuffle sync instead of normal sync?) Or was it by accident that the program worked in this way?
(The __syncwarp() intrinsic is probably useful here anyway (for performance reasons), as the threads converge at that point.)
If __syncwarp() is not enough: How to make sure the kernel does not hang? Is there generally another recommended way than __syncwarp()?
I am running this on Turing RTX 2060 Mobile (and debugged is with Visual Studio).
No, you should not need a __syncwarp() here. CUDA went from e.g. __shfl_up() to __shfl_up_sync() to avoid this. I think the problem is that you are trying to shuffle up data from a thread that is not participating in the call, i.e. thread 30 is trying to get data from thread 29, so thread 29 has to participate.
Threads may only read data from another thread which is actively participating in the __shfl_sync() command. If the target thread is inactive, the retrieved value is undefined.
from the docs. Although this explanation is still unsatisfactory, as you seem to get a deadlock instead of an undefined value. But maybe this is wanted behavior for a debug build?
That being said, I'm not quite sure how to do this elegantly, because just including thread 29 in the conditional and mask will only shift the problem to 29 trying to get data from 28. In the examples given in the documentation, they always do the intrinsic with all threads and then conditionally use the results.
My best guess is that you want thread 29 to participate, but with a delta of 0. I have not found anything saying that delta needs to be the same across threads.
You might also want to use __ballot_sync() to retrieve a mask as can be seen in Listing 3 of this blogpost to avoid bugs from manually specifying the mask, which needs to be changed whenever the conditional is changed.

CUDA out of memory message after using just ~2.2GB of memory on a GTX1080

I'm doing matrix multiplication on a GTX1080 GPU using JCuda, version 0.8.0RC with CUDA 8.0. I load two matrices A and B into the device in row-major vector form, and read the product matrix from the device. But I'm finding that I run out of device memory earlier than I would expect. For example, if matrix A is dimensioned 100000 * 5000 = 500 million entries = 2GB worth of float values, then:
cuMemAlloc(MatrixA, 100000 * 5000 * Sizeof.FLOAT);
works fine. But if I increase the number or rows to 110000 from 100000, I get the following error on this call (which is made before the memory allocations for matrices B and C, so those are not part of the problem):
Exception in thread "main" jcuda.CudaException: CUDA_ERROR_OUT_OF_MEMORY
at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:344)
at jcuda.driver.JCudaDriver.cuMemAlloc(JCudaDriver.java:3714)
at JCudaMatrixMultiply.main(JCudaMatrixMultiply.java:84) (my code)
The issue is that allocating a matrix of this size on the device should take only about 2.2GB, and the GTX1080 has 8GB of memory, so I don't see why I'm running out of memory. Does anyone have any thoughts on this? It's true that I'm using the JCuda 0.8.0RC with the release version of CUDA 8, but I tried downloading the RC version of CUDA 8 (8.0.27) to use with JCuda 0.8.0RC and had some problems getting it to work. If versions compatibility is likely to be the issue, however, I can try again.
Matrices of 100000 * 5000 are pretty big, of course, and I won't need to work with larger matrices for a while on my neural network project, but I would like to be confident that I can use all 8GB of memory on this new card. Thanks for any help.
tl;dr:
When calling
cuMemAlloc(MatrixA, (long)110000 * 5000 * Sizeof.FLOAT);
// ^ cast to long here
or alternatively
cuMemAlloc(MatrixA, 110000L * 5000 * Sizeof.FLOAT);
// ^ use the "long" literal suffix here
it should work.
The last argument to cuMemAlloc is of type size_t. This is an implementation-specific unsigned integer type for "arbitrary" sizes. The closest possible primitive type in Java for this is long. And in general, every size_t in CUDA is mapped to long in JCuda. In this case, the Java long is passed as a jlong into the JNI layer, and this is simply cast to size_t for the actual native call.
(The lack of unsigned types in Java and the odd plethora of integer types in C can still cause problems. Sometimes, the C types and the Java types just don't match. But as long as the allocation is not larger than 9 Million Terabytes (!), a long should be fine here...)
But the comment by havogt lead to the right track. What happens here is indeed an integer overflow: The computation of the actual value
110000 * 5000 * Sizeof.FLOAT = 2200000000
is by default done using the int type in Java, and this is where the overflow happens: 2200000000 is larger than Integer.MAX_VALUE. The result will be a negative value. When this is cast to the (unsigned) size_t value in the JNI layer, it will become a ridiculosly large positive value, that clearly causes the error.
When doing the computation using long values, either by explicitly casting to long or by appending the L suffix to one of the literals, the value is passed to CUDA as the proper long value of 2200000000.

Why does cuFFT performance suffer with overlapping inputs?

I'm experimenting with using cuFFT's callback feature to perform input format conversion on the fly (for instance, calculating FFTs of 8-bit integer input data without first doing an explicit conversion of the input buffer to float). In many of my applications, I need to calculate overlapped FFTs on an input buffer, as described in this previous SO question. Typically, adjacent FFTs might overlap by 1/4 to 1/8 of the FFT length.
cuFFT, with its FFTW-like interface, explicitly supports this via the idist parameter of the cufftPlanMany() function. Specifically, if I want to calculate FFTs of size 32768 with an overlap of 4096 samples between consecutive inputs, I would set idist = 32768 - 4096. This does work properly in the sense that it yields the correct output.
However, I'm seeing strange performance degradation when using cuFFT in this way. I have devised a test that implements this format conversion and overlap in two different ways:
Explicitly tell cuFFT about the overlapping nature of the input: set idist = nfft - overlap as I described above. Install a load callback function that just does the conversion from int8_t to float as needed on the buffer index provided to the callback.
Don't tell cuFFT about the overlapping nature of the input; lie to it an dset idist = nfft. Then, let the callback function handle the overlapping by calculating the correct index that should be read for each FFT input.
A test program implementing both of these approaches with timing and equivalence tests is available in this GitHub gist. I didn't reproduce it all here for brevity. The program calculates a batch of 1024 32768-point FFTs that overlap by 4096 samples; the input data type is 8-bit integers. When I run it on my machine (with a Geforce GTX 660 GPU, using CUDA 8.0 RC on Ubuntu 16.04), I get the following result:
executing method 1...done in 32.523 msec
executing method 2...done in 26.3281 msec
Method 2 is noticeably faster, which I would not expect. Look at the implementations of the callback functions:
Method 1:
template <typename T>
__device__ cufftReal convert_callback(void * inbuf, size_t fft_index,
void *, void *)
{
return (cufftReal)(((const T *) inbuf)[fft_index]);
}
Method 2:
template <typename T>
__device__ cufftReal convert_and_overlap_callback(void *inbuf,
size_t fft_index, void *, void *)
{
// fft_index is the index of the sample that we need, not taking
// the overlap into account. Convert it to the appropriate sample
// index, considering the overlap structure. First, grab the FFT
// parameters from constant memory.
int nfft = overlap_params.nfft;
int overlap = overlap_params.overlap;
// Calculate which FFT in the batch that we're reading data for. This
// tells us how much overlap we need to account for. Just use integer
// arithmetic here for speed, knowing that this would cause a problem
// if we did a batch larger than 2Gsamples long.
int fft_index_int = fft_index;
int fft_batch_index = fft_index_int / nfft;
// For each transform past the first one, we need to slide "overlap"
// samples back in the input buffer when fetching the sample.
fft_index_int -= fft_batch_index * overlap;
// Cast the input pointer to the appropriate type and convert to a float.
return (cufftReal) (((const T *) inbuf)[fft_index_int]);
}
Method 2 has a significantly more complex callback function, one that even involves integer division by a non-compile time value! I would expect this to be much slower than method 1, but I'm seeing the opposite. Is there a good explanation for this? Is it possible that cuFFT structures its processing much differently when the input overlaps, thus resulting in the degraded performance?
It seems like I should be able to achieve performance that is quite a bit faster than method 2 if the index calculations could be removed from the callback (but that would require the overlapping to be specified to cuFFT).
Edit: After running my test program under nvvp, I can see that cuFFT definitely seems to be structuring its computations differently. It's hard to make sense of the kernel symbol names, but the kernel invocations break down like this:
Method 1:
__nv_static_73__60_tmpxft_00006cdb_00000000_15_spRealComplex_compute_60_cpp1_ii_1f28721c__ZN13spRealComplex14packR2C_kernelIjfEEvNS_19spRealComplexR2C_stIT_T0_EE: 3.72 msec
spRadix0128C::kernel1Tex<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, CONSTANT, ALL, WRITEBACK>: 7.71 msec
spRadix0128C::kernel1Tex<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, CONSTANT, ALL, WRITEBACK>: 12.75 msec (yes, it gets invoked twice)
__nv_static_73__60_tmpxft_00006cdb_00000000_15_spRealComplex_compute_60_cpp1_ii_1f28721c__ZN13spRealComplex24postprocessC2C_kernelTexIjfL9fftAxii_t1EEEvP7ComplexIT0_EjT_15coordDivisors_tIS6_E7coord_tIS6_ESA_S6_S3_: 7.49 msec
Method 2:
spRadix0128C::kernel1MemCallback<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, L1, ALL, WRITEBACK>: 5.15 msec
spRadix0128C::kernel1Tex<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, CONSTANT, ALL, WRITEBACK>: 12.88 msec
__nv_static_73__60_tmpxft_00006cdb_00000000_15_spRealComplex_compute_60_cpp1_ii_1f28721c__ZN13spRealComplex24postprocessC2C_kernelTexIjfL9fftAxii_t1EEEvP7ComplexIT0_EjT_15coordDivisors_tIS6_E7coord_tIS6_ESA_S6_S3_: 7.51 msec
Interestingly, it looks like cuFFT invokes two kernels to actually compute the FFTs using method 1 (when cuFFT knows about the overlapping), but with method 2 (where it doesn't know that the FFTs are overlapped), it does the job with just one. For the kernels that are used in both cases, it does seem to use the same grid parameters between methods 1 and 2.
I don't see why it should have to use a different implementation here, especially since the input stride istride == 1. It should just use a different base address when fetching data at the transform input; the rest of the algorithm should be exactly the same, I think.
Edit 2: I'm seeing some even more bizarre behavior. I realized by accident that if I fail to destroy the cuFFT handles appropriately, I see differences in measured performance. For example, I modified the test program to skip destruction of the cuFFT handles and then executed the tests in a different sequence: method 1, method 2, then method 2 and method 1 again. I got the following results:
executing method 1...done in 31.5662 msec
executing method 2...done in 17.6484 msec
executing method 2...done in 17.7506 msec
executing method 1...done in 20.2447 msec
So the performance seems to change depending upon whether there are other cuFFT plans in existence when creating a plan for the test case! Using the profiler, I see that the structure of the kernel launches doesn't change between the two cases; the kernels just all seem to execute faster. I have no reasonable explanation for this effect either.
If you specify non-standard strides (doesn't matter if batch/transform) cuFFT uses different path internally.
ad edit 2:
This is likely GPU Boost adjusting clocks on GPU. cuFFT plan do not have impact one on another
Ways to get more stable results:
run warmup kernel (anything that would make full GPU work is good) and then your problem
increase batch size
run test several times and take average
lock clocks of the GPU (not really possible on GeForce - Tesla can do it)
At the suggestion of #llukas, I filed a bug report with NVIDIA regarding the issue (https://partners.nvidia.com/bug/viewbug/1821802 if you're registered as a developer). They acknowledged the poorer performance with overlapped plans. They actually indicated that the kernel configuration used in both cases is suboptimal and they plan to improve that eventually. No ETA was given, but it will likely not be in the next release (8.0 was just released last week). Finally, they said that as of CUDA 8.0, there is no workaround to make cuFFT use a more efficient method with strided inputs.

cublas one function call produced three executions

I called the cublas_Sgemm_v2 function for 10236 times with first matrix non-transposed and the second transposed. However, in the nvprof results, I saw three items produced from that function call. The (m, n, k) values to the function call are (588, 588, 20).
There are the items listed in the nvprof results.
Time(%) Time Calls Avg Min Max Name
12.32% 494.86ms 10236 48.344us 47.649us 49.888us sgemm_sm35_ldg_nt_128x8x128x16x16
8.64% 346.91ms 10236 33.890us 32.352us 35.488us sgemm_sm35_ldg_nt_64x16x128x8x32
8.11% 325.63ms 10236 31.811us 31.360us 32.512us sgemm_sm35_ldg_nt_128x16x64x16x16
Is this expected and why is that? Can someone explain what do the values in the function names such as sgemm_sm35_ldg_nt_128x8x128x16x16 mean?
I also have other function calls to cublas_Sgemm_v2 with different transpose settings and I only see one item per each function call.
UPDATE:
As #Marco13 asked, I put more results here:
Time(%) Time Calls Avg Min Max Name
--------------------------------------------------------------------------------
Resulted from 7984 calls with (Trans, NonTrans) with (m, n, k) = (588, 100, 588)
20.84% 548.30ms 7984 68.675us 58.977us 81.474us sgemm_sm35_ldg_tn_32x16x64x8x16
Resulted from 7984 calls with (NonTrans, NonTrans) with (m, n, k) = (588, 100, 588)
12.95% 340.71ms 7984 42.674us 21.856us 64.514us sgemm_sm35_ldg_nn_64x16x64x16x16
All the following resulted from 3992 calls with (NonTrans, Trans) with (m, n, k) = (588, 588, 100)
9.81% 258.15ms 3992 64.666us 61.601us 68.642us sgemm_sm35_ldg_nt_128x8x128x16x16
6.84% 179.90ms 3992 45.064us 40.097us 49.505us sgemm_sm35_ldg_nt_64x16x128x8x32
6.33% 166.51ms 3992 41.709us 38.304us 61.185us sgemm_sm35_ldg_nt_128x16x64x16x16
Another run with 588 changed to 288:
Time(%) Time Calls Avg Min Max Name
--------------------------------------------------------------------------------
Resulted from 7984 calls with (Trans, NonTrans) with (m, n, k) = (288, 100, 288)
22.01% 269.11ms 7984 33.706us 30.273us 39.232us sgemm_sm35_ldg_tn_32x16x64x8x16
Resulted from 7984 calls with (NonTrans, NonTrans) with (m, n, k) = (288, 100, 288)
14.79% 180.78ms 7984 22.642us 18.752us 26.752us sgemm_sm35_ldg_nn_64x16x64x16x16
Resulted from 3992 calls with (NonTrans, Trans) with (m, n, k) = (288, 288, 100)
7.43% 90.886ms 3992 22.766us 19.936us 25.024us sgemm_sm35_ldg_nt_64x16x64x16x16
From the last three lines is looks like certain transposition types can be more efficient than the others, and certain matrix sizes are more economic in terms of computation time over matrix size. What is the guideline of ensuring economic computation?
UPDATE 2:
For the case of (m, n, k) = (588, 100, 588) above, I manually transposed the matrix before calling the sgemm function, then there is only one item in the nvprof result. The time it take is only a little less than the sum of the two items in the above table. So there is no much performance gain from doing so.
Time(%) Time Calls Avg Min Max Name
--------------------------------------------------------------------------------
31.65% 810.59ms 15968 50.763us 21.505us 72.098us sgemm_sm35_ldg_nn_64x16x64x16x16
Sorry, not an answer - but slightly too long for a comment:
Concerning the edit, about the influence of the "transpose" state: Transposing a matrix might cause an access pattern that is worse in terms of memory coalescing. A quick websearch brings brings some results about this ( https://devtalk.nvidia.com/default/topic/528450/cuda-programming-and-performance/cublas-related-question/post/3734986/#3734986 ), but with a slightly different setup than yours:
DGEMM performance on a K20c
args: ta=N tb=N m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.13280010 sec GFLOPS=1034.93
args: ta=T tb=N m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.13872910 sec GFLOPS=990.7
args: ta=N tb=T m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.12521601 sec GFLOPS=1097.61
args: ta=T tb=T m=4096 n=4096 k=4096 alpha=-1 beta=2 lda=4096 ldb=4096 ldc=4096
elapsed = 0.13652611 sec GFLOPS=1006.69
In this case, the differences do not seem worth the hassle of changing the matrix storage (e.g. from column-major to row-major, to avoid transposing the matrix), because all patterns seem to run with a similar speed. But your mileage may vary - particularly, the difference in your tests between (t,n) and (n,n) are very large (548ms vs 340ms), which I found quite surprising. If you have the choice to easily switch between various representations of the matrix, then a benchmark covering all the four cases may be worthwhile.
In any case, regarding your question about the functions that are called there: The CUBLAS code for the sgemm function in CUBLAS 1.1 was already full of unrolled loops and already contained 80 (!) versions of the sgemm function for different cases that have been assembled using a #define-hell. It has to be assumed that this has become even more unreadable in the newer CUBLAS versions, where the newer compute capabilities have to be taken into account - and the function names that you found there indicated that this indeed is the case:
sgemm_sm35_ldg_nt_64x16x128x8x32
sm35 : Runs on a device with compute capability 3.5
ldg : ? Non-texture-memory version ? (CUBLAS 1.1 contained functions called sgemm_main_tex_* which worked on texture memory, and functions sgemm_main_gld_* which worked on normal, global memory)
nt : First matrix is Not transposed, second one is Transposed
64x16x128x8x32 - Probably related to tile sizes, maybe shared memory etc...
Still, I think it's surprising that a single call to sgemm causes three of these internal functions to be called. But as mentioned in the comment: I assume that they try to handle the "main" part of the matrix with a specialized, efficient version, and "border tiles" with one that is capable of doing range checks and/or cope with warps that are not full. (Not very precise, just to be suggestive: A matrix of size 288x288 could be handled by an efficient core for matrices of size 256x256, and two calls for the remaining 32x288 and 288x32 entries).
But all this is also the reason why I guess there can hardly be a general guideline concerning the matrix sizes: The "best" matrix size in terms of computation time over matrix size will at least depend on
the hardware version (compute capability) of the target system
the transposing-flags
the CUBLAS version
EDIT Concerning the comment: One could imagine that there should be a considerable difference between the transosed and the non-transposed processing. When multiplying two matrices
a00 a01 a02 b00 b01 b02
a10 a11 a12 * b10 b11 b12
a20 a21 a22 b20 b21 b22
Then the first element of the result will be
a00 * b00 + a01 * b10 + a02 * b20
(which simply is the dot product of the first row of a and the first column of b). For this computation one has to read consecutive values from a. But the values that are read from b are not consecutive. Instead, they are "the first value in each row". One could think that this would have a negative impact on memory coalescing. But for sure, the NVIDIA engineers have tried hard to avoid any negative impact here, and the implementation of sgemm in CUBLAS is far, far away from "a parallel version of the naive 3-nested-loops implementation" where this access pattern would have such an obvious drawback.

CUDA memory operation order within a single thread

From the CUDA Programming Guide (v. 5.5):
The CUDA programming model assumes a device with a weakly-ordered
memory model, that is:
The order in which a CUDA thread writes data to shared memory, global memory, page-locked host memory, or the memory of a peer device
is not necessarily the order in which the data is observed being
written by another CUDA or host thread;
The order in which a CUDA thread reads data from shared memory, global memory, page-locked host memory, or the memory of a peer device
is not necessarily the order in which the read instructions appear in
the program for instructions that are independent of each other
However, do we have a guarantee that the (dependent) memory operations as seen from the single thread are actually consistent? If I do - say:
arr[x] = 1;
int z = arr[y];
where x happens to be equal to y, and no other thread is touching the memory, do I have a guarantee that z is 1? Or do I still need to put some volatile or a barrier between those two operations?
In response to Orpedo's answer.
If your compiler doesn't compile the functionality stated by your code into equal functionality in machine-code, the compiler is either broken or you haven't taken the optimizations into consideration...
My problem is what optimizations (done either by compiler or hardware) are allowed?
It could happen --- for example --- that store instruction is non-blocking and the load instruction that follows somehow is managed by the memory controller faster than the already queued-up store.
I don't know CUDA hardware. Do I have a guarantee that the above will never happen?
The CUDA Programming Guide simply stating, that you cannot predict in which order the threads is executed, but every single thread will still run as a sequential thread.
In the example you state, where x and y are the same and NO OTHER THREAD is touching the memory, you DO have a guarantee that z = 1.
Here the point being, that if you have several threads dooing operations on the same data (e.g. an array), you are NOT guaranteed that thread #9 executes before #10.
Take an example:
__device__ void sum_all(float *x, float *result, int size N){
x[threadId.x] = threadId.x;
result[threadId.x] = 0;
for(int i = 0; i < N; i++)
result[threadId.x] += x[threadID.x];
}
Here we have some dumb function, which SHOULD fill a shared array (x) with the numbers from m ... n (read from one number to another number), and then sum up the numbers already put into the array and store the result in another array.
Given that you your lowest indexed thread is enumerated thread #0, you would expect that the first time your code runs this code x should contain
x[] = {0, 0, 0 ... 0} and result[] = {0, 0, 0 ... 0}
next for thread #1
x[] = {0, 1, 0 ... 0} and result[] = {0, 1, 0 ... 0}
next for thread #2
x[] = {0, 1, 2 ... 0} and result[] = {0, 1, 3 ... 0}
and so forth.
But this is NOT guaranteed. You can't know if e.g. thread #3 runs first, hence changing the array x[] before thread #0 runs. You actually don't even know if the arrays are changed by some other thread while you are executing the code.
I am not sure, if this is explicitly stated in the CUDA documentation (I wouldn't expect it to be), as this is a basic principle of computing. Basically what you are asking is, if running your code on a GFX will change the functionality of your code.
The cores of a GPU are generally the same, as that of a CPU, just with less control-arithmetics, a smaller instructionset and typically only supporting single-precision.
In a CUDA-GPU there is 1 program counter for each Warp (section of 32 synchronous cores). Like a CPU, the program counter increases by magnitude of one address element after each instruction, unless you have branches or jumps. This gives the sequential flow of the program, and this can not be changed.
Branches and jumps can only be introduced by the software running on the core, and hence is determined by your compiler. Compiler optimizations can in fact change the functionality of your code, but only in the case where the code is implemented "wrong" with respect to the compiler
So in short - Your code will always be executed in the order it is ordered in the memory, no matter if it is executed on a CPU or a GPU. If your compiler doesn't compile the functionality stated by your code into equal functionality in machine-code, the compiler is either broken or you haven't taken the optimizations into consideration...
Hope this was clear enough :)
As far as I understood you're basically asking whether memory dependencies and alias analysis information are being respected in the CUDA compiler.
The answer to that question is, assuming that the CUDA compiler is free of bugs, yes because as Robert noted the CUDA compiler uses LLVM under the hood and two basic modules (which, at the moment, I really don't think they could be excluded by the pipeline) are:
Memory dependence analysis
Alias Analysis
These two passes detect memory locations potentially pointing to the same address and use live-analysis on variables (even out of the block scope) to avoid dangerous optimizations (e.g. you can't write in a live variable before its next read, the data may still be useful).
I don't know the compiler internals but assuming (as any other reasonably trusted compiler) that it will do its best to be bug-free, the analysis that take place in there should really not bother you at all and assure you that at least in theory what you just presented as an example (i.e. the dependent-load faster than the store) cannot happen.
What guarantee you that? Nothing but the fact that the company is giving a compiler to use, and there are disclaimers in case it doesn't for exceptional cases :)
Also: aside from the compiler topic, the instruction execution is also dependent on the hardware specification. In this case, a SIMT hardware instruction issuing unit
cfr. http://www.csl.cornell.edu/~cbatten/pdfs/kim-simt-vstruct-isca2013.pdf and all the referenced papers for more information