Should calls to kernels be encoded to fit the "GPU layout", or the "algorithm layout"? - cuda

I'm just learning about CUDA/OpenCL, but I have a conceptual question. Suppose I am encoding an algorithm to do a Breadth-First-Search on a graph. Suppose my target device is a GPU with only 2 work-groups of 2 processing elements (i.e., 4 cores). Intuitivelly, I guess the search could be done in parallel by keeping an array of "nodes to visit". For each pass, each node is visited in parallel and added to the next "nodes to visit" array. For example, this graph could spawn the following search:
start: must visit A
parallel pass 1: must visit B, C, D
parallel pass 2: must visit E, F, G, H, I, J, K, L, M
parallel pass 3: must visit N, O, P
parallel pass 4: must visit Q
parallel pass 5: done
A way to do it, I suppose, would be to call a ND range kernel 5 times (example in OpenCL):
clEnqueueNDRangeKernel(queue, kernel, 1, 0, 1, 1, 0, NULL, NULL);
clEnqueueNDRangeKernel(queue, kernel, 1, 0, 1, 3, 0, NULL, NULL);
clEnqueueNDRangeKernel(queue, kernel, 1, 0, 1, 9, 0, NULL, NULL);
clEnqueueNDRangeKernel(queue, kernel, 1, 0, 1, 3, 0, NULL, NULL);
clEnqueueNDRangeKernel(queue, kernel, 1, 0, 1, 1, 0, NULL, NULL);
(Of course, this is hard-coded for this case, so, to avoid that, I guess I could keep a counter of nodes to visit.) This sounds wrong, though: it doesn't fit the layout of a GPU, at all. For example, on the third call, you are using 1 work group of 9 work items, when your GPU has 2 work groups of 2 work items...!
Another way I see could be the following (example in OpenCL):
while (work not complete...)
clEnqueueNDRangeKernel(queue, kernel, 1, 0, 2, 2, 0, NULL, NULL);
This would call "clEnqueueNDRangeKernel" continously in a way that fits perfectly the GPU layout, until it receives a signal that the work is done. But this time, the ids received by the kernel wouldn't fit the layout of the algorithm!
What is the right way to do this? Am I missing something?

Your question is one of the most unobvious & interesting.
IMO, you should implement "pure" & easy-to-analyze algorithm if want your application to run on different hardware. If someone else will come across your implementation, at least, it will be easy to tweak.
Otherwise, if you put performance first, hardcode every single piece of software to achieve optimal performance on single target platform. Other man, who may work with your code later, will have to learn hardware peculiarities anyway.

From what I can infer from your question, I'd say that the short answer is both.
I say that because the way you want to solve a problem is always linked to how you posed that problem.
The best illustration is the amount of sorting algorithms that exist. From the pure sorting point of view ("I need my data sorted"), the problem is solved for a while. However it didn't stop researchers to investigate new algorithms because they added constraints and/or new input to the problem: "I need my data sorted as fast as possible" (the reason why algorithms are categorized with the big O notation), "I need my data sorted as fast as possible knowing that the data structure is..." or "knowing that there is X % chance that..." or "I don't care about the speed but I care about memory", etc.
Now your problem seems to be: I want a breadth first search algorithm that runs efficiently (this I'm guessing - why to learn OCL/CUDA otherwise?) on GPUs.
This simple sentence hides a lot of constraints. For instance you have to take into account that:
it takes a lot of time to send the data through the PCIe bus (for discrete GPUs).
Access (global) memory latency is high.
Threads work in lock steps (number varies with vendors).
Latency is hidden with throughput, and so on.
Note also that it is not necessarily the same that "I want a parallel breadth first search algorithm" (which could run on CPU with again different constraints).
A quick Google search with these key words : "breadth first search parallel GPU" returns me, among others these articles that seems promising (I just went through the abstracts):
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU (Use CPU and GPU)
An Effective GPU Implementation of Breadth-First Search
Scalable GPU Graph Traversal (From NVIDIA)

Related

In CUDA, how can I get this warp's thread mask in conditionally executed code (in order to execute e.g., __shfl_sync or <cg>.shfl?

I'm trying to update some older CUDA code (pre CUDA 9.0), and I'm having some difficulty updating usage of warp shuffles (e.g., __shfl).
Basically the relevant part of the kernel might be something like this:
int f = d[threadIdx.x];
int warpLeader = <something in [0,32)>;
// Point being, some threads in the warp get removed by i < stop
for(int i = k; i < stop; i+=skip)
{
// Point being, potentially more threads don't see the shuffle below.
if(mem[threadIdx.x + i/2] == foo)
{
// Pre CUDA 9.0.
f = __shfl(f, warpLeader);
}
}
Maybe that's not the best example (real code too complex to post), but the two things accomplished easily with the old intrinsics were:
Shuffle/broadcast to whatever threads happen to be here at this time.
Still get to use the warp-relative thread index.
I'm not sure how to do the above post CUDA 9.0.
This question is almost/partially answered here: How can I synchronize threads within warp in conditional while statement in CUDA?, but I think that post has a few unresolved questions.
I don't believe __shfl_sync(__activemask(), ...) will work. This was noted in the linked question and many other places online.
The linked question says to use coalesced_group, but my understanding is that this type of cooperative_group re-ranks the threads, so if you had a warpLeader (on [0, 32)) in mind as above, I'm not sure there's a way to "figure out" its new rank in the coalesced_group.
(Also, based on the truncated comment conversation in the linked question, it seems unclear if coalesced_group is just a nice wrapper for __activemask() or not anyway ...)
It is possible to iteratively build up a mask using __ballot_sync as described in the linked question, but for code similar to the above, that can become pretty tedious. Is this our only way forward for CUDA > 9.0?
I don't believe __shfl_sync(__activemask(), ...) will work. This was noted in the linked question and many other places online.
The linked question doesn't show any such usage. Furthermore, the canonical blog specifically says that usage is the one that satisfies this:
Shuffle/broadcast to whatever threads happen to be here at this time.
The blog states that this is incorrect usage:
//
// Incorrect use of __activemask()
//
if (threadIdx.x < NUM_ELEMENTS) {
unsigned mask = __activemask();
val = input[threadIdx.x];
for (int offset = 16; offset > 0; offset /= 2)
val += __shfl_down_sync(mask, val, offset);
(which is conceptually similar to the usage given in your linked question.)
But for "opportunistic" usage, as defined in that blog, it actually gives an example of usage in listing 9 that is similar to the one that you state "won't work". It certainly does work following exactly the definition you gave:
Shuffle/broadcast to whatever threads happen to be here at this time.
If your algorithm intent is exactly that, it should work fine. However, for many cases, that isn't really a correct description of the algorithm intent. In those cases, the blog recommends a stepwise process to arrive at a correct mask:
Don’t just use FULL_MASK (i.e. 0xffffffff for 32 threads) as the mask value. If not all threads in the warp can reach the primitive according to the program logic, then using FULL_MASK may cause the program to hang.
Don’t just use __activemask() as the mask value. __activemask() tells you what threads happen to be convergent when the function is called, which can be different from what you want to be in the collective operation.
Do analyze the program logic and understand the membership requirements. Compute the mask ahead based on your program logic.
If your program does opportunistic warp-synchronous programming, use “detective” functions such as __activemask() and __match_all_sync() to find the right mask.
Use __syncwarp() to separate operations with intra-warp dependences. Do not assume lock-step execution.
Note that steps 1 and 2 are not contradictory to other comments. If you know for certain that you intend the entire warp to participate (not typically known in a "opportunistic" setting) then it is perfectly fine to use a hardcoded full mask.
If you really do intend the opportunistic definition you gave, there is nothing wrong with the usage of __activemask() to supply the mask, and in fact the blog gives a usage example of that, and step 4 also confirms that usage, for that case.

How does MIPS r10000 fetch hide instruction cache latency?

I am studying different pipeline stages of mips r10000. The paper says
that processor fetches 4 instructions per cycle from instruction cache each time. But the latency from the instruction cache must be more than one cycle, though I don't know the exact hit latency of instruction cache, hit latency of L1 data cache in Haswell processor is about 4 cycles.
So if we assume L1 instruction cache latency is 3-4 cycles how can the processor fetch 4 instructions each cycle?
The MIPS R10000 had a single cycle latency instruction cache and could fetch a contiguous block of four instructions within a cache block without alignment constraints.
Mechanically, this probably meant that it used four SRAM banks with at least partially independent addressing (the cache set address decode could be shared).
Since each bank is independently addressable, as can be seen in the diagram any contiguous sequence of four words contained in the sixteen words can be accessed. Addressing rows [0, 0, 0, 0] gets words [0, 1, 2, 3] (words 0-3); rows [1, 0 , 0, 0] gets words [4, 1, 2, 3] (words 1-4); rows [1, 1, 0, 0] gets words [4, 5, 2, 3] (words 2-5); ...; rows [3, 3, 3, 2] gets words [12, 13, 14, 11] (words 11-14); rows [3, 3, 3, 3] gets words [12, 13, 14, 15] (words 12-15).
(The same banking could cross cache block boundaries, but then two cache block hits would have to be confirmed in parallel. Memoization of the way for the previous access would reduce this to one set check for a common case of sequential accesses in largish cache blocks; one set would use the memoized way and the other would perform the normal check when entering a new cache block. Page crossing is a similar issue.)
(A common alternative for multiple instruction fetch does have an alignment constraint of a naturally aligned chunk of, e.g., 16 bytes.)
This processor did not redirect instruction fetch until a branch was detected in the second pipeline stage (decode), so a taken branch introduced a one cycle bubble even with a correct prediction. An incorrect prediction might not be determined until some cycles later because execution started in the fourth pipeline stage and instructions were executed out-of-order. (An incorrectly predicted taken branch could decode instructions already fetched in the taken branch bubble as these were stored in a "resume cache".)
Buffering of instructions can smooth out such hazards as throughput rarely approaches the maximum because of data dependencies and other hazards.
In general, a cache can provide multiple words per fetch (a natural alignment restriction facilitates a single bank providing the chunk) or be accessed multiple times per cycle (e.g., more deeply pipelining the instruction cache than other parts of the pipeline or using expensive multiported SRAM).
As long as a new address is provided every cycle, a fetch of multiple contiguous instructions can be done every cycle. If two addresses are available (predicted) per cycle, instructions after a taken branch could be fetched in the same cycle. (Another method of reducing the taken branch penalty — and providing other post-branch optimization opportunities — is to use a trace cache.)

Monte Carlo with rates, system simulation with CUDA C++

So I am trying to simulate a 1-D physical model named Tasep.
I wrote a code to simulate this system in c++, but I definitely need a performance boost.
The model is very simple ( c++ code below ) - an array of 1's and 0's. 1 represent a particle and 0 is no-particle, meaning empty. A particle moves one element to the right, at a rate 1, if that element is empty. A particle at the last location will disappear at a rate beta ( say 0.3 ). Finally, if the first location is empty a particle will appear there, at a rate alpha.
One threaded is easy, I just pick an element at random, and act with probability 1 / alpha / beta, as written above. But this can take a lot of time.
So I tried to do a similar thing with many threads, using the GPU, and that raised a lot of questions:
Is using the GPU and CUDA at all good idea for such a thing?
How many threads should I have? I can have a thread for each site ( 10E+6 ), should I?
How do I synchronize the access to memory between different threads? I used atomic operations so far.
What is the right way to generate random data? If I use a million threads is it ok to have a random generator for each?
How do I take care of the rates?
I am very new to CUDA. I managed to run code from CUDA samples and some tutorials. Although I have some code of the above ( still gives strange result though ), I do not put it here, because I think the questions are more general.
So here is the c++ one threaded version of it:
int Tasep()
{
const int L = 750000;
// rates
int alpha = 330;
int beta = 300;
int ProbabilityNormalizer = 1000;
bool system[L];
int pos = 0;
InitArray(system); // init to 0's and 1's
/* Loop */
for (int j = 0; j < 10*L*L; j++)
{
unsigned long randomNumber = xorshf96();
pos = (randomNumber % (L)); // Pick Random location in the the array
if (pos == 0 && system[0] == 0) // First site and empty
system[0] = (alpha > (xorshf96() % ProbabilityNormalizer)); // Insert a particle with chance alpha
else if (pos == L - 1) // last site
system[L - 1] = system[L - 1] && (beta < (xorshf96() % ProbabilityNormalizer)); // Remove a particle if exists with chance beta
else if (system[pos] && !system[pos + 1]) // If current location have a particle and the next one is empty - Jump right
{
system[pos] = false;
system[pos + 1] = true;
}
if ((j % 1000) == 0) // Just do some Loggingg
Log(system, j);
}
getchar();
return 0;
}
I would be truly grateful for whoever is willing to help and give his/her advice.
I think that your goal is to perform something called Monte Carlo Simulations.
But I have failed to fully understand your main objective (i.e. get a frequency, or average power lost, etc.)
Question 01
Since you asked about random data, I do believe you can have multiple random seeds (maybe one for each thread), I would advise you to generate the seed in the GPU using any pseudo random generator (you can use even the same as CPU), store the seeds in GPU global memory and launch as many threads you can using dynamic parallelism.
So, yes CUDA is a suitable approach, but keep in your mind the balance between time that you will require to learn and how much time you will need to get the result from your current code.
If you will take use this knowledge in the future, learn CUDA maybe worth, or if you can escalate your code in many GPUs and it is taking too much time in CPU and you will need to solve this equation very often it worth too. Looks like that you are close, but if it is a simple one time result, I would advise you to let the CPU solve it, because probably, from my experience, you will take more time learning CUDA than the CPU will take to solve it (IMHO).
Question 02
The number of threads is very usual question for rookies. The answer is very dependent of your project, but taking in your code as an insight, I would take as many I can, using every thread with a different seed.
My suggestion is to use registers are what you call "sites" (be aware that are strong limitations) and then run multiples loops to evaluate your particle, in the very same idea of a car tire a bad road (data in SMEM), so your L is limited to 255 per loop (avoid spill at your cost to your project, and less registers means more warps per block). To create perturbation, I would load vectors in the shared memory, one for alpha (short), one for beta (short) (I do assume different distributions), one "exist or not particle" in the next site (char), and another two to combine as pseudo generator source with threadID, blockID, and some current time info (to help you to pick the initial alpha, beta and exist or not) so u can reuse this rates for every thread in the block, and since the data do not change (only the reading position change) you have to sync only once after reading, also you can "random pick the perturbation position and reuse the data. The initial values can be loaded from global memory and "refreshed" after an specific number of loops to hide the loading latency. In short, you will reuse the same data in shared multiple times, but the values selected for every thread change at every interaction due to the pseudo random value. Taking in account that you are talking about large numbers and you can load different data in every block, the pseudo random algorithm should be good enough. Also, you can even use the result stored in the gpu from previous runs as random source, flip one variable and do some bit operations, so u can use every bit as a particle.
Question 03
For your specific project I would strongly recommend to avoid thread cooperation and make these completely independent. But, you can use shuffle inside the same warp, with no high cost.
Question 04
It is hard to generate truly random data, but you should worry about by how often last your period (since any generator has a period of random and them repeats). I would suggest you to use a single generator which can work in parallel to your kernel and use it feed your kernels (you can use dynamic paralelism). In your case since you want some random you should not worry a lot with consistency. I gave an example of pseudo random data use in the previous question, that may assist. Keep in your mind that there is no real random generator, but there are alternatives as internet bits for example.
Question 05
Already explained in the Question 03, but keep in your mind that you do not need a full sequence of values, only a enough part to use in multiple forms, to give enough time to your kernel just process and then you can refresh you sequence, if you guarantee to not feed block with the same sequence it will be very hard to fall into patterns.
Hope I have help, I’m working with CUDA for a bit more than a year, started just like you, and still every week I do improve my code, now it is almost good enough. Now I see how it perfectly fit my statistical challenge: cluster random things.
Good luck!

CUDA memory operation order within a single thread

From the CUDA Programming Guide (v. 5.5):
The CUDA programming model assumes a device with a weakly-ordered
memory model, that is:
The order in which a CUDA thread writes data to shared memory, global memory, page-locked host memory, or the memory of a peer device
is not necessarily the order in which the data is observed being
written by another CUDA or host thread;
The order in which a CUDA thread reads data from shared memory, global memory, page-locked host memory, or the memory of a peer device
is not necessarily the order in which the read instructions appear in
the program for instructions that are independent of each other
However, do we have a guarantee that the (dependent) memory operations as seen from the single thread are actually consistent? If I do - say:
arr[x] = 1;
int z = arr[y];
where x happens to be equal to y, and no other thread is touching the memory, do I have a guarantee that z is 1? Or do I still need to put some volatile or a barrier between those two operations?
In response to Orpedo's answer.
If your compiler doesn't compile the functionality stated by your code into equal functionality in machine-code, the compiler is either broken or you haven't taken the optimizations into consideration...
My problem is what optimizations (done either by compiler or hardware) are allowed?
It could happen --- for example --- that store instruction is non-blocking and the load instruction that follows somehow is managed by the memory controller faster than the already queued-up store.
I don't know CUDA hardware. Do I have a guarantee that the above will never happen?
The CUDA Programming Guide simply stating, that you cannot predict in which order the threads is executed, but every single thread will still run as a sequential thread.
In the example you state, where x and y are the same and NO OTHER THREAD is touching the memory, you DO have a guarantee that z = 1.
Here the point being, that if you have several threads dooing operations on the same data (e.g. an array), you are NOT guaranteed that thread #9 executes before #10.
Take an example:
__device__ void sum_all(float *x, float *result, int size N){
x[threadId.x] = threadId.x;
result[threadId.x] = 0;
for(int i = 0; i < N; i++)
result[threadId.x] += x[threadID.x];
}
Here we have some dumb function, which SHOULD fill a shared array (x) with the numbers from m ... n (read from one number to another number), and then sum up the numbers already put into the array and store the result in another array.
Given that you your lowest indexed thread is enumerated thread #0, you would expect that the first time your code runs this code x should contain
x[] = {0, 0, 0 ... 0} and result[] = {0, 0, 0 ... 0}
next for thread #1
x[] = {0, 1, 0 ... 0} and result[] = {0, 1, 0 ... 0}
next for thread #2
x[] = {0, 1, 2 ... 0} and result[] = {0, 1, 3 ... 0}
and so forth.
But this is NOT guaranteed. You can't know if e.g. thread #3 runs first, hence changing the array x[] before thread #0 runs. You actually don't even know if the arrays are changed by some other thread while you are executing the code.
I am not sure, if this is explicitly stated in the CUDA documentation (I wouldn't expect it to be), as this is a basic principle of computing. Basically what you are asking is, if running your code on a GFX will change the functionality of your code.
The cores of a GPU are generally the same, as that of a CPU, just with less control-arithmetics, a smaller instructionset and typically only supporting single-precision.
In a CUDA-GPU there is 1 program counter for each Warp (section of 32 synchronous cores). Like a CPU, the program counter increases by magnitude of one address element after each instruction, unless you have branches or jumps. This gives the sequential flow of the program, and this can not be changed.
Branches and jumps can only be introduced by the software running on the core, and hence is determined by your compiler. Compiler optimizations can in fact change the functionality of your code, but only in the case where the code is implemented "wrong" with respect to the compiler
So in short - Your code will always be executed in the order it is ordered in the memory, no matter if it is executed on a CPU or a GPU. If your compiler doesn't compile the functionality stated by your code into equal functionality in machine-code, the compiler is either broken or you haven't taken the optimizations into consideration...
Hope this was clear enough :)
As far as I understood you're basically asking whether memory dependencies and alias analysis information are being respected in the CUDA compiler.
The answer to that question is, assuming that the CUDA compiler is free of bugs, yes because as Robert noted the CUDA compiler uses LLVM under the hood and two basic modules (which, at the moment, I really don't think they could be excluded by the pipeline) are:
Memory dependence analysis
Alias Analysis
These two passes detect memory locations potentially pointing to the same address and use live-analysis on variables (even out of the block scope) to avoid dangerous optimizations (e.g. you can't write in a live variable before its next read, the data may still be useful).
I don't know the compiler internals but assuming (as any other reasonably trusted compiler) that it will do its best to be bug-free, the analysis that take place in there should really not bother you at all and assure you that at least in theory what you just presented as an example (i.e. the dependent-load faster than the store) cannot happen.
What guarantee you that? Nothing but the fact that the company is giving a compiler to use, and there are disclaimers in case it doesn't for exceptional cases :)
Also: aside from the compiler topic, the instruction execution is also dependent on the hardware specification. In this case, a SIMT hardware instruction issuing unit
cfr. http://www.csl.cornell.edu/~cbatten/pdfs/kim-simt-vstruct-isca2013.pdf and all the referenced papers for more information

My simple turing machine

I'm trying to understand and implement the simplest turing machine and would like feedback if I'm making sense.
We have a infinite tape (lets say an array called T with pointer at 0 at the start) and instruction table:
( S , R , W , D , N )
S->STEP (Start at step 1)
R->READ (0 or 1)
W->WRITE (0 or 1)
D->DIRECTION (0=LEFT 1=RIGHT)
N->NEXTSTEP (Non existing step is HALT)
My understanding is that a 3-state 2-symbol is the simplest machine. 3-state i don't understand. 2-symbol because we use 0 and 1 for READ/WRITE.
For example:
(1,0,1,1,2)
(1,1,0,1,2)
Starting at step 1, if Read is 0 then { Write 1, Move Right) else {Write 0, Move Right) and then go to step 2 - which does not exist which halts program.
What does 3-state mean? Does this machine pass as turing machine? Can we simplify more?
I think the confusion might come from your use of "steps" instead of "states". You can think of a machine's state as the value it has in its memory (although as a previous poster noted, some people also take a machine's state to include the contents of the tape -- however, I don't think that definition is relevant to your question). It's possible that this change in terminology might be at the heart of your confusion. Let me explain what I think it is you're thinking. :)
You gave lists of five numbers -- for example, (1,0,1,1,2). As you correctly state, this should be interpreted (reading from left to right) as "If the machine is in state 1 AND the current square contains a 0, print a 1, move right, and change to state 2." However, your use of the word "step" seems to suggest that that "step 2" must be followed by "step 3", etc., when in reality a Turing machine can go back and forth between states (and of course, there can only be finitely many possible states).
So to answer your questions:
Turing machines keep track of "states" not "steps";
What you've described is a legitimate Turing machine;
A simpler (albeit otherwise uninteresting) Turing machine would be one that starts in the HALT state.
Edits: Grammar, Formatting, and removed a needless description of Turing machines.
Response to comment:
Correct me if I'm misinterpreting your comment, but I did not mean to suggest a Turing machine could be in more than one state at a time, only that the number of possible states can be any finite number. For example, for a 3-state machine, you might label the possible states A, B, and C. (In the example you provided, you labeled the two possible states as '1' and '2') At any given time, exactly one of those values (states) would be in the machine's memory. We would say, "the machine is in state A" or "the machine is in state B", etc. (Your machine starts in state '1' and terminates after it enters state '2').
Also, it's no longer clear to me what you mean by a "simpler/est" machine. The smallest known Universal Turing machine (i.e., a Turing machine that can simulate another Turing machine, given an appropriate tape) requires 2 states and 5 symbols (see the relevant Wikipedia article).
On the other hand, if you're looking for something simpler than a Turing machine with the same computation power, Post-Turing machines might be of interest.
I believe that the concept of state is basically the same as in Finite State Machines. If I recall, you need a separate termination state, to which the turing machine can transition after it has finished running the program. As for why 3 states I'd guess that the other two states are for intialisation and execution respectively.
Unfortunately none of that is guaranteed to be correct, but I thought I'd post my thoughts anyway since the question was unanswered for 5 hours. I suspect if you were to re-ask this question on cstheory.stackexchange.com you might get a better/more definative answer.
"State" in the context of Turing machines should be clarified as to which is being described: (i) the current instruction, or (ii) the list of symbols on the tape together with the current instruction, or (iii) the list of symbols on the tape together with the current instruction placed to the left of the scanned symbol or to the right of the scanned symbol. Reference