Cuda Kernel with different array sizes - cuda

I am working on a fluid dynamic problem in cuda and discovered a problem like this
if I have an array e.g debug_array with the length 600 and an array
value_array with the length 100 and I wanna do sth like
for(int i=0;i<6;i++)
{
debug_array[6*(bx*block_size+tx)+i] = value_array[bx*block_size+tx];
}
block_size would in this example be based on the 100 element array, e.g
4 blocks block_size 25
if value_array contains e.g 10;20;30;.....
I would expect debug_array to have groups of 6 similar values like
10;10;10;10;10;10;20;20;20;20;20;20;30......
The problem is that it is not picking up all values from the values array, any idea
why this isn't working or a good workaround.
What will work is if I define float val = value_array[bx*block_size+tx]; outside the for loop and keep this inside the loop debug_array[bx*block_size+tx+i] = val;
But I would like to avoid that as my kernels have between 5 and 10 device function inside the loop and it makes it just hard to read.
thanks in advance any advice will be appriciated
Markus

There seems to be an error in computing the index:
Let's assume bx = 0 and tx = 0
The first 6 elements in debug_array will be filled with data.
Next thread: tx = 1: Elements 1 to 7 will be filled with data (overwriting existing data).
Due to the threads working in parallel it is not determined which thread will be scheduled first and therefore which values will be written into the debug_array.
You should have written:
debug_array[6*(bx*block_size+tx)+i] = value_array[bx*block_size+tx];

If changing the code to move the value_array expression out of the loop and into a temp variable makes the code work - and that is the only code change you made - then this smells like a compiler bug.
Try changing your nvcc compiler options to reduce or disable optimizations and see if the value_array expression inside the loop changes behavior. Also, make sure you're using the latest CUDA tools.
Optimizing compilers will often attempt to move expressions that aren't dependent on the loop index variable out of the loop, exactly like your manual workaround. It's called "invariant code motion" and it makes loops faster by reducing the amount of code that executes in each iteration of the loop. If manually extracting the invariant code from the loop works, but letting the compiler figure it out on its own doesn't, that casts doubt on the compiler.

Related

Monte Carlo with rates, system simulation with CUDA C++

So I am trying to simulate a 1-D physical model named Tasep.
I wrote a code to simulate this system in c++, but I definitely need a performance boost.
The model is very simple ( c++ code below ) - an array of 1's and 0's. 1 represent a particle and 0 is no-particle, meaning empty. A particle moves one element to the right, at a rate 1, if that element is empty. A particle at the last location will disappear at a rate beta ( say 0.3 ). Finally, if the first location is empty a particle will appear there, at a rate alpha.
One threaded is easy, I just pick an element at random, and act with probability 1 / alpha / beta, as written above. But this can take a lot of time.
So I tried to do a similar thing with many threads, using the GPU, and that raised a lot of questions:
Is using the GPU and CUDA at all good idea for such a thing?
How many threads should I have? I can have a thread for each site ( 10E+6 ), should I?
How do I synchronize the access to memory between different threads? I used atomic operations so far.
What is the right way to generate random data? If I use a million threads is it ok to have a random generator for each?
How do I take care of the rates?
I am very new to CUDA. I managed to run code from CUDA samples and some tutorials. Although I have some code of the above ( still gives strange result though ), I do not put it here, because I think the questions are more general.
So here is the c++ one threaded version of it:
int Tasep()
{
const int L = 750000;
// rates
int alpha = 330;
int beta = 300;
int ProbabilityNormalizer = 1000;
bool system[L];
int pos = 0;
InitArray(system); // init to 0's and 1's
/* Loop */
for (int j = 0; j < 10*L*L; j++)
{
unsigned long randomNumber = xorshf96();
pos = (randomNumber % (L)); // Pick Random location in the the array
if (pos == 0 && system[0] == 0) // First site and empty
system[0] = (alpha > (xorshf96() % ProbabilityNormalizer)); // Insert a particle with chance alpha
else if (pos == L - 1) // last site
system[L - 1] = system[L - 1] && (beta < (xorshf96() % ProbabilityNormalizer)); // Remove a particle if exists with chance beta
else if (system[pos] && !system[pos + 1]) // If current location have a particle and the next one is empty - Jump right
{
system[pos] = false;
system[pos + 1] = true;
}
if ((j % 1000) == 0) // Just do some Loggingg
Log(system, j);
}
getchar();
return 0;
}
I would be truly grateful for whoever is willing to help and give his/her advice.
I think that your goal is to perform something called Monte Carlo Simulations.
But I have failed to fully understand your main objective (i.e. get a frequency, or average power lost, etc.)
Question 01
Since you asked about random data, I do believe you can have multiple random seeds (maybe one for each thread), I would advise you to generate the seed in the GPU using any pseudo random generator (you can use even the same as CPU), store the seeds in GPU global memory and launch as many threads you can using dynamic parallelism.
So, yes CUDA is a suitable approach, but keep in your mind the balance between time that you will require to learn and how much time you will need to get the result from your current code.
If you will take use this knowledge in the future, learn CUDA maybe worth, or if you can escalate your code in many GPUs and it is taking too much time in CPU and you will need to solve this equation very often it worth too. Looks like that you are close, but if it is a simple one time result, I would advise you to let the CPU solve it, because probably, from my experience, you will take more time learning CUDA than the CPU will take to solve it (IMHO).
Question 02
The number of threads is very usual question for rookies. The answer is very dependent of your project, but taking in your code as an insight, I would take as many I can, using every thread with a different seed.
My suggestion is to use registers are what you call "sites" (be aware that are strong limitations) and then run multiples loops to evaluate your particle, in the very same idea of a car tire a bad road (data in SMEM), so your L is limited to 255 per loop (avoid spill at your cost to your project, and less registers means more warps per block). To create perturbation, I would load vectors in the shared memory, one for alpha (short), one for beta (short) (I do assume different distributions), one "exist or not particle" in the next site (char), and another two to combine as pseudo generator source with threadID, blockID, and some current time info (to help you to pick the initial alpha, beta and exist or not) so u can reuse this rates for every thread in the block, and since the data do not change (only the reading position change) you have to sync only once after reading, also you can "random pick the perturbation position and reuse the data. The initial values can be loaded from global memory and "refreshed" after an specific number of loops to hide the loading latency. In short, you will reuse the same data in shared multiple times, but the values selected for every thread change at every interaction due to the pseudo random value. Taking in account that you are talking about large numbers and you can load different data in every block, the pseudo random algorithm should be good enough. Also, you can even use the result stored in the gpu from previous runs as random source, flip one variable and do some bit operations, so u can use every bit as a particle.
Question 03
For your specific project I would strongly recommend to avoid thread cooperation and make these completely independent. But, you can use shuffle inside the same warp, with no high cost.
Question 04
It is hard to generate truly random data, but you should worry about by how often last your period (since any generator has a period of random and them repeats). I would suggest you to use a single generator which can work in parallel to your kernel and use it feed your kernels (you can use dynamic paralelism). In your case since you want some random you should not worry a lot with consistency. I gave an example of pseudo random data use in the previous question, that may assist. Keep in your mind that there is no real random generator, but there are alternatives as internet bits for example.
Question 05
Already explained in the Question 03, but keep in your mind that you do not need a full sequence of values, only a enough part to use in multiple forms, to give enough time to your kernel just process and then you can refresh you sequence, if you guarantee to not feed block with the same sequence it will be very hard to fall into patterns.
Hope I have help, I’m working with CUDA for a bit more than a year, started just like you, and still every week I do improve my code, now it is almost good enough. Now I see how it perfectly fit my statistical challenge: cluster random things.
Good luck!

Possible to call subfunction in S-function level-2

I have been trying to convert my level-1 S-function to level-2 but I got stuck at calling another subfunction at function Output(block) trying to look for other threads but to no avail, do you mind to provide related links?
My output depends on a lot processing with the inputs, this is the reason I need to call the sub-function in order to calculate and then return output values, all the examples that I can see are calculating their outputs directly in "function Output(block)", in my case I thought it is not possible.
I then tried to use Interpreted Matlab Function block but failed due to the output dimension is NOT the same as input dimension, also it does not support the return of more than ONE output................
Dear Sir/Madam,
I read in S-function documentation that "S-function level-1 supports vector inputs and outputs. DOES NOT support multiple input and output ports".
Does the second sentence mean the input and output dimension MUST BE SAME?
I have been using S-function level-1 to do the following:
[a1, b1] = choose_cells(c, d);
where a1 and b1 are outputs, c and d are inputs. All the variables are having a single value, except d is an array with 6 values.
Referring to the image attached, we all know that in S-function block, the input dimension must be SAME as output dimension, else we will get error, in this case, the input dimension is 7 while the output dimension is 2, so I have to include the "Terminator" blocks in the diagram for it to work perfectly, otherwise, I will get an error.
My problem is, when the system gets bigger, the array d could contain hundreds of variables, using this method, it means I would have to add hundreds of "Terminator" blocks in order to get this work, this definitely does not sound practical.
Could you please suggest me a wise way to implement this?
Thanks in advance.
http://imgur.com/ib6BTTp
http://imageshack.us/content_round.php?page=done&id=4tHclZ2klaGtl66S36zY2KfO5co

error in Assigning values to bytes in a 2d array of registers in Verilog .Error

Hi when i write this piece of code :
module memo(out1);
reg [3:0] mem [2:0] ;
output wire [3:0] out1;
initial
begin
mem[0][3:0]=4'b0000;
mem[1][3:0]=4'b1000;
mem[2][3:0]=4'b1010;
end
assign out1= mem[1];
endmodule
i get the following warnings which make the code unsynthesizable
WARNING:Xst:1780 - Signal mem<2> is never used or assigned. This unconnected signal will be trimmed during the optimization process.
WARNING:Xst:653 - Signal mem<1> is used but never assigned. This sourceless signal will be automatically connected to value 1000.
WARNING:Xst:1780 - Signal > is never used or assigned. This unconnected signal will be trimmed during the optimization process.
Why am i getting these warnings ?
Haven't i assigned the values of mem[0] ,mem[1] and mem[2]!?? Thanks for your help!
Your module has no inputs and a single output -- out1. I'm not totally sure what the point of the module is with respect to your larger system, but you're basically initializing mem, but then only using mem[1]. You could equivalently have a module which just assigns out1 to the value 4'b1000 (mem never changes). So yes -- you did initialize the array, but because you didn't use any of the other values the xilinx tools are optimizing your module during synthesis and "trimming the fat." If you were to simulate this module (say in modelsim) you'd see your initializations just fine. Based on your warnings though I'm not sure why you've come to the conclusion that your code is unsynthesizable. It appears to me that you could definitely synthesize it, but that it's just sort of a weird way to assign a single value to 4'b1000.
With regards to using initial begins to store values in block ram (e.g. to make a ROM) that's fine. I've done that several times without issue. A common use for this is to store coefficients in block ram, which are read out later. That stated the way this module is written there's no way to read anything out of mem anyway.

CUDA: sum of data on a global memory variable

I've launched a kernel with 2100 blocks and 4 threads per block.
Somewhat in this kernel all the threads have to execute a function, and put its result on an array (on global memory) into "threadIdx.x" position.
I surely know that, in this fase of the project, the function always returns 1.012086.
Now, I've written this code to do that sum:
currentErrors[threadIdx.x]=0;
for(i=0;i<gridDim.x;i++)
{
if(i==blockIdx.x)
{
currentErrors[threadIdx.x]+=globalError(mynet,myoutput);
}
}
But when the kernel ends all array's position has 1.012086 as value (instead 1.012086*2100).
Where I'm wrong?
Thanks for your helps!
To compute a final sum out of partial results of your blocks, I would suggest doing it the following way:
Let every block write a partial result into a separate cell of a gridDim.x-sized array.
Copy the array to host.
Perform final sum on the host.
I assume each block has a lot to compute on its own, which would warrant the usage of CUDA in the first place.
In your current state --- I think there can be something wrong in your kernel. Seems to me that every block is summing all the data, returning a final result as if it was a partial result.
The loop you presented does not really make sense. For each block there is only one i which will do something. The code will be equivalent to simply writing:
currentErrors[threadIdx.x]=0;
currentErrors[threadIdx.x]+=globalError(mynet,myoutput);
save for some unpredictable scheduling differences.
Remember that blocks are not executed in sync. Each block can run before, during or after any other block.
Also:
You may be interested in parallel prefix sum algorithm.
You may want to check an efficient CUDA implementation of the prefix sum.

Looping inside cuda code

I ran some CUDA code that updated an array of floats. I have a wrapper function like the one discussed in How can I compile CUDA code then link it to a C++ project? this question.
Inside my CUDA function I create a for loop like this...
int tid = threadIdx.x;
for(int i=0;i<X;i++)
{
//code here
}
Now the issue is that if X is equal to the value of 100, everything works just fine, but if X is equal to 1000000, my vector does not get updated (almost as if the code inside the for loop does not get executed)
Now inside the wrapper function, if I call the CUDA function in a for loop, it still works just fine, (but is significantly slower for some reason than if I simply did the same process all on the CPU) like this...
for(int i=0;i<1000000;i++)
{
update<<<NumObjects,1>>>(dev_a, NumObjects);
}
Does anyone know why I can loop a million times in the wrapper function but not simply call the CUDA "update" function once and then inside that function start a for loop of a million?
You should be using cudaThreadSynchronize and cudaGetLastError after running this to see if there was some error. I imagine the first time, it timed out. This happens if the kernel takes a long time to complete. The card just gives up on it.
The second thing, the reason it takes much longer to execute, is because there is a set overhead time for each kernel launch. When you had the loop inside the kernel, you experienced this overhead once and ran the loop. Now you're experiencing it X times. The overhead is fairly small, but large enough that as much of the loop should be put inside the kernel as possible.
If X is particularly large, you might look into running as much of the loop in the kernel as possible until it completes in a safe amount of time, and then loop over these kernels.