Can I use AS3 Stage3D AGAL to achieve CUDA like processing? - actionscript-3

I have a program what detects a ball in a 320x240 stream runtime, but if I stream bigger resolution, it gets too slow. I'm assuming if I could use the GPU to calculate each pixels (with their neighbor frames, and neigbor pixels) it would be faster. Anyone knows if I can get data BACK from the GPU with AGAL?
in sort, I have the loop below, what goes through each pixel of the frame, and I want to calculate the most on GPU, to achive better performance.
for(var i:int=cv.length-1; i>1;i--){
if( (110*255) < (cv[i] & 0x0000FF00) && (cv[i] & 0x0000FF00) < (150*255)){ //i zöld
if( (cv[i+2] & 0x0000FF00) > (150*255) ) { //i+2 világos
if(floodhere(cv, i+2)){ //méret nagy
prevDiff[i]=0xffffffff; //fehér
close.push(i);
}
else prevDiff[i]=0xffff0000 //méret kicsi -> piros
} else {
prevDiff[i]=0xff000055 //kék
}
} else {
prevDiff[i]=0xff000000 //fekete
}
}

You can use AGAL to make fast calculations on the GPU, just be aware of the limits.
It goes roughly like this:
You need to upload you data as as textures (a n*m matrices), one datapoint is a 3x8 bit value. Uploading any kind of big data to the GPU is slow, thus you should not do it in every frame. Getting the texture back to Actionscript is slow too.
You can upload data to the GPU to its global variable memory (but only a limited amount)
The GPU will run your AGAL program parallel on every element on this matrix, and the output will be an n*m matrix too.
Every program instance has access to 3 things: Its coordinates, the global variables, and the uploaded matrices. The output of your program will be written to an output matrix to the same position. If you write multiple programs, the can access this output matrix quickly, but getting it back to the normal memory (for actionscript manipulation) is slow.
AGAL programs are very limited compared to Actionscript:
- max. 256 instructions.
- no loops, functions, classes. You only have mathematical operators and conditionals ("if-else").
- cannot write to the global memory

You may be able to use PixelBender. It also works in separate thread(s) and makes use of multicore CPUs so is much quicker than actionscript.
See http://www.flashmagazine.com/tutorials/detail/using_pixel_bender_to_calculate_information/ for an example

No way to get data back. You can only get color back. Moreover, to get pixel color in actionscript you should copy data from texture to bitmapdata wich is VEEEERY slow.

Related

What is the fastest way to update a single float value to the GPU to access it in a CUDA kernel?

I have a opengl particle simulation, where the position of each particle is calculated in a CUDA kernel. Most memory resides within the GPU memory, but there is a single float value, I have to update from the CPU each frame.
At the moment I use cudaMemcpyAsync() to copy the float value to the GPU, but (at least from what I can tell), this slows down the performance quite a bit. I tried to use nvproof to see, which calls take the longest, with these results:
Calls Avg Min Max Name
477 2.9740us 2.8160us 4.5440us simulation(float3*, float*, float3*, float*)
477 89.033us 18.600us 283.00us cudaLaunchKernel
477 47.819us 10.200us 120.70us cudaMemcpyAsync
I think I can't really do much about the kernel launch itself, but from the calls, that happen every frame cudaMemcpyAsync() seems to be taking the longest.
I have also tried to use pinned memory and cudaHostGetDevicePointer() as described here, however for some reason this increases the kernel launch times even more, making more than up for the time saved for not needing the memcopy function.
I guess there has to be a better/faster way to update my single float variable to the GPU?
Easiest way is, that you can add an extra parameter to the simulation kernel function as a value of simple float but not as a pointer to float so that the data goes directly by the kernel launch parameters structure that CUDA sends to GPU when you launch the kernel. Then you evade that data copy command altogether. (I'm assuming CUDA packs whole function parameter descriptor data of kernel into a single copy command because kernel parameter descriptor space is limited by a few kBs or less).
simulation(fooPointer,
barPointer,
fooBarPointer,
floatVariable
);
Or, try double buffering between data update and rendering or between data update and compute so that simulation image follows the simulation calculation by 1-2 frames behind (and per-frame time gets worse) but "frames per second" increases.
If its not an interactive simulation, hiding compute/render/data latencies by double or triple buffering should work.
If you are after minimizing per-frame timing (quicker response to a user-input into simulation?) then you should embed the float variable to the end of an array that you already send/use in simulation or whatever structure you are using. If you already have a 1MB+ float buffer to send to GPU, then appending 4B(float) to end of it should not make much difference then you can access it from there. 1 copy operation should be faster than 2 copy operations with same total size.
If you are literally sending just 4B to GPU at each frame (with a simple function to generate that data), then (as 3Dave said in comments) you can try adding an extra kernel function to update the value in the GPU and just have the overhead of kernel launch command instead of both copy command overhead and data copy overhead. On a positive side, that extra kernel overhead might be hidden if there is a "graph" of kernels running for each frame automatically without enqueueing all of them again and again.
Here,
https://devblogs.nvidia.com/cuda-graphs/
The part
We are going to create a simple code which mimics this pattern. We will then use this to demonstrate the overheads involved with the standard launch mechanism and show how to introduce a CUDA Graph comprising the multiple kernels, which can be launched from the application in a single operation.
cudaGraphLaunch(instance, stream);
They say per-kernel launch overhead in this "graph" feature is only 3-4 microseconds when there are many(20) kernels in the algorithm.
Since graph supports other commands too, you can try both copy and compute parts in parallel cuda-streams within a graph and switch their inputs with double buffering so all CUDA things can stay within CUDA's context before sending output to rendering.
(Maybe)You don't even have to change the data mechanism at all. Just try sending data of float as binary representation into the pointer value and only read the pointer value (not data value) from kernel and convert it back to float. I don't know if CUDA returns an error for this if you don't try reaching the (wrong) pointer address that the float data represents, in the kernel.
simulation(fooPointer,
barPointer,
fooBarPointer,
toPtr(floatData) // <----- float to 64/32 bit pointer value
);
and in kernel
float val = fromPtrToFloat(parameter4); // converts pointer itself, not the data
But this may not be a preferred practice if you can simply use "value" type parameters.

How to perform basic operations (+ - * /) on GPU and store the result on it

I have the following code line, gamma is a CPU variable, that after i will need to copy to GPU. gamma_x and delta are also stored on CPU. Is there any way that i can execute the following line and store its result directly on GPU? So basically, host gamma, gamma_x and delta on GPU and get the output of the following line on GPU. It would speed up my code a lot for the lines after.
I tried with magma_dcopy but so far i couldn't find a way to make it working because the output of magma_ddot is CPU double.
gamma = -(gamma_x[i+1] + magma_ddot(i,&d_gamma_x[1],1,&(d_l2)[1],1, queue))/delta;
The very short answer is no, you can't do this, or least not if you use magma_ddot.
However, magma_ddot is itself a only very thin wrapper around cublasDdot, and the cublas function fully supports having the result of the operation stored in GPU memory rather than returned to the host.
In theory you could do something like this:
// before the apparent loop you have not shown us:
double* dotresult;
cudaMalloc(&dotresult, sizeof(double));
for (int i=....) {
// ...
// magma_ddot(i,&d_gamma_x[1],1,&(d_l2)[1],1, queue);
cublasSetPointerMode( queue->cublas_handle(), CUBLAS_POINTER_MODE_DEVICE);
cublasDdot(queue->cublas_handle(), i, &d_gamma_x[1], 1, &(d_l2)[1], 1, &dotresult);
cudaDeviceSynchronize();
cublasSetPointerMode( queue->cublas_handle(), CUBLAS_POINTER_MODE_HOST);
// Now dotresult holds the magma_ddot result in device memory
// ...
}
Note that might make Magma blow up depending on how you are using it, because Magma uses CUBLAS internally and how CUBLAS state and asynchronous operations are handled inside Magma are completely undocumented. Having said that, if you are careful, it should be OK.
To then execute your calculation, either write a very simple kernel and launch it with one thread, or perhaps use a simple thrust call with a lambda expression, depending on your preference. I leave that as an exercise to the reader.

Monte Carlo with rates, system simulation with CUDA C++

So I am trying to simulate a 1-D physical model named Tasep.
I wrote a code to simulate this system in c++, but I definitely need a performance boost.
The model is very simple ( c++ code below ) - an array of 1's and 0's. 1 represent a particle and 0 is no-particle, meaning empty. A particle moves one element to the right, at a rate 1, if that element is empty. A particle at the last location will disappear at a rate beta ( say 0.3 ). Finally, if the first location is empty a particle will appear there, at a rate alpha.
One threaded is easy, I just pick an element at random, and act with probability 1 / alpha / beta, as written above. But this can take a lot of time.
So I tried to do a similar thing with many threads, using the GPU, and that raised a lot of questions:
Is using the GPU and CUDA at all good idea for such a thing?
How many threads should I have? I can have a thread for each site ( 10E+6 ), should I?
How do I synchronize the access to memory between different threads? I used atomic operations so far.
What is the right way to generate random data? If I use a million threads is it ok to have a random generator for each?
How do I take care of the rates?
I am very new to CUDA. I managed to run code from CUDA samples and some tutorials. Although I have some code of the above ( still gives strange result though ), I do not put it here, because I think the questions are more general.
So here is the c++ one threaded version of it:
int Tasep()
{
const int L = 750000;
// rates
int alpha = 330;
int beta = 300;
int ProbabilityNormalizer = 1000;
bool system[L];
int pos = 0;
InitArray(system); // init to 0's and 1's
/* Loop */
for (int j = 0; j < 10*L*L; j++)
{
unsigned long randomNumber = xorshf96();
pos = (randomNumber % (L)); // Pick Random location in the the array
if (pos == 0 && system[0] == 0) // First site and empty
system[0] = (alpha > (xorshf96() % ProbabilityNormalizer)); // Insert a particle with chance alpha
else if (pos == L - 1) // last site
system[L - 1] = system[L - 1] && (beta < (xorshf96() % ProbabilityNormalizer)); // Remove a particle if exists with chance beta
else if (system[pos] && !system[pos + 1]) // If current location have a particle and the next one is empty - Jump right
{
system[pos] = false;
system[pos + 1] = true;
}
if ((j % 1000) == 0) // Just do some Loggingg
Log(system, j);
}
getchar();
return 0;
}
I would be truly grateful for whoever is willing to help and give his/her advice.
I think that your goal is to perform something called Monte Carlo Simulations.
But I have failed to fully understand your main objective (i.e. get a frequency, or average power lost, etc.)
Question 01
Since you asked about random data, I do believe you can have multiple random seeds (maybe one for each thread), I would advise you to generate the seed in the GPU using any pseudo random generator (you can use even the same as CPU), store the seeds in GPU global memory and launch as many threads you can using dynamic parallelism.
So, yes CUDA is a suitable approach, but keep in your mind the balance between time that you will require to learn and how much time you will need to get the result from your current code.
If you will take use this knowledge in the future, learn CUDA maybe worth, or if you can escalate your code in many GPUs and it is taking too much time in CPU and you will need to solve this equation very often it worth too. Looks like that you are close, but if it is a simple one time result, I would advise you to let the CPU solve it, because probably, from my experience, you will take more time learning CUDA than the CPU will take to solve it (IMHO).
Question 02
The number of threads is very usual question for rookies. The answer is very dependent of your project, but taking in your code as an insight, I would take as many I can, using every thread with a different seed.
My suggestion is to use registers are what you call "sites" (be aware that are strong limitations) and then run multiples loops to evaluate your particle, in the very same idea of a car tire a bad road (data in SMEM), so your L is limited to 255 per loop (avoid spill at your cost to your project, and less registers means more warps per block). To create perturbation, I would load vectors in the shared memory, one for alpha (short), one for beta (short) (I do assume different distributions), one "exist or not particle" in the next site (char), and another two to combine as pseudo generator source with threadID, blockID, and some current time info (to help you to pick the initial alpha, beta and exist or not) so u can reuse this rates for every thread in the block, and since the data do not change (only the reading position change) you have to sync only once after reading, also you can "random pick the perturbation position and reuse the data. The initial values can be loaded from global memory and "refreshed" after an specific number of loops to hide the loading latency. In short, you will reuse the same data in shared multiple times, but the values selected for every thread change at every interaction due to the pseudo random value. Taking in account that you are talking about large numbers and you can load different data in every block, the pseudo random algorithm should be good enough. Also, you can even use the result stored in the gpu from previous runs as random source, flip one variable and do some bit operations, so u can use every bit as a particle.
Question 03
For your specific project I would strongly recommend to avoid thread cooperation and make these completely independent. But, you can use shuffle inside the same warp, with no high cost.
Question 04
It is hard to generate truly random data, but you should worry about by how often last your period (since any generator has a period of random and them repeats). I would suggest you to use a single generator which can work in parallel to your kernel and use it feed your kernels (you can use dynamic paralelism). In your case since you want some random you should not worry a lot with consistency. I gave an example of pseudo random data use in the previous question, that may assist. Keep in your mind that there is no real random generator, but there are alternatives as internet bits for example.
Question 05
Already explained in the Question 03, but keep in your mind that you do not need a full sequence of values, only a enough part to use in multiple forms, to give enough time to your kernel just process and then you can refresh you sequence, if you guarantee to not feed block with the same sequence it will be very hard to fall into patterns.
Hope I have help, I’m working with CUDA for a bit more than a year, started just like you, and still every week I do improve my code, now it is almost good enough. Now I see how it perfectly fit my statistical challenge: cluster random things.
Good luck!

Simulation loop GPU utilization

I am struggling with utilizing a simulation loop.
There are 3 kernel launched in every cycle.
The next time step size is computed by the second kernel.
while (time < end)
{
kernel_Flux<<<>>>(...);
kernel_Timestep<<<>>>(d_timestep);
memcpy(&h_timestep, d_timestep, sizeof(float), ...);
kernel_Integrate<<<>>>(d_timestep);
time += h_timestep;
}
I only need copy back a single float. What would be the most efficient way to avoid unnecessary synchronizations?
Thank you in advance. :-)
In CUDA all operations running from the default stream are synchronized. So in the code you've posted kernels will run one after another. From what I can see the kernel kernel_integrate() depends from the result of the kernel kernel_Timestep(), so no way of avoiding synchronization. Anyway if the kernels kernel_Flux() and kernel_Timestep() work on independent data, you can try to execute them in parallel, in two different streams.
If you care about the iteration time a lot, you can probably setup a new stream dedicated to the memcpy of the h_timestep out (you need to use cudaMemcpyAsync in this case). Then use something like speculative execution, where your loop proceeds before you figure out the time. To do so you will have to setup the GPU memory buffers for the next several iterations. You can probably do this by using a circular buffer. You also need to use cudaEventRecord and cudaStreamWaitEvent to synchronize the different streams, such that a next iteration is allowed to proceed only if the time corresponds to the buffer you are about to overwrite, has been calculated (the memcpy stream has done the job), because otherwise you will lose the state at that iteration.
Another potential solution, which I haven't tried but I suspect would work, is to make use of dynamic parallelism. If your cards support that, you can probably put the whole loop in GPU.
EDIT:
Sorry, I just realized that you have the third kernel. Your delay due to synchronization may be because you are not doing cudaMemcpyAsync? It's very likely that the third kernel will run longer than the memcpy. You should be able to proceed without any delay. The only synchronization needed is after each iteration.
The ideal solution would be to move everything to the GPU. However, I cannot do so, because I need to launch CUDPP compact after every few iterations, and it does not support CUDA streams nor dynamics parallelism. I know that the Thrust 1.8 library has copy_if method, which does the same, and it is working with dynamic parallelism. The problem is it does not compile with separate compilation on.
To sum up, now I use the following code:
while (time < end)
{
kernel_Flux<<<gs,bs, 0, stream1>>>();
kernel_Timestep<<<gs,bs, 0, stream1>>>(d_timestep);
cudaEventRecord(event, stream1);
cudaStreamWaitEvent(mStream2, event, 0);
memcpyasync(&h_timestep, d_timestep, sizeof(float), ..., stream2);
kernel_Integrate<<<>>>(d_timestep);
cudaStreamSynchronize(stream2);
time += h_timestep;
}

Why does FFT of sine wave have magnitudes in multiple bins

I've been playing around with Web Audio some. I have a simple oscillator node playing at a frequency of context.sampleRate / analyzerNode.fftSize * 5 (107.666015625 in this case). When I call analyzer.getByteFrequencyData I would expect it to have a value in the 5th bin, and no where else. What I actually see is [0,0,0,240,255,255,255,240,0,0...]
Why am I getting values in multiple bins?
The webaudio AnalyserNode applies a Blackman window before computing the FFT. This windowing function will smear the single tone.
That has to do that your sequence is finite and therefore your signal is supposed to last for a finite amount of time. Surely you are calculating the FFT with a rectangular window, i.e. your signal is consider to last for the amount of generated samples only and that "discontinuity" (i.e. the fact that the signal has a finite number of samples) creates the spectral leakage. To minimise this effect, you could try several windows functions that when applied to your data prior the FFT calculation, reduces this effect.
It looks like you might be clipping somewhere in your computation by using a test signal too large for your data or arithmetic format. Try again using a floating point format.