STM32F4 nanosecs delay - stm32f4discovery

I've been playing with SysTick for a couple of days and i cannot reach nanoseconds delay. Is it possible with Systick to reach such small values or i have to use timers and interrupts? The LEDs though won't work lower than 350ns delay. Here is an image from my usb oscilloscope:
In general I want to make a project (i am just experimenting with LEDs and SysTick above) which will be like this:
where Δt = 250ns (the other parameters will be determined somehow). The question is, can I make these pulses by using SysTick?

STM32F407VG have 24-bit SysTick timer and its maximum clock speed is 168MHz (Core Clock speed). That means, even if you set your SysTick reload register to:
0x000001 (1 cycle)
You can only have 5.95ns period.

I found this in section 6.2 Clocks of the RM0368 reference manual:
The RCC feeds the external clock of the Cortex System Timer (SysTick) with the AHB clock (HCLK) divided by 8. The SysTick can work either with this clock or with the Cortex clock (HCLK), configurable in the SysTick control and status register.
So maybe the maximum tick rate is limited by the clock divisions. Check Figure 12. Clock tree to see which clock config you should use to get maximum speed.

Related

Is it possible to design Arinc653 scheduler with Azure-RTOS?

I want to design a scheduler that works in Arinc653 manner just for experimental issues.
Is this possible to manipulate the scheduler in this way?
There is time-slicing in threadX I know but all examples I've encountered are using TX_NO_TIME_SLICE (And my shots with that did not work either.).
Besides I'm not sure if time-slice make the thread wait until its deadline met or put it into sleep so that other threads get running.
For short; Arinc653 scheduler defines a constant major frame that each
'thread' has its definite amount of running times and repeats major
frame endlessly. If a thread assigned with i.e 3ms within a major frame and it finishes its job in 1 ms; kernel still waits 2ms to switch next 'thread'.
You can use time slicing to limit the amount of time each thread runs: https://learn.microsoft.com/en-us/azure/rtos/threadx/chapter4#tx_thread_create
I understand that the characteristic of of the Arinc653 scheduler that you want to emulate is time partitioning. The ThreadX schedule policy is based on priority, preemption threshold and time-slicing.
You can emulate time partitioning with ThreadX. To achieve that you can use a timer, where you can suspend/resume threads for each frame. Timers execute in a different context than threads, they are light weight and not affected by priorities. By default ThreadX uses a timer thread, set to the highest priority, to execute threads; but to get better performance you can compile ThreadX to run the timers inside an IRQ (define the option TX_TIMER_PROCESS_IN_ISR).
An example:
Threads thd1,thd2,thd3 belong to frame A
Threads thd4,thd5,thd6 belong to frame B
Timer tm1 is triggered once every frame change
Pseudo code for tm1:
tm1()
{
static int i = 0;
if (i = ~i)
{
tx_thread_suspend(thd1);
tx_thread_suspend(thd2);
tx_thread_suspend(thd3);
tx_thread_resume(thd4);
tx_thread_resume(thd5);
tx_thread_resume(thd6);
}
else
{
tx_thread_suspend(thd4);
tx_thread_suspend(thd5);
tx_thread_suspend(thd6);
tx_thread_resume(thd1);
tx_thread_resume(thd2);
tx_thread_resume(thd3);
}
}

Issue with delta time server-side

How can I make server-side game run the same on every machine, because when I use server's delta time it works different on every computer/phone.
Would something called 'fixed timestep' help me ?
Yes fixed timestep can help you. But also simple movement with a delta can help you.
Fixed timestep commonly using with a physics because sometimes physics needs to be update more often (120-200hz)than game's render method.
However you can still use fixed timestep without physic.
You just need to interpolate your game objects with
lerp(oldValue, newValue, accumulator / timestep);
In your case probably small frame rate differences causes unexpected results.
To avoid that you should use movement depends delta.
player.x+=5*60*delta;//I assume your game is 60 fps
Instead of
player.x+=5;
So last delta will be only difference between machines.And its negligible since delta difference between 60 and 58 fps is only ~0.0005 secs

Why cufftPlanMany() takes too long?

When calling cufftPlanMany() the first time, it takes about 0.7 sec, but all next calls are fast.
Any idea how to accelerate the first call of cufftPlanMany()?
First call to cufftPlanMany causes libcufft.so to be loaded. This in turns initalizes cuda context if needed and loads all the kernels. It would always take some time depending on the size of the library. 0.7 of a second is a bit excessive and it will be reduced in next version of cuFFT. We also reduced time of each subsequent cufftPlan* function a bit.
Why do you need low initialization time?

Why does FFT of sine wave have magnitudes in multiple bins

I've been playing around with Web Audio some. I have a simple oscillator node playing at a frequency of context.sampleRate / analyzerNode.fftSize * 5 (107.666015625 in this case). When I call analyzer.getByteFrequencyData I would expect it to have a value in the 5th bin, and no where else. What I actually see is [0,0,0,240,255,255,255,240,0,0...]
Why am I getting values in multiple bins?
The webaudio AnalyserNode applies a Blackman window before computing the FFT. This windowing function will smear the single tone.
That has to do that your sequence is finite and therefore your signal is supposed to last for a finite amount of time. Surely you are calculating the FFT with a rectangular window, i.e. your signal is consider to last for the amount of generated samples only and that "discontinuity" (i.e. the fact that the signal has a finite number of samples) creates the spectral leakage. To minimise this effect, you could try several windows functions that when applied to your data prior the FFT calculation, reduces this effect.
It looks like you might be clipping somewhere in your computation by using a test signal too large for your data or arithmetic format. Try again using a floating point format.

CUDA samples matrixMul error

I am very new to cuda and started reading about parallel programming and cuda just a few weeks ago. After I installed the cuda toolkit, I was browsing the sdk samples (which come with the installation of the toolkit) and wanted to try some of them out. I started with matrixMul from 0_Simple folder. This program executes fine (I am using Visual Studio 2010).
Now I want to change the size of the matrices and try with a bigger one (for example 960X960 or 1024x1024). In this case, something crashes (I get black screen, and then the message: display driver stopped responding and has recovered).
I am changing this two lines in the code (from main function):
dim3 dimsA(8*4*block_size, 8*4*block_size, 1);
dim3 dimsB(8*4*block_size, 8*4*block_size, 1);
before they were:
dim3 dimsA(5*2*block_size, 5*2*block_size, 1);
dim3 dimsB(5*2*block_size, 5*2*block_size, 1);
Can someone point to me what I am doing wrong. and should I alter something else in this example for it to work properly. Thx!
Edit: like some of you suggested, i changed the timeout value (0 somehow did not work for me, I set the timeout to 60), so my driver does not crash, but I get huge list of errors, like:
... ... ...
Error! Matrix[409598]=6.40005159, ref=6.39999986 error term is > 1e-5
Error! Matrix[409599]=6.40005159, ref=6.39999986 error term is > 1e-5
Does this got something to do with the allocation of the memory. Should I make changes there and what could they be?
Your new problem is actually just the strict tolerances provided in the NVidia example. Your kernel is running correctly. It's just complaining that accumluated error is greater than the limit that they had set for this example. This is just because you're doing a lot more math operations which are all accumulating error. If you look at the numbers it's giving you, you're only off of the reference answer by about 0.00005, which is not unusual after a lot of single-precision floating-point math. The reason you're getting these errors now and not with the default matrix sizes is that the original matricies were smaller and thus required a lot less operations to multiply. Matrix multiplication of N x N matricies requires on the order of N^3 operations, so the number of operations required increases much faster than the size of the matrix and the accumulated error would increase in proportion with the number of operations.
If you look near the end of the runTest() function, there's a call to computeGold() which computes the reference answer on your CPU. There should then be a call to something like shrCompareL2fe that compares the results. The last parameter to this is a tolerance. If you increase the size of this tolerance (say, to 1e-3 or 1e-4 instead of 1e-5,) you should eliminate these error messages. Note that there may be a couple of these calls. The version of the SDK examples that I have has an optional CUBLAS implementation, so it has a comparison for that against the gold, too. The one right after the print statement that says "Comparing CUDA matrixMul & Host results" is the one you'd want to change.
I'd advise looking at the indexing used in the kernel (matrixMulCUDA) a bit closer - it sounds like you're writing to unallocated memory.
More specifically, is the only thing that you changed the dimsA and dimsB variables? Inside the kernel they use the thread and block index to access the data - did you also increase the data size accordingly? There is no bounds checking going on in the kernel, so if you just change the kernel launch configuration, but not the data, then odds are you're writing past your data into some other memory
Have you disabled Timeout Detection and Recovery (TDR) in Windows? It is entirely possible that your code is running fine but that the larger matricies caused the kernel execution to exceed Windows' timeout, which causes Windows to assume the card is locked up, so it resets the card and gives you a message identical to the one you describe. Even if that is not your problem here, you definitely want to disable that before doing any serious CUDA work in Windows. The timeout is quite short by default, since normal graphics rendering should take small fractions of a second per frame.
See this post on the NVidia forums that describes TDR and how to turn it off:
WDDM TDR - NVidia devtalk forum
In particular, you probably want to set the key HKLM\System\CurrentControlSet\Control\GraphicsDrivers\TdrLevel to 0 (Detection Disabled).
Alternatively, you can increase the timeout period by setting
HKLM\System\CurrentControlSet\Control\GraphicsDrivers\TdrDelay. It defaults to 2 and is specified in seconds. Personally, I have found that TDR is always annoying when doing work in CUDA, so I just turn it off entirely. IIRC, you need to restart your system for any TDR-related changes to take effect.