When I call IDirect3D9::CreateDevice I get FPU exception and my app crashes ("floating point division by zero").
It happens since latest update of Windows 10 (April 2018) and laptops with dual graphics (NVIDIA + Intel).
We found a temporary solution - globally disable all FPU exceptions in our app by calling Set8087CW($133F) when our app is started.
I'm sure that our code which works with DirectX is correct and there is no mistakes in parameters which I pass to CreateDevice(). Also I pass D3DCREATE_FPU_PRESERVE flag to CreateDevice().
This problem didn't happen with any previous version of Windows 10 or with Windows 8/7/Vista/XP. Also this problem doesn't occur if manually choose Intel video card for our app instead of NVIDIA.
Our app is written on Delphi (Pascal).
The documentation for D3DCREATE_FPU_PRESERVE does say that "Portions of Direct3D assume floating-point unit exceptions are masked; unmasking these exceptions may result in undefined behavior". I suppose that means you were lucky in the past; this is one of those cases where "undefined" means "used to work and now doesn't".
Floating point exceptions basically result for any operation that would produce a NaN or infinite value from finite values (the most common is division by zero).
If it happens on one driver/GPU combination but not the other, it is likely that the drivers/GPUs are rounding differently in certain corner cases. Keep an eye out for edge values - very small values multiplied by very large ones for instance. For instance (x * 1e10 + y) / 1e10 might result in zero in some scenarios, which could then allow divide by zero errors to creep in. This kind of debugging can be pretty annoying but is part of the reality of dealing with a variety of GPUs.
Related
I have heard the term "Single Cycle Cpu" and was trying to understand what single cycle cpu actually meant. Is there a clear agreed definition and consensus and what is means?
Some home brew "single cycle cpu's" I've come across seem to use both the rising and the falling edges of the clock to complete a single instruction. Typically, the rising edge acts as fetch/decode and the falling edge as execute.
However, in my reading I came across the reasonable point made here ...
https://zipcpu.com/blog/2017/08/21/rules-for-newbies.html
"Do not transition on any negative (falling) edges.
Falling edge clocks should be considered a violation of the one clock principle,
as they act like separate clocks.".
This rings true to me.
Needing both the rising and falling edges (or high and low phases) is effectively the same as needing the rising edge of two cycles of a single clock that's running twice as fast; and this would be a "two cycle" CPU wouldn't it.
So is it honest to state that a design is a "single cycle CPU" when both the rising and falling edges are actively used for state change?
It would seem that a true single cycle cpu must perform all state changing operations on a single clock edge of a single clock cycle.
I can imagine such a thing is possible providing the data strorage is all synchronous. If we have a synchronous system that has settled then on the next clock edge we can clock the results into a synchronous data store and simultaneously clock the program counter on to the next address.
But if the target data store is for example async RAM then the surely control lines would be changing whilst that data is being stored leading to unintended behaviours.
Am I wrong, are there any examples of a "single cycle cpu" that include async storage in the mix?
It would seem that using async RAM in ones design means one must use at least two logical clock cycles to achive the state change.
Of course, with some more complexity one could perhaps add anhave a cpu that uses a single edge where instructions use solely synchronout components, but relies on an extra cycle when storing to async data; but then that still wouldn't be a single cycle cpu, but rather a a mostly single cycle cpu.
So no CPU that writes to async RAM (or other async component) can honestly be considered a single cycle CPU because the entire instruction cannot be carried out on a single clock edge. The RAM write needs two edges (ie falling and rising) and this breaks the single clock principal.
So is there a commonly accepted single cycle CPU and are we applying the term consistently?
What's the story?
(Also posted in my hackday log https://hackaday.io/project/166922-spam-1-8-bit-cpu/log/181036-single-cycle-cpu-confusion and also on a private group in hackaday)
=====
Update: Looking at simple MIP's it seems the models use synchronous memory and so can probably operate off a single edge ad maybe it does - therefore warrant the category "single cycle".
And perhaps FPGA memory is always synchronous - I don't know about that.
But is the term using inconsistently elsewhere - ie like most Homebrew TTL Computers out there??
Or am I just plain wrong?
====
Update :
Some may have misunderstood my point.
Numerous home brew TTL cpu's claim "single cycle CPU" status (not interested for the purposes of this discussion in more complex beasts that do pipelining or whatever).
By single cycle these CPU's they typically mean that they do something like advancing the PC on one edge of the clock and then the use the opposing edge of the clock to update flipflops with the result. OR they will use the other phase of the clock to update async components like latches and sram.
However, the ZipCPU reference I provided suggests that using the opposing clock edge is akin to using a second clock cycle or even a second clock. BTW Ben Eater in his vids even compares the inverted clock that he uses to update his SRAM to being a second clock.
My objection to the use of "single cycle CPU" with such CPU's (basically most/all home bred TTL CPU's I've seen as they all work that way) is that I agree with ZipCPU that using the opposing edge (or phase) of the clock for the commit is effectively the same as using a second clock and this makes a mockery of the "single cycle" claim.
If the use of oposing edge is effectively the same a using a single edge but of dual clock cycles then I think that makes use of the term questionable. So I take ZipCPU's point to heart and tighten the term to mean use of a single edge.
On the other hand is seems perfectly possible to build a CPU that uses only sync components (ie edge triggered flip flops) and which uses only a single edge, where on each edge, we clock whatever is on the bus into whatever device is selected for write and at the same moment advance the PC.
Between one edge and the next same direction edge, settling occurs.
In this manner we end up with CPI=1 and use of only a single edge - which is very distinctly different to the common TTL CPU pattern of using both edges of the clock.
BTW my impression of FPGA's (which I'm not referring to here) is that the storage elements in FPGA are all synchronous flip flops. I don't know, but that's what my reading suggests. Anyway, if this is true then a trivial FPGA based CPU probabnly has a CPI=1 and uses only say the +ve edge and so these might well meet my narrow definition of "single cycle cpu". Also, my reading suggests that various MIP's impls (educational efforts probably) are probably meeting my definition ok.
This seems mostly a question of definitions and terminology, moreso than how to actually build simple CPUs.
If you insist on that strict definition of "single cycle CPU" meaning to truly use only clock edge to set everything in motion for that instruction, then yes, that would exclude real-world toy/hobby CPUs that use a 2nd clock edge to give a consistent interval for memory access.
But they certainly still fulfil the spirit of a single-cycle CPU, which is that every instruction runs in 1 clock cycle, with no pipelining and no multi-cycle microcode.
A whole clock cycle does have 2 clock edges, and it's normal for real-world (non-single-cycle) CPUs to use the "other" edge for internal timing in some cases, but we still talk about their frequency in whole cycles, not the edge frequency, except in cases like DDR memory where we do talk about the transfer rate = twice the memory clock frequency. What sets that apart is always using both edges, and for approximately equal things, not just some extra timing / synchronization within a clock cycle.
Now could you build a CPU that keeps a store value on a memory bus for some minimum time, without using a clock edge? Maybe.
Perhaps make sure the critical path leading to store-data it is short enough that the data is always ready. And possibly propagate some "data-ready" signal along with your computations (or just from the longest critical path of any instruction), and after a couple gate delays after the data is on the bus, flip the memory clock. (And on the next CPU clock edge, flip it back). If your memory doesn't mind its clock not having a uniform duty cycle, this might be fine as long as each half of the memory clock is long enough.
For loading from memory, you can maybe do something similar by initiating a memory load cycle some gate-delays after the CPU clock edge that starts this "cycle" of your single-cycle CPU. This might even involve building a long chain of gate delays intentionally with inverters dedicated to that purpose. Or perhaps even an analog RC time delay, but either way that's obviously worse than just using the other edge of the main clock, and you'd only ever do this as an exercise in single-cycle dogmatic purity. (It can also be flaky because gate-delay isn't constant, depending on voltage and temperature, so one side of the chip running hotter than the other could change relative timing.)
The definition says that a single cycle CPU takes just one instruction per one cycle. So it's possible to make a conclusion in theory that there are other CPU's that takes more or less instruction per cycle. You can check it out that there are some concepts like multi-cycle processor and pipelined processor (Instruction pipelining). "Pipelining attempts to keep every part of the processor busy with some instruction by dividing incoming instructions into a series of sequential steps." acorkcording to Wiki. I don't know how exactly it works, but maybe it just uses available registers (maybe instead of using for example EAX, ECX is used like EAX or maybe it works some other way, but one for sure is 100% true that number of registers in increasing, so maybe it's one of main purposes. Source: https://en.wikipedia.org/wiki/Instruction_pipelining
I think the answer for the question: "is-a-single-cycle-cpu-possible-if-asynchronous-components-are-used" depends on CPU controller that controls both CPU and RAM with opcodes. I found interesing information about this on site: http://people.cs.pitt.edu/~cho/cs1541/current/handouts/lect-pipe_4up.pdf
https://ibb.co/tKy6sR2
CONCLUSION: In my opinion, if we consider the term "single cycle CPU" it should be the simplest possible construction. The term "asynchronous" implements a conclusion, that is more complex than "synchronous". So both terms are not equivalent. It's something like "Can a basic data type be considered as a structure?". In my opinion the word "single" means the simplest possible and "asynchronous" means some modification, so more complex, so just think it's not possible, but maybe the term "are used" can be bypassed by "are used at the time" - if some switch, some controller can turn off asynchronous mode and make this all the simplest possible. But generally just think it's not possible
I know this question seems very generic as it can depend on the platform,
but I understand with procedure / function calls, the assembler code to push return address on the stack and local variables etc. can be part of either the caller function or callee function.
When a hardware exception or interrupt occurs tho, the Program Counter will get the address of the exception handler via the exception table, but where is the actual code to store the state, return address etc. Or is this automatically done at the hardware level for interrupts and exceptions?
Thanks in advance
since you are asking about arm and you tagged microcontroller you might be talking about the arm7tdmi but are probably talking about one of the cortex-ms. these work differently than the full sized arm architecture. as documented in the architectural reference manual that is associated with these cores (the armv6-m or armv7-m depending on the core) it documents that the hardware conforms to the ABI, plus stuff for an interrupt. So the return address the psr and registers 0 through 4 plus some others are all put on the stack, which is unusual for an architecture to do. R14 instead of getting the return address gets an invalid address of a specific pattern which is all part of the architecture, unlike other processor ip, addresses spaces on the cortex-ms are encouraged or dictated by arm, that is why you see ram starts at 0x20000000 usually on these and flash is less than that, there are some exceptions where they place ram in the "executable" range pretending to be harvard when really modified harvard. This helps with the 0xFFFxxxxx link register return address, depending on the manual they either yada yada over the return address or they go into detail as to what the patterns you find mean.
likewise the address in the vector table is spelled out something like the first 16 are system/arm exceptions then interrupts follow after that where it can be up to 128 or 256 possible interrupts, but you have to look at the chip vendor (not arm) documentation for that to see how many they exposed and what is tied to what. if you are not using those interrupts you dont have to leave a huge hole in your flash for vectors, just use that flash for your program (so long as you insure you are never going to fire that exception or interrupt).
For function calls, which occur at well defined (synchronous) locations in the program, the compiler generates executable instructions to manage the return address, registers and local variables. These instructions are integrated with your function code. The details are hardware and compiler specific.
For a hardware exception or interrupt, which can occur at any location (asynchronous) in the program, managing the return address and registers is all done in hardware. The details are hardware specific.
Think about how a hardware exception/interrupt can occur at any point during the execution of a program. And then consider that if a hardware exception/interrupt required special instructions integrated into the executable code then those special instructions would have to be repeated everywhere throughout the program. That doesn't make sense. Hardware exception/interrupt management is handled in hardware.
The "code" isn't software at all; by definition the CPU has to do it itself internally because interrupts happen asynchronously. (Or for synchronous exceptions caused by instructions being executed, then the internal handling of that instruction is what effectively triggers it).
So it's microcode or hardwired logic inside the CPU that generates the stores of a return address on an exception, and does any other stuff that the architecture defines as happening as part of taking an exception / interrupt.
You might as well as where the code is that pushes a return address when the call instruction executes, on x86 for example where the call instruction pushes return info onto the stack instead of overwriting a link register (the way most RISCs do).
Recently, I have been making program (FDTD Operation) using the CUDA
development environment, OS is Windows server 2008 , Graphic card is TeslaC2070, compiler is VS2010. This program calculates using single and double precision floating-point.
I was reading the CUDA programming guide 3.2 and 4.0 . In appendix, guide tell me sin(), cos() has maximum accuracy of 2 ULP. My original CPU program produces results which are different to the CUDA Version.
I want to make results correctly same. Is it possible?
To quote Goldberg (a paper that every Computer Scientist, Computational Scientist, and possibly even every scientist who programs, should read):
Due to roundoff errors, the associative laws of algebra do not
necessarily hold for floating-point numbers.
This means that when you change the order of operations—even when using ostensibly associative arithmetic—you are likely to get slightly different answers.
Parallelism, by definition, results in different ordering of operations relative to serial arithmetic. "Embarrasingly parallel" computations, that is, computations where each output element is computed independently from all others, sometimes do not have to worry about this. But collective operations, like reductions or scans, and spatial neighborhood computations, such stencils (as in FDTD), do experience this effect.
In practice, even using a different compiler (and even different compiler options) can change the result of floating point computation, even when compiling the same code, with or without parallelism.
I'm thinking about this question for a time: when does an ARM7(with 3 pipelines) processor increase its PC register.
I originally thought that after an instruction has been executed, the processor first check is there any exception in the last execution, then increase PC by 2 or 4 depending on current state. If an exception occur, ARM7 will change its running mode, store PC in the LR of current mode and begin to process current exception without modifying the PC register.
But it make no sense when analyzing returning instructions. I can not work out why PC will be assigned LR when returning from an undefined-instruction-exception while LR-4 from prefetch-abort-exception, don't both of these exceptions happened at the decoding state? What's more, according to my textbook, PC will always be assigned LR-4 when returning from prefetch-abort-exception no matter what state the processor is(ARM or Thumb) before exception occurs. However, I think PC should be assigned LR-2 if the original state is Thumb, since a Thumb-instruction is 2 bytes long instead of 4 bytes which an ARM-instruction holds, and we just wanna roll-back an instruction in current state. Is there any flaws in my reasoning or something wrong with the textbook.
Seems a long question. I really hope anyone can help me get the right answer.
Thanks in advance.
You return to LR from undefined-instruction handling because that points to the instruction after the one that caused the trap; you don't want to return to the same undefined instruction again, it'll only hit the same trap.
You return to LR-4 from prefetch-abort if you want to execute the same instruction again; presumably because you've mapped some memory in for it so it'll now work.
At what point in the pipeline an ARM7 actually increases its PC is irrelevant, because the value of PC during execution and consequently the value of LR in abort handlers is something laid down as part of the ARM architecture standard, based largely on what the ancient ARM2 did with its PC.
However, I think PC should be assigned LR-2 if the original state is Thumb
That would make sense, but then the exception handler would need to know whether the original code that caused it to trigger was ARM or Thumb code. This might have impacted compatibility too, as there was plenty of non-Thumb-aware exception handling code around. So instead the Thumb architecture fudged the LR on entry to exception handlers so that the handler could always use the same instruction to return, the one they were used to using for non-Thumb code.
I've noticed that CUDA applications tend to have a rough maximum run-time of 5-15 seconds before they will fail and exit out. I realize it's ideal to not have CUDA application run that long but assuming that it is the correct choice to use CUDA and due to the amount of sequential work per thread it must run that long, is there any way to extend this amount of time or to get around it?
I'm not a CUDA expert, --- I've been developing with the AMD Stream SDK, which AFAIK is roughly comparable.
You can disable the Windows watchdog timer, but that is highly not recommended, for reasons that should be obvious.
To disable it, you need to regedit HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Watchdog\Display\DisableBugCheck, create a REG_DWORD and set it to 1.
You may also need to do something in the NVidia control panel. Look for some reference to "VPU Recovery" in the CUDA docs.
Ideally, you should be able to break your kernel operations up into multiple passes over your data to break it up into operations that run in the time limit.
Alternatively, you can divide the problem domain up so that it's computing fewer output pixels per command. I.e., instead of computing 1,000,000 output pixels in one fell swoop, issue 10 commands to the gpu to compute 100,000 each.
The basic unit that has to fit within the time slice is not your entire application, but the execution of a single command buffer. In the AMD Stream SDK, a long sequence of operations can be broken up into multiple time slices by explicitly flushing the command queue with a CtxFlush() call. Perhaps CUDA has something similar?
You should not have to read all of your data back and forth across the PCIX bus on every time slice; you can leave your textures, etc. in gpu local memory; you just have some command buffers complete occasionally, to prove to the OS that you're not stuck in an infinite loop.
Finally, GPUs are fast, so if your application is not able to do useful work in that 5 or 10 seconds, I'd take that as a sign that something is wrong.
[EDIT Mar 2010 to update:] (outdated again, see the updates below for the most recent information) The registry key above is out-of-date. I think that was the key for Windows XP 64-bit. There are new registry keys for Vista and Windows 7. You can find them here: http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx
or here: http://msdn.microsoft.com/en-us/library/ee817001.aspx
[EDIT Apr 2015 to update:] This is getting really out of date. The easiest way to disable TDR for Cuda programming, assuming you have the NVIDIA Nsight tools installed, is to open the Nsight Monitor, click on "Nsight Monitor options", and under "General" set "WDDM TDR enabled" to false. This will change the registry setting for you. Close and reboot. Any change to the TDR registry setting won't take effect until you reboot.
[EDIT August 2018 to update:]
Although the NVIDIA tools allow disabling the TDR now, the same question is relevant for AMD/OpenCL developers. For those: The current link that documents the TDR settings is at https://learn.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys
On Windows, the graphics driver has a watchdog timer that kills any shader programs that run for more than 5 seconds. Note that the Xorg/XFree86 drivers don't do this, so one possible workaround is to run the CUDA apps on Linux.
AFAIK it is not possible to disable the watchdog timer on Windows. The only way to get around this on Windows is to use a second card that has no displayed screens on it. It doesn't have to be a Tesla but it must have no active screens.
Resolve Timeout Detection and Recovery - WINDOWS 7 (32/64 bit)
Create a registry key in Windows to change the TDR settings to a
higher amount, so that Windows will allow for a longer delay before
TDR process starts.
Open Regedit from Run or DOS.
In Windows 7 navigate to the correct registry key area, to create the
new key:
HKEY_LOCAL_MACHINE>SYSTEM>CurrentControlSet>Control>GraphicsDrivers.
There will probably one key in there called DxgKrnlVersion there as a
DWord.
Right click and select to create a new key REG_DWORD, and name it
TdrDelay. The value assigned to it is the number of seconds before
TDR kicks in - it > is currently 2 automatically in Windows (even
though the reg. key value doesn't exist >until you create it). Assign
it with a new value (I tried 4 seconds), which doubles the time before
TDR. Then restart PC. You need to restart the PC before the value will
work.
Source from Win7 TDR (Driver Timeout Detection & Recovery)
I have also verified this and works fine.
The most basic solution is to pick a point in the calculation some percentage of the way through that I am sure the GPU I am working with is able to complete in time, save all the state information and stop, then to start again.
Update:
For Linux: Exiting X will allow you to run CUDA applications as long as you want. No Tesla required (A 9600 was used in testing this)
One thing to note, however, is that if X is never entered, the drivers probably won't be loaded, and it won't work.
It also seems that for Linux, simply not having any X displays up at the time will also work, so X does not need to be exited as long as you screen to a non-X full-screen terminal.
This isn't possible. The time-out is there to prevent bugs in calculations from taking up the GPU for long periods of time.
If you use a dedicated card for CUDA work, the time limit is lifted. I'm not sure if this requires a Tesla card, or if a GeForce with no monitor connected can be used.
The solution I use is:
1. Pass all information to device.
2. Run iterative versions of algorithms, where each iteration invokes the kernel on the memory already stored within the device.
3. Finally transfer memory to host only after all iterations have ended.
This enables control over iterations from CPU (including option to abort), without the costly device<-->host memory transfers between iterations.
The watchdog timer only applies on GPUs with a display attached.
On Windows the timer is part of the WDDM, it is possible to modify the settings (timeout, behaviour on reaching timeout etc.) with some registry keys, see this Microsoft article for more information.
It is possible to disable this behavior in Linux. Although the "watchdog" has an obvious purpose, it may cause some very unexpected results when doing extensive computations using shaders / CUDA.
The option can be toggled in your X-configuration (likely /etc/X11/xorg.conf)
Adding: Option "Interactive" "0" to the device section of your GPU does the job.
see CUDA Visual Profiler 'Interactive' X config option?
For details on the config
and
see ftp://download.nvidia.com/XFree86/Linux-x86/270.41.06/README/xconfigoptions.html#Interactive
For a description of the parameter.