ThreadX kernel enter function - threadx

What does the ThreadX kernel enter function do?
What does it mean that this function does not return?
How are the threads created in the tx_application_define function scheduled and executed?

ThreadX kernel-enter routine performs the following:
If ThreadX initialization needs to take place:
Call any port specific pre-processing.
Invoke the low-level initialization to handle all processor specific initialization issues.
Invoke the high-level initialization to exercise all of the ThreadX components.
Call any port specific post-processing.
Call the user-provided initialization function tx_application_define.
Call any port specific pre-scheduler processing.
Enter the scheduling loop to start executing threads.
To answer your questions:
In step #2, the ThreadX kernel-enter routine calls function tx_application_define, which is up to you to implement. It is pretty similar in essence to a user-callback routine, except for the fact that it is not provided as a function-pointer (i.e., the tx_application_define symbol is resolved during link-time instead of during runtime). This function is where you should typically create all the threads.
In step #4, the ThreadX kernel-enter routine starts an infinite loop, which is in essence the scheduler itself. This is where all the context switches are managed, and the threads go in and out of execution. Upon every HW interrupt, the PC (program-counter) jumps from the currently executing thread to the IV (interrupt-vector), and from there to the connected ISR (interrupt-service-routine). After that, it jumps back to the scheduler (i.e., into the infinite loop), which determines whether a context-switch is required or not. Execution eventually returns to the last executing thread or to some other thread, depending on the scheduler decision.
As you can understand, every context-switch is the result of a HW interrupt, but not every HW interrupt results with a context-switch. You should typically refrain from enabling the interrupts (by calling function __enable_interrupt from within function tx_application_define), as the ThreadX kernel-enter routine makes sure of that just before it enters the scheduling loop.

Related

What thread runs the callback passed to cudaStreamAddCallback?

If I register a callback via cudaStreamAddCallback(), what thread is going to run it ?
The CUDA documentation says that cudaStreamAddCallback
adds a callback to be called on the host after all currently enqueued items in the stream have completed. For each cudaStreamAddCallback call, a callback will be executed exactly once. The callback will block later work in the stream until it is finished.
but says nothing about how the callback itself is called.
Just to flesh out comments so that this question has an answer and will fall off the unanswered queue:
The short answer is that this is an internal implementation detail of the CUDA runtime and you don't need to worry about it.
The longer answer is that if you look carefully at the operation of the CUDA runtime, you will notice that context establishment on a device (be it explicit via the driver API, or implicit via the runtime API) spawns a small thread pool. It is these threads which are used to implement features of the runtime like stream command queues and call back operations. Again, an internal implementation detail which the programmer doesn't need to know about.

Can a CUDA event be fired from device-side code?

Is there any way to fire an event (for benchmarking purposes, similar to cudaEvents in the CPU code) from a device kernel in CUDA?
E.g. suppose I would like to measure the time passed from kernel start to the first thread ever that starts a computation and the time passed from the last thread that leaves the computation to the CPU return.
Can I do that?
The device runtime API (used with dynamic parallelism) does have limited stream and events support, but event timing is not supported.
So, no you can't do that.
An ugly workaround would be writing to some managed-memory location, and having a host-side thread poll it and fire the event when the value changes.

Understanding hardware interrupts and exceptions at processor and hardware level

After a lot of reading about interrupt handling etcetera, i still can figure out the full process of interrupt handling from the very beginning.
For example:
A division by zero.
The CPU fetches the instruction to divide a number by zero and send it to the ALU.
Assuming the the ALU started the process of the division or run some checks before starting it.
How the exception is signaled to the CPU ?
How the CPU knows what exception has occurred from only one bit signal ? Is there a register that is reads after it gets interrupted to know this ?
2.How my application catches the exception?
Do i need to write some function to catch a specipic SIGNAL or something else? And when i write expcepion handling routine like
Try {}
Catch {}
And an exception occurres how can i know what exeption is thrown and handle it well ?
The most important part that bugs me is for example when an interupt is signaled from the keyboard to the PIC the pic in his turn signals to the CPU that an interrupt occurred by changing the wite INT.
But how does the CPU knows what device need to be served ?
What is the processes the CPU is doing when his INTR pin turns on ?
Does he has a routine that checks some register that have a value of the interrupt (that set by the PIC when it turns on the INT wire? )
Please don't ban the post, it's really important for me to understand this topic, i read a researched a couple of weaks but connot connect the dots in my head.
Thanks.
There are typically several thing associated with interrupts other than just a pin. Normally for more recent micro-controllers there is a interrupt vector placed on memory that addresses each interrupt call, and a register that signals the interrupt event/flag.
When a event that is handled by an interruption occurs and a specific flag is set. Depending on priority's and current state of the CPU the context switch time may vary for example a low priority interrupt flagged duding a higher priority interrupt will have to wait till the high priority interrupt is finished. In the event that nesting is possible than higher priority interrupts may interrupt lower priority interrupts.
In the particular case of exceptions like dividing by 0, that indeed would be detected by the ALU, the CPU may offer or not a derived interruption that we will call in events like this. For other types of exceptions an interrupt might not be available and the CPU would just act accordingly for example rebooting.
As a conclusion the interrupt events would occur in the following manner:
Interrupt event is flagged and the corresponding flag on the register is set
When the time comes the CPU will switch context to the interruption handler function.
At the end of the handler the interruption flag is cleared and the CPU is ready to re-flag the interrupt when the next event comes.
Deciding between interrupts arriving at the same time or different priority interrupts varies with different hardware.
It may be simplest to understand interrupts if one starts with the way they work on the Z80 in its simplest interrupt mode. That processor checks the state of a
pin called /IRQ at a certain point during each instruction; if the pin is asserted and an "interrupt enabled" flag is set, then when it is time to fetch the next instruction the processor won't advance the program counter or read a byte from memory, but instead disable the "interrupt enabled" flag and "pretend" that it read an "RST 38h" instruction. That instruction behaves like a single-byte "CALL 0038h" instruction, pushing the program counter and transferring control to that address.
Code at 0038h can then poll various peripherals if they need any service, use an "ei" instruction to turn the "interrupt enabled" flag back on, and perform a "ret". If no peripheral still has an immediate need for service at that point, code can then resume with whatever it was doing before the interrupt occurred. To prevent problems if the interrupt line is still asserted when the "ret" is executed, some special logic will ensure that the interrupt line will be ignored during that instruction (or any other instruction which immediately follows "ei"). If another peripheral has developed a need for service while the interrupt handler was running, the system will return to the original code, notice the state of /IRQ while it processes the first instruction after returning, and then restart the sequence with the RST 38h.
In the simple Z80 approach, there is only one kind of interrupt; any peripheral can assert /IRQ, and if any peripheral does so the Z80 will need to ask every peripheral if it wants attention. In more advanced systems, it's possible to have many different interrupts, so that when a peripheral needs service control can be dispatched to a routine which is designed to handle just that peripheral. The same general principles still apply, however: an interrupt effectively inserts a "call" instruction into whatever the processor was doing, does something to ensure that the processor will be able to service whatever needed attention without continuously interrupting that process [on the Z80, it simply disables interrupts, but systems with multiple interrupt sources can leave higher-priority sources enabled while servicing lower ones], and then returns to whatever the processor had been doing while re-enabling interrupts.

CUDA kernel launched after call to thrust is synchronous or asynchronous?

I am having some troubles with the results of my computations, for some reason they are not correct, I checked the code and it seems right (although I will check it again).
My question is if custom cuda kernels are synchronous or asynchronous after being launch after a call to thrust, e.g.
thrust::sort_by_key(args);
arrangeData<<<blocks,threads>>>(args);
will the kernel arrangeData run after thrust::sort has finished?
Assuming your code looks like that, and there is no usage of streams going on (niether the kernel call nor the thrust call indicate any stream usage as you have posted it), then both activities are issued to the default stream. I also assume (although it would not change my answer in this case) that the args passed to the thrust call are device arguments, not host arguments. (e.g. device_vector, not host_vector).
All CUDA API and kernel calls issued to the default stream (or any given single stream) will be executed in order.
The arrangeData kernel will not begin until any kernels launched by the thrust::sort_by_key call are complete.
You can verify this using a profiler, e.g. nvvp
Note that synchronous vs. asynchronous may be a bit confusing. When we talk about kernel launches being asynchronous, we are almost always referring to the host CPU activity, i.e. the kernel launch is asynchronous with respect to the host thread, which means it returns control to the host thread immediately, and its execution will occur at some unspecified time with respect to the host thread.
CUDA API calls and kernel calls issued to the same stream are always synchronous with respect to each other. A given kernel will not begin execution until all prior cuda activity issued to that stream (even things like cudaMemcpyAsync) has completed.

Do cudaBindTextureToArray and cudaUnbindTexture break GPU-CPU concurrency?

I want my CPU and GPU to overlap computation, however, my GPU code contains some synchronous function calls like cudaBindTextureToArray() and cudaUnbindTexture() for which no asynchronous counterparts exists. Will these calls calls break GPU-CPU concurrency?
In general, the functions that may be asynchronous are listed here:
- •Kernel launches;
- •Memory copies between two addresses to the same device memory;
- •Memory copies from host to device of a memory block of 64 KB or less;
- •Memory copies performed by functions that are suffixed with Async;
- •Memory set function calls.
Asynchronous functions usually have an Async suffix, and they will usually accept a stream parameter.
Functions that don't meet the above description should be assumed to be synchronous. Specific exceptions (like cudaSetDevice()) are usually evident from their description.
In the context of a single-device system, synchronous functions (with the exception of specific stream synchronizing functions like cudaStreamSynchronize and cudaStreamWaitEvent) will:
Wait to begin until all cuda activity has completed (i.e. all previous cuda API calls and kernel calls have completed)
Execute their designated activity (e.g. cudaMemcpy() will begin the designated copy operation after step 1 is complete)
Release the calling (host) thread after step 2 is complete
Therefore the calling (host) thread is blocked from the moment the cudaMemcpy() call is made until all previous cuda activity is complete and the cudaMemcpy() call is complete. I think most people would say this may "break" GPU-CPU concurrency, because for the duration of the sequence described above (steps 1-3) the CPU thread is effectively doing nothing.
Whether or not it makes much difference in your application will depend on what is happening before and after the synchronous call in question.