In " lmax disruptor architecture design" it is showing that , they are taking input and enqueing it in input disruptor and there are multiple event handlers like journaling ,un-marshalling , business logic and after that that enqueing it to output disruptor and output disruptor has Marshalling , journaling etc event handlers ..
My doubt is ..why not using one disruptor with all combine event handlers of input and output disruptor. We can handle event in such a way that , after business logic processing output's disruptor events will call.??
Correct me if I misunderstood it.
In the article there are potentially multiple output disruptors.
It is a perfectly valid design choice in some cases to have the output processors in the same disruptor as the business logic processor. The advantage is that there is no copying of the data that needs to be output if it is already contained in the input event.
However in this case and one of them is slow that would potentially cause the disruptor to fill to capacity and prevent the business logic processor from processing new events.
By having the output disruptors as separate ring buffers the business logic processor can decide how to handle an output disruptor that is slow and whose ring buffer is full. It also allow multiple input disruptors to share the same output disruptor if that output disruptor needs exclusive access to some external resource.
Related
I have a feed-forward neural network which is basically a composition of N functions. I want to pipeline the training procedure of said network in a multi-device environment by executing some of these functions in one device, forwarding the result to the second, execute some more functions etc. So far, I think something like the following would work:
subfunctions = [a list of jit-ed functions, each of which executes one or more network layers]
input = some provided input
for f in subfunctions:
input = f(input) #these get called asynchronously, right?
In addition, I need the final device to send back a "message" with backpropagated gradients to its previous device, which it in turn will also send back (after applying chain rule).
I also need these things to happen concurrently, i.e. call the function of device 1 again while device 2 is just beginning to process the input it got from device 1 (think of a pipelined execution).
Is there native support in Jax for such operations, or should I be looking into something like mpi4jax? Would that even work for me if I'm looking into managing, say, GPU devices and not CPU processes?
When calling model.rayIntersect() in the Autodesk Forge viewer, I noticed that the intersects returned did not always reflect the accurate intersections unless I wait on the GEOMETRY_LOADED_EVENT.
From inspecting the non-minified source code of the viewer (here) it does not appear to me that waiting on GEOMETRY_LOADED_EVENT is necessary based on any of the operations in the rayIntersect() function. It is my understanding that we could get the mesh data of objects in the viewer simply from the fragments, which does not require the GEOMETRY_LOADED_EVENT. Is there another event I could wait on before calling model.rayIntersect() that may fire more quickly?
I am working to perform this intersection calculation on large models in a headless form of the viewer, so waiting on the GEOMETRY_LOADED_EVENT can take quite some time, so I would prefer not to wait for it to finish.
The hit-testing logic in Forge Viewer is pretty complex and may use different approaches (such as hit-testing a BVH, hit-testing individual meshes, or checking pixels in an "ID buffer") depending on your environment and settings.
The BVH is computed by the viewer after it receives the "fragment list" with bounding boxes of all fragments (it's an async operation that may take a while), and the ID buffer is generated as part of the standard rendering pipeline, so for these to work, you should actually wait for the Autodesk.Viewing.GEOMETRY_EVENT_LOADED event.
CUDA 10 added runtime API calls for putting streams (= queues) in "capture mode", so that instead of executing, they are returned in a "graph". These graphs can then be made to actually execute, or they can be cloned.
But what is the rationale behind this feature? Isn't it unlikely to execute the same "graph" twice? After all, even if you do run the "same code", at least the data is different, i.e. the parameters the kernels take likely change. Or - am I missing something?
PS - I skimmed this slide deck, but still didn't get it.
My experience with graphs is indeed that they are not so mutable. You can change the parameters with 'cudaGraphHostNodeSetParams', but in order for the change of parameters to take effect, I had to rebuild the graph executable with 'cudaGraphInstantiate'. This call takes so long that any gain of using graphs is lost (in my case). Setting the parameters only worked for me when I build the graph manually. When getting the graph through stream capture, I was not able to set the parameters of the nodes as you do not have the node pointers. You would think the call 'cudaGraphGetNodes' on a stream captured graph would return you the nodes. But the node pointer returned was NULL for me even though the 'numNodes' variable had the correct number. The documentation explicitly mentions this as a possibility but fails to explain why.
Task graphs are quite mutable.
There are API calls for changing/setting the parameters of task graph nodes of various kinds, so one can use a task graph as a template, so that instead of enqueueing the individual nodes before every execution, one changes the parameters of every node before every execution (and perhaps not all nodes actually need their parameters changed).
For example, See the documentation for cudaGraphHostNodeGetParams and cudaGraphHostNodeSetParams.
Another useful feature is the concurrent kernel executions. Under manual mode, one can add nodes in the graph with dependencies. It will explore the concurrency automatically using multiple streams. The feature itself is not new but make it automatic becomes useful for certain applications.
When training a deep learning model it happens often to re-run the same set of kernels in the same order but with updated data. Also, I would expect Cuda to do optimizations by knowing statically what will be the next kernels. We can imagine that Cuda can fetch more instructions or adapt its scheduling strategy when knowing the whole graph.
CUDA Graphs is trying to solve the problem that in the presence of too many small kernel invocations, you see quite some time spent on the CPU dispatching work for the GPU (overhead).
It allows you to trade resources (time, memory, etc.) to construct a graph of kernels that you can use a single invocation from the CPU instead of doing multiple invocations. If you don't have enough invocations, or your algorithm is different each time, then it won't worth it to build a graph.
This works really well for anything iterative that uses the same computation underneath (e.g., algorithms that need to converge to something) and it's pretty prominent in a lot of applications that are great for GPUs (e.g., think of the Jacobi method).
You are not going to see great results if you have an algorithm that you invoke once or if your kernels are big; in that case the CPU invocation overhead is not your bottleneck. A succinct explanation of when you need it exists in the Getting Started with CUDA Graphs.
Where task graph based paradigms shine though is when you define your program as tasks with dependencies between them. You give a lot of flexibility to the driver / scheduler / hardware to do scheduling itself without much fine-tuning from the developer's part. There's a reason why we have been spending years exploring the ideas of dataflow programming in HPC.
I am trying to use eventlets to process a large number of data requests, approx. 100,000 requests at a time to a remote server, each of which should generate a 10k-15k byte JSON response. I have to decode the JSON, then perform some data transformations (some field name changes, some simple transforms like English->metric, but a few require minor parsing), and send all 100,000 requests out the back end as XML in a couple of formats expected by a legacy system. I'm using the code from the eventlet example which uses imap() "for body in pool.imap(fetch, urls):...."; lightly modified. eventlet is working well so far on a small sample (5K urls), to fetch the JSON data. My question is whether I should add the non-I/O processing (JSON decode, field transform, XML encode) to the "fetch()" function so that all that transform processing happens in the greenthread, or should I do the bare minimum in the greenthread, return the raw response body, and do the main processing in the "for body in pool.imap():" loop? I'm concerned that if I do the latter, the amount of data from completed threads will start building up, and will bloat memory, where doing the former would essentially throttle the process to where the XML output would keep up. Suggestions as to preferred method to implement this welcome. Oh, and this will eventually run off of cron hourly, so it really has a time window it has to fit into. Thanks!
Ideally, you put each data processing operation into separate green thread. Then, only when required, combine several operations into batch or use a pool to throttle concurrency.
When you do non-IO-bound processing in one loop, essentially you throttle concurrency to 1 simultaneous task. But you can run those in parallel using (OS) thread pool in eventlet.tpool module.
Throttle concurrency only when you have too many parallel CPU-bound code running.
For message-oriented middleware that does not consistently support priority messages (such as AMQP) what is the best way to implement priority consumption when queues have only FIFO semantics? The general use case would be a system in which consumers receive messages of a higher priority before messages of a lower priority when a large backlog of messages exists in Queue(s).
Given only FIFO support for a given single queue, you will of course have to introduce either multiple queues, an intermediary, or have a more complex consumer.
Multiple queues could be handled in a couple of ways. The producer and consumer could agree to have two queues between them, one for high-priority, and one for background tasks.
If your producer is constrained to a single queue, but you have control over the consumer, consider introducing a fan-out router in the path. So producer->Router is a single queue, and the router then has two queues to the consumer.
Another way to tackle it, which is likely less than ideal, would be to have your consumer spin a thread to front the queue, then dispatch the work internally. Something like the router version above, but inside a single app. This has the downside of having multiple messages in flight inside your app, which may complicate recovery in the event of a failure.
Don't forget to consider starvation of the effectively low-priority events, whatever they are, if some of them should be processed even if there are higher-priority events still hanging about.