Microsoft Orleans maximum grains per silo - configuration

I am testing Microsoft Orleans for feasibility as a distributed computing framework. It seems like it may work however I was wondering how do I set the maximum number of active grains in a given silo?
My grains will not purely be CPU bound and will perform some IO and other related tasks. I am worried that if I let it run wild it will spin up a massive number of instances which will bog the whole thing down.
Is silo configuration like this possible?

Orleans is very well suited for non-CPU-bound work. Orleans grains are designed to use Task<T> for asynchrony instead of threads, so you should always perform asynchronous IO, using C#'s [async/await][1] feature.
If you absolutely need to perform blocking IO, you can perform the IO outside the context of the grain and await the result in your grain, like so:
var result = await Task.Run(() => {
// Perform blocking work.
return 43;
});

It is better to offload all blocking operations to Thread Pool and keep MaxActiveThreads at default (#cores).
Basically you never want to block Orleans threads (those MaxActiveThreads). Those Orleans thread should do a light compute and issue external async calls (to other grains or external services). all heavy compute should be done not on Orlean's threads.
You can do that and still maintain single threaded execution guarantees.
See here: https://dotnet.github.io/orleans/docs/grains/external_tasks_and_grains.html

It appears that the Silo configuration xml file has that configuration ability.
<Scheduler MaxActiveThreads="15"/>
The XSD for it is specified online.

Related

How can I reduce compute costs and waste in my Foundry transforms?

We have a lot of pretty complicated data pipelines and the amount of compute being consumed has been steadily rising every month. How can I figure out where compute is being wasted and make things more efficient?
So, this will turn into a little bit of an involved answer but hopefully I can point people to a useful set of resources to help them manage waste.
Let's start in the obvious place. Compute profiles:
Engineers will commonly increase the executor memory to solve an executor OOM, the cause of this OOM is often skew. Try to mitigate the skew first and increase memory usage second.
Memory is relatively cheap, but when you increase memory you do so on every executor, which can get expensive across a large number of executors. Usually only a single executor is OOMing and 90% of the time it is due to skew.
Local Spark: You can use the compute profile KUBERNETES_NO_EXECUTORS on small transforms (a rule of thumb might be <50mb of input and output data) which will mean your transform will be run on the driver (reminder on drivers vs executors) This will mean 2 fewer modules are spun up reducing the amount of resources consumed by 66%. Often a job this small does not need executors and using them just causes shuffles and other wasted compute. When you're dealing with small data try to use local spark, your jobs will spin up faster, and will cost less.
Views: Docs on views have not been added to public docs yet, but you can find them on your platform docs at documentation/product/views/overview.
Views are a really useful way to reduce compute usage by eliminating the need for a transform altogether. Anywhere you have an identity transform being used to move a dataset between projects, or a transform that exists only to union several other datasets together, this transform can be replaced by a view. Views work by containing the information on the backing datasets and files, rather than containing any files themselves. They therefore require no processing of their own.
Incremental Pipelines: Where you have data that does not need to be changed after it is processed you might be able to use an incremental pipeline. This way you only process the new data as it comes into your pipeline without having to reprocess the entire mass of data.
This is probably the most powerful tool to reduce compute consumption in large intensive pipelines with high data throughput.

Using CUDA GPUs at prediction time for high througput streams

We're trying to develop a Natural Language Processing application that has a user facing component. The user can call models through an API, and get the results back.
The models are pretrained using Keras with Theano. We use GPUs to speed up the training. However, prediction is still sped up significantly by using the GPU. Currently, we have a machine with two GPUs. However, at runtime (e.g. when running the user facing bits) there is a problem: multiple Python processes sharing the GPUs via CUDA does not seem to offer a parallelism speed up.
We're using nvidia-docker with libgpuarray (pygpu), Theano and Keras.
The GPUs are still mostly idle, but adding more Python workers does not speed up the process.
What is the preferred way of solving the problem of running GPU models behind an API? Ideally we'd utilize the existing GPUs more efficiently before buying new ones.
I can imagine that we want some sort of buffer before sending it off to the GPU, rather than requesting a lock for each HTTP call?
This is not an answer to your more general question, but rather an answer based on how I understand the scenario you described.
If someone has coded a system which uses a GPU for some computational task, they have (hopefully) taken the time to parallelize its execution so as to benefit from the full resources the GPU can offer, or something close to that.
That means that if you add a second similar task - even in parallel - the total amount of time to complete them should be similar to the amount of time to complete them serially, i.e. one after the other - since there are very little underutilized GPU resources for the second task to benefit from. In fact, it could even be the case that both tasks will be slower (if, say, they both somehow utilize the L2 cache a lot, and when running together they thrash it).
At any rate, when you want to improve performance, a good thing to do is profile your application - in this case, using the nvprof profiler or its nvvp frontend (the first link is the official documentation, the second link is a presentation).

CUDA contexts, streams, and events on multiple GPUs

TL;DR version: "What's the best way to round-robin kernel calls to multiple GPUs with Python/PyCUDA such that CPU and GPU work can happen in parallel?" with a side of "I can't have been the first person to ask this; anything I should read up on?"
Full version:
I would like to know the best way to design context, etc. handling in an application that uses CUDA on a system with multiple GPUs. I've been trying to find literature that talks about guidelines for when context reuse vs. recreation is appropriate, but so far haven't found anything that outlines best practices, rules of thumb, etc.
The general overview of what we're needing to do is:
Requests come in to a central process.
That process forks to handle a single request.
Data is loaded from the DB (relatively expensive).
The the following is repeated an arbitrary number of times based on the request (dozens):
A few quick kernel calls to compute data that is needed for later kernels.
One slow kernel call (10 sec).
Finally:
Results from the kernel calls are collected and processed on the CPU, then stored.
At the moment, each kernel call creates and then destroys a context, which seems wasteful. Setup is taking about 0.1 sec per context and kernel load, and while that's not huge, it is precluding us from moving other quicker tasks to the GPU.
I am trying to figure out the best way to manage contexts, etc. so that we can use the machine efficiently. I think that in the single-gpu case, it's relatively simple:
Create a context before starting any of the GPU work.
Launch the kernels for the first set of data.
Record an event for after the final kernel call in the series.
Prepare the second set of data on the CPU while the first is computing on the GPU.
Launch the second set, repeat.
Insure that each event gets synchronized before collecting the results and storing them.
That seems like it should do the trick, assuming proper use of overlapped memory copies.
However, I'm unsure what I should do when wanting to round-robin each of the dozens of items to process over multiple GPUs.
The host program is Python 2.7, using PyCUDA to access the GPU. Currently it's not multi-threaded, and while I'd rather keep it that way ("now you have two problems" etc.), if the answer means threads, it means threads. Similarly, it would be nice to just be able to call event.synchronize() in the main thread when it's time to block on data, but for our needs efficient use of the hardware is more important. Since we'll potentially be servicing multiple requests at a time, letting other processes use the GPU when this process isn't using it is important.
I don't think that we have any explicit reason to use Exclusive compute modes (ie. we're not filling up the memory of the card with one work item), so I don't think that solutions that involve long-standing contexts are off the table.
Note that answers in the form of links to other content that covers my questions are completely acceptable (encouraged, even), provided they go into enough detail about the why, not just the API. Thanks for reading!
Caveat: I'm not a PyCUDA user (yet).
With CUDA 4.0+ you don't even need an explicit context per GPU. You can just call cudaSetDevice (or the PyCUDA equivalent) before doing per-device stuff (cudaMalloc, cudaMemcpy, launch kernels, etc.).
If you need to synchronize between GPUs, you will need to potentially create streams and/or events and use cudaEventSynchronize (or the PyCUDA equivalent). You can even have one stream wait on an event inserted in another stream to do sophisticated dependencies.
So I suspect the answer to day is quite a lot simpler than talonmies' excellent pre-CUDA-4.0 answer.
You might also find this answer useful.
(Re)Edit by OP: Per my understanding, PyCUDA supports versions of CUDA prior to 4.0, and so still uses the old API/semantics (the driver API?), so talonmies' answer is still relevant.

Reliably monitor a serial port (Nortel CS1000)

I have a custom python script that monitors the call logs from a Nortel phone system. This phone system is under extremely high volume throughout the day and it's starting to appear that some records may be getting lost.
Some of you may dislike this, but I'm not interested in sharing the source code or current method in any way. I would rather consider this from a "new project" approach.
I'm looking for insight into the easiest and safest way to reliably monitor heavy data output through a serial port on Linux. I'm not limiting this to any particular set of tools or languages, I want to find out what works best to do this one critical job. I'm comfortable enough parsing the data and inserting it into mysql that we could just assume the data could be dropped to a text file.
Thank you
Well, the way that I would approach this this to have 2 threads (or processes) working.
Thread 1: The read thread
This thread does nothing but read data from the raw serial port and put the data into a local buffer/queue (In memory is preferred for speed). It should do nothing else. Depending on the clock speed of the serial connection, this should be pretty easy to do.
Thread2: The processing thread
This thread just sleeps until there is data in the local buffer to process, then reads and processes it. That's it.
The reason for splitting it apart in two, is so that if one is busy (a block in MySQL for the processing thread) it won't affect the other. After all, while the serial port is buffered by the OS, the buffer size is limited.
But then again, any local program is likely going to be way faster than the serial port can send data. Serial transfer is actually quite slow relative to the clock speed of the processor (115.2kbps is about the limit on standard hardware). So unless you're CPU speed bound (such as on an Arduino), I can't see normal conditions affecting it too much. So your choice of language really shouldn't be of too much concern (assuming modern hardware). Stick to what you know.

Reporting Services won't use more than 25% of CPU

I've set up a solution that creates rapid fire PDF reports. Currently it seems I can't get Reporting Services to use all the resources it has available to it. The system doesn't appear to be IO bound, CPU bound, or memory bound. Any suggestions on trying to figure out why it's running so?
The application isn't network IO bound, and it is multi-threaded to 2 times the number of processors.
SQL Server Reporting Services limits the number of reports run to 2 simultaneous ad-hoc reports and 2 simultaneous web reports. This is a hard limit imposed by the server.
Robin Day is probably right, however if you are using a processor that supports hyper threading you may get a performance benefit by turning this off in the BIOS. You can try an a A/B performance test.
You could also check the SQL instance (when you say reporting service you mean SSRS right?) has not a got processor affinity set.
Is this a case of not using a multi threaded approach? Is the machine using 100% of one core of a processor and that's the bottleneck?
EDIT: Sorry for stating the obvious, was just an idea before you mentioned that it was already multi threaded. I'm afraid I can't offer any more suggestions.
Any suggestions on trying to figure out why it's running so?
a) There's an API to restrict a whole process to one CPU: test that using GetProcessAffinityMask.
b) 'Thread state' and 'Thread wait reason' are two of the performance counters ... maybe you can read this to see why threads, we you think ought to be running, aren't.
All the threads of your application are fighting for a single lock. Use a profiler to see if there is a congestion somewhere.
If you have four cores, that would explain why you see 25% overall CPU usage.
Maybe the server can't deliver more data over the network (so it's network IO bound)?