how many uvicorn workers do I have to have in production? - gunicorn

My Environment
FastAPI
Gunicorn & Uvicorn Worker
AWS EC2 c5.2xlarge (8 vCPU)
Document
https://fastapi.tiangolo.com/deployment/server-workers/
Question
Currently I'm using 24 Uvicorn workers in production server. (c5.2xlarge)
gunicorn main:app --workers 24 --worker-class uvicorn.workers.UvicornWorker --bind 0.0.0.0:80
I've learn that one process runs on one core.
Therefore If i have 8 processes, I can make use of whole cores (c5.2xlarge's vCpu == 8)
I'm curious that in this situation, Is there any performance benefit if I got more processes than 8?

Number of recommended workers is 2 x number_of_cores +1
You can read more about it at
https://docs.gunicorn.org/en/stable/design.html#:~:text=Gunicorn%20should%20only%20need%204,workers%20to%20start%20off%20with.
In your case with 8 CPU cores, you should be using 17 worker threads.
Additional thoughts on async systems:
The two times core is not a scientific figure as says in the article. But the idea is that one thread can do I/O and another CPU processing at the same time. This makes maximum use of simultaneous threads. Even with async systems, conceptually this works and should give you maximum efficiency.

In general the best practice is:
number_of_workers = number_of_cores x 2 + 1
or more precisely:
number_of_workers = number_of_cores x num_of_threads_per_core + 1
The reason for it is CPU hyperthreading, which allows each core to run multiple concurrent threads. The number of concurrent threads is decided by the chip designers.
Two concurrent threads per CPU core are common, but some processors can support more than two.
vCPU that is mentioned for AWS ec2 resource is already the hyperthreaded amount of processing units you have on the machine (num_of_cores x num_of_threads_per_core). Not to be confused with number_of_cores available on that machine.
So, in your case, c5.2xlarge has 8 vCPUs, meaning you have 8 available concurrent workers.

Related

How to make sense of OpenShift pods CPU usage metrics

Can someone please help explain how to understand the CPU usage metrics reported on the OpenShift web console.
Below is an example for my application. The cursor on the graph points to 0.008 cores and is different at different times. What does 0.008 cores mean? How should this value be understood if my project on OpenShift doesn't resource limit and quote set? Thanks!
Compute Resources Learn More
Each container running on a node uses compute resources like CPU and memory. You can specify how much CPU and memory a container needs to improve scheduling and performance.
CPU
CPU is often measured in units called millicores. Each millicore is equivalent to 1⁄1000 of a CPU core.
1000 millcores = 1 core

Apache Superset - Concurrent loading of dashboard slices (Athena)

I've got a dashboard with a few slices setup. Slices are loading one after the other, not concurrently. This results in a bad user experience. My data is sitting on S3 and I'm using the Athena connector for queries. I can see that the calls to Athena are fired in order, with each query waiting for the one before it to finish before running.
I'm using gevent, which is far as I can tell shouldn't be a problem?
Below is an extract from my Superset config:
SUPERSET_WORKERS = 8
WEBSERVER_THREADS = 8
It used to be set to 2 and 1 respectively, but I upped it to 8 each see if that could be the issue. I'm getting the same result though.
Is this a simple misconfiguration issue or am I missing something?
It's important to understand multiProcessing and multiThreading before increasing the workers and threads for gunicorn. If you are looking at CPU intense operations, you want many processes' while you probably want many threads for I/O intensive operations.
With your problem, you don't need many process; but threads within a process. With that config, next step would be to debug how you're spawning greenlets. (gevents facilitates concurrency. Concurrency !== Parallel Processing).
To bootstrap gunicorn with multiple threads you could do something like this:
gunicorn -b localhost:8000 main:app --threads 30 --reload
Please post more code to facilitate targeted help.

Is is valid to assume that Google virtual CPUs are all on 1 socket (if < 16 VCPUs?)

We're building a high performance computing scientific application (lots and lots of computations) using Java. To the best of our knowledge, google compute engine does not provide the "true" physical socket information, nor do they have a service like AWS's dedicated hosting (https://aws.amazon.com/ec2/dedicated-hosts/ and then see section on "affinity") where (for a fee) one could see the actual physical sockets.
However, based on our understanding, JIT compiler will do a lot better if it knows that all the threads are really on a single physical socket. Would it be reasonable, therefore, to assume that even though google compute engines do NOT display the true underlying physical socket structure, that if we have a google compute engine that's <= 16 cores, that's it's definitely (or most likely, eg >95%) coming from a single physical socket? If so, we can also then assume that the cpu numbers (when doing cat /proc/cpuinfo) correspond logically (in sequence) to the physical cpu cores/logical cores so that if we wanted our program to put two threads onto the same physical core (but two logical cores), we could just tell it to put the two threads on CPU 0 and 1 and we would know that CPU0 and CPU1 belong to the same physical CPU core, and that CPU2 and CPU3 belong to the same physical core and so on?
If so, would it be reasonable to assume that for compute engines that are 32 VCPUs or 64 VCPUs that the number of sockets are 2 and 4 respectively? And that the result of cat proc/cpuinfo also follows a logical order, so that not only is CPU0 and CPU1 on the same phyisical CPU core, but that we can assume that CPU0 through CPU15 is on physical socket #1, and CPU16 to CPU31 is on physical socket #2 etc?
as you inferred, GCE currently does not expose the actual NUMA architecture of the machine, and we do not guarantee that a VM will run entirely on one socket, nor can you intentionally land VM threads on specific cores/hyperthreads. These capabilities are on our radar for possible future enhancements/features.
I don't believe this is specifically documented currently, however I am speaking as a Product Manager for GCE.

GridEngine on single machine: How can I limit cores for each job?

I have a single machine with 32 cores (2 processors), and 32G RAM. I installed gridengine to submit jobs to those queues I created. But it seems jobs are running on all cores.
I wonder if there is way to limit cores and RAMs for each job. For example I have two queues: parallel.q and serial.q, so that I allocate 20G RAMS and 20 cores to serial.q but I want each job only use one core and maximum 1G RAMs, and 8G RAMs + 8 cores to a single parallel job. All 4 cores and 4G rams left for other usage.
How can I config my queue or gridengine to get the setting right? I tried to read the manual, but don't have a clue.
Thanks!
I don't have problem with parallel jobs. I have some serial jobs will call several different programs somehow the system will assign them all cores available. But I don't want all cores be used for jobs rather for example only two cores available for each job.(Each job has several programs run sequentially, in which case systems allocate each program a core). BTW, I would like have some idle cores all the time to process other jobs, like processing data. Is it possible or necessary?
In fact, if I understand well, you want to partition a single machine with several sub-queues, is that right?
This may be problematic with SGE because the host configuration allows you to set the number of CPU available on a given node. Than you create your queues and assign different hosts to different queues.
In your case, you shoud assign the same host to one master queue, and then add subordinate queues that can use only a given MAX_SLOTS slots.
But if I may ask one question: why should you partition it? If you set up only one queue and configure some parallel environment then you can just submit your jobs using qsub -pe <parallelEnvironment> <NSLOTS> and the grid engine takes care of everything. I suggest you setup at least an OpenMP parallel environment, because you won't probably need MPI on a shared memory machine like yours (it seems a great machine BTW).
Another thing is that you must be able to configure your model run so that the code that you are using can be used with a limited number of CPU; this is very important. In practice you must assign the same number of CPUs to the simulation code than to the SGE. This information is contained in the $NSLOTS variable of your qsub-script.

Multiple GPUs and Multiple Executables

Suppose I have 4 GPUs and would like to run 50 CUDA programs in parallel. My question is: is the NVIDIA driver smart enough to run the 50 CUDA programs on the different GPUs or do I have to set the CUDA device for each program?
thank you
The first point to make is that you cannot run 50 applications in parallel on 4 GPUs on just about any CUDA platform. If you have a Hyper-Q capable GPU, there is the possibility of up to 32 threads or MPI processes queuing work to the GPU. Otherwise there is a single command queue.
For anything other than the latest Kepler Tesla cards, CUDA driver only supports a single active context at a time. If you run more that one application on a GPU, the processes will both have contexts which just contend with one another in a "first come, first serve" basis. If one application blocks the other with a long running kernel or similar, there is no pre-emption or anything else which makes the process yield to another process. When the GPU is shared with a display manager, there is a watchdog timer that will impose an upper limit of a few seconds before the application will get its context killed. The result is that only one context ever runs on the hardware at a time. Context switching isn't free, and there is a performance penalty to having multiple processes contending for a single device.
Furthermore, every context present on a GPU requires device memory. On the platform you are asking about, linux, there is no memory paging, so every context's resources must coexist in GPU memory. I don't believe it would be possible to have 12 non-trivial contexts running on any current GPU simultaneously - you would run out of available memory well before that number. Trying to run more applications would result in an context establishment failure.
As for the behaviour of the driver distributing multiple applications on multiple GPUs, AFAIK the linux driver doesn't do any intelligent distribution of processes amongst GPUs, except when one or more of the GPUs are in a non-default compute mode. If no device is specifically requested, the driver will always try and find the first valid, free GPU it can run a process or thread on. If a GPU is busy and marked compute exclusive (either thread or process) or marked prohibited, then the driver will skip over it when trying to find a GPU to run on. If all GPUs are exclusive and occupied or prohibited, then the application will fail with a no valid device available error.
So in summary,for everything other than Hyper-Q devices, there is no performance gain in doing what you are asking about (quite the opposite) and I would expected it to break if you tried. A much saner approach would be to use compute exclusivity in combination with a resource managing task scheduler like Torque or one of the (former) Sun Grid Engine versions, which could schedule your processes to run in an orderly fashion according to the availability of GPUs. This is how most general purpose HPC clusters deal with scheduling in multi-gpu environments.