Pm2 cluster mode, The ideal number of workers? - pm2

I using PM2 to run my nodejs application.
When starting it in cluster mode "pm2 start server -i 0": PM2 will automatically spawn as many workers as you have CPU cores.
What is the ideal number of workers to run and why?

Beware of the context switch
When running multiple processes on your machine, try to make sure each CPU core will be kepy busy by a single application thread at a time. As a general rule, you should look to spawn N-1 application processes, where N is the number of available CPU cores. That way, each process is guaranteed to get a good slice of one core, and there’s one spare for the kernel scheduler to run other server tasks on. Additionally, try to make sure the server will be running little or no work other than your Node.JS application, so processes don’t fight for CPU.
We made a mistake where we deployed two busy node.js applications to our servers, both apps spawning N-1 processes each. The applications’ processes started vehemently competing for CPU, resulting in CPU load and usage increasing dramatically. Even though we were running these on beefy 8-core servers, we were paying a noticeable penalty due to context switching. Context switching is the behaviour whereby the CPU suspends one task in order to work on another. When context switching, the kernel must suspend all state for one process while it loads and executes state for another. After simply reducing the number of processes the applications spawned such that they each shared an equal number of cores, load dropped significantly:
https://engineering.gosquared.com/optimising-nginx-node-js-and-networking-for-heavy-workloads

Related

Why do my google cloud compute instances always unexpectedly restart?

Help! Help! Help!
It is really annoying and I almost cannot bear it anymore! I'm using google cloud compute engine instances but they often unexpectedly restart without any notification in advance. The restart of instances seems to happen randomly and I have no idea what's going wrong there! I'm pretty sure that the instances are been occupied (usage of CPUs > 50% and all GPUs are in use) when restart happens. Could anyone please tell me how to solve this problem? Thanks in advance!
The issue is right here:
all GPUs are in use
If you check the official documentation about GPU:
GPU instances must terminate for host maintenance events, but can automatically restart. These maintenance events typically occur once per week, but can occur more frequently when necessary. You must configure your workloads to handle these maintenance events cleanly. Specifically, long-running workloads like machine learning and high-performance computing (HPC) must handle the interruption of host maintenance events. Learn how to handle host maintenance events on instances with GPUs.
This is because an instance that has a GPU attached cannot be migrated to another host for maintenance as it happens for the rest of the virtual machines. To get a physical GPU attached to the instance and bare metal performance you are using GPU passthrough , which sadly means if the host has to go through maintenance the VM is going down with it.
This sounds like Preemptible VM instance.
Preemptible instances function like normal instances, but have the following limitations:
Compute Engine might terminate preemptible instances at any time due to system events. The probability that Compute Engine will terminate a preemptible instance for a system event is generally low, but might vary from day to day and from zone to zone depending on current conditions.
Compute Engine always terminates preemptible instances after they run for 24 hours.
To check if your instance is preemptible using gcloud cli, just run
gcloud compute instances describe instance-name --format="(scheduling.preemptible)"
Result
scheduling:
preemptible: false
change "instance-name" to real name.
Or simply via UI, click on compute instance and scroll down:
To check for system operations performed on your instance, you can review it using following command:
gcloud compute operations list

How to prevent two CUDA programs from interfering

I've noticed that if two users try to run CUDA programs at the same time, it tends to lock up either the card or the driver (or both?). We need to either reset the card or reboot the machine to restore normal behavior.
Is there a way to get a lock on the GPU so other programs can't interfere while it's running?
Edit
OS is Ubuntu 11.10 running on a server. While there is no X Windows running, the card is used to display the text system console. There are multiple users.
If you are running on either Linux or Windows with the TCC driver, you can put the GPU into compute exclusive mode using the nvidia-smi utility.
Compute exclusive mode makes the driver refuse a context establishment request if another process already holds a context on that GPU. Any process trying to run on a busy compute exclusive GPU will receive a no device available error and fail.
You can use something like Task Spooler to queue the programs and run one at the time.
We use TORQUE Resource Manager but it's harder to configure than ts. With TORQUE you can have multiple queues (ie one for cuda jobs, two for cpu jobs) and assign a different job to each gpu.

Kvm/Qemu maximum vm count limit

For a research project I am trying to boot as many VM's as possible, using python libvirt bindings, in KVM under Ubuntu server 12.04. All the VM's are set to idle after boot, and to use a minimum amount of memory. At the most I was able to boot 1000 VM's on a single host, at which point the kernel (Linux 3x) became unresponsive, even if both CPU- and memory usage is nowhere near the limits (48 cores AMD, 128GB mem.) Before this, the booting process became successively slower, after a couple of hundred VM's.
I assume this must be related to the KVM/Qemu driver, as the linux kernel itself should have no problem handling this few processes. However, I did read that the Qemu driver was now multi-threaded. Any ideas of what the cause of this slowness may be - or at least where I should start looking?
You are booting all the VMs using qemu-kvm right, and after 100s of VM you feel it's becoming successively slow. So when you feels it stop using kvm, just boot using qemu, I expect you see the same slowliness. My guess is that after those many VMs, KVM (hardware support) exhausts. Because KVM is nothing but software layer for few added hardware registers. So KVM might be the culprit here.
Also what is the purpose of this experiment ?
The following virtual hardware limits for guests have been tested. We ensure host and VMs install and work successfully, even when reaching the limits and there are no major performance regressions (CPU, memory, disk, network) since the last release (SUSE Linux Enterprise Server 11 SP1).
Max. Guest RAM Size --- 512 GB
Max. Virtual CPUs per Guest --- 64
Max. Virtual Network Devices per Guest --- 8
Max. Block Devices per Guest --- 4 emulated (IDE), 20 para-virtual (using virtio-blk)
Max. Number of VM Guests per VM Host Server --- Limit is defined as the total number of virtual CPUs in all guests being no greater than 8 times the number of CPU cores in the host
for more limitations of KVm please refer this document link

Running Hudson on EC2

I am planning to install Hudson on Amazon EC2 using Ubuntu image. The code I am going to test does not have a big memory overhead - I will be executing mainly python unit tests.
Which EC2 instance should I use? Would micro instance be sufficient (have enough memory) or should I use a bigger instance?
Jenkins itself will happily run on a micro, but there are two problems: 1) you won't have much memory left for building and testing, around 150MB, but the bigger problem is 2) if your CPU usage spikes for more than a few seconds Amazon will simply crush your instance with throttling cutting off 97% or more of available CPU. http://gregsramblings.com/2011/02/07/amazon-ec2-micro-instance-cpu-steal/
The throttling was completely impossible for us, making a build with testing take 12 minutes on EC2 instead of 25 seconds on a quad i7 laptop.
But! There's a fix for the frugal:
Run a Jenkins master on a micro, but start up a small instance when needed to run the actual tests. That gives us plenty of memory and decent CPU, yet it's still incredibly cheap (ten cents per push [or commit]). However, it substantially increases build time because it has to boot the instance and all that.
The setup is rather involved, and requires working around some limitations of the ec2 plugin (which, overall, works extremely well), so we wrote up a blog post if you want to do this: http://wkmacura.tumblr.com/post/5416465911/jenkins-ec2
We're running hudson on EC2 and integration testing Ruby/Rails. We're doing just fine on a micro instance as I think you will to.

Development Environment for Testing MySQL Replication

Is there an easy way to setup an environment on one machine (or a VM) with MySQL replication? I would like to put together a proof of concept of MySQL replication with one Master write instance and two slave instances for reads.
I can see doing it across 2 or 3 VMs running on my computer, but that would really bog down my system. I'd rather have everything running on the same VM. What's the best way to proof out scalability solutions like this in a local dev environment?
Thanks for your help,
Dave
I think to truly test MySQL Replication it is important to do so in realistic constraints.
If you put all the replicate nodes under one operating system then you no longer have the bandwidth constraint, the data transfer speed would be much higher that what you would get if those replicate DBs are on different sites.
Everything under one VM is a shortcut to configurations, for instance it does not make you go through the configuration of the networking.
I suggest you use multiple VMs, even if you have to put them under one physical machine, you can always configure the hypervisor to make the packets go through a router, in which case the I/O will be bound by whatever the network interface has as throughput.
I can see doing it across 2 or 3 VMs
running on my computer, but that would
really bog down my system.
You can try and make a few VMs with JeOS (Just Enough OS) versions of the operating system you want. I know Ubuntu has one and it can boot on 128 RAM, which makes it convenient to deploy lots of cloned VMs under one physical machine without monster RAM.
Next step would be doing the same thing on a cloud (Infrastructure as a Service, IaaS) provider, and try your setup on different geographical sites.
If what you're testing is machine-to-machine replication, then setting up multiple VMs on a virtual private network would be the correct environment to test it. If you use Ubuntu Server, you don't have to install more than you actually need -- just give the VMs enough space for a base install + MySQL + your data. Memory usage can be as little as 256MB per VM. All you have to do is suspend or shutdown the VMs when you're not running a full-up test.
I've had situations where I was running 4 or more VMs simultaneously on my workstation, either for development or testing purposes -- it's not that taxing unless you're trying to do video rendering in each VM.