Ensuring one Job Per Node on StarCluster / SunGridEngine (SGE) - sungridengine

When qsubing jobs on a StarCluster / SGE cluster, is there an easy way to ensure that each node receives at most one job at a time? I am having issues where multiple jobs end up on the same node leading to out of memory (OOM) issues.
I tried using -l cpu=8 but I think that does not check the number of USED cores just the number of cores on the box itself.
I also tried -l slots=8 but then I get:
Unable to run job: "job" denied: use parallel environments instead of requesting slots explicitly.

In your config file (.starcluster/config) add this section:
[plugin sge]
setup_class = starcluster.plugins.sge.SGEPlugin
slots_per_host = 1

Largely depends on how the cluster resources are configured i.e. memory limits, etc. However, one thing to try is to request a lot of memory for each job:
-l h_vmem=xxG
This will have side-effect of excluding other jobs from running on a node by virtue that most of the memory on that node is already requested by another previously running job.
Just make sure the memory you request is not above the allowable limit for the node. You can see if it bypassing this limit by checking the output of qstat -j <jobid> for errors.

I accomplished this by setting the number of slots on each my nodes to 1 using:
qconf -aattr queue slots "[nodeXXX=1]" all.q

Related

Container for threads process isolation

I want to know if is possible to customize an LXC kernel (or relation system like OpenVZ, etc) to work just for threads process, see this mention:
Unlike Docker, Virtuozzo, and LXC, which operate on the process level,
LVE is able to operate on the thread level. This allows multithreaded
servers such as Apache (with its 'worker' MPM) to take advantage of
LVE without having to run a separate instance per LVE user.
source:
blog.phusion.nl/2016/02/03/lve-an-alternative-container-technology-to-docker-and-virtuozzolxc/

Apache Superset - Concurrent loading of dashboard slices (Athena)

I've got a dashboard with a few slices setup. Slices are loading one after the other, not concurrently. This results in a bad user experience. My data is sitting on S3 and I'm using the Athena connector for queries. I can see that the calls to Athena are fired in order, with each query waiting for the one before it to finish before running.
I'm using gevent, which is far as I can tell shouldn't be a problem?
Below is an extract from my Superset config:
SUPERSET_WORKERS = 8
WEBSERVER_THREADS = 8
It used to be set to 2 and 1 respectively, but I upped it to 8 each see if that could be the issue. I'm getting the same result though.
Is this a simple misconfiguration issue or am I missing something?
It's important to understand multiProcessing and multiThreading before increasing the workers and threads for gunicorn. If you are looking at CPU intense operations, you want many processes' while you probably want many threads for I/O intensive operations.
With your problem, you don't need many process; but threads within a process. With that config, next step would be to debug how you're spawning greenlets. (gevents facilitates concurrency. Concurrency !== Parallel Processing).
To bootstrap gunicorn with multiple threads you could do something like this:
gunicorn -b localhost:8000 main:app --threads 30 --reload
Please post more code to facilitate targeted help.

Kubernetes on GCE / Prevent pods undergoing an eviction with "The node was low on compute resources."

Painful investigation on aspects that so far are not that highlighted by documentation (at least from what I've googled)
My cluster's kube-proxy became evicted (+-experienced users might be able to consider the faced issues). Searched a lot, but no clues about how to have them up again.
Until describing the concerned pod gave a clear reason : "The node was low on compute resources."
Still not that experienced with resources balance between pods/deployments and "physical" compute, how would one 'prioritizes' (or similar approach) to make sure specific pods will never end up in such a state ?
The cluster has been created with fairly low resources in order to get our hands on while keeping low costs and eventually witnessing such problems (gcloud container clusters create deemx --machine-type g1-small --enable-autoscaling --min-nodes=1 --max-nodes=5 --disk-size=30), is using g1-small is to prohibit ?
If you are using iptables-based kube-proxy (the current best practice), then kube-proxy being killed should not immediately cause your network connectivity to fail, but new services and updates to endpoints will stop working. Still, your apps should continue to work, but degrade slowly.
If you are using userspace kube-proxy, you might want to upgrade.
The error message sounds like it was due to memory pressure on the machine.
When there is memory pressure, Kubelet tries to terminate things in order of lowest to highest QoS level.
If your kube-proxy pod is not using Guaranteed resources, then you might want to change that.
Other things to look at:
if kube-proxy suddenly used a lot more memory, it could be terminated. If you made a huge number of pods or services or endpoints, this could cause it to use more memory.
if you started processes on the machine that are not under kubernetes control, that could cause kubelet to make an incorrect decision about what to terminate. Avoid this.
It is possible that on such a small machine as a g1-small, the amount of node resources held back is insufficient, such that too much guaranteed work got put on the machine -- see allocatable vs capacity. This might need tweaking.
Node oom documentation

GridEngine on single machine: How can I limit cores for each job?

I have a single machine with 32 cores (2 processors), and 32G RAM. I installed gridengine to submit jobs to those queues I created. But it seems jobs are running on all cores.
I wonder if there is way to limit cores and RAMs for each job. For example I have two queues: parallel.q and serial.q, so that I allocate 20G RAMS and 20 cores to serial.q but I want each job only use one core and maximum 1G RAMs, and 8G RAMs + 8 cores to a single parallel job. All 4 cores and 4G rams left for other usage.
How can I config my queue or gridengine to get the setting right? I tried to read the manual, but don't have a clue.
Thanks!
I don't have problem with parallel jobs. I have some serial jobs will call several different programs somehow the system will assign them all cores available. But I don't want all cores be used for jobs rather for example only two cores available for each job.(Each job has several programs run sequentially, in which case systems allocate each program a core). BTW, I would like have some idle cores all the time to process other jobs, like processing data. Is it possible or necessary?
In fact, if I understand well, you want to partition a single machine with several sub-queues, is that right?
This may be problematic with SGE because the host configuration allows you to set the number of CPU available on a given node. Than you create your queues and assign different hosts to different queues.
In your case, you shoud assign the same host to one master queue, and then add subordinate queues that can use only a given MAX_SLOTS slots.
But if I may ask one question: why should you partition it? If you set up only one queue and configure some parallel environment then you can just submit your jobs using qsub -pe <parallelEnvironment> <NSLOTS> and the grid engine takes care of everything. I suggest you setup at least an OpenMP parallel environment, because you won't probably need MPI on a shared memory machine like yours (it seems a great machine BTW).
Another thing is that you must be able to configure your model run so that the code that you are using can be used with a limited number of CPU; this is very important. In practice you must assign the same number of CPUs to the simulation code than to the SGE. This information is contained in the $NSLOTS variable of your qsub-script.

Performing a distributed CUDA/OpenCL based password cracking

Is there a way to perform a distributed (as in a cluster of a connected computers) CUDA/openCL based dictionary attack?
For example, if I have a one computer with some NVIDIA card that is sharing the load of the dictionary attack with another coupled computer and thus utilizing a second array of GPUs there?
The idea is to ensure a scalability option for future expanding without the need of replacing the whole set of hardware that we are using. (and let's say cloud is not an option)
This is a simple master / slave work delegation problem. The master work server hands out to any connecting slave process a unit of work. Slaves work on one unit and queue one unit. When they complete a unit, they report back to the server. Work units that are exhaustively checked are used to estimate operations per second. Depending on your setup, I would adjust work units to be somewhere in the 15-60 second range. Anything that doesn't get a response by the 10 minute mark is recycled back into the queue.
For queuing, offer the current list of uncracked hashes, the dictionary range to be checked, and the permutation rules to be applied. The master server should be able to adapt queues per machine and per permutation rule set so that all machines are done their work within a minute or so of each other.
Alternately, coding could be made simpler if each unit of work were the same size. Even then, no machine would be idle longer than the amount of time for the slowest machine to complete one unit of work. Size your work units so that the fastest machine doesn't enter a case of resource starvation (shouldn't complete work faster than five seconds, should always have a second unit queued). Using that method, hopefully your fastest machine and slowest machine aren't different by a factor of more than 100x.
It would seem to me that it would be quite easy to write your own service that would do just this.
Super Easy Setup
Let's say you have some GPU enabled program X that takes a hash h as input and a list of dictionary words D, then uses the dictionary words to try and crack the password. With one machine, you simply run X(h,D).
If you have N machines, you split the dictionary into N parts (D_1, D_2, D_3,...,D_N). Then run P(x,D_i) on machine i.
This could easily be done using SSH. The master machine splits the dictionary up, copies it to each of the slave machines using SCP, then connects to the slaves and tells them to run the program.
Slightly Smarter Setup
When one machine cracks the password, they could easily notify the master that they have completed the task. The master then kills the programs running on the other slaves.