GridEngine on single machine: How can I limit cores for each job?

GridEngine on single machine: How can I limit cores for each job? - sungridengine

I have a single machine with 32 cores (2 processors), and 32G RAM. I installed gridengine to submit jobs to those queues I created. But it seems jobs are running on all cores.
I wonder if there is way to limit cores and RAMs for each job. For example I have two queues: parallel.q and serial.q, so that I allocate 20G RAMS and 20 cores to serial.q but I want each job only use one core and maximum 1G RAMs, and 8G RAMs + 8 cores to a single parallel job. All 4 cores and 4G rams left for other usage.
How can I config my queue or gridengine to get the setting right? I tried to read the manual, but don't have a clue.
Thanks!
I don't have problem with parallel jobs. I have some serial jobs will call several different programs somehow the system will assign them all cores available. But I don't want all cores be used for jobs rather for example only two cores available for each job.(Each job has several programs run sequentially, in which case systems allocate each program a core). BTW, I would like have some idle cores all the time to process other jobs, like processing data. Is it possible or necessary?

In fact, if I understand well, you want to partition a single machine with several sub-queues, is that right?
This may be problematic with SGE because the host configuration allows you to set the number of CPU available on a given node. Than you create your queues and assign different hosts to different queues.
In your case, you shoud assign the same host to one master queue, and then add subordinate queues that can use only a given MAX_SLOTS slots.
But if I may ask one question: why should you partition it? If you set up only one queue and configure some parallel environment then you can just submit your jobs using qsub -pe <parallelEnvironment> <NSLOTS> and the grid engine takes care of everything. I suggest you setup at least an OpenMP parallel environment, because you won't probably need MPI on a shared memory machine like yours (it seems a great machine BTW).
Another thing is that you must be able to configure your model run so that the code that you are using can be used with a limited number of CPU; this is very important. In practice you must assign the same number of CPUs to the simulation code than to the SGE. This information is contained in the $NSLOTS variable of your qsub-script.

Related

What is the difference between Nvidia Hyper Q and Nvidia Streams?

I always thought that Hyper-Q technology is nothing but the streams in GPU. Later I found I was wrong(Am I?). So I was doing some reading about Hyper-Q and got confused more.
I was going through one article and it had these two statements:
A. Hyper-Q is a flexible solution that allows separate connections from multiple CUDA streams, from multiple Message Passing Interface (MPI) processes, or even from multiple threads within a process
B. Hyper-Q increases the total number of connections (work queues) between the host and the GK110 GPU by allowing 32 simultaneous, hardware-managed connections (compared to the single connection available with Fermi)
In aforementioned points, Point B says that there can be multiple connected created to a single GPU from host. Does it mean I can create multiple context on a simple GPU through different applications? Does it mean that I will have to execute all applications on different streams?What if all my connections are memory and compute resource consuming, who manages the resource (memory/cores) scheduling?

Think of HyperQ as streams implemented in hardware on the device side.
Before the arrival of HyperQ, e.g. on Fermi, commands (kernel launches, memory transfers, etc.) from all streams were placed in a single work queue by the driver on the host. That meant that commands could not overtake each other, and you had to be careful issuing them in the right order on the host to achieve best overlap.
On the GK110 GPU and later devices with HyperQ, there are (at least) 32 work queues on the device. This means that commands from different queues can be reordered relative to each other until they start execution. So both orderings in the example linked above lead to good overlap on a GK110 device.
This is particularly important for multithreaded host code, where you can't control the order without additional synchronization between threads.
Note that of the 32 hardware queues only 8 are used by default to save resources. Set the CUDA_DEVICE_MAX_CONNECTIONS environment variable to a higher value if you need more.

looking for simple cluster configuration

I am using compute engine for embarrassingly parallel scientific calculations. Some of my calculations require a single core and some require 64-cores machines. I am currently using my own scripts: I have a qsub-like command that creates a new instance with the required number of cores, booting it from a custom image with the pre-installed software, connects to a storage bucket via gcsfuse, runs the required command and then kills the instance after it's done.
Do I really need to do all of that with my own scripts, or is there any tool that I should use instead? I'd much rather use some ready made tool for all of the management.
My usage fluctuates widely (hundreds of cores in parallel for 3 hours, then 2 days with nothing, etc). So I don't want constant sized machines: I like to be billed by the minute for my computations.

You may want to use auto-scaling feature for managed instance group in Google Compute Engine(GCE). This feature adds more instances to your instance group when there is more load (upscaling), and removes instances when there is less load (downscaling). Moreover, you can define autoscaling policy based upon CPU utilization, or Load balancer utilization or request per seconds. Please refer autoscaler decisions document to understand decisions that autoscaler might make when scaling instance groups.

Limit cores per job in sun gridengine

I'm currently setting up a gridengine on Ubuntu 16.04 using the sun gridengine.
Most of the features I want to use are working. However, I'm struggling with the following problem:
I have a 32 core machine (64 threads)
I'm running jobs which use software like Matlab...
These software packages can use multiple threads for calcultion
Current situation:
The Queue has 2 slots, Processors is set to 1.
I submit one job and all 64 threads are used for the calculation.
I submit a second job and both are running in parallel.
So, for run time test, I cannot control the number of used cores.
I also tried to setup a parallel environment (connected to that queue). But also if I run a job there, all cores are used.
I guess I have a general understanding problem.
Does anybody know or have an idea, how it is possible to setup something like that:
a) each slot can only use one core (then the parallel environment would allow me to specify the slots/cores of a job
b) to restrict the cores of a submitted job
Important is also that it is not only an upper but also a lower bound. But this could be handled by the number of slots, I guess.
Thanks already in advance for any ideas.

You can't(easily) control the number of threads a process can spawn but,using a recent grid engine, you can control the number of cores it can access. If your grid engine is recent check out the -binding parameter of qsub and the USE_CGROUPS option in sge_conf. If you have an older grid engine then you could try playing tricks with the starter_method.

Is it possible to split Cuda jobs between GPU & CPU?

I'm having a bit of problems understanding how or if its possible to share a work load between a gpu and cpu. I have a large log file that I need to read each line then run about 5 million operations on(testing for various scenarios). My current approach has been to read a few hundred lines, add it to an array and then send it to each GPU, which is working fine but because there is so much work per line and so many lines it takes a long time. I noticed that while this is going on my CPU cores are basically doing nothing. I'm using EC2, so I have 2 quad core Xeon & 2 Tesla GPUs, one cpu core reads the file(running the main program) and the GPU's do the work so I'm wondering how or what can I do to involve the other 7 cores into the process?
I'm a bit confused at how to design a program to balance the tasks between GPU/CPU because they both would finish the jobs at different times so I couldn't just send it to them all at the same time. I thought about setting up a queue(I'm new to c, so not sure if this is possible yet) but then is there a way to know when a GPU job is completed(since I thought sending jobs to Cuda was asynchronous)? I kernel is very similar to a normal c function so converting it for cpu usage is not problem just balancing the work seems to be the issue. I went though 'Cuda by example' again but couldn't really find anything referring to this type of balancing.
Any suggestions would be great.

I think the key is to create a multithreaded app, following all the common practices for that, and have two types of worker threads. One that does work with the GPU and one that does work with the CPU. So basically, you will need a thread pool and a queue.
http://en.wikipedia.org/wiki/Thread_pool_pattern
The queue can be very simple. You can have one shared integer that is the index of the current row in the log file. When a thread is ready to retrieve more work, it locks that index, gets some number of lines from the log file, starting at the line designated by the index, then increases the index by the number of lines that it retrieved, and then unlocks.
When a worker thread is done with one chunk of the log file, it posts its results back to the main thread and gets another chunk (or exits if there are no more lines to process).
The app launches some combination of GPU and CPU worker threads to utilize all available GPUs and CPU cores.
One problem you may run into is that if the CPU is busy, performance of the GPUs may suffer, as slight delays in submitting new work or processing results from the GPUs are introduced. You may need to experiment with the number of threads and their affinity. For instance, you may need to reserve one CPU core for each GPU by manipulating thread affinities.

Since you say line-by-line may be you can split the jobs across 2 different process -
One CPU + GPU Process
One CPU process that utilized remaining 7 cores
You can start of each process with different offsets - like 1st process reads the lines 1-50, 101-150 etc while the 2nd one reads 51-100, 151-200 etc
This will avoid you the headache of optimizing CPU-GPU interaction

How to measure current load of MySQL server?

How to measure current load of MySQL server? I know I can measure different things like CPU usage, RAM usage, disk IO etc but is there a generic load measure for example the server is at 40% load etc?

mysql> SHOW GLOBAL STATUS;
Found here.

The notion of "40% load" is not really well-defined. Your particular application may react differently to constraints on different resources. Applications will typically be bound by one of three factors: available (physical) memory, available CPU time, and disk IO.
On Linux (or possibly other *NIX) systems, you can get a snapshot of these with vmstat, or iostat (which provides more detail on disk IO).
However, to connect these to "40% load", you need to understand your database's performance characteristics under typical load. The best way to do this is to test with typical queries under varying amounts of load, until you observe response times increasing dramatically (this will mean you've hit a bottleneck in memory, CPU, or disk). This load should be considered your critical level, which you do not want to exceed.

is there a generic load measure for example the server is at 40% load ?
Yes! there is:
SELECT LOAD_FILE("/proc/loadavg")
Works on a linux machine. It displays the system load averages for the past 1, 5, and 15 minutes.
System load averages is the average number of processes that are either in a runnable or uninterruptable state. A process in a runnable state is either using the CPU or waiting to use the CPU.
A process in uninterruptable state is waiting for some I/O access, eg waiting for disk. The averages are taken over the three time intervals. Load averages are not normalized for the number of
CPUs in a system, so a load average of 1 means a single CPU system is loaded all the time while on a 4 CPU system it means it was idle 75% of the time.
So if you want to normalize you need to count de number of cpu's also.
you can do that too with
SELECT LOAD_FILE("/proc/cpuinfo")
see also 'man proc'

with top or htop you can follow the usage in Linux realtime

On linux based systems the standard check is usually uptime, a load index is returned according to metrics described here.

aside from all the good answers on this page (SHOW GLOBAL STATUS, VMSTAT, TOP...) there is also a very simple to use tool written by Jeremy Zawodny, it is perfect for non-admin users. It is called "mytop". more info # http://jeremy.zawodny.com/mysql/mytop/

Hi friend as per my research we have some command like
MYTOP: open source program written using PERL language
MTOP: also an open source program written on PERL, It works same as MYTOP but it monitors the queries which are taking longer time and kills them after specific time.
Link for details of above command

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

GridEngine on single machine: How can I limit cores for each job? - sungridengine

Related

What is the difference between Nvidia Hyper Q and Nvidia Streams?

looking for simple cluster configuration

Limit cores per job in sun gridengine

Is it possible to split Cuda jobs between GPU & CPU?

How to measure current load of MySQL server?

Categories

Resources