I want to know if is possible to customize an LXC kernel (or relation system like OpenVZ, etc) to work just for threads process, see this mention:
Unlike Docker, Virtuozzo, and LXC, which operate on the process level,
LVE is able to operate on the thread level. This allows multithreaded
servers such as Apache (with its 'worker' MPM) to take advantage of
LVE without having to run a separate instance per LVE user.
source:
blog.phusion.nl/2016/02/03/lve-an-alternative-container-technology-to-docker-and-virtuozzolxc/
Related
I'm looking for a way to effectively virtualize a single process (and, presumably, any children it creates). Although a Container model sounds appropriate, products like Docker don't quite fit the bill.
The Intel VMX model allows one to create a hypervisor, then launch a VM, which will exit back to the hypervisor under certain (programmable) conditions, such as privileged instruction execution, CR3/CR8 manipulation, exceptions, direct I/O, etc. The features of the VMX model fit very well with my needs, except that I can't require a VM with a separate instance of the entire OS to accomplish the task - I just want my hypervisor to control one child application (think Photoshop/Excel/Firefox; one process and its progeny, if any) that's running under the host OS, and catch VM exits under the specified conditions (for debugging and/or emulation purposes). Outside of the exit conditions, the child process should run unencumbered, and have access to all OS resources to which it would be entitled without the VM, including filesystem, graphical output, keyboard/mouse input, IPC/messaging, etc. For my purposes, I am not interested in isolation or access restriction, which is the typical motivation for using a VM - to the contrary, I want the child process to be fully enmeshed in the host OS environment. While operating entirely in user-space is preferable, I can utilize Ring 0 to facilitate this. (Note that the question is Intel-specific and OS-agnostic, although it's likely to be implemented in a *nix environment first.)
I'm wondering what would happen if I had my hypervisor set up a VMCS that simply mirrored the host's actual configuration, including page tables, IDT, etc., then VMLAUNCH 0(%rip) (in effect, a pseudo-fork?) and execute the child process from there. (That seems far too simplistic to actually work, but the notion does have some appeal). Assuming that's a Bad Idea™, how might I approach this problem?
I always thought that Hyper-Q technology is nothing but the streams in GPU. Later I found I was wrong(Am I?). So I was doing some reading about Hyper-Q and got confused more.
I was going through one article and it had these two statements:
A. Hyper-Q is a flexible solution that allows separate connections from multiple CUDA streams, from multiple Message Passing Interface (MPI) processes, or even from multiple threads within a process
B. Hyper-Q increases the total number of connections (work queues) between the host and the GK110 GPU by allowing 32 simultaneous, hardware-managed connections (compared to the single connection available with Fermi)
In aforementioned points, Point B says that there can be multiple connected created to a single GPU from host. Does it mean I can create multiple context on a simple GPU through different applications? Does it mean that I will have to execute all applications on different streams?What if all my connections are memory and compute resource consuming, who manages the resource (memory/cores) scheduling?
Think of HyperQ as streams implemented in hardware on the device side.
Before the arrival of HyperQ, e.g. on Fermi, commands (kernel launches, memory transfers, etc.) from all streams were placed in a single work queue by the driver on the host. That meant that commands could not overtake each other, and you had to be careful issuing them in the right order on the host to achieve best overlap.
On the GK110 GPU and later devices with HyperQ, there are (at least) 32 work queues on the device. This means that commands from different queues can be reordered relative to each other until they start execution. So both orderings in the example linked above lead to good overlap on a GK110 device.
This is particularly important for multithreaded host code, where you can't control the order without additional synchronization between threads.
Note that of the 32 hardware queues only 8 are used by default to save resources. Set the CUDA_DEVICE_MAX_CONNECTIONS environment variable to a higher value if you need more.
I am using compute engine for embarrassingly parallel scientific calculations. Some of my calculations require a single core and some require 64-cores machines. I am currently using my own scripts: I have a qsub-like command that creates a new instance with the required number of cores, booting it from a custom image with the pre-installed software, connects to a storage bucket via gcsfuse, runs the required command and then kills the instance after it's done.
Do I really need to do all of that with my own scripts, or is there any tool that I should use instead? I'd much rather use some ready made tool for all of the management.
My usage fluctuates widely (hundreds of cores in parallel for 3 hours, then 2 days with nothing, etc). So I don't want constant sized machines: I like to be billed by the minute for my computations.
You may want to use auto-scaling feature for managed instance group in Google Compute Engine(GCE). This feature adds more instances to your instance group when there is more load (upscaling), and removes instances when there is less load (downscaling). Moreover, you can define autoscaling policy based upon CPU utilization, or Load balancer utilization or request per seconds. Please refer autoscaler decisions document to understand decisions that autoscaler might make when scaling instance groups.
I have a single machine with 32 cores (2 processors), and 32G RAM. I installed gridengine to submit jobs to those queues I created. But it seems jobs are running on all cores.
I wonder if there is way to limit cores and RAMs for each job. For example I have two queues: parallel.q and serial.q, so that I allocate 20G RAMS and 20 cores to serial.q but I want each job only use one core and maximum 1G RAMs, and 8G RAMs + 8 cores to a single parallel job. All 4 cores and 4G rams left for other usage.
How can I config my queue or gridengine to get the setting right? I tried to read the manual, but don't have a clue.
Thanks!
I don't have problem with parallel jobs. I have some serial jobs will call several different programs somehow the system will assign them all cores available. But I don't want all cores be used for jobs rather for example only two cores available for each job.(Each job has several programs run sequentially, in which case systems allocate each program a core). BTW, I would like have some idle cores all the time to process other jobs, like processing data. Is it possible or necessary?
In fact, if I understand well, you want to partition a single machine with several sub-queues, is that right?
This may be problematic with SGE because the host configuration allows you to set the number of CPU available on a given node. Than you create your queues and assign different hosts to different queues.
In your case, you shoud assign the same host to one master queue, and then add subordinate queues that can use only a given MAX_SLOTS slots.
But if I may ask one question: why should you partition it? If you set up only one queue and configure some parallel environment then you can just submit your jobs using qsub -pe <parallelEnvironment> <NSLOTS> and the grid engine takes care of everything. I suggest you setup at least an OpenMP parallel environment, because you won't probably need MPI on a shared memory machine like yours (it seems a great machine BTW).
Another thing is that you must be able to configure your model run so that the code that you are using can be used with a limited number of CPU; this is very important. In practice you must assign the same number of CPUs to the simulation code than to the SGE. This information is contained in the $NSLOTS variable of your qsub-script.
I'm working on a cluster with a lot of nodes, and each node has two gpus. In the cluster, I can't launch "nvidia-smi" to check which device is busy. My code selects the best device (with cudaChooseDevice) in terms of capability, but when the cluster assign me the same node for two different jobs, then I have two tasks running on the same gpu.
My question is: There is a way to check at runtime if the device is busy or not?
Thanks
Your cluster managers should install and use cluster management (job-scheduling) software that allows them to assign and track GPUs just like CPUs and memory. There are a number of job schedulers that can do this. Even without explicit GPU support in the job-scheduler, it's possible to build job entry/exit scripts that will assign GPUs properly.
You can effectively include the same functionality that nvidia-smi uses by embedding NVML in your applications. Any query or data item reported on by nvidia-smi can be accessed programmatically through NVML.
It's also not clear to me why you could not launch a script for your job which checks which devices are busy using nvidia-smi, then picks an un-busy device.
But keep in mind that any runtime check you might do would be subject to the behavior of other applications. If those applications (whether launched by you or other users) have unusual behavior, your runtime check can easily be defeated.