How to make sense of OpenShift pods CPU usage metrics - openshift

Can someone please help explain how to understand the CPU usage metrics reported on the OpenShift web console.
Below is an example for my application. The cursor on the graph points to 0.008 cores and is different at different times. What does 0.008 cores mean? How should this value be understood if my project on OpenShift doesn't resource limit and quote set? Thanks!

Compute Resources Learn More
Each container running on a node uses compute resources like CPU and memory. You can specify how much CPU and memory a container needs to improve scheduling and performance.
CPU
CPU is often measured in units called millicores. Each millicore is equivalent to 1⁄1000 of a CPU core.
1000 millcores = 1 core

Related

How to extend tensorflow's GPU memory from system RAM

I want to train googles object detection with faster_rcnn_with resnet101 using mscoco datasetcode. I used only 10,000 images for training purpose.I used graphics: GeForce 930M/PCIe/SSE2. NVIDIA Driver Version:384.90. here is the picture of my GeForce.
And I have 8Gb RAM but in tensorflow gpu it is showed 1.96 Gb.
. Now How can I extend my PGU's RAM. I want to use full system memory.
You can train on the cpu to take advantage of the RAM on your machine. However, to run something on the gpu it has to be loaded to the gpu first. Now you can swap memory in and out, because not all the results are needed at any step. However, you pay with a very long training time and I would rather advise you to reduce the batch size. Nevertheless, details about this process and implementation can be found here: https://medium.com/#Synced/how-to-train-a-very-large-and-deep-model-on-one-gpu-7b7edfe2d072.

Why mxnet's GPU version cost more memory than CPU version?

I made a very simple network using mxnet(two fc layers with dim of 512).
By changing the ctx = mx.cpu() or ctx = mx.gpu(0), I run the same code on both CPU and GPU.
The memory cost of GPU is much bigger than CPU version.(I checked that using 'top' instead of 'nvidia-smi').
It seems strange, as the GPU version also has memory on GPU already, why GPU still need more space on memory?
(line 1 is CPU program / line 2 is GPU program)
It may be due to differences in how much time each process was running.
Looking at your screenshot, CPU process has 5:48.85 while GPU has 9:11.20 - so the GPU training was running almost double the time which could be the reason.
When running on GPU you are loading a bunch of different lower-level libraries in memory (CUDA, CUDnn, etc) which are allocated first in your RAM. If your network is very small like in your current case, the overhead of loading the libraries in RAM will be higher than the cost of storing the network weights in RAM.
For any more sizable network, when running on CPU the amount of memory used by the weights will be significantly larger than the libraries loaded in memory.

Google Compute Engine CPU usage discrepancy in Dashboard and VM

In Google Cloud, I have an autscaling setup which scales depending on CPU usage. Even if no processes are consuming CPU on the VM, it is maxed out in the dashboard. Basically, the dashboard shows a much larger CPU usage than what the VM actually is using. This is triggering a lot of instance start-ups. Unnecessarily.
This is the CPU usage graph from Google Compute Engine dashboard. This is the usage for a single instance and not for the overall group.
Google Compute Engine CPU usage from dashboard
And, this is the CPU usage stats for all the CPUs from the VM. This has hardly any consumption, and most of the time is spent idle.
CPU usage from the VM
Is it possible that these VMs perform some heavy I/O operations? (using persistent disk, network communication). If so, this could be virtualization overhead of VMs, which is not visible to the guest operating system.
Look at your "CPU Usage from the VM" screenshot, the very top line, right side:
load average: 0.61, 0.94, 0.86
This is your CPU usage average over 1, 5, 15 minutes. See here for more details.
So you can see that there is something going on, and the Google Compute Engine CPU graph isn't wrong. It looks like it's graphing your 1 minute average.

Is there any way or even possible to get the overall utilization of a GPU during a period of time?

I am trying to get the information about the overall utilization of a GPU (mine is an NVIDIA Tesla K20, running on Linux) during a period of time.
By "overall" I mean something like, how many streaming multi-processors are scheduled to run, and how many GPU cores are scheduled to run (I suppose if a core is running, it will run at its full speed/frequency?). It would be also nice if I can get the overall utilization measured by flops.
Of course before asking the question here, I've searched and investigated several existing tools/libraries, including NVML (and nvidia-smi built on top of it), CUPTI (and nvprof), PAPI, TAU, and Vampir. However, it seems (but I am not sure yet) none of them could provide me with the needed information. E.g., NVML can report "GPU Utilization" by percent, but according to its document/comment, this utilization is "Percent of time over the past second during which one or more kernels was executing on the GPU", which is apparently not accurate enough. For nvprof, it can report flops for individual kernel (with very high overhead), but I still don't know how well the GPU is utilized.
PAPI seems to be able to get instruction count, but it cannot different float point operation from others. I haven't tried other two tools (TAU and Vampir) yet, but I doubt they can meet my need.
So I am wondering is it even possible to get the overall utilization information of a GPU? If not, what is the best alternative to estimate it? The purpose I am doing this is to find a better schedule for multiple jobs running on top of GPU.
I am not sure if I've described my question clearly enough, so please let me know if there is anything I can add for a better description.
Thank you very much!
nVidia Nsight plugin to Visual Studio has very nice graphical features that give the statistics you want. But I have the feeling that you have a Linux machine so Nsight won't work.
I suggest using nVidia Visual Profiler.
The metrics reference is fairly complete and can be found here. This is how I would gather the data you are interested in:
Active SMX units - look at sm_efficiency. It should be close to 100%. If it's lower, then some of the SMX units are not active.
Active cores / SMX - This depends. K20 has a Quad-warp scheduler with dual instruction issue. A warp fires 32 SM cores. K20 has 192 SP cores and 64 DP cores. You need to look at ipc metric (instructions per cycle). If your program is DP and IPC is 2 then you have 100% utilization (for the entire workload execution). That means that 2 warps scheduled instructions so all your 64 DP cores were active during all the cycles. If your program is SP, your IPC theoretically should be 6. However in practice this is very hard to get. An IPC of 6, means that 3 of the schedulers launched 2 warps each, and gave work to 3 x 2 x 32 = 192 SP cores.
FLOPS - Well, if your program uses floating point operations, then I would look to flop_count_sp and divide it by the elapsed seconds.
Regarding frequency, I wouldn't worry but it doesn't harm to check with nvidia-smi. If your card has enough cooling then it will stay at peak frequency while running.
Check the metrics reference as it will provide you much more useful information.
I think NVprof also supports multiple processes. Check here. You can also filter by process ID. So you can collect these metrics "multi-context" or "single-context". In the metrics reference table, you have a column that states if they can be collected in both the cases.
Note: The metrics are computed using the HW performance counters, and driver level analysis. If nvidia tools cannot provide more than this, then it's not likely that other tools will be able to offer more. But I think that properly combining the metrics can tell you everything you want about your app run.

GridEngine on single machine: How can I limit cores for each job?

I have a single machine with 32 cores (2 processors), and 32G RAM. I installed gridengine to submit jobs to those queues I created. But it seems jobs are running on all cores.
I wonder if there is way to limit cores and RAMs for each job. For example I have two queues: parallel.q and serial.q, so that I allocate 20G RAMS and 20 cores to serial.q but I want each job only use one core and maximum 1G RAMs, and 8G RAMs + 8 cores to a single parallel job. All 4 cores and 4G rams left for other usage.
How can I config my queue or gridengine to get the setting right? I tried to read the manual, but don't have a clue.
Thanks!
I don't have problem with parallel jobs. I have some serial jobs will call several different programs somehow the system will assign them all cores available. But I don't want all cores be used for jobs rather for example only two cores available for each job.(Each job has several programs run sequentially, in which case systems allocate each program a core). BTW, I would like have some idle cores all the time to process other jobs, like processing data. Is it possible or necessary?
In fact, if I understand well, you want to partition a single machine with several sub-queues, is that right?
This may be problematic with SGE because the host configuration allows you to set the number of CPU available on a given node. Than you create your queues and assign different hosts to different queues.
In your case, you shoud assign the same host to one master queue, and then add subordinate queues that can use only a given MAX_SLOTS slots.
But if I may ask one question: why should you partition it? If you set up only one queue and configure some parallel environment then you can just submit your jobs using qsub -pe <parallelEnvironment> <NSLOTS> and the grid engine takes care of everything. I suggest you setup at least an OpenMP parallel environment, because you won't probably need MPI on a shared memory machine like yours (it seems a great machine BTW).
Another thing is that you must be able to configure your model run so that the code that you are using can be used with a limited number of CPU; this is very important. In practice you must assign the same number of CPUs to the simulation code than to the SGE. This information is contained in the $NSLOTS variable of your qsub-script.