Overload due to apache process and Mysql process - mysql

I have a small site running and only 20 suppliers used to access this sites for queries. The server is running on high load during the peak hours. Please find the output below:
top - 10:15:42 up 32 days, 20:08, 4 users, load average: 2.20, 2.06, 1.94
Tasks: 500 total, 1 running, 498 sleeping, 0 stopped, 1 zombie
Cpu(s): 7.1%us, 2.3%sy, 0.0%ni, 90.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 32931056k total, 3124852k used, 29806204k free, 49508k buffers
Swap: 3999740k total, 0k used, 3999740k free, 1364836k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10130 mysql 20 0 6207m 567m 5468 S 232 1.8 14306:04 mysqld
27534 worldsto 20 0 307m 20m 5364 S 5 0.1 0:01.97 apache2
29237 worldsto 20 0 299m 12m 3696 S 2 0.0 0:00.07 apache2
29003 worldsto 20 0 299m 13m 3716 S 1 0.0 0:00.12 apache2
root#server70:~# ps -ef | grep apache | wc
434 2368 17756
CPU(s): 24
RAM size: 32 GB
From what I have seen from the Apache logs, all the connections are coming from suppliers and company IP addresses. I am sure there is something wrong with the Apache process so that MYSQL is using more CPU load.
Please someone help me to identify and fix this problem. Thanks

The best troubleshooting step you can do is this:
connect to your MySQL server process, and type:
SHOW FULL PROCESSLIST
That will show you every query that's running. You will probably see the same query showing up multiple times, perhaps with different ID's - maybe something like:
SELECT * FROM foo WHERE fooid='1'
SELECT * FROM foo WHERE fooid='2'
...etc...
That means you need an index on 'fooid'.

Related

WordPress Database Keeps crashing frequently

I have a website that is using WordPress + WooCommerce to manage an e-commerce website. Right now we are using a plugin called: "WP All Import" with the WooCommerce Add-on to Import from a .CSV file all the product data (SKU, Title, Description, Price, Image link, etc).
So the problem is that when we run the import it frequently crashes giving error message
This is the error that keeps showing
We asked to our host and they answer with the following:
"
We are sorry for the server issues. The website requests are causing the MariaDB to allocate all CPU resources and making the server restart to kill the processes
ov 25 13:29:35 server mysqld: 2020-11-25 13:29:35 140531759913152 [Note] /usr/sbin/mysqld (mysqld 10.2.36-MariaDB) starting as process 22045 ...
Nov 25 13:29:35 server mysqld: 2020-11-25 13:29:35 140531759913152 [Warning] Could not increase number of max_open_files to more than 524288 (request: 524423)
top - 13:33:21 up 1:02, 2 users, load average: 9.48, 10.41, 10.28
Tasks: 145 total, 24 running, 119 sleeping, 0 stopped, 2 zombie
%Cpu(s): 85.3 us, 14.5 sy, 0.0 ni, 0.2 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 4194304 total, 1292680 free, 2622632 used, 278992 buff/cache
KiB Swap: 524288 total, 338664 free, 185624 used. 1390602 avail Mem
"
Checking with tools like GTMetrix, we are not as good as we have to be in performance, getting an "E" score with the most important thing to change the amount of DOM elements (Now aprox. 2100)
Thanks in advance

502 Proxy Error from OpenShift DIY project

On my Openshift account I have setup Tomcat 8 and JDK 8 on a DIY application with the MySql and PHPAdmin cartridges installed.
My war file points to everything correctly and there are no errors on startup in any of the logs. However, when I try to go to my OpenShift URL I receive this 502 Proxy Error in the browser. I'm using Chrome.
The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request GET /.
What could be causing this problem?
#Graham Where's the fun in that? So I'm going to share my experience, in case anybody else gets here. I think in my instance I was hitting the upper limit of authorized CPU / memory usage for my 'free' gear. Nothing really jumps out and yells "You hit the limit" but It was pretty clear something was wrong. I'm pretty happy with the results, glad I stuck it out. I've learned a whole lot about deployment to an online server with meager $$ resources.
General troubleshooting instructions start here.
First, I shut down the server hard with a $rhc app-force-stop <app_name> After that I was able to start up the system again and it would work fine. In my case I was trying to do too much with the size of server I was paying for (free!) The free server includes 512Mb Ram and 1 Gig storage. I was trying to run Node, a MongoDB and a Cron cartridge in there. Additionally I had a whole lot of asynchronous Input/Output with quite a large stack built up. In hind sight, not clever.
Error detection wasn't real easy. I didn't learn anything at all from the log files. Generally when something went wrong they just stopped recording anything at all.
There are 11 tests to do. First login to the server via SSH, and your command line tool. Note, there is no magic "you screwed up here message" You've got to look at your usage, and compare it to your authorized usage levels. So yeah, this took me awhile, but I documented this for my own notes. Here's a good place to share with others. I've learned a whole lot with this exercise. Good luck. (oh and in my case, I deleted the cron cartridge and the mongodb cartridge. I'm hosting the DB at mlab.com where its accessible from my other projects. Success for me .)
1) Memory Fail Counts: (results should be zero...)
oo-cgroup-read memory.failcnt // my results --> 160031
oo-cgroup-read memory.memsw.failcnt // my resluts --> 8572
2) Check disk Quotas
[xyz-abc.rhcloud.com 5xxx3]\> quota -s
Disk quotas for user 5xxx3 (uid 3488):
Filesystem blocks quota limit grace files quota limit grace
/dev/mapper/EBSStore01-user_home01
608M 0 1024M 12664 0 80000
3) Check for your actual disk usage. (du = Disk Usage
Sum of directories (-s) in human-readable format (-h : Byte, Kilobyte, Megabyte, Gigabyte, Terabyte and Petabyte): )
du -sh ~
du: cannot read directory `/var/lib/openshift/5xxx3/.tmp': Permission denied
du: cannot read directory `/var/lib/openshift/5xxx3/.sandbox': Permission denied
du: cannot read directory `/var/lib/openshift/5xxx3/.ssh': Permission denied
du: cannot read directory `/var/lib/openshift/5xxx3/.gearstats': Permission denied
607M /var/lib/openshift/5xxx3/
4) List open files (lsof is a command meaning "list open files", which is used in many Unix-like systems to report a list of all open
files and the processes that opened them. -n Do not resolve hostnames (no DNS). -P Do not resolve port
names (list port number instead of its name). )
lsof -n -P
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
mongod 471639 3488 11u IPv4 423798423 0t0 TCP 127.x.y.z:27017 (LISTEN)
node 475151 3488 10u IPv4 423815802 0t0 TCP 127.x.y.z:8080 (LISTEN)
5) Display top CPU intensive processes (top Provide information (frequently refreshed) about the most CPU-intensive processes currently running. You do not
need to include a - before options. -b Run in batch mode; don't accept command-line input. Useful for sending
output to another command or to a file. -n num Update display num times, then exit.)
top -b -n 1
top - 00:48:37 up 13 days, 23:52, 0 users, load average: 2.91, 2.27, 2.09
Tasks: 13 total, 1 running, 12 sleeping, 0 stopped, 0 zombie
Cpu(s): 11.6%us, 10.0%sy, 0.1%ni, 77.5%id, 0.5%wa, 0.0%hi, 0.2%si, 0.1%st
Mem: 15297608k total, 14537912k used, 759696k free, 36456k buffers
Swap: 52428792k total, 16372136k used, 36056656k free, 2720680k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
60898 3488 20 0 12800 968 744 R 1.9 0.0 0:00.02 top
55776 3488 20 0 106m 2740 808 S 0.0 0.0 0:00.00 sshd
55779 3488 20 0 104m 2260 1432 S 0.0 0.0 0:00.09 bash
432471 3488 20 0 106m 888 884 S 0.0 0.0 0:00.00 sshd
432475 3488 20 0 55144 1540 1536 S 0.0 0.0 0:00.11 sftp-server
471611 3488 20 0 9508 412 404 S 0.0 0.0 0:00.00 control
471612 3488 20 0 181m 2152 1720 S 0.0 0.0 0:00.01 logshifter
471624 3488 20 0 4072 456 448 S 0.0 0.0 0:00.00 scl
471625 3488 20 0 9236 812 808 S 0.0 0.0 0:00.00 bash
471639 3488 20 0 373m 14m 13m S 0.0 0.1 0:03.53 mongod
475123 3488 20 0 778m 5264 5172 S 0.0 0.0 0:00.08 node
475124 3488 20 0 117m 2148 1708 S 0.0 0.0 0:00.00 logshifter
475151 3488 20 0 863m 114m 6776 S 0.0 0.8 0:04.10 node
6) Review memory usage. (free -- Display statistics about memory usage: total free, used, physical, swap, shared, and buffers used by the kernel.
Options: -b Calculate memory in bytes. -k Default. Calculate memory in kilobytes. -m Calculate memory in megabytes.)
free
total used free shared buffers cached
Mem: 15297608 14767896 529712 766468 36484 2746820
-/+ buffers/cache: 11984592 3313016
Swap: 52428792 16334312 36094480
This is where I've gone astray. There is still a tiny bit of free space, but it doesn't take me much to figure out when I'm doing an intensive I/O that I'm going to go south fast here. When that happened I didn't see any error log / messages at all. Things just stop working.
7) Check your sockets. (ss - socket statistics. The output will contain all tcp, udp and unix socket connection details. )
ss
State Recv-Q Send-Q Local Address:Port Peer Address:Port
(in this case there are no open sockets.. the line above is just the column headers..)
8) Check VMstat. (vmstat – Summary information of Memory, Processes, Paging etc. Free – Amount of free/idle memory spaces.
si – Swapped in every second from disk in Kilo Bytes. so – Swapped out every second to disk in Kilo Bytes. )
vmstat
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 16248996 425248 33476 2946912 88 90 321 247 4 3 12 10 78 0 0
9) Check I/O stats. (iostat – Central Processing Unit (CPU) statistics and input/output statistics for devices and partitions.)
iostat
Linux 2.6.32-573.12.1.el6.x86_64 (ex-std-node842.prod.rhcloud.com) 03/14/2016 _x86_64_ (4 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
11.60 0.12 10.21 0.49 0.06 77.52
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
xvda 6.56 197.23 147.83 238703267 178916836
xvdf 15.08 337.29 347.44 408209376 420504392
xvdg 15.13 337.45 347.44 408413143 420502512
xvdp 65.18 1603.17 1060.59 1940282568 1283607613
dm-0 7.97 108.87 33.25 131768290 40238544
dm-1 70.00 1574.18 1060.36 1905191416 1283329611
dm-2 3.48 87.89 114.58 106366791 138678084
10) (mpstat - Report processors related statistics. )
mpstat
Linux 2.6.32-573.12.1.el6.x86_64 (ex-std-node842.prod.rhcloud.com) 03/14/2016 _x86_64_ (4 CPU)
01:10:59 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
01:10:59 AM all 11.60 0.12 10.01 0.49 0.00 0.21 0.06 0.00 77.52
11) User Limits (ulimit User limits - limit the use of system-wide resources. -a All current limits are reported. )
ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 59663
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 350
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

Is it possible to check that a particular query opens how many files in MySQL?

I have large number of open files limit in MySQL.
I have set open_files_limit to 150000 but still MySQL uses almost 80% of it.
Also I have low traffic and max concurrent connections around 30 and no query has more than 4 joins.
The files opened by the server are visible in the performance_schema.
See table performance_schema.file_instances.
http://dev.mysql.com/doc/refman/5.5/en/file-instances-table.html
As for tracing which query opens which file, it does not work that way, due to caching in the server itself (table cache, table definition cache).
MySQL shouldn't open that many files, unless you have set a ludicrously large value for the table_cache parameter (the default is 64, the maximum is 512K).
You can reduce the number of open files by issuing the FLUSH TABLES command.
Otherwise, the appropriate value of table_cache can be roughly estimated (in Linux) by running strace -c against all MySQLd threads. You get something like:
# strace -f -c -p $( pidof mysqld )
Process 13598 attached with 22 threads
[ ...pause while it gathers information... ]
^C
Process 13598 detached
...
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
58.82 0.040000 51 780 io_getevents
29.41 0.020000 105 191 93 futex
11.76 0.008000 103 78 select
0.00 0.000000 0 72 stat
0.00 0.000000 0 20 lstat
0.00 0.000000 0 16 lseek
0.00 0.000000 0 16 read
0.00 0.000000 0 9 3 open
0.00 0.000000 0 5 close
0.00 0.000000 0 6 poll
...
------ ----------- ----------- --------- --------- ----------------
...and see whether there's a reasonable difference in impact in open() and close() calls; those are the calls which table_cache affects, and that influence how many open files there are at any given point.
If the impact of open() is negligible, then by all means reduce table_cache. It is mostly needed on slow IOSS'es, and there aren't many of those left around.
If you're running on Windows, you'll have to try and use ProcMon by SysInternals, or some similar tool.
Once you have table_cache to manageable levels, your query that now opens too many files will simply close and re-open many of those same files. You'll perhaps notice an impact on performances, that in all likelihood will be negligible. Chances are that a smaller table cache might actually get you results faster, as fetching an item from a modern, fast IOSS cache may well be faster than searching for it in a really large cache.
If you're into optimizing your server, you may want to look at this article too. The take-away is that as caches go, larger is not always better (it also applies to indexing).
Inspecting a specific query on Linux
On Linux you can use strace (see above) and verify what files are opened and how:
$ sudo strace -f -p $( pidof mysqld ) 2>&1 | grep 'open("'
Meanwhile from a different terminal I run a query, and:
[pid 8894] open("./ecm/db.opt", O_RDONLY) = 39
[pid 8894] open("./ecm/prof2_people.frm", O_RDONLY) = 39
[pid 8894] open("./ecm/prof2_discip.frm", O_RDONLY) = 39
[pid 8894] open("./ecm/prof2_discip.ibd", O_RDONLY) = 19
[pid 8894] open("./ecm/prof2_discip.ibd", O_RDWR) = 19
[pid 8894] open("./ecm/prof2_people.ibd", O_RDONLY) = 20
[pid 8894] open("./ecm/prof2_people.ibd", O_RDWR) = 20
[pid 8894] open("/proc/sys/vm/overcommit_memory", O_RDONLY|O_CLOEXEC) = 39
...these are the files that the query used (*be sure to run the query on a "cold-started" MySQL to prevent caching), and I see that the highest file handle assigned was 39, thus at no point were there more than 40 open files.
The same files can be checked from /proc/$PID/fd or from MySQL:
select * from performance_schema.file_instances where open_count > 1;
but the count from MySQL is slightly shorter, it does not take into account socket descriptors, log files, and temporary files.
This would only be possible by adjusting the source code and add logging on that level.
ALternative: Run a test using this scenario:
You will have to setup an automated test to make this possible:
Log your queries;
Create a script which preloads your heap with a normal dataset (else you are testing against empty memory), take a snapshot of the number of open tables;
Run every query and take snapshot of open tables; (In retrospect) I think you could do this without restarting MySQL every time, so then just every query and record the results. Debugging is tedious work: Not impossible, just really tedious.
Personally I would start different:
Install cacti and percona cacti plugin
Register a week of normal workload
Then hunt down high load queries (slow log > 0.1 second, run through a script to find repeating queries).
Another week monitoring
Then hunt down additional queries with a high repeat count: This is often inefficient code firing a high number of queries where less could be used (like retrieving the keys and then all the values for every key per key (one by one: Happens a lot when programmers use ORM).

mytop Perl monitor of MySQL is no more valid in terms of Queries/Questions since MySQLd 5.1.63?

We have Perl mytop of Jeremy Zawodny version 2009-04-06 installed on Debian Squeeze OS, with MySQLd version 5.1.53, using apt-get command. I redo the "apt-get install mytop" that indicates that no newer version of mytop is available.
This version of mytop seems outdated, as it gives systematically very low value of queries done by MySQLd. In fact, it uses the status query to get the total queries since uptime:
SHOW STATUS LIKE 'Questions';
It yields an error result in the new version mysqld. In effect to get mysql total number of queries since uptime, the new mysqld server shifted the number of queries to 'Queries' instead of 'Questions':
SHOW STATUS LIKE 'Queries';
You can see an enormous difference between the two variables by:
mysql> SHOW STATUS LIKE 'Que%';
+---------------+--------+
| Variable_name | Value |
+---------------+--------+
| Queries | 135903 |
| Questions | 160 |
+---------------+--------+
2 rows in set (0.00 sec)
that gives the both values of 'Queries' and 'Questions'.
mytop -uJohnDoe2 -ppassword
Here is the original mytop output:
MySQL on localhost (5.1.63-0+squeeze1-log) up 0+01:42:35 [13:36:44]
Queries: 265.0 qps: 0 Slow: 0.0 Se/In/Up/De(%): 14760/00/00/00
qps now: 0 Slow qps: 0.0 Threads: 5 ( 1/ 5) 1500/00/00/00
Key Efficiency: 100.0% Bps in/out: 0.9/173.8 Now in/out: 8.3/ 1.5k
Id User Host/IP DB Time Cmd Query or State
-- ---- ------- -- ---- --- --------------
28019 root localhost 0 Query show full processlist
....
I copied mytop to mytop.pl, and replaced in the source Perl code the string "Questions" by "Queries", and run
mytop.pl -uJohnDoe2 -ppassword
And the modifed mytop.pl gives a more realistic monitoring:
MySQL on localhost (5.1.63-0+squeeze1-log) up 0+01:42:23 [13:36:32]
Queries: 136.1k qps: 23 Slow: 0.0 Se/In/Up/De(%): 28/00/00/00
qps now: 18 Slow qps: 0.0 Threads: 5 ( 1/ 5) 27/00/00/00
Key Efficiency: 100.0% Bps in/out: 0.1/ 18.3 Now in/out: 8.4/ 1.5k
Id User Host/IP DB Time Cmd Query or State
-- ---- ------- -- ---- --- --------------
30789 root localhost 0 Query show full processlist
....
Have you observed this problem in your system? i.e.
Perl monitor of MySQL is now invalid in terms of Queries/Questions since MySQLd 5.1.63?
ADDED:
After reading the answer of Shlomi Noach, I added this link for modified Perl script file:
mytop.pl.
I have indeed noticed this change, and blogged about it: questions or queries?
Apparently this came as a surprise to other monitoring tools (Innotop, MonYOG) developers.
With regard you case you have two very simple options:
Switch to innotop instead
Modify the source code for mytop and replace Questions with Queries.
The change was made in 5.1.31; in aforementioned post you can also read comment by an Oracle employee.

A top-like utility for monitoring CUDA activity on a GPU [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 22 days ago.
The community reviewed whether to reopen this question 22 days ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I'm trying to monitor a process that uses CUDA and MPI, is there any way I could do this, something like the command "top" but that monitors the GPU too?
To get real-time insight on used resources, do:
nvidia-smi -l 1
This will loop and call the view at every second.
If you do not want to keep past traces of the looped call in the console history, you can also do:
watch -n0.1 nvidia-smi
Where 0.1 is the time interval, in seconds.
I find gpustat very useful. It can be installed with pip install gpustat, and prints breakdown of usage by processes or users.
I'm not aware of anything that combines this information, but you can use the nvidia-smi tool to get the raw data, like so (thanks to #jmsu for the tip on -l):
$ nvidia-smi -q -g 0 -d UTILIZATION -l
==============NVSMI LOG==============
Timestamp : Tue Nov 22 11:50:05 2011
Driver Version : 275.19
Attached GPUs : 2
GPU 0:1:0
Utilization
Gpu : 0 %
Memory : 0 %
Recently, I have written a monitoring tool called nvitop, the interactive NVIDIA-GPU process viewer.
It is written in pure Python and is easy to install.
Install from PyPI:
pip3 install --upgrade nvitop
Install the latest version from GitHub (recommended):
pip3 install git+https://github.com/XuehaiPan/nvitop.git#egg=nvitop
Run as a resource monitor:
nvitop -m
nvitop will show the GPU status like nvidia-smi but with additional fancy bars and history graphs.
For the processes, it will use psutil to collect process information and display the USER, %CPU, %MEM, TIME and COMMAND fields, which is much more detailed than nvidia-smi. Besides, it is responsive for user inputs in monitor mode. You can interrupt or kill your processes on the GPUs.
nvitop comes with a tree-view screen and an environment screen:
In addition, nvitop can be integrated into other applications. For example, integrate into PyTorch training code:
import os
from nvitop.core import host, CudaDevice, HostProcess, GpuProcess
from torch.utils.tensorboard import SummaryWriter
device = CudaDevice(0)
this_process = GpuProcess(os.getpid(), device)
writer = SummaryWriter()
for epoch in range(n_epochs):
# some training code here
# ...
this_process.update_gpu_status()
writer.add_scalars(
'monitoring',
{
'device/memory_used': float(device.memory_used()) / (1 << 20), # convert bytes to MiBs
'device/memory_percent': device.memory_percent(),
'device/memory_utilization': device.memory_utilization(),
'device/gpu_utilization': device.gpu_utilization(),
'host/cpu_percent': host.cpu_percent(),
'host/memory_percent': host.virtual_memory().percent,
'process/cpu_percent': this_process.cpu_percent(),
'process/memory_percent': this_process.memory_percent(),
'process/used_gpu_memory': float(this_process.gpu_memory()) / (1 << 20), # convert bytes to MiBs
'process/gpu_sm_utilization': this_process.gpu_sm_utilization(),
'process/gpu_memory_utilization': this_process.gpu_memory_utilization(),
},
global_step
)
See https://github.com/XuehaiPan/nvitop for more details.
Note: nvitop is dual-licensed by the GPLv3 License and Apache-2.0 License. Please feel free to use it as a dependency for your own projects. See Copyright Notice for more details.
Just use watch nvidia-smi, it will output the message by 2s interval in default.
For example, as the below image:
You can also use watch -n 5 nvidia-smi (-n 5 by 5s interval).
Use argument "--query-compute-apps="
nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv
for further help, please follow
nvidia-smi --help-query-compute-app
You can try nvtop, which is similar to the widely-used htop tool but for NVIDIA GPUs. Here is a screenshot of nvtop of it in action.
Download and install latest stable CUDA driver (4.2) from here. On linux, nVidia-smi 295.41 gives you just what you want. use nvidia-smi:
[root#localhost release]# nvidia-smi
Wed Sep 26 23:16:16 2012
+------------------------------------------------------+
| NVIDIA-SMI 3.295.41 Driver Version: 295.41 |
|-------------------------------+----------------------+----------------------+
| Nb. Name | Bus Id Disp. | Volatile ECC SB / DB |
| Fan Temp Power Usage /Cap | Memory Usage | GPU Util. Compute M. |
|===============================+======================+======================|
| 0. Tesla C2050 | 0000:05:00.0 On | 0 0 |
| 30% 62 C P0 N/A / N/A | 3% 70MB / 2687MB | 44% Default |
|-------------------------------+----------------------+----------------------|
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0. 7336 ./align 61MB |
+-----------------------------------------------------------------------------+
EDIT: In latest NVIDIA drivers, this support is limited to Tesla Cards.
Another useful monitoring approach is to use ps filtered on processes that consume your GPUs. I use this one a lot:
ps f -o user,pgrp,pid,pcpu,pmem,start,time,command -p `lsof -n -w -t /dev/nvidia*`
That'll show all nvidia GPU-utilizing processes and some stats about them. lsof ... retrieves a list of all processes using an nvidia GPU owned by the current user, and ps -p ... shows ps results for those processes. ps f shows nice formatting for child/parent process relationships / hierarchies, and -o specifies a custom formatting. That one is similar to just doing ps u but adds the process group ID and removes some other fields.
One advantage of this over nvidia-smi is that it'll show process forks as well as main processes that use the GPU.
One disadvantage, though, is it's limited to processes owned by the user that executes the command. To open it up to all processes owned by any user, I add a sudo before the lsof.
Lastly, I combine it with watch to get a continuous update. So, in the end, it looks like:
watch -n 0.1 'ps f -o user,pgrp,pid,pcpu,pmem,start,time,command -p `sudo lsof -n -w -t /dev/nvidia*`'
Which has output like:
Every 0.1s: ps f -o user,pgrp,pid,pcpu,pmem,start,time,command -p `sudo lsof -n -w -t /dev/nvi... Mon Jun 6 14:03:20 2016
USER PGRP PID %CPU %MEM STARTED TIME COMMAND
grisait+ 27294 50934 0.0 0.1 Jun 02 00:01:40 /opt/google/chrome/chrome --type=gpu-process --channel=50877.0.2015482623
grisait+ 27294 50941 0.0 0.0 Jun 02 00:00:00 \_ /opt/google/chrome/chrome --type=gpu-broker
grisait+ 53596 53596 36.6 1.1 13:47:06 00:05:57 python -u process_examples.py
grisait+ 53596 33428 6.9 0.5 14:02:09 00:00:04 \_ python -u process_examples.py
grisait+ 53596 33773 7.5 0.5 14:02:19 00:00:04 \_ python -u process_examples.py
grisait+ 53596 34174 5.0 0.5 14:02:30 00:00:02 \_ python -u process_examples.py
grisait+ 28205 28205 905 1.5 13:30:39 04:56:09 python -u train.py
grisait+ 28205 28387 5.8 0.4 13:30:49 00:01:53 \_ python -u train.py
grisait+ 28205 28388 5.3 0.4 13:30:49 00:01:45 \_ python -u train.py
grisait+ 28205 28389 4.5 0.4 13:30:49 00:01:29 \_ python -u train.py
grisait+ 28205 28390 4.5 0.4 13:30:49 00:01:28 \_ python -u train.py
grisait+ 28205 28391 4.8 0.4 13:30:49 00:01:34 \_ python -u train.py
This may not be elegant, but you can try
while true; do sleep 2; nvidia-smi; done
I also tried the method by #Edric, which works, but I prefer the original layout of nvidia-smi.
You can use the monitoring program glances with its GPU monitoring plug-in:
open source
to install: sudo apt-get install -y python-pip; sudo pip install glances[gpu]
to launch: sudo glances
It also monitors the CPU, disk IO, disk space, network, and a few other things:
In Linux Mint, and most likely Ubuntu, you can try "nvidia-smi --loop=1"
If you just want to find the process which is running on gpu, you can simply using the following command:
lsof /dev/nvidia*
For me nvidia-smi and watch -n 1 nvidia-smi are enough in most cases. Sometimes nvidia-smi shows no process but the gpu memory is used up so i need to use the above command to find the processes.
I created a batch file with the following code in a windows machine to monitor every second. It works for me.
:loop
cls
"C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi"
timeout /T 1
goto loop
nvidia-smi exe is usually located in "C:\Program Files\NVIDIA Corporation" if you want to run the command only once.
you can use nvidia-smi pmon -i 0 to monitor every process in GPU 0.
including compute mode, sm usage, memory usage, encoder usage, decoder usage.
There is Prometheus GPU Metrics Exporter (PGME) that leverages the nvidai-smi binary. You may try this out. Once you have the exporter running, you can access it via http://localhost:9101/metrics. For two GPUs, the sample result looks like this:
temperature_gpu{gpu="TITAN X (Pascal)[0]"} 41
utilization_gpu{gpu="TITAN X (Pascal)[0]"} 0
utilization_memory{gpu="TITAN X (Pascal)[0]"} 0
memory_total{gpu="TITAN X (Pascal)[0]"} 12189
memory_free{gpu="TITAN X (Pascal)[0]"} 12189
memory_used{gpu="TITAN X (Pascal)[0]"} 0
temperature_gpu{gpu="TITAN X (Pascal)[1]"} 78
utilization_gpu{gpu="TITAN X (Pascal)[1]"} 95
utilization_memory{gpu="TITAN X (Pascal)[1]"} 59
memory_total{gpu="TITAN X (Pascal)[1]"} 12189
memory_free{gpu="TITAN X (Pascal)[1]"} 1738
memory_used{gpu="TITAN X (Pascal)[1]"} 10451
Run nvidia-smi in device monitoring mode, e.g.:
$ nvidia-smi dmon -d 3 -s pcvumt
# gpu pwr gtemp mtemp mclk pclk pviol tviol sm mem enc dec fb bar1 rxpci txpci
# Idx W C C MHz MHz % bool % % % % MB MB MB/s MB/s
0 273 54 - 9501 2025 0 0 100 11 0 0 18943 75 5906 659
0 280 54 - 9501 2025 0 0 100 11 0 0 18943 75 7404 650
0 277 54 - 9501 2025 0 0 100 11 0 0 18943 75 7386 719
0 279 55 - 9501 2025 0 0 99 11 0 0 18945 75 6592 692
0 281 55 - 9501 2025 0 0 99 11 0 0 18945 75 7760 641
0 279 55 - 9501 2025 0 0 99 11 0 0 18945 75 7775 668
0 279 55 - 9501 2025 0 0 100 11 0 0 18947 75 7589 690
0 281 55 - 9501 2025 0 0 99 12 0 0 18947 75 7514 657
0 279 55 - 9501 2025 0 0 100 11 0 0 18947 75 6472 558
0 280 54 - 9501 2025 0 0 100 11 0 0 18947 75 7066 683
Full details are in man nvidia-smi.