Resource monitor using hyperledger fabric caliper - Error: socket hang up - containers

Following Configuration is used to monitor the resource utilization of the docker container which is in a remote machine using a hyperledger caliper (0.4.2).
test:
name: sample fabric network
description: sample Network Benchmark
workers:
type: local
number: 5
rounds:
- label: readAsset
description: Using Chaincode Asset Management contract
txNumber: 10000
rateControl:
type: fixed-load
opts:
tps: 200
workload:
module: workload/readAsset.js
arguments:
contractId: asset-management
monitors:
resource:
- module: docker
options:
interval: 5
containers:
- http://34.233.177.187:7051/peer0.org1.example.com
It gives an error of Error retrieving remote containers: Error: socket hang up as below and does not generate any resource monitoring-related report. Although, the transaction is submitted to a remote network using caliper from where I am trying to monitor the resource utilization of container peer0.org1.example.com. The above-defined Machine IP and the name of the container are defined in the Connection profile.
Log from caliper console
2021.02.17-03:46:22.319 info [caliper] [worker-orchestrator] 5 workers prepared, progressing to test phase.
2021.02.17-03:46:22.945 error [caliper] [monitor-docker] Error retrieving remote containers: Error: socket hang up
2021.02.17-03:46:22.947 info [caliper] [round-orchestrator] Monitors successfully started
2021.02.17-03:46:22.948 info [caliper] [worker-message-handler] Worker#0 is starting Round#0
2021.02.17-03:46:22.952 info [caliper] [worker-message-handler] Worker#1 is starting Round#0
2021.02.17-03:46:22.956 info [caliper] [caliper-worker] Worker #1 starting workload loop
2021.02.17-03:46:22.958 info [caliper] [caliper-worker] Worker #0 starting workload loop
2021.02.17-03:46:22.961 info [caliper] [worker-message-handler] Worker#2 is starting Round#0
2021.02.17-03:46:22.967 info [caliper] [worker-message-handler] Worker#3 is starting Round#0
2021.02.17-03:46:22.970 info [caliper] [caliper-worker] Worker #2 starting workload loop
2021.02.17-03:46:22.973 info [caliper] [worker-message-handler] Worker#4 is starting Round#0
2021.02.17-03:46:22.975 info [caliper] [caliper-worker] Worker #3 starting workload loop
2021.02.17-03:46:22.977 info [caliper] [caliper-worker] Worker #4 starting workload loop
2021.02.17-03:46:27.352 info [caliper] [default-observer] [readAsset Round 0 Transaction Info] - Submitted: 15 Succ: 10 Fail:0 Unfinished:5
2021.02.17-03:46:32.322 info [caliper] [default-observer] [readAsset Round 0 Transaction Info] - Submitted: 45 Succ: 30 Fail:0 Unfinished:15
2021.02.17-03:46:37.322 info [caliper] [default-observer] [readAsset Round 0 Transaction Info] - Submitted: 70 Succ: 60 Fail:0 Unfinished:10
2021.02.17-03:46:42.323 info [caliper] [default-observer] [readAsset Round 0 Transaction Info] - Submitted: 89 Succ: 79 Fail:0 Unfinished:10
2021.02.17-03:46:47.323 info [caliper] [default-observer] [readAsset Round 0 Transaction Info] - Submitted: 111 Succ: 103 Fail:0 Unfinished:8
2021.02.17-03:46:52.323 info [caliper] [default-observer] [readAsset Round 0 Transaction Info] - Submitted: 128 Succ: 120 Fail:0 Unfinished:8
2021.02.17-03:46:57.323 info [caliper] [default-observer] [readAsset Round 0 Transaction Info] - Submitted: 153 Succ: 148 Fail:0 Unfinished:5
2021.02.17-03:47:02.323 info [caliper] [default-observer] [readAsset Round 0 Transaction Info] - Submitted: 173 Succ: 165 Fail:0 Unfinished:8
2021.02.17-03:47:07.324 info [caliper] [default-observer] [readAsset Round 0 Transaction Info] - Submitted: 194 Succ: 189 Fail:0 Unfinished:5
2021.02.17-03:47:12.324 info [caliper] [default-observer] [readAsset Round 0 Transaction Info] - Submitted: 214 Succ: 205 Fail:0 Unfinished:9
2021.02.17-03:47:17.325 info [caliper] [default-observer] [readAsset Round 0 Transaction Info] - Submitted: 233 Succ: 223 Fail:0 Unfinished:10
2021.02.17-03:47:22.325 info [caliper] [default-observer] [readAsset Round 0 Transaction Info] - Submitted: 252 Succ: 239 Fail:0 Unfinished:13
2021.02.17-03:47:27.325 info [caliper] [default-observer] [readAsset Round 0 Transaction Info] - Submitted: 269 Succ: 267 Fail:0 Unfinished:2
20
When monitoring is done using the local machine then it works. for that, I used the below configuration, and its works successfully for local docker container. The configuration is shown below:
monitors:
resource:
- module: docker
options:
interval: 5
containers:
- all
Is there any more configuration to monitor the remote container resource
utilization in caliper?
Thank you!

Caliper uses the dockerode#3.1.0 module to connect to containers(https://www.npmjs.com/package/dockerode) the socket hang up would indicate that a connection has failed to be created due to a timeout or similar.
There are two options available:
Debug the dockerode based connection - perhaps increasing the log level in caliper will help here? It would be relatively easy to craft up a small app using dockerode#3.1.0 to check the ability of it to connect to the passed container URL - if you are able to connect, and the same URL does not work in caliper then there is a bug that needs to be fixed
Swap out to a Prometheus based monitor that uses CAdvisor to monitor container statistics (you would get Grafana for free here too)

Related

Cannot execute apache2 restart command in ubuntu21.10

After I executed the command: sudo systemctl restart apache2 in the terminal, it prompted an error:
Job for apache2.service failed because the control process exited with error code.
See "systemctl status apache2.service" and "journalctl -xeu apache2.service" for details.
After executing the systemctl status apache2.service command according to the above prompt, I received this prompt:
× apache2.service - The Apache HTTP Server
Loaded: loaded (/lib/systemd/system/apache2.service; enabled; vendor prese>
Active: failed (Result: exit-code) since Mon 2021-12-20 09:50:13 +06; 9min>
Docs: https://httpd.apache.org/docs/2.4/
Process: 14920 ExecStart=/usr/sbin/apachectl start (code=exited, status=1/F>
CPU: 18ms
Жел 20 09:50:13 RTT apachectl[14923]: [Mon Dec 20 09:50:13.739503 2021] [alias:>
Жел 20 09:50:13 RTT apachectl[14923]: AH00558: apache2: Could not reliably dete>
Жел 20 09:50:13 RTT apachectl[14923]: (98)Address already in use: AH00072: make>
Жел 20 09:50:13 RTT apachectl[14923]: (98)Address already in use: AH00072: make>
Жел 20 09:50:13 RTT apachectl[14923]: no listening sockets available, shutting >
Жел 20 09:50:13 RTT apachectl[14923]: AH00015: Unable to open logs
Жел 20 09:50:13 RTT systemd[1]: apache2.service: Failed with result 'exit-code'.
Жел 20 09:50:13 RTT apachectl[14920]: Action 'start' failed.
Жел 20 09:50:13 RTT apachectl[14920]: The Apache error log may have more inform>
Жел 20 09:50:13 RTT systemd[1]: Failed to start The Apache HTTP Server.
After executing the journalctl -xeu apache2.service command according to the above prompt, I received this prompt:
The process' exit code is 'exited' and its exit status is 1.
Жел 20 09:50:13 RTT apachectl[14923]: [Mon Dec 20 09:50:13.739503 2021] [alias:>
Жел 20 09:50:13 RTT apachectl[14923]: AH00558: apache2: Could not reliably dete>
Жел 20 09:50:13 RTT apachectl[14923]: (98)Address already in use: AH00072: make>
Жел 20 09:50:13 RTT apachectl[14923]: (98)Address already in use: AH00072: make>
Жел 20 09:50:13 RTT apachectl[14923]: no listening sockets available, shutting >
Жел 20 09:50:13 RTT apachectl[14923]: AH00015: Unable to open logs
Жел 20 09:50:13 RTT systemd[1]: apache2.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ The unit apache2.service has entered the 'failed' state with result 'exit-co>
Жел 20 09:50:13 RTT apachectl[14920]: Action 'start' failed.
Жел 20 09:50:13 RTT apachectl[14920]: The Apache error log may have more inform>
Жел 20 09:50:13 RTT systemd[1]: Failed to start The Apache HTTP Server.
░░ Subject: A start job for unit apache2.service has failed
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ A start job for unit apache2.service has finished with a failure.
░░
░░ The job identifier is 3993 and the job result is failed.
Its possible that you may already have some other webserver running on your website which is using port 80, because you have error RTT apachectl[14923]: (98)Address already in use: AH00072: make>
Please try this:
Check first what is using port 80
sudo netstat -nap | grep 80
Now use command to identify process (in my case "httpd")
ps -ax | grep httpd
If you identify the process please kill them and try again to restart apache2.

slurmd.service is Failed & there is no PID file /var/run/slurmd.pid

I am trying to start slurmd.service using below commands but it is not successful permanently. I will be grateful if you could help me to resolve this issue!
systemctl start slurmd
scontrol update nodename=fwb-lab-tesla1 state=idle
This is the status of slurmd.service
cat /usr/lib/systemd/system/slurmd.service
[Unit]
Description=Slurm node daemon
After=network.target munge.service
ConditionPathExists=/etc/slurm/slurm.conf
[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurmd.pid
KillMode=process
LimitNOFILE=51200
LimitMEMLOCK=infinity
LimitSTACK=infinity
[Install]
WantedBy=multi-user.target
and this the status of the node:
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
gpucompute* up infinite 1 drain fwb-lab-tesla1
$ sinfo -R
REASON USER TIMESTAMP NODELIST
Low RealMemory root 2020-09-28T16:46:28 fwb-lab-tesla1
$ sinfo -Nl
Thu Oct 1 14:00:10 2020
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
fwb-lab-tesla1 1 gpucompute* drained 32 32:1:1 64000 0 1 (null) Low RealMemory
Here there is the contents of slurm.conf
$ cat /etc/slurm/slurm.conf
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=FWB-Lab-Tesla
#ControlAddr=137.72.38.102
#
MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
#SlurmUser=slurm
SlurmdUser=root
StateSaveLocation=/var/spool/slurm/StateSave
SwitchType=switch/none
TaskPlugin=task/cgroup
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
# Prevent very long time waits for mix serial/parallel in multi node environment
SchedulerParameters=pack_serial_at_end
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/filetxt
# Need slurmdbd for gres functionality
#AccountingStorageTRES=CPU,Mem,gres/gpu,gres/gpu:Titan
ClusterName=cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
#SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm/slurmctld.log
#SlurmdDebug=3
SlurmdLogFile=/var/log/slurm/slurmd.log
#
#
# COMPUTE NODES
GresTypes=gpu
#NodeName=fwb-lab-tesla[1-32] Gres=gpu:4 RealMemory=64000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
#PartitionName=compute Nodes=fwb-lab-tesla[1-32] Default=YES MaxTime=INFINITE State=UP
#NodeName=fwb-lab-tesla1 NodeAddr=137.73.38.102 Gres=gpu:4 RealMemory=64000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
NodeName=fwb-lab-tesla1 NodeAddr=137.73.38.102 Gres=gpu:4 RealMemory=64000 CPUs=32 State=UNKNOWN
PartitionName=gpucompute Nodes=fwb-lab-tesla1 Default=YES MaxTime=INFINITE State=UP
There is not any slurmd.pid in the below path. Just once by starting system it appears here but it is gone after few minutes again.
$ ls /var/run/
abrt cryptsetup gdm lvm openvpn-server slurmctld.pid tuned
alsactl.pid cups gssproxy.pid lvmetad.pid plymouth sm-notify.pid udev
atd.pid dbus gssproxy.sock mariadb ppp spice-vdagentd user
auditd.pid dhclient-eno2.pid httpd mdadm rpcbind sshd.pid utmp
avahi-daemon dhclient.pid initramfs media rpcbind.sock sudo vpnc
certmonger dmeventd-client ipmievd.pid mount samba svnserve xl2tpd
chrony dmeventd-server lightdm munge screen sysconfig xrdp
console ebtables.lock lock netreport sepermit syslogd.pid xtables.lock
crond.pid faillock log NetworkManager setrans systemd
cron.reboot firewalld lsm openvpn-client setroubleshoot tmpfiles.d
[shirin#FWB-Lab-Tesla Seq2KMR33]$ systemctl status slurmctld
â slurmctld.service - Slurm controller daemon
Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2020-09-28 15:41:25 BST; 2 days ago
Main PID: 1492 (slurmctld)
CGroup: /system.slice/slurmctld.service
ââ1492 /usr/sbin/slurmctld
Sep 28 15:41:25 FWB-Lab-Tesla systemd[1]: Starting Slurm controller daemon...
Sep 28 15:41:25 FWB-Lab-Tesla systemd[1]: Started Slurm controller daemon.
I try to start the service slurmd.service but it returns to failed after few minutes again
$ systemctl status slurmd
â slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
Active: failed (Result: timeout) since Tue 2020-09-29 18:11:25 BST; 1 day 19h ago
Process: 25650 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
CGroup: /system.slice/slurmd.service
ââ2986 /usr/sbin/slurmd
Sep 29 18:09:55 FWB-Lab-Tesla systemd[1]: Starting Slurm node daemon...
Sep 29 18:09:55 FWB-Lab-Tesla systemd[1]: Can't open PID file /var/run/slurmd.pid (yet?) after start: No ...ctory
Sep 29 18:11:25 FWB-Lab-Tesla systemd[1]: slurmd.service start operation timed out. Terminating.
Sep 29 18:11:25 FWB-Lab-Tesla systemd[1]: Failed to start Slurm node daemon.
Sep 29 18:11:25 FWB-Lab-Tesla systemd[1]: Unit slurmd.service entered failed state.
Sep 29 18:11:25 FWB-Lab-Tesla systemd[1]: slurmd.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
Log output of starting slurmd:
[2020-09-29T18:09:55.074] Message aggregation disabled
[2020-09-29T18:09:55.075] gpu device number 0(/dev/nvidia0):c 195:0 rwm
[2020-09-29T18:09:55.075] gpu device number 1(/dev/nvidia1):c 195:1 rwm
[2020-09-29T18:09:55.075] gpu device number 2(/dev/nvidia2):c 195:2 rwm
[2020-09-29T18:09:55.075] gpu device number 3(/dev/nvidia3):c 195:3 rwm
[2020-09-29T18:09:55.095] slurmd version 17.11.7 started
[2020-09-29T18:09:55.096] error: Error binding slurm stream socket: Address already in use
[2020-09-29T18:09:55.096] error: Unable to bind listen port (*:6818): Address already in use```
The log files states that it cannot bind to the standard slurmd port 6818, because there is something else using this address already.
Do you have another slurmd running on this node? Or something else listening there? Try netstat -tulpen | grep 6818 to see what is using the address.

Percona MySQL Server working but filling the messages log with errors

I have Percona MySQL server 5.7 running under CentOS 7 and although mysql is running without any noticeable errors, it is filling my /var/log/messages with the following every ten seconds:
Nov 15 10:07:27 server systemd: mysqld.service holdoff time over, scheduling restart.
Nov 15 10:07:27 server systemd: Starting MySQL Percona Server...
Nov 15 10:07:27 server mysqld_safe: 171115 10:07:27 mysqld_safe Adding '/usr/lib64/libjemalloc.so.1' to LD_PRELOAD for mysqld
Nov 15 10:07:27 server mysqld_safe: 171115 10:07:27 mysqld_safe Logging to '/var/lib/mysql/server.local.err'.
Nov 15 10:07:27 server mysqld_safe: 171115 10:07:27 mysqld_safe A mysqld process already exists
Nov 15 10:07:27 server systemd: mysqld.service: main process exited, code=exited, status=1/FAILURE
Nov 15 10:07:28 server systemd: Failed to start MySQL Percona Server.
Nov 15 10:07:28 server systemd: Unit mysqld.service entered failed state.
Nov 15 10:07:28 server systemd: Triggering OnFailure= dependencies of mysqld.service.
Nov 15 10:07:28 server systemd: mysqld.service failed.
Nov 15 10:07:28 server systemd: Started Service Status Monitor.
Nov 15 10:07:28 server systemd: Starting Service Status Monitor...
Even though it's stating in there that it failed to start the Percona server, it appears to be working as my website is still doing mysql queries. I know very little about mysql admin and was hoping a mysql guru could shed some light on what is happening.
The clue is here: "A mysqld process already exists". It can't start mysqld because another mysqld process is already running, and using the same port. You need to kill that process before the one you tried to start can start.
Re your comment:
Since this is CentOS 7, I assume mysql.service is being called by systemd.
In my experience, if you start mysqld "ad hoc" without using systemd, then systemd has no idea that it's running, and tries to start mysqld on its own. Systemd also cannot shut down an instance of mysqld unless it started that instance.
the mysqld process is active,ps -ef |grep mysqld ,kill -9 {}

mysql terminated with status 1?

This morning I noticed that my mysql server was not running. A look at the logs and I found the information below. While it is troubling that the mysqld service ran out of memory and was killed, it is more troubling that mysql could not restart.
Any ideas on why mysql could not respawn? How can I test to make sure that if the process is killed it will respawn?
Thank you.
387 Oct 10 06:37:09 ip-xxx-xxx-xxx-xxx kernel: [12218775.475042] Out of memory: Kill process 810 (mysqld) score 232 or sacrifice child
388 Oct 10 06:37:09 ip-xxx-xxx-xxx-xxx kernel: [12218775.475060] Killed process 810 (mysqld) total-vm:888108kB, anon-rss:139816kB, file-rss:0kB
389 Oct 10 06:37:09 ip-xxx-xxx-xxx-xxx kernel: [12218775.655663] init: mysql main process (810) killed by KILL signal
390 Oct 10 06:37:09 ip-xxx-xxx-xxx-xxx kernel: [12218775.655745] init: mysql main process ended, respawning
391 Oct 10 06:37:10 ip-xxx-xxx-xxx-xxx kernel: [12218776.044805] type=1400 audit(1381408630.181:13): apparmor="STATUS" operation="profile_replace" name="/usr/sbin/mysqld" pid=27754 comm="apparmor_parser"
392 Oct 10 06:37:10 ip-xxx-xxx-xxx-xxx kernel: [12218776.676434] init: mysql main process (27763) terminated with status 1
393 Oct 10 06:37:10 ip-xxx-xxx-xxx-xxx kernel: [12218776.676489] init: mysql main process ended, respawning
394 Oct 10 06:37:11 ip-xxx-xxx-xxx-xxx kernel: [12218777.468923] init: mysql post-start process (27764) terminated with status 1
395 Oct 10 06:37:11 ip-xxx-xxx-xxx-xxx kernel: [12218777.512363] type=1400 audit(1381408631.649:14): apparmor="STATUS" operation="profile_replace" name="/usr/sbin/mysqld" pid=27800 comm="apparmor_parser"
396 Oct 10 06:37:11 ip-xxx-xxx-xxx-xxx kernel: [12218777.681433] init: mysql main process (27804) terminated with status 1
397 Oct 10 06:37:11 ip-xxx-xxx-xxx-xxx kernel: [12218777.681491] init: mysql respawning too fast, stopped
I would try running mysqld directly as a command, and looking at the output. It could be e.g. InnoDB corruption leading to stopping immediately after spawning, at which point upstart might try to respawn until apparmour stops it.
http://ubuntuforums.org/showthread.php?t=1475798
looping script issue probably
An old question but a recurrent issue. The question has two faces:
first, why the mysql process runs out of memory?
secondly, why the mysql process can't start again?
The first one is something addressed to an overkill in the configuration. An oversized buffer configuration can make mysql ask for more memory than the system can provide. Verify this question to get insights of how to find best fit for your environment.
The second issue could be very trickful. There are lot of possibilities for problems that can prevent mysql to start. The following steps can be performed to figure out what is the cause.
The first clue can be found in the mysql log error file, in most of cases found in /var/log/mysql/error.log
However, regardless the nature of the issue, the error log file can be empty. In this case try to:
look into syslog: in a terminal prompt
type tail -f /var/log/syslog
and in another terminal try to start mysql:
service start mysql
If this approach don't provide any useful clue try the following:
Start mysqld from raw in verbose mode:
su mysql
mysqld -v
as showed here.
The output messages can be helpful to find out the root cause that prevents mysql to start.

Can't connect to MySQL for single dyno on restart occasionally

Every once in a while, when we either restart the app or a dyno gets cycled/restarted automatically, a dyno restarts but comes back up with Can't connect to MySQL server on (url) (110) error. In then will continuously throw !! Unexpected error while processing request: can't modify frozen array errors at every request we get until we manually restart the dyno. It is then perfectly fine just like all of our other dynos
Jan 29 22:18:23 myapp heroku/web.26: State changed from up to starting
Jan 29 22:18:27 myapp heroku/web.26: Stopping all processes with SIGTERM
Jan 29 22:18:29 myapp heroku/web.26: Process exited with status 0
Jan 29 22:18:32 myapp heroku/web.26: Starting process with command `bundle exec thin -p 41924 -e production -R /home/heroku_rack/heroku.ru start`
Jan 29 22:18:40 myapp app/web.26: >> Thin web server (v1.4.1 codename Chromeo)
Jan 29 22:18:40 myapp app/web.26: >> Maximum connections set to 1024
Jan 29 22:18:40 myapp app/web.26: >> Listening on 0.0.0.0:41924, CTRL+C to stop
Jan 29 22:18:42 myapp heroku/web.26: State changed from starting to up
Jan 29 22:19:08 myapp app/web.26: Starting the New Relic Agent.
Jan 29 22:19:08 myapp app/web.26: Installed New Relic Browser Monitoring middleware
Jan 29 22:19:08 myapp app/web.26: ** [NewRelic][01/29/13 19:19:08 -0800 dd1c41a1-e62a-46f0-bdc8-f48919d4db3c (2)] INFO : Dispatcher: thin
Jan 29 22:19:08 myapp app/web.26: ** [NewRelic][01/29/13 19:19:08 -0800 dd1c41a1-e62a-46f0-bdc8-f48919d4db3c (2)] INFO : Application: myapp
Jan 29 22:19:08 myapp app/web.26: ** [NewRelic][01/29/13 19:19:08 -0800 dd1c41a1-e62a-46f0-bdc8-f48919d4db3c (2)] INFO : New Relic Ruby Agent 3.5.3 Initialized: pid = 2
Jan 29 22:19:08 myapp app/web.26: ** [NewRelic][01/29/13 19:19:08 -0800 dd1c41a1-e62a-46f0-bdc8-f48919d4db3c (2)] INFO : NewRelic::Agent::Samplers::DelayedJobSampler sampler not available: No DJ worker present
Jan 29 22:19:18 myapp app/web.26: ** [NewRelic][01/29/13 19:19:18 -0800 dd1c41a1-e62a-46f0-bdc8-f48919d4db3c (2)] INFO : Connected to NewRelic Service at collector-1.newrelic.com
Jan 29 22:19:18 myapp app/web.26: ** [NewRelic][01/29/13 19:19:18 -0800 dd1c41a1-e62a-46f0-bdc8-f48919d4db3c (2)] INFO : Reporting performance data every 60 seconds.
Jan 29 22:19:31 myapp app/web.26: !! Unexpected error while processing request: Can't connect to MySQL server on 'myrdsinstance.us-east-1.rds.amazonaws.com' (110)
Jan 29 22:19:31 myapp app/web.26: !! Unexpected error while processing request: can't modify frozen array
Jan 29 22:19:31 myapp app/web.26: !! Unexpected error while processing request: can't modify frozen array
Jan 29 22:19:31 myapp app/web.26: !! Unexpected error while processing request: can't modify frozen array
Jan 29 22:19:31 myapp app/web.26: !! Unexpected error while processing request: can't modify frozen array
Jan 29 22:19:31 myapp app/web.26: !! Unexpected error while processing request: can't modify frozen array
Our MySQL instance is an RDS instance and we have never encountered issues connecting to it from other EC2 instances or locally, only from our Heroku dynos.
Thinking that it might be back_log issue, we upped that value, but it had no effect on the occurance of the errors.
What else can we try to help either diagnose what is going wrong or to eliminate this connection issue? Being on RDS and Heroku, our logging options seem rather limited... Is there some way of making MySQL automatically reconnect its connection, since Heroku overrites your database.yml?