How should we decide the optimal heartbeat.timeout configuration for flink jobs. I am using flink 1.10.3 and my service fails due to heartbeat time out exception. Currently default value is set up that is 50 secs.
In my flink job I tried increasing the heartbeat.timeout from 50s to 5min, it did not work, and the exception kept on coming.
The reason for the heartbeat timeout exception in my case was that the task managers were crashing as the heap memory was getting exhausted.
So I tried changing the taskmanager.memory.managed.fraction to 0.05 from 0.4, which in turn increased the heap memory.
Now, the frequency of heartbeat failure has reduced and the pipeline is also able to restart from failures.
Maybe you can modify conf/flink-conf.yaml, or through the -D dynamic parameter
it's may be help to you
https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/cli/
Related
I am using couchbase-server community edition version 4.0.0. I find the below logs in the babysitter.log file repeatedly. The frequency is higher when the cluster is under load.
WARNING 473: Slow GET operation on connection (10.3.4.14:55846 => 10.3.9.13:11210): 4325 ms
WARNING 1057: Slow DELETE operation on connection (10.3.2.23:46152 => 10.3.9.13:11210): 1280 ms
I find little to no documentation for this warning log. What do these logs mean? How can I further debug the cause for slow operation?
At some point, we added some logging to Couchbase to help identify bigger problems that come about from resource exhaustion by correlating to smaller problems that happen earlier. These are some of those warnings.
They are generally safe to ignore. It does indicate that your system might be under heavy load and perhaps the process isn't getting enough time to send/receive/handle results.
I'd also say the 4.0 version is quite old at this stage and Couchbase has been much improved and is shipping much newer runtimes. In particular, this is coming from an erlang process, one of the things we've updated the runtime on. I expect you'll see less of this and perhaps a little more detail.
We are running AWS Aurora(Serverless RDS) in our production environment. It has to scale between 2 capacity units(4GB RAM) and 8 capacity units(16GB RAM).
For the last 2 months, our database has never auto-scaled, it was running in the minimum capacity unit. In the past week, due to an increase in system usage, auto-scaling started triggering every few mins. It was scaling between 4 and 8 capacity units during the day time.
And since last week, we were getting an issue(not all the time but every few mins) when our application triggers SQL queries to the database, Incorrect arguments to mysqld_stmt_execute. This error happens for both read & write operations.
So, we suspected auto-scaling might be the reason and we kept the same capacity units for both min(8) and max(8) to avoid scaling. So, scaling didn't happen and we didn't get that error again. So, we confirmed that the error was caused by auto-scaling. Actually, auto-scaling helped us to reduce the cost but unfortunately causes an error.
We don't understand why this error happens during scaling. Can someone explain why scaling causes this issue and how to avoid this?
Or is it something related to connection pooling issue? I've raised it in the connection pooling project as well.
https://github.com/brettwooldridge/HikariCP/issues/1407
This is the problem with caching prepared statements. When new servers are provisioned for scaling and when the cached prepared statements are fired on the new servers, MySQL has thrown this error. So, we disabled prepared statement caches and we are not getting the error anymore.
Though it works, we couldn't cache the prepared statements which may lag slight performance. As of now, it's okay as we didn't notice the lag.
Currently working on the performance of my RESTFul api implemented using node js and mysql. For load testing of my APIs I'm using jmeter. So When I call my one of url for testing load with configuration
Virtual Users : 100,
Total Duration : 60s,
Time delay : 0s,
Ramup-Period : 1s
Jmeter show status OK for around 300-400 results and after that It timeouts for rest of request. Then after this I'm not able to ssh login or ping my server from my system till restart my system. Why this is happening?. Is it problem of my APIs design or server load problems.
Most probably your server simply becomes overloaded, I would suggest monitoring its baseline health metrics using top, vmstat, sar etc during your test execution. Also consider increasing your ramp-up period setting so load would increase more gradually - this way you will be able to correlate system behavior with the increasing load and determine what is the maximum of concurrent requests your system is able to serve.
Alternatively (or in addition to top, vmstat and sar) you can use JMeter PerfMon Plugin which is capable of collecting a lot of metrics and sending them to JMeter so you will have system performance report along with the load test results. Check out How to Monitor Your Server Health & Performance During a JMeter Load Test guide for detailed installation and configuration instructions along with usage examples.
My Rails application takes a JSON blob of ~500 entries (from an API endpoint), throws it into a sidekiq/ redis background queue. The background job parses the blob then loops through the entries to perform a basic Rails Model.find_or_initialize_by_field_and_field() and model.update_attributes().
If this job were in the foreground, it would take a matter of seconds (if that long). I'm seeing these jobs remain in the sidekiq queue for 8 hours. Obviously, something's not right.
I've recently re-tuned the MySQL database to use 75% of available RAM as the buffer_pool_size and divided that amongst 3 buffer pools. I originally thought that might be part of the deadlock but the load avg on the box is still well below any problematic status ( 5 CPU and a load of ~ 2.5 ) At this point, I'm not convinced the DB is the problem though, of course, I can't rule it out.
I'm sure, at this point, I need to scale back the sidekiq worker instances. In anticipation of the added load I increased the concurrency to 300 per worker (I have 2 active workers on different servers.) Under a relatively small amount of load there queues operate as expected; even the problematic jobs are completed in ~1 minute. Though, per the sidekiq documentation >50 concurrent workers is a bad idea. I wasn't having any stability issues at 150 workers per instance. The problem has been this newly introduced job that performs ~500 MySQL finds and updates.
If this were a database timeout issue, the background job should have failed and been moved from the active (busy) queue to the failed queue. That's not the case. They're just getting stuck in the queue.
What other either MySQL or Rails/ sidekiq tuning parameters should I be examining to ensure these jobs succeed, fail, or properly time out?
I have an intense query, which is issued from a java/hibernate application, that is timing out after 3-4 minutes. Which my.cnf configurations control that behavior ? or alternatively, how can this be configured through hibernate ? Why the query takes so long is beyond the scope of this question. Thanks
You could be hitting a timeout on InnoDB locks and need to adjust the innodb_lock_wait_timeout parameter to ensure that queries don't get cut for holding locks too long.
It's also possible that the query is still running, but the web process that initiated it has timed out. If you can see your long-running query with SHOW PROCESSLIST after the fact, you will need to adjust your web server request time-out.