MySQL replication is not running although mysql says it is - mysql

I have two servers configured in a master-master pair using MMM. I recently had an issue where the passive master received a replication error (got a packet bigger than max_allowed_packet) but the slave IO and SQL threads continued running. And seconds_behind_master was still showing as 0 even though the slave was not executing new statements.
I thought this type of error would cause replication to stop (it's done this in the past). Instead replication kept running and our monitors didn't notice the problem. Also the replication errors continually showed up in the mysql error log, instead of "Last_Error" in "show slave status".
We are running version 5.0.33.
Any ideas what happened here? thanks!

For the max allowed packet size, it sounds like your two DBs are not configured identically. At least the network protocol stuff should be identical.
Did you try show slave status on both machines?
Quiet failure is a terrible situation. I wonder what records did not make it. Do you have a way of finding out?
Are you getting periodic errors in the error log or a flood of identical errors? Is the sequence number incrementing on the passive master?
Jacob

Related

MYSQL replication does not work for all tables

I have a problem with MySQL replication - there is one table on master server which doesn't appear on slave server. Both master and slave has the same master_log_file and master_log_position, both slave_io and slave_sql threads are running, I even tried to add an empty table to the master database but it does appear on the slave database. It's not the first time I got such error but before that my symptopms were new data did not appear in slave database. Are there any other solutions for this problem than stopping replication on the slave, dropping the database, dumping it on master server, rsyncing to slave server and restarting replication from new file/position?
I noticed using
SHOW SLAVE STATUS;
that Relay_Log_Pos is smaller than Read_Master_Log_Pos and Relay_Log_File differ from Master_Log_File but Slave_SQL_Running_State says
Slave has read all relay log; waiting for more updates
Seconds_Behind_Master says 0.
MySQL officially only supports replication to the next higher version (although it will work for 5.7.13+), see Replication Compatibility Between MySQL Versions:
MySQL supports replication from one release series to the next higher release series. For example, you can replicate from a master running MySQL 5.5 to a slave running MySQL 5.6, from a master running MySQL 5.6 to a slave running MySQL 5.7, and so on.
However, you may encounter difficulties when replicating from an older master to a newer slave if the master uses statements or relies on behavior no longer supported in the version of MySQL used on the slave.
By default, replication will stop if an error occurs, and you have to restart it (after fixing the error). If you use the --slave-skip-errors=all-option however, it will skip these errors:
Normally, replication stops when an error occurs on the slave, which gives you the opportunity to resolve the inconsistency in the data manually. This option causes the slave SQL thread to continue replication when a statement returns any of the errors listed in the option value.
Do not use this option unless you fully understand why you are getting errors. If there are no bugs in your replication setup and client programs, and no bugs in MySQL itself, an error that stops replication should never occur. Indiscriminate use of this option results in slaves becoming hopelessly out of synchrony with the master, with you having no idea why this has occurred.
MySQL 5.5. and 5.7. will actually behave differently for a lot of statements, so enabling this option in this scenario will require even more care.
Without seeing your actual table create-statement, it is unclear what exactly caused that problem and how to fix it (or if it is possible), but you should especially check your configuration settings. MySQL 5.7. enables strict mode by default, so a usual suspect for incompatibilites is e.g. a zero default value for date/timestamp-columns like default '0000-00-00' (either explicit or implicit), which is not allowed anymore, see no_zero_date.
Even if you seem to not be too keen about 100% replication (which can snowball very fast, but that is up to you to evaluate for your scenario), resetting your slave (after fixing e.g. the configuration settings) at least once is probably the easiest solution, as there might have been other things you may have missed, and, if executed without errors, will also doublecheck if your tables and data up to that point are compatible with your 5.7-slave now.

Lost connection to MySQL server during query on random simple queries

FINAL UPDATE: We fixed this problem by finding a way to accomplish our goals without forking. But forking was the cause of the problem.
---Original Post---
I'm running a ruby on rails stack, our mysql server is separate, but housed at the same site as our app servers. (we've tried swapping it out for a different mysql server with double the specs, but no improvement was seen.
during business hours we get a handful of these from no particular query.
ActiveRecord::StatementInvalid: Mysql2::Error: Lost connection to MySQL server during query
most of the queries that fail are really simple, and there seems to be no pattern between one query and another. This all started when I upgraded from Rails 4.1 to 4.2.
I'm at a loss as to what to try. Our database server is less than 5% CPU throughout the day. I do get bug reports from users who have random interactions fail due to this, so it's not queries that have been running for hours or anything like that, of course when they retry the exact same thing it works.
Our servers are configured by cloud66.
So in short: our mysql server is going away for some reason, but it's not because of lack of resources, it's also a brand new server as we migrated from another server when this problem started.
this also happens to me on localhost while developing features sometimes, so I don't believe it's a load issue.
We're running the following:
ruby 2.2.5
rails 4.2.6
mysql2 0.4.8
UPDATE: per the first answer below I increased our max_connections variable to 500 last night, and confirmed the increase via
show global variables like 'max_connections';
I'm still getting dropped connection, the first one today was dropped only a few minutes ago....
ActiveRecord::StatementInvalid: Mysql2::Error: Lost connection to MySQL server during query
I ran select * from information_schema.processlist; and I got 36 rows back. Does this mean my app servers were running 36 connections at that moment? or can a process be multiple connections?
UPDATE: I just set net_read_timeout = 60 (it was 30 before) I'll see if that helps
UPDATE: It didn't help, I'm still looking for a solution...
Heres my Database.yml with credentials removed.
production:
adapter: mysql2
encoding: utf8
host: localhost
database:
username:
password:
port: 3306
reconnect: true
The connection to MySQL can be disrupted by a number of means, but I would recommend revisiting Mario Carrion's answer since it's a very wise answer.
It seems likely that connection is disrupted because it's being shared with the other processes, causing communication protocol errors...
...this could easily happen if the connection pool is process bound, which I believe it is, in ActiveRecord, meaning that the same connection could be "checked-out" a number of times simultaneously in different processes.
The solution is that database connections must be established only AFTER the fork statement in the application server.
I'm not sure which server you're using, but if you're using a warmup feature - don't.
If you're running any database calls before the first network request - don't.
Either of these actions could potentially initialize the connection pool before forking occurs, causing the MySQL connection pool to be shared between processes while the locking system isn't.
I'm not saying this is the only possible reason for the issue, as stated by #sloth-jr, there are other options... but most of them seem less likely according to your description.
Sidenote:
I ran select * from information_schema.processlist; and I got 36 rows back. Does this mean my app servers were running 36 connections at that moment? or can a process be multiple connections?
Each process could hold a number of connections. In your case, you might have up to 500X36 connections. (see edit)
In general, the number of connections in the pool can often be the same as the number of threads in each process (it shouldn't be less than the number of thread, or contention will slow you down). Sometimes it's good to add a few more depending on your application.
EDIT:
I apologize for ignoring the fact that the process count was referencing the MySQL data and not the application data.
The process count you showed is the MySQL server data, which seems to use a thread per connection IO scheme. The "Process" data actually counts active connections and not actual processes or threads (although it should translate to the number of threads as well).
This means that out of possible 500 connections per application processes (i.e., if you're using 8 processes for your application, that would be 8X500=4,000 allowed connections) your application only opened 36 connections so far.
This indicates a timeout error. It's usually a general resource or connection error.
I would check your MySQL config for max connections on MySQL console:
show global variables like 'max_connections';
And ensure the number of pooled connections used by Rails database.yml is less than that:
pool: 10
Note that database.yml reflects number of connections that will be pooled by a single Rails process. If you have multiple processes or other servers like Sidekiq, you'll need to add them together.
Increase max_connections if necessary in your MySQL server config (my.cnf), assuming your kit can handle it.
[mysqld]
max_connections = 100
Note other things might be blocking too, e.g. open files, but looking at connections is a good starting point.
You can also monitor active queries:
select * from information_schema.processlist;
as well as monitoring the MySQL slow log.
One issue may be a long-running update command. If you have a slow-running command that affects a lot of records (e.g. a whole table), it might be blocking even the simplest queries. This means you could see random queries timeout, but if you check MySQL status, the real cause is another long-running query.
Things you did not mention but you should take a look:
Are you using unicorn? If so, are your reconnecting and disconnecting in your after_fork and before_fork?
Is reconnect: true set in your database.yml configuration?
Well,at first glance this sounds like your webserver is keeping the mysql sessions open and sometimes a user runs into a timeout. Try disabling the keep mysql sessions alive.
It will be a hog but you only use 5% ...
other tipps:
Enable the mysql "Slow Query Log" and take a look.
write a short script which pulls and logs the mysql processlist every minute and cross check the log with timeouts
look at the pool size in your db connection or set one!
http://guides.rubyonrails.org/configuring.html#database-pooling
should be equal to the max-connections mysql likes to have!
Good luck!
Find out if your database is limited in terms of multiple connections. Because normally a SQL database is supposed to have more than one active connection.
(Contact your network provider)
Would you mind posting some of your queries? The MySQL documentation has this to say about it:
https://dev.mysql.com/doc/refman/5.7/en/error-lost-connection.html
TL;DR:
Network problems; are any of your boxes renewing leases
periodically, or experiencing other network connection errors
(netstat / ss), firewall timeouts, etc. Not sure how managed your
hosts are by cloud66....
Query timed out. This can happen if you've got commands backed up
behind blocking statements (eg, alters/locking backups on MyISAM
tables). How simple are your queries? No cartesian products in-play?
EXPLAIN query could help.
Exceeding MAX_PACKET_SIZE. Are you storing pictures, video content, etc.?
There are lots of possibilities here, and without more information, will be difficult to pinpoint this.
Would look first at mysql_error.log, then work your way from the DB server back to your application.
UPDATE: this didn't work.
Heres the solution, special thanks to #Myst for pointing out that forking can cause issues, I had no idea to look at this particular code. As the errors seemed random because we forked in this fashion in several places.
It turns out that when I was forking processes, rails was using the same database connection for all forked processes, This created a situation where when one of the processes (the parent process?) terminated the database connection, the remaining process would have its connection interrupted.
The solution was to change this code:
def recalculate_completion
Process.fork do
if self.course
self.course.user_groups.includes(user:[:events]).each do |ug|
ug.recalculate_completion
end
end
end
end
into this code:
def recalculate_completion
ActiveRecord::Base.remove_connection
Process.fork do
ActiveRecord::Base.establish_connection
if self.course
self.course.user_groups.includes(user:[:events]).each do |ug|
ug.recalculate_completion
end
end
ActiveRecord::Base.remove_connection
end
ActiveRecord::Base.establish_connection
end
Making this change stopped the errors from our servers and everything appears to be working well now. If anyone has any more info as to why this worked I would be happy to hear it, as I would like to have a deeper understanding of this.
Edit: it turns out this didn't work either.... we still got dropped connections but not as often.
If you have query cache enabled, please reset it and it should work.
RESET QUERY CACHE;

MySQL/MariaDB replication: Can I interrupt the process?

I have a replication setup here where data get replicated from a stationary host to a notebook.
Replication happens in two steps: the copying of the relay files, which is quite fast, and the application of the relay log events to the database, which tends to be slow.
Now my question: Suppose the slave has gotten all data from the master, but the "import process" still runs. Can I safely shut down the slave host and resume the still pending part of the replication without disturbing the process in any way?
So I am connected to the host, say "stop slave", shut down the notebook, go home and then "start slave" again without having a connection to the host. Can I expect the slave instance to resume the import process again?
Your laptop is permanently a Slave to the other machine, correct? You are just breaking the network connection to the Master every night?
There are two threads on the Slave. The I/O thread is responsible for pulling data from the binlog on the Master and putting the stuff into the "relay-log" on the Slave. If (when) the network goes away, this thread repeatedly retries. There are settings that say how frequently and when to eventually give up. Consider tuning them.
The SQL thread is responsible for applying whatever is in the relay-log. Effectively, the SQL thread can run all the time. It's quite happy to "do nothing" when there is nothing to do.
The I/O thread creates new relay-log files as needed; the SQL thread deletes a log as it finishes with it.
I have dealt with dozens of slaves over the years; I don't recall any issue with network or power failures. You are essentially causing at least a network failure every night. If you are also powering down the laptop, do it gracefully. InnoDB (but not MyISAM) recovers nicely from power failures, but don't push your luck.
STOP/START SLAVE seems unnecessary, but won't hurt. Things should "resume" and eventually "catch up".
Your quote talks about the Master purging binlogs. Well, there is an issue here. The Master does not keep track of what Slaves exist, so it can't tell if your Slave is un-connected for longer than the Master is keeping the binlogs.
See expires_logs_days. Suggest you set that to higher than the number of vacation days you might ever take.
My experience with Slaves predates GTIDs, Galera, etc.; will you be using such?
I partially have found the answer to my question:
The MySQL documentation says:
If the slave stops before the SQL thread has executed all the fetched statements, the I/O thread has at least fetched everything so that a safe copy of the statements is stored locally in the slave's relay logs, ready for execution the next time that the slave starts. This enables the master server to purge its binary logs sooner because it no longer needs to wait for the slave to fetch their contents.
This indicates that it is perfectly possible to resume the import process (execution of the statements), however, it still remains unclear
if I need to start slave before the described things happen and
what happens if the slave doesn't find its master if I do start slave.

MYSQL SLOW replication but suddenly increasing and then again slow in mysql Slave unable

I am facing a major issue in MySQL server from 2 days. My slave server is Seconds behind master by 70000 and its not getting down from 2 days. At night its suddenly increasing but again it in slow mode. Is there any way to Synchronize Master slave replication FAST? What is the problem with? Slave is working its IO and sql running in YES MODE. Please Help me out if there is any way
Is it repeatedly bouncing between 70000 and about 0? If so, that is a mystery that I have seen on and off for more than a decade. Ignore it, it will go away.
If Seconds_behind_master is rising at the rate of 1 second per second, the look at what the Slave is doing. SHOW PROCESSLIST; You will probably find something like ALTER that has been running a long time, tying up replication.
If Seconds_behind_master is getting big, but not going down much, then there are several possible answers.
Is the Slave a "weaker" machine than the Master? Keep in mind that Replication is (depending on the version) only single-threaded. Multiple writes can happen on the Master simultaneously, but then have to be done one at a time on the Slave.
Is the Slave running a big query that is locking what the replication thread would like to get to? Look at the Slave's PROCESSLIST.
Which Engine are you using? VM? Cloud hosted? Performing backups at night?

Robust fault tolerant MySQL replication

Is there any way to get a fault tolerant MySQL replication? I am in an environment that has many networking issues. It appears that replication gets an error and just stops. I need it to continue to work and recover from these faults. There is some wrapper software that checks the state of replication and restarts it in the case of losing its log position. Is there an alternative?
Note:
Replication is done from an embedded computer with MySQL 4.1 to a external computer that has MySQL 5.0.45
What error are you getting? You also haven't described what replication scheme or Mysql version you're using. The errors you're getting are also important.
Replication usually stops when there's a primary/unique key conflict in a Master-Master replication. Other than that on a typical Master-Slave replication setup, networking issues shouldn't cause problems.
Try using Mysql 5.1 or newer, since replication in 5.0 is statement-based and causes problems in Master-Master setups, or when you're using stored-procedures.
(Also, stay away from Mysql Cluster ... noticed the advice on another comment).
Replication errors only happen if the databases get out of sync somehow, having the server simply continue would mean incoherent databases, I really doubt you'd want that.
In my experience, the only time you end up with such errors is if one of the master servers did not complete a query and the slave noticed.
In any case, if you really want to have the slave continue via some sort of chron job, you could always have a query run every few minuts asking the slave "SHOW SLAVE STATUS" then checking the error column, if it's present, send a "STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER; START SLAVE;" command. But it would probably be much more apt to send an email to an admin when mysql encounters an error instead, so he/she can investigate the source of the problem and make sure the databases are actually in sync, otherwise you're likely to see more errors in the near future as the databases become more and more out of sync.
Consider MySQL Cluster using the NDB storage engine, it's meant to be shared-nothing and fault tolerant
MySQL replication will normally detect problems and reconnect anyway, continuing from where it left off.
If you're getting replication errors, it's likely that the source is something else. MySQL replication effectively does a "tail -f" on the query log and replays it on the slave (it's slightly smarter than that, but not much).
If the databases become out of sync, MySQL replication will neither detect nor repair this, but it may eventually cause it to break as a subsequent update cannot proceed due to conflicting data on the slave.
The default timeouts on the replication slave are much too long - it waits hours (or something) - you'll want to reduce this.
Data becoming out of sync is difficult to avoid, mitigation steps are:
Monitor replication using something like mk-table-checksum from Maatkit
Audit all your code for replication-unsafe queries
If using 5.1, switch to row-based replication, which is less likely to suffer from this problem