MySQL replication with custom query for reverse hashes - mysql

I have a MySQL DB with a quickly growing amount of data.
I'd like to use some web based tool that plugs into the DB so that I can analyze data and create reports.
The idea would be to use replication in order to give R/O access to the slave DB instead of having to worry about security issues on the master (which also contains data not relevant to this project, but just as important).
The master DB contains strings that are hashed (SHA1 128) from the source and, on the slave, they need to go back to their original form using a reverse hash database.
This will allow whatever tool I plug into the slave-DB (living on another server) to work straight out of the box.
My question is: what is the best way to do replication while somehow reshaping the slave-DB with the mentioned strings back into the source format?
example
MASTER DB
a8866ghfde332as
a8fwe43kf3e3t42
SLAVE DB
John Smith
Rose White
The slave DB should already contain the tables reversed and should NOT be reversed when doing a query.
How do you guys think I should approach this?
Is replication the way to go?
Thank you for any help!
EDIT
I should specify some details:
the slave DB would also contain a reverse hash (lookup) table
the amount of source strings is limited so there's little risk of collisions
the best option would be to replicate only certain tables to the slave, where the slave-DB does a reverse hash lookup every time there is an INSERT and saves the reversed hash in another table (or column) ready to be read by the web based tool
This type of setup I am willing to use is mainly focused on NOT having anything connecting to the master other than the source (that creates records in the DB) and the slave DB itself.
This would result in better security by having the reverse lookup table sitting in a DB (the slave) that is NOT in direct contact with the source of data.
So, even in case somebody hacks the source and makes it to the master DB, no useful data could be retrieved being the strings in object hashed.

It is easier, simpler, and most foolproof to replicate everything from master to slave in MySQL, so plan to relicate everything unless you have an extremely compelling reason not to.
That said, MySQL has absolutely no problem with the slave having tables that the master does not have -- tables created directly on the slave will not cause a problem if no tables with conflicting schema+table names exist on the master.
You don't want to try to have the slave "fix" the data on the way in, because that's not something MySQL replication is designed to do, nor is it something readily accomploahed. Triggers will fire on tables on the slave only when the master writes events to its binlog in statement mode, which is not as reliable as row mode nor as flexible as mixed mode, and even if you had this working, you then lose the ability to compare master and slave data sets with table checksums, which is an important part of the ongoing maintenance of master/slave replication.
However... I do see a way to accomplish what you want to do with the reverse hash tables: create the lookup tables on the slave, and then create views that reconstruct the data in its desired form by joining the replicated tables to the lookup tables, and run your queries on the slave against these views.
When a view simply joins properly indexed tables and doesn't include anything unusual like aggregate functions (e.g. COUNT()) or UNION or EXISTS, then, the server will process queries against views as if the underlying tables had been queried directly, using all available and appropriate indexes... so this approach should not cause significant performance penalties. In fact, you can declare indexes on replicated tables on the slave (on replicated tables) that you don't have or need on the master (except for UNIQUE indexes, which wouldn't make sense) and these can be designed as needed for the slave-specific workload.

Hash functions are surjective, so it is possible for two different inputs to have the same output. As such, it would not be possible to accurately rebuild the data on the slave.
On a simple level, and to demonstrate this; consider a hashing function for integers, that happens to return the square of the input; so, -1 => 1, 0 => 0, 1 => 1, 2 => 4, 3 => 9, etc. Now consider the inverse, being the square root, 1 => -1 & 1, 4 => -2 & 2, etc.
It may be better to only replicate the specific data needed for your project to the slaves, and do it without hashing.

Related

mysql automatic replication of partial data

I have to create a dashboard based on a table in mysql, and only on today datas
This db is used on a service with a massive data quantity, and continous read and write data, so I'd like to replicate in a "slave" instance part of this table (only today data).
Is it possible to do it in Mysql, without scripting?
Thanks
MySQL has no built-in feature to replicate a subset of rows. There are replication filters to replicate a subset of schemas or tables, but not rows.
One workaround could be to replicate fully to the replica, then on the replica delete any data that is more than one day old.
But this would work only for a database that is INSERT-only. If you also have UPDATE and DELETE operations replicated, they might find that they are trying to change rows that are missing. If you use ROW-based binary logs, this would result in a replication error when it can't find the row, and replication would stop.
It might work if you only use STATEMENT-based binary logs, but I've never tried it so I can't predict what other problems might occur. Also, you can't fully prevent ROW-based binary logs from occurring, because individual sessions can change their binary log format.
I think you're going to need a bespoke solution no matter what. Probably not using replication, but just an ETL job to query the current day's data and import it into another MySQL instance (not a replica).

MySQL replication/synchronization: purge from master but not from slave

I came across this problem a few days ago and have been tinkering with- and pondering about several different approaches, but I cannot seem to find a good answer:
I have two MySQL servers, one master/hot and one slave/archive. All write requests go to the master, and shall also (eventually) be replicated/copied to the slave. However, certan data in the master grows "stale" after a while (say a week) and shall then be purged, so to keep the master's tables short. This purge should however not affect the slave. How can I go about achiving this?
Essentially, my master database acts sort of like a "hot" database, where data is fresh and is purged once it goes old. It should contain data that users might need quickly, and thus we want to keep the tables small. My slave on the other hand works more like an archive, which should contain all data, regardless of "hotness". Queries to the slave doesn't need to execute quickly, and the slaves data can lag behind a few minutes, but it needs to contain all records since our beginning of time.
My initial thought was to utilize ordinary replication, but can I somehow filter certain queries to not affect the slave? I was thinking of creating a purge query, which removes old data from the master but doesn't effect the slave. From reading the MySQL documentation, it seems that this filtering can only be done on Database or Tabel level.
Another thought was to do this via an external application, and manually SELECT data from the master and INSERT it into the slave, and then use some clever logic to decide what data to select. This works good for log-tables, which will only ever add data, but it doesn't work good for tables that represents states, such as user settings. This approach will probably also include a lot of special cases, as I cannot find a good, consistent way of describing all tables in our database (there are log-tables, state-tables, config-tables and a few which I cannot really categorize).
None of these approaches seem to solve the problem in a simple fashion, but I feel I cannot be the first to have this problem. Any ideas are welcome, and thanks in advance.
If more info is needed, feel free to comment and I'll edit it in
Just use regular replication. When you delete data on the master you do in the same session
SET sql_log_bin = 0;
DELETE FROM my_table WHERE whatever = true;
SET sql_log_bin = 1;
This prevents that those statements are written to the binary log. And therefore it won't be replicated to the slave.
read more about it here

Is it possible to determine MySQL replication "position" with a normal query?

I have a MySQL (RDS) database that is replicated from one datacenter to another. There is also a message bus which spans these two locations, and it carries messages when certain writes to the database take place.
The messages and the MySQL replication race between the two locations. We need to make sure we don't process the message before the write that it refers to has definitely made it into the replica.
At the moment we use custom "last updated at" field on the tables that are replicated. It seems like there should be a global variable we can use instead though -- something that monotonically increases whenever there's a write anywhere in the database, and is available at both the master and the slave.
Does such a variable exist? Do I need special privileges to read it?
If there is not such a thing, what would be the tradeoffs associated with implementing it ourselves?

MySQL: Writing to slave node

Lets say I have a datbase of Cars. I have Makes and Models (FK to Makes). I plan on having users track their cars. each Car has a FK to Model. Now, I have a lot of users, and I want to split up my database to distribute load. The Makes and Models tables don't change so much, but they need to be shared across shards. My thought is to use MySQL replication from a master DB of makes and models to each slave database. My question is: Can I safely write to the slave databases assuming I don't write to those tables on the master?
And while on the subject, is there anyway to guarantee one slave database has the latest data? For example, someone just added the 'Taurus' make, and then wants to add their car. Can I ensure that the slave database they are using has the latest master data?
Yes, in general you can safely write to a table on the slaves that is not being written on the master. If you do things like insert auto_increment rows on the slaves and on the master, independently, you will of course have problems. You should configure that table to be excluded from replication entirely, really.
For checking whether you have the latest data, SHOW SLAVE STATUS includes a field Seconds_Behind_Master that tells you whether the slave is up to date. Obviously you want it to be zero. To be certain that inserted and replicated data is present, of course, you need to wait a second and then see that Seconds_Behind_Master is zero.
This was a good solution I gleaned while searching
I included the main point as avilable here:
http://erlycoder.com/43/mysql-master-slave-and-master-master-replication-step-by-step-configuration-instructions-
MySQL master-master replication and autoincrement indexes
If you are using master-slave replication, than most likely you will design your application the way to write to master and read from slave or several slaves. But when you are using master-master replication you are going to read and write to any of master servers. So, in this case the problem with autoincremental indexes will raise. When both servers will have to add a record (different one each server simultaneously) to the same table. Each one will assign them the same index and will try to replicate to the salve, this will create a collision. Simple trick will allow to avoid such collisions on MySQL server.
On the Master 1/Slave 2 add to /etc/my.cnf:
auto_increment_increment= 2
auto_increment_offset = 1
On the Master 2/Slave 1 add to /etc/my.cnf:
auto_increment_increment= 2
auto_increment_offset = 2

Replication with lots of temporary table writes

I've got a database which I intend to replicate for backup reasons (performance is not a problem at the moment).
We've set up the replication correctly and tested it and all was fine.
Then we realized that it replicates all the writes to the temporary tables, which in effect meant that replication of one day's worth of data took almost two hours for the idle slave.
The reason for that is that we recompute some of the data in our db via cronjob every 15 mins to ensure it's in sync (it takes ~3 minutes in total, so it is unacceptable to do those operations during a web request; instead we just store the modifications without attempting to recompute anything while in the web request, and then do all of the work in bulk). In order to process that data efficiently, we use temporary tables (as there's lots of interdependencies).
Now, the first problem is that temporary tables do not persist if we restart the slave while it's in the middle of processing transactions that use that temp table. That can be avoided by not using temporary tables, although this has its own issues.
The more serious problem is that the slave could easily catch up in less than half an hour if it wasn't for all that recomputation (which it does one after the other, so there's no benefit of rebuilding the data every 15 mins... and you can literally see it stuck at, say 1115, only to quickly catch up and got stuck at 1130 etc).
One solution we came up with is to move all that recomputation out of the replicated db, so that the slave doesn't replicate it. But it has disadvantages in that we'd have to prune the tables it eventually updates, making our slave in effect "castrated", ie. we'd have to recompute everything on it before we could actually use it.
Did anyone have a similar problem and/or how would you solve it? Am I missing something obvious?
I've come up with the solution. It makes use of replicate-do-db mentioned by Nick. Writing it down here in case somebody had a similar problem.
The problem with just using replicate-(wild-)do* options in this case (like I said, we use temp tables to repopulate a central table) is that either you ignore temp tables and repopulate the central one with no data (which causes further problems as all the queries relying on the central table being up-to-date will produce different results) or you ignore the central table, which has a similar problem. Not to mention, you have to restart mysql after adding any of those options to my.cnf. We wanted something that would cover all those cases (and future ones) without the need for any further restart.
So, what we decided to do is to split the database into the "real" and a "workarea" databases. Only the "real" database is replicated (I guess you could decide on a convention of table names to be used for replicate-wild-do-table syntax).
All the temporary table work is happening in "workarea" db, and to avoid the dependency problem mentioned above, we won't populate the central table (which sits in "real" db) by INSERT ... SELECT or RENAME TABLE, but rather query the tmp tables to generate a sort of a diff on the live table (ie. generate INSERT statements for new rows, DELETE for the old ones and update where necessary).
This way the only queries that are replicated are exactly the updates that are required, nothing else, ie. some (most?) of the recomputation queries hapenning every fifteen minutes might not even make its way to slave, and the ones that do will be minimal and not computationally expensive at all, just simple INSERTs and DELETEs.
In MySQL, as of 5.0 I believe, you can do table wildcards to replicate specific tables. There are a number of command-line options that can be set but you can also do this via your MySQL config file.
[mysqld]
replicate-do-db = db1
replicate-do-table = db2.mytbl2
replicate-wild-do-table= database_name.%
replicate-wild-do-table= another_db.%
The idea being that you tell it to not replicate any tables other than the ones you specify.