replication, uuid field as primary key, row based replication - mysql

I am looking about the best way to implement a multi-sites replication with MySDQL. As I am mainly a MS-SQL user, I feel quite unconfortable with MySQL replication terminology, and the multiple MySQL versions that do not do exactly the same thing the same way.
My objective is, as per SQL terminology, to have a publisher and many subscribers. Suscribers are open to user updates. Changes are replicated with the publisher, which will then distribute these changes to other suscribers.
So my objective is here to determine the correct primary key rule to be used for our tables. As we want exclusively surrogate keys, we have the possibility to use either integer\autoincrement or uuid/uuid_short fields. In order to avoid replication conflicts, the integer\autoincrement does not fits our needs, as it can create replication conflicts when, between two synchronisations, both servers insert a record in the same table. So, according to me, the correct solution to avoid replication\primary key conflicts would be to:
use uuid or uuid-short fields as primary keys
have the corresponding uuid values set by the server at INSERT time
set the replication to RBR (Row Based Replication - sounds equivalent to MS-SQl merge replication) mode, and not SBR (Statement Based Replication - sounds like transactional replication). As I understand it, RBR will insert the calculated uuid value 'as is' in the other servers at replication time, while SBR will recall the uuid() function and generate a new value on each server ... thus causing a major problem!
Will it work?

I think there are two unrelated issues here:
1.
Choosing a primary key in a way which is probably unique - UUID is an acceptable approach. You can use it with SBR or RBR, provided the client app generates the UUID. If you happen to be calling the mysql function to generate it, then you need to use row-based replication. Row-based replication is generally much better, so the only reason you'd want to use statement-based replication is backwards compatibility with a MySQL <5.1 slave, or legacy application which relies on some of its behaviour.
2.
Secondly, you appear to want to do multi-master replication. IT WILL NOT WORK.
MySQL replication is extremely simplistic - it writes changes to a log, pushes them from one server to another, reading the log as needed.
You can't do multi-master replication, because it will fail. No only does MySQL not actually support it (for arbitrary toplologies), but it also doesn't work.
MySQL replication has no "conflict resolution" algorithm, it will simply stop and break if a conflict is discovered.
Even assuming that your database contains NO UNIQUE INDEXES except primary keys which are allocated by UUIDs, your application can still have rules which make multi-master fail.
Any constraints, invariants or assumptions within your application, probably only work if the database has immediate-consistency. Trying to use multi-master replication breaks this assumption and will cause your application to enter unexpected (i.e. normally impossible) states.

Related

MySQL replication with custom query for reverse hashes

I have a MySQL DB with a quickly growing amount of data.
I'd like to use some web based tool that plugs into the DB so that I can analyze data and create reports.
The idea would be to use replication in order to give R/O access to the slave DB instead of having to worry about security issues on the master (which also contains data not relevant to this project, but just as important).
The master DB contains strings that are hashed (SHA1 128) from the source and, on the slave, they need to go back to their original form using a reverse hash database.
This will allow whatever tool I plug into the slave-DB (living on another server) to work straight out of the box.
My question is: what is the best way to do replication while somehow reshaping the slave-DB with the mentioned strings back into the source format?
example
MASTER DB
a8866ghfde332as
a8fwe43kf3e3t42
SLAVE DB
John Smith
Rose White
The slave DB should already contain the tables reversed and should NOT be reversed when doing a query.
How do you guys think I should approach this?
Is replication the way to go?
Thank you for any help!
EDIT
I should specify some details:
the slave DB would also contain a reverse hash (lookup) table
the amount of source strings is limited so there's little risk of collisions
the best option would be to replicate only certain tables to the slave, where the slave-DB does a reverse hash lookup every time there is an INSERT and saves the reversed hash in another table (or column) ready to be read by the web based tool
This type of setup I am willing to use is mainly focused on NOT having anything connecting to the master other than the source (that creates records in the DB) and the slave DB itself.
This would result in better security by having the reverse lookup table sitting in a DB (the slave) that is NOT in direct contact with the source of data.
So, even in case somebody hacks the source and makes it to the master DB, no useful data could be retrieved being the strings in object hashed.
It is easier, simpler, and most foolproof to replicate everything from master to slave in MySQL, so plan to relicate everything unless you have an extremely compelling reason not to.
That said, MySQL has absolutely no problem with the slave having tables that the master does not have -- tables created directly on the slave will not cause a problem if no tables with conflicting schema+table names exist on the master.
You don't want to try to have the slave "fix" the data on the way in, because that's not something MySQL replication is designed to do, nor is it something readily accomploahed. Triggers will fire on tables on the slave only when the master writes events to its binlog in statement mode, which is not as reliable as row mode nor as flexible as mixed mode, and even if you had this working, you then lose the ability to compare master and slave data sets with table checksums, which is an important part of the ongoing maintenance of master/slave replication.
However... I do see a way to accomplish what you want to do with the reverse hash tables: create the lookup tables on the slave, and then create views that reconstruct the data in its desired form by joining the replicated tables to the lookup tables, and run your queries on the slave against these views.
When a view simply joins properly indexed tables and doesn't include anything unusual like aggregate functions (e.g. COUNT()) or UNION or EXISTS, then, the server will process queries against views as if the underlying tables had been queried directly, using all available and appropriate indexes... so this approach should not cause significant performance penalties. In fact, you can declare indexes on replicated tables on the slave (on replicated tables) that you don't have or need on the master (except for UNIQUE indexes, which wouldn't make sense) and these can be designed as needed for the slave-specific workload.
Hash functions are surjective, so it is possible for two different inputs to have the same output. As such, it would not be possible to accurately rebuild the data on the slave.
On a simple level, and to demonstrate this; consider a hashing function for integers, that happens to return the square of the input; so, -1 => 1, 0 => 0, 1 => 1, 2 => 4, 3 => 9, etc. Now consider the inverse, being the square root, 1 => -1 & 1, 4 => -2 & 2, etc.
It may be better to only replicate the specific data needed for your project to the slaves, and do it without hashing.

MySQL sharding approaches?

What is the best approach for Sharding MySQL tables.
The approaches I can think of are :
Application Level sharding?
Sharding at MySQL proxy layer?
Central lookup server for sharding?
Do you know of any interesting projects or tools in this area?
The best approach for sharding MySQL tables to not do it unless it is totally unavoidable to do it.
When you are writing an application, you usually want to do so in a way that maximizes velocity, developer speed. You optimize for latency (time until the answer is ready) or throughput (number of answers per time unit) only when necessary.
You partition and then assign partitions to different hosts (= shard) only when the sum of all these partitions does no longer fit onto a single database server instance - the reason for that being either writes or reads.
The write case is either a) the frequency of writes is overloading this servers disks permanently or b) there are too many writes going on so that replication permanently lags in this replication hierarchy.
The read case for sharding is when the size of the data is so large that the working set of it no longer fits into memory and data reads start hitting the disk instead of being served from memory most of the time.
Only when you have to shard you do it.
The moment you shard, you are paying for that in multiple ways:
Much of your SQL is no longer declarative.
Normally, in SQL you are telling the database what data you want and leave it to the optimizer to turn that specification into a data access program. That is a good thing, because it is flexible, and because writing these data access programs is boring work that harms velocity.
With a sharded environment you are probably joining a table on node A against data on node B, or you have a table larger than a node, on nodes A and B and are joining data from it against data that is on node B and C. You are starting to write application side hash-based join resolutions manually in order to resolve that (or you are reinventing MySQL cluster), meaning you end up with a lot of SQL that no longer declarative, but is expressing SQL functionality in a procedural way (e.g. you are using SELECT statements in loops).
You are incurring a lot of network latency.
Normally, an SQL query can be resolved locally and the optimizer knows about the costs associated with local disk accesses and resolves the query in a way that minimizes the costs for that.
In a sharded environment, queries are resolved by either running key-value accesses across a network to multiple nodes (hopefully with batched key accesses and not individual key lookups per round trip) or by pushing parts of the WHERE clause onward to the nodes where they can be applied (that is called 'condition pushdown'), or both.
But even in the best of cases this involves many more network round trips that a local situation, and it is more complicated. Especially since the MySQL optimizer knows nothing about network latency at all (Ok, MySQL cluster is slowly getting better at that, but for vanilla MySQL outside of cluster that is still true).
You are losing a lot of expressive power of SQL.
Ok, that is probably less important, but foreign key constraints and other SQL mechanisms for data integrity are incapable of spanning multiple shards.
MySQL has no API which allows asynchronous queries that is in working order.
When data of the same type resides on multiple nodes (e.g. user data on nodes A, B and C), horizontal queries often need to be resolved against all of these nodes ("Find all user accounts that have not been logged in for 90 days or more"). Data access time grows linearly with the number of nodes, unless multiple nodes can be asked in parallel and the results aggregated as they come in ("Map-Reduce").
The precondition for that is an asynchronous communication API, which does not exist for MySQL in a good working shape. The alternative is a lot of forking and connections in the child processes, which is visiting the world of suck on a season pass.
Once you start sharding, data structure and network topology become visible as performance points to your application. In order to perform reasonably well, your application needs to be aware of these things, and that means that really only application level sharding makes sense.
The question is more if you want to auto-shard (determining which row goes into which node by hashing primary keys for example) or if you want to split functionally in a manual way ("The tables related to the xyz user story go to this master, while abc and def related tables go to that master").
Functional sharding has the advantage that, if done right, it is invisible to most developers most of the time, because all tables related to their user story will be available locally. That allows them to still benefit from declarative SQL as long as possible, and will also incur less network latency because the number of cross-network transfers is kept minimal.
Functional sharding has the disadvantage that it does not allow for any single table to be larger than one instance, and it requires manual attention of a designer.
Functional sharding has the advantage that it is relatively easily done to an existing codebase with a number of changes that is not overly large. http://Booking.com has done it multiple times in the past years and it worked well for them.
Having said all that, looking at your question, I do believe that you are asking the wrong questions, or I am completely misunderstanding your problem statement.
Application Level sharding: dbShards is the only product that I know of that does "application aware sharding". There are a few good articles on the website. Just by definition, application aware sharding is going to be more efficient. If an application knows exactly where to go with a transaction without having to look it up or get redirected by a proxy, that in its self will be faster. And speed is often one of the primary concerns, if not the only concern, when someone is looking into sharding.
Some people "shard" with a proxy, but in my eyes that defeats the purpose of sharding. You are just using another server to tell your transactions where to find the data or where to store it. With application aware sharding, your application knows where to go on its own. Much more efficient.
This is the same as #2 really.
Do you know of any interesting projects or tools in this area?
Several new projects in this space:
citusdata.com
spockproxy.sourceforge.net
github.com/twitter/gizzard/
Application level of course.
Best approach I've ever red I've found in this book
High Performance MySQL
http://www.amazon.com/High-Performance-MySQL-Jeremy-Zawodny/dp/0596003064
Short description: you could split your data in many parts and store ~50 part on each server. It will help you to avoid the second biggest problem of sharding - rebalancing. Just move some of them to the new server and everything will be fine :)
I strongly recommend you to buy it and read "mysql scaling" part.
As of 2018, there seems to be a MySql-native solution to that. There are actually at least 2 - InnoDB Cluster and NDB Cluster(there is a commercial and a community version of it).
Since most people who use MySql community edition are more familiar with InnoDB engine, this is what should be explored as a first priority. It supports replication and partitioning/sharding out of the box and is based on MySql Router for different routing/load-balancing options.
The syntax for your tables creation would need to change, for example:
CREATE TABLE t1 (col1 INT, col2 CHAR(5), col3 DATETIME) PARTITION BY HASH ( YEAR(col3) );
(this is only one of four partitioning types)
One very important limitation:
InnoDB foreign keys and MySQL partitioning are not compatible. Partitioned InnoDB tables cannot have foreign key references, nor can they have columns referenced by foreign keys. InnoDB tables which have or which are referenced by foreign keys cannot be partitioned.
Shard-Query is an OLAP based sharding solution for MySQL. It allows you to define a combination of sharded tables and unsharded tables. The unsharded tables (like lookup tables) are freely joinable to sharded tables, and sharded tables may be joined to each other as long as the tables are joined by the shard key (no cross shard or self joins that cross shard boundaries). Being an OLAP solution, Shard-Query usually has minimum response times of 100ms or less, even for simple queries so it will not work for OLTP. Shard-Query is designed for analyzing big data sets in parallel.
OLTP sharding solutions exist for MySQL as well. Closed source solutions include ScaleDB, DBShards. Open source OLTP solution include JetPants, Cubrid or Flock/Gizzard (Twitter infrastructure).
Do you know of any interesting projects or tools in this area?
As of 2022 Here are 2 tools:
Vitess (website: https://vitess.io & repo: https://github.com/vitessio/vitess)
PlanetScale (https://planetscale.com)
You can consider this middleware
shardingsphere

Can you index tables differently on Master and Slave (MySQL)

Is it possible to set up different indexing on a read only slave, from on the master? Basically, this seems like it makes sense given the different requirements of the two systems, but I want to make sure it will work and not cause any problems.
I believe so. After replication is working, you can drop the indexes on the slave and create the indexes you want and that should do it. Since MySQL replicates statements and not data (at least by default), as long as the SQL necessary to insert or update or select from the table doesn't need to change, it shouldn't notice.
Now there are obviously downsides to this. If you make a unique key that isn't on the master, you could get data inserted on the master that can't be inserted on the slave. If an update is done that uses an index it may run fast on the master but cause a table scan on the slave (since you don't have whatever index was handy).
And if any DDL changes ever happen on the master (such as to alter an index) that will be passed to the slave and the new index will be created there as well, even though you don't want it to.
For sure. I do it all the time. Issues I've run into:
Referencing indexes via FORCE/USE/IGNORE INDEX in SELECTS will error out
Referencing indexes in ALTER statments on the master can break replication
Adds another step to promoting a slave to be the master in case of emergency
If you're using statement based replication (the norm), and you're playing around with UNIQUE indexes, any INSERT... ON DUPLICATE KEY, INSERT IGNORE or REPLACE statments will cause extreme data drifting / divergence
Performance differences (both good and bad)
Sure, I think it's even a common practice to replicate InnoDB tables into MyISAM tables on the slave to be able to add full-text indexes.

MySQL vs SQLite + UNIQUE Indexes

For reasons that are irrelevant to this question I'll need to run several SQLite databases instead of the more common MySQL for some of my projects, I would like to know how SQLite compares to MySQL in terms of speed and performance regarding disk I/O (the database will be hosted in a USB 2.0 pen drive).
I've read the Database Speed Comparison page at http://www.sqlite.org/speed.html and I must say I was surprised by the performance of SQLite but since those benchmarks are a bit old I was looking for a more updated benchmark (SQLite 3 vs MySQL 5), again my main concern is disk performance, not CPU/RAM.
Also since I don't have that much experience with SQLite I'm also curious if it has anything similar to the TRIGGER (on update, delete) events in the InnoDB MySQL engine. I also couldn't find any way to declare a field as being UNIQUE like MySQL has, only PRIMARY KEY - is there anything I'm missing?
As a final question I would like to know if a good (preferably free or open source) SQLite database manager exists.
A few questions in there:
In terms of disk I/O limits, I wouldn't imagine that the database engine makes a lot of difference. There might be a few small things, but I think it's mostly just whether the database can read/write data as fast as your application wants it to. Since you'd be using the same amount of data with either MySQL or SQLite, I'd think it won't change much.
SQLite does support triggers: CREATE TRIGGER Syntax
SQLite does support UNIQUE constraints: column constraint definition syntax.
To manage my SQLite databases, I use the Firefox Add-on SQLite Manager. It's quite good, does everything I want it to.
In terms of disk I/O limits, I wouldn't imagine that the database engine makes
a lot of difference.
In Mysql/myISAM the data is stored UNORDERED, so RANGE reads ON PRIMARY KEY will theoretically need to issue several HDD SEEK operations.
In Mysql/InnoDB the data is sorted by PRIMARY KEY, so RANGE reads ON PRIMARY KEY will be done using one DISK SEEK operation (in theory).
To sum that up:
myISAM - data is written on HDD unordered. Slow PRI-KEY range reads if pri key is not AUTO INCREMENT unique field.
InnoDB - data ordered, bad for flash drives (as data needs to be re-ordered after insert = additional writes). Very fast for PRI KEY range reads, slow for writes.
InnoDB is not suitable for flash memory. As seeks are very fast (so you won't get too much benefit from reordering the data), and additional writes needed to maintain the order are damaging to flash memory.
myISAM / innoDB makes a huge difference for conventional and flash drives (i don't know what about SQLite), but i'd rather use mysql/myisam.
I actually prefer using SQLiteSpy http://www.portablefreeware.com/?id=1165 as my SQLite interface.
It supports things like REGEXP which can come in handy.

How to apply an index to a MySQL / MyISAM table without locking?

Having a production table, with a very critical column (date) missing an index, are there any ways to apply said index without user impact?
The table currently gets around 5-10 inserts every second, so full table locking is out; redirecting those inserts to alternative table / database, even temporarily, is also denied (for corporate politics reasons). Any other ways?
As far as I know this is not possible with MyISAM. With 5-10 INSERTs per second you should consider InnoDB anyways, unless you're not reading that much.
Are you using replication, preferable in a Master-Master Setup? (You should!) If that is the case, you could CREATE INDEX on the standby server, switch roles and do the same, then switch back. Be sure to disable replication temporarily (when using master-master) to avoid replicating the CREATE INDEX to the active node.
Depending on whether you use that table primarily to archive Logs or similar, you might aswell look into the Archive Storage engine.