Database strategy for synchronization based on changes - mysql

I have a Spring+Hibernate+MySQL backend that exposes my model (8 different entities) to a desktop client. To keep synchronized, I want the client to regularely ask the server for recent changes. The process may be as follows:
Point A: The client connects for the
first time and retrieves all the
model from the server.
Point B: The client asks the server
for all changes since Point A.
Point C: The client asks the server
for all changes since Point B.
To retrieve the changes (point B&C) I could create a HQL query that returns all rows in all my tables that have been last modified since my previous retrieval. However I'm afraid this can be a heavy query and degrade my performance if executed oftenly.
For this reason I was considering other alternatives as keeping a separate table with recent updates for a fast access. I have looked to using L2 query cache but it doesn't seem to serve for my purpose.
Does someone know a good strategy for my purpose? My initial thought is to keep control of synchronization and avoid using "automatic" synchronization tools.
Many thanks

you can store changes in a queue table. Triggers can populate the queue on insert, update, delete. this preserves the order of the changes like insert, update, update, delete. Empty the queue after download.
Emptying the queue would cause issues if you have multiple clients.... may need to think about a design to handle that case.
there are several designs you can go with, all with trade offs. I have used the queue design before, but it was only copying data to a single destination, not multiple.

Related

How to keep two databases with different schemas up-to-date

Our company has really old legacy system with such a bad database design (no foreign keys, columns with serialized PHP arrays, etc. :(). We decided to rewrite a system from a scratch with new database schema.
We want to rewrite a system by parts. So we will split old monolithic application to many smaller ones.
Problem is: we want to have live data in two databases. Old and New schema.
I'd like to ask you if anyone of you knows best practices how to do this.
What we think of:
asynchronous data synchronization with message queue
make a REST API in new system and make legacy application to use
it instead of db calls
some kind of table replication
Thank you very much
I had to deal with a similar problem in the past. There was a system which didn't have support but there was people using it because, It had some features (security holes) which allowed them certain functionalities. However, they also needed new functionalities.
I selected the tables which involved the new system and I created some triggers for cross update the tables, so when I created a register on the old system the trigger created a copy in the new system and reversal. If you design this system properly you would have both systems working at the same time in real time.
The drawback is that while the both system are running the system is going to become slower since you have to maintain the integrity of two databases in every operation.
I would start by adding a database layer to accept API calls from the business layer, then write to both the old schema and the new. This adds complexity up front, but it lets you guarantee that the data stays in sync.
This would require changing the legacy system to call an API instead of issuing SQL statements. If they did not have the foresight to do that originally, you may not be able to take my approach. But, you should do it going forward.
Triggers may or may not work out. In older versions of MySQL, there can be only one trigger of a given type on a given table. This forces you to lump unrelated things into a single trigger.
Replication can solve some changes -- Engine, datatypes, etc. But it cannot help with splitting one table into two. Be careful of the replication of Triggers and where the Trigger has effect (between Master and Slave). In general, a stored routine should be performed on the Master, letting the effect be replicated to the slave. But it may be worth considering how to have the trigger run in the Slave instead. Or different triggers in the two servers.
Another thought is to do the transformation in stages. By careful planning of schema changes versus application of triggers versus code changes versus database layer, you can do partial transformations one at a time, sometimes without having a big outage to update everything simultaneously (with your fingers crossed). A simple example: (1) change code to dynamically handle either new or old schema, (2) change the schema, (3) clean up the code (remove handling of old schema).
Doing a database migration may be a tedious task considering the complexity of data and structure of the tables which is of-course with out any constraints or a proper design. But given that your legacy application was doing its job - the amount of corrupt usable data will be minimal.
For the said problem I would suggest a db migration task which would convert all the old legacy data into the new form. And develop the new application. The advantages being.
1) There is not need to keep 2 different applications.
2) No need to change the code in the legacy application - which can become messy.
3) DB migration will give us a chance to correct any corrupt data (if needed).
DB migration may not be practical under all scenarios but if you can do it in lesser effort than making the changes for database sync, new api's for legacy application - I would suggest to go for it.

Granular 'Up to the minute' data recoverability of mySQL database data

I operate a web-based online game with a mySQL backend. Every day many writes are performed against hundreds of related tables holding user data.
Frequently a user's account will become compromised. I would like the ability to restore the user's data to a certain point in time prior to the attack without affecting any other user data.
I'm aware of binary logging in mySQL, but as far as I know this is whole-database recovery up to a certain point in time. I would like a more granular solution, ie able to specify which tables, which rows etc.
What should I be looking into? What are the general best-practices?
If you create and use audit tables (populated through triggers) you can always get back to the data for one particular user in any table.
Be sure to write your general restore script before you need it though. Much easier to put in a userid into a script that you already have available than to sit there looking at the audit tables going, how do I do this again.
MySQL (or any other RDBMS that I'm aware of) is not able to do that by itself. Therefore you should implement that yourself in your application layer.
This is (without external modules) not possible.
As thejh in the comments suggested, revisions would be a good solution. When you only need to work with userdata, create a table that resembles the usertable with additional timestamp or similar and run a cron job once a week/day/.. that copies the userdata that has recently been modified (additional flags/dates in the actual user table) into this table.

Message Queues Vs DB Table Queue via CRON

We have a large project coming up soon with quite a lot of media processing (Images, Video) as well email output etc, the sort of stuff normally we'd put into a table called "email_queue" and we use a cron to run a script process the queue in the table.
I have been reading a lot on Message Queue systems like beanstalkd, and have even set it up. It was easy and nice to use, the problem is that I am unsure whether I am missing something.
Could someone detail the benefits of using a queue system rather than a table and a CRON? Since I really can't see to see what they are.
Thanks
Differences:
Once a message is put on the queue it can be immediately delivered. So if your cron normally ran every 5 minutes, you could process faster with the queuing.
If your queueing system supports transactions, then it will automatically re-deliver a message if the processing fails.
It can be harder to query what is in your queue. A database table has a nice way to search (sql).
If you have multiple servers/processes/threads handling messages, the queue system will make sure a message is only delivered to one of them. With a DB table you need to handle this via application code (locking, flags, etc ...)
A message queue (a distributed one at least, e.g. RabbitMQ) gives you the ability to distribute work across physical nodes. You still need to have a process on each node to dequeue work and process it.
It gets down ultimately to your requirements I guess. You can achieve a more manageable solution at scale with using message queues: you can decouple your nodes more easily.
Of course, there is a learning curve... so it again comes back to your target goals.
Note that on each node you can still reuse your cron/db table until (and if) you wish to change the implementation. That's what great about decoupling when you can.
First, queues are often backed by actual DB tables and can maintain message durability. That aside, the queue is a natural way to shove off work that needs to be done asynchronously, which if you design on that principal from the start is very powerful.
Other than the fact that a table (entity) has a set of hard columns (attributes), both this table being composed of a set of records composing as well as a queue are nothing more than lists of stuff You are employing the queue-as-a-table as a formal queue, just that you are polling it on a regular (cron) basis.
MQs add another nifty feature though of generally synchronizing access to the message itself (you may or may not be doing this in your SQL to get the next thing).
I like to consider the cron/table mechanism as POLL-based and the MQ as EVENT-based.
Benefit of a queue in my opinion is that it takes care of the sync'ing, status updating. MQs can be set up to "broadcast" (topic) or make available the message to a group of consumers or listeners.
MQs though asynchronous would likely operate between your cron window. How do you know that the number of messages you process in your table can be accomplished before the next cron job runs and tries to step on the previous job?
Multiple consumers for the MQ allows you to scale the work as you see fit. In the example above if you saw that your load average (just the same in the OS' process queue) is greater than you like, you can provision another consumer to handle said load, bringing it on and offline as metrics demand.
MQs can be set up to have different operational parameters such as message priority and performance (some queues can remain in memory, others persist to disk).
Downside is that (as already mentioned) that the queue can sometimes be hard to query and for which to obtain metrics. I always find MQ systems that have a DB backing store so that I can myself watch the queue with SQL.
This gets asked fairly frequently, and there's usually not a compelling reason to go MQ if you're comfortable with databases. Here's one example thread.
My take is that you might want to avoid the learning curve unless your data requirements include exceptionally high volumes, which is unlikely if you're thing cron rather than a process with a timer (much less multiple processes with timers.)

MySQL Databases. How Many for a Web App?

I'm building a web app. This app will use MySQL to store all the information associated with each user. However, it will also use MySQL to store sys admin type stuff like error logs, event logs, various temporary tokens, etc. This second set of information will probably be larger than the first set, and it's not as important. If I lost all my error logs, the site would go on without a hiccup.
I am torn on whether to have multiple databases for these different types of information, or stuff it all into a single database, in multiple tables.
The reason to keep it all in one, is that I only have to open up one connection. I've noticed a measurable time penalty for connection opening, particularly using remote mysql servers.
What do you guys do?
Fisrt,i must say, i think storing all your event logs, error logs in db is a very bad idea, instead you may want to store them on the filesystem.
You will only need error logs or event logs if something in your web app goes unexpected. Then you download the file, and examine it, thats all. No need to store it on the db. It will slow down your db and your web app.
As an answer to your question, if you really want to do that, you should seperate them, and you should find a way to keep your page running even your event og and error log databases are loaded and responding slowly.
Going with two distinct database (one for your application's "core" data, and another one for "technical" data) might not be a bad idea, at least if you expect your application to have a lot of users :
it'll allow you to put one DB on one server, and the other DB on a second server
and you can think about scaling a bit more, later : more servers for the "core" data, and still only one for the "technical" data -- or the opposite
if the "technical" data is not as important, you can (more easily) have two distinct backup processes / policies
having two distinct databases, and two distinct servers, also means you can have heavy calculations on the technical data, without impacting the DB server that hosts the "core" data -- and those calculations can be useful, on logs, or stuff like that.
as a sidenote : if you don't need that kind of "reporting" calculations, maybe storing those data to a DB is not useful, and files would do perfectly ?
Maybe opening two connections means a bit more time -- but that difference is probably rather negligible, is it not ?
I've worked a couple of times on applications that would use two database :
One "master" / "write" database, that would be used only for writes
and one "slave" database (a replication of the first one, to several slave servers), that would be used for reads
This way, yes, we sometimes open two connections -- bu one server alone would not have been able to handle the load...
Use connection pooling anyway. So the time to get a connection is not a problem. But if you have 2 connections, transaction handling become more complicated. On the other hand, sometimes it's handy to have 2 connections: if something goes wrong on the business transaction, you can rollback transaction and still log the failure on the admin transaction. But I would still stick to one database.
I would only use one databse - mostly for the reason you supply: You only need one connection to reach both logging and user stored data.
Depending on your programming language, some frameworks (J2EE as an example) provide connection pooling. With two databases you would need two pools. In PHP on the other hand, the performance come in to perspective when setting up a connection (or two).
I see no reason for two databases. It'd be perfectly acceptable to have tables that are devoted to "technical" and "business"data, but the logical separation should be sufficient.
Physical separation doesn't seem necessary to me, unless you mean an application and data warehouse star schema. In that case, it's either real-time updates or, more typically, a nightly batch ETL.
It makes no difference to mysql in any way whether you use separate "datbases", they are simply catalogues.
It may make setting permissions easier, this is a legitimate reason to do it. Other than that, it is exactly the same as keeping the tables in the same db (except you can have several tables with the same name ... but please don't)
Putting them on separate servers might be a good idea however, as you probably don't want your core critical (user info, for example) data mixed in with your high-volume, unimportant data. This is particularly true for old audit data, debug logs etc.
Also short-lived data, such as search results, sessions etc, could be placed on a different server - it presumably has no high availability[1] requirement.
Having said that, if you don't need to do this, dump it all on one server where it's easier to manage (backup, provide high availibilty, manage security etc).
It is not generally possible to take a consistent snapshot of data on >1 server. This is a good reason to only have one (or one that you care about for backup purposes)
[1] Of the data, not the database.
In MySQL, InnoDB has an option of storing all tables of a certain database in one file, or having one file per table.
Having one file per table is somewhat recommended anyway, and if you do that, it makes difference on the database storage level if you have one database or several.
With connection pooling, one database or several is probably not going to matter either.
So, in my opinion, the question is if you'd ever consider separating the "other half" of the database to a separate server - with the separate server having perhaps a very different hardware configuration, such as no RAID. If so, consider using separate databases. If not, use a single database.

Pattern for updating slave SQL Server 2008 databases from a master whilst minimising disruption

We have an ASP.NET web application hosted by a web farm of many instances using SQL Server 2008 in which we do aggregation and pre-processing of data from multiple sources into a format optimised for fast end user query performance (producing 5-10 million rows in some tables). The aggregation and optimisation is done by a service on a back end server which we then want to distribute to multiple read only front end copies used by the web application instances to facilitate maximum scalability.
My question is about the best way to get this data from a back end database out to the read only front end copies in such a way that does not kill their performance during the process. The front end web application instances will be under constant high load and need to have good responsiveness at all times.
The backend database is constantly being updated so I suspect that transactional replication will not be the best approach, as the constant stream of updates to the copies will hurt their performance.
Staleness of data is not a huge issue so snapshot replication might be the way to go, but this will result in poor performance during the periods of replication.
Doing a drop and bulk insert will result in periods with no data for user queries.
I don't really want to get into writing a complex cluster approach where we drop copies out of the cluster during updating - is there something along these lines that we can do without too much effort, or is there a better alternative?
There is actually a technology built into SQL Server 2005 (and 2008) that is designed to address this kind of issues. Service Broker (I'll refer further as SSB). The problem is that it has a very steep learning curve.
I know MySpace went public how uses SSB to manage their park of SQL Servers: MySpace Uses SQL Server Service Broker to Protect Integrity of 1 Petabyte of Data. I know of several more (major) sites that use similar patterns but unfortunately they have not gone public so I cannot refer names. I was personally involved with some projects around this technology (I am a former member of the SQL Server team).
Now bear in mind that SSB is not a dedicate data transfer technology like Replication. As such you will not find anyhting similar to the publishing wizards and simple deployment options of Replication (check a table and it gets transferred). SSB is a reliable messaging technology and as such its primitives stop at the level of message exchange, you would have to write the code that leverages the data change capture, packs it as messages and also the unpacking of message into relational tables at destination.
Why still some companies preffer SSB over Replication at a task like you describe is because SSB has a far better story when it comes to reliability and scalability. I know of projects that exchange data between 1500+ sites, far beyond the capabilities of Replication. SSB is also abstracted from the physical topology: you can move databases, rename machines, rebuild servers all without changing the application. Because data flow occurs over logical routes the application can addapt on-the-fly to new topologies. SSB is also resilient to long periods of disocnnect and downtime, being capable of resuming the data flow after hours, days and even months of disconnect. High troughput achieved by engine integration (SSB is part of the SQL engine itself, is not a collection of sattelite applications and processes like Replication) means that the backlog of changes can be processes on reasonable times (I know of sites that are going through half a million transactions per minute). SSB applications typically rely on internal Activation to process the incomming data. SSB also has some unique features like built-in load balancing (via routes) with sticky session semantics, support for deadlock free application specific correlated processing, priority data delivery, specific support for database mirroring, certificate based authentication for cross domain operations, built-in persisted timers and many more.
This is not a specific answer 'how to move data from table T on server A to server B'. Is more a generic technology on how to 'exhange data between server A and server B'.
I've never had to deal with this scenario before but did come up with a possible solution for this. Basically, it would require a change in your main database structure. Instead of storing the data, you would keep records of modifications of this data. Thus, if a record is added, you store "Table X, inserted new record with these values: ..." With modifications, just store the table, field and changed value. With deletions, just store which record is deleted. Every modification will be stored with a timestamp.
Your client systems would keep their local copies of the database and will regularly ask for all database modifications after a certain date/time. You then execute those modifications on the local database and it will be up-to-date again.
And the back-end? Well, it would just keep a list of modifications and perhaps a table with the base data. Keeping just the modifications also means you're keeping track of history, allowing you to ask the system what it looked like a year ago.
How well this would perform depends on the number of modifications on the back-end database. But if you request the changes every 15 minutes, it shouldn't be that much data every time.
But again, I never had the chance to work this out in a real application so it's still a theoretic principle for me. It seems fast but a lot of work will be required.
Option 1: Write an app to transfer the data using row level transactions. It might take longer but would result in no interruption of the site using the data because the rows are there before and after the read occurs, just with new data. This processing would happen on a separate server to minimize load.
In sql server 2008 you can set READ_COMMITTED_SNAPSHOT to ON to ensure that the row being updated is not causing blocking.
But basically all this app does is read the new data as it is available out from one database and into the other.
Option 2: Move the data (tables or entire database) from the aggregation server to the front-end server. Automate this if possible. Then switch your web application to point to the new database or tables for future requests. This works but requires control over the web app, which you may not have.
Option 3: If you were talking about a single table (or this could work with many) what you can do is a view swap. So you write your code against a sql view which points to table A. You do you work on Table B and when it's ready, you update the view to point to Table B. You can even write a function that determines the active table and automate the whole swap thing.
Option 4: You might be able to use something like byte-level replication of the server. That sounds scary though. Which is basically copying the server from point A to point B exactly down to the very bytes. It's mostly used in DR situations which this sounds like it could be a kinda/sorta DR situation, but not really.
Option 5: Give up and learn how to sell insurance. :)