How do I keep 2 scratch Databases in sync - mysql

My question is a lot like this one. However I'm on MySQL and I'm looking for the "lowest tech" solution that I can find.
The situation is that I have 2 databases that should have the same data in them but they are updated primarily when they are not able to contact each other. I suspect that there is some sort of clustering or master/slave thing that would be able to sync them just fine. However in my cases that is major overkill as this is just a scratch DB for my own use.
What is a good way to do this?
My current approach is to have a Federated table on one of them and, every so often, stuff the data over the wire to the other with an insert/select. It get a bit convoluted trying to deal with primary keys and what not. (insert ignore seems to not work correctly)
p.s. I can easily build a query that selects the rows to transfer.

MySQL's inbuilt replication is very easy to set up and works well even when the DBs are disconnected most of the time. I'd say configuring this would be much simpler than any custom solution out there.
See http://www.howtoforge.com/mysql_database_replication for instructions, you should be up and running in 10-15 mins and you won't have to think about it again.
The only downside I can see is that it is asynchronous - ie. you must have one designated master that gets all the changes.

My current solution is
set up a federated table on the source box that grabs the table on the target box
set up a view on the source box that selects the rows to be updated (as a join of the federated table)
set up another federated table on the target box that grabs the view on the source box
issue an INSERT...SELECT...ON DUPLICATE UPDATE on the target box to run the pull.
I guess I could just grab the source table and do it all in one shot, but based on the query logs I've been seeing, I'm guessing that I'd end up with about 20K queries being run or about 100-300MB of data transfer depending on how things happen. The above setup sold result in about 4 queries and little more data transfered than actually needed to be.

Related

Data pipeline proposal

Our product has been growing steadily over the last few years and we are now on a turning point as far as data size for some of our tables is, where we expect that the growth of said tables will probably double or triple in the next few months, and even more so in the next few years. We are talking in the range of 1.4M now, so over 3M by the end of the summer and (since we expect growth to be exponential) we assume around 10M at the end of the year. (M being million, not mega/1000).
The table we are talking about is sort of a logging table. The application receives data files (csv/xls) on a daily basis and the data is transfered into said table. Then it is used in the application for a specific amount of time - a couple of weeks/months - after which it becomes rather redundant. That is: if all goes well. If there is some problem down the road, the data in the rows can be useful to inspect for problem solving.
What we would like to do is periodically clean up the table, removing any number of rows based on certain requirements, but instead of actually deleting the rows move them 'somewhere else'.
We currently use MySQL as a database and the 'somewhere else' could be the same, but can be anything. For other projects we have a Master/Slave setup where the whole database is involved, but that's not what we want or need here. It's just some tables where the Master table would need to become shorter and the Slave only bigger, not a one-on-one sync.
The main requirement for the secondary store would be that the data should be easy to inspect/query when need to, either by SQL or another DSL, or just visual tooling. So we are not interested in backing up the data to one or more CSV files or another plain text format, since that is not as easy to inspect. The logs will then be somewhere on S3 so we would need to download it, grep/sed/awk on it... We'd much rather have something database like that we can consult.
I hope the problem is clear?
For the record: while the solution can be anything we prefer to have the simplest solution possible. It's not that we don't want Apache Kafka (example), but then we'd have to learn it, install it, maintain it. Every new piece of technology adds onto our stack, the lighter it remains the more we like it ;).
Thanks!
PS: we are not just being lazy here, we have done some research but we just thought it'd be a good idea to get some more insight in the problem.

MySQL Database Structure with revisions/history

I've been looking into various DB Structures for the task I'm trying to achieve but it seems like my ideas are flawed. I first looked into wiki's DB but seemed a bit complicated for what I want to do and then I saw this which looks closer to what I am trying to do.
I was thinking of having a table which will keep the final form and an extra table where it will keep all the revisions/history. I am not sure though if that would be too much. Although I am not sure the above example is using this method.
I've done something similar - a database table with the ability to fork into multiple revisions and unlimited undo capability, without slowing down the database. I used an additional table to keep track of the "change vectors". Each change can be undone.
There are several types of transactions, so your change table has to keep track. For example, the simple one would be value change. You record what the position (unique ID and column name) is, and the value before and after the change. During undo, the previous value is restored.
The most expensive change is the addition or removal of a column. This is where you utilize an external storage if you don't want to have an excessive longtext/longblob column. A nosql database such as mongodb is suitable for this use.
Hope this helps get you started.

MySQL structure for DBs larger than 10mm records

I am working with an application which has a 3 tables each with more than 10mm records and larger than 2GB.
Every time data is inserted there's at least one record added to each of the three tables and possibly more.
After every INSERT a script is launched which queries all these tables in order to extract data relevent to the last INSERT (let's call this the aggregation script).
What is the best way to divide the DB in smaller units and across different servers so that the load for each server is manageable?
Notes:
1. There are in excess of 10 inserts per second and hence the aggregation script is run the same number of times.
2. The aggregation script is resource intensive
3. The aggregation script has to be run on all the data in order to find which one is relevant to the last insert
4. I have not found a way of somehow dividing the DB into smaller units
5. I know very little about distributed DBs, so please use very basic terminology and provide links for further reading if possible
There are two answers to this from a database point of view.
Find a way of breaking up the database into smaller units. This is very dependent on the use of your database. This is really your best bet because it's the only way to get the database to look at less stuff at once. This is called sharding:
http://en.wikipedia.org/wiki/Shard_(database_architecture)
Have multiple "slave" databases in read only mode. These are basically copies of your database (with a little lag). For any read only queries where that lag is acceptable, they access these databases across the code in your entire site. This will take some load off of the master database you are querying. But, it will still be resource intensive on any particular query.
From a programming perspective, you already have nearly all your information (aside from ids). You could try to find some way of using that information for all your needs rather than having to requery the database after insert. You could have some process that only creates ids that you query first. Imagine you have tables A, B, C. You would have other tables that only have primary keys that are A_ids, B_ids, C_ids. Step one, get new ids from the id tables. Step two, insert into A, B, C and do whatever else you want to do at the same time.
Also, general efficiency/performance of all queries should be reviewed. Make sure you have indexes on anything you are querying. Do explain on all queries you are running to make sure they are using indexes.
This is really a midlevel/senior dba type of thing to do. Ask around your company and have them lend you a hand and teach you.

Migrating and comparing a SQL Server database

We downloaded today RedGate's Toolbet, in oder to automatize some tasks that take so long in our company when it comes to databases.
The first one appear with a 15 GB database we have, with a lot of indexes, constrains and also several triggers. We want this database to be migrated exactly with the schema, all the data, triggers, etc to a new DB with the idea to reduce the size an also to get a better performance hidding all the mistakes commited in the past. Unfortunately this was the first customer's release DB of one products, and we used it to test lot of things that no always worked pretty well. We are sure that if we do something like this, we will get more tha 50% of the size back into our disk.
Can one or some Toolbet tools combined be useful to do this? If answer is not, is there available other tool useful for this task?
One common way this can happen is if you are not selecting all your tables to be included in the compare. For example, you may have selected a child table and not the parent table. This could lead to a FK error like you describe.

Never delete entries? Good idea? Usual?

I am designing a system and I don't think it's a good idea to give the ability to the end user to delete entries in the database. I think that way because often then end user, once given admin rights, might end up making a mess in the database and then turn to me to fix it.
Of course, they will need to be able to do remove entries or at least think that they did if they are set as admin.
So, I was thinking that all the entries in the database should have an "active" field. If they try to remove an entry, it will just set the flag to "false" or something similar. Then there will be some kind of super admin that would be my company's team who could change this field.
I already saw that in another company I worked for, but I was wondering if it was a good idea. I could just make regular database backups and then roll back if they commit an error and adding this field would add some complexity to all the queries.
What do you think? Should I do it that way? Do you use this kind of trick in your applications?
In one of our databases, we distinguished between transactional and dictionary records.
In a couple of words, transactional records are things that you cannot roll back in real life, like a call from a customer. You can change the caller's name, status etc., but you cannot dismiss the call itself.
Dictionary records are things that you can change, like assigning a city to a customer.
Transactional records and things that lead to them were never deleted, while dictionary ones could be deleted all right.
By "things that lead to them" I mean that as soon as the record appears in the business rules which can lead to a transactional record, this record also becomes transactional.
Like, a city can be deleted from the database. But when a rule appeared that said "send an SMS to all customers in Moscow", the cities became transactional records as well, or we would not be able to answer the question "why did this SMS get sent".
A rule of thumb for distinguishing was this: is it only my company's business?
If one of my employees made a decision based on data from the database (like, he made a report based on which some management decision was made, and then the data report was based on disappeared), it was considered OK to delete these data.
But if the decision affected some immediate actions with customers (like calling, messing with the customer's balance etc.), everything that lead to these decisions was kept forever.
It may vary from one business model to another: sometimes, it may be required to record even internal data, sometimes it's OK to delete data that affects outside world.
But for our business model, the rule from above worked fine.
A couple reasons people do things like this is for auditing and automated rollback. If a row is completely deleted then there's no way to automatically rollback that deletion if it was in error. Also, keeping a row around and its previous state is important for auditing - a super user should be able to see who deleted what and when as well as who changed what, etc.
Of course, that's all dependent on your current application's business logic. Some applications have no need for auditing and it may be proper to fully delete a row.
The downside to just setting a flag such as IsActive or DeletedDate is that all of your queries must take that flag into account when pulling data. This makes it more likely that another programmer will accidentally forget this flag when writing reports...
A slightly better alternative is to archive that record into a different database. This way it's been physically moved to a location that is not normally searched. You might add a couple fields to capture who deleted it and when; but the point is it won't be polluting your main database.
Further, you could provide an undo feature to bring it back fairly quickly; and do a permanent delete after 30 days or something like that.
UPDATE concerning views:
With views, the data still participates in your indexing scheme. If the amount of potentially deleted data is small, views may be just fine as they are simpler from a coding perspective.
I prefer the method that you are describing. Its nice to be able to undo a mistake. More often than not, there is no easy way of going back on a DELETE query. I've never had a problem with this method and unless you are filling your database with 'deleted' entries, there shouldn't be an issue.
I use a combination of techniques to work around this issue. For some things adding the extra "active" field makes sense. Then the user has the impression that an item was deleted because it no longer shows up on the application screen. The scenarios where I would implement this would include items that are required to keep a history...lets say invoice and payment. I wouldn't want such things being deleted for any reason.
However, there are some items in the database that are not so sensitive, lets say a list of categories that I want to be dynamic...I may then have users with admin privileges be allowed to add and delete a category and the delete could be permanent. However, as part of the application logic I will check if the category is used anywhere before allowing the delete.
I suggest having a second database like DB_Archives whre you add every row deleted from DB. The is_active field negates the very purpose of foreign key constraints, and YOU have to make sure that this row is not marked as deleted when it's referenced elsewhere. This becomes overly complicated when your DB structure is massive.
There is an acceptable practice that exists in many applications (drupal's versioning system, et. al.). Since MySQL scales very quickly and easily, you should be okay.
I've been working on a project lately where all the data was kept in the DB as well. The status of each individual row was kept in an integer field (data could be active, deleted, in_need_for_manual_correction, historic).
You should consider using views to access only the active/historic/... data in each table. That way your queries won't get more complicated.
Another thing that made things easy was the use of UPDATE/INSERT/DELETE triggers that handled all the flag changing inside the DB and thus kept the complex stuff out of the application (for the most part).
I should mention that the DB was a MSSQL 2005 server, but i guess the same approach should work with mysql, too.
Yes and no.
It will complicate your application much more than you expect since every table that does not allow deletion will be behind extra check (IsDeleted=false) etc. It does not sound much but then when you build larger application and in query of 11 tables 9 require chech of non-deletion.. it's tedious and error prone. (Well yeah, then there are deleted/nondeleted views.. when you remember to do/use them)
Some schema upgrades will become PITA since you'll have to relax FK:s and invent "suitable" data for very, very old data.
I've not tried, but have thought a moderate amount about solution where you'd zip the row data to xml and store that in some "Historical" table. Then in case of "must have that restored now OMG the world is dying!1eleven" it's possible to dig out.
I agree with all respondents that if you can afford to keep old data around forever it's a good idea; for performance and simplicity, I agree with the suggestion of moving "logically deleted" records to "old stuff" tables rather than adding "is_deleted" flags (moving to a totally different database seems a bit like overkill, but you can easily change to that more drastic approach later if eventually the amount of accumulated data turns out to be a problem for a single db with normal and "old stuff" tables).