How to get changes done to mysql table during a time interval? - mysql

I am essentially creating a data warehouse.
For warehouse to remain consistent with the source data, I have to pull changes daily from the source mysql DBs.
My problem is that, in some source mysql tables, there is no 'lastupdated' equivalent columns.
How can i pull changes in this scenario?

In a data warehouse, in order to capture the changes in the target system, there has to be a way to identify the changed, new records at the source side. This is normally done with the help of some Flag or Lastupdated column. But if neither of these is present & if the table is small, you can consider truncating & reloading the entire data from source to the target. But this may not be very feasible for large tables.
You can also refer some other techniques mentioned in the below blog-:
https://www.hvr-software.com/blog/change-data-capture/

Related

mysql order database use 2 tables for current and history?

I have a question about performance. I have a table that contains orderheaders, whenever an order is shipped and invoiced the row is moved from orderheaders to orderheaders_history. This was designed like this by someone else. Now im thinking that one column in the orderheaders table to define wether the row is current or history is more efficient. What way is actually better?
Thanks!
Historical data in a separate table makes sense only if you have a lot of records. What "a lot" can only be determined by you. If your queries are getting slower and having historical data grouped with your more current data is not a business need then you separate them but generally speaking I would have a single table with both current and historical data and only if I can build a solid case of it would I go for an historical table.
Once you hit a critical mass of records and you need to put your records in separate places you can then start dumping your data in cheaper, slower hard drives (not to say less reliable) which will enable you to have a lower total cost of data storage while still having all data ready for use.
Finally, if you already have a working process that is automated around this setup I would not go and change it just for the sake of changing it. Changing a database is risky and you don't want to lose or corrupt data without good reason. If the process does create overhead and you do not benefit from the separate of these tables then I would suggest you to consider reverting back to a simpler table structure.

MYSQL - Database Design Large-scale real world deployment

I would love to hear some opinions or thoughts on a mysql database design.
Basically, I have a tomcat server which recieves different types of data from about 1000 systems out in the field. Each of these systems are unique, and will be reporting unique data.
The data sent can be categorized as frequent, and unfrequent data. The unfrequent data is only sent about once a day and doesn't change much - it is basically just configuration based data.
Frequent data, is sent every 2-3 minutes while the system is turned on. And represents the current state of the system.
This data needs to be databased for each system, and be accessible at any given time from a php page. Essentially for any system in the field, a PHP page needs to be able to access all the data on that client system and display it. In other words, the database needs to show the state of the system.
The information itself is all text-based, and there is a lot of it. The config data (that doesn't change much) is key-value pairs and there is currently about 100 of them.
My idea for the design was to have 100+ columns, and 1 row for each system to hold the config data. But I am worried about having that many columns, mainly because it isn't too future proof if I need to add columns in the future. I am also worried about insert speed if I do it that way. This might blow out to a 2000row x 200column table that gets accessed about 100 times a second so I need to cater for this in my initial design.
I am also wondering, if there is any design philosophies out there that cater for frequently changing, and seldomly changing data based on the engine. This would make sense as I want to keep INSERT/UPDATE time low, and I don't care too much about the SELECT time from php.
I would also love to know how to split up data. I.e. if frequently changing data can be categorised in a few different ways should I have a bunch of tables, representing the data and join them on selects? I am worried about this because I will probably have to make a report to show common properties between all systems (i.e. show all systems with a certain condition).
I hope I have provided enough information here for someone to point me in the right direction, any help on the matter would be great. Or if someone has done something similar and can offer advise I would be very appreciative. Thanks heaps :)
~ Dan
I've posted some questions in a comment. It's hard to give you advice about your rapidly changing data without knowing more about what you're trying to do.
For your configuration data, don't use a 100-column table. Wide tables are notoriously hard to handle in production. Instead, use a four-column table containing these columns:
SYSTEM_ID VARCHAR System identifier
POSTTIME DATETIME The time the information was posted
NAME VARCHAR The name of the parameter
VALUE VARCHAR The value of the parameter
The first three of these columns are your composite primary key.
This design has the advantage that it grows (or shrinks) as you add to (or subtract from) your configuration parameter set. It also allows for the storing of historical data. That means new data points can be INSERTed rather than UPDATEd, which is faster. You can run a daily or weekly job to delete history you're no longer interested in keeping.
(Edit if you really don't need history, get rid of the POSTTIME column and use MySQL's nice extension feature INSERT ON DUPLICATE KEY UPDATE when you post stuff. See http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html)
If your rapidly changing data is similar in form (name/value pairs) to your configuration data, you can use a similar schema to store it.
You may want to create a "current data" table using the MEMORY access method for this stuff. MEMORY tables are very fast to read and write because the data is all in RAM in your MySQL server. The downside is that a MySQL crash and restart will give you an empty table, with the previous contents lost. (MySQL servers crash very infrequently, but when they do they lose MEMORY table contents.)
You can run an occasional job (every few minutes or hours) to copy the contents of your MEMORY table to an on-disk table if you need to save history.
(Edit: You might consider adding memcached http://memcached.org/ to your web application system in the future to handle a high read rate, rather than constructing a database design for version 1 that handles a high read rate. That way you can see which parts of your overall app design have trouble scaling. I wish somebody had convinced me to do this in the past, rather than overdesigning for early versions. )

Cross Stream Data changes - EDW

I got a scenario where Data Stream B is dependent on Data Stream A. Whenever there is change in Data Stream A it is required re-process the Stream B. So a common process is required to identify the changes across datastreams and trigger the re-processing tasks.
Is there a good way to do this besides triggers.
Your question is rather unclear and I think any answer depends very heavily on what your data looks like, how you load it, how you can identify changes, if you need to show multiple versions of one fact or dimension value to users etc.
Here is a short description of how we handle it, it may or may not help you:
We load raw data incrementally daily, i.e. we load all data generated in the last 24 hours in the source system (I'm glossing over timing issues, but they aren't important here)
We insert the raw data into a loading table; that table already contains all data that we have previously loaded from the same source
If rows are completely new (i.e. the PK value in the raw data is new) they are processed normally
If we find a row where we already have the PK in the table, we know it is an updated version of data that we've already processed
Where we find updated data, we flag it for special processing and re-generate any data depending on it (this is all done in stored procedures)
I think you're asking how to do step 5, but it depends on the data that changes and what your users expect to happen. For example, if one item in an order changes, we re-process the entire order to ensure that the order-level values are correct. If a customer address changes, we have to re-assign him to a new sales region.
There is no generic way to identify data changes and process them, because everyone's data and requirements are different and everyone has a different toolset and different constraints and so on.
If you can make your question more specific then maybe you'll get a better answer, e.g. if you already have a working solution based on triggers then why do you want to change? What problem are you having that is making you look for an alternative?

Read vs Write tables database design

I have a user activity tracking log table where it logs all user activity as they occur. This is extremely high write table due to the in depth tracking of click by click tracking. Up to here the database design is perfect. Problem is the next step.
I need to output the data for the business folks + these people can query to fetch past activity data. Hence there is semi-medium to high read also. I do not like the idea of reading and writing from the same high traffic table.
So ideally I want to split the tables: The first one for quick writes (less to no fks), then copy that data over fully formatted & pulling in all the labels for the ids into a read table for reading use.
So questions:
1) Is this the best approach for me?
2) If i do keep 2 tables, how to keep them in sync? I cant copy the data to the read table instant as it writes to the write table - it will defeat the whole purpose of having seperate tables then, nor can i keep the read table to be old because the activity data tracked links with other user data like session_id, etc so if these IDs are not ready when their usecase calles for it the writes will fail.
I am using MySQL for user data and HBase for some large tables, with php codeignitor for my app.
Thanks.
Yes, having 2 separate tables is the best approach. I've had the same problem to solve a few months ago, though for a daemon-type application and not a website.
Eventually I ended up with 1 MEMORY table keeping "live" data which is inserted/updated/deleted on almost every event and another table that had duplicates of the live data rows, but without the unnecesary system columns - my history table, which was used for reading only per request.
The live table is only relevant to the running process, so I don't care if the contained data is lost due to a server failure - whatever data needs to be read later is already stored in the history table. So ... there's no problem in duplicating the data in the two tables - your goal is performance, not normalization.

How to design a SQL db with undo-redo?

I'm trying to figure out how to design my DB tables to allow Undo-Redo.
Pretend you have a tasks table with the following structure:
id <int>
title <varchar>
memo <string>
date_added <datetime>
date_due <datetime>
Now assume that over a few days and multiple log-ins that several edits have taken place; but a user wants to go back to one of the versions.
Would you have a separate table tracking the changes - or - would you try to keep the changes within the tasks table ("ghost" rows, for lack of a better term)?
Would you track all of the columns or just the ones that changed each time?
If it matters, I'm using MySQL. Also, if it matters, I'd like to be able to show the history (ala Photoshop) and allow a user to switch to any version.
Bonus question: Would you save the whole memo cell on a change or would you try to save the delta only? Reason I ask is because the memo cell could be large and only a single word or character might be changed each revision. Granted, saving the delta would require parsing, but if undos aren't expected very often, wouldn't it be better to save space rather than processing time?
Thank you for your help.
I would create a History table for your tasks table. Same structure as tasks + a new field named previousId. This would hold the previous change id, so you can go back an forth through different changes (undo/redo).
Why a new History table? For a simple reason: do not overload tasks table with things that it was not designed for.
As for space, in the History, instead of a Memo, use a binary format and zip the content of the text you want to store. Don't try to detect changes. You will run into a buggy code which will result in frustration and wasted time...
Optimization:
Even better, you may keep only three columns in History table:
1. taskId (foreign key to tasks)
2. data - a binary field. Before saving in the History table, create an XML string holding only the fields that have changed.
3. previousId (will help maintain a queue of changes and allow navigation back and forth)
As for data field, create an XML string like this:
<task>
<title>Title was changed</title>
<date_added>2011-03-26 01:29:22<date_added>
</task>
This will basically tell you that this time you changed only the title and the date_added fields.
After the XML string is built, just zip it if you want and store it into History table's data field.
XML will also allow for flexibility. If you add / remove a field in tasks table, you don't need to update the History table, too. So this way the structure of the tasks table and History table are decoupled so you don't need to update two tables each time.
PS: don't forget to add some indexes to quickly navigate through the history table. Fields to be indexed: taskId and previousId as you will need fast queries against this table.
Hope this helps.
When I do similar types of things using SQL I always use a second table for revision history. This prevents your primary table from getting overly large with versions. The rationale is that retrieving the record that is current happens almost 100% of the time, viewing history and rolling back (undo) is very infrequent.
If you only have a single UNDO or history, then tracking in-table is probably fine.
Whether you want to save deltas or the entire cell depends on expected growth / usage. If you are comfortable creating the logic to manage deltas, that will save you space. If things don't really create new versions that often I wouldn't start with that, (applying YAGNI)
You might want to compress revisions in delta form but you should still have the current revision in full for quick retrieval.
However, older to newer deltas require lots of processing unless you have some non-delta to base on. Newer to older deltas require reprocessing every time something changes. So deltas usually do not get you many benefits but greater complexity.
Last I checked, which is some years ago, MediaWiki, the software behind Wikipedia, stored full texts and provided some means to compress older revisions with gzip to save space and a dedicated table archive for deleted revisions / pages.
Their website has an ER diagram of their database layout which you might find useful.