Managing Historical Data Dependencies

Managing Historical Data Dependencies - mysql

3 Tables: Device, SoftwareRevision, Message. All data entered is handed by PHP scripts on an Apache server.
A device can have one software revision. A software revision can have many devices. A device can have many messages. A message can have one device.
Something like above.
The issue is, the SoftwareRevision changes how the message is used in the front end application. This means that when the software is updated on the device, we need older messages to retain the information that they were received from a different software revision.
The TL;DR here is that the fully normalized way I see of doing this becomes a real pain. I've got about 5 of these situations in my current project and 3 of them are nested inside of each other.
I see three ways of doing this:
The first is the above fully normalized way. In order to find out how to use the message on the front end application, one must find the latest entry into Device_SoftwareRevision_Records that is before the datetime of the given message. This gets really fiddly when you have a more complex database and application. Just to get the current SoftwareRevision_ID for a device you have to use a MAX GROUP BY type statement (I've ended up having to use views to simplify).
The second is to directly link the Message to the SoftwareVersion. This means you don't have to go through the whole MAX GROUP BY WHERE blah blah. The SoftwareVersion_ID is retrieved by a PHP script and then the message is entered. Of course, this is denormalized so now there is potential for duplicate data.
Aaaand heres our fully denormalized version. The Software_Revision_Records table is purely for bookkeeping purposes. Easy to use for the front-end application but a pain to update at the back-end. The back-end updating can actually be streamlined with triggers for entering into the Software_Revision_Records table so the only thing that can really go wrong is the message gets the wrong software revision when it is entered.
Is there a better way of doing this that I have missed? Is it such a sin to denormalize the database in this situation? Will my decision here cause the business to erupt into flames (probably not)?

If the messages are tied to the software revision for that particular device, then it might make more sense to reflect that relationship in the data model. i.e. have a foreign key from Messages to Device_SoftwareRevision_Records rather than from Messages to Device. You still have the relationship from Messages to Device indirectly, it's normalised, and there's no messing around with dates trying to figure out which messages were created while a given software revision was in place.
In cases where you do need dates, it might also be worth considering having both a start and stop date, and filling in any null dates with something like 9999-12-31 (to indicate that a record has not yet been ended). You can easily find the latest record without needing to do a max. It will also make it a lot easier to query the table if you do need to compare it to other dates - you can just do a between on a single record. In this example, you'd just look for this:
where Message.TimeStamp between Device_SoftwareRevision_Records.StartDate and Device_SoftwareRevision_Records.EndDate
That said, I would still - if at all possible - change the model to relate Messages to the correct table rather than rely on dates. Being able to do simple joins will be quicker, more convenient, more obvious if anyone new needs to learn the structure, and is likely to perform better.

Related

Recurring data demand - automated query, or store data directly in SQL?

This is a simple question even though the title sounds complicated.
Let's say I'm storing data from a bunch of applications into one central database/ data warehouse. This is data at a pretty fine level -- say, daily summaries of various metrics.
HOWEVER, I know in the front-end I will be frequently displaying weekly and monthly aggregates of this data as well.
One idea would be to have scripting language do this for me after querying the SQL database - but that seems horribly inefficient, perhaps.
The second idea would be to have views in the database that represent business weeks and months -- this might be the best way to do it.
But my final idea is -- couldn't a SQL client simply run a query that aggregates all the daily data into weeks (or months) and store them in a separate table? The advantage of this is that it would reduce querying time of any user, since all the query work is done before a website or button is even loaded/ pushed. Even with a view, I guess that aggregation calculation would have to be done as soon as the view was queried.
The only downside to having the queries aggregated from the weeks/ months perhaps even once a day (instead of every time the website is loaded) -- is that it won't be up-to-date/ may reflect inconsistencies.
I'm not really an expert when it comes to this bigger picture stuff -- anyone have any thoughts? thanks

It depends on the user experience you're trying to create.
Is the user base expecting to watch monthly aggregates with one finger on the F5 key when watching this month's statistics? To cover this scenario, you might want to have a view with criteria that presents a window always relative to getdate(). Keeping in mind that with good indexing strategies and query design should mitigate the impact of this sort of approach to nearly nothing.
Is the user expecting informational data that doesn't include today's data? More performance might be seen out of a nightly job that does the aggregation into a new table.
Of all the scenarios, though, I would not recommend manual aggregation. Down that road are unexpected bugs and exceptions that can really be handled with a good SQL statement. Aggregates are a big part of all DBMSs', let their software handle that and work on the rest of your application.

MySQL database activity log: fields vs table

So basically I am in the process of creating a personal finance tracking system. It occurred to be that keeping tabs on when each instance and transaction was last edited or updated might be of relevant information some day.
Now as far as I can see there are two approaches to implement something like this:
Create "updated" fields to all the tables I want to keep track of and then let mysql update those fields for me (ON UPDATE clause)
Create a completely seperate table for holding the log data and then update that with a triggers and transactions
Now it seems that 1st approach would have the benefit of keeping things simple and easy to maintain. However how this will impact the performance if I suddenly decide to get every log in the database for review. Also this would kind of goes against normalization (not by much though) with same data stored in multiple tables.
The second approach would allow more flexibility to the logging system and might actually shorten the sql query necessary to retrieve certain data. However it would make the schema more complex as two additional tables would have to be created (the actual log table and many-to-many relation table for holding the keys) and maintained. On the other hand if I ever want to implement an activity history this approach would propably be the only one capable of doing it.
As such I would like to know some more pros and cons to each method. Since 2nd option allows more flexibility I am considering implementing it but I am not sure about performance issues. In the end it comes down to two guestions:
Are there any real life examples where both approaches are
implemented?
And:
Are there any studies, comparisons or other resource that might shed
some light on which is considered more performance friendly and "best
practices" approach?

It depends on what kind of reporting you need and your current architecture.
If you just want to know last update date, then having 2 fields (creation date and last update) should be enough. That's because having separate table won't give any perfomance boost, but will make your code harder to maintain.
It's another story if you want to have something more elaborate, like reporting differences (what was changed) and/or have full change log on each transaction (there might be few updates to one transaction, right?). In this case you actually must have separate table, because otherwise it will bloat your table and reduce perfomance.
Based on my experience, I'd go with separate table. That's because it will be easier to maintain - your logging logic will be practically separated from everything else and I think one day you'll need that additional info on your transactions and full transaction history.
As far as perfomance goes, you won't notice any formidable difference unless your system will be under serious load. But as your system is personal, any choice would suffice, just don't forget about proper indexing.
Note that I'm making alot of assumptions here, so if you want something more specific, please provide your actual architecture and reporting needs. I'd suggest some books on high availability/perfomance, but they are not on your specific needs, but on general availability/perfomance.

MYSQL - Database Design Large-scale real world deployment

I would love to hear some opinions or thoughts on a mysql database design.
Basically, I have a tomcat server which recieves different types of data from about 1000 systems out in the field. Each of these systems are unique, and will be reporting unique data.
The data sent can be categorized as frequent, and unfrequent data. The unfrequent data is only sent about once a day and doesn't change much - it is basically just configuration based data.
Frequent data, is sent every 2-3 minutes while the system is turned on. And represents the current state of the system.
This data needs to be databased for each system, and be accessible at any given time from a php page. Essentially for any system in the field, a PHP page needs to be able to access all the data on that client system and display it. In other words, the database needs to show the state of the system.
The information itself is all text-based, and there is a lot of it. The config data (that doesn't change much) is key-value pairs and there is currently about 100 of them.
My idea for the design was to have 100+ columns, and 1 row for each system to hold the config data. But I am worried about having that many columns, mainly because it isn't too future proof if I need to add columns in the future. I am also worried about insert speed if I do it that way. This might blow out to a 2000row x 200column table that gets accessed about 100 times a second so I need to cater for this in my initial design.
I am also wondering, if there is any design philosophies out there that cater for frequently changing, and seldomly changing data based on the engine. This would make sense as I want to keep INSERT/UPDATE time low, and I don't care too much about the SELECT time from php.
I would also love to know how to split up data. I.e. if frequently changing data can be categorised in a few different ways should I have a bunch of tables, representing the data and join them on selects? I am worried about this because I will probably have to make a report to show common properties between all systems (i.e. show all systems with a certain condition).
I hope I have provided enough information here for someone to point me in the right direction, any help on the matter would be great. Or if someone has done something similar and can offer advise I would be very appreciative. Thanks heaps :)
~ Dan

I've posted some questions in a comment. It's hard to give you advice about your rapidly changing data without knowing more about what you're trying to do.
For your configuration data, don't use a 100-column table. Wide tables are notoriously hard to handle in production. Instead, use a four-column table containing these columns:
SYSTEM_ID VARCHAR System identifier
POSTTIME DATETIME The time the information was posted
NAME VARCHAR The name of the parameter
VALUE VARCHAR The value of the parameter
The first three of these columns are your composite primary key.
This design has the advantage that it grows (or shrinks) as you add to (or subtract from) your configuration parameter set. It also allows for the storing of historical data. That means new data points can be INSERTed rather than UPDATEd, which is faster. You can run a daily or weekly job to delete history you're no longer interested in keeping.
(Edit if you really don't need history, get rid of the POSTTIME column and use MySQL's nice extension feature INSERT ON DUPLICATE KEY UPDATE when you post stuff. See http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html)
If your rapidly changing data is similar in form (name/value pairs) to your configuration data, you can use a similar schema to store it.
You may want to create a "current data" table using the MEMORY access method for this stuff. MEMORY tables are very fast to read and write because the data is all in RAM in your MySQL server. The downside is that a MySQL crash and restart will give you an empty table, with the previous contents lost. (MySQL servers crash very infrequently, but when they do they lose MEMORY table contents.)
You can run an occasional job (every few minutes or hours) to copy the contents of your MEMORY table to an on-disk table if you need to save history.
(Edit: You might consider adding memcached http://memcached.org/ to your web application system in the future to handle a high read rate, rather than constructing a database design for version 1 that handles a high read rate. That way you can see which parts of your overall app design have trouble scaling. I wish somebody had convinced me to do this in the past, rather than overdesigning for early versions. )

Medium-term temporary tables - creating tables on the fly to last 15-30 days?

Context
I'm currently developing a tool for managing orders and communicating between technicians and services. The industrial context is broadcast and TV. Multiple clients expecting media files each made to their own specs imply widely varying workflows even within the restricted scope of a single client's orders.
One client can ask one day for a single SD file and the next for a full-blown HD package containing up to fourteen files... In a MySQL db I am trying to store accurate information about all the small tasks composing the workflow, in multiple forms:
DATETIME values every time a task is accomplished, for accurate tracking
paths to the newly created files in the company's file system in VARCHARs
archiving background info in TEXT values (info such as user comments, e.g. when an incident happens and prevents moving forward, they can comment about it in this feed)
Multiply that by 30 different file types and this is way too much for a single table. So I thought I'd break it up by client: one table per client so that any order only ever requires the use of that one table that doesn't manipulate more than 15 fields. Still, this a pretty rigid solution when a client has 9 different transcoding specs and that a particular order only requires one. I figure I'd need to add flags fields for each transcoding field to indicate which ones are required for that particular order.
Concept
I then had this crazy idea that maybe I could create a temporary table to last while the order is running (that can range from about 1 day to 1 month). We rarely have more than 25 orders running simultaneously so it wouldn't get too crowded.
The idea is to make a table tailored for each order, eliminating the need for flags and unnecessary forever empty fields. Once the order is complete the table would get flushed, JSON-encoded, into a TEXT or BLOB so it can be restored later if changes need made.
Do you have experience with DBMS's (MySQL in particular) struggling from such practices if it has ever existed? Does this sound like a viable option? I am happy to try (which I already started) and I am seeking advice so as to keep going or stop right here.
Thanks for your input!

Well, of course that is possible to do. However, you can not use the MySQL temporary tables for such long-term storage, you will have to use "normal" tables, and have some clean-up routine...
However, I do not see why that amount of data would be too much for a single table. If your queries start to run slow due to much data, then you should add some indexes to your database. I also think there is another con: It will be much harder to build reports later on, when you have 25 tables with the same kind of data, you will have to run 25 queries and merge the data.
I do not see the point, really. The same kinds of data should be in the same table.

Never delete entries? Good idea? Usual?

I am designing a system and I don't think it's a good idea to give the ability to the end user to delete entries in the database. I think that way because often then end user, once given admin rights, might end up making a mess in the database and then turn to me to fix it.
Of course, they will need to be able to do remove entries or at least think that they did if they are set as admin.
So, I was thinking that all the entries in the database should have an "active" field. If they try to remove an entry, it will just set the flag to "false" or something similar. Then there will be some kind of super admin that would be my company's team who could change this field.
I already saw that in another company I worked for, but I was wondering if it was a good idea. I could just make regular database backups and then roll back if they commit an error and adding this field would add some complexity to all the queries.
What do you think? Should I do it that way? Do you use this kind of trick in your applications?

In one of our databases, we distinguished between transactional and dictionary records.
In a couple of words, transactional records are things that you cannot roll back in real life, like a call from a customer. You can change the caller's name, status etc., but you cannot dismiss the call itself.
Dictionary records are things that you can change, like assigning a city to a customer.
Transactional records and things that lead to them were never deleted, while dictionary ones could be deleted all right.
By "things that lead to them" I mean that as soon as the record appears in the business rules which can lead to a transactional record, this record also becomes transactional.
Like, a city can be deleted from the database. But when a rule appeared that said "send an SMS to all customers in Moscow", the cities became transactional records as well, or we would not be able to answer the question "why did this SMS get sent".
A rule of thumb for distinguishing was this: is it only my company's business?
If one of my employees made a decision based on data from the database (like, he made a report based on which some management decision was made, and then the data report was based on disappeared), it was considered OK to delete these data.
But if the decision affected some immediate actions with customers (like calling, messing with the customer's balance etc.), everything that lead to these decisions was kept forever.
It may vary from one business model to another: sometimes, it may be required to record even internal data, sometimes it's OK to delete data that affects outside world.
But for our business model, the rule from above worked fine.

A couple reasons people do things like this is for auditing and automated rollback. If a row is completely deleted then there's no way to automatically rollback that deletion if it was in error. Also, keeping a row around and its previous state is important for auditing - a super user should be able to see who deleted what and when as well as who changed what, etc.
Of course, that's all dependent on your current application's business logic. Some applications have no need for auditing and it may be proper to fully delete a row.

The downside to just setting a flag such as IsActive or DeletedDate is that all of your queries must take that flag into account when pulling data. This makes it more likely that another programmer will accidentally forget this flag when writing reports...
A slightly better alternative is to archive that record into a different database. This way it's been physically moved to a location that is not normally searched. You might add a couple fields to capture who deleted it and when; but the point is it won't be polluting your main database.
Further, you could provide an undo feature to bring it back fairly quickly; and do a permanent delete after 30 days or something like that.
UPDATE concerning views:
With views, the data still participates in your indexing scheme. If the amount of potentially deleted data is small, views may be just fine as they are simpler from a coding perspective.

I prefer the method that you are describing. Its nice to be able to undo a mistake. More often than not, there is no easy way of going back on a DELETE query. I've never had a problem with this method and unless you are filling your database with 'deleted' entries, there shouldn't be an issue.

I use a combination of techniques to work around this issue. For some things adding the extra "active" field makes sense. Then the user has the impression that an item was deleted because it no longer shows up on the application screen. The scenarios where I would implement this would include items that are required to keep a history...lets say invoice and payment. I wouldn't want such things being deleted for any reason.
However, there are some items in the database that are not so sensitive, lets say a list of categories that I want to be dynamic...I may then have users with admin privileges be allowed to add and delete a category and the delete could be permanent. However, as part of the application logic I will check if the category is used anywhere before allowing the delete.

I suggest having a second database like DB_Archives whre you add every row deleted from DB. The is_active field negates the very purpose of foreign key constraints, and YOU have to make sure that this row is not marked as deleted when it's referenced elsewhere. This becomes overly complicated when your DB structure is massive.

There is an acceptable practice that exists in many applications (drupal's versioning system, et. al.). Since MySQL scales very quickly and easily, you should be okay.

I've been working on a project lately where all the data was kept in the DB as well. The status of each individual row was kept in an integer field (data could be active, deleted, in_need_for_manual_correction, historic).
You should consider using views to access only the active/historic/... data in each table. That way your queries won't get more complicated.
Another thing that made things easy was the use of UPDATE/INSERT/DELETE triggers that handled all the flag changing inside the DB and thus kept the complex stuff out of the application (for the most part).
I should mention that the DB was a MSSQL 2005 server, but i guess the same approach should work with mysql, too.

Yes and no.
It will complicate your application much more than you expect since every table that does not allow deletion will be behind extra check (IsDeleted=false) etc. It does not sound much but then when you build larger application and in query of 11 tables 9 require chech of non-deletion.. it's tedious and error prone. (Well yeah, then there are deleted/nondeleted views.. when you remember to do/use them)
Some schema upgrades will become PITA since you'll have to relax FK:s and invent "suitable" data for very, very old data.
I've not tried, but have thought a moderate amount about solution where you'd zip the row data to xml and store that in some "Historical" table. Then in case of "must have that restored now OMG the world is dying!1eleven" it's possible to dig out.

I agree with all respondents that if you can afford to keep old data around forever it's a good idea; for performance and simplicity, I agree with the suggestion of moving "logically deleted" records to "old stuff" tables rather than adding "is_deleted" flags (moving to a totally different database seems a bit like overkill, but you can easily change to that more drastic approach later if eventually the amount of accumulated data turns out to be a problem for a single db with normal and "old stuff" tables).

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008