My application serves customers which are online stores. One of the tables in my DB is "Product" and it has a column "In_Stock". This is a boolean (bit(1)) column. My customers send data feeds of their product catalog and each customer has their own version of this table. I would like to track changes to this In Stock column, something to the effect of...
11/13/2016 true
12/26/2016 false
01/07/2017 true
Just so that when I do some auditing, I can see for a given time period what was the state of a given product.
How best can I do this?
It seems overkill to create a separate history table and have it updated by a trigger just for one boolean column. Would a history column suffice? I can save the data there in some kind of JSON string.
Sorry, any workable solution will require a second table.
One such solution is Version Normal Form (vnf) which is a special case of 2nf. Consider your table containing the boolean field (assuming it is properly normalized to at least 3nf). Now you want to track the changes made to the boolean field. One way is to turn the rows into versions by adding an EffectiveDate column then, instead of updating the row, write a new row with the current date in the date field (or updating if the boolean field is unchanged).
This allows the tracking of the field, there being a new version for every time the field is changed. But there are severe disadvantages, not least of which is the fact that a row is no longer an entity, but a version of an entity. This makes is impossible to use a foreign key to this table as those want to refer to an entity.
But look carefully at the design. Before the change, you had a good, normalized table with no tracking of changes. After adding the EffectiveDate column, there has been a subtle change. All the fields except the boolean field are, as before, dependent only on the PK. The boolean field is dependent not only on the PK but the new date field as well. It is no longer is 2nf.
Normalizing the table requires moving the boolean field and the date field to a new table:
create table NewTable(
EntityID int not null references OriginalTable( ID ),
EffDate date not null,
TrackedCol boolean,
constraint PK_NewTable primary key( EntityID, EffDate )
);
The first version is inserted when a new row is inserted into the original table. From then on, another version is added only when an update to the original table changes the value of the boolean field.
Here is a previous answer that includes the query to get the current and any past values of the versioned data. I've discussed this design many times here.
Also, there is a way to structure the design so the application code doesn't need to be changed. That is, the redesign will be completely transparent to existing code. The answer linked above contains another link to more documentation to show how that is done.
I would do trigger thing. But don't replicate whole column - take unique column id, log timestamp and boolean value.
Sometimes having good logs is priceless :)
I've written an audit trail module for this purpose, it basically duplicates the table, add some information to each row and keep the original data table untouched except for triggers.
Related
I have a few tables storing their corresponding records for my system. For example, there could be a table called templates and logos. But for each table, one of the rows will be a default in the system. I would have normally added a is_default column for each table, but all of the rows except for 1 would have been 0.
Another colleague of mine sees another route, in which there is a system_defaults table. And that table has a column for each table. For example, this table would have a template_id column and a logo_id column. Then that column stores the corresponding default.
Is one way more correct than the other generally? The first way, there are many columns with the same value, except for 1. And the second, I suppose I just have to do a join to get the details, and the table grows sideways whenever I add a new table that has a default.
The solutions mainly differ in the ways to make sure that no more than one default value is assigned for each table.
is_default solution: Here it may happen that more than one record of a table has the value 1. It depends on the SQL dialect of your database whether this can be excluded by a constraint. As far as I understand MySQL, this kind of constraint can't be expressed there.
Separate table solution: Here you can easily make sure by your table design that at most one default is present per table. By assigning not null constraints, you can also force defaults for specific tables, or not. When you introduce a new table, you are extending your database (and the software working on it) anyway, so the additional attribute on the default table won't hurt.
A middle course might be the following: Have a table
Defaults
id
table_name
row_id
with one record per table, identified by the table name. Technically, the problem of more than one default per table may also occur here. But if you only insert records into this table when a new table gets introduced, then your operative software will only need to perform updates on this table, never inserts. You can easily check this via code inspection.
I currently have a non-temporal MySQL DB and need to change it to a temporal MySQL DB. In other words, I need to be able to retain a history of changes that have been made to a record over time for reporting purposes.
My first thought for implementing this was to simply do inserts into the tables instead of updates, and when I need to select the data, simply doing a GROUP BY on some column and ordering by the timestamp DESC.
However, after thinking about things a bit, I realized that that will really mess things up because the primary key for each insert (which would really just be simulating a number of updates on a single record) will be different and thus mess up any linkage that uses the primary key to link to other records in the DB.
As such, my next thought was to continue updating the main tables in the DB, but also create a new insert into an "audit table" that is simply a copy of the full record after the update, and then when I needed to report on temporal data, I could use the audit table for querying purposes.
Can someone please give me some guidance or links on how to properly do this?
Thank you.
Make the given table R temporal(ie, to maintain the history).
One design is to leave the table R as it is and create a new table R_Hist with valid_start_time and valid_end_time.
Valid time is the time when the fact is true.
The CRUD operations can be given as:
INSERT
Insert into both R
Insert into R_Hist with valid_end_time as infinity
UPDATE
Update in R
Insert into R_Hist with valid_end_time as infinity
Update valid_end_time with the current time for the “latest” tuple
DELETE
Delete from R
Update valid_end_time with the current time for the “latest” tuple
SELECT
Select from R for ‘snapshot’ queries (implicitly ‘latest’ timestamp)
Select from R_Hist for temporal operations
Instead, you can choose to design new table for every attribute of table R. By this particular design you can capture attribute level temporal data as opposed to entity level in the previous design. The CRUD operations are almost similar.
I did a column Deleted and a column DeletedDate. Deleted defaults to false and deleted date null.
Complex primary key on IDColumn, Deleted, and DeletedDate.
Can index by deleted so you have real fast queries.
No duplicate primary key on your IDColumn because your primary key includes deleted and deleted date.
Assumption: you won't write to the same record more than once a millisecond. Could cause duplicate primary key issue if deleted date is not unique.
So then I do a transaction type deal for updates: select row, take results, update specific values, then insert. Really its an update to deleted true deleted date to now() then you have it spit out the row after update and use that to get primary key and/or any values not available to whatever API you built.
Not as good as a temporal table and takes some discipline but it builds history into 1 table that is easy to report on.
I may start updating the deleted date column and change it to added/Deleted in addition to the added date so I can sort records by 1 column, the added/deleted column while always updated the addedBy column and just set the same value as the added/Deleted column for logging sake.
Either way could just do a complex case when not null as addedDate else addedDate as addedDate order by AddedDate desc. so, yeah, whatever, this works.
I am doing EF 6 code first with MVC 5. One of my classes has a property that can mean three things:
Confirmed by user
Has not answered yet
Declined by user
My question is, what should I use?
A nullable bool, obviously mapped to the choices above
An enum (the column would store an integer as a foreign key to another table listing the states)
Or two bool columns (HasAnswered, IsConfirmed) where IsConfirmed only gets accessed if the user has answered
I am very thankful for every opinion you might have.
Disclaimer.. I thought you were suggesting an enum as a column datatype
None of the above.
An enum.. what if want to add more data to each status or rename one?
Nullable bool .. as above, plus what if you need to add another status?
Two bool columns.. same as above, plus you could introduce normalisation issues.
I'd go with a tiny_int called status (or something similar).
Mainly for flexibility.. if you need the status titles in the DB or need to add any other data to each option you can place it in another table and foreign key it in. If you don't you just need to translate the numbers somewhere.
Another option is to separate out the actions (the answer) from the state.
Consider an answer table with a column indicating confirmation or denial and a reference to the original table. By joining the tables it is possible to separate out the original rows that have no answers and those that have been confirmed/denied.
UPDATE
In response to Null's comment, an int is so much more powerful..
A status database table:
id, title, description, priority, cost... (attach extra data to the status)
Or in PHP pseudo code (without a table):
$query = new Query('SELECT * FROM table WHERE status = :status');
$query->bind('status'=>TableClass::STATUS_ANSWERED)
I wasn't suggesting you translate the numbers everywhere.. just somewhere.
The final great thing about an int is it separates meaning from the workings.. what if I want to rename one of the enums? I change one row with a foreign key, or potentially millions with an enum.
I am considering designing a relational DB schema for a DB that never actually deletes anything (sets a deleted flag or something).
1) What metadata columns are typically used to accomodate such an architecture? Obviously a boolean flag for IsDeleted can be set. Or maybe just a timestamp in a Deleted column works better, or possibly both. I'm not sure which method will cause me more problems in the long run.
2) How are updates typically handled in such architectures? If you mark the old value as deleted and insert a new one, you will run into PK unique constraint issues (e.g. if you have PK column id, then the new row must have the same id as the one you just marked as invalid, or else all of your foreign keys in other tables for that id will be rendered useless).
If your goal is auditing, I'd create a shadow table for each table you have. Add some triggers that get fired on update and delete and insert a copy of the row into the shadow table.
Here are some additional questions that you'll also want to consider
How often do deletes occur. What's your performance budget like? This can affect your choices. The answer to your design will be different depending of if a user deleting a single row (like lets say an answer on a Q&A site vs deleting records on an hourly basis from a feed)
How are you going to expose the deleted records in your system. Is it only through administrative purposes or can any user see deleted records. This makes a difference because you'll probably need to come up with a filtering mechanism depending on the user.
How will foreign key constraints work. Can one table reference another table where there's a deleted record?
When you add or alter existing tables what happens to the deleted records?
Typically the systems that care a lot about audit use tables as Steve Prentice mentioned. It often has every field from the original table with all the constraints turned off. It often will have a action field to track updates vs deletes, and include a date/timestamp of the change along with the user.
For an example see the PostHistory Table at https://data.stackexchange.com/stackoverflow/query/new
I think what you're looking for here is typically referred to as "knowledge dating".
In this case, your primary key would be your regular key plus the knowledge start date.
Your end date might either be null for a current record or an "end of time" sentinel.
On an update, you'd typically set the end date of the current record to "now" and insert a new record the starts at the same "now" with the new values.
On a "delete", you'd just set the end date to "now".
i've done that.
2.a) version number solves the unique constraint issue somewhat although that's really just relaxing the uniqueness isn't it.
2.b) you can also archive the old versions into another table.
How do you retain historical relational data if rows are changed? In this example, users are allowed to edit the rows in the Property table at any time. Tests can have any number of properties. If they edit the field 'Name' in the Property table, or drop a row in the Property table, Test rows might not hold conditions at the time of the test. Would you change the design of the Test table by adding a property names column, and dropping the TestProperty mapping table? The property names column would have to be something like a delimited list of strings. How is problem usually handled?
3 tables:
Test:
TestId AUTONUMBER,
Name CHAR,
TestDate DATE
Property:
PropertyId AUTONUMBER,
Name CHAR
TestProperty: (maps properties to tests)
TestId
PropertyId
I do not think the question has been answered fully.
If they edit the field 'Name' in the Property table ... Would you change the design of the Test table by adding a property names column, and dropping the TestProperty mapping table?
Definitely not. That would add massive duplication for no purpose.
If your requirement is to maintain the integrity of the data values (in Property) at the time of the Test, the correct (database) method is to implement a History table. That should be an exact copy of the source table, plus one item: a TIMESTAMP or DATETIME column is added to the PK.PropertyHistory
PropertyId AUTONUMBER,
Name CHAR
CONSTRAINT PRIMARY KEY CLUSTERED UC_PK (PropertyId)
PropertyHistory
PropertyId INT,
AuditedDtm DATETIME,
Name CHAR
CONSTRAINT PRIMARY KEY CLUSTERED UC_PK (PropertyId, AuditedDtm)
For this to be meaningful and useable, the Test table needs a timestamp as well, to identify which version of ProperyHistory to reference:TestProperty
TestId
PropertyId
TestDtm DATETIME
The property names column would have to be something like a delimited list of strings.
That would break basic design rules as well as Database Normalisation rules, and prevent you from performing ordinary Relational operations on it. Never store more than one data value in a single column.
... or drop a row in the Property table
Deletion is something different again. If it is a "database" then it has Integrity. Therfore you cannot delete a parent row if it has child rows in some other table (and you can delete it if it does not have children). This is usually implemented as a "soft delete", an Indicator such as IsObsolete is added. This is referenced in the various SELECTS to exclude the row from being used (to add new children) but remains available as the parent for existing children.
If you want to retain property relations, even if the property doesn't exist. Make it so that Properties aren't necessarily deleted, but add a flag that denotes if the property is currently active. If a property's name is changed, create a new property with the new name and set the old property to inactive.
If you do this, you'll have to create some way of garbage collecting the inactive properties.
I'd never make a single column into a field that imitates a one-to-multi relationship with a comma-denoted list. Otherwise, you defeat the purpose of relational database.
Seems like you're using Test as both a template for a particular instance of a test, as well as the test itself. Maybe every time a user performs a test according to the specification in Test, create a row in, say, TestRun? This would preserve the particular Propertys, and if the entries in Property change later, then subsequent TestRuns would reflect the new changes.