This is a bit tricky to search on keywords, so my apologies if this question exists.
Let's say I have a standard type 2 slowly changing dimension, modeled with a Startdate and Enddate column. The record with the NULL Enddate is the current version of the dimension record.
I understand it's pretty straightforward when I detect what will be a fact table insert from the source data. The new fact table record is simply inserted mapped to the most current dimension record based on matching the business key AND the dimension record where the Enddate = NULL.
I'm having a little trouble figuring out what to do when there's an update to a measure in the source system, what will amount to an update, not an insert, in my fact table. It seems I only have the business key to join on, and the existing record in the fact table could point to a previous version of the dimension record. I'm unsure of how to grab the correct surrogate key from the dimension and perform the fact table update.
I can provide more detail if needed.
Thanks in advance.
Do you have any insert or create date on source table? You can use that to do
select * from dim where src_dt
between dim.startdate and dim.enddate and keys = src.keys
and return correct dimension row. if you are using SCD.
When you do lookup you should use dates along with natural keys to get correct dimension row and not select most current dim row for both inserts and updates.
Related
I know there are ways to make an audit table in order to get the change history for an entire table in SQL, for ex:
Is there a MySQL option/feature to track history of changes to records?
However, I want to know if there is a way to get the change history for a specific row - ie. a record of edits for row 1 in a table. If there is a way to do this, I would greatly appreciate it. Thanks!
What we have done in the past is have change history tables. The first one would be:
Change History
ch_ID Primary Key
Table_Name Name of the table for the change
Table_PK The PK from that table
Type insert, update, delete
Change_Date Date of the change
Change_By Who made the change
Second would be:
Change History Details
chd_id PK
ch_ID
Column Name
Old Value
New Value
We then used triggers on the table. You can't use stored procedure to capture the info because DBAs usually don't use stored procedures when making data changes. You can then query by table name and the primary key of the record you are interested in. You can also add screen name to the first table to enable you to get all of the changes made for a record on that screen.
I usually suggest having a second history_x table for table x. history_x in this scenario is nearly identical to x; it differs in that it's copy of x's primary key is not primary (and not auto-incrementing even if 'x''s is), and it has it's own primary key and sometimes some sort of addition changed_when datetime field.
Then two triggers are made:
AFTER INSERT ON x basically just clones a new row in x to history_x
AFTER UPDATE ON x just clones the new state of row x to history_x
How to handle DELETE varies. Often, if you're going as far as to actually delete the x record, the corresponding history records can be deleted with it. If you're just flagging the x as "retired", that is covered by the UPDATE handling. If you need to preserve the history after a delete, you can just add a x_deleted "flag" field and a DELETE trigger that clones the last state of the row, but sets the x_deleted flag in history to "true".
Also, this obviously doesn't track PK changes to x, but could if history_x has two copies of x's PK; one would be the historical PK value captured by the triggers with the rest of the fields, and the second would be bound to a foreign key that would cascade all the old history to reference the new key.
Edit: If you can take advantage of the semi-global nature of session/# variables, you can even add information such as who made the change; but often connection pooling can interfere with that (each connection is it's own session).
Edit#2/Warning: If you're storing large data such as BLOBs or large TEXT fields, they should probably NOT be cloned every update.
Oh yeah, the "changed_when" data can also be more useful if expressed as a valid_from and valid_until pair of fields. valid_until should be null for the newest history record, and when a new history record in added the previous newest should have it's valid_until field set. changed_when is enough for a log, but if you need to actually use the old values WHERE ? >= valid_from and ? < valid_until is a lot easier than WHERE valid_until < ? ORDER BY valid_until DESC LIMIT 1
Based on how it sounds to me, what you want to do is to use RowNumber() in a query.
See here for more details.
This is a problem that bothers me whenever there is a need to add a new field to a table. Here the table has got about 1.5 Billion records (partitioned and sharded so it is physically separated files). Now I need to add a nullable field which is varchar(1024), which is going to accept some JSON strings. It is possible that the field length has to be increased in future to accommodate longer strings.
Here are the arguments
All existing rows will have null values for this field. (fav. new table)
Only 5% of the newly inserted records will have value for this. (fav. new table )
Most of the current queries on the table will need to access this field. (fav. alter)
I'm not sure if query memory allocation has a role to play in this, based on where I store.
Now should I add to current table, or define another table with same primary keys to store this data.
Your comments would help a decision.
Well if your older records wont need to have that varchar field , you should put it in another table and while pulling data give a join with primary key of other
Its not a big deal you can simply add a column in that table and for that just set null for that new column.
I think that, regardless of the 3 situations you have posited, you should alter the existing table, rather than creating a new one.
My reasoning is as follows:
1) Your table is very large (1.5 billion rows). If you create a new table, you would replicate the PK for 1.5 billion rows in the new table.
This will cause the following problems:
a) Wastage of DB space.
b) Time-intensive. Populating a new table with 1.5 billion rows and updating their PKs is a non-trivial exercise.
c) Rollback-segment exhaustion. If the rollback segments have insufficient space during the insertion of the new rows, the insert will fail. This will increase the DB fragmentation.
On the other hand, all these problems are avoided by altering the table:
1) There is no space wastage.
2) The operation won't be time-consuming.
3) There is no risk of rollback segment failure or DB fragmentation.
So alter the table.
Both these approaches have merits and demerits. I think I found a compromise between these two options., which has benefits of both approaches
create a new table to hold the JSON string. This table has same primary key as first table. Say the first table is Customer, and second table is Customer_json_attributes
alter the current table(customer) to add a flag indicating the presence of value in the JSON field. say json_present_indicator char(1).
Application to set the json_present_indicator='Y' in the fist table if there is a value for the JSON field in the second table, if not set to 'N'
Select queries will have a left join having json_present_indicator = ‘Y’ as a join condition. This will be efficient join as the query will search the second table only when the indicator is ‘Y’. Remember only 5% of the records will have a value on the JSON field
My application serves customers which are online stores. One of the tables in my DB is "Product" and it has a column "In_Stock". This is a boolean (bit(1)) column. My customers send data feeds of their product catalog and each customer has their own version of this table. I would like to track changes to this In Stock column, something to the effect of...
11/13/2016 true
12/26/2016 false
01/07/2017 true
Just so that when I do some auditing, I can see for a given time period what was the state of a given product.
How best can I do this?
It seems overkill to create a separate history table and have it updated by a trigger just for one boolean column. Would a history column suffice? I can save the data there in some kind of JSON string.
Sorry, any workable solution will require a second table.
One such solution is Version Normal Form (vnf) which is a special case of 2nf. Consider your table containing the boolean field (assuming it is properly normalized to at least 3nf). Now you want to track the changes made to the boolean field. One way is to turn the rows into versions by adding an EffectiveDate column then, instead of updating the row, write a new row with the current date in the date field (or updating if the boolean field is unchanged).
This allows the tracking of the field, there being a new version for every time the field is changed. But there are severe disadvantages, not least of which is the fact that a row is no longer an entity, but a version of an entity. This makes is impossible to use a foreign key to this table as those want to refer to an entity.
But look carefully at the design. Before the change, you had a good, normalized table with no tracking of changes. After adding the EffectiveDate column, there has been a subtle change. All the fields except the boolean field are, as before, dependent only on the PK. The boolean field is dependent not only on the PK but the new date field as well. It is no longer is 2nf.
Normalizing the table requires moving the boolean field and the date field to a new table:
create table NewTable(
EntityID int not null references OriginalTable( ID ),
EffDate date not null,
TrackedCol boolean,
constraint PK_NewTable primary key( EntityID, EffDate )
);
The first version is inserted when a new row is inserted into the original table. From then on, another version is added only when an update to the original table changes the value of the boolean field.
Here is a previous answer that includes the query to get the current and any past values of the versioned data. I've discussed this design many times here.
Also, there is a way to structure the design so the application code doesn't need to be changed. That is, the redesign will be completely transparent to existing code. The answer linked above contains another link to more documentation to show how that is done.
I would do trigger thing. But don't replicate whole column - take unique column id, log timestamp and boolean value.
Sometimes having good logs is priceless :)
I've written an audit trail module for this purpose, it basically duplicates the table, add some information to each row and keep the original data table untouched except for triggers.
I currently have a non-temporal MySQL DB and need to change it to a temporal MySQL DB. In other words, I need to be able to retain a history of changes that have been made to a record over time for reporting purposes.
My first thought for implementing this was to simply do inserts into the tables instead of updates, and when I need to select the data, simply doing a GROUP BY on some column and ordering by the timestamp DESC.
However, after thinking about things a bit, I realized that that will really mess things up because the primary key for each insert (which would really just be simulating a number of updates on a single record) will be different and thus mess up any linkage that uses the primary key to link to other records in the DB.
As such, my next thought was to continue updating the main tables in the DB, but also create a new insert into an "audit table" that is simply a copy of the full record after the update, and then when I needed to report on temporal data, I could use the audit table for querying purposes.
Can someone please give me some guidance or links on how to properly do this?
Thank you.
Make the given table R temporal(ie, to maintain the history).
One design is to leave the table R as it is and create a new table R_Hist with valid_start_time and valid_end_time.
Valid time is the time when the fact is true.
The CRUD operations can be given as:
INSERT
Insert into both R
Insert into R_Hist with valid_end_time as infinity
UPDATE
Update in R
Insert into R_Hist with valid_end_time as infinity
Update valid_end_time with the current time for the “latest” tuple
DELETE
Delete from R
Update valid_end_time with the current time for the “latest” tuple
SELECT
Select from R for ‘snapshot’ queries (implicitly ‘latest’ timestamp)
Select from R_Hist for temporal operations
Instead, you can choose to design new table for every attribute of table R. By this particular design you can capture attribute level temporal data as opposed to entity level in the previous design. The CRUD operations are almost similar.
I did a column Deleted and a column DeletedDate. Deleted defaults to false and deleted date null.
Complex primary key on IDColumn, Deleted, and DeletedDate.
Can index by deleted so you have real fast queries.
No duplicate primary key on your IDColumn because your primary key includes deleted and deleted date.
Assumption: you won't write to the same record more than once a millisecond. Could cause duplicate primary key issue if deleted date is not unique.
So then I do a transaction type deal for updates: select row, take results, update specific values, then insert. Really its an update to deleted true deleted date to now() then you have it spit out the row after update and use that to get primary key and/or any values not available to whatever API you built.
Not as good as a temporal table and takes some discipline but it builds history into 1 table that is easy to report on.
I may start updating the deleted date column and change it to added/Deleted in addition to the added date so I can sort records by 1 column, the added/deleted column while always updated the addedBy column and just set the same value as the added/Deleted column for logging sake.
Either way could just do a complex case when not null as addedDate else addedDate as addedDate order by AddedDate desc. so, yeah, whatever, this works.
In order to determine how often some object has been used, I use a table with the following fields;
id - objectID - timestamp
Every time an object is used, it's ID and time() are added in. This allows me to determine how often an object has been used in the last hour/minute/second etc.
After one hour, the row is useless (I'm not checking above one hour). However, it is my understanding that it is unwise to simply delete the row, because it may mess up the primary key (auto_increment ID).
So I added a field called "active". Prior to checking how often an object has been used I loop over all WHERE active=1 and set it to 0 if more than 1 hour has passed. I don't think this would give any concurrency problems between multiple users, but this leaves me with alot of unused data.
Now I'm thinking that maybe it's best to, prior to inserting new usage data, check if there is a field with active=0 and then rather than inserting a new row, update that one with the new data, and set active to 1 again. However, this would require table locking to prevent multiple clients from updating the same row.
Can anyone shed some more light on this, please?
I've never heard anywhere that deleting rows messes up primary keys.
Are you perhaps attempting to ensure that the id values automatically assigned by auto_increment match those of another table? This is not necessary - you can simply use an INTEGER PRIMARY KEY as the id column and assign the values explicitly.
You could execute an update query that match all rows older than 1 hour.
UPDATE table SET active=0 WHERE timestamp < now() - interval 1 hour