MySQL history table design and query - mysql

TL;DR: Is this design correct and how should I query it?
Let's say we have history tables for city and address designed like this:
CREATE TABLE city_history (
id BIGINT UNSIGNED NOT NULL PRIMARY KEY,
name VARCHAR(128) NOT NULL,
history_at DATETIME NOT NULL,
obj_id INT UNSIGNED NOT NULL
);
CREATE TABLE address_history (
id BIGINT UNSIGNED NOT NULL PRIMARY KEY,
city_id INT NULL,
building_no VARCHAR(10) NULL,
history_at DATETIME NOT NULL,
obj_id INT UNSIGNED NOT NULL
);
Original tables are pretty much the same except for history_id and obj_id (city: id, name; address: id, city_id, building_no). There's also a foreign key relation between city and address (city_id).
History tables are populated on every change of the original entry (create, update, delete) with the exact state of the entry at given time.
obj_id holds id of original object - no foreign key, because original entry can be deleted and history entries can't. history_at is the time of creation of history entry.
History entries are created for every table independently - change in city name creates city_history entry but does not create address_history entry.
So to see what was the state of the whole address with city (e.g. on printed documents) at any T1 point in time, we take from both history tables most recent entries for given obj_id created before T1, right?
With this design in theory we should be able to see the state of signle address with city at any given point of time. Could anyone help me create such a query for given address id and time? Please note that there could be multiple records with the same exact timestamp.
There is also a need to create a report for showing every change of state of given address in given time period with entries like "city_name, building_no, changed_at". Is it something that can be created with SQL query? Performance doesn't matter here so much, such reports won't be generated so often.
The above report will probably be needed in an interactive version where user can filter results e.g. by city name or building number. Is it still possible to do in SQL?
In reality address table and address_history table have 4 more foreign keys that should be joined in report (street, zip code, etc.). Wouldn't the query be like ten pages long to provide all the needed functionality?
I've tried to build some queries, play with greatest-n-per-group, but I don't think I'm getting anywhere with this. Is this design really OK for my use cases (if so, can you please provide some queries for me to play with to get where I want?)? Or should I rethink the whole design?
Any help appreciated.

(My answer copied from here, since that question never marked an answer as accepted.)
My normal "pattern" in (very)pseudo code:
Table A: a_id (PK), a_stuff
Table A_history: a_history_id (PK), a_id(FK referencing A.a_id), valid_from, valid_to, a_stuff
Triggers on A:
On insert: insert values into A_history with valid_from = now, and valid_to = null.
On update: set valid_to = now for last history record of a_id; and do the same insert from the "on insert" trigger with the updated values for the row.
On delete: set valid_to = now for last history record of a_id.
In this scenario, you'd query history with "x >= from and x < to" (not BETWEEN as the a previous record's "from" value should match the next's to "value").
Additionally, this pattern also makes "change log" reports easier.
Without a table dedicated to change logging, the relevant records can be found just by SELECT * FROM A_history WHERE valid_from BETWEEN [reporting interval] OR valid_to BETWEEN [reporting interval].
If there is a central change log table, the triggers can just be modified to include log entry inserts as well. (Unless log entries include "meta" data such as reason for change, who changed, etc... obviously).
Note: This pattern can be implemented without triggers. Using a stored procedure, or even just multiple queries in code, can actually negate the need for the non-history table.
The history table's "a_id" would need to be replaced with whatever uniquely identifies the record normally though; it could still be an id value, but these values would need synthesized when inserting, and known when updating/deleting.
Queries:
(if not new) UPDATE the most recent entry's valid_to.
(if not deleting) INSERT new entry

This is a very "traditional" Problem, when it comes down to versioning (or monitoring) of changes to a certain row.
There are various "solutions", each having its own drawback and advantage.
The following "statements" are a result of my expericence, they are neither perfect, nor do I claim they are the "only ones"!
1.) Creating a "history table": That's the worst Idea of all. You would always need to take into account which table you need to query, depending on DATA that should be queried. That's a "Chicken-Egg" Problem...
2.) Using ONE Table with ONE (increasing) "Revision" Number: That's a better approach, but it will get "hard" to query: Determining the "most recent row" per "id" is very costly no matter which aproach is used.
My personal expierence is, that following the pattern of a "double linked List" ist the best to solve this, when it comes down to Millions of records:
3.) Maintain two columns among every entity, let's say prev_version_id and next_version_id. prev_version_id points to NULL, if there is no previous version. next_version_id points to NULL if there is no later version.
This approach would require you to ALWAYS perform two actions upon an update:
Create the new row
Update the old rows reference (next_version_id) to the just insterted row.
However, when your database has grown to something like 100 Million Rows, you will be very happy that you have choosen this path:
Querying the "Oldest" Version is as simple as querying where ISNULL(prev_version_id) and entity_id = 5
Querying the "Latest" Version is as simple as querying where ISNULL(next_version_id) and entity_id = 5
Getting a full version history will just target the entity_id=5 of the data-table, sortable by either prev_version_id or next_version_id.
The very often neglected fact: The first two queries will also work to get a list of ALL first versions or of ALL recent versions of an entity - in about NO TIME! (Don't underestimate how "costly" it can be do determine the most recent version of an entity otherwise! Believe me, when "testing" everything seems equaly fine, but the real struggle starts when live-data with millions of records is used.)
cheers,
dognose

Related

How to track change(Update/delete) in MYSQL for later query (NOT FOR LOG)

I have research some question in stackoverflow, but what I want is for later query purpose, not for logging purose.
I have a project that needs to get value from certain moment.
For example
I have a user table
User:
id
name
address
Pet:
id
name
type
Adoption:
id
user_id
pet_id
Data:
User:
1, John, One Street
Pet:
1, Lucy, Cat
Adoption:
1, 1, 1
Let's say the user change address so it look like
User:
1, John, Another Street
And what I need is
What is the address(or other field) of the user when they adopt the pet.
What I am thinking of is always create a new row in same table(in this case user) and refer the new row to the previous row
User:
2, 1, John, Another Street ( where 1 is referring to the previous id / updated from)
1, NULL, John, One Street, deleted (NULL means this is newly created data)
The advantage of using this is, it's easy to query(I just query like usual
The downside is the table will be so huge to record every update. Is there any solution?
Thank you
This is what i do sometimes:
For any field that i need to track value changes, i design a separate changes table.
For example, for the address field that is a concept associated with the user entity and is not a direct property of the adoption entity, i define the table:
UserAddressChanges(UserID, Address, ChangeDateTime, ChangerPersonID)
This way, the changes data may be used in any other sub-system or system, independent of your current adoption use-case.
I use in-table change tracking for very simple tables like:
UniversityManagers(PersonID, AssignDateTime, AssignorPersonID)
For more complex tables with frequent changes (and usually, few refers to previous data) where i need full record logging, i separate the main table (of current records) and the log table which have extra fields such as LogID, ChangeDateTime, ChangerPersonID, ChangerIP, ...
There are different approaches to this.
Perhaps the simplest is to denormalize the data. If there is data you need at the point of adoption, include it as columns in the adoption table. This address is the "point-in-time" address.
This method is useful for simple things, but it does not scale well. And you have to pre-define the columns you want.
The next step is to create audit tables for all your tables, or at least all tables of interest. Every time a record changes in user, a new record is added into userAudit. Audit tables are usually maintained using triggers.
The advantage of audit tables is that they do not clutter the existing table (and logic). The same queries work on the existing tables.
Finally, you can just cave in and realize that your data model is overly simplified. You really have slowly changing dimensions. This data can be represented using version effective dates and version end dates for each row. The user table ends up looking like:
user_id name address version_eff_dt version_end_dt
Because user_id is no longer a primary key, you might want two tables users and userHistory, or something like that.
This is a "correct" representation of the data at any point in time. However, it usually requires restructuring queries because a single user appears multiple times in the table -- and user_id is no longer the primary key.

Database design: Managing old and new data in database table

I have a table Student with field as followed,
Student table (one record per student)
student_id
Name
Parent_Name
Address_line1, Address_line2, Addess_line
Photo_path
Signature_file_path
Preferred_examcity_choice1,Preferred_examcity_choice1, Preferred_examcity_choice3
Gender
Nationality
.
.
.
I am inserting into this table on Registration form completion through the web interface.
Now there is one more module in a web interface for updating the student data, on every update request I am updating the student table records and inserting the new entry in student_data_change_request. student can change records any number of times.
student_data_change_request
request_id(auto_incr PK)
old_name
new_name
old_photo_path
new_photo_path
old_signature_file_path
new_signature_file_path
Now coming to problem, earlier students were allowed to change very few fields, now client want to allow the candidate to update more number of fields(around 20 fields) and adding old and new columns for the corresponding column isn't elegant and preferred(I guess), I will end up creating 40 columns to keep track of 20 columns. So how should I redesign my table? suggestions are welcomed.
One approach is to have a shadow table named (table)_xx that has the same columns, the time, date, update/insert/delete flag, user or whatever and no referential integrity. Set a trigger to update that table from the source whenever anything happens.
If you've got genuine business requirements that need history then do those properly but this pattern is great as a general audit, debugging and forensic tool.
It's also really easy to automate/script as you just generate it from the DB metadata.
Usually historical table looks like:
request_id
column_name
old_value
new_value
dt
request_id and column_name are primary key. When you update student table you insert new entry in student_data_change_request for each updating column.
Edited:
Another way:
request_id
value_type
name
photo_path
signature_file_path
...
and insert first entry with old values and second entry with new values. Colum value_type is mark old or new.
I would rather have just one table, with an additional column for effective date. Then a view that picks up just the most recent row for each student_id becomes your first "table". If for some reason you must show "current" and "most recently changed" values side-by-side, that is another view.
As usual, it all depends on how you intend to use the data.
My strong preference in these cases is the solution #mathguy suggests - embedding the concept of time in the main table design. This allows you to ask the question "what was this student's address on 1 Jan?", or "who had signature x on 12 Feb?".
If you have to report or execute business logic that reflects the status at any point in time, this design works really well. For instance, if you have to report on how many students lived in a particular address for a given term, you want to know when the records were valid.
But not all applications care about "time" - sometimes, you just want to have an audit table, so you can trace what happened over time in case of anomalies.
In that case, #loztinspace's solution is useful - but in my experience, this rapidly escalates into more work, because those who want to inspect the audit records can or should not get access to a SQL prompt on your production environment.

Database not having unique ID's

I have a database containing attendance in monthly basis. Now, I want to display that data on a series of text box. But my problem is that it does not contain any unique id's that's making my task difficult. Have a look at the attachment so that you guys can get the picture of my problem.
http://s26.postimg.org/p8v0zhemx/image.png
Thank you so much in advance.
EDIT:
For future researchers using listview, this is the query for my MySQL.
You have to make a composite key if your db does not have a unique id. Google it.
The query i managed to pull out from my head.
"SELECT empno, line1, time1, line2, time2, line3, time3, line4, time4, line5, time5, line6, time6 FROM attendancelist WHERE empno = '" & ListPayroll.SelectedItems(0).Text & "' AND line1 = '" & ListPayroll.SelectedItems(0).SubItems(1).Text & "'"
It looks to me like your sample data table contains tons of attendance data that basically look like this:
employee workdate starttime endtime
00117 2014-02-03 08:15 17:30
00117 2014-02-04 09:00 17:30
00117 2014-02-05 null null
etc.
If the employee was absent on the given day, that's indicated by null values in starttime and endtime. If the employee was not employed at all on the particular date, you'd simply leave the row out of the table entirely. I think that's what the first five days of employee 00001 in your sample data's first row mean -- not present, not absent.
Your raw data is arranged in a pretty doggone inconvenient report layout that puts a week's work of days on each row. You can probably write a simple dotnet program to slurp up your six-day-week input table and insert six rows (or fewer) of this data from each row of that table.
Once you've loaded your data from that input table, you can switch over to maintaining it in your new table. That will be much easier for you to handle in a program. You will also be able to write a query program that will recreate your six-day-row report, if that's what your users prefer.
Arranged the way I have shown it, you'll get a nice little attendance table. If you know ahead of time you'll have at most one record per day per employee, you can use the columns I've shown, and use a composite primary key consisting of (employee, workdate).
If you might have more than one row per employee per date you'll need to add an id column, that can be an autoincrementing surrogate primary key.
If all you need is an arbitrary unique identifier for update and delete (as indicated in the comments), then add one:
ALTER TABLE my_table
ADD COLUMN id INT AUTO_INCREMENT PRIMARY KEY;
That is, of course, assuming you have that ability or can convince someone who does. It is a remarkably minor change. If column names are specified in the existing INSERT queries, it won't require a change to them. Someone ought to be willing to do it.
If you have the freedom to modify the schema, please consider revising it. This is very, very poorly designed (having repetitive columns and columns containing multiple pieces of information). If you cannot modify this one, creating a new, better designed schema and importing data from this schema may be another option you want to consider. (Creating a "new schema" could also be accomplished using a set of separate tables.)
Also be aware that with the current structure, you will need extremely heavy validation code side to prevent users from saving invalid data when they modify it.

MySQL - Migrating some ID numbers over from randomly generated to autoincremental

I am in the process of rewriting a company's entire system. The original developer was a bit silly and generated ID numbers for each customer report randomly in his database. Each ID number is up to 7 digits long - but could be anything.
I am migrating over all his old data to our new, far more logically structured database. I obviously want to use a MySQL auto-increment for our ID field. However, it's vital that we keep the old ID numbers as customers still phone up each day with those to reference against.
Ideally, the perfect scenario would be we go live December 1st - everything up to December 1st is all randomly IDed, and from December 1st onwards they automatically increment starting at the highest random ID in the old database.
Is such a thing possible with MySQL without any issues? I am currently using two columns - one, our logical autoincrementing ID, and a second column called old_id which was being used during migration. But we need the call centre staff to only be using one ID or mass confusion will ensue.
Thanks!
If you start numbering from the highest random value, just changing the field to autoincrement should be enough, the normal behaviour is that mysql won't change ids already set, and starts numbering from the highest value+1.
If you want to start from a specific value (say 10,000,000) you can set
ALTER TABLE theTableInQuestion AUTO_INCREMENT=10000000
Of course, be sure to create backups and test, but it should not pose any problems at all. (Note that the old records will be stored in order of the id-field, which is random, and won't reflect the creation order.)
As you need to keep the old IDs, I'm going to assume that you're going to create a new column for autoincrement ID that will become your primary key but keep the existing ID column and rename it (to old_id, maybe?). I'm also going to assume you record when a customer signed up.
If you make your old ID column nullable (allow NULL as a valid value) then you can simply check whether or not the old ID column is NULL. If it's not NULL then treat that as the ID, otherwise use the autoincrement column.
Finding a customer:
SELECT *
FROM customer
WHERE (id = /*Put your ID here*/ AND reg_date >= /*Put the date the new regime starts here*/)
OR (id_old = /*put your ID here*/ AND reg_date < /*Put the date the new regime starts here*/)
This will occasionally return 2 rows so you'll have to use some other criteria to uniquely identify the customer in question.
As for associating an old customer with other tables in the database, you can always use the new ID internally throughout the entire DB once its generated. You will have to update tables that are using the old ID as the foreign key, obviously.
UPDATE target_table
JOIN customers on target_table.cust_id = customers.id_old
SET target_table.cust_id = customers.id;
(Note: The above is just a quick and dirty query that hasn't been tested! I'd suggest testing on a copy of the database before you try it for real!)

What is the best method to store default values in database?

I have several tables like Buyers, Shops, Brands, Money_Collectors, e.t.c.
Each one of those has a default value, e.g. the default Buyer is David, the default Shop is Ebay, and so on.
I would like to save those default values in a database (so that user could change them).
I thought to add is_default column to each one of the tables, but it seems to be ineffective because only one row in each table may be the default.
Then I thought that the best would be to have Defaults table that will contain all the default values. This table will have 1 row and N columns, where N is the number of the default values:
Defaults table:
buyer shop brand money_collector
----- ---- ----- ---------------
David Ebay Dell NULL (no default value)
But, this seems to be not the best approach because the table structure changes when a new default value is added.
What would be the best approach to store default values ?
Just to be clear.
The best way is with a column on each table which dropdowns source from.
And here's why...
"Shouldn't I worry about space when
saving data in a database?"
The short answer is no. The longer answer is what you should worry about is performance. Focusing on space will lead you to do very bad things.
Bad things that you'll do if space is a concern.
You'll bury meaning into Primary Keys. i.e. Smart Keys.
You'll try to store mulitple values in one column.
You'll index too little
(No doubt we could create a list of 50 bad practices which save space)
suppose there are 50 shops (select box
with 50 possible values). In this
case, to store the default shop you
need 50 boolean fields,
Well it's ONE Boolean column. It exists on each row.
Let me ask you this. If you created a table with 1 date column and inserted 1 row, how much space would you use on disk?
If you said a 7 or 8 bytes then you're off by about 1000 times.
The smallest unit of disk space is a block. Blocks are typical 8kb (the can be as small as 2kb as large as 32kb, in general (no nitpicking here, the actual limits are unimportant))
Let's say you have 8kb blocks then your 1 column, 1 row table takes 8Kb. If you insert another 999 rows it will still take up 8KB. (Again no nitpicking there is overhead per block and per row - it's an example)
So in your look up table with 50 store names, the likelihood that adding 50 bytes to the size of the table forces you to expand from 1 block to 2 is slim to none and completely irrelevant.
On the other hand, your default table will certainly take up at least one additional block.
But the worst hit to PERFORMANCE is that your call to fill a drop down will need two round trips to the database, one to get the list, one to get the default. (yes, you may be able to do this in one but go with it)
So you've saved exactly zero space and doubled your network traffic.
You see what I'm saying.
Another crucial reason to stop worrying about space is you're giving up clarity. think of the developer you're going to hire to run this app. When he joins the team and looks at the database, imagine the two scenarios.
There's a Boolean column named Default_value
There's a table with no relationships to anything that's named Default_Values
You ask him to build a new for with a dropdown for 'store'.
In scenario 1 he finds the store table, wires up the dropdown to a simple query of the table and uses the default_value field to select the initial value.
In scenario 2, without some training, how would he know to look for a separate table? Maybe he'd see the table but by the time you're hiring, your datamodel now has hundreds of tables.
Again, a little contrived but the point is salient. Clarity in the database is well, well worth a byte per row.
Technical stuff
I'm not a MySQL guy but in Oracle, a null column at the end of a row take no additional space. In Oracle I would use a Varchar2(1) and let 'T' = Default and leave the others null. That would have the effect on only using 1 addition byte total, and not per row. YMMV with MySQL, you can pose that question separately if you can't Google the answer.
But the time to worry about that is on millions of rows, not hundreds. Any table which feeds a dropdown will never be big enough to start worrying about extra bytes.
What if you create an XML and then store that XML in the table in an XML column. The XML column would contain the XML, and the XML could have tags of tables and a sub node of default values.
You should rather create a a table with two columns and n rows
Defaults table:
buyer, David
shop, Ebay,
brand, Dell
This way you can add new values without having to change table structure
You can create a catalog table (some kind of metadata table) containing the default values as strings for the desired table columns. Then you can use the convert function for getting the appropriate value. Below is a sample table definition (Transact-SQL was used):
create table dbo.cat_default_values
(
id_column varchar(30) not null,
id_table varchar(30) not null,
datatype varchar(30) not null,
value varchar(100) not null,
f_creation datetime not null,
usr_creation char(8) null,
primary key clustered (id_column, id_table)
)
declare #defaultValueInt int,
#defaultValueVarchar varchar(30)
select #defaultValueInt = convert(int, value)
from cat_default_values where id_column = "defColumInteger" and id_table = "table1"
select defaultValueVarchar = value
from cat_default_values where id_column = "defColumVarchar" and id_table = "table1"
What you are trying to store is not meta data information. First of all, so I will not invent an external data store to store this data.(coupled with extra code )
I assume you have a PK Sequence generation logic (under your control). I will assign a magic number x and I will insert a record in each table with _id = x as the default value. So if you want to show the user the default value, you can handle in your query in a uniform way or you can handle this in application logic while insert. The good thing about this is, you have access to default value all the time and without writing any extra logic and the logic for maintaining default value of a table can be maintained using the same code (templating ;)
(From the lessons W3c learned from modeling schema information of XML using DTD.)
Only catch is this logic should be made explicit either using some extensive documentation or could be hard imposed by using a trigger.