Versioned and indexed data store - mysql

I have a requirement to store all versions of an entity in a easily indexed way and was wondering if anyone has input on what system to use.
Without versioning the system is simply a relational database with a row per, for example, person. If the person's state changes that row is changed to reflect this. With versioning the entry should be updated in such a way so that we can always go back to a previous version. If I could use a temporal database this would be free and I would be able to ask 'what is the state of all people as of yesterday at 2pm living in Dublin and aged 30'. Unfortunately there doesn't seem to be any mature open source projects that can do temporal.
A really nasty way to do this is just to insert a new row per state change. This leads to duplication, as a person can have many fields but only one changing per update. It is also then quite slow to select the correct version for every person given a timestamp.
In theory it should be possible to use a relational database and a version control system to mimic a temporal database but this sounds pretty horrendous.
So I was wondering if anyone has come across something similar before and how they approached it?
Update
As suggested by Aaron here's the query we currently use (in mysql). It's definitely slow on our table with >200k rows. (id = table key, person_id = id per person, duplicated if the person has many revisions)
select name from person p where p.id = (select max(id) from person where person_id = p.person_id and timestamp <= :timestamp)
Update
It looks like the best way to do this is with a temporal db but given that there aren't any open source ones out there the next best method is to store a new row per update. The only problem is duplication of unchanged columns and a slow query.

There are two ways to tackle this. Both assume that you always insert new rows. In every case, you must insert a timestamp (created) which tells you when a row was "modified".
The first approach uses a number to count how many instances you already have. The primary key is the object key plus the version number. The problem with this approach seems to be that you'll need a select max(version) to make a modification. In practice, this is rarely an issue since for all updates from the app, you must first load the current version of the person, modify it (and increment the version) and then insert the new row. So the real problem is that this design makes it hard to run updates in the database (for example, assign a property to many users).
The next approach uses links in the database. Instead of a composite key, you give each object a new key and you have a replacedBy field which contains the key of the next version. This approach makes it simple to find the current version (... where replacedBy is NULL). Updates are a problem, though, since you must insert a new row and update an existing one.
To solve this, you can add a back pointer (previousVersion). This way, you can insert the new rows and then use the back pointer to update the previous version.

Here is a (somewhat dated) survey of the literature on temporal databases: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.6988&rep=rep1&type=pdf
I would recommend spending a good while sitting down with those references and/or Google Scholar to try to find some good techniques that fit your data model. Good luck!

Related

How to design a MySQL table that tracks the Status of each Asset, as well as every old Status?

I would like to create a table that tracks the status of each asset as well as each past status. Basically I want to keep a log of all status changes.
Do I create a timestamp for each updated status and have every update be its own separate row, linked back to the asset through the assetid? Then sort by the timestamp to get these statuses in order? I can see this table getting unwieldy if there are tons of rows for each asset and the table grows linearly over time.
This is for a MySQL database.
Here is an example of how I have designed a database table to track/log purposes.
Columns:
auto increment pk (if you don't have better pk)
timestamp
tracked object id (asset_id in your case)
event type (probably you don’t need but this is explained below)
content (this can be also named status in your case)
My example is very simplified but the main idea is to insert each record into own row. You can create a table with proper primary keys or indexes to have a good search performance.
Using the structure you should be able to search by asset, by status, or get latest changes etc. The structure depends on your needs so usually I have modified it to support the need.
Don’t care too much about the event -columns. I just put it here because most of the implementations are based on event sourcing. Here is a link to one article that could explain it: http://scottlobdell.me/2017/01/practical-implementation-event-sourcing-mysql/
I suggest that you could read more about that event sourcing that if the design could work in your case. Look only the database example because that is similar like in my example.
In the results, you should have a journal of status changes. Then it depends on your code how to handle/read data and show results.
About the linear growth… I would say it is not a big problem. Of course, if you have more information what “tons of rows” means, then ask. I have not seen any scaling problems. The same structure works very well with relational or with NoSQL databases. Mysql also has features to optimize that kind of structure if the size of data will be a problem.

What's a suitable table design for objects which can be "trashed"

I'm designing a simple media "server" as part of a larger application. I've chosen to adopt similar terminology as the AWS S3 service, i.e Objects and Buckets (i.e files and directories).
I have two tables:
cdn_bucket
id, directory
and
cdn_object
id, bucket_id, filename, is_deleted
Other tables in the database can include objects using a foreign key on cdn_object.id. This has nice side-effects in that I can specify a constraint to set the field NULL in the event that the object is deleted (or indeed prevent deletion if needed). e.g:
blog_post
id, title, body, featured_image
CONSTRAINT: featured_image = cdn_object.id ON DELETE SET NULL
I was told once that I shouldn't delete things, ever (that's an argument for another post, please don't comment on it here); hence the is_deleted flag. To clarify the question, this is what I mean by "trashed", i.e recoverable.
This works great, however I can't leverage the cascading functionality of the constraints (i.e I mark an object as deleted, but the referring table, e.g blog_post.featured_image references the old ID).
I was wondering what the SO opinions might be on the following two approaches, or if there's another approach which might be better.
1. Join the cdn_object table
SELECT bp.*, cdno.id featured_image FROM blog_post bp JOIN cdn_object cdno ON cdno.id = bp.featured_image AND cdno.is_deleted = 0.
Pro: easy to implement.
Con: every query has to join the cdn_object table.
or
2. Use a trash table
Have another table, cdn_object_trash and have the code 'move' the row cdn_object when it's deleted, triggering all the cascading constraints.
Pro: allows the relational rules to do what they were designed to do
Con: bad by design? Not sure.
My gut feeling tells me I should use the is_deleted flag and write code accordingly, but this is a generic class and so I'd prefer to not force the developer to write the join every time if I can configure that logic in the DB.
I hope my situation/question is clear, please ask me to clarify any points if needed.
Your third option is to set up a reasonable backup and retention schedule, and use cascading deletes. While I understand the desire to "never delete anything", abiding by that principle is forcing you to be redundant in your programming choices (option 1) or to figure out how to build a trash table to redundantly store information (option 2; do you build a single table with a string representation of the data, or do you make a trash copy of the schema?). Both of those choices seem like a lot of work to maintain (over the long haul).
I've worked with variants of both choices, and if those were the only options on the table, option 1 is a bit easier to maintain; however, you have to be EXTREMELY diligent in using it, and you have to make sure that future development efforts live up to that same standard.

Best practice for enabling "undelete" for database entities?

This is for a CRM application using PHP/MySQL. Various entities like customer, contact, note, etc, can be "deleted" by the user. Rather than actually deleting the entity from the database, I just want it to appear deleted to the application, but be kept in the DB and able to be "restored" if needed at a later time. Maybe even add some kind of "recycle bin" to the app.
I've thought of several ways to do this:
Move the deleted entity to another table. (customer to customer_deleted)
Change an attribute on the entity. (enabled to false)
I'm sure there are other ways and that each have their own implications on DB size, performance, etc, I'm just wondering what's the generally recommended way to do something like this?
I would go a combination of both:
Set a flag deleted to true
Use a cronjob to move the entries after a while to a tabelle of type ARCHIVE
If you need to restore the entry, select into the article table and delete from Archive
Why i would go this way?
If a customer deleted the wrong one, the restore could be done instand
After a few weeks/month the article table may grow up to much, so i would archive all entries that are deleted for 1 week p.a.
A common practice is to set a deleted_at column to the date at which the entity was deleted by the user (defaults to null). You may also include a deleted_by column for marking who deleted it. Using some kind of deleted column makes FK relationships easier to work with since these wont break. By moving the row to a new table you would have to update FK (and then update them again if you ever undelete). The downside is that you have to ensure all your queries exclude deleted rows (where this wouldnt be a problem if you moved the row to a new table). Many ORM's make this filtering easier so it depends on what you are using.

versioning each field vs history date field?

Which do you recommend and why?
I have a few tables, when i make a change to the data... it should go to a history table (audit) with a effective date.
The other solution is versioning each field to insert a new row when making changes to the data?
Which is the best method for the invoice information? Item name and price is always change
These are slowly changing dimensions, type 2 and type 4, appropriately.
Both methods are valid and may be more appropriate for your needs, depending on your model and query requirements.
Basically, type 2 (versioning) is more appropriate when you need to query historical values as often as the current one, while type 4 (history table) is more suited when you are querying the current value more often and there are more queries (more queries to develop I mean) against the most recent value.
A system we use and happy with:
Each table that requires history, we create a similar table and adding a timestamp field at the end, which becomes a part of the PK.
Each update on original table, we insert into history table with the same conditions:
update table x WHERE somthing something
insert into table x_history
select * from x WHERE something something
That keeps your data clean and your tables slim.
My personal preference would be to user the Observer Pattern in your application and to implement a separate history table. This means that you can pull the data from the history table when you need it and you don't compromise the speed of querying the main table.

MySQL - Best method of saving and loading items

So on my older work, I had always used the 'text' data type to store items, like so:
0=4151:54;1=995:5000;2=521:1;
So basically: slot=item:amount;
I've been looking into finding the best ways of storing information in a sql database, and everywhere i go, it says that using text is a big performance hit.
I was thinking of doing something else, like having a table with the following columns:
id, owner_id, slot_id, item_id, amount
Where as now i can just insert a row for each item a character allocates. But i have no clue how to save them, since the slot's item can change, etc. A character has 28 inventory slots, and 500 bank slots, should i insert them all at registration? or is there a smarter way to save the items
Yes use that structure. Using text to store relational data defeats the purpose of a relational database.
I don't see what you mean by insert them all at registration. Can you not insert them as you need to?
Edit
Based on your previous comment I would recommend only inserting a slot as it is needed (if I understand your problem). It may be an idea to keep the ID of the slot in the application, if need be.
If I understand you correctly, and that the slot's item can change, then you want to further abstract the mapping between item_id and the item:
entry_tbl.item_id->item_rel_realitems_tbl.real_id->items_tbl
This way, all entries with an itemid point to a table that maps those ids to a mutable item. When you UPDATE an item in 'items_tbl' then the mapping automatically updates the entry_tbl.
Another JOIN is needed however. I would also use stored procedures in any case to abstract the mechanism from semantics.
I am not sure I understand the wording of your question however.