Best practice for enabling "undelete" for database entities? - mysql

This is for a CRM application using PHP/MySQL. Various entities like customer, contact, note, etc, can be "deleted" by the user. Rather than actually deleting the entity from the database, I just want it to appear deleted to the application, but be kept in the DB and able to be "restored" if needed at a later time. Maybe even add some kind of "recycle bin" to the app.
I've thought of several ways to do this:
Move the deleted entity to another table. (customer to customer_deleted)
Change an attribute on the entity. (enabled to false)
I'm sure there are other ways and that each have their own implications on DB size, performance, etc, I'm just wondering what's the generally recommended way to do something like this?

I would go a combination of both:
Set a flag deleted to true
Use a cronjob to move the entries after a while to a tabelle of type ARCHIVE
If you need to restore the entry, select into the article table and delete from Archive
Why i would go this way?
If a customer deleted the wrong one, the restore could be done instand
After a few weeks/month the article table may grow up to much, so i would archive all entries that are deleted for 1 week p.a.

A common practice is to set a deleted_at column to the date at which the entity was deleted by the user (defaults to null). You may also include a deleted_by column for marking who deleted it. Using some kind of deleted column makes FK relationships easier to work with since these wont break. By moving the row to a new table you would have to update FK (and then update them again if you ever undelete). The downside is that you have to ensure all your queries exclude deleted rows (where this wouldnt be a problem if you moved the row to a new table). Many ORM's make this filtering easier so it depends on what you are using.

Related

What's a suitable table design for objects which can be "trashed"

I'm designing a simple media "server" as part of a larger application. I've chosen to adopt similar terminology as the AWS S3 service, i.e Objects and Buckets (i.e files and directories).
I have two tables:
cdn_bucket
id, directory
and
cdn_object
id, bucket_id, filename, is_deleted
Other tables in the database can include objects using a foreign key on cdn_object.id. This has nice side-effects in that I can specify a constraint to set the field NULL in the event that the object is deleted (or indeed prevent deletion if needed). e.g:
blog_post
id, title, body, featured_image
CONSTRAINT: featured_image = cdn_object.id ON DELETE SET NULL
I was told once that I shouldn't delete things, ever (that's an argument for another post, please don't comment on it here); hence the is_deleted flag. To clarify the question, this is what I mean by "trashed", i.e recoverable.
This works great, however I can't leverage the cascading functionality of the constraints (i.e I mark an object as deleted, but the referring table, e.g blog_post.featured_image references the old ID).
I was wondering what the SO opinions might be on the following two approaches, or if there's another approach which might be better.
1. Join the cdn_object table
SELECT bp.*, cdno.id featured_image FROM blog_post bp JOIN cdn_object cdno ON cdno.id = bp.featured_image AND cdno.is_deleted = 0.
Pro: easy to implement.
Con: every query has to join the cdn_object table.
or
2. Use a trash table
Have another table, cdn_object_trash and have the code 'move' the row cdn_object when it's deleted, triggering all the cascading constraints.
Pro: allows the relational rules to do what they were designed to do
Con: bad by design? Not sure.
My gut feeling tells me I should use the is_deleted flag and write code accordingly, but this is a generic class and so I'd prefer to not force the developer to write the join every time if I can configure that logic in the DB.
I hope my situation/question is clear, please ask me to clarify any points if needed.
Your third option is to set up a reasonable backup and retention schedule, and use cascading deletes. While I understand the desire to "never delete anything", abiding by that principle is forcing you to be redundant in your programming choices (option 1) or to figure out how to build a trash table to redundantly store information (option 2; do you build a single table with a string representation of the data, or do you make a trash copy of the schema?). Both of those choices seem like a lot of work to maintain (over the long haul).
I've worked with variants of both choices, and if those were the only options on the table, option 1 is a bit easier to maintain; however, you have to be EXTREMELY diligent in using it, and you have to make sure that future development efforts live up to that same standard.

Proper way to store requests in Mysql (or any) database

What is the "proper" (most normalized?) way to store requests in the database? For example, a user submits an article. This article must be reviewed and approved before it is posted to the site.
Which is the more proper way:
A) store it in in the Articles table with an "Approved" field which is either a 0, 1, 2 (denied, approved, pending)
OR
B) Have an ArticleRequests table which has the same fields as Articles, and upon approval, move the row data from ArticleRequests to Articles.
Thanks!
Since every article is going to have an approval status, and each time an article is requested you're very likely going to need to know that status - keep it inline with the table.
Do consider calling the field ApprovalStatus, though. You may want to add a related table to contain each of the statuses unless they aren't going to change very often (or ever).
EDIT: Reasons to keep fields in related tables are:
If the related field is not always applicable, or may frequently be null.
If the related field is only needed in rare scenarios and is better described by using a foreign key into a related table of associated attributes.
In your case those above reasons don't apply.
Definitely do 'A'.
If you do B, you'll be creating a new table with the same fields as the other one and that means you're doing something wrong. You're repeating yourself.
I think it's better to store data in main table with specific status. Because it's not necessary to move data between tables if this one is approved and the article will appear on site at the same time. If you don't want to store disapproved articles you should create cron script with will remove unnecessary data or move them to archive table. In this case you will have less loading of your db because you can adjust proper time for removing old articles for example at night.
Regarding problem using approval status in each query: If you are planning to have very popular site with high-load for searching or making list of article you will use standalone server like sphinx or solr(mysql is not good solution for this purposes) and you will put data to these ones with status='Approved'. Using delta indexing helps you to keep your data up-to-date.

Versioned and indexed data store

I have a requirement to store all versions of an entity in a easily indexed way and was wondering if anyone has input on what system to use.
Without versioning the system is simply a relational database with a row per, for example, person. If the person's state changes that row is changed to reflect this. With versioning the entry should be updated in such a way so that we can always go back to a previous version. If I could use a temporal database this would be free and I would be able to ask 'what is the state of all people as of yesterday at 2pm living in Dublin and aged 30'. Unfortunately there doesn't seem to be any mature open source projects that can do temporal.
A really nasty way to do this is just to insert a new row per state change. This leads to duplication, as a person can have many fields but only one changing per update. It is also then quite slow to select the correct version for every person given a timestamp.
In theory it should be possible to use a relational database and a version control system to mimic a temporal database but this sounds pretty horrendous.
So I was wondering if anyone has come across something similar before and how they approached it?
Update
As suggested by Aaron here's the query we currently use (in mysql). It's definitely slow on our table with >200k rows. (id = table key, person_id = id per person, duplicated if the person has many revisions)
select name from person p where p.id = (select max(id) from person where person_id = p.person_id and timestamp <= :timestamp)
Update
It looks like the best way to do this is with a temporal db but given that there aren't any open source ones out there the next best method is to store a new row per update. The only problem is duplication of unchanged columns and a slow query.
There are two ways to tackle this. Both assume that you always insert new rows. In every case, you must insert a timestamp (created) which tells you when a row was "modified".
The first approach uses a number to count how many instances you already have. The primary key is the object key plus the version number. The problem with this approach seems to be that you'll need a select max(version) to make a modification. In practice, this is rarely an issue since for all updates from the app, you must first load the current version of the person, modify it (and increment the version) and then insert the new row. So the real problem is that this design makes it hard to run updates in the database (for example, assign a property to many users).
The next approach uses links in the database. Instead of a composite key, you give each object a new key and you have a replacedBy field which contains the key of the next version. This approach makes it simple to find the current version (... where replacedBy is NULL). Updates are a problem, though, since you must insert a new row and update an existing one.
To solve this, you can add a back pointer (previousVersion). This way, you can insert the new rows and then use the back pointer to update the previous version.
Here is a (somewhat dated) survey of the literature on temporal databases: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.6988&rep=rep1&type=pdf
I would recommend spending a good while sitting down with those references and/or Google Scholar to try to find some good techniques that fit your data model. Good luck!

How to delete from a database?

I know of two ways to delete data from a database table
DELETE it forever
Use a flag like isActive/isDeleted
Now the problem with isActive is that I have to track everywhere in my SQL queries that whether the record is active or not. Using DELETE however gets rid of the data forever.
What would be the best way to backup this data?
Assuming I have multiple tables in a database, should I have a common function which just backs everything up and stores it in another table (in XML probably?) or is there any other way.
I am using MySQL but am curious about techniques used in other DBs as well.
Replace the table with a view that hides the inactive items.
Or write a trigger on DELETE that backs up the row to an archive table.
You could use a trigger that fires on deleting records to back them up into some kind of graveyard table.
You could use an isDeleted column and defien a view which selects all columns except isDeleted with the condition isDeleted=false. Then have all your stps work only with the view.
You could maintain a history table, where you back the record up and time stamp
One of the biggest reasons for not deleting data is that it may be required for a relation - for example the the user may decide to delete an old customer from the database, but you still need the customer record because it is referenced by old invoices (which may have a much longer lifespan).
Based on this the best solution is often the "IsDeleted" type of column, combined with a view (Quassnoi has mentioned partitioning, which can help with performance issues that might pop up due to a lot of invisible data).
You can partition your tables on the DELETED column and define the views which would include the condition:
… AND deleted = 0
This will make the queries over the active data just as simple and efficient.
Well, if you were using SqlServer you can use triggers, which will allow you to move the record to a deleted table.

MySQL - Saving and loading

I'm currently working on a game, and just a while ago i started getting start on loading and saving.
I've been thinking, but i really can't decide, since I'm not sure which would be more efficient.
My first option:
When a user registers, only the one record is inserted (into 'characters' table). When the user tries to login, and after he/she has done so successfully, the server will try loading all information from the user (which is separate across multiple tables, and combines via mysql 'LEFT JOIN'), it'll run though all the information it has and apply them to the entity instance, if it runs into a NULL (which means the information isn't in the database yet) it'll automatically use a default value.
At saving, it'll insert or update, so that any defaults that have been generated at loading will be saved now.
My second option:
Simply insert all the required rows at registration (rows are inserted when from website when the registration is finished).
Downsides to first option: useless checks if the user has logged in once already, since all the tables will be generated after first login.
Upsides to first option: if any records from tables are deleted, it would insert default data instead of kicking player off saying it's character information is damaged/lost.
Downsides to second option: it could waste a bit of memory, since all tables are inserted at registration, and there could be spamming bots, and people who don't even manage to get online.
Upsides to first option: We don't have to check for anything in the server.
I also noted that the first option may screw up any search systems (via admincp, if we try looking a specific users).
I would go with the second option, add default rows to your user account, and flag the main user table as incomplete. This will maintain data integrity across your database, whereas every user record is complete in it's entirety. If you need to remove the record, you can simply add a cascading delete script to clean house.
Also, I wouldn't develop your data schema based off of malacious bots creating accounts. If you are concerned about the integrity of your user accounts, add some sort of data validation into your solution or an automated clean-house script to clear out incomplete accounts once the meet a certain criteria, i.e. the date created meeting a certain threshold.
You mention that there's multiple tables of data for each user, with some that can have a default value if none exist in the table. I'm guessing this is set up something like a main "characters" table, with username, password, and email, and a separate table for something like "favorite shortcuts around the site", and if they haven't specified personal preferences, it defaults to a basic list of "profile, games list, games by category" etc.
Then the question becomes when registering, should an explicit copy of the favorite shortcuts default be added for that user, or have the null value default to a default list?
I'd suggest that it depends on the nature of the auxiliary data tables; specifically the default value for those tables. How often would the defaults change? If the default changes often, a setup like your first option would result in users with only a 'basic' entry would frequently get new auxiliary data, while those that did specify their own entries would keep their preferences. Using your second option, if the default changed, in order to keep users updated, a search/replace would have to be done to change entries that were the old default to the new default.
The other suggestion is to take another look at your database structure. You don't mention that your current table layout is set in stone; is there a way to not have all the LEFT JOIN tables, and have just one 'characters' table?