Recently, I stumbled about the following problem: Given is a simple data model with "Books" and "Authors". Each "Book" has a reference to an "Author". Persistence is achieved with a relational database. Besides adding books and authors, it is also possible to delete them. Usually, if I want to delete an Author, i would perform a SQL DELETE operation and remove the corresponding row. However, I have seen in other projects, people don't call DELETE. Instead, they add some kind of active/deleted flag and mark the corresponding row as "deleted".
My questions are: Is this in general best practice? What are the advantages? My best guess is:
Setting a flag has a better performance than a DELETE operation
If you run out of space, it is still possible to run a cleanup service which looks for deleted object and removes the corresponding rows
Setting a delete flag is better for database consistency because a deletion of "Author" in the example above could destroy foreign keys in the corresponding "Book entries.
Anyway, these are just guesses. Does someone know the answer?
There are many reasons to not use delete. First, maintaining history can be very important. I wouldn't use "just" a delete flag, but instead have dates of validity.
Second, in an operational system, delete can be an expensive operation. The row needs to be deleted from the table, from associated indexes, and then there might be cascading deletes and triggers.
Third, delete can prevent other operations from working well, because tables and rows and indexes get locked. This can slow down an operational system, particularly during peak periods.
Fourth, delete can be tricky to maintain relational integrity -- especially if those cascading deletes are not defined.
Fifth, storage is cheap. Processing power is cheap. So, for many databases, deleting records to recover space is simply unnecessary.
This doesn't mean that you should always avoid deleting records. But there are very valid reasons for not rushing to remove data.
Related
I have a web form that is used to create and update personal information. On save, I collect all the info in a large multidimensional JSON array. When updating the database, the information will potentially consists of three parts. New rows to be created, rows that need to be updated and rows that need to be deleted. These rows will also be across about 5 tables.
My question is this, how should I approach the MySQL queries? My initial thought was to DELETE all the information from all the tables, and do a clean INSERT of all the new information in one go. I guess the other approach would be to do 3 queries: UPDATE all those with an existing ID; DELETE all those marked for deletion and INSERT all the newly created data (data without existing ID's).
Which of these approaches would be best, or is there a better way of doing this? Thanks for any advice. I appreciate it.
delete all and insert all should NEVER be practiced.
reasons:
Too much costly. mostly user performs edit. so for what was just a few update, you did one delete and a hundred inserts.
plays havoc with on-delete-cascade foreign keys.
upsets auto-increment fields even when they were apparently not touched.
you need to implement unit-of-work. I dont know which language you are working with, but some of the languages have an inbuilt support for that. in dot-net we have DataSets.
Basics:
Keep track of each record you fetched from database. secretly maintain a flag for each record to note which were loaded-from-db (ie. untouched), which has modifications (needs update query) and which are added new. for the deleted records, maintain a separate list (maybe of their IDs). How to achieve this feat is matter of separate discussion.
When user clicks Save, start a database transaction. this is not strictly part of current discussion, but is almost always done in similar conditions.
In the transaction, first loop through the deleted items array. fire a delete query for each of them.
Then loop through the modified items array. for each modified item you may simply update all of its columns to the latest values. if the numer of columns is too large (>30) then things change a bit.
then comes the newly created items. fire one insert for each of them.
Finally commit the transaction.
if the language you are programming in supports try/catch blocks then perform all of the above steps (after begining transaction) in try/catch. in catch block rollback the transcation.
this approach looks more complicated and seems to fire more queries than the simple delete/insert/all approach but trust me we have been there, done that and then spent sleeples nights undoing all that was done. never go the delete/insert way unless you can really justify it.
on how to do the change-tracking thing, it depends a lot on language and type of application you are using. even for dot-net the approach differs for desktop applications and web applications. tracking deletions is easy. so as tracking new insertions. the update marks are applied by trapping the edit event on any of the columns of that field.
EDIT
The data spans about five tables. hence the three loops (delete/update/insert) has to be done five times, one for each table. first draw the relationships among the tables. process the top table first. then process the tables which are directly connected to the top level tables and so on. if you have a cyclic relationship among the tables then you have to be specially careful.
The code against the Save operation is about to grow quite long. 5x3=15 operations, each with its own sql. none of these operations are expected to be reusable hence putting them in separate methods is futile. everything is about to go in a large procedural block. hence religiously comment the code. mark the table boundaries and the operations.
You probably don't want to do any deletes. Just mark the obsolete entries as "inactive", or maybe timestamp them as having an ending validity.
In using this philosophy, all edits are actually insertions. No modifications (except to change the "expire" field) and no deletes. To update a name, mark the record as expired and insert a new record with a beginning validity timestamp at the same time.
In such a database, auditing and data recovery are easily performed.
Sometimes you are required to keep your log and records for criminal prevention purpose.
When you give users the permission to delete record, it means that you'll lose evidences.
In ordinary cases, do you actually delete record? or switch the undisplay flag to keep log?
If you allow any modification to data then you will lose evidence. Maybe you should design your database so you never use UPDATE or DELETE, only INSERT.
Unless the government has told you to keep all records, I recommend not going too much out of your way to do it.
Apart from keeping records for auditing purposes as you mention, the use of a 'Deleted' flag also allows you to incorporate 'undo' functionality.
If you physically delete data, then it will be quite a bit of work to get the old data back. But if you use flags then it can be as easy as re-setting the flag to get the data to re-appear.
If a lot of deletes happen in your database, then the downside of flags is that you will be holding on to a lot of data that isn't being used.
You can instead of just deleting is first insert the record into a history table for any type of modification that happens. Then you will always have the data available without having needless information in your main table
I've always had a weird feeling in my gut about actually deleting rows from certain types of tables.
For example, if I have a table of Users...when they delete their account, rather than fully deleting their row, I have been marking as "dead" or inactive. This allows me to retain a record of their existence if I ever need it again.
In situations like this - considering performance, overhead, etc - should I delete the row, or simply mark as inactive?
Which is more "common"?
Personally, I almost always use "soft deletes" as you describe.
If space is a concern, I'll have a job that will periodically clean up the soft-deleted records after they've been deleted for a certain amount of time.
Perhaps you could move the inactive MySQL records to a separate table designed to hold inactive accounts? That way, you could simply move them back over if you need to, or delete the table if database size becomes an issue.
Data are very valuable to be permanently deleted from the database. Mark as dead.
I generally give status for such cases. In this pattern
0 Inactive
1 Active
2 Trashed
In addition to "soft" deletes, another solution is to use "audit tables". I asked what they were on dba.stackexchange.com recently.
Audit tables are typically used to record actions, such as insert/update/delete, performed on a second table, possibly storing old and new values, time, etc.
They can be implemented using triggers in a straightforward way.
Pros:
the "unused" data is in a separate table
it's easy to turn the level-of-detail knob from fine-grained to coarse-grained
it may be more efficient space-wise, depending on the exact implementation
Cons:
since data is in a separate table, it could cause key conflicts in the case that a row were "undeleted"
it may be less efficient space-wise, depending on the exact implementation
This question made me remember this entertaining anecdote. My point: there are so many factors to consider when choosing between hard and soft delete that there is no thumb rule telling you which one to pick.
It's not a specific question, more a general wondering.
When you have to make a delete on multiple tables in a 1:M relationship, is it better to make a FK constraint with a cascade delete or join the tables in the delete statement.
I had an old project that had separate delete statements for related tables, and a few times some of the statements were not executed and data integrity was compromised. I had to make a decision between the two, so I was thinking a bit what would be a better solution.
There is also an option to make a stored procedure or a transaction.
So I am looking for an opinion or advice...?
I'd say it's safer to use a cascade delete. If you decide to use joins, you have to remember to use them every time you delete anything from parent table; and even if you're disciplined enough to do that, you can't be sure about your coworkers or people who will support your software in the future. Also, encoding such knowledge about table relationships more than once violates DRY principle.
If you use a cascade delete though, nobody has to remember anything, and child rows will always be deleted as needed.
If your database has proper RI defined for it then there shouldn't be any case of compromised data integrity. All of your related tables should have declarative RI, which means that you can't delete a parent while it still has children.
Also, if you have code that is only deleting some of the rows at times then that is poor coding and poor testing. These kinds of actions should be a single transaction. Your suggestion of using a stored procedure is a great approach for solving that problem and is pretty standard.
As has already been mentioned, cascading triggers have the danger of deleting rows that someone did not intend to delete. Consider that sometimes people might be accessing your data from somewhere outside of your application, especially when fixing data issues. When someone accidentally tries to delete the wrong parent and gets an RI error that's good. When they accidentally try to delete the wrong parent and it not only deletes that parent but 20 children in 5 other tables, that's bad.
Also, cascading deletes are very hidden. If a developer is coding a delete for the parent then they should know that they have to use the delete stored procedure to take care of children. It's much preferable to have a developer not code against that, get an error, and fix his code (or realize that he doesn't really want to do all of that deleting) than it is to have a developer throw in a delete and have no one realize that it's killing off children until the code has gone live.
IMO, I prefer to have my developers knowledgeable about the application rather than make it easier for them to remain ignorant of it.
Cascade delete causes lots of issues and thus is extremely dangerous. I would not recommend its use. In the first place, suppose I need to delete record that has millions of child records. You could lock up the database and make it unusable for hours. I know of very few dbas who will permit cascade delete to be used in their databases.
Next, it does not help with data integrity if you have defined the FKs. A delete with child records still existant will fail which is a good thing. I want the customer delete to fail if he has existing orders for instance. Cascade delete used thoughtlessly (as it usually is in my experience) can cause things to be deleted that you really don't want to delete.
Use both!
"Joined" manual deletes are usually better for avoiding deadlocks and other contention problems as you can break up the deletes into smaller units of work. If you do have contention its definitely easier to find the cause of the conflict.
As stated "Delete Cascade" will absolutely guarantee referential integrity.
So use both -- do explicit deletes of the "children" in joined sqls to avoid deadlocks and performance problems. But leave "CASCADE DELETE" enabled to catch anything you missed. As there should be no children left when you come to delete the parent this won't cost you anything, unless, you made a mistake with your deletes, in which case the cost is worth it to maintain your referential integrity.
I've never used triggers before, but this seems like a solid use case. I'd like to know if triggers are what I should be using, and if so, I could use a little hand-holding on how to go about it.
Essentially I have two heavily denormalized tables, goals and users_goals. Both have title columns (VARCHAR) that duplicate the title data. Thus, there will be one main goal of "Learn how to use triggers", and many (well, maybe not many in this case) users' goals with the same title. The architecture of the site demands that this be the case.
I haven't had a need to have a relationship between these two tables just yet. I link from individual users' goals to the main goals, but simply do so with a query by title, (with an INDEX on the title column). Now I need to have a third table that relates these two tables, but it only needs to be eventually consistent. There would be two columns, both FOREIGN KEYs, goal_id and users_goal_id.
Are triggers the way to go with this? And if so, what would that look like?
Yes you could do this using triggers, but the exact implementation depends on your demands.
If you want to rebuild al your queries, so they don't use the title for the join, but the goal_id instead, you can just build that. If you need to keep the titles in sync as well, that's an extra.
First for the join. You stated that one goal has many user goals. Does that mean that each user goal belongs to only one goal? If so, you don't need the extra table. You can just add a column goal_id to your user_goals table. Make sure there is a foreign key constraint (I hope you're using InnoDB tables), so you can enforce referential integrity.
Then the trigger. I'm not exactly sure how to write them on MySQL. I do use triggers a lot on Oracle, but only seldom on MySQL. Anyway, I'd suggest you build three triggers:
Update trigger on goals table. This trigger should update related user_goals table when the title is modified.
Update trigger on the user_goals table. If user_goals.title is modified, this trigger should check if the title in the goals table differs from the new title in user_goals. If so, you have two options:
Exception: Don't allow the title to be modified in the user_goals child table.
Update: Allow the title to be changed. Update the parent record in goals. The trigger on goals will update the other related user_goals for you.
You could also silently ignore the change by changing the value back in the trigger, but that wouldn't be a good idea.
Insert trigger on user_goals. Easiest option is to query the title of the specified goal_id and don't allow inserting another value for title. You could opt to update goals if a title is given.
Insert trigger on goals. No need for this one.
No, you should never use triggers at all if you can avoid it.
Triggers are an anti-pattern to me; they have the effect of "doing stuff behind the programmer's back".
Imagine a future maintainer of your application needs to do something, if they are not aware of the trigger (imagine they haven't checked your database schema creation scripts in detail), then they could spend a long, long time trying to work out why this happens.
If you need to have several pieces of client-side code updating the tables, consider making them use a stored procedure; document this in the code maintenance manual (and comments etc) to ensure that future developers do the same.
If you can get away with it, just write a common routine on the client side which is always called to update the shared column(s).
Even triggers do nothing to ensure that the columns are always in sync, so you will need to implement a periodic process which checks this anyway. They will otherwise go out of sync sooner or later (maybe just because some operations engineer decides to start doing manual updates; maybe one table gets restored from a backup and the other doesn't)