It's not a specific question, more a general wondering.
When you have to make a delete on multiple tables in a 1:M relationship, is it better to make a FK constraint with a cascade delete or join the tables in the delete statement.
I had an old project that had separate delete statements for related tables, and a few times some of the statements were not executed and data integrity was compromised. I had to make a decision between the two, so I was thinking a bit what would be a better solution.
There is also an option to make a stored procedure or a transaction.
So I am looking for an opinion or advice...?
I'd say it's safer to use a cascade delete. If you decide to use joins, you have to remember to use them every time you delete anything from parent table; and even if you're disciplined enough to do that, you can't be sure about your coworkers or people who will support your software in the future. Also, encoding such knowledge about table relationships more than once violates DRY principle.
If you use a cascade delete though, nobody has to remember anything, and child rows will always be deleted as needed.
If your database has proper RI defined for it then there shouldn't be any case of compromised data integrity. All of your related tables should have declarative RI, which means that you can't delete a parent while it still has children.
Also, if you have code that is only deleting some of the rows at times then that is poor coding and poor testing. These kinds of actions should be a single transaction. Your suggestion of using a stored procedure is a great approach for solving that problem and is pretty standard.
As has already been mentioned, cascading triggers have the danger of deleting rows that someone did not intend to delete. Consider that sometimes people might be accessing your data from somewhere outside of your application, especially when fixing data issues. When someone accidentally tries to delete the wrong parent and gets an RI error that's good. When they accidentally try to delete the wrong parent and it not only deletes that parent but 20 children in 5 other tables, that's bad.
Also, cascading deletes are very hidden. If a developer is coding a delete for the parent then they should know that they have to use the delete stored procedure to take care of children. It's much preferable to have a developer not code against that, get an error, and fix his code (or realize that he doesn't really want to do all of that deleting) than it is to have a developer throw in a delete and have no one realize that it's killing off children until the code has gone live.
IMO, I prefer to have my developers knowledgeable about the application rather than make it easier for them to remain ignorant of it.
Cascade delete causes lots of issues and thus is extremely dangerous. I would not recommend its use. In the first place, suppose I need to delete record that has millions of child records. You could lock up the database and make it unusable for hours. I know of very few dbas who will permit cascade delete to be used in their databases.
Next, it does not help with data integrity if you have defined the FKs. A delete with child records still existant will fail which is a good thing. I want the customer delete to fail if he has existing orders for instance. Cascade delete used thoughtlessly (as it usually is in my experience) can cause things to be deleted that you really don't want to delete.
Use both!
"Joined" manual deletes are usually better for avoiding deadlocks and other contention problems as you can break up the deletes into smaller units of work. If you do have contention its definitely easier to find the cause of the conflict.
As stated "Delete Cascade" will absolutely guarantee referential integrity.
So use both -- do explicit deletes of the "children" in joined sqls to avoid deadlocks and performance problems. But leave "CASCADE DELETE" enabled to catch anything you missed. As there should be no children left when you come to delete the parent this won't cost you anything, unless, you made a mistake with your deletes, in which case the cost is worth it to maintain your referential integrity.
Related
Recently, I stumbled about the following problem: Given is a simple data model with "Books" and "Authors". Each "Book" has a reference to an "Author". Persistence is achieved with a relational database. Besides adding books and authors, it is also possible to delete them. Usually, if I want to delete an Author, i would perform a SQL DELETE operation and remove the corresponding row. However, I have seen in other projects, people don't call DELETE. Instead, they add some kind of active/deleted flag and mark the corresponding row as "deleted".
My questions are: Is this in general best practice? What are the advantages? My best guess is:
Setting a flag has a better performance than a DELETE operation
If you run out of space, it is still possible to run a cleanup service which looks for deleted object and removes the corresponding rows
Setting a delete flag is better for database consistency because a deletion of "Author" in the example above could destroy foreign keys in the corresponding "Book entries.
Anyway, these are just guesses. Does someone know the answer?
There are many reasons to not use delete. First, maintaining history can be very important. I wouldn't use "just" a delete flag, but instead have dates of validity.
Second, in an operational system, delete can be an expensive operation. The row needs to be deleted from the table, from associated indexes, and then there might be cascading deletes and triggers.
Third, delete can prevent other operations from working well, because tables and rows and indexes get locked. This can slow down an operational system, particularly during peak periods.
Fourth, delete can be tricky to maintain relational integrity -- especially if those cascading deletes are not defined.
Fifth, storage is cheap. Processing power is cheap. So, for many databases, deleting records to recover space is simply unnecessary.
This doesn't mean that you should always avoid deleting records. But there are very valid reasons for not rushing to remove data.
UNIQUE is an index which makes your field, well, unique. But is it worth using it if you're already doing your validation in PHP prior to inserting new data? An extra INDEX isn't the end of the world but if you're after query optimization then UNIQUE just gets in the way, right?
Why wear a seat belt if you're a good driver and you can save two seconds of your total trip time?
One of the most important lessons for a programmer to learn is that he is human and he makes mistakes. Worse, everyone else working on this code is human, too.
Why does the UNIQUE constraint exist? To protect the database from humans making mistakes. Turning off your UNIQUE constraint says "You do not need to worry, Mr. Database, I will never give you data that doesn't match my intent."
What if something happens to your code such that your validation for uniqueness breaks? Now your code dumps duplicate records into the database. But if you had a UNIQUE constraint on that column, when your front-end code stopped working, you'd get your queries blowing up.
You're human. Accept it. Let the computer do its job and help protect you from yourself.
UNIQUE is not only for making sure data is valid. The primary purpose is to optimize queries: if the database knows the field is unique, it can stop searching for hits as soon as the first record is found. You can't pass that information to the database through well-crafted queries alone.
That is an interesting question.
Are you sure that there is no way for your code to be bypassed ?
Are you sure nothing else will ever access the data beside the PHP application ?
Are you sure the rest of your application won't fail in the case where a duplicate is inserted ?
What would be the implication of having duplicate entries, would that cause problem for future references or calculations ?
This is some of the questions that constraint at database level help solve.
As for optimization, a constraint does not make the process of retrieving data noticeably slower and it can in fact be use in the execution plan at some point, since it is related to an index.
So no, it won't get in the way of optimization and it will also protect your data from inconsistencies.
As pst mentions, at this stage in your development, you are in no position to begin optimizing your database or the application in question.
It's generally not a bad thing to add additional sanity checks in your system. Yes, you're hurting performance just that tiny little bit, but in no way will any user ever notice an extra CPU tick or two.
Think about this: Today you do your validation in php, but do not assert uniqueness in the database. In the future, you, a colleague, or some other guy who has forked your project changes the original php validation, ruins it, or forgets it altogether. At this point, you'll probably wish you had that added check in your database.
tl:dr; Transactional Integrity (in the database) handles Race Conditions (in the application).
The Concurrency and integrity section of these Rails docs explains why this is necessary with an example scenario.
Databases with transactional integrity guarantee uniqueness through isolation, while applications actually take a few separate steps (get the value, check if there are other values, then save the value) outside of transactional isolation that leave them vulnerable to race conditions, especially at scale.
I've always had a weird feeling in my gut about actually deleting rows from certain types of tables.
For example, if I have a table of Users...when they delete their account, rather than fully deleting their row, I have been marking as "dead" or inactive. This allows me to retain a record of their existence if I ever need it again.
In situations like this - considering performance, overhead, etc - should I delete the row, or simply mark as inactive?
Which is more "common"?
Personally, I almost always use "soft deletes" as you describe.
If space is a concern, I'll have a job that will periodically clean up the soft-deleted records after they've been deleted for a certain amount of time.
Perhaps you could move the inactive MySQL records to a separate table designed to hold inactive accounts? That way, you could simply move them back over if you need to, or delete the table if database size becomes an issue.
Data are very valuable to be permanently deleted from the database. Mark as dead.
I generally give status for such cases. In this pattern
0 Inactive
1 Active
2 Trashed
In addition to "soft" deletes, another solution is to use "audit tables". I asked what they were on dba.stackexchange.com recently.
Audit tables are typically used to record actions, such as insert/update/delete, performed on a second table, possibly storing old and new values, time, etc.
They can be implemented using triggers in a straightforward way.
Pros:
the "unused" data is in a separate table
it's easy to turn the level-of-detail knob from fine-grained to coarse-grained
it may be more efficient space-wise, depending on the exact implementation
Cons:
since data is in a separate table, it could cause key conflicts in the case that a row were "undeleted"
it may be less efficient space-wise, depending on the exact implementation
This question made me remember this entertaining anecdote. My point: there are so many factors to consider when choosing between hard and soft delete that there is no thumb rule telling you which one to pick.
I have many tables where data needs to be "marked for deletion" but not deleted, or toggle between published and hidden data.
Most intuitive way to handle these cases is to add a column in the database deleted int(1) or public int(1). This raises the concern of not forgetting to specify WHERE deleted=0 for each and every time that table is being accessed.
I considered overcoming this by creating duplicate tables for deleted/unpublished data such as article => article_deleted and moving the data instead of deleting it. This provides with 2 issues:
Foreign key constraints end up being extremely annoying to maintain
Number of tables with hidden content doubles (in my case ~20 becomes ~40 tables)
My last idea is to create a duplicate of the entire database called unreleased and migrate data there.
My question isn't about safety of the data management, but more of - what is the right way of doing it from the beginning?
I have run into this exact issue before and I think it is a bad idea to create an unnecessarily cumbersome DB because you are afraid of bad code.
I think it would be a better idea to do thorough testing on your Test server before you release to production. Even I was tripped up by the "Deleted" column a few times when I first encountered it but I eventually caught on, and if you have a proper Dev/Test/Production environment you should be fine.
In summary, keep the delete column and demand more from your coders.
UPDATE:
Alternatively you could create a view that only pulls the records that aren't deleted and make sure everyone uses that for select queries.
I think your initial approach is "correct" and "right", but your concern about it being slightly error-prone is a valid one.
You'll probably just have to make sure that your test procedures are rigourous enough to catch errors.
The first approach is the best I've come up with. I call the column active instead of deleted. The record exists but it can be either active or inactive. That then if you really do need to delete things the terminology doesn't get screwy.
Saying "Delete the inactive records" makes sense but saying "Delete the deleted records" just gets confusing.
I've never used triggers before, but this seems like a solid use case. I'd like to know if triggers are what I should be using, and if so, I could use a little hand-holding on how to go about it.
Essentially I have two heavily denormalized tables, goals and users_goals. Both have title columns (VARCHAR) that duplicate the title data. Thus, there will be one main goal of "Learn how to use triggers", and many (well, maybe not many in this case) users' goals with the same title. The architecture of the site demands that this be the case.
I haven't had a need to have a relationship between these two tables just yet. I link from individual users' goals to the main goals, but simply do so with a query by title, (with an INDEX on the title column). Now I need to have a third table that relates these two tables, but it only needs to be eventually consistent. There would be two columns, both FOREIGN KEYs, goal_id and users_goal_id.
Are triggers the way to go with this? And if so, what would that look like?
Yes you could do this using triggers, but the exact implementation depends on your demands.
If you want to rebuild al your queries, so they don't use the title for the join, but the goal_id instead, you can just build that. If you need to keep the titles in sync as well, that's an extra.
First for the join. You stated that one goal has many user goals. Does that mean that each user goal belongs to only one goal? If so, you don't need the extra table. You can just add a column goal_id to your user_goals table. Make sure there is a foreign key constraint (I hope you're using InnoDB tables), so you can enforce referential integrity.
Then the trigger. I'm not exactly sure how to write them on MySQL. I do use triggers a lot on Oracle, but only seldom on MySQL. Anyway, I'd suggest you build three triggers:
Update trigger on goals table. This trigger should update related user_goals table when the title is modified.
Update trigger on the user_goals table. If user_goals.title is modified, this trigger should check if the title in the goals table differs from the new title in user_goals. If so, you have two options:
Exception: Don't allow the title to be modified in the user_goals child table.
Update: Allow the title to be changed. Update the parent record in goals. The trigger on goals will update the other related user_goals for you.
You could also silently ignore the change by changing the value back in the trigger, but that wouldn't be a good idea.
Insert trigger on user_goals. Easiest option is to query the title of the specified goal_id and don't allow inserting another value for title. You could opt to update goals if a title is given.
Insert trigger on goals. No need for this one.
No, you should never use triggers at all if you can avoid it.
Triggers are an anti-pattern to me; they have the effect of "doing stuff behind the programmer's back".
Imagine a future maintainer of your application needs to do something, if they are not aware of the trigger (imagine they haven't checked your database schema creation scripts in detail), then they could spend a long, long time trying to work out why this happens.
If you need to have several pieces of client-side code updating the tables, consider making them use a stored procedure; document this in the code maintenance manual (and comments etc) to ensure that future developers do the same.
If you can get away with it, just write a common routine on the client side which is always called to update the shared column(s).
Even triggers do nothing to ensure that the columns are always in sync, so you will need to implement a periodic process which checks this anyway. They will otherwise go out of sync sooner or later (maybe just because some operations engineer decides to start doing manual updates; maybe one table gets restored from a backup and the other doesn't)