Delete MySql rows, or mark "dead"? - mysql

I've always had a weird feeling in my gut about actually deleting rows from certain types of tables.
For example, if I have a table of Users...when they delete their account, rather than fully deleting their row, I have been marking as "dead" or inactive. This allows me to retain a record of their existence if I ever need it again.
In situations like this - considering performance, overhead, etc - should I delete the row, or simply mark as inactive?
Which is more "common"?

Personally, I almost always use "soft deletes" as you describe.
If space is a concern, I'll have a job that will periodically clean up the soft-deleted records after they've been deleted for a certain amount of time.

Perhaps you could move the inactive MySQL records to a separate table designed to hold inactive accounts? That way, you could simply move them back over if you need to, or delete the table if database size becomes an issue.

Data are very valuable to be permanently deleted from the database. Mark as dead.
I generally give status for such cases. In this pattern
0 Inactive
1 Active
2 Trashed

In addition to "soft" deletes, another solution is to use "audit tables". I asked what they were on dba.stackexchange.com recently.
Audit tables are typically used to record actions, such as insert/update/delete, performed on a second table, possibly storing old and new values, time, etc.
They can be implemented using triggers in a straightforward way.
Pros:
the "unused" data is in a separate table
it's easy to turn the level-of-detail knob from fine-grained to coarse-grained
it may be more efficient space-wise, depending on the exact implementation
Cons:
since data is in a separate table, it could cause key conflicts in the case that a row were "undeleted"
it may be less efficient space-wise, depending on the exact implementation

This question made me remember this entertaining anecdote. My point: there are so many factors to consider when choosing between hard and soft delete that there is no thumb rule telling you which one to pick.

Related

Relational Database: DELETE versus "Mark for Deletion"

Recently, I stumbled about the following problem: Given is a simple data model with "Books" and "Authors". Each "Book" has a reference to an "Author". Persistence is achieved with a relational database. Besides adding books and authors, it is also possible to delete them. Usually, if I want to delete an Author, i would perform a SQL DELETE operation and remove the corresponding row. However, I have seen in other projects, people don't call DELETE. Instead, they add some kind of active/deleted flag and mark the corresponding row as "deleted".
My questions are: Is this in general best practice? What are the advantages? My best guess is:
Setting a flag has a better performance than a DELETE operation
If you run out of space, it is still possible to run a cleanup service which looks for deleted object and removes the corresponding rows
Setting a delete flag is better for database consistency because a deletion of "Author" in the example above could destroy foreign keys in the corresponding "Book entries.
Anyway, these are just guesses. Does someone know the answer?
There are many reasons to not use delete. First, maintaining history can be very important. I wouldn't use "just" a delete flag, but instead have dates of validity.
Second, in an operational system, delete can be an expensive operation. The row needs to be deleted from the table, from associated indexes, and then there might be cascading deletes and triggers.
Third, delete can prevent other operations from working well, because tables and rows and indexes get locked. This can slow down an operational system, particularly during peak periods.
Fourth, delete can be tricky to maintain relational integrity -- especially if those cascading deletes are not defined.
Fifth, storage is cheap. Processing power is cheap. So, for many databases, deleting records to recover space is simply unnecessary.
This doesn't mean that you should always avoid deleting records. But there are very valid reasons for not rushing to remove data.

Delete all then insert all, or update, delete and insert as needed?

I have a web form that is used to create and update personal information. On save, I collect all the info in a large multidimensional JSON array. When updating the database, the information will potentially consists of three parts. New rows to be created, rows that need to be updated and rows that need to be deleted. These rows will also be across about 5 tables.
My question is this, how should I approach the MySQL queries? My initial thought was to DELETE all the information from all the tables, and do a clean INSERT of all the new information in one go. I guess the other approach would be to do 3 queries: UPDATE all those with an existing ID; DELETE all those marked for deletion and INSERT all the newly created data (data without existing ID's).
Which of these approaches would be best, or is there a better way of doing this? Thanks for any advice. I appreciate it.
delete all and insert all should NEVER be practiced.
reasons:
Too much costly. mostly user performs edit. so for what was just a few update, you did one delete and a hundred inserts.
plays havoc with on-delete-cascade foreign keys.
upsets auto-increment fields even when they were apparently not touched.
you need to implement unit-of-work. I dont know which language you are working with, but some of the languages have an inbuilt support for that. in dot-net we have DataSets.
Basics:
Keep track of each record you fetched from database. secretly maintain a flag for each record to note which were loaded-from-db (ie. untouched), which has modifications (needs update query) and which are added new. for the deleted records, maintain a separate list (maybe of their IDs). How to achieve this feat is matter of separate discussion.
When user clicks Save, start a database transaction. this is not strictly part of current discussion, but is almost always done in similar conditions.
In the transaction, first loop through the deleted items array. fire a delete query for each of them.
Then loop through the modified items array. for each modified item you may simply update all of its columns to the latest values. if the numer of columns is too large (>30) then things change a bit.
then comes the newly created items. fire one insert for each of them.
Finally commit the transaction.
if the language you are programming in supports try/catch blocks then perform all of the above steps (after begining transaction) in try/catch. in catch block rollback the transcation.
this approach looks more complicated and seems to fire more queries than the simple delete/insert/all approach but trust me we have been there, done that and then spent sleeples nights undoing all that was done. never go the delete/insert way unless you can really justify it.
on how to do the change-tracking thing, it depends a lot on language and type of application you are using. even for dot-net the approach differs for desktop applications and web applications. tracking deletions is easy. so as tracking new insertions. the update marks are applied by trapping the edit event on any of the columns of that field.
EDIT
The data spans about five tables. hence the three loops (delete/update/insert) has to be done five times, one for each table. first draw the relationships among the tables. process the top table first. then process the tables which are directly connected to the top level tables and so on. if you have a cyclic relationship among the tables then you have to be specially careful.
The code against the Save operation is about to grow quite long. 5x3=15 operations, each with its own sql. none of these operations are expected to be reusable hence putting them in separate methods is futile. everything is about to go in a large procedural block. hence religiously comment the code. mark the table boundaries and the operations.
You probably don't want to do any deletes. Just mark the obsolete entries as "inactive", or maybe timestamp them as having an ending validity.
In using this philosophy, all edits are actually insertions. No modifications (except to change the "expire" field) and no deletes. To update a name, mark the record as expired and insert a new record with a beginning validity timestamp at the same time.
In such a database, auditing and data recovery are easily performed.

Is it usual to delete actual record, or switch "the undisplay flag" when designing database?

Sometimes you are required to keep your log and records for criminal prevention purpose.
When you give users the permission to delete record, it means that you'll lose evidences.
In ordinary cases, do you actually delete record? or switch the undisplay flag to keep log?
If you allow any modification to data then you will lose evidence. Maybe you should design your database so you never use UPDATE or DELETE, only INSERT.
Unless the government has told you to keep all records, I recommend not going too much out of your way to do it.
Apart from keeping records for auditing purposes as you mention, the use of a 'Deleted' flag also allows you to incorporate 'undo' functionality.
If you physically delete data, then it will be quite a bit of work to get the old data back. But if you use flags then it can be as easy as re-setting the flag to get the data to re-appear.
If a lot of deletes happen in your database, then the downside of flags is that you will be holding on to a lot of data that isn't being used.
You can instead of just deleting is first insert the record into a history table for any type of modification that happens. Then you will always have the data available without having needless information in your main table

How can I fix this scaling issue with soft deleting items?

I have a database where most tables have a delete flag for the tables. So the system soft deletes items (so they are no longer accessible unless by admins for example)
What worries me is in a few years, when the tables are much larger, is that the overall speed of the system is going to be reduced.
What can I do to counteract effects like that.
Do I index the delete field?
Do I move the deleted data to an identical delete table and back when undeleted?
Do I spread out the data over a few MySQL servers over time? (based on growth)
I'd appreciate any and all suggestions or stories.
UPDATE:
So partitioning seems to be the key to this. But wouldn't partitioning just create two "tables", one with the deleted items and one without the deleted items.
So over time the deleted partition will grow large and the occasional fetches from it will be slow (and slower over time)
Would the speed difference be something I should worry about? Since I fetch most (if not all) data by some key value (some are searches but they can be slow for this setup)
I'd partition the table on the DELETE flag.
The deleted rows will be physically kept in other place, but from SQL's point of view the table remains the same.
Oh, hell yes, index the delete field. You're going to be querying against it all the time, right? Compound indexes with other fields you query against a lot, like parent IDs, might also be a good idea.
Arguably, this decision could be made later if and only if performance problems actually appear. It very much depends on how many rows are added at what rate, your box specs, etc. Obviously, the level of abstraction in your application (and the limitations of any libraries you are using) will help determine how difficult such a change will be.
If it becomes a problem, or you are certain that it will be, start by partitioning on the deleted flag between two tables, one that holds current data and one that holds historical/deleted data. IF, as you said, the "deleted" data will only be available to administrators, it is reasonable to suppose that (in most applications) the total number of users (here limited only to admins) will not be sufficient to cause a problem. This means that your admins might need to wait a little while longer when searching that particular table, but your user base (arguably more important in most applications) will experience far less latency. If performance becomes unacceptable for the admins, you will likely want to index the user_id (or transaction_id or whatever) field you access the deleted records by (I generally index every field by which I access the table, but at certain scale there can be trade-offs regarding which indexes are most worthwhile).
Depending on how the data is accessed, there are other simple tricks you can employ. If the admin is looking for a specific record most of the time (as opposed to, say, reading a "history" or "log" of user activity), one can often assume that more recent records will be looked at more often than old records. Some DBs include tuning options for making recent records easier to find than older records, but you'll have to look it up for your particular database. Failing that, you can manually do it. The easiest way would be to have an ancient_history table that contains all records older than n days, weeks or months, depending on your constraints and suspected usage patterns. Newer data then lives inside a much smaller table. Even if the admin is going to "browse" all the records rather than searching for a specific one, you can start by showing the first n days and have a link to see all days should they not find what they are looking for (eg, most online banking applications that lets you browse transactions but shows only the first 30 days of history unless you request otherwise.)
Hopefully you can avoid having to go a step further, and sharding on user_id or some such scheme. Depending on the scale of the rest of your application, you might have to do this anyway. Unless you are positive that you will need to, I strongly suggest using vertical partitioning first (eg, keeping your forum_posts on a separate machine than your sales_records), as it is FAR easier to setup and maintain. If you end up needing to shard on user_id, I suggest using google ;-]
Good luck. BTW, I'm not a DBA so take this with a grain of salt.

Never delete entries? Good idea? Usual?

I am designing a system and I don't think it's a good idea to give the ability to the end user to delete entries in the database. I think that way because often then end user, once given admin rights, might end up making a mess in the database and then turn to me to fix it.
Of course, they will need to be able to do remove entries or at least think that they did if they are set as admin.
So, I was thinking that all the entries in the database should have an "active" field. If they try to remove an entry, it will just set the flag to "false" or something similar. Then there will be some kind of super admin that would be my company's team who could change this field.
I already saw that in another company I worked for, but I was wondering if it was a good idea. I could just make regular database backups and then roll back if they commit an error and adding this field would add some complexity to all the queries.
What do you think? Should I do it that way? Do you use this kind of trick in your applications?
In one of our databases, we distinguished between transactional and dictionary records.
In a couple of words, transactional records are things that you cannot roll back in real life, like a call from a customer. You can change the caller's name, status etc., but you cannot dismiss the call itself.
Dictionary records are things that you can change, like assigning a city to a customer.
Transactional records and things that lead to them were never deleted, while dictionary ones could be deleted all right.
By "things that lead to them" I mean that as soon as the record appears in the business rules which can lead to a transactional record, this record also becomes transactional.
Like, a city can be deleted from the database. But when a rule appeared that said "send an SMS to all customers in Moscow", the cities became transactional records as well, or we would not be able to answer the question "why did this SMS get sent".
A rule of thumb for distinguishing was this: is it only my company's business?
If one of my employees made a decision based on data from the database (like, he made a report based on which some management decision was made, and then the data report was based on disappeared), it was considered OK to delete these data.
But if the decision affected some immediate actions with customers (like calling, messing with the customer's balance etc.), everything that lead to these decisions was kept forever.
It may vary from one business model to another: sometimes, it may be required to record even internal data, sometimes it's OK to delete data that affects outside world.
But for our business model, the rule from above worked fine.
A couple reasons people do things like this is for auditing and automated rollback. If a row is completely deleted then there's no way to automatically rollback that deletion if it was in error. Also, keeping a row around and its previous state is important for auditing - a super user should be able to see who deleted what and when as well as who changed what, etc.
Of course, that's all dependent on your current application's business logic. Some applications have no need for auditing and it may be proper to fully delete a row.
The downside to just setting a flag such as IsActive or DeletedDate is that all of your queries must take that flag into account when pulling data. This makes it more likely that another programmer will accidentally forget this flag when writing reports...
A slightly better alternative is to archive that record into a different database. This way it's been physically moved to a location that is not normally searched. You might add a couple fields to capture who deleted it and when; but the point is it won't be polluting your main database.
Further, you could provide an undo feature to bring it back fairly quickly; and do a permanent delete after 30 days or something like that.
UPDATE concerning views:
With views, the data still participates in your indexing scheme. If the amount of potentially deleted data is small, views may be just fine as they are simpler from a coding perspective.
I prefer the method that you are describing. Its nice to be able to undo a mistake. More often than not, there is no easy way of going back on a DELETE query. I've never had a problem with this method and unless you are filling your database with 'deleted' entries, there shouldn't be an issue.
I use a combination of techniques to work around this issue. For some things adding the extra "active" field makes sense. Then the user has the impression that an item was deleted because it no longer shows up on the application screen. The scenarios where I would implement this would include items that are required to keep a history...lets say invoice and payment. I wouldn't want such things being deleted for any reason.
However, there are some items in the database that are not so sensitive, lets say a list of categories that I want to be dynamic...I may then have users with admin privileges be allowed to add and delete a category and the delete could be permanent. However, as part of the application logic I will check if the category is used anywhere before allowing the delete.
I suggest having a second database like DB_Archives whre you add every row deleted from DB. The is_active field negates the very purpose of foreign key constraints, and YOU have to make sure that this row is not marked as deleted when it's referenced elsewhere. This becomes overly complicated when your DB structure is massive.
There is an acceptable practice that exists in many applications (drupal's versioning system, et. al.). Since MySQL scales very quickly and easily, you should be okay.
I've been working on a project lately where all the data was kept in the DB as well. The status of each individual row was kept in an integer field (data could be active, deleted, in_need_for_manual_correction, historic).
You should consider using views to access only the active/historic/... data in each table. That way your queries won't get more complicated.
Another thing that made things easy was the use of UPDATE/INSERT/DELETE triggers that handled all the flag changing inside the DB and thus kept the complex stuff out of the application (for the most part).
I should mention that the DB was a MSSQL 2005 server, but i guess the same approach should work with mysql, too.
Yes and no.
It will complicate your application much more than you expect since every table that does not allow deletion will be behind extra check (IsDeleted=false) etc. It does not sound much but then when you build larger application and in query of 11 tables 9 require chech of non-deletion.. it's tedious and error prone. (Well yeah, then there are deleted/nondeleted views.. when you remember to do/use them)
Some schema upgrades will become PITA since you'll have to relax FK:s and invent "suitable" data for very, very old data.
I've not tried, but have thought a moderate amount about solution where you'd zip the row data to xml and store that in some "Historical" table. Then in case of "must have that restored now OMG the world is dying!1eleven" it's possible to dig out.
I agree with all respondents that if you can afford to keep old data around forever it's a good idea; for performance and simplicity, I agree with the suggestion of moving "logically deleted" records to "old stuff" tables rather than adding "is_deleted" flags (moving to a totally different database seems a bit like overkill, but you can easily change to that more drastic approach later if eventually the amount of accumulated data turns out to be a problem for a single db with normal and "old stuff" tables).