Designing tables for deleted data

Designing tables for deleted data - mysql

I have many tables where data needs to be "marked for deletion" but not deleted, or toggle between published and hidden data.
Most intuitive way to handle these cases is to add a column in the database deleted int(1) or public int(1). This raises the concern of not forgetting to specify WHERE deleted=0 for each and every time that table is being accessed.
I considered overcoming this by creating duplicate tables for deleted/unpublished data such as article => article_deleted and moving the data instead of deleting it. This provides with 2 issues:
Foreign key constraints end up being extremely annoying to maintain
Number of tables with hidden content doubles (in my case ~20 becomes ~40 tables)
My last idea is to create a duplicate of the entire database called unreleased and migrate data there.
My question isn't about safety of the data management, but more of - what is the right way of doing it from the beginning?

I have run into this exact issue before and I think it is a bad idea to create an unnecessarily cumbersome DB because you are afraid of bad code.
I think it would be a better idea to do thorough testing on your Test server before you release to production. Even I was tripped up by the "Deleted" column a few times when I first encountered it but I eventually caught on, and if you have a proper Dev/Test/Production environment you should be fine.
In summary, keep the delete column and demand more from your coders.
UPDATE:
Alternatively you could create a view that only pulls the records that aren't deleted and make sure everyone uses that for select queries.

I think your initial approach is "correct" and "right", but your concern about it being slightly error-prone is a valid one.
You'll probably just have to make sure that your test procedures are rigourous enough to catch errors.

The first approach is the best I've come up with. I call the column active instead of deleted. The record exists but it can be either active or inactive. That then if you really do need to delete things the terminology doesn't get screwy.
Saying "Delete the inactive records" makes sense but saying "Delete the deleted records" just gets confusing.

Related

Delete all then insert all, or update, delete and insert as needed?

I have a web form that is used to create and update personal information. On save, I collect all the info in a large multidimensional JSON array. When updating the database, the information will potentially consists of three parts. New rows to be created, rows that need to be updated and rows that need to be deleted. These rows will also be across about 5 tables.
My question is this, how should I approach the MySQL queries? My initial thought was to DELETE all the information from all the tables, and do a clean INSERT of all the new information in one go. I guess the other approach would be to do 3 queries: UPDATE all those with an existing ID; DELETE all those marked for deletion and INSERT all the newly created data (data without existing ID's).
Which of these approaches would be best, or is there a better way of doing this? Thanks for any advice. I appreciate it.

delete all and insert all should NEVER be practiced.
reasons:
Too much costly. mostly user performs edit. so for what was just a few update, you did one delete and a hundred inserts.
plays havoc with on-delete-cascade foreign keys.
upsets auto-increment fields even when they were apparently not touched.
you need to implement unit-of-work. I dont know which language you are working with, but some of the languages have an inbuilt support for that. in dot-net we have DataSets.
Basics:
Keep track of each record you fetched from database. secretly maintain a flag for each record to note which were loaded-from-db (ie. untouched), which has modifications (needs update query) and which are added new. for the deleted records, maintain a separate list (maybe of their IDs). How to achieve this feat is matter of separate discussion.
When user clicks Save, start a database transaction. this is not strictly part of current discussion, but is almost always done in similar conditions.
In the transaction, first loop through the deleted items array. fire a delete query for each of them.
Then loop through the modified items array. for each modified item you may simply update all of its columns to the latest values. if the numer of columns is too large (>30) then things change a bit.
then comes the newly created items. fire one insert for each of them.
Finally commit the transaction.
if the language you are programming in supports try/catch blocks then perform all of the above steps (after begining transaction) in try/catch. in catch block rollback the transcation.
this approach looks more complicated and seems to fire more queries than the simple delete/insert/all approach but trust me we have been there, done that and then spent sleeples nights undoing all that was done. never go the delete/insert way unless you can really justify it.
on how to do the change-tracking thing, it depends a lot on language and type of application you are using. even for dot-net the approach differs for desktop applications and web applications. tracking deletions is easy. so as tracking new insertions. the update marks are applied by trapping the edit event on any of the columns of that field.
EDIT
The data spans about five tables. hence the three loops (delete/update/insert) has to be done five times, one for each table. first draw the relationships among the tables. process the top table first. then process the tables which are directly connected to the top level tables and so on. if you have a cyclic relationship among the tables then you have to be specially careful.
The code against the Save operation is about to grow quite long. 5x3=15 operations, each with its own sql. none of these operations are expected to be reusable hence putting them in separate methods is futile. everything is about to go in a large procedural block. hence religiously comment the code. mark the table boundaries and the operations.

You probably don't want to do any deletes. Just mark the obsolete entries as "inactive", or maybe timestamp them as having an ending validity.
In using this philosophy, all edits are actually insertions. No modifications (except to change the "expire" field) and no deletes. To update a name, mark the record as expired and insert a new record with a beginning validity timestamp at the same time.
In such a database, auditing and data recovery are easily performed.

Why do I need a UNIQUE constraint on a column if my application already validates the data before saving it?

UNIQUE is an index which makes your field, well, unique. But is it worth using it if you're already doing your validation in PHP prior to inserting new data? An extra INDEX isn't the end of the world but if you're after query optimization then UNIQUE just gets in the way, right?

Why wear a seat belt if you're a good driver and you can save two seconds of your total trip time?
One of the most important lessons for a programmer to learn is that he is human and he makes mistakes. Worse, everyone else working on this code is human, too.
Why does the UNIQUE constraint exist? To protect the database from humans making mistakes. Turning off your UNIQUE constraint says "You do not need to worry, Mr. Database, I will never give you data that doesn't match my intent."
What if something happens to your code such that your validation for uniqueness breaks? Now your code dumps duplicate records into the database. But if you had a UNIQUE constraint on that column, when your front-end code stopped working, you'd get your queries blowing up.
You're human. Accept it. Let the computer do its job and help protect you from yourself.

UNIQUE is not only for making sure data is valid. The primary purpose is to optimize queries: if the database knows the field is unique, it can stop searching for hits as soon as the first record is found. You can't pass that information to the database through well-crafted queries alone.

That is an interesting question.
Are you sure that there is no way for your code to be bypassed ?
Are you sure nothing else will ever access the data beside the PHP application ?
Are you sure the rest of your application won't fail in the case where a duplicate is inserted ?
What would be the implication of having duplicate entries, would that cause problem for future references or calculations ?
This is some of the questions that constraint at database level help solve.
As for optimization, a constraint does not make the process of retrieving data noticeably slower and it can in fact be use in the execution plan at some point, since it is related to an index.
So no, it won't get in the way of optimization and it will also protect your data from inconsistencies.

As pst mentions, at this stage in your development, you are in no position to begin optimizing your database or the application in question.
It's generally not a bad thing to add additional sanity checks in your system. Yes, you're hurting performance just that tiny little bit, but in no way will any user ever notice an extra CPU tick or two.
Think about this: Today you do your validation in php, but do not assert uniqueness in the database. In the future, you, a colleague, or some other guy who has forked your project changes the original php validation, ruins it, or forgets it altogether. At this point, you'll probably wish you had that added check in your database.

tl:dr; Transactional Integrity (in the database) handles Race Conditions (in the application).
The Concurrency and integrity section of these Rails docs explains why this is necessary with an example scenario.
Databases with transactional integrity guarantee uniqueness through isolation, while applications actually take a few separate steps (get the value, check if there are other values, then save the value) outside of transactional isolation that leave them vulnerable to race conditions, especially at scale.

What is the best way (in Rails/AR) to ensure writes to a database table are performed synchronously, one after another, one at a time?

I have noticed that using something like delayed_job without a UNIQUE constraint on a table column would still create double entries in the DB. I have assumed delayed_job would run jobs one after another. The Rails app runs on Apache with Passenger Phusion. I am not sure if that is the reason why this would happen, but I would like to make sure that every item in the queue is persisted to AR/DB one after another, in sequence, and to never have more than one write to this DB table happen at the same time. Is this possible? What would be some of the issues that I would have to deal with?
update
The race conditions arise because an AJAX API is used to send data to the application. The application received a bunch of data, each batch of data is identified as belonging together by a Session ID (SID), in the end, the final state of the database has to include the latest most up-to date AJAX PUT query to the API. Sometimes queries arrive at the exact same time for the same SID -- so I need a way to make sure they don't all try to be persisted at the same time, but one after the other, or simply the last to be sent by AJAX request to the API.
I hope that makes my particular use-case easier to understand...

You can lock a specific table (or tables) with the LOCK TABLES statement.
In general I would say that relying on this is poor design and will likely lead to with scalability problems down the road since you're creating an bottleneck in your application flow.
With your further explanations, I'd be tempted to add some extra columns to the table used by delayed_job, with a unique index on them. If (for example) you only ever wanted 1 job per user you'd add a user_id column and then do
something.delay(:user_id => user_id).some_method
You might need more attributes if the pattern is more sophisticated, e.g. there are lots of different types of jobs and you only wanted one per person, per type, but the principle is the same. You'd also want to be sure to rescue ActiveRecord::RecordNotUnique and deal with it gracefully.
For non delayed_job stuff, optimistic locking is often a good compromise between handling the concurrent cases well without slowing down the non concurrent cases.

If you are worried/troubled about/with multiple processes writing to the 'same' rows - as in more users updating the same order_header row - I'd suggest you set some marker bound to the current_user.id on the row once /order_headers/:id/edit was called, and removing it again, once the current_user releases the row either by updating or canceling the edit.
Your use-case (from your description) seems a bit different to me, so I'd suggest you leave it to the DB (in case of a fairly recent - as in post 5.1 - MySQL, you'd add a trigger/function which would do the actual update, and here - you could implement similar logic to the above suggested; some marker bound to the sequenced job id of sorts)

Database Upserts - Good or Bad Practice?

Looking for some insight as to whether an Upsert (insert or if exists, then update) procedure is considered bad practice in database programming. I work in SQL server if that bears any relevance.
At a place I worked some months ago, the the resident DB guru stated in newly written db coding standards (most of which I agreed with), that Upserts should be avoided.
I can't really see a logical reason for this, and consider my self reasonably conscious of good programming practice. I think they are useful for straight forward data management and help to avoid excessive stored procedure numbers.
Looking for some insight / discussion that will help me come to a conclusion on this.
Thanks.
Update In response to comments:
The specific context I refer to is the creation or update of a domain entity data representation in the database. Say for example a "Person" object exists as a representation of the "Person" table in the database. I simply need a mechanism for creating a new Person, or updating an existing one. Here I have the option of creating an Upsert stored procedure, or two separate stored procedures - one for Update, and one for Insert.
Any advantages or disadvantages in anyones view?

The primary problem is overwriting an existing record when the intention is to add a new record because whatever was selected as the key is duplicated. Say a login name for example. You see that login exists so you update when you should have kicked back an error that the login is a duplicate.
A second problem is with resurrecting a deleted record. Say process "A" queries the record, process "B" deletes it, and then process "A" submits a change. The record that was intended to be deleted is now back in the database rather than passing an exception back to "A" that it was deleted.

I like to program on purpose.
Either I'm creating something, in which case I would want the insert to fail (duplicate) if there already was a entity there. Or, I'm updating something that I know is there, in which case I'd like the update to fail (which actually doesn't happen).
With upsert/merge this get kind of fuzzy. Did I or did I not succed? Did I partially succeed? Some of the values in the row are mine (from insert) and some of them was there before?
Having said that, Upserts are useful (which is why they were implemented to begin with) and banning them would be just silly. That's like banning roads because criminals use them to get away from the cops. There are an infinite number of cases where upserts are the only resonable way of doing things. And anyone who have worked with synchronizing of data between systems knows this.

Depends what you talk about. Data? Well, that is determined by data manipulation processes, or? If I need to insert OR update, then I need to do that. if it is about schema objects, similar.

The objection in the first example of two processes --where a second process "resurrects" a deleted record by adding a new one with the same key -- is only valid in specific instances which would result from either poor design OR would happen regardless if an "upsert" procedure wrote the record with the identical key or two separate procedures wrote the inserted record.
Where an identical key must be avoided, an auto-incrementing identity key is used in the insert. Where an identical key does not need to be avoided, good database design has to be implemented to avoid creating "phantom joins". For example, in the telecommunications world telephone numbers are often reused and are a "unique" key. They cannot be the primary key because person #2 might "inherit" the phone number but should likely not "inherit" person #1's overdue unpaid bill or call history, etc. So a combination of an auto incrementing primary key plus service dates or other unique indentifiers would be used in any join logic to prevent bad data chaining.

Never delete entries? Good idea? Usual?

I am designing a system and I don't think it's a good idea to give the ability to the end user to delete entries in the database. I think that way because often then end user, once given admin rights, might end up making a mess in the database and then turn to me to fix it.
Of course, they will need to be able to do remove entries or at least think that they did if they are set as admin.
So, I was thinking that all the entries in the database should have an "active" field. If they try to remove an entry, it will just set the flag to "false" or something similar. Then there will be some kind of super admin that would be my company's team who could change this field.
I already saw that in another company I worked for, but I was wondering if it was a good idea. I could just make regular database backups and then roll back if they commit an error and adding this field would add some complexity to all the queries.
What do you think? Should I do it that way? Do you use this kind of trick in your applications?

In one of our databases, we distinguished between transactional and dictionary records.
In a couple of words, transactional records are things that you cannot roll back in real life, like a call from a customer. You can change the caller's name, status etc., but you cannot dismiss the call itself.
Dictionary records are things that you can change, like assigning a city to a customer.
Transactional records and things that lead to them were never deleted, while dictionary ones could be deleted all right.
By "things that lead to them" I mean that as soon as the record appears in the business rules which can lead to a transactional record, this record also becomes transactional.
Like, a city can be deleted from the database. But when a rule appeared that said "send an SMS to all customers in Moscow", the cities became transactional records as well, or we would not be able to answer the question "why did this SMS get sent".
A rule of thumb for distinguishing was this: is it only my company's business?
If one of my employees made a decision based on data from the database (like, he made a report based on which some management decision was made, and then the data report was based on disappeared), it was considered OK to delete these data.
But if the decision affected some immediate actions with customers (like calling, messing with the customer's balance etc.), everything that lead to these decisions was kept forever.
It may vary from one business model to another: sometimes, it may be required to record even internal data, sometimes it's OK to delete data that affects outside world.
But for our business model, the rule from above worked fine.

A couple reasons people do things like this is for auditing and automated rollback. If a row is completely deleted then there's no way to automatically rollback that deletion if it was in error. Also, keeping a row around and its previous state is important for auditing - a super user should be able to see who deleted what and when as well as who changed what, etc.
Of course, that's all dependent on your current application's business logic. Some applications have no need for auditing and it may be proper to fully delete a row.

The downside to just setting a flag such as IsActive or DeletedDate is that all of your queries must take that flag into account when pulling data. This makes it more likely that another programmer will accidentally forget this flag when writing reports...
A slightly better alternative is to archive that record into a different database. This way it's been physically moved to a location that is not normally searched. You might add a couple fields to capture who deleted it and when; but the point is it won't be polluting your main database.
Further, you could provide an undo feature to bring it back fairly quickly; and do a permanent delete after 30 days or something like that.
UPDATE concerning views:
With views, the data still participates in your indexing scheme. If the amount of potentially deleted data is small, views may be just fine as they are simpler from a coding perspective.

I prefer the method that you are describing. Its nice to be able to undo a mistake. More often than not, there is no easy way of going back on a DELETE query. I've never had a problem with this method and unless you are filling your database with 'deleted' entries, there shouldn't be an issue.

I use a combination of techniques to work around this issue. For some things adding the extra "active" field makes sense. Then the user has the impression that an item was deleted because it no longer shows up on the application screen. The scenarios where I would implement this would include items that are required to keep a history...lets say invoice and payment. I wouldn't want such things being deleted for any reason.
However, there are some items in the database that are not so sensitive, lets say a list of categories that I want to be dynamic...I may then have users with admin privileges be allowed to add and delete a category and the delete could be permanent. However, as part of the application logic I will check if the category is used anywhere before allowing the delete.

I suggest having a second database like DB_Archives whre you add every row deleted from DB. The is_active field negates the very purpose of foreign key constraints, and YOU have to make sure that this row is not marked as deleted when it's referenced elsewhere. This becomes overly complicated when your DB structure is massive.

There is an acceptable practice that exists in many applications (drupal's versioning system, et. al.). Since MySQL scales very quickly and easily, you should be okay.

I've been working on a project lately where all the data was kept in the DB as well. The status of each individual row was kept in an integer field (data could be active, deleted, in_need_for_manual_correction, historic).
You should consider using views to access only the active/historic/... data in each table. That way your queries won't get more complicated.
Another thing that made things easy was the use of UPDATE/INSERT/DELETE triggers that handled all the flag changing inside the DB and thus kept the complex stuff out of the application (for the most part).
I should mention that the DB was a MSSQL 2005 server, but i guess the same approach should work with mysql, too.

Yes and no.
It will complicate your application much more than you expect since every table that does not allow deletion will be behind extra check (IsDeleted=false) etc. It does not sound much but then when you build larger application and in query of 11 tables 9 require chech of non-deletion.. it's tedious and error prone. (Well yeah, then there are deleted/nondeleted views.. when you remember to do/use them)
Some schema upgrades will become PITA since you'll have to relax FK:s and invent "suitable" data for very, very old data.
I've not tried, but have thought a moderate amount about solution where you'd zip the row data to xml and store that in some "Historical" table. Then in case of "must have that restored now OMG the world is dying!1eleven" it's possible to dig out.

I agree with all respondents that if you can afford to keep old data around forever it's a good idea; for performance and simplicity, I agree with the suggestion of moving "logically deleted" records to "old stuff" tables rather than adding "is_deleted" flags (moving to a totally different database seems a bit like overkill, but you can easily change to that more drastic approach later if eventually the amount of accumulated data turns out to be a problem for a single db with normal and "old stuff" tables).

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008