UNIQUE constraint vs checking before INSERT - mysql

I have a SQL server table RealEstate with columns - Id, Property, Property_Value. This table has about 5-10 million rows and can increase even more in the future. I want to insert a row only if a combination of Id, Property, Property_Value does not exist in this table.
Example Table -
1,Rooms,5
1,Bath,2
1,Address,New York
2,Rooms,2
2,Bath,1
2,Address,Miami
Inserting 2,Address,Miami should NOT be allowed. But, 2,Price,2billion is okay. I am curious to know which is the "best" way to do this and why. The why part is most important to me. The two ways of checking are -
At application level - The app should check if a row exists before it inserts a row.
At database level - Set unique constraints on all 3 columns and let the database
do the checking instead of person/app.
Is there any scenario where one would be better than the other ?
Thanks.
PS: I know there is a similar question already, but it does not answer my problem -
Unique constraint vs pre checking
Also, I think that UNIQUE is applicable to all databases, so I don't think I should remove the mysql and oracle tags.

I think it most cases the differences between that two are going to be small enough that the choice should mostly be driven by picking the implementation that ends up being most understandable to someone looking at the code for the first time.
However, I think exception handling has a few small advantages:
Exception handling avoids a potential race condition. The 'check, then insert' method might fail if another process inserts a record between your check and your insert. So, even if you're doing 'check then insert' you still want exception handling on the insert and if you're already doing exception handling anyways then you might as well do away with the initial check.
If your code is not a stored procedure and has to interact with the database via the network (i.e. the application and the db are not on the same box), then you want to avoid having two separate network calls (one for the check and the other for the insert) and doing it via exception handling provides a straightforward way of handling the whole thing with a single network call. Now, there are tons of ways to do the 'check then insert' method while still avoiding the second network call, but simply catching the exception is likely to be the simplest way to go about it.
On the other hand, exception handling requires a unique constraint (which is really a unique index), which comes with a performance tradeoff:
Creating a unique constraint will be slow on very large tables and it will cause a performance hit on every single insert to that table. On truly large databases you also have to budget for the extra disk space consumed by the unique index used to enforce the constraint.
On the other hand, it might make selecting from the table faster if your queries can take advantage of that index.
I'd also note that if you're in a situation where what you actually want to do is 'update else insert' (i.e. if a record with the unique value already exists then you want to update that record, else you insert a new record) then what you actually want to use is your particular database's UPSERT method, if it has one. For SQL Server and Oracle, this would be a MERGE statement.

Dependent on the cost of #1 (doing a lookup) being reasonable, I would do both. At least, in Oracle, which is the database I have the most experience with.
Rationale:
Unique/primary keys should be a core part of your data model design, I can't see any reason to not implement them - if you have so much data that performance suffers from maintaining the unique index:
that's a lot of data
partition it or archive it away from your OLTP work
The more constraints you have, the safer your data is against application logic errors.
If you check that a row exists first, you can easily extract other information from that row to use as part of an error message, or otherwise fork the application logic to cope with the duplication.
In Oracle, rolling back DML statements is relatively expensive because Oracle expects to succeed (i.e. COMMIT changes that have been written) by default.

This does not answer the question directly, but I thought it might be helpful to post it here since its better than wikipedia and the link might just become dead someday.
Link - http://www.celticwolf.com/blog/2010/04/27/what-is-a-race-condition/
Wikipedia has a good description of a race condition, but it’s hard to follow if you don’t understand the basics of programming. I’m going to try to explain it in less technical terms, using the example of generating an identifier as described above. I’ll also use analogies to human activities to try to convey the ideas.
A race condition is when two or more programs (or independent parts of a single program) all try to acquire some resource at the same time, resulting in an incorrect answer or conflict. This resource can be information, like the next available appointment time, or it can be exclusive access to something, like a spreadsheet. If you’ve ever used Microsoft Excel to edit a document on a shared drive, you’ve probably had the experience of being told by Excel that someone else was already editing the spreadsheet. This error message is Excel’s way of handling the potential race condition gracefully and preventing errors.
A common task for programs is to identify the next available value of some sort and then assign it. This technique is used for invoice numbers, student IDs, etc. It’s an old problem that has been solved before. One of the most common solutions is to allow the database that is storing the data to generate the number. There are other solutions, and they all have their strengths and weaknesses.
Unfortunately, programmers who are ignorant of this area or simply bad at programming frequently try to roll their own. The smart ones discover quickly that it’s a much more complex problem than it seems and look for existing solutions. The bad ones never see the problem or, once they do, insist on making their unworkable solution ever more complex without fixing the error. Let’s take the example of a student ID. The neophyte programmer says “to know what the next student number should be, we’ll just get the last student number and increment it.” Here’s what happens under the hood:
Betty, an admin. assistant in the admissions office fires up the student management program. Note that this is really just a copy of the program that runs on her PC. It talks to the database server over the school’s network, but has no way to talk to other copies of the program running on other PCs.
Betty creates a new student record for Bob Smith, entering all of the information.
While Betty is doing her data entry, George, another admin. assistant, fires up the student management program on his PC and begins creating a record for Gina Verde.
George is a faster typist, so he finishes at the same time as Betty. They both hit the “Save” button at the same time.
Betty’s program connects to the database server and gets the highest student number in use, 5012.
George’s program, at the same time, gets the same answer to the same question.
Both programs decide that the new student ID for the record that they’re saving should be 5013. They add that information to the record and then save it in the database.
Now Bob Smith (Betty’s student) and Gina Verde (George’s student) have the same student ID.
This student ID will be attached to all sorts of other records, from grades to meal cards for the dining hall. Eventually this problem will come to light and someone will have to spend a lot of time assigning one of them a new ID and sorting out the mixed-up records.
When I describe this problem to people, the usual reaction is “But how often will that happen in practice? Never, right?”. Wrong. First, when data entry is being done by your staff, it’s generally done during a relatively small period of time by everyone. This increases the chances of an overlap. If the application in question is a web application open to the general public, the chances of two people hitting the “Save” button at the same time are even higher. I saw this in a production system recently. It was a web application in public beta. The usage rate was quite low, with only a few people signing up every day. Nevertheless, six pairs of people managed to get identical IDs over the space of a few months. In case you’re wondering, no, neither I nor anyone from my team wrote that code. We were quite surprised, however, at how many times that problem occurred. In hindsight, we shouldn’t have been. It’s really a simple application of Murphy’s Law.
How can this problem be avoided? The easiest way is to use an existing solution to the problem that has been well tested. All of the major databases (MS SQL Server, Oracle, MySQL, PostgreSQL, etc.) have a way to increment numbers without creating duplicates. MS SQL server calls it an “identity” column, while MySQL calls it an “auto number” column, but the function is the same. Whenever you insert a new record, a new identifier is automatically created and is guaranteed to be unique. This would change the above scenario as follows:
Betty, an admin. assistant in the admissions office fires up the student management program. Note that this is really just a copy of the program that runs on her PC. It talks to the database server over the school’s network, but has no way to talk to other copies of the program running on other PCs.
Betty creates a new student record for Bob Smith, entering all of the information.
While Betty is doing her data entry, George, another admin. assistant, fires up the student management program on his PC and begins creating a record for Gina Verde.
George is a faster typist, so he finishes at the same time as Betty. They both hit the “Save” button at the same time.
Betty’s program connects to the database server and hands it the record to be saved.
George’s program, at the same time, hands over the other record to be saved.
The database server puts both records into a queue and saves them one at a time, assigning the next available number to them.
Now Bob Smith (Betty’s student) gets ID 5013 and Gina Verde (George’s student) gets id 5014.
With this solution, there is no problem with duplication. The code that does this for each database server has been tested repeatedly over the years, both by the manufacturer and by users. Millions of applications around the world rely on it and continue to stress test it every day. Can anyone say the same about their homegrown solution?
There is at least one well tested way to create identifiers in the software rather than in the database: uuids (Universally Unique Identifiers). However, a uuid takes the form of xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx where “x” stands for a hexadecimal digit (0-9 and a-f). Do you want to use that for an invoice number, student ID or some other identifier seen by the public? Probably not.
To summarize, a race condition occurs when two programs, or two independent parts of a program, attempt to access some information or access a resource at the same time, resulting in an error, be it an incorrect calculation, a duplicated identifier or conflicting access to a resource. There are many more types of race conditions than I’ve presented here and they affect many other areas of software and hardware.

The description of your problem is exactly why primary keys can be compound, e.g., they consist of multiple fields. That way, the database will handle the uniqueness for you, and you don't need to care about it.
In your case, the table definition could be something similar to the following like:
CREATE TABLE `real_estate` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`property` varchar(255) DEFAULT NULL,
`property_value` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `index_id_property_property_value` (`id`, `property`, `property_value`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Related

Best Approach for maintaining 'users in country' count MySql

Given a series of complex websites that all use the same user tacking mysql database. (this is not our exact situation: but a simplification of the situation to make this post a brief/efficient as possible)
We don't always know where a user is when he starts using a site. In fact there are about 50 points in the code where the country field might get updated. We might collect it from the IP address on use. We might get it when he uses his credit card. We might get it when he fills out a form. Heck we might get it when we talk to him on the phone.
Assume a simple structure like:
CREATE TABLE `Users` (
`ID` INT NOT NULL AUTO_INCREMENT ,
`County` VARCHAR(45) NULL ,
PRIMARY KEY (`ID`) );
What Im wondering is what is the best way to keep track of one more scrap of information on this person:
`Number_of_Users_in_My_Country`.
I know I could run a simple query to get it with each record. But I constantly need two other bits of information: (Keep in mind that Im not really dealing with countries but other groups that number in the 100,000X : again: counties is just to make this post simple)
User count by Country and
Selection of countries with less than x users.
Im wondering if I should create a trigger when the country value changes to update the Number_of_Users_in_My_Country field?
As Im new to mySQL I would love to know thoughts on this or any other approach.
Lots of people will tell you not to do that, because it's not normalized. However, if it's trivial to keep an aggregate value (to save complex joins in certain queries), I'd say go for it. Keep in mind with your triggers that you can't update the same table as the trigger's definition, so be careful in defining how certain events propagate updates to other tables, lest you get in a loop.
An additional recommendation: I would keep a table for countries, and use a foreign key reference from Users to Countries. Then in countries, have a column for total users in that country. Users_in_my_country seems to have a very specific use, and it would be easier to maintain from the countries' perspective.
Given that you've simplified the question somewhat, it's hard to be totally precise.
In general, if at all possible, I prefer to calculate these derived values on the fly. And to find out if it's valuable, I prefer to try it out; 100.000x records is not a particularly scary number, and I'd much prefer to spend time tuning a query/indexing scheme once than dealing with maintenance crazy for the life of the application.
If you've tried that, and still can't get it to work, my next consideration would be to work with stale/cached data. It all depends on your business, but if it's okay for the "number of users in my country" value to be slightly out of date, then calculating these values and caching them in the application layer would be much better. Caching has lots of pre-existing libraries you can use, it's well understood by most developers, and with high traffic web sites, caching for even a few seconds can have a dramatic effect on your performance and scalability. Alternatively, have a script that populates a table "country_usercount" and run it every minute or so.
If the data must, absolutely, be fresh, I'd include the logic to update the counts in the application layer code - it's a bit ugly, but it's easy to debug, and behaves predictably. So, every time the event fires that tells you which country the user is from, you update the country_usercount table from the application code.
The reason I dislike triggers is that they can lead to horrible, hard to replicate bugs and performance issues - if you have several of those aggregated pre-calculated fields, and you write a trigger for each, you could easily end up with lots of unexpected database activity.

Why do I need a UNIQUE constraint on a column if my application already validates the data before saving it?

UNIQUE is an index which makes your field, well, unique. But is it worth using it if you're already doing your validation in PHP prior to inserting new data? An extra INDEX isn't the end of the world but if you're after query optimization then UNIQUE just gets in the way, right?
Why wear a seat belt if you're a good driver and you can save two seconds of your total trip time?
One of the most important lessons for a programmer to learn is that he is human and he makes mistakes. Worse, everyone else working on this code is human, too.
Why does the UNIQUE constraint exist? To protect the database from humans making mistakes. Turning off your UNIQUE constraint says "You do not need to worry, Mr. Database, I will never give you data that doesn't match my intent."
What if something happens to your code such that your validation for uniqueness breaks? Now your code dumps duplicate records into the database. But if you had a UNIQUE constraint on that column, when your front-end code stopped working, you'd get your queries blowing up.
You're human. Accept it. Let the computer do its job and help protect you from yourself.
UNIQUE is not only for making sure data is valid. The primary purpose is to optimize queries: if the database knows the field is unique, it can stop searching for hits as soon as the first record is found. You can't pass that information to the database through well-crafted queries alone.
That is an interesting question.
Are you sure that there is no way for your code to be bypassed ?
Are you sure nothing else will ever access the data beside the PHP application ?
Are you sure the rest of your application won't fail in the case where a duplicate is inserted ?
What would be the implication of having duplicate entries, would that cause problem for future references or calculations ?
This is some of the questions that constraint at database level help solve.
As for optimization, a constraint does not make the process of retrieving data noticeably slower and it can in fact be use in the execution plan at some point, since it is related to an index.
So no, it won't get in the way of optimization and it will also protect your data from inconsistencies.
As pst mentions, at this stage in your development, you are in no position to begin optimizing your database or the application in question.
It's generally not a bad thing to add additional sanity checks in your system. Yes, you're hurting performance just that tiny little bit, but in no way will any user ever notice an extra CPU tick or two.
Think about this: Today you do your validation in php, but do not assert uniqueness in the database. In the future, you, a colleague, or some other guy who has forked your project changes the original php validation, ruins it, or forgets it altogether. At this point, you'll probably wish you had that added check in your database.
tl:dr; Transactional Integrity (in the database) handles Race Conditions (in the application).
The Concurrency and integrity section of these Rails docs explains why this is necessary with an example scenario.
Databases with transactional integrity guarantee uniqueness through isolation, while applications actually take a few separate steps (get the value, check if there are other values, then save the value) outside of transactional isolation that leave them vulnerable to race conditions, especially at scale.

What is the best way (in Rails/AR) to ensure writes to a database table are performed synchronously, one after another, one at a time?

I have noticed that using something like delayed_job without a UNIQUE constraint on a table column would still create double entries in the DB. I have assumed delayed_job would run jobs one after another. The Rails app runs on Apache with Passenger Phusion. I am not sure if that is the reason why this would happen, but I would like to make sure that every item in the queue is persisted to AR/DB one after another, in sequence, and to never have more than one write to this DB table happen at the same time. Is this possible? What would be some of the issues that I would have to deal with?
update
The race conditions arise because an AJAX API is used to send data to the application. The application received a bunch of data, each batch of data is identified as belonging together by a Session ID (SID), in the end, the final state of the database has to include the latest most up-to date AJAX PUT query to the API. Sometimes queries arrive at the exact same time for the same SID -- so I need a way to make sure they don't all try to be persisted at the same time, but one after the other, or simply the last to be sent by AJAX request to the API.
I hope that makes my particular use-case easier to understand...
You can lock a specific table (or tables) with the LOCK TABLES statement.
In general I would say that relying on this is poor design and will likely lead to with scalability problems down the road since you're creating an bottleneck in your application flow.
With your further explanations, I'd be tempted to add some extra columns to the table used by delayed_job, with a unique index on them. If (for example) you only ever wanted 1 job per user you'd add a user_id column and then do
something.delay(:user_id => user_id).some_method
You might need more attributes if the pattern is more sophisticated, e.g. there are lots of different types of jobs and you only wanted one per person, per type, but the principle is the same. You'd also want to be sure to rescue ActiveRecord::RecordNotUnique and deal with it gracefully.
For non delayed_job stuff, optimistic locking is often a good compromise between handling the concurrent cases well without slowing down the non concurrent cases.
If you are worried/troubled about/with multiple processes writing to the 'same' rows - as in more users updating the same order_header row - I'd suggest you set some marker bound to the current_user.id on the row once /order_headers/:id/edit was called, and removing it again, once the current_user releases the row either by updating or canceling the edit.
Your use-case (from your description) seems a bit different to me, so I'd suggest you leave it to the DB (in case of a fairly recent - as in post 5.1 - MySQL, you'd add a trigger/function which would do the actual update, and here - you could implement similar logic to the above suggested; some marker bound to the sequenced job id of sorts)

Database Upserts - Good or Bad Practice?

Looking for some insight as to whether an Upsert (insert or if exists, then update) procedure is considered bad practice in database programming. I work in SQL server if that bears any relevance.
At a place I worked some months ago, the the resident DB guru stated in newly written db coding standards (most of which I agreed with), that Upserts should be avoided.
I can't really see a logical reason for this, and consider my self reasonably conscious of good programming practice. I think they are useful for straight forward data management and help to avoid excessive stored procedure numbers.
Looking for some insight / discussion that will help me come to a conclusion on this.
Thanks.
Update In response to comments:
The specific context I refer to is the creation or update of a domain entity data representation in the database. Say for example a "Person" object exists as a representation of the "Person" table in the database. I simply need a mechanism for creating a new Person, or updating an existing one. Here I have the option of creating an Upsert stored procedure, or two separate stored procedures - one for Update, and one for Insert.
Any advantages or disadvantages in anyones view?
The primary problem is overwriting an existing record when the intention is to add a new record because whatever was selected as the key is duplicated. Say a login name for example. You see that login exists so you update when you should have kicked back an error that the login is a duplicate.
A second problem is with resurrecting a deleted record. Say process "A" queries the record, process "B" deletes it, and then process "A" submits a change. The record that was intended to be deleted is now back in the database rather than passing an exception back to "A" that it was deleted.
I like to program on purpose.
Either I'm creating something, in which case I would want the insert to fail (duplicate) if there already was a entity there. Or, I'm updating something that I know is there, in which case I'd like the update to fail (which actually doesn't happen).
With upsert/merge this get kind of fuzzy. Did I or did I not succed? Did I partially succeed? Some of the values in the row are mine (from insert) and some of them was there before?
Having said that, Upserts are useful (which is why they were implemented to begin with) and banning them would be just silly. That's like banning roads because criminals use them to get away from the cops. There are an infinite number of cases where upserts are the only resonable way of doing things. And anyone who have worked with synchronizing of data between systems knows this.
Depends what you talk about. Data? Well, that is determined by data manipulation processes, or? If I need to insert OR update, then I need to do that. if it is about schema objects, similar.
The objection in the first example of two processes --where a second process "resurrects" a deleted record by adding a new one with the same key -- is only valid in specific instances which would result from either poor design OR would happen regardless if an "upsert" procedure wrote the record with the identical key or two separate procedures wrote the inserted record.
Where an identical key must be avoided, an auto-incrementing identity key is used in the insert. Where an identical key does not need to be avoided, good database design has to be implemented to avoid creating "phantom joins". For example, in the telecommunications world telephone numbers are often reused and are a "unique" key. They cannot be the primary key because person #2 might "inherit" the phone number but should likely not "inherit" person #1's overdue unpaid bill or call history, etc. So a combination of an auto incrementing primary key plus service dates or other unique indentifiers would be used in any join logic to prevent bad data chaining.

Never delete entries? Good idea? Usual?

I am designing a system and I don't think it's a good idea to give the ability to the end user to delete entries in the database. I think that way because often then end user, once given admin rights, might end up making a mess in the database and then turn to me to fix it.
Of course, they will need to be able to do remove entries or at least think that they did if they are set as admin.
So, I was thinking that all the entries in the database should have an "active" field. If they try to remove an entry, it will just set the flag to "false" or something similar. Then there will be some kind of super admin that would be my company's team who could change this field.
I already saw that in another company I worked for, but I was wondering if it was a good idea. I could just make regular database backups and then roll back if they commit an error and adding this field would add some complexity to all the queries.
What do you think? Should I do it that way? Do you use this kind of trick in your applications?
In one of our databases, we distinguished between transactional and dictionary records.
In a couple of words, transactional records are things that you cannot roll back in real life, like a call from a customer. You can change the caller's name, status etc., but you cannot dismiss the call itself.
Dictionary records are things that you can change, like assigning a city to a customer.
Transactional records and things that lead to them were never deleted, while dictionary ones could be deleted all right.
By "things that lead to them" I mean that as soon as the record appears in the business rules which can lead to a transactional record, this record also becomes transactional.
Like, a city can be deleted from the database. But when a rule appeared that said "send an SMS to all customers in Moscow", the cities became transactional records as well, or we would not be able to answer the question "why did this SMS get sent".
A rule of thumb for distinguishing was this: is it only my company's business?
If one of my employees made a decision based on data from the database (like, he made a report based on which some management decision was made, and then the data report was based on disappeared), it was considered OK to delete these data.
But if the decision affected some immediate actions with customers (like calling, messing with the customer's balance etc.), everything that lead to these decisions was kept forever.
It may vary from one business model to another: sometimes, it may be required to record even internal data, sometimes it's OK to delete data that affects outside world.
But for our business model, the rule from above worked fine.
A couple reasons people do things like this is for auditing and automated rollback. If a row is completely deleted then there's no way to automatically rollback that deletion if it was in error. Also, keeping a row around and its previous state is important for auditing - a super user should be able to see who deleted what and when as well as who changed what, etc.
Of course, that's all dependent on your current application's business logic. Some applications have no need for auditing and it may be proper to fully delete a row.
The downside to just setting a flag such as IsActive or DeletedDate is that all of your queries must take that flag into account when pulling data. This makes it more likely that another programmer will accidentally forget this flag when writing reports...
A slightly better alternative is to archive that record into a different database. This way it's been physically moved to a location that is not normally searched. You might add a couple fields to capture who deleted it and when; but the point is it won't be polluting your main database.
Further, you could provide an undo feature to bring it back fairly quickly; and do a permanent delete after 30 days or something like that.
UPDATE concerning views:
With views, the data still participates in your indexing scheme. If the amount of potentially deleted data is small, views may be just fine as they are simpler from a coding perspective.
I prefer the method that you are describing. Its nice to be able to undo a mistake. More often than not, there is no easy way of going back on a DELETE query. I've never had a problem with this method and unless you are filling your database with 'deleted' entries, there shouldn't be an issue.
I use a combination of techniques to work around this issue. For some things adding the extra "active" field makes sense. Then the user has the impression that an item was deleted because it no longer shows up on the application screen. The scenarios where I would implement this would include items that are required to keep a history...lets say invoice and payment. I wouldn't want such things being deleted for any reason.
However, there are some items in the database that are not so sensitive, lets say a list of categories that I want to be dynamic...I may then have users with admin privileges be allowed to add and delete a category and the delete could be permanent. However, as part of the application logic I will check if the category is used anywhere before allowing the delete.
I suggest having a second database like DB_Archives whre you add every row deleted from DB. The is_active field negates the very purpose of foreign key constraints, and YOU have to make sure that this row is not marked as deleted when it's referenced elsewhere. This becomes overly complicated when your DB structure is massive.
There is an acceptable practice that exists in many applications (drupal's versioning system, et. al.). Since MySQL scales very quickly and easily, you should be okay.
I've been working on a project lately where all the data was kept in the DB as well. The status of each individual row was kept in an integer field (data could be active, deleted, in_need_for_manual_correction, historic).
You should consider using views to access only the active/historic/... data in each table. That way your queries won't get more complicated.
Another thing that made things easy was the use of UPDATE/INSERT/DELETE triggers that handled all the flag changing inside the DB and thus kept the complex stuff out of the application (for the most part).
I should mention that the DB was a MSSQL 2005 server, but i guess the same approach should work with mysql, too.
Yes and no.
It will complicate your application much more than you expect since every table that does not allow deletion will be behind extra check (IsDeleted=false) etc. It does not sound much but then when you build larger application and in query of 11 tables 9 require chech of non-deletion.. it's tedious and error prone. (Well yeah, then there are deleted/nondeleted views.. when you remember to do/use them)
Some schema upgrades will become PITA since you'll have to relax FK:s and invent "suitable" data for very, very old data.
I've not tried, but have thought a moderate amount about solution where you'd zip the row data to xml and store that in some "Historical" table. Then in case of "must have that restored now OMG the world is dying!1eleven" it's possible to dig out.
I agree with all respondents that if you can afford to keep old data around forever it's a good idea; for performance and simplicity, I agree with the suggestion of moving "logically deleted" records to "old stuff" tables rather than adding "is_deleted" flags (moving to a totally different database seems a bit like overkill, but you can easily change to that more drastic approach later if eventually the amount of accumulated data turns out to be a problem for a single db with normal and "old stuff" tables).