I have a SQL server table RealEstate with columns - Id, Property, Property_Value. This table has about 5-10 million rows and can increase even more in the future. I want to insert a row only if a combination of Id, Property, Property_Value does not exist in this table.
Example Table -
1,Rooms,5
1,Bath,2
1,Address,New York
2,Rooms,2
2,Bath,1
2,Address,Miami
Inserting 2,Address,Miami should NOT be allowed. But, 2,Price,2billion is okay. I am curious to know which is the "best" way to do this and why. The why part is most important to me. The two ways of checking are -
At application level - The app should check if a row exists before it inserts a row.
At database level - Set unique constraints on all 3 columns and let the database
do the checking instead of person/app.
Is there any scenario where one would be better than the other ?
Thanks.
PS: I know there is a similar question already, but it does not answer my problem -
Unique constraint vs pre checking
Also, I think that UNIQUE is applicable to all databases, so I don't think I should remove the mysql and oracle tags.
I think it most cases the differences between that two are going to be small enough that the choice should mostly be driven by picking the implementation that ends up being most understandable to someone looking at the code for the first time.
However, I think exception handling has a few small advantages:
Exception handling avoids a potential race condition. The 'check, then insert' method might fail if another process inserts a record between your check and your insert. So, even if you're doing 'check then insert' you still want exception handling on the insert and if you're already doing exception handling anyways then you might as well do away with the initial check.
If your code is not a stored procedure and has to interact with the database via the network (i.e. the application and the db are not on the same box), then you want to avoid having two separate network calls (one for the check and the other for the insert) and doing it via exception handling provides a straightforward way of handling the whole thing with a single network call. Now, there are tons of ways to do the 'check then insert' method while still avoiding the second network call, but simply catching the exception is likely to be the simplest way to go about it.
On the other hand, exception handling requires a unique constraint (which is really a unique index), which comes with a performance tradeoff:
Creating a unique constraint will be slow on very large tables and it will cause a performance hit on every single insert to that table. On truly large databases you also have to budget for the extra disk space consumed by the unique index used to enforce the constraint.
On the other hand, it might make selecting from the table faster if your queries can take advantage of that index.
I'd also note that if you're in a situation where what you actually want to do is 'update else insert' (i.e. if a record with the unique value already exists then you want to update that record, else you insert a new record) then what you actually want to use is your particular database's UPSERT method, if it has one. For SQL Server and Oracle, this would be a MERGE statement.
Dependent on the cost of #1 (doing a lookup) being reasonable, I would do both. At least, in Oracle, which is the database I have the most experience with.
Rationale:
Unique/primary keys should be a core part of your data model design, I can't see any reason to not implement them - if you have so much data that performance suffers from maintaining the unique index:
that's a lot of data
partition it or archive it away from your OLTP work
The more constraints you have, the safer your data is against application logic errors.
If you check that a row exists first, you can easily extract other information from that row to use as part of an error message, or otherwise fork the application logic to cope with the duplication.
In Oracle, rolling back DML statements is relatively expensive because Oracle expects to succeed (i.e. COMMIT changes that have been written) by default.
This does not answer the question directly, but I thought it might be helpful to post it here since its better than wikipedia and the link might just become dead someday.
Link - http://www.celticwolf.com/blog/2010/04/27/what-is-a-race-condition/
Wikipedia has a good description of a race condition, but it’s hard to follow if you don’t understand the basics of programming. I’m going to try to explain it in less technical terms, using the example of generating an identifier as described above. I’ll also use analogies to human activities to try to convey the ideas.
A race condition is when two or more programs (or independent parts of a single program) all try to acquire some resource at the same time, resulting in an incorrect answer or conflict. This resource can be information, like the next available appointment time, or it can be exclusive access to something, like a spreadsheet. If you’ve ever used Microsoft Excel to edit a document on a shared drive, you’ve probably had the experience of being told by Excel that someone else was already editing the spreadsheet. This error message is Excel’s way of handling the potential race condition gracefully and preventing errors.
A common task for programs is to identify the next available value of some sort and then assign it. This technique is used for invoice numbers, student IDs, etc. It’s an old problem that has been solved before. One of the most common solutions is to allow the database that is storing the data to generate the number. There are other solutions, and they all have their strengths and weaknesses.
Unfortunately, programmers who are ignorant of this area or simply bad at programming frequently try to roll their own. The smart ones discover quickly that it’s a much more complex problem than it seems and look for existing solutions. The bad ones never see the problem or, once they do, insist on making their unworkable solution ever more complex without fixing the error. Let’s take the example of a student ID. The neophyte programmer says “to know what the next student number should be, we’ll just get the last student number and increment it.” Here’s what happens under the hood:
Betty, an admin. assistant in the admissions office fires up the student management program. Note that this is really just a copy of the program that runs on her PC. It talks to the database server over the school’s network, but has no way to talk to other copies of the program running on other PCs.
Betty creates a new student record for Bob Smith, entering all of the information.
While Betty is doing her data entry, George, another admin. assistant, fires up the student management program on his PC and begins creating a record for Gina Verde.
George is a faster typist, so he finishes at the same time as Betty. They both hit the “Save” button at the same time.
Betty’s program connects to the database server and gets the highest student number in use, 5012.
George’s program, at the same time, gets the same answer to the same question.
Both programs decide that the new student ID for the record that they’re saving should be 5013. They add that information to the record and then save it in the database.
Now Bob Smith (Betty’s student) and Gina Verde (George’s student) have the same student ID.
This student ID will be attached to all sorts of other records, from grades to meal cards for the dining hall. Eventually this problem will come to light and someone will have to spend a lot of time assigning one of them a new ID and sorting out the mixed-up records.
When I describe this problem to people, the usual reaction is “But how often will that happen in practice? Never, right?”. Wrong. First, when data entry is being done by your staff, it’s generally done during a relatively small period of time by everyone. This increases the chances of an overlap. If the application in question is a web application open to the general public, the chances of two people hitting the “Save” button at the same time are even higher. I saw this in a production system recently. It was a web application in public beta. The usage rate was quite low, with only a few people signing up every day. Nevertheless, six pairs of people managed to get identical IDs over the space of a few months. In case you’re wondering, no, neither I nor anyone from my team wrote that code. We were quite surprised, however, at how many times that problem occurred. In hindsight, we shouldn’t have been. It’s really a simple application of Murphy’s Law.
How can this problem be avoided? The easiest way is to use an existing solution to the problem that has been well tested. All of the major databases (MS SQL Server, Oracle, MySQL, PostgreSQL, etc.) have a way to increment numbers without creating duplicates. MS SQL server calls it an “identity” column, while MySQL calls it an “auto number” column, but the function is the same. Whenever you insert a new record, a new identifier is automatically created and is guaranteed to be unique. This would change the above scenario as follows:
Betty, an admin. assistant in the admissions office fires up the student management program. Note that this is really just a copy of the program that runs on her PC. It talks to the database server over the school’s network, but has no way to talk to other copies of the program running on other PCs.
Betty creates a new student record for Bob Smith, entering all of the information.
While Betty is doing her data entry, George, another admin. assistant, fires up the student management program on his PC and begins creating a record for Gina Verde.
George is a faster typist, so he finishes at the same time as Betty. They both hit the “Save” button at the same time.
Betty’s program connects to the database server and hands it the record to be saved.
George’s program, at the same time, hands over the other record to be saved.
The database server puts both records into a queue and saves them one at a time, assigning the next available number to them.
Now Bob Smith (Betty’s student) gets ID 5013 and Gina Verde (George’s student) gets id 5014.
With this solution, there is no problem with duplication. The code that does this for each database server has been tested repeatedly over the years, both by the manufacturer and by users. Millions of applications around the world rely on it and continue to stress test it every day. Can anyone say the same about their homegrown solution?
There is at least one well tested way to create identifiers in the software rather than in the database: uuids (Universally Unique Identifiers). However, a uuid takes the form of xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx where “x” stands for a hexadecimal digit (0-9 and a-f). Do you want to use that for an invoice number, student ID or some other identifier seen by the public? Probably not.
To summarize, a race condition occurs when two programs, or two independent parts of a program, attempt to access some information or access a resource at the same time, resulting in an error, be it an incorrect calculation, a duplicated identifier or conflicting access to a resource. There are many more types of race conditions than I’ve presented here and they affect many other areas of software and hardware.
The description of your problem is exactly why primary keys can be compound, e.g., they consist of multiple fields. That way, the database will handle the uniqueness for you, and you don't need to care about it.
In your case, the table definition could be something similar to the following like:
CREATE TABLE `real_estate` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`property` varchar(255) DEFAULT NULL,
`property_value` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `index_id_property_property_value` (`id`, `property`, `property_value`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
I have a web form that is used to create and update personal information. On save, I collect all the info in a large multidimensional JSON array. When updating the database, the information will potentially consists of three parts. New rows to be created, rows that need to be updated and rows that need to be deleted. These rows will also be across about 5 tables.
My question is this, how should I approach the MySQL queries? My initial thought was to DELETE all the information from all the tables, and do a clean INSERT of all the new information in one go. I guess the other approach would be to do 3 queries: UPDATE all those with an existing ID; DELETE all those marked for deletion and INSERT all the newly created data (data without existing ID's).
Which of these approaches would be best, or is there a better way of doing this? Thanks for any advice. I appreciate it.
delete all and insert all should NEVER be practiced.
reasons:
Too much costly. mostly user performs edit. so for what was just a few update, you did one delete and a hundred inserts.
plays havoc with on-delete-cascade foreign keys.
upsets auto-increment fields even when they were apparently not touched.
you need to implement unit-of-work. I dont know which language you are working with, but some of the languages have an inbuilt support for that. in dot-net we have DataSets.
Basics:
Keep track of each record you fetched from database. secretly maintain a flag for each record to note which were loaded-from-db (ie. untouched), which has modifications (needs update query) and which are added new. for the deleted records, maintain a separate list (maybe of their IDs). How to achieve this feat is matter of separate discussion.
When user clicks Save, start a database transaction. this is not strictly part of current discussion, but is almost always done in similar conditions.
In the transaction, first loop through the deleted items array. fire a delete query for each of them.
Then loop through the modified items array. for each modified item you may simply update all of its columns to the latest values. if the numer of columns is too large (>30) then things change a bit.
then comes the newly created items. fire one insert for each of them.
Finally commit the transaction.
if the language you are programming in supports try/catch blocks then perform all of the above steps (after begining transaction) in try/catch. in catch block rollback the transcation.
this approach looks more complicated and seems to fire more queries than the simple delete/insert/all approach but trust me we have been there, done that and then spent sleeples nights undoing all that was done. never go the delete/insert way unless you can really justify it.
on how to do the change-tracking thing, it depends a lot on language and type of application you are using. even for dot-net the approach differs for desktop applications and web applications. tracking deletions is easy. so as tracking new insertions. the update marks are applied by trapping the edit event on any of the columns of that field.
EDIT
The data spans about five tables. hence the three loops (delete/update/insert) has to be done five times, one for each table. first draw the relationships among the tables. process the top table first. then process the tables which are directly connected to the top level tables and so on. if you have a cyclic relationship among the tables then you have to be specially careful.
The code against the Save operation is about to grow quite long. 5x3=15 operations, each with its own sql. none of these operations are expected to be reusable hence putting them in separate methods is futile. everything is about to go in a large procedural block. hence religiously comment the code. mark the table boundaries and the operations.
You probably don't want to do any deletes. Just mark the obsolete entries as "inactive", or maybe timestamp them as having an ending validity.
In using this philosophy, all edits are actually insertions. No modifications (except to change the "expire" field) and no deletes. To update a name, mark the record as expired and insert a new record with a beginning validity timestamp at the same time.
In such a database, auditing and data recovery are easily performed.
I have noticed that using something like delayed_job without a UNIQUE constraint on a table column would still create double entries in the DB. I have assumed delayed_job would run jobs one after another. The Rails app runs on Apache with Passenger Phusion. I am not sure if that is the reason why this would happen, but I would like to make sure that every item in the queue is persisted to AR/DB one after another, in sequence, and to never have more than one write to this DB table happen at the same time. Is this possible? What would be some of the issues that I would have to deal with?
update
The race conditions arise because an AJAX API is used to send data to the application. The application received a bunch of data, each batch of data is identified as belonging together by a Session ID (SID), in the end, the final state of the database has to include the latest most up-to date AJAX PUT query to the API. Sometimes queries arrive at the exact same time for the same SID -- so I need a way to make sure they don't all try to be persisted at the same time, but one after the other, or simply the last to be sent by AJAX request to the API.
I hope that makes my particular use-case easier to understand...
You can lock a specific table (or tables) with the LOCK TABLES statement.
In general I would say that relying on this is poor design and will likely lead to with scalability problems down the road since you're creating an bottleneck in your application flow.
With your further explanations, I'd be tempted to add some extra columns to the table used by delayed_job, with a unique index on them. If (for example) you only ever wanted 1 job per user you'd add a user_id column and then do
something.delay(:user_id => user_id).some_method
You might need more attributes if the pattern is more sophisticated, e.g. there are lots of different types of jobs and you only wanted one per person, per type, but the principle is the same. You'd also want to be sure to rescue ActiveRecord::RecordNotUnique and deal with it gracefully.
For non delayed_job stuff, optimistic locking is often a good compromise between handling the concurrent cases well without slowing down the non concurrent cases.
If you are worried/troubled about/with multiple processes writing to the 'same' rows - as in more users updating the same order_header row - I'd suggest you set some marker bound to the current_user.id on the row once /order_headers/:id/edit was called, and removing it again, once the current_user releases the row either by updating or canceling the edit.
Your use-case (from your description) seems a bit different to me, so I'd suggest you leave it to the DB (in case of a fairly recent - as in post 5.1 - MySQL, you'd add a trigger/function which would do the actual update, and here - you could implement similar logic to the above suggested; some marker bound to the sequenced job id of sorts)
I have many tables where data needs to be "marked for deletion" but not deleted, or toggle between published and hidden data.
Most intuitive way to handle these cases is to add a column in the database deleted int(1) or public int(1). This raises the concern of not forgetting to specify WHERE deleted=0 for each and every time that table is being accessed.
I considered overcoming this by creating duplicate tables for deleted/unpublished data such as article => article_deleted and moving the data instead of deleting it. This provides with 2 issues:
Foreign key constraints end up being extremely annoying to maintain
Number of tables with hidden content doubles (in my case ~20 becomes ~40 tables)
My last idea is to create a duplicate of the entire database called unreleased and migrate data there.
My question isn't about safety of the data management, but more of - what is the right way of doing it from the beginning?
I have run into this exact issue before and I think it is a bad idea to create an unnecessarily cumbersome DB because you are afraid of bad code.
I think it would be a better idea to do thorough testing on your Test server before you release to production. Even I was tripped up by the "Deleted" column a few times when I first encountered it but I eventually caught on, and if you have a proper Dev/Test/Production environment you should be fine.
In summary, keep the delete column and demand more from your coders.
UPDATE:
Alternatively you could create a view that only pulls the records that aren't deleted and make sure everyone uses that for select queries.
I think your initial approach is "correct" and "right", but your concern about it being slightly error-prone is a valid one.
You'll probably just have to make sure that your test procedures are rigourous enough to catch errors.
The first approach is the best I've come up with. I call the column active instead of deleted. The record exists but it can be either active or inactive. That then if you really do need to delete things the terminology doesn't get screwy.
Saying "Delete the inactive records" makes sense but saying "Delete the deleted records" just gets confusing.
I've never used triggers before, but this seems like a solid use case. I'd like to know if triggers are what I should be using, and if so, I could use a little hand-holding on how to go about it.
Essentially I have two heavily denormalized tables, goals and users_goals. Both have title columns (VARCHAR) that duplicate the title data. Thus, there will be one main goal of "Learn how to use triggers", and many (well, maybe not many in this case) users' goals with the same title. The architecture of the site demands that this be the case.
I haven't had a need to have a relationship between these two tables just yet. I link from individual users' goals to the main goals, but simply do so with a query by title, (with an INDEX on the title column). Now I need to have a third table that relates these two tables, but it only needs to be eventually consistent. There would be two columns, both FOREIGN KEYs, goal_id and users_goal_id.
Are triggers the way to go with this? And if so, what would that look like?
Yes you could do this using triggers, but the exact implementation depends on your demands.
If you want to rebuild al your queries, so they don't use the title for the join, but the goal_id instead, you can just build that. If you need to keep the titles in sync as well, that's an extra.
First for the join. You stated that one goal has many user goals. Does that mean that each user goal belongs to only one goal? If so, you don't need the extra table. You can just add a column goal_id to your user_goals table. Make sure there is a foreign key constraint (I hope you're using InnoDB tables), so you can enforce referential integrity.
Then the trigger. I'm not exactly sure how to write them on MySQL. I do use triggers a lot on Oracle, but only seldom on MySQL. Anyway, I'd suggest you build three triggers:
Update trigger on goals table. This trigger should update related user_goals table when the title is modified.
Update trigger on the user_goals table. If user_goals.title is modified, this trigger should check if the title in the goals table differs from the new title in user_goals. If so, you have two options:
Exception: Don't allow the title to be modified in the user_goals child table.
Update: Allow the title to be changed. Update the parent record in goals. The trigger on goals will update the other related user_goals for you.
You could also silently ignore the change by changing the value back in the trigger, but that wouldn't be a good idea.
Insert trigger on user_goals. Easiest option is to query the title of the specified goal_id and don't allow inserting another value for title. You could opt to update goals if a title is given.
Insert trigger on goals. No need for this one.
No, you should never use triggers at all if you can avoid it.
Triggers are an anti-pattern to me; they have the effect of "doing stuff behind the programmer's back".
Imagine a future maintainer of your application needs to do something, if they are not aware of the trigger (imagine they haven't checked your database schema creation scripts in detail), then they could spend a long, long time trying to work out why this happens.
If you need to have several pieces of client-side code updating the tables, consider making them use a stored procedure; document this in the code maintenance manual (and comments etc) to ensure that future developers do the same.
If you can get away with it, just write a common routine on the client side which is always called to update the shared column(s).
Even triggers do nothing to ensure that the columns are always in sync, so you will need to implement a periodic process which checks this anyway. They will otherwise go out of sync sooner or later (maybe just because some operations engineer decides to start doing manual updates; maybe one table gets restored from a backup and the other doesn't)