How to deal with duplicates in database?

How to deal with duplicates in database? - mysql

In a program, should we use try catch to check insertion of duplicate values into tables, or should we check if the value is already present in the table and avoid insertion?

This is easy enough to enforce with a UNIQUE constraint on the database side so that's my recommendation. I try to put as much of the data integrity into the database so that I can avoid having bad data (although sometimes unavoidable).
If this is how you already have it you might as well just catch the mysql exception for duplicate value insertion on such a table as doing the check then the insertion is more costly then having the database do one simple lookup (and possibly an insert).

Depends upon whether you are inserting one, or a million, as well as whether the duplicate is the primary key.
If its the primary key, read: http://database-programmer.blogspot.com/2009/06/approaches-to-upsert.html
An UPSERT or ON DUPLICATE KEY... The idea behind an UPSERT is simple.
The client issues an INSERT command. If a row already exists with the
given primary key, then instead of throwing a key violation error, it
takes the non-key values and updates the row.
This is one of those strange (and very unusual) cases where MySQL
actually supports something you will not find in all of the other more
mature databases. So if you are using MySQL, you do not need to do
anything special to make an UPSERT. You just add the term "ON
DUPLICATE KEY UPDATE" to the INSERT statement:
If it's not the primary key, and you are inserting just one row, then you can still make sure this doesn't cause a failure.
For your actual question, I don't really like the idea of using try/catch for program flow, but really, you have to evaluate readability and user experience (in this case performance), and pick what you think is the best of mix of the two.

You can add a UNIQUE constraint to your table.. Something like
CREATE TABLE IF NOT EXISTS login
(
loginid SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
loginname CHAR(20) NOT NULL,
UNIQUE (loginname)
);
This will ensure no two login names are the same.

you can Create a Unique Composite Key
ALTER TABLE `TableName` ADD UNIQUE KEY (KeyOne, KeyTwo, ...);

you just need to create a unique key in your table so that it will not permit to add the same value again.

You should try inserting the value and catch the exception. In a busy system, if you check for the existience of a value it might get inserted between the time you check and the time you insert it.
Let the database do it's job, let the database check for the duplicate entry.

A database is a computerized representation of a set of business rules and a DBMS is used to enforce these business rules as constraints. Neither can verify a proposition in the database is true in the real world. For example, if the model in question is the employees of an enterprise and the Employees table contains two people named 'Jimmy Barnes' DBMS (nor the database) cannot know whether one is a duplicate, whether either are real people, etc. A trusted source is required to determine existence and identity. In the above example, the enterprise's personnel department is responsible for checking public records, perusing references, ensuring the person is not already on the payroll, etc then allocating an unique employee reference number that can be used as a key. This is why we look for industry-standard identifiers with a trusted source: ISBN for books, VIN for cars, ISO 4217 for currencies, ISO 3166 for countries, etc.

I think it is better to check if the value already exists and avoid the insertion. The check for duplicate values can be done in the procedure that saves the data (using exists if your database is an SQL database).
If a duplicate exists you avoid the insertion and can return a value to your app indicating so and then show a message accordingly.
For example, a piece of SQL code could be something like this:
select #ret_val = 0
If exists (select * from employee where last_name = #param_ln and first_name = #param_fn)
select #ret_val = -1
Else
-- your insert statement here
Select #ret_val
Your condition for duplicate values will depend on what you define as a duplicate record. In your application you would use the return value to know if the data was a duplicate. Good luck!

Related

MySQL Unique Constraint based on column value

Let's say I have a table like this:
CREATE TABLE dept (
id VARCHAR(255) NOT NULL PRIMARY KEY,
code VARCHAR(255) NOT NULL,
active BIT NOT NULL,
...
);
Problem:
I want to add a unique constraint on code column. But it should be applied only if active is set to true (uniqueness should be checked only among active records). There can be many records with active = false and the same code so I can't use constraint on multiple columns.
What I tried:
I haven't found any references in the documentation proving that such constraint is possible, but I know it is possible in other databases using unique function-based indexes.
Of course I can write a trigger that will check the invariant on every add/update operation, but I hope there is more efficient solution.
I'm using MySQL 5.7.15.

This simply isn't possible in MySQL, I'm afraid.
I have come "close" to solving this in the past by having a uniquely constrained column which is nullable (replacing both the active and code fields). When NULL - it's "inactive", when anything other than NULL - it has to be unique.
But that doesn't precisely solve the problem you're asking. (Perhaps something better can be suggested if you could update your question to include the bigger picture?)
Otherwise read/write to the table through a stored procedure or - as you've suggested yourself - do something inelegant with triggers.

To solve your problem you need use CHECK clause but it MySQL don't support it. From doc:
The CHECK clause is parsed but ignored by all storage engines. See Section 13.1.18, “CREATE TABLE Syntax”. The reason for accepting but ignoring syntax clauses is for compatibility, to make it easier to port code from other SQL servers, and to run applications that create tables with references.
So you can do this only by check data on application level or insert/update rows in this table by stored procedures.

I sorry this does not really a direct answer your question but:
Maybe you are better off with a different table design? The fact that something you want to do is not supported by your RDBMS is always a strong evidence that you are using it wrong.
Have you thought about creating a dept and an dept_history table, dept containing only the active records? That would solve your problem with the unique constraint.

DB design for M:N table with time interval

i would like to ask you a design question:
I am designing a table that makes me scratch my head, not sure what the best approach is, i feel like i am missing something:
There are two tables A and B and one M:N relationship table between them. The relationship table has right now these values:
A.ID, B.ID, From, To
Bussiness requirements:
At any time, A:B relation ship can be only 1:1
A:B can repeat in time as defined by From and To datetime values, which specify an interval
example: Car/Driver.
Any car can have only 1 Driver at any time
Any Driver can drive only 1 car at any time (this is NOT topgear, ok? :) )
Driver can change the car after some time, and can return to the same car
Now, i am not sure:
- what PK should i go with? A,B is not enough, adding From and To doesnt feel right, maybe an autoincrement PK?
-any way to enforce the bussiness requirements by DB design?
-for business reason, i would prefer it not to be in a historical table. Why? Well, let's assume the car is rented and i want to know, given a date, who had what car rented at that date. Splitting it into historical table would require more joinst :(
I feel like i am missing something, some kind of general patter ... or i dont know....
Thankful for any help, so thank you :)

I don't think you are actually missing anything. I think you've got a handle on what the problem is.
I've read a couple of articles about how to handle "temporal" data in a relational database.
Bottom line consensus is that the traditional relational model doesn't have any builtin mechanism for supporting temporal data.
There are several approaches, some better suited to particular requirements than others, but all of the approaches feel like they are "duct taped" on.
(I was going to say "bolted on", but I thought at tip of the hat to Red Green was in order: "... the handyman's secret weapon, duct tape", and "if the women don't find you handsome, they should at least find in you handy.")
As far as a PRIMARY KEY or UNIQUE KEY for the table, you could use the combination of (a_id, b_id, from). That would give the row a unique identifier.
But, that doesn't do anything to prevent overlapping "time" ranges.
There is no declarative constraint for a MySQL table that prevents "overlapping" datetime ranges that are stored as "start","end" or "start","duration", etc. (At least, in the general case. If you had very well defined ranges, and triggers that rounded the from to an even four hour boundary, and a duration to exactly four hours, you could use a UNIQUE constraint. In the more general case, for any ol' values of from and to, the UNIQUE constraint does not work for us.
A CHECK constraint is insufficient (since you would need to look at other rows), and even if it were possible, MySQL doesn't actually enforce check constraints.
The only way (I know of) to get the database to enforce such a constraint would be a TRIGGER that looks for the existence of another row for which the affected (inserted/updated) row would conflict.
You'd need both a BEFORE INSERT trigger and a BEFORE UPDATE trigger. The trigger would need to query the table, to check for the existence of a row that "overlaps" the new/modified row
SELECT 1
FROM mytable t
WHERE t.a_id = NEW.a_id
AND t.b_id = NEW.b_id
AND t.from <> OLD.from
AND < (t.from, t.to) overlaps (NEW.from,NEW.to) >
Obviously, that last line is pseudocode for the actual syntax that would be required.
The line before that would only be needed in the BEFORE UPDATE trigger, so we don't find (as a "match") the row being updated. The actual check there would really depend on the selection of the PRIMARY KEY (or UNIQUE KEY)(s).
With MySQL 5.5, we can use the SIGNAL statement to return an error, if we find the new/updated row would violate the constraint. With previous versions of MySQL, we can "throw" an error by doing something that causes an actual error to occur, such as running a query against a table name that we know does not exist.
And finally, this type of functionality doesn't necessarily have to be implemented in a database trigger; this could be handled on the client side.

How about three tables:
TCar, TDriver, TLog
TCar
pkCarID
fkDriverID
name
A unique index on driver ensures a driver is only ever in one car. Turning the foreign key
fkDriverID into a 1:1 relation ship.
TDriver
pkDriverID
name
TLog
pkLogID (surrogate pk)
fkCarID
fkDriverID
from
to
With 2 joins you will get any information you describe. if you just need to find Car data by driverID or driver data by cardid you can do it with one join.

thank you everyone for you input, so far i am thinking about this approach, would be thankful for any criticism/pointing out flaws:
Tables (pseudoSQLcode):
Car (ID pk auto_increment, name)
Driver(ID pk auto_increment, name)
Assignment (CarID unique,DriverID unique,from Datetime), composite PK (CarID,DriverID)
AssignmentHistory (CarID unique,DriverID unique,from Datetime,to Datetime) no pk
of course, CarID is a FK to Car(ID) and DriverID is a FK to Driver(ID)
the next stage are two triggers (and boy oh boy, i hope this can be done in mysql (works on MSSSQL, but i dont have a mysql db handy right now to test):
!!! Warning, MSSQL for now
create trigger Assignment _Update on Assignment instead of update as
delete Assignment
from Assignment
join inserted
on ( inserted.CarID= Assignment .CarID
or inserted.DriverID= Assignment .DriverID)
and ( inserted.CarID<> omem.CarID or inserted.DriverID<> omem.DriverID)
insert into Assignment
select * from inserted;
create trigger Assignment _Insert on Assignment after delete as
insert into Assignment_History
select CarID,DriverID,from,NOW() from deleted;
i tested it a bit and it seems for each bussiness case it does what i need it to do

generate id number mysql

i want to generate a id number for my user table.
id number is unique index.
here my trigger
USE `schema_epolling`;
DELIMITER $$
CREATE DEFINER=`root`#`localhost` TRIGGER `tbl_user_BINS` BEFORE INSERT ON `tbl_user`
FOR EACH ROW
BEGIN
SET NEW.id_number = CONCAT(DATE_FORMAT(NOW(),'%y'),LPAD((SELECT auto_increment FROM
information_schema.tables WHERE table_schema = 'schema_epolling' AND table_name =
'tbl_user'),6,0));
END
it works if i insert one by one .. or may 5 rows at a time.
but if i insert a bulk rows.. an error occured.
id number
heres the code i use for inserting bulk rows from another schema/table:
INSERT INTO schema_epolling.tbl_user (last_name, first_name)
SELECT last_name, first_name
FROM schema_nc.tbl_person
heres the error:
Error Code: 1062. Duplicate entry '14000004' for key 'id_number_UNIQUE'
Error Code: 1062. Duplicate entry '14000011' for key 'id_number_UNIQUE'
Error Code: 1062. Duplicate entry '14000018' for key 'id_number_UNIQUE'
Error Code: 1062. Duplicate entry '14000025' for key 'id_number_UNIQUE'
Error Code: 1062. Duplicate entry '14000032' for key 'id_number_UNIQUE'
if i use uuid() function it works fine. but i dont want uuid() its too long.

You don't want to generate id values that way.
The auto-increment value for the current INSERT is not generated yet at the time the BEFORE INSERT trigger executes.
Even if it were, the INFORMATION_SCHEMA would contain the maximum auto-increment value generate by any thread, not just the thread executing the trigger. So you would have a race condition that would easily conflict with other concurrent inserts and get the wrong value.
Also, querying INFORMATION_SCHEMA on every INSERT is likely to be a bottleneck for your performance.
In this case, to get the auto-increment value formatted with the two-digit year number prepended, you could advance the table's auto-increment value up to %y million, and then when we reach January 1 2015 you would ALTER TABLE to advance it again.
Re your comments:
The answer I gave above applies to how MySQL's auto-increment works. If you don't rely on auto-increment, you can generate the values by some other means.
Incrementing another one-row table as #Vatev suggests (though this creates a relatively long-lived lock on that table, which could be a bottleneck for your inserts).
Generating values in your application, based on an central, atomic id-generator like memcached. See other ideas here: Generate unique IDs in a distributed environment
Using UUID(). Yes, sorry, it's 32 characters long. Don't truncate it or you will use uniqueness.
But combining triggers with auto-increment in the way you show simply won't work.

I'd like to add my two cents to expound on Bill Karwin's point.
It's better that you don't generate a Unique ID by attempting to manually cobble one together.
The fact that your school produces an ID in that way does not mean that's the best way to do it (assuming that is what they are using that generated value for which I can't know without more information).
Your database work will be simpler and less error prone if you accept that the purpose for an ID field (or key) is to guarantee uniqueness in each row of data, not as a reference point to store certain pieces of human readable data in a central spot.
This type of a ID/key is known as a surrogate key.
If you'd like to read more about them here's a good article: http://en.wikipedia.org/wiki/Surrogate_key
It's common for a surrogate key to also be the primary key of a table, (and when it's used in this way it can greatly simplify creating relationships between tables).
If you would like to add a secondary column that concatenates date values and other information because that's valuable for an application you are writing, or any other purpose you see fit, then create that as a separate column in your table.
Thinking of an ID column/key in this, fire & forget, way may simplify the concept enough that you may experience a number of benefits in your database creation efforts.
As an example, should you require uniqueness between un-associated databases, you will more easily be able to stomach the use of a UUID.
(Because you'll know it's purpose is merely to ensure uniqueness NOT to be useful to you in any other way.)
Additionally, as you've found, taking the responsibility on yourself, instead of relying on the database, to produce a unique value adds time consuming complexity that can otherwise be avoided.
Hope this helps.

Select and insert at the same time

So, i need to get max number of field called chat_id and after that i need to increment it by one and insert some data in that field, so the query should look something like this:
SELECT MAX(`chat_id`) FROM `messages`;
Lets say it returns me 10 now i need to insert new data
INSERT INTO `messages` SET `chat id` = 11 -- other data here....
So it would work the way i want but my question is what if betwen that time while i'm incrementing and inserting new record other user gonna do the same? than there would already be record with 11 id and it could mess my data is there a way to make sure that the right id goes where i need, btw i can't user auto increment for this.
EDIT as i said i cannot use auto increment because that table already have id field with auto increment, this id is for different porpuse, also it's not unique and it can't be unique
EDIT 2 Solved it by redoing my whole tables structure since no one gave me better ideas

Don't try to do this on your own. You've already identified one of the pitfalls of that approach. I'm not sure why you're saying you can't use auto increment here. That's really the way to go.
CREATE TABLE messages (
chat_id INT NOT NULL AUTO_INCREMENT,
....
)

If you cannot use an auto-increment primary key then you will either have to exclusively lock the table (which is generally not a good idea), or be prepared to encounter failures.
Assuming that the chat_id column is UNIQUE (which it should be from what you 're saying), you can put these two queries inside a loop. If the INSERT succeeds then everything is fine, you can break out of the loop and continue. Otherwise it means that someone else managed to snatch this particular id out of your hands, so repeat the process until successful.
At this point I have to mention that you should not actually use a totally naive approach in production code (e.g. you might want to put an upper limit in how many iterations are possible before you give up) and that this solution will not work well if there is a lot of contention for the database (it will work just fine to ensure that the occasional race does not cause you problems). You should examine your access patterns and load before deciding on this.

AUTO_INCREMENT would solve this problem. But for other similar situations this would be a great use of transactions. If you're using InnoDb engine you can use transactions to ensure that operations happen in a specific order so that your data stays consistent.

You can solve this by using MySQL's built-in uuid() function to calculate the new primary key value, instead of leaving it to the auto increment feature.
Alter your table to make messages.chat_id a char(36) and remove the AUTO_INCREMENT clause.
Then do this:
# Generate a unique primary key value before inserting.
declare new_id char(36);
select uuid() into new_id;
# Insert the new record.
insert into messages
(chat_id, ...)
values
(new_id, ...);
# Select the new record.
select *
from messages
where chat_id = new_id;
The MySQL's documentation on uuid() says:
A UUID is designed as a number that is globally unique in space and time. Two calls to UUID() are expected to generate two different values, even if these calls are performed on two separate devices not connected to each other.
Meaning it's perfectly safe to use the value generated by uuid as a primary key value.
This way you can predict what the primary key value of the new record will be before you insert it and then query by it knowing for sure that no other process has "stolen" that id from you in between the insert and the select. Which in turn removes the need for a transaction.

How to restrict a column value in SQLite / MySQL

I would like to restrict a column value in a SQL table. For example, the column values can only be "car" or "bike" or "van". My question is how do you achieve this in SQL, and is it a good idea to do this on the DB side or should I let the application restrict the input.
I also have the intention to add or remove more values in the future, for example, "truck".
The type of Databases I am using are SQLite and MySQL.

Add a new table containing these means of transport, and make your column a foreign key to that table. New means of transport can be added to the table in future, and your column definition remains the same.
With this construction, I would definitively choose to regulate this at the DB level, rather than that of the application.

For MySQL, you can use the ENUM data type.
column_name ENUM('small', 'medium', 'large')
See MySQL Reference: The ENUM Type
To add to this, I find it's always better to restrict on the DB side AND on the app side. An Enum plus a Select box and you're covered.

Yes, it is recommended to add check constraints. Check constraints are used to ensure the validity of data in a database and to provide data integrity. If they are used at the database level, applications that use the database will not be able to add invalid data or modify valid data so the data becomes invalid, even if the application itself accepts invalid data.
In SQLite:
create table MyTable
(
name string check(name = "car" or name = "bike" or name = "van")
);
In MySQL:
create table MyTable
(
name ENUM('car', 'bike', 'van')
);

You would use a check constraint. In SQL Server it works like this
ALTER TABLE Vehicles
ADD CONSTRAINT chkVehicleType CHECK (VehicleType in ('car','bike','van'));
I'm not sure if this is ANSI standard but I'm certain that MySQL has a similar construct.

If you want to go with DB-side validation, you can use triggers. See this for SQLite, and this detailed how-to for MySQL.
So the question is really whether you should use Database validation or not. If you have multiple clients -- whether they are different programs, or multiple users (with possibly different versions of the program) -- then going the database route is definitely best. The database is (hopefully) centralized, so you can decouple some of the details of validation. In your particular case, you can verify that the value being inserted into the column is contained in a separate table that simply lists valid values.
On the other hand, if you have little experience with databases, plan to target several different databases, and don't have the time to develop expertise, perhaps simple application level validation is the most expedient choice.

To add some beginner level context to the excellent answer of #NGLN above.
First, one needs to check the foreign key constraint is active, otherwise sqlite won't limit to the input to the column to the reference table:
PRAGMA foreign_key;
...which gives a response of 0 or 1, indicating on or off.
To set the foreign key constraint:
PRAGMA foreign_keys = ON;
This needs to be set to ensure that sqlite3 enforces the constraint.
I found it simplest to just set the primary key of the reference table to be the type. In the OP's example:
CREATE TABLE IF NOT EXISTS vehicle_types(
vehicle_type text PRIMARY KEY);
Then, one can insert 'car', 'bike' etc into the vehicle_types table (and more in the future) and reference that table in the foreign key constraint in the child table (the table in which the OP wished to reference the type of vehicle):
CREATE TABLE IF NOT EXISTS ops_original_table(
col_id integer PRIMARY KEY,
...many other columns...
vehicle_type text NOT NULL,
FOREIGN KEY (vehicle_type) REFERENCES vehicle_types(vehicle_type);
Outwith the scope of the OP's question but also take note that when setting up a foreign key constraint thought should be given to what happens to the column in child table (ops_original_table) if a parent table value (vehicle_types) is deleted or updated. See this page for info

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008