I have a database design that makes use of compound primary keys to ensure uniqueness and which are also foreign keys.
These tables are then linked to other tables in the same way, so that in the end the compound key can get up to 4 or 5 columns. This led to some rather large JOINs, so I thought a simple solution would be to use an autoincrement column which is not part of the primary key but which is used as part of the primary key of other table(s).
Here is some pseudo code showing the general layout :
CREATE TABLE Item (
id AUTO_INCREMENT,
...
PRIMARY KEY (id)
) ENGINE = InnoDB;
CREATE TABLE PriceCategory (
id AUTO_INCREMENT,
...
PRIMARY KEY (id)
)
CREATE TABLE ItemPriceCategory (
itemId,
priceCategoryId,
id AUTO_INCREMENT,
...
UNIQUE INDEX id,
PRIMARY KEY (eventId, priceCategoryId)
)
CREATE TABLE ClientType (
id AUTO_INCREMENT,
...
PRIMARY KEY (id)
)
CREATE TABLE Price (
itemPriceCategoryId,
clientTypeId,
id AUTO_INCREMENT,
...
UNIQUE INDEX id,
PRIMARY KEY (itemPriceCategoryId, clientTypeId)
)
table Purchase (
priceId,
userId,
amount,
PRIMARY KEY (priceId, userId)
)
The names of tables have been changed to protect the innocent ;-) Also the actual layout is a little deeper in terms of references.
So, my question is, is this a viable strategy, from a performance and data integrity point of view ? Is it better to have all keys from all the referenced tables in the Purchase table ?
Thanks in advance.
Generally, the advice on primary keys is to have "meaningless", immutable primary keys with a single column. Auto incrementing integers are nice.
So, I would reverse your design - your join tables should also have meaningless primary keys. For instance:
CREATE TABLE ItemPriceCategory (
itemId,
priceCategoryId,
id AUTO_INCREMENT,
...
PRIMARY KEY id,
UNIQUE INDEX (eventId, priceCategoryId)
)
That way, the itemPriceCategoryId column in price is a proper foreign key, linking to the primary key of the ItemPriceCategory table.
You can then use http://dev.mysql.com/doc/refman/5.5/en/innodb-foreign-key-constraints.html foreign keys to ensure the consistency of your database.
In terms of performance, broadly speaking, this strategy should be faster than querying compound keys in a join, but with a well-indexed database, you may not actually notice the difference...
I think that something has been lost in translation over here, but I did my best to make an ER diagram of this.
In general, there are two approaches. The first one is to propagate keys and the second one is to have an auto-increment integer as a PK for each table.
The second approach is often driven by ORM tools which use a DB as object-persistence storage, while the first one (using key propagation) is more common for hand-crafted DB design.
In general, the model with key propagation offers better performance for "random queries", mostly because you can "skip tables" in joins. For example, in the model with key propagation you can join the Purchase table directly to the Item table to report purchases by ItemName. In the other model you would have to join Price and ItemPriceCategory tables too -- just to get to the ItemID.
Basically, the model with key propagation is essentially relational -- while the other one is object-driven. ORM tools either prefer or enforce the model with separate ID (second case), but offer other advantages for development.
Your example seems to be trying to use some kind of a combination of these two -- not necessarily bad, it would help if you could talk to original designer.
With key propagation
Independent keys for each table
Related
Lets say I have table A with two junction tables B and C, how would I go about creating primary keys for table A? I have two of these types of table in a diagram I drew, the circle keys are foreign keys btw.
Image with junction tables
Your games table needs only one primary key: this identifies each specific game. In the junction tables, the primary keys are composed of the game primary key and the directors (or types) primary key.
Taken from the reference in the tutorial MySQL Primary Key:
CREATE TABLE roles(
role_id INT AUTO_INCREMENT,
role_name VARCHAR(50),
PRIMARY KEY(role_id)
);
It is difficult to provide information about your specific question because there is too little details in it.
From your comment "if a table has two junction tables attached to it, would it need to have two primary keys?". No.
A primary key is actually a logical concept (a design mechanism) used to define a logical model. A primary key is a set of attributes (columns) that together uniquely identify each end every Tuple (row) in a relation (table). One of the rules of a primary key is that there is only one per relation.
The logical model is used, as mentioned, as the design to create the physical model, relations become tables, attributes become columns, Primary keys may become unique indexes. Foreign Keys may become indexes in the related table and so on.
Many RDBMS's allow the specification of a PRIMARY KEY in a physical table definition. Most also allow definition of FOREIGN KEYs on a physical table also. What they do with them may vary from one implementation to another. Many use the definition of a PRIMARY KEY to define a UNIQUE INDEX of some sort to enforce the "must uniquely identify" each and every record in the table.
So, No, your games_directors table does not need, nor can it have, two primary keys. if you did choose to specify a PRIMARY KEY, you would need to specify all the columns that uniquely identify records in the games_directors table - most likely PRIMARY KEY (game_id, director_id).
Similarly, the PRIMARY KEY for the games table would likely be PRIMARY KEY (game_id), for the directors would likely be PRIMARY KEY (director_id) and for game types it would likely be PRIMARY KEY (game_type_id).
You might use a foreign key from your games_directors table to ensure that when records are added to it that the corresponding director exists in the games table and the directors table. In this case, your games_directors table will have two foreign key relationships (one to games and another to directors). But only one PRIMARY KEY.
So you might end up with something like this:
create table games (
game_id integer,
PRIMARY KEY (game_id)
);
create table directors (
director_id integer,
PRIMARY KEY (director_id)
);
CREATE TABLE games_directors (
game_id INTEGER NOT NULL,
director_id INTEGER NOT NULL,
commission_paid DECIMAL(10,2),
PRIMARY KEY (game_id, director_id),
FOREIGN KEY (game_id) REFERENCES games(game_id),
FOREIGN KEY (director_id) REFERENCES directors(director_id)
);
NB: I didn't tested the above using PostgreSql. The syntax should work for most RDBMS's, but some may require tweaking slightly.
Indexes can be used to speed up access to individual records within table. For example, you might want to create an index on director name or director id (depending upon how you most frequenytly access that table). If you mostly access the director table with an equality condition like this : where director_name = 'fred' then an index on director_name might make sense.
Indexes become more useful as the number of records in the tables grows.
I hope this answers your question. :-)
At work we have a big database with unique indexes instead of primary keys and all works fine.
I'm designing new database for a new project and I have a dilemma:
In DB theory, primary key is fundamental element, that's OK, but in REAL projects what are advantages and disadvantages of both?
What do you use in projects?
EDIT: ...and what about primary keys and replication on MS SQL server?
What is a unique index?
A unique index on a column is an index on that column that also enforces the constraint that you cannot have two equal values in that column in two different rows. Example:
CREATE TABLE table1 (foo int, bar int);
CREATE UNIQUE INDEX ux_table1_foo ON table1(foo); -- Create unique index on foo.
INSERT INTO table1 (foo, bar) VALUES (1, 2); -- OK
INSERT INTO table1 (foo, bar) VALUES (2, 2); -- OK
INSERT INTO table1 (foo, bar) VALUES (3, 1); -- OK
INSERT INTO table1 (foo, bar) VALUES (1, 4); -- Fails!
Duplicate entry '1' for key 'ux_table1_foo'
The last insert fails because it violates the unique index on column foo when it tries to insert the value 1 into this column for a second time.
In MySQL a unique constraint allows multiple NULLs.
It is possible to make a unique index on mutiple columns.
Primary key versus unique index
Things that are the same:
A primary key implies a unique index.
Things that are different:
A primary key also implies NOT NULL, but a unique index can be nullable.
There can be only one primary key, but there can be multiple unique indexes.
If there is no clustered index defined then the primary key will be the clustered index.
You can see it like this:
A Primary Key IS Unique
A Unique value doesn't have to be the Representaion of the Element
Meaning?; Well a primary key is used to identify the element, if you have a "Person" you would like to have a Personal Identification Number ( SSN or such ) which is Primary to your Person.
On the other hand, the person might have an e-mail which is unique, but doensn't identify the person.
I always have Primary Keys, even in relationship tables ( the mid-table / connection table ) I might have them. Why? Well I like to follow a standard when coding, if the "Person" has an identifier, the Car has an identifier, well, then the Person -> Car should have an identifier as well!
Foreign keys work with unique constraints as well as primary keys. From Books Online:
A FOREIGN KEY constraint does not have
to be linked only to a PRIMARY KEY
constraint in another table; it can
also be defined to reference the
columns of a UNIQUE constraint in
another table
For transactional replication, you need the primary key. From Books Online:
Tables published for transactional
replication must have a primary key.
If a table is in a transactional
replication publication, you cannot
disable any indexes that are
associated with primary key columns.
These indexes are required by
replication. To disable an index, you
must first drop the table from the
publication.
Both answers are for SQL Server 2005.
The choice of when to use a surrogate primary key as opposed to a natural key is tricky. Answers such as, always or never, are rarely useful. I find that it depends on the situation.
As an example, I have the following tables:
CREATE TABLE toll_booths (
id INTEGER NOT NULL PRIMARY KEY,
name VARCHAR(255) NOT NULL,
...
UNIQUE(name)
)
CREATE TABLE cars (
vin VARCHAR(17) NOT NULL PRIMARY KEY,
license_plate VARCHAR(10) NOT NULL,
...
UNIQUE(license_plate)
)
CREATE TABLE drive_through (
id INTEGER NOT NULL PRIMARY KEY,
toll_booth_id INTEGER NOT NULL REFERENCES toll_booths(id),
vin VARCHAR(17) NOT NULL REFERENCES cars(vin),
at TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL,
amount NUMERIC(10,4) NOT NULL,
...
UNIQUE(toll_booth_id, vin)
)
We have two entity tables (toll_booths and cars) and a transaction table (drive_through). The toll_booth table uses a surrogate key because it has no natural attribute that is not guaranteed to change (the name can easily be changed). The cars table uses a natural primary key because it has a non-changing unique identifier (vin). The drive_through transaction table uses a surrogate key for easy identification, but also has a unique constraint on the attributes that are guaranteed to be unique at the time the record is inserted.
http://database-programmer.blogspot.com has some great articles on this particular subject.
There are no disadvantages of primary keys.
To add just some information to #MrWiggles and #Peter Parker answers, when table doesn't have primary key for example you won't be able to edit data in some applications (they will end up saying sth like cannot edit / delete data without primary key). Postgresql allows multiple NULL values to be in UNIQUE column, PRIMARY KEY doesn't allow NULLs. Also some ORM that generate code may have some problems with tables without primary keys.
UPDATE:
As far as I know it is not possible to replicate tables without primary keys in MSSQL, at least without problems (details).
If something is a primary key, depending on your DB engine, the entire table gets sorted by the primary key. This means that lookups are much faster on the primary key because it doesn't have to do any dereferencing as it has to do with any other kind of index. Besides that, it's just theory.
In addition to what the other answers have said, some databases and systems may require a primary to be present. One situation comes to mind; when using enterprise replication with Informix a PK must be present for a table to participate in replication.
As long as you do not allow NULL for a value, they should be handled the same, but the value NULL is handled differently on databases(AFAIK MS-SQL do not allow more than one(1) NULL value, mySQL and Oracle allow this, if a column is UNIQUE)
So you must define this column NOT NULL UNIQUE INDEX
There is no such thing as a primary key in relational data theory, so your question has to be answered on the practical level.
Unique indexes are not part of the SQL standard. The particular implementation of a DBMS will determine what are the consequences of declaring a unique index.
In Oracle, declaring a primary key will result in a unique index being created on your behalf, so the question is almost moot. I can't tell you about other DBMS products.
I favor declaring a primary key. This has the effect of forbidding NULLs in the key column(s) as well as forbidding duplicates. I also favor declaring REFERENCES constraints to enforce entity integrity. In many cases, declaring an index on the coulmn(s) of a foreign key will speed up joins. This kind of index should in general not be unique.
There are some disadvantages of CLUSTERED INDEXES vs UNIQUE INDEXES.
As already stated, a CLUSTERED INDEX physically orders the data in the table.
This mean that when you have a lot if inserts or deletes on a table containing a clustered index, everytime (well, almost, depending on your fill factor) you change the data, the physical table needs to be updated to stay sorted.
In relative small tables, this is fine, but when getting to tables that have GB's worth of data, and insertrs/deletes affect the sorting, you will run into problems.
I almost never create a table without a numeric primary key. If there is also a natural key that should be unique, I also put a unique index on it. Joins are faster on integers than multicolumn natural keys, data only needs to change in one place (natural keys tend to need to be updated which is a bad thing when it is in primary key - foreign key relationships). If you are going to need replication use a GUID instead of an integer, but for the most part I prefer a key that is user readable especially if they need to see it to distinguish between John Smith and John Smith.
The few times I don't create a surrogate key are when I have a joining table that is involved in a many-to-many relationship. In this case I declare both fields as the primary key.
My understanding is that a primary key and a unique index with a not‑null constraint, are the same (*); and I suppose one choose one or the other depending on what the specification explicitly states or implies (a matter of what you want to express and explicitly enforce). If it requires uniqueness and not‑null, then make it a primary key. If it just happens all parts of a unique index are not‑null without any requirement for that, then just make it a unique index.
The sole remaining difference is, you may have multiple not‑null unique indexes, while you can't have multiple primary keys.
(*) Excepting a practical difference: a primary key can be the default unique key for some operations, like defining a foreign key. Ex. if one define a foreign key referencing a table and does not provide the column name, if the referenced table has a primary key, then the primary key will be the referenced column. Otherwise, the the referenced column will have to be named explicitly.
Others here have mentioned DB replication, but I don't know about it.
Unique Index can have one NULL value. It creates NON-CLUSTERED INDEX.
Primary Key cannot contain NULL value. It creates CLUSTERED INDEX.
In MSSQL, Primary keys should be monotonically increasing for best performance on the clustered index. Therefore an integer with identity insert is better than any natural key that might not be monotonically increasing.
If it were up to me...
You need to satisfy the requirements of the database and of your applications.
Adding an auto-incrementing integer or long id column to every table to serve as the primary key takes care of the database requirements.
You would then add at least one other unique index to the table for use by your application. This would be the index on employee_id, or account_id, or customer_id, etc. If possible, this index should not be a composite index.
I would favor indices on several fields individually over composite indices. The database will use the single field indices whenever the where clause includes those fields, but it will only use a composite when you provide the fields in exactly the correct order - meaning it can't use the second field in a composite index unless you provide both the first and second in your where clause.
I am all for using calculated or Function type indices - and would recommend using them over composite indices. It makes it very easy to use the function index by using the same function in your where clause.
This takes care of your application requirements.
It is highly likely that other non-primary indices are actually mappings of that indexes key value to a primary key value, not rowid()'s. This allows for physical sorting operations and deletes to occur without having to recreate these indices.
imagine you have a N-N relationship with three tables created. By example, image you have a table for CUSTOMER(id_customer, name), the car table CAR(id_car, model, num_doors, ...) (id_customer and id_car are the primary keys for their respective tables).
Now you create the table PURCHASE to relate CUSTOMER with CARS and create the FOREIGN KEYS... The question is what is the best choice?
1.PURCHASE(id_customer, id_car, date_purchase) -> Primary keys are (id_customer, id_car). Foreign keys are (id_customer->CUSTOMER.id_customer and id_car->CAR.id_car).
2.PURCHASE(id, id_customer, id_car, date_purchase) -> Primary keys are (id), Foreign keys are (id_customer->CUSTOMER.id_customer and id_car->CAR.id_car).
My question is what is the best option, 1 or 2, taking into account efficiency and optimization in database.
For a Many-to-Many mapping, do it this way.
CREATE TABLE XtoY (
# No surrogate id for this table
x_id MEDIUMINT UNSIGNED NOT NULL, -- For JOINing to one table
y_id MEDIUMINT UNSIGNED NOT NULL, -- For JOINing to the other table
# Include other fields specific to the 'relation'
PRIMARY KEY(x_id, y_id), -- When starting with X
INDEX (y_id, x_id) -- When starting with Y
) ENGINE=InnoDB;
Notes:
Lack of an AUTO_INCREMENT id for this table -- The PK given is the 'natural' PK; there is no good reason for a surrogate.
"MEDIUMINT" -- This is a reminder that all INTs should be made as small as is safe (smaller ⇒ faster). Of course the declaration here must match the definition in the table being linked to.
"UNSIGNED" -- Nearly all INTs may as well be declared non-negative
"NOT NULL" -- Well, that's true, isn't it?
"InnoDB" -- More effecient than MyISAM because of the way the PRIMARY KEY is clustered with the data in InnoDB.
"INDEX(y_id, x_id)" -- The PRIMARY KEY makes it efficient to go one direction; the makes the other direction efficient. No need to say UNIQUE; that would be extra effort on INSERTs.
In the secondary index, saying justINDEX(y_id) would work because it would implicit include x_id. But I would rather make it more obvious that I am hoping for a 'covering' index.
See Cookbook on building indexes
I have 2 tables, each with their own auto incremented IDs, which are of course primary keys.
When I want to create a 3rd table to establish the relation between these 2 tables, I always have an error.
First one is that you can have only 1 automatically-incremented column, the second one occurs when I delete the auto_increment statement from those 2, therefore AQL doesn't allow me to make them foreign keys, because of the type matching failure.
Is there a way that I can create a relational table without losing auto increment features?
Another possible (but not preferred) solution may be there is another primary key in the first table, which is the username of the user, not with an auto_increment statement, of course. Is it inevitable?
Thanks in advance.
1 Concept
You have misunderstood some basic concepts, and the difficulties result from that. We have to address the concepts first, not the problem as you perceive it, and consequently, your problem will disappear.
auto incremented IDs, which are of course primary keys.
No, they are not. That is a common misconception. And problems are guaranteed to ensue.
An ID field cannot be a Primary Key in the English or technical or Relational senses.
Sure, in SQL, you can declare any field to be a PRIMARY KEY, but that doesn't magically transform it into a Primary Key in the English, technical, or Relational senses. You can name a chihuahua "Rottweiller", but that doesn't transform it into a Rottweiller, it remains a chihuahua. Like any language, SQL simply executes the commands that you give it, it does not understand PRIMARY KEY to mean something Relational, it just whacks an unique index on the column (or field).
The problem is, since you have declared the ID to be a PRIMARY KEY, you think of it as a Primary Key, and you may expect that it has some of qualities of a Primary Key. Except for the uniqueness of the ID value, it provides no benefit. It has none of the qualities of a Primary Key, or any sort of Relational Key for that matter. It is not a Key in the English, technical, or Relational senses. By declaring a non-key to be a key, you will only confuse yourself, and you will find out that there is something terribly wrong only when the user complains about duplicates in the table.
2 Relational Model
2.1 Relational tables must have row uniqueness
A PRIMARY KEY on an ID field does not provide row uniqueness. Therefore it is not a Relational table containing rows, and if it isn't that, then it is a file containing records. It doesn't have any of the integrity, or power (at this stage you will be aware of join power only), or speed, that a table in a Relational database has.
Execute this code (MS SQL) and prove it to yourself. Please do not simply read this and understand it, and then proceed to read the rest of this Answer, this code must be executed before reading further. It has curative value.
-- [1] Dumb, broken file
-- Ensures unique RECORDS, allows duplicate ROWS
CREATE TABLE dumb_file (
id INT IDENTITY PRIMARY KEY,
name_first CHAR(30),
name_last CHAR(30)
)
INSERT dumb_file VALUES
( 'Mickey', 'Mouse' ),
( 'Mickey', 'Mouse' ),
( 'Mickey', 'Mouse' )
SELECT *
FROM dumb_file
Notice that you have duplicate rows. Relational tables are required to have unique rows. Further proof that you do not have a relational table, or any of the qualities of one.
Notice that in your report, the only thing that is unique is the ID field, which no user cares about, no user sees, because it is not data, it is some additional nonsense that some very stupid "teacher" told you to put in every file. You have record uniqueness but not row uniqueness.
In terms of the data (the real data minus the extraneous additions), the data name_last and name_first can exist without the ID field. A person has a first name and last name without an ID being stamped on their forehead.
The second thing that you are using that confuses you is the AUTOINCREMENT. If you are implementing a record filing system with no Relational capability, sure, it is helpful, you don't have to code the increment when inserting records. But if you are implementing a Relational Database, it serves no purpose at all, because you will never use it. There are many features in SQL that most people never use.
2.2 Corrective Action
So how do you upgrade, elevate, that dumb_file that is full of duplicate rows to a Relational table, in order to get some of the qualities and benefits of a Relational table ? There are three steps to this.
You need to understand Keys
And since we have progressed from ISAM files of the 1970's, to the Relational Model, you need to understand Relational Keys. That is, if you wish to obtain the benefits (integrity, power, speed) of a Relational Database.
In Codd's Relational Model:
a key is made up from the data
and
the rows in a table must be unique
Your "key" is not made up from the data. It is some additional, non-data parasite, caused by your being infected with the disease of your "teacher". Recognise it as such, and allow yourself the full mental capacity that God gave you (notice that I do not ask you to think in isolated or fragmented or abstract terms, all the elements in a database must be integrated with each other).
Make up a real key from the data, and only from the data. In this case, there is only one possible Key: (name_last, name_first).
Try this code, declare an unique constraint on the data:
-- [2] dumb_file fixed, elevated to table, prevents duplicate rows
-- still dumb
CREATE TABLE dumb_table (
id INT IDENTITY PRIMARY KEY,
name_first CHAR(30),
name_last CHAR(30),
CONSTRAINT UK
UNIQUE ( name_last, name_first )
)
INSERT dumb_table VALUES
( 'Mickey', 'Mouse' ),
( 'Minnie', 'Mouse' )
SELECT *
FROM dumb_table
INSERT dumb_table VALUES
( 'Mickey', 'Mouse' )
Now we have row uniqueness. That is the sequence that happens to most people: they create a file which allows dupes; they have no idea why dupes are appearing in the drop-downs; the user screams; they tweak the file and add an index to prevent dupes; they go to the next bug fix. (They may do so correctly or not, that is a different story.)
The second level. For thinking people who think beyond the fix-its. Since we have now row uniqueness, what in Heaven's name is the purpose of the ID field, why do we even have it ??? Oh, because the chihuahua is named Rotty and we are afraid to touch it.
The declaration that it is a PRIMARY KEY is false, but it remains, causing confusion and false expectations. The only genuine Key there is, is the (name_last, name_fist), and it is a Alternate Key at this point.
Therefore the ID field is totally superfluous; and so is the index that supports it; and so is the stupid AUTOINCREMENT; and so is the false declaration that it is a PRIMARY KEY; and any expectations you may have of it are false.
Therefore remove the superfluous ID field. Try this code:
-- [3] Relational Table
-- Now that we have prevented duplicate data, the id field
-- AND its additional index serves no purpose, it is superfluous,
-- like an udder on a bull. If we remove the field AND the
-- supporting index, we obtain a Relational table.
CREATE TABLE relational_table (
name_first CHAR(30),
name_last CHAR(30),
CONSTRAINT PK
PRIMARY KEY ( name_last, name_first )
)
INSERT relational_table VALUES
( 'Mickey', 'Mouse' ),
( 'Minnie', 'Mouse' )
SELECT *
FROM relational_table
INSERT relational_table VALUES
( 'Mickey', 'Mouse' )
Works just fine, works as intended, without the extraneous fields and indices.
Please remember this, and do it right, every single time.
2.3 False Teachers
In these end times, as advised, we will have many of them. Note well, the "teachers" who propagate ID columns, by virtue of the detailed evidence in this post, simply do not understand the Relational Model or Relational Databases. Especially those who write books about it.
As evidenced, they are stuck in pre-1970 ISAM technology. That is all they understand, and that is all that they can teach. They use an SQL database container, for the ease of Access, recovery, backup, etc, but the content is pure Record Filing System with no Relational Integrity, Power, or speed. AFAIC, it is a serious fraud.
In addition to ID fields, of course, there are several items that are key Relational-or-not concepts, that taken together, cause me to form such a grave conclusion. Those other items are beyond the scope of this post.
One particular pair of idiots is currently mounting an assault on First Normal Form. They belong in the asylum.
3 Solution
Now for the rest of your question.
3.1 Answers
Is there a way that I can create a relational table without losing auto increment features?
That is a self-contradicting sentence. I trust you will understand from my explanation, Relational tables have no need for AUTOINCREMENT "features"; if the file has AUTOINCREMENT, it is not a Relational table.
AUTOINCREMENT or IDENTITY is good for one thing only: if, and only if, you want to create an Excel spreadsheet in the SQL database container, replete with fields named A, B, and C, across the top, and record numbers down the left side. In database terms, that is the result of a SELECT, a flattened view of the data, that is not the source of data, which is organised (Normalised).
Another possible (but not preferred) solution may be there is another primary key in the first table, which is the username of the user, not with an auto increment statement, of course. Is it inevitable?
In technical work, we don't care about preferences, because that is subjective, and it changes all the time. We care about technical correctness, because that is objective, and it does not change.
Yes, it is inevitable. Because it is just a matter of time; number of bugs; number of "can't dos"; number of user screams, until you face the facts, overcome your false declarations, and realise that:
the only way to ensure that user rows are unique, that user_names are unique, is to declare an UNIQUE constraint on it
and get rid of user_id or id in the user file
which promotes user_name to PRIMARY KEY
Yes, because your entire problem with the third table, not coincidentally, is then eliminated.
That third table is an Associative Table. The only Key required (Primary Key) is a composite of the two parent Primary Keys. That ensures uniqueness of the rows, which are identified by their Keys, not by their IDs.
I am warning you about that because the same "teachers" who taught you the error of implementing ID fields, teach the error of implementing ID fields in the Associative Table, where, just as with an ordinary table, it is superfluous, serves no purpose, introduces duplicates, and causes confusion. And it is doubly superfluous because the two keys that provide are already there, staring us in the face.
Since they do not understand the RM, or Relational terms, they call Associative Tables "link" or "map" tables. If they have an ID field, they are in fact, files.
3.2 Lookup Tables
ID fields are particularly Stupid Thing to Do for Lookup or Reference tables. Most of them have recognisable codes, there is no need to enumerate the list of codes in them, because the codes are (should be) unique.
ENUM is just as stupid, but for a different reason: it locks you into an anti-SQL method, a "feature" in that non-compliant "SQL".
Further, having the codes in the child tables as FKs, is a Good Thing: the code is much more meaningful, and it often saves an unnecessary join:
SELECT ...
FROM child_table -- not the lookup table
WHERE gender_code = "M" -- FK in the child, PK in the lookup
instead of:
SELECT ...
FROM child_table
WHERE gender_id = 6 -- meaningless to the maintainer
or worse:
SELECT ...
FROM child_table C -- that you are trying to determine
JOIN lookup_table L
ON C.gender_id = L.gender_id
WHERE L.gender_code = "M" -- meaningful, known
Note that this is something one cannot avoid: you need uniqueness on the lookup code and uniqueness on the description. That is the only method to prevent duplicates in each of the two columns:
CREATE TABLE gender (
gender_code CHAR(2) NOT NULL,
name CHAR(30) NOT NULL
CONSTRAINT PK
PRIMARY KEY ( gender_code )
CONSTRAINT AK
UNIQUE ( name )
)
3.3 Full Example
From the details in your question, I suspect that you have SQL syntax and FK definition issues, so I will give the entire solution you need as an example (since you have not given file definitions):
CREATE TABLE user ( -- Typical Identifying Table
user_name CHAR(16) NOT NULL, -- Short PK
name_first CHAR(30) NOT NULL, -- Alt Key.1
name_last CHAR(30) NOT NULL, -- Alt Key.2
birth_date DATE NOT NULL -- Alt Key.3
CONSTRAINT PK -- unique user_name
PRIMARY KEY ( user_name )
CONSTRAINT AK -- unique person identification
PRIMARY KEY ( name_last, name_first, birth_date )
)
CREATE TABLE sport ( -- Typical Lookup Table
sport_code CHAR(4) NOT NULL, -- PK Short code
name CHAR(30) NOT NULL -- AK
CONSTRAINT PK
PRIMARY KEY ( sport_code )
CONSTRAINT AK
PRIMARY KEY ( name )
)
CREATE TABLE user_sport ( -- Typical Associative Table
user_name CHAR(16) NOT NULL, -- PK.1, FK
sport_code CHAR(4) NOT NULL, -- PK.2, FK
start_date DATE NOT NULL
CONSTRAINT PK
PRIMARY KEY ( user_name, sport_code )
CONSTRAINT user_plays_sport_fk
FOREIGN KEY ( user_name )
REFERENCES user ( user_name )
CONSTRAINT sport_occupies_user_fk
FOREIGN KEY ( sport_code )
REFERENCES sport ( sport_code )
)
There, the PRIMARY KEY declaration is honest, it is a Primary Key; no ID; no AUTOINCREMENT; no extra indices; no duplicate rows; no erroneous expectations; no consequential problems.
3.4 Relational Data Model
Here is the Data Model to go with the definitions.
As a PDF
If you are not used to the Notation, please be advised that every little tick, notch, and mark, the solid vs dashed lines, the square vs round corners, means something very specific. Refer to the IDEF1X Notation.
A picture is worth a thousand words; in this case a standard-complaint picture is worth more than that; a bad one is not worth the paper it is drawn on.
Please check the Verb Phrases carefully, they comprise a set of Predicates. The remainder of the Predicates can be determined directly from the model. If this is not clear, please ask.
Here is a small design with the common NOT NULL UNIQUE constraints on the natural keys:
CREATE TABLE 'users' {
id int(10) NOT NULL AUTO_INCREMENT,
name NOT NULL UNIQUE,
email NOT NULL UNIQUE,
pass NOT NULL,
PRIMARY KEY ('id')
}
The NOT NULL UNIQUE constraint seems hackish to me. Having disjoint candidate keys seems denormalized to me, and the UNIQUE constraint seems like a bloated O(N) checking feature, so I'm inclined to use a design that has a relation for each natural key that maps the natural key to the surrogate key in the main relation.
CREATE TABLE users {
id int(10) NOT NULL AUTO_INCREMENT,
pass NOT NULL,
PRIMARY KEY ('id')
}
CREATE TABLE user_names {
name NOT NULL,
user_id NOT NULL,
PRIMARY KEY ('name')
}
CREATE TABLE user_emails {
email NOT NULL,
user_id NOT NULL,
PRIMARY KEY ('email')
}
This way, I implicitly enforce the unique constraint on user's emails and usernames while providing the luxury of being able to search for a user's info with their email or name in O(ln N + ln M) time (which I very much desire).
This only way I can ever see the first, more common design matching the performance of the second design is if the UNIQUE constraint implicitly indexed the table so that selects with, and therefore checks for uniqueness of, the natural keys can be done in O(ln N) time.
I suppose my question is, with regard to the performance insertions and selections with the natural keys, what is the best way to handle a table with 3 or more natural keys that is indexed by a surrogate key?
It seems that what you are describing is 6th Normal Form. Assuming your original table is in 5NF then your new schema consisting of 3 tables is in 6NF. Having three candidate keys does not violate 5NF but it would violate 6NF.
From the data integrity point of view however 6NF has significant disadvantages. It is normally the case that some dependencies are lost. For example your original table enforces the constraint that every user has a name and password. Your 6NF version can't do that - at least not in SQL if you want to permit inserts to all the tables. 6NF is useful for some specific situations (temporal data) but in general 5NF is more useful and desirable from a data integrity perspective.
This doesn't answer your performance question but I thought it was worth pointing out.
You are normalizing too much in my opinion. You will hurt performance not only on inserts/updates but also on selects since you are now joining 3 tables instead of doing a straight insert/select/update/delete in one table.
I disagree that the NOT NULL UNIQUE is hackish but I do find strange that there's such a constraint on a name column.