Should I create a surrogate key instead of a composite key? - mysql

Structure:
Actor <=== ActorMovie ===> Movie
ActorMovie: ActorID (fk), MovieId (fk)... ===> pk: (ActorID, MovieID)
Should do I create a surrogate key for ActorMovie table like this?
ActorMovie: ActorMovieID (pk), ActorID (fk), MovieId (fk)...

Conventions are good if they are helpful
"SQL Antipatterns", Chapter 4, "ID Required"
Intention of primary key
Primary key - is something that you can use to identify your row with it's unique address in table. That means, not only some surrogate column can be primary key. In fact, primary key should be:
Unique. identifier for each row. If it's compound, that means, every combination of column's values must be unique
Minimal. That means, it can't be reduced (i.e. if it's compound, no column could be omitted without losing uniqueness)
Single. No other primary key may be defined, each table can have only one primary key
Compound versus surrogate
There are cases, when surrogate key has benefits. Most common problem - if you have table with people names. Can combination of first_name + last_name + taxpayer_id be unique? In most cases - yes. But in theory, there could be cases, when duplicated will occur. So, this is the case, when surrogate key will provide unique identifying of rows in any case.
However, if we're talking about many-to-many link between tables, it's obvious, that linking table will always contain each pair once. In fact, you'll even need to check if duplicate does not exist before operating with that table (otherwise - it's redundant row, because it holds no additional information unless your design has a special intention to store that). Therefore, your combination of ActorID + MovieID satisfies all conditions for primary key, and there's no need to create surrogate key. You can do that, but that will have little sense (if not at all), because it will have no meaning rather than numbering rows. In other hand, with compound key, you'll have:
Unique check by design. Your rows will be unique, no duplicates for linking table will be allowed. And that has sense: because there's no need to create a link if it already exists
No redundant (and, thus, less comprehensive) column in design. That makes your design easier and more readable.
As a conclusion - yes, there are cases, when surrogate key should (or even must) be applied, but in your particular case it will definitely be antipattern - use compound key.
References:
Primary keys in SQL
SQL Antipatterns by Bill Karwin

Let me just mention a detail that seems to have been missed by other posters: InnoDB tables are clustered.
If you have just a primary key, your whole table will be represented by a lone B-Tree, which is very efficient. Adding a surrogate would just create another B-Tree (and "fatter" than expected to boot, due to how clustering works), without benefit to offset the added overhead.
Surrogates have their place, but junction tables are usually not it.

I'd always go with the composite key. My reasoning:
You will probably never use the surrogate key anywhere.
You will reduce the number of indexes/constraints on the table, as you will most certainly still need a indexes over actor and movie.
You will always search for either movie or an actor anyway.
Unless you have a scenario where you will actually use the surrogate key outside of it's own table, I'd go with the composite key.

If you want to associate other data elements with the join table, such as the name(s) of the role(s) played (which might be a child table) then I certainly would. If you were sure that you never wanted to then I'd consider it as optional.

Consider the first normal form (1NF) of database design normalization.
I would have made the ActorID and MovieID as unique key combination then create a primary key ActorMovieID.
See the same question here: Two foreign keys instead of primary

On this subject, my point is very simple: surrogate primary keys ALWAYS work, while Composite keys MIGHT NOT ALWAYS work one of these days, and this for multiple reasons.
So when you start asking yourself 'is composite better than surrogate', you have already entered the process of loosing your time. Go for surrogate. It allways works. And switch to next step.

Related

Should one combine foreign keys that point to the same table if all columns are required?

I encounter this situation frequently. An example,
A user is uniquely identified by appId, externalUserId.
Table xxxContract has a foreign key (fileUploadId, appId, externalUserId) to table fileUpload that ensures the file upload belongs to the specified user.
Table xxxContract has a foreign key (businessId, appId, externalUserId) to table business that ensures the business belongs to the specified user.
With the above two, we guarantee user A's file upload won't be used as a contract for user B's business.
xxxContract also has a fileTypeId column that is STORED GENERATED to a certain value that says "This contract is of file type XXX_CONTRACT"
Table xxxContract also has a foreign key (fileUploadId, fileTypeId) to table fileUpload.
This guarantees we only use XXX_CONTRACT file uploads for xxxContract, and not accidentally use other file types.
Given the above, we have this situation where we have two foreign keys that point to the same table fileUpload, and even have overlapping columns,
(fileUploadId, appId, externalUserId)
(fileUploadId, fileTypeId)
And all the columns are NOT NULL.
So, it seems to me like it's safe to combine the foreign keys into one larger foreign key,
(fileUploadId, appId, externalUserId, fileTypeId)
And we'll still have the same guarantees as before.
My gut feeling is that I should not combine the foreign keys because separating them by meaning and giving the FKs meaningful names helps with maintainability.
But I've never had a formal education with these things so I'd like to know what the industry standard is.
Related, is there a performance benefit to combining them vs. separating them?
But I've never had a formal education with these things so I'd like to know what the industry standard is.
The standard is, that there is no standard.
As you already noted, you can use multiple columns to define a primary key. This is called a natural primary key, for instance: A user can be uniquely identified by firstname, lastname - and birthdate. (at least almost ever)
This kind of keys is often called composite keys, because every column alone doesn't work out, only combined they form a primary key.
Surrogate (or artifical) primary keys are also well known: id column, using auto-increment.
So, as to your question: Yes, if you have 3 columns that already form a natural primary key, it is completly safe to add more columns. Since the 3 columns already present will uniquely identify the row, there is no harm in adding a 4th, 5th or even 6th column to the key.
Whether you are going to use natural or surrogate primary keys depens on personal preference i'd say. I never use natural keys, even on tables where this is possible.
Keep in mind, that whenever you need to delete / update something, you always need to know the primary key. hence, with natural keys, you need to move multiple values through many method-calls, while surrogate keys offer the advantage of just having "one" id to uniquely identify a row. No more information required.
Performance-wise, i assume that (Integer-based) surrogate primary keys tend to be faster than (String-based) natural primary keys. It's even less columns to consider when writing queries and/or designing indexes.

Why we should have an ID column in the table of users?

It's obvious that we already have another unique information about each user, and that is username. Then, why we need another unique thing for each user? Why should we also have an id for each user? What would happen if we omit the id column?
Even if your username is unique, there are few advantages to having an extra id column instead of using the varchar as your primary key.
Some people prefer to use an integer column as the primary key, to serve as a surrogate key that never needs to change, even if other columns are subject to change. Although there's nothing preventing a natural primary key from being changeable too, you'd have to use cascading foreign key constraints to ensure that the foreign keys in related tables are updated in sync with any such change.
The primary key being a 32-bit integer instead of a varchar can save space. The choice between a int or a varchar foreign key column in every other table that references your user table can be a good reason.
Inserting to the primary key index is a little bit more efficient if you add new rows to the end of the index, compared to of wedging them into the middle of the index. Indexes in MySQL tables are usually B+Tree data structures, and you can study these to understand how they perform.
Some application frameworks prefer the convention that every table in your database has a primary key column called id, instead of using natural keys or compound keys. Following such conventions can make certain programming tasks simpler.
None of these issues are deal-breakers. And there are also advantages to using natural keys:
If you look up rows by username more often than you search by id, it can be better to choose the username as the primary key, and take advantage of the index-organized storage of InnoDB. Make your primary lookup column be the primary key, if possible, because primary key lookups are more efficient in InnoDB (you should be using InnoDB in MySQL).
As you noticed, if you already have a unique constraint on username, it seems a waste of storage to keep an extra id column you don't need.
Using a natural key means that foreign keys contain a human-readable value, instead of an arbitrary integer id. This allows queries to use the foreign key value without having to join back to the parent table for the "real" value.
The point is that there's no rule that covers 100% of cases. I often recommend that you should keep your options open, and use natural keys, compound keys, and surrogate keys even in a single database.
I cover some issues of surrogate keys in the chapter "ID Required" in my book SQL Antipatterns Volume 1: Avoiding the Pitfalls of Database Programming.
This identifier is known as a Surrogate Key. The page I linked lists both the advantages and disadvantages.
In practice, I have found them to be advantageous because even superkey data can change over time (i.e. a user's email address may change and thus any corresponding relations must change), but a surrogate key never needs to change for the data it identifies because its value is meaningless to the relation.
It's also nice from a JOIN standpoint because it can be an integer with a smaller key length than a varchar.
I can say that in practice I prefer to use them. I have been bitten too many times by having multiple-column primary keys or a data-representative superkey used across tables having to become non-unique later due to changing requirements during development, and that is not a situation you want to deal with.
In my opinion, every table should have a unique, auto-incremented id.
Here are some practical reasons. If you have duplicate rows, you can readily determine which row to delete. If you want to know the order that rows were inserted, you have that information in the id. As for users, there's more than on "John Smith" in the world. An id provides a key for foreign references.
Finally, just about anything that might describe a user -- a name, an address, a telephone number, an email address -- could change over time.
im mysql we have.
1:Index fields 2:Unique fields and 3:PK fields.
index means pointable
unique means in a table must be one in all rows.
PK = index + unique
in a table you may have lots of unique fields like
username or passport code or email.
but you need a field like ID. that is both unique and index (=PK).which is first is always one thing and never changes and second is unique and third is simple (because is often number).
One reason to have a numeric id is that creating an index on it is leaner than on a text-field, reducing index size and processing time required to look up a specific user. Also it's less bytes to save when cross-referencing to a user (relational database) in a different table.

When we don't need a primary key for our table?

Will it ever happen that we design a table that doesn't need a primary key?
No.
The primary key does a lot of stuff behind-the-scenes, even if your application never uses it.
For example: clustering improves efficiency (because heap tables are a mess).
Not to mention, if ANYONE ever has to do something on your table that requires pulling a specific row and you don't have a primary key, you are the bad guy.
Yes.
If you have a table that will always be fetched completely, and is being referred-to by zero other tables, such as some kind of standalone settings or configuration table, then there is no point having a primary key, and the argument could be made by some that adding a PK in this situation would be a deception of the normal use of such a table.
It is rare, and probably when it is most often done it is done wrongly, but they do exist, and such instances can be valid.
Depends.
What is primary key / unique key?
In relational database design, a unique key can uniquely identify each row in a table, and is closely related to the Superkey concept. A unique key comprises a single column or a set of columns. No two distinct rows in a table can have the same value (or combination of values) in those columns if NULL values are not used. Depending on its design, a table may have arbitrarily many unique keys but at most one primary key.
So, when you don't have to differentiate (uniquely identify) each row,
you don't have to use primary key
For example, a big table for logs,
without using primary key, you can have fairly smaller size of data and faster for insertion
Primary key not mandatory but it is not a good practice to create tables without primary key. DBMS creates auto-index on PK, but you can make a column unique and index it, e.g. user_name column in users table are usually made unique and indexed, so you may choose to skip PK here. But it is still a bad idea because PK can be used as foreign key for referential integrity.
In general, you should almost always have PK in a table unless you have very strong reason to justify not having a PK.
Link tables (in many to many relationship) may not have a primary key. But, I personally like to have PK in those tables as well.

Why are composite primary keys still around?

I'm assigned to migrate a database to a mid-class ERP.
The new system uses composite primary keys here and there, and from a pragmatic point of view, why?
Compared to autogenerated IDs, I can only see negative aspects;
Foreign keys becomes blurry
Harder migration or db-redesigns
Inflexible as business change. (My car has no reg.plate..)
Same integrity better achieved with constraints.
It's falling back to the design concept of candiate keys, which I neither see the point of.
Is it a habit/artifact from the floppy-days (minimizing space/indexes), or am I missing something?
//edit//
Just found good SO-post: Composite primary keys versus unique object ID field
//
Composite keys are required when your primary keys are non-surrogate and inherently, um, composite, that is, breakable into several non-related parts.
Some real-world examples:
Many-to-many link tables, in which the primary keys are composed of the keys of the entities related.
Multi-tenant applications when tenant_id is a part of primary key of each entity and the entities are only linkable within the same tenant (constrained by a foreign key).
Applications processing third-party data (with already provided primary keys)
Note that logically, all this can be achieved using a UNIQUE constraint (additional to a surrogate PRIMARY KEY).
However, there are some implementation specific things:
Some systems won't let a FOREIGN KEY refer to anything that is not a PRIMARY KEY.
Some systems would only cluster a table on a PRIMARY KEY, hence making the composite the PRIMARY KEY would improve performance of the queries joining on the composite.
Personally I prefer the use of surrogate keys. However, in joining tables that consist only of the ids from two other tables (to create a many-to-many relationships) composite keys are the way to go and thus taking them out would make things more difficult.
There is a school of thought that surrogate keys are always bad and that if you don't have uniqueness to record through the use of natural keys you have a bad design. I strongly disagree with this (if you aren't storing SSN or some other unique value I defy you to come up with a natural key for a person table for instance.) But many people feel that it is necessary for proper normalization.
Sometimes having a composite key reduces the need to join to another table. Sometimes it doesn't. So there are times when a composite key can boost performance as well as times when it can harm performance. If the key is relatively stable, you may be fine with faster performance on select queries. However, if it is something that is subject to change like a company name, you could be in a world of hurt when company A changes it's name and you have to update a million associated records.
There is no one size fits all in database design. There are time when composite keys are helpful and times when they are horrible. There are times when surrogate keys are helpful and times when they are not.
Composite primary key provides better performance when it comes to them being used as Foreign keys in other tables and reduces table reads - sometimes they can be life savers. If you use surrogate keys, you have to go to that table to get natural key information.
For example (pure example - so we are not talking DB design here), lets say you have an ORDER table and ORDER_ITEM. If you use ProductId and LineNumber (UPDATE: and as Pedro mentioned OrderId or even better OrderNumber) as composite primary key in ORDER_ITEM, then in your cross table for SHIPPING, you would be able to have ProductId in the SHIPPING_ORDERITEM. This can massively boost your performance if for example you have run out of that product and need to find out all products of that ProductId that need to be shipped without a need to join.
On the other hand, if you use a surrogate key, you have to join and you end up with a very inefficient SQL execution plan where it has to do bookmark lookup on several indexes.
See more on bookmark lookup which using surrogate keys becomes a major issue.
Natural primary keys are brittle.
Suppose we have built a system around a natural PK on (CountryCode, PhoneNumber), and several years down the road we need to add Extension, or change the PK to one column: Email. If these PK columns are propagated to all child tables, this becomes very expensive.
A few years ago there were some systems that were built assuming that Social Security Number is a natural PK, and had to be redesigned to use identities, when the SSN became non-unique and nullable.
Because we cannot predict the future, we don't know if later on some change will render obsolete what used to be a perfectly correct and complete model.
The very simple answer is data integrity. If the data is to be useful and accurate then the keys are presumably required. Having an "autogenerated id" doesn't remove the requirement for other keys as well. The alternative is not to enforce uniqueness and accept that data will be duplicated and almost inevatibly contain anomalies and lead to errors as a result. Why would you want that?
In short, the purpose of composite keys is to use the database to enforce one or more business rules. In other words: protect the integrity of your data.
Ex. You have a list of parts that you buy from suppliers. You could could create your supplier and parts table like such:
SUPPLIER
SupplierId
SupplierName
PART
PartId
PartName
SupplierId
Uh oh. The parts table allows for duplicate data. Since you used a surrogate key that was autogenerated, you're not enforcing the fact that a part from a supplier should only be entered once. Instead, you should create the PART table like such:
PART
SupplierId
SupplierPartId
PartName
In this example, your parts come from specific suppliers and you want to enforce the rule: "A single supplier can only supply a single part once" in the PARTS table. Hence, the composite key. Your composite key prevents accidental duplicate entry of a part.
You can always leave business rules out of your database and leave them to your application, but by keeping the rule in the database (via a composite key), you ensure that the business rule is enforced everywhere, especially if you should ever decide to allow multiple applications to access the data.
Just as functions encapsulate a set of instructions, or database views abstract base table connections, so to do surrogate keys abstract the meaning of the entity they are placed on.
If, for example, you have a table that holds vehicle data, applying a surrogate VehicleId abstracts what it means to be a vehicle from a data point of view. When you reference VehicleId = 1, you are most surely talking about a vehicle of some sort, but do we know if it is a 2008 Chevy Impala, or a 1991 Ford F-150? No. Can the underlying data of whatever Vehicle #1 is change at any time? Yes.
Short answer: Multi-column foreign keys naturally refer to multi column primary keys. There can still be an autogenerated id column that is part of the primary key.
Philosophical answer: Primary key is the identity of the row. If there there is a bit of information that is an intrinsic part of the identity of the row (such as which customer the article belongs to.. in a multi customer wiki) - The information should be part of the primary key.
An example: System for organizing LAN parties
The system supports several LAN parties with the same people and organizers attending thus:
CREATE TABLE users ( users_id serial PRIMARY KEY, ... );
And there are several parties:
CREATE TABLE parties ( parties_id serial PRIMARY KEY, ... );
But most of the other stuff needs to carry the information about which party it is linked to:
CREATE TABLE ticket_types (
ticket_types_id serial,
parties_id integer REFERENCES parties,
name text,
....
PRIMARY KEY(ticket_types_id, parties_id)
);
...this is because we want to refer to primary keys. Foreign key on table attendances points to table ticket_types.
CREATE TABLE attendances (
attendances_id serial,
parties_id integer REFERENCES parties,
ticket_types_id integer,
PRIMARY KEY (attendances_id, parties_id),
FOREIGN KEY (ticket_types_id, parties_id) REFERENCES parties
);
While I prefer surrogate keys, I use composite cases in a few cases. The composite key may consist entirely or partially of surrogate key fields.
Many to many join tables. These usually require a unique key on the key pair anyway. In some cases additional columns may be included in the key.
Weak child tables. Things like order lines do not stand on their own. In this case I use the parent (orders) tables primary key in the composite table.
When there are multiple weak tables related to an entity, it may be possible to eliminate a table from the join set when querying child data. In the case of grandchild tables, it is possible to join the grandparent to grandchild without involving the table in the middle.

Should I add a autoinc primary key for the sake of having a primary key?

I have a table which needs 2 fields. One will be a foreign key, the other is not necessarily unique. There really isn't a reason that I can find to have a primary key other than having read that "every single tabel ever needs needs needs a primary key".
Edit:
Some good thoughts in here.
For clarity's sake, I will give you an example that is similar to my database needs.
Let's say have a table with product type, quantity, cost, and manufacturer.
Product type will not always be unique (say, MP3 Player), but manufacturer/product type will be unique (say, Apple MP3 Player). Forget about the various models the manufacturers make for this example. For ease, this table has a autoincrementing primary key.
I am giving a point value and logging how often these products are searched for, added to a cart, and bought for display on a list of hot items.
The way I have it layed out currently is in a second table with a FK pointing to the main table, and a second column for the total number of "popularity points" this item has gained.
The answers have seen here have made me think that perhaps I should just add a "points" column to my primary products table so that I could just track there... but that seems like I'm not normalizing my database enough.
My problem is I'm currently mostly just a hobbyist doing this for learning, and don't have the luxury of a DBA to tell me how to set up my tables, so I have to learn both the coding side and the database side.
You have to distinguish between primary key and surrogate key. Auto-incremented column would be a particular case of the latter. Your question, therefore, is twofold:
Does every table need to have a primary key?
Does every table need to have a surrogate primary key?
The answer to first question is YES except in some special cases (association table for many-to-many relationship arguably being an example of such a special case). The reason for this is that you usually need to be able (if not right now then in the future) to consistently address individual rows of that table - for updates / deletion, for example.
The answer to the second question is NO. If your table represents a core business entity then OR it can be referenced from many-to-one association, having a surrogate key is probably a good idea; but it's not absolutely necessary.
It's somewhat unclear what your table's function is; from your description it sounds like it has "collection of values" semantics (FK to "main" table + value). Certain ORMs don't support surrogate keys in such circumstances; if that's what has prompted your question it's OK to leave the surrogate (or even primary in case of bag) key off.
For the sake of having something unique and as identifier, please please please please have a primary key in every table :)
It also helps forward compaitability in case there are future schema changes and 2 values are no long unique. Plus, memory are much cheaper now, feel free to use them as investments. ;)
i am not sure how the other field looks like .. but i am guessing that it would be to ok to have a composite primary key , which is based on the FK and the other field .. but then again i dont know your exact scenario.
I would say that it's absolutely necessary to have some sort of primary key in every table.
Interestingly enough, one of the DBA's for a Viacom property once told me that there was really no discernible difference in using an INT UNSIGNED or a VARCHAR(n) as a primary key in MySQL. This was in reference to a user table with more than 64 million rows. I believe n can be decently large (<=100), but I forget the what they limited to. Unfortunately, I don't have any empirical data to back that up.
You don't HAVE to have a primary key on every table, but it is considered best practice to have them as they are almost always necessary on a normalized relational database design. If you're finding a bunch of tables you don't think need PKs, then you should revisit the design/layout of your tables. To read more on normalization see here.
A couple scenarios that I can think of where you may not need or want a PK on a table would be a table strictly for logging. (to limit performance degradation of writing the log and maintaining a unique index) and in the scenario where your just storing data used to pump through an application for test purposes.
I'll be contrary and say you shouldn't add the key if you don't have a reason for it. It is very easy to add this column later if needed.
Strictly speaking, a surrogate key is not necessary, but a primary key is.
Many people use the term "primary key" to mean a single column that is an auto-incrementing integer. But this is not an accurate definition of a primary key.
A primary key is a constraint on one or more columns that serve to identify each row uniquely. Yes, you need some way of addressing individual rows. This is a crucial characteristic of a relation (aka a table).
You say you have a foreign key and another column that is not unique. But are these two columns taken together unique? If so, you can declare a primary key constraint over these two columns.
Defining another surrogate key (also called a pseudokey -- the auto-incrementing type) is a convenience because some people don't like to have to reference two columns when selecting a single row. Or they want the freedom to change values in the other columns easily, without changing the value of the primary key by which one addresses the individual row.
This is a technique related to normalization and a pretty good practice. A key made up of an auto incrementing number has many benefits:
You have a PK that does not pertain to the data.
You never have to change the PK value
Every row will automatically have a unique identifier