I have a database in which I have declared a primary key. Later on in implementing the database, I realized that I will have to create an auto-incrementing surrogate key and switch my current primary key to that, as my current primary key will inevidably have multiple occurences. I have scoundered the depths of stack overflow and other sites searching for an answer, but I cannot find a reasonable solution.
Specifically, I am making this database for a fraternity, in which each member is initiated with a unique scroll number. It seemed like a good idea to use the scroll number as the primary key, until I realised that members with more than one major of study will have two tuples (one indicating each major, database has to be in 3NF). That considered, is creating a surrogate key the way to go, or is there a far more reasonable solution to the problem?
You will need a many to many relationship between members and areas of studies. So yes you'll need a surrogate key.
Related
Context
I'm learning about identifying and non-identifying relationships, and I'm wondering how I'd express them in MySQL. For practice, I've been working on a database for Pokemon. For context, every few years a new version of the game comes out and updates a lot of things, e.g. the a certain move that a Pokemon can use may get stronger. This update is called generation. Moreover, each move has an elemental type, like fire or water.
So my three entities are move, generation, and type. Since I want to keep track of how a Pokemon move chances over time, a move is in an identifying relationship with generation. The name of the move is not enough to identify it, since, e.g. the move "Karate Chop" is different in generation 1 than in generation 2. So the corresponding primary key in generation, genID, should be part of my primary key for move.
On the other hand, I want to store type as a foreign key in move, but I believe this is a non-identifying relationship. Every move has a type, so I believe it's what's called a mandatory non-identifying relationship.
My attempt
So how would I write this in MySQL? I think it would be something like
CREATE TABLE move (
moveID int NOT NULL,
genID int NOT NULL,
typeID int NOT NULL,
PRIMARY KEY (MoveID, GenID),
CONSTRAINT FK_GenMove FOREIGN KEY (genID) REFERENCES generation(genID),
CONSTRAINT FK_TypeMove FOREIGN KEY (typeID) REFERENCES type(typeID)
);
However, I couldn't find an example where a foreign key was part of the primary key in the MySQL book I'm using (they discuss identifying relationships, but I couldn't find an example with syntax). Specifically, I'm unsure whether the order I list the constraints matters (should I declare my primary keys first, then my foreign keys?)
Indices
Also, I believe that my composite primary key will automatically become a clustered index for my table. A common query one would do is filtering move by generation/genID. So this should automatically be efficient since I have an index on genID, even though its part of a composite key, right? Or do I need to make a separate index for genID alone?
One thing that I realized the next day is that the order in which I declare my primary key matters. (moveID, genID) will sort by moveID first, then genID, whereas (genID, moveID) would sort the other way. Since I mentioned that I wanted the behavior of the latter case in my original post (picking out all move's in a given generation), as opposed to the former case, I felt that I should point out.
I am working on an assignment and I'm a little rusty with my SQL basics as I mainly work with already created tables, not with creating them. I was given a database model and asked to create it. I was told the model may have errors and to just correct them. Here is a snippet of the part I am having issue with:
http://i.imgur.com/0KyMquZ.jpg
I've been trying to figure it out and Googling and researching but I'm just not sure if there's something I'm not getting or I need to adjust the model. The issue I am having is with the operation table and the connecting tables. The primary key for operation is made up of the three primary keys from the connecting tables and another primary key, date. Can that be done? If they were foreign keys in the other tables I think I could figure it out. I've been trying to figure out how to do it but mostly just trying to wrap my head around the concept of what this is showing. I just don't understand how or why. Wouldn't that composite primary key have to be in the other 3 tables are they fine split up? Shouldn't that composite primary key be referencing foreign keys in other tables? I'm just really confused. I'm ok working with databases but designing, not so much.
I would just ask my professor about it but we are never on the same page. I think I understand him in the moment and then I wind up more confused. I don't think it matters for this but it's MySQL.
Although it is technically allowable, I think that it is semantically meaningless in this case as it allows the following situation to exist.
A patient can be subject to the same operation on the same day multiple times by different doctors, but ...
the patient cannot be subject to the same operation on the same day multiple times by the same doctor, but ...
the patient can be subject to different operations from the same doctor on the same day.
To me, this primary key is nonsense and you might as well add a synthetic primary key and make these simple foreign key columns where appropriate.
There is no problem with declaring a composite PK where some of the components are also declared FKs. Logically, this is quite correct.
The effects on performance will be hard to predict. The index that Mysql builds based on the PK declaration will be hard to predict. Some queries will be sped up, others won't.
After reading the comments, I now understand my misconception. I was thinking the primary composite key is made up of primary keys, duplicates from the other tables. I realize now that the composite primary key is made up of foreign keys and one primary key: date. Thanks for the clarification.
I'm just doing some basic normalisation but I don't have the answer for this, wondering if you guys can give me some info on right/wrong, do's/dont's etc.
So if I have:
I've always set a primary key (unique auto incrementer on lookup tables), in the image the lookup tables would be "page_downloads" and "page_includes" but I can guarantee those columns will never get used as they will only be queried via the page_id, same for so many definition tables.
So my question is: "Is there any point? What is the best practice thing to do? Always create the primary key even though it will never be used or don't bother creating it as it is fine to use the indexed int column which refers to a primary key in another table. Eg the relationship in the picture (page_id to page_id). Thoughts?"
Thanks
D
No. While every table should have a PRIMARY KEY, it need not be a surrogate. In this instance, (page_id,file_id) is a valid compound PRIMARY KEY (as is (file_id,page_id)).
To add some info to Strawberry's valid observations.
There's no absolute answer or best practice regarding the surrogate keys and usually this boils down to individual preference. There are both advantages and disadvantages to using surrogate keys. Among the advantages, one could consider:
Immutability Surrogate keys do not change while the row exists.
This has the following advantages:
Applications cannot lose their reference to a row in the database
(since the identifier never changes). The primary or natural key data
can always be modified, even with databases that do not support
cascading updates across related foreign keys. Requirement
changes[edit] Attributes that uniquely identify an entity might
change, which might invalidate the suitability of natural keys.
Consider the following example:
An employee's network user name is chosen as a natural key. Upon
merging with another company, new employees must be inserted. Some of
the new network user names create conflicts because their user names
were generated independently (when the companies were separate). In
these cases, generally a new attribute must be added to the natural
key (for example, an original_company column). With a surrogate key,
only the table that defines the surrogate key must be changed. With
natural keys, all tables (and possibly other, related software) that
use the natural key will have to change.
Some problem domains do not clearly identify a suitable natural key.
Surrogate keys avoid choosing a natural key that might be incorrect.
Performance[edit] Surrogate keys tend to be a compact data type, such
as a four-byte integer. This allows the database to query the single
key column faster than it could multiple columns. Furthermore a
non-redundant distribution of keys causes the resulting b-tree index
to be completely balanced. Surrogate keys are also less expensive to
join (fewer columns to compare) than compound keys.
Compatibility While using several database application
development systems, drivers, and object-relational mapping systems,
such as Ruby on Rails or Hibernate, it is much easier to use an
integer or GUID surrogate keys for every table instead of natural keys
in order to support database-system-agnostic operations and
object-to-row mapping.
Uniformity When every table has a uniform surrogate key, some
tasks can be easily automated by writing the code in a
table-independent way.
Validation It is possible to design key-values that follow a
well-known pattern or structure which can be automatically verified.
For instance, the keys that are intended to be used in some column of
some table might be designed to "look differently from" those that are
intended to be used in another column or table, thereby simplifying
the detection of application errors in which the keys have been
misplaced. However, this characteristic of the surrogate keys should
never be used to drive any of the logic of the applications
themselves, as this would violate the principles of Database
normalization.
I'm assigned to migrate a database to a mid-class ERP.
The new system uses composite primary keys here and there, and from a pragmatic point of view, why?
Compared to autogenerated IDs, I can only see negative aspects;
Foreign keys becomes blurry
Harder migration or db-redesigns
Inflexible as business change. (My car has no reg.plate..)
Same integrity better achieved with constraints.
It's falling back to the design concept of candiate keys, which I neither see the point of.
Is it a habit/artifact from the floppy-days (minimizing space/indexes), or am I missing something?
//edit//
Just found good SO-post: Composite primary keys versus unique object ID field
//
Composite keys are required when your primary keys are non-surrogate and inherently, um, composite, that is, breakable into several non-related parts.
Some real-world examples:
Many-to-many link tables, in which the primary keys are composed of the keys of the entities related.
Multi-tenant applications when tenant_id is a part of primary key of each entity and the entities are only linkable within the same tenant (constrained by a foreign key).
Applications processing third-party data (with already provided primary keys)
Note that logically, all this can be achieved using a UNIQUE constraint (additional to a surrogate PRIMARY KEY).
However, there are some implementation specific things:
Some systems won't let a FOREIGN KEY refer to anything that is not a PRIMARY KEY.
Some systems would only cluster a table on a PRIMARY KEY, hence making the composite the PRIMARY KEY would improve performance of the queries joining on the composite.
Personally I prefer the use of surrogate keys. However, in joining tables that consist only of the ids from two other tables (to create a many-to-many relationships) composite keys are the way to go and thus taking them out would make things more difficult.
There is a school of thought that surrogate keys are always bad and that if you don't have uniqueness to record through the use of natural keys you have a bad design. I strongly disagree with this (if you aren't storing SSN or some other unique value I defy you to come up with a natural key for a person table for instance.) But many people feel that it is necessary for proper normalization.
Sometimes having a composite key reduces the need to join to another table. Sometimes it doesn't. So there are times when a composite key can boost performance as well as times when it can harm performance. If the key is relatively stable, you may be fine with faster performance on select queries. However, if it is something that is subject to change like a company name, you could be in a world of hurt when company A changes it's name and you have to update a million associated records.
There is no one size fits all in database design. There are time when composite keys are helpful and times when they are horrible. There are times when surrogate keys are helpful and times when they are not.
Composite primary key provides better performance when it comes to them being used as Foreign keys in other tables and reduces table reads - sometimes they can be life savers. If you use surrogate keys, you have to go to that table to get natural key information.
For example (pure example - so we are not talking DB design here), lets say you have an ORDER table and ORDER_ITEM. If you use ProductId and LineNumber (UPDATE: and as Pedro mentioned OrderId or even better OrderNumber) as composite primary key in ORDER_ITEM, then in your cross table for SHIPPING, you would be able to have ProductId in the SHIPPING_ORDERITEM. This can massively boost your performance if for example you have run out of that product and need to find out all products of that ProductId that need to be shipped without a need to join.
On the other hand, if you use a surrogate key, you have to join and you end up with a very inefficient SQL execution plan where it has to do bookmark lookup on several indexes.
See more on bookmark lookup which using surrogate keys becomes a major issue.
Natural primary keys are brittle.
Suppose we have built a system around a natural PK on (CountryCode, PhoneNumber), and several years down the road we need to add Extension, or change the PK to one column: Email. If these PK columns are propagated to all child tables, this becomes very expensive.
A few years ago there were some systems that were built assuming that Social Security Number is a natural PK, and had to be redesigned to use identities, when the SSN became non-unique and nullable.
Because we cannot predict the future, we don't know if later on some change will render obsolete what used to be a perfectly correct and complete model.
The very simple answer is data integrity. If the data is to be useful and accurate then the keys are presumably required. Having an "autogenerated id" doesn't remove the requirement for other keys as well. The alternative is not to enforce uniqueness and accept that data will be duplicated and almost inevatibly contain anomalies and lead to errors as a result. Why would you want that?
In short, the purpose of composite keys is to use the database to enforce one or more business rules. In other words: protect the integrity of your data.
Ex. You have a list of parts that you buy from suppliers. You could could create your supplier and parts table like such:
SUPPLIER
SupplierId
SupplierName
PART
PartId
PartName
SupplierId
Uh oh. The parts table allows for duplicate data. Since you used a surrogate key that was autogenerated, you're not enforcing the fact that a part from a supplier should only be entered once. Instead, you should create the PART table like such:
PART
SupplierId
SupplierPartId
PartName
In this example, your parts come from specific suppliers and you want to enforce the rule: "A single supplier can only supply a single part once" in the PARTS table. Hence, the composite key. Your composite key prevents accidental duplicate entry of a part.
You can always leave business rules out of your database and leave them to your application, but by keeping the rule in the database (via a composite key), you ensure that the business rule is enforced everywhere, especially if you should ever decide to allow multiple applications to access the data.
Just as functions encapsulate a set of instructions, or database views abstract base table connections, so to do surrogate keys abstract the meaning of the entity they are placed on.
If, for example, you have a table that holds vehicle data, applying a surrogate VehicleId abstracts what it means to be a vehicle from a data point of view. When you reference VehicleId = 1, you are most surely talking about a vehicle of some sort, but do we know if it is a 2008 Chevy Impala, or a 1991 Ford F-150? No. Can the underlying data of whatever Vehicle #1 is change at any time? Yes.
Short answer: Multi-column foreign keys naturally refer to multi column primary keys. There can still be an autogenerated id column that is part of the primary key.
Philosophical answer: Primary key is the identity of the row. If there there is a bit of information that is an intrinsic part of the identity of the row (such as which customer the article belongs to.. in a multi customer wiki) - The information should be part of the primary key.
An example: System for organizing LAN parties
The system supports several LAN parties with the same people and organizers attending thus:
CREATE TABLE users ( users_id serial PRIMARY KEY, ... );
And there are several parties:
CREATE TABLE parties ( parties_id serial PRIMARY KEY, ... );
But most of the other stuff needs to carry the information about which party it is linked to:
CREATE TABLE ticket_types (
ticket_types_id serial,
parties_id integer REFERENCES parties,
name text,
....
PRIMARY KEY(ticket_types_id, parties_id)
);
...this is because we want to refer to primary keys. Foreign key on table attendances points to table ticket_types.
CREATE TABLE attendances (
attendances_id serial,
parties_id integer REFERENCES parties,
ticket_types_id integer,
PRIMARY KEY (attendances_id, parties_id),
FOREIGN KEY (ticket_types_id, parties_id) REFERENCES parties
);
While I prefer surrogate keys, I use composite cases in a few cases. The composite key may consist entirely or partially of surrogate key fields.
Many to many join tables. These usually require a unique key on the key pair anyway. In some cases additional columns may be included in the key.
Weak child tables. Things like order lines do not stand on their own. In this case I use the parent (orders) tables primary key in the composite table.
When there are multiple weak tables related to an entity, it may be possible to eliminate a table from the join set when querying child data. In the case of grandchild tables, it is possible to join the grandparent to grandchild without involving the table in the middle.
I have a table which needs 2 fields. One will be a foreign key, the other is not necessarily unique. There really isn't a reason that I can find to have a primary key other than having read that "every single tabel ever needs needs needs a primary key".
Edit:
Some good thoughts in here.
For clarity's sake, I will give you an example that is similar to my database needs.
Let's say have a table with product type, quantity, cost, and manufacturer.
Product type will not always be unique (say, MP3 Player), but manufacturer/product type will be unique (say, Apple MP3 Player). Forget about the various models the manufacturers make for this example. For ease, this table has a autoincrementing primary key.
I am giving a point value and logging how often these products are searched for, added to a cart, and bought for display on a list of hot items.
The way I have it layed out currently is in a second table with a FK pointing to the main table, and a second column for the total number of "popularity points" this item has gained.
The answers have seen here have made me think that perhaps I should just add a "points" column to my primary products table so that I could just track there... but that seems like I'm not normalizing my database enough.
My problem is I'm currently mostly just a hobbyist doing this for learning, and don't have the luxury of a DBA to tell me how to set up my tables, so I have to learn both the coding side and the database side.
You have to distinguish between primary key and surrogate key. Auto-incremented column would be a particular case of the latter. Your question, therefore, is twofold:
Does every table need to have a primary key?
Does every table need to have a surrogate primary key?
The answer to first question is YES except in some special cases (association table for many-to-many relationship arguably being an example of such a special case). The reason for this is that you usually need to be able (if not right now then in the future) to consistently address individual rows of that table - for updates / deletion, for example.
The answer to the second question is NO. If your table represents a core business entity then OR it can be referenced from many-to-one association, having a surrogate key is probably a good idea; but it's not absolutely necessary.
It's somewhat unclear what your table's function is; from your description it sounds like it has "collection of values" semantics (FK to "main" table + value). Certain ORMs don't support surrogate keys in such circumstances; if that's what has prompted your question it's OK to leave the surrogate (or even primary in case of bag) key off.
For the sake of having something unique and as identifier, please please please please have a primary key in every table :)
It also helps forward compaitability in case there are future schema changes and 2 values are no long unique. Plus, memory are much cheaper now, feel free to use them as investments. ;)
i am not sure how the other field looks like .. but i am guessing that it would be to ok to have a composite primary key , which is based on the FK and the other field .. but then again i dont know your exact scenario.
I would say that it's absolutely necessary to have some sort of primary key in every table.
Interestingly enough, one of the DBA's for a Viacom property once told me that there was really no discernible difference in using an INT UNSIGNED or a VARCHAR(n) as a primary key in MySQL. This was in reference to a user table with more than 64 million rows. I believe n can be decently large (<=100), but I forget the what they limited to. Unfortunately, I don't have any empirical data to back that up.
You don't HAVE to have a primary key on every table, but it is considered best practice to have them as they are almost always necessary on a normalized relational database design. If you're finding a bunch of tables you don't think need PKs, then you should revisit the design/layout of your tables. To read more on normalization see here.
A couple scenarios that I can think of where you may not need or want a PK on a table would be a table strictly for logging. (to limit performance degradation of writing the log and maintaining a unique index) and in the scenario where your just storing data used to pump through an application for test purposes.
I'll be contrary and say you shouldn't add the key if you don't have a reason for it. It is very easy to add this column later if needed.
Strictly speaking, a surrogate key is not necessary, but a primary key is.
Many people use the term "primary key" to mean a single column that is an auto-incrementing integer. But this is not an accurate definition of a primary key.
A primary key is a constraint on one or more columns that serve to identify each row uniquely. Yes, you need some way of addressing individual rows. This is a crucial characteristic of a relation (aka a table).
You say you have a foreign key and another column that is not unique. But are these two columns taken together unique? If so, you can declare a primary key constraint over these two columns.
Defining another surrogate key (also called a pseudokey -- the auto-incrementing type) is a convenience because some people don't like to have to reference two columns when selecting a single row. Or they want the freedom to change values in the other columns easily, without changing the value of the primary key by which one addresses the individual row.
This is a technique related to normalization and a pretty good practice. A key made up of an auto incrementing number has many benefits:
You have a PK that does not pertain to the data.
You never have to change the PK value
Every row will automatically have a unique identifier