Table without a primary key - mysql

So I've always been told that it's absolutely necessary to have a primary key specified with a table. I've been doing some work and ran into a situation where a primary key's unique constraint would stop data I need from being added.
If there's an example situation where a table was structured with fields:
Age, First Name, Last Name, Country, Race, Gender
Where if a TON of data was being entered all these fields don't necessarily uniquely identify a row and I don't need an index across all columns anyways. Would the only solution here be to make an auto-incrementing ID field? Would it be okay to NOT have a primary at all?

It's not always necessary to have a primary key, most DBMS' will allow you to construct a table without one (a).
But that doesn't necessarily mean it's a good idea. Have a think about the situation in which you want to use that data. Now think about if you have two twenty-year-old Australian men named Bob Smith, both from Perth.
Without a unique constraint, you can put both rows into the table but her's the rub. How would you figure out which one you want to use in future? (b)
Now, if you just want to store the fact that there are one or more people meeting those criteria, you only need to store one row. But then, you'd probably have a composite primary key consisting of all columns.
If you have other information you want to store about the person (e.g., highest score in the "2048" game on their iPhone), then you don't want a primary key across the entire row, just across the columns you mention.
Unfortunately, that means there will undoubtedly come a time when both of those Bob Smith's try to write their high score to the database, only to find one of them loses their information.
If you want them both in the table and still want to allow for the possibility outlined above (two people with identical attributes in the columns you mention) then the best bet is to introduce an artificial key such as an auto-incrementing column, for the primary key. That will allow you to uniquely identify a row regardless of how identical the other columns are.
The other advantage of an artificial key is that, being arbitrary, it never needs to change for the thing being identified. In your example, if you use age, names, nationality or location (c) in your primary key, these are all subject to change, meaning that you will need to adjust any foreign keys referencing those rows. If the tables referencing these rows uses the unchanging artificial key, that will never be a problem.
(a) There are situations where a primary key doesn't really give you any performance benefit such as when the table is particularly small (such as mapping integers 1 through 12 to month name).
In other words, things where a full table scan isn't really any slower than indexing. But these situations are incredibly rare and I'd probably still use a key because it's more consistent (especially since the use of a key tends not to make a difference to the performance either way).
(b) Keep in mind that we're talking in terms of practice here rather than theory. While in practice you may create a table with no primary key, relational theory states that each row must be uniquely identifiable, otherwise relations are impossible to maintain.
C.J. Date who, along with Codd, is one of the progenitors of relational database theory, states the rules of relational tables in "An introduction to Database Systems", one of which is:
The records have a unique identifier field or field combination called the primary key.
So, in terms of relational theory, each table must have a primary key, even though it's not always required in practice.
(c) Particularly age which is guaranteed to change annually until you're dead, so perhaps date of birth may be a better choice for that column.

Would the only solution here be to make an auto-incrementing ID field?
That is a valid way, but it is not the only one: you could use other ways to generate unique keys, such as using GUIDs. Keys like that are called surrogate primary keys, because they are not related to the "payload" of the data row.
Would it be okay to NOT have a primary at all?
Since you mentioned that the actual data in rows may not be unique, you wouldn't be able to use your table effectively without a primary key. For example, you would not be able to update or delete a specific row, which may be required, for example, when a user's name changes.

The most simple solution would be to include an ID column to serve as primary key:
id int not null primary key auto_increment

From your post it looks like the table representing a person entity. In that case, wouldn't having a PK would determine each person entity uniquely. I would suggest, having a primary key on the table which will uniquely determine each person record.
You can either create a AUTO_INCREMENT ID column (a synthetic ID column)
(OR)
You can combine multiple columns in your table which can uniquely determine all the other fields like (First Name, Last Name) probably which will make it a composite primary key but that may clash as well since there could be more than one person having same full name (first name + last name).

Typically you should avoid proliferating ID primary keys fields through your database.
Now, that doesn't mean you shouldn't have primary keys, your primary key can be a surrogate or a composed key. And that's what you should do here.
If those fields {Age, First Name, Last Name, Country, Race, Gender}, identify unequivocally each row, then make a primary key composed by all of those fields.
But if not, then you must have some other type of information to disambiguate your data.
You can also, not specify any kind of key, and assume that table as non-normalized, and redundant data source... if this is what you need...!

Use an identity column with another column such as Last Name

Related

Should one combine foreign keys that point to the same table if all columns are required?

I encounter this situation frequently. An example,
A user is uniquely identified by appId, externalUserId.
Table xxxContract has a foreign key (fileUploadId, appId, externalUserId) to table fileUpload that ensures the file upload belongs to the specified user.
Table xxxContract has a foreign key (businessId, appId, externalUserId) to table business that ensures the business belongs to the specified user.
With the above two, we guarantee user A's file upload won't be used as a contract for user B's business.
xxxContract also has a fileTypeId column that is STORED GENERATED to a certain value that says "This contract is of file type XXX_CONTRACT"
Table xxxContract also has a foreign key (fileUploadId, fileTypeId) to table fileUpload.
This guarantees we only use XXX_CONTRACT file uploads for xxxContract, and not accidentally use other file types.
Given the above, we have this situation where we have two foreign keys that point to the same table fileUpload, and even have overlapping columns,
(fileUploadId, appId, externalUserId)
(fileUploadId, fileTypeId)
And all the columns are NOT NULL.
So, it seems to me like it's safe to combine the foreign keys into one larger foreign key,
(fileUploadId, appId, externalUserId, fileTypeId)
And we'll still have the same guarantees as before.
My gut feeling is that I should not combine the foreign keys because separating them by meaning and giving the FKs meaningful names helps with maintainability.
But I've never had a formal education with these things so I'd like to know what the industry standard is.
Related, is there a performance benefit to combining them vs. separating them?
But I've never had a formal education with these things so I'd like to know what the industry standard is.
The standard is, that there is no standard.
As you already noted, you can use multiple columns to define a primary key. This is called a natural primary key, for instance: A user can be uniquely identified by firstname, lastname - and birthdate. (at least almost ever)
This kind of keys is often called composite keys, because every column alone doesn't work out, only combined they form a primary key.
Surrogate (or artifical) primary keys are also well known: id column, using auto-increment.
So, as to your question: Yes, if you have 3 columns that already form a natural primary key, it is completly safe to add more columns. Since the 3 columns already present will uniquely identify the row, there is no harm in adding a 4th, 5th or even 6th column to the key.
Whether you are going to use natural or surrogate primary keys depens on personal preference i'd say. I never use natural keys, even on tables where this is possible.
Keep in mind, that whenever you need to delete / update something, you always need to know the primary key. hence, with natural keys, you need to move multiple values through many method-calls, while surrogate keys offer the advantage of just having "one" id to uniquely identify a row. No more information required.
Performance-wise, i assume that (Integer-based) surrogate primary keys tend to be faster than (String-based) natural primary keys. It's even less columns to consider when writing queries and/or designing indexes.

Can we use any other unique constraint as primary key in database like a phone number, or national Id

I'm creating a database that has column of person's mobile phone number. Now i just want to know without making a separate column for id and making it a primary key, can i make this column a primary key for this table?
As noted above, you technically could use a phone number as a primary key, but it is not a best practice, because:
You would not be able to insert another user who happens to have the same phone number (primary keys must be unique).
You will run into what is known as an "update anomaly", if you have other tables that reference your tables primary key, and you decide to change a user's mobile number, you will have to also update the mobile number in all of the dependent tables.How to maintain referential integrity
From a performance standpoint, indexes on numeric values are usually more efficient than indexes on varchars, and will improve the performance on your joins, and the index will take up less space.
More often than not, your best bet is to use an auto-incrementing surrogate key.
Technically, you can define any column as primary key. The question is if such definition is good or bad. If you are going to use a phone number (that should be stored as string) and the column will not only be a primary key but also unique, and you will make sure that no attempt will be made to insert two times the same number for different people, then it should be OK.

Should I create a surrogate key instead of a composite key?

Structure:
Actor <=== ActorMovie ===> Movie
ActorMovie: ActorID (fk), MovieId (fk)... ===> pk: (ActorID, MovieID)
Should do I create a surrogate key for ActorMovie table like this?
ActorMovie: ActorMovieID (pk), ActorID (fk), MovieId (fk)...
Conventions are good if they are helpful
"SQL Antipatterns", Chapter 4, "ID Required"
Intention of primary key
Primary key - is something that you can use to identify your row with it's unique address in table. That means, not only some surrogate column can be primary key. In fact, primary key should be:
Unique. identifier for each row. If it's compound, that means, every combination of column's values must be unique
Minimal. That means, it can't be reduced (i.e. if it's compound, no column could be omitted without losing uniqueness)
Single. No other primary key may be defined, each table can have only one primary key
Compound versus surrogate
There are cases, when surrogate key has benefits. Most common problem - if you have table with people names. Can combination of first_name + last_name + taxpayer_id be unique? In most cases - yes. But in theory, there could be cases, when duplicated will occur. So, this is the case, when surrogate key will provide unique identifying of rows in any case.
However, if we're talking about many-to-many link between tables, it's obvious, that linking table will always contain each pair once. In fact, you'll even need to check if duplicate does not exist before operating with that table (otherwise - it's redundant row, because it holds no additional information unless your design has a special intention to store that). Therefore, your combination of ActorID + MovieID satisfies all conditions for primary key, and there's no need to create surrogate key. You can do that, but that will have little sense (if not at all), because it will have no meaning rather than numbering rows. In other hand, with compound key, you'll have:
Unique check by design. Your rows will be unique, no duplicates for linking table will be allowed. And that has sense: because there's no need to create a link if it already exists
No redundant (and, thus, less comprehensive) column in design. That makes your design easier and more readable.
As a conclusion - yes, there are cases, when surrogate key should (or even must) be applied, but in your particular case it will definitely be antipattern - use compound key.
References:
Primary keys in SQL
SQL Antipatterns by Bill Karwin
Let me just mention a detail that seems to have been missed by other posters: InnoDB tables are clustered.
If you have just a primary key, your whole table will be represented by a lone B-Tree, which is very efficient. Adding a surrogate would just create another B-Tree (and "fatter" than expected to boot, due to how clustering works), without benefit to offset the added overhead.
Surrogates have their place, but junction tables are usually not it.
I'd always go with the composite key. My reasoning:
You will probably never use the surrogate key anywhere.
You will reduce the number of indexes/constraints on the table, as you will most certainly still need a indexes over actor and movie.
You will always search for either movie or an actor anyway.
Unless you have a scenario where you will actually use the surrogate key outside of it's own table, I'd go with the composite key.
If you want to associate other data elements with the join table, such as the name(s) of the role(s) played (which might be a child table) then I certainly would. If you were sure that you never wanted to then I'd consider it as optional.
Consider the first normal form (1NF) of database design normalization.
I would have made the ActorID and MovieID as unique key combination then create a primary key ActorMovieID.
See the same question here: Two foreign keys instead of primary
On this subject, my point is very simple: surrogate primary keys ALWAYS work, while Composite keys MIGHT NOT ALWAYS work one of these days, and this for multiple reasons.
So when you start asking yourself 'is composite better than surrogate', you have already entered the process of loosing your time. Go for surrogate. It allways works. And switch to next step.

Why we should have an ID column in the table of users?

It's obvious that we already have another unique information about each user, and that is username. Then, why we need another unique thing for each user? Why should we also have an id for each user? What would happen if we omit the id column?
Even if your username is unique, there are few advantages to having an extra id column instead of using the varchar as your primary key.
Some people prefer to use an integer column as the primary key, to serve as a surrogate key that never needs to change, even if other columns are subject to change. Although there's nothing preventing a natural primary key from being changeable too, you'd have to use cascading foreign key constraints to ensure that the foreign keys in related tables are updated in sync with any such change.
The primary key being a 32-bit integer instead of a varchar can save space. The choice between a int or a varchar foreign key column in every other table that references your user table can be a good reason.
Inserting to the primary key index is a little bit more efficient if you add new rows to the end of the index, compared to of wedging them into the middle of the index. Indexes in MySQL tables are usually B+Tree data structures, and you can study these to understand how they perform.
Some application frameworks prefer the convention that every table in your database has a primary key column called id, instead of using natural keys or compound keys. Following such conventions can make certain programming tasks simpler.
None of these issues are deal-breakers. And there are also advantages to using natural keys:
If you look up rows by username more often than you search by id, it can be better to choose the username as the primary key, and take advantage of the index-organized storage of InnoDB. Make your primary lookup column be the primary key, if possible, because primary key lookups are more efficient in InnoDB (you should be using InnoDB in MySQL).
As you noticed, if you already have a unique constraint on username, it seems a waste of storage to keep an extra id column you don't need.
Using a natural key means that foreign keys contain a human-readable value, instead of an arbitrary integer id. This allows queries to use the foreign key value without having to join back to the parent table for the "real" value.
The point is that there's no rule that covers 100% of cases. I often recommend that you should keep your options open, and use natural keys, compound keys, and surrogate keys even in a single database.
I cover some issues of surrogate keys in the chapter "ID Required" in my book SQL Antipatterns Volume 1: Avoiding the Pitfalls of Database Programming.
This identifier is known as a Surrogate Key. The page I linked lists both the advantages and disadvantages.
In practice, I have found them to be advantageous because even superkey data can change over time (i.e. a user's email address may change and thus any corresponding relations must change), but a surrogate key never needs to change for the data it identifies because its value is meaningless to the relation.
It's also nice from a JOIN standpoint because it can be an integer with a smaller key length than a varchar.
I can say that in practice I prefer to use them. I have been bitten too many times by having multiple-column primary keys or a data-representative superkey used across tables having to become non-unique later due to changing requirements during development, and that is not a situation you want to deal with.
In my opinion, every table should have a unique, auto-incremented id.
Here are some practical reasons. If you have duplicate rows, you can readily determine which row to delete. If you want to know the order that rows were inserted, you have that information in the id. As for users, there's more than on "John Smith" in the world. An id provides a key for foreign references.
Finally, just about anything that might describe a user -- a name, an address, a telephone number, an email address -- could change over time.
im mysql we have.
1:Index fields 2:Unique fields and 3:PK fields.
index means pointable
unique means in a table must be one in all rows.
PK = index + unique
in a table you may have lots of unique fields like
username or passport code or email.
but you need a field like ID. that is both unique and index (=PK).which is first is always one thing and never changes and second is unique and third is simple (because is often number).
One reason to have a numeric id is that creating an index on it is leaner than on a text-field, reducing index size and processing time required to look up a specific user. Also it's less bytes to save when cross-referencing to a user (relational database) in a different table.

Should I add a autoinc primary key for the sake of having a primary key?

I have a table which needs 2 fields. One will be a foreign key, the other is not necessarily unique. There really isn't a reason that I can find to have a primary key other than having read that "every single tabel ever needs needs needs a primary key".
Edit:
Some good thoughts in here.
For clarity's sake, I will give you an example that is similar to my database needs.
Let's say have a table with product type, quantity, cost, and manufacturer.
Product type will not always be unique (say, MP3 Player), but manufacturer/product type will be unique (say, Apple MP3 Player). Forget about the various models the manufacturers make for this example. For ease, this table has a autoincrementing primary key.
I am giving a point value and logging how often these products are searched for, added to a cart, and bought for display on a list of hot items.
The way I have it layed out currently is in a second table with a FK pointing to the main table, and a second column for the total number of "popularity points" this item has gained.
The answers have seen here have made me think that perhaps I should just add a "points" column to my primary products table so that I could just track there... but that seems like I'm not normalizing my database enough.
My problem is I'm currently mostly just a hobbyist doing this for learning, and don't have the luxury of a DBA to tell me how to set up my tables, so I have to learn both the coding side and the database side.
You have to distinguish between primary key and surrogate key. Auto-incremented column would be a particular case of the latter. Your question, therefore, is twofold:
Does every table need to have a primary key?
Does every table need to have a surrogate primary key?
The answer to first question is YES except in some special cases (association table for many-to-many relationship arguably being an example of such a special case). The reason for this is that you usually need to be able (if not right now then in the future) to consistently address individual rows of that table - for updates / deletion, for example.
The answer to the second question is NO. If your table represents a core business entity then OR it can be referenced from many-to-one association, having a surrogate key is probably a good idea; but it's not absolutely necessary.
It's somewhat unclear what your table's function is; from your description it sounds like it has "collection of values" semantics (FK to "main" table + value). Certain ORMs don't support surrogate keys in such circumstances; if that's what has prompted your question it's OK to leave the surrogate (or even primary in case of bag) key off.
For the sake of having something unique and as identifier, please please please please have a primary key in every table :)
It also helps forward compaitability in case there are future schema changes and 2 values are no long unique. Plus, memory are much cheaper now, feel free to use them as investments. ;)
i am not sure how the other field looks like .. but i am guessing that it would be to ok to have a composite primary key , which is based on the FK and the other field .. but then again i dont know your exact scenario.
I would say that it's absolutely necessary to have some sort of primary key in every table.
Interestingly enough, one of the DBA's for a Viacom property once told me that there was really no discernible difference in using an INT UNSIGNED or a VARCHAR(n) as a primary key in MySQL. This was in reference to a user table with more than 64 million rows. I believe n can be decently large (<=100), but I forget the what they limited to. Unfortunately, I don't have any empirical data to back that up.
You don't HAVE to have a primary key on every table, but it is considered best practice to have them as they are almost always necessary on a normalized relational database design. If you're finding a bunch of tables you don't think need PKs, then you should revisit the design/layout of your tables. To read more on normalization see here.
A couple scenarios that I can think of where you may not need or want a PK on a table would be a table strictly for logging. (to limit performance degradation of writing the log and maintaining a unique index) and in the scenario where your just storing data used to pump through an application for test purposes.
I'll be contrary and say you shouldn't add the key if you don't have a reason for it. It is very easy to add this column later if needed.
Strictly speaking, a surrogate key is not necessary, but a primary key is.
Many people use the term "primary key" to mean a single column that is an auto-incrementing integer. But this is not an accurate definition of a primary key.
A primary key is a constraint on one or more columns that serve to identify each row uniquely. Yes, you need some way of addressing individual rows. This is a crucial characteristic of a relation (aka a table).
You say you have a foreign key and another column that is not unique. But are these two columns taken together unique? If so, you can declare a primary key constraint over these two columns.
Defining another surrogate key (also called a pseudokey -- the auto-incrementing type) is a convenience because some people don't like to have to reference two columns when selecting a single row. Or they want the freedom to change values in the other columns easily, without changing the value of the primary key by which one addresses the individual row.
This is a technique related to normalization and a pretty good practice. A key made up of an auto incrementing number has many benefits:
You have a PK that does not pertain to the data.
You never have to change the PK value
Every row will automatically have a unique identifier