Say I had the following tables:
Foo
- id
- code (Unique)
- name
- descption
Bar
- id
- foo_code
Is there a disadvantage with Bar.foo_code pointing to Foo.code? Generally, I see the ids being referenced instead (e.g. there would be a Bar.foo_id that points to Foo.id), but in my case, it would be a lot simpler if it actually pointed to something other than the auto-incrementing PK.
I'm curious if this is bad design or if there will be a penalty in performance somehow.
No, there are no disadvantages. Maybe on readability, depending on the real names of the columns. However, note that, if Foo.code is unique, then it is virtually a primary key. Just make sure it is also indexed and you are golden. If Foo.code is in fact unique, then you could even get rid of Foo.id and save some space.
What reason do you have to keep the auto incrementing id around if really you're using another key as the primary key instead?
By default, a table would be sorted on the primary key. It may make more sense to have it sorted by code by default. Also, it'll take a bit more data storage, and updates get a bit slower as both indexes have to be updated. For many applications the performance difference will make very little practical difference, but I just can't think of a reason to keep the auto increment id around if you're not using it.
Surrogate primary keys do have their uses, but if you already have a unique column that you want to use as the primary key instead then using that unique column as your PK instead isn't bad design at all.
Depends on how you intend to query:
If you often need to determine the code associated with given Bar, then referencing it directly in the FK effectively migrates it from Foo to Bar, allowing you to do it without a JOIN.
However, if you also need to fetch Foo.name or Foo.descption, then the JOIN is inevitable, and referencing Foo.id allows you avoid paying the double-lookup price caused by the MySQL/InnoDB clustering.
That being said, you may be able to remove the surrogate key Foo.id, removing the entire dilemma.
Related
I think I understand primary keys and indexes.
In my setup, I have a table with several columns. Two of these columns are User ID, and Username.
Ideally I would like both to be unique, and non nullable.
As far as I can tell, my best use would be to have the User ID as the primary key, as this is the most important field not to NULL, and it will never change as the database grows.
I would then have to have the username column as a unique index, so that it can be the same on another row, although unfortunately, could end up NULL.
This is what I will do unless there is a way to have both columns as unique and non NULLABLE?
You can declare the Username column as NOT NULL and put an unique index on it. Although the index itself won't force not-null values, the field definition will, so it will be effectively a unique non-nullable field.
From both my application development and datawarehouse experience I would recommend having a separate primary key that is not used in any business setting and do not use User ID as the primary key. Using UserID as the primary key can lead to a whole host of problems. I would index each column (separately).
Anytime you need to merge or reassign a user or change their ID, etc, having actually used their userID as the primary key will lead to a lot of problems for those operations.
Also, on the web, this will open up people seeing URL's like ....user/1/details and then potentially being able to change the '1' to a '2' (for example) and seeing other peoples info. It is better if the ID is unique like '57489574389ghfjghfjghf' and then it's harder to hack URLs with.
The choice between a 'natural' and a 'surrogate' key is explained well here:
http://www.agiledata.org/essays/keys.html
Most of the problems people experience in this area are for edge cases such as merges and deletes. These are usually of low priority initially but concern over them will grow over time and poorly engineered solutions will start to break down (usually because at the point that data quality is 'recognized' there is often such a large volume of 'bad' data that going forward is untenable - the old data can't be 'fixed' and without that rules are hard to introduce for new records which will co-exist with them. This assumes that the ability to update old records is still required.
Nop, sorry to say you are incorrect, on both accounts.
1) Right about everything, except that the PK can change if you want it to.
2) Unique index is, by definition, unique, it cannot be repeated. What you mean is a plain old index, not unique, which can be repeated. Its purpose is to speed up querying if you filter often by that field. Otherwise is better not to use it.
What you want: Column1 = Primary Key (not null), Column2 = Unique Index (not null), exactly what you said, but now you know why it does work as you need it to.
EDIT: Also, it seems you make a corelation between indexes and non-nullables. You can make a column non-nullable, independently of whether it is an index or not.
Totally agree with Michael, your primary key column should not contain any meaningful data, especially like userID. So you should add another column for the PK and fill it from a sequence.
Also agree with Darhazer: you should put a not null constraint and a unique index on both the userid and username fields.
I'm building a new application that has a number of data objects and each one needs "history" or notes. In the past I have just created one database table called notes and had a number of foreign keys attached to the different objects. This time I would like others thoughts. Is it good practice/efficient to use one table with ever increasing auto_inc IDs or should I maintain different [object]_notes type tables?
N.B. The Notes object itself would always be the same, subject, text, date etc.
I'd use only 1 table. I assume we're not talking gazillions of history notes?
If not, then 1 table is just fine
I think the question you are asking is if an auto incrementing ID is as good a primary key as composite natural keys, or a key composed of 2 entities.
Unless you have a good reason to do so, I would stick to the autoincrement Primary Key, it has a unique index thus optimized for read lookups. You can do still do an index on composite keys. Some actually prefer it that way as it can be argued that it makes the relationships clearer & clenaer by not having the extra column on each table, but for small applications and datasets I don't worry about that and just use the autoincrement option.
I have some mysql tables that have auto incrementing id's that are primary keys, but I notice that I never actually use them... I used to think that every table must have a primary key so I guess that is why I created them before. Should I remove them all if I don't use them at all?
Unless you are running into space problems I wouldn't remove them.
They are a life saver in case you by mistake (or oversight) populate the database with repeated/wrong data.
They also help to have related tables, where you reference the content on one table through the autogenerated id.
This is assuming you have indexes for the other columns you use to actually query the data (if you don't, then more reason to keep the autoincrement ids and use them!).
No.
You should keep them; a database always needs something that differentiates a row from another row (a "Key" of some sort).
If you have something that is guaranteed to be unique for each row, then you can use that as a key; otherwise keep the Primary Key and the Auto generated ID.
I'd personally keep them. They will be especially useful at a later date if you expand the database design and need to reference this table.
Interesting!...
I seem to hold a minority opinion here, getting both upvoted and downvoted to currently an even 0, yet no one in the majority opinion (see responses above) seems to make much of a case for keeping the id field, and the downvoters didn't even bother leaving comments hinting at why doing away with the id is such a bad idea.
In their defense, my own original response did not include any strong argument as to why it is ok to do away with the id attribute in some cases (which seem to apply to the OP). Maybe such a gratuitous response makes it, in of itself, a downvotable response.
Please do educate me, and the OP, by leaving comments pro or against the _systematic_ (and I stress "systematic") need to include auto-incremented non-semantic primary keys in all tables. A promised I returned and added to my response to provide a list of reasons why it may be detrimental to [again, systematically] impose a auto-incremented PK.
My original response:
You bet! you can remove these!
Before you do anything to the database make sure you have a backup, in particular is the DB size is significant.
Use the ALTER TABLE statement to remove the id in the tables where you want to remove it. Specifically
ALTER TABLE myTable DROP COLUMN id
(you also need to remove the PK constraint before removing the id, if the table has such a constraint)
EDIT (Added later)
There are many cases where it just doesn't make sense to carry along an autoincremented ID key, regardless of the relative little extra storage requirement these keys add.
In all these cases, the underlying implication is that
either the data itself supplies a primary key,
or, the application manages the key generation
The key supplied "natively" in the data doesn't necessarily neeeds to be a single column key, it can be a composite key, although in these cases one may wish to study the situation more closely, particularly is the overal key is a bit long.
Here are some of the drawbacks of using an auto-incremeted primary key in lieu of a native or application-supplied key:
The effective data integrity may go unchecked
i.e. the server may allow record insertions of updates which create a duplicated [native] key (eventhough the artificial, autoincremented primary key hides this reality)
When relying on the auto-incremented PK for the support of joins between tables, when part of the [native] key values have to be updated...
...we either create the need of deleting the record in full and and re-insert it with the news values,
...or the risk of keeping outdated/incorrect links.
A common "follow-up" with auto-incremented keys is to create a clustered index on the table for this key.
This does make sense for tables without an native or application-supplied primary key, so so much for data sets that have such keys.
Effectively this prevents choosing a key for the clustered index which may be more beneficial for the most common query patterns.
Migrating tables with an auto-incremented key can made more difficult depending on the DBMS (need to declare the underlying column as plain integer, prior to copy, then need start again the autoincrement...)
For narrow tables, i.e. tables with a few columns only, the relative cost of the auto-incremented PK can be significant, and impact performance in a non negligible fashion.
When inserting new records along with associated records in related tables, the auto-incremented key needs to be obtained after the insertion of the main record, before the related records can be inserted; the logic is simpler when the column values supporting the link are known ahead of time.
To summarize, the idea that so long as the storage can carry the [relatively minimal] extra "weight" of the artificial primary key, we should include and use such a key, is not without drawbacks of its own.
A final consideration is that just like it is rather easy to remove such keys when we don't need them, they too can be easily added, post-facto, when/if it becomes apparent that they are useful in a particular situation. Neither form of refactoring (adding vs. removing the auto-incremented columns) is risk free, but neither is a major production either.
Yes, if you can figure out another primary key.
There is obviously a flaw of your table design. For example, you had a table like
relation_id(PK), parent_id, child_id .
It is known that the combination of parent_id and child_id is unique, then you can assign the primary key to be parent_id + child_id, and then drop the column relation_id.
There should may endlessly other possible cases, but just bear in mind that primary key is helping you to locate data quickly, as well as helping you have your design making sense.
I have a table which needs 2 fields. One will be a foreign key, the other is not necessarily unique. There really isn't a reason that I can find to have a primary key other than having read that "every single tabel ever needs needs needs a primary key".
Edit:
Some good thoughts in here.
For clarity's sake, I will give you an example that is similar to my database needs.
Let's say have a table with product type, quantity, cost, and manufacturer.
Product type will not always be unique (say, MP3 Player), but manufacturer/product type will be unique (say, Apple MP3 Player). Forget about the various models the manufacturers make for this example. For ease, this table has a autoincrementing primary key.
I am giving a point value and logging how often these products are searched for, added to a cart, and bought for display on a list of hot items.
The way I have it layed out currently is in a second table with a FK pointing to the main table, and a second column for the total number of "popularity points" this item has gained.
The answers have seen here have made me think that perhaps I should just add a "points" column to my primary products table so that I could just track there... but that seems like I'm not normalizing my database enough.
My problem is I'm currently mostly just a hobbyist doing this for learning, and don't have the luxury of a DBA to tell me how to set up my tables, so I have to learn both the coding side and the database side.
You have to distinguish between primary key and surrogate key. Auto-incremented column would be a particular case of the latter. Your question, therefore, is twofold:
Does every table need to have a primary key?
Does every table need to have a surrogate primary key?
The answer to first question is YES except in some special cases (association table for many-to-many relationship arguably being an example of such a special case). The reason for this is that you usually need to be able (if not right now then in the future) to consistently address individual rows of that table - for updates / deletion, for example.
The answer to the second question is NO. If your table represents a core business entity then OR it can be referenced from many-to-one association, having a surrogate key is probably a good idea; but it's not absolutely necessary.
It's somewhat unclear what your table's function is; from your description it sounds like it has "collection of values" semantics (FK to "main" table + value). Certain ORMs don't support surrogate keys in such circumstances; if that's what has prompted your question it's OK to leave the surrogate (or even primary in case of bag) key off.
For the sake of having something unique and as identifier, please please please please have a primary key in every table :)
It also helps forward compaitability in case there are future schema changes and 2 values are no long unique. Plus, memory are much cheaper now, feel free to use them as investments. ;)
i am not sure how the other field looks like .. but i am guessing that it would be to ok to have a composite primary key , which is based on the FK and the other field .. but then again i dont know your exact scenario.
I would say that it's absolutely necessary to have some sort of primary key in every table.
Interestingly enough, one of the DBA's for a Viacom property once told me that there was really no discernible difference in using an INT UNSIGNED or a VARCHAR(n) as a primary key in MySQL. This was in reference to a user table with more than 64 million rows. I believe n can be decently large (<=100), but I forget the what they limited to. Unfortunately, I don't have any empirical data to back that up.
You don't HAVE to have a primary key on every table, but it is considered best practice to have them as they are almost always necessary on a normalized relational database design. If you're finding a bunch of tables you don't think need PKs, then you should revisit the design/layout of your tables. To read more on normalization see here.
A couple scenarios that I can think of where you may not need or want a PK on a table would be a table strictly for logging. (to limit performance degradation of writing the log and maintaining a unique index) and in the scenario where your just storing data used to pump through an application for test purposes.
I'll be contrary and say you shouldn't add the key if you don't have a reason for it. It is very easy to add this column later if needed.
Strictly speaking, a surrogate key is not necessary, but a primary key is.
Many people use the term "primary key" to mean a single column that is an auto-incrementing integer. But this is not an accurate definition of a primary key.
A primary key is a constraint on one or more columns that serve to identify each row uniquely. Yes, you need some way of addressing individual rows. This is a crucial characteristic of a relation (aka a table).
You say you have a foreign key and another column that is not unique. But are these two columns taken together unique? If so, you can declare a primary key constraint over these two columns.
Defining another surrogate key (also called a pseudokey -- the auto-incrementing type) is a convenience because some people don't like to have to reference two columns when selecting a single row. Or they want the freedom to change values in the other columns easily, without changing the value of the primary key by which one addresses the individual row.
This is a technique related to normalization and a pretty good practice. A key made up of an auto incrementing number has many benefits:
You have a PK that does not pertain to the data.
You never have to change the PK value
Every row will automatically have a unique identifier
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
This post was edited and submitted for review 10 days ago.
Improve this question
I am not very familiar with databases and the theories behind how they work. Is it any slower from a performance standpoint (inserting/updating/querying) to use Strings for Primary Keys than integers?
For Example I have a database that would have about 100 million row like mobile number, name and email. mobile number and email would be unique. so can I have the mobile number or email as a primary key,
well it effect my query performance when I search based on email or mobile number. similarly the primary key well be used as foreign key in 5 to 6 tables or even more.
I am using MySQL database
Technically yes, but if a string makes sense to be the primary key then you should probably use it. This all depends on the size of the table you're making it for and the length of the string that is going to be the primary key (longer strings == harder to compare). I wouldn't necessarily use a string for a table that has millions of rows, but the amount of performance slowdown you'll get by using a string on smaller tables will be minuscule to the headaches that you can have by having an integer that doesn't mean anything in relation to the data.
Another issue with using Strings as a primary key is that because the index is constantly put into sequential order, when a new key is created that would be in the middle of the order the index has to be resequenced... if you use an auto number integer, the new key is just added to the end of the index.
Inserts to a table having a clustered index where the insertion occurs in the middle of the sequence DOES NOT cause the index to be rewritten. It does not cause the pages comprising the data to be rewritten. If there is room on the page where the row will go, then it is placed in that page. The single page will be reformatted to place the row in the right place in the page. When the page is full, a page split will happen, with half of the rows on the page going to one page, and half going on the other. The pages are then relinked into the linked list of pages that comprise a tables data that has the clustered index. At most, you will end up writing 2 pages of database.
Strings are slower in joins and in real life they are very rarely really unique (even when they are supposed to be). The only advantage is that they can reduce the number of joins if you are joining to the primary table only to get the name. However, strings are also often subject to change thus creating the problem of having to fix all related records when the company name changes or the person gets married. This can be a huge performance hit and if all tables that should be related somehow are not related (this happens more often than you think), then you might have data mismatches as well. An integer that will never change through the life of the record is a far safer choice from a data integrity standpoint as well as from a performance standpoint. Natural keys are usually not so good for maintenance of the data.
I also want to point out that the best of both worlds is often to use an autoincrementing key (or in some specialized cases, a GUID) as the PK and then put a unique index on the natural key. You get the faster joins, you don;t get duplicate records, and you don't have to update a million child records because a company name changed.
Too many variables. It depends on the size of the table, the indexes, nature of the string key domain...
Generally, integers will be faster. But will the difference be large enough to care? It's hard to say.
Also, what is your motivation for choosing strings? Numeric auto-increment keys are often so much easier as well. Is it semantics? Convenience? Replication/disconnected concerns? Your answer here could limit your options. This also brings to mind a third "hybrid" option you're forgetting: Guids.
It doesn't matter what you use as a primary key so long as it is UNIQUE. If you care about speed or good database design use the int unless you plan on replicating data, then use a GUID.
If this is an access database or some tiny app then who really cares. I think the reason why most of us developers slap the old int or guid at the front is because projects have a way of growing on us, and you want to leave yourself the option to grow.
Don't worry about performance until you have got a simple and sound design that agrees with the subject matter that the data describes and fits well with the intended use of the data. Then, if performance problems emerge, you can deal with them by tweaking the system.
In this case, it's almost always better to go with a string as a natural primary key, provide you can trust it. Don't worry if it's a string, as long as the string is reasonably short, say about 25 characters max. You won't pay a big price in terms of performance.
Do the data entry people or automatic data sources always provide a value for the supposed natural key, or is sometimes omitted? Is it occasionally wrong in the input data? If so, how are errors detected and corrected?
Are the programmers and interactive users who specify queries able to use the natural key to get what they want?
If you can't trust the natural key, invent a surrogate. If you invent a surrogate, you might as well invent an integer. Then you have to worry about whther to conceal the surrogate from the user community. Some developers who didn't conceal the surrogate key came to regret it.
Indices imply lots of comparisons.
Typically, strings are longer than integers and collation rules may be applied for comparison, so comparing strings is usually more computationally intensive task than comparing integers.
Sometimes, though, it's faster to use a string as a primary key than to make an extra join with a string to numerical id table.
Two reasons to use integers for PK columns:
We can set identity for integer field which incremented automatically.
When we create PKs, the db creates an index (Cluster or Non Cluster) which sorts the data before it's stored in the table. By using an identity on a PK, the optimizer need not check the sort order before saving a record. This improves performance on big tables.
Yes, but unless you expect to have millions of rows, not using a string-based key because it's slower is usually "premature optimization." After all, strings are stored as big numbers while numeric keys are usually stored as smaller numbers.
One thing to watch out for, though, is if you have clustered indices on a any key and are doing large numbers of inserts that are non-sequential in the index. Every line written will cause the index to re-write. if you're doing batch inserts, this can really slow the process down.
What is your reason for having a string as a primary key?
I would just set the primary key to an auto incrementing integer field, and put an index on the string field.
That way if you do searches on the table they should be relatively fast, and all of your joins and normal look ups will be unaffected in their speed.
You can also control the amount of the string field that gets indexed. In other words, you can say "only index the first 5 characters" if you think that will be enough. Or if your data can be relatively similar, you can index the whole field.
From performance standpoint - Yes string(PK) will slow down the performance when compared to performance achieved using an integer(PK), where PK ---> Primary Key.
From requirement standpoint - Although this is not a part of your question still I would like to mention. When we are handling huge data across different tables we generally look for the probable set of keys that can be set for a particular table. This is primarily because there are many tables and mostly each or some table would be related to the other through some relation ( a concept of Foreign Key ). Therefore we really cannot always choose an integer as a Primary Key, rather we go for a combination of 3, 4 or 5 attributes as the primary key for that tables. And those keys can be used as a foreign key when we would relate the records with some other table. This makes it useful to relate the records across different tables when required.
Therefore for Optimal Usage - We always make a combination of 1 or 2 integers with 1 or 2 string attributes, but again only if it is required.
I would probably use an integer as your primary key, and then just have your string (I assume it's some sort of ID) as a separate column.
create table sample (
sample_pk INT NOT NULL AUTO_INCREMENT,
sample_id VARCHAR(100) NOT NULL,
...
PRIMARY KEY(sample_pk)
);
You can always do queries and joins conditionally on the string (ID) column (where sample_id = ...).
There could be a very big misunderstanding related to string in the database are. Almost everyone has thought that database representation of numbers are more compact than for strings. They think that in db-s numbers are represented as in the memory. BUT it is not true. In most cases number representation is more close to A string like representation as to other.
The speed of using number or string is more dependent on the indexing then the type itself.
By default ASPNetUserIds are 128 char strings and performance is just fine.
If the key HAS to be unique in the table it should be the Key. Here's why;
primary string key = Correct DB relationships, 1 string key(The primary), and 1 string Index(The Primary).
The other option is a typical int Key, but if the string HAS to be unique you'll still probably need to add an index because of non-stop queries to validate or check that its unique.
So using an int identity key = Incorrect DB Relationships, 1 int key(Primary), 1 int index(Primary), Probably a unique string Index, and manually having to validate the same string doesn't exist(something like a sql check maybe).
To get better performance using an int over a string for the primary key, when the string HAS to be unique, it would have to be a very odd situation. I've always preferred to use string keys. And as a good rule of thumb, don't denormalize a database until you NEED to.