I think I should be counted as database newbie, so read the question as a newbie question. I currently create a table, which holds environment variables for a number of hosts, like this:
create table envs (
host varchar(255),
envname varchar(255),
envvalue varchar(8192),
PRIMARY KEY(host, envname)
);
Very simple, one table holding all the data I need. Common operation is to get all the environment variables for a given host, another is to get a given environment variable for a given host, third example operation would be to get a given environment variable for all hosts and list duplicates.
Performance is not expected to be an issue, it's going to be maybe tens of hosts, dozens of variables per host, average max 1 query per second.
Now I've read that having composite primary key is not necessarily a good idea. Is this true for above use case? If it is true, how should I change the database design? If not, is the above one-table database fine for the purposes I listed above?
I don't see a problem here with the primary key. The semantics of a primary key is to uniquely identify the non-key attribute values for the key values. As I assume that for one host and one envname there is at most one envvalue the primary key makes perfect sense.
It could be that some people argue against composite primary keys because they are afraid of performance issues. However performance considerations should never influence the choice of the primary key. Many database systems automatically create an index structure for the primary key; the choice of this index structure can influence performance. However this choice can mostly be changed manually and should be done at a later point if you really have performance issues.
Your one-table design and choice of primary key is fine.
Now I've read that having composite primary key is not necessarily a good idea. Is this true for above use case?
No. Use a composite primary key on (host, envname).
If it is true, how should I change the database design?
N/A.
If not, is the above one-table database fine for the purposes I listed above?
Yes: it's known as the Entity–Attribute–Value model.
It's a bad idea, because you store unique values (host, envname) several times.
What if you were to change the hostname from srv01 to *srv01_new*? You'd have to change every ocurrence of srv01 in your table. And what if, some day, you decide you need to create a new table that holds additional information about every single host.
Now, if you change the hostname, you have to change those information as well.
To get to your question: It's not an issue of performance, but of normalization.
Databases should generally be normalized as far as possible. If you are intrigued enough, read on.
You should create one table for your hosts, having a unique id (int) as primary key and a unique (index) name as the hostname.
Your table should then only reference the id of the host, not the name. This way, your hostname is only stored once in your whole database and can be altered to whatever you desire, without breaking other tables.
If your environment names are unique, too, you should create another table for those, having the same layout as the hosts table (id, name).
Your combination table then stores the id of the host and the id of the environment, along with the value. You must of course keep the combined primary key, so every combination of host/environment is unique and easily indexable.
Then, you have a many-to-many-relationship with additional attributes and perfect normalization.
Related
I'm helping with a Rails application, the intent is for that application to be multi-tenanted. What this means is that there will be data from multiple users/organisations in the database tables, and often the access path will be along the lines of "get me all the data for my organisation".
We're using MYSQL as the database.
Rails by default creates a primary key on the table using the id column. The id column is auto-incremented. This is nice in some ways - rows are always added at the end of the table. However, consider the following situation:
An object called foo. A foo has an id, and always has an
organisation_id
Over time each organisation creates foos in the database, these foos
are interleaved throughout the table (they are stored in id sequence)
A use case that involves listing all foos for this organisation
The problem I have is that the foos for an organisation are not located closely together in the database, in fact they're spread around very sub-optimally. Ideally I'd create a primary key of (organisation_id, id) on the table, which would result in all foos for a given organisation being side by side in the table.
Unfortunately, when I do this Rails gives me an 'Unknown primary key for table foos in model Foo' error. I think I could deal with this by using the composite keys gem to rails, but it seems like there should be some way to make this transparent at the database level.
Is there an alternate approach?
For reference, the command on the database to change my index was:
ALTER TABLE foos ADD KEY (id); # needed because the id column is auto-increment
ALTER TABLE foos DROP PRIMARY KEY, ADD PRIMARY KEY(organisation_id, id);
EDIT 1: A blog post that indicates success doing exactly this with composite_primary_keys gem. Which gives me a bit more confidence with that approach, problem is that it's from 2008, so things may have moved on. http://www.joehruska.com/?p=6
EDIT 2: Another option I was considering was partitioning instead - the number of organisations probably wouldn't exceed the maximum partitions, and I could probably group them a bit without losing too much benefit. Unfortunately, the key quote is every unique key on the table must use every column in the table's partitioning expression. (This also includes the table's primary key - from the MYSQL manual http://dev.mysql.com/doc/refman/5.6/en/partitioning-limitations-partitioning-keys-unique-keys.html.
So I'm still back needing a composite primary key again. I'm a little surprised that Rails cares so much about the primary key, rather than simply that a key is present.
If you don't want to use composite_primary_keys then you may be stuck just relying on a standard index on :organisation_id or [:organisation_id, :id]
My understanding is that Rails cares about PrimaryKeys so much because of the assumptions is makes with relationships between models. Perhaps it should be improved, you could always suggest it as a future feature.
I think I understand primary keys and indexes.
In my setup, I have a table with several columns. Two of these columns are User ID, and Username.
Ideally I would like both to be unique, and non nullable.
As far as I can tell, my best use would be to have the User ID as the primary key, as this is the most important field not to NULL, and it will never change as the database grows.
I would then have to have the username column as a unique index, so that it can be the same on another row, although unfortunately, could end up NULL.
This is what I will do unless there is a way to have both columns as unique and non NULLABLE?
You can declare the Username column as NOT NULL and put an unique index on it. Although the index itself won't force not-null values, the field definition will, so it will be effectively a unique non-nullable field.
From both my application development and datawarehouse experience I would recommend having a separate primary key that is not used in any business setting and do not use User ID as the primary key. Using UserID as the primary key can lead to a whole host of problems. I would index each column (separately).
Anytime you need to merge or reassign a user or change their ID, etc, having actually used their userID as the primary key will lead to a lot of problems for those operations.
Also, on the web, this will open up people seeing URL's like ....user/1/details and then potentially being able to change the '1' to a '2' (for example) and seeing other peoples info. It is better if the ID is unique like '57489574389ghfjghfjghf' and then it's harder to hack URLs with.
The choice between a 'natural' and a 'surrogate' key is explained well here:
http://www.agiledata.org/essays/keys.html
Most of the problems people experience in this area are for edge cases such as merges and deletes. These are usually of low priority initially but concern over them will grow over time and poorly engineered solutions will start to break down (usually because at the point that data quality is 'recognized' there is often such a large volume of 'bad' data that going forward is untenable - the old data can't be 'fixed' and without that rules are hard to introduce for new records which will co-exist with them. This assumes that the ability to update old records is still required.
Nop, sorry to say you are incorrect, on both accounts.
1) Right about everything, except that the PK can change if you want it to.
2) Unique index is, by definition, unique, it cannot be repeated. What you mean is a plain old index, not unique, which can be repeated. Its purpose is to speed up querying if you filter often by that field. Otherwise is better not to use it.
What you want: Column1 = Primary Key (not null), Column2 = Unique Index (not null), exactly what you said, but now you know why it does work as you need it to.
EDIT: Also, it seems you make a corelation between indexes and non-nullables. You can make a column non-nullable, independently of whether it is an index or not.
Totally agree with Michael, your primary key column should not contain any meaningful data, especially like userID. So you should add another column for the PK and fill it from a sequence.
Also agree with Darhazer: you should put a not null constraint and a unique index on both the userid and username fields.
This question is geared towards MySQL, since that is what I'm using -- but I think that it's probably the same or similar for almost every major database implementation.
How do keys work in a database? By that I mean, when you set a field to 'primary key', 'unique key' or an 'index' -- what do each of these do, and when should I use each one?
Right now I have a table containing a few fields, one of them being a GUID (minus the { and } around it). I set the GUID field to the primary key and I see that it created a binary tree. So it improves search performance -- but what differentiates that from other types of keys?
I realize this may not really be programming related (although it is development related) -- I wasn't sure where exactly to ask this but SO is what I use the most so I'll ask here. Migrate as necessary
There are probably hundreds of references for this elsewhere on the web, so a bit of Googling will help you get deep into understanding DB design. That said, the basic gist is:
primary key: a field or combination of fields which must be unique for each row, and which is/are indexed to provide rapid lookup of a row given a key value; cannot contain NULL, and a table can only have one primary key. Generally indexed in a clustered index, which means that the data in the table is reordered to match the order of the index, a process that greatly improves serial data retrieval. (This is the main reason a table can only have one primary key -- the order of the data can't match the order of more than one index!)
unique key: same as a primary key, but on some DB platforms, can contain NULL values so long as they don't violate the uniqueness constraint. (In other words, if the unique key contains a single column, there can only be one row in the table with NULL in that column; if the key contains more than one column, then the table can only contain rows with NULLs in the columns such that there's no non-unique duplication of NULL values across the columns in the key.) On other platforms (including MySQL), unique constraints can contain multiple NULLs; the uniqueness constraint only applies to non-NULL values of the referenced columns. There can be more than one of these per table. Indexed in a non-clustered index.
index: a field or combination of fields which are pre-indexed for more rapid retrieval given a value for the field(s) in the index. A table can have more than one index.
When you define a primary key, the database creates an index based on that key. It needs to be unique. In general you can create an index that to speed up access to data based on non-unique query data. The indexed retrieval time for a uniquely keyed data should be better than for non-uniquely keyed indexes, so I try to use unique indexes where possible.
At the most basic, primary keys represent how the records will be physically stored in memory / on disk, you would want the unique field you're going to search on the most to be this as it will greatly reduce searching.
Unique key's are fields that can only contain unique values.
An index is a specialized "map" to the database file that queries can reference.
These are extremely simplified answers, but I think that's the gist of it.
One more thing, any key is essentially a separate table that is sorted by the index that points directly to the row(s) that match the key.
A BTree style index is stored in a balanced tree, a balanced tree is a tree structure where traveling left is smaller and traveling right is larger.
5
3 7
2 4 6 8
Would be an example of a balanced tree. The other major type is a Hash, where a mathematical expression turns the key into the relative memory location of the key.
In order to really understand keys, you have to understand them at three levels: conceptual, logical, and physical. I'm going to reverse my habitual order, and discuss physical first.
Most programmers tend to think at the physical level. At the physical level, a key is a surrogate (stand-in) for the address of a row. When a row is to be referenced, a copy of the key can be used to specify the row. When a reference to a row is made in another row, the copy is known as a foreign key.
Most experienced programmers have a thorough understanding of pointers and addresses, and would understand exactly how the data structure worked if only it used pointers and addresses. Before the relational databases became dominant, there were in fact databases that used pointers to records embedded in other records to tie the data together.
A disadvantage to using keys instead of pointers is that the DBMS has to use an index to translate a key reference back to a pointer in order to retrieve the row in question. An advantage is that the level of indirection allows the DBMS to shuffle all the rows in a table for whatever purpose, as long as the DBMS updates all the relevant indexes accordingly.
Viewed at this level, keys might as well be simple, integer, and autoincremented. These work faster than other kinds of keys, and they sidestep certain data management issues that arise when user supplied data is missing or inconsistent. However, sidestepping data management issues at this level can create a minefield at the two higher levels.
At the logical level, a key is a minimal subset of the data in a tuple (row) that allows a single matching tuple to be specified, and when the DBMS retrieves the container for that tuple, all the attributes in the tuple are now available. Every relation has at least one candidate key. In the worst case, the entire tuple is the only candidate key. When multiple candidate keys exist for a single relation (table), common practice is to choose one candidate key as the primary key, and to make all references via this primary key.
(Actually, relation and table are not synonymous, but I'm simplifying here. Likewise, tuple and row are not synonymous, although they look identical at first glance.)
The primary reason to declare a primary key is to rule out duplicate keys or missing keys.
Sometimes database people choose to leave duplicate and missing key avoidance up to the programmers whose applications write to the database. More commonly, a primary key constraint serves to reflect an error back to a program that violates a primary key constraint.
When a DBMS sets up a primary key constraint, it also builds an index on the primary key. This allows the DBMS to find duplicates quickly, and it also speeds up certain queries that use the key column(s).
At the conceptual level, keys are the means by which the user community identifies instances of entities, whether those entities are persons (employees, travellers, etc.), things (bank accounts, hotel rooms, etc.) or whatever. The key is data and the entity identified by the key is not data. The key can thus be seen a surrogate for the entity in the database.
At the conceptual level, keys are always natural, and never automatically supplied by the system. However, in the real world, keys are often mismanaged, and the consequences of mismanagement are overcome by what is called "common sense". Instilling common sense into an automated system is generally not feasible.
I never really described an index in the above, but it's implicit in what I said. An index is a data structure that serves to map from a key to a pointer. In all the databases you are likely to use, indexes are declared by the database builder (or perhaps a DBA) and managed by the DBMS.
I have some mysql tables that have auto incrementing id's that are primary keys, but I notice that I never actually use them... I used to think that every table must have a primary key so I guess that is why I created them before. Should I remove them all if I don't use them at all?
Unless you are running into space problems I wouldn't remove them.
They are a life saver in case you by mistake (or oversight) populate the database with repeated/wrong data.
They also help to have related tables, where you reference the content on one table through the autogenerated id.
This is assuming you have indexes for the other columns you use to actually query the data (if you don't, then more reason to keep the autoincrement ids and use them!).
No.
You should keep them; a database always needs something that differentiates a row from another row (a "Key" of some sort).
If you have something that is guaranteed to be unique for each row, then you can use that as a key; otherwise keep the Primary Key and the Auto generated ID.
I'd personally keep them. They will be especially useful at a later date if you expand the database design and need to reference this table.
Interesting!...
I seem to hold a minority opinion here, getting both upvoted and downvoted to currently an even 0, yet no one in the majority opinion (see responses above) seems to make much of a case for keeping the id field, and the downvoters didn't even bother leaving comments hinting at why doing away with the id is such a bad idea.
In their defense, my own original response did not include any strong argument as to why it is ok to do away with the id attribute in some cases (which seem to apply to the OP). Maybe such a gratuitous response makes it, in of itself, a downvotable response.
Please do educate me, and the OP, by leaving comments pro or against the _systematic_ (and I stress "systematic") need to include auto-incremented non-semantic primary keys in all tables. A promised I returned and added to my response to provide a list of reasons why it may be detrimental to [again, systematically] impose a auto-incremented PK.
My original response:
You bet! you can remove these!
Before you do anything to the database make sure you have a backup, in particular is the DB size is significant.
Use the ALTER TABLE statement to remove the id in the tables where you want to remove it. Specifically
ALTER TABLE myTable DROP COLUMN id
(you also need to remove the PK constraint before removing the id, if the table has such a constraint)
EDIT (Added later)
There are many cases where it just doesn't make sense to carry along an autoincremented ID key, regardless of the relative little extra storage requirement these keys add.
In all these cases, the underlying implication is that
either the data itself supplies a primary key,
or, the application manages the key generation
The key supplied "natively" in the data doesn't necessarily neeeds to be a single column key, it can be a composite key, although in these cases one may wish to study the situation more closely, particularly is the overal key is a bit long.
Here are some of the drawbacks of using an auto-incremeted primary key in lieu of a native or application-supplied key:
The effective data integrity may go unchecked
i.e. the server may allow record insertions of updates which create a duplicated [native] key (eventhough the artificial, autoincremented primary key hides this reality)
When relying on the auto-incremented PK for the support of joins between tables, when part of the [native] key values have to be updated...
...we either create the need of deleting the record in full and and re-insert it with the news values,
...or the risk of keeping outdated/incorrect links.
A common "follow-up" with auto-incremented keys is to create a clustered index on the table for this key.
This does make sense for tables without an native or application-supplied primary key, so so much for data sets that have such keys.
Effectively this prevents choosing a key for the clustered index which may be more beneficial for the most common query patterns.
Migrating tables with an auto-incremented key can made more difficult depending on the DBMS (need to declare the underlying column as plain integer, prior to copy, then need start again the autoincrement...)
For narrow tables, i.e. tables with a few columns only, the relative cost of the auto-incremented PK can be significant, and impact performance in a non negligible fashion.
When inserting new records along with associated records in related tables, the auto-incremented key needs to be obtained after the insertion of the main record, before the related records can be inserted; the logic is simpler when the column values supporting the link are known ahead of time.
To summarize, the idea that so long as the storage can carry the [relatively minimal] extra "weight" of the artificial primary key, we should include and use such a key, is not without drawbacks of its own.
A final consideration is that just like it is rather easy to remove such keys when we don't need them, they too can be easily added, post-facto, when/if it becomes apparent that they are useful in a particular situation. Neither form of refactoring (adding vs. removing the auto-incremented columns) is risk free, but neither is a major production either.
Yes, if you can figure out another primary key.
There is obviously a flaw of your table design. For example, you had a table like
relation_id(PK), parent_id, child_id .
It is known that the combination of parent_id and child_id is unique, then you can assign the primary key to be parent_id + child_id, and then drop the column relation_id.
There should may endlessly other possible cases, but just bear in mind that primary key is helping you to locate data quickly, as well as helping you have your design making sense.
I have a table which needs 2 fields. One will be a foreign key, the other is not necessarily unique. There really isn't a reason that I can find to have a primary key other than having read that "every single tabel ever needs needs needs a primary key".
Edit:
Some good thoughts in here.
For clarity's sake, I will give you an example that is similar to my database needs.
Let's say have a table with product type, quantity, cost, and manufacturer.
Product type will not always be unique (say, MP3 Player), but manufacturer/product type will be unique (say, Apple MP3 Player). Forget about the various models the manufacturers make for this example. For ease, this table has a autoincrementing primary key.
I am giving a point value and logging how often these products are searched for, added to a cart, and bought for display on a list of hot items.
The way I have it layed out currently is in a second table with a FK pointing to the main table, and a second column for the total number of "popularity points" this item has gained.
The answers have seen here have made me think that perhaps I should just add a "points" column to my primary products table so that I could just track there... but that seems like I'm not normalizing my database enough.
My problem is I'm currently mostly just a hobbyist doing this for learning, and don't have the luxury of a DBA to tell me how to set up my tables, so I have to learn both the coding side and the database side.
You have to distinguish between primary key and surrogate key. Auto-incremented column would be a particular case of the latter. Your question, therefore, is twofold:
Does every table need to have a primary key?
Does every table need to have a surrogate primary key?
The answer to first question is YES except in some special cases (association table for many-to-many relationship arguably being an example of such a special case). The reason for this is that you usually need to be able (if not right now then in the future) to consistently address individual rows of that table - for updates / deletion, for example.
The answer to the second question is NO. If your table represents a core business entity then OR it can be referenced from many-to-one association, having a surrogate key is probably a good idea; but it's not absolutely necessary.
It's somewhat unclear what your table's function is; from your description it sounds like it has "collection of values" semantics (FK to "main" table + value). Certain ORMs don't support surrogate keys in such circumstances; if that's what has prompted your question it's OK to leave the surrogate (or even primary in case of bag) key off.
For the sake of having something unique and as identifier, please please please please have a primary key in every table :)
It also helps forward compaitability in case there are future schema changes and 2 values are no long unique. Plus, memory are much cheaper now, feel free to use them as investments. ;)
i am not sure how the other field looks like .. but i am guessing that it would be to ok to have a composite primary key , which is based on the FK and the other field .. but then again i dont know your exact scenario.
I would say that it's absolutely necessary to have some sort of primary key in every table.
Interestingly enough, one of the DBA's for a Viacom property once told me that there was really no discernible difference in using an INT UNSIGNED or a VARCHAR(n) as a primary key in MySQL. This was in reference to a user table with more than 64 million rows. I believe n can be decently large (<=100), but I forget the what they limited to. Unfortunately, I don't have any empirical data to back that up.
You don't HAVE to have a primary key on every table, but it is considered best practice to have them as they are almost always necessary on a normalized relational database design. If you're finding a bunch of tables you don't think need PKs, then you should revisit the design/layout of your tables. To read more on normalization see here.
A couple scenarios that I can think of where you may not need or want a PK on a table would be a table strictly for logging. (to limit performance degradation of writing the log and maintaining a unique index) and in the scenario where your just storing data used to pump through an application for test purposes.
I'll be contrary and say you shouldn't add the key if you don't have a reason for it. It is very easy to add this column later if needed.
Strictly speaking, a surrogate key is not necessary, but a primary key is.
Many people use the term "primary key" to mean a single column that is an auto-incrementing integer. But this is not an accurate definition of a primary key.
A primary key is a constraint on one or more columns that serve to identify each row uniquely. Yes, you need some way of addressing individual rows. This is a crucial characteristic of a relation (aka a table).
You say you have a foreign key and another column that is not unique. But are these two columns taken together unique? If so, you can declare a primary key constraint over these two columns.
Defining another surrogate key (also called a pseudokey -- the auto-incrementing type) is a convenience because some people don't like to have to reference two columns when selecting a single row. Or they want the freedom to change values in the other columns easily, without changing the value of the primary key by which one addresses the individual row.
This is a technique related to normalization and a pretty good practice. A key made up of an auto incrementing number has many benefits:
You have a PK that does not pertain to the data.
You never have to change the PK value
Every row will automatically have a unique identifier