Greetings,
I have some mysql tables that are currently using an md5 hash as a primary key. I normally generate the hash with the value of a column. For instante, let's imagine I have a table called "Artists" with the fields id, name, num_members, year. I tend to make a md5($name) and use it has an ID.
I would like to know what are the downsides of doing this. Is it just better to use integers with AUTO_INCREMENT ? I tend to run away from this because it's just not worth the trouble of finding out what the last id inserted was, and what will be the next etc.
Can you give me some lights on this?
Thank you.
If you need a surrogate primary key, using an AUTO_INCREMENT field is better than an md5 hash, because it is fewer bytes of data, and database backends optimize for integer primary keys.
mysql_insert_id can be used if you need the last inserted id.
If you are generating the primary key as a hash of other columns, why not just use those other columns as a unique key, then join on those?
Another question is, what are the upsides of using an md5 hash? I can't think of any.
The MD5 isn't a true key in this case because it functionally depends on the name. That means that if you have two artists with the same name, you have duplicate "keys" for different records. You could make it a real key by hashing all the attributes together (and hoping that the probability gods don't send you a collision), or you could just save yourself the trouble and use an autoincrementing ID.
It seems like the way you're trying to use the MD5 isn't really buying you any benefit. If "$name" is unique, then why not just use "name" as the primary key? Calculating an MD5 hash and using it as a key for something that's already unique is redundant.
On the other hand, if "name" is not unique, then the MD5 hash won't be unique either and so it's pointless that way too.
Generally you use an MD5 hash when you don't want to store the actual value of the column. For instance, if you're storing passwords, you generally only store the MD5 hash of the password, not the password itself, so that you can't see people's passwords just by looking at the table contents.
If you don't have any unique fields, then you're stuck doing something like an auto-increment because it's at least guaranteed unique. If you use the built-in SQL auto-increment, then you'll just have to fetch the last one way or another. Alternately, if you can get away with keeping a unique counter locally in your application, that avoids having to use auto-increment, but isn't necessarily viable for most applications.
The first approach has one obvious disadvantage: if there are two artists of the same name there will be a primary key collision. Using an INT column with an auto-increment will ensure uniqueness.
Furthermore, though very unlikely, there is a chance that MD5 hashes of different strings could collide (I seem to recall the probability as being 1 in 36 to the power of 32).
The benefits are if you present the IDs to customers (say in a query string for a web form, though that is another no-no)... it prevents users guessing another one.
Personally I use auto-increment without problems (have moved DBs to new servers and everything without problems)
Related
I have a database where I am using an auto-increment field as the primary key. However, there is another field, studyUID, that will definitely be unique to the table. This studyUID, however, is a string and is typically around 60 characters in length. My thought, perhaps incorrectly, is that it would be faster or better to use the integer from the auto-inc field as the primary key - maybe I thought that is just what you do.
As I’m writing this I’m thinking maybe this doesn’t make sense as I will almost always search the database using the studyID field. Based on that logic, maybe it doesn’t even make sense to have the auto-inc field.
Would appreciate comments and thoughts on this subject.
I've looked for a satisfying answer a tad more specific to my particular problem for a while now, but to avail. Whether I'm just not looking at the right places or not, I don't know, but here goes:
I'm pulling data from an application that afterwards is manipulated and sent to my own server. Amongst the data pulled is an, originally in the application's database, auto-incremented identifier. An example of this identifier I just now retrieved is 955534861. Isn't it better and more effective design to not auto-increment my primary key and just use the value I know is and will always stay unique, or should I look into concepts such as surrogate keys?
Thanks in advance.
The situation you describe resembles my primary job which is maintaining a data warehouse. We get data from other systems and store it.
Something that happens to us is that these "other systems" change. That leads to possibilities that the new version of the "other system" will duplicate the unique identifier from the previous system. We deal with this by adding something to that record in our data warehouse to guarantee it's uniqueness. It might be a field to identify the source system or it might be a date. It is never an autogenerated number.
If there is any chance of this this happening to you, you might want to expand your options.
If there is a natural key in your model, you cannot replace it by creating a surrogate key.
You can only add a surrogate key and keep the existing natural key, which has its pros and cons, as described here.
This'll get a little nerdy, but bear with me:
As long as a key value is unique, it'll serve its function. But for performance, you ideally want that key value to be as short as possible.
GUIDs are commonly used, because they are statistically highly unlikely to ever be repeated. But that comes at the expense of size: they are 128 bits long, which makes them longer than a machine word. To compare two GUIDs (as must be repeatedly done when sorting, or migrating down a b-tree for indexes) will take multiple processor intructions to load and compare the values. And they will consume more memory when cached into memory.
The advantage of auto-incrementing key values is that
They are guaranteed to be unique. Proxy index values are only predicted to be unique.
Because they will have full value coverage over the range of their underlying datatype, the most compact possible type may be used. This makes for smaller indexes and more efficient compare operations
Because the smallest possible type can be used, more index values can be stored on a single database page, which means you're more likely to get a cache hit when searching or joining on that value. That means that peformance will be--all other things being equal--somewhat better.
On most databases, auto-incrementing keys are worked into the database engine, so there is very small overhead in generating them.
If you employ a clustered index on your key value, new record inserts are less likely to require a random disk seek, and more likely to be read during read-ahead, so if you do any kind of sequential processing or lookup based on that key, it'll probably run faster.
The primary key, typically an auto-incrementing ID, is what MySQL uses as a row identifier as well, so it should be left alone. If you need a secondary key that's generated by your application for some other purpose, you may want to add that as another column with a UNIQUE index on it.
In other databases where there's a proper row identifier mechanism, this is less of an issue.
I've got a database that stores hash values and a few pieces of data about the hash, all in one table. One of the fields is 'job_id', which is the ID for the job that the hash came from.
The problem I'm trying to solve is that with this design, a hash can only belong to one job - in reality a hash can occur in many jobs, and I'd like to know each job in which a hash occurs.
The way I'm thinking of doing this is to create a new table called 'Jobs', with fields 'job_id', 'job_name' and 'hash_value'. When a new batch of data is inserted into the DB, the job ID and name would be created here and each hash would go into here as well as the original hash table, but in the Jobs table it'd also be stored against the job.
I don't like this, because I'd be duplicating the hash column across tables. Is there a better way? I can add to the hash table but can't take away any columns because closed-source software depends on it. The hash value is the primary key. It's MySQL and the database stores many millions of records. Thanks in advance!
Adding the new job table is the way to go. It's the normative practice, for representing a one-to-many relationship.
It's good to avoid unnecessary duplication of values. But in this case, you aren't really "duplicating" the hash_value column; rather, you are really defining a relationship between job and the table that has hash_value as the primary key.
The relationship is implemented by adding a column to the child table; that column holds the primary key value from the parent table. Typically, we add a FOREIGN KEY constraint on the column as well.
The problem I'm trying to solve is that with this design, a hash can
only belong to one job - in reality a hash can occur in many jobs, and
I'd like to know each job in which a hash occurs.
The way I'm thinking of doing this is to create a new table called
'Jobs', with fields 'job_id', 'job_name' and 'hash_value'.
As long as you can also get a) the foreign keys right and b) the cascades right for both "job_id" and "hash_value", that should be fine.
Duplicate data and redundant data are technical terms in relational modeling. Technical term means they have meanings that you're not likely to find in a dictionary. They don't mean "the same values appear in multiple tables." That should be obvious, because if you replace the values with surrogate ID numbers, those ID numbers will then appear in multiple tables.
Those technical terms actually mean "identical values with identical meaning." (Relevant: Hugh Darwen's article for definition and use of predicates.)
There might be good, practical reasons for replacing text with an ID number, but there are no theoretical reasons to do that, and normalization certainly doesn't require it. (There's no "every row has an ID number" normal form.)
If i read your question correctly, your design is fundamentally flawed, because of these two facts:
the hash is the primary key (quoted from your question)
the same hash can be generated from multiple different inputs (fact)
you have millions of hashes (from question)
With the many millions of rows/hashes, eventually you'll get a hash collision.
The only sane approach is to have job_id as the primary key and hash in a column with a non-unique index on it. Finding job(s) given a hash would be straightforward.
I think I understand primary keys and indexes.
In my setup, I have a table with several columns. Two of these columns are User ID, and Username.
Ideally I would like both to be unique, and non nullable.
As far as I can tell, my best use would be to have the User ID as the primary key, as this is the most important field not to NULL, and it will never change as the database grows.
I would then have to have the username column as a unique index, so that it can be the same on another row, although unfortunately, could end up NULL.
This is what I will do unless there is a way to have both columns as unique and non NULLABLE?
You can declare the Username column as NOT NULL and put an unique index on it. Although the index itself won't force not-null values, the field definition will, so it will be effectively a unique non-nullable field.
From both my application development and datawarehouse experience I would recommend having a separate primary key that is not used in any business setting and do not use User ID as the primary key. Using UserID as the primary key can lead to a whole host of problems. I would index each column (separately).
Anytime you need to merge or reassign a user or change their ID, etc, having actually used their userID as the primary key will lead to a lot of problems for those operations.
Also, on the web, this will open up people seeing URL's like ....user/1/details and then potentially being able to change the '1' to a '2' (for example) and seeing other peoples info. It is better if the ID is unique like '57489574389ghfjghfjghf' and then it's harder to hack URLs with.
The choice between a 'natural' and a 'surrogate' key is explained well here:
http://www.agiledata.org/essays/keys.html
Most of the problems people experience in this area are for edge cases such as merges and deletes. These are usually of low priority initially but concern over them will grow over time and poorly engineered solutions will start to break down (usually because at the point that data quality is 'recognized' there is often such a large volume of 'bad' data that going forward is untenable - the old data can't be 'fixed' and without that rules are hard to introduce for new records which will co-exist with them. This assumes that the ability to update old records is still required.
Nop, sorry to say you are incorrect, on both accounts.
1) Right about everything, except that the PK can change if you want it to.
2) Unique index is, by definition, unique, it cannot be repeated. What you mean is a plain old index, not unique, which can be repeated. Its purpose is to speed up querying if you filter often by that field. Otherwise is better not to use it.
What you want: Column1 = Primary Key (not null), Column2 = Unique Index (not null), exactly what you said, but now you know why it does work as you need it to.
EDIT: Also, it seems you make a corelation between indexes and non-nullables. You can make a column non-nullable, independently of whether it is an index or not.
Totally agree with Michael, your primary key column should not contain any meaningful data, especially like userID. So you should add another column for the PK and fill it from a sequence.
Also agree with Darhazer: you should put a not null constraint and a unique index on both the userid and username fields.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
This post was edited and submitted for review 10 days ago.
Improve this question
I am not very familiar with databases and the theories behind how they work. Is it any slower from a performance standpoint (inserting/updating/querying) to use Strings for Primary Keys than integers?
For Example I have a database that would have about 100 million row like mobile number, name and email. mobile number and email would be unique. so can I have the mobile number or email as a primary key,
well it effect my query performance when I search based on email or mobile number. similarly the primary key well be used as foreign key in 5 to 6 tables or even more.
I am using MySQL database
Technically yes, but if a string makes sense to be the primary key then you should probably use it. This all depends on the size of the table you're making it for and the length of the string that is going to be the primary key (longer strings == harder to compare). I wouldn't necessarily use a string for a table that has millions of rows, but the amount of performance slowdown you'll get by using a string on smaller tables will be minuscule to the headaches that you can have by having an integer that doesn't mean anything in relation to the data.
Another issue with using Strings as a primary key is that because the index is constantly put into sequential order, when a new key is created that would be in the middle of the order the index has to be resequenced... if you use an auto number integer, the new key is just added to the end of the index.
Inserts to a table having a clustered index where the insertion occurs in the middle of the sequence DOES NOT cause the index to be rewritten. It does not cause the pages comprising the data to be rewritten. If there is room on the page where the row will go, then it is placed in that page. The single page will be reformatted to place the row in the right place in the page. When the page is full, a page split will happen, with half of the rows on the page going to one page, and half going on the other. The pages are then relinked into the linked list of pages that comprise a tables data that has the clustered index. At most, you will end up writing 2 pages of database.
Strings are slower in joins and in real life they are very rarely really unique (even when they are supposed to be). The only advantage is that they can reduce the number of joins if you are joining to the primary table only to get the name. However, strings are also often subject to change thus creating the problem of having to fix all related records when the company name changes or the person gets married. This can be a huge performance hit and if all tables that should be related somehow are not related (this happens more often than you think), then you might have data mismatches as well. An integer that will never change through the life of the record is a far safer choice from a data integrity standpoint as well as from a performance standpoint. Natural keys are usually not so good for maintenance of the data.
I also want to point out that the best of both worlds is often to use an autoincrementing key (or in some specialized cases, a GUID) as the PK and then put a unique index on the natural key. You get the faster joins, you don;t get duplicate records, and you don't have to update a million child records because a company name changed.
Too many variables. It depends on the size of the table, the indexes, nature of the string key domain...
Generally, integers will be faster. But will the difference be large enough to care? It's hard to say.
Also, what is your motivation for choosing strings? Numeric auto-increment keys are often so much easier as well. Is it semantics? Convenience? Replication/disconnected concerns? Your answer here could limit your options. This also brings to mind a third "hybrid" option you're forgetting: Guids.
It doesn't matter what you use as a primary key so long as it is UNIQUE. If you care about speed or good database design use the int unless you plan on replicating data, then use a GUID.
If this is an access database or some tiny app then who really cares. I think the reason why most of us developers slap the old int or guid at the front is because projects have a way of growing on us, and you want to leave yourself the option to grow.
Don't worry about performance until you have got a simple and sound design that agrees with the subject matter that the data describes and fits well with the intended use of the data. Then, if performance problems emerge, you can deal with them by tweaking the system.
In this case, it's almost always better to go with a string as a natural primary key, provide you can trust it. Don't worry if it's a string, as long as the string is reasonably short, say about 25 characters max. You won't pay a big price in terms of performance.
Do the data entry people or automatic data sources always provide a value for the supposed natural key, or is sometimes omitted? Is it occasionally wrong in the input data? If so, how are errors detected and corrected?
Are the programmers and interactive users who specify queries able to use the natural key to get what they want?
If you can't trust the natural key, invent a surrogate. If you invent a surrogate, you might as well invent an integer. Then you have to worry about whther to conceal the surrogate from the user community. Some developers who didn't conceal the surrogate key came to regret it.
Indices imply lots of comparisons.
Typically, strings are longer than integers and collation rules may be applied for comparison, so comparing strings is usually more computationally intensive task than comparing integers.
Sometimes, though, it's faster to use a string as a primary key than to make an extra join with a string to numerical id table.
Two reasons to use integers for PK columns:
We can set identity for integer field which incremented automatically.
When we create PKs, the db creates an index (Cluster or Non Cluster) which sorts the data before it's stored in the table. By using an identity on a PK, the optimizer need not check the sort order before saving a record. This improves performance on big tables.
Yes, but unless you expect to have millions of rows, not using a string-based key because it's slower is usually "premature optimization." After all, strings are stored as big numbers while numeric keys are usually stored as smaller numbers.
One thing to watch out for, though, is if you have clustered indices on a any key and are doing large numbers of inserts that are non-sequential in the index. Every line written will cause the index to re-write. if you're doing batch inserts, this can really slow the process down.
What is your reason for having a string as a primary key?
I would just set the primary key to an auto incrementing integer field, and put an index on the string field.
That way if you do searches on the table they should be relatively fast, and all of your joins and normal look ups will be unaffected in their speed.
You can also control the amount of the string field that gets indexed. In other words, you can say "only index the first 5 characters" if you think that will be enough. Or if your data can be relatively similar, you can index the whole field.
From performance standpoint - Yes string(PK) will slow down the performance when compared to performance achieved using an integer(PK), where PK ---> Primary Key.
From requirement standpoint - Although this is not a part of your question still I would like to mention. When we are handling huge data across different tables we generally look for the probable set of keys that can be set for a particular table. This is primarily because there are many tables and mostly each or some table would be related to the other through some relation ( a concept of Foreign Key ). Therefore we really cannot always choose an integer as a Primary Key, rather we go for a combination of 3, 4 or 5 attributes as the primary key for that tables. And those keys can be used as a foreign key when we would relate the records with some other table. This makes it useful to relate the records across different tables when required.
Therefore for Optimal Usage - We always make a combination of 1 or 2 integers with 1 or 2 string attributes, but again only if it is required.
I would probably use an integer as your primary key, and then just have your string (I assume it's some sort of ID) as a separate column.
create table sample (
sample_pk INT NOT NULL AUTO_INCREMENT,
sample_id VARCHAR(100) NOT NULL,
...
PRIMARY KEY(sample_pk)
);
You can always do queries and joins conditionally on the string (ID) column (where sample_id = ...).
There could be a very big misunderstanding related to string in the database are. Almost everyone has thought that database representation of numbers are more compact than for strings. They think that in db-s numbers are represented as in the memory. BUT it is not true. In most cases number representation is more close to A string like representation as to other.
The speed of using number or string is more dependent on the indexing then the type itself.
By default ASPNetUserIds are 128 char strings and performance is just fine.
If the key HAS to be unique in the table it should be the Key. Here's why;
primary string key = Correct DB relationships, 1 string key(The primary), and 1 string Index(The Primary).
The other option is a typical int Key, but if the string HAS to be unique you'll still probably need to add an index because of non-stop queries to validate or check that its unique.
So using an int identity key = Incorrect DB Relationships, 1 int key(Primary), 1 int index(Primary), Probably a unique string Index, and manually having to validate the same string doesn't exist(something like a sql check maybe).
To get better performance using an int over a string for the primary key, when the string HAS to be unique, it would have to be a very odd situation. I've always preferred to use string keys. And as a good rule of thumb, don't denormalize a database until you NEED to.