MySQL - using String as Primary Key - mysql

I saw a similar post on Stack Overflow already, but wasn't quite satisfied.
Let's say I offer a Web service. http://foo.com/SERVICEID
SERVICEID is a unique String ID used to reference the service (base 64, lower/uppercase + numbers), similar to how URL shortener services generate ID's for a URL.
I understand that there are inherent performance issues with comparing strings versus integers.
But I am curious of how to maximally optimize a primary key of type String.
I am using MySQL, (currently using MyISAM engine, though I admittedly don't understand all the engine differences).
Thanks.
update for my purpose the string was actually just a base62 encoded integer, so the primary key was an integer, and since you're not likely to ever exceed bigint's size it just doesn't make too much sense to use anything else (for my particular use case)

There's nothing wrong with using a CHAR or VARCHAR as a primary key.
Sure it'll take up a little more space than an INT in many cases, but there are many cases where it is the most logical choice and may even reduce the number of columns you need, improving efficiency, by avoiding the need to have a separate ID field.
For instance, country codes or state abbreviations already have standardised character codes and this would be a good reason to use a character based primary key rather than make up an arbitrary integer ID for each in addition.

If your external ID is base64, your internal ID is a binary string. Use that as the key in your database with type BINARY(n) (if fixed length) or VARBINARY if variable length. The binary version is 3/4 shorter than the base64 one.
And just convert from/to base64 in your service.

Using string as the type of primary column is not a good approach because If our values can not be generated sequentially and with an Incremental pattern, this may cause database fragmentation and decrease the database performance.

Related

Limit of SQL 'auto increment' as primary key

I want to create a system of online billboard where everyone can post a topic as my project.
I try to design the database using SQL to store the information of each topic, including the topic's id as primary key.
At first I design the id using integer datatype with auto-increment, as I think it's the simplest way. Then I thought about it and found out that the integer has limit(the number may be high but it is there), so I'm here finding another method.
Now I think of some pseudo-random algorithms, or use the hashing of topic's name but still not clear.
I also find the GUID from research in here, but not sure can it be used.
I wish you suggest me some ways of how to deal with the limit size of integer as primary key, or suggest me any keywords for me to do further research.
This answer assumes MySQL/MariaDB, because it uses the terminology "auto-increment" for such columns (as opposed to other databases that use identity or serial).
If int isn't big enough, you can use bigint.
Although I might consider it unlikely that you'll exceed the thresholds for int (it works for many applications), bigint would require great effort on you and your computers part for a long, long time to exceed the maximum value.
This is explained in the documentation.
With int, the maximum value supported by SQL Server is 2,147,483,647.
Just for completeness, I will also add that yet another option is to change the data type of the column to bigint (maximum value 9,223,372,036,854,775,807 - this will allow you to insert a million rows per second, for almost 300,000 years in a row).
Or if you fear that you will overflow even that, you can consider using decimal(38,0) - the maximum here is a number consisting of 38 9's (which will allow you to maintain that same pace for a whopping 31,709,791,983,764,586,504,312,531 years). 
http://sqlblog.com/blogs/hugo_kornelis

Alternate field for autoincrement PK

In my tables I use an auto-increment PK on tables where I store for example posts and comments.
I don't want to expose the PK to the HTTP client, however, I still use it internally in my API implementation to perform quick lookups.
When a user wants to retrieve a post by id, I want to have an alternate unique key on the table.
I wonder what is the best (most common) way to use as type for this field.
The most obvious to me would be to use a UUID or GUID.
I wonder if there is a straightforward way to generate a random numeric key for this instead for performance.
What is your take on the best approach for this situation?
MySQL has a function that generates a 128-bit UUID, version 1 as described in RFC 4122, and returns it as a hex string with dashes, by the custom of UUID formatting.
https://dev.mysql.com/doc/refman/5.7/en/miscellaneous-functions.html#function_uuid
A true UUID is meant to be globally unique in space and time. Usually it's overkill unless you need a distributed set of independent servers to generate unique values without some central uniqueness validation, which could create a bottleneck.
MySQL also has a function UUID_SHORT() which generates a 64-bit numeric value. This does not conform with the RFC, but it might be useful for your case.
https://dev.mysql.com/doc/refman/5.7/en/miscellaneous-functions.html#function_uuid-short
Read the description of the UUID_SHORT() implementation. Since the upper bits are seldom changing, and the lower bits are simply monotonically incrementing, it avoids the performance and fragmentation issues caused by inserting random UUID values into an index.
The UUID_SHORT value also fits in a MySQL BIGINT UNSIGNED without having to use UNHEX().

SHA1 sum as a primary key?

I am going to store filenames and other details in a table where, I am planning to use sha1 hash of the filename as PK.
Q1. SHA1 PK will not be a sequentially increasing/decreasing number.
so, will it be more resource consuming for the database to
maintain/search_into and index on that key? If i decide to keep it in database as 40 char value.
Q2. I read here:
https://stackoverflow.com/a/614483/986818 storing the data as
binary(20) field. can someone advice me in this regard:
a) do i have to create this column as: TYPE=integer, LENGTH=20,
COLLATION=binary, ATTRIBUTES=binary?
b) how to convert the sha1 value in MySQL or Perl to store into the
table?
c) is there a danger of duplicacy for this 20 char value?
**
---------UPDATE-------------
**
The requirement is to search the table on filename. user supplies filename, i go search the table and if filename is not there adds it. So either i index on varchar(100) filename field or generate a column with sha1 of the filename - hoping it would be easy for indexing for MySql compared to indexing a varchar field. Also i can search using the sha1 value from my program against the sha1 column. what say? primary key or just indexd key: i choose PK coz DBIx likes using PK. and PK or INDEX+UNIQ would be same amount of overhead for the system(so i thought)
Ok, then use a very -short- hash on the filename and accept collisions. Use an integer type for it (thats much faster!!!). E.g. you can use md5(filename) and then use the first 8 characters and convert them to an integer. SQL could look like this:
CREATE TABLES files (
id INT auto_increment,
hash INT unsigned,
filename VARCHAR(100),
PRIMARY KEY(id),
INDEX(hash)
);
Then you can use:
SELECT id FROM files WHERE hash=<hash> AND filename='<filename>';
The hash is then used for sorting out most other files (normally all other files) and then the filename is for selecting the right entry out of the few hash collisions.
For generating an integer hash-key in perl I suggest using md5() and pack().
If i decide to keep it in database as 40 char value.
Using a character sequence as a key will degrade performance for obvious reasons.
Also the PK is supposed to be unique. Although it will be probably be unlikely that you end up with collisions (theoretically using that for a function to create the PK seems inappropriate.
Additionally anyone knowing the filename and the hash you use, would know all your database ids. I am not sure if this is something not to consider.
Q1: Yes, it will need to build up a B-Tree of nodes that contain not only 1 Integer (4 Bytes) but a CHAR(40). Speed would be aproximately the same, as long the INDEX is kept in memory. As the entries are about 10 times bigger, you need 10 times more memory to keep it in memory. BUT: You probably want to lookup by the Hash anyway. So you'll need to have it either as Primary key OR as an Index.
Q2: Just create a Table field like CREATE TABLE test (ID BINARY(40), ...); later you can use INSERT INTO test (ID, ..) VALUES (UNHEX('4D7953514C'), ...);
-- Regarding: Is there a danger of duplicacy for this 20 char value?
The chance is 1 in 2^(8*20). 1 in 1,46 * 10^48 ... or 1 of 14615016373309029182036848327163*10^18. So the chance for that is very very v.. improbable.
There is no reason to use a cryptographically secure hash here. Instead, if you do this, use an ordinary hash. See here: https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
The hash is NOT a 40 char value! It's a 160 bit number, and you should store it that way (as a 20 char binary field). Edit: I see you mentioned that in comment 2. Yes, you should definitely do that. But I can't tell you how since I don't know what programming language you are using. Edit2: I see it's perl - sorry I don't know how to convert it in perl, but look for "pack" functions.
No, do not create it as type integer. The maximum integer is 128 bits which doesn't hold the entire thing. Although you could really just truncate it to 128 bits without real harm.
It's better to use a simpler hash anyway. You could risk it and ignore collisions, but if you do it properly you kinda of have to handle them.
I would stick with the standard auto-incrementing integer for the primary key. If uniqueness of file names is important (which it sounds like it is), then you can add a UNIQUE constraint on the file name itself or some derived, canonical version of the file name. Most languages/frameworks have some sort of method for getting a canonical version of a path (relative to absolute, standardized case, etc).
If you implement my suggestion or pursue your original plan, then you should be aware that multiple strings can map to the same filename/path. Both versions will have different hashes/pass the uniqueness constraint but will actually both refer to the same file. This depends on operating system and may or may not be a problem for you. Just something to keep in mind.

MySQL: primary key is a 8-byte string. Is it better to use BIGINT or BINARY(8)?

We need to store many rows in a MySQL (InnoDB) table, all of them having a 8-byte binary string as primary key.
I was wondering wether it was best to use the BIGINT column type (which contains 64-bit, thus 8-byte, integers) or BINARY(8), which is fixed length.
Since we're using those ids as strings in our application, and not numbers, storing them as binary strings sounds more coherent to me. However, I wonder if there are performance issues with this. Does it make any difference?
If that matters, we are reading/storing these ids using hex notation (like page_id = 0x1122334455667788).
We wouldn't use integers in queries anyway, since we're writing a PHP application and, as you surely know, there isn't a "unsigned long long int" type, so all integers are machine-dependant size.
I'd use the binary(8) if this matches your design.
Otherwise you'll always have a conversion overhead in performance or complexity somewhere. There won't be much (if any) difference between the types at the RDBMS level

Strings as Primary Keys in MYSQL Database [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
This post was edited and submitted for review 10 days ago.
Improve this question
I am not very familiar with databases and the theories behind how they work. Is it any slower from a performance standpoint (inserting/updating/querying) to use Strings for Primary Keys than integers?
For Example I have a database that would have about 100 million row like mobile number, name and email. mobile number and email would be unique. so can I have the mobile number or email as a primary key,
well it effect my query performance when I search based on email or mobile number. similarly the primary key well be used as foreign key in 5 to 6 tables or even more.
I am using MySQL database
Technically yes, but if a string makes sense to be the primary key then you should probably use it. This all depends on the size of the table you're making it for and the length of the string that is going to be the primary key (longer strings == harder to compare). I wouldn't necessarily use a string for a table that has millions of rows, but the amount of performance slowdown you'll get by using a string on smaller tables will be minuscule to the headaches that you can have by having an integer that doesn't mean anything in relation to the data.
Another issue with using Strings as a primary key is that because the index is constantly put into sequential order, when a new key is created that would be in the middle of the order the index has to be resequenced... if you use an auto number integer, the new key is just added to the end of the index.
Inserts to a table having a clustered index where the insertion occurs in the middle of the sequence DOES NOT cause the index to be rewritten. It does not cause the pages comprising the data to be rewritten. If there is room on the page where the row will go, then it is placed in that page. The single page will be reformatted to place the row in the right place in the page. When the page is full, a page split will happen, with half of the rows on the page going to one page, and half going on the other. The pages are then relinked into the linked list of pages that comprise a tables data that has the clustered index. At most, you will end up writing 2 pages of database.
Strings are slower in joins and in real life they are very rarely really unique (even when they are supposed to be). The only advantage is that they can reduce the number of joins if you are joining to the primary table only to get the name. However, strings are also often subject to change thus creating the problem of having to fix all related records when the company name changes or the person gets married. This can be a huge performance hit and if all tables that should be related somehow are not related (this happens more often than you think), then you might have data mismatches as well. An integer that will never change through the life of the record is a far safer choice from a data integrity standpoint as well as from a performance standpoint. Natural keys are usually not so good for maintenance of the data.
I also want to point out that the best of both worlds is often to use an autoincrementing key (or in some specialized cases, a GUID) as the PK and then put a unique index on the natural key. You get the faster joins, you don;t get duplicate records, and you don't have to update a million child records because a company name changed.
Too many variables. It depends on the size of the table, the indexes, nature of the string key domain...
Generally, integers will be faster. But will the difference be large enough to care? It's hard to say.
Also, what is your motivation for choosing strings? Numeric auto-increment keys are often so much easier as well. Is it semantics? Convenience? Replication/disconnected concerns? Your answer here could limit your options. This also brings to mind a third "hybrid" option you're forgetting: Guids.
It doesn't matter what you use as a primary key so long as it is UNIQUE. If you care about speed or good database design use the int unless you plan on replicating data, then use a GUID.
If this is an access database or some tiny app then who really cares. I think the reason why most of us developers slap the old int or guid at the front is because projects have a way of growing on us, and you want to leave yourself the option to grow.
Don't worry about performance until you have got a simple and sound design that agrees with the subject matter that the data describes and fits well with the intended use of the data. Then, if performance problems emerge, you can deal with them by tweaking the system.
In this case, it's almost always better to go with a string as a natural primary key, provide you can trust it. Don't worry if it's a string, as long as the string is reasonably short, say about 25 characters max. You won't pay a big price in terms of performance.
Do the data entry people or automatic data sources always provide a value for the supposed natural key, or is sometimes omitted? Is it occasionally wrong in the input data? If so, how are errors detected and corrected?
Are the programmers and interactive users who specify queries able to use the natural key to get what they want?
If you can't trust the natural key, invent a surrogate. If you invent a surrogate, you might as well invent an integer. Then you have to worry about whther to conceal the surrogate from the user community. Some developers who didn't conceal the surrogate key came to regret it.
Indices imply lots of comparisons.
Typically, strings are longer than integers and collation rules may be applied for comparison, so comparing strings is usually more computationally intensive task than comparing integers.
Sometimes, though, it's faster to use a string as a primary key than to make an extra join with a string to numerical id table.
Two reasons to use integers for PK columns:
We can set identity for integer field which incremented automatically.
When we create PKs, the db creates an index (Cluster or Non Cluster) which sorts the data before it's stored in the table. By using an identity on a PK, the optimizer need not check the sort order before saving a record. This improves performance on big tables.
Yes, but unless you expect to have millions of rows, not using a string-based key because it's slower is usually "premature optimization." After all, strings are stored as big numbers while numeric keys are usually stored as smaller numbers.
One thing to watch out for, though, is if you have clustered indices on a any key and are doing large numbers of inserts that are non-sequential in the index. Every line written will cause the index to re-write. if you're doing batch inserts, this can really slow the process down.
What is your reason for having a string as a primary key?
I would just set the primary key to an auto incrementing integer field, and put an index on the string field.
That way if you do searches on the table they should be relatively fast, and all of your joins and normal look ups will be unaffected in their speed.
You can also control the amount of the string field that gets indexed. In other words, you can say "only index the first 5 characters" if you think that will be enough. Or if your data can be relatively similar, you can index the whole field.
From performance standpoint - Yes string(PK) will slow down the performance when compared to performance achieved using an integer(PK), where PK ---> Primary Key.
From requirement standpoint - Although this is not a part of your question still I would like to mention. When we are handling huge data across different tables we generally look for the probable set of keys that can be set for a particular table. This is primarily because there are many tables and mostly each or some table would be related to the other through some relation ( a concept of Foreign Key ). Therefore we really cannot always choose an integer as a Primary Key, rather we go for a combination of 3, 4 or 5 attributes as the primary key for that tables. And those keys can be used as a foreign key when we would relate the records with some other table. This makes it useful to relate the records across different tables when required.
Therefore for Optimal Usage - We always make a combination of 1 or 2 integers with 1 or 2 string attributes, but again only if it is required.
I would probably use an integer as your primary key, and then just have your string (I assume it's some sort of ID) as a separate column.
create table sample (
sample_pk INT NOT NULL AUTO_INCREMENT,
sample_id VARCHAR(100) NOT NULL,
...
PRIMARY KEY(sample_pk)
);
You can always do queries and joins conditionally on the string (ID) column (where sample_id = ...).
There could be a very big misunderstanding related to string in the database are. Almost everyone has thought that database representation of numbers are more compact than for strings. They think that in db-s numbers are represented as in the memory. BUT it is not true. In most cases number representation is more close to A string like representation as to other.
The speed of using number or string is more dependent on the indexing then the type itself.
By default ASPNetUserIds are 128 char strings and performance is just fine.
If the key HAS to be unique in the table it should be the Key. Here's why;
primary string key = Correct DB relationships, 1 string key(The primary), and 1 string Index(The Primary).
The other option is a typical int Key, but if the string HAS to be unique you'll still probably need to add an index because of non-stop queries to validate or check that its unique.
So using an int identity key = Incorrect DB Relationships, 1 int key(Primary), 1 int index(Primary), Probably a unique string Index, and manually having to validate the same string doesn't exist(something like a sql check maybe).
To get better performance using an int over a string for the primary key, when the string HAS to be unique, it would have to be a very odd situation. I've always preferred to use string keys. And as a good rule of thumb, don't denormalize a database until you NEED to.