SHA1 sum as a primary key? - mysql

I am going to store filenames and other details in a table where, I am planning to use sha1 hash of the filename as PK.
Q1. SHA1 PK will not be a sequentially increasing/decreasing number.
so, will it be more resource consuming for the database to
maintain/search_into and index on that key? If i decide to keep it in database as 40 char value.
Q2. I read here:
https://stackoverflow.com/a/614483/986818 storing the data as
binary(20) field. can someone advice me in this regard:
a) do i have to create this column as: TYPE=integer, LENGTH=20,
COLLATION=binary, ATTRIBUTES=binary?
b) how to convert the sha1 value in MySQL or Perl to store into the
table?
c) is there a danger of duplicacy for this 20 char value?
**
---------UPDATE-------------
**
The requirement is to search the table on filename. user supplies filename, i go search the table and if filename is not there adds it. So either i index on varchar(100) filename field or generate a column with sha1 of the filename - hoping it would be easy for indexing for MySql compared to indexing a varchar field. Also i can search using the sha1 value from my program against the sha1 column. what say? primary key or just indexd key: i choose PK coz DBIx likes using PK. and PK or INDEX+UNIQ would be same amount of overhead for the system(so i thought)

Ok, then use a very -short- hash on the filename and accept collisions. Use an integer type for it (thats much faster!!!). E.g. you can use md5(filename) and then use the first 8 characters and convert them to an integer. SQL could look like this:
CREATE TABLES files (
id INT auto_increment,
hash INT unsigned,
filename VARCHAR(100),
PRIMARY KEY(id),
INDEX(hash)
);
Then you can use:
SELECT id FROM files WHERE hash=<hash> AND filename='<filename>';
The hash is then used for sorting out most other files (normally all other files) and then the filename is for selecting the right entry out of the few hash collisions.
For generating an integer hash-key in perl I suggest using md5() and pack().

If i decide to keep it in database as 40 char value.
Using a character sequence as a key will degrade performance for obvious reasons.
Also the PK is supposed to be unique. Although it will be probably be unlikely that you end up with collisions (theoretically using that for a function to create the PK seems inappropriate.
Additionally anyone knowing the filename and the hash you use, would know all your database ids. I am not sure if this is something not to consider.

Q1: Yes, it will need to build up a B-Tree of nodes that contain not only 1 Integer (4 Bytes) but a CHAR(40). Speed would be aproximately the same, as long the INDEX is kept in memory. As the entries are about 10 times bigger, you need 10 times more memory to keep it in memory. BUT: You probably want to lookup by the Hash anyway. So you'll need to have it either as Primary key OR as an Index.
Q2: Just create a Table field like CREATE TABLE test (ID BINARY(40), ...); later you can use INSERT INTO test (ID, ..) VALUES (UNHEX('4D7953514C'), ...);
-- Regarding: Is there a danger of duplicacy for this 20 char value?
The chance is 1 in 2^(8*20). 1 in 1,46 * 10^48 ... or 1 of 14615016373309029182036848327163*10^18. So the chance for that is very very v.. improbable.

There is no reason to use a cryptographically secure hash here. Instead, if you do this, use an ordinary hash. See here: https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
The hash is NOT a 40 char value! It's a 160 bit number, and you should store it that way (as a 20 char binary field). Edit: I see you mentioned that in comment 2. Yes, you should definitely do that. But I can't tell you how since I don't know what programming language you are using. Edit2: I see it's perl - sorry I don't know how to convert it in perl, but look for "pack" functions.
No, do not create it as type integer. The maximum integer is 128 bits which doesn't hold the entire thing. Although you could really just truncate it to 128 bits without real harm.
It's better to use a simpler hash anyway. You could risk it and ignore collisions, but if you do it properly you kinda of have to handle them.

I would stick with the standard auto-incrementing integer for the primary key. If uniqueness of file names is important (which it sounds like it is), then you can add a UNIQUE constraint on the file name itself or some derived, canonical version of the file name. Most languages/frameworks have some sort of method for getting a canonical version of a path (relative to absolute, standardized case, etc).
If you implement my suggestion or pursue your original plan, then you should be aware that multiple strings can map to the same filename/path. Both versions will have different hashes/pass the uniqueness constraint but will actually both refer to the same file. This depends on operating system and may or may not be a problem for you. Just something to keep in mind.

Related

Alternate field for autoincrement PK

In my tables I use an auto-increment PK on tables where I store for example posts and comments.
I don't want to expose the PK to the HTTP client, however, I still use it internally in my API implementation to perform quick lookups.
When a user wants to retrieve a post by id, I want to have an alternate unique key on the table.
I wonder what is the best (most common) way to use as type for this field.
The most obvious to me would be to use a UUID or GUID.
I wonder if there is a straightforward way to generate a random numeric key for this instead for performance.
What is your take on the best approach for this situation?
MySQL has a function that generates a 128-bit UUID, version 1 as described in RFC 4122, and returns it as a hex string with dashes, by the custom of UUID formatting.
https://dev.mysql.com/doc/refman/5.7/en/miscellaneous-functions.html#function_uuid
A true UUID is meant to be globally unique in space and time. Usually it's overkill unless you need a distributed set of independent servers to generate unique values without some central uniqueness validation, which could create a bottleneck.
MySQL also has a function UUID_SHORT() which generates a 64-bit numeric value. This does not conform with the RFC, but it might be useful for your case.
https://dev.mysql.com/doc/refman/5.7/en/miscellaneous-functions.html#function_uuid-short
Read the description of the UUID_SHORT() implementation. Since the upper bits are seldom changing, and the lower bits are simply monotonically incrementing, it avoids the performance and fragmentation issues caused by inserting random UUID values into an index.
The UUID_SHORT value also fits in a MySQL BIGINT UNSIGNED without having to use UNHEX().

Store UUID v4 in MySQL

I'm generating UUIDs using PHP, per the function found here
Now I want to store that in a MySQL database. What is the best/most efficient MySQL field format for storing UUID v4?
I currently have varchar(256), but I'm pretty sure that's much larger than necessary. I've found lots of almost-answers, but they're generally ambiguous about what form of UUID they're referring to, so I'm asking for the specific format.
Store it as VARCHAR(36) if you're looking to have an exact fit, or VARCHAR(255) which is going to work out with the same storage cost anyway. There's no reason to fuss over bytes here.
Remember VARCHAR fields are variable length, so the storage cost is proportional to how much data is actually in them, not how much data could be in them.
Storing it as BINARY is extremely annoying, the values are unprintable and can show up as garbage when running queries. There's rarely a reason to use the literal binary representation. Human-readable values can be copy-pasted, and worked with easily.
Some other platforms, like Postgres, have a proper UUID column which stores it internally in a more compact format, but displays it as human-readable, so you get the best of both approaches.
If you always have a UUID for each row, you could store it as CHAR(36) and save 1 byte per row over VARCHAR(36).
uuid CHAR(36) CHARACTER SET ascii
In contrast to CHAR, VARCHAR values are stored as a 1-byte or 2-byte
length prefix plus data. The length prefix indicates the number of
bytes in the value. A column uses one length byte if values require no
more than 255 bytes, two length bytes if values may require more than
255 bytes.
https://dev.mysql.com/doc/refman/5.7/en/char.html
Though be careful with CHAR, it will always consume the full length defined even if the field is left empty. Also, make sure to use ASCII for character set, as CHAR would otherwise plan for worst case scenario (i.e. 3 bytes per character in utf8, 4 in utf8mb4)
[...] MySQL must reserve four bytes for each character in a CHAR
CHARACTER SET utf8mb4 column because that is the maximum possible
length. For example, MySQL must reserve 40 bytes for a CHAR(10)
CHARACTER SET utf8mb4 column.
https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html
Question is about storing an UUID in MySQL.
Since version 8.0 of mySQL you can use binary(16) with automatic conversion via UUID_TO_BIN/BIN_TO_UUID functions:
https://mysqlserverteam.com/mysql-8-0-uuid-support/
Be aware that mySQL has also a fast way to generate UUIDs as primary key:
INSERT INTO t VALUES(UUID_TO_BIN(UUID(), true))
Most efficient is definitely BINARY(16), storing the human-readable characters uses over double the storage space, and means bigger indices and slower lookup. If your data is small enough that storing as them as text doesn't hurt performance, you probably don't need UUIDs over boring integer keys. Storing raw is really not as painful as others suggest because any decent db admin tool will display/dump the octets as hexadecimal, rather than literal bytes of "text". You shouldn't need to be looking up UUIDs manually in the db; if you have to, HEX() and x'deadbeef01' literals are your friends. It is trivial to write a function in your app – like the one you referenced – to deal with this for you. You could probably even do it in the database as virtual columns and stored procedures so the app never bothers with the raw data.
I would separate the UUID generation logic from the display logic to ensure that existing data are never changed and errors are detectable:
function guidv4($prettify = false)
{
static $native = function_exists('random_bytes');
$data = $native ? random_bytes(16) : openssl_random_pseudo_bytes(16);
$data[6] = chr(ord($data[6]) & 0x0f | 0x40); // set version to 0100
$data[8] = chr(ord($data[8]) & 0x3f | 0x80); // set bits 6-7 to 10
if ($prettify) {
return guid_pretty($data);
}
return $data;
}
function guid_pretty($data)
{
return strlen($data) == 16 ?
vsprintf('%s%s-%s-%s-%s-%s%s%s', str_split(bin2hex($data), 4)) :
false;
}
function guid_ugly($data)
{
$data = preg_replace('/[^[:xdigit:]]+/', '', $data);
return strlen($data) == 32 ? hex2bin($data) : false;
}
Edit: If you only need the column pretty when reading the database, a statement like the following is sufficient:
ALTER TABLE test ADD uuid_pretty CHAR(36) GENERATED ALWAYS AS (CONCAT_WS('-', LEFT(HEX(uuid_ugly), 8), SUBSTR(HEX(uuid_ugly), 9, 4), SUBSTR(HEX(uuid_ugly), 13, 4), SUBSTR(HEX(uuid_ugly), 17, 4), RIGHT(HEX(uuid_ugly), 12))) VIRTUAL;
This works like a charm for me in MySQL 8.0.26
create table t (
uuid BINARY(16) default (UUID_TO_BIN(UUID())),
)
When querying you may use
select BIN_TO_UUID(uuid) uuid from t;
The result is:
# uuid
'8c45583a-0e1f-11ec-804d-005056219395'
The most space-efficient would be BINARY(16) or two BIGINT UNSIGNED.
The former might give you headaches because manual queries do not (in a straightforward way) give you readable/copyable values.
The latter might give you headaches because of having to map between one value and two columns.
If this is a primary key, I would definitely not waste any space on it, as it becomes part of every secondary index as well. In other words, I would choose one of these types.
For performance, the randomness of random UUIDs (i.e. UUID v4, which is randomized) will hurt severely. This applies when the UUID is your primary key or if you do a lot of range queries on it. Your insertions into the primary index will be all over the place rather than all at (or near) the end. Your data loses temporal locality, which was a helpful property in various cases.
My main improvement would be to use something similar to a UUID v1, which uses a timestamp as part of its data, and ensure that the timestamp is in the highest bits. For example, the UUID might be composed something like this:
Timestamp | Machine Identifier | Counter
This way, we get a locality similar to auto-increment values.
This could be useful if you use binary(16) data type:
INSERT INTO table (UUID) VALUES
(UNHEX(REPLACE(UUID(), "-","")))
I just found a nice article going in more depth on these topics: https://www.xaprb.com/blog/2009/02/12/5-ways-to-make-hexadecimal-identifiers-perform-better-on-mysql/
It covers the storage of values, with the same options already expressed in the different answers on this page:
One: watch out for character set
Two: use fixed-length, non-nullable values
Three: Make it BINARY
But also adds some interesting insight about indexes:
Four: use prefix indexes
In many but not all cases, you don’t need to index the full length of
the value. I usually find that the first 8 to 10 characters are
unique. If it’s a secondary index, this is generally good enough. The
beauty of this approach is that you can apply it to existing
applications without any need to modify the column to BINARY or
anything else—it’s an indexing-only change and doesn’t require the
application or the queries to change.
Note that the article doesn't tell you how to create such a "prefix" index. Looking at MySQL documentation for Column Indexes we find:
[...] you can create an index that uses only the first N characters of the
column. Indexing only a prefix of column values in this way can make
the index file much smaller. When you index a BLOB or TEXT column, you
must specify a prefix length for the index. For example:
CREATE TABLE test (blob_col BLOB, INDEX(blob_col(10)));
[...] the prefix length in
CREATE TABLE, ALTER TABLE, and CREATE INDEX statements is interpreted
as number of characters for nonbinary string types (CHAR, VARCHAR,
TEXT) and number of bytes for binary string types (BINARY, VARBINARY,
BLOB).
Five: build hash indexes
What you can do is generate a checksum of the values and index that.
That’s right, a hash-of-a-hash. For most cases, CRC32() works pretty
well (if not, you can use a 64-bit hash function). Create another
column. [...] The CRC column isn’t guaranteed to be unique, so you
need both criteria in the WHERE clause or this technique won’t work.
Hash collisions happen quickly; you will probably get a collision with
about 100k values, which is much sooner than you might think—don’t
assume that a 32-bit hash means you can put 4 billion rows in your
table before you get a collision.
This is a fairly old post but still relevant and comes up in search results often, so I will add my answer to the mix. Since you already have to use a trigger or your own call to UUID() in your query, here are a pair of functions that I use to keep the UUID as text in for easy viewing in the database, but reducing the footprint from 36 down to 24 characters. (A 33% savings)
delimiter //
DROP FUNCTION IF EXISTS `base64_uuid`//
DROP FUNCTION IF EXISTS `uuid_from_base64`//
CREATE definer='root'#'localhost' FUNCTION base64_uuid() RETURNS varchar(24)
DETERMINISTIC
BEGIN
/* converting INTO base 64 is easy, just turn the uuid into binary and base64 encode */
return to_base64(unhex(replace(uuid(),'-','')));
END//
CREATE definer='root'#'localhost' FUNCTION uuid_from_base64(base64_uuid varchar(24)) RETURNS varchar(36)
DETERMINISTIC
BEGIN
/* Getting the uuid back from the base 64 version requires a little more work as we need to put the dashes back */
set #hex = hex(from_base64(base64_uuid));
return lower(concat(substring(#hex,1,8),'-',substring(#hex,9,4),'-',substring(#hex,13,4),'-',substring(#hex,17,4),'-',substring(#hex,-12)));
END//

MySQL - using String as Primary Key

I saw a similar post on Stack Overflow already, but wasn't quite satisfied.
Let's say I offer a Web service. http://foo.com/SERVICEID
SERVICEID is a unique String ID used to reference the service (base 64, lower/uppercase + numbers), similar to how URL shortener services generate ID's for a URL.
I understand that there are inherent performance issues with comparing strings versus integers.
But I am curious of how to maximally optimize a primary key of type String.
I am using MySQL, (currently using MyISAM engine, though I admittedly don't understand all the engine differences).
Thanks.
update for my purpose the string was actually just a base62 encoded integer, so the primary key was an integer, and since you're not likely to ever exceed bigint's size it just doesn't make too much sense to use anything else (for my particular use case)
There's nothing wrong with using a CHAR or VARCHAR as a primary key.
Sure it'll take up a little more space than an INT in many cases, but there are many cases where it is the most logical choice and may even reduce the number of columns you need, improving efficiency, by avoiding the need to have a separate ID field.
For instance, country codes or state abbreviations already have standardised character codes and this would be a good reason to use a character based primary key rather than make up an arbitrary integer ID for each in addition.
If your external ID is base64, your internal ID is a binary string. Use that as the key in your database with type BINARY(n) (if fixed length) or VARBINARY if variable length. The binary version is 3/4 shorter than the base64 one.
And just convert from/to base64 in your service.
Using string as the type of primary column is not a good approach because If our values can not be generated sequentially and with an Incremental pattern, this may cause database fragmentation and decrease the database performance.

MySQL Table primary keys

Greetings,
I have some mysql tables that are currently using an md5 hash as a primary key. I normally generate the hash with the value of a column. For instante, let's imagine I have a table called "Artists" with the fields id, name, num_members, year. I tend to make a md5($name) and use it has an ID.
I would like to know what are the downsides of doing this. Is it just better to use integers with AUTO_INCREMENT ? I tend to run away from this because it's just not worth the trouble of finding out what the last id inserted was, and what will be the next etc.
Can you give me some lights on this?
Thank you.
If you need a surrogate primary key, using an AUTO_INCREMENT field is better than an md5 hash, because it is fewer bytes of data, and database backends optimize for integer primary keys.
mysql_insert_id can be used if you need the last inserted id.
If you are generating the primary key as a hash of other columns, why not just use those other columns as a unique key, then join on those?
Another question is, what are the upsides of using an md5 hash? I can't think of any.
The MD5 isn't a true key in this case because it functionally depends on the name. That means that if you have two artists with the same name, you have duplicate "keys" for different records. You could make it a real key by hashing all the attributes together (and hoping that the probability gods don't send you a collision), or you could just save yourself the trouble and use an autoincrementing ID.
It seems like the way you're trying to use the MD5 isn't really buying you any benefit. If "$name" is unique, then why not just use "name" as the primary key? Calculating an MD5 hash and using it as a key for something that's already unique is redundant.
On the other hand, if "name" is not unique, then the MD5 hash won't be unique either and so it's pointless that way too.
Generally you use an MD5 hash when you don't want to store the actual value of the column. For instance, if you're storing passwords, you generally only store the MD5 hash of the password, not the password itself, so that you can't see people's passwords just by looking at the table contents.
If you don't have any unique fields, then you're stuck doing something like an auto-increment because it's at least guaranteed unique. If you use the built-in SQL auto-increment, then you'll just have to fetch the last one way or another. Alternately, if you can get away with keeping a unique counter locally in your application, that avoids having to use auto-increment, but isn't necessarily viable for most applications.
The first approach has one obvious disadvantage: if there are two artists of the same name there will be a primary key collision. Using an INT column with an auto-increment will ensure uniqueness.
Furthermore, though very unlikely, there is a chance that MD5 hashes of different strings could collide (I seem to recall the probability as being 1 in 36 to the power of 32).
The benefits are if you present the IDs to customers (say in a query string for a web form, though that is another no-no)... it prevents users guessing another one.
Personally I use auto-increment without problems (have moved DBs to new servers and everything without problems)

Facebook user_id : big_int, int or string?

Facebook's user id's go up to 2^32 .. which by my count it 4294967296.
mySQL's unsigned int's range is 0 to 4294967295 (which is 1 short - or my math is wrong)
and its unsigned big int's range is 0 to 18446744073709551615
int = 4 bytes, bigint = 8 bytes
OR
Do I store it as a string?
varchar(10) = ? bytes
How will it effect efficiency, I heard that mysql handle's numbers far better than strings (performance wise). So what do you guys recommend
Because Facebook assigns the IDs, and not you, you must use BIGINTs.
Facebook does not assign the IDs sequentially, and I suspect they have some regime for assigning numbers.
I recently fixed exactly this bug, so it is a real problem.
I would make it UNSIGNED, simply because that is what it is.
I would not use a string. That makes comparisons painful and your indexes clunkier than they need to be.
You can't use INT any more. Last night I had two user ids that maxed out INT(10).
I use a bigint to store the facebook id, because that's what it is.
but internally for the primary and foreign keys of the tables, i use a smallint, because it is smaller. But also because if the bigint should ever have to become a string (to find users by username instead of id), i can easily change it.
so i have a table that looks like this:
profile
- profile_key smallint primary key
- profile_name varchar
- fb_profile_id bigint
and one that looks like this
something_else
- profile_key smallint primary key
- something_else_key smallint primary key
- something_else_name varchar
and my queries for a singe page could be something like this:
select profile_key, profile_name
from profile
where fb_profile_id = ?
now i take the profile_key and use it in the next query
select something_else_key, something_else_name
from something_else
where profile_key = ?
the profile table almost always gets queried for almost any request anyway, so i don't consider it an extra step.
And ofcourse it is also quite ease to cache the first query for some extra performance.
If you are reading this in 2015 when facebook has upgraded their API to 2.0 version. They have added a note in their documentation stating that their ids would be changed and would have an app scope. So maybe there is huge possibility later in the future that they might change all the ids to Alpha numeric.
https://developers.facebook.com/docs/apps/upgrading#upgrading_v2_0_user_ids
So I would suggest to keep the type to varchar and avoid any future migration pains
Your math is a little wrong... remember that the largest number you can store in N bytes is 2^(N) - 1... not 2^(N). There are 2^N possible numbers, however the largest number you can store is 1 less that.
If Facebook uses an unsigned big int, then you should use that. They probably don't assign them sequentially.
Yes, you could get away with a varchar... however it would be slower (but probably not as much as you are thinking).
Store them as strings.
The Facebook Graph API returns ids as strings, so if you want comparisons to work without having to cast, you should use strings. IMO this trumps other considerations.
I would just stick with INT. It's easy, it's small, it works and you can always change the column to a larger size in the future if you need to.
FYI:
VARCHAR(n) ==> variable, up to n + 1 bytes
CHAR(n) ==> fixed, n bytes
Unless you expect more than 60% of the world's population to sign up, int should do?