MySQL schema for table of hashes - mysql

I need to store, query and update a large amount of file hashes. What would be the optimal mysql schema for this kind of table? Should I use a hash index eg.
CREATE INDEX hash_index on Hashes(id) using HASH;
can I reuse the PK hash for the index ? (as I understand, the "using hash" will create a hash from the hash)

File hashes are fixed-length data items (unless you change the hash type after you have created some rows). If you represent your file hashes in hexadecimal or Base 64, they'll have characters and digits in them. For example, sha-256 hashes in hex take 64 characters (four bits in each character).
These characters are all 8-bit characters, so you don't need unicode. If you're careful about filling them in, you don't need case sensitivity either. Eliminating all these features of database columns makes the values slightly faster to search.
So, make your hashes fixed-length ASCII columns, using ddl like this:
hash CHAR(64) COLLATE 'ascii_bin'
You can certainly use such a column as a primary key.
Raymond correctly pointed out that MySQL doesn't offer hash indexes except for certain types of tables. That's OK: ordinary BTREE indexes work reasonably well for this kind of information.

Related

MySQL using MATCH AGAINST for long unique values (8.0.27)

I have a situation where we're storing long unique IDs (up to 200 characters) that are single TEXT entries in our database. The problem is we're using a FULLTEXT index for speed purposes and it works great for the smaller GUID style entries. The problem is it won't work for the entries > 84 characters due to the limitations of innodb_ft_max_token_size, which apparently cannot be set > 84. This means any entries more than 84 characters are omitted from the Index.
Sample Entries (actual data from different sources I need to match):
AQMkADk22NgFmMTgzLTQ3MzEtNDYwYy1hZTgyLTBiZmU0Y2MBNDljMwBGAAADVJvMxLfANEeAePRRtVpkXQcAmNmJjI_T7kK7mrTinXmQXgAAAgENAAAAmNmJjI_T7kK7mrTinXmQXgABYpfCdwAAAA==
AND
<j938ir9r-XfrwkECA8Bxz6iqxVth-BumZCRIQ13On_inEoGIBnxva8BfxOoNNgzYofGuOHKOzldnceaSD0KLmkm9ET4hlomDnLu8PBktoi9-r-pLzKIWbV0eNadC3RIxX3ERwQABAgA=#t2.msgid.quoramail.com>
AND
["ca97826d-3bea-4986-b112-782ab312aq23","ca97826d-3bea-4986-b112-782ab312aaf7","ca97826d-3bea-4986-b112-782ab312a326"]
So what are my options here? Is there any way to get the unique strings of 160 (or so) characters working with a FULLTEXT index?
What's the most efficient Index I can use for large string values without spaces (up to 200 characters)?
Here's a summary of the discussion in comments:
The id's have multiple formats, either a single token of variable length up to 200 characters, or even an "array," being a JSON-formatted document with multiple tokens. These entries come from different sources, and the format is outside of your control.
The FULLTEXT index implementation in MySQL has a maximum token size of 84 characters. This is not able to search for longer tokens.
You could use a conventional B-tree index (not FULLTEXT) to index longer strings, up to 3072 bytes in current versions of MySQL. But this would not support cases of JSON arrays of multiple tokens. You can't use a B-tree index to search for words in the middle of a string. Nor can you use an index with the LIKE predicate to match a substring using a wildcard in the front of the pattern.
Therefore to use a B-tree index, you must store one token per row. If you receive a JSON array, you would have to split this into individual tokens and store each one on a row by itself. This means writing some code to transform the content you receive as id's before inserting them into the database.
MySQL 8.0.17 supports a new kind of index on a JSON array, called a Multi-Value Index. If you could store all your tokens as a JSON array, even those that are received as single tokens, you could use this type of index. But this also would require writing some code to transform the singular form of id's into a JSON array.
The bottom line is that there is no single solution for indexing the text if you must support any and all formats. You either have to suffer with non-optimized searches, or else you need to find a way to modify the data so you can index it.
Create a new table 2 columns: a VARCHAR(200) CHARSET ascii COLLATION ascii_bin (BASE64 needs case sensitivity.)
That table may have multiple rows for one row in your main table.
Use some simple parsing to find the string (or strings) in your table to add them to this new table.
PRIMARY KEY(that-big-column)
Update your code to also do the INSERT of new rows for new data.
Now a simple BTree lookup plus Join will solve all your plans.
TEXT does not work with indexes, but VARCHAR up to some limit does work. 200 with ascii is only 200 bytes, much below the 3072 limit.

SQL - max 255 length unique index - Hash Solution

We have table where we store tokens for users (i.e. accessTokens).
The problem is, sometimes tokens can have more than 255 length and MySQL/MariaDB is unable to store it into table that have unique index on this column.
We need unique indexes, therefore one solution is to add additional column with hash of token which has max 255 length and put unique index to it. Any search/save will go through this hash, after match, we select the whole token and send it back. After a lot of thinking and googling this is probably the only viable solution for this use-case (but you can try to give us another one).
Every single token we generate right now is at least partially random, therefore slightly chance of hash collision is "ok", the user is not stucked forever in next request, it should pass.
Do you know any good modern method in 2017? Having some statistical data about hash collision for this method would be appreciated.
The hash is only for internal use - we dont need it to be secure (fast insecure hash is best for us), it should be long enough to have low chance of collision but must never ever pass the 255 length limit.
PS: Setting up special version of database/table that allows more length is not viable, we need it also in some older system without migration.
Are these access tokens representable with 8-bit characters? That is, are all the characters in them taken from the ASCII or iso-8859-1 character sets?
If so, you can get a longer unique index than 255 by declaring the access-token column with COLLATE latin1_bin. The limit of an index prefix is 767 bytes, but utf8 characters in VARCHAR columns take 3 bytes per character.
So a column with 767 unique latin1 characters should be uniquely indexable. That may solve your problem if your unique hashes all fit in about 750 bytes.
If not ...
You've asked for a hash function for your long tokens with a "low" risk of collision. SHA1 is pretty good, and is available as a function in MySQL. SHA512 is even better, but doesn't work in all MySQL servers. But the question is this: What is the collision risk of taking the first, or last, 250 characters of your long tokens and using them as a hash?
Why do I ask? Because your spec calls for a unique index on a column that's too long for a MySQL unique index. You're proposing to solve that problem by using a hash function that is also not guaranteed to be unique. That gives you two choices, both of which require you to live with a small collision probability.
Add a hash column that's computed by SHA2('token', 512) and live with the tiny probablility of collision.
Add a hash column that's computed by LEFT('token', 255) and live with the tiny probability of collision.
You can implement the second choice simply by removing the unique constraint on your index on the token column. (In other words, by doing very little.)
The SHA has family has well-known collision characteristics. To evaluate some other hash function would require knowing the collision characteristics of your long tokens, and you haven't told us those.
Comments on HASHing
UNHEX(MD5(token)) fits in 16 bytes - BINARY(16).
As for collisions: Theoretically, there is only one chance in 9 trillion that you will get a collision in a table of 9 trillion rows.
For SHA() in BINARY(20) the odds are even less. Bigger shas are, in my opinion, overkill.
Going beyond the 767 limit to 3072
⚈ Upgrade to 5.7.7 (MariaDB 10.2.2?) for 3072 byte limit -- but your cloud may not provide this;
⚈ Reconfigure (if staying with 5.6.3 - 5.7.6 (MariaDB 10.1?)) -- 4 things to change: Barracuda + innodb_file_per_table + innodb_large_prefix + dynamic or compressed.
Later versions of 5.5 can probably perform the 'reconfigure'.
Similar Question: Does MariaDB allow 255 character unique indexes?

Store UUID v4 in MySQL

I'm generating UUIDs using PHP, per the function found here
Now I want to store that in a MySQL database. What is the best/most efficient MySQL field format for storing UUID v4?
I currently have varchar(256), but I'm pretty sure that's much larger than necessary. I've found lots of almost-answers, but they're generally ambiguous about what form of UUID they're referring to, so I'm asking for the specific format.
Store it as VARCHAR(36) if you're looking to have an exact fit, or VARCHAR(255) which is going to work out with the same storage cost anyway. There's no reason to fuss over bytes here.
Remember VARCHAR fields are variable length, so the storage cost is proportional to how much data is actually in them, not how much data could be in them.
Storing it as BINARY is extremely annoying, the values are unprintable and can show up as garbage when running queries. There's rarely a reason to use the literal binary representation. Human-readable values can be copy-pasted, and worked with easily.
Some other platforms, like Postgres, have a proper UUID column which stores it internally in a more compact format, but displays it as human-readable, so you get the best of both approaches.
If you always have a UUID for each row, you could store it as CHAR(36) and save 1 byte per row over VARCHAR(36).
uuid CHAR(36) CHARACTER SET ascii
In contrast to CHAR, VARCHAR values are stored as a 1-byte or 2-byte
length prefix plus data. The length prefix indicates the number of
bytes in the value. A column uses one length byte if values require no
more than 255 bytes, two length bytes if values may require more than
255 bytes.
https://dev.mysql.com/doc/refman/5.7/en/char.html
Though be careful with CHAR, it will always consume the full length defined even if the field is left empty. Also, make sure to use ASCII for character set, as CHAR would otherwise plan for worst case scenario (i.e. 3 bytes per character in utf8, 4 in utf8mb4)
[...] MySQL must reserve four bytes for each character in a CHAR
CHARACTER SET utf8mb4 column because that is the maximum possible
length. For example, MySQL must reserve 40 bytes for a CHAR(10)
CHARACTER SET utf8mb4 column.
https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html
Question is about storing an UUID in MySQL.
Since version 8.0 of mySQL you can use binary(16) with automatic conversion via UUID_TO_BIN/BIN_TO_UUID functions:
https://mysqlserverteam.com/mysql-8-0-uuid-support/
Be aware that mySQL has also a fast way to generate UUIDs as primary key:
INSERT INTO t VALUES(UUID_TO_BIN(UUID(), true))
Most efficient is definitely BINARY(16), storing the human-readable characters uses over double the storage space, and means bigger indices and slower lookup. If your data is small enough that storing as them as text doesn't hurt performance, you probably don't need UUIDs over boring integer keys. Storing raw is really not as painful as others suggest because any decent db admin tool will display/dump the octets as hexadecimal, rather than literal bytes of "text". You shouldn't need to be looking up UUIDs manually in the db; if you have to, HEX() and x'deadbeef01' literals are your friends. It is trivial to write a function in your app – like the one you referenced – to deal with this for you. You could probably even do it in the database as virtual columns and stored procedures so the app never bothers with the raw data.
I would separate the UUID generation logic from the display logic to ensure that existing data are never changed and errors are detectable:
function guidv4($prettify = false)
{
static $native = function_exists('random_bytes');
$data = $native ? random_bytes(16) : openssl_random_pseudo_bytes(16);
$data[6] = chr(ord($data[6]) & 0x0f | 0x40); // set version to 0100
$data[8] = chr(ord($data[8]) & 0x3f | 0x80); // set bits 6-7 to 10
if ($prettify) {
return guid_pretty($data);
}
return $data;
}
function guid_pretty($data)
{
return strlen($data) == 16 ?
vsprintf('%s%s-%s-%s-%s-%s%s%s', str_split(bin2hex($data), 4)) :
false;
}
function guid_ugly($data)
{
$data = preg_replace('/[^[:xdigit:]]+/', '', $data);
return strlen($data) == 32 ? hex2bin($data) : false;
}
Edit: If you only need the column pretty when reading the database, a statement like the following is sufficient:
ALTER TABLE test ADD uuid_pretty CHAR(36) GENERATED ALWAYS AS (CONCAT_WS('-', LEFT(HEX(uuid_ugly), 8), SUBSTR(HEX(uuid_ugly), 9, 4), SUBSTR(HEX(uuid_ugly), 13, 4), SUBSTR(HEX(uuid_ugly), 17, 4), RIGHT(HEX(uuid_ugly), 12))) VIRTUAL;
This works like a charm for me in MySQL 8.0.26
create table t (
uuid BINARY(16) default (UUID_TO_BIN(UUID())),
)
When querying you may use
select BIN_TO_UUID(uuid) uuid from t;
The result is:
# uuid
'8c45583a-0e1f-11ec-804d-005056219395'
The most space-efficient would be BINARY(16) or two BIGINT UNSIGNED.
The former might give you headaches because manual queries do not (in a straightforward way) give you readable/copyable values.
The latter might give you headaches because of having to map between one value and two columns.
If this is a primary key, I would definitely not waste any space on it, as it becomes part of every secondary index as well. In other words, I would choose one of these types.
For performance, the randomness of random UUIDs (i.e. UUID v4, which is randomized) will hurt severely. This applies when the UUID is your primary key or if you do a lot of range queries on it. Your insertions into the primary index will be all over the place rather than all at (or near) the end. Your data loses temporal locality, which was a helpful property in various cases.
My main improvement would be to use something similar to a UUID v1, which uses a timestamp as part of its data, and ensure that the timestamp is in the highest bits. For example, the UUID might be composed something like this:
Timestamp | Machine Identifier | Counter
This way, we get a locality similar to auto-increment values.
This could be useful if you use binary(16) data type:
INSERT INTO table (UUID) VALUES
(UNHEX(REPLACE(UUID(), "-","")))
I just found a nice article going in more depth on these topics: https://www.xaprb.com/blog/2009/02/12/5-ways-to-make-hexadecimal-identifiers-perform-better-on-mysql/
It covers the storage of values, with the same options already expressed in the different answers on this page:
One: watch out for character set
Two: use fixed-length, non-nullable values
Three: Make it BINARY
But also adds some interesting insight about indexes:
Four: use prefix indexes
In many but not all cases, you don’t need to index the full length of
the value. I usually find that the first 8 to 10 characters are
unique. If it’s a secondary index, this is generally good enough. The
beauty of this approach is that you can apply it to existing
applications without any need to modify the column to BINARY or
anything else—it’s an indexing-only change and doesn’t require the
application or the queries to change.
Note that the article doesn't tell you how to create such a "prefix" index. Looking at MySQL documentation for Column Indexes we find:
[...] you can create an index that uses only the first N characters of the
column. Indexing only a prefix of column values in this way can make
the index file much smaller. When you index a BLOB or TEXT column, you
must specify a prefix length for the index. For example:
CREATE TABLE test (blob_col BLOB, INDEX(blob_col(10)));
[...] the prefix length in
CREATE TABLE, ALTER TABLE, and CREATE INDEX statements is interpreted
as number of characters for nonbinary string types (CHAR, VARCHAR,
TEXT) and number of bytes for binary string types (BINARY, VARBINARY,
BLOB).
Five: build hash indexes
What you can do is generate a checksum of the values and index that.
That’s right, a hash-of-a-hash. For most cases, CRC32() works pretty
well (if not, you can use a 64-bit hash function). Create another
column. [...] The CRC column isn’t guaranteed to be unique, so you
need both criteria in the WHERE clause or this technique won’t work.
Hash collisions happen quickly; you will probably get a collision with
about 100k values, which is much sooner than you might think—don’t
assume that a 32-bit hash means you can put 4 billion rows in your
table before you get a collision.
This is a fairly old post but still relevant and comes up in search results often, so I will add my answer to the mix. Since you already have to use a trigger or your own call to UUID() in your query, here are a pair of functions that I use to keep the UUID as text in for easy viewing in the database, but reducing the footprint from 36 down to 24 characters. (A 33% savings)
delimiter //
DROP FUNCTION IF EXISTS `base64_uuid`//
DROP FUNCTION IF EXISTS `uuid_from_base64`//
CREATE definer='root'#'localhost' FUNCTION base64_uuid() RETURNS varchar(24)
DETERMINISTIC
BEGIN
/* converting INTO base 64 is easy, just turn the uuid into binary and base64 encode */
return to_base64(unhex(replace(uuid(),'-','')));
END//
CREATE definer='root'#'localhost' FUNCTION uuid_from_base64(base64_uuid varchar(24)) RETURNS varchar(36)
DETERMINISTIC
BEGIN
/* Getting the uuid back from the base 64 version requires a little more work as we need to put the dashes back */
set #hex = hex(from_base64(base64_uuid));
return lower(concat(substring(#hex,1,8),'-',substring(#hex,9,4),'-',substring(#hex,13,4),'-',substring(#hex,17,4),'-',substring(#hex,-12)));
END//

Indexes on BLOBs that contain encrypted data

I have a bunch of columns in a table that are of the type BLOB. The data that's contained in these columns are encrypted with MySQL's AES_ENCRYPT() function. Some of these fields are being used in a search section of an application I'm building. Is it worth it to put indexes on the columns that are being frequently accessed? I wasn't sure if the fact that they are BLOBs or the fact that the data itself is encrypted would make an index useless.
EDIT: Here are some more details about my specific case. There is a table with ~10 columns or so that are each BLOBs. Each record that is inserted into this table will be encrypted using the AES_ENCRYPT() function. In the search portion of my application users will be able to type in their query. I take their query and decrypt it like this SELECT AES_DECRYPT(fname MYSTATICKEY) AS fname FROM some_table so that I can perform a search using a LIKE clause. What I am curious about is if the index will index the encrypted data and not the actual data that is returned from the decryption. I am guessing that if the index applied to only the encrypted binary string then it would not help performance at all. Am I wrong on that?
Note the following:
You can't add an index of type FULLTEXT to a BLOB column (http://dev.mysql.com/doc/refman/5.5/en//fulltext-search.html)
Therefore, you will need to use another type of index. For BLOBs, you will have to specify a prefix length (http://dev.mysql.com/doc/refman/5.0/en/create-index.html) - the length will depend on the storage engine (e.g. up to 1000 bytes long for MyISAM tables, and 767 bytes for InnoDB tables). Therefore, unless the values you are storing are short you won't be able to index all the data.
AES_ENCRYPT() encrypts a string and returns a binary string. This binary string will be the value that is indexed.
Therefore, IMO, your guess is right - an index won't help the performance of your searches.
Note that 'indexing an encrypted column' is a fairly common problem - there's quite few articles online about it. For example (although this is quite old and for MS SQL it does cover some ideas): http://blogs.msdn.com/b/raulga/archive/2006/03/11/549754.aspx
Also see: What's the best way to store and yet still index encrypted customer data? (the top answer links to the same article I found above)

SHA1 sum as a primary key?

I am going to store filenames and other details in a table where, I am planning to use sha1 hash of the filename as PK.
Q1. SHA1 PK will not be a sequentially increasing/decreasing number.
so, will it be more resource consuming for the database to
maintain/search_into and index on that key? If i decide to keep it in database as 40 char value.
Q2. I read here:
https://stackoverflow.com/a/614483/986818 storing the data as
binary(20) field. can someone advice me in this regard:
a) do i have to create this column as: TYPE=integer, LENGTH=20,
COLLATION=binary, ATTRIBUTES=binary?
b) how to convert the sha1 value in MySQL or Perl to store into the
table?
c) is there a danger of duplicacy for this 20 char value?
**
---------UPDATE-------------
**
The requirement is to search the table on filename. user supplies filename, i go search the table and if filename is not there adds it. So either i index on varchar(100) filename field or generate a column with sha1 of the filename - hoping it would be easy for indexing for MySql compared to indexing a varchar field. Also i can search using the sha1 value from my program against the sha1 column. what say? primary key or just indexd key: i choose PK coz DBIx likes using PK. and PK or INDEX+UNIQ would be same amount of overhead for the system(so i thought)
Ok, then use a very -short- hash on the filename and accept collisions. Use an integer type for it (thats much faster!!!). E.g. you can use md5(filename) and then use the first 8 characters and convert them to an integer. SQL could look like this:
CREATE TABLES files (
id INT auto_increment,
hash INT unsigned,
filename VARCHAR(100),
PRIMARY KEY(id),
INDEX(hash)
);
Then you can use:
SELECT id FROM files WHERE hash=<hash> AND filename='<filename>';
The hash is then used for sorting out most other files (normally all other files) and then the filename is for selecting the right entry out of the few hash collisions.
For generating an integer hash-key in perl I suggest using md5() and pack().
If i decide to keep it in database as 40 char value.
Using a character sequence as a key will degrade performance for obvious reasons.
Also the PK is supposed to be unique. Although it will be probably be unlikely that you end up with collisions (theoretically using that for a function to create the PK seems inappropriate.
Additionally anyone knowing the filename and the hash you use, would know all your database ids. I am not sure if this is something not to consider.
Q1: Yes, it will need to build up a B-Tree of nodes that contain not only 1 Integer (4 Bytes) but a CHAR(40). Speed would be aproximately the same, as long the INDEX is kept in memory. As the entries are about 10 times bigger, you need 10 times more memory to keep it in memory. BUT: You probably want to lookup by the Hash anyway. So you'll need to have it either as Primary key OR as an Index.
Q2: Just create a Table field like CREATE TABLE test (ID BINARY(40), ...); later you can use INSERT INTO test (ID, ..) VALUES (UNHEX('4D7953514C'), ...);
-- Regarding: Is there a danger of duplicacy for this 20 char value?
The chance is 1 in 2^(8*20). 1 in 1,46 * 10^48 ... or 1 of 14615016373309029182036848327163*10^18. So the chance for that is very very v.. improbable.
There is no reason to use a cryptographically secure hash here. Instead, if you do this, use an ordinary hash. See here: https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
The hash is NOT a 40 char value! It's a 160 bit number, and you should store it that way (as a 20 char binary field). Edit: I see you mentioned that in comment 2. Yes, you should definitely do that. But I can't tell you how since I don't know what programming language you are using. Edit2: I see it's perl - sorry I don't know how to convert it in perl, but look for "pack" functions.
No, do not create it as type integer. The maximum integer is 128 bits which doesn't hold the entire thing. Although you could really just truncate it to 128 bits without real harm.
It's better to use a simpler hash anyway. You could risk it and ignore collisions, but if you do it properly you kinda of have to handle them.
I would stick with the standard auto-incrementing integer for the primary key. If uniqueness of file names is important (which it sounds like it is), then you can add a UNIQUE constraint on the file name itself or some derived, canonical version of the file name. Most languages/frameworks have some sort of method for getting a canonical version of a path (relative to absolute, standardized case, etc).
If you implement my suggestion or pursue your original plan, then you should be aware that multiple strings can map to the same filename/path. Both versions will have different hashes/pass the uniqueness constraint but will actually both refer to the same file. This depends on operating system and may or may not be a problem for you. Just something to keep in mind.