SQL - max 255 length unique index - Hash Solution - mysql

We have table where we store tokens for users (i.e. accessTokens).
The problem is, sometimes tokens can have more than 255 length and MySQL/MariaDB is unable to store it into table that have unique index on this column.
We need unique indexes, therefore one solution is to add additional column with hash of token which has max 255 length and put unique index to it. Any search/save will go through this hash, after match, we select the whole token and send it back. After a lot of thinking and googling this is probably the only viable solution for this use-case (but you can try to give us another one).
Every single token we generate right now is at least partially random, therefore slightly chance of hash collision is "ok", the user is not stucked forever in next request, it should pass.
Do you know any good modern method in 2017? Having some statistical data about hash collision for this method would be appreciated.
The hash is only for internal use - we dont need it to be secure (fast insecure hash is best for us), it should be long enough to have low chance of collision but must never ever pass the 255 length limit.
PS: Setting up special version of database/table that allows more length is not viable, we need it also in some older system without migration.

Are these access tokens representable with 8-bit characters? That is, are all the characters in them taken from the ASCII or iso-8859-1 character sets?
If so, you can get a longer unique index than 255 by declaring the access-token column with COLLATE latin1_bin. The limit of an index prefix is 767 bytes, but utf8 characters in VARCHAR columns take 3 bytes per character.
So a column with 767 unique latin1 characters should be uniquely indexable. That may solve your problem if your unique hashes all fit in about 750 bytes.
If not ...
You've asked for a hash function for your long tokens with a "low" risk of collision. SHA1 is pretty good, and is available as a function in MySQL. SHA512 is even better, but doesn't work in all MySQL servers. But the question is this: What is the collision risk of taking the first, or last, 250 characters of your long tokens and using them as a hash?
Why do I ask? Because your spec calls for a unique index on a column that's too long for a MySQL unique index. You're proposing to solve that problem by using a hash function that is also not guaranteed to be unique. That gives you two choices, both of which require you to live with a small collision probability.
Add a hash column that's computed by SHA2('token', 512) and live with the tiny probablility of collision.
Add a hash column that's computed by LEFT('token', 255) and live with the tiny probability of collision.
You can implement the second choice simply by removing the unique constraint on your index on the token column. (In other words, by doing very little.)
The SHA has family has well-known collision characteristics. To evaluate some other hash function would require knowing the collision characteristics of your long tokens, and you haven't told us those.

Comments on HASHing
UNHEX(MD5(token)) fits in 16 bytes - BINARY(16).
As for collisions: Theoretically, there is only one chance in 9 trillion that you will get a collision in a table of 9 trillion rows.
For SHA() in BINARY(20) the odds are even less. Bigger shas are, in my opinion, overkill.
Going beyond the 767 limit to 3072
⚈ Upgrade to 5.7.7 (MariaDB 10.2.2?) for 3072 byte limit -- but your cloud may not provide this;
⚈ Reconfigure (if staying with 5.6.3 - 5.7.6 (MariaDB 10.1?)) -- 4 things to change: Barracuda + innodb_file_per_table + innodb_large_prefix + dynamic or compressed.
Later versions of 5.5 can probably perform the 'reconfigure'.
Similar Question: Does MariaDB allow 255 character unique indexes?

Related

MySQL create index error

Issue is to create index for table 'visits_visit' (Django visit app), because every query lasts at least 60 ms and is going to be worse.
CREATE INDEX resource ON visits_visit (object_app(200), object_model(200), object_id(200));
It returns:
ERROR 1071 (42000): Specified key was too long; max key length is 1000 bytes
What to do? Structure of table is on the screenshot.
See the reference to a possible duplicate question already answered in comments under your question. Or should I say a canonical duplicate target to close this question to if it does close. That said, not much there in that reference in terms of storage engines or character sets.
In your case the character set factors in with the use of string-type columns in your composite index.
A side note is certainly performance. Don't expect a great one in general with what you are attempting. Your index is way too wide and may very well not even be of the intended use. Indexes and their benefit need careful scrutiny. This can be ascertained with the use of mysql explain. See the following, in particular the General Comments section.
Please see the following article Using Innodb_large_prefix to Avoid ERROR 1071 and below is an excerpt.
The character limit depends on the character set you use. For example
if you use latin1 then the largest column you can index is
varchar(767), but if you use utf8 then the limit is varchar(255).
There is also a separate 3072 byte limit per index. The 767 byte limit
is per column, so you can include multiple columns (each 767 bytes or
smaller) up to 3072 total bytes per index, but no column longer than
767 bytes. (MyISAM is a little different. It has a 1000 byte index
length limit, but no separate column length limit within that). One
workaround for these limits is to only index a prefix of the longer
columns, but what if you want to index more than 767 bytes of a column
in InnoDB? In that case you should consider using innodb_large_prefix,
which was introduced in MySQL 5.5.14 and allows you to include columns
up to 3072 bytes long in InnoDB indexes. It does not affect the index
limit, which is still 3072 bytes.
Also see the Min and Max section from the Mysql Manual Page Limits on InnoDB Tables
The 'right' answer is to shorten the fields and/or normalize them.
Do you really have 200-character-long apps, models, etc? If not, shorten the fields.
Probably model is repeated in the table a lot? If so, normalize it and replace the column with the id from normalizing it.
You seem to be using MyISAM; you could (should) also switch to InnoDB. That will change the error message, or it might make it go away.
Are you using utf8 characters? Are you doing everything in English? Changing the CHARACTER SET could make 200 characters mean 200 bytes, not 600 (utf8) or 800 (utf8mb4).
Changing the character set for ip_address would shrink its footprint from 15 * (bytes/char). So would changing from CHAR to VARCHAR. Note also that 15 is insufficient to handle IPv6.

always use 255 chars for varchar fields decreases performance?

I usually use maximum chars possible for varchar fields, so in most cases I set 255 but only using 16 chars in columns...
does this decreases performance for my database?
When it comes to storage, a VARCHAR(255) column will take up 1 byte to store the length of the actual value plus the bytes required to store the actual value.
For a latin1 VARCHAR(255) column, that's at most 256 bytes. For a UTF8 column, where each character can take up to 3 bytes (though rarely), the maximum size is 766 bytes. As we know the maximum index length for a single column in bytes in InnoDB is 767 bytes, hence perhaps the reason some declare 255 as the maximum supported column length.
So, again, when storing the value, it only takes up as much room as is actually needed.
However, if the column is indexed, the index automatically allocates the maximum possible size so that each node in the index has enough room to store any possible value. When searching through an index, MySQL loads the nodes in specific byte size chunks at a time. Large nodes means less nodes per read, which means it takes longer to search the index.
MySQL will also use the maximum size when storing the values in a temp table for sorting.
So, even if you aren't using indexes, but are ever performing a query that can't utilize an index for sorting, you will get a performance hit.
Therefore, if performance is your goal, setting any VARCHAR column to 255 characters should not be a rule of thumb. Instead, you should use the minimum required.
There may be edge cases where you'd rather suffer the performance every day so that you never have to lock a table completely to increase the size of a column, but I don't think that's the norm.
One possible exception is if you are joining on a VARCHAR column between two tables. MySQL says:
MySQL can use indexes on columns more efficiently if they are declared
as the same type and size.
In that case, you might use the max size between the two.
Whenever you're talking about "performance" you can only find out one way: Benchmarking.
In theoretical terms there's no difference between VARCHAR(20) and VARCHAR(255) if they're both populated with the same data. Keep in mind if you get your length wrong you will have massive truncation problems and MySQL does not warn you before it starts chopping data to fit.
I try to avoid setting limits on VARCHAR columns unless the data would be completely invalid if it was longer. For instance, two-character ISO country codes can be stored in VARCHAR(2) because longer strings are meaningless. For other things, especially names or phone numbers, limiting the length is potentially and probably harmful.
Still, you will want to test any schema you create to be sure it meets your performance requirements. I expect you'd have a hard time detecting any difference at all between VARCHAR(25) and VARCHAR(16).
There are two ways in which this will decrease performance.
if you're loading those columns many many times, performing a join on the column, or other such thing that means they need to be accessed a large number of times. The number of times depends on your machine, but think on the order of millions.
if you're always filling the field (using 20 chars in a varchar(20), then the length checks are adding a little overhead whenever you perform an insert.
The best way to determine this though is to benchmark your database though.

Database MySql design - varchar length for utf8 fields :: 1. password 2. username 3.email

Most of the times I define varchar(255) length auto.
But now I thinking how much varchar length should be best to define for utf8 fields:
password
username
email
If this fields should be define less than varchar 255, how much performance it will improve?
Thanks
'password' should be char(40) if you use SHA1 hashes. This might have binary collation if you are sure that the cases of the hash is always the same. This gives you better performance. If you're not, use latin1, but don't use utf8.
'email'... use 255, you cannot know how long someone's email address is.
For the username I'd just use whatever your max username length is. 20 or 30 would probably be good.
If you have an index on a character field (especially if it's part f the PK) choose the length very carefully, because longer and longer indexes might reduce performance heavily (and increases memory usage).
Also, if you use UTF8 char field in an index, you have to be aware, that MySQL reserves 3 times more bytes that the actual character length of the field, preparing for the worst case (UTF8 might store certain characters on 3 bytes). This can also cause lack of memory.
If you index any of those fields (and you don't use a prefix as the index), bear in mind that MySQL will index the field as though it were CHAR rather than VARCHAR, and each index record will use the maximum potential space (so 3n bytes for a VARCHAR(n), since a UTF8 character can be up to 3 bytes long). That could mean the index will be larger than necessary. To get around this, make the field smaller, or index on a prefix.
(I should say: I'm sure I've read that this is the case somewhere in the MySQL documentation, but I couldn't find it when I looked just now.)
changing that won't have a big effect on performance (depending on how much rows are in that table - probably you won't notice any effect), but maybe it will make your database using less disk space. (i use a lengh of 30 for user names, 64 for passwords(legth of the hash) and 50 for email adresses).

MySQL: Why use VARCHAR(20) instead of VARCHAR(255)? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Are there disadvantages to using a generic varchar(255) for all text-based fields?
In MYSQL you can choose a length for the VARCHAR field type. Possible values are 1-255.
But what are its advantages if you use VARCHAR(255) that is the maximum instead of VARCHAR(20)? As far as I know, the size of the entries depends only on the real length of the inserted string.
size (bytes) = length+1
So if you have the word "Example" in a VARCHAR(255) field, it would have 8 bytes. If you have it in a VARCHAR(20) field, it would have 8 bytes, too. What is the difference?
I hope you can help me. Thanks in advance!
Check out: Reference for Varchar
In short there isn't much difference unless you go over the size of 255 in your VARCHAR which will require another byte for the length prefix.
The length indicates more of a constraint on the data stored in the column than anything else. This inherently constrains the MAXIMUM storage size for the column as well. IMHO, the length should make sense with respect to the data. If your storing a Social Security # it makes no sense to set the length to 128 even though it doesn't cost you anything in storage if all you actually store is an SSN.
There are many valid reasons for choosing a value smaller than the maximum that are not related to performance. Setting a size helps indicate the type of data you are storing and also can also act as a last-gasp form of validation.
For instance, if you are storing a UK postcode then you only need 8 characters. Setting this limit helps make clear the type of data you are storing. If you chose 255 characters it would just confuse matters.
I don't know about mySQL but in SQL Server it will let you define fields such that the total number of bytes used is greater than the total number of bytes that can actually be stored in a record. This is a bad thing. Sooner or later you will get a row where the limit is reached and you cannot insert the data.
It is far better to design your database structure to consider row size limits.
Additionally yes, you do not want people to put 200 characters in a field where the maximum value should be 10. If they do, it is almost always bad data.
You say, well I can limit that at the application level. But data does not get into the database just from one application. Sometimes multiple applications use it, sometimes data is imported and sometimes it is fixed manually from the query window (update all the records to add 10% to the price for instance). If any of these other sources of data don't know about the rules you put in your application, you will have bad, useless data in your database. Data integrity must be enforced at the database level (which doesn't stop you from also checking before you try to enter data) or you have no integrity. Plus it has been my experience that people who are too lazy to design their database are often also too lazy to actually put the limits into the application and there is no data integrity check at all.
They have a word for databases with no data integrity - useless.
There is a semantical difference (and I believe that's the only difference): if you try to fill 30 non-space characters into varchar(20), it will produce an error, whereas it will succeed for varchar(255). So it is primarily an additional constraint.
Well, if you want to allow for a larger entry, or limit the entry size perhaps.
For example, you may have first_name as a VARCHAR 20, but perhaps street_address as a VARCHAR 50 since 20 may not be enough space. At the same time, you may want to control how large that value can get.
In other words, you have set a ceiling of how large a particular value can be, in theory to prevent the table (and potentially the index/index entries) from getting too large.
You could just use CHAR which is a fixed width as well, but unlike VARCHAR which can be smaller, CHAR pads the values (although this makes for quicker SQL access.
From a database perspective performance wise I do not believe there is going to be a difference.
However, I think a lot of the decision on the length to use comes down to what you are trying to accomplish and documenting the system to accept just the data that it needs.

MySQL performance of unique varchar field vs unique bigint

I'm working on an application that will be implementing a hex value as a business key (in addition to an auto increment field as primary key) similar to the URL id seen in Gmail. I will be adding a unique constraint to the column and was originally thinking of storing the value as a bigint to get away from searching a varchar field but was wondering if that's necessary if the field is unique.
Internal joins would be done using the auto increment field and the hex value would be used in the where clause for filtering.
What sort of performance hit would there be in simply storing the value as a varchar(x), or perhaps a char(x) over the additional work in doing the conversion to and from hex to store the value as an integer in the database? Is it worth the additional complexity?
I did a quick test on a small number of rows (50k) and had similar search result times. If there is a large performance issue would it be linear, or exponential?
I'm using InnoDB as the engine.
Is your hex value a GUID? Although I used to worry about the performance of such long items as indexes, I have found that on modern databases the performance difference on even millions of records is fairly insignificant.
A potentially larger problem is the memory that the index consumes (16 byte vs 4 byte int, for example), but on servers that I control I can allocate for that. As long as the index can be in memory, I find that there is more overhead from other operations that the size of the index element doesn't make a noticeable difference.
On the upside, if you use a GUID you gain server independence for records created and more flexibility in merging data on multiple servers (which is something I care about, as our system aggregates data from child systems).
There is a graph on this article that seems to back up my suspicion: Myths, GUID vs Autoincrement
The hex value is generated from a UUID (Java's implementation); it's hashed and truncated to smaller length (likely 16 characters). The algorithm for which is still under discussion (currently SHA). An advantage I see of storing the value in hex vs integer is that if we needed to grow the size (which I don't see happening with this application at 16 char) we could simply increase the truncated length and leave the old values without fear of collision. Converting to integer values wouldn't work as nicely for that.
The reason for the truncation vs simply using a GUID/UUID is simply to make the URL's and API's (which is where these will be used) more friendly.
All else being equal, keeping the data smaller will make it run faster. Mostly because it'll take less space, so less disk i/o, less memory needed to hold the index, etc etc. 50k rows isn't enough to notice that though...