MySQL: UNIQUE text field using additional HASH field - mysql

In my MySQL DB I have a table defined like:
CREATE TABLE `mytablex_cs` (
`id` mediumint(8) unsigned NOT NULL AUTO_INCREMENT,
`tag` varchar(6) COLLATE utf8_bin NOT NULL DEFAULT '',
`value` text COLLATE utf8_bin NOT NULL,
PRIMARY KEY (`id`),
KEY `kt` (`tag`),
KEY `kv` (`value`(200))
) ENGINE=MyISAM AUTO_INCREMENT=7 DEFAULT CHARSET=utf8 COLLATE=utf8_bin
I need to implement a UNIQUE constraint (key) on the value field.
I know that is not yet possible to define a unique index on the entire value for a blob or text field, but there is a ticket(?) open to implement such feature (see this page) where it has been suggested to create a unique key using a hash like it is already implemented for other fields.
Now I would like to use a similar approach adding to the table another field that will contain the hash and creating a unique key on this field.
I gave a look to possible ways to create this hash and, since I would like to avoid collisions (I need to insert several millions of entries), it seems that the RIPEMD-160 algorithm is the best one, even if a quick search gave me several similar solutions that use SHA256 or even SHA1 and MD5.
I totally lack of knowledge in cryptography, so what are the down sides of choosing this approach?
Another question I have is: which algorithm is currently used by MySQL to create the hash?

Lets look at your requirements:
You need to ensure that a value field is unique. The value field is a text column and due to the nature of it there is no way to create a unique index on the value field(for now). So using a extra field which is the hash of the field value is your only real option here.
Advantages to this approach:
Easy to calculate the hash.
It is extremely rare to create a duplicate hash for two different values so your hash values are almost gauranteed to be unqiue.
Hashes are normally some numeric value(expressed as hexdecimal) that can be efficiently indexed.
The hashes wont take up a lot of space, different hashing function return different length hashes so play around with the different algorithms and test them to find one that suits your need.
Disadvantages of this approach:
Extra field to cater for during INSERTS and UPDATES i.e. there is more work to do.
If you already have data in the table and this is in production you will have to update the current data and hopefully you dont have duplicates already. Also it will take time to run the update. Thus it might be tricky to apply the change in a already working system.
Hashing functions are CPU intensive and can have a negative impact on CPU usage.
I assume you understand what a hash function does and conceptually how it works.
You can find a list of cryptographic functions here: http://dev.mysql.com/doc/refman/5.5/en//encryption-functions.html
MySQL supports as far as I know MD5, SHA, SHA1 and SHA2 hashing functions. Most if not all of these should be sufficient for just hashing. Some functions like MD5 has some issues when used in cryptography applications i.e. when using it in PKI as a signature algorithm etc. However these issues should not be that important when you decide on using it to create a unique value as it is not really being applied in a cryptography context here.
To use the MySQL hashing functions you can try the following examples:
SELECT MD5('1234')
SELECT SHA('1234')
SELECT SHA1('1234')
SELECT SHA2('1234',224);
As with everythig new you should try all the approaches and find the one that will be most successfull in your case.

Related

How to save varchar field for query performance in mysql innodb

I made the following sample table.
CREATE TABLE `User` (
`userId` INT(11) NOT NULL AUTO_INCREMENT,
`userKey` VARCHAR(20) NOT NULL,
PRIMARY KEY (`userId`),
UNIQUE KEY (`userKey`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
I want a more quick query speed for the userKey field.
SELECT * FROM `User` WHERE userKey='KEY123'
In order to do that, what else should I consider besides the index?
For example, when saving the userKey, would using a value such as a DATE(sortable value) as prefix be more beneficial to the search speed?
Or, if I save the userKey as the hash value, will it improve?
If not all, is it enough to use the index for the above query?
You've declared userKey as UNIQUE, so MySQL will automatically create an index for you. That should generally be enough, given that you are searching by "exact match" or by "starting with" type of queries.
If you consider storing hash values, you would only be able to perform "exact match" queries. However, for short strings that may turn out to be not really worth it. Also, with hashes, you need to consider the risk of a collision. If you use a smaller hash value (say 4 bytes) that risk increases. Those are just some guidelines. It's not possible to provide a definitive solution without delving into more details of your case.
One way to speed the table is to get rid of it.
Building a mapping from an under 20 char string to a 4-byte INT does not save much space. Getting rid of the lookup saves a little time.
If you need the to do the 'right thing' had have one copy of each string to avoid redundancy, then what you have is probably the best possible.
If there are more columns, then there may a slight improvement.
But first, answer this: Do you do the lookup by UserKey more of often than by UserId?
I give a resounding NO to a Hash. (Such might be necessary if the strings were too big to index.)
A BTree is very efficient as doing a "point query". (FROM User WHERE UserId = 123 or FROM User WHERE UserKey = 'xyz'). Adding a date or hash would only make it slower.

TIME Columns as primary key

From a long time ago, and because several reasons, I have understood that no DATETIME columns should not form part of the primary key of a table. Between these reasons, I think it is a bad idea given the high precision of this field. An example, 2014-06-26 15:35:12 won't match 2014-06-26 15:35:13.
Questions like Use timestamp(or datetime) as part of primary key (or part of clustered index) seem to support this "phobia".
However I am facing now a very concrete problem: I want to map into a MySQL table some values of a function like
f:(TimeInDay,TimeInDay) -> Integer
Where the arguments represent a time interval (with second precision) within the same day.
Unique (TimeInDay,TimeInDay) pairs results in a concrete output value. So I came to this table structure:
CREATE TABLE sessions_schedule
(
tIni TIME NOT NULL,
tEnd TIME NOT NULL,
X tinyInt,
CONSTRAINT pk PRIMARY KEY (tIni, tEnd)
);
Where TIMEs compose the primary key.
In the MySQL online manual I found:
MySQL recognizes TIME values in several formats,... Some of these
formats can include a trailing fractional seconds part in up to
microseconds (6 digits) precision. Although this fractional part is
recognized, it is discarded from values stored into TIME columns.
So, it seems to me, that in this case the inclusion of TIME fields in the primary key is justified. Am I right?
From a long time ago, and because several reasons, I have understood
that no DATETIME columns should not form part of the primary key of a
table.
That's not true for the relational model, it's not true of SQL in general, and it's not true of MySQL in particular.
Between these reasons, I think it is a bad idea given the high
precision of this field. An example, 2014-06-26 15:35:12 won't match
2014-06-26 15:35:13.
Your example isn't a good one. Think about using integers instead. Would you expect the integer 3 to match the integer 4? Of course not. So why would you think '2014-06-26 15:35:12' would match '2014-06-26 15:35:13'? They're different values. Different values aren't supposed to match.
So, it seems to me, that in this case the inclusion of TIME fields in
the primary key is justified. Am I right?
Quite likely. You just have to make sure that you
don't store any values more precise than a second, and
tIni is before tEnd.
(MySQL can store trailing microseconds.)
On other platforms, you'd probably use CHECK constraints to enforce those requirements, but MySQL doesn't enforce CHECK constraints. You'll need to write triggers, or revoke permissions on the tables, and require changes to go through a stored procedure.

Partitioning a mySQL table

I am considering partitioning a mySQL table that has the potential to grow very big. The table as it stands goes like this
DROP TABLE IF EXISTS `uidlist`;
CREATE TABLE IF NOT EXISTS `uidlist` (
`uid` varchar(9) CHARACTER SET ascii COLLATE ascii_bin NOT NULL,
`chcs` varchar(16) NOT NULL DEFAULT '',
UNIQUE KEY `uid` (`uid`)
) ENGINE=InnoDB DEFAULT CHARSET=ascii;
where
uid is a sequence of 9 character id strings starting with a lowercase letter
chcs is a checksum that is used internally.
I suspect that the best way to partition this table would be based on the first letter of the uid field. This would give
Partition 1
abcd1234,acbd1234,adbc1234...
Partition 2
bacd1234,bcad1234,bdac1234...
However, I have never done partitioning before I have no idea how to go about it. Is the partitioning scheme I have outlined possible? If so, how do I go about implementing it?
I would much appreciate any help with this.
Check out the manual for start :)
http://dev.mysql.com/tech-resources/articles/partitioning.html
MySQL is pretty feature-rich when it comes to partitioning and choosing the correct strategy depends on your use case (can partitioning help your sequential scans?) and the way your data grows since you don't want any single partition to become too large to handle.
If your data will tend to grow over time somewhat steadily you might want to do a create-date based partitioning scheme so that (for example) all records generated in a single year end up in last partition and previous partitions are never written to - for this to happen you may have to introduce another column to regulate this, see http://dev.mysql.com/doc/refman/5.1/en/partitioning-hash.html.
Added optimization benefit of this approach would be that you can have the most recent partition on a disk with fast writes (a solid state for example) and you can keep the older partitions on a cheaper disk with decent read speed.
Anyway, knowing more about your use case would help people give you more concrete answers (possibly including sql code)
EDIT, also, check out http://www.tokutek.com/products/tokudb-for-mysql/
The main question you need to ask yourself before partitioning is "why". What is the goal you are trying to achieve by partitioning the table?
Since all the table's data will still existing on a single MySQL server and, I assume, new rows will be arriving in "random" order (meaning the partition they'll be inserted into), you won't gain much by partitioning. Your point select queries might be slightly faster, but not likely by much.
The main benefit I've seen using MySQL partitioning is for data that needs to be purged according to a set retention policy. Partitioning data by week or month makes it very easy to delete old data quickly.
It sounds more likely to me that you want to be sharding your data (spreading it across many servers), and since your data design as shown is really just key-value then I'd recommend looking at database solutions that include sharding as a feature.
I have upvoted both of the answers here since they both make useful points. #bbozo - a move to TokuDB is planned but there are constraints that stop it from being made right now.
I am going off the idea of partitioning the uidlist table as I had originally wanted to do. However, for the benefit of anyone who finds this thread whilst trying to do something similiar here is the "how to"
DROP TABLE IF EXISTS `uidlist`;
CREATE TABLE IF NOT EXISTS `uidlist` (
`uid` varchar(9) CHARACTER SET ascii COLLATE ascii_bin NOT NULL ,
`chcs` varchar(16) NOT NULL DEFAULT '',
UNIQUE KEY `uid` (`uid`)
) ENGINE=InnoDB DEFAULT CHARSET=ascii
PARTITION BY RANGE COLUMNS(uid)
(
PARTITION p0 VALUES LESS THAN('f%'),
PARTITION p1 VALUES LESS THAN('k%'),
PARTITION p2 VALUES LESS THAN('p%'),
PARTITION p3 VALUES LESS THAN('u%')
);
which creates four partitions.
I suspect that the long term solution here is to use a key-value store as suggested by #tmcallaghan rather than just stuffing everything into a MySQL table. I will probably post back in due course once I have established what would be the right way to accomplish that.

MySQL - Speed of string comparison for primary key

I have a MySQL table where I would like my primary key to be a string. This string may potentially be a bit longer (hundreds of characters).
A very common query would be an INSERT ... ON DUPLICATE KEY UPDATE, which means MySQL would have to check whether the primary key already exists in the table a lot. If this is done with a naive strcmp I imagine this might take quite a while the longer the strings are. Would it thus be better to hash the string manually (either to a shorter string or some other data type) and use that as my primary key or can I just use the long string directly? Does MySQL hash primary key strings internally?
First off, when you have an index on a varchar field mysql doesn't do a strcmp on all entries to find the correct one; instead it uses a binary tree, which is a lot faster than strcmp to navigate through to find the proper entry.
Note: I include some info to improve performance if needs be below, but please do not do that until you hit an actual problem. Varchar indexes are quick, they have been optimized by a lot of very smart people, and in the large majority of cases it will be way more than you need.
With that said, if you have a lot of entries and/or very long keys it can be nice performance wise to use an index of hashes on top of it.
CREATE TABLE users
(
username varchar not null,
username_hashed varchar(32) not null,
primary key (username),
index (username_hashed)
);
When you insert you can set username_hashed = md5(username) for example. And then you search with something like select otherfields from users where username_hashed = md5(username) and username = username
Note that it seems mysql 5.5 support hash index natively, which would allow you to not have to do that by hand.
Does the primary key need to be a string? Can't it just be a unique index, with an integer primary auto increment?
Searching will always be faster with integers, and it might take a bit of code rearrangement in your app, but you'll always be better off searching numbered primary keys vs. strings.
Look at these two posts that show the difference in memory for int and varchar:
What is the size of column of int(11) in mysql in bytes?
Memory usage of storing strings as varchar in MySQL

Suggestions for string lookup optimizations

We have the below MySQL database table with about 75,000 entries. Each entry in the table represents a symbol in the system for which further data could be retrieved. This table is queried for autocomplete purposes - a user looks up a symbol, which is then matched to either the symbol's name or to its tags (semicolon separated list of strings). When the user selects the correct symbol, relevant data is fetched. Here is the table's description:
CREATE TABLE `symbols` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(512) NOT NULL,
`tags` varchar(512) DEFAULT NULL,
`type` enum('1','2','3','4','5','6','7','8','9') NOT NULL,
`popularity` int(11) DEFAULT '0',
PRIMARY KEY (`id`),
UNIQUE KEY `uc_symbol_name` (`type`,`symbol`),
KEY `symbol_idx` (`symbol`),
KEY `type_popularity_idx` (`type`,`popularity`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
The above table is stored, alongside copious amounts of data, on a backend machine which serves this data over a JSON API. Currently, our frontend JavaScript code is querying the backend server directly in AJAX in order to do the autocomplete. Instead, to speed things up, we want to create a local cached version of the symbols table on the server from which the frontend is served (the frontend is written in django). This is possible since the table contains under 100,000 symbols, and because the table only gets updated about once an minute. Furthermore, it will allow us to implement better matching algorithms like Levenshtein distance.
How would you go about creating this type of cached symbol table? Obviously the lookup will have to happen in code (probably Python), but how would you store the data, and how would you sync it once a minute? We have a Redis server running on the django frontend server, but that introduces the question of persistence... Any thoughts are very welcome!
Just use a simple hash table, together with a "last updated time". Every time you do a lookup in the hash, check the "last updated" time. If it is more than a minute in the past, dump the data in the hash and reload it from the DB. Of course you have to make sure to avoid race conditions...
There are other alternatives but this is the simplest way and will be easiest to code correctly. If you find that hitting one of your transactions per minute with the extra latency of a big DB operation is not acceptable, you can come up with something a bit more complicated (such as running the DB operations asynchronously, on a background thread). To prepare for that eventuality, encapsulate this code in a class. (Then if it's too slow, you can play with the implementation without affecting any of your other code.)
Of course, there are other things you could do if you need more performance. You could add an updated_time column to your DB records, and then only load the ones which have been updated since last time. Will this actually make things faster? If it does, will the difference be big to enough to even matter? There's no way to know unless you try it out. That's why it's better to try the simpler solution first, and see if it reaches your performance goals.