Hash method for database string search? - mysql

I have a MySQL InnoDB database and one of the fields in a table is term VARCHAR(255) CHARACTER SET utf8 NOT NULL
This is too large, as it can be 255*3 = 765 bytes. It's still within the limit of 767 bytes InnoDB has, but I want to speed up searches based on the term as well save space by reducing the size of the indexes.
Instead of using the term as a key, I decided to use a hash of term.
What kind of hash method should I use?
edit: I am storing search terms, e.g. "how to find a new car", "iphone 5", "best yugioh card" etc

The best way is to use MD5 like this:
CREATE TABLE termtable
(
id int not null auto_increment,
term VARCHAR(255) CHARACTER SET utf8 NOT NULL,
termhash char(32) not null,
primary key (id),
key (termhash)
);
If you are looking for one specific value and those values could be lengths well beyond 32 characters, you could store the hash value:
INSERT INTO mytable (term,termhash)
VALUES ('a long string',MD5('a long string'));
That way, you just such for hash values to retrieve results
SELECT * FROM termtable WHERE termhash = MD5('a long string');

MySQL includes the MD5 algorithm. The resulting hash is only 32 hex characters, or 16 binary "bytes".

Related

Cakephp 3 create i18n table in phpmyadmin issue

I have a problem to create i18n table for CakePHP 3 Translate Behavior. So I have my database in phpmyadmin and when I want to execute this piece of code from the official cookbook :
CREATE TABLE i18n (
id int NOT NULL auto_increment,
locale varchar(6) NOT NULL,
model varchar(255) NOT NULL,
foreign_key int(10) NOT NULL,
field varchar(255) NOT NULL,
content text,
PRIMARY KEY (id),
UNIQUE INDEX I18N_LOCALE_FIELD(locale, model, foreign_key, field),
INDEX I18N_FIELD(model, foreign_key, field)
);
PhpMyAdmin say :
1071 - Specified key was too long; max key length is 767 bytes
I'm in uft8_unicode_ci. Should I go for utf8_general_ci?
Thanks for your help.
There is no difference in size requirements between utf8_unicode and utf8_general, they only differ with regards to sorting.
By default the index (key prefix) limit is 767 bytes for InnoDB tables (and 1000 bytes for MyISAM), if applicable enable the innodb_large_prefix option (it is enabled by default as of MySQL 5.7) which raises the limit to 3072 bytes, or make the VARCHAR columns smaller, and/or change their collation, the locale column (which holds ISO locale/country codes) surely doesn't use unicode characters, and chances are that your model and column/field names also only use ASCII characters, and that their names are way below 255 characters in length.
With an ASCII collation the VARCHAR columns require only 1 byte per char, unlike with UTF-8, which can require up to 3 bytes (or 4 bytes for the mb4 variants), which alone already causes the index size limit to be exceeded (3 * 255 * 2 = 1530).
See also
MySQL 5.7 Manual > Character Sets and Collations
MySQL 5.7 Manual > Limits on InnoDB Tables > Maximums and Minimums
MySQL 5.7 Manual > InnoDB Startup Options and System Variables > innodb_large_prefix
I have limited my request with :
model varchar(85) NOT NULL,
field varchar(85) NOT NULL,
model and field at 85, I think it's enought, I mysql accept it.
Hope that will help someone.

Database field lengths for storing data with unknown und technically unlimited length?

I have to store DOIs in a MySQL database. The handbook says:
There is no limitation on the length of a DOI name.
So far, the maximum length of a DOI in my current data is 78 chars. Which field length would you recommend in order to not waste storage space and to be on the safe side? In general:
How do you handle the problem of not knowing the maximum length of input data that has to be stored in a database, considering space and the efficiency of transactions?
EDIT
There are these two (simplified) tables document and topic with a one-to-many relationship:
CREATE TABLE document
(
ID int(11) NOT NULL,
DOI ??? NOT NULL,
PRIMARY KEY (ID)
);
CREATE TABLE topic
(
ID int(11) NOT NULL,
DocID int(11) NOT NULL,
Name varchar(255) NOT NULL,
PRIMARY KEY (ID),
FOREIGN KEY (DocID) REFERENCES Document(ID), UNIQUE(DocID)
);
I have to run the following (simplified) query for statistics, returning the total value of referenced topic-categories per document (if there are any references):
SELECT COUNT(topic.Name) AS number, document.DOI
FROM document LEFT OUTER JOIN topic
ON document.ID = topic.DocID
GROUP BY document.DOI;
The character set used is utf_8_general_ci.
TEXT and VARCHAR can store 64KB. If you're being extra paranoid, use LONGTEXT which allows 4GB, though if the names are actually longer than 64KB then that is a really abusive standard. VARCHAR(65535) is probably a reasonable accommodation.
Since VARCHAR is variable length then you really only pay for the extra storage if and when it's used. The limit is just there to cap how much data can, theoretically, be put in the field.
Space is not a problem; indexing may be a problem. Please provide the queries that will need an index on this column. Also provide the CHARACTER SET needed. With those, we can discuss the ramifications of various cutoffs: 191, 255, 767, 3072, etc.

mySQL VARCHAR(256) + mySQL INT = how many bytes?

CREATE SCHEMA IF NOT EXISTS `utftest` DEFAULT CHARACTER SET utf16;
CREATE TABLE IF NOT EXISTS `metadata_labels` (`metadata_id` INT NOT NULL , `label` VARCHAR(256) NOT NULL , PRIMARY KEY (`metadata_id`, `label`));
however I get the following error msg:
Specified key was too long; max key length is 767 bytes
Please advise
UTF 16 uses 32 bits per character (4 bytes) in MySQL. 4 x 256 > 767.
If possible, I would recommend using something other than UTF16 VARCHAR for your key.
In UTF8, it would require 3 x 256 + 4 = 772 bytes. UTF16 would take another 25% more.
You shouldn't use a primary key that's so wide; for an index to be efficient, the storage for each index should be kept to a minimum.
If you need to prevent duplicates, I would recommend adding a calculated field that contains a hash of the contents (e.g. sha1) and create a unique constraint on that instead.
Alternatively, use latin1 as the character encoding for the label field to reduce the number of bytes to 256 + 4 = 300.
If Unicode is a must and hashes are out of the picture you should reduce the column to either UTF8 (250 chars) or UTF16 (190 chars)

MySQL index for long strings

I have MySQL InnoDb table where I want to store long (limit is 20k symbols) strings. Is there any way to create index for this field?
you can put an MD5 of the field into another field and index that. then when u do a search, u match versus the full field that is not indexed and the md5 field that is indexed.
SELECT *
FROM large_field = "hello world hello world ..."
AND large_field_md5 = md5("hello world hello world ...")
large_field_md5 is index and so we go directly to the record that matches. Once in a blue moon it might need to test 2 records if there is a duplicate md5.
You will need to limit the length of the index, otherwise you are likely to get error 1071 ("Specified key was too long"). The MySQL manual entry on CREATE INDEX describes this:
Indexes can be created that use only the leading part of column values, using col_name(length) syntax to specify an index prefix length:
Prefixes can be specified for CHAR, VARCHAR, BINARY, and VARBINARY columns.
BLOB and TEXT columns also can be indexed, but a prefix length must be given.
Prefix lengths are given in characters for nonbinary string types and in bytes for binary string types. That is, index entries consist of the first length characters of each column value for CHAR, VARCHAR, and TEXT columns, and the first length bytes of each column value for BINARY, VARBINARY, and BLOB columns.
It also adds this:
Prefix support and lengths of prefixes (where supported) are storage engine dependent. For example, a prefix can be up to 1000 bytes long for MyISAM tables, and 767 bytes for InnoDB tables.
Here is an example how you could do that. As #Gidon Wise mentioned in his answer you can index the additional field. In this case it will be query_md5.
CREATE TABLE `searches` (
`id` int(10) UNSIGNED NOT NULL,
`query` varchar(10000) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`query_md5` varchar(32) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
) ENGINE=InnoDB;
ALTER TABLE `searches`
ADD PRIMARY KEY (`id`),
ADD KEY `searches_query_md5_index` (`query_md5`);
To make sure you will not have any similar md5 hashes you want to double check by doing and `query` =''.
The query will look like this:
select * from `searches` where `query_md5` = "b6d31dc40a78c646af40b82af6166676" and `query` = 'long string ...'
b6d31dc40a78c646af40b82af6166676 is md5 hash of the long string ... string. This, I think can improve query performance and you can be sure that you will get right results.
Use the sha2 function with a specific length. Add this to your table:
`hash` varbinary(32) GENERATED ALWAYS AS (unhex(sha2(`your_text`,256)))
ADD UNIQUE KEY `ix_hash` (`hash`);
Read about the SHA2 function

Is there any point in using CHAR if you have a VARCHAR in the same table?

I just read the accepted answer of this question, which left me with this question.
Here's a quote from that answer:
"But since you tagged this question with MySQL, I'll mention a MySQL-specific tip: when your query implicitly generates a temporary table, for instance while sorting or GROUP BY, VARCHAR fields are converted to CHAR to gain the advantage of working with fixed-width rows. If you use a lot of VARCHAR(255) fields for data that doesn't need to be that long, this can make the temporary table very large."
As I understand it, the advantage of CHAR is that you get fixed-width rows, so doesn't a VARCHAR in the same table mess that up? Are there any advantages of using CHAR when you have a VARCHAR in the same table?
Here's an example:
Table with CHAR:
CREATE TABLE address (
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
street VARCHAR(100) NOT NULL,
postcode CHAR(8) NOT NULL,
PRIMARY KEY (id)
);
Table without CHAR:
CREATE TABLE address (
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
street VARCHAR(100) NOT NULL,
postcode VARCHAR(8) NOT NULL,
PRIMARY KEY (id)
);
Will the table with CHAR perform any better than the table without CHAR, and if so, in what situations?
"VARCHAR" basically sets a maximum length for the field and only stores the data that is entered into it, thus saving on space. The "CHAR" type has a fixed length, so if you set "CHAR(100)", 100 character worth of space will be used regardless of what the contents are.
The only time you will gain a speed advantage is if you have no variable length fields in your record ("VARCHAR", "TEXT", etc.). You may notice that Internally all your "CHAR" fields are changed to "VARCHAR" as soon as a variable length field type is added, by MySQL.
Also "CHAR" is less efficient from a space storage point of view, but more efficient for searching and adding. It's faster because the database only has to read an offset value to get a record rather than reading parts until it finds the end of a record. And fixed length records will minimize fragmentation, since deleted record space can be reused for new records.
Hope it helps.