using binary text comparison in mysql - efficiency pitfalls?

using binary text comparison in mysql - efficiency pitfalls? - mysql

I'm using binary string comparison on user names (defined as varchar) to be sure the string matches exactly:
... where binary Owner = '$user' ...
or
... from Records join Users on binary Records.Owner = Users.User ....
But I'm not sure if it has some (either positive or negative) impact on efficiency. The manual states:
Note that in some contexts, if you cast an indexed column to BINARY,
MySQL is not able to use the index efficiently.
Is it an issue in this case? I would expect binary to be faster on the contrary, because it doesn't have to ignore case, whitespace, some accents etc.

Indeed, it's possible for an expression in a WHERE clause to make it impossible to use an index on a column. You wouldn't expect an index on a float column called x to be any use for searching
WHERE SIN(x) BETWEEN 0.0 and 0.1
for example.
The answer to your problem is to define your Owner column to have a binary collation rather than a case-insensitive collation. See here for how to do that: http://dev.mysql.com/doc/refman/5.0/en/charset-column.html

Related

SQL string literal hexadecimal key to binary and back

after extensive search I am resorting to stack-overflows wisdom to help me.
Problem:
I have a database table that should effectively store values of the format (UserKey, data0, data1, ..) where the UserKey is to be handled as primary key but at least as an index. The UserKey itself (externally defined) is a string of 32 characters representing a checksum, which happens to be (a very big) hexadecimal number, i.e. it looks like this UserKey = "000000003abc4f6e000000003abc4f6e".
Now I can certainly store this UserKey in a char(32)-field, but I feel this being mighty inefficient, as I store a series of in principle arbitrary characters, i.e. reserving space for for more information per character than the 4 bits i need to store the hexadecimal characters (0..9,A-F).
So my thought was to convert this string literal into the hex-number it really represents, and store that. But this number (32*4 bits = 16Bytes) is much too big to store/handle as SQL only handles BIGINTS of 8Bytes.
My second thought was to convert this into a BINARY(16) representation, which should be compact and efficient concerning memory. However, I do not know how to efficiently convert between these two formats, as SQL also internally only handles numbers up to the maximum of 8 Bytes.
Maybe there is a way to convert this string to binary block by block and stitch the binary together somehow, in the way of:
UserKey == concat( stringblock1, stringblock2, ..)
UserKey_binary = concat( toBinary( stringblock1 ), toBinary( stringblock2 ), ..)
So my question is: is there any such mechanism foreseen in SQL that would solve this for me? How would a custom solution look like? (I find it hard to believe that I should be the first to encounter such a problem, as it has become quite modern to use ridiculously long hashkeys in many applications)
Also, the Userkey_binary should than act as relational key for the table, so I hope for a bit of speed by this more compact representation, as it needs to determine the difference on a minimal number of bits. Additionally, I want to mention that I would like to do any conversion if possible on the Server-side, so that user-scripts have not to be altered (the user-side should, if possible, still transmit a string literal not [partially] converted values in the insert statement)

In Contradiction to my previous statement, it seems that MySQL's UNHEX() function does a conversion from a string block by block and then concat much like I stated above, so the method works also for HEX literal values which are bigger than the BIGINT's 8 byte limitation. Here an example table that illustrates this:
CREATE TABLE `testdb`.`tab` (
`hexcol_binary` BINARY(16) GENERATED ALWAYS AS (UNHEX(charcol)) STORED,
`charcol` CHAR(32) NOT NULL,
PRIMARY KEY (`hexcol_binary`));
The primary key is a generated column, so that that updates to charcol are the designated way of interacting with the table with string literals from the outside:
REPLACE into tab (charcol) VALUES ('1010202030304040A0A0B0B0C0C0D0D0');
SELECT HEX(hexcol_binary) as HEXstring, tab.* FROM tab;
as seen building keys and indexes on the hexcol_binary works as intended.
To verify the speedup, take
ALTER TABLE `testdb`.`tab`
ADD INDEX `charkey` (`charcol` ASC);
EXPLAIN SELECT * from tab where hexcol_binary = UNHEX('1010202030304040A0A0B0B0C0C0D0D0') #keylength 16
EXPLAIN SELECT * from tab where charcol = '1010202030304040A0A0B0B0C0C0D0D0' #keylength 97
the lookup on the hexcol_binary column is much better performing, especially if its additonally made unique.
Note: the hex conversion does not care if the hex-characters A through F are capitalized or not for the conversion process, however the charcol will be very sensitive to this.

Finding the set of wildcards for representing a binary interval

In a switch flow table there is a field called match field, where a list of matching condition is maintained. In the match fields, binary sequences with wildcards (*, a bit of wildcard character means that it could be either 1 or 0 in this bit) are used to represent some matching conditions.
For example, we can use the following binary sequences with wildcards to represent the matching condition '38798 <= Port <= 56637':
100101111000111*
100101111**1****
100101111*1*****
1001011111******
10011***********
101*************
110111010011110*
1101110100***0**
1101110100**0***
1101110100*0****
11011101000*****
11011100********
110**0**********
110*0***********
1100************
Does anyone know (or suggests) a way to obtain such sequence? So far, I used a brute force strategy (getting all the possible combinations for the intervals), but it is not computationally feasible (memory explosion) and has a problem of redundancy (wildcards representing the same interval). So now I'm trying to use range splitting and grid-of-tries to get a solution, but without success yet.

What data type could I use for an ID number that has a length of 13 digits in SQL Server 2008?

Normally, the INTEGER data type would suffice, but being in South Africa the ID numbers have a length of 13 and the INTEGER data type only goes up to 10. I am not fond of using characters like VARCHAR since it would not restrict the input ID number to integer values only. I only solution I see (other to using VARCHAR) is to use DECIMAL. Only problems that I see are that I can't restrict the max size like in VARCHAR and the data input could have ',' and '.' Any comments?

Just use BIGINT, it ranges from -9223372036854775808 to 9223372036854775807 which should be enough for your application.

Assuming that you're referring to South African national ID numbers, which according to Wikipedia always have 13 digits, then I would go for CHAR(13) with a CHECK constraint (a CLR user-defined data type might also be an option).
The main reason is that the 'number' is not a number, it's an ID. You can't add, subtract, multiply etc. the values so there is no benefit in using a numeric data type. Furthermore, the ID is composed of components that have their own meaning, so being able to parse them out is presumably important (and easier when using character data types).
In fact, depending on how you use this data, you could also add columns that store the individual components of the ID (DOB, sequence, citizenship), either as computed columns or real columns. This could be convenient for querying and reporting (and indexing), especially if you converted the DOB to a date or datetime column.

I would indeed use VARCHAR with a CHECK that matches the format. You can even be more sophisticated if there is internal validation, e.g. a check digit. Now you are all set for other countries that have an alphabetic character, or if you need to handle a leading zero.
I wouldn't use an integer unless it makes sense to do some sort of arithmetic on the field, which is almost certainly not true here.

You could use money as well, although it appears you only get 4 digits after the decimal place. The money type is 8 bytes, giving you a range from -922,337,203,685,477.5808 to 922,337,203,685,477.5807.
declare #num as money
select #num = '1,300,000.45'
select #num
Results in:
1300000.45
The parsing of commas and periods might be dependent on your specific culture settings, although I don't know that for sure.

Hash Index algo in MySQL

I was reading an article on hash indexing, and it seems that it is similar to the md5 function of PHP, in that that both take a string value and return an integer representing that string, and this representation is consistent. Is this similarity really there, or am I missing anything? Plus has anybody got an idea about the hashing algorithm MySQL employs for hash based index structure?

I'm not pretending to give a complete description on MySQL algo, but there are a few things that may be guessed.
First of all, Hash table wiki is a must-read. Then we have a notice from MySQL documentation:
They are used only for equality comparisons that use the = or <=> operators (but are very fast). They are not used for comparison
operators such as < that find a range of values. Systems that rely on
this type of single-value lookup are known as “key-value stores”; to
use MySQL for such applications, use hash indexes wherever possible.
The optimizer cannot use a hash index to speed up ORDER BY operations. (This type of index cannot be used to search for the next
entry in order.)
MySQL cannot determine approximately how many rows there are between two values (this is used by the range optimizer to decide
which index to use). This may affect some queries if you change a
MyISAM table to a hash-indexed MEMORY table.
Only whole keys can be used to search for a row. (With a B-tree index, any leftmost prefix of the key can be used to find rows.)
This points to following (rather common) properties:
MySQL hash function operates on a fixed length "full-key" record (it
is a question though, how varchars are treated, e.g. they might be padded with zeros up to the maximum length)
There is a max_heap_table_size global value and a MAX_ROWS parameter that engine is likely to use when guessing upper row count for the hash function.
MySQL allows non-unique keys, but warns about proportional slowdowns. At least this may tell that there is no second hash function, but a mere linked list used in Collision resolution.
As for the actual function used, I don't think there is much to tell. MySQL may even use different functions according to some key heuristics (e.g. one for mostly sequential data, such as ID, but another for CHARs), and of course its output is changed according to estimated row count. However, you should only consider hash indices when BTREE cannot afford you good enough performance or you just never ever use any of its advantages, which is, I suppose, a rare case.
UPDATE
A bit into sources: /storage/heap/hp_hash.c contains a few implementations for hash functions. At least it was a right assumption that they use different techniques for different types, as it comes to TEXT and VARCHAR:
/*
* Fowler/Noll/Vo hash
*
* The basis of the hash algorithm was taken from an idea sent by email to the
* IEEE Posix P1003.2 mailing list from Phong Vo (kpv#research.att.com) and
* Glenn Fowler (gsf#research.att.com). Landon Curt Noll (chongo#toad.com)
* later improved on their algorithm.
*
* The magic is in the interesting relationship between the special prime
* 16777619 (2^24 + 403) and 2^32 and 2^8.
*
* This hash produces the fewest collisions of any function that we've seen so
* far, and works well on both numbers and strings.
*/
I'll try to give a simplified explanation.
ulong nr= 1, nr2= 4;
for (seg=keydef->seg,endseg=seg+keydef->keysegs ; seg < endseg ; seg++)
Every part of a compund key is processed separately, result is accumulated in nr.
if (seg->null_bit)
{
if (rec[seg->null_pos] & seg->null_bit)
{
nr^= (nr << 1) | 1;
continue;
}
}
NULL values are treated separately.
if (seg->type == HA_KEYTYPE_TEXT)
{
uint char_length= seg->length; /* TODO: fix to use my_charpos() */
seg->charset->coll->hash_sort(seg->charset, pos, char_length,
&nr, &nr2);
}
else if (seg->type == HA_KEYTYPE_VARTEXT1) /* Any VARCHAR segments */
{
uint pack_length= seg->bit_start;
uint length= (pack_length == 1 ? (uint) *(uchar*) pos : uint2korr(pos));
seg->charset->coll->hash_sort(seg->charset, pos+pack_length,
length, &nr, &nr2);
}
So are TEXT and VARCHAR. hash_sort is presumably some other function that takes collation into account. VARCHARs have a prefixed 1 or 2-byte length.
else
{
uchar *end= pos+seg->length;
for ( ; pos < end ; pos++)
{
nr *=16777619;
nr ^=(uint) *pos;
}
}
And every other type is treated byte-by-byte with mutiplication and xor.

Which MySQL data type to use for storing boolean values

Since MySQL doesn't seem to have any 'boolean' data type, which data type do you 'abuse' for storing true/false information in MySQL?
Especially in the context of writing and reading from/to a PHP script.
Over time I have used and seen several approaches:
tinyint, varchar fields containing the values 0/1,
varchar fields containing the strings '0'/'1' or 'true'/'false'
and finally enum Fields containing the two options 'true'/'false'.
None of the above seems optimal. I tend to prefer the tinyint 0/1 variant, since automatic type conversion in PHP gives me boolean values rather simply.
So which data type do you use? Is there a type designed for boolean values which I have overlooked? Do you see any advantages/disadvantages by using one type or another?

For MySQL 5.0.3 and higher, you can use BIT. The manual says:
As of MySQL 5.0.3, the BIT data type is used to store bit-field
values. A type of BIT(M) enables storage of M-bit values. M can range
from 1 to 64.
Otherwise, according to the MySQL manual you can use BOOL or BOOLEAN, which are at the moment aliases of tinyint(1):
Bool, Boolean: These types are synonyms for TINYINT(1). A value of
zero is considered false. Non-zero
values are considered true.
MySQL also states that:
We intend to implement full boolean
type handling, in accordance with
standard SQL, in a future MySQL
release.
References: http://dev.mysql.com/doc/refman/5.5/en/numeric-type-overview.html

BOOL and BOOLEAN are synonyms of TINYINT(1). Zero is false, anything else is true. More information here.

This is an elegant solution that I quite appreciate because it uses zero data bytes:
some_flag CHAR(0) DEFAULT NULL
To set it to true, set some_flag = '' and to set it to false, set some_flag = NULL.
Then to test for true, check if some_flag IS NOT NULL, and to test for false, check if some_flag IS NULL.
(This method is described in "High Performance MySQL: Optimization, Backups, Replication, and More" by Jon Warren Lentz, Baron Schwartz and Arjen Lentz.)

This question has been answered but I figured I'd throw in my $0.02.
I often use a CHAR(0), where '' == true and NULL == false.
From MySQL docs:
CHAR(0) is also quite nice when you need a column that can take only
two values: A column that is defined as CHAR(0) NULL occupies only one
bit and can take only the values NULL and '' (the empty string).

If you use the BOOLEAN type, this is aliased to TINYINT(1). This is best if you want to use standardised SQL and don't mind that the field could contain an out of range value (basically anything that isn't 0 will be 'true').
ENUM('False', 'True') will let you use the strings in your SQL, and MySQL will store the field internally as an integer where 'False'=0 and 'True'=1 based on the order the Enum is specified.
In MySQL 5+ you can use a BIT(1) field to indicate a 1-bit numeric type. I don't believe this actually uses any less space in the storage but again allows you to constrain the possible values to 1 or 0.
All of the above will use approximately the same amount of storage, so it's best to pick the one you find easiest to work with.

I use TINYINT(1) in order to store boolean values in Mysql.
I don't know if there is any advantage to use this... But if i'm not wrong, mysql can store boolean (BOOL) and it store it as a tinyint(1)
http://dev.mysql.com/doc/refman/5.0/en/other-vendor-data-types.html

Bit is only advantageous over the various byte options (tinyint, enum, char(1)) if you have a lot of boolean fields. One bit field still takes up a full byte. Two bit fields fit into that same byte. Three, four,five, six, seven, eight. After which they start filling up the next byte. Ultimately the savings are so small, there are thousands of other optimizations you should focus on. Unless you're dealing with an enormous amount of data, those few bytes aren't going to add up to much. If you're using bit with PHP you need to typecast the values going in and out.

I got fed up with trying to get zeroes, NULLS, and '' accurately round a loop of PHP, MySql and POST values, so I just use 'Yes' and 'No'.
This works flawlessly and needs no special treatment that isn't obvious and easy to do.

Until MySQL implements a bit datatype, if your processing is truly pressed for space and/or time, such as with high volume transactions, create a TINYINT field called bit_flags for all your boolean variables, and mask and shift the boolean bit you desire in your SQL query.
For instance, if your left-most bit represents your bool field, and the 7 rightmost bits represent nothing, then your bit_flags field will equal 128 (binary 10000000). Mask (hide) the seven rightmost bits (using the bitwise operator &), and shift the 8th bit seven spaces to the right, ending up with 00000001. Now the entire number (which, in this case, is 1) is your value.
SELECT (t.bit_flags & 128) >> 7 AS myBool FROM myTable t;
if bit_flags = 128 ==> 1 (true)
if bit_flags = 0 ==> 0 (false)
You can run statements like these as you test
SELECT (128 & 128) >> 7;
SELECT (0 & 128) >> 7;
etc.
Since you have 8 bits, you have potentially 8 boolean variables from one byte. Some future programmer will invariably use the next seven bits, so you must mask. Don’t just shift, or you will create hell for yourself and others in the future. Make sure you have MySQL do your masking and shifting — this will be significantly faster than having the web-scripting language (PHP, ASP, etc.) do it. Also, make sure that you place a comment in the MySQL comment field for your bit_flags field.
You’ll find these sites useful when implementing this method:
MySQL — Bit Functions and Operators
Decimal/Binary Conversion Tool

Since MySQL (8.0.16) and MariaDB (10.2.1) both implemented the CHECK constraint, I would now use
bool_val TINYINT CHECK(bool_val IN(0,1))
You will only be able to store 0, 1 or NULL, as well as values which can be converted to 0 or 1 without errors like '1', 0x00, b'1' or TRUE/FALSE.
If you don't want to permit NULLs, add the NOT NULL option
bool_val TINYINT NOT NULL CHECK(bool_val IN(0,1))
Note that there is virtually no difference if you use TINYINT, TINYINT(1) or TINYINT(123).
If you want your schema to be upwards compatible, you can also use BOOL or BOOLEAN
bool_val BOOL CHECK(bool_val IN(TRUE,FALSE))
db<>fiddle demo

Referring to this link
Boolean datatype in Mysql, according to the application usage, if one wants only 0 or 1 to be stored, bit(1) is the better choice.

After reading the answers here I decided to use bit(1) and yes, it is somehow better in space/time, BUT after a while I changed my mind and I will never use it again. It complicated my development a lot, when using prepared statements, libraries etc (php).
Since then, I always use tinyint(1), seems good enough.

You can use BOOL, BOOLEAN data type for storing boolean values.
These types are synonyms for TINYINT(1)
However, the BIT(1) data type makes more sense to store a boolean value (either true[1] or false[0]) but TINYINT(1) is easier to work with when you're outputting the data, querying and so on and to achieve interoperability between MySQL and other databases. You can also check this answer or thread.
MySQL also converts BOOL, BOOLEAN data types to TINYINT(1).
Further, read documentation

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008