How can I migrate from "float" to "points" in MySQL? - mysql

I'm looking for a faster way to calculate Euclidean distances in SQL.
Problem I want to solve
The following "Euclidean distance calculation" is slow.
SELECT
id,
sqrt(
power(f1 - (-0.09077361), 2) +
power(f2 - (0.10373443), 2) +
...
...
power(f127 - (0.0778369), 2) +
power(f128 - (0.00951046), 2)
) as distance
FROM
face_feature
ORDER BY
distance
LIMIT
1
;
What I want to know
Can you share how to migrate from "float" to "points"?
I received the following advice, but I don't understand how.
Switch to POINTs and a SPATIAL index. It may be possible your task orders of magnitude faster.
MySQL
mysql> SHOW VARIABLES LIKE '%version%';
+--------------------------+------------------------------+
| Variable_name | Value |
+--------------------------+------------------------------+
| version | 8.0.29 |
| version_comment | MySQL Community Server - GPL |
| version_compile_machine | x86_64 |
| version_compile_os | Linux |
+--------------------------+------------------------------+
Table
mysql> desc face_feature;
+-------+------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+----------------+
| id | int | NO | PRI | NULL | auto_increment |
| f1 | float(9,8) | NO | | NULL | |
| f2 | float(9,8) | NO | | NULL | |
..
| f127 | float(9,8) | NO | | NULL | |
| f128 | float(9,8) | NO | | NULL | |
+-------+------------+------+-----+---------+----------------+
Data
mysql> SELECT count(*) FROM face_feature;
+----------+
| count(*) |
+----------+
| 100003 |
+----------+
mysql> SELECT * FROM face_feature LIMIT 1\G;
id: 1
f1: -0.07603023
f2: 0.13605964
...
f127: 0.09608927
f128: 0.00082345
Reference (My other question)
How can I make "euclidean distance calculation" faster in MySQL?

Don't use FLOAT(M,N) it adds an extra rounding that only hurts various operations.
FLOAT(9,8), if the numbers are near "1.0" will lose some precision. This is because there are only 24 bits of precision in any FLOAT.
(m,n) on FLOAT and DOUBLE has been deprecated (as useless and misleading) in newer versions of MySQL.
There are helper functions to convert numeric strings to POINT values. Internally, a POINT contains two DOUBLEs. Hence the original DECIMAL(9,8) loses only a round-from-decimal-to-binary at the 53rd significant bit.
But the real question is about using SPATIAL indexing when the universe has 128 dimensions. I don't think it will work. (I have not even heard of using SPATIAL for 3 dimensions, though it should be practical.)

Related

How can I make "euclidean distance calculation" faster in MySQL?

I am creating a face recognition system, but the search is very slow. Can you share how to speed up the search?
It takes about 6 seconds for 100,000 data items.
MySQL
mysql> SHOW VARIABLES LIKE '%version%';
+--------------------------+------------------------------+
| Variable_name | Value |
+--------------------------+------------------------------+
| version | 8.0.29 |
| version_comment | MySQL Community Server - GPL |
| version_compile_machine | x86_64 |
| version_compile_os | Linux |
+--------------------------+------------------------------+
Table
CREATE TABLE `face_feature` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`f1` decimal(9,8) NOT NULL,
`f2` decimal(9,8) NOT NULL,
...
...
`f127` decimal(9,8) NOT NULL,
`f128` decimal(9,8) NOT NULL,
PRIMARY KEY (id)
);
Data
mysql> SELECT count(*) FROM face_feature;
+----------+
| count(*) |
+----------+
| 110004 |
+----------+
mysql> SELECT * FROM face_feature LIMIT 1\G;
id: 1
f1: -0.07603023
f2: 0.13605964
...
f127: 0.09608927
f128: 0.00082345
SQL
SELECT
id,
sqrt(
power(f1 - (-0.09077361), 2) +
power(f2 - (0.10373443), 2) +
...
...
power(f127 - (0.0778369), 2) +
power(f128 - (0.00951046), 2)
) as distance
FROM
face_feature
ORDER BY
distance
LIMIT
1
;
Result
+----+--------------------+
| id | distance |
+----+--------------------+
| 1 | 0.3376853491771237 |
+----+--------------------+
1 row in set (6.18 sec)
Update 1:
Changed from decimal(9,8) to float(9,8)
Then, improved from approximately 4sec to 3.26 sec
mysql> desc face_feature;
+-------+------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+----------------+
| id | int | NO | PRI | NULL | auto_increment |
| f1 | float(9,8) | NO | | NULL | |
| f2 | float(9,8) | NO | | NULL | |
..
| f127 | float(9,8) | NO | | NULL | |
| f128 | float(9,8) | NO | | NULL | |
+-------+------------+------+-----+---------+----------------+
Update 2:
Changed from POWER(z, 2) to z*z
Then, the result was changed from 3.26 sec to 4.65 sec
SELECT
id,
sqrt(
((f1 - (-0.09077361)) * (f1 - (-0.09077361))) +
((f2 - (0.10373443)) * (f2 - (0.10373443))) +
((f3 - (0.00798536)) * (f3 - (0.00798536))) +
...
...
((f126 - (0.07803915)) * (f126 - (0.07803915))) +
((f127 - (0.0778369)) * (f127 - (0.0778369))) +
((f128 - (0.00951046)) * (f128 - (0.00951046))
) as distance
FROM
face_feature
ORDER BY
distance
LIMIT
1
;
Update 3
I am looking into the usage of MySQL GIS.
How can I migrate from "float" to "points" in MySQL?
Update 4
I'm also looking at PostgreSQL because I can't find a way to handle 128 dimensions in MySQL.
DECIMAL(9,8) -- that's a lot of significant digits. Do you need that much precision?
FLOAT -- about 7 significant digits; faster arithmetic.
POWER(z, 2) -- probably a lot slower than z*z. (This may be the slowest part.)
SQRT -- In many situations, you can simply work with the squares. In this case:
SELECT SQRT(closest)
FROM ( SELECT -- leave out SQRT
... ORDER BY .. LIMIT 1 )
Here are some other thoughts. They are not necessarily relevant to the query being discussed:
Precise testing -- Beware of comparing for 'equal' Roundoff error is likely to make things unequal unexpectedly. Imprecise measurements add to the issue. If I measure something twice, I might get 1.23456789 one time and 1.23456788 the next time. (Especially at that level of "precision".
Trade complexity vs speed -- Use ABS(a - b) as the distance formula; find the 10 items closest in that way, then use the Euclidean distance to get the 'right' distance.
Break the face into regions. Find which region the item is in, then check only the subset of the 128 points that are in that region. (Being near a boundary -- put some points in two regions.)
Think out of the box -- I'm not familiar with your facial recognition, so I have run out of mathematical tricks.
Switch to POINTs and a SPATIAL index. It may be possible your task orders of magnitude faster. (This is probably not practical for 128-dimensional space.)

How should I construct a database to store a lot of SHA1 data

I'm having trouble constructing a database to store a lot of SHA1 data and efficiently return results.
I will admit SQL is not my strongest skill but as an exercise I am trying to use the data from https://haveibeenpwned.com/Passwords which returns results pretty quickly
This is my data:
mysql> describe pwnd;
+----------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------+------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| pwndpass | binary(20) | NO | | NULL | |
+----------+------------------+------+-----+---------+----------------+
mysql> select id, hex(pwndpass) from pwnd order by id desc limit 10;
+-----------+------------------------------------------+
| id | hex(pwndpass) |
+-----------+------------------------------------------+
| 306259512 | FFFFFFFEE791CBAC0F6305CAF0CEE06BBE131160 |
| 306259511 | FFFFFFF8A0382AA9C8D9536EFBA77F261815334D |
| 306259510 | FFFFFFF1A63ACC70BEA924C5DBABEE4B9B18C82D |
| 306259509 | FFFFFFE3C3C05FCB0B211FD0C23404F75E397E8F |
| 306259508 | FFFFFFD691D669D3364161E05538A6E81E80B7A3 |
| 306259507 | FFFFFFCC6BD39537AB7398B59CEC917C66A496EB |
| 306259506 | FFFFFFBFAD0B653BDAC698485C6D105F3C3682B2 |
| 306259505 | FFFFFFBBFC923A29A3B4931B63684CAAE48EAC4F |
| 306259504 | FFFFFFB58E389A0FB9A27D153798956187B1B786 |
| 306259503 | FFFFFFB54953F45EA030FF13619B930C96A9C0E3 |
+-----------+------------------------------------------+
10 rows in set (0.01 sec)
My question relates to quickly finding entries as it currently takes over 6 minutes
mysql> select hex(pwndpass) from pwnd where hex(pwndpass) = '0000000A1D4B746FAA3FD526FF6D5BC8052FDB38';
+------------------------------------------+
| hex(pwndpass) |
+------------------------------------------+
| 0000000A1D4B746FAA3FD526FF6D5BC8052FDB38 |
+------------------------------------------+
1 row in set (6 min 31.82 sec)
Do I have the correct data types? I search for storing sha1 data and a Binary(20) field is advised but not sure how to optimising it for searching the data.
My MySQL install is a clean turnkey VM https://www.turnkeylinux.org/mysql I have not adjusted any settings other than giving the VM more disk space
The two most obvious tips are:
Create an index on the column.
Don't convert every single row to hexadecimal on every search:
select hex(pwndpass)
from pwnd
where hex(pwndpass) = '0000000A1D4B746FAA3FD526FF6D5BC8052FDB38';
-- ^^^ This is forcing MySQL to convert every hash stored from binary to hexadecimal
-- so it can determine whether there's a match
In fact, you don't even need hexadecimal at all, save for display purposes:
select id, hex(pwndpass) -- This is fine, will just convert matching rows
from pwnd
where pwndpass = ?
... where ? is a placeholder that, in your client language, corresponds to a binary string.
If you need to run the query right in command-line, you can also use an hexadecimal literal:
select id, hex(pwndpass) -- This is fine, will just convert matching rows
from pwnd
where pwndpass = 0x0000000A1D4B746FAA3FD526FF6D5BC8052FDB38

Finding closest value. How to tell MySQL that the data is already ordered?

Let's say I have a table like the following:
+-----------+------------+------+-----+---------+
| Field | Type | Null | Key | Default |
+------------+------------+------+-----+---------+
| datetime | double | NO | PRI | NULL |
| some_value | float | NO | | NULL |
+------------+------------+------+-----+---------+
Date is necessary to be in double and is registered in unix time with fractional seconds (no possibility to install mysql 5.6 to use fractional DATETIME). In addition, the values of the field datetime are not only primary, they are also always increasing. I would like to find the closest row to certain value. Usually you can use something like:
select * from table order by abs(datetime - $myvalue) limit 1
However, I'm afraid that this implementation will be slow for hundred thousands of values, because it is going to search in all the database. And since I have an ordered list, I know I can do some binary search to speed up the process, but I have no idea how to tell MySQL to perform such kind of search.
In order to test the performance I do the following lines:
SET profiling = 1;
SELECT * FROM table order by abs(datetime - $myvalue) limit 1;
SHOW PROFILE FOR QUERY 1;
With the following results:
+--------------------------------+----------+
| Status | Duration |
+--------------------------------+----------+
| starting | 0.000122 |
| Waiting for query cache lock | 0.000051 |
| checking query cache for query | 0.000191 |
| checking permissions | 0.000038 |
| Opening tables | 0.000094 |
| System lock | 0.000047 |
| Waiting for query cache lock | 0.000085 |
| init | 0.000103 |
| optimizing | 0.000031 |
| statistics | 0.000057 |
| preparing | 0.000049 |
| executing | 0.000023 |
| Sorting result | 2.806665 |
| Sending data | 0.000359 |
| end | 0.000049 |
| query end | 0.000033 |
| closing tables | 0.000050 |
| freeing items | 0.000089 |
| logging slow query | 0.000067 |
| cleaning up | 0.000032 |
+--------------------------------+----------+
Which in my understanding, the sorting the result takes 2.8 seconds, however my data is already sorted. As additional information, I have around 240,000 rows.
It won't scan the entire database. A primary key is indexed by a B-tree. Forcing it into a binary search would be slower, if you could do it, which you can't.
Try making it a field:
select abs(datetime - $myvalue) as date_diff, table.*
from table
order by date_diff
limit 1
Indexes are supported in RDBMSs. Define an index on date time or field of your interest and db will not do the complete table scan

How do I compare average runtime of two functions in MySQL?

I wanted to compare average runtime of two functions in MySQL -
Square distance: pow(x1 - x2, 2) + pow(y1 - y2, 2) + pow(z1 - z2, 2)
vs
Dot product: x1 * x2 + y1 * y2 + z1 * z2
Now, whichever function I choose is going to run around 50,000,000,000 times in a single query! So, even the tiniest of difference in their runtime matters.
So, I tried profiling. Here's what I got,
mysql> show profiles;
+----------+------------+-----------------------------------------------------------------------+
| Query_ID | Duration | Query |
+----------+------------+-----------------------------------------------------------------------+
| 4 | 0.00014400 | select pow(rand()-rand(),2)+pow(rand()-rand(),2)+pow(rand()-rand(),2) |
| 5 | 0.00012800 | select pow(rand()-rand(),2)+pow(rand()-rand(),2)+pow(rand()-rand(),2) |
| 6 | 0.00017000 | select pow(rand()-rand(),2)+pow(rand()-rand(),2)+pow(rand()-rand(),2) |
| 7 | 0.00024800 | select pow(rand()-rand(),2)+pow(rand()-rand(),2)+pow(rand()-rand(),2) |
| 8 | 0.00014400 | select pow(rand()-rand(),2)+pow(rand()-rand(),2)+pow(rand()-rand(),2) |
| 9 | 0.00014000 | select pow(rand()-rand(),2)+pow(rand()-rand(),2)+pow(rand()-rand(),2) |
| 10 | 0.00014900 | select pow(rand()-rand(),2)+pow(rand()-rand(),2)+pow(rand()-rand(),2) |
| 11 | 0.00015000 | select rand()*rand()+rand()*rand()+rand()*rand() |
| 12 | 0.00012000 | select rand()*rand()+rand()*rand()+rand()*rand() |
| 13 | 0.00015200 | select rand()*rand()+rand()*rand()+rand()*rand() |
| 14 | 0.00022500 | select rand()*rand()+rand()*rand()+rand()*rand() |
| 15 | 0.00012700 | select rand()*rand()+rand()*rand()+rand()*rand() |
| 16 | 0.00013200 | select rand()*rand()+rand()*rand()+rand()*rand() |
| 17 | 0.00013400 | select rand()*rand()+rand()*rand()+rand()*rand() |
| 18 | 0.00013800 | select rand()*rand()+rand()*rand()+rand()*rand() |
+----------+------------+-----------------------------------------------------------------------+
15 rows in set, 1 warning (0.00 sec)
This is not very helpful at all, runtimes fluctuate around so much that I have no clue which one is faster and by how much.
I need to run each of these functions like 10,000 times to get a nice and consistent average runtime. How do I accomplish this in MySQL?
(Note that rand() is called 6 times in both the functions so it's runtime doesn't really make a difference)
Edit:
Sure, I can create a temp table, it would be slightly inconvenient, fill it with random values, which again is not straight forward (see How do I populate a mysql table with many random numbers) and then proceed to comparing my functions.
I wanted to know If a better way existed in MySQL.
In the best of the cases, the function pow detects that the exponent is the integer 2 and performs exponentiation with a single multiply. There is no reason it could beat a pure multiply.

Floating Point Types comparisons

I have inserted diff values of pi (see below):
3.14
3.1415
3.14159
3.14159265359
I do not see the different in how the different floating point types handle the same values.
Code:
mysql> select * from test_types;
+---------+---------+---------+----------+
| flo | dub | deci | noomeric |
+---------+---------+---------+----------+
| 3.14000 | 3.14000 | 3.14000 | 3.14000 |
| 3.14150 | 3.14150 | 3.14150 | 3.14150 |
| 3.14159 | 3.14159 | 3.14159 | 3.14159 |
| 3.14150 | 3.14150 | 3.14150 | 3.14150 |
| 3.14159 | 3.14159 | 3.14159 | 3.14159 |
+---------+---------+---------+----------+
5 rows in set (0.00 sec)
mysql> describe test_types;
+----------+---------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------+---------------+------+-----+---------+-------+
| flo | float(10,5) | YES | | NULL | |
| noomeric | decimal(10,5) | YES | | NULL | |
| deci | decimal(10,5) | YES | | NULL | |
| dub | double(10,5) | YES | | NULL | |
+----------+---------------+------+-----+---------+-------+
4 rows in set (0.00 sec)
I can see here that when creating the table the field with numeric type used DECIMAL (see describe command table).
Does anybody know an example showing differences between FLOAT, DECIMAL and DOUBLE please?
FLOAT and DOUBLE are meant for very small values or very large values.
Essentially they are the same thing (except differ in storage size FLOAT 4 bytes against DOUBLE 8 bytes, see Data Type Storage Requirements)
The main thing about them is that they are approximate (see quoted from Oracle website):
Because floating-point values are approximate and not stored as exact
values, attempts to treat them as exact in comparisons may lead to
problems. They are also subject to platform or implementation
dependencies.
DECIMAL allows for an exact representation but the reason your DECIMAL column did not work for PI very well is because you allowed for only 5 decimal places but then you fed it 11 decimal places.
The best way to store the value of PI accurate to 11 decimal places is something like DECIMAL(12,11).
For an actual example for values being treated differently when stored as DECIMAL as opposed to same value being stored and used as a FLOAT see below:
CREATE TABLE decimal_vs_float_test
( dec DECIMAL(12,11)
, fl FLOAT
);
INSERT INTO decimal_vs_float_test VALUES
( 3.947947949 , 3.947947949 )
,( 3.777777777 , 3.777777777 )
,( 3.555555555 , 3.555555555 )
,( 3.333333333 , 3.333333333 )
,( 3.111111111 , 3.111111111 )
;
SELECT * FROM decimal_vs_float_test WHERE fl = dec
Now you can see the values for a DECIMAL or a FLOAT treated differently.
Hope that helps.
Additionally FLOAT and DOUBLE are floating binary point types whereas DECIMAL is a floating decimal point type.
See this answer for more exact details on what that means, the difference between how the types are encoded and when is best to use what type (its meant for C# but its still interesting).