Hamming Distance optimization for MySQL or PostgreSQL? - mysql

I trying to improve search similar images pHashed in MySQL database.
Right now I comparing pHash counting hamming distance like this:
SELECT * FROM images WHERE BIT_COUNT(hash ^ 2028359052535108275) <= 4
Results for selecting (engine MyISAM)
20000 rows ; query time < 20ms
100000 rows ; query time ~ 60ms # this was just fine, until its reached 150000 rows
300000 rows ; query time ~ 150ms
So query time encrease depends of the number of rows in table.
I also try solutions found on stackoverflow
Hamming distance on binary strings in SQL
SELECT * FROM images WHERE
BIT_COUNT(h1 ^ 11110011) +
BIT_COUNT(h2 ^ 10110100) +
BIT_COUNT(h3 ^ 11001001) +
BIT_COUNT(h4 ^ 11010001) +
BIT_COUNT(h5 ^ 00100011) +
BIT_COUNT(h6 ^ 00010100) +
BIT_COUNT(h7 ^ 00011111) +
BIT_COUNT(h8 ^ 00001111) <= 4
rows 300000 ; query time ~ 240ms
I changed database engine to PostgreSQL. Translate this MySQL query to PyGreSQL
Without success.
rows 300000 ; query time ~ 18s
Is there any solution to optimize above queries?
I mean optimization not depended of the number of rows.
I have limited ways (tools) to solve this problem.
MySQL so far seemed to be the simplest solution but I can deploy code on every open source database engine that will work with Ruby on dedicated machine.
There is some ready solutions for MsSQL https://stackoverflow.com/a/5930944/766217 (not tested). Maybe someone know how to translate it for MySQL or PostgreSQL.
Please, post answers based on some code or observations. We have a lot of theoretical issues about hamming distance on stackoverflow.com
Thanks!

When considering the efficiency of algorithms, computer scientists use the concept of the order denoted O(something) where something is a function of n, the number of things being computed, in this case rows. So we get, in increasing time:
O(1) - independent of the number of items
O(log(n)) - increases as the logarithm of the items
O(n) - increases in proportion of the items (what you have)
O(n^2) - increases as the square of the items
O(n^3) - etc
O(2^n) - increases exponentially
O(n!) - increases with the factorial of the number
The last 2 are effectively uncomputable for any reasonable number of n (80+).
Only the most significant term matters since this dominates for large n so n^2 and 65*n^2+787*n+4656566 are both O(n^2)
Bearing in mind that this is a mathematical construction and the time an algorithm takes with real software on real hardware using real data may be heavily influenced by other things (e.g. an O(n^2) memory operation may take less time than an O(n) disk operation).
For your problem, you need to run through each row and compute BIT_COUNT(hash ^ 2028359052535108275) <= 4. This is an O(n) operation.
The only way this could be improved is by utilizing an index since a b-tree index retrieval is an O(log(n)) operation.
However, because your column field is contained within a function, an index on that column cannot be used. You have 2 possibilities:
This is an SQL server solution and I don't know if it is portable to MySQL. Create a persisted calculated column in your table with the formula BIT_COUNT(hash ^ 2028359052535108275) and put an index on it. This will not be suitable if you need to change the bit mask.
Work out a way of doing the bitwise arithmetic without using the BIT_COUNT function.

This solution made things a bit faster for me.
It makes a derived table for each hash compare, and returns only the results that are less than the ham distance. This way, it's not doing the BIT_COUNT on a pHash that has already exceeded the ham. It returns all matches in about 2.25 seconds on 2.6 million records.
It's InnoDB, and I have very few indexes.
If somebody can make it faster, I'll appreciate you.
SELECT *, BIT_COUNT(pHash3 ^ 42597524) + BC2 AS BC3
FROM (
SELECT *, BIT_COUNT(pHash2 ^ 258741369) + BC1 AS BC2
FROM (
SELECT *, BIT_COUNT(pHash1 ^ 5678910) + BC0 AS BC1
FROM (
SELECT `Key`, pHash0, pHash1, pHash2, pHash3, BIT_COUNT(pHash0 ^ 1234567) as BC0
FROM files
WHERE BIT_COUNT(pHash0 ^ 1234567) <= 3
) AS BCQ0
WHERE BIT_COUNT(pHash1 ^ 5678910) + BC0 <= 3
) AS BCQ1
WHERE BIT_COUNT(pHash2 ^ 258741369) + BC1 <= 3
) AS BCQ2
WHERE BIT_COUNT(pHash3 ^ 42597524) + BC2 <= 3
This is the equivalent query, but without the derived tables. Its return time is almost 3 times as long.
SELECT `Key`, pHash0, pHash1, pHash2, pHash3
FROM Files
WHERE BIT_COUNT(pHash0 ^ 1234567) + BIT_COUNT(pHash1 ^ 5678910) + BIT_COUNT(pHash2 ^ 258741369) + BIT_COUNT(pHash3 ^ 42597524) <=3
Keeping in mind that the lower the ham value on the first one, the faster it will run.

Here are the results for my tests. Phash is calculated with the imagehash library in Python and stored as two BIGINTs in the database.
This test was ran on 858,433 images in a mariadb database that does not use sharding. I found sharding to actually slow down the process, however that was with the function method so that may be different without it or on a large database.
The table these are running on is an in-memory only table. A local table is kept and upon startup of the database the id, phash1, and phash2 are copied to an in-memory table. The id is returned to match to the innodb table once something is found.
Total Images: 858433
Image 1: ece0455d6b8e9470
Function HAMMINGDISTANCE_16:
RETURN BIT_COUNT(A0 ^ B0) + BIT_COUNT(A1 ^ B1)
Method: HAMMINGDISTANCE_16 Function
Query:
SELECT `id` FROM `phashs` WHERE HAMMINGDISTANCE_16(filephash_1, filephash_2, CONV(SUBSTRING('ece0455d6b8e9470', 1, 8), 16, 10), CONV(SUBSTRING('ece0455d6b8e9470', 9, 8), 16, 10)) <= 3;
Time: 2.1760 seconds
Method: BIT_COUNT
Query:
SELECT `id` FROM `phashs` WHERE BIT_COUNT(filephash_1 ^ CONV(SUBSTRING('ece0455d6b8e9470', 1, 8), 16, 10)) + BIT_COUNT(filephash_2 ^ CONV(SUBSTRING('ece0455d6b8e9470', 9, 8), 16, 10)) <= 3;
Time: 0.1547 seconds
Method: Multi-Select BIT_COUNT inner is filephash_1
Query:
SELECT `id` FROM ( SELECT `id`, `filephash_2`, BIT_COUNT(filephash_1 ^ CONV(SUBSTRING('ece0455d6b8e9470', 1, 8), 16, 10)) as BC0 FROM `phashs` WHERE BIT_COUNT(filephash_1 ^ CONV(SUBSTRING('ece0455d6b8e9470', 1, 8), 16, 10)) <= 3 ) BCQ0 WHERE BC0 + BIT_COUNT(filephash_2 ^ CONV(SUBSTRING('ece0455d6b8e9470', 9, 8), 16, 10)) <= 3;
Time: 0.1878 seconds
Method: Multi-Select BIT_COUNT inner is filephash_2
Query:
SELECT `id` FROM (SELECT `id`, `filephash_1`, BIT_COUNT(filephash_2 ^ CONV(SUBSTRING('ece0455d6b8e9470', 9, 8), 16, 10)) as BC1 FROM `phashs` WHERE BIT_COUNT(filephash_2 ^ CONV(SUBSTRING('ece0455d6b8e9470', 9, 8), 16, 10)) <= 3) BCQ1 WHERE BIT_COUNT(filephash_1 ^ CONV(SUBSTRING('ece0455d6b8e9470', 1, 8), 16, 10)) + BC1 <= 3;
Time: 0.1860 seconds
Image 2: 813ed36913ec8639
Function HAMMINGDISTANCE_16:
RETURN BIT_COUNT(A0 ^ B0) + BIT_COUNT(A1 ^ B1)
Method: HAMMINGDISTANCE_16 Function
Query:
SELECT `id` FROM `phashs` WHERE HAMMINGDISTANCE_16(filephash_1, filephash_2, CONV(SUBSTRING('813ed36913ec8639', 1, 8), 16, 10), CONV(SUBSTRING('813ed36913ec8639', 9, 8), 16, 10)) <= 3;
Time: 2.1440 seconds
Method: BIT_COUNT
Query:
SELECT `id` FROM `phashs` WHERE BIT_COUNT(filephash_1 ^ CONV(SUBSTRING('813ed36913ec8639', 1, 8), 16, 10)) + BIT_COUNT(filephash_2 ^ CONV(SUBSTRING('813ed36913ec8639', 9, 8), 16, 10)) <= 3;
Time: 0.1588 seconds
Method: Multi-Select BIT_COUNT inner is filephash_1
Query:
SELECT `id` FROM ( SELECT `id`, `filephash_2`, BIT_COUNT(filephash_1 ^ CONV(SUBSTRING('813ed36913ec8639', 1, 8), 16, 10)) as BC0 FROM `phashs` WHERE BIT_COUNT(filephash_1 ^ CONV(SUBSTRING('813ed36913ec8639', 1, 8), 16, 10)) <= 3 ) BCQ0 WHERE BC0 + BIT_COUNT(filephash_2 ^ CONV(SUBSTRING('813ed36913ec8639', 9, 8), 16, 10)) <= 3;
Time: 0.1671 seconds
Method: Multi-Select BIT_COUNT inner is filephash_2
Query:
SELECT `id` FROM (SELECT `id`, `filephash_1`, BIT_COUNT(filephash_2 ^ CONV(SUBSTRING('813ed36913ec8639', 9, 8), 16, 10)) as BC1 FROM `phashs` WHERE BIT_COUNT(filephash_2 ^ CONV(SUBSTRING('813ed36913ec8639', 9, 8), 16, 10)) <= 3) BCQ1 WHERE BIT_COUNT(filephash_1 ^ CONV(SUBSTRING('813ed36913ec8639', 1, 8), 16, 10)) + BC1 <= 3;
Time: 0.1686 seconds

Related

How to auto increment a string with sql query

I am stuck at a point where i have to increment a string, and my strings are of type C001,SC001,B001
in my data base they are defined like
what i am trying to do do is write a query which check the previous highest code present into my db and the incriment it to +1
for example C001 -> C002,C009->C010,C099`->C100 and so on
Similarly for SC001->SC002,SC009->SC010,SC099->SC100 and so on
Similarly fro B001 -> B002,B009->B010,B099`->B100 and so on
I have a query which my friend has suggested me to use but that query only incriminating AAAA->AAAA01 , AAAA09->AAAA10
query is
SELECT id AS PrevID, CONCAT(
SUBSTRING(id, 1, 4),
IF(CAST(SUBSTRING(id, 5) AS UNSIGNED) <= 9, '0', ''),
CAST(SUBSTRING(id, 5) AS UNSIGNED) + 1
) AS NextID
FROM (
-- since you allow strings such as AAAA20 and AAAA100 you can no longer use MAX
SELECT id
FROM t
ORDER BY SUBSTRING(id, 1, 4) DESC, CAST(SUBSTRING(id, 5) AS UNSIGNED) DESC
LIMIT 1
) x
when i am replacing ID with CategoryCode it is giving me PrevID-C004 NextID-C00401 which is not my requirement i want PrevID-C004 and NextID->C005
NOTE i am using my sqlServer 5.1
Just try this one ,
SELECT
CategoryCode,CAST(CONCAT(LPAD(CategoryCode,1,0),LPAD(MAX(RIGHT(CategoryCode,
3)) + 1, 3, 0) ) AS CHAR),
FROM test
SELECT
SubCategoryCode,CAST(CONCAT(LPAD(SubCategoryCode,2,0),
LPAD(MAX(RIGHT(CategoryCode, 3)) + 1, 3, 0) ) AS CHAR),
FROM test
SELECT
BrandCode,CAST(CONCAT(LPAD(BrandCode,1,0), LPAD(MAX(RIGHT(BrandCode, 3)) +
1, 3, 0)) AS CHAR) FROM test

Mysql : Use IF in TRUNCATE

This is the query, simplified.
SELECT `a`, TRUNCATE(`b` / 1000, 3) AS `b`
FROM (
...
) AS `m`
GROUP BY `a`
ORDER BY `a`
What i'm trying to do is change the number of decimal places (actual 3) based on the value of b.
So i tried this:
SELECT `a`, TRUNCATE(`b` / 1000, IF(`b` < 10, 2, 3)) AS `b` ...
and this
SELECT `a `, IF(`b ` < 10, TRUNCATE(`b ` / 1000, 2), TRUNCATE(`b ` / 1000, 3)) AS `b `
If b is less than 10, i want 3 decimal places, otherwise 2.
But this doesn't seem to work ...
Resources : https://dev.mysql.com/doc/refman/8.0/en/control-flow-functions.html#function_if
just change the values position that you put in your query
SELECT `a `, IF(b < 10, TRUNCATE(b / 1000, 3), TRUNCATE(b / 1000, 2))
AS b
if(a<1,2,3) means if a<1 then 2 will come as a value in your result so you have to switch your values position
use round
SELECT a , IF(b < 10, round((b / 1000), 2), round((b / 1000), 3) ) AS b
The ROUND() function rounds a number to a specified number of decimal places.
example SELECT ROUND(345.156, 2); result = 345.16
SELECT ROUND(345.156, 2); result = 345.156
If you don't want round then TRUNCATE will shown 0.00 in case of b value less than 10, so what do you mean by not working ?
You need 3 decimal place when b<10 so you have to change the position of yours query result
You have misplaced the order of queries to run, in case of true/false evaluation in If(). Following may work:
SELECT `a `,
IF(`b ` < 10,
TRUNCATE(`b ` / 1000, 3),
TRUNCATE(`b ` / 1000, 2)
) AS `b `

Why does this MySQL XOR query return 0?

I am trying to compute the hamming distance between two hex strings. First, the strings are converted from base 16 to base 10, then they are xor'd and the bits are counted:
SELECT (CONV('b4124b0d195b2507', 16, 10)) ^ (CONV('eae26aebf1f139f9', 16, 10));
This results in 0.
Independently running
SELECT (CONV('b4124b0d195b2507', 16, 10));
and
SELECT (CONV('eae26aebf1f139f9', 16, 10));
give me the answers I would expect (12975515996039881991 and 16925207911220722169).
Where is the flaw in my logic?
SELECT CONVERT((CONV('b4124b0d195b2507', 16, 10)), SIGNED) ^ CONVERT((CONV('eae26aebf1f139f9', 16, 10)), SIGNED)
is what you want
The conv as per the docs
http://dev.mysql.com/doc/refman/5.0/en/mathematical-functions.html#function_conv
Returns a string representation of the number N, converted from base
from_base to base to_base
You need to convert back to numbers to xor

Mysql Order By Problematic

Okay, I'm having some difficulties with order by. Here is the problem I need to solve:
In the database I have written every tile of a map, that is 101 x 101 big. The table has 3 columns(ID, x, y), now I gotta select all the tiles in some radious. For example, I used this query:
SELECT *
FROM tile
WHERE ((x >= -3 AND x <= 3)
AND (y >= -3 AND y <= 3))
ORDER BY x ASC, y DESC;
This query selects all tiles in radius of 3 of the given coordinate (0|0) for now.
But, it doesn't sort them the way I want it to. Basically, the output must be like this.
But this is the closest I got.
http://prntscr.com/zqjd7
Edit:
Disregard the double values, had double inputs for each coordinate. Haven't seen it.
It seems that your problem is around the ASC / DESC modificator.
But since we're here, wouldn't you prefer to use a distance formula? Something near
SELECT x, y FROM tile WHERE
(
POW(x-#var1, 2) + POW(y-#var2, 2) <= POW(3, 2)
)
ORDER BY x DESC, y ASC;
Here, given a point P (m,n), we shall know the distance to a fixed point Q (x,y) by acerting D(P,Q) = SQRT( (x-m)² + (y-n)² ). As much as it has to be less than (or equals) your desired radius (= 3), we have so SQRT( (x-m)² + (y-n)² ) <= 3, or better, (x-m)² + (y-n)² <= 3², raising both terms to its square power.
SQL-language speaking, we write POW(x-m, 2) + POW(y-n, 2) <= POW(3, 2), willing to say that the distance between (x,y) and (m,n) is last than or equal 3.
About #var, it's where you enter your input value. More specifically, they are session variables, but you don't really want to use it to perform a select; just substitute them by any number you want, e.g. you can choose the origin (0,0) by putting 0 on place of #var1 and #var2.
[Update]
Well... It's always a good idea to test your code before answering. In fact I should have suggested to order firstly by y, since we first care about ordering rows to display on screen. The following code was (finally) tested (on test DB); my last suggest is to create the following index (index_y_x):
USE `test` ;
CREATE TABLE IF NOT EXISTS `test`.`tile` (
`id` INT(11) UNSIGNED NOT NULL AUTO_INCREMENT ,
`x` INT(11) NULL DEFAULT 0 ,
`y` INT(11) NULL DEFAULT 0 ,
PRIMARY KEY (`id`) ,
INDEX `index_y_x` (`y` DESC, `x` ASC) )
ENGINE = InnoDB
DEFAULT CHARACTER SET = utf8;
INSERT tile (x,y) VALUES
(-2,-2),(-2, -1),(-2, 0),(-2, 1),(-2, 2),
(-1,-2),(-1, -1),(-1, 0),(-1, 1),(-1, 2),
(0,-2), (0, -1), (0, 0), (0, 1), (0, 2),
(1,-2), (1, -1), (1, 0), (1, 1), (1, 2),
(2,-2), (2, -1), (2, 0), (2, 1), (2, 2);
SELECT x, y FROM tile
WHERE POW(x-3, 2) + POW(y-3, 2) <= POW(3, 2)
ORDER BY y DESC, x ASC;
This returns items near the point (3,3), in a range of 3 units

MySQL: an efficient binary value comparison

My table has 8 VARCHAR fields of binary strings of 64bits each one. My goal is to get Hamming distance for each register. I was doing it with the next query :
SELECT
BIT_COUNT(CONV(fp.bin_str0, 2, 10 ) ^ CONV('0000000001101111000000000101011100000000001010100000000001111101', 2, 10 )) +
BIT_COUNT(CONV(fp.bin_str1, 2, 10 ) ^ CONV('0000000010110001000000001000000000000000011000010000000011110100', 2, 10 )) +
BIT_COUNT(CONV(fp.bin_str2, 2, 10 ) ^ CONV('0000000010010100000000000010101100000000110001000000000011100100', 2, 10 )) +
BIT_COUNT(CONV(fp.bin_str3, 2, 10 ) ^ CONV('0000000011101011000000000001110000000000101100010000000000011001', 2, 10 )) +
BIT_COUNT(CONV(fp.bin_str4, 2, 10 ) ^ CONV('0000000000010000000000000011010100000000111011100000000001001101', 2, 10 )) +
BIT_COUNT(CONV(fp.bin_str5, 2, 10 ) ^ CONV('0000000000101111000000000110101000000000000010100000000000101101', 2, 10 )) +
BIT_COUNT(CONV(fp.bin_str6, 2, 10 ) ^ CONV('0000000000011000000000000101011000000000001010000000000000001011', 2, 10 )) +
BIT_COUNT(CONV(fp.bin_str7, 2, 10 ) ^ CONV('0000000000101011000000000011100100000000000100000000000000111010', 2, 10 )) from mytable fp
So this query is extremely slow. There are some reasons: mytable has 3M registers and the field fp.bin_stri is of VARCHAR type.
As MySQL has BINARY type, can I execute the same query over fp.bin_stri of BINARY type? An how?
I'm confused because, when I have changed fp.bin_stri to BINARY, the data of this field has appeared as BLOB and now I don't know how the query should look like. Should it use CONV?
A 64-bit binary string is the same size as MySQL's BIGINT type (standard size on modern hardware of double-precision float or long integer). Use a BIGINT UNSIGNED to store each field, then you can compare to other bit fields using the b'1010...' syntax instead of CONV().
BIT_COUNT(fp.strN ^ b'0000000001101111000000000101011100000000001010100000000001111101')
Should be really fast since the hardware is designed to do bit ops on 64-bit values.