MySQL: an efficient binary value comparison - mysql

My table has 8 VARCHAR fields of binary strings of 64bits each one. My goal is to get Hamming distance for each register. I was doing it with the next query :
SELECT
BIT_COUNT(CONV(fp.bin_str0, 2, 10 ) ^ CONV('0000000001101111000000000101011100000000001010100000000001111101', 2, 10 )) +
BIT_COUNT(CONV(fp.bin_str1, 2, 10 ) ^ CONV('0000000010110001000000001000000000000000011000010000000011110100', 2, 10 )) +
BIT_COUNT(CONV(fp.bin_str2, 2, 10 ) ^ CONV('0000000010010100000000000010101100000000110001000000000011100100', 2, 10 )) +
BIT_COUNT(CONV(fp.bin_str3, 2, 10 ) ^ CONV('0000000011101011000000000001110000000000101100010000000000011001', 2, 10 )) +
BIT_COUNT(CONV(fp.bin_str4, 2, 10 ) ^ CONV('0000000000010000000000000011010100000000111011100000000001001101', 2, 10 )) +
BIT_COUNT(CONV(fp.bin_str5, 2, 10 ) ^ CONV('0000000000101111000000000110101000000000000010100000000000101101', 2, 10 )) +
BIT_COUNT(CONV(fp.bin_str6, 2, 10 ) ^ CONV('0000000000011000000000000101011000000000001010000000000000001011', 2, 10 )) +
BIT_COUNT(CONV(fp.bin_str7, 2, 10 ) ^ CONV('0000000000101011000000000011100100000000000100000000000000111010', 2, 10 )) from mytable fp
So this query is extremely slow. There are some reasons: mytable has 3M registers and the field fp.bin_stri is of VARCHAR type.
As MySQL has BINARY type, can I execute the same query over fp.bin_stri of BINARY type? An how?
I'm confused because, when I have changed fp.bin_stri to BINARY, the data of this field has appeared as BLOB and now I don't know how the query should look like. Should it use CONV?

A 64-bit binary string is the same size as MySQL's BIGINT type (standard size on modern hardware of double-precision float or long integer). Use a BIGINT UNSIGNED to store each field, then you can compare to other bit fields using the b'1010...' syntax instead of CONV().
BIT_COUNT(fp.strN ^ b'0000000001101111000000000101011100000000001010100000000001111101')
Should be really fast since the hardware is designed to do bit ops on 64-bit values.

Related

Convert IP address (IPv4) itno an Integer in R

I was looking for a way to write a function in R which converts an IP address into an integer.
My dataframe looks like this:
total IP
626 189.14.153.147
510 67.201.11.8
509 64.22.53.140
483 180.9.85.10
403 98.8.136.126
391 64.06.187.68
I export this data from mysql database. I have a query where i can convert an IP address into an integer in mysql:
mysql> select CAST(SUBSTRING_INDEX(SUBSTRING_INDEX('75.19.168.155', '.', 1), '.', -1) << 24 AS UNSIGNED) + CAST(SUBSTRING_INDEX(SUBSTRING_INDEX('75.19.168.155', '.', 2), '.', -1) << 16 AS UNSIGNED) + CAST(SUBSTRING_INDEX(SUBSTRING_INDEX('75.19.168.155', '.', 3), '.', -1) << 8 AS UNSIGNED) + CAST(SUBSTRING_INDEX(SUBSTRING_INDEX('75.19.168.155', '.', 4), '.', -1) AS UNSIGNED) FINAL;
But I want to do this conversion in R, any help would be awesome
You were not entirely specific about what conversion you wanted, so I multiplied the decimal values by what I thought might appropriate (thinking the three digit items were actually digit equivalents in "base 256" numbers then redisplayed in base 10). If you wanted the order of the locations to be reversed, as I have seen suggested elsewhere, you would reverse the indexing of 'vals' in both solutions
convIP <- function(IP) { vals <- read.table(text=as.character(IP), sep=".")
return( vals[1] + 256*vals[2] + 256^2*vals[3] + 256^3*vals[4]) }
> convIP(dat$IP)
V1
1 2476281533
2 134990147
3 2352289344
4 173345204
5 2122844258
6 1153107520
(It's usually better IT practice to specify what you think to be the correct answer so testing can be done. Bertelson's comment above would be faster and implicitly uses 1000, 1000^2 and 1000^3 as the factors.)
I am taking a crack at simplifying the code but fear that the need to use Reduce("+", ...) may make it more complex. You cannot use sum because it is not vectorized.
convIP <- function(IP) { vals <- read.table(text=as.character(IP), sep=".")
return( Reduce("+", vals*256^(3:0))) }
> convIP(dat$IP)
[1] 5737849088 5112017 2717938944 1245449 3925902848 16449610

Cast number of bytes from blob field to number

I have a table with one blob field named bindata. bindata always contains 7 bytes. First four of them is an integer (unsigned I think, db is not mine).
My question is how can I select only the first four bytes from bindata and convert them to a number?
I am new in mySQL but from the documentation I see that I may have to use the conv function by doing something like this:
SELECT CONV(<Hex String of first 4 bytes of bindata>,16,10) as myNumber
But I don't have a clue on how to select only the first four bytes of the blob field. I am really stuck here.
Thanks
You can use string function to get partial of byte in the blob. For example:
SELECT id,
((ORD(SUBSTR(`data`, 1, 1)) << 24) +
(ORD(SUBSTR(`data`, 2, 1)) << 16) +
(ORD(SUBSTR(`data`, 3, 1)) << 8) +
ORD(SUBSTR(`data`, 4, 1))) AS num
FROM test;
Here is Demo in SQLFiddle

how to update flag bit in mysql query?

This is my sql query,In flag(00000) every bit position have different specification, e.g. change 4th bit position to 1 when user is inactive.Here flag is varchar datatype(String).
$sql="select flag from user where id =1"
I got
flag=10001 #it may be flag="00001" or flag="00101"
I want to update 2nd bit of this flag to 1.
$sql="update user set flag='-1---' where id=1" #it may be flag='11001' or flag='01001' or flag='01110'
Actually,I want to to update 2nd bit of this flag to 1,but with out updating it like flag='11001'.I want to do some thing like this.
$sql="update user set flag='--change(flag,2bit,to1)--' where id =1" #this is wrong
What can I do for it , only using one sql query?Is it possible?
update user
set flag = lpad(conv((conv(flag, 2, 10) | 1 << 3), 10, 2), 5, '0')
where id = 1
conv(flag, 2, 10) converts the flag string from binary to decimal.
1 << 3 shifts a 1 bit 3 binary places to the left
| performs a binary OR of this, to set that bit. This arithmetic operation will automatically coerce the decimal string to a number; you can use an explicit CAST if you prefer.
conv(..., 10, 2) will convert the decimal string back to a binary string
lpad(..., 5, '0') adds leading zeroes to make the string 5 characters long
FIDDLE DEMO
To set the bit to 0, you use:
set flag = lpad(conv((conv(flag, 2, 10) & ~(1 << 3)), 10, 2), 5, '0')
you want to use the bitwise or operator |
update user set flag = flag | (1 << 1) where id =1
if flag was 101 flag will now be 111
if flag was 000 flag will now be 010
1 << 1 shifts 1 up one bit - making it 10 (binary 2)
edit - not tested but use
update user set flag = cast(cast(flag AS SIGNED) | (1 << 1) AS CHAR) where id =1
If you are going to use a VARCHAR, you are better off using string manipulation functions: http://dev.mysql.com/doc/refman/5.0/en/string-functions.html
UPDATE user
SET flag = CONCAT(LEFT(flag, 1), '1', RIGHT(flag, 3))
WHERE id = 1
However, you probably want to convert this field to an INT so that you can use the bit functions: http://dev.mysql.com/doc/refman/5.0/en/bit-functions.html

Hamming Distance optimization for MySQL or PostgreSQL?

I trying to improve search similar images pHashed in MySQL database.
Right now I comparing pHash counting hamming distance like this:
SELECT * FROM images WHERE BIT_COUNT(hash ^ 2028359052535108275) <= 4
Results for selecting (engine MyISAM)
20000 rows ; query time < 20ms
100000 rows ; query time ~ 60ms # this was just fine, until its reached 150000 rows
300000 rows ; query time ~ 150ms
So query time encrease depends of the number of rows in table.
I also try solutions found on stackoverflow
Hamming distance on binary strings in SQL
SELECT * FROM images WHERE
BIT_COUNT(h1 ^ 11110011) +
BIT_COUNT(h2 ^ 10110100) +
BIT_COUNT(h3 ^ 11001001) +
BIT_COUNT(h4 ^ 11010001) +
BIT_COUNT(h5 ^ 00100011) +
BIT_COUNT(h6 ^ 00010100) +
BIT_COUNT(h7 ^ 00011111) +
BIT_COUNT(h8 ^ 00001111) <= 4
rows 300000 ; query time ~ 240ms
I changed database engine to PostgreSQL. Translate this MySQL query to PyGreSQL
Without success.
rows 300000 ; query time ~ 18s
Is there any solution to optimize above queries?
I mean optimization not depended of the number of rows.
I have limited ways (tools) to solve this problem.
MySQL so far seemed to be the simplest solution but I can deploy code on every open source database engine that will work with Ruby on dedicated machine.
There is some ready solutions for MsSQL https://stackoverflow.com/a/5930944/766217 (not tested). Maybe someone know how to translate it for MySQL or PostgreSQL.
Please, post answers based on some code or observations. We have a lot of theoretical issues about hamming distance on stackoverflow.com
Thanks!
When considering the efficiency of algorithms, computer scientists use the concept of the order denoted O(something) where something is a function of n, the number of things being computed, in this case rows. So we get, in increasing time:
O(1) - independent of the number of items
O(log(n)) - increases as the logarithm of the items
O(n) - increases in proportion of the items (what you have)
O(n^2) - increases as the square of the items
O(n^3) - etc
O(2^n) - increases exponentially
O(n!) - increases with the factorial of the number
The last 2 are effectively uncomputable for any reasonable number of n (80+).
Only the most significant term matters since this dominates for large n so n^2 and 65*n^2+787*n+4656566 are both O(n^2)
Bearing in mind that this is a mathematical construction and the time an algorithm takes with real software on real hardware using real data may be heavily influenced by other things (e.g. an O(n^2) memory operation may take less time than an O(n) disk operation).
For your problem, you need to run through each row and compute BIT_COUNT(hash ^ 2028359052535108275) <= 4. This is an O(n) operation.
The only way this could be improved is by utilizing an index since a b-tree index retrieval is an O(log(n)) operation.
However, because your column field is contained within a function, an index on that column cannot be used. You have 2 possibilities:
This is an SQL server solution and I don't know if it is portable to MySQL. Create a persisted calculated column in your table with the formula BIT_COUNT(hash ^ 2028359052535108275) and put an index on it. This will not be suitable if you need to change the bit mask.
Work out a way of doing the bitwise arithmetic without using the BIT_COUNT function.
This solution made things a bit faster for me.
It makes a derived table for each hash compare, and returns only the results that are less than the ham distance. This way, it's not doing the BIT_COUNT on a pHash that has already exceeded the ham. It returns all matches in about 2.25 seconds on 2.6 million records.
It's InnoDB, and I have very few indexes.
If somebody can make it faster, I'll appreciate you.
SELECT *, BIT_COUNT(pHash3 ^ 42597524) + BC2 AS BC3
FROM (
SELECT *, BIT_COUNT(pHash2 ^ 258741369) + BC1 AS BC2
FROM (
SELECT *, BIT_COUNT(pHash1 ^ 5678910) + BC0 AS BC1
FROM (
SELECT `Key`, pHash0, pHash1, pHash2, pHash3, BIT_COUNT(pHash0 ^ 1234567) as BC0
FROM files
WHERE BIT_COUNT(pHash0 ^ 1234567) <= 3
) AS BCQ0
WHERE BIT_COUNT(pHash1 ^ 5678910) + BC0 <= 3
) AS BCQ1
WHERE BIT_COUNT(pHash2 ^ 258741369) + BC1 <= 3
) AS BCQ2
WHERE BIT_COUNT(pHash3 ^ 42597524) + BC2 <= 3
This is the equivalent query, but without the derived tables. Its return time is almost 3 times as long.
SELECT `Key`, pHash0, pHash1, pHash2, pHash3
FROM Files
WHERE BIT_COUNT(pHash0 ^ 1234567) + BIT_COUNT(pHash1 ^ 5678910) + BIT_COUNT(pHash2 ^ 258741369) + BIT_COUNT(pHash3 ^ 42597524) <=3
Keeping in mind that the lower the ham value on the first one, the faster it will run.
Here are the results for my tests. Phash is calculated with the imagehash library in Python and stored as two BIGINTs in the database.
This test was ran on 858,433 images in a mariadb database that does not use sharding. I found sharding to actually slow down the process, however that was with the function method so that may be different without it or on a large database.
The table these are running on is an in-memory only table. A local table is kept and upon startup of the database the id, phash1, and phash2 are copied to an in-memory table. The id is returned to match to the innodb table once something is found.
Total Images: 858433
Image 1: ece0455d6b8e9470
Function HAMMINGDISTANCE_16:
RETURN BIT_COUNT(A0 ^ B0) + BIT_COUNT(A1 ^ B1)
Method: HAMMINGDISTANCE_16 Function
Query:
SELECT `id` FROM `phashs` WHERE HAMMINGDISTANCE_16(filephash_1, filephash_2, CONV(SUBSTRING('ece0455d6b8e9470', 1, 8), 16, 10), CONV(SUBSTRING('ece0455d6b8e9470', 9, 8), 16, 10)) <= 3;
Time: 2.1760 seconds
Method: BIT_COUNT
Query:
SELECT `id` FROM `phashs` WHERE BIT_COUNT(filephash_1 ^ CONV(SUBSTRING('ece0455d6b8e9470', 1, 8), 16, 10)) + BIT_COUNT(filephash_2 ^ CONV(SUBSTRING('ece0455d6b8e9470', 9, 8), 16, 10)) <= 3;
Time: 0.1547 seconds
Method: Multi-Select BIT_COUNT inner is filephash_1
Query:
SELECT `id` FROM ( SELECT `id`, `filephash_2`, BIT_COUNT(filephash_1 ^ CONV(SUBSTRING('ece0455d6b8e9470', 1, 8), 16, 10)) as BC0 FROM `phashs` WHERE BIT_COUNT(filephash_1 ^ CONV(SUBSTRING('ece0455d6b8e9470', 1, 8), 16, 10)) <= 3 ) BCQ0 WHERE BC0 + BIT_COUNT(filephash_2 ^ CONV(SUBSTRING('ece0455d6b8e9470', 9, 8), 16, 10)) <= 3;
Time: 0.1878 seconds
Method: Multi-Select BIT_COUNT inner is filephash_2
Query:
SELECT `id` FROM (SELECT `id`, `filephash_1`, BIT_COUNT(filephash_2 ^ CONV(SUBSTRING('ece0455d6b8e9470', 9, 8), 16, 10)) as BC1 FROM `phashs` WHERE BIT_COUNT(filephash_2 ^ CONV(SUBSTRING('ece0455d6b8e9470', 9, 8), 16, 10)) <= 3) BCQ1 WHERE BIT_COUNT(filephash_1 ^ CONV(SUBSTRING('ece0455d6b8e9470', 1, 8), 16, 10)) + BC1 <= 3;
Time: 0.1860 seconds
Image 2: 813ed36913ec8639
Function HAMMINGDISTANCE_16:
RETURN BIT_COUNT(A0 ^ B0) + BIT_COUNT(A1 ^ B1)
Method: HAMMINGDISTANCE_16 Function
Query:
SELECT `id` FROM `phashs` WHERE HAMMINGDISTANCE_16(filephash_1, filephash_2, CONV(SUBSTRING('813ed36913ec8639', 1, 8), 16, 10), CONV(SUBSTRING('813ed36913ec8639', 9, 8), 16, 10)) <= 3;
Time: 2.1440 seconds
Method: BIT_COUNT
Query:
SELECT `id` FROM `phashs` WHERE BIT_COUNT(filephash_1 ^ CONV(SUBSTRING('813ed36913ec8639', 1, 8), 16, 10)) + BIT_COUNT(filephash_2 ^ CONV(SUBSTRING('813ed36913ec8639', 9, 8), 16, 10)) <= 3;
Time: 0.1588 seconds
Method: Multi-Select BIT_COUNT inner is filephash_1
Query:
SELECT `id` FROM ( SELECT `id`, `filephash_2`, BIT_COUNT(filephash_1 ^ CONV(SUBSTRING('813ed36913ec8639', 1, 8), 16, 10)) as BC0 FROM `phashs` WHERE BIT_COUNT(filephash_1 ^ CONV(SUBSTRING('813ed36913ec8639', 1, 8), 16, 10)) <= 3 ) BCQ0 WHERE BC0 + BIT_COUNT(filephash_2 ^ CONV(SUBSTRING('813ed36913ec8639', 9, 8), 16, 10)) <= 3;
Time: 0.1671 seconds
Method: Multi-Select BIT_COUNT inner is filephash_2
Query:
SELECT `id` FROM (SELECT `id`, `filephash_1`, BIT_COUNT(filephash_2 ^ CONV(SUBSTRING('813ed36913ec8639', 9, 8), 16, 10)) as BC1 FROM `phashs` WHERE BIT_COUNT(filephash_2 ^ CONV(SUBSTRING('813ed36913ec8639', 9, 8), 16, 10)) <= 3) BCQ1 WHERE BIT_COUNT(filephash_1 ^ CONV(SUBSTRING('813ed36913ec8639', 1, 8), 16, 10)) + BC1 <= 3;
Time: 0.1686 seconds

Is there way to match IP with IP+CIDR straight from SELECT query?

Something like
SELECT COUNT(*) AS c FROM BANS WHERE typeid=6 AND (SELECT ipaddr,cidr FROM BANS) MATCH AGAINST 'this_ip';
So you don't first fetch all records from DB and then match them one-by one.
If c > 0 then were matched.
BANS table:
id int auto incr PK
typeid TINYINT (1=hostname, 4=ipv4, 6=ipv6)
ipaddr BINARY(128)
cidr INT
host VARCHAR(255)
DB: MySQL 5
IP and IPv type (4 or 6) is known when querying.
IP is for example ::1 in binary format
BANNED IP is for example ::1/64
Remember that IPs are not a textual address, but a numeric ID. I have a similar situation (we're doing geo-ip lookups), and if you store all your IP addresses as integers (for example, my IP address is 192.115.22.33 so it is stored as 3228767777), then you can lookup IPs easily by using right shift operators.
The downside of all these types of lookups is that you can't benefit from indexes and you have to do a full table scan whenever you do a lookup. The above scheme can be improved by storing both the network IP address of the CIDR network (the beginning of the range) and the broadcast address (the end of the range), so for example to store 192.168.1.0/24 you can store two columns:
network broadcast
3232235776, 3232236031
And then you can to match it you simply do
SELECT count(*) FROM bans WHERE 3232235876 >= network AND 3232235876 <= broadcast
This would let you store CIDR networks in the database and match them against IP addresses quickly and efficiently by taking advantage of quick numeric indexes.
Note from discussion below:
MySQL 5.0 includes a ranged query optimization called "index merge intersect" which allows to speed up such queries (and avoid full table scans), as long as:
There is a multi-column index that matches exactly the columns in the query, in order. So - for the above query example, the index would need to be (network, broadcast).
All the data can be retrieved from the index. This is true for COUNT(*), but is not true for SELECT * ... LIMIT 1.
MySQL 5.6 includes an optimization called MRR which would also speed up full row retrieval, but that is out of scope of this answer.
For IPv4, you can use:
SET #length = 4;
SELECT INET_NTOA(ipaddr), INET_NTOA(searchaddr), INET_NTOA(mask)
FROM (
SELECT
(1 << (#length * 8)) - 1 & ~((1 << (#length * 8 - cidr)) - 1) AS mask,
CAST(CONV(SUBSTR(HEX(ipaddr), 1, #length * 2), 16, 10) AS DECIMAL(20)) AS ipaddr,
CAST(CONV(SUBSTR(HEX(#myaddr), 1, #length * 2), 16, 10) AS DECIMAL(20)) AS searchaddr
FROM ip
) ipo
WHERE ipaddr & mask = searchaddr & mask
IPv4 addresses, network addresses and netmasks are all UINT32 numbers and are presented in human-readable form as "dotted-quads". The routing table code in the kernel performs a very fast bit-wise AND comparison when checking if an address is in a given network space (network/netmask). The trick here is to store the dotted-quad IP addresses, network addresses and netmasks in your tables as UINT32, and then perform the same 32-bit bit-wise AND for your matching. eg
SET #test_addr = inet_aton('1.2.3.4');
SET #network_one = inet_aton('1.2.3.0');
SET #network_two = inet_aton('4.5.6.0');
SET #network_netmask = inet_aton('255.255.255.0');
SELECT (#test_addr & #network_netmask) = #network_one AS IS_MATCHED;
+------------+
| IS_MATCHED |
+------------+
| 1 |
+------------+
SELECT (#test_addr & #network_netmask) = #network_two AS IS_NOT_MATCHED;
+----------------+
| IS_NOT_MATCHED |
+----------------+
| 0 |
+----------------+
Generating IP Address Ranges as Integers
If your database doesn't support fancy bitwise operations, you can use a simplified integer based approach.
The following example is using PostgreSQL:
select (cast(split_part(split_part('4.0.0.0/8', '/', 1), '.', 1) as bigint) * (256 * 256 * 256) +
cast(split_part(split_part('4.0.0.0/8', '/', 1), '.', 2) as bigint) * (256 * 256 ) +
cast(split_part(split_part('4.0.0.0/8', '/', 1), '.', 3) as bigint) * (256 ) +
cast(split_part(split_part('4.0.0.0/8', '/', 1), '.', 4) as bigint))
as network,
(cast(split_part(split_part('4.0.0.0/8', '/', 1), '.', 1) as bigint) * (256 * 256 * 256) +
cast(split_part(split_part('4.0.0.0/8', '/', 1), '.', 2) as bigint) * (256 * 256 ) +
cast(split_part(split_part('4.0.0.0/8', '/', 1), '.', 3) as bigint) * (256 ) +
cast(split_part(split_part('4.0.0.0/8', '/', 1), '.', 4) as bigint)) + cast(
pow(256, (32 - cast(split_part('4.0.0.0/8', '/', 2) as bigint)) / 8) - 1 as bigint
) as broadcast;
Hmmm. You could build a table of the cidr masks, join it, and then compare the ip anded (& in MySQL) with the mask with the ban block ipaddress. Would that do what you want?
If you don't want to build a mask table, you could compute the mask as -1 << (x-cidr) with x = 64 or 32 depending.