Finding the set of wildcards for representing a binary interval - binary

In a switch flow table there is a field called match field, where a list of matching condition is maintained. In the match fields, binary sequences with wildcards (*, a bit of wildcard character means that it could be either 1 or 0 in this bit) are used to represent some matching conditions.
For example, we can use the following binary sequences with wildcards to represent the matching condition '38798 <= Port <= 56637':
100101111000111*
100101111**1****
100101111*1*****
1001011111******
10011***********
101*************
110111010011110*
1101110100***0**
1101110100**0***
1101110100*0****
11011101000*****
11011100********
110**0**********
110*0***********
1100************
Does anyone know (or suggests) a way to obtain such sequence? So far, I used a brute force strategy (getting all the possible combinations for the intervals), but it is not computationally feasible (memory explosion) and has a problem of redundancy (wildcards representing the same interval). So now I'm trying to use range splitting and grid-of-tries to get a solution, but without success yet.

Related

Using large numbers in sql query

I have a field called size which is a BIGINT storing the number of bytes in a file. To get a file that is larger than 1GB I am currently doing:
size > (1024*1024*1024)
But this looks a bit hairy. Is there another way to write this that makes it more clear that the result of 1024*1024*1024 is 1GiB?
Additionally is the exponent operator built into mysql? I've used
select power(2, 30)
But I was wondering if there was a shortform to do that directly in the query, such as 2^30.
^ is the bitwise xor operator.
Either POW(2,30) or POWER(2,30) (or POWER(1024,3)) will work; I believe of the two POWER is the more standard. There is no typographic operator for exponentiation.
I would just leave it as 1024*1024*1024; to me that provides the best readability (and makes it clear it is 1 GiB, not 1 GB).

How can I determine which regular expressions from a list possibly overlap

I have a table of regular expressions that are in an MySQL table that I match text against.
Is there a way, using MySQL or any other language (preferably Perl) that I can take this list of expressions and determine which of them MAY overlap. This should be independent of whatever text may be supplied to the expressions.
All of the expression have anchors.
Here is an example of what I am trying to get:
Expressions:
^a$
^b$
^ab
^b.*c
^batch
^catch
Result:
'^b.*c' and '^batch' MAY overlap
Thoughts?
Thanks,
Scott
Further explanation:
I have a list of user-created regexes and an imported list of strings that are to be matched against the regexes. In this case the strings are "clean" data (ie they are not user-created but imported from another source - they must not change).
When a user adds to the list of regexes I do not want any collisions on either the existing list of strings nor any future strings (which can not be guessed ahead of time - the only constraints being they are ASCII printable characters no longer than 255 characters).
A brute-force method would be to create a "rainbow" table of all of the permutations of strings and each time a regex is added run all of the regexes against the rainbow table. However I'd like to avoid this (I'm not even sure of the cost) and so was wondering aloud as to the possibility of an algorithm that would AT LEAST show which regexes in a list MAY collide.
I will punt on full REs. Even limiting to BREs and/or MySQL-pre-8.0 will be challenging. Here are some thoughts.
If end-anchored and no + or *, the calculate the length. The fixed-length can be used as a discriminator. Also, it could be used for toning back the "brute force" by perhaps an order of magnitude.
Anything followed by + or * gets turned into .* for simplicity. (Re the "may collide" rule.)
Any RE with explicit characters (including those followed by +) becomes a discriminator in some situations. Eg, ^a.*b$ vs ^a.*c$.
For those anchored at the end, reverse the pattern and test it that way. (I don't know how difficult reversing is.)
If you can say that a particular character must be at any position, then use it as a discriminator: ^a.b.*c$ -- a in pos 1; b in pos 3; c at end. Perhaps this can be extended to character classes: ^\w may match, but ^\d and ^a.*\d$ can't.

SQL string literal hexadecimal key to binary and back

after extensive search I am resorting to stack-overflows wisdom to help me.
Problem:
I have a database table that should effectively store values of the format (UserKey, data0, data1, ..) where the UserKey is to be handled as primary key but at least as an index. The UserKey itself (externally defined) is a string of 32 characters representing a checksum, which happens to be (a very big) hexadecimal number, i.e. it looks like this UserKey = "000000003abc4f6e000000003abc4f6e".
Now I can certainly store this UserKey in a char(32)-field, but I feel this being mighty inefficient, as I store a series of in principle arbitrary characters, i.e. reserving space for for more information per character than the 4 bits i need to store the hexadecimal characters (0..9,A-F).
So my thought was to convert this string literal into the hex-number it really represents, and store that. But this number (32*4 bits = 16Bytes) is much too big to store/handle as SQL only handles BIGINTS of 8Bytes.
My second thought was to convert this into a BINARY(16) representation, which should be compact and efficient concerning memory. However, I do not know how to efficiently convert between these two formats, as SQL also internally only handles numbers up to the maximum of 8 Bytes.
Maybe there is a way to convert this string to binary block by block and stitch the binary together somehow, in the way of:
UserKey == concat( stringblock1, stringblock2, ..)
UserKey_binary = concat( toBinary( stringblock1 ), toBinary( stringblock2 ), ..)
So my question is: is there any such mechanism foreseen in SQL that would solve this for me? How would a custom solution look like? (I find it hard to believe that I should be the first to encounter such a problem, as it has become quite modern to use ridiculously long hashkeys in many applications)
Also, the Userkey_binary should than act as relational key for the table, so I hope for a bit of speed by this more compact representation, as it needs to determine the difference on a minimal number of bits. Additionally, I want to mention that I would like to do any conversion if possible on the Server-side, so that user-scripts have not to be altered (the user-side should, if possible, still transmit a string literal not [partially] converted values in the insert statement)
In Contradiction to my previous statement, it seems that MySQL's UNHEX() function does a conversion from a string block by block and then concat much like I stated above, so the method works also for HEX literal values which are bigger than the BIGINT's 8 byte limitation. Here an example table that illustrates this:
CREATE TABLE `testdb`.`tab` (
`hexcol_binary` BINARY(16) GENERATED ALWAYS AS (UNHEX(charcol)) STORED,
`charcol` CHAR(32) NOT NULL,
PRIMARY KEY (`hexcol_binary`));
The primary key is a generated column, so that that updates to charcol are the designated way of interacting with the table with string literals from the outside:
REPLACE into tab (charcol) VALUES ('1010202030304040A0A0B0B0C0C0D0D0');
SELECT HEX(hexcol_binary) as HEXstring, tab.* FROM tab;
as seen building keys and indexes on the hexcol_binary works as intended.
To verify the speedup, take
ALTER TABLE `testdb`.`tab`
ADD INDEX `charkey` (`charcol` ASC);
EXPLAIN SELECT * from tab where hexcol_binary = UNHEX('1010202030304040A0A0B0B0C0C0D0D0') #keylength 16
EXPLAIN SELECT * from tab where charcol = '1010202030304040A0A0B0B0C0C0D0D0' #keylength 97
the lookup on the hexcol_binary column is much better performing, especially if its additonally made unique.
Note: the hex conversion does not care if the hex-characters A through F are capitalized or not for the conversion process, however the charcol will be very sensitive to this.

Will hashing two eual strings give same hash value

I need to anonymyze personal data in our MySql database. The problem is that I still need to be able to link two persons together after they have been anonymized.
I thought this could be done by hashing their social security number or e-mail address, which lead to my question:
When hashing two equal strings (s1 and s1) I get two hash values (h1 and h2), how sure can I be that:
1) the hashed value is equal (h1 = h2)
2) no not equal (s3 = s1) will produce the same hash value
1) Same strings will always produce equal hash values
2) Different strings theoretically might produce same hash if you choose small hash length compared to data volume. But using default hash lengths (32 or 40) wont cause such problems.
1) (h1 = h2) is always true for equal strings (s1 and s2) per definition, when using a correct hash function.
2) Two different strings can have the same hash value. This is called a "collsison". The probability depends on the hash function used and the length of the resulting hash. For MD5 for example there are websites and tables for finding collisions, which is quite interesting.
I'm not sure what you mean by linking persons together or what your requirements are, so I cannot help you with that. But you could link two persons together with their ids.

using binary text comparison in mysql - efficiency pitfalls?

I'm using binary string comparison on user names (defined as varchar) to be sure the string matches exactly:
... where binary Owner = '$user' ...
or
... from Records join Users on binary Records.Owner = Users.User ....
But I'm not sure if it has some (either positive or negative) impact on efficiency. The manual states:
Note that in some contexts, if you cast an indexed column to BINARY,
MySQL is not able to use the index efficiently.
Is it an issue in this case? I would expect binary to be faster on the contrary, because it doesn't have to ignore case, whitespace, some accents etc.
Indeed, it's possible for an expression in a WHERE clause to make it impossible to use an index on a column. You wouldn't expect an index on a float column called x to be any use for searching
WHERE SIN(x) BETWEEN 0.0 and 0.1
for example.
The answer to your problem is to define your Owner column to have a binary collation rather than a case-insensitive collation. See here for how to do that: http://dev.mysql.com/doc/refman/5.0/en/charset-column.html