Creating a hash of a string thats sortable - language-agnostic

Is there anyway to create hashs of strings where the hashes can be sorted and have the same results as if the strings themselves were sorted?

This won't be possible, at least if you allow strings longer than the hash size. You have 256^(max. string size) possible strings mapped to 256^(hash size) hash values, so you'll end up with some of the strings unsorted.
Just imagine the simplest hash: Truncating every string to (hash size) bytes.

Yes. It's called using the entire input string as the hash.

As others have pointed out it's not practical to do exactly what you've asked. You'd have to use the string itself as the hash which would constrain the lengths of strings that could be "hashed" and so on.
The obvious approach to maintaining a "sorted hash" data structure would be to maintain both a sorted list (heap or binary tree, for example) and a hashed mapping of the data. Inserts and removals would be O(log(n)) while retrievals would be O(1). Off hand I'm not sure how often this would be worth the additional complexity and overhead.
If you had a particularly large data set, mostly read-only and such that logarithmic time retrieval was overly expensive then I suppose it might be useful. Note that the cost of updates is actually the sum of the constant time (hash) and the logarithmic time (binary tree or heap) operations. However O(1) + O(log(n)) reduces to the larger of the two terms during asymptotic analysis. (The underlying cost is still there --- relevant to any implementation effort regardless of its theoretical irrelevance).
For a significant range of data set sizes the cost of maintaining this hypothetical hybrid data structure could be estimated as "twice" the cost of maintaining either of the pure ones. (In other words many implementations of a binary tree over can scale to billions of elements (2^~32 or so) in time cost that's comparable to the cost of the typical hash functions). So I'd be hard-pressed to convince myself that such added code complexity and run-time cost (of a hybrid data structure) would actually be of benefit to a given project.
(Note: I saw that Python 3.1.1 added the notion of "ordered" dictionaries ... and this is similar to being sorted, but not quite the same. From what I gather the ordered dictionary preserves the order in which elements were inserted to the collection. I also seem to remember some talk of "views" ... objects in the language which can access keys of a dictionary in some particular manner (sorted, reversed, reverse sorted, ...) at (possibly) lower cost than passing the set of keys through the built-in "sorted()" and "reversed()." I haven't used these nor have a looked at the implementation details. I would guess that one of these "views" would be something like a lazily evaluated index, performing the necessary sorting on call, and storing the results with some sort of flag or trigger (observer pattern or listener) that's reset when the back-end source collection is updated. In that scheme a call to the "view" would update its index; subsequence calls would be able to use those results so long as no insertions nor deletions had been made to the dictionary. Any call to the view subsequent to key changes would incur the cost of updating the view. However this is all pure speculation on my part. I mention it because it might also provide insight into some alternative ways to approach the question).

Not unless there are fewer strings than hashes, and the hashes are perfect. Even then you still have to ensure the hash order is the same as the string order, this is probably not possible unless you know all the strings ahead of time.

No. The hash would have to contain the same amount of information as the string it is replacing. Otherwise, if two strings mapped to the same hash value, how could you possibly sort them?
Another way of thinking about it is this: If I have two strings, "a" and "b", then I hash both of them with this sort preserving hash function and get f(a) and f(b). However, there are an infinite number of strings that are greater than "a" but less than "b". This would require hashing the strings to arbitrary precision Real values (because of cardinality). In the end, you would basically just have the string encoded as a number.

You're essentially asking if you can compress the key strings into smaller keys while preserving their collation order. So it depends on your data. If your strings are composed of only hexadecimal digits, for example, they can be replaced with 4-bit codes.
But for the general case, it can't be done. You'd end up "hashing" each source key into itself.

I stumble upon this, and although everyone is correct with their answers, I needed a solution exactly like this to use in elasticsearch (don't ask why). Sometimes we don't need a perfect solution for all cases, we just need one to work with the constraints that are acceptable. My solution is able to generate a sortable hashcode for the first n chars of the string, I did some preliminary tests and didn't have any collisions. You need to define beforehand the charset that is used and play with n to a deemed acceptable value of the first chars needed to sort and try to maintain the result hash code in the positive interval of the defined type for it to work, in my case, for Java Long type I could go up to 13 chars.
Below is my code in Java, hopefully, it will help someone else that needs this.
String charset = "abcdefghijklmnopqrstuvwxyz";
public long orderedHash(final String s, final String charset, final int n) {
Long hash = 0L;
if(s.isEmpty() || n == 0)
return hash;
Long charIndex = (long)(charset.indexOf(s.charAt(0)));
if(charIndex == -1)
return hash;
for(int i = 1 ; i < n; i++)
hash += (long)(charIndex * Math.pow(charset.length(), i));
hash += charIndex + 1 + orderedHash(s.substring(1), charset, n - 1);
return hash;
}
Examples:
orderedHash("a", charset, 13) // 1
orderedHash("abc", charset, 13) // 4110785825426312
orderedHash("b", charset, 13) // 99246114928149464
orderedHash("google", charset, 13) // 651008600709057847
orderedHash("stackoverflow", charset, 13) // 1858969664686174756
orderedHash("stackunderflow", charset, 13) // 1858969712216171093
orderedHash("stackunderflo", charset, 13) // 1858969712216171093 same, 13 chars limitation
orderedHash("z", charset, 13) // 2481152873203736576
orderedHash("zzzzzzzzzzzzz", charset, 13) // 2580398988131886038
orderedHash("zzzzzzzzzzzzzz", charset, 14) // -4161820175519153195 no good, overflow
orderedHash("ZZZZZZZZZZZZZ", charset, 13) // 0 no good, not in charset
If more precision is needed, use an unsigned type or a composite one made of two longs for example and compute the hashcode with substrings.
Edit: Although the previously algorithm sufficed for my use I noticed that it was not really ordering correctly the strings if they didn't have a length bigger that the chosen n. With this new algorithm it should be ok now.

Related

SQL string literal hexadecimal key to binary and back

after extensive search I am resorting to stack-overflows wisdom to help me.
Problem:
I have a database table that should effectively store values of the format (UserKey, data0, data1, ..) where the UserKey is to be handled as primary key but at least as an index. The UserKey itself (externally defined) is a string of 32 characters representing a checksum, which happens to be (a very big) hexadecimal number, i.e. it looks like this UserKey = "000000003abc4f6e000000003abc4f6e".
Now I can certainly store this UserKey in a char(32)-field, but I feel this being mighty inefficient, as I store a series of in principle arbitrary characters, i.e. reserving space for for more information per character than the 4 bits i need to store the hexadecimal characters (0..9,A-F).
So my thought was to convert this string literal into the hex-number it really represents, and store that. But this number (32*4 bits = 16Bytes) is much too big to store/handle as SQL only handles BIGINTS of 8Bytes.
My second thought was to convert this into a BINARY(16) representation, which should be compact and efficient concerning memory. However, I do not know how to efficiently convert between these two formats, as SQL also internally only handles numbers up to the maximum of 8 Bytes.
Maybe there is a way to convert this string to binary block by block and stitch the binary together somehow, in the way of:
UserKey == concat( stringblock1, stringblock2, ..)
UserKey_binary = concat( toBinary( stringblock1 ), toBinary( stringblock2 ), ..)
So my question is: is there any such mechanism foreseen in SQL that would solve this for me? How would a custom solution look like? (I find it hard to believe that I should be the first to encounter such a problem, as it has become quite modern to use ridiculously long hashkeys in many applications)
Also, the Userkey_binary should than act as relational key for the table, so I hope for a bit of speed by this more compact representation, as it needs to determine the difference on a minimal number of bits. Additionally, I want to mention that I would like to do any conversion if possible on the Server-side, so that user-scripts have not to be altered (the user-side should, if possible, still transmit a string literal not [partially] converted values in the insert statement)
In Contradiction to my previous statement, it seems that MySQL's UNHEX() function does a conversion from a string block by block and then concat much like I stated above, so the method works also for HEX literal values which are bigger than the BIGINT's 8 byte limitation. Here an example table that illustrates this:
CREATE TABLE `testdb`.`tab` (
`hexcol_binary` BINARY(16) GENERATED ALWAYS AS (UNHEX(charcol)) STORED,
`charcol` CHAR(32) NOT NULL,
PRIMARY KEY (`hexcol_binary`));
The primary key is a generated column, so that that updates to charcol are the designated way of interacting with the table with string literals from the outside:
REPLACE into tab (charcol) VALUES ('1010202030304040A0A0B0B0C0C0D0D0');
SELECT HEX(hexcol_binary) as HEXstring, tab.* FROM tab;
as seen building keys and indexes on the hexcol_binary works as intended.
To verify the speedup, take
ALTER TABLE `testdb`.`tab`
ADD INDEX `charkey` (`charcol` ASC);
EXPLAIN SELECT * from tab where hexcol_binary = UNHEX('1010202030304040A0A0B0B0C0C0D0D0') #keylength 16
EXPLAIN SELECT * from tab where charcol = '1010202030304040A0A0B0B0C0C0D0D0' #keylength 97
the lookup on the hexcol_binary column is much better performing, especially if its additonally made unique.
Note: the hex conversion does not care if the hex-characters A through F are capitalized or not for the conversion process, however the charcol will be very sensitive to this.

json boolean vs integer - which takes up less space?

When sending a value in JSON otw, is it better to use a boolean or an integer to use up less space?
e.g:
{
foo: false
}
Or:
{
foo: 0
}
Would using a number use less space, considering its just a number, compared to 4 or 5 characters for a boolean value? (true/false)
Also is there a speed difference between the two approaches if you convert them from JSON to object format?
Firstly, this is micro-optimisation, and very unlikely to be important. If you are transporting thousands or millions of such values, it might become significant; but in that case, you probably want something much more efficient than JSON anyway (a plain CSV would be better in many cases, but ideally you'd use some packed binary format).
Secondly, JSON is a way of representing data in a string; so storing or sending JSON means you are storing or sending strings. Measuring the size of the data is therefore trivial: how long is the string? The string 0 has one character; the string false has five characters.
Thirdly, if you're optimising for space, you'd remove all insignificant whitespace, so your examples should be {"foo":false} (13 characters) and {"foo":0} (9 characters). Note that you can't, as you have in your example, skip the quote marks around foo - that is not valid JSON.
Fourthly, how much memory or other resources the structure will take up when you convert it from JSON into an object depends on what language you're using, what implementation of that language, and any number of other factors, so is completely unanswerable (and, again, a micro-optimisation that is very unlikely to be important).
I think integer is a better solution because, besides using less space (and consequentially, being potentially faster to parse), it is also more future proof. Someone can easily convert it into a three (or more) state variable if needed by just assigning other values like -1, 2, 3..., while the conversion from boolean would be less straight forward.

Will hashing two eual strings give same hash value

I need to anonymyze personal data in our MySql database. The problem is that I still need to be able to link two persons together after they have been anonymized.
I thought this could be done by hashing their social security number or e-mail address, which lead to my question:
When hashing two equal strings (s1 and s1) I get two hash values (h1 and h2), how sure can I be that:
1) the hashed value is equal (h1 = h2)
2) no not equal (s3 = s1) will produce the same hash value
1) Same strings will always produce equal hash values
2) Different strings theoretically might produce same hash if you choose small hash length compared to data volume. But using default hash lengths (32 or 40) wont cause such problems.
1) (h1 = h2) is always true for equal strings (s1 and s2) per definition, when using a correct hash function.
2) Two different strings can have the same hash value. This is called a "collsison". The probability depends on the hash function used and the length of the resulting hash. For MD5 for example there are websites and tables for finding collisions, which is quite interesting.
I'm not sure what you mean by linking persons together or what your requirements are, so I cannot help you with that. But you could link two persons together with their ids.

Octave force deepcopy

The question
What are the ways of coercing octave to create a real copy of whatever object? Structures are the main interest.
My underlying problem
In my problem I'm obtaining a rather large structure from another function in a loop but for the current task only a few pieces of it are needed. For example:
for i=1:many
res=solver(params);
store1{i}=res.string1;
store2{i}=res.arr(:,1);
end
res is a sizable chunk of data and due to lazy-copy those store-s are references to tiny portions of bytes in that chunk. After I store those tiny portions, I don't need res itself, however, since middle of that chunk is referenced by store, the memory area is unfit for res obtained on the next iteration (they are of the same size) and thus another sizable piece of memory is allocated, which is then again crossed by few tiny links an so on.
Without storing parts of res, the program successfully keeps the memory consumption same after first couple of iterations.
So how do I make a complete copy of structure field?
I've tried using struct-related functions like rmfield but those keep references instead of their own objects.
I've tried to wrap the assignment of in its own function:
new_struct=copy( rmfield(old_struct,"bigdata"));
function c=copy(a);
c=a;
end;
This by the way doesn't work even for arrays.
I'm interested in method applicable to any generic variable.
Minimal working example of the problem
a=cell(3,1);
for i=1:length(a);
r=rand(100000,1000);
a{i}=r(1:100,end);
whos; fflush(stdout);
pause(2);
end;
The above code will cause memory usage to gradually grow by far more than 8.08 kb reported by whos due to references stored by a{i} blocking bigger memory block than they actually need. If you force the proper copy, the problem is not present.
Numerical arrays
For numeric types addition of zero is enough to warrant a new array.
c=a+0;
Strings
For string which is 1 x n char array, something along the following lines will work:
c=[a "a"](1:end-1);
Multidimensional char arrays will require concatenation with a column:
c=[a true(size(a,1),1)](:,1:end-1);
Here true is used to generate dummy array of size compatible with char. (There seems to be no procedural method of generating char array of arbitrary size) char(zeros(size(a,1),1)) and char(true(size(a,1),1)) caused excess memory usage during their creation on some calls.
Note that empty concatenation c=[a ""]; will not result in a copying. Also it is possible to do c=[a+0 ""]; which will result in a copying due to +0 but that one infers type conversions to and from double which is 8 times larger in size. (char(zeros( doesn't seem to cause that)
Other types
In general you can use casting for the types allowed by it in order to not tailor the expressions manually as I had to do above:
typelist={"double","single","char"}; %full list of supported types is available in the link
class_of_a = typelist{ isa(a,typelist) };
c=typecast( [typecast(a,'single'); single(1)] (1:end-1), class_of_a);
Single is seemingly smallest datatype available in octave.
Note that logical is not supported by this method.
Copying structures
Apparently you'd have to write your own function to go around struct fields, copy them with above methods and recursively go to substructs.
(As it doesn't involve complexities relevant here, I'd rather leave that to be done by those who actually needs that, my own problem being solved by +0's.)

Hash Index algo in MySQL

I was reading an article on hash indexing, and it seems that it is similar to the md5 function of PHP, in that that both take a string value and return an integer representing that string, and this representation is consistent. Is this similarity really there, or am I missing anything? Plus has anybody got an idea about the hashing algorithm MySQL employs for hash based index structure?
I'm not pretending to give a complete description on MySQL algo, but there are a few things that may be guessed.
First of all, Hash table wiki is a must-read. Then we have a notice from MySQL documentation:
They are used only for equality comparisons that use the = or <=> operators (but are very fast). They are not used for comparison
operators such as < that find a range of values. Systems that rely on
this type of single-value lookup are known as “key-value stores”; to
use MySQL for such applications, use hash indexes wherever possible.
The optimizer cannot use a hash index to speed up ORDER BY operations. (This type of index cannot be used to search for the next
entry in order.)
MySQL cannot determine approximately how many rows there are between two values (this is used by the range optimizer to decide
which index to use). This may affect some queries if you change a
MyISAM table to a hash-indexed MEMORY table.
Only whole keys can be used to search for a row. (With a B-tree index, any leftmost prefix of the key can be used to find rows.)
This points to following (rather common) properties:
MySQL hash function operates on a fixed length "full-key" record (it
is a question though, how varchars are treated, e.g. they might be padded with zeros up to the maximum length)
There is a max_heap_table_size global value and a MAX_ROWS parameter that engine is likely to use when guessing upper row count for the hash function.
MySQL allows non-unique keys, but warns about proportional slowdowns. At least this may tell that there is no second hash function, but a mere linked list used in Collision resolution.
As for the actual function used, I don't think there is much to tell. MySQL may even use different functions according to some key heuristics (e.g. one for mostly sequential data, such as ID, but another for CHARs), and of course its output is changed according to estimated row count. However, you should only consider hash indices when BTREE cannot afford you good enough performance or you just never ever use any of its advantages, which is, I suppose, a rare case.
UPDATE
A bit into sources: /storage/heap/hp_hash.c contains a few implementations for hash functions. At least it was a right assumption that they use different techniques for different types, as it comes to TEXT and VARCHAR:
/*
* Fowler/Noll/Vo hash
*
* The basis of the hash algorithm was taken from an idea sent by email to the
* IEEE Posix P1003.2 mailing list from Phong Vo (kpv#research.att.com) and
* Glenn Fowler (gsf#research.att.com). Landon Curt Noll (chongo#toad.com)
* later improved on their algorithm.
*
* The magic is in the interesting relationship between the special prime
* 16777619 (2^24 + 403) and 2^32 and 2^8.
*
* This hash produces the fewest collisions of any function that we've seen so
* far, and works well on both numbers and strings.
*/
I'll try to give a simplified explanation.
ulong nr= 1, nr2= 4;
for (seg=keydef->seg,endseg=seg+keydef->keysegs ; seg < endseg ; seg++)
Every part of a compund key is processed separately, result is accumulated in nr.
if (seg->null_bit)
{
if (rec[seg->null_pos] & seg->null_bit)
{
nr^= (nr << 1) | 1;
continue;
}
}
NULL values are treated separately.
if (seg->type == HA_KEYTYPE_TEXT)
{
uint char_length= seg->length; /* TODO: fix to use my_charpos() */
seg->charset->coll->hash_sort(seg->charset, pos, char_length,
&nr, &nr2);
}
else if (seg->type == HA_KEYTYPE_VARTEXT1) /* Any VARCHAR segments */
{
uint pack_length= seg->bit_start;
uint length= (pack_length == 1 ? (uint) *(uchar*) pos : uint2korr(pos));
seg->charset->coll->hash_sort(seg->charset, pos+pack_length,
length, &nr, &nr2);
}
So are TEXT and VARCHAR. hash_sort is presumably some other function that takes collation into account. VARCHARs have a prefixed 1 or 2-byte length.
else
{
uchar *end= pos+seg->length;
for ( ; pos < end ; pos++)
{
nr *=16777619;
nr ^=(uint) *pos;
}
}
And every other type is treated byte-by-byte with mutiplication and xor.