How the get value function works in sparse merkle trees? - ethereum

I have just started reading about sparse merkle trees and I came across a function(get value) which is used to find value for the specified key. I can't find an explanation on the internet which can explain how the get value function works.
My understanding is that each node is of 256 bits so there can be 2^256 leaf nodes and keys are indexed. So we start from root and keep choosing left or right node based on weather the bit is 0 or 1 but I'm not able to understand v = db.get(v)[32:] statement. How is it leading me to the value for the key provided?
def get(db, root, key):
v = root
path = key_to_path(key)
for i in range(256):
if (path >> 255) & 1:
v = db.get(v)[32:]
else:
v = db.get(v)[:32]
path <<= 1
return v

"A Merkle tree [21] is a binary tree that incorporates the use of cryptographic hash
functions. One or many attributes are inserted into the leaves, and every node
derives a digest which is recursively dependent on all attributes in its subtree.
That is, leaves compute the hash of their own attributes, and parents derive the
hash of their children’s digests concatenated left-to-right."
This is a citation from "https://eprint.iacr.org/2016/683.pdf"
Each hash has a roadmap to all it's dependent hashes.

Related

Generate unique serial from id number

I have a database that increases id incrementally. I need a function that converts that id to a unique number between 0 and 1000. (the actual max is much larger but just for simplicity's sake.)
1 => 3301,
2 => 0234,
3 => 7928,
4 => 9821
The number generated cannot have duplicates.
It can not be incremental.
Need it generated on the fly (not create a table of uniform numbers to read from)
I thought a hash function but there is a possibility for collisions.
Random numbers could also have duplicates.
I need a minimal perfect hash function but cannot find a simple solution.
Since the criteria are sort of vague (good enough to fool the average person), I am unsure exactly which route to take. Here are some ideas:
You could use a Pearson hash. According to the Wikipedia page:
Given a small, privileged set of inputs (e.g., reserved words for a compiler), the permutation table can be adjusted so that those inputs yield distinct hash values, producing what is called a perfect hash function.
You could just use a complicated looking one-to-one mathematical function. The drawback of this is that it would be difficult to make one that was not strictly increasing or strictly decreasing due to the one-to-one requirement. If you did something like (id ^ 2) + id * 2, the interval between ids would change and it wouldn't be immediately obvious what the function was without knowing the original ids.
You could do something like this:
new_id = (old_id << 4) + arbitrary_4bit_hash(old_id);
This would give the unique IDs and it wouldn't be immediately obvious that the first 4 bits are just garbage (especially when reading the numbers in decimal format). Like the last option, the new IDs would be in the same order as the old ones. I don't know if that would be a problem.
You could just hardcode all ID conversions by making a lookup array full of "random" numbers.
You could use some kind of hash function generator like gperf.
GNU gperf is a perfect hash function generator. For a given list of strings, it produces a hash function and hash table, in form of C or C++ code, for looking up a value depending on the input string. The hash function is perfect, which means that the hash table has no collisions, and the hash table lookup needs a single string comparison only.
You could encrypt the ids with a key using a cryptographically secure mechanism.
Hopefully one of these works for you.
Update
Here is the rotational shift the OP requested:
function map($number)
{
// Shift the high bits down to the low end and the low bits
// down to the high end
// Also, mask out all but 10 bits. This allows unique mappings
// from 0-1023 to 0-1023
$high_bits = 0b0000001111111000 & $number;
$new_low_bits = $high_bits >> 3;
$low_bits = 0b0000000000000111 & $number;
$new_high_bits = $low_bits << 7;
// Recombine bits
$new_number = $new_high_bits | $new_low_bits;
return $new_number;
}
function demap($number)
{
// Shift the high bits down to the low end and the low bits
// down to the high end
$high_bits = 0b0000001110000000 & $number;
$new_low_bits = $high_bits >> 7;
$low_bits = 0b0000000001111111 & $number;
$new_high_bits = $low_bits << 3;
// Recombine bits
$new_number = $new_high_bits | $new_low_bits;
return $new_number;
}
This method has its advantages and disadvantages. The main disadvantage that I can think of (besides the security aspect) is that for lower IDs consecutive numbers will be exactly the same (multiplicative) interval apart until digits start wrapping around. That is to say
map(1) * 2 == map(2)
map(1) * 3 == map(3)
This happens, of course, because with lower numbers, all the higher bits are 0, so the map function is equivalent to just shifting. This is why I suggested using pseudo-random data for the lower bits rather than the higher bits of the number. It would make the regular interval less noticeable. To help mitigate this problem, the function I wrote shifts only the first 3 bits and rotates the rest. By doing this, the regular interval will be less noticeable for all IDs greater than 7.
It seems that it doesn't have to be numerical? What about an MD5-Hash?
select md5(id+rand(10000)) from ...

Generate a powerset with the help of a binary representation

I know that "a powerset is simply any number between 0 and 2^N-1 where N is number of set members and one in binary presentation denotes presence of corresponding member".
(Hynek -Pichi- Vychodil)
I would like to generate a powerset using this mapping from the binary representation to the actual set elements.
How can I do this with Erlang?
I have tried to modify this, but with no success.
UPD: My goal is to write an iterative algorithm that generates a powerset of a set without keeping a stack. I tend to think that binary representation could help me with that.
Here is the successful solution in Ruby, but I need to write it in Erlang.
UPD2: Here is the solution in pseudocode, I would like to make something similar in Erlang.
First of all, I would note that with Erlang a recursive solution does not necessarily imply it will consume extra stack. When a method is tail-recursive (i.e., the last thing it does is the recursive call), the compiler will re-write it into modifying the parameters followed by a jump to the beginning of the method. This is fairly standard for functional languages.
To generate a list of all the numbers A to B, use the library method lists:seq(A, B).
To translate a list of values (such as the list from 0 to 2^N-1) into another list of values (such as the set generated from its binary representation), use lists:map or a list comprehension.
Instead of splitting a number into its binary representation, you might want to consider turning that around and checking whether the corresponding bit is set in each M value (in 0 to 2^N-1) by generating a list of power-of-2-bitmasks. Then, you can do a binary AND to see if the bit is set.
Putting all of that together, you get a solution such as:
generate_powerset(List) ->
% Do some pre-processing of the list to help with checks later.
% This involves modifying the list to combine the element with
% the bitmask it will need later on, such as:
% [a, b, c, d, e] ==> [{1,a}, {2,b}, {4,c}, {8,d}, {16,e}]
PowersOf2 = [1 bsl (X-1) || X <- lists:seq(1, length(List))],
ListWithMasks = lists:zip(PowersOf2, List),
% Generate the list from 0 to 1^N - 1
AllMs = lists:seq(0, (1 bsl length(List)) - 1),
% For each value, generate the corresponding subset
lists:map(fun (M) -> generate_subset(M, ListWithMasks) end, AllMs).
% or, using a list comprehension:
% [generate_subset(M, ListWithMasks) || M <- AllMs].
generate_subset(M, ListWithMasks) ->
% List comprehension: choose each element where the Mask value has
% the corresponding bit set in M.
[Element || {Mask, Element} <- ListWithMasks, M band Mask =/= 0].
However, you can also achieve the same thing using tail recursion without consuming stack space. It also doesn't need to generate or keep around the list from 0 to 2^N-1.
generate_powerset(List) ->
% same preliminary steps as above...
PowersOf2 = [1 bsl (X-1) || X <- lists:seq(1, length(List))],
ListWithMasks = lists:zip(PowersOf2, List),
% call tail-recursive helper method -- it can have the same name
% as long as it has different arity.
generate_powerset(ListWithMasks, (1 bsl length(List)) - 1, []).
generate_powerset(_ListWithMasks, -1, Acc) -> Acc;
generate_powerset(ListWithMasks, M, Acc) ->
generate_powerset(ListWithMasks, M-1,
[generate_subset(M, ListWithMasks) | Acc]).
% same as above...
generate_subset(M, ListWithMasks) ->
[Element || {Mask, Element} <- ListWithMasks, M band Mask =/= 0].
Note that when generating the list of subsets, you'll want to put new elements at the head of the list. Lists are singly-linked and immutable, so if you want to put an element anywhere but the beginning, it has to update the "next" pointers, which causes the list to be copied. That's why the helper function puts the Acc list at the tail instead of doing Acc ++ [generate_subset(...)]. In this case, since we're counting down instead of up, we're already going backwards, so it ends up coming out in the same order.
So, in conclusion,
Looping in Erlang is idiomatically done via a tail recursive function or using a variation of lists:map.
In many (most?) functional languages, including Erlang, tail recursion does not consume extra stack space since it is implemented using jumps.
List construction is typically done backwards (i.e., [NewElement | ExistingList]) for efficiency reasons.
You generally don't want to find the Nth item in a list (using lists:nth) since lists are singly-linked: it would have to iterate the list over and over again. Instead, find a way to iterate the list once, such as how I pre-processed the bit masks above.

Real world use cases of bitwise operators [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
What are some real world use cases of the following bitwise operators?
AND
XOR
NOT
OR
Left/Right shift
Bit fields (flags)
They're the most efficient way of representing something whose state is defined by several "yes or no" properties. ACLs are a good example; if you have let's say 4 discrete permissions (read, write, execute, change policy), it's better to store this in 1 byte rather than waste 4. These can be mapped to enumeration types in many languages for added convenience.
Communication over ports/sockets
Always involves checksums, parity, stop bits, flow control algorithms, and so on, which usually depend on the logic values of individual bytes as opposed to numeric values, since the medium may only be capable of transmitting one bit at a time.
Compression, Encryption
Both of these are heavily dependent on bitwise algorithms. Look at the deflate algorithm for an example - everything is in bits, not bytes.
Finite State Machines
I'm speaking primarily of the kind embedded in some piece of hardware, although they can be found in software too. These are combinatorial in nature - they might literally be getting "compiled" down to a bunch of logic gates, so they have to be expressed as AND, OR, NOT, etc.
Graphics
There's hardly enough space here to get into every area where these operators are used in graphics programming. XOR (or ^) is particularly interesting here because applying the same input a second time will undo the first. Older GUIs used to rely on this for selection highlighting and other overlays, in order to eliminate the need for costly redraws. They're still useful in slow graphics protocols (i.e. remote desktop).
Those were just the first few examples I came up with - this is hardly an exhaustive list.
Is it odd?
(value & 0x1) > 0
Is it divisible by two (even)?
(value & 0x1) == 0
I've used bitwise operations in implementing a security model for a CMS. It had pages which could be accessed by users if they were in appropriate groups. A user could be in multiple groups, so we needed to check if there was an intersection between the users groups and the pages groups. So we assigned each group a unique power-of-2 identifier, e.g.:
Group A = 1 --> 00000001
Group B = 2 --> 00000010
Group C = 3 --> 00000100
We OR these values together, and store the value (as a single int) with the page. E.g. if a page could be accessed by groups A & B, we store the value 3 (which in binary is 00000011) as the pages access control. In much the same way, we store a value of ORed group identifiers with a user to represent which groups they are in.
So to check if a given user can access a given page, you just need to AND the values together and check if the value is non-zero. This is very fast as this check is implemented in a single instruction, no looping, no database round-trips.
Here's some common idioms dealing with flags stored as individual bits.
enum CDRIndicators {
Local = 1 << 0,
External = 1 << 1,
CallerIDMissing = 1 << 2,
Chargeable = 1 << 3
};
unsigned int flags = 0;
Set the Chargeable flag:
flags |= Chargeable;
Clear CallerIDMissing flag:
flags &= ~CallerIDMissing;
Test whether CallerIDMissing and Chargeable are set:
if((flags & (CallerIDMissing | Chargeable )) == (CallerIDMissing | Chargeable)) {
}
Low-level programming is a good example. You may, for instance, need to write a specific bit to a memory-mapped register to make some piece of hardware do what you want it to:
volatile uint32_t *register = (volatile uint32_t *)0x87000000;
uint32_t value;
uint32_t set_bit = 0x00010000;
uint32_t clear_bit = 0x00001000;
value = *register; // get current value from the register
value = value & ~clear_bit; // clear a bit
value = value | set_bit; // set a bit
*register = value; // write it back to the register
Also, htonl() and htons() are implemented using the & and | operators (on machines whose endianness(Byte order) doesn't match network order):
#define htons(a) ((((a) & 0xff00) >> 8) | \
(((a) & 0x00ff) << 8))
#define htonl(a) ((((a) & 0xff000000) >> 24) | \
(((a) & 0x00ff0000) >> 8) | \
(((a) & 0x0000ff00) << 8) | \
(((a) & 0x000000ff) << 24))
I use them to get RGB(A) values from packed colorvalues, for instance.
When I have a bunch of boolean flags, I like to store them all in an int.
I get them out using bitwise-AND. For example:
int flags;
if (flags & 0x10) {
// Turn this feature on.
}
if (flags & 0x08) {
// Turn a second feature on.
}
etc.
& = AND:
Mask out specific bits.
You are defining the specific bits which should be displayed
or not displayed. 0x0 & x will clear all bits in a byte while 0xFF will not change x.
0x0F will display the bits in the lower nibble.
Conversion:
To cast shorter variables into longer ones with bit identity it is necessary to adjust the bits because -1 in an int is 0xFFFFFFFF while -1 in a long is 0xFFFFFFFFFFFFFFFF. To preserve
the identity you apply a mask after conversion.
|=OR
Set bits. The bits will be set indepently if they are already set. Many datastructures (bitfields) have flags like IS_HSET = 0, IS_VSET = 1 which can be indepently set.
To set the flags, you apply IS_HSET | IS_VSET (In C and assembly this is very convenient to read)
^=XOR
Find bits which are the same or different.
~= NOT
Flip bits.
It can be shown that all possible local bit operations can be implemented by these operations.
So if you like you can implement an ADD instruction solely by bit operations.
Some wonderful hacks:
http://www.ugcs.caltech.edu/~wnoise/base2.html
http://www.jjj.de/bitwizardry/bitwizardrypage.html
Encryption is all bitwise operations.
You can use them as a quick and dirty way to hash data.
int a = 1230123;
int b = 1234555;
int c = 5865683;
int hash = a ^ b ^ c;
I just used bitwise-XOR (^) about three minutes ago to calculate a checksum for serial communication with a PLC...
This is an example to read colours from a bitmap image in byte format
byte imagePixel = 0xCCDDEE; /* Image in RRGGBB format R=Red, G=Green, B=Blue */
//To only have red
byte redColour = imagePixel & 0xFF0000; /*Bitmasking with AND operator */
//Now, we only want red colour
redColour = (redColour >> 24) & 0xFF; /* This now returns a red colour between 0x00 and 0xFF.
I hope this tiny examples helps....
In the abstracted world of today's modern language, not too many. File IO is an easy one that comes to mind, though that's exercising bitwise operations on something already implemented and is not implementing something that uses bitwise operations. Still, as an easy example, this code demonstrates removing the read-only attribute on a file (so that it can be used with a new FileStream specifying FileMode.Create) in c#:
//Hidden files posses some extra attibutes that make the FileStream throw an exception
//even with FileMode.Create (if exists -> overwrite) so delete it and don't worry about it!
if(File.Exists(targetName))
{
FileAttributes attributes = File.GetAttributes(targetName);
if ((attributes & FileAttributes.ReadOnly) == FileAttributes.ReadOnly)
File.SetAttributes(targetName, attributes & (~FileAttributes.ReadOnly));
File.Delete(targetName);
}
As far as custom implementations, here's a recent example:
I created a "message center" for sending secure messages from one installation of our distributed application to another. Basically, it's analogous to email, complete with Inbox, Outbox, Sent, etc, but it also has guaranteed delivery with read receipts, so there are additional subfolders beyond "inbox" and "sent." What this amounted to was a requirement for me to define generically what's "in the inbox" or what's "in the sent folder". Of the sent folder, I need to know what's read and what's unread. Of what's unread, I need to know what's received and what's not received. I use this information to build a dynamic where clause which filters a local datasource and displays the appropriate information.
Here's how the enum is put together:
public enum MemoView :int
{
InboundMemos = 1, // 0000 0001
InboundMemosForMyOrders = 3, // 0000 0011
SentMemosAll = 16, // 0001 0000
SentMemosNotReceived = 48, // 0011
SentMemosReceivedNotRead = 80, // 0101
SentMemosRead = 144, // 1001
Outbox = 272, //0001 0001 0000
OutBoxErrors = 784 //0011 0001 0000
}
Do you see what this does? By anding (&) with the "Inbox" enum value, InboundMemos, I know that InboundMemosForMyOrders is in the inbox.
Here's a boiled down version of the method that builds and returns the filter that defines a view for the currently selected folder:
private string GetFilterForView(MemoView view, DefaultableBoolean readOnly)
{
string filter = string.Empty;
if((view & MemoView.InboundMemos) == MemoView.InboundMemos)
{
filter = "<inbox filter conditions>";
if((view & MemoView.InboundMemosForMyOrders) == MemoView.InboundMemosForMyOrders)
{
filter += "<my memo filter conditions>";
}
}
else if((view & MemoView.SentMemosAll) == MemoView.SentMemosAll)
{
//all sent items have originating system = to local
filter = "<memos leaving current system>";
if((view & MemoView.Outbox) == MemoView.Outbox)
{
...
}
else
{
//sent sub folders
filter += "<all sent items>";
if((view & MemoView.SentMemosNotReceived) == MemoView.SentMemosNotReceived)
{
if((view & MemoView.SentMemosReceivedNotRead) == MemoView.SentMemosReceivedNotRead)
{
filter += "<not received and not read conditions>";
}
else
filter += "<received and not read conditions>";
}
}
}
return filter;
}
Extremely simple, but a neat implementation at a level of abstraction that doesn't typically require bitwise operations.
Usually bitwise operations are faster than doing multiply/divide. So if you need to multiply a variable x by say 9, you will do x<<3 + x which would be a few cycles faster than x*9. If this code is inside an ISR, you will save on response time.
Similarly if you want to use an array as a circular queue, it'd be faster (and more elegant) to handle wrap around checks with bit wise operations. (your array size should be a power of 2). Eg: , you can use tail = ((tail & MASK) + 1) instead of tail = ((tail +1) < size) ? tail+1 : 0, if you want to insert/delete.
Also if you want a error flag to hold multiple error codes together, each bit can hold a separate value. You can AND it with each individual error code as a check. This is used in Unix error codes.
Also a n-bit bitmap can be a really cool and compact data structure. If you want to allocate a resource pool of size n, we can use a n-bits to represent the current status.
Bitwise & is used to mask/extract a certain part of a byte.
1 Byte variable
01110010
&00001111 Bitmask of 0x0F to find out the lower nibble
--------
00000010
Specially the shift operator (<< >>) are often used for calculations.
Bitwise operators are useful for looping arrays which length is power of 2. As many people mentioned, bitwise operators are extremely useful and are used in Flags, Graphics, Networking, Encryption. Not only that, but they are extremely fast. My personal favorite use is to loop an array without conditionals. Suppose you have a zero-index based array(e.g. first element's index is 0) and you need to loop it indefinitely. By indefinitely I mean going from first element to last and returning to first. One way to implement this is:
int[] arr = new int[8];
int i = 0;
while (true) {
print(arr[i]);
i = i + 1;
if (i >= arr.length)
i = 0;
}
This is the simplest approach, if you'd like to avoid if statement, you can use modulus approach like so:
int[] arr = new int[8];
int i = 0;
while (true) {
print(arr[i]);
i = i + 1;
i = i % arr.length;
}
The down side of these two methods is that modulus operator is expensive, since it looks for a remainder after integer division. And the first method runs an if statement on each iteration. With bitwise operator however if length of your array is a power of 2, you can easily generate a sequence like 0 .. length - 1 by using & (bitwise and) operator like so i & length. So knowing this, the code from above becomes
int[] arr = new int[8];
int i = 0;
while (true){
print(arr[i]);
i = i + 1;
i = i & (arr.length - 1);
}
Here is how it works. In binary format every number that is power of 2 subtracted by 1 is expressed only with ones. For example 3 in binary is 11, 7 is 111, 15 is 1111 and so on, you get the idea. Now, what happens if you & any number against a number consisting only of ones in binary? Let's say we do this:
num & 7;
If num is smaller or equal to 7 then the result will be num because each bit &-ed with 1 is itself. If num is bigger than 7, during the & operation computer will consider 7's leading zeros which of course will stay as zeros after & operation only the trailing part will remain. Like in case of 9 & 7 in binary it will look like
1001 & 0111
the result will be 0001 which is 1 in decimal and addresses second element in array.
Base64 encoding is an example. Base64 encoding is used to represent binary data as a printable characters for sending over email systems (and other purposes). Base64 encoding converts a series of 8 bit bytes into 6 bit character lookup indexes. Bit operations, shifting, and'ing, or'ing, not'ing are very useful for implementing the bit operations necessary for Base64 encoding and decoding.
This of course is only 1 of countless examples.
I'm suprised no one picked the obvious answer for the Internet age. Calculating valid network addresses for a subnet.
http://www.topwebhosts.org/tools/netmask.php
Nobody seems to have mentioned fixed point maths.
(Yeah, I'm old, ok?)
Is a number x a power of 2? (Useful for example in algorithms where a counter is incremented, and an action is to be taken only logarithmic number of times)
(x & (x - 1)) == 0
Which is the highest bit of an integer x? (This for example can be used to find the minimum power of 2 that is larger than x)
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
return x - (x >>> 1); // ">>>" is unsigned right shift
Which is the lowest 1 bit of an integer x? (Helps find number of times divisible by 2.)
x & -x
If you ever want to calculate your number mod(%) a certain power of 2, you can use yourNumber & 2^N-1, which in this case is the same as yourNumber % 2^N.
number % 16 = number & 15;
number % 128 = number & 127;
This is probably only useful being an alternative to modulus operation with a very big dividend that is 2^N... But even then its speed boost over the modulus operation is negligible in my test on .NET 2.0. I suspect modern compilers already perform optimizations like this. Anyone know more about this?
I use them for multi select options, this way I only store one value instead of 10 or more
it can also be handy in a sql relational model, let's say you have the following tables: BlogEntry, BlogCategory
traditonally you could create a n-n relationship between them using a BlogEntryCategory table
or when there are not that much BlogCategory records you could use one value in BlogEntry to link to multiple BlogCategory records just like you would do with flagged enums,
in most RDBMS there are also a very fast operators to select on that 'flagged' column...
When you only want to change some bits of a microcontroller's Outputs, but the register to write to is a byte, you do something like this (pseudocode):
char newOut = OutRegister & 0b00011111 //clear 3 msb's
newOut = newOut | 0b10100000 //write '101' to the 3 msb's
OutRegister = newOut //Update Outputs
Of course, many microcontrollers allow you to change each bit individually...
I've seen them used in role based access control systems.
There is a real world use in my question here -
Respond to only the first WM_KEYDOWN notification?
When consuming a WM_KEYDOWN message in the windows C api bit 30 specifies the previous key state. The value is 1 if the key is down before the message is sent, or it is zero if the key is up
They are mostly used for bitwise operations (surprise). Here are a few real-world examples found in PHP codebase.
Character encoding:
if (s <= 0 && (c & ~MBFL_WCSPLANE_MASK) == MBFL_WCSPLANE_KOI8R) {
Data structures:
ar_flags = other->ar_flags & ~SPL_ARRAY_INT_MASK;
Database drivers:
dbh->transaction_flags &= ~(PDO_TRANS_ACCESS_MODE^PDO_TRANS_READONLY);
Compiler implementation:
opline->extended_value = (opline->extended_value & ~ZEND_FETCH_CLASS_MASK) | ZEND_FETCH_CLASS_INTERFACE;
I've seen it in a few game development books as a more efficient way to multiply and divide.
2 << 3 == 2 * 8
32 >> 4 == 32 / 16
Whenever I first started C programming, I understood truth tables and all that, but it didn't all click with how to actually use it until I read this article http://www.gamedev.net/reference/articles/article1563.asp (which gives real life examples)
I don't think this counts as bitwise, but ruby's Array defines set operations through the normal integer bitwise operators. So [1,2,4] & [1,2,3] # => [1,2]. Similarly for a ^ b #=> set difference and a | b #=> union.

Creating a hash of a string thats sortable

Is there anyway to create hashs of strings where the hashes can be sorted and have the same results as if the strings themselves were sorted?
This won't be possible, at least if you allow strings longer than the hash size. You have 256^(max. string size) possible strings mapped to 256^(hash size) hash values, so you'll end up with some of the strings unsorted.
Just imagine the simplest hash: Truncating every string to (hash size) bytes.
Yes. It's called using the entire input string as the hash.
As others have pointed out it's not practical to do exactly what you've asked. You'd have to use the string itself as the hash which would constrain the lengths of strings that could be "hashed" and so on.
The obvious approach to maintaining a "sorted hash" data structure would be to maintain both a sorted list (heap or binary tree, for example) and a hashed mapping of the data. Inserts and removals would be O(log(n)) while retrievals would be O(1). Off hand I'm not sure how often this would be worth the additional complexity and overhead.
If you had a particularly large data set, mostly read-only and such that logarithmic time retrieval was overly expensive then I suppose it might be useful. Note that the cost of updates is actually the sum of the constant time (hash) and the logarithmic time (binary tree or heap) operations. However O(1) + O(log(n)) reduces to the larger of the two terms during asymptotic analysis. (The underlying cost is still there --- relevant to any implementation effort regardless of its theoretical irrelevance).
For a significant range of data set sizes the cost of maintaining this hypothetical hybrid data structure could be estimated as "twice" the cost of maintaining either of the pure ones. (In other words many implementations of a binary tree over can scale to billions of elements (2^~32 or so) in time cost that's comparable to the cost of the typical hash functions). So I'd be hard-pressed to convince myself that such added code complexity and run-time cost (of a hybrid data structure) would actually be of benefit to a given project.
(Note: I saw that Python 3.1.1 added the notion of "ordered" dictionaries ... and this is similar to being sorted, but not quite the same. From what I gather the ordered dictionary preserves the order in which elements were inserted to the collection. I also seem to remember some talk of "views" ... objects in the language which can access keys of a dictionary in some particular manner (sorted, reversed, reverse sorted, ...) at (possibly) lower cost than passing the set of keys through the built-in "sorted()" and "reversed()." I haven't used these nor have a looked at the implementation details. I would guess that one of these "views" would be something like a lazily evaluated index, performing the necessary sorting on call, and storing the results with some sort of flag or trigger (observer pattern or listener) that's reset when the back-end source collection is updated. In that scheme a call to the "view" would update its index; subsequence calls would be able to use those results so long as no insertions nor deletions had been made to the dictionary. Any call to the view subsequent to key changes would incur the cost of updating the view. However this is all pure speculation on my part. I mention it because it might also provide insight into some alternative ways to approach the question).
Not unless there are fewer strings than hashes, and the hashes are perfect. Even then you still have to ensure the hash order is the same as the string order, this is probably not possible unless you know all the strings ahead of time.
No. The hash would have to contain the same amount of information as the string it is replacing. Otherwise, if two strings mapped to the same hash value, how could you possibly sort them?
Another way of thinking about it is this: If I have two strings, "a" and "b", then I hash both of them with this sort preserving hash function and get f(a) and f(b). However, there are an infinite number of strings that are greater than "a" but less than "b". This would require hashing the strings to arbitrary precision Real values (because of cardinality). In the end, you would basically just have the string encoded as a number.
You're essentially asking if you can compress the key strings into smaller keys while preserving their collation order. So it depends on your data. If your strings are composed of only hexadecimal digits, for example, they can be replaced with 4-bit codes.
But for the general case, it can't be done. You'd end up "hashing" each source key into itself.
I stumble upon this, and although everyone is correct with their answers, I needed a solution exactly like this to use in elasticsearch (don't ask why). Sometimes we don't need a perfect solution for all cases, we just need one to work with the constraints that are acceptable. My solution is able to generate a sortable hashcode for the first n chars of the string, I did some preliminary tests and didn't have any collisions. You need to define beforehand the charset that is used and play with n to a deemed acceptable value of the first chars needed to sort and try to maintain the result hash code in the positive interval of the defined type for it to work, in my case, for Java Long type I could go up to 13 chars.
Below is my code in Java, hopefully, it will help someone else that needs this.
String charset = "abcdefghijklmnopqrstuvwxyz";
public long orderedHash(final String s, final String charset, final int n) {
Long hash = 0L;
if(s.isEmpty() || n == 0)
return hash;
Long charIndex = (long)(charset.indexOf(s.charAt(0)));
if(charIndex == -1)
return hash;
for(int i = 1 ; i < n; i++)
hash += (long)(charIndex * Math.pow(charset.length(), i));
hash += charIndex + 1 + orderedHash(s.substring(1), charset, n - 1);
return hash;
}
Examples:
orderedHash("a", charset, 13) // 1
orderedHash("abc", charset, 13) // 4110785825426312
orderedHash("b", charset, 13) // 99246114928149464
orderedHash("google", charset, 13) // 651008600709057847
orderedHash("stackoverflow", charset, 13) // 1858969664686174756
orderedHash("stackunderflow", charset, 13) // 1858969712216171093
orderedHash("stackunderflo", charset, 13) // 1858969712216171093 same, 13 chars limitation
orderedHash("z", charset, 13) // 2481152873203736576
orderedHash("zzzzzzzzzzzzz", charset, 13) // 2580398988131886038
orderedHash("zzzzzzzzzzzzzz", charset, 14) // -4161820175519153195 no good, overflow
orderedHash("ZZZZZZZZZZZZZ", charset, 13) // 0 no good, not in charset
If more precision is needed, use an unsigned type or a composite one made of two longs for example and compute the hashcode with substrings.
Edit: Although the previously algorithm sufficed for my use I noticed that it was not really ordering correctly the strings if they didn't have a length bigger that the chosen n. With this new algorithm it should be ok now.

Rot13 for numbers

EDIT: Now a Major Motion Blog Post at http://messymatters.com/sealedbids
The idea of rot13 is to obscure text, for example to prevent spoilers. It's not meant to be cryptographically secure but to simply make sure that only people who are sure they want to read it will read it.
I'd like to do something similar for numbers, for an application involving sealed bids. Roughly I want to send someone my number and trust them to pick their own number, uninfluenced by mine, but then they should be able to reveal mine (purely client-side) when they're ready. They should not require further input from me or any third party.
(Added: Note the assumption that the recipient is being trusted not to cheat.)
It's not as simple as rot13 because certain numbers, like 1 and 2, will recur often enough that you might remember that, say, 34.2 is really 1.
Here's what I'm looking for specifically:
A function seal() that maps a real number to a real number (or a string). It should not be deterministic -- seal(7) should not map to the same thing every time. But the corresponding function unseal() should be deterministic -- unseal(seal(x)) should equal x for all x. I don't want seal or unseal to call any webservices or even get the system time (because I don't want to assume synchronized clocks). (Added: It's fine to assume that all bids will be less than some maximum, known to everyone, say a million.)
Sanity check:
> seal(7)
482.2382 # some random-seeming number or string.
> seal(7)
71.9217 # a completely different random-seeming number or string.
> unseal(seal(7))
7 # we always recover the original number by unsealing.
You can pack your number as a 4 byte float together with another random float into a double and send that. The client then just has to pick up the first four bytes. In python:
import struct, random
def seal(f):
return struct.unpack("d",struct.pack("ff", f, random.random() ))[0]
def unseal(f):
return struct.unpack("ff",struct.pack("d", f))[0]
>>> unseal( seal( 3))
3.0
>>> seal(3)
4.4533985422978706e-009
>>> seal(3)
9.0767582382536571e-010
Here's a solution inspired by Svante's answer.
M = 9999 # Upper bound on bid.
seal(x) = M * randInt(9,99) + x
unseal(x) = x % M
Sanity check:
> seal(7)
716017
> seal(7)
518497
> unseal(seal(7))
7
This needs tweaking to allow negative bids though:
M = 9999 # Numbers between -M/2 and M/2 can be sealed.
seal(x) = M * randInt(9,99) + x
unseal(x) =
m = x % M;
if m > M/2 return m - M else return m
A nice thing about this solution is how trivial it is for the recipient to decode -- just mod by 9999 (and if that's 5000 or more then it was a negative bid so subtract another 9999). It's also nice that the obscured bid will be at most 6 digits long. (This is plenty security for what I have in mind -- if the bids can possibly exceed $5k then I'd use a more secure method. Though of course the max bid in this method can be set as high as you want.)
Instructions for Lay Folk
Pick a number between 9 and 99 and multiply it by 9999, then add your bid.
This will yield a 5 or 6-digit number that encodes your bid.
To unseal it, divide by 9999, subtract the part to the left of the decimal point, then multiply by 9999.
(This is known to children and mathematicians as "finding the remainder when dividing by 9999" or "mod'ing by 9999", respectively.)
This works for nonnegative bids less than 9999 (if that's not enough, use 99999 or as many digits as you want).
If you want to allow negative bids, then the magic 9999 number needs to be twice the biggest possible bid.
And when decoding, if the result is greater than half of 9999, ie, 5000 or more, then subtract 9999 to get the actual (negative) bid.
Again, note that this is on the honor system: there's nothing technically preventing you from unsealing the other person's number as soon as you see it.
If you're relying on honesty of the user and only dealing with integer bids, a simple XOR operation with a random number should be all you need, an example in C#:
static Random rng = new Random();
static string EncodeBid(int bid)
{
int i = rng.Next();
return String.Format("{0}:{1}", i, bid ^ i);
}
static int DecodeBid(string encodedBid)
{
string[] d = encodedBid.Split(":".ToCharArray());
return Convert.ToInt32(d[0]) ^ Convert.ToInt32(d[1]);
}
Use:
int bid = 500;
string encodedBid = EncodeBid(bid); // encodedBid is something like 54017514:4017054 and will be different each time
int decodedBid = DecodeBid(encodedBid); // decodedBid is 500
Converting the decode process to a client side construct should be simple enough.
Is there a maximum bid? If so, you could do this:
Let max-bid be the maximum bid and a-bid the bid you want to encode. Multiply max-bid by a rather large random number (if you want to use base64 encoding in the last step, max-rand should be (2^24/max-bid)-1, and min-rand perhaps half of that), then add a-bid. Encode this, e.g. through base64.
The recipient then just has to decode and find the remainder modulo max-bid.
What you want to do (a Commitment scheme) is impossible to do client-side-only. The best you could do is encrypt with a shared key.
If the client doesn't need your cooperation to reveal the number, they can just modify the program to reveal the number. You might as well have just sent it and not displayed it.
To do it properly, you could send a secure hash of your bid + a random salt. That commits you to your bid. The other client can commit to their bid in the same way. Then you each share your bid and salt.
[edit] Since you trust the other client:
Sender:
Let M be your message
K = random 4-byte key
C1 = M xor hash(K) //hash optional: hides patterns in M xor K
//(you can repeat or truncate hash(K) as necessary to cover the message)
//(could also xor with output of a PRNG instead)
C2 = K append M //they need to know K to reveal the message
send C2 //(convert bytes to hex representation if needed)
Receiver:
receive C2
K = C2[:4]
C1 = C2[4:]
M = C1 xor hash(K)
Are you aware that you need a larger 'sealed' set of numbers than your original, if you want that to work?
So you need to restrict your real numbers somehow, or store extra info that you don't show.
One simple way is to write a message like:
"my bid is: $14.23: aduigfurjwjnfdjfugfojdjkdskdfdhfddfuiodrnfnghfifyis"
All that junk is randomly-generated, and different every time.
Send the other person the SHA256 hash of the message. Have them send you the hash of their bid. Then, once you both have the hashes, send the full message, and confirm that their bid corresponds to the hash they gave you.
This gives rather stronger guarantees than you need - it's actually not possible from them to work out your bid before you send them your full message. However, there is no unseal() function as you describe.
This simple scheme has various weaknesses that a full zero-knowledge scheme would not have. For example, if they fake you out by sending you a random number instead of a hash, then they can work out your bid without revealing their own. But you didn't ask for bullet-proof. This prevents both accidental and (I think) undetectable cheating, and uses only a commonly-available command line utility, plus a random number generator (dice will do).
If, as you say, you want them to be able to recover your bid without any further input from you, and you are willing to trust them only to do it after posting their bid, then just encrypt using any old symmetric cipher (gpg --symmetric, perhaps) and the key, "rot13". This will prevent accidental cheating, but allow undetectable cheating.
One idea that poped into my mind was to maybe base your algorithm on the mathematics
used for secure key sharing.
If you want to give two persons, Bob and Alice, half a key each so
that only when combining them they will be able to open whatever the key locks, how do you do that? The solution to this comes from mathematics. Say you have two points A (-2,2) and B (2,0) in a x/y coordinate system.
|
A +
|
C
|
---+---+---+---|---+---B---+---+---+---
|
+
|
+
If you draw a straight line between them it will cross the y axis at exactly one single point, C (0,1).
If you only know one of the points A or B it is impossible to tell where it will cross.
Thus you can let the points A and B be the shared keys which when combined will reveal the y-value
of the crossing point (i.e. 1 in this example) and this value is then typically used as
a real key for something.
For your bidding application you could let seal() and unseal() swap the y-value between the C and B points
(deterministic) but have the A point vary from time to time.
This way seal(y-value of point B) will give completely different results depending on point A,
but unseal(seal(y-value of point B)) should return the y-value of B which is what you ask for.
PS
It is not required to have A and B on different sides of the y-axis, but is much simpler conceptually to think of it this way (and I recommend implementing it that way as well).
With this straight line you can then share keys between several persons so that only two of
them are needed to unlock whatever. It is possible to use curve types other then straight lines to create other
key sharing properties (i.e. 3 out of 3 keys are required etc).
Pseudo code:
encode:
value = 2000
key = random(0..255); // our key is only 2 bytes
// 'sealing it'
value = value XOR 2000;
// add key
sealed = (value << 16) | key
decode:
key = sealed & 0xFF
unsealed = key XOR (sealed >> 16)
Would that work?
Since it seems that you are assuming that the other person doesn't want to know your bid until after they've placed their own, and can be trusted not to cheat, you could try a variable rotation scheme:
from random import randint
def seal(input):
r = randint(0, 50)
obfuscate = [str(r)] + [ str(ord(c) + r) for c in '%s' % input ]
return ':'.join(obfuscate)
def unseal(input):
tmp = input.split(':')
r = int(tmp.pop(0))
deobfuscate = [ chr(int(c) - r) for c in tmp ]
return ''.join(deobfuscate)
# I suppose you would put your bid in here, for 100 dollars
tmp = seal('$100.00') # --> '1:37:50:49:49:47:49:49' (output varies)
print unseal(tmp) # --> '$100.00'
At some point (I think we may have already passed it) this becomes silly, and because it is so easy, you should just use simple encryption, where the message recipient always knows the key - the person's username, perhaps.
If the bids are fairly large numbers, how about a bitwise XOR with some predetermined random-ish number? XORing again will then retrieve the original value.
You can change the number as often as you like, as long as both client and server know it.
You could set a different base (like 16, 17, 18, etc.) and keep track of which base you've "sealed" the bid with...
Of course, this presumes large numbers (> the base you're using, at least). If they were decimal, you could drop the point (for example, 27.04 becomes 2704, which you then translate to base 29...)
You'd probably want to use base 17 to 36 (only because some people might recognize hex and be able to translate it in their head...)
This way, you would have numbers like G4 or Z3 or KW (depending on the numbers you're sealing)...
Here's a cheap way to piggyback off rot13:
Assume we have a function gibberish() that generates something like "fdjk alqef lwwqisvz" and a function words(x) that converts a number x to words, eg, words(42) returns "forty two" (no hyphens).
Then define
seal(x) = rot13(gibberish() + words(x) + gibberish())
and
unseal(x) = rot13(x)
Of course the output of unseal is not an actual number and is only useful to a human, but that might be ok.
You could make it a little more sophisticated with words-to-number function that would also just throw away all the gibberish words (defined as anything that's not one of the number words -- there are less than a hundred of those, I think).
Sanity check:
> seal(7)
fhrlls hqufw huqfha frira afsb ht ahuqw ajaijzji
> seal(7)
qbua adfshua hqgya ubiwi ahp wqwia qhu frira wge
> unseal(seal(7))
sueyyf udhsj seven ahkua snsfo ug nuhdj nwnvwmwv
I know this is silly but it's a way to do it "by hand" if all you have is rot13 available.