Assignment of mercurial global changeset id - mercurial

Apparently Mercurial assigns a global changeset id to each change. How do they ensure that this is unique?

As Zach says, the changeset ID is computed using the SHA-1 hash function. This is an example of a cryptographically secure hash function. Cryptographic hash functions take an input string of arbitrary length and produces a fixed-length digest from this string. In the case of SHA-1, the output length is fixed to 160 bit, of which Mercurial by default only shows you the first 48 bit (12 hexadecimal digits).
Cryptographic hash functions have the property that it is extremely difficult to find two different inputs that produce the same output, that is, it is hard to find strings x != y such that H(x) == H(y). This is called collision resistance.
Since Mercurial uses the SHA-1 function to compute the changeset ID, you get the same changeset ID for identical inputs (identical changes, identical committer names and dates). However, if you use different inputs (x != y) when you will get different outputs (changeset IDs) because of the collision resistance.
Put differently, if you do not get different changeset IDs for different input, then you have found a collision for SHA-1! So far, nobody has ever found a collision for SHA-1, so this will be a major discovery.
In more detail, the SHA-1 hash function is used in a recursive way in Mercurial. Each changeset hash is computed by concatenating:
manifest ID
commit username
commit date
affected files
commit message
first parent changeset ID
second parent changeset ID
and then running SHA-1 on all this (see changelog.py and revlog.py). Because the hash function is used recursively, the changeset hash will fix the entire history all the way back to the root in the changeset graph.
This also means that you wont get the same changeset ID if you add the line Hello World! to two different projects at the same time with the same commit message -- when their histories are different (different parent changesets), the two new changesets will get different IDs.

Mercurial's changeset IDs are SHA-1 hashes of the "manifest" for each changeset. It only prints the first dozen hex digits of the global ID, but it uses the full SHA-1 for internal operations. There's no actual guarantee that they are unique, but it is sufficiently unlikely for practical purposes.
See here for gory details.

Related

Mercurial, Get history of branch since last tag - including merged in commits

With Mercurial, I'm trying to get the history of a branch since the last tag.
BUT I want to include all the comments that were merged in as well.
Our devs usually create a branch, do some work, possibly multiple commits, then merge the branch back in.
Using: hg log -b . -r "last(tagged())::" --template "{desc|firstline}\n"
I'll get entries like "Merge" - with no information on what commits were included in that merge.
How do I get it to include the merged commits?
We also have multiple active branches, so just including ALL commits for ALL branches won't work.
There are at least two issues here that seem (after analysis) to stem from using -b .. There may be more issues as well. I will take them in some order, not necessarily the best one, perhaps even the worst one. :-)
Combining -b and last(set) seems unwise in general
Your -b . constraint means you get only commits that are on the current branch. If your revset would otherwise include commits on another branch, those commits will be excluded. Or, to put it another (more set-theoretic) way, using -b . within hg log is a bit like taking whatever revset specifier you have and adding:
(revset) & branch(.)
—though simply asking this question this brought up one point I was unsure about: is the limiting done before calculating tagged(), or after? Some poking-about in hg --debugger tells me that it's "after", which means we get:
(last(tagged()) & (branch(.))
which means that if there are tags on, e.g., revs 1, 7, and 34, we'll select rev 34 first, then select revisions whose branch is the current branch. Suppose that rev 7 is a member of the current branch, but rev 34 is not. The result of the & is then the empty set.
That's probably not the issue here—the actual final expression is, or might as well be, branch(.) & descendants(last(tagged()))—but in at least some cases, it would probably be better to use:
last(tagged() & branch(.))
so that you start with the last revision that is both tagged, and on the current branch. (If this revlist is empty it's not clear what you should do next, but that is hard to program at this level, so let's just assume the revlist has one revision in it, e.g., rev 7 in our example here.)
This is probably not what you want after all, though; see the last section below.
Combining -b and a DAG range
A DAG-range operator like X::Y in Mercurial simply means: All commits/revisions that are descendants of X, including X itself, and ancestors of Y, including Y itself. Omitting Y entirely means all descendants of X. Without a -b limiter, you will get all such commits, but with -b ., you once again restrict yourself to those commits that on the current branch.
If you merge commits within one branch, then, you will get the merge commits and their ancestors and descendants that are on this branch. (Remember that in Mercurial, any commit is on exactly one branch, forever: this is the branch that was current when the commit itself was made.) But if you are using branches at all in Mercurial, you are probably merging commits that are in other branches. If you want to see any those commits, you cannot use -b . here.
Getting what you want
Let's go back to your first statement above:
With Mercurial, I'm trying to get the history of a branch since the last tag. BUT I want to include all the [commits] that were merged in as well.
Let's draw a quick example or two and see which commits you might want.
Here's a horizontal graph, with newer commits toward the right. Each commit is represented by an o unless it is tagged, in which case it is represented by a *. There are several merges. Commits on the first two rows are on branch B1 and commits on the third row are on branch B2.
o--o---o--o---o--*--o--o--o
b1: \ / /
*--o /
\ /
b2: o--*--o--o--o--o
It's not clear to me which commits you wish to see. The last tagged commit on b1 is the top row *, but there is a tagged commit on b2 as well (whose rev number is probably lower than the one on b1). Or suppose we had a slightly different graph, so that the highest numbered tagged revision were one on b2:
o--o---o--o---*--o--o--o--o
b1: \ / /
*--o /
\ /
b2: o--o--*--o--o--o
If we use the expression last(tagged()) without any branch masking, we will choose the rightmost starred commit. If we then feed that into a DAG operator (e.g., as X in X:: or using descendants(), we get all the commits that are "after" that one.
When we start with the single starred commit on b2—as in the last graph—we get that commit and the remaining three commits that are on b2, plus the last two commits that are on b1. That may be what you want, but perhaps you also want some commits that are on b1 that come before the merge, but after (and maybe including) the final starred commit that is on b1 itself.
Note that this is what you get with just descendants(last(tagged()), i.e., if you remove -b . from your original hg log command.
When we start with the last starred commit on b1, though, as in the earlier graph, we get just that commit plus the final three commits on branch b1. None of the commits on branch b2 that get merged are descendants of the starred commit we chose. So the DAG-range approach itself is suspect, here. Still, if eliminating tagged commits that are directly on b1 suffices, note that we can use:
descendants(last(tagged() and not branch(b1)))
(there is no difference between and and & here, I just spelled it out because I spelled out not).
There is another possibility that I see here: perhaps you want any commits that are ancestors of the current branch's final commit, but stopping at:
any tagged commit, or
any predecessor merge for any other branch than the first-merged branch arrived-at by traversing ancestors.
Visualizing this last case requires a more complex branch topology, with more than two total named branches. Since it's (a) hard and (b) not at all clear to me that this is what you want, I'm not going to write an expression to produce it.

Analysis of open addressing

I am currently learning hash tables from "An introduction of algorithms 3th". Get quite confused while trying to understand open addressing from statistical point of view. Linear probing and quadratic probing can only generate m possible probe sequence, assuming m is hash table length. However, as defined in open addressing, the possible key value number is greater than the number of hash values, i.e. load factor n/m< 1. In reality, if the hash function is predefined, there exists only n possible probe sequence, which is less than m. The same thing applies to double hashing. If the book says, one hash function is randomly chosen from a set of universal hash functions, then, I can understand. Without introducing randomness in open addressing analysis, the analysis of its performance based on universal hashing is obscured. I have never used hash table in practice, maybe I dive too much into the details. But I also have such doubt in hash table's practical usage:
Q: In reality, if the load factor is less than 1, why would we bother open addressing ? Why not project each key to an integer and arrange them in an array ?
Q: In reality, if the load factor is less than 1, why would we bother open addressing? Why not project each key to an integer and arrange them in an array ?
Because in many situations when hash tables are used, there's no good O(1) way to "project each key to an [distinct, not-absurdly-sparse] integer" array index.
A simple thought experiment illustrates this: say you expect the user to type four three-uppercase-letter keys, and you want to store them somewhere in an array with dimension 10. You have 264 possible inputs, so no matter what your logic is, on average 264/10 of them will "project... to an integer" indicating the same array position. When you realise the "project[ion]" can't avoid potential "collisions", and that projection is a logically identical operation to "hashing" and modding to a "bucket", then some collision-handling logic will be needed, your proposed "alternative" morphs back into a hash table....
Linear probing and quadratic probing can only generate m possible probe sequence, assuming m is hash table length. However, as defined in open addressing, the possible key value number is greater than the number of hash values, i.e. load factor n/m< 1.
They are very confusing statements. The "number of hash values" is not arbitrarily limited - you could use a 32 bit hash generating any of ~4 billion hash values, a 512-bit hash, or whatever other size you feel like. Given the structure of your statement is "a > b, i.e. load factor n/m < 1", and "n/m < 1" can be rewritten as "n < m" or "m > n", you imply "a" and "m" are meant to be the same thing, as are "b" and "n":
you're referring to m - which "load factor n/m" requires be the number of buckets in the hash table - as "the possible key value number": it's not, and what could that even mean?
you're referring to n - which "load factor n/m" requires be the number of keys stored in the hash table - as "the number of hash values": it's not, except in the trivial sense of that many (not necessarily distinct) hash values being generated when the keys are hashed
In reality, if the hash function is predefined, there exists only n possible probe sequence, which is less than m.
Again, that's a very poorly defined statement. The hashing of n keys can identify at most n distinct buckets from which collision-handling would kick in, but those n could begin pretty much anywhere within the m buckets, given the hash function's job is to spray them around. And, so what?
The same thing applies to double hashing. If the book says, one hash function is randomly chosen from a set of universal hash functions, then, I can understand.
Understand what?
Without introducing randomness in open addressing analysis, the analysis of its performance based on universal hashing is obscured.
For sure. "Repeatable randomness" of hashing is a very convenient and tangible benchmark against which specific implementations can be compare.
I have never used hash table in practice, maybe I dive too much into the details. But I also have such doubt in hash table's practical usage:

sequence of branch taken or not-taken that reduces the branch misprediction rate

Increasing the size of a branch prediction table implies that the two branches in a program are less likely to share a common predictor. A single predictor predicting a single branch instruction is generally more accurate than is the same predictor serving more than one branch instruction.
List a sequence of branch taken and not-taken actions to show a simple example of a 2-bit predictor sharing (several different branch instructions are mapped into the same entry of the prediction table) that reduces the branch misprediction rate, compared to the situation where separate predictor entries are used for each branch. (Note: Be sure to show the outcomes of two different branch instructions and specifically indicate the order of these outcomes and which branch they correspond to)
Can someone explain to me what this question is asking for specifically? Also, what does "2-bit predictor sharing (several different branch instructions are mapped into the same entry of the prediction table)" and "separate predictor entries are used for each branch" mean? I've been reading and rereading my notes but I couldn't figure it out. I tried to find some branch prediction examples online but couldn't come across any.
"2-bit predictor" could be referring to either of two things, but much more likely one than the other.
The unlikely possibility is that they mean a branch table with only four entries, so two bits are used to associated a particular branch with an entry in the table. That's unlikely because a 4-entry table is so small that lots of branches would share the same table entries, so the branch predictor wouldn't be much more accurate than static branch prediction (e.g., always predicting backward branches as taken, since they're typically used to form loops).
The much more like possibility is using two bits to indicate whether a branch is likely to be taken or not. Some of the earliest microprocessors that included branch prediction (e.g., Pentium, PowerPC 604) worked roughly this way. The basic idea is that you keep a two-bit saturating counter, and make a prediction based on its current state. Intel called the states strongly not taken, weakly not taken, weakly taken, strongly taken. These would be numbered as (say) 0, 1, 2 and 3, so you can use a two-bit counter to track the states. Every time a branch is taken, you increment the number (unless it's already 3) and every time it's not taken, you decrement it (again, unless it's already 0). When you need to predict a branch if the counter is 0 or 1 you predict the branch not taken, and if it's 2 or 3 you predict it taken1.
A separate predictor entry used for each branch means each branch instruction in the program has its own entry in the branch prediction table. The alternative is some sort of mapping from branch instructions to table entries. For example, if you had a table with 220 entries, you could use 20 bits from a branch instruction's address, and use those bits as the index into the table. Assuming a machine with 32-bit addressing, and 32-bit instructions, you'd have up to 1024 branch instructions that could map to any one entry in the table (32-20-2 = 10, 210 = 1024). In reality you expect only a small percentage of instructions to be branches, some of the address space to be used for data, etc., so probably only a few branches would map to one entry in the table.
As far as the basic question of what it's asking for: they want a sequence of branch instructions that will (by what coincidence) be predicted more accurately when two branches map to the same slot in the branch predictor table than when/if each maps to a separate slot in the table. To go into just slightly more detail (but hopefully without giving away the whole puzzle), start with a pattern of branches where the branch predictor will usually be wrong. What the predictor basically does is assume that if the branch was taken the last time, that indicates that it's more likely to be taken this time (and conversely, if it wasn't taken last time, it probably won't be this time either).
So, you start with a pattern of branches exactly the opposite of that. Then, you want to add a second branch mapping to the same spot in the branch prediction table that will follow a pattern of branches that will adjust the data in the branch predictor table so that it more accurately reflects the upcoming branch rather than the previous branch.
1Technically, the Pentium didn't actually work this way, but it's how it was documented to work, and probably intended to work; the discrepancy in how it actually did work seems to have been a bug.

Mercurial update to local revision or hash changeset

I use Mercurial and i have a weird problem, i have a very big history and the local revisions in Mercurial now has 5 characters.
In Mercurial you can execute "hg up " and it can choose between the local revision or the hash changeset ( i have no idea the policy it uses to choose between each other ), in my case the local revision coincide with the 5 first characters of another hash changeset. For example:
I want to update to the local revision: 80145
If i execute:
"hg up 80145"
Mercurial doesn't update to the revision i want, it updates to an old one because its hash changeset is:
801454d1cd5e
So, does anyone know if there is a way to specify to which type of revision you want to update to? local revision or hash changeset.
Thanks all!
====
Problem solved. After some investigation i realized that Mercurial always update to the local revision if it exists, and to the hash changeset otherwise.
In my case the local revision didn't exist, so it was updating to the hash changeset
Sounds like you found your own answer (and should enter it as an answer instead of a comment and then select it -- that's not just allowed but encouraged around here), but for reference here's where that information lived:
$ hg help revisions
Specifying Single Revisions
Mercurial supports several ways to specify individual revisions.
A plain integer is treated as a revision number. Negative integers are
treated as sequential offsets from the tip, with -1 denoting the tip, -2
denoting the revision prior to the tip, and so forth.
A 40-digit hexadecimal string is treated as a unique revision identifier.
A hexadecimal string less than 40 characters long is treated as a unique
revision identifier and is referred to as a short-form identifier. A
short-form identifier is only valid if it is the prefix of exactly one
full-length identifier.
Any other string is treated as a bookmark, tag, or branch name. A bookmark
is a movable pointer to a revision. A tag is a permanent name associated
with a revision. A branch name denotes the tipmost revision of that
branch. Bookmark, tag, and branch names must not contain the ":"
character.
The reserved name "tip" always identifies the most recent revision.
The reserved name "null" indicates the null revision. This is the revision
of an empty repository, and the parent of revision 0.
The reserved name "." indicates the working directory parent. If no
working directory is checked out, it is equivalent to null. If an
uncommitted merge is in progress, "." is the revision of the first parent.
So as you found the first interpretation was as a revision number and when that didn't match anything it was tried as the prefix of a revision id. In theory this could happen with even the number 1 if your only changeset was revision 0 and its hash started with 1.

1-1 mappings for id obfuscation

I'm using sequential ids as primary keys and there are cases where I don't want those ids to be visible to users, for example I might want to avoid urls like ?invoice_id=1234 that allow users to guess how many invoices the system as a whole is issuing.
I could add a database field with a GUID or something conjured up from hash functions, random strings and/or numeric base conversions, but schemes of that kind have three issues that I find annoying:
Having to allocate the extra database field. I know I could use the GUID as my primary key, but my auto-increment integer PK's are the right thing for most purposes, and I don't want to change that.
Having to think about the possibility of hash/GUID collisions. I give my full assent to all the arguments about GUID collisions being as likely as spontaneous combustion or whatever, but disregarding exceptional cases because they're exceptional goes against everything else I've been taught, and it continues to bother me even when I know I should be more bothered about other things.
I don't know how to safely trim hash-based identifiers, so even if my private ids are 16 or 32 bits, I'm stuck with 128 bit generated identifiers that are a nuisance in urls.
I'm interested in 1-1 mappings of an id range, stretchable or shrinkable so that for example 16-bit ids are mapped to 16 bit ids, 32 bit ids mapped to 32 bit ids, etc, and that would stop somebody from trying to guess the total number of ids allocated or the rate of id allocation over a period.
For example, if my user ids are 16 bit integers (0..65535), then an example of a transformation that somewhat obfuscates the id allocation is the function f(x) = (x mult 1001) mod 65536. The internal id sequence of 1, 2, 3 becomes the public id sequence of 1001, 2002, 3003. With a further layer of obfuscation from base conversion, for example to base 36, the sequence becomes 'rt', '1jm', '2bf'. When the system gets a request to the url ?userid=2bf, it converts from base 36 to get 3003 and it applies the inverse transformation g(x) = (x mult 1113) mod 65536 to get back to the internal id=3.
A scheme of that kind is enough to stop casual observation by casual users, but it's easily solvable by someone who's interested enough to try to puzzle it through. Can anyone suggest something that's a bit stronger, but is easily implementable in say PHP without special libraries? This is getting close to a roll-your-own encryption scheme, so maybe there is a proper encryption algorithm that's widely available and has the stretchability property mentioned above?
EDIT: Stepping back a little bit, some discussion at codinghorror about choosing from three kinds of keys - surrogate (guid-based), surrogate (integer-based), natural. In those terms, I'm trying to hide an integer surrogate key from users but I'm looking for something shrinkable that makes urls that aren't too long, which I don't know how to do with the standard 128-bit GUID. Sometimes, as commenter Princess suggests below, the issue can be sidestepped with a natural key.
EDIT 2/SUMMARY:
Given the constraints of the question I asked (stretchability, reversibility, ease of implementation), the most suitable solution so far seems to be the XOR-based obfuscation suggested by Someone and Breton.
It would be irresponsible of me to assume that I can achieve anything more than obfuscation/security by obscurity. The knowledge that it's an integer sequence is probably a crib that any competent attacker would be able to take advantage of.
I've given some more thought to the idea of the extra database field. One advantage of the extra field is that it makes it a lot more straightforward for future programmers who are trying to familiarise themselves with the system by looking at the database. Otherwise they'd have to dig through the source code (or documentation, ahem) to work out how a request to a given url is resolved to a given record in the database.
If I allow the extra database field, then some of the other assumptions in the question become irrelevant (for example the transformation doesn't need to be reversible). That becomes a different question, so I'll leave it there.
I find that simple XOR encryption is best suited for URL obfuscation. You can continue using whatever serial number you are using without change. Further XOR encryption doesn't increase the length of source string. If your text is 22 bytes, the encrypted string will be 22 bytes too. It's not easy enough as to be guessed like rot 13 but not heavy weight like DSE/RSA.
Search the net for PHP XOR encryption to find some implementation. The first one I found is here.
I've toyed with this sort of thing myself, in my amateurish way, and arrived at a kind of kooky number scrambling algorithm, involving mixed radices. Basically I have a function that maps a number between 0-N to another number in the 0-N range. For URLS I then map that number to a couple of english words. (words are easier to remember).
A simplified version of what I do, without mixed radices: You have a number that is 32 bits, so ahead of time, have a passkey which is 32-bits long, and XOR the passkey with your input number. Then shuffle the bits around in a determinate reordering. (possibly based on your passkey).
The nice thing about this is
No collisions, as long as you shuffle and xor the same way each time
No need to store the obfuscated keys in the database
Still use your ordered IDS internally, since you can reverse the obfuscation
You can repeat the operation several times to get more obfuscated results.
if you're up for the mixed radix version, it's basically the same, except that I add the steps of converting the input to a mixed raddix number, using the maximum range's prime factors as the digit's bases. Then I shuffle the digits around, keeping the bases with the digits, and turn it back into a standard integer.
You might find it useful to revisit the idea of using a GUID, because you can construct GUIDs in a way that isn't subject to collision.
Check out the Wikipedia page on GUIDs - the "Type 1" algorithm uses both the MAC address of the PC, and the current date/time as inputs. This guarantees that collisions are simply impossible.
Alternatively, if you create a GUID column in your database as an alternative-key (keep using your auto-increment primary keys), define it as unique. Then, if your GUID generation approach does give a duplicate, you'll get an appropriate error on insert that you can handle.
I saw this question yesterday: how reddit generates an alphanum id
I think it's a reasonably good method (and particularily clever)
it uses Python
def to_base(q, alphabet):
if q < 0: raise ValueError, "must supply a positive integer"
l = len(alphabet)
converted = []
while q != 0:
q, r = divmod(q, l)
converted.insert(0, alphabet[r])
return "".join(converted) or '0'
def to36(q):
return to_base(q, '0123456789abcdefghijklmnopqrstuvwxyz')
Add a char(10) field to your order table... call it 'order_number'.
After you create a new order, randomly generate an integer from 1...9999999999. Check to see if it exists in the database under 'order_number'. If not, update your latest row with this value. If it does exist, pick another number at random.
Use 'order_number' for publicly viewable URLs, maybe always padded with zeros.
There's a race condition concern for when two threads attempt to add the same number at the same time... you could do a table lock if you were really concerned, but that's a big hammer. Add a second check after updating, re-select to ensure it's unique. Call recursively until you get a unique entry. Dwell for a random number of milliseconds between calls, and use the current time as a seed for the random number generator.
Swiped from here.
UPDATED As with using the GUID aproach described by Bevan, if the column is constrained as unique, then you don't have to sweat it. I guess this is no different that using a GUID, except that the customer and Customer Service will have an easier time referring to the order.
I've found a much simpler way. Say you want to map N digits, pseudorandomly to N digits. you find the next highest prime from N, and you make your function
prandmap(x) return x * nextPrime(N) % N
this will produce a function that repeats (or has a period) every N, no number is produced twice until x=N+1. It always starts at 0, but is pseudorandom thereafter.
I honestly thing encrypting/decrypting query string data is a bad approach to this problem. The easiest solution is sending data using POST instead of GET. If users are clicking on links with querystring data, you have to resort to some javascript hacks to send data by POST (keep accessibility in mind for users with Javascript turned off). This doesn't prevent users from viewing source, but at the very least it keeps sensitive from being indexed by search engines, assuming the data you're trying to hide really that sensitive in the first place.
Another approach is to use a natural unique key. For example, if you're issuing invoices to customers on a monthly basis, then "yyyyMM[customerID]" uniquely identifies a particular invoice for a particular user.
From your description, personally, I would start off by working with whatever standard encryption library is available (I'm a Java programmer, but I assume, say, a basic AES encryption library must be available for PHP):
on the database, just key things as you normally would
whenever you need to transmit a key to/from a client, use a fairly strong, standard encryption system (e.g. AES) to convert the key to/from a string of garbage. As your plain text, use a (say) 128-byte buffer containing: a (say) 4-byte key, 60 random bytes, and then a 64-byte medium-quality hash of the previous 64 bytes (see Numerical Recipes for an example)-- obviously when you receive such a string, you decrypt it then check if the hash matches before hitting the DB. If you're being a bit more paranoid, send an AES-encrypted buffer of random bytes with your key in an arbitrary position, plus a secure hash of that buffer as a separate parameter. The first option is probably a reasonable tradeoff between performance and security for your purposes, though, especially when combined with other security measures.
the day that you're processing so many invoices a second that AES encrypting them in transit is too performance expensive, go out and buy yourself a big fat server with lots of CPUs to celebrate.
Also, if you want to hide that the variable is an invoice ID, you might consider calling it something other than "invoice_id".