Finding which rev more closely matches a given code - mercurial

This has happened to me a couple of times:
I find a zip with an unversioned snapshot of the source code (ie: the .hg folder was not included in the zip, or was removed). For varying reasons, I've had to figure out which Rev this code belongs to. Or which rev matches it "more closely", if the exact match is somewhere between 2 revs.
How can I find which rev more closely matches a given code?
I suppose I could write a script that cycles through every rev and counts the number of diff lines, but I wonder if this already exists.
Maybe some hg internal command is already able to do it? (and could do it a lot more efficiently that I could accomplish in a script)

Related

Mercurial, Get history of branch since last tag - including merged in commits

With Mercurial, I'm trying to get the history of a branch since the last tag.
BUT I want to include all the comments that were merged in as well.
Our devs usually create a branch, do some work, possibly multiple commits, then merge the branch back in.
Using: hg log -b . -r "last(tagged())::" --template "{desc|firstline}\n"
I'll get entries like "Merge" - with no information on what commits were included in that merge.
How do I get it to include the merged commits?
We also have multiple active branches, so just including ALL commits for ALL branches won't work.
There are at least two issues here that seem (after analysis) to stem from using -b .. There may be more issues as well. I will take them in some order, not necessarily the best one, perhaps even the worst one. :-)
Combining -b and last(set) seems unwise in general
Your -b . constraint means you get only commits that are on the current branch. If your revset would otherwise include commits on another branch, those commits will be excluded. Or, to put it another (more set-theoretic) way, using -b . within hg log is a bit like taking whatever revset specifier you have and adding:
(revset) & branch(.)
—though simply asking this question this brought up one point I was unsure about: is the limiting done before calculating tagged(), or after? Some poking-about in hg --debugger tells me that it's "after", which means we get:
(last(tagged()) & (branch(.))
which means that if there are tags on, e.g., revs 1, 7, and 34, we'll select rev 34 first, then select revisions whose branch is the current branch. Suppose that rev 7 is a member of the current branch, but rev 34 is not. The result of the & is then the empty set.
That's probably not the issue here—the actual final expression is, or might as well be, branch(.) & descendants(last(tagged()))—but in at least some cases, it would probably be better to use:
last(tagged() & branch(.))
so that you start with the last revision that is both tagged, and on the current branch. (If this revlist is empty it's not clear what you should do next, but that is hard to program at this level, so let's just assume the revlist has one revision in it, e.g., rev 7 in our example here.)
This is probably not what you want after all, though; see the last section below.
Combining -b and a DAG range
A DAG-range operator like X::Y in Mercurial simply means: All commits/revisions that are descendants of X, including X itself, and ancestors of Y, including Y itself. Omitting Y entirely means all descendants of X. Without a -b limiter, you will get all such commits, but with -b ., you once again restrict yourself to those commits that on the current branch.
If you merge commits within one branch, then, you will get the merge commits and their ancestors and descendants that are on this branch. (Remember that in Mercurial, any commit is on exactly one branch, forever: this is the branch that was current when the commit itself was made.) But if you are using branches at all in Mercurial, you are probably merging commits that are in other branches. If you want to see any those commits, you cannot use -b . here.
Getting what you want
Let's go back to your first statement above:
With Mercurial, I'm trying to get the history of a branch since the last tag. BUT I want to include all the [commits] that were merged in as well.
Let's draw a quick example or two and see which commits you might want.
Here's a horizontal graph, with newer commits toward the right. Each commit is represented by an o unless it is tagged, in which case it is represented by a *. There are several merges. Commits on the first two rows are on branch B1 and commits on the third row are on branch B2.
o--o---o--o---o--*--o--o--o
b1: \ / /
*--o /
\ /
b2: o--*--o--o--o--o
It's not clear to me which commits you wish to see. The last tagged commit on b1 is the top row *, but there is a tagged commit on b2 as well (whose rev number is probably lower than the one on b1). Or suppose we had a slightly different graph, so that the highest numbered tagged revision were one on b2:
o--o---o--o---*--o--o--o--o
b1: \ / /
*--o /
\ /
b2: o--o--*--o--o--o
If we use the expression last(tagged()) without any branch masking, we will choose the rightmost starred commit. If we then feed that into a DAG operator (e.g., as X in X:: or using descendants(), we get all the commits that are "after" that one.
When we start with the single starred commit on b2—as in the last graph—we get that commit and the remaining three commits that are on b2, plus the last two commits that are on b1. That may be what you want, but perhaps you also want some commits that are on b1 that come before the merge, but after (and maybe including) the final starred commit that is on b1 itself.
Note that this is what you get with just descendants(last(tagged()), i.e., if you remove -b . from your original hg log command.
When we start with the last starred commit on b1, though, as in the earlier graph, we get just that commit plus the final three commits on branch b1. None of the commits on branch b2 that get merged are descendants of the starred commit we chose. So the DAG-range approach itself is suspect, here. Still, if eliminating tagged commits that are directly on b1 suffices, note that we can use:
descendants(last(tagged() and not branch(b1)))
(there is no difference between and and & here, I just spelled it out because I spelled out not).
There is another possibility that I see here: perhaps you want any commits that are ancestors of the current branch's final commit, but stopping at:
any tagged commit, or
any predecessor merge for any other branch than the first-merged branch arrived-at by traversing ancestors.
Visualizing this last case requires a more complex branch topology, with more than two total named branches. Since it's (a) hard and (b) not at all clear to me that this is what you want, I'm not going to write an expression to produce it.

Do lower MediaWiki page revision IDs always mean earlier edits?

In general it seems true, at least for a single page, that lower revision IDs for Mediawiki page histories mean an earlier edit time. Is this true in General? Are there ever exceptions? How does revision ID minting work?
I am trying attempting trying to write a function with Pywikipedia, that will give the Page text as a of an arbitrary timestamp. It would just be more optimized to sort based on Revision ID, rather than making a dict of revision IDs timestamps, and then sorting the timestamps.
I found the answer for this on IRC thanks to user:halfak. The answer is that there is no guarantee for at least two reasons.
If pages are imported from a secondary wiki, then timestamps can be unrelated. And
If two edits occur within the same second, they will not be properly ordered, which happens sometimes.

Find all log messages explaining differences between two changesets

I'd like to find all differences between two mercurial revisions. I'd primarily like to see the history of the differences (i.e. the changeset log messages), not the internal details of what changed in the files.
Example: compare revisions 105 and 106
/---101---103---105
100 \
\---102---104---106
Here, revision 106 includes changesets 106,104 and 102 which 105 doesn't have, and 105 in turn includes 103 and 105 that 106 doesn't have. How can I easily get this list; ideally taking into account grafts too?
The following revision set query almost works:
(ancestors(105) - ancestors(106)) + (ancestors(106) - ancestors(105))
However, that's a fairly long query for something that seems like a fairly common question: why exactly does this branch differ from my local version? I also believe it fails to take into account grafts and it unfortunately includes uninteresting changesets such as merges.
Bonus points for including the git equivalent.
Edit: The reason I want this is to explain to humans how these versions differ. I've got a complex source tree, and I need to be able to tell people that version X includes features A & B and bugfix P, but version Y includes features C & D and bugfix Q - and that they're otherwise the same.
If I go back to my example: merges themselves aren't interesting (so in the example above 104 isn't interesting), but the changesets the merges consist of are very interesting - meaning 101 and 102. Merges combine lots of changes into one changeset that lacks reasonable log information. In particular, if I just find the nearest ancestor, I'd find 101, and then it'd look like 102 isn't of particular interest. In terms of the actual patches applied, this information is complete - I don't need to see how merge changeset 104 was constructed, only the result. However, if I want to know why it contains those changes, I need the log messages from 102.
Hrm, I've not tested it, but would:
ancestor(X,Y)::X + ancestor(X,Y)::Y
get you the same list. I think it would, and would also likely be faster.

sequence of branch taken or not-taken that reduces the branch misprediction rate

Increasing the size of a branch prediction table implies that the two branches in a program are less likely to share a common predictor. A single predictor predicting a single branch instruction is generally more accurate than is the same predictor serving more than one branch instruction.
List a sequence of branch taken and not-taken actions to show a simple example of a 2-bit predictor sharing (several different branch instructions are mapped into the same entry of the prediction table) that reduces the branch misprediction rate, compared to the situation where separate predictor entries are used for each branch. (Note: Be sure to show the outcomes of two different branch instructions and specifically indicate the order of these outcomes and which branch they correspond to)
Can someone explain to me what this question is asking for specifically? Also, what does "2-bit predictor sharing (several different branch instructions are mapped into the same entry of the prediction table)" and "separate predictor entries are used for each branch" mean? I've been reading and rereading my notes but I couldn't figure it out. I tried to find some branch prediction examples online but couldn't come across any.
"2-bit predictor" could be referring to either of two things, but much more likely one than the other.
The unlikely possibility is that they mean a branch table with only four entries, so two bits are used to associated a particular branch with an entry in the table. That's unlikely because a 4-entry table is so small that lots of branches would share the same table entries, so the branch predictor wouldn't be much more accurate than static branch prediction (e.g., always predicting backward branches as taken, since they're typically used to form loops).
The much more like possibility is using two bits to indicate whether a branch is likely to be taken or not. Some of the earliest microprocessors that included branch prediction (e.g., Pentium, PowerPC 604) worked roughly this way. The basic idea is that you keep a two-bit saturating counter, and make a prediction based on its current state. Intel called the states strongly not taken, weakly not taken, weakly taken, strongly taken. These would be numbered as (say) 0, 1, 2 and 3, so you can use a two-bit counter to track the states. Every time a branch is taken, you increment the number (unless it's already 3) and every time it's not taken, you decrement it (again, unless it's already 0). When you need to predict a branch if the counter is 0 or 1 you predict the branch not taken, and if it's 2 or 3 you predict it taken1.
A separate predictor entry used for each branch means each branch instruction in the program has its own entry in the branch prediction table. The alternative is some sort of mapping from branch instructions to table entries. For example, if you had a table with 220 entries, you could use 20 bits from a branch instruction's address, and use those bits as the index into the table. Assuming a machine with 32-bit addressing, and 32-bit instructions, you'd have up to 1024 branch instructions that could map to any one entry in the table (32-20-2 = 10, 210 = 1024). In reality you expect only a small percentage of instructions to be branches, some of the address space to be used for data, etc., so probably only a few branches would map to one entry in the table.
As far as the basic question of what it's asking for: they want a sequence of branch instructions that will (by what coincidence) be predicted more accurately when two branches map to the same slot in the branch predictor table than when/if each maps to a separate slot in the table. To go into just slightly more detail (but hopefully without giving away the whole puzzle), start with a pattern of branches where the branch predictor will usually be wrong. What the predictor basically does is assume that if the branch was taken the last time, that indicates that it's more likely to be taken this time (and conversely, if it wasn't taken last time, it probably won't be this time either).
So, you start with a pattern of branches exactly the opposite of that. Then, you want to add a second branch mapping to the same spot in the branch prediction table that will follow a pattern of branches that will adjust the data in the branch predictor table so that it more accurately reflects the upcoming branch rather than the previous branch.
1Technically, the Pentium didn't actually work this way, but it's how it was documented to work, and probably intended to work; the discrepancy in how it actually did work seems to have been a bug.

Assignment of mercurial global changeset id

Apparently Mercurial assigns a global changeset id to each change. How do they ensure that this is unique?
As Zach says, the changeset ID is computed using the SHA-1 hash function. This is an example of a cryptographically secure hash function. Cryptographic hash functions take an input string of arbitrary length and produces a fixed-length digest from this string. In the case of SHA-1, the output length is fixed to 160 bit, of which Mercurial by default only shows you the first 48 bit (12 hexadecimal digits).
Cryptographic hash functions have the property that it is extremely difficult to find two different inputs that produce the same output, that is, it is hard to find strings x != y such that H(x) == H(y). This is called collision resistance.
Since Mercurial uses the SHA-1 function to compute the changeset ID, you get the same changeset ID for identical inputs (identical changes, identical committer names and dates). However, if you use different inputs (x != y) when you will get different outputs (changeset IDs) because of the collision resistance.
Put differently, if you do not get different changeset IDs for different input, then you have found a collision for SHA-1! So far, nobody has ever found a collision for SHA-1, so this will be a major discovery.
In more detail, the SHA-1 hash function is used in a recursive way in Mercurial. Each changeset hash is computed by concatenating:
manifest ID
commit username
commit date
affected files
commit message
first parent changeset ID
second parent changeset ID
and then running SHA-1 on all this (see changelog.py and revlog.py). Because the hash function is used recursively, the changeset hash will fix the entire history all the way back to the root in the changeset graph.
This also means that you wont get the same changeset ID if you add the line Hello World! to two different projects at the same time with the same commit message -- when their histories are different (different parent changesets), the two new changesets will get different IDs.
Mercurial's changeset IDs are SHA-1 hashes of the "manifest" for each changeset. It only prints the first dozen hex digits of the global ID, but it uses the full SHA-1 for internal operations. There's no actual guarantee that they are unique, but it is sufficiently unlikely for practical purposes.
See here for gory details.