Mercurial: which addremove --similarity default is recommended? - mercurial

For Hg's addremove command, I'm looking for a reasonable --similarity default.
Which value works well for all kinds of files?
Which value works well for which programming language?

There is no one good value. If there was, it would already be the default. Language and filetypes are largely irrelevant. What matters is the size of the file relative to the size of the changes you're making.
If you are simply moving files around without edits, 100 is the best answer. Mercurial probably won't make any mistakes guessing what adds and removes were actually renames.
If you are making small edits to large files while moving them around, then 90 might work well. Mercurial probably won't be fooled unless you've got some very similar files.
If you're making large changes to small files, you might need to go down to 50 to guess your rename. But the odds of Mercurial mistaking two different but similar files is now quite large.
(For comparison, Git uses a 50% heuristic by default for diffs and merges, but since it doesn't actually record renames in history, there's less permanent downside of guessing wrong.)

You can find out a suitable value for a specific situation by running Mercurial with the option --dry-run. Mercurial won't perform any action but show what it would do. You also see the similarity of each file, so you can adjust your value before performing the real action.
Example:
hg addremove --similarity 50 --dry-run
...
recording removal of ../contact_form/views.py as rename to contact_form/views.py (98% similar)

Note that the bug 3430 and the release note 2012-06-01 mention for addremove:
addremove: document default similarity behavior -s100
So, while not the "best value", -s100 (identical, simple moves without edits) is officially the default one.

Related

Is it safe to use Mercurial Queues and the Share extension together as long as your working directories are on separate branches?

I've thought this through and I think I understand the implications, but I wanted to get a sanity check because the caveats on https://www.mercurial-scm.org/wiki/ShareExtension are pretty general.
Specifically, the warning is "It's probably not a good idea to mix MQ and shared clones; if you do so, you should definitely avoid pushing/popping patches in one clone while another clone has patches applied."
However based on my understanding of how Mq works, it's only unsafe to push/pop patches (create/destroy history) if you have two shares whose working directory parent would be affected by such changes. That is, if you have two shares that are updated to separate named branches, pushing/popping patches from one should only have the effect on the other of creating/destroying history that is unrelated to the working directory and thus should NOT have any undesirable side-effects.
There will be small side effects, such as revision sequence number changes in some situations, but nothing that should jeopardize correctness or cause problems with the working directory.
Is this correct or am I missing something?
I'm not sure about this, but AFAIK if you end up having the exact same file content in both branches (repos), this might still wind up as shared storage and wreak havoc.
This is obviously not a definitive answer, but I just wanted to report back in case anyone else is interested in this situation. I've been running for several months now with multiple shares on a "central copy" of a large repo, with each share being dedicated to its own branch, and using MQ freely within each share. I have not hit any problems. History changes on other branches just look the same as pulls/strips would -- unrelated changesets being added, modified, and removed.

Optimizing Mercurial repository size

How does one optimize Mercurial repositories so that older revisions take the minimum required space?
I am aware that Mercurial does some magic already to group and compress existing commits. However, is there a way to enforce a manual run of this operation, so that as much space as possible is saved, disregarding speed? Is it possible to pack as many repos in one stream, change compression algorithm -- anything to better compress old changesets?
I don't have a lot of large-sized repositories right now, but I do have some medium-to-big sized ones that could use some shrinkage in the early history.
Git appears to have git gc [--aggressive] which, to a git non-expert, seems to do some magic cutting down the cruft and compressing the repos. It also has git repack which also seems to be doing the same thing, albeit with some additional expert options. At least that's how it seems to me: change sets can be 'packed' differently.
Have you tried using the shrink-revlog.py extension in the mercurial/contrib directory? On very branchy repositories it may cut down the size of the manifest significantly (OTOH, it had zero effect for me on a nearly 1GB manifest in a repo converted from subversion).

Is it possible to change the username on a mercurial changeset based on the commit message?

I am converting a CVS repository (actually a number of them) into a Mercurial setup, and was wondering if any one had experience with updating and correcting usernames in the changesets.
The issue is that over quite a long period of time a default user has been used fairly often to commit to the CVS repository, and the commit message has then been supplied with '(initials)' at the end to identify the actual person committing. And now that I am converting to Mercurial I would like to clean this up, by setting the correct username.
Having done some research this seems non-trivial and I was thinking that something along the lines of this:
convert to hg and tag/branch commits with a specific initial in the commit message, using --config convert.cvsps.mergeto='{{mergetobranch ([-\w]+)}}'
converting this new repository and then use the --authormap to edit the default user to the persons initials.
But I am unsure if it is possible to selectively convert out branches and then get it back into its original place in the history.
Any help or ideas would be greatly appreciated. I have of course fully control of all the repository clones since it is not published in any way.
You need to use the Convert extension. You can follow the answer from this post.
The Convert extension sadly doesn't provide a convenient way to have conditional author mappings, but it does permit performing the conversion incrementally. You can then perform your conversion in small steps, inside of which a single author map is valid. You can use also script the generation of these "fence post" commits.
I had a similar problem today, where I wanted to conditionally change the username in some old commits in a Mercurial repository. I've bundled up all of my steps into a convenient script, which you can find here. The downside is that it depends on some Mercurial specific behavior, but you can work around this by first converting with a basic, non conditional author map, and then running the conditional conversion on the that repository.
This question is rather old, but if you don't have too many clones out in the wild, it may still be worth your while to run the author-name correcting conversion. Remember: hg convert necessarily changes the hash for every changeset, so you have to deal with all the usual issues of EditingHistory. I suspect this is part of the reason why both of the usual ways of dealing with this, Mercurial Queues and the newer HistEdit Extension, don't play nice with merge changesets -- it's difficult and messy to redo a non-leaf merge, and the presence of merges often indicates that you've shared the repository, which means you should reconsider editing history.

mercurial temporarily ignore versioned files

My question is essentially the same as here but applies to mercurial. I have a set of files that are under version control, and one save operation changes quite a lot of files. Some of the resulting changes are important for revision control, and some of the changes are just junk. I can "partition" off the junk into separate files. These junk files need to be part of a basic checkout in order for it to work, but their contents (and changes over time) aren't that important for revision control. Right now I just tell all our developers not to commit these files, but we all forget and it creates a lot of extra baggage in the repository. I don't really like the svn solution proposed because there are quite a lot of files and I want a simple clone to just work without all this extra manual work, so I was wondering if mercurial has a better alternative. It's kind of like hg shelve but not quite, and kind of like ignore, but not quite. Is there some hg extension that allows for this? Can git do it?
Mercurial doesn't support this. The correct way to do it is to commit thefile.sample and then have your developers (or better you deploy script) do a copy from thefile.sample to thefile if thefile doesn't exist. That way anyone can update the example file, but there's no risk of them committing their local changes (say their personal database connect string).
Aha! So TortoiseHG's repository and global settings have an Auto Exclude List where you can define a list of files that will be unchecked by default when the status, commit, and shelve dialogs open. So they still show up, but the user has to check them in order to actually do a commit. The setting is stored in hgrc, but it's under the [tortoisehg] heading so it's not supported by mercurial per se. Nevertheless, it fits my needs.
One solution to this is to use nested tree support (submodule in git), where the "junk" would be put in a different repository (to avoid cluttering the main repo), while enabling checking out the whole thing out in a consistent manner (right version of both repos in sync).
https://www.mercurial-scm.org/wiki/Subrepository?action=show&redirect=subrepos
In git, submodules are one solution to this issue - but they are not that great UI-wise. What I do instead is to keep two completely independent repositories, and using the subtree merge strategy when I need to update the main repo with the junk repo: http://progit.org/book/ch6-7.html

A mercurial merge chose the wrong changes, what is the correct way to fix this?

Changes were made to our .vcproj to fix an issue on the build machine (changeset 1700). Later, a developer merged his changes (changes 1710 through 1715) into the trunk, but the mercurial auto-merge overwrote the changes from 1700. I assume this happened because he chose the wrong branch as the "parent" of the merge (see part 2 of the question).
1) What is the "correct" mercurial way to fix this issue, considering out of all the merged files, only one file was merged incorrectly, and
2) what should the developer have done differently in order to make sure this didn't occur? Are there ways we can enforce the "correct" way?
Edit: I probably wasn't clear enough on what happened. Developer A modified a line in our .vcproj file that removed an option for the compiler. His check-in became changeset 1700. Developer B, working from a previous parent (let's say changeset 1690), made some changes to completely different parts of the project, but he did touch the .vcproj file (just not anywhere near the changes made by Developer A). When Developer B merged his changes (becoming changes 1710 through 1715), the merge process overwrote the changes from 1700.
To fix this, I just re-modified the .vcproj file to include the change again, and checked it in. I just wanted to know why Mercurial thought that it shouldn't keep the changes in 1700, and whether or not there was an "official" way to fix this.
Edit the second: Developer B swears up and down that Mercurial merged the .vcproj file without prompting him for conflict resolution, but it is of course possible that he's just misremembering, in which case this whole exercise is academic.
I will address the 2nd part of you question first...
If there is a conflict, the automated merge tools should force the programmer to decide how the merge happens. But the general assumption is that a conflict will involve two edits to the same set of lines. If somehow a conflict arises because of edits to lines that are not close to each other the automated merge will blithely choose both of the edits and a bug will appear.
The general case of a merge tool always merging properly is very hard to solve, and really can't be with current technology. Here is an example of what I mean from C:
int i; // Someone replaces this with 'short i' in one changeset stating
// that a short is more efficient.
// ... lots of code;
// Someone else replaces all the 65000s with 100000s in another changeset,
// saying that more precision is needed.
for (i = 0; i < 65000; ++i) {
integral_approximation_piece(start + i/65000.0, end + (i + 1) / 65000.0);
}
No merge tool is going to catch this kind of conflict. The tool would have to actually compile the code to see that those two parts of the code have anything to do with eachother, and while that would likely be enough in this case, I can construct an example that would require the code to be run and the results examined to catch the conflict.
This means that what you really ought to do is rigorously test your code after a merge, just like you should after any other change. The vast majority of merges will result in obvious conflicts that a developer will have to resolve (even though that resolution is often fairly obvious), or will merge cleanly. But the very few merges that don't fit either category can't easily be handled in an automated fashion.
This can also be fixed by development practices that encourage locality. For example a coding standard that states "Variables should be declared near where they're used.".
I'm guessing that .vcproj files are particularly prone to this problem since they are not well understood by developers and so if conflicts do appear they will not be sure what to do with them. My guess is that this happened and your developer simply did a revert back to the revision (s)he checked in.
As for part 1...
What to do in this case depends a lot on your development process. You can either strip the merge changeset out and redo it, though that won't work very well if lots of people have already pulled it, and it will work especially poorly if there are lots of changesets that have already been checked in that are based on the merge changeset.
You can also check in a new change that fixes the problem with the merge.
Those are basically your two options.
The tone of your post seems to me to indicate that you may have some politics surrounding this issue in your organization, and people are blaming this error on the frequent merges of Mercurial. So I will point out that any change control system can have this problem. In the case of Subversion, for example, every time a developer does an update while they have outstanding changes in their working directory, they are doing a merge, and this kind of problem can arise with any merge.
In mercurial a merge doesn't have a single parent, it by definition has two and only two parents. When someone is merging they're making two choices:
What two changesets will constitute the two changes
Which of those changesets will be the left-parent and which will be the right-parent
Of those two questions the first is very important, and the second barely matters at all, though it took me a while to come to understand that.
You select the left-parent by using hg update X. That changes the output of hg parents (or in newer versions hg summary) and essentially determines what's in your working directory before the merge.
You select the right-parent by using hg merge Y. That says merge X (the working directory's parent) with changeset Y. As a special case, if there are only two heads in your repository and your parent is already one of them then Y will default to the the other.
I'd have to see your resulting graph to know just what the developer did, but it's possible he didn't update to one head or another before invoking merge, which would have him merging one head with some point back in history.
If your developer picked the right parents for the merge then the left vs. right doesn't much matter -- the only real difference is that when one uses hg diff or hg log -p or some other command that shows the patch for a merge changeset, it's displayed relative to the left-parent. That's, however, mostly a factor in display only. Functionally they're pretty much identical.
Assuming your developer picked the right changesets then what he should have done was test the result of the merge before committing it. Merging is software development, not an annoying VCS side effect, and not testing before committing is the error.
Fixing
To fix this, just re-do the merge correctly. Use hg update to set one parent, use hg merge to pick the other. Make sure your current working directory is correct and then commit. You can get rid of his bad merge using something like hg strip or better, just close down his branch with hg commit --close-branch after updating to it.
Avoiding
You say "mercurial auto-merge", but mercurial doesn't really auto-merge. It does a premerge which is an extremely cautious combination of obvious changes, but it's so careful it won't even merge for you if each merge parent adds code in the same region because it can't know which block of code you'd rather have first.
You can disable this premerge entirely or on a file-by-file basis using the merge tool configuration options:
https://www.mercurial-scm.org/wiki/MergeToolConfiguration?highlight=premerge