Is cvs2hg still potentially producing corrupted repositories?

Is cvs2hg still potentially producing corrupted repositories? - mercurial

Trying to migrate a repository from cvs to hg, I found the tool cvs2hg, and it seems to do nicely he job (conversion goes fine, and I have all the tags and branches).
However, the hg documentation warns about "fixup commits" making the repository somewhat corrupted or at least dangerous.
Is this still a problem ? Maybe hg or cvs2hg have benefited from fixes since this warning was written.
If it is, potentially, how can I check if I am in such a dangerous situation, on the resulting hg repository ?

Fixup commits are good and necessary. And cvs2hg does much better job than hg convert.
But maybe first about the problem. In CVS repository you can play various dirty tricks with tags and branches. For example, you can manually fine-tune some tag tagging today's version of 3 files, yesterday's version of 4 others, and month-long version of yet another. In practice, I did it a lot of times to make "patch tags" (there is some old tag, I have various commits afterwards, there turns out to be a bug, I fix the bug, make fixup tag by old tag, moving it on 1-2 files).
In the result, you get tag which points to release which naver has existed or will exist at any point of repository history, if the history is taken for whole repo.
Similar tricks could be made with branches. Or branches can start from "ugly" tag.
Any kind of „natural” conversion of CVS to HG is dead lost on such cases. There is no place in the time-based history at which such tag or branch could be hooked. And hg convert just binds such tags at more-or-less random places, and branches at very ugly places.
Fixup commits simply are those missing revisions: artificial commits which are bound at appropriate place and introduce changes which put repository at state at which it should be at given tag. With those, we get both "artificial" tags, and branches, properly bound to proper code.
So if you:
commited a.c(1.1), b.c(1.1) and c.c(1.1)
commited a.c(1.2), b.c(1.2)
commited c.c(1.2)
artificially created tag blah_1.0 which points to a.c(1.1), b.c(1.1) and c.c(1.2)
commited a.c(1.3), b.c(1.3)
...
then hg convert based history will have 4 edit changesets (just like those above) and blah_1.0 bound at some ugly place with wrong content. At the same time, cvs2hg will create "fixup commit" which will artificially create changeset at which we really have a.c(1.1), b.c(1.1) and c.c(1.2), and tag there. In a history, such changeset is reasonably similar to transplanted/grafted/cherry-picked commit.

You should carefully check the resulting repository to make sure it represents your code history and doesn't contain any of these crappy fixup commits.
BTW, it might be worthwhile to check out the newer http://www.catb.org/esr/reposurgeon/ tool.

Related

Mercurial: Automatically tagging a build

In a mercurial set up, I'd like to automatically tag certain builds based on continuous integration scripts. For example, a tag such as branchName-buildId whenever a build of a branch is deployed, or perhaps latest-stable whenever a build passes all integration tests.
However, I'm worried that the straightforward approach of simply calling hg tag will cause problems:
Some tags may be duplicate - i.e. latest-stable. I don't really care which build gets tagged in this situation, but I don't want any conflicts because a script can't resolve those.
Tags cause commits. However, this means that those commits need to be pushed and they need to be robust in the face of concurrent pushes by humans and other scripts. In particular, the automatic push can create additional heads, which is Not Good. But by the time the additional head is detected (at push) the local tag commit has already happened, and even though the new heads are likely trivially mergeable, sometimes tags cause conflicts.
How can I automatically let the CI server tag a build robustly? Here it's more important that the end result is consistent (i.e. that it doesn't mess up the CI server or the repo), and it's less important that tags are reliably applied in the face of duplicates or conflicts (which should be very unlikely anyhow).

I think you're right to be cautious. Robots aren't always the best citizens, and can often do silly things.
What you end up doing depends on what you see the tags being used for. For example, if you only see the CI system using them, then I'd suggest keeping them local. No pull/push/merge issues at all.
Some tags may be duplicate - i.e. latest-stable. I don't really care which build gets tagged in this situation, but I don't want any conflicts because a script can't resolve those.
If a tag is already defined, and you call hg tag again, it will fail unless you force it, but what this does is add a newer, later definition of the same tag, and the latest one wins. On one hand this is good, because the merge is simple, but think about the case when you do:
hg update -r latest-stable
hg update -r latest-stable
hg update -r latest-stable
hg update -r latest-stable
Each time you'll update to the version you'll get a version before the tag was made (as normal), and at that version latest-stable will point to the previous latest-stable. The result is that this sequence of commands will move you back through time.
Hence I'd say it's better either to have unique tags (i.e. stable-2013-02-18) or tag in two commits; One that removes the old tag, and one to add the new one.
hg update -r latest-stable # You're now at the commit that removed the tag.
hg update -r latest-stable # This one will error because tag doesn't exist
Tags cause commits. However, this means that those commits need to be pushed and they need to be robust in the face of concurrent pushes by humans and other scripts. In particular, the automatic push can create additional heads, which is Not Good. But by the time the additional head is detected (at push) the local tag commit has already happened, and even though the new heads are likely trivially mergeable, sometimes tags cause conflicts.
The CI robot should tag; pull; merge (if necessary); push. If the merge fails, don't push, raise an alarm. If the push fails (i.e. there's been more changesets in the time it took to merge), pull and merge again. I'd just make sure your script is very explicit about the revisions it's merging. This process should leave you with no extra heads.
I believe Mercurial treats the .hgtags file differently for merging because it knows about the content, so conflicts should be very rare. Also, tag commits are, in general, easy to merge because all that changes is .hgtags, so a merge from the CI head should never conflict. The only reason it could is because someone else is using the same tag names as the CI server, and if they are doing that then they need to have honey poured on their keyboard so they can do any more damage.
The situation I can see causing problems is if you're doing CI tagging on multiple heads with the same tag names. e.g. Development and release branches both have CI run on them, both have tests-clean tags assigned, but to different revisions, and are then merged later. Solution is, don't do that.
Hope some of that is helpful.

If you care about history of builds then consider creating a named branch just for the build process. In Mercurial all tags from all branches are visible in whole repository.
If you don't care about history bookmarks should do the trick. Build process can set bookmark latest-stable after tests are run and then execute hg push --bookmark latest-stable to push that bookmark to the server.
In either way take you have to take care that you don't run tests on revisions which child has already been tested. Mercurial revsets are very powerful query language and should help.

Pull commits on repo post-rebase

I'm looking for a simple way to pull in additional commits after rebasing or a good reason to tell someone not to rebase.
Essentially we have a project, crons. I make changes to this frequently, and the maintainer of the project pulls in changes when I request it and rebases every time.
This is usually okay, but it can lead to problems in two scenarios:
Releasing from two branches simultaneously
Having to release an additional commit afterwards.
For example, I commit revision 1000. Maintainer pulls and rebases to create revision 1000', but at around the same time I realize a horrible mistake and create revision 1001 (child of 1000). Since 1000 doesn't exist in the target branch, this creates an unusable merge, and the maintainer usually laughs at me and tells me to try again (which requires me getting a fresh checkout of the main branch at 1000' and creating and importing a patch manually from the other checkout). I'm sure you can see how the same problem could occur with me trying to release from two separate branches simultaneously as well.
Anyway, once the main branch has 1000', is there anything that can be done to pull in 1001 without having to merge the same changes again? Or does rebasing ruin this? Regardless is there anything I can say to get Maintainer to stop rebasing? Is he using it incorrectly?

Tell your maintainer to stop being a jacka**.
Rebasing is something that should only be done by you, the one that created the changesets you want to rebase, and not done to changesets that are:
already shared with someone else
gotten from someone else
Your maintainer probably wants a non-distributed version control system, like Subversion, where changesets follows a straight line, instead of the branchy nature of a DVCS. In that respect, the choice of Mercurial is wrong, or the usage of Mercurial is wrong.
Also note that rebasing is one way of changing history, and since Mercurial discourages that (changing history), rebasing is only available as an extension, not available "out of the box" of a vanilla Mercurial configuration.
So to answer your question: No, since your maintainer insists on breaking the nature of a DVCS, the tools will fight against you (and him), and you're going to have a hard time getting the tools to cooperate with you.
Tell your maintainer to embrace how a DVCS really works. Now, he may still insist on not accepting new branches or heads in his repository, and insist on you pulling and merging before pushing back a single head to his repository, but that's OK.
Rebasing shared changesets, however, is not.
If you really want to use rebasing, the correct way to do it is like this:
You pull the latest changes from some source repository
You commit a lot of changesets locally, fixing bugs, adding new features, whatnot
You then try to push, gets told that this will create new heads in the target repository. This tells you that there are new changesets in the target repository that you did not get when you last pulled, because they have been added after that
Instead, you pull, this will add a new head in your local repository. Now you have the head that was created from your new changesets, and the head that was retrieved from the source repository created by others.
You then rebase your changesets on top of the ones you got from the source repository, in essence moving your changesets in the history to appear that you started your work from the latest changeset in the current source repository
You then attempt a new push, succeeding
The end result is that the target repository, and your own repository, will have a more linear changeset history, instead of a branch and then a merge.
However, since multiple branches is perfectly fine in a DVCS, you don't have to go through all of this. You can just merge, and continue working. This is how a DVCS is supposed to work. Rebasing is just an extra tool you can use if you really want to.

Help understanding the benefits of branching in Mercurial

I've struggled to understand how branching is beneficial. I can't push to a repo with 2 heads, or 2 branches... so why would I ever need/use them?

First of all, you can push even with two heads, but since you probably don't want to do that, the default behavior is to prevent you from doing it. You can, however, force the push to go through.
Now, as for branching, let's take a simple scenario in a non-distributed version control system, like Subversion.
Let's assume you have a colleague that is working in the same project as you. The current latest changeset in the Subversion repository is revision 100, you both update to this locally so that now both of you have the same files.
Ok, now your colleague has already been working on his changes for a couple of hours now, and so he commits. This brings the central repository up to revision 101. You're still on revision 100 locally, and you're still working on your changes.
At some point, you complete, and you want to commit, but Subversion won't let you. It says you have to update first, so you start the update process.
The update process wants to take your changes, and pretend you actually started with revision 101 instead of 100. If your changes are not in conflict with whatever it was your colleague committed, all is hunky dory, but if your changes are in conflict, you have a problem.
Now you have to merge your changes with his changes, and things can go haywire. For instance, you might end up merging one file OK, the second file OK, or so you think, and then the third file, and you suddenly discover that you've got some of the details wrong, it would've been better to merge the second file differently.
Unless you made a backup of your changes before updating, and sooner or later you will forget, you have a problem.
Now, the above scenario is actually quite common. Well, perhaps not the merging part, it depends on how many is working in the same area or files at the same time, but the "must update before committing" part is quite common with Subversion.
So how does Mercurial do it?
Well, Mercurial commits locally, it doesn't talk to any remote repository at all, so it won't stop you from committing.
So, let's try the above scenario again, just in Mercurial this time.
The tipmost changeset in the remote repository is revision 100. You both have cloned this down, and you're both starting to work on the changes, from revision 100.
Your colleague completes his changes and commits, locally. He then pushes his changeset up to the central repository, bringing the tip there up to revision 101.
You then complete your changes, and commit, also locally, and then you want to push, but you get the error message you've already discovered, and is asking about.
So how is this different?
Well, your changes are now committed, there is no way, unless you try really hard to accidentally lose them or destroy them.
Here's the 3 repositories in play and their current state:
Colleague ---98---99---100---A
Central ---98---99---100---A
You ---98---99---100---B
If you were to push, and was allowed to do this (or force the push through), the Central repository would look like this:
Central ---98---99---100---A
\
+--B
Two heads. If your colleague now pulled, which one should he continue working from? This question is the reason Mercurial will by default prevent you from causing this.
So instead you pull, and you get the above state in your own repository.
In other words, you can chose to impact your own repository and create multiple heads there, but you are not imposing that problem on anyone else.
You then merge, the same type of operation you had to do in Subversion, except your changeset is safe, it was committed, and you won't accidentally corrupt or destroy it. If, mid-merge, you want to start over, you can, nothing lost, no harm done.
After the merge, your local repository looks like this:
You ---98---99---100---A----M
\ /
+--B--+
This is now safe to push, and if your colleague now pulls, he knows that he has to continue from the M changeset, the one that merged his and your changes.
The above description is what happens due to Mercurials distributed nature.
You can also name branches, to make them more permanent. For instance, you might want to name a branch "stable", to signal that any changesets on that branch have been thoroughly tested and is safe for release to customers or to put into production. Then you would only merge changes onto that branch when said testing has been completed.
The nature, however, is the same as the above description. Whenever more than one person works on a project with Mercurial, you will get branches, and that's a good thing.

Whenever more than one clone of a repo is made and commits are made in those clones, branches happen, whether you name them by using the hg branch command or not. My philosophy is, you might as well give them a name. It makes things less confusing.
A good explanation of mercurial branches: http://stevelosh.com/blog/2009/08/a-guide-to-branching-in-mercurial/

what's the difference between hg tag and hg bookmark?

What's the difference between a tag and a bookmark in Mercurial? I can't seem to find any discussion of how the two differ.

Lets consider your repository as a "choose your own adventure books", with different points of view.
A tag is like a stamp that the editor put on your manuscript to say "ok, we keep a trace of your current work, in case shit happens."
A named branch would be a chapter. You have to choose at one point which chapter you'll have to write, and they are there to stay. Some will merge back, some will end (sorry, you died.)
A bookmark is, well, a bookmark. It follows you while you're reading (committing) the book. It helps you to keep tracks of "what you were reading at that time", so you can remove them, move them to a different "chapter". When you share the book (push), you usually don't share your bookmarks, unless you explicitly want to. So you usually use them on anonymous branches because their life cycle is shorter than named branches.

Bookmarks are used when you want a mnemonic (foo_feature) that points to a changing commit id as your work progresses. They're more light-weight that regular Mercurial branches, and somewhat similar to the way git branches work.
Tags generally point to fixed commit ids. They can be reassigned manually, but this is discouraged.

There are actually five concepts to play with:
tags
local tags
bookmarks
lightweight branches
named branches
Lightweight branches are what happens if you just use mercurial. Your repository history forks and sometimes merges as you change things and move around your history.
The other four are ways of annotating lightweight branches and the changesets that make them up.
named branches and tags are mercurial-only concepts where the branch names and tags actually get recorded in the repository by making more commits to the repository. They'll tend to propagate to other repositories in ways which are not necessarily obvious.
local tags and bookmarks are much more like what git calls tags and branches. They're metadata rather than being mixed in with the versioned objects. So they're not represented as part of the repository history. They tend to be local to your repository, and won't propagate unless you propagate them deliberately.
At least I think that's how they all work. After about twelve months of using mercurial daily I haven't really got to grips with its model(s). If anyone knows better than me then feel free to edit this answer so it's correct.
How I actually use these things in practice.
I'm working on a single shared repository with about 20 other people. I make many experiments and lightweight branches in my own private repository, which never get pushed to our main central repository. Occasionally once an experiment has worked out I'll modify the main line and push a changeset into the central repository, from which it will find its way to everyone else's machine.
I'll occasionally push some changesets to a co-worker if they're one of the people who's comfy with how mercurial works. But several people are a bit scared of it and prefer if I send them diffs that they can apply with patch.
For experiments I expect to be short lived and private, I just let lightweight branches happen where they may, and remember what's going on. If I feel my memory slipping about a twig that's been around for a bit, I bookmark it.
I use local tags to mark revisions I might like to come back to one day. They make interesting past states easier to find.
I myself almost never make non-local tags or named branches (except by accident, and I destroy them if I do). But our release people do. Our released major versions all have their own named branches off from the main line, and minor versions have tags on those branches. That ensures that these important branches and tags look the same to everyone.
Again, I've no idea whether this is how one's supposed to use mercurial, but it seems to be a model that works well for our size of team.
If three or four of us wanted to collaborate on an experiment, that would probably be worth a named branch, which we'd probably share between ourselves but not push to the central repo. I don't know how well that would work out!

The biggest difference is that a bookmark is automatically moved forward when you commit. Here's an example:
hg init
..edit a file..
hg commit -m 'my commit' # creates revision 0
hg tag -r 0 mytag # creates revision 1
hg bookmark -r 0 mybookmark # doesn't create a revision
hg update 0 # get back to r0
..edit a file..
hg commit -m 'another commit' # creates revision 2
At that point mytag is still pointing to revision 0 and mybookmark is now pointing at revision 2. Also the tagging created a changeset and the bookmark didn't.
Also, of course, the bookmark created a revisio

Doing without partial commits the "Mercurial way"

Subversion shop considering switching to Mercurial, trying to figure out in advance what all the complaints from developers are going to be. There's one fairly common use case here that I can't see how to handle.
I'm working on some largish feature, and I have a significant part of the code -- or possibly several significant parts of the code -- in pieces all over the garage floor, totally unsuitable for checkin, maybe not even compiling.
An urgent bugfix request comes in. The fix is nice and local and doesn't touch any of the code I've been working on.
I make the fix in my working copy.
Now what?
I've looked at "Mercurial cherry picking changes for commit" and "best practices in mercurial: branch vs. clone, and partial merges?" and all the suggestions seem to be extensions of varying complexity, from Record and Shelve to Queues.
The fact that there apparently isn't any core functionality for this makes me suspect that in some sense this working style is Doing It Wrong. What would a Mercurial-like solution to this use case look like?
Edited to add: git, by contrast, seems designed for this workflow: git add the bugfix files, don't git add anything else (or git reset HEAD anything you might have already added), git commit.

Here's how I would handle the case:
have a dev branch
have feature branches
have a personal branch
have a stable branch.
In your scenario, I would be committing frequently to my branch off the feature branch.
When the request came in, I would hg up -r XYZ where XYZ is the rev number that they are running, then branch a new feature branch off of that(or up branchname, whatever).
Perform work, then merge into the stable branch after the work is tested.
Switch back to my work and merge up from the top feature branch commit node, thus integrating the two streams of effort.

Lots of useful functionality for Mercurial is provided in the form of extensions -- don't be afraid to use them.
As for your question, record provides what you call partial commits (it allows you to select which hunks of changes you want to commit). On the other hand, shelve allows to temporarily make your working copy clean, while keeping the changes locally. Once you commit the bug fix, you can unshelve the changes and continue working.
The canonical way to go around this (i.e. using only core) would probably be to make a clone (note that local clones are cheap as hardlinks are created instead of copies).

You would clone the repository (i.e. create a bug-fix branch in SVN terms) and do the fix from there.
Alternatively if it really is a quick fix you can use the -I option on commit to explicitly check-in individual files.

Like any DVCS, branching is your friend. Branching a repository multiple ways is the bread and butter of these system. Here's a git model you might consider adopting that works quite well with Mercurial, also.

In addition to what Santa said about branching being your friend...
Small-granularity commits are your friend. Rather than making lots of code changes in a single commit, make each logically self-contained code change in its own commit. Then it will be a lot easier to cherry-pick changes to merge between branches.

Don't use Mercurial without using the Mq Extension (it comes pre-packaged in the default installation). In addition to solving your specific problem, it solves a lot of other general problems and really should be the default way that you work (especially if you're using an IDE that doesn't integrate directly with Hg, making switching branches on the fly a difficult way to work).

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008