Sync Mercurial repositories with two central servers without connection - mercurial

I'm new to mercurial and I have problems with the solution we're working to implement in my company. I work in a Lab with a strict security environment and the production Mercurial server is in a isolated network. All people have two computers, one to work in the "real world" and another one to work in the isolated and secure environment.
The problem is that we have other Labs distributed around the world and in some cases we need to work together two or more Labs in a project. Every Lab has a HG server for to manage their own projects locally, but I'm not sure if our method to sync common projects is the best solution. To solve that, we use a "bundle" to send the news changesets from one Lab to another. My question is about how good is this method because the solution is a little bit complicate. The procedure is more or less that way:
In Lab B, hg pull and update to be sure about the last version in local folder.
Ask the other about the "hg log", to see what are the last common changeset.
In Lab A: hg pull and update to be sure about the last version in local folder.
In Lab A: Make bundle, "hg bundle --base XX project.bundle" (where XX is the last common changeset).
Send it to Lab B (with a complicated method due the security normative: encrypt files, encrypt drives, secure erases, etc).
In Lab B: "hg unbundle projectYY.bundle" in the local folder.
This process creates two heads, that sometimes force you to make merges.
Once the changesets from Lab A are correctly implemented at Lab B, we need to repeat the process in the opposite direction, to implement the evolution of the project in the Lab B to the Lab A.
Could anyone enlighten me the way to find the best solution to get out of this dilemma?
Anyone have a better solution?
Thanks a lot for your help.

Bundles are the right vehicle for propagating changes without a direct connection. But you can simplify the bundle-building process by modeling communication locally:
In Lab A, maintain repoA (the central repo for local use), as well as repoB, which represents the state of the repository in lab B. Lab B has a complementary set-up.
You can use this dual set-up to model the relationship between the labs as if you had a direct connection, but changeset sharing proceeds via bundles instead of push/pull.
From the perspective of Lab A: Update repoA the regular way, but update repoB only with bundles that you receive from Lab B and bundles (or changesets) that you are sending to Lab B.
More specifically (again from the perspective of Lab A):
In the beginning the repos are synchronized, but as development progresses, changes are committed only to repoA.
When it's time to bring lab B up to speed, just go to repoA and run hg outgoing path/to/repoB. You now know what to bundle without having to request and study lab B's logs. In fact, hg bundle bundlename.bzip repoB will bundle the right changesets for you.
Encrypt and send off your bundle.
You can assume that the bundle will be integrated into Lab B's home repo, so update our local repoB as well, either by pushing directly or (for assured consistency) by unbundling (importing) the bundle that was mailed off.
When lab B receives the bundle, they will import it to their own copy of repoA-- it is now updated to the same state as repoA in lab A. Lab B can now push or pull changes into their own repoB, and merge them (in repoB) with their own unshared changesets. This will generate one or more merge changesets, which are handled just like any other check-ins to lab B's repoB.
And that's that. When lab B sends a bundle back to lab A, it will use the same process, steps 1 to 5. Everything stays synchronized just like they would if the repositories were directly connected. As always, it pays to synchronize frequently so as to avoid diverging too far and encountering merge conflicts.
In fact you have more than two labs. The approaches to keeping them synchronized are the same as if you had a direct connection: Do you want a "star topology" with a central server that is the only node the other labs communicate with directly? Then each lab only needs a local copy of this server. Do you need lots of bilateral communication before some work is shared with everyone? Then keep a local model of every lab you want to exchange changesets with.

If you have no direct network communication between the two mercurial repositories, then the method you describe seems like the easiest way to sync those two repositories.
You could probably save a bit on the process boilerplate on getting the new changesets which need bundling, how exactly depends.
For once, you don't need to update your working copy in order to create the bundles; it suffices to just have the repo, you don't need a working copy.
And if you know the date and time of the last sync, you can simply bundle all changesets added since that time, using an appropriate revset, e.g. all revisions since 30th March this year: hg log -r'date(">2015-03-30")' Thus you could skip a lengthy manual review process.
If your repository is not too big (thus fits on the media you use for exchange), simply copy it there in its entirety and do a local pulls from that exchange disk to sync, skipping those review processes, too.
Of course you will not be able to avoid making the merges - they are the price you have to pay when several people work on the same thing at the same time and both commit to their own repos.

Related

How do large companies deal with Mercurial?

I am investigating how to migrate our source control from SVN to Mercurial. One thing I am not sure how to deal with is usernames in commits. From what I've seen, there is no way to force an HG user to use a specific username, even if specified in Mercurial.ini, the user can override it in commits with the -u flag in hg commit.
How do companies deal with this? there is nothing to prevent developer A to commit something in his repository as developer B, and then pushing it to someone else.
Thanks.
I wouldn't say our company is large (4 developers), but it's never been an issue for us so far. I haven't seen any way to prevent that behavior either in my searching. I guess it comes down to an issue of trust amongst your developers.
Unrelated, we did successfully migrate from SVN to Mercurial about two years ago so I may be able to answer other questions you have.
EDIT: An idea:
I'm not sure how you were planning on setting up your topology, but we have a server that functions as the central repository for all our repos. It is possible to push changes between developers (bypassing the central server), but we never do that. We always commit locally and then push/pull from/to the central server. Additionally, we use https and windows authentication to authenticate with this central server.
If you're planning on having something like this, you could create a hook on the server (see repository events) (maybe the precommit event) that would verify that the user name in each commit being pushed is the same as the authenticated user from the web server.
Not sure if this would work, but it sounds plausable.
Another attempt(s)
Path-based ACLs in pseudo-CVCS workflow
If you'll use "controlled anarchy" workflow (p2p communications aren't controlled, resticted AND trusted and single authoritative source is common push-target), you can use "Branch Per Developer" paradigm. I.e - with ACL extension on central repo the following restrictions apply:
Nobody can push to default branch
Each developer can push only in his personal branch (under any name, name means nothing, auth-data for tracking is branch-name)
Only trusted mergers can work with repo-Central (merge dev-branches to default, NO rebase|NO history rewrite in dev-branches)
Each mergeset in default branch contain authentication piece - source branch
Signing branches
If you can't trust (and you must not trust) username in commits, you can trust strong crypto. Mercurial have at least two extensions, which allow digitally sign commits, thus providing accurate (so-so, see notes below) information about the authorship with own advantages and disadvantages in both cases
Commitsigs Extension Wiki and Signing Mercurial Changesets on Windows mini-HowTo are complete enough to understand and demonstrate all aspects of the start. Pro: no additional commits for signing, you can't (by design) sign old historic commits. Contra: not-so-nice output of needed commands (see screenshots in Damian's post for log and verifysigs), because it's GnuPG (no PKI), theoretically it's possible to create and use key-pair for any name-email and only "extra" comparison will show two different keys for one user
GPG extension and Approval Reports from wiki as quick-start. Pro: can use pgp-keys or openssl-certs (TBT!!!) (where openssl means one corporate source of issued certs), more readable and informative output of sigcheck command. Contra:
commiting changes to a .hgsigs file in the root of the working copy
and so it requires extra changesets to be made. This makes it
infeasible to sign all changesets. The .hgsigs file must also be
merged like any other file when branches are merged.
and at last file can be modified by hand by malicious user as any other file in WC
Edit and bugfixing
Openssl can be used in Commitsigs, not GPG extension

Mercurial distributed repositories

Having myself found in a role of a build engineer and a systems guy I had to learn end figure out a few things - namely how to set up our infrastructure. Before I came on board they didn't have any. With this in mind please excuse me if I ask anything that should have been obvious.
We currently have 3 level distributed mercurial repositories: level one on each of developer machines, level two on central (trunk) server - only accessible from local network and the third layer on BitBucket. Workflow is as follows:
Local development: developer pulls change-sets from local network server. developer commits to local and pushes to our local server once merge conflicts are resolved. A scheduled script overnight backs everything up to BitBucket.
Working from home: developer pulls change-sets from BitBucket. Developer comits to their local repo and push to BitBucket.
TeamCity picks up repo changes from local network server for each project and runs a build / automated deploy to test environment.
The issue I'm hitting is scenario 2: at the moment if someone pushes something to bitbucket it's their responsibility to merge it back when they're back in office. And it's a bit of a time waster if it could be automated.
In case you're wondering, the reason we have a central repo on local network is because it would be slow to run TeamCity builds of BitBucket repositories. Haven't tested so it's just an educated guess.
Anyhow, the script that is scheduled and pushes all changes from central repository on local network just runs a "hg push" for each of repositories. It would have to do a pull / merge beforehand. How do I do this right?
This is what the pull would have to use switches for:
- update after pull
- in case of merge conflicts, always take newer file
- in case of error, send an email to system administrator(s)
- anything extra?
Please feel free to share your own setup as long as it's not vastly different to what's described.
UPDATE: In light of recent answers I feel an important aspect if the intended approach needs to be clarified. The idea is not to force merges on our local network central repo. Instead it should resolve merge conflicts in same was as per using HgWorkbench on developer machines with post pull: update + merge. All developers have this on by default so it should be OK.
So the script / batch file on server would do the following:
pull from BitBucket
update + auto merge
Any merge auto conflicts?
3.1 Yes -> Send an email to administrators to manually merge -> Break
3.2 No -> Cary on
Get outgoing changesets. Will push create multiple heads? (This might be redundant because of pull / update)
4.1 Yes -> Prompt administrators. Break.
4.2 No -> Push changes
Hope this clears things up a bit. Now, can this be done using hg commands alone - batch - or do I have to script it? Specifically can it send emails?
Thanks.
So all your work is available at BitBucket, right? Why not make BitBucket (as available from anywhere) you primary repo source and dropping your local servers? You can pull changes from BitBucket with TeamCity for your nightly builds and developers whould always work with current repo at BitBucket and resolve all merging problems themselves so there wouldn't be any subsequent merges for you.
I would not try to automatically merge the changes if they are conflicting, this will only lead to broken and inconsistent versions and "lost" changes causing confusion and chaos. Don't merge it automatically if it isn't clear how that merge should look like.
A better alternative would be to just keep the two heads around and push/pull them without merging. This way everybody still can get that version of the data he was working on from work/home. A manual merge will have to be done, but this can also be done at work or from home, enabling developers to resolve the issue from wherever they are. You can also send emails around in this scenario to make sure everybody is aware of the problem.
I guess that you could automize this using a script, I would try PowerShell if I were you. However, sometimes this may require manual change merges when there are conflicts (because when developers commit changes to both BB and local repos, these changes might be conflicting).

GIt Workflow dev and production - where should I create the repos?

I've read somewhere* a setup like this would be nice:
Two main branches, one for each server.
Pushing to master sends changes on to live;
Pushing to dev/stage (or whatever you call it) sends changes to staging;
Workflow:
Create branch from dev;
work locally until you're ready to test;
merge back to dev;
push to Hub, which sends changes to dev/staging server.
Once you're ready with those to go live:
merge from dev to master,
then push master to Hub, which sends those changes on to the live server.
Two main branches, one for each server.
So I have one branch "production" on "webroot/myliveapp/"
and another branch "development" on "webroot/devapp/"
Where should the repository be ?
UPDATE:
I mean:
We will have, according to this flow:
Prime repo;
Bare repo hub;
Clones;
The development and production branches should belong to one repository, right ?
If this is correct, then were should we issue the FIRST git init command ?
On our Prime repo ?
So we will have:
"webroot/myliveapp/" - production branch;
"webroot/devapp/" - development branch;
"webroot/.git" - Prime repository;
Does this make sense ?
Or should the Prime repository correspond to our production branch location ?
*Note: if you need a context about what workflow I'm trying to implement, is this one:
http://joemaller.com/990/a-web-focused-git-workflow/
Thanks for the update on your question, it is more clear now.
I believe the problem you're having is based on a misunderstanding of Git workflow; Git doesn't equate directories to branches, it equates a view of your filesystem to branches. This is powerful - but easy to shoot yourself in the foot. Let me explain.
Git acts more like a database-backed, differentially-versioned, history tracking filesystem in itself. It is "above" your filesystem, not "part of" it. It doesn't use your filesystem to represent branches, rather, when you check out a different branch, all the files in your filesystem will change to be the files in that branch. You are asking Git to make your filesystem represent the alternate reality of that branch.
If you are on branch master, and it has a file root/foo.txt committed, and you check out branch experiment, which does not have root/foo.txt committed, you will find that file gone when you look for it. It is a part of master, not experiment, and so it is not present in your filesystem. This is why Git is really picky about your current branch being committed before it lets you switch branches - if you have unstaged changes on your filesystem that Git doesn't know about, it refuses to blow them away by overwriting them with a different reality. You have to intervene to make things right first.
So, to answer the quesiton, don't create subdirectories for "myliveapp" and "devapp" - create different branches. Just have your one codebase under "webroot". Then, hack away on, say, the "unstable" branch, committing your changes as usual. You can then switch all of the files in your repository to be at the version of your dev server's files by switching to the "devapp" branch, and you can similarly switch back to "unstable" at any time.
When you want to update a branch, e.g. doing an update of your dev server, you can merge "unstable" into "devapp". This will make all of the files of "devapp" look like those of "unstable", bringing it up to date.
One other thing to note: the difference between a prime repo, a bare repo, and clones is almost nil. There is virtually no difference in the software; rather, it's a human convention to say "Linus' kernel is the canonical Linux kernel". With that understanding:
A prime repo is just one repository that everyone agrees holds the "canonical" version of the software. That is, whenever a developer has made a change they want everyone to see, rather than saying, "Pull my version of devapp", they can say, "I've published my changes to our prime repo." It's simply an easy convention for people to rally around.
A clone is a copy of some other repo. I could clone the prime repo, make changes, and then you can clone my repo. If you make changes, you can push them either onto the prime repo or onto mine, as long as the merge is valid and you have permissions on the computer.
A bare repo simply has no "working copy" - there is no "webroot" directory on that computer. It's empty with only the .git directory - which is fine for servers where nobody needs to alter the files.
Finally, the .git dir doesn't hold the files of your repo, it holds the git configuration and database. It's your entire repository history in database form, which is used to populate the rest of the repo with a particular version of your software. That's why I made the comment: you can locally check out any version of any alternate reality of the repository, with no network communication, at any time - because it's all there in the .git dir. The only network communication necessary is for when you want to sync your local repository to some other repository, using push or pull.

How to setup a DVCS system that keeps some of the stuff we like about our centralized VCS?

Right now, we have a small team of developers using TFS for our version control. I'm evaluating the possibility of us moving to a DVCS, and am wondering if we'd need to give up some of the stuff we like about our current system if we moved to DVCS, or if we can find a way to support it.
Right now we a Stable branch, and 1 branch for each developer (you can think of each dev's branch as a feature branch that is reused from feature to feature). The stuff we like is:
1) Each time a dev checks in changes to his branch, we do a build and test of everything on the servers, followed by automatic deployment of all projects to that developers test environment.
2) Merging from any dev's branch to Stable is done by me, so that I can have 1 last check on what is happening to our stable branch before committing the changes.
3) If I want to help a dev with something, I can just grab latest from their branch and look at it on my machine.
I'm trying to understand how this could work with a DVCS (specifically we are testing with Mercurial).
I'm hoping to be able to pull off something like this:
1) We setup a central repository, and we create Main and Release branches in addition to 1 branch for each developer.
2) All devs clone the repository to their local machine.
3) All work is done in their personal branch against the local repository.
4) When they are done, the would pull from the central repository, and perform a local forward integration merge from Main to to their branch, to integrate any changes that have happened in the past to Main.
5) They would then push their changes to the central repository.
6) Some CI service would pickup this change, causing a build/test/deploy-to-dev of all our projects in that branch.
7) If everything was ok the dev would shoot me an email saying their branch was ready for merging to Main.
8) I could then merge in their changes, either by somehow connecting directly to the remove repository, or by doing a pull->merge->push.
So to deal with our requests:
1) I'm assuming there is some CI tools that can watch a branch in Mercurial, and kick off a build/test/deploy process (like CC.Net).
2) I can still manage the final merge process from DevX to Main either by connecting to the remote repo, or by pulling, merging and pushing through my local repo.
3) I believe I could either pull changes directly from another dev's repo, or I could just pull from the central repo, and then update my working directory to work on their code.
So do I have this mostly right?
Yep, you have all that right. Regarding you final assumptions:
1) There's a Mercurial Source Control Block for CruiseControl.NET. Another CI server I've heard of in use with Mercurial is Jenkins.
2) Correct. For integration with Main, I would prefer pulling (from either) and merging on my own machine before pushing, rather than merging on the server.
3) Exactly so.
It sounds like your developers are fairly disciplined, but just in case you need better control certain aspects of your operations:
You can use hooks to issue warnings when someone tries to merge their branch to Main. In-process hooks have to be written in Python, but they have access to the Mercurial API that way. You could also place hooks on the server that reject pushes containing a merge to Main not done by certain users.
One way some organizations control integration is a pull-only scenario. Only a few developers can push to the official repository and other developers send them pull requests. The Mercurial book's Chapter 6 covers this a bit, too.
A branch per developer is good. A branch per feature is also useful, allowing each developer to work on multiple things in parallel, then merging each to their branch when done. They just have to remember to close that feature branch before doing so, so the branch name doesn't keep popping up. This can be done with with clones as well, but I find myself preferring named branches since I have to keep work/backup/laptop development clones all synced so I can work on whateve, whenever. I still do expendable work as a clone first.

Moving from Subversion to Mercurial - how to adapt the workflow and staging/integration systems?

We got all psyched about from from svn to hg and as the development workflow is more or less flushed out, here remains the most difficult part - staging and integration system.
Hopefully this question goes a bit further then your common 'how do I move from xxx to Mercurial'. Please forgive long and probably poorly written question :)
We are web shop that does a lot of projects(mainly PHP and Zend), so we have one huge svn repo, with like 100+ folders, each representing a project with it's own tags,branches and trunk of course. On our integration and testing server(where QA and clients look at work results and test stuff) everything is pretty much automated - Apache is set to pick up new projects automatically creating vhost for each project/trunk; mysql migration scripts right there in trunk too and developers can apply them through simple web-interface. Long story short our workflow is this now:
Checkout code, do work, commit
Run update on the server via web interface(this basically does svn up on server on a particular project and also run db-migration script if needed)
QA changes on the server
This approach is certainly suboptimal for large projects when we have 2+ developers working on the same code. Branching in svn was only causing more headaches, well, hence moving to Mercurial. And here is where the question lies - how does one organize efficient staging/integration/testing server for this type of work(where you have many projects, say single developer could be working on 3 different projects in 1 day).
We decided to have 'default' branch tracking production essentially and then make all changes in individual branches. In this case though how can we automate staging updates for each branch? If earlier for one project we almost always were working on trunk, so we needed one DB, one vhost, etc. now we potentially talking about N-databases per project, N-vhost configs and etc. Then what about CI stuff(such as running phpDocumentor and/or unit tests)? Should it only be done on the 'default'? On branches?
I wonder how other teams solve this issue, perhaps some best practices that we're not using or overlooking?
Additional notes:
Probably worth mentioning that we've picked Kiln as a repo hosting service(mostly since we're using FogBugz anyway)
This is by no means the complete answer you'll eventually pick, but here are some tools that will likely factor into it:
repositories without working directories -- if you clone -U or hg update null you get a repository with no working directory (only the .hg). They're better on the server because they take up less room and no one is tempted to edit there
changegroup hooks
For that last one the changegroup hook runs whenever one or more changesets arrive via push or pull and you can have it do some interesting things such as:
push the changesets on to another repo depending on what has arrived
update the receiving repo's working directory
For example one could automate something like this using only the tools described above:
developer pushes five changesets to central-repo/project1/main
last changeset is on branch 'my-experiment' so csets are automatually re-pushed to optionally created repo central-repo/project1/my-experiment
central-repo/project1/my-experiment automatically does hg update tip which is certain to be on the my-expiriment branch
central-repo/project1/my-experiment automatically runs tests in its working dir and if they pass does a 'make dist' that deploys, which might set up database and vhost too
The biggie, and chapter 10 in the mercurial book covers this, is to not have the user waiting on that process. You want the user to push to a repo that contains possibly-okay-code and the automated processed do the CI and deploy work, which if it passes ends up being a likely-okay repo.
In the largest mercurial setup in which I've worked (20 or so developers) we got to the point where our CI system (Hudson) was pulling from the maybe-ok repos for each periodically then building and testing, and handling each branch separately.
Bottom line: all the tools you need to setup whatever you'd like probably already exist, but gluing them together will be one-off sort of work.
What you need to remember is that DVCS (vs. CVCS) introduces another dimension to versioning:
You don't have to rely anymore only on branching (and get a staging workspace from the right branch)
You now have with DVCS the publication workflow (push/pull between repo)
Meaning your staging environment is now a repo (with the full history of the project), checked out at a certain branch:
Many developers can push many different branches to that staging repo: the reconciliation process can be done in isolation within that repo, in a "main" branch of your choice.
Or they can pull that staging branch in their repo and test things out before pushing back.
From Joel's tutorial on Mercurial HgInit
A developer don't necessary have to commit for other to see: the publication process in a DVCS allows for him/her to pull the staging branch first, reconcile any conflict locally, and then push to the staging repo.