How can I retrieve only a subdirectory from a Mercurial Repository? - mercurial

I'm trying to sell our group on using Mercurial as a source repository rather than VSS. In the process of updating our build scripts, I'm running into an issue trying to retrieve files from the Hg repository.
Our builds are automated with NAnt and currently work for local builds or builds from VSS (ie, pull the source as needed from VSS). I'm trying to update them to work with Mercurial as well.
Basically, when I'm working with single files, I don't have any issues since I can just use NAnt's 'get' task (after getting the appropriate revision hash) to retrieve the individual file.
The problem that I'm having is when I need to work with a directory (and subdirectories) of files that aren't at the root of the repository. I can't seem to figure out the proper commands to retrieve/copy a subdirectory from the repository to my 'working' directory for the builds. I've spent basically the whole afternoon trying to figure out how to do this with the mercurial executables (so I can use a NAnt 'exec' task), and have basically hit a wall so I figured I'd try posting here.
Can someone confirm whether this is possible, and provide some suggestions as to how I might be able to do this? I realize that Mercurial tracks changes by files and not directories, but it seems odd to me that this isn't available out of the box (from what I can tell).
If it's just not possible, the only workarounds I see are either maintaining NAnt fileset lists of expected files to work with (ugh!), or cloning the entire repository to a temporary directory and then just copying the files from that source as needed (this feels like a cludge to me).
I realize that I could simply create another repository for the directory that I want to work with, but I'd prefer to not go that route since I think that would increase the complexity of what I'm trying to do by a significant amount (I would have to apply this a large number of times for all of the different libraries that we build..).

Mercurial doesn't let you get only part of a repository. You have to get the whole tree. It's much more whole-repo focused than svn is.

You could try and segment your repository into multiple repos and manage them using the subrepos feature. Then you can pull the subdirectories independently.

Related

Mercurial Repo Living Archive

We have an Hg repo that is over 6GB and 150,000 changesets. It has 8 years of history on a large application. We have used a branching strategy over the last 8 years. In this approach, we create a new branch for a feature and when finished, close the branch and merge it to default/trunk. We don't prune branches after changes are pushed into default.
As our repo grows, it is getting more painful to work with. We love having the full history on each file and don't want to lose that, but we want to make our repo size much smaller.
One approach I've been looking into would be to have two separate repos, a 'Working' repo and an 'Archive' repo. The Working repo would contain the last 1 to 2 years of history and would be the repo developers cloned and pushed/pulled from on a daily basis. The Archive repo would contain the full history, including the new changesets pushed into the working repo.
I cannot find the right Hg commands to enable this. I was able to create a Working repo using hg convert <src> <dest> --config convert.hg.startref=<rev>. However, Mecurial sees this as a completely different repo, breaking any association between our Working and Archive repos. I'm unable to find a way to merge/splice changesets pushed to the Working repo into the Archive repo and maintain a unified file history. I tried hg transplant -s <src>, but that resulted in several 'skipping emptied changeset' messages. It's not clear to my why the hg transplant command felt those changeset were empty. Also, if I were to get this working, does anyone know if it maintains a file's history, or is my repo going to see the transplanted portion as separate, maybe showing up as a delete/create or something?
Anyone have a solution to either enable this Working/Archive approach or have a different approach that may work for us? It is critical that we maintain full file history, to make historical research simple.
Thanks
You might be hitting a known bug with the underlying storage compression. 6GB for 150,000 revision is a lot.
This storage issue is usually encountered on very branchy repositories, on an internal data structure storing the content of each revision. The current fix for this bug can reduce repository size up to ten folds.
Possible Quick Fix
You can blindly try to apply the current fix for the issue and see if it shrinks your repository.
upgrade to Mercurial 4.7,
add the following to your repository configuration:
[format]
sparse-revlog = yes
run hg debugupgraderepo --optimize redeltaall --run (this will take a while)
Some other improvements are also turned on by default in 4.7. So upgrade to 4.7 and running the debugupgraderepo should help in all cases.
Finer Diagnostic
Can you tell us what is the size of the .hg/store/00manifest.d file compared to the full size of .hg/store ?
In addition, can you provide use with the output of hg debugrevlog -m
Other reason ?
Another reason for repository size to grow is for large (usually binary file) to be committed in it. Do you have any them ?
The problem is that the hash id for each revision is calculated based on a number of items including the parent id. So when you change the parent you change the id.
As far as I'm aware there is no nice way to do this, but I have done something similar with several of my repos. The bad news is that it required a chain of repos, batch files and splice maps to get it done.
The bulk of the work I'm describing is ideally done one time only and then you just run the same scripts against the same existing repos every time you want to update it to pull in the latest commits.
The way I would do it is to have three repos:
Working
Merge
Archive
The first commit of Working is a squash of all the original commits in Archive, so you'll be throwing that commit away when you pull your Working code into the Archive, and reparenting the second Working commit onto the old tip of Archive.
STOP: If you're going to do this, back up your existing repos, especially the Archive repo before trying it, it might get trashed if you run this over the top of it. It might also be fine, but I'm not having any problems on my conscience!
Pull both Working and Archive into the Merge repo.
You now have a Merge repo with two completely independent trees in it.
Create a splicemap. This is just a text file giving the hash of a child node and the hash of its proposed parent node, separated by a space.
So your splicemap would just be something like:
hash-of-working-commit-2 hash-of-archive-old-tip
Then run hg convert with the splicemap option to do the reparenting of the second commit of Working onto the old tip of the Archive. E.g.
hg convert --splicemap splicemapPath.txt --config convert.hg.saverev=true Merge Archive
You might want to try writing it to a different named repo rather than Archive the first time, or you could try writing it over a copy of the existing Archive, I'm not sure if it'll work but if it does it would probably be quicker.
Once you've run this setup once, you can just run the same scripts over the existing repos again and again to update with the latest Working revisions. Just pull from Working to Merge and then run the hg convert to put it into Archive.

Mercurial: Incomplete central repository possible?

I want to realize the following setup:
AtWork:MercurialRepo <-> Internet:MercurialRepo <-> AtHome:MercurialRepo
Problem is the repository is several gigs. I already have the entire repo at home (through bundling->cdrom->unbundling). The thing is, I do not want to store the whole repository on the internet. Is there a way to temporarily exclude folders from versioning in order to push/pull only a subset of the repo I am working on through the internet? How do I best accomplish my goal? From time to time I would need to do the tedious bundling -> cdrom -> unbundling route, just to update everything else, but in general I do want to use the internet route and do not want to store the whole repo there.
So, as you've found out by now you can't selectively clone some files from a repository. The best you can do is clone a subset of all branches; but you will get the entire past history of these branches, for all files in the repository. So, unless a lot of the big files are only known in some branches and not others, this won't help you.
Since your problem is the large size of files (rather than a long and bulky history), you probably need to break it down into several "subrepositories" of manageable size. Note that the subset you are interested in cloning must be a subrepository; cloning the main repo necessarily includes the subrepositories. The mercurial subrepository documentation recommends that you make a trivial ("thin shell") main repo, and put all your project code in subrepositories.
Subrepositories are a complex solution, and are considered a "feature of last resort" by the mercurial team. It's a complex setup, there are various limitations (see the docs), and you'll have the extra complication of trying to convert your repo in a way that will preserve file history. So, it's worth considering ways to avoid this:
a) It would be best if you can avoid the middle copy of your repo; is there no way you can set up ssh access or a proxy so that your home repo can talk to your work repo directly? (Or vice versa; it's enough if one of the locations is able to contact the other).
b) You could carry the repo on a USB stick, as #vaclav's answer suggests.
c) Or maybe you should just bite the bullet and clone the entire repo on the internet.
Is there a way to temporarily exclude folders from versioning in order to push/pull only a subset of the repo I am working on through the internet?
Not folders, but some parts of repo - yes
You can push -b (only some branch(es)) or push -r (revision with ancestors: for latest work it will be -r tip), but final size of transfer is heavy dependent from type of your DAG - in case of a lot of cross-branch merges you probably skip only small part of changesets
I have small idea, bit different from what you asked, but...
If I have same issue, I would thing of using usb flash as whole repository (if you are about 10 or 20 gig it should be cheap). So at work you can copy, or clone whole repo to usb, pull new changes from it at home, and after your home working is done, push it to repo on flash, then pull it to repo at work(I use even temporary commits for undone work which I revert to working directory and strip, so I can continue where I ended).
But definitely easiest way, is to try get some connection to work servers, or to your machine at work. Or get bigger space for repo at internet. So, just another Ideat. HTH
Is not really possible. The closest thing would be to use sub-repositories which will effectively allow you to have only part of your big repo on the net.

"factorizing" a mercurial repository on kiln

Summarized questions:
What is the simplest (and best) way to shift a group of files from an existing repository to a new sub repository, so those files can be integrated with other parent repositories, some of which may not yet exist?
Do files in subrepositories need to be in discrete folders, or can they exist alongside other files?
Detailed Questions:
I have begun the process of creating multiple repositories representing several projects that have shared components, and that is going well, thanks to SO and some helpful answers to my question here
As I move on to adding a second project I notice there are a few files in my projects that are duplicated, and are essentially the same thing, with enough similarity to warrant taking them out of a main project repository and creating a new subrepository so they can be
used by any new projects I begin, and
removed from other existing repositories, since they are identical.
I am assuming the best way is to simply create a new repository, move the files across on the local file system, push both repositories, and then create a .hgsub file and proceeed as in the answer to my earlier question. This would obviously then shift the files concerned to a subfolder in the local file system under each main project, which i can live with, but it does raise the hypothetical question - is it possible to have a list of files in a repository that are effectively part of a sub repository but reside alongside other files (i.e. not in a sub folder).
If I wanted to (for example) have a "acme.h" file in each project that is part of another repository could I do this? as it happens, I don't need to do this at this point in time, and in my current situation it would be better from a design point of view to have the files I need to "refactor" into another repository in their own subfolder, however that might not always be the case. I use refactor in quotes here, as strictly speaking it's more about refactoring duplicated files that is refactoring code - however the same principle applies.
hopefully my questions are succinct enough to be answered without too much more explanation.
Thanks for summary, makes it much easier to answer!
What is the simplest (and best) way to shift a group of files from an existing repository to a new sub repository, so those files can be integrated with other parent repositories, some of which may not yet exist?
You can use the convert extension to extract a directory from an existing Mercurial repository. You'll want to use the --filemap flag and in the filemap you include the directory you want and rename it to the root. See hg help convert for more info.
After you get a smaller repository with the
Do files in subrepositories need to be in discrete folders, or can they exist alongside other files?
They must be in their own folders. This is simply because that's how a repository looks like in Mercurial, Git, Subversion, ... When you're dealing with subrepositories, then Mercurial is not tracking the files inside the subrepo: it's just asking some (other) system to make a checkout of repository foo at some location.
So when your .hgsub file has
foo = foo
bar = [git]bar
baz = [svn]baz
then Mercurial will notice this on hg update and run
hg clone default-path-of-this-repo/foo foo
git clone default-path-of-this-repo/bar bar
svn checkout default-path-of-this-repo/baz baz
for your. This explains why subrepostories are directories in the outer repository: that's simply what a clone/checkout looks like these days.
As you can see, subrepositories can be of different types. It's conceivable that someone could add a RCS subrepository type for tracking individual files. They would then not have to live in a directory.

Mercurial (Hg) and Binary Files

I am writing a set of django apps and would like to use Hg for version control. I would like each app to be independent of the others so in each app there may be a directory for static media that contains images that I would not want under version control. In other words, the binary files would not all be in one central location
I would like to find a way to clone the repository that would include copies of the image files. It also would be great if when I did a merge, if there were an image file in one repo and not another, that there would be some sort of warning.
Currently I use a python script to find images and other binary files that are in one repo, but not the other. But a lot of people must face this problem, so there must be a more robust and elegant solution.
One one other thing...for reasons I do not want to go into, usually one of my repos is on a windows machine, and the other is on Linux. So a crossplatform solution would be nice.
Since Mercurial 2.0 the extension largefiles is now included in the main distribution. That extension keeps and manages large files outside of the "normal" repository in a way that you get the benefit of DCVS but without the benefit of exponential size and processing time growth.
Other extension that work along similar lines are SnapExtension and BigFilesExtension. However, those two are not distributed with Mercurial (you have to get them manually).
Mercurial can track any kind of file, for binary files if something changes then the whole file gets replaced not just the changes.
On the getting a warning if one repo doesn't contain a file, that's kind of the point of a DVCS is that the repos are related but are autonomous. You could always check and see what files were added during a synch or merge operation.
The current Mercurial book (by Bryan O'Sullivan) says, that Mercurial stores diffs also for binary files. How efficient this is, obviously depends on the nature of changes to binary files.

Does anyone have a script to handle multiple hg repositories at once?

I have a project that is combining multiple hg repositories (different components) to build a single application. I'm looking for a cross-platform tool to support performing an operation on multiple repos at the same time (e.g. tag, pull, push, commit etc...) Essentially, I'm looking for the 'repo' script that Google wrote for Android, but for hg instead of git:
http://source.android.com/download/using-repo
I searched on stack overflow and found this:
mercurial windows batch file for pulling changes to multiple repositories
But it's still a bit manual and windows only. I know it's not that hard to write the script to either pass a command to the repos or try to encapsulate everything, but thought that it might be a common thing so maybe others already have a solution. I suppose one approach would be to port the repo script to hg (find and replace git with hg would probably get pretty far for simple operations).
What do other people do in this situation?
Definitely look at the new (in version 1.3) subrepositories feature in Mercurial. It lets you have an overarching repository that contains other repositories. The state of the top level includes a file that specifies the hash of the tip of the sub-repos, so you can effectively specify a single hash node id that encompasses the state of all the subordinate repos.
https://www.mercurial-scm.org/wiki/subrepos
I have been in a similar scenario for the past year, working daily with a project spread over 6 repositories, and being the one responsible for most branching/merging stuff etc.
At some point I found a simple bash-script for checking multiple SVN repo's. I adoptet it to Mercurial and have extended it over time. Right now I can't recall or google the original source of the script, so I unfortunately can't credit the original author (will credit in my script if I find the info later).
My "hgall" script is published on GitHub at https://github.com/JKrag/hgall
It is made for my own use, and thus directly references a few extensions that I have been using, but this can been easily edited out. (or just don't use the commands that call extensions you haven't installed).
The script is documented in the README, including my personal "tweak" of having it (by default) ignore repo's that start with underscore, as I use these as temporary clones to test messy operations (e.g. large merges).
I hope this script is useful to some of you - it has been of great help to me over the last year. Our company has just moved to Git, so I will probably not be making many updates in the near future, but might possibly be porting it to Git some day....
Additionally, Iain Lowe has an extension https://bitbucket.org/ilowe/multirepo/src which can manage multiple repositories.