Shrinking Mercurial repo size (manifests in particular)

Shrinking Mercurial repo size (manifests in particular) - mercurial

We're currently undergoing an attempt to migrate our mercurial (in this case an ancient version of Kiln) to BitBucket and we immediately ran in issues with size (if you don't know, BitBucket imposes a rather generous 2gb repo limit - that we happened to blow by).
Anyways, I've cleaned up the sins of the past:
using convert with filemaps (removing binaries/static files that should never been in the repo)
creating separate repos for other things that shouldn't have been in the main repo
attempting to use generaldelta to reduce size (as per
https://www.mercurial-scm.org/wiki/ScaleMercurial)
using branchmaps to try to consolidate old branches and their associated changesets
Even with these steps, I still have a very large manifest file, despite the "data" stored for the repo shrinking down to a "manageable" size (~600mb), my manifest file is nearly 700mb.
Some additional information: in general, we practice branch-per-feature and have two-branch track out to environments:
a release branch (deployed to staging and then to prod)
default branch (originally off of release, all features are first merged here and then to release. this branch dies and is reborn every two weeks)
One difference in this workflow is that default itself never is merged in to release (a la gitflow/hgflow). Does this uni-directional flow into default cause issues?
We "only" have 120 open branch heads, so it seems like that's manageable?
I'm obviously missing some step here (or else the repo is just completely hosed).

Just for future reference, I followed Tim's suggestion above. My full script ended up looking like this:
hg --config format.generaldelta=1 clone --pull oldrepo oldrepo-generaldelta
hg --config format.generaldelta=1 clone --pull oldrepo-generaldata oldrepo-generaldelta2
hg convert --filemap filemap.txt oldrepo-generaldelta2 newrepo
As Tim mentioned in his linked answer - our manifests went from about 700mb down to about 40mb with the second clone.
Can I optimize a Mercurial clone?

Related

What parts of the hg-git metadata state do I need to back up in order to start over?

This is currently a purely theoretical question (related to this one), but let me first give the background. Whenever you run hg gexport the initial hash will vary from invocation to invocation. This is similar to when you run git init or hg init. However, since the Mercurial and Git commits correspond to each other and build on previous hashes, there should be some way to start over from a minimal common initial state (or minimal state on the Git side, for example).
Suppose I have used hg-git in the past and now I am trying to sync again between my Mercurial and my Git states, but without (or very little of) the original .git directory from the hg gexport. What I do have, though, are the two metadata files: git-mapfile and git-tags.
There is an old Git mirror, which is sort of "behind" and the Mercurial repo which is up-to-date.
Then I configure the Mercurial repo for hg-git like so (.hg/hgrc):
[git]
intree = True
[extensions]
hgext.bookmarks=
topic=
hggit=
[paths]
default = ssh://username#hgserver.tld//project/repo
gitmirror = git+ssh://username#server.tld/project/repo.git
If I now do the naive hg pull gitmirror all I will gain is a duplication of every existing commit on an unrelated branch with unrelated commit history (and the double amount of heads, compared to prior to the pull).
It clearly makes no big difference to place those two metadata files (git-mapfile and git-tags) into .hg. The biggest difference is that the pull without these files will succeed (but duplicate everything) and the pull with them will error out at the first revision because of "abort: unknown revision ..." (even makes sense).
Question: which part(s) and how much (i.e. what's the minimum!) of the Git-side data/metadata created by hg gexport do I have to keep around in order to start over syncing with hg-git? (I was unable to find this covered in the documentation.)

The core metadata is stored in .hg/git-mapfile, and the actual Git repository is stored in .hg/git or .git dependending on intree. The git-mapfile is the only file needed to reproduce the full state; anything else is just cache. In order to recreate a repository from scratch, do the following:
Clone or initialise the Mercurial repository, somehow.
Clone or initialise the embedded Git repository, e.g. using git clone --bare git+ssh://username#server.tld/project/repo.git .hg/git.
Copy over the metadata from the original repository, and put it into .hg/git-mapfile.
Run hg git-cleanup to remove any commits from the map no longer known to Mercurial.
Pull from Git.
Push to Git.
These are the steps I'd use, off the top of my head. The three last steps are the most important. In particular, you must pull from Git to populate the repository prior to pushing; otherwise, the conversion will fail.

How to properly use hg share extension?

Say I have cloned a repo to a directory called ~/trunk and I want to share a branch named my-new-branch to the directory ~/my-new-branch. How would I do that with the hg share extension?
This is what I've been doing:
cd ~
hg share trunk my-new-branch
But then when I cd into the new directory I have to hg up to the branch?
Confused.

IMO share is a very useful command which has some great advantages over clone in some cases. But I think it is unfortunately overlooked in many instances.
What share does is to re-use the 'store' of Mercurial version control information between more than one local repository. (It has nothing directly to do with branching.)
The 'store' is a bunch of files which represents all the history Mercurial saves for you. You don't interact with it directly. Its a black box 99.99% of the time.
share differs from the more commonly-used clone command in that clone would copy the information store, taking longer to run and using potentially a lot more disk space.
The "side effect" of using share rather than clone is that you will instantly see all the same commits in every shared repository. It is as if push/pull were to happen automatically among all the shared repos. This would not be true with clone, you'd have to explicitly push/pull first. This is quite useful but something to be mindful of in your workflow because it may surprise you the first time you use it if you are only used to clone.
If you want to work in multiple branches (named or unnamed) of your project simultaneously,
either clone or share will work fine. One you have created the second repository, yes you need to update it to whatever changeset you want to begin working on.
Concrete example using share:
hg clone path\to\source\repo working1 # Create local repo working1 cloned from somewhere
cd working1
hg up branchname1
cd ..
hg share working1 working2 # shares the 'store' already used for working1 with working2
cd working2
hg up branchname2 # some other branch or point to start working from
As soon as you commit something in working1 that commit will be visible in the history of working2. But since they are not on the same branch this has no real immediate effect on working2.
working2 will retain path\to\source\repo as its default push/pull location, just like working1.
My own practice has been to create numerous locally shared repositories (quick, easy, saves space) and work in various branches. Often I'll even have a few of them on the same named branch but set to different points in history, for various reasons. I no longer find much need to actually clone locally (on the same PC).
A caveat -- I would avoid using share across a network connection - like to a repo on a mapped network drive. I think that could suffer some performance or even reliability issues. In fact, I wouldn't work off a network drive with a Mercurial repo (if avoidable) in any circumstance. Cloning locally would be safer.
Secondly -- I would read the docs, there are a few weird scenarios you might encounter; but I think these are not likely just based on my own experience.
Final note: although share is implemented as an "extension" to Mercurial, it has been effectively a part of it since forever. So there is nothing new or experimental about it, don't let the "extension" deal put you off.

Mercurial Repo Living Archive

We have an Hg repo that is over 6GB and 150,000 changesets. It has 8 years of history on a large application. We have used a branching strategy over the last 8 years. In this approach, we create a new branch for a feature and when finished, close the branch and merge it to default/trunk. We don't prune branches after changes are pushed into default.
As our repo grows, it is getting more painful to work with. We love having the full history on each file and don't want to lose that, but we want to make our repo size much smaller.
One approach I've been looking into would be to have two separate repos, a 'Working' repo and an 'Archive' repo. The Working repo would contain the last 1 to 2 years of history and would be the repo developers cloned and pushed/pulled from on a daily basis. The Archive repo would contain the full history, including the new changesets pushed into the working repo.
I cannot find the right Hg commands to enable this. I was able to create a Working repo using hg convert <src> <dest> --config convert.hg.startref=<rev>. However, Mecurial sees this as a completely different repo, breaking any association between our Working and Archive repos. I'm unable to find a way to merge/splice changesets pushed to the Working repo into the Archive repo and maintain a unified file history. I tried hg transplant -s <src>, but that resulted in several 'skipping emptied changeset' messages. It's not clear to my why the hg transplant command felt those changeset were empty. Also, if I were to get this working, does anyone know if it maintains a file's history, or is my repo going to see the transplanted portion as separate, maybe showing up as a delete/create or something?
Anyone have a solution to either enable this Working/Archive approach or have a different approach that may work for us? It is critical that we maintain full file history, to make historical research simple.
Thanks

You might be hitting a known bug with the underlying storage compression. 6GB for 150,000 revision is a lot.
This storage issue is usually encountered on very branchy repositories, on an internal data structure storing the content of each revision. The current fix for this bug can reduce repository size up to ten folds.
Possible Quick Fix
You can blindly try to apply the current fix for the issue and see if it shrinks your repository.
upgrade to Mercurial 4.7,
add the following to your repository configuration:
[format]
sparse-revlog = yes
run hg debugupgraderepo --optimize redeltaall --run (this will take a while)
Some other improvements are also turned on by default in 4.7. So upgrade to 4.7 and running the debugupgraderepo should help in all cases.
Finer Diagnostic
Can you tell us what is the size of the .hg/store/00manifest.d file compared to the full size of .hg/store ?
In addition, can you provide use with the output of hg debugrevlog -m
Other reason ?
Another reason for repository size to grow is for large (usually binary file) to be committed in it. Do you have any them ?

The problem is that the hash id for each revision is calculated based on a number of items including the parent id. So when you change the parent you change the id.
As far as I'm aware there is no nice way to do this, but I have done something similar with several of my repos. The bad news is that it required a chain of repos, batch files and splice maps to get it done.
The bulk of the work I'm describing is ideally done one time only and then you just run the same scripts against the same existing repos every time you want to update it to pull in the latest commits.
The way I would do it is to have three repos:
Working
Merge
Archive
The first commit of Working is a squash of all the original commits in Archive, so you'll be throwing that commit away when you pull your Working code into the Archive, and reparenting the second Working commit onto the old tip of Archive.
STOP: If you're going to do this, back up your existing repos, especially the Archive repo before trying it, it might get trashed if you run this over the top of it. It might also be fine, but I'm not having any problems on my conscience!
Pull both Working and Archive into the Merge repo.
You now have a Merge repo with two completely independent trees in it.
Create a splicemap. This is just a text file giving the hash of a child node and the hash of its proposed parent node, separated by a space.
So your splicemap would just be something like:
hash-of-working-commit-2 hash-of-archive-old-tip
Then run hg convert with the splicemap option to do the reparenting of the second commit of Working onto the old tip of the Archive. E.g.
hg convert --splicemap splicemapPath.txt --config convert.hg.saverev=true Merge Archive
You might want to try writing it to a different named repo rather than Archive the first time, or you could try writing it over a copy of the existing Archive, I'm not sure if it'll work but if it does it would probably be quicker.
Once you've run this setup once, you can just run the same scripts over the existing repos again and again to update with the latest Working revisions. Just pull from Working to Merge and then run the hg convert to put it into Archive.

Mercurial clone cleanup to match upstream

I have a hg clone of a repository into which I have done numerous changes locally over a few months and pushed them to my clone at google code. Unfortunately as a noob I committed a whole bunch of changes on the default branch.
Now I would like to make sure my current default is EXACTLY as upstream and then I can do proper branching off default and only working on the branches..
However how do I do that cleanup though?
For reference my clone is http://code.google.com/r/mosabua-roboguice/source/browse
PS: I got my self into the same problem with git and got that cleaned up: Cleanup git master branch and move some commit to new branch?

First, there's nothing wrong with committing on the default branch. You generally don't want to create a separate named branch for every task in Mercurial, because named branches are forever. You might want to look at the bookmark feature for something closer to git branches ("hg help bookmarks"). So if the only thing wrong with your existing changesets is that they are on the default branch, then there really is nothing wrong with them. Don't worry about it.
However, if you really want to start afresh, the obvious, straightforward thing to do is reclone from upstream. You can keep your messy changesets by moving the existing repo and recloning. Then transplant the changesets from the old repo into the new one on a branch of your choosing.
If you don't want to spend the time/bandwidth for a new clone, you can use the (advanced, dangerous, not for beginners) strip command. First, you have to enable the mq extension (google it or see the manual -- I'm deliberately not explaining it here because it's dangerous). Then run
hg strip 'outgoing("http://upstream/path/to/repo")'
Note that I'm using the revsets feature added in Mercurial 1.7 here. If you're using an older version, there's no easy way to do this.

The best way to do this is with two clones. When working with a remote repo I don't control I always keep a local clone called 'virgin' to which I make no changes. For example:
hg clone -U https://code.google.com/r/mosabua-roboguice-clean/ mosabua-roboguice-clean-virgin
hg clone mosabua-roboguice-clean-virgin mosabua-roboguice-clean-working
Note that because Mercurial uses hard links for local clones and because that first clone was a clone with -U (no working directory (bare repo in git terms)) this takes up no additional disk space.
Work all you want in robo-guice working and pull in robo-guice virgin to see what's going on upstream, and pull again in roboguice-working to get upstream changes.
You can do something like this after the fact by creating a new clone of the remote repo and if diskspace is precious use the relink extension to associate them.

Preface - all history changes have sense only for non-published repos. You'll have to push to GoogleCode's repo from scratch after editing local history (delete repo on GC, create empty, push) - otherwise you'll gust get one more HEAD in default branch
Manfred
Easy (but not short) way - default only+MQ
as Greg mentioned, install MQ
move all your commits into MQ-patches on top of upstream code
leave your changes as pathes forever
check, edit if nesessary and re-integrate patches after each upstream pull (this way your own CG-repo without MQ-patches will become identical to upstream)
More complex - MQ in the middle + separate branches
above
above
create named branch, switch to it
"Finish" patches
Pull upstream, merge with your branch changes (from defaut to yourbranch)
Commit your changes only into yourbranch
Rebasing
Enable rebase extension
Create named branch (with changeset in it? TBT)
Rebase your changesets to the new ancestor, test results
See 5-6 from "More complex" chapter

Perhaps you could try the Convert extension. It can bring a repository in a better shape, while preserving history. Of course, after the modifications have been done, you will have to delete the old repo and upload the converted one.

Mercurial Remove History

Is there a way in mercurial to remove old changesets from a database? I have a repository that is 60GB and that makes it pretty painful to do a clone. I would like to trim off everything before a certain date and put the huge database away to collect dust.

There is no simple / recommended way of doing this directly to an existing repository.
You can however "convert" your mercurial repo to a new mercurial repo and choose a revision from where to include the history onwards via the convert.hg.startrev option
hg convert --config convert.hg.startrev=1234 <source-repository> <new-repository-name>
The new repo will contain everything from the original repo minus the history previous to the starting revision.
Caveat: The new repo will have completely new changeset IDs, i.e. it is in no way related to the original repo. After creating the new repo every developer has to clone the new repo and delete their clones from the original repo.
I used this to cleanup old repos used internally within our company - combined with the --filemap option to remove unwanted files too.

You can do it, but in doing so you invalidate all the clones out there, so it's generally not wise to do unless you're working entirely alone.
Every changeset in mercurial is uniquely identified by a hashcode, which is a combination of (among other things) the source code changes, metadata, and the hashes of its one or two parents. Those parents need to exist in the repo all the way back to the start of the project. (Not having that restriction would be having shallow-clones, which aren't available (yet)).
If you're okay with changing the hashes of the newer changesets (which again breaks all the clones out there in the wild) you can do so with the commands;
hg export -o 'changeset-%r.patch' 400:tip # changesets 400 through the end for example
cd /elsewhere
hg init newrepo
cd newrepo
hg import /path/to/the/patches/*.patch
You'll probably have to do a little work to handle merge changesets, but that's the general idea.
One could also do it using hg convert with type hg as both the source and the destination types, and using a splicemap, but that's probably more involved yet.
The larger question is, how do you type up 60GB of source code, or were you adding generated files against all advice. :)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008