Optimizing Mercurial repository size

Optimizing Mercurial repository size - mercurial

How does one optimize Mercurial repositories so that older revisions take the minimum required space?
I am aware that Mercurial does some magic already to group and compress existing commits. However, is there a way to enforce a manual run of this operation, so that as much space as possible is saved, disregarding speed? Is it possible to pack as many repos in one stream, change compression algorithm -- anything to better compress old changesets?
I don't have a lot of large-sized repositories right now, but I do have some medium-to-big sized ones that could use some shrinkage in the early history.
Git appears to have git gc [--aggressive] which, to a git non-expert, seems to do some magic cutting down the cruft and compressing the repos. It also has git repack which also seems to be doing the same thing, albeit with some additional expert options. At least that's how it seems to me: change sets can be 'packed' differently.

Have you tried using the shrink-revlog.py extension in the mercurial/contrib directory? On very branchy repositories it may cut down the size of the manifest significantly (OTOH, it had zero effect for me on a nearly 1GB manifest in a repo converted from subversion).

Related

Mercurial sparse checkout

In this F8 conference video(starting 8:40) from 2015 they speak about the advantages of using Mercurial and a single repository across facebook.
How does this work in practice? Using Mercurial, can i checkout a subdirectory (live in SVN)? If so, how? Do i need a facebook-mercurial-extension for this
P.S.: I only found answers like this or this from 2010 on SO where i am not sure if the answers still apply with all the efforts FB put into it.

From your question it is not clear if you are looking for a workflow (the monorepo vs multiple repos debate) or for performance and scaling for a huge code base.
For the workflow, I suggest googling for monorepo. It has its pros and cons, you need to understand your situation and current workflow to decide. For the performance and scaling, keep reading.
The idea of remotefilelog is not to checkout a subdirectory (as you mention), the idea is to checkout everything. In order to do that in an efficient way, you need two extensions actively developed by Facebook:
remotefilelog. This gives you something conceptually similar to a shallow clone. This reduces hg clone and hg pull time.
fsmonitor (previously called hgwatchman, it is now part of mercurial core). This dramatically reduces time of local operations such as hg status. Note that fsmonitor is independent from remotefilelog. You can start experimenting with this, since it doesn't require any setup on the server side.
With a recent mercurial (which I strongly suggest) you can shave off the additional startup time of the Python interpreter using CommandServer + CHg.
Some additional notes:
I tested extensively fsmonitor. It works very well, on huge repos the time of hg status is reduced from 10 secs to less than 1 sec (and the majority of this 1 sec is Python startup time, see above for CHg). If your repository is really huge, you might need to fine tune some inotify kernel parameters (or the equivalent on MacOSX). The fsmonitor documentation has all the information you need.
I didn't test remotefilelog, although I read everything I found about it and I am sure it works. Depending on how development is done (everybody has always Internet connectivity or not, the organization has its own master repo or not) there can be a caveat: it partially transforms the decentralized hg into a centralized VCS like svn: some operations that normally can be done offline (for example: hg log and the first hg update to a changeset in the past) will now require connectivity to the master repository.
Before considering remotefilelog, I used extensively the largefiles extension on a huge repo. It has the same drawbacks than remotefilelog and some confusing corner cases for users that want to use hg just to get things done without taking the time to understand how it works. If I were to manage another huge repo, I would use remotefilelog instead than largefiles, although their use case is not really the same.
Mercurial has also support for subrepositories (doc1, doc2). The problem is that it changes the behavior of hg depending on where you are in the source tree. Again, if the developers don't care about really understanding how hg works, it will be just too confusing.
Additional information:
Facebook Engineering blog post
scaling mercurial wiki, although not completely up to date
just by googling mercurial facebook.

i am not sure if the answers still apply with all the efforts FB put
into it
(Early 2017) The answers in the questions linked still apply (because they occasionally get updated) but note that you will have to read all the comments and answers.
remotefilelog essentially allows on demand shallow clones (so you don't fetch the history for everything for all time) but you still fetch the essential metadata for, and checkout across, all the directories of the repo at the desired revision.
Using Mercurial, can i checkout a subdirectory (li[k]e in SVN)? If so, how?
https://stackoverflow.com/a/40355673/7836056 discusses how you might use third party extensions to allow narrow/sparse checkouts (Facebook's sparse.py) or narrow clones (Google's NarrowHG) with Mercurial thus only "creating" a single directory from within the main repository (albeit with radically different tradeoffs).
(Note phrasing matters: "sparse checkout" means a very specific action when referring to distributed version control in a manner that doesn't exist when using it to refer to centralised version control)

How do you handle documents, images(psd), etc in you repository?

This might be a noob question. but I'm really torned between adding documents to my repository, in this case Mercurial.
by documents i meant, files that doesn't really go into your program. like PSD, doc, xls.
what's the best way to handle those files, or how do you handle your documents.

Take a look at the Largefiles extension that shipped with Mercurial 2.0 (with bugfixes since). It's designed to treat files that are binary and update rarely in a different, more efficient way.
Basically it stores those files without trying to compute diffs between versions, and anybody cloning the repo just gets the versions they need, and not all the history. This leads to faster cloning / pulling, but updates may need a connection to the remote repository to read versions of files into the local cache.

I toss them in my repository. It's nice to track changes of them and see old revisions anyway. I can see old revisions of a design document or see what the previous art was for an asset (maybe a graphic designer removed the alpha channel and he/she wasn't supposed to). Throw it in there. If it doesn't change, it's not taking up any more space with a good source control system than storing it outside of source control.

mercurial temporarily ignore versioned files

My question is essentially the same as here but applies to mercurial. I have a set of files that are under version control, and one save operation changes quite a lot of files. Some of the resulting changes are important for revision control, and some of the changes are just junk. I can "partition" off the junk into separate files. These junk files need to be part of a basic checkout in order for it to work, but their contents (and changes over time) aren't that important for revision control. Right now I just tell all our developers not to commit these files, but we all forget and it creates a lot of extra baggage in the repository. I don't really like the svn solution proposed because there are quite a lot of files and I want a simple clone to just work without all this extra manual work, so I was wondering if mercurial has a better alternative. It's kind of like hg shelve but not quite, and kind of like ignore, but not quite. Is there some hg extension that allows for this? Can git do it?

Mercurial doesn't support this. The correct way to do it is to commit thefile.sample and then have your developers (or better you deploy script) do a copy from thefile.sample to thefile if thefile doesn't exist. That way anyone can update the example file, but there's no risk of them committing their local changes (say their personal database connect string).

Aha! So TortoiseHG's repository and global settings have an Auto Exclude List where you can define a list of files that will be unchecked by default when the status, commit, and shelve dialogs open. So they still show up, but the user has to check them in order to actually do a commit. The setting is stored in hgrc, but it's under the [tortoisehg] heading so it's not supported by mercurial per se. Nevertheless, it fits my needs.

One solution to this is to use nested tree support (submodule in git), where the "junk" would be put in a different repository (to avoid cluttering the main repo), while enabling checking out the whole thing out in a consistent manner (right version of both repos in sync).
https://www.mercurial-scm.org/wiki/Subrepository?action=show&redirect=subrepos
In git, submodules are one solution to this issue - but they are not that great UI-wise. What I do instead is to keep two completely independent repositories, and using the subtree merge strategy when I need to update the main repo with the junk repo: http://progit.org/book/ch6-7.html

Mercurial (Hg) and Binary Files

I am writing a set of django apps and would like to use Hg for version control. I would like each app to be independent of the others so in each app there may be a directory for static media that contains images that I would not want under version control. In other words, the binary files would not all be in one central location
I would like to find a way to clone the repository that would include copies of the image files. It also would be great if when I did a merge, if there were an image file in one repo and not another, that there would be some sort of warning.
Currently I use a python script to find images and other binary files that are in one repo, but not the other. But a lot of people must face this problem, so there must be a more robust and elegant solution.
One one other thing...for reasons I do not want to go into, usually one of my repos is on a windows machine, and the other is on Linux. So a crossplatform solution would be nice.

Since Mercurial 2.0 the extension largefiles is now included in the main distribution. That extension keeps and manages large files outside of the "normal" repository in a way that you get the benefit of DCVS but without the benefit of exponential size and processing time growth.
Other extension that work along similar lines are SnapExtension and BigFilesExtension. However, those two are not distributed with Mercurial (you have to get them manually).

Mercurial can track any kind of file, for binary files if something changes then the whole file gets replaced not just the changes.
On the getting a warning if one repo doesn't contain a file, that's kind of the point of a DVCS is that the repos are related but are autonomous. You could always check and see what files were added during a synch or merge operation.

The current Mercurial book (by Bryan O'Sullivan) says, that Mercurial stores diffs also for binary files. How efficient this is, obviously depends on the nature of changes to binary files.

It is said that Mercurial's "hg clone" is very cheap... but it is 400MB on my hard drive? (on Mac OS X Snow Leopard)

I have a project I cloned over the network to the Mac hard drive (OS X Snow Leopard).
The project is about 1GB in the hard drive
du -s
2073848 .
so when I hg clone proj proj2
then when I
MacBook-Pro ~/development $ du -s proj
2073848 proj
MacBook-Pro ~/development $ du -s proj2
894840 proj2
MacBook-Pro ~/development $ du -s
2397928 .
so the clone seems not so cheap... probably around 400MB... is that so? also, the whole folder grew by about 200MB, which is not the total of proj and proj2 by the way... are there some links and some are not links, that's why the overlapping is not counted twice?

When possible, Mercurial will use hardlinks on the repository data, it will not use hardlinks on the working directory. Therefore, the only space it can save, is that of the .hg folder.
If you're using an editor that can break hardlinks, you can cp -al REPO REPOCLONE to use hardlinks on the entire directory, including the working directory, but be aware that it has some caveats. Quoting from the manual:
For efficiency, hardlinks are used for
cloning whenever the source and
destination are on the same filesystem
(note this applies only to the repository data, not to the working
directory). Some filesystems, such as
AFS,
implement hardlinking incorrectly, but do not report errors. In these
cases, use the --pull option to avoid
hardlinking.
In some cases, you can clone repositories and the working directory
using full hardlinks with
$ cp -al REPO REPOCLONE
This is the fastest way to clone, but it is not always safe. The
operation is not atomic (making sure
REPO is not
modified during the operation is up to you) and you have to make sure
your editor breaks hardlinks (Emacs
and most
Linux Kernel tools do so). Also, this is not compatible with certain
extensions that place their metadata
under
the .hg directory, such as mq.

Cheap is not the same as free. Cloning creates a new repository, that inherently has space costs - if you didn't want it to be located somewhere else on the disk, why would you bother cloning? However it is cheap in comparison, as you note, cloning your 1GB repo only adds ~200MB to the space taken up in the parent directory, because Mercurial is smart enough to identify information that doesn't need to be duplicated.
I think more generally, you need to stop worrying about the intricacies of how Mercurial (or any DVCS/VCS) works. It is a given that using version control takes more disk space, and takes time. As the amount of data and number of changes increases, the space and time demands increase too. What you're failing to realize is that these costs are far outweighed by the benefits of version control. The peace of mind that your work is safe, that you can't accidentally screw anything up, and the ability to look at your past work, along with the ease of distribution in the case of DVCS's are all far more valuable.
If your concerns really outweigh these benefits, you should just stick to a plain file system, and use FTP to share/distribute/commit the source code.
Update
Regarding romkyns' comment: You're downloading a large quantity of data. Downloading lots of data takes time, regardless of what it is. There is no way around that fact, and no way Mercurial nor any other VCS can make it go faster.
The benefit of Mercurial and the distributed model however is that you only pay that cost once. Since all work is done locally, you can commit, revert, update, and the like to your hearts content without any network overhead, and only make network operations to pull and push changes, which is relatively rare. In a centralized VCS, you're forced to make network operations any time you want to do something with your source code.
Additionally, I just tried cloning mozilla-central myself to see how long it would take, and it took 5 minutes to download changesets and manifests, 20 minutes to download the file chunks, and then updating to default (which is not network limited) took 10 minutes. 35 minutes to get the entire codebase for Mozilla along with the entire history of revisions isn't that bad. And even on this massive project with ~500,000 files and ~62,000 changes the repository is only 15% larger than the working directory, which goes back to the original point of the question.
It is worth mentioning though, cloning a repository is not the best way to download source code. If you just want the codebase, you can get releases. The Mercurial Web Interface can also let you browse the codebase without downloading anything, and you can download complete archives of any revision via the archive links (bz2, zip, gz) at the top of each page. All of these options are faster than a full clone. Cloning the repository is only necessary when you want to actively develop the Mozilla codebase, not when you just want the files.

When you can get 1TB of disk space for £60, 400MB is cheap (~ 2p).

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008