How does mercurial compress files in repository? - mercurial

I'm seeing that mercurial efficiently compresses the files in repository
(repo/.hg/store/data)
Does anybody know what kind of compression is used for repository files?
Thanks.

There are two levels of compression in Mercurial repositories: delta storage, and zlib compression.
In addition, various other parts employ also compression. For example, bundles can be compressed with both gzip and bzip2, as can archive tarballs - but I don't think you were asking for these.

You might find Mercurial author Matt Mackall's paper on the revlog format interesting.

Initial versions of files are compressed using deflate (same algorithm as zip), but for updated files, Mercurial stores only a (binary) diff against a previous version.
It also tries to do the right thing: When a deflated JPEG turns out bigger than the original, it will not store it "compressed", for example.

Related

Optimizing Mercurial repository size

How does one optimize Mercurial repositories so that older revisions take the minimum required space?
I am aware that Mercurial does some magic already to group and compress existing commits. However, is there a way to enforce a manual run of this operation, so that as much space as possible is saved, disregarding speed? Is it possible to pack as many repos in one stream, change compression algorithm -- anything to better compress old changesets?
I don't have a lot of large-sized repositories right now, but I do have some medium-to-big sized ones that could use some shrinkage in the early history.
Git appears to have git gc [--aggressive] which, to a git non-expert, seems to do some magic cutting down the cruft and compressing the repos. It also has git repack which also seems to be doing the same thing, albeit with some additional expert options. At least that's how it seems to me: change sets can be 'packed' differently.
Have you tried using the shrink-revlog.py extension in the mercurial/contrib directory? On very branchy repositories it may cut down the size of the manifest significantly (OTOH, it had zero effect for me on a nearly 1GB manifest in a repo converted from subversion).

How to store my binary assets with Mercurial?

I'm starting a game development project and my team and I will be using Mercurial for version control and I was wondering what a more appropriate way to store the binary assets for the game would be. Basically, I have 2 options:
Mercurial 2.1 has the largefiles extension, but I don't know too much about it. It seems like it'll solve the 'repository bloat' problem but doesn't solve binary merge conflict issues.
Keeping the binary assets in a SVN checkout, as a subrepo. This way we can lock files for editing and avoid merge conflicts, but I would really like to avoid having to use 2 version control systems (especially one that I don't really like that much).
Any insight/advice or other options I haven't thought of?
As you surmised large files will do what you need. For merging binaries you can set up a merge tool if one exists for your file type. Something like this:
[merge-tools]
mymergetool.priority = 100
mymergetool.premerge = False
mymergetool.args = $local $other $base -o $output
myimgmerge = SOME-PROGRAM-THAT-WILL-MERGE-IMAGES-FOR-YOU
[merge-patterns]
**.jpg = myimgmerge
**.exe = internal:fail
In general though, merging non-text things will always be a pain using a source control tool. Digital Asset Management applications exist to make that less painful, but they're not decentralized or very pleasant with which to work.
You're correct that the largefiles extension will avoid bloating the repository. What happens is that you only download the large files needed by the revision you're checking out. So if you have a 50 MB file and it has been edited radically 10 times, then the versions might take up 500 MB on the server. However, when you do hg update you only download the 50 MB version you need for that revision.
You're also correct that the largefiles extension doesn't help with merges. Infact, it skips the merge step completely and only prompts you like this:
largefile <some large file> has a merge conflict
keep (l)ocal or take (o)ther?
You don't get a chance to use the normal merge machinery.
To do locking, you could use the lock extension I wrote for a client. They wanted it for their documentation department where people would work with files that cannot be easily merged. It basically turns Mercurial into a centralized system similar to Subversion: locks are stored in a file in the central repository and client contact this repository when you run hg locks and before hg commit.

How do you handle documents, images(psd), etc in you repository?

This might be a noob question. but I'm really torned between adding documents to my repository, in this case Mercurial.
by documents i meant, files that doesn't really go into your program. like PSD, doc, xls.
what's the best way to handle those files, or how do you handle your documents.
Take a look at the Largefiles extension that shipped with Mercurial 2.0 (with bugfixes since). It's designed to treat files that are binary and update rarely in a different, more efficient way.
Basically it stores those files without trying to compute diffs between versions, and anybody cloning the repo just gets the versions they need, and not all the history. This leads to faster cloning / pulling, but updates may need a connection to the remote repository to read versions of files into the local cache.
I toss them in my repository. It's nice to track changes of them and see old revisions anyway. I can see old revisions of a design document or see what the previous art was for an asset (maybe a graphic designer removed the alpha channel and he/she wasn't supposed to). Throw it in there. If it doesn't change, it's not taking up any more space with a good source control system than storing it outside of source control.

Mercurial (Hg) and Binary Files

I am writing a set of django apps and would like to use Hg for version control. I would like each app to be independent of the others so in each app there may be a directory for static media that contains images that I would not want under version control. In other words, the binary files would not all be in one central location
I would like to find a way to clone the repository that would include copies of the image files. It also would be great if when I did a merge, if there were an image file in one repo and not another, that there would be some sort of warning.
Currently I use a python script to find images and other binary files that are in one repo, but not the other. But a lot of people must face this problem, so there must be a more robust and elegant solution.
One one other thing...for reasons I do not want to go into, usually one of my repos is on a windows machine, and the other is on Linux. So a crossplatform solution would be nice.
Since Mercurial 2.0 the extension largefiles is now included in the main distribution. That extension keeps and manages large files outside of the "normal" repository in a way that you get the benefit of DCVS but without the benefit of exponential size and processing time growth.
Other extension that work along similar lines are SnapExtension and BigFilesExtension. However, those two are not distributed with Mercurial (you have to get them manually).
Mercurial can track any kind of file, for binary files if something changes then the whole file gets replaced not just the changes.
On the getting a warning if one repo doesn't contain a file, that's kind of the point of a DVCS is that the repos are related but are autonomous. You could always check and see what files were added during a synch or merge operation.
The current Mercurial book (by Bryan O'Sullivan) says, that Mercurial stores diffs also for binary files. How efficient this is, obviously depends on the nature of changes to binary files.

cvs2svn conversion?

I converted my CVS repository into SVN repository.
It worked great, but one problem had occured....
I converted using a dumpfile, and the command was:
cvs2svn –encoding=( ) –sort=(PATH TO sort.exe) --default-eol=native –dumpfile=PATH\name.svn_dump –svnadmin=(PATH TO SVN ADMIN) (PATH TO REP)
loading the dump file:
svnadmin load PATH (to repository location) < PATH\name.svn_dump
Now some binary files, which in CVS are marked with -kb, have been corrupted. If I open both versions of a file in WinMerge, there look the same when the "Ignore Carriage Return Differences" is checked.
What seems to be the problem?
Did I miss something during the conversion?
thanks,
Oded.
Since you used the --default-eol=native option, any binary files that were not marked as binary in CVS will be stored to Subversion in "native" EOL encoding and will typically have problems like you described when checked out of Subversion. So, are you really sure that the files in question were marked as binary in CVS?
Please also note that there is a more proprietary CVS-like program called CVSNT whose repository format is different in several details to that of CVS. For example, it stores file modes in a way that is incompatible with CVS. cvs2svn does not support converting CVSNT repositories. If your repository was ever touched by a CVSNT client, you might have difficulties with your conversion. In that case, follow the tips in the above link and also consider setting the files in question to binary explicitly, for example using cvs2svn's --auto-props option.