Mercurial (Hg) and Binary Files - mercurial

I am writing a set of django apps and would like to use Hg for version control. I would like each app to be independent of the others so in each app there may be a directory for static media that contains images that I would not want under version control. In other words, the binary files would not all be in one central location
I would like to find a way to clone the repository that would include copies of the image files. It also would be great if when I did a merge, if there were an image file in one repo and not another, that there would be some sort of warning.
Currently I use a python script to find images and other binary files that are in one repo, but not the other. But a lot of people must face this problem, so there must be a more robust and elegant solution.
One one other thing...for reasons I do not want to go into, usually one of my repos is on a windows machine, and the other is on Linux. So a crossplatform solution would be nice.

Since Mercurial 2.0 the extension largefiles is now included in the main distribution. That extension keeps and manages large files outside of the "normal" repository in a way that you get the benefit of DCVS but without the benefit of exponential size and processing time growth.
Other extension that work along similar lines are SnapExtension and BigFilesExtension. However, those two are not distributed with Mercurial (you have to get them manually).

Mercurial can track any kind of file, for binary files if something changes then the whole file gets replaced not just the changes.
On the getting a warning if one repo doesn't contain a file, that's kind of the point of a DVCS is that the repos are related but are autonomous. You could always check and see what files were added during a synch or merge operation.

The current Mercurial book (by Bryan O'Sullivan) says, that Mercurial stores diffs also for binary files. How efficient this is, obviously depends on the nature of changes to binary files.

Related

How do you handle documents, images(psd), etc in you repository?

This might be a noob question. but I'm really torned between adding documents to my repository, in this case Mercurial.
by documents i meant, files that doesn't really go into your program. like PSD, doc, xls.
what's the best way to handle those files, or how do you handle your documents.
Take a look at the Largefiles extension that shipped with Mercurial 2.0 (with bugfixes since). It's designed to treat files that are binary and update rarely in a different, more efficient way.
Basically it stores those files without trying to compute diffs between versions, and anybody cloning the repo just gets the versions they need, and not all the history. This leads to faster cloning / pulling, but updates may need a connection to the remote repository to read versions of files into the local cache.
I toss them in my repository. It's nice to track changes of them and see old revisions anyway. I can see old revisions of a design document or see what the previous art was for an asset (maybe a graphic designer removed the alpha channel and he/she wasn't supposed to). Throw it in there. If it doesn't change, it's not taking up any more space with a good source control system than storing it outside of source control.

Mercurial: Lightweight copies/renames

From Mercurial wiki - GSoC Ideas 2010:
Project Ideas
Lightweight copies/renames
(very difficult - a successful student
will become an expert in Mercurial's
storage format and transmission
protocol)
Copies and renames currently are not
too efficient. Mercurial copies the
copied/renamed source file to the new
initial revision of the target file in
its internal history store. For
renames, this is especially
counter-intuitive, as renaming a large
file grows the store by the file's
size. It would be better if Mercurial
had some way of referring to the
existing revision from the new file,
while preserving backwards
compatbility and bounded I/O
guarantees for retrieving revisions.
See issue883 for discussion.
There's an mq from an old attempt at
this located here.
Sorry if this is an obvious question (I'm not good at English and programming). I'm wondering, what does the "Lightweight copies" mean?
Is it mean: when this feature is implemented, multiple files with same content (same hash value different file names) will be stored only once in repository (just like Git)?
Update:
Thanks everyone for your answers. One of Mercurial's developers - tonfa also answered this question in a comment of this answer:
caveman: When light-weight copies are
implemented, will two files with same
content (same hash value different
names) store only once in repository
(just like Git)?
tonfa: no, this feature isn't planned
(it would break other optimizations to
minimize disk access)
Right now, when you copy a file, a new file is created in the repository that contains a compressed snapshot of the file you just copied. The idea would be to set it up so the copy references the old file somehow and then has revlog entries based on that instead of having to have its own snapshot to base the revlog entries off of.
This will not be like how git works. Changing Mercurial to work that way would be really interesting, and not the easiest proposition.
I'd better say that copied/renamed file wouldn't store as twice space more as it now, but will just point to the same revision.
Not sure that this will be true for the files added separately with the same content. According to the description they will be treated as completely independent files and will occupy 2x space.

How to cleanly handle source code and data in a repository

I'm working on a collaborative scientific project that is made up by a handful of Python scripts (1M max) and a relatively large dataset (1.5 GB). The datasets are tightly linked to the python scripts since the datasets themselves are the science and the scripts are a simple interface to them.
I'm using Mercurial as my source control tool, but I am not clear on a good mechanism to define the repository. Logistically it makes sense to bundle these together so that by cloning the repository you'd get the entire package. On the other hand, I'm concerned about the source control tool dealing with large amounts of data.
Is there a clean mechanism to handle this?
If the data files change rarely and you normally need all of them anyway, then just add them to Mercurial and be done with it. All your clones will be 1.5 GB, but that is just the way it has to be with that amount of data.
if the data is binary data and changed often, then you might try to avoid downloading all the old data. One way to do this is to use a Subversion subrepository. You will have a .hgsub file with
data = [svn]http://svn.some.edu/me/ourdata
which tells Mercurial to make a svn checkout from the right-hand side URL and put the Subversion working copy into your Mercurial clone as data. Mercurial will maintain an additional file for you called .hgsubstate, in which it records the SVN revision number to checkout for any given Mercurial changeset. By using Subversion like this, you only end up with the latest version of the data on your machine, but Mercurial will know how to get older versions of the data when needed. Please see this guide to subrepositories if you go down this route.
There is an article on the official wiki about large binary files. But the proposition of #MartinGeisler is a really nice new alternative.
My first inclination is to separate the python scripts out into their own repository, but I really need more domain information to make the "right" call.
On the one hand, if new datasets will be created then you would want a core set of tools to be able to handle all of them, right? But I can also see how new datasets may introduce cases that the scripts may not have previously handled... although it seems like in an ideal world you would want scripts that are written in a general way so they can handle future data and existing datasets??

mercurial temporarily ignore versioned files

My question is essentially the same as here but applies to mercurial. I have a set of files that are under version control, and one save operation changes quite a lot of files. Some of the resulting changes are important for revision control, and some of the changes are just junk. I can "partition" off the junk into separate files. These junk files need to be part of a basic checkout in order for it to work, but their contents (and changes over time) aren't that important for revision control. Right now I just tell all our developers not to commit these files, but we all forget and it creates a lot of extra baggage in the repository. I don't really like the svn solution proposed because there are quite a lot of files and I want a simple clone to just work without all this extra manual work, so I was wondering if mercurial has a better alternative. It's kind of like hg shelve but not quite, and kind of like ignore, but not quite. Is there some hg extension that allows for this? Can git do it?
Mercurial doesn't support this. The correct way to do it is to commit thefile.sample and then have your developers (or better you deploy script) do a copy from thefile.sample to thefile if thefile doesn't exist. That way anyone can update the example file, but there's no risk of them committing their local changes (say their personal database connect string).
Aha! So TortoiseHG's repository and global settings have an Auto Exclude List where you can define a list of files that will be unchecked by default when the status, commit, and shelve dialogs open. So they still show up, but the user has to check them in order to actually do a commit. The setting is stored in hgrc, but it's under the [tortoisehg] heading so it's not supported by mercurial per se. Nevertheless, it fits my needs.
One solution to this is to use nested tree support (submodule in git), where the "junk" would be put in a different repository (to avoid cluttering the main repo), while enabling checking out the whole thing out in a consistent manner (right version of both repos in sync).
https://www.mercurial-scm.org/wiki/Subrepository?action=show&redirect=subrepos
In git, submodules are one solution to this issue - but they are not that great UI-wise. What I do instead is to keep two completely independent repositories, and using the subtree merge strategy when I need to update the main repo with the junk repo: http://progit.org/book/ch6-7.html

How can I retrieve only a subdirectory from a Mercurial Repository?

I'm trying to sell our group on using Mercurial as a source repository rather than VSS. In the process of updating our build scripts, I'm running into an issue trying to retrieve files from the Hg repository.
Our builds are automated with NAnt and currently work for local builds or builds from VSS (ie, pull the source as needed from VSS). I'm trying to update them to work with Mercurial as well.
Basically, when I'm working with single files, I don't have any issues since I can just use NAnt's 'get' task (after getting the appropriate revision hash) to retrieve the individual file.
The problem that I'm having is when I need to work with a directory (and subdirectories) of files that aren't at the root of the repository. I can't seem to figure out the proper commands to retrieve/copy a subdirectory from the repository to my 'working' directory for the builds. I've spent basically the whole afternoon trying to figure out how to do this with the mercurial executables (so I can use a NAnt 'exec' task), and have basically hit a wall so I figured I'd try posting here.
Can someone confirm whether this is possible, and provide some suggestions as to how I might be able to do this? I realize that Mercurial tracks changes by files and not directories, but it seems odd to me that this isn't available out of the box (from what I can tell).
If it's just not possible, the only workarounds I see are either maintaining NAnt fileset lists of expected files to work with (ugh!), or cloning the entire repository to a temporary directory and then just copying the files from that source as needed (this feels like a cludge to me).
I realize that I could simply create another repository for the directory that I want to work with, but I'd prefer to not go that route since I think that would increase the complexity of what I'm trying to do by a significant amount (I would have to apply this a large number of times for all of the different libraries that we build..).
Mercurial doesn't let you get only part of a repository. You have to get the whole tree. It's much more whole-repo focused than svn is.
You could try and segment your repository into multiple repos and manage them using the subrepos feature. Then you can pull the subdirectories independently.