How to cleanly handle source code and data in a repository - mercurial

I'm working on a collaborative scientific project that is made up by a handful of Python scripts (1M max) and a relatively large dataset (1.5 GB). The datasets are tightly linked to the python scripts since the datasets themselves are the science and the scripts are a simple interface to them.
I'm using Mercurial as my source control tool, but I am not clear on a good mechanism to define the repository. Logistically it makes sense to bundle these together so that by cloning the repository you'd get the entire package. On the other hand, I'm concerned about the source control tool dealing with large amounts of data.
Is there a clean mechanism to handle this?

If the data files change rarely and you normally need all of them anyway, then just add them to Mercurial and be done with it. All your clones will be 1.5 GB, but that is just the way it has to be with that amount of data.
if the data is binary data and changed often, then you might try to avoid downloading all the old data. One way to do this is to use a Subversion subrepository. You will have a .hgsub file with
data = [svn]http://svn.some.edu/me/ourdata
which tells Mercurial to make a svn checkout from the right-hand side URL and put the Subversion working copy into your Mercurial clone as data. Mercurial will maintain an additional file for you called .hgsubstate, in which it records the SVN revision number to checkout for any given Mercurial changeset. By using Subversion like this, you only end up with the latest version of the data on your machine, but Mercurial will know how to get older versions of the data when needed. Please see this guide to subrepositories if you go down this route.

There is an article on the official wiki about large binary files. But the proposition of #MartinGeisler is a really nice new alternative.

My first inclination is to separate the python scripts out into their own repository, but I really need more domain information to make the "right" call.
On the one hand, if new datasets will be created then you would want a core set of tools to be able to handle all of them, right? But I can also see how new datasets may introduce cases that the scripts may not have previously handled... although it seems like in an ideal world you would want scripts that are written in a general way so they can handle future data and existing datasets??

Related

How do you handle documents, images(psd), etc in you repository?

This might be a noob question. but I'm really torned between adding documents to my repository, in this case Mercurial.
by documents i meant, files that doesn't really go into your program. like PSD, doc, xls.
what's the best way to handle those files, or how do you handle your documents.
Take a look at the Largefiles extension that shipped with Mercurial 2.0 (with bugfixes since). It's designed to treat files that are binary and update rarely in a different, more efficient way.
Basically it stores those files without trying to compute diffs between versions, and anybody cloning the repo just gets the versions they need, and not all the history. This leads to faster cloning / pulling, but updates may need a connection to the remote repository to read versions of files into the local cache.
I toss them in my repository. It's nice to track changes of them and see old revisions anyway. I can see old revisions of a design document or see what the previous art was for an asset (maybe a graphic designer removed the alpha channel and he/she wasn't supposed to). Throw it in there. If it doesn't change, it's not taking up any more space with a good source control system than storing it outside of source control.

Mercurial common/local files

I'm an hg user since a couple a years and I'm happy about that!
I have to start a project as I never did before.
The idea is to develop a software with a batch mode and an GUI.
So there will be common sources to both batch and GUI mode but each one will also contain specific sources.
And, basically, I would like my coworkers to be able to clone the GUI version, work on it an commit changes.
Then, I'd like to be able to merge their changes on the common files with the batch version.
How can I deal with that?
Since I've been reading a bit on this topic, I would really appreciate any help!!
Thank you.
binoua
As the creator of subrepos, I strongly recommend against using subrepos for this.
While subrepos can be used for breaking up a larger project into smaller pieces, the benefits of this are often outweighed by the additional complexity and fragility that subrepos involve. Unless your project is going to be really large, you should just stick to one project repo for simplicity.
So what are subrepos for, then? Subrepos are best for managing collections of otherwise independent projects. For instance, let's say you're building a large GUI tool that wraps around an existing SCM. I'd recommend you structure it something like this:
scm-gui-build/ <- master build repo with subrepos:
scm-gui/ <- independent repo for all the code in your GUI tool
scm/ <- repo for the third-party SCM itself
gui-toolkit/ <- a third-party GUI toolkit you depend on
extensions/ <- some third-party extension to bundle
extension-foo/
Here you do all your work in a plain old repo (scm-gui), but use a master repo at a higher level to manage building/packaging/versioning/tagging/releasing the whole collection. The master scm-gui-build repo is just a thin wrapper around other normal repos, which means that if something breaks (like one of the repo's URLs goes offline) you can keep working in your project without problems.
(see also: https://www.mercurial-scm.org/wiki/Subrepository#Recommendations)

Monolithic vs multiple Mercurial repos for modules within a suite of related applications

At my organization, we'd like to switch from using CVS to Mercurial for a variety of reasons. We've done a lot of investigation when trying to determine how we should organize our Hg repositories based on what we have in our codebase and how we tend to work. We've come up with satisfactory answers to most of our questions, but there's one point that's stumping us a little, and it's driving me mad because the most convenient way to organize the repo for our workflow just seems like the wrong way to go from a conceptual standpoint. I'm trying to figure out whether our perception of how this is "supposed" to work is wrong, or whether we're just bumping up against a legitimate feature gap in the available tooling.
Primarily we maintain a medium sized codebase consisting of a suite of applications that get all get released in the same package. Conceptually you can divide our code into three categories:
Shared code
Application code for our primary suite (uses the shared code)
Miscellaneous small utilities that are infrequently maintained (uses the shared code)
This doesn't seem unusual to me, but I want to stress the point that we maintain the application code and the shared code at the same time and always want them to be bleeding edge with respect to each other. That is, we want all our application builds to always use the latest version and the same version of the shared code. We frequently add to or modify application code and shared code at the same time. Currently, the shared code is in one CVS module, and the applications are all in their own separate modules. The shared code and application modules are checked out such that the shared code gets built once and then linked into each application. We frequently do cvs commits that include changes across shared and application modules at once. We would really like to keep that ability.
I understand that commits in Hg are atomic within repositories -- that's fine but I'd like to be able to diff and commit to an application and a shared library at the "same time" (i.e. I don't care if it's really atomic but I don't want to have to manually do two separate diffs and two separate commit actions).
Conceptually, it seems like it would be correct to have one or a few repos for the shared code, and a separate repo for each application and each little utility program. This means you'd need to check out multiple repos for each build but that isn't a problem for us. The problem is there doesn't seem to be any tooling that lets you view updates or changes on multiple repos at once, or diff multiple repos at once and then sequentially commit them for you. This would be easy to script, but that wouldn't help developers who want to use various GUI frontends to complement the command line.
It seems like in order to be able to commit across multiple codebases at once (from a user's perspective) and keep everything on the bleeding edge together, the only two solutions are:
Use a monolithic repo with EVERYTHING in it.
Create some subrepos but access/commit everything through a big monolithic "main" repo that contains all the subrepos to keep everything on the latest revisions (which doesn't seem any better than (1) to me).
It can't be that unusual to want to work with multiple "peer" repositories at the same time, even if the actions aren't truly atomic across all of them -- and yet I'm not finding tons of articles or posts clamoring for this ability.
In summary:
We would like to organize our code such that we can diff and commit application code and shared code at the same time from the user's perspective (they need not truly be atomic).
It seems like we should be putting application code and shared code in separate repositories.
Subrepositories tie parent repositories to specific revisions, which we do not want.
What am I missing here?
In my shop, we have many projects that are simply in separate repos, but the main application's repo has 2 other projects in it. One is a module that shares a significant amount of code with the main application, and the other is for database migrations for the application (it's even in a different language). I wanted related changes in both the application and the migrator to be committed together, inseparably. Altogether, all source files in this repo are between 10 and 11 MB.
So if putting everything in one repository is really what makes sense because you don't want to deal with subrepositories, then there's nothing wrong with putting everything in one repository. The one of mine is on the small side of medium, in my opinion. TortoiseHg's source is around 20 MB, OGRE is over 100 MB.
Without knowing more about your projects and their relationships, the impression I get is that a single repository would work just fine, and that you're not looking at this incorrectly.
If you change your mind, hg convert can help you extract projects into their own repository, maintaining the history of those files.
If the one-repository approach is not for you, then I think subrepos should be given a chance, as that is the only other method I know of for treating multiple repos cohesively that is supported in TortoiseHg (see the Recommendations section).
However, I'm not sure how you would deal with the inter-department access, given that it doesn't seem there is an established subset already shared with others.

Mercurial (Hg) and Binary Files

I am writing a set of django apps and would like to use Hg for version control. I would like each app to be independent of the others so in each app there may be a directory for static media that contains images that I would not want under version control. In other words, the binary files would not all be in one central location
I would like to find a way to clone the repository that would include copies of the image files. It also would be great if when I did a merge, if there were an image file in one repo and not another, that there would be some sort of warning.
Currently I use a python script to find images and other binary files that are in one repo, but not the other. But a lot of people must face this problem, so there must be a more robust and elegant solution.
One one other thing...for reasons I do not want to go into, usually one of my repos is on a windows machine, and the other is on Linux. So a crossplatform solution would be nice.
Since Mercurial 2.0 the extension largefiles is now included in the main distribution. That extension keeps and manages large files outside of the "normal" repository in a way that you get the benefit of DCVS but without the benefit of exponential size and processing time growth.
Other extension that work along similar lines are SnapExtension and BigFilesExtension. However, those two are not distributed with Mercurial (you have to get them manually).
Mercurial can track any kind of file, for binary files if something changes then the whole file gets replaced not just the changes.
On the getting a warning if one repo doesn't contain a file, that's kind of the point of a DVCS is that the repos are related but are autonomous. You could always check and see what files were added during a synch or merge operation.
The current Mercurial book (by Bryan O'Sullivan) says, that Mercurial stores diffs also for binary files. How efficient this is, obviously depends on the nature of changes to binary files.

How can I retrieve only a subdirectory from a Mercurial Repository?

I'm trying to sell our group on using Mercurial as a source repository rather than VSS. In the process of updating our build scripts, I'm running into an issue trying to retrieve files from the Hg repository.
Our builds are automated with NAnt and currently work for local builds or builds from VSS (ie, pull the source as needed from VSS). I'm trying to update them to work with Mercurial as well.
Basically, when I'm working with single files, I don't have any issues since I can just use NAnt's 'get' task (after getting the appropriate revision hash) to retrieve the individual file.
The problem that I'm having is when I need to work with a directory (and subdirectories) of files that aren't at the root of the repository. I can't seem to figure out the proper commands to retrieve/copy a subdirectory from the repository to my 'working' directory for the builds. I've spent basically the whole afternoon trying to figure out how to do this with the mercurial executables (so I can use a NAnt 'exec' task), and have basically hit a wall so I figured I'd try posting here.
Can someone confirm whether this is possible, and provide some suggestions as to how I might be able to do this? I realize that Mercurial tracks changes by files and not directories, but it seems odd to me that this isn't available out of the box (from what I can tell).
If it's just not possible, the only workarounds I see are either maintaining NAnt fileset lists of expected files to work with (ugh!), or cloning the entire repository to a temporary directory and then just copying the files from that source as needed (this feels like a cludge to me).
I realize that I could simply create another repository for the directory that I want to work with, but I'd prefer to not go that route since I think that would increase the complexity of what I'm trying to do by a significant amount (I would have to apply this a large number of times for all of the different libraries that we build..).
Mercurial doesn't let you get only part of a repository. You have to get the whole tree. It's much more whole-repo focused than svn is.
You could try and segment your repository into multiple repos and manage them using the subrepos feature. Then you can pull the subdirectories independently.