Let's say I create a repository, add x files to it and commit. Say the size is a Mb after the initial commit.
Is there any way to estimate how large the repository is going to be in one years time?
If the lines of code has increased by 10%, will the repository have grown accordingly?
How does number of commits, branches, tags etc. factor into the repository size?
Will 10000 commits the same year make the repository grow (noticeably) more than say 1000 commits?
Maybe my question is wrongly phrased?
Changes to a Mercurial repository are stored as either a complete file or as a compressed delta against the previous version:
https://www.mercurial-scm.org/wiki/FAQ#FAQ.2BAC8-TechnicalDetails.How_does_Mercurial_store_its_data.3F
Mercurial makes the decision about whether to store a complete file versus a delta based on the amount of changes made.
This means that it's not just adding lines of code that will increase the total size of a repository, but also:
The number of changes made to existing code.
The number of changes made to each file per commit.
The number of files that are added and subsequently deleted.
Mercurial retains all deleted files. You could add a 1GB file to your repository and then delete it; the number of lines hasn't increased, but because the file remains in the repository, the repository will be considerably larger.
To answer your questions in turn:
I imagine it's feasible to roughly estimate the size of a repository after x months, assuming that you maintain a steady rate of change to the repository in total (ie. you add/remove/alter files at the same rate, changing roughly the same number of lines per commit).
Increasing the number of lines of code by 10% doesn't tell us how many lines were deleted/altered, so an increase in lines of code won't necessarily correspond to the same increase in repo size.
Tags don't affect Mercurial repo size by more than a handful of bytes. Nor do branches, until you start working on them, at which point they add the same overhead as working on the tip. Number of commits should be reasonably proportional to the repo size, assuming the same rate of change occurs.
Committing 10x as often probably won't increase the file size, as it is the rate of change that is the main influence on repo size, not number of commits.
Directly estimating the size in a year is obviously impossible, unless you have some idea of the number of commits and the final size of the work tree.
That said, git is pretty disk-space efficient. It absolutely never stores more than one copy of a given version of a file (this is internally represented as a blob), and older blobs are delta-compressed into packs. This means that it is very efficient at storing plain text, and very inefficient with large binary files. If your project is predominantly plain text, you almost certainly have nothing to worry about.
Branches and tags have essentially no effect on size. Sure, a branch's reflog could get up to a few KB, but that's nothing to worry about. Lightweight tags are pretty much just a stored SHA1, and annotated tags just add a tiny bit of metadata to that.
As for lines of code and number of commits, it's hard to say exactly. Generally the commits are a much bigger factor than the lines of code; you can have many many version of files all adding up (even represented as deltas) but the actual content only has to be stored once. This is backed up by the fact that work trees tend to be much than the .git directory. For example, my clone of git.git has a 17MB work tree and a 39MB .git directory. Other projects I examined had similar ratios.
More commits of equal size would certainly make the repository grow more, but taking 1000 commits and splitting them up into 10000 (encompassing the same changes) wouldn't make the repository much bigger. The commit objects themselves are small; it's the differences in the files that take space. You might see an initial spike in size, as commits are only periodically delta-compressed, but once git gc --auto gets triggered, those commits will get compressed back down.
The best generalization I can make is that a repository's .git directory will tend to grow at a rate proportional to the amount of delta per time, which in general should be proportional to work tree size and the rate at which people are modifying the project. This is of course so general as to be completely unhelpful, but there you are.
If you want to estimate, I'd just take some data over the first month or so, and try and fit a curve.
Take a look at GitBenchmarks page on Git wiki, the section "Repository size benchmarks" and "Other benchmarks and references" (taking into account when the benchmark was made, and what versions it uses), in particular the entry at the end page:
DVCS Round-up: One System to Rule Them All? -- Part 3 by Robert Fendt on Linux Developer Network, from 27-01-2009, contains results of two synthetic benchmarks testing how a system acts under stress (number of commits in repository, or number of files comitted).
The test system was a VM running Ubuntu 8.10, and the software versions used were SVK 2.0.2 (last is 2.2.3), darcs 2.1.0 (last is 2.4.4), monotone 0.42 (last is 0.48), Bazaar 1.10 (last is 2.2.1), Mercurial 1.1.2 (last is 1.6.4), and Git 1.6.1 (last is 1.7.3).
If you're worried about size mushroomin, go and clone some online projects and examine the size of their repositories. There are plenty of large projects to choose from with branches commits, etc, etc. My experience is that git & mercurial and pretty good about keeping size down, the size is a reflection more of the files that you put into them (and their size) rather than overhead.
Related
How does one optimize Mercurial repositories so that older revisions take the minimum required space?
I am aware that Mercurial does some magic already to group and compress existing commits. However, is there a way to enforce a manual run of this operation, so that as much space as possible is saved, disregarding speed? Is it possible to pack as many repos in one stream, change compression algorithm -- anything to better compress old changesets?
I don't have a lot of large-sized repositories right now, but I do have some medium-to-big sized ones that could use some shrinkage in the early history.
Git appears to have git gc [--aggressive] which, to a git non-expert, seems to do some magic cutting down the cruft and compressing the repos. It also has git repack which also seems to be doing the same thing, albeit with some additional expert options. At least that's how it seems to me: change sets can be 'packed' differently.
Have you tried using the shrink-revlog.py extension in the mercurial/contrib directory? On very branchy repositories it may cut down the size of the manifest significantly (OTOH, it had zero effect for me on a nearly 1GB manifest in a repo converted from subversion).
For larger teams, having to pull/update/merge then commit each time makes no sense to me, specifically when the files that were changed by other developers have nothing to do with my changeset files.
i.e. I change file1.txt, and someone else changes file10.txt. Why must I merge on my computer before being allowed to push?
It makes pushing a big pain, as you have to constantly pull/update/merge if many developers are commiting.
Also, it makes your changeset look much larger than it was since it shows your merges as seperate commits.
Mercurial makes you do this since its atomic unit isn't a file but a changeset. That is a node containing a group of changes. Each changeset is an individual node in history and represents what that person did. This does result in you having to merge even if no common files where changes (which would be a simple automatic merge). These merge nodes are important since they are part of your repositories history and gives Mercurial more information for future merges with ancestral information.
That said there is an extension you can use that would clean up your history a bit (but won't resolve your issue with needing to pull before you push). It is called the rebase extension, it is shipped with Mercurial but disabled by default. It adds a new arumument to pull that looks like:
hg pull --rebase
This will pull new changes and moves your local changeset linearly above them without having a merge changset. However, I would urge against using this since you do lose information about your repository since you are re-writing its history. Read this post for information about some issues that this may cause.
Well, you could try using rebase, which will avoid the merge commits, but it is not without its own perils. You can also collapse to one step by doing "hg pull --update", rather than separate hg pull; hg update commands.
As for why you must merge on your computer: this is a direct consequence of mercurial being a distributed version control system. There is no central server which can be considered canonical (unless you create one by convention), so there is no other "place" where the merge could occur. You are the only one who can decide how the information in your repo should be combined with the information in the remote repo. The results of these decisions must be recorded, and that is the origin of the merge commit.
Also, in your example the merge would happen without user interaction since there are no conflicts (the same would be true with rebase), so I don't see why that is a problem.
Because having changes in disjunct files does not guarantee that they are independent.
When you pull in changes, even if they are in files that are untouched by your local changes, it can cause your local changes to stop working. E.g. an interface that you access from newly written code could have been changed.
This is why there is always a merge step inbetween, so that a human can review the changes, test for issues, and address them before integrating the changes back into the main repository. This step is very important, because skipping it risks blocking all those 50-100 colleagues (which is very expensive).
I would take Lasse’s advice and push less often. Merging isn’t a big deal if you only need to do it twice or thrice a day. Also maybe create smaller team repositories (or branches) that are merged with the main repository daily by a designated person.
At my organization, we'd like to switch from using CVS to Mercurial for a variety of reasons. We've done a lot of investigation when trying to determine how we should organize our Hg repositories based on what we have in our codebase and how we tend to work. We've come up with satisfactory answers to most of our questions, but there's one point that's stumping us a little, and it's driving me mad because the most convenient way to organize the repo for our workflow just seems like the wrong way to go from a conceptual standpoint. I'm trying to figure out whether our perception of how this is "supposed" to work is wrong, or whether we're just bumping up against a legitimate feature gap in the available tooling.
Primarily we maintain a medium sized codebase consisting of a suite of applications that get all get released in the same package. Conceptually you can divide our code into three categories:
Shared code
Application code for our primary suite (uses the shared code)
Miscellaneous small utilities that are infrequently maintained (uses the shared code)
This doesn't seem unusual to me, but I want to stress the point that we maintain the application code and the shared code at the same time and always want them to be bleeding edge with respect to each other. That is, we want all our application builds to always use the latest version and the same version of the shared code. We frequently add to or modify application code and shared code at the same time. Currently, the shared code is in one CVS module, and the applications are all in their own separate modules. The shared code and application modules are checked out such that the shared code gets built once and then linked into each application. We frequently do cvs commits that include changes across shared and application modules at once. We would really like to keep that ability.
I understand that commits in Hg are atomic within repositories -- that's fine but I'd like to be able to diff and commit to an application and a shared library at the "same time" (i.e. I don't care if it's really atomic but I don't want to have to manually do two separate diffs and two separate commit actions).
Conceptually, it seems like it would be correct to have one or a few repos for the shared code, and a separate repo for each application and each little utility program. This means you'd need to check out multiple repos for each build but that isn't a problem for us. The problem is there doesn't seem to be any tooling that lets you view updates or changes on multiple repos at once, or diff multiple repos at once and then sequentially commit them for you. This would be easy to script, but that wouldn't help developers who want to use various GUI frontends to complement the command line.
It seems like in order to be able to commit across multiple codebases at once (from a user's perspective) and keep everything on the bleeding edge together, the only two solutions are:
Use a monolithic repo with EVERYTHING in it.
Create some subrepos but access/commit everything through a big monolithic "main" repo that contains all the subrepos to keep everything on the latest revisions (which doesn't seem any better than (1) to me).
It can't be that unusual to want to work with multiple "peer" repositories at the same time, even if the actions aren't truly atomic across all of them -- and yet I'm not finding tons of articles or posts clamoring for this ability.
In summary:
We would like to organize our code such that we can diff and commit application code and shared code at the same time from the user's perspective (they need not truly be atomic).
It seems like we should be putting application code and shared code in separate repositories.
Subrepositories tie parent repositories to specific revisions, which we do not want.
What am I missing here?
In my shop, we have many projects that are simply in separate repos, but the main application's repo has 2 other projects in it. One is a module that shares a significant amount of code with the main application, and the other is for database migrations for the application (it's even in a different language). I wanted related changes in both the application and the migrator to be committed together, inseparably. Altogether, all source files in this repo are between 10 and 11 MB.
So if putting everything in one repository is really what makes sense because you don't want to deal with subrepositories, then there's nothing wrong with putting everything in one repository. The one of mine is on the small side of medium, in my opinion. TortoiseHg's source is around 20 MB, OGRE is over 100 MB.
Without knowing more about your projects and their relationships, the impression I get is that a single repository would work just fine, and that you're not looking at this incorrectly.
If you change your mind, hg convert can help you extract projects into their own repository, maintaining the history of those files.
If the one-repository approach is not for you, then I think subrepos should be given a chance, as that is the only other method I know of for treating multiple repos cohesively that is supported in TortoiseHg (see the Recommendations section).
However, I'm not sure how you would deal with the inter-department access, given that it doesn't seem there is an established subset already shared with others.
From Mercurial wiki - GSoC Ideas 2010:
Project Ideas
Lightweight copies/renames
(very difficult - a successful student
will become an expert in Mercurial's
storage format and transmission
protocol)
Copies and renames currently are not
too efficient. Mercurial copies the
copied/renamed source file to the new
initial revision of the target file in
its internal history store. For
renames, this is especially
counter-intuitive, as renaming a large
file grows the store by the file's
size. It would be better if Mercurial
had some way of referring to the
existing revision from the new file,
while preserving backwards
compatbility and bounded I/O
guarantees for retrieving revisions.
See issue883 for discussion.
There's an mq from an old attempt at
this located here.
Sorry if this is an obvious question (I'm not good at English and programming). I'm wondering, what does the "Lightweight copies" mean?
Is it mean: when this feature is implemented, multiple files with same content (same hash value different file names) will be stored only once in repository (just like Git)?
Update:
Thanks everyone for your answers. One of Mercurial's developers - tonfa also answered this question in a comment of this answer:
caveman: When light-weight copies are
implemented, will two files with same
content (same hash value different
names) store only once in repository
(just like Git)?
tonfa: no, this feature isn't planned
(it would break other optimizations to
minimize disk access)
Right now, when you copy a file, a new file is created in the repository that contains a compressed snapshot of the file you just copied. The idea would be to set it up so the copy references the old file somehow and then has revlog entries based on that instead of having to have its own snapshot to base the revlog entries off of.
This will not be like how git works. Changing Mercurial to work that way would be really interesting, and not the easiest proposition.
I'd better say that copied/renamed file wouldn't store as twice space more as it now, but will just point to the same revision.
Not sure that this will be true for the files added separately with the same content. According to the description they will be treated as completely independent files and will occupy 2x space.
I have a project I cloned over the network to the Mac hard drive (OS X Snow Leopard).
The project is about 1GB in the hard drive
du -s
2073848 .
so when I hg clone proj proj2
then when I
MacBook-Pro ~/development $ du -s proj
2073848 proj
MacBook-Pro ~/development $ du -s proj2
894840 proj2
MacBook-Pro ~/development $ du -s
2397928 .
so the clone seems not so cheap... probably around 400MB... is that so? also, the whole folder grew by about 200MB, which is not the total of proj and proj2 by the way... are there some links and some are not links, that's why the overlapping is not counted twice?
When possible, Mercurial will use hardlinks on the repository data, it will not use hardlinks on the working directory. Therefore, the only space it can save, is that of the .hg folder.
If you're using an editor that can break hardlinks, you can cp -al REPO REPOCLONE to use hardlinks on the entire directory, including the working directory, but be aware that it has some caveats. Quoting from the manual:
For efficiency, hardlinks are used for
cloning whenever the source and
destination are on the same filesystem
(note this applies only to the repository data, not to the working
directory). Some filesystems, such as
AFS,
implement hardlinking incorrectly, but do not report errors. In these
cases, use the --pull option to avoid
hardlinking.
In some cases, you can clone repositories and the working directory
using full hardlinks with
$ cp -al REPO REPOCLONE
This is the fastest way to clone, but it is not always safe. The
operation is not atomic (making sure
REPO is not
modified during the operation is up to you) and you have to make sure
your editor breaks hardlinks (Emacs
and most
Linux Kernel tools do so). Also, this is not compatible with certain
extensions that place their metadata
under
the .hg directory, such as mq.
Cheap is not the same as free. Cloning creates a new repository, that inherently has space costs - if you didn't want it to be located somewhere else on the disk, why would you bother cloning? However it is cheap in comparison, as you note, cloning your 1GB repo only adds ~200MB to the space taken up in the parent directory, because Mercurial is smart enough to identify information that doesn't need to be duplicated.
I think more generally, you need to stop worrying about the intricacies of how Mercurial (or any DVCS/VCS) works. It is a given that using version control takes more disk space, and takes time. As the amount of data and number of changes increases, the space and time demands increase too. What you're failing to realize is that these costs are far outweighed by the benefits of version control. The peace of mind that your work is safe, that you can't accidentally screw anything up, and the ability to look at your past work, along with the ease of distribution in the case of DVCS's are all far more valuable.
If your concerns really outweigh these benefits, you should just stick to a plain file system, and use FTP to share/distribute/commit the source code.
Update
Regarding romkyns' comment: You're downloading a large quantity of data. Downloading lots of data takes time, regardless of what it is. There is no way around that fact, and no way Mercurial nor any other VCS can make it go faster.
The benefit of Mercurial and the distributed model however is that you only pay that cost once. Since all work is done locally, you can commit, revert, update, and the like to your hearts content without any network overhead, and only make network operations to pull and push changes, which is relatively rare. In a centralized VCS, you're forced to make network operations any time you want to do something with your source code.
Additionally, I just tried cloning mozilla-central myself to see how long it would take, and it took 5 minutes to download changesets and manifests, 20 minutes to download the file chunks, and then updating to default (which is not network limited) took 10 minutes. 35 minutes to get the entire codebase for Mozilla along with the entire history of revisions isn't that bad. And even on this massive project with ~500,000 files and ~62,000 changes the repository is only 15% larger than the working directory, which goes back to the original point of the question.
It is worth mentioning though, cloning a repository is not the best way to download source code. If you just want the codebase, you can get releases. The Mercurial Web Interface can also let you browse the codebase without downloading anything, and you can download complete archives of any revision via the archive links (bz2, zip, gz) at the top of each page. All of these options are faster than a full clone. Cloning the repository is only necessary when you want to actively develop the Mozilla codebase, not when you just want the files.
When you can get 1TB of disk space for £60, 400MB is cheap (~ 2p).