Optimize perf / profile Hg Workbench

Optimize perf / profile Hg Workbench - mercurial

My Hg Workbench is sometimes very slow.
I am under windows7, using last version of Tortoise Hg.
I have several large repositories being tracked by Hg locally
For example, I have one repository that
has been converted from CVS using cvs2hg (from svn2hg)
has more than 7000 changesets
is more than 1GB heavy
uses extensions largefiles, hgk, mercurial_keyring
has several binary files being tracked as large files
has a few binary files not tracked as large files( including one file of 300 MB)
With this repo, it can take
from 10 seconds, up to 30 seconds to open the repository
from 3 seconds, up to 10 seconds to change a changeset
and sometimes, doing a filter in the changeset list can last several minutes.
Note 1 : This is the case for all the team (approx 10 people) using this repository, not only on my PC
Note 2 : in the past, it was even much much longer but that was due to the collumn "Changes" being displayed. Now, it is not as bad but far from good anyway.
Note 3 : this repos has been cloned from a RhodeCode server but the problem is in local
configuration, not during push or pull operation, so I think it is not related.
Note 4 : after cvs2hg conversion, I had a problem of End Of Line (I had been unable to make EOL conversion to work correctly with the convertor, I don't know why). So after the conversion, I had to apply a EOL conversion on all my text files, to commit and to push. That made quite a big changeset. I don't think that is the reason for my problem because as far as I remember, it was already slow before this conversion, but I prefered to notice it anyway.
Would anybody have any idea about what's wrong with my config ?
Is there a way I could have a profile log or even a general log that could give me information about what is taking so long ?
Thanks in advance
Antoine

Related

Mercurial Repo Living Archive

We have an Hg repo that is over 6GB and 150,000 changesets. It has 8 years of history on a large application. We have used a branching strategy over the last 8 years. In this approach, we create a new branch for a feature and when finished, close the branch and merge it to default/trunk. We don't prune branches after changes are pushed into default.
As our repo grows, it is getting more painful to work with. We love having the full history on each file and don't want to lose that, but we want to make our repo size much smaller.
One approach I've been looking into would be to have two separate repos, a 'Working' repo and an 'Archive' repo. The Working repo would contain the last 1 to 2 years of history and would be the repo developers cloned and pushed/pulled from on a daily basis. The Archive repo would contain the full history, including the new changesets pushed into the working repo.
I cannot find the right Hg commands to enable this. I was able to create a Working repo using hg convert <src> <dest> --config convert.hg.startref=<rev>. However, Mecurial sees this as a completely different repo, breaking any association between our Working and Archive repos. I'm unable to find a way to merge/splice changesets pushed to the Working repo into the Archive repo and maintain a unified file history. I tried hg transplant -s <src>, but that resulted in several 'skipping emptied changeset' messages. It's not clear to my why the hg transplant command felt those changeset were empty. Also, if I were to get this working, does anyone know if it maintains a file's history, or is my repo going to see the transplanted portion as separate, maybe showing up as a delete/create or something?
Anyone have a solution to either enable this Working/Archive approach or have a different approach that may work for us? It is critical that we maintain full file history, to make historical research simple.
Thanks

You might be hitting a known bug with the underlying storage compression. 6GB for 150,000 revision is a lot.
This storage issue is usually encountered on very branchy repositories, on an internal data structure storing the content of each revision. The current fix for this bug can reduce repository size up to ten folds.
Possible Quick Fix
You can blindly try to apply the current fix for the issue and see if it shrinks your repository.
upgrade to Mercurial 4.7,
add the following to your repository configuration:
[format]
sparse-revlog = yes
run hg debugupgraderepo --optimize redeltaall --run (this will take a while)
Some other improvements are also turned on by default in 4.7. So upgrade to 4.7 and running the debugupgraderepo should help in all cases.
Finer Diagnostic
Can you tell us what is the size of the .hg/store/00manifest.d file compared to the full size of .hg/store ?
In addition, can you provide use with the output of hg debugrevlog -m
Other reason ?
Another reason for repository size to grow is for large (usually binary file) to be committed in it. Do you have any them ?

The problem is that the hash id for each revision is calculated based on a number of items including the parent id. So when you change the parent you change the id.
As far as I'm aware there is no nice way to do this, but I have done something similar with several of my repos. The bad news is that it required a chain of repos, batch files and splice maps to get it done.
The bulk of the work I'm describing is ideally done one time only and then you just run the same scripts against the same existing repos every time you want to update it to pull in the latest commits.
The way I would do it is to have three repos:
Working
Merge
Archive
The first commit of Working is a squash of all the original commits in Archive, so you'll be throwing that commit away when you pull your Working code into the Archive, and reparenting the second Working commit onto the old tip of Archive.
STOP: If you're going to do this, back up your existing repos, especially the Archive repo before trying it, it might get trashed if you run this over the top of it. It might also be fine, but I'm not having any problems on my conscience!
Pull both Working and Archive into the Merge repo.
You now have a Merge repo with two completely independent trees in it.
Create a splicemap. This is just a text file giving the hash of a child node and the hash of its proposed parent node, separated by a space.
So your splicemap would just be something like:
hash-of-working-commit-2 hash-of-archive-old-tip
Then run hg convert with the splicemap option to do the reparenting of the second commit of Working onto the old tip of the Archive. E.g.
hg convert --splicemap splicemapPath.txt --config convert.hg.saverev=true Merge Archive
You might want to try writing it to a different named repo rather than Archive the first time, or you could try writing it over a copy of the existing Archive, I'm not sure if it'll work but if it does it would probably be quicker.
Once you've run this setup once, you can just run the same scripts over the existing repos again and again to update with the latest Working revisions. Just pull from Working to Merge and then run the hg convert to put it into Archive.

Can I clone just the latest changesets of a repository instead of the entire history?

I have to work with an hg repository that has millions of lines of code and hundreds of thousands of changesets. As you can imagine, this really slows down mercurial and TortoiseHg.
Is it possible for me to create a local repository that only has the latest few thousand changesets?
Not only would this hopefully make things run snappier, but it might also save me some hard drive space.

No you can't, but you can optimise your local clone. Have a look at my answer to https://stackoverflow.com/a/19294645/479199.
There has been some work on shallow clones, but it's still very much a work in progress (and there doesn't seem to have been much progress):
https://www.mercurial-scm.org/wiki/ShallowClone

It seems that Facebook released an extension that is supposed to solve this problem.
See https://bitbucket.org/facebook/remotefilelog

No you can't. That's called a "Shallow Clone" and it's not implemented/supported. Millions of lines of code and thousands of changesets isn't particularly large, and once you've cloned it down once almost every action should be near instantaneous.
Further, the compressed binary deltas in the .hg directory are usually smaller than the entirety of the uncompressed code in the working directory, so the space savings shouldn't be substantial either.
Once you've cloned the repo once, make sure to do any further clones on the same machine from your local clone, and you'll never have to wait for the whole repo to clone again.

Pulling from a sourcerepo mercurial repository slows down as the history increases. Why, and is there a solution?

The Problem
Pulling changes from our Sourcerepo mercurial repository used to take very little time, e.g. a couple of seconds. Now it takes enough time for me to go get a drink and have a wee (a few minutes). It is getting incrementally slower as, I presume, a function of the size of the history on the repository. This is really annoying, there's only so much drinking and weeing I need to do, it's starting to hit my productivity.
Context
The repo is on Sourcerepo.
I'm connecting to the repo over ssh SSH with key based authentication.
I'm using TortoiseHG as my tool of choice, though the issue is equally seen via the command line.
It takes a long time to check for incoming changesets, it's the pull not the
update that's taking the time.
If there are no changes it only takes a few seconds to tell me that.
We use a lot of named branches, we do close them after ourselves so there's 5-20 open at a time but hundreds in the history.
We've got a little under 3k revisions in the repo.
Pushing is still really fast.
I thought Hg used some sort of delta encoding to only get the changes, it shouldn't be taking this long. I wondered if there was an option I was missing in Hg or if anyone else has experienced this behaviour?
Thanks in advance :)

You can try
hg incoming -v --time --profile --debug REPO-URL and read output - it can shed some light on most slow parts in process
Clone Sourcerepo to local repository and test incoming against this repo (eliminate network-related latency)
If nothing help
clone smaller parts of repo (from 1 to tip) and test incoming for these truncated clones in order to identify changeset (I hope), which brings this headache: I can't see how to use bisect on single big repo for test incoming operational time

How to store my binary assets with Mercurial?

I'm starting a game development project and my team and I will be using Mercurial for version control and I was wondering what a more appropriate way to store the binary assets for the game would be. Basically, I have 2 options:
Mercurial 2.1 has the largefiles extension, but I don't know too much about it. It seems like it'll solve the 'repository bloat' problem but doesn't solve binary merge conflict issues.
Keeping the binary assets in a SVN checkout, as a subrepo. This way we can lock files for editing and avoid merge conflicts, but I would really like to avoid having to use 2 version control systems (especially one that I don't really like that much).
Any insight/advice or other options I haven't thought of?

As you surmised large files will do what you need. For merging binaries you can set up a merge tool if one exists for your file type. Something like this:
[merge-tools]
mymergetool.priority = 100
mymergetool.premerge = False
mymergetool.args = $local $other $base -o $output
myimgmerge = SOME-PROGRAM-THAT-WILL-MERGE-IMAGES-FOR-YOU
[merge-patterns]
**.jpg = myimgmerge
**.exe = internal:fail
In general though, merging non-text things will always be a pain using a source control tool. Digital Asset Management applications exist to make that less painful, but they're not decentralized or very pleasant with which to work.

You're correct that the largefiles extension will avoid bloating the repository. What happens is that you only download the large files needed by the revision you're checking out. So if you have a 50 MB file and it has been edited radically 10 times, then the versions might take up 500 MB on the server. However, when you do hg update you only download the 50 MB version you need for that revision.
You're also correct that the largefiles extension doesn't help with merges. Infact, it skips the merge step completely and only prompts you like this:
largefile <some large file> has a merge conflict
keep (l)ocal or take (o)ther?
You don't get a chance to use the normal merge machinery.
To do locking, you could use the lock extension I wrote for a client. They wanted it for their documentation department where people would work with files that cannot be easily merged. It basically turns Mercurial into a centralized system similar to Subversion: locks are stored in a file in the central repository and client contact this repository when you run hg locks and before hg commit.

It is said that Mercurial's "hg clone" is very cheap... but it is 400MB on my hard drive? (on Mac OS X Snow Leopard)

I have a project I cloned over the network to the Mac hard drive (OS X Snow Leopard).
The project is about 1GB in the hard drive
du -s
2073848 .
so when I hg clone proj proj2
then when I
MacBook-Pro ~/development $ du -s proj
2073848 proj
MacBook-Pro ~/development $ du -s proj2
894840 proj2
MacBook-Pro ~/development $ du -s
2397928 .
so the clone seems not so cheap... probably around 400MB... is that so? also, the whole folder grew by about 200MB, which is not the total of proj and proj2 by the way... are there some links and some are not links, that's why the overlapping is not counted twice?

When possible, Mercurial will use hardlinks on the repository data, it will not use hardlinks on the working directory. Therefore, the only space it can save, is that of the .hg folder.
If you're using an editor that can break hardlinks, you can cp -al REPO REPOCLONE to use hardlinks on the entire directory, including the working directory, but be aware that it has some caveats. Quoting from the manual:
For efficiency, hardlinks are used for
cloning whenever the source and
destination are on the same filesystem
(note this applies only to the repository data, not to the working
directory). Some filesystems, such as
AFS,
implement hardlinking incorrectly, but do not report errors. In these
cases, use the --pull option to avoid
hardlinking.
In some cases, you can clone repositories and the working directory
using full hardlinks with
$ cp -al REPO REPOCLONE
This is the fastest way to clone, but it is not always safe. The
operation is not atomic (making sure
REPO is not
modified during the operation is up to you) and you have to make sure
your editor breaks hardlinks (Emacs
and most
Linux Kernel tools do so). Also, this is not compatible with certain
extensions that place their metadata
under
the .hg directory, such as mq.

Cheap is not the same as free. Cloning creates a new repository, that inherently has space costs - if you didn't want it to be located somewhere else on the disk, why would you bother cloning? However it is cheap in comparison, as you note, cloning your 1GB repo only adds ~200MB to the space taken up in the parent directory, because Mercurial is smart enough to identify information that doesn't need to be duplicated.
I think more generally, you need to stop worrying about the intricacies of how Mercurial (or any DVCS/VCS) works. It is a given that using version control takes more disk space, and takes time. As the amount of data and number of changes increases, the space and time demands increase too. What you're failing to realize is that these costs are far outweighed by the benefits of version control. The peace of mind that your work is safe, that you can't accidentally screw anything up, and the ability to look at your past work, along with the ease of distribution in the case of DVCS's are all far more valuable.
If your concerns really outweigh these benefits, you should just stick to a plain file system, and use FTP to share/distribute/commit the source code.
Update
Regarding romkyns' comment: You're downloading a large quantity of data. Downloading lots of data takes time, regardless of what it is. There is no way around that fact, and no way Mercurial nor any other VCS can make it go faster.
The benefit of Mercurial and the distributed model however is that you only pay that cost once. Since all work is done locally, you can commit, revert, update, and the like to your hearts content without any network overhead, and only make network operations to pull and push changes, which is relatively rare. In a centralized VCS, you're forced to make network operations any time you want to do something with your source code.
Additionally, I just tried cloning mozilla-central myself to see how long it would take, and it took 5 minutes to download changesets and manifests, 20 minutes to download the file chunks, and then updating to default (which is not network limited) took 10 minutes. 35 minutes to get the entire codebase for Mozilla along with the entire history of revisions isn't that bad. And even on this massive project with ~500,000 files and ~62,000 changes the repository is only 15% larger than the working directory, which goes back to the original point of the question.
It is worth mentioning though, cloning a repository is not the best way to download source code. If you just want the codebase, you can get releases. The Mercurial Web Interface can also let you browse the codebase without downloading anything, and you can download complete archives of any revision via the archive links (bz2, zip, gz) at the top of each page. All of these options are faster than a full clone. Cloning the repository is only necessary when you want to actively develop the Mozilla codebase, not when you just want the files.

When you can get 1TB of disk space for £60, 400MB is cheap (~ 2p).

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008