Finding Large Files in Mercurial Repository

Finding Large Files in Mercurial Repository - mercurial

Similar to this link but for mercurial. I'd like to find the files that are most contributing to the size of my mercurial repository.
I intend to use hg convert to create a new, smaller repository. I'm just not sure yet which files are contributing to the repository size. They could be files that have already been deleted.
What is a good way to find these anywhere in the repository history? There are over 20,000 commits. I'm thinking a powershell script, but I'm not sure what the best way to go about this is.

Check hg help fileset. Something like
hg files "set:size('>1M')"
should do the trick for you. You might need to operate over all revisions, though as it only operates on one revision. In bash I'd try something like
for i in `hg log -r"all()" "set:size('>400k')" --template="{rev}\n"`; do hg files -r$i "set:size('>400k')"; done | sort | uniq
might do the trick. Maybe it can be optimized as it's currently a bit duplication and might run for quite a bit; on the OpenTTD repository with 22000 commits it took on my laptop just short of 10 minutes.
(Also check hg help on templates, files and grep)

Related

Mercurial Repo Living Archive

We have an Hg repo that is over 6GB and 150,000 changesets. It has 8 years of history on a large application. We have used a branching strategy over the last 8 years. In this approach, we create a new branch for a feature and when finished, close the branch and merge it to default/trunk. We don't prune branches after changes are pushed into default.
As our repo grows, it is getting more painful to work with. We love having the full history on each file and don't want to lose that, but we want to make our repo size much smaller.
One approach I've been looking into would be to have two separate repos, a 'Working' repo and an 'Archive' repo. The Working repo would contain the last 1 to 2 years of history and would be the repo developers cloned and pushed/pulled from on a daily basis. The Archive repo would contain the full history, including the new changesets pushed into the working repo.
I cannot find the right Hg commands to enable this. I was able to create a Working repo using hg convert <src> <dest> --config convert.hg.startref=<rev>. However, Mecurial sees this as a completely different repo, breaking any association between our Working and Archive repos. I'm unable to find a way to merge/splice changesets pushed to the Working repo into the Archive repo and maintain a unified file history. I tried hg transplant -s <src>, but that resulted in several 'skipping emptied changeset' messages. It's not clear to my why the hg transplant command felt those changeset were empty. Also, if I were to get this working, does anyone know if it maintains a file's history, or is my repo going to see the transplanted portion as separate, maybe showing up as a delete/create or something?
Anyone have a solution to either enable this Working/Archive approach or have a different approach that may work for us? It is critical that we maintain full file history, to make historical research simple.
Thanks

You might be hitting a known bug with the underlying storage compression. 6GB for 150,000 revision is a lot.
This storage issue is usually encountered on very branchy repositories, on an internal data structure storing the content of each revision. The current fix for this bug can reduce repository size up to ten folds.
Possible Quick Fix
You can blindly try to apply the current fix for the issue and see if it shrinks your repository.
upgrade to Mercurial 4.7,
add the following to your repository configuration:
[format]
sparse-revlog = yes
run hg debugupgraderepo --optimize redeltaall --run (this will take a while)
Some other improvements are also turned on by default in 4.7. So upgrade to 4.7 and running the debugupgraderepo should help in all cases.
Finer Diagnostic
Can you tell us what is the size of the .hg/store/00manifest.d file compared to the full size of .hg/store ?
In addition, can you provide use with the output of hg debugrevlog -m
Other reason ?
Another reason for repository size to grow is for large (usually binary file) to be committed in it. Do you have any them ?

The problem is that the hash id for each revision is calculated based on a number of items including the parent id. So when you change the parent you change the id.
As far as I'm aware there is no nice way to do this, but I have done something similar with several of my repos. The bad news is that it required a chain of repos, batch files and splice maps to get it done.
The bulk of the work I'm describing is ideally done one time only and then you just run the same scripts against the same existing repos every time you want to update it to pull in the latest commits.
The way I would do it is to have three repos:
Working
Merge
Archive
The first commit of Working is a squash of all the original commits in Archive, so you'll be throwing that commit away when you pull your Working code into the Archive, and reparenting the second Working commit onto the old tip of Archive.
STOP: If you're going to do this, back up your existing repos, especially the Archive repo before trying it, it might get trashed if you run this over the top of it. It might also be fine, but I'm not having any problems on my conscience!
Pull both Working and Archive into the Merge repo.
You now have a Merge repo with two completely independent trees in it.
Create a splicemap. This is just a text file giving the hash of a child node and the hash of its proposed parent node, separated by a space.
So your splicemap would just be something like:
hash-of-working-commit-2 hash-of-archive-old-tip
Then run hg convert with the splicemap option to do the reparenting of the second commit of Working onto the old tip of the Archive. E.g.
hg convert --splicemap splicemapPath.txt --config convert.hg.saverev=true Merge Archive
You might want to try writing it to a different named repo rather than Archive the first time, or you could try writing it over a copy of the existing Archive, I'm not sure if it'll work but if it does it would probably be quicker.
Once you've run this setup once, you can just run the same scripts over the existing repos again and again to update with the latest Working revisions. Just pull from Working to Merge and then run the hg convert to put it into Archive.

Find likely base revision in a mercurial repository

I received some code as a tar archive (without the .hg directory). I know which repository this code is based on, but not which revision was used as a base for these modifications. Is there some way to find this out by just looking at the files? This is similar to Given a file, how to find out which revision in a mercurial repository this is? but I cannot reach the author of the code, so I cannot control how the files are extracted from the repository. I am also dealing with modified files here so the diff to the base revision would not be empty.
My fall-back plan would be to loop through all revisions and using the one with the smallest diff, but I'm still hoping there is a better solution.

There's no automated way to do it, but you could possibly reduce the time by using a hg bisect --command diff ... command to zero in on it.
As a tip if you (I know it probably wasn't) you ever has to give someone a snapshot again, use hg archive to make it. It includes a .hg_archive.txt file with version info that'll help if you have to do this again.

Mercurial - Look into history of file without updating

I'm working in this file and I come across a piece of code which I think has changed at some point in history, and I would like to know where it changed.
It's a pretty big file with a lot of history, so when I use hg diff, I get a enormous list and I don't think it's efficient to search through that.
It would be really neat if I can look into an old revision of the file, to see what the file looked like at a certain point in time. Then I can see how the code worked back then so I can conclude how the bug evolved. Of course, I want to do this without updating the file, because I'm currently working in it and have made changes in it.
So, is there any way you can look into the history of a file without updating it?

There are a few tools to help you:
To get the history of a file you can just use hg log FILE which is probably the best starting point.
You can also use hg annotate FILE which lists every line in the file and says which revision changed it to be like it currently is. It can also take a revision using the --rev REV command tail to look at older versions of the file.
To just list the contents of a file at a given revision you can use hg cat FILE --rev REV.
If it proves too hard to track down the bug using those tools, you can just clone your repository somewhere else and use hg bisect to track it down.

hg bisect lets you find the changeset that intoduced a problem. To start the search run the hg bisect --reset command. It is well document in Mercurial: The Definitive Guide.

How do I permanently remove (obliterate) files from history?

I commited (not pushed) a lot of files locally (including binary files removing & adding...) and now when I try to push it takes a lot of time. Actually I messed up my local repo history.
How could I avoid this mistake in the future ? Can I transform a set of local revision 1->2->3->4 to 1->2 with 2 being the final revision of the local clone ?
edit: since I was in hurry I started a new remote repo from scratch with revision 4. In the future I will go with the marked answer as it seems easier but I will dig other solutions to see the truth. Thx for your support.

It's not clear from your question whether those changes got pushed. If they're still local, you can more or less get rid of them easily. convert is one option. You can also use MQ (mercurial queues). Check the EditingHistory wiki article for a detailed explanation. It recommends MQ being the simplest approach.
To prevent that kind of mistakes, you should probably add a hook to reject 'bad' commits, given that you can describe them programatically ;)

Mercurial history is immutable, you can't delete using the normal tools. You can, however, create a new repo without those files:
$ hg clone -r 1 repo-with-too-much new-repo
that takes only revisions zero and one from the old repo and puts them into a new repo. Now copy the files from revision four into the new repo and commit.
This gets rid of those interstitial changesets, but any repo you have out there in the wild still has them, so when you pull you'll get them back.
In general once you've pushed a changeset it's out there and unless you can get everyone with a clone to delete it and reclone you're out of luck.

Fetching a single file from another mercurial repository [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Mercurial: copying ONE file and its history to another repository
I have several repositories in my local machine.
One is my main code, another is an assortment of useful code/tools.
These are two fundamentally different repos. It might make sense to setup a new repo and pull these two in as sub-repos, but I want to wait until Mercurial devs mark sub-repos as non-experimental before I do that.
One of the useful code files has become so useful, I want to put it into my main code area...but I want to maintain its history. This will, of course, result in some variant of a fork, but that's acceptable. (best case would be being able to push-pull it back and forth and keep updating its history).

I'd just use the subrepo feature that came online in 1.3. It might change slightly, but you won't be left high and dry backwards compatibility wise.
If you can't bring yourself to so, then what you need to do is:
use hg convert with a filemap that deletes all files except the one you want and convert from the repo with the single useful file to a new repo containing only that file and all its history
then hg pull from the new single-file-full-history repo into the target repo
hg merge in the target repo and you'll have that file with all it's history
The other option would be to hg export the entire tools repo, use grepdiff (part of difftools) to limit to only one file, and then import into the target repo, but that's crazy.

The short answer is you can't copy a file and its history simply, as stated in this thread:
https://www.mercurial-scm.org/pipermail/mercurial/2009-April/025105.html
I'm relatively new to DVCS and you really have to think of each repo as a self contained package. Not like svn or p4, where you can hang different projects off the root and configure it how you like and do partial repo checkouts. (That said, I really like the flexibility of being able to clone repos quickly to try things out. And being able to commit on a local machine.)
I'm just looking at a similar problem. I've branched a repo to make changes and I only want one file out of one changeset. And it is nice to have the history.
You could look at:
hg cat
This would probably involve writing a script to transfer history, i.e. commit N changesets in the target repo with the hg cat results from the source. Wonder if there is an extension to do this?
You could get the log of the file you want to copy and paste that into a commit comment. It's not in the metadata, but you do have a record and all the hashes etc.

may be
hg export
also can help you.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008