Are deleted files still downloaded with an hg clone? - mercurial

In our Mercurial repo we added a really big file (and did an hg push), then deleted the big file (and did another push).
Now if someone does an hg clone will they still pull down that big file? I know it won't appear in their working directory as it was deleted, but will the file still be pulled down and stored in Mercurial internal storage?
I'd like to ensure people don't have to pull down the file. I've learned that really big files should be stored outside of Mercurial, so I deleted the file. But I was wondering if people will still be pulling down the big file - in which case I guess I will recreate the repository from scratch.

Of course it will still be in the repository.
You can always update back to older revisions, and if you update back to the revision you got when you committed the file, it'll be there in all its glory.
There are two ways to mitigate this (when you're committing, not now):
One of the big-files extensions, these essentially add big files to a secondary repository and link the two, so that if you update to a revision where the file doesn't exist, and you don't already have it, it will not get updated. ie. it's more a "on-demand" style of pulling
If the file never changes, keep it available on the network and just create some kind of link to it instead of a full copy
Right now, you got four options:
Strip away the changeset that added the file, and all the changesets that came after it. You can do that using the Mercurial Queues extension. Note that you need to do this stripping in all clones. If just one of your users push back the repository that has that file in its history to the central clone, you have the changesets back.
Rebuild the repository from scratch manually
Using the hg convert command and some filtering, the --filemap option can be used for this
Leave it as is. How big is it, will be much of a problem?
Note that rebuilding the repository, either manually or through hg convert will invalidate all clones. Anyone trying to push to your new central clone from an old clone will get a message about unrelated repositories. If any of your users are stupi^H^H^H^H^Hnot smart enough to realize that forcing the push is a bad idea, then you will have problems with this approach.

Yes, the file is still in the history. If you want to delete it completely, you need to use Mercurial Queues — see Editing History on Mercurial wiki.
Just keep in mind this breaks clones as revision IDs change.

Related

Mercurial Repo Living Archive

We have an Hg repo that is over 6GB and 150,000 changesets. It has 8 years of history on a large application. We have used a branching strategy over the last 8 years. In this approach, we create a new branch for a feature and when finished, close the branch and merge it to default/trunk. We don't prune branches after changes are pushed into default.
As our repo grows, it is getting more painful to work with. We love having the full history on each file and don't want to lose that, but we want to make our repo size much smaller.
One approach I've been looking into would be to have two separate repos, a 'Working' repo and an 'Archive' repo. The Working repo would contain the last 1 to 2 years of history and would be the repo developers cloned and pushed/pulled from on a daily basis. The Archive repo would contain the full history, including the new changesets pushed into the working repo.
I cannot find the right Hg commands to enable this. I was able to create a Working repo using hg convert <src> <dest> --config convert.hg.startref=<rev>. However, Mecurial sees this as a completely different repo, breaking any association between our Working and Archive repos. I'm unable to find a way to merge/splice changesets pushed to the Working repo into the Archive repo and maintain a unified file history. I tried hg transplant -s <src>, but that resulted in several 'skipping emptied changeset' messages. It's not clear to my why the hg transplant command felt those changeset were empty. Also, if I were to get this working, does anyone know if it maintains a file's history, or is my repo going to see the transplanted portion as separate, maybe showing up as a delete/create or something?
Anyone have a solution to either enable this Working/Archive approach or have a different approach that may work for us? It is critical that we maintain full file history, to make historical research simple.
Thanks
You might be hitting a known bug with the underlying storage compression. 6GB for 150,000 revision is a lot.
This storage issue is usually encountered on very branchy repositories, on an internal data structure storing the content of each revision. The current fix for this bug can reduce repository size up to ten folds.
Possible Quick Fix
You can blindly try to apply the current fix for the issue and see if it shrinks your repository.
upgrade to Mercurial 4.7,
add the following to your repository configuration:
[format]
sparse-revlog = yes
run hg debugupgraderepo --optimize redeltaall --run (this will take a while)
Some other improvements are also turned on by default in 4.7. So upgrade to 4.7 and running the debugupgraderepo should help in all cases.
Finer Diagnostic
Can you tell us what is the size of the .hg/store/00manifest.d file compared to the full size of .hg/store ?
In addition, can you provide use with the output of hg debugrevlog -m
Other reason ?
Another reason for repository size to grow is for large (usually binary file) to be committed in it. Do you have any them ?
The problem is that the hash id for each revision is calculated based on a number of items including the parent id. So when you change the parent you change the id.
As far as I'm aware there is no nice way to do this, but I have done something similar with several of my repos. The bad news is that it required a chain of repos, batch files and splice maps to get it done.
The bulk of the work I'm describing is ideally done one time only and then you just run the same scripts against the same existing repos every time you want to update it to pull in the latest commits.
The way I would do it is to have three repos:
Working
Merge
Archive
The first commit of Working is a squash of all the original commits in Archive, so you'll be throwing that commit away when you pull your Working code into the Archive, and reparenting the second Working commit onto the old tip of Archive.
STOP: If you're going to do this, back up your existing repos, especially the Archive repo before trying it, it might get trashed if you run this over the top of it. It might also be fine, but I'm not having any problems on my conscience!
Pull both Working and Archive into the Merge repo.
You now have a Merge repo with two completely independent trees in it.
Create a splicemap. This is just a text file giving the hash of a child node and the hash of its proposed parent node, separated by a space.
So your splicemap would just be something like:
hash-of-working-commit-2 hash-of-archive-old-tip
Then run hg convert with the splicemap option to do the reparenting of the second commit of Working onto the old tip of the Archive. E.g.
hg convert --splicemap splicemapPath.txt --config convert.hg.saverev=true Merge Archive
You might want to try writing it to a different named repo rather than Archive the first time, or you could try writing it over a copy of the existing Archive, I'm not sure if it'll work but if it does it would probably be quicker.
Once you've run this setup once, you can just run the same scripts over the existing repos again and again to update with the latest Working revisions. Just pull from Working to Merge and then run the hg convert to put it into Archive.

Mercurial: How do I fix an incorrectly published rename/copy?

I've somehow marked a bunch of files as being "copy/renamed" instead of just marking them as new files.
I've also published these changes to our central repository and can cannot strip them out (we've already got some changes on top and I don't have the permissions/ability to fiddle with the central repository anyway).
Context:
I've actually done it to a whole bunch of files, but here's an example of what I did to one: I've copied a file called "Labels.properties", to a new file called "Labels_ja.properties" (the new file contains a subset of the original properties, and eventually all the values will be translated to a different language). The "Labels_ja.properties" file has been marked as a copy/rename of "Labels.properties" (the only way I've been able to see exactly what's happened is by looking with SourceTree, which shows "File copied/renamed from Labels.properties").
Our environment is a sort of central repository with automated "pull request" style merging tools built on top, so solutions involving all our developers magically knowing exactly how to drive Hg in order to resolve these conflicts aren't going to work - it's the scripts that are getting the merge conflicts.
These copy/renames are causing a lot of hassles: when people touch the original versions of the property files (they don't even know about the copied files yet) - it looks like Hg is trying to merge those changes onto the copies, but because the files are very different, those merges are failing with conflicts.
Problem:
What can I do to sort out all these merge conflicts that our automated merge scripts are getting?
Ideally, I'd like to go back in time and just mark all these files "new" - but there's no going back now that the changesets have been published.
Can I just make a big backout commit, then re-add the files (making sure that they are marked "new" and not "copy/renamed")?
I cannot think of a good solution. Even if you delete the files, Mercurial will still complain repeatedly when they get changed and merged. The simplest solution I can think of is moving the bad files into some "attic" directory where they don't hurt anyone and never touching them again. Then add the real files where they belong.
If I understand correctly, the problem is in fact that the developers are still based on a revision that does not have your move.
When they change the original file, an 'hg merge' will then smash together the changes from the developer and the changes to the new file in that new file.
I see a few solutions for this:
You can tell the developers to rebase their changes on top of the new changes. This will actually not work, since rebase will make the same mistake as merge.
The developers can turn their changes into patches and apply their patches on top of your changes. Their patches will refer to the original files, not the changed files, so this should work fine for the tools that merge later on.
The above flow can be done in a more sophisticated way using Mercurial Queues: import the existing developer commits as patches, then qpop all of these, update to your newer revision and qpush all of the revisions again.
Mark the head with your changes as 'closed' using '--close-branch'. However, if your tools are not equipped to handle this correctly (by ignoring the closed branch), this may cause issues. Branches are reopened when a new commit is done on top of the branch. So if you merge developer changes with the closed branch, it will open again.

Teaching a mercurial repository about a bad rename after it is pushed

We have moves and renames of files in our published mercurial history that were not properly recorded, so that they appear in the history as unrelated deletions and adds.
Is there any way to tell the repository about the connections so that --follow commands can work again?
(For non-pushed changes, here is a question discussing how to get mercurial to properly record moves/renames before you commit, as well as a useful tip here.)
One solution that works but is a bit brutal: You can login remotely on your central Mercurial server, fix the renames there as a local change and then ask everyone to clone the repo again.
This works since all Mercurial repos are equal. You just think of one as "central" but in fact it's a repo like any other. So if you have access to that, you can rewrite history there. The drawback is of course that every developer will notice, so they'll have to export any non-pushed changes they made, clone the repo again and then import the patch.
[EDIT] A possible workaround would be to create a new branch just before you did the bad rename, rename the files properly and then cherry pick all the changes after that into the new branch.
I'm not sure if it's a good idea to merge this branch into your original branch. If you did, then Mercurial would see two kinds of file renames in the past and I'm not sure which one it would follow.
So before you merge, I suggest that you create a small test/demo repo where you reproduce the situation and then try it out.
You can start a new head from before the rename, do the rename properly, then merge it into the head from the upstream server. It will give you a conflict. When this happens, revert the offending files to the revision from your new head (with the good renames; hg revert -r <rev> <files...>).

How do I permanently remove (obliterate) files from history?

I commited (not pushed) a lot of files locally (including binary files removing & adding...) and now when I try to push it takes a lot of time. Actually I messed up my local repo history.
How could I avoid this mistake in the future ? Can I transform a set of local revision 1->2->3->4 to 1->2 with 2 being the final revision of the local clone ?
edit: since I was in hurry I started a new remote repo from scratch with revision 4. In the future I will go with the marked answer as it seems easier but I will dig other solutions to see the truth. Thx for your support.
It's not clear from your question whether those changes got pushed. If they're still local, you can more or less get rid of them easily. convert is one option. You can also use MQ (mercurial queues). Check the EditingHistory wiki article for a detailed explanation. It recommends MQ being the simplest approach.
To prevent that kind of mistakes, you should probably add a hook to reject 'bad' commits, given that you can describe them programatically ;)
Mercurial history is immutable, you can't delete using the normal tools. You can, however, create a new repo without those files:
$ hg clone -r 1 repo-with-too-much new-repo
that takes only revisions zero and one from the old repo and puts them into a new repo. Now copy the files from revision four into the new repo and commit.
This gets rid of those interstitial changesets, but any repo you have out there in the wild still has them, so when you pull you'll get them back.
In general once you've pushed a changeset it's out there and unless you can get everyone with a clone to delete it and reclone you're out of luck.

Merging changes to a workspace with uncommitted changes

We've just recently switched over from SVN to Mercurial, but now we are running into problems with our workflow. Example:
I have my local clone of the repository which I work on. I'm making some highly experimental changes to our code base, something that I don't want to commit before I'm sure it works the way it is supposed to, I don't want to commit it even locally. Now, simultaneously, my co-worker has made some significant improvements/bug fixes which I need. He pushes his commits to our main repository. The question is, how can I merge his changes to my workspace without the requirement that I have to commit all my changes, since I need his changes to test my own code?
A more day-to-day problem we have with the exact same workflow is where we have a couple of configuration files which are in the repository. Each developer makes a couple of small environment specific changes to the configuration files, but do not commit the changes. These couple of uncommitted files hinders us from making any merges to our workspace, just like with the example above. Ideally, the configuration files probably shouldn't be in the repository, unfortunately, that's just how it has to be for here unnamed reasons.
If you don't want to clone, you can do it the following way.
hg diff > mylocalchanges.txt
hg revert -a
# Do your merge here, once you are done, import back your local mods
hg import --no-commit mylocalchanges.txt
There are two operations, as you've discovered, that makes changes from one person available to someone else (or many, on either side.)
There's pulling, which takes changes from some other clone of the repository and puts them into your clone.
There's pushing, which takes changes from your repository and puts them into another clone.
In your case, your coworker has pushed his changes into what I assume is your central master of the repository.
After he has done this, you can pull the latest changes down into your repository, and merge them into your branch. This will incorporate any bugfixes or changes your coworker did into your experimental code.
This gives you the freedom of staying current on other coworkers development in your project, and not having to release your experimental code until it is ready (or even at all.)
So, as long as you stay away from the Push command, you're safe.
Of course, this also assumes nobody is pulling directly from your clone of the repository, if they do that, then of course they will get your experimental changes, but it doesn't sound like you've set it up this way (and it is highly unlikely as well.)
As for the configuration files, the typical way to do this is that you only commit a master file template into the repository, with a different name (ie. an extra extension .template or similar), and then place the name of the real configuration file into the ignore filter.
Each developer then has to make his or her own copy of the template, rename it, and change it in any way they want, without the risk of committing database connection strings, passwords, or local paths, to the repository.
If necessary, provide a script that will help the developer make the real configuration file if it is long and complex.
Regarding your experimental changes, you should commit them. Often.
Simply you commit them in a clone you don't push. You only pull to merge whatever updates you need from other repos.
As for config files, don't commit them.
Commit template files, and script able to generate complete config files from the template.
That way, developers will only modify "private" (i.e. not committed) config files with their own private values.
If you know your uncommitted changes will not collide with the merge commit that you are creating - then you can do the following...
1) Shelve the uncommitted changes
2) Do the pull and merge
3) Unshelve the uncommitted changes
Shelf effectively stores your uncommitted changes away as into diff (relative to your last commit) then rolls back those files in your local workspace. Then un-shelving then applies that diff, bringing back your uncommitted changes.
Tools such as TortoiseHg have shelf built in.