How do I anonymise a mercurial repository?

How do I anonymise a mercurial repository? - mercurial

I have a programming assignment I'm to hand in at my University for the end of this week and they have strict rules about anonymity of the assignments to maintain impartiality, so if my name (or any other obvious identifying info) appears anywhere in the work it may be automatically disqualified.
While preparing to burn everything to disc, I've just noticed/remembered that my HG repo is full of copies of my name. The code is all clean, but the author of every changeset is either my full name and email or my university login ID and the hostname of a lab computer (depends where I was working).
I need to create an anonymised version of the repo (or swap out all names for my student ID number) without losing any of the other information it holds.
So, as the headline says, how do I anonymise a mercurial repository?

You can use Mercurial's Convert extension with the --authors option to "convert" your repository into a new Mercurial repository, changing the authors' names during the conversion.
Quote from the second link:
Convert can also remap author names during conversion, if the
--authors option is provided. The argument should be a simple text file maps each source commit author to a destination commit author. It
is handy for source SCMs that use UNIX logins to identify authors (eg:
CVS). Example:
john=John Smith <John.Smith#someplace.net>
tom=Tom Johnson <Tom.Johnson#bigcity.com>

If you don't have any merge changesets, then you could try using the graft command in Mercurial 2.0 to graft your repository to a new repository while changing the recorded user name.
If you do have merge changesets, then it might be possible to use the transplant extension in Mercurial 2.2, although changing the recorded user name appears to be harder.

Related

Mercurial: how do I create a new repository containing a subrange of revisions from an existing repo?

I have a repo with subrepos, with a long history. At some point the main subrepo became fully self-contained (doesn't depend on other sister subrepos).
I don't care anymore about the history of the whole thing before the main subrepo became self-contained. So I want to start a new repo that contains just what the subrepo has in it from that moment on. If possible, please describe in terms of TortoiseHg commands.

You probably want to make use of mercurial's convert extension. You can specify revisions to be converted, paths, branches and files to include or exclude in the newly-created repository.
hg convert from-path new-repo
Convert is a default extension which just needs activation.

You might be able to weed out any other changesets you don't need by using the hg strip command.
It will entirely remove changesets (and their ancestors) from a repository. Most likely you would want to make a fresh clone and then work on stripping it down.
One potential pitfall is that the final stripped repo would still have shared parentage with the original; therefore the potential exists for someone to accidentally pull down again changesets which were stripped.
The hg convert command (noted in another answer) does not have this downside.

Rename a local-only mercurial branch?

I find a lot of questions about renaming hg branches that have already been pushed/pulled by other people, but I've got a situation here where work has been done on a set of branhces (10 in total) that all have the "wrong" name - wrong in the sense that the repo (on bitbucket.org) has restrictions on the naming of branches.
Any developer can open a new branch named app-feature-xxxx (where xxxx can be anything), but a boatload of work has been done by a new dev, and none of the branches follow this naming pattern (the branches are effectively the xxxx part without the app-feature- prefix)
Currently these branches are known only on his machine - they've never been pushed to BitBucket.org, nor pulled by anyone else
Can they be renamed in-situ, before they're pushed? Right now hg is attempting to commit his history to BitBucket with these branch names and it's failing. If the branches can be renamed before that happens, everything should be golden.. And there aren't the usual "but what about everyone else's history?" problems, because only one person has these commits..
The easiest I've been able to come up with right now is to clone the repo again, just make one app-feature-lotsofupdates branch, and then keep switching working copy in the original repo, and use a diff tool to apply the code from the original repo (with the wrong names) to this newly cloned repo, committing after every diff/copy - effectively a manual merge of all the various features into one branch (that will then be merged into production)

You can do this using the convert extension to Mercurial, and the branchmap option.
The branchmap is a file that allows you to rename a branch when it is
being brought in from whatever external repository. When used in
conjunction with a splicemap, it allows for a powerful combination to
help fix even the most badly mismanaged repositories and turn them
into nicely structured Mercurial repositories. The branchmap contains
lines of the form
original_branch_name new_branch_name
original_branch_name is the name
of the branch in the source repository, and new_branch_name is the
name of the branch is the destination repository.
The command line to do this would be something like:
hg convert --branchmap branchmap.txt path\to\source\repo path\to\converted\repo
Documentation.

Given a file, how to find out which revision in a mercurial repository this is?

Assume that there is a file under hg version control. I have a particular version of that file, and I would like to find out in which revision this file was in this version.
I suspect that there are two possible ways to do this.
Do hg update in a loop and diff the file against subsequent versions (sloooow, but should work).
Make Mercurial put the rev number in a, say, comment in the second line of the file right before committing. From what I have read, a precommit hook might be of use. Then I don't have to compare anything, just look at the file itself (I'm assuming no-one will change this, of course, but this is rather safe assumption in my case).
My use case is a joint paper, written in LaTeX, with two coauthors who have no idea about version control at all, but I prefer to use it (for obvious reasons). We communicate by email, and there's effectively a human-based lock system ("I will not work on this file until you send me the next version, ok?"). The only problem that arises is that I'm sending version X to author B to proofread, then author C sends me a corrected version Y and I commit it into my repo, then author B sends his corrections Z (to version X) and I'm starting to get lost-but I can check the attachment in the email sent to B, and I only need to find out which revision it is.
So, my question is: which of the two ideas above would be better, or maybe there's yet another one to help me deal with this mess?

hg archive is good method for future work, but I can suggest at least 3 alternative work-styles and 1 fix for find-correct-version with updates
Future work
You can use separate named branches for co-authors and default for merged results, send co-author always head from his branch, update his branch after getting corrections (you'll always know, that you sent) and merge branches to default
One branch, revision-of-coworker marked with bookmark, which you later move to next point
Mercurial keywords considered somehow as a "feature of last resort", but in your case it's obvious and usable solution: just add keyword with hash-id in file (defaul extension instead of hook - easier and more reliable)
Current state
For finding changeset with source of file, you can try to use bisect (example) and test in test-script, f.e, CRC of file (you have needed CRC of unversioned file, check versioned file across history)

If you're happy to rely on finding the emails you send the reviewers, why not just include the revision hashes in them along with the files?
You can get this for almost zero extra effort by generating your attachment using hg archive, which will create a file containing 1) your files for review, and 2) .hg_archival.txt, complete with revision hash.
Though I'd be surprised if there isn't a more elegant way, even if your collaborators are dead-set against using version control.

Creating Mercurial subrepositories, while maintaining the history

I am about to make some major changes to my Mercurial repositories. As I am going to be using a Feature of Last Resort, I am looking for some advice and reassurance that I am not doing something stupid.
Where I Am:
I have a Mercurial repository with a complete history of all of these files:
/source
/secret_subsystem
/unclassified_subsystem
/common_files
Source is the Mercurial repository.
The secret subsystem folder contains code which is intellectual property we want to keep in-house.
The unclassified subsystem folder contains code which we want to outsource to a third-party to maintain.
The common files folder contains code that both subsystems depend on. We will be keeping ownership, but we want to share it with the third-party.
Obviously, I can't just push out my whole repository to the third-party company. The third-party would see too much.
Where I Want To Be:
Having read up on subrepositories, this is where I think I need to be:
Have THREE subrepositories: secret_subsystem, unclassified_subsystem, common_files.
Ensure there are no other files at the /source level, due to this recommendation.
Have the outsourcers create a brand new respository at the source level on their machines, and two corresponding subrepositories.
Push the unclassified_subsystem and common_files to the out-sourcer, pulling back unclassified_subsystem as required, pushing out new common_files repositories as required.
Maintaining History:
I would like to maintain the commit history, as much as practical, for all of the subsystems.
To do this, I will run the hg convert extension command three times, once for each subrepository. I will filter down to only the files that belong in each subrepository. I may also need to map filenames to move the files from ./common_files/foo.py to ./foo.py (for example).
My Questions:
1) Is dividing up a repository into subrepository a reasonable way of implementing security - viz. that a third-party can only see and edit some of our files?
2) Is using hg convert a reasonable way to create a subrepository from an existing repository, while still maintaining the history?
3) Will hg convert's filter strip out (a) all commits messages about files NOT in the filtered respository? Will it filter out all diffs for files NOT in the filtered repository?
There is another implied question: Am I heading into a world of hurt? If so, I will simply give up on retaining file histories, or even make them seperate repositories and forget about cross-repository commits.

I've not used subrepos so far, but I can answer 2) and 3):
2) Yes, sounds reasonable.
3) Yes.
There was a similar question just 2 days ago: Convert mercurial repository to subrepositories with full history (like hg log -f)

Mercurial : user friendly way to display exact revision number of files?

When I was using Subversion as part of the build process I'd run an 'svn info' and capture the unique ID number and echo it to a header file for inclusion by other programs. This made it easy for users to say for example, 'I'm running build 456' and given the number 456 I could always cross reference exactly what they were running.
I'm trying to figure out how to achieve the same thing with Mercurial. 'hg summary' displays an integer id as well as the hex hash code. From what I was reading the integer id could be different for different people. I'm supposing the hash code is unique, but it's not very user friendly.
Is the hg hash code the only unique way of identifying a particular version of files in Mercurial?
Thank you,
Fred

Yes it is the only way to uniquely identify a changeset.
More details in the documentation : ChangeSet and ChangeSetID
If you want to use an integer number, I see two possible solution depending on your build process.
If the build always happens on the same machine (ie: same repository), you can use the integer id because it never changes on a particular repo (except if you do history rewriting)
If the build of a particular version only happens once, you can use a variable that you increment each time in your build script.

hg id command will give you needed changeset. You can add someoptions to command also, but most useful and permanent part is changeset id
For the same repo
>hg id -nibt
6c4d15d8cfbd 841 default tip
>hg id
6c4d15d8cfbd tip
you can also think about some commands, which support templating of output, and combine nice output from template-keywords mix: hg help templating
Example for already mentioned repo
>hg log --template "{rev}:{node|short}-{latesttag}+{latesttagdistance}" -r tip
841:6c4d15d8cfbd-1.3+3

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008