BFG Repo cleaner: Delete many folders with same name. Size still the same - git-rewrite-history

I have a huge git repository and I want to reduce the size. To achieve that, I want to delete a folder, which exists many times in different subdirectories with the same name.
The git repository has the following folder structure:
- ID1
- Graphs
- ID2
- Graphs
I want to delete all Graphs folders, because they are huge.
I ran the java -jar ~/Downloads/bfg-1.13.0.jar --delete-folders Graphs --no-blob-protection ~/path/to/Repo with the following output:
Using repo : ~/path/to/Repo/.git
Found 0 objects to protect
Found 3 tag-pointing refs : refs/tags/VERSION1.4, refs/tags/VERSION1.5, refs/tags/VERSION3.1
Found 5 commit-pointing refs : HEAD, refs/heads/master, refs/remotes/origin/HEAD, ...
Protected commits
-----------------
You're not protecting any commits, which means the BFG will modify the contents of even *current* commits.
This isn't recommended - ideally, if your current commits are dirty, you should fix up your working copy and commit that, check that your build still works, and only then run the BFG to clean up your history.
Cleaning
--------
Found 284 commits
Cleaning commits: 100% (284/284)
Cleaning commits completed in 268 ms.
Updating 3 Refs
---------------
Ref Before After
-------------------------------------------------
refs/heads/master | 7f5ba511 | fcd2600c
refs/remotes/origin/develop | c30fa798 | 7e345ac0
refs/remotes/origin/master | 7f5ba511 | fcd2600c
Updating references: 100% (3/3)
...Ref update completed in 20 ms.
Commit Tree-Dirt History
------------------------
Earliest Latest
| |
...DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDm
D = dirty commits (file tree fixed)
m = modified commits (commit message or parents changed)
. = clean commits (no changes to file tree)
Before After
-------------------------------------------
First modified commit | 3fba1762 | 7a24f280
Last dirty commit | 0ea27985 | 1cc26472
In total, 490 object ids were changed. Full details are logged here:
~/path/to/Repo.bfg-report/2019-04-03/16-35-48
BFG run is complete! When ready, run: git reflog expire --expire=now --all && git gc --prune=now --aggressive
Then I followed the instruction and ran git reflog expire --expire=now --all && git gc --prune=now --aggressive.
That took at least 5h to complete.
Output (I tried to translate):
Object count: 11191, Fertig.
Counting objects: 100% (11191/11191), Fertig.
Delta Compression is using up to 12 threads.
compress objects: 100% (11115/11115), Fertig.
write objects: 100% (11191/11191), Fertig.
Total 11191 (Delta 1866), Re-used 5522 (Delta 0)
Now, after it is completed the size of the repository is only slightly smaller.
Looking in the checked out directory, the size is only <50MB.
But the size of the .git folder is >19GB.
When looking through the history, it seems that the Graphs folders are gone.
I do not understand, why the repository size is still that huge, but the folders are actually gone in the commit history.

Related

Condensing a mercurial repository - recommanded way?

Let's say I have a repository 'Main', and Max and co work on a clone each. Max has some local commits ('f'&'g') that are not yet pushed to 'Main'. This is how it looks now (pipes being pushs/pulls):
A--B1--B2--C--D1--D2--D3--E (Main)
| | | |
A--B1--B2--C--D1--D2--D3--E--f--g (Max)
'B1' and 'B2' as well as 'D1', 'D2' and 'D3' are changes that only make sense together. We would like to combine 'B1' and 'B2' to a single changeset 'B' and combine 'D1', 'D2' and 'D3' to a single changeset 'D'. The new structure should look like this:
A--B--C--D--E (Main)
| | |
A--B--C--D--E--f--g (Max)
My (main) question is: What is the reccommended way of doing this?
Now let's make things worse:
We have a branch that was merged within the change-sets that we want to collapse. It would look like this:
A--B1--B2--C--D1--D2------D4--E (Main)
| | \-------D3-/ |
| | |
A--B1--B2--C--D1--D2------D4--E--f--g (Max)
\-------D3-/
The new history should look like this:
A--B--C--D--E (Main)
| | |
A--B--C--D--E--f--g (Max)
How would you do that?
Thanks in advance.
It depends on how much effort you want to put into this. While I don't know a solution within Mercurial itself (I only know history editing functions which can't cope with merges), Git does have the functionality you need:
If I would really have to do such an operation, I would
Try to convince the management that this is not worth it
Try harder to convince the management that this is not worth it
Make a backup! The following steps involve destructive operations, so consider this as not optional. You have been warned.
exort the repo with hg-git into a git repository
export the complete (git) history into a fast-import-stream with git fastexport --no-data --all > history.fi
Create a Pseudohistory by editing history.fi, dropping your unwanted revisions
import the adjusted history into the git repo with ``git fast-import -f < history.fi`
check extensively if the newly created history is in fact the way you want it to have
clone Max into a local work repository
Remove successors of commmit A in the local work repository
pull your updated history back from git (again with hg-git) into the local work repository
check, if the Mercurial history matches your expectation (diffs of commits between the new and old repos, metadata (time stamps, committer names, ...)
Remove successors of commmit A in every repo (Main, Max and every developer clone)
hg push -r E Main the partial history back to Main out of the work repository
hg push -r g Max the complete history back to Max out of the work repository

With Mercurial, how to run a job for each changeset when doing an update?

Question: When I update my working directory from one revision to the other, I'd like to run a script for each revision passed. How can I do this?
Important Constraints
The script is always the same
Traversal should happen iteratively (I don't want to do all this by hand for 100 revisions...)
The incoming hook is no option. It must happen not only after pushes or pulls, but for all updates, no matter how often I switch between revisions.
For illustration:
r1(*) -- r2 -- r3 -- r4(head)
Basically, I'd like to do
r1(*) --> r2 (then run script) --> r3 (then run script) --> r4 (then...
Let's say, my working directory is currently at r1, and now I want to update it to r4. Instead of doing a direct update (like with hg update), I'd like to update to r2 first and then run a script (update-my-database, for example). Afterwards I'd like to update to r3, then run the same script and so on.
It is doable but we must be careful to select a linear piece from the revision graph.
Consider a minimal repository with two heads (same thing would be if two branches or two bookmarks):
$ hg log -G --template "{rev}\n"
o 4
|
| o 3
| |
| o 2
|/
o 1
|
o 0
Say we start from 1. We want to go along either head 4 (so the linear piece would be: 1, 4) or along head 3 (so the linear piece would be: 1, 2, 3).
So we will use a script with two parameters: the beginning and the end of the linear piece.
This is bare-bones but works:
for rev in `hg log -r "descendants(1) and ancestors(3)" --template "{rev}\n"`
do
echo Updating to $rev:
hg update $rev
echo Executing our script:
hg id --id --num
done
Here is the output:
Updating to 1:
0 files updated, 0 files merged, 2 files removed, 0 files unresolved
Executing our script:
4d5aba4313ce 1
Updating to 2:
1 files updated, 0 files merged, 0 files removed, 0 files unresolved
Executing our script:
f0de0712ec00 2
Updating to 3:
1 files updated, 0 files merged, 0 files removed, 0 files unresolved
Executing our script:
d052bd7c310b 3
The script works no matter what is the current revision of the working directory.
Before starting it, you have to do hg fetch to get the changes from the remote without applying them to the working directory and hg log --graph to understand the topology and so select the starting and ending points of the linear piece.
You also have to replace the hardcoded 1 and 3 with the $1 and $2 shell positional parameters and replace the hg id line with your actual script :-)
To understand how it works, we use a feature of mercurial called revsets (do hg help revsets to learn more) that allows to perform queries on the history graph (a bit like SQL :-)

How to get the revision count for file in Mercurial

Using templates, I want to find out how many times a file has been revised across all changesets. So, put another way, how many changesets feature that file.
Is there a way to do it? And can it be done with the Keywords extension?
And yes, I realise it's not really what Mercurial is about. I have sucky requirements:)
hg log -q filename | wc -l will output amount of changesets
It is a normal feature of an VCS to track, when a file was changed, just run hg log THE_FILENAME to see all changesets which affect one specific file.
To count them, run for example hg log THE_FILENAME | grep -c "^changeset".
I thought I'd just add one more option to the list here since grep and wc (word count) may not be available in your console (Windows users especially). There is an equivalent functionality in PowerShell:
hg log -q filename | Measure-Object
This will return the count by default (and as you can see there are other options you can play with using Measure-Object)
Count : 14
Average :
Sum :
Maximum :
Minimum :
Property :
And if you are interested in how many commits you have done for the entire repository you can omit the -q filename parameter:
hg log | Measure-Object
Count : 492
Average :
Sum :
Maximum :
Minimum :
Property :

Did the behavior of `hg backout` change since the hg book was written?

I created a new repository, test-backout, and added a new file in it, file. I then made 4 commits, each time, appending the number of the commit to file using
echo [manually entered number] >> file
hg commit -m '[manually entered number]'
In effect, file had:
init
1
2
3
According to the hg book, if I run hg backout --merge 2, I should have:
init
1
3
but instead, it fails to merge and opens up my difftool (vimdiff), and I get 3 options:
init | init | init
1 | 1 |
2 | |
3 | |
I initially tried it with the --merge option, then again without it. My question now is, is there still a way for me to get:
init
1
3
did I just make a mistake or miss something, or am I stuck with those options?
A big factor in why you got the 3-way merge is that your context is too artificial, and I will get to that.
If I take a 50-line text file and change a different part and commit each change, I won't have to resolve conflicts. And what I mean is I have 4 changesets: rev 0 adds the file, revs 1, 2, and 3 each change one area of the file: the beginning, middle, or end.
In this situation, when I do hg backout 2, it makes a reverse of rev 2 and merges those changes to my working directory, and when I commit, the graph is linear:
# backout 2
|
o 3
|
o 2
|
o 1
|
o initial
If I instead do hg backout 2 --merge, it automatically commits the backout as a child of the revision it is backing out, and then merges that with the tip, producing a branched graph after I commit the merge:
# merge
|\
| o backout 2
| |
o | 3
|/
o 2
|
o 1
|
o initial
In both situations, I didn't have to do any 3-way merging. The reason you don't automatically get
init
1
3
and instead have to do a 3-way merge is that the changes are too close together. The context and changes in each changeset are completely overlapped (default number of lines of context for a diff chunk is 3 lines, which encompasses the entire file still in your 4th changeset).
A similar example is if you had 3 changesets that each modified the same line. If you backed out the middle change like you're doing here, you would still be presented with a 3-way merge that you'll likely have to manually edit to get correct.
By the way, behavior did change in 1.7, as attested by hg help backout:
Before version 1.7, the behavior without --merge was equivalent to specifying --merge followed by "hg update --clean ." to cancel the merge and leave the child of REV as a head to be merged separately.
However, I don't think that's quite what you suspected.

Mercurial: how can I see only the changes introduced by a merge?

I'm trying to get in the habit of doing code reviews, but merges have been making the process difficult because I don't know how to ask Mercurial to "show only changes introduced by the merge which were not present in either of its parents."
Or, slightly more formally (thanks to Steve Losh):
Show me every hunk in the merge that wasn't present in either of its parents, and show me every hunk present in either of its parents that isn't also present in 3.
For example, assume I have a repository with two files, a and b. If "a" is changed in revision 1, "b" is changed in revision 2 (which is on a separate branch) and these two changes are merged in revision 3, I'll get a history which looks like this:
# changeset: 3
|\ summary: Merged.
| |
| o changeset: 2
| | summary: Changing b
| |
o | changeset: 1
|/ summary: Changing a
|
o changeset: 0
summary: Adding a and b
But if I ask to see the changes introduced by revision 3, hg di -c 3, Mercurial will show me the same thing as if I asked to see the changes introduced in revision 1, hg di -c 1:
$ hg di -c 3
--- a/a
+++ b/a
## -1,1 +1,1 ##
-a
+Change to a
$ hg di -c 1
--- a/a
+++ b/a
## -1,1 +1,1 ##
-a
+Change to a
But, obviously, this isn't very helpful - instead, I would like to be told that no new changes were introduced by revision 3 (or, if there was a conflict during the merge, I would like to see only the resolution to that conflict). Something like:
$ hg di -c 3
$
So, how can I do this?
ps: I know that I can reduce the number of merges in my repository using rebaseā€¦ But that's not my problem - my problem is figuring out what was changed with a merge.
The short answer: you can't do this with any stock Mercurial command.
Running hg diff -c 3 will show you the changes between 3 and its first parent -- i.e. the changeset you were at when you ran hg merge.
This makes sense when you think of branches as more than just simple changesets. When you run hg up 1 && hg merge 2 you're telling Mercurial: "Merge changeset 2 into changeset 1".
It's more obvious if you're using named branches. Say changeset 2 in your example was on a named branch called rewrite-ui. When you run hg update 1 && hg merge rewrite-ui you're effectively saying: "Merge all the changes in the rewrite-ui branch into the current branch." When you later run hg diff -c on this changeset it's showing you everything that was introduced to the default branch (or whatever branch 1 happens to be on) by the merge, which makes sense.
From your question, though, it looks like you're looking for a way to say:
Show me every hunk in this changeset that wasn't present in either of its parents, and show me every hunk present in either of its parents that isn't also present in 3.
This isn't a simple thing to calculate (I'm not even sure I got the description right just now). I can definitely see how it would be useful, though, so if you can define it unambiguously you might convince one of us Mercurial contributors that read SO to implement it.
In order to do code reviews you really want to see just the changes in the project that you are reviewing. To that end we use a new branch for each story and use pull requests to spotlight the changes, having merged all changes into the story branch before creating the pull request. We host our code on bitbucket and the review / pull request tools are really very good, offering a side by side diff.
Pull requests with side-by-side diffs