can bfg-repo-cleaner list affected files before deleting them - bfg-repo-cleaner

From the command line option I can not see that there is a way to see what bfg will do before doing it.
If I run this command:
$ bfg --strip-blobs-bigger-than 1M --replace-text banned.txt repo.git
Can I get a list of files larger than 1M before actual deleting them?

See https://github.com/rtyley/bfg-repo-cleaner/issues/17 for a discussion on a dry-run feature.
Essense of my comments there: since Git makes it super-cheap to make additional local clones where you can perform test-runs of BFG, and this provides you with real output you can verify in addition to a report, that is superior to having a report-only or dry-run mode.
Just make an additional local clone of the repo, run BFG, then read the reports produced and examine the repo.

Related

Running a snakemake pipeline with different configs from same working directory

Can a snakemake pipeline be run with two different configs from the same working directory?
Config files here would have a "project name" parameter that would define the input and output path for the pipeline. Since snakemake locks the working directory, I wonder if running the same pipeline with different config files in the same working directory would result in some conflict. If yes, is there any viable alternative strategy for this scenario?
Yes, you can choose the config file using $snakemake --configfile my_config_file. You can run two instances of snakemake at the same time. Snakemake does not lock the directory itself. It has two types of locks, input and output locks. If there is no overlap between the files created by the two workflows, they can run simultaneously. If there is an overlap in the files the workflows will create, you should create these files first. Overlap in input files is not a problem. A workflow only releases it's locks after it completes/ is interrupted. It takes a bit of time for snakemake to set up the locks, so launching two instances at exactly the same time can sometimes cause problems.

Is it safe to remove bundles in strip-backup folders?

Recently I have rewrote a lot of history (Forgive me Father, for I have sinned). Our old repository had a lot of sensitive information as well as unnecessary merges (up to 20 anonymous branches running simultaneously and being merged back indiscriminately), so I have striped several commits, pruned dead branches, rebased / squashed commits, rolled back unnecessary merges, created bookmarks, etc.
We now have a clean repo. I have also run unitary tests along several revisions to make sure that I haven't broke anything import. Yesterday I've forked the old repo (for backup purposes) and pushed the clean repository upstream. We are a small team and synchronizing changes was not a problem, every developer in my team is already working with the new repo.
Anyway, my local repository now have a .hg/strip-backup folder of around 2 Gigabytes.
From what I was able to understand, this folder contains backup bundles for every one of the destructive commands that I have run. I no longer need those.
My question is: Is it safe to remove the bundles inside .hg/strip-backup? Or will I corrupt my local repository if I delete those files?
Bonus question: Is there a built-in mercurial command to remove backups or should I just use rm .hg/strip-backup/*?
Yes, it is safe to remove the whole folder. The information contained in the folder is not relevant to the repo.
As a bonus answer, your best option to clean-up the cache folders is to simply re-clone the repo. Doing so allows you to start fresh and all the temporary files will be left on the base repo. Replace the original repo with a cloned repo and you won't have to bother with this history of temporary files for a while.

Mercurial - a simple way to lock a repository

My scenario:
A set of shared repositories needs to be locked for a given time so a process can run for a given time. After this process is done, I want to unlock the repositories. It's a process not on the repositories, but on a different system.
The repositories are not what the process is working on. I just need a time frame where the repositories are "protected". I just need to make sure the repositories don't change while this process is running.
I want a simple way to lock a repository, so no one can push to it.
If I manually create a .hg/store/lock file with a dummy content, do you see any problem with it?
Initial testing shows it works, but I'm concerned that I might not be aware of the implications.
If you just need to generally deny access to the repos for a given period, then you can do it that way. There shouldn't be any side-effects or other consequences.
Clone the repository and then run your process against the cloned repo.

Sync projects and databases between 2 computers with git

I was wondering what ways are are there to sync web projects initialized with git and mysql databases between 2 computers without using a 3rd one as a "server".
I already know that I could use a service like Dropbox and sync data with it, but I don't what to do it so.
If the two servers aren't always available (in particular not available at the same time), then you need an external third-party source for your synchronization.
One solution for git repo is to use git bundle which allows to create a kind of "bare repo" in one file.
Having only one file to move around make it any sync operation easier to do.
You will have to copy a bundle from one server to another (by whatever mean you want), in order for the second repo (on the second server) to pull from (you can pull from a git bundle: it acts as a bare repo) that bundle.
Just clone from one to the other. In git, there is no real difference between server repos and local repos in terms of pulling and cloning. Pushing from one to the other is tricky if neither is created as bare. Generally in that case, rather than push one from the other, we'll pull back and forth as needed.

rsync and MyISAM tables

I'm trying to use rsync to backup MySQL data. The tables use the MyISAM storage engine.
My expectation was that after the first rsync, subsequent rsyncs would be very fast. It turns out, if the table data was changed at all, the operation slows way down.
I did an experiment with a 989 MB MYD file containing real data:
Test 1 - recopying unmodified data
rsync -a orig.MYD copy.MYD
takes a while as expected
rsync -a orig.MYD copy.MYD
instantaneous - speedup is in the millions
Test 2 - recopying slightly modified data
rsync -a orig.MYD copy.MYD
takes a while as expected
UPDATE table SET counter = counter + 1 WHERE id = 12345
rsync -a orig.MYD copy.MYD
takes as long as the original copy!
What gives? Why is rsync taking forever just to copy a tiny change?
Edit: In fact, the second rsync in Test 2 takes as long as the first. rsync is apparently copying the whole file again.
Edit: Turns out when copying from local to local, --whole-file is implied. Even with --no-whole-file, the performance is still terrible.
rsync still has to calculate block hashes to determine what's changed. It may be that the no-modification case is a shortcut looking at file mod time / size.
rsync uses an algorithim where it sees if a file has changed, and then sees what parts of it changed. In a large database it is common that your changes are spread throughout a large segment of the file. This is rsync's worst case scenario.
Rsync is file based. If you found a way of doing it with a block based system then you could just backup the blocks/bytes that had changed.
LVM snapshots might be one way of doing this.
when doing local copies, rsync defaults to --whole-file for a reason: it's faster than doing the checks.
If you want the fastest local copy, you already got it.
If you want to see the rsync speedup, copy over the network. It's impressive, but won't be faster than a local full copy.
rsync for local copies is a nice replacement to cp when you have a big directory where only some files change. It'll copy those file whole; but quickly skip those not modified (just checking timestamps and filesize). For a single big file, it's no better than cp.