My requirement is alike with Dropbox. I was told to research Mercurial, but it stores all history at every repository, and diff file at local, transfer reversion only.
But I must use remote-diff (rsync). I think replacing Mercurial's diff algorithm with rsync is not easy,it must change most of the code of Mercurial.
And if I implement this only based on librsync, too many things left me to write a reliable “dropbox”. whether syncML will be helpful? but because I only sync files, whether it's also too complex?
Because my synchronization is not distributed, maybe SVN will suit my requirement more? But SVN is also not using rsync.
Related
I have my Banana Pi set up as my Mercurial server. It works well for me for my software as generally speaking I have firmware and that's about it in my repositories. I can access it via open VPN from anywhere in the world. However, I have started to use version control for my PCB files as well now, due to a new CAD system which complicates my old, crude but effective way of doing my PCB archiving and backup. (Also, everything in my new CAD system, all the PCBs and schamtics, are text files which makes version control work nicely.)
So, with Mercurial I started doing as I did with software and creating a new repo for my PCB for one of the boards I'm updating for a customer, and immediately came across an issue that svn seems to cope with easily and I was wondering whether Mercurial can do the same.
I have my BH0001 project repository which has all the embedded C in it and I have started creating a new issue of the PCB for which the C code is used. I had to create a new Mercurial repo called BH0001_pcb to differentiate between code and PCB. With svn you can have a project repo and then Hardware and Software directories within the project number, but still be able to check out the two different types of files to different places independently.
I could, of course, clone the BH0001 software repository to a local machine, add the PCB info in a new folder in the local Mercurial repo send it all back to the server and it would be perfectly happy. The problem then comes when checking out because I would be cloning both firmware and PCB on to a machine when I might only want one or the other.
Also, this goes against how I store stuff locally. In my /username/home directory I have a Software directory and a CAD directory and within those I have projects. So I would have:
home/CAD/CustomerName/BH0001
and
home/Software/CustomerName/BH0001.
If I'm to carry on using my current method do I have to:
Change my local directory structures to be something like:
home/Projects/CustomerName/BH0001/CAD
and
home/Projects/CustomerName/BH0001/Software
Suck it up and use things like ProjectName_pcb for separate repos.
Some other way I can't think of/can't find/am unaware of? e.g. There's a way of checking out part of a Mercurial repository to one directory and a different part of the repo to a different directory.
Or should I just use svn if I really want to carry on as I have?
With default mercurial you currently cannot do partial repository clones as you can do with SVN. So your approach to use separate repositories is a good choice.
However there ways to achieve a similar result: sub-repositories. In your case I'd create a parent repository which contains your two current repositories as sub-repositories. Mind though, sub-repositories have some rough edges, so read the linked page carefully - I'd like to especially stress that it's good practise to have a parent repo which basically only contains the 'real' repos but not much on its own.
There exist ideas like a thin or narrow clone (which is somewhat identical to what SVN does), but I haven't seen them in production.
In this F8 conference video(starting 8:40) from 2015 they speak about the advantages of using Mercurial and a single repository across facebook.
How does this work in practice? Using Mercurial, can i checkout a subdirectory (live in SVN)? If so, how? Do i need a facebook-mercurial-extension for this
P.S.: I only found answers like this or this from 2010 on SO where i am not sure if the answers still apply with all the efforts FB put into it.
From your question it is not clear if you are looking for a workflow (the monorepo vs multiple repos debate) or for performance and scaling for a huge code base.
For the workflow, I suggest googling for monorepo. It has its pros and cons, you need to understand your situation and current workflow to decide. For the performance and scaling, keep reading.
The idea of remotefilelog is not to checkout a subdirectory (as you mention), the idea is to checkout everything. In order to do that in an efficient way, you need two extensions actively developed by Facebook:
remotefilelog. This gives you something conceptually similar to a shallow clone. This reduces hg clone and hg pull time.
fsmonitor (previously called hgwatchman, it is now part of mercurial core). This dramatically reduces time of local operations such as hg status. Note that fsmonitor is independent from remotefilelog. You can start experimenting with this, since it doesn't require any setup on the server side.
With a recent mercurial (which I strongly suggest) you can shave off the additional startup time of the Python interpreter using CommandServer + CHg.
Some additional notes:
I tested extensively fsmonitor. It works very well, on huge repos the time of hg status is reduced from 10 secs to less than 1 sec (and the majority of this 1 sec is Python startup time, see above for CHg). If your repository is really huge, you might need to fine tune some inotify kernel parameters (or the equivalent on MacOSX). The fsmonitor documentation has all the information you need.
I didn't test remotefilelog, although I read everything I found about it and I am sure it works. Depending on how development is done (everybody has always Internet connectivity or not, the organization has its own master repo or not) there can be a caveat: it partially transforms the decentralized hg into a centralized VCS like svn: some operations that normally can be done offline (for example: hg log and the first hg update to a changeset in the past) will now require connectivity to the master repository.
Before considering remotefilelog, I used extensively the largefiles extension on a huge repo. It has the same drawbacks than remotefilelog and some confusing corner cases for users that want to use hg just to get things done without taking the time to understand how it works. If I were to manage another huge repo, I would use remotefilelog instead than largefiles, although their use case is not really the same.
Mercurial has also support for subrepositories (doc1, doc2). The problem is that it changes the behavior of hg depending on where you are in the source tree. Again, if the developers don't care about really understanding how hg works, it will be just too confusing.
Additional information:
Facebook Engineering blog post
scaling mercurial wiki, although not completely up to date
just by googling mercurial facebook.
i am not sure if the answers still apply with all the efforts FB put
into it
(Early 2017) The answers in the questions linked still apply (because they occasionally get updated) but note that you will have to read all the comments and answers.
remotefilelog essentially allows on demand shallow clones (so you don't fetch the history for everything for all time) but you still fetch the essential metadata for, and checkout across, all the directories of the repo at the desired revision.
Using Mercurial, can i checkout a subdirectory (li[k]e in SVN)? If so, how?
https://stackoverflow.com/a/40355673/7836056 discusses how you might use third party extensions to allow narrow/sparse checkouts (Facebook's sparse.py) or narrow clones (Google's NarrowHG) with Mercurial thus only "creating" a single directory from within the main repository (albeit with radically different tradeoffs).
(Note phrasing matters: "sparse checkout" means a very specific action when referring to distributed version control in a manner that doesn't exist when using it to refer to centralised version control)
This might be a noob question. but I'm really torned between adding documents to my repository, in this case Mercurial.
by documents i meant, files that doesn't really go into your program. like PSD, doc, xls.
what's the best way to handle those files, or how do you handle your documents.
Take a look at the Largefiles extension that shipped with Mercurial 2.0 (with bugfixes since). It's designed to treat files that are binary and update rarely in a different, more efficient way.
Basically it stores those files without trying to compute diffs between versions, and anybody cloning the repo just gets the versions they need, and not all the history. This leads to faster cloning / pulling, but updates may need a connection to the remote repository to read versions of files into the local cache.
I toss them in my repository. It's nice to track changes of them and see old revisions anyway. I can see old revisions of a design document or see what the previous art was for an asset (maybe a graphic designer removed the alpha channel and he/she wasn't supposed to). Throw it in there. If it doesn't change, it's not taking up any more space with a good source control system than storing it outside of source control.
I'm working on a collaborative scientific project that is made up by a handful of Python scripts (1M max) and a relatively large dataset (1.5 GB). The datasets are tightly linked to the python scripts since the datasets themselves are the science and the scripts are a simple interface to them.
I'm using Mercurial as my source control tool, but I am not clear on a good mechanism to define the repository. Logistically it makes sense to bundle these together so that by cloning the repository you'd get the entire package. On the other hand, I'm concerned about the source control tool dealing with large amounts of data.
Is there a clean mechanism to handle this?
If the data files change rarely and you normally need all of them anyway, then just add them to Mercurial and be done with it. All your clones will be 1.5 GB, but that is just the way it has to be with that amount of data.
if the data is binary data and changed often, then you might try to avoid downloading all the old data. One way to do this is to use a Subversion subrepository. You will have a .hgsub file with
data = [svn]http://svn.some.edu/me/ourdata
which tells Mercurial to make a svn checkout from the right-hand side URL and put the Subversion working copy into your Mercurial clone as data. Mercurial will maintain an additional file for you called .hgsubstate, in which it records the SVN revision number to checkout for any given Mercurial changeset. By using Subversion like this, you only end up with the latest version of the data on your machine, but Mercurial will know how to get older versions of the data when needed. Please see this guide to subrepositories if you go down this route.
There is an article on the official wiki about large binary files. But the proposition of #MartinGeisler is a really nice new alternative.
My first inclination is to separate the python scripts out into their own repository, but I really need more domain information to make the "right" call.
On the one hand, if new datasets will be created then you would want a core set of tools to be able to handle all of them, right? But I can also see how new datasets may introduce cases that the scripts may not have previously handled... although it seems like in an ideal world you would want scripts that are written in a general way so they can handle future data and existing datasets??
I am writing a set of django apps and would like to use Hg for version control. I would like each app to be independent of the others so in each app there may be a directory for static media that contains images that I would not want under version control. In other words, the binary files would not all be in one central location
I would like to find a way to clone the repository that would include copies of the image files. It also would be great if when I did a merge, if there were an image file in one repo and not another, that there would be some sort of warning.
Currently I use a python script to find images and other binary files that are in one repo, but not the other. But a lot of people must face this problem, so there must be a more robust and elegant solution.
One one other thing...for reasons I do not want to go into, usually one of my repos is on a windows machine, and the other is on Linux. So a crossplatform solution would be nice.
Since Mercurial 2.0 the extension largefiles is now included in the main distribution. That extension keeps and manages large files outside of the "normal" repository in a way that you get the benefit of DCVS but without the benefit of exponential size and processing time growth.
Other extension that work along similar lines are SnapExtension and BigFilesExtension. However, those two are not distributed with Mercurial (you have to get them manually).
Mercurial can track any kind of file, for binary files if something changes then the whole file gets replaced not just the changes.
On the getting a warning if one repo doesn't contain a file, that's kind of the point of a DVCS is that the repos are related but are autonomous. You could always check and see what files were added during a synch or merge operation.
The current Mercurial book (by Bryan O'Sullivan) says, that Mercurial stores diffs also for binary files. How efficient this is, obviously depends on the nature of changes to binary files.