I have some SQL log backups scheduled to run every 15 minutes including a robocopy with the /MIR option, to an archive folder on a cloud storage volume using CloudBerry.
Sometimes after a Full backup, and a slow network, the full backup archive copy has not completed when the log backup is run, and I suspect a problem caused by the second robocopy now also trying to copy the large full backup file in addition to the new log backup.
What should happen? If the retry flag is set to /R:60, should the second instance somehow skip files already being copied by another robocopy instance, or will the two instances of robocopy step all over each other? Or must the second instance be run with the /R:0 option set to skip the first file still being copied?
I know this answer is a little late and I hope you found a solution but here are my 2 cents:
Robocopy has an option to "monitor a source" for changes, I think it's the /MON and /MOT options. this would prevent robocopy from rerunning- it would always run in what is essentially a hot folder type scenario.
From the help of robocopy:
/MON:n :: MONitor source; run again when more than n changes seen.
/MOT:m :: MOnitor source; run again in m minutes Time, if changed.
While this is quite an old question, I have not found a proper answer and find it to still be relevant, so here are my findings:
I ran a couple of tests and it seems that RoboCopy takes a snapshot of the source and destination directories and compares which files need to be copied from the point of the snapshot.
This means that if one RoboCopy instance starts immediately after another, the two instances will keep clashing and overwriting each other, as neither of the instances are aware that changes are happening in the destination directory.
If one instance (instance A) attempts to copy the same file that the other instance (instance B) is copying, it will error and either retry (if using /R) or skip to the next (if using /R:0). Once instance B is finished with the file it will then try to copy the next file on the list, which will either error (if instance A is still copying it) or overwrite the file (if instance A already moved on to the next file).
So in the case of the question, the most likely behavior (assuming network speed and file sizes remain somewhat consistent) is that the new instance of RoboCopy will overwrite the backup files in the beginning of the list while the original instance is still copying the last files on the list.
Related
I am working on designing a little project where I need to use Consul to manage application configuration in a dynamic way so that all my app machines can get the configuration at the same time without any inconsistency issue. We are using Consul already for service discovery purpose so I was reading more about it and it looks like they have a Key/Value store which I can use to manage my configurations.
All our configurations are json file so we make a zip file with all our json config files in it and store the reference from where you can download this zip file in a particular key in Consul Key/Value store. And all our app machines need to download this zip file from that reference (mentioned in a key in Consul) and store it on disk on each app machine. Now I need all app machines to switch to this new config at the same time approximately to avoid any inconsistency issue.
Let's say I have 10 app machines and all these 10 machines needs to download zip file which has all my configs and then switch to new configs at the same time atomically to avoid any inconsistency (since they are taking traffic). Below are the steps I came up with but I am confuse on how loading new files in memory along with switch to new configs will work:
All 10 machines are already up and running with default config files as of now which is also there on the disk.
Some outside process will update the key in my consul key/value store with latest zip file reference.
All the 10 machines have a watch on that key so once someone updates the value of the key, watch will be triggered and then all those 10 machines will download the zip file onto the disk and uncompress it to get all the config files.
(..)
(..)
(..)
Now this is where I am confuse on how remaining steps should work.
How apps should load these config files in memory and then switch all at same time?
Do I need to use leadership election with consul or anything else to achieve any of these things?
What will be the logic around this since all 10 apps are already running with default configs in memory (which is also stored on disk). Do we need two separate directories one with default and other for new configs and then work with these two directories?
Let's say if this is the node I have in Consul just a random design (could be wrong here) -
{"path":"path-to-new-config", "machines":"ip1:ip2:ip3:ip4:ip5:ip6:ip7:ip8:ip9:ip10", ...}
where path will have new zip file reference and machines could be a key here where I can have list of all machines so now I can put each machine ip address as soon as they have downloaded the file successfully in that key? And once machines key list has size of 10 then I can say we are ready to switch? If yes, then how can I atomically update machines key in that node? Maybe this logic is wrong here but I just wanted to throw out something. And also need to clean up all those machines list after switch since for the next config update I need to do similar exercise.
Can someone outline the logic on how can I efficiently manage configuration on all my app machines dynamically and also avoid inconsistency issue at the same time? Maybe I need one more node as status which can have details about each machine config, when it downloaded, when it switched and other details?
I can think of several possible solutions, depending on your scenario.
The simplest solution is not to store your config in memory and files at all, just store the config directly in the consul kv store. And I'm not talking about a single key that maps to the entire json (I'm assuming your json is big, otherwise you wouldn't zip it), but extracting smaller key/value sets from the json (this way you won't need to pull the whole thing every time you make a query to consul).
If you get the config directly from consul, your consistency guarantees match consul consistency guarantees. I'm guessing you're worried about performance if you lose your in-memory config, that's something you need to measure. If you can tolerate the performance loss, though, this will save you a lot of pain.
If performance is a problem here, a variation on this might be to use fsconsul. With this, you'll still extract your json into multiple key/value sets in consul, and then fsconsul will map that to files for your apps.
If that's off the table, then the question is how much inconsistencies are you willing to tolerate.
If you can stand a few seconds of inconsistencies, your best bet might be to put a TTL (time-to-live) on your in-memory config. You'll still have the watch on consul but you combine it with evicting your in-memory cache every few seconds, as a fallback in case the watch fails (or stalls) for some reason. This should give you a worst-case few seconds inconsistencies (depending on the value you set for your TTL), but normal case (I think) should be fast.
If that's not acceptable (does downloading the zip take a lot of time, maybe?), you can go down the route you mentioned. To update a value atomically you can use their cas (check-and-set) operation. It will give you an error if an update had happened between the time you sent the request and the time consul tried to apply it. Then you need to pull the list of machines, and apply your change again and retry (until it succeeds).
I don't see why you would need 2 directories, but maybe I'm misunderstanding the question: when your app starts, before you do anything else, you check if there's a new config and if there is you download it and load it to memory. So you shouldn't have a "default config" if you want to be consistent. After you downloaded the config on startup, you're up and alive. When your watch signals a key change you can download the config to directly override your old config. This is assuming you're running the watch triggered code on a single thread, so you're not going to be downloading the file multiple times in parallel. If the download failed, it's not like you're going to load the corrupt file to your memory. And if you crashed mid-download, then you'll download again on startup, so should be fine.
Can a snakemake pipeline be run with two different configs from the same working directory?
Config files here would have a "project name" parameter that would define the input and output path for the pipeline. Since snakemake locks the working directory, I wonder if running the same pipeline with different config files in the same working directory would result in some conflict. If yes, is there any viable alternative strategy for this scenario?
Yes, you can choose the config file using $snakemake --configfile my_config_file. You can run two instances of snakemake at the same time. Snakemake does not lock the directory itself. It has two types of locks, input and output locks. If there is no overlap between the files created by the two workflows, they can run simultaneously. If there is an overlap in the files the workflows will create, you should create these files first. Overlap in input files is not a problem. A workflow only releases it's locks after it completes/ is interrupted. It takes a bit of time for snakemake to set up the locks, so launching two instances at exactly the same time can sometimes cause problems.
I am facing a very weird issue with DFS-R. Recently I had to recreate two Replication Groups (Data Collection) in order to restore file replication after a disaster with one of the servers. Everything went well during the whole process and both servers are already in sync with each other. However, when I run a WMI query (Wmic /namespace:\root\microsoftdfs path dfsrreplicatedfolderinfo get replicationgroupname, replicatedfoldername, state) to see the status of the Replication Groups, I see duplicate values for a given RG with a status of '0' (Uninitialized).
I have already checked everything I could... If I open the DFS Management console I don't see any duplicate Replication Groups there; the contents of the XML configuration files under "C:\System Volume Information\DFSR" & "E:\System Volume Information\DFSR" - being the latter one where the replicated data resides - are OK (no duplicate entries); and on registry (HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\DFSR), I also only see what's really configured (again, no duplicate values).
I strongly suspect it's some sort of left over on the WMI repository that was not handled properly by DFS-R when the Replication Groups were re-created, but as I've never had to delete spefic instances from WMI, I am wondering whether someone out there has already faced the same issue or has any clues on how to get rid of these disturbing duplicate results being returned on "Wmic /namespace:\root\microsoftdfs path dfsrreplicatedfolderinfo get replicationgroupname,replicatedfoldername,state".
Just wanted to emphasize that replication is working fine between the two boxes (in both directions), the only issue being faced here is the duplicate thing.
I managed to figure out what was wrong by myself. As it turns out, those duplicate entries being returned upon "Wmic /namespace:\root\microsoftdfs path dfsrreplicatedfolderinfo get replicationgroupname,replicatedfoldername,state" were due to leftovers in the folder dedicated to the Staging area. When the Replication Groups were recreated, the contents of the Staging folder were not manually deleted and DFS-R didn’t manage to handle that very well. In order to resolve it, I created a new folder for Staging, pointed the DFS Replication Group within DFS Management Console to the new folder, gave it over 12 hours (I was really precautious here due to the size and criticality of the data in scope) to ensure DFS-R had fully recognized the new configuration, stopped DFS Replication service, deleted the old Staging folder and, finally, restarted DFS Replication service. Now everything is back on track and looking good. :-)
I'm developing a small photo sharing Rails app which will read and display photos from a library of photos on the local filesystem.
In order to avoid scanning the filesystem every time the user loads the page, I want to set up an hourly cron job that indexes all files and stores it in a local MySQL table.
What's the best way to scan the local filesystem and store metadata about local files (e.g. size, file type, modified date, etc..)? Is there a convenient ruby-based library? I'd also like to be able to "watch" the filesystem to know when files have disappeared since the last scan so that they can be deleted from my table.
Thanks!
You will want to look into inotify.
https://github.com/nex3/rb-inotify
You can set a watch (register a callback in the Linux kernel) on a file or a directory, and everytime something changes in that file/directory, the kernel will notify you immediately with a list of what has changed.
Common events are listed here: https://en.wikipedia.org/wiki/Inotify
You will notice that IN_CREATE + IN_DELETE are the events you are looking for.
Side note: IN_CREATE only creates the file (it's still empty), you will need to wait until IN_CLOSE_WRITE is called, to know data was finished writing to file.
I see an apparent random problem about once a month that is doing my head in. Google appears to be changing the naming convention for additional disks (to root) and how they are presented under /dev/disk/by-id/ at boot.
All the time the root disk is available as /dev/disk/by-id/google-persistent-disk-0
MOST of the time the single extra disk we mount is presented as /dev/disk/by-id/google-persistent-disk-1
We didn't give this name but we wrote our provisioning scripts to expect this convention.
Every now and then, on rebooting the VM, our startup scripts fail in executing a safe mount:
/usr/share/google/safe_format_and_mount -m "mkfs.ext4 -F" /dev/disk/by-id/google-persistent-disk-1 /mountpoint
They fail because something has changed the name of the disk. Its no longer /dev/disk/by-id/google-persistent-disk-1 its now /dev/disk/by-id/google-{the name we gave it when we created it}
Last time I updated our startup scripts to use this new naming convention it switched back an hour later. WTF?
Any clues appreciated. Thanks.
A naming convention beyond your control is not a stable API. You should not write your management tooling to assume this convention will never be changed -- as you can see, it's changing for reasons you have nothing to do with, and it's likely that it will change again. If you need access to the list of disks on the system, you should query it through udev, or you can consider using /dev/disk/by-uuid/ which will not change (because the UUID is generated at filesystem creation) instead of /dev/disk/by-id/.