What will happen with a gsutil command if a DRA bucket's contents are unavailable? - google-compute-engine

I'm on an DRA (Durable Reduced Availability) bucket and I perform the gsutil rsync command quite often to upload/download files to/from the bucket.
Since file) could be unavailable (because of the DRA), what exactly will happen during a gsutil rsync session when such a scenario is being hit?
Will gsutil just wait until the unavailable files becomes available and complete the task, thus always downloading everything from the bucket?
Or will gsutil exit with a warning about a certain file not being available, and if so exactly what output is being used (so that I can make a script to look for this type of message)?
What will the return code be of the gsutil command in a session where files are found to be unavailable?
I need to be 100% sure that I download everything from the bucket, which I'm guessing can be difficult to keep track of when downloading hundreds of gigabytes of data. In case gsutil rsync completes without downloading unavailable files, is it possible to construct a command which retries the unavailable files until all such files have been successfully downloaded?

If your files exceed the resumable threshold (as of 4.7, this is 8MB), any availability issues will be retried with exponential backoff according to the num_retries and max_retry_delay configuration variables. If the file is smaller than the threshold, it will not be retried (this will be improved in 4.8 so small files also get retries).
If any file(s) fail to transfer successfully, gsutil will halt and output an exception depending on the failure encountered. If you are using gsutil -m rsync or gsutil rsync -C, gsutil will continue on errors and at the end, you'll get a CommandException with the message 'N file(s)/object(s) could not be copied/removed'
If retries are exhausted and/or either of the failure conditions described in #2 occur, the exit code will be nonzero.
In order to ensure that you download all files from the bucket, you can simply rerun gsutil rsync until you get a nonzero exit code.
Note that gsutil rsync relies on listing objects. Listing in Google Cloud Storage is eventually consistent. So if you are upload files to the bucket and then immediately run gsutil rsync, it is possible you will miss newly uploaded files, but the next run of gsutil rsync should pick them up.

I did some tests on a project and could not get gsutil to throw any errors. Afaik, gsutil operates on the directory level, it is not looking for a specific file.
When you run, for example $ gsutil rsync local_dir gs://bucket , gsutil is not expecting any particular file, it just takes whatever you have in "local_dir" and uploads it to gs://bucket, so :
gsutil will not wait, it will complete.
you will not get any errors - the only errors I got is when the local directory or bucket are missing entirely.
if, let´s say a file is missing on local_dir, but it is available in the bucket and then you run $ gsutil rsync -r local_dir gs://bucket, then nothing will change in the bucket. with the "-d" option, the file will be deleted on the bucket side.
As a suggestion, you could just add a crontab entry to rerun the gstuil command a couple of times a day or at night.
Another way is to create a simple script and add it to your crontab to run every hour or so. this will check if your file exists, and if so it will run the gsutil command:
#!/bin/bash
FILE=/home/user/test.txt
if [ -f $FILE ];
then
echo "file exists..or something"
else
gsutil rsync /home/user gs://bucket
fi
UPDATE :
I think this may be what you need. In ~/ you should have a .boto file .
~$ more .boto | grep max
# num_retries = <integer value>
# max_retry_delay = <integer value>
Uncomment those lines and add your numbers. Default is 6 retries, so you could do something like 24 retries and put 3600s in between. This in theory should always keep looping .
Hope this helps !

Related

to resume the download by using gsutil

I have been downloading the file by using gsutil, and the process has crushed.
The documentation on gsutil is located at :
https://cloud.google.com/storage/docs/gsutil_install#redhat
The file location is described on : https://genebass.org/downloads
How can I resume the file download instead of starting from scratch ?
I have been looking for answers to a similar question, although those have been provided to different questions. For example :
GSutil resume download using tracker files
As mentioned in GCP docs, using the gsutil cp command:
gsutil automatically performs a resumable upload whenever you use the cp command to upload an object that is larger than 8 MiB. You do not need to specify any special command line options to make this happen. [. . .] Similarly, gsutil automatically performs resumable downloads (using standard HTTP Range GET operations) whenever you use the cp command, unless the destination is a stream. In this case, a partially downloaded temporary file will be visible in the destination directory. Upon completion, the original file is deleted and overwritten with the downloaded contents.
If you're also using gsutil in large production tasks, you may find useful information on Scripting Production Transfers.
Alternatively, you can achieve resumable download from Google Cloud Storage using the Range header (just take note of the HTTP specification threshold).
I'm not sure which command you're using (cp or rsync), but either way gsutil will fortunately take care of resuming downloads for you.
From the docs for gsutil cp:
gsutil automatically resumes interrupted downloads and interrupted resumable uploads, except when performing streaming transfers.
So, if you're using gsutil cp, it will automatically resume the partially downloaded files without starting them over. However, resuming with cp will also re-download the files that were already completed. To avoid this, use the -n flag so the files you've already downloaded are skipped, something like:
gsutil -m cp -n -r gs://ukbb-exome-public/300k/results/variant_results.mt .
If instead you're using gsutil rsync, then it will simply resume downloading.

How does gsutil rsync determine whether a file is old or new when syncing two folders?

I'm rsyncing files to a DRA bucket, and I need to make sure that when a file is newer in the source folder it must be synced to the destination folder.
Right now I am using MD5 checksum to be 100% sure, but this is too slow on a data set of 8TB with a very large amount of files.
If I disable the MD5 checking, how does gsutil rsync determine whether a file should get synced or not?
From gsutil rsync --help:
CHANGE DETECTION ALGORITHM
To determine if a file or object has changed gsutil rsync first checks whether
the source and destination sizes match. If they match, it next checks if their
checksums match, using checksums if available (see below).
Unlike the Unix
rsync command, gsutil rsync does not use timestamps to determine if the
file/object changed, because the GCS API does not permit the caller to set an
object's timestamp (hence, timestamps of identical files/objects cannot be
made to match).
Checksums will not be available in two cases:
When synchronizing to or from a file system. By default, gsutil does not
checksum files, because of the slowdown caused when working with large
files. You can cause gsutil to checksum files by using the gsutil rsync -c
option, at the cost of increased local disk I/O and run time when working
with large files. You should consider using the -c option if your files can
change without changing sizes (e.g., if you have files that contain fixed
width data, such as timestamps).
When comparing composite GCS objects with objects at a cloud provider that
does not support CRC32C (which is the only checksum available for composite
objects). See 'gsutil help compose' for details about composite objects.
Regards,
Paolo

Importing large datasets into Couchbase

I am having difficulty importing large datasets into Couchbase. I have experience doing this very fast with Redis via the command line but I have not seen anything yet for Couchbase.
I have tried using the PHP SDK and it imports about 500 documents / second. I have also tried the cbcdocload script in the Couchbase bin folder but it seems to want each document in its on JSON file. It is a bit of work to create all these files and then load them. Is there some other importation process I am missing? If cbcdocload is the only way load data fast then is it possible to put multiple documents into 1 json file.
Take the file that has all the JSON documents in it and zip up the file:
zip somefile.zip somefile.json
Place the zip file(s) into a directory. I used ~/json_files/ in my home directory.
Then load the file or files by the following command:
cbdocloader -u Administrator -p s3kre7Pa55 -b MyBucketToLoad -n 127.0.0.1:8091 -s 1000 \
~/json_files/somefile.zip
Note: '-s 1000' is the memory size. You'll need to adjust this value for your bucket.
If successful you'll see output stating how many documents were loaded, success, etc.
Here is a brief script to load up a lot of .zip files in a given directory:
#!/bin/bash
JSON_Dir=~/json_files/
for ZipFile in $JSON_Dir/*.zip ;
do /Applications/Couchbase\ Server.app/Contents/Resources/couchbase-core/bin/cbdocloader \
-u Administrator -p s3kre7Pa55 -b MyBucketToLoad \
-n 127.0.0.1:8091 -s 1000 $ZipFile
done
UPDATED: Keep in mind this script will only work if your data is formatted correctly or if the files are less than the max single document size of 20MB. (not the zipfile, but any document extracted from the zip)
I have created a blog post describing bulk loading from a single file as well and it is listed here:
Bulk Loading Documents Into Couchbase

How to solve jenkins 'Disk space is too low' issue?

I have deployed Jenkins in my CentOS machine, Jenkins was working well for 3 days, but yesterday there was a Disk space is too low. Only 1.019GB left. problem.
How can I solve this problem, it make my master offline for hours?
You can easily change the threshold from jenkins UI (my version is 1.651.3):
[]
Update: How to ensure high disk space
This feature is meant to prevent working on slaves with low free disk space. Lowering the threshold would not solve the fact that some jobs do not properly cleanup after they finish.
Depending on what you're building:
Make sure you understand what is the disk output of your build - if possible - restrict the output to happen only to the job workspace. Use workspace cleanup plugin to cleanup the workspace as post build step.
If the process must write some data to external folders - clean them up manually on post build steps.
Alternative1 - provision a new slave per job (use spot slaves - there are many plugins that integrate with different cloud provider to provision on the fly machines on demand)
Alternative2 - run the build inside a container. Everything will be discarded once the build is finished
Beside above solutions, there is a more "COMMON" way - directly delete the largest space consumer from Linux machine. You can follow the below steps:
Login to Jenkins machine (Putty)
cd to the Jenkins installation path
Using ls -lart to list out hidden folder also, normally jenkin
installation is placed in .jenkins/ folder
[xxxxx ~]$ ls -lart
drwxrwxr-x 12 xxxx 4096 Feb 8 02:08 .jenkins/
list out the folders spaces
Use df -h to show Disk space in high level
du -sh ./*/ to list out total memory for each subfolder in current path.
du -a /etc/ | sort -n -r | head -n 10 will list top 10 directories eating disk space in /etc/
Delete old build or other large size folder
Normally ./job/ folder or ./workspace/ folder can be the largest folder. Please go inside and delete base on you need (DO NOT
delete entire folder).
rm -rf theFolderToDelete
You can limit the reduce of disc space by discarding the old builds. There's a checkbox for this in the project configuration.
This is actually a legitimate question so I don't understand the downvotes, perhaps it belongs on Superuser or Serverfault. This is a soft warning threshold not hard limit where the disk is out of space.
For hudson see where to configure hudson node disk temp space thresholds - this is talking about the host, not nodes
Jenkins is the same. The conclusion is for many small projects the system property called hudson.diagnosis.HudsonHomeDiskUsageChecker.freeSpaceThreshold could be decreased.
In saying that I haven't tested it and there is a disclaimer
No compatibility guarantee
In general, these switches are often experimental in nature, and subject to change without notice. If you find some of those useful, please file a ticket to promote it to the official feature.
I got the same issue. My jenkins version is 2.3 and its UI is slightly different. Putting it here so that it may helps someone. Increasing both disk space thresholds to 5GB fixed the issue.
I have a cleanup job with the following build steps. You can schedule it #daily or #weekly.
Execute system groovy script build step to clean up old jobs:
import jenkins.model.Jenkins
import hudson.model.Job
BUILDS_TO_KEEP = 5
for (job in Jenkins.instance.items) {
println job.name
def recent = job.builds.limit(BUILDS_TO_KEEP)
for (build in job.builds) {
if (!recent.contains(build)) {
println "Preparing to delete: " + build
build.delete()
}
}
}
You'd need to have Groovy plugin installed.
Execute shell build step to clean cache directories
rm -r ~/.gradle/
rm -r ~/.m2/
echo "Disk space"
du -h -s /
To check the free space as Jenkins Job:
Parameters
FREE_SPACE: Needed free space in GB.
Job
#!/usr/bin/env bash
free_space="$(df -Ph . | awk 'NR==2 {print $4}')"
if [[ "${free_space}" = *G* ]]; then
free_space_gb=${x/[^0-9]*/}
if [[ ${free_space_gb} -lt ${FREE_SPACE} ]]; then
echo "Warning! Low space: ${free_space}"
exit 2
fi
else
echo "Warning! Unknown: ${free_space}"
exit 1
fi
echo "Free space: ${free_space}"
Plugins
Set build description
Post-Build Actions
Regular expression: Free space: (.*)
Description: Free space: \1
Regular expression for failed builds: Warning! (.*)
Description for failed builds: \1
For people who do not know where the configs are, download the tmpcleaner from
https://updates.jenkins-ci.org/download/plugins/tmpcleaner/
You will get an hpi file here. Go to Manage Jenkins-> Manage plugins-> Advanced and then upload the hpi file here and restart jenkins
You can immediately see a difference if you go to Manage Nodes.
Since my jenkins was installed in a debian server, I did not understand most of the answers related to this since i cannot find a /etc/default folder or jenkins file.
If someone knows where the /tmp folder is or how to configure it for debian , do let me know in comments

Redirect output to different directories for sun grid engine array jobs

I'm running a lot of jobs with Sun Grid Engine. Since these are a jobs (~100000), I would like to use array jobs, which seems to be easier on the queue.
Another problem is that each jobs produces an stdout and stderr file, which I need to track error. If I define them in the qsub -t 1-100000 -o outputdir -e errordir I will end up having directories with 100000 files in them, which is too much.
Is there a way to have each job write the output file to a directory (say, a directory which consists of the first 2 characters of the job ID, which is random hex letters; or the job number modulu 1000, or something of that sort).
Thanks
I can't think of a good way to do this with qsub as there are no programmatic interfaces into the -o and -e options. There is, however, a way to accomplish what you want.
Run your qsub with -o and -e pointing to /dev/null. Make the command you run be some type of wrapper that redirects it's own stdout and stderr to files in whatever fashion you want (i.e., your broken down directory structure) before it execs the real job.