to resume the download by using gsutil - google-apps-script

I have been downloading the file by using gsutil, and the process has crushed.
The documentation on gsutil is located at :
https://cloud.google.com/storage/docs/gsutil_install#redhat
The file location is described on : https://genebass.org/downloads
How can I resume the file download instead of starting from scratch ?
I have been looking for answers to a similar question, although those have been provided to different questions. For example :
GSutil resume download using tracker files

As mentioned in GCP docs, using the gsutil cp command:
gsutil automatically performs a resumable upload whenever you use the cp command to upload an object that is larger than 8 MiB. You do not need to specify any special command line options to make this happen. [. . .] Similarly, gsutil automatically performs resumable downloads (using standard HTTP Range GET operations) whenever you use the cp command, unless the destination is a stream. In this case, a partially downloaded temporary file will be visible in the destination directory. Upon completion, the original file is deleted and overwritten with the downloaded contents.
If you're also using gsutil in large production tasks, you may find useful information on Scripting Production Transfers.
Alternatively, you can achieve resumable download from Google Cloud Storage using the Range header (just take note of the HTTP specification threshold).

I'm not sure which command you're using (cp or rsync), but either way gsutil will fortunately take care of resuming downloads for you.
From the docs for gsutil cp:
gsutil automatically resumes interrupted downloads and interrupted resumable uploads, except when performing streaming transfers.
So, if you're using gsutil cp, it will automatically resume the partially downloaded files without starting them over. However, resuming with cp will also re-download the files that were already completed. To avoid this, use the -n flag so the files you've already downloaded are skipped, something like:
gsutil -m cp -n -r gs://ukbb-exome-public/300k/results/variant_results.mt .
If instead you're using gsutil rsync, then it will simply resume downloading.

Related

How does gsutil rsync determine whether a file is old or new when syncing two folders?

I'm rsyncing files to a DRA bucket, and I need to make sure that when a file is newer in the source folder it must be synced to the destination folder.
Right now I am using MD5 checksum to be 100% sure, but this is too slow on a data set of 8TB with a very large amount of files.
If I disable the MD5 checking, how does gsutil rsync determine whether a file should get synced or not?
From gsutil rsync --help:
CHANGE DETECTION ALGORITHM
To determine if a file or object has changed gsutil rsync first checks whether
the source and destination sizes match. If they match, it next checks if their
checksums match, using checksums if available (see below).
Unlike the Unix
rsync command, gsutil rsync does not use timestamps to determine if the
file/object changed, because the GCS API does not permit the caller to set an
object's timestamp (hence, timestamps of identical files/objects cannot be
made to match).
Checksums will not be available in two cases:
When synchronizing to or from a file system. By default, gsutil does not
checksum files, because of the slowdown caused when working with large
files. You can cause gsutil to checksum files by using the gsutil rsync -c
option, at the cost of increased local disk I/O and run time when working
with large files. You should consider using the -c option if your files can
change without changing sizes (e.g., if you have files that contain fixed
width data, such as timestamps).
When comparing composite GCS objects with objects at a cloud provider that
does not support CRC32C (which is the only checksum available for composite
objects). See 'gsutil help compose' for details about composite objects.
Regards,
Paolo

What will happen with a gsutil command if a DRA bucket's contents are unavailable?

I'm on an DRA (Durable Reduced Availability) bucket and I perform the gsutil rsync command quite often to upload/download files to/from the bucket.
Since file) could be unavailable (because of the DRA), what exactly will happen during a gsutil rsync session when such a scenario is being hit?
Will gsutil just wait until the unavailable files becomes available and complete the task, thus always downloading everything from the bucket?
Or will gsutil exit with a warning about a certain file not being available, and if so exactly what output is being used (so that I can make a script to look for this type of message)?
What will the return code be of the gsutil command in a session where files are found to be unavailable?
I need to be 100% sure that I download everything from the bucket, which I'm guessing can be difficult to keep track of when downloading hundreds of gigabytes of data. In case gsutil rsync completes without downloading unavailable files, is it possible to construct a command which retries the unavailable files until all such files have been successfully downloaded?
If your files exceed the resumable threshold (as of 4.7, this is 8MB), any availability issues will be retried with exponential backoff according to the num_retries and max_retry_delay configuration variables. If the file is smaller than the threshold, it will not be retried (this will be improved in 4.8 so small files also get retries).
If any file(s) fail to transfer successfully, gsutil will halt and output an exception depending on the failure encountered. If you are using gsutil -m rsync or gsutil rsync -C, gsutil will continue on errors and at the end, you'll get a CommandException with the message 'N file(s)/object(s) could not be copied/removed'
If retries are exhausted and/or either of the failure conditions described in #2 occur, the exit code will be nonzero.
In order to ensure that you download all files from the bucket, you can simply rerun gsutil rsync until you get a nonzero exit code.
Note that gsutil rsync relies on listing objects. Listing in Google Cloud Storage is eventually consistent. So if you are upload files to the bucket and then immediately run gsutil rsync, it is possible you will miss newly uploaded files, but the next run of gsutil rsync should pick them up.
I did some tests on a project and could not get gsutil to throw any errors. Afaik, gsutil operates on the directory level, it is not looking for a specific file.
When you run, for example $ gsutil rsync local_dir gs://bucket , gsutil is not expecting any particular file, it just takes whatever you have in "local_dir" and uploads it to gs://bucket, so :
gsutil will not wait, it will complete.
you will not get any errors - the only errors I got is when the local directory or bucket are missing entirely.
if, let´s say a file is missing on local_dir, but it is available in the bucket and then you run $ gsutil rsync -r local_dir gs://bucket, then nothing will change in the bucket. with the "-d" option, the file will be deleted on the bucket side.
As a suggestion, you could just add a crontab entry to rerun the gstuil command a couple of times a day or at night.
Another way is to create a simple script and add it to your crontab to run every hour or so. this will check if your file exists, and if so it will run the gsutil command:
#!/bin/bash
FILE=/home/user/test.txt
if [ -f $FILE ];
then
echo "file exists..or something"
else
gsutil rsync /home/user gs://bucket
fi
UPDATE :
I think this may be what you need. In ~/ you should have a .boto file .
~$ more .boto | grep max
# num_retries = <integer value>
# max_retry_delay = <integer value>
Uncomment those lines and add your numbers. Default is 6 retries, so you could do something like 24 retries and put 3600s in between. This in theory should always keep looping .
Hope this helps !

how can I download a google compute engine image

How can I download a google compute engine image that was created from a snapshot of a persistent disk? There doesn't seem to be a direct way to do this through the console.
There isn't a direct way to download a image or snapshot from GCE, but there's a way to save an image and store it in Google Cloud Storage(GCS) where it can be downloaded. You can use the standard gcimagebundle tool to do this.
You can also create this image using the dd command. On a temporary disk that’s bigger than the one you want to image, run this:
dd if=/dev/disk/by-id/google-diskname of=disk.img bs=5M
You can then run this command to copy it over to GCS:
gsutil cp disk.img gs://bucket/image.img
And later, you can:
gsutil cat gs://bucket/image.img | dd of=/dev/disk/by-id/google-newdisk bs=5M
This will allow you to make an image of your disk and then send it to GCS where you can download it using either the web interface or gsutil.
As an addition to the current answer, you can it directly download a file using SSH / SCP, by adding your public key to the "SSH Keys". Then, using your own terminal :
sheryl:~ sangprabo$ scp prabowo.murti#123.456.789.012:/var/www/my-file.tar.gz .
Enter passphrase for key '/Users/sangprabo/.ssh/id_rsa':
I prefer that way so I don't need to create a bucket first. CMIIW.

Get the latest updated file from FTP Folder

Kindly see this screen cast to get better idea about our requirement:
https://www.screenr.com/QmDN
We want to automate the Text Datasource Generation and connection to MS Excel in order to make it easier to the end-user to connect to the Text Datasource (CSV) to MS Excel so that they can generate their own reports.
The steps I have in mind:
Use WinSCP FTP Client with Scripting
Write script to get the most recent updated file from FTP Folder
Or instead of step 2, download all generated files from FTP to a Shared Folder on the Network.
Get the most recent version of the Generated CSV File
Rename the file to the Standard Naming Convention. This must be the name used in MS Excel as the CSV Text Datasource.
Delete all other files
I developed sample script that can be used by WinSCP to download the files from FTP folder:
# Automatically abort script on errors
option batch abort
# Disable overwrite confirmations that conflict with the previous
option confirm off
# Connect
open CSOD
# Change remote directory
cd /Reports/CAD
# Force binary mode transfer
option transfer binary
# Download file to the local directory d:\
#get "Training Attendance Data - Tarek_22_10_21_2014_05_05.CSV" "D:\MyData\Business\Talent Management System\Reports\WinCSP\"
get "*.CSV" "D:\MyData\Business\Talent Management System\Reports\WinCSP\Files\"
# Disconnect
close
exit
Then, I can schedule the above code to run periodically using this command:
winscp.com /script=example.txt
The above sample is working fine, but the main problem is how to identify the most recent file, so that I can rename it, and delete all the other files.
Appreciate your help.
Tarek
Just add the -latest switch to the get command:
get -latest "*.CSV" "D:\MyData\Business\Talent Management System\Reports\WinCSP\Files\"
For more details, see WinSCP article Downloading the most recent file.
You don't specify the language you use, here a Ruby script that downloads the most recent file of an FTP path. Just to demonstrate how easy and terse this can be done with a scripting language like Ruby.
require 'net/ftp'
Net::FTP.open('url of ftpsite') do |ftp|
ftp.login("username", "password")
path = "/private/transfer/*.*"
# file[55..-1] gives the filename part of the returned string
most_recent_file = ftp.list(path)[2..-1].sort_by {|file|ftp.mtime(file[55..-1])}.reverse.first[55..-1]
puts "downloading #{most_recent_file}"
ftp.getbinaryfile(most_recent_file, File.basename(most_recent_file))
puts "done"
end

Importing large datasets into Couchbase

I am having difficulty importing large datasets into Couchbase. I have experience doing this very fast with Redis via the command line but I have not seen anything yet for Couchbase.
I have tried using the PHP SDK and it imports about 500 documents / second. I have also tried the cbcdocload script in the Couchbase bin folder but it seems to want each document in its on JSON file. It is a bit of work to create all these files and then load them. Is there some other importation process I am missing? If cbcdocload is the only way load data fast then is it possible to put multiple documents into 1 json file.
Take the file that has all the JSON documents in it and zip up the file:
zip somefile.zip somefile.json
Place the zip file(s) into a directory. I used ~/json_files/ in my home directory.
Then load the file or files by the following command:
cbdocloader -u Administrator -p s3kre7Pa55 -b MyBucketToLoad -n 127.0.0.1:8091 -s 1000 \
~/json_files/somefile.zip
Note: '-s 1000' is the memory size. You'll need to adjust this value for your bucket.
If successful you'll see output stating how many documents were loaded, success, etc.
Here is a brief script to load up a lot of .zip files in a given directory:
#!/bin/bash
JSON_Dir=~/json_files/
for ZipFile in $JSON_Dir/*.zip ;
do /Applications/Couchbase\ Server.app/Contents/Resources/couchbase-core/bin/cbdocloader \
-u Administrator -p s3kre7Pa55 -b MyBucketToLoad \
-n 127.0.0.1:8091 -s 1000 $ZipFile
done
UPDATED: Keep in mind this script will only work if your data is formatted correctly or if the files are less than the max single document size of 20MB. (not the zipfile, but any document extracted from the zip)
I have created a blog post describing bulk loading from a single file as well and it is listed here:
Bulk Loading Documents Into Couchbase