How to extract a very big file in Google Colab - google-drive-api

I am trying to extract a 14.6 GB 7z file (https://archive.org/download/stackexchange/stackoverflow.com-Posts.7z).
I have it downloaded and saved in my Google Drive. I mount my drive to Google Colab and then change the current directory to where the file is located: os.chdir('/content/drive/My Drive/.../')
When I try to unzip the file !p7zip -k -d stackoverflow.com-Posts.7z, it uses the current instance's HDD space and during this process, it runs out of all the available allocated HDD space, and hence the unzip abruptly terminates.
Is there a way to extract the file without using the instance's HDD space OR do it in chunks, such that the extract is successful.
PS: I believe, when decompressed the file size is ~100 GB

You can try to read data by blocks, using libarchive, without unzip it first.
https://github.com/dsoprea/PyEasyArchive
Here's an example notebook

Related

to resume the download by using gsutil

I have been downloading the file by using gsutil, and the process has crushed.
The documentation on gsutil is located at :
https://cloud.google.com/storage/docs/gsutil_install#redhat
The file location is described on : https://genebass.org/downloads
How can I resume the file download instead of starting from scratch ?
I have been looking for answers to a similar question, although those have been provided to different questions. For example :
GSutil resume download using tracker files
As mentioned in GCP docs, using the gsutil cp command:
gsutil automatically performs a resumable upload whenever you use the cp command to upload an object that is larger than 8 MiB. You do not need to specify any special command line options to make this happen. [. . .] Similarly, gsutil automatically performs resumable downloads (using standard HTTP Range GET operations) whenever you use the cp command, unless the destination is a stream. In this case, a partially downloaded temporary file will be visible in the destination directory. Upon completion, the original file is deleted and overwritten with the downloaded contents.
If you're also using gsutil in large production tasks, you may find useful information on Scripting Production Transfers.
Alternatively, you can achieve resumable download from Google Cloud Storage using the Range header (just take note of the HTTP specification threshold).
I'm not sure which command you're using (cp or rsync), but either way gsutil will fortunately take care of resuming downloads for you.
From the docs for gsutil cp:
gsutil automatically resumes interrupted downloads and interrupted resumable uploads, except when performing streaming transfers.
So, if you're using gsutil cp, it will automatically resume the partially downloaded files without starting them over. However, resuming with cp will also re-download the files that were already completed. To avoid this, use the -n flag so the files you've already downloaded are skipped, something like:
gsutil -m cp -n -r gs://ukbb-exome-public/300k/results/variant_results.mt .
If instead you're using gsutil rsync, then it will simply resume downloading.

Too slow to unzip dataset in google colab from google drive

I have my dataset which is around 1.2GB and want to upload it on google colab. What I tried is, I compressed the dataset into zip and it turned 479MB. Then upload the zip file into google drive, and do the folloing command in google colab.
!unzip Archive.zip
It starts to unzip the file but it's too slow(I've waited for three hours but it didn't finish). I'm using GPU of google colab and smaller zip file was unzipped correctly. Is there faster way to upload dataset on google colab?

Google colab and google drive: Copy file from colab to Google Drive

There seem to be lots of ways to access a file on Google Drive from Colab but no simple way to save a file from Google Colab back to Google Drive.
For example, to access a Google Drive file from Colab, you can mount the Google Drive using
from google.colab import drive
drive.mount('/content/drive')
However, to save an output file you've generated in Colab on Google Drive the methods seem very complicated as in:
Upload File From Colab to Google Drive Folder
Once Google Drive is mounted, you can even view the drive files in the Table of Contents from Colab. Is there no simple way to save or copy a file created in Colab and visible in the Colab directory back to Google Drive?
Note: I don't want to save it to a local machine using something like
from google.colab import files
files.download('example.txt')
as the file is very large
After you have mounted the drive, you can just copy it there.
# mount it
from google.colab import drive
drive.mount('/content/drive')
# copy it there
!cp example.txt /content/drive/MyDrive
Other answers suggest how to copy a specific file, I would like to mention you can also copy the entire directory, which is useful when copying logs from callbacks from Colab to Drive:
from google.colab import drive
drive.mount('/content/drive')
In my case, the folder names were:
%cp -av "/content/logs/scalars/20201228-215414" "/content/drive/MyDrive/Colab Notebooks/logs/scalars/manual_add"
You can use shutil to copy/move files between colab and google drive
import shutil
shutil.copy("/content/file.doc", "/content/gdrive/file.doc")
When you are saving files, simply specify the Google Drive path for saving the file.
When using large files, Colab sometimes syncs the VM and Drive asynchronously. To force the sync, simply run:
from google.colab import drive
drive.flush_and_unmount()
in my case I use the common approach with the !cp command.
But sometimes, it didn't work in Colab because we didn't enter the right file path.
basic code: !cp source_filepath destination_filepath
implementation code:
!cp /content/myfolder/myitem.txt /content/gdrive/MyDrive/mydrivefolder/
in addition, to correctly enter the path, you can copy the path location from the table of contents on the left side by clicking the dot menu -> copy path.
Once you see the file in the Table of Contents of Colab on the left, simply drag that file into the "/content/drive/My Drive/" directory located on the same panel. Once the file is inside your "My Drive", you will be able to see it inside your Google Drive.
After you mount your drive...
from google.colab import drive
drive.mount('/content/drive')
...just prepend the full path, including the mounted path (/content/drive) to the file you want to write.
someList = []
with open('/content/drive/My Drive/data/file.txt', 'w', encoding='utf8') as output:
for line in someList:
output.write(line + '\n')
In this case we save it in a folder called data located in the root of your Google Drive.
You may often run into quota limits using the gdown library.
Access denied with the following error:
Too many users have viewed or downloaded this file recently. Please
try accessing the file again later. If the file you are trying to
access is particularly large or is shared with many people, it may
take up to 24 hours to be able to view or download the file. If you
still can't access a file after 24 hours, contact your domain
administrator.
You may still be able to access the file from the browser:
https://drive.google.com/uc?id=FILE_ID
No doubt gdown is faster but i copy my files using the command below and avoid quota limits
!cp /content/drive/MyDrive/Dataset/test1.zip /content/dataset

how to upload image folder to colab?

I was trying to upload a big image folder into google drive and github but github not allowed and google drive taking too long. How can I upload the local folder to colab.
Sorry, I don't think there's a solution to your issue. If your fundamental problem is limited upload capacity from the machine with the images, you'll just need to wait.
A nice property to uploading to Drive is that you can use programs like Backup and Sync to retry the transfer until it's successful. And, once the images have been uploaded to Drive once, you'll be able to access them quickly in Colab thereafter without uploading again. (See this example notebook showing how to connect your Google Drive files to Colab as a filesystem.)
convert the folder to zip file and then upload it on colab.
further you can unzip your folder by following command.
! unzip "your path"
The unzip method only works for csv files.
If you use a kaggle dataset, use
os.environ['KAGGLE_USERNAME'] = 'enter_username_here' # username
os.environ['KAGGLE_KEY'] = 'enter_key_here' # key
!kaggle datasets download -d dataset_api_command_here
If you have the image in google drive, use
from google.colab import drive
drive.mount('/content/drive')

Importing large datasets into Couchbase

I am having difficulty importing large datasets into Couchbase. I have experience doing this very fast with Redis via the command line but I have not seen anything yet for Couchbase.
I have tried using the PHP SDK and it imports about 500 documents / second. I have also tried the cbcdocload script in the Couchbase bin folder but it seems to want each document in its on JSON file. It is a bit of work to create all these files and then load them. Is there some other importation process I am missing? If cbcdocload is the only way load data fast then is it possible to put multiple documents into 1 json file.
Take the file that has all the JSON documents in it and zip up the file:
zip somefile.zip somefile.json
Place the zip file(s) into a directory. I used ~/json_files/ in my home directory.
Then load the file or files by the following command:
cbdocloader -u Administrator -p s3kre7Pa55 -b MyBucketToLoad -n 127.0.0.1:8091 -s 1000 \
~/json_files/somefile.zip
Note: '-s 1000' is the memory size. You'll need to adjust this value for your bucket.
If successful you'll see output stating how many documents were loaded, success, etc.
Here is a brief script to load up a lot of .zip files in a given directory:
#!/bin/bash
JSON_Dir=~/json_files/
for ZipFile in $JSON_Dir/*.zip ;
do /Applications/Couchbase\ Server.app/Contents/Resources/couchbase-core/bin/cbdocloader \
-u Administrator -p s3kre7Pa55 -b MyBucketToLoad \
-n 127.0.0.1:8091 -s 1000 $ZipFile
done
UPDATED: Keep in mind this script will only work if your data is formatted correctly or if the files are less than the max single document size of 20MB. (not the zipfile, but any document extracted from the zip)
I have created a blog post describing bulk loading from a single file as well and it is listed here:
Bulk Loading Documents Into Couchbase