Colab how to get file id for existing file - google-drive-api

I am starting with colab for ml and I have problem importing files from my google drive into the notebook. Say I got a file pretrained_vgg19.mat in my drive like drive/jupyter/pretrained_vgg19.mat. The code snippet for importing files from drive says that I need to use the file_ID that looks like laggVyWshwcyP6kEI-y_W3P8D26sz. How do I get this file_ID?

See PyDrive documentation for the ListFile command:
from pydrive.drive import GoogleDrive
drive = GoogleDrive(gauth) # Create GoogleDrive instance with authenticated GoogleAuth instance
# Auto-iterate through all files in the root folder.
file_list = drive.ListFile({'q': "'root' in parents and trashed=false"}).GetList()
for file1 in file_list:
print('title: %s, id: %s' % (file1['title'], file1['id']))
Now all you need to do is tweak the search parameters, since you know the title of the file already. See docs.
file_list = drive.ListFile({'q': "name='pretrained_vgg19.mat' and trashed=false"}).GetList()
for file in file_list:
print('%s' % (file['id']))
Note that it is possible to have files with the same folder name and file name, because you can create multiple folders with identical paths in Google Drive. If there is even a chance of this, you will get multiple files returned in your list operation and will need to use some other criteria in order to select the correct one.

user244343's answer didn't work for me since the gauth object doesn't exist. I did this instead (test.zip needs to point to the right folder and file in your Drive!):
!apt-get install -qq xattr
filename = "/content/drive/My\ Drive/test.zip"
# Retrieving the file ID for a file in `"/content/drive/My Drive/"`:
id = !xattr -p 'user.drive.id' {filename}
print(id)

Related

Azure ADLS Gen2 file created by Azure Databricks doesn't inherit ACL

I have a databricks notebook that is writing a dataframe to a file in ADLS Gen2 storage.
It creates a temp folder, outputs the file and then copies that file to a permanent folder. For some reason the file doesn't inherit the ACL correctly. The folder it creates has the correct ACL.
The code for the notebook:
#Get data into dataframe
df_export = spark.sql(SQL)
# OUTPUT file to temp directory coalesce(1) creates a single output data file
(df_export.coalesce(1).write.format("parquet")
.mode("overwrite")
.save(TempFolder))
#get the parquet file name. It's always the last in the folder as the other files are created starting with _
file = dbutils.fs.ls(TempFolder)[-1][0]
#create permanent copy
dbutils.fs.cp(file,FullPath)
The temp folder that is created shows the following for the relevant account.
Where the file shows the following.
There is also a mask. I'm not really familiar with masks so not sure how this differs.
The Mask permission on the folder shows
On the file it shows as
Does anyone have any idea why this wouldn't be inheriting the ACL from the parent folder?
I've had a response from Microsoft support which has resolved this issue for me.
Cause: Databricks stored files have Service principal as the owner of the files with permission -rw-r--r--, consequently forcing the effective permission of rest of batch users in ADLS from rwx (directory permission) to r-- which in turn causes jobs to fail
Resolution: To resolve this, we need to change the default mask (022) to custom mask (000) on Databricks end. You can set the following in Spark Configuration settings under your cluster configuration: spark.hadoop.fs.permissions.umask-mode 000
Wow, thats great! I was looking for a solution. Passthrough Authentication might be a proper solution now.
I had the feeling it was part of this acient hadoop bug:
https://issues.apache.org/jira/browse/HDFS-6962 (solved in hadoop-3, now part of spark 3+).
Spark tries to set the ACL's after moving the files, but fails. First the files are created somewhere else in a tmp dir. The tmp-dir rights are inherated by default adls-behaviour.

Is it possible to use the Google Drive API to get file from within a shared .zip file

Assume the following .zip file:
unzip -l myarchive.zip
Archive: myarchive.zip
Length Date Time Name
--------- ---------- ----- ----
3663 1980-00-00 00:00 sub_dir1/file1.txt
4573 1980-00-00 00:00 sub_dir1/file2.txt
6021 1980-00-00 00:00 sub_dir2/file1.txt
6627 1980-00-00 00:00 file1.txt
The following command extracts the file sub_dir1/file1.txt from the .zip file when it is in the file system.
unzip -p myarchive.zip sub_dir1/file1.txt > file1.txt
But if the .zip file is in Google Drive with a shared link (e.g. the fileId is: 1234567...v4rzj),
Is it possible to make a Google Drive API query to get a specific file (e.g. sub_dir1/file1.txt) from within a .zip file?
I am attempting to do a similar action. Take a look at my question here.
How to read file names of items in a Zipped Folder? Google App Script
This portion of the code can unzip the file on Google Drive and place it in any location you need. However it will run through the entire zip folder.
/// "var zfi" define a zip file iterator ///
while (zfi.hasNext()){ // loops through ZIP file iterator
var file = zfi.next(); // every loop sets active file to next
Logger.log("Zip Folder: %s", file.getName());
var fileBlob = file.getBlob(); // get file blob
fileBlob.setContentType("application/zip");
var unZippedfile = Utilities.unzip(fileBlob); // unzipped file iterator
//// loops all blob elements ////
for (i=0; i<unZippedfile.length; i++) {
var uzf = temp.createFile(unZippedfile[i]);
Google drive is simply a file storage system it in and of itself it does not have the ability to unzip files in this manner or to check the contents of a file. The google drive api just gives you the ability to Create, update ,delete upload and download the files.
Other options.
as your unzip command works on a file stored locally on your machine. You will need to download the file from Google drive first and then run your unzip.
As you have not mentioned which programming language you are intending to use i recommend checking the documentation for examples.
This is an example using Java, you will need the authorization code as well.
String fileId = "0BwwA4oUTeiV1UVNwOHItT0xfa2M";
OutputStream outputStream = new ByteArrayOutputStream();
driveService.files().get(fileId)
.executeMediaAndDownloadTo(outputStream);

How to create cblite2 file with python script

I was trying to use couchbase in one of my android applications. This app will have static data that are scraped from the web. So I want to generate a cblite2 file with my python script and insert those data and then use this cblite2 file in android. I can load data from an existing file according to this. But how can I generate my initial cblite2 file?
You could use the cblite command line tool to create the database. There are a couple of ways to do the import. I'll describe what seems to me like the simplest way.
Have your script save the JSON documents to a directory. For this example, let's call the diretory json-documents. Use the desired document ID as the base name, and .json as the extension. For example, if you want a document's ID to be "foo", the filename would be foo.json.
Then use the cblite cp command to import the documents into a new database:
cblite cp json-documents/ myNewDatabase.cblite2
You can use the other cblite subcommands to verify the import was successful.
List the document IDs:
cblite ls myNewDatabase.cblite2
Display the contents of a document:
cblite cat myNewDatabase.cblite2 someDocumentId

Python Os.walk misses few files to process in the directory

Out of 10 files in the directory, only 8 files are processed and 2 files are not processed. But if I delete all the 8 files and try running with the missed 2 files it is working. Why Os.walk is missing files? Also is there a way to process all the files in the directory one after another without missing any.
Note: The solution will be used for the folder that contains 100K JSON files.
for root, dirs, files in os.walk('D:/M'):
for file in files:
if file.endswith(".json"):
Strfil=os.path.join(root,file)
with open(Strfil, 'r') as json_file:
For file system related things it is better to use the pathlib module
With pathlib you can do something like this.
from pathlib import Path
json_files = list(Path("D:/M").glob("**/*.json"))
for f in json_files:
with open(f, 'r') as json_file:
I think any file with more than 250 characters will be skipped by Windows as 'too long'. What I suggest is to map the network drive to make the path much shorter.
e.g. z:\myfile.xlsx instead of c:\a\b\c\d\e\f\g\myfile.xlsx

Download folder with Google Drive API

I have some data on Google Drive, organized in folders, which I want to propagate on other servers. I have some script for propagating, but I need to download data from google drive. Is there a method for downloading folders via Google Drive API, that is also maintaining whole folder structure?
Folders are also files on Google Drive. The only difference is the Mime type. with folders its mimeType = 'application/vnd.google-apps.folder'.
There is no single method that will allow you to download everything with in a folder. Your going to have to do a file.list searching for the files with the parent Id to the file Id of your parent folder using search parameters (TIP: ''1234567' in parents'). This will return a list of the files contained within your folder. Then download each one.
Update from comment you need to loop though each directory or just do a main list of everything on your drive account and process the data locally.
File 1 (folder)
----> File 2 (folder )
--------> File 3 (actually a file)
---->File Four (actually a file)
'File 1' in parents
returns everything within the file 1 directory. If the mime type of the item returned is a directory (mimeType = 'application/vnd.google-apps.folder') make a request to get its contences
'File 2' in parents
returns everything within the file 2 directory.
FileList result = service.files().list().setQ("parents='1NzSAZEwAFARWegk42ANrWQrTopWQTdGB'").setSpaces("drive")
.setFields("nextPageToken, files(id, name)")
.setPageToken(pageToken)
.execute();
The parents parameter is the the fileId of the folder you want to download files from. Doing this you can get all files in that particular folder.