I have a databricks notebook that is writing a dataframe to a file in ADLS Gen2 storage.
It creates a temp folder, outputs the file and then copies that file to a permanent folder. For some reason the file doesn't inherit the ACL correctly. The folder it creates has the correct ACL.
The code for the notebook:
#Get data into dataframe
df_export = spark.sql(SQL)
# OUTPUT file to temp directory coalesce(1) creates a single output data file
(df_export.coalesce(1).write.format("parquet")
.mode("overwrite")
.save(TempFolder))
#get the parquet file name. It's always the last in the folder as the other files are created starting with _
file = dbutils.fs.ls(TempFolder)[-1][0]
#create permanent copy
dbutils.fs.cp(file,FullPath)
The temp folder that is created shows the following for the relevant account.
Where the file shows the following.
There is also a mask. I'm not really familiar with masks so not sure how this differs.
The Mask permission on the folder shows
On the file it shows as
Does anyone have any idea why this wouldn't be inheriting the ACL from the parent folder?
I've had a response from Microsoft support which has resolved this issue for me.
Cause: Databricks stored files have Service principal as the owner of the files with permission -rw-r--r--, consequently forcing the effective permission of rest of batch users in ADLS from rwx (directory permission) to r-- which in turn causes jobs to fail
Resolution: To resolve this, we need to change the default mask (022) to custom mask (000) on Databricks end. You can set the following in Spark Configuration settings under your cluster configuration: spark.hadoop.fs.permissions.umask-mode 000
Wow, thats great! I was looking for a solution. Passthrough Authentication might be a proper solution now.
I had the feeling it was part of this acient hadoop bug:
https://issues.apache.org/jira/browse/HDFS-6962 (solved in hadoop-3, now part of spark 3+).
Spark tries to set the ACL's after moving the files, but fails. First the files are created somewhere else in a tmp dir. The tmp-dir rights are inherated by default adls-behaviour.
Related
I've zip files in my container and I would get one or more files everyday and as they come in, I want to process the files. I have some questions.
Can I use Databricks autoloader feature to process zip files? Is zip file supported by Autoloader?
What settings need to be enabled to use Autoloader? I have my container and sas token.
Once the zip file is processed (unzip, read each of the file in the zip file), I should not read the zip again. How can I do this when I use Autoloader? Is there any specific setting?
Are there any samples available? I'm new to this area and trying to get more info.
Unfortunately, processing of Zip file using Azure DataBrick is not possible.
Auto Loader supports two modes for detecting new files: directory listing and file notification.
Auto Loader provides a Structured Streaming source called cloudFiles.
Given an input directory path on the cloud file storage, the
cloudFiles source automatically processes new files as they arrive,
with the option of also processing existing files in that directory.
Auto Loader can scale to loading data from storage accounts that
contain billions of files that need to be backfilled to pipelines
where millions of files are loaded in an hour.
For more information you can refer this Microsoft Document
I am trying to use CSV Data Set Config to get some data from csv file to be used in jmeter script but i don't want to hardcode the file path as it will be changing according to the test environment. Is there a way i can pick this path from System properties i.e some export set in my bashrc file.
Export in my bashrc :
export NIMBUS4_PERFORMANCE_TEST_REPO=/Users/rahul/Documents/verecloud/performancetest/data/user.csv
I would suggest the following workaround:
Change "Filename" setting of the CSV Data Set Config to following:
${__BeanShell(System.getenv().get("NIMBUS4_PERFORMANCE_TEST_REPO"))}
Where:
System.getenv() - method which provides access to underlying operating system environment variables
__Beanshell() - JMeter built-in function which allows executing of arbitrary Beanshell code
you could create a softlink at some static path. For example,
say we have created a soft link to /user/data/csvs folder.
You are in say ~/Documents , there run below
ln -s /user/data/csvs
Now we can access it in the jmeter and you will also have the flexibility to modify the softlink to point to some other location too.
Only constraint i see is the pointed directory name shouldn't change.
Hope this will help!!!
You can have just users.csv if the file is in the same folder as the .jmx itself;
You can have ${location}\users.csv
And in your UserDefinedVariables you'll have
and in non-gui mode you'll refer as
%RUNNER_HOME%\Test.jmx -Jloc=%RUNNER_HOME%\users.csv -Jusers=100 -Jloop=1 -Jrampup=5
I created a new Azure WebJobs project which is a console app. I placed a settings.json file in the root and I'm trying to access it using the following code but I keep getting an error that says it cannot locate the file. I think it's looking for it under Debug folder but I don't want to move the file there. How do I reference that file?
var config = new Configuration();
config.AddJsonFile("settings.json");
I tried "~/settings.json" but that didn't work either.
You need to identify if it's a deployment or runtime issue, per this article.
Make sure that your file is in fact getting deployed:
In VS, check that it has Copy to Output directory set to Copy if Newer
Use Kudu Console to look at the relevant WebJob folder under D:\home\site\wwwroot\App_Data\jobs\... and make sure that the json file made it to there next to the exe.
You can try to add your json file into your WebJob project's Resources as shown:
Remember to set the file type as Text and encoding to UTF-8.
In your code, you can easily access your json file as string as below:
// The Resources property depends on your actual file name being referenced
var settingsJson = Resources.settings;
Hope this helps!
I am trying to automate weekly generation of a database. As a first step in this process, I need to obtain a set of files from network location M:\. The process is as follows:
Delete any possibly remaining old source files from my local folder (REMOVE_OLD_FILES).
Obtain the names of the required files using regular expressions (GET_FILES).
Copy the files from the network location to my local folder for further processing (COPY/MOVE FILES)
Step 3 is where I run into trouble, I frequently receive the below error:
Error processing files. Exception : org.apache.commons.vfs.FileNotFoundException: Could not read from "file:///M:/FILESOURCE/FILENAME.zip" because it is a not a file.
However, when I manually locatae the 'erroneous' file on the network location and try to open or copy it, there are no problems. If I then re-run the Spoon job, no errors occur for this file (although the next file might lead to an error).
So far, I have verified that steps 1 and 2 run correctly: more specifically, there are no errors in the file names returned from step 2.
Obviously, I would prefer not having to manually open all the files first to ensure that Spoon can correctly copy them. Does anyone have an idea what might be causing this behaviour?
For completeness, below are the parameters selected in the COPY/MOVE FILES step.
I was facing same issue with different clients and finally i tried with some basic approach and it got resolved. It might help in your case as well.
Also, other users can follow this rule.
Just try this: Create all required folder with Spoon Job "Create a Folder" and inactive/delete those hops from your job or transformation once your folders are created.
This is because, user you are using to delete the file/s is not recognized as Windows User. Once your folder is in place you can remove "Create a Folder" steps from your Job.
The path to the file is wrong. If you are running spoon in a Windows environment you should use the Windows format for filepaths. Try changing from
"file:///M:/FILESOURCE/FILENAME.zip"
To
"M:\FILESOURCE\FILENAME.zip"
By the way, it will only work if M: is an actual drive in the machine. If you want to access a file in the network you should use the network path to the shared folder, this way:
"\\MachineName\M$\FILESOURCE\FILENAME.zip"
or
"\\MachineName\FILESOURCE\FILENAME.zip"
If you try to access a file in a network mounted drive it won't work.
:image => StorageRoom::Image.new_with_filename(path)
I have to get the path of the image. So far i have specified the path manually and it worked and now i have put in heroku but it shows Load Error - No such file present.
How can i get the path value of the local system using browse button.
Your problem may not be related to path names, but to the fact that Heroku has a read-only file system. If you try to write files onto disk in a Heroku app, it simply doesn't work -- the file will not be saved.
The exception is the "temp" directory. You can save files there, but they are not guaranteed to persist for longer than the duration of a single request.
Is the file you are trying to open actually saved in your Git repo? If so, it will be on the disk in your Heroku app, and you should be able to open it.
To see what the filesystem layout looks like on your Heroku instance, you can create a controller method like:
render :inline => Dir['**/*'].inspect
File.expand_path
Reference : http://saaridev.blogspot.com/2006/11/ruby-finding-absolute-path-of-running.html
You don't need the full path. As far as file path in the client machine is concerned for file uploads, the path is irrelevant as it poses security risks for the user.
Most modern browsers don't send the file path for file uploads. You could get the path using Javascript or Flash but still I don't see the logic behind doing this.
When a user clicks on the submit button the browser should at least send you the file name with the file data together with a bunch of other information like the mime type. Your web server would either write the file to disk or process it in memory assuming you have near infinite memory resources. Look at the RFC 1867 for file uploads for more on this.