In Databricks, check whether a path exist or not

In Databricks, check whether a path exist or not - csv

I am reading CSV files from datalake store, for that I am having multiple paths but if any one path does not exist it gives exception. I want to avoid this expection.

I think if you want to check for multiple pathes, the check will fail if one path does not exist. Perhaps you could try a different approach.
For the given example if you want to subselect subfolders you could try the following instead.
Read sub-directories of a given directory:
# list all subfolders and files in directory demo
dir = dbutils.fs.ls ("/mnt/adls2/demo")
Filter out the relevant sub-directories:
pathes = ''
for i in range (0, len(dir)):
subpath = dir[i].path
if '/corr' in subpath or '/deci' in subpath and subpath.startswith ('dbfs:/'): # select dirs to read
pathes = pathes + (dir[i].path) + ' '
# convert the string to a list
pathes = list(pathes.split())
Use the result-list to read the dataframe:
df = (spark.read
.json(pathes))

where path is a specific variable, you can use to check if it exists (scala):
dbutils.fs.ls("/mnt").map(_.name).contains(s"$path/")

you must Use block blobs for a Mount point to work from databricks and be able to see the blobs in it.

Related

How to use different config files for different environments in airflow?

I'm using SparkKubernetesOperator which has a template_field called application_file. Normally on giving this field a file's name, airflow reads that file and templates the jinja variable in it (just like script field in the BashOperator).
So this works and the file information is shown in the Rendered Template tab with the jinja variables replaced with the correct values.
start_streaming = SparkKubernetesOperator(
task_id='start_streaming',
namespace='spark',
application_file='user_profiles_streaming_dev.yaml',
...
dag=dag,
)
I want to use different files in the application_file field for different environments
So I used a jinja template in the field. But when I change the application_file with user_profiles_streaming_{{ var.value.env }}.yaml, the rendered output is just user_profiles_streaming_dev.yaml and not the file contents.
I know that recursive jinja variable replacement is not possible in airflow but I was wondering if there is any workaround for having different template files.
What I have tried -
I tried using a different operator and doing xcom push to read the file contents and sending it to SparkKubernetesOperator. While this was good for reading different files based on environment, it did not solve the issue of having the jinja variable replaced.
I also tried making a custom operator which inherits the SparkKubernetesOperator and has a template_field applicaton_file_name thinking that jinja replacement will take place 2 times, but this didn't work too.

I made an env file which had the environment details (dev/prod). Then I added this code to the start of my dag file
ENV = None
with open('/home/airflow/env', 'r') as env_file:
value = env_file.read()
if value == None or value == "":
raise Exception("ENV FILE NOT PRESENT")
ENV = value
and then accessed the environment in the code like this
submit_job = SparkKubernetesOperator(
task_id='submit_job',
namespace="spark",
application_file=f"adhoc_{ENV}.yaml",
do_xcom_push=True,
dag=dag,
)
This way I could have separate dev and prod files.

How to name output file with respect to input file?

I have about 100 input files which, after processing, generate more than 2000 output files. I would like to name the output files based on the names of the input file.
Here is the command I run:
Start cmd /k "G:Path\eachGeo.bat G:\Path\InputGeo\*.csv"
The input files are read via cmd by executing the .bat file. Output is stored at a different path:
outputfilename = 'Path\outputGeo\\' + Time.now.to_i.to_s +
'_' + eachTag[45..54] + '_output.csv'
In the code above I am using Time.now.to_i.to_s to name the output files based on the current system time.
I would like to change this to be the name of the input file.

Normally you'd tackle it like this where you're using things like File.basename to extract the relevant part of the original file path:
Dir.glob("path/*.csv") do |path|
CSV.open(path) do |csv_in|
# ...
out_path = "output_path/%s_%s.csv" % [
File.basename(path, ".csv"),
each_tag[45..54]
]
CSV.open(out_path, "w") do |csv_out|
# ...
end
end
end
This is a really simple example. I'd avoid putting your output files in the same directory as the input ones so you don't mistakenly read them in again when you run the program a second time.

Reading all CSV files in current working directory into pandas with correct filenames

I'm trying to use a loop to read in multiple CSVs (for now but mix of that and xls in the future).
I'd like each data frame in pandas to be the same name excluding file extension in my folder.
import os
import pandas as pd
files = filter(os.path.isfile, os.listdir( os.curdir ) )
files # this shows a list of the files that I want to use/have in my directory- they are all CSVs if that matters
# i want to load these into pandas data frames with the corresponding filenames
# not sure if this is the right approach....
# but what is wrong is the variable is named 'weather_today.csv'... i need to drop the .csv or .xlsx or whatever it might be
for each_file in files:
frame = pd.read_csv( each_file)
each_file = frame
Bernie seems to be great but one problem:
or each_file in files:
frame = pd.read_csv(each_file)
filename_only = os.path.splitext(each_file)[0]
# Right below I am assigning my looped data frame the literal variable name of "filename_only" rather than the value that filename_only represents
#rather than what happens if I print(filename_only)
filename_only = frame
for example if my two files are weather_today, earthquakes.csv (in that order) in my files list, then both 'earthquakes' and 'weather' will not be created.
however, if I simply type 'filename_only' and click the enter key in python - then I will see the earthquake dataframe. If I have 100 files, then the last data frame name in the list loop will be titled 'filename_only' and the other 99 won't because the previous assignments are never made and the 100th one overwrites them.

You can use os.path.splitext() for this to "split the pathname path into a pair (root, ext) such that root + ext == path, and ext is empty or begins with a period and contains at most one period."
for each_file in files:
frame = pd.read_csv(each_file)
filename_only = os.path.splitext(each_file)[0]
filename_only = frame
As asked in a comment we would like a way to filter for just CSV files so you can do something like this:
files = [file for file in os.listdir( os.curdir ) if file.endswith(".csv")]

Use a dictionary to store your frames:
frames = {}
for each_file in files:
frames[os.path.splitext(each_file)[0]] = pd.read_csv(each_file)
Now you can get the DataFrame of your choice with:
frames[filename_without_ext]
Simple, right? Be careful about RAM usage though, reading a bunch of files can quickly fill up system memory and cause a crash.

Algorithm to delete every files in a directory, except some in a given list

Assume we have a directory with structure like this, I marked directories as (+) and files as (-)
rootdir
+a
+a1
-f1
-f2
+a2
-f3
+b
+b1
+b2
-f4
-f5
-f6
+b3
-f7
-f8
and a given list of files like
/a/a1/f1
/b/b1/b2/f5
/b/b3/f7
I am struggling to find the way to remove every files inside root, except the one in the given list. So after the program executed, the root directory should look like this:
rootdir
+a
+a1
-f1
+b
+b1
+b2
-f5
+b3
-f7
This example just for easier to understand the problem. In reality, the given list include around 4 thousands of files. And the root directory has the size of ~15GB with a hundreds of thousands files inside.
That would be easy to search inside a folder, and to remove files that matched in a given list. Let just say we solve the revert issue, to keep files that matched in a given list.
Programs written in Perl/Python are prefer.

First, store your list of files you want to keep inside an associative container like a Python dict or a map of some kind.
Second, simply iterate (in Python, os.walk) over the entire directory structure, and every time you see a file, check if it is in the associative container of paths to keep. If not, delete it (in Python, os.unlink).
Alternatively:
First, create a temporary directory on the same filesystem.
Second, move (os.renames, which generates new subdirectories as needed) all the "keep" files to the temporary directory, with the same structure.
Third, overwrite (os.removedirs followed by os.rename, or just shutil.move) the original directory with the temporary one.

The os.walk path:
import os
keep = set(['/a/a1/f1', '/b/b1/b2/f5', '/b/b3/f7'])
for dirpath, dirnames, filenames in os.walk('./'):
for name in filenames:
path = os.path.join(dirpath, name).lstrip('.')
print('check ' + path)
if path not in keep:
print('delete ' + path)
else:
print('keep ' + path)
It doesn't do anything except inform you.
It don't think os.walk is too slow, and it gives you the option of keeping by regex patterns or any other criteria.

This is a working code for your problem.
import os
def list_files(directory):
for root, dirs, files in os.walk(directory):
for name in files:
yield os.path.join(root, name)
files_to_delete = {'/home/vedang/Desktop/a.out', '/home/vedang/Desktop/ABC/temp.txt'} #Keep a set instead of list for faster lookups
for f in list_files('/home/vedang/Desktop'):
if f in files_to_delete:
os.unlink(f)

Here is a function which accepts a set of files you wish to keep and the root directory from which you wish to begin deleting files.
It's a classic recursive Depth-First-Search that will remove empty directories after deleting all the unwanted files
import os
def delete_files(keep_list:set, curr_dir):
files = os.listdir(curr_dir)
for f in files:
path = f"{curr_dir}/{f}"
if os.path.isfile(path):
if path not in keep_list:
os.remove(path)
elif os.path.islink(path):
os.unlink(path)
elif os.path.isdir(path):
delete_files(keep_list, path)
files = os.listdir(curr_dir)
if not files:
os.rmdir(curr_dir)

here i got a solution in a different aspect,
suppose we are at linux environment,
first,
find .
to get a long list with all file path/folder explained
second, suppose we got the exclude path list, in order to exclude at your volume ( say thousands ) , we could just append these to the previous list, and
| sort | uniq - c |grep -v "^2"
to get the to delete list,
and third
| xargs rm
to actually do the deletion

Labels on Nodes and Relationships from a CSV file

I have problem when i want to add a label on a Node or to a Relatioship.
I do this in Neo4j with Cypher:
LOAD CSV WITH HEADERS FROM "file:c:/Users/Test/test.csv" AS line
CREATE (n:line.FROM)
and i get this error:
Invalid input '.': expected an identifier character, whitespace, NodeLabel, a property map, ')' or a relationship pattern (line 2, column 15 (offset: 99))
"CREATE (n:line.FROM)"
If there is not a possible way of doing this with the Cypher Language, can you recommend me an other way to do my job?
It is very important to find a solution on this problem even with a Cypher solution or any Java thing to do this job...

Depends on how dynamic you need it to be, for small variability:
LOAD CSV WITH HEADERS FROM "file:c:/Users/Test/test.csv" AS line
WHERE line.FROM = "Foo"
CREATE (n:Foo)
From Java you can use node.addLabel(DynamicLabel.label(line.from))
Otherwise you can look into my neo4j-shell-tools, which allow dynamic labels and rel-types: with #{FROM}.
see: https://github.com/jexp/neo4j-shell-tools#cypher-import

Thank you all for your answers but none of them helped me to solve my problem.
I found a solution to do exactly what i wanted. The solution was the Neo4jImporter tool (Link from official manual: Neo4jImporter tool Manual ) and not Cypher language nor Java.
So here is an example of what i have done and worked for me
A test.csv file contains the "PropertyTest" and ":LABEL". Firstly it creates one node with the label "TEST" and after the creation it adds the "proptest" property on the "TEST" node. So to add a Label on your node you use :LABEL and to add a Property on the same node you add any name you want as a header in .csv file.
Example of test.csv file:
PropertyTest,:LABEL
proptest,TEST
For windows i've done the Neo4jImport.bat command as it is described in the manual page of Neo4j.You can found the Neo4jImport.bat in Windows at "C:\Program Files\Neo4j Community\bin" and you run it from command line (cmd).
In details i opened the cmd, i followed the path to Neo4jImport.bat and finaly i wrote:
Neo4jImport.bat --into path-to-save-your-neo4j-database --nodes path-to-your-csv\test.csv
--delimiter ","
The default delimiter of Neo4jImporter is the "," but you can change it. For example if your information in .csv file is seperated with tab you can do the following:
Neo4jImport.bat --into path-to-save-your-neo4j-database --nodes path-to-your-csv\test.csv
--delimiter "TAB"
That was the way that i loaded dynamically a whole model of almost 2.000 nodes with different Labels and Properties.
Keep in mind from the manual that you can add as many labels and as many properties you want on a node by adding to your csv more headers
Example of two Labels in a node:
PropertyTest,:LABEL,:LABEL
proptest,TEST,SECOND_LABEL
Example of Neo4jImport.bat for two Labels and comma seperated CSV file:
Neo4jImport.bat --into path-to-save-your-neo4j-database --nodes path-to-your-csv\test.csv
--delimiter ","
I hope that you will find it useful to this certain problem of Labels from .csv files and please read the official manual, it helped me a lot to find a solution for my problem.

Below is the way for two csv files MIP_nodes.csv and MIP_edges.csv:
//Load csv data into the database - with dynamic label(s)
WITH "file:///MIP_nodes.csv" AS uri
LOAD CSV WITH HEADERS FROM uri AS row
WITH * WHERE row.label <> ""
call apoc.merge.node ([row.label],{nodeId:row.nodeId, name: row.name, type: row.type, created: row.created, property1: row.property1, property2: row.property2})
YIELD node as n1
//RETURN n1
WITH * WHERE row.label = ""
call apoc.merge.node (['DefaultNode'],{nodeId:row.nodeId, name: row.name, type: row.type, created: row.created, property1: row.property1, property2: row.property2})
YIELD node as n2
RETURN n1, n2
//Load csv data into the database - with dynamic relationship(s)
//:auto USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM 'file:///MIP_edges.csv' AS row
MATCH (s)
WHERE s.nodeId = row.sourceId
//RETURN s
MATCH (d)
WHERE d.nodeId = row.destinationId
//RETURN d
CALL apoc.merge.relationship(s, row.label,{type:row.type, created: row.created, property1: row.property1, property2: row.property2},{}, d,{})
YIELD rel
//REMOVE rel.noOp;
RETURN rel;

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

In Databricks, check whether a path exist or not - csv

I am reading CSV files from datalake store, for that I am having multiple paths but if any one path does not exist it gives exception. I want to avoid this expection.

where path is a specific variable, you can use to check if it exists (scala): dbutils.fs.ls("/mnt").map(_.name).contains(s"$path/")

you must Use block blobs for a Mount point to work from databricks and be able to see the blobs in it.

Related

How to use different config files for different environments in airflow?

How to name output file with respect to input file?

Reading all CSV files in current working directory into pandas with correct filenames

Algorithm to delete every files in a directory, except some in a given list

Labels on Nodes and Relationships from a CSV file

Categories

Resources