Loading Multiple CSV files across all subfolder levels with Wildcard file name - csv

I want to Load Multiple CSV files matching certain names into a dataframe. Currently i am looping through the whole folder and creating a list of filenames and then loading those csv's into the dataframe list and then concatenating that dataframe.
The approach i want to use (if possible) is to bypass all the code and read all files in a one liner kind of approach.
I know this can be done easily for single level of subfolders, but my subfolder structure is as follows
Root Folder
|
Subfolder1
|
Subfolder 2
|
X01.csv
Y01.csv
Z01.csv
|
Subfolder3
|
Subfolder4
|
X01.csv
Y01.csv
|
Subfolder5
|
X01.csv
Y01.csv
I want to read all "X01.csv" files while reading from Root Folder.
Is there a way i can read all the required files in code something like the below
filepath = "rootpath" + "/**/X*.csv"
df = spark.read.format("com.databricks.spark.csv").option("recursiveFilelookup","true").option("header","true").load(filepath)
This code works fine for single level of subfolders, is there any equivalent of this for multi level folders ? i thought the "recursiveFilelookup" option would look across all levels of subfolders, but apparently this is not the way it works.
Currently i am getting a
Path not found ... filepath
exception
any help please

Have you tried using the glob.glob function?
You can use it to search for files that match certain criteria inside a root path, and pass the list of files it finds to spark.read.csv function.
For example, I've recreated the folder structure from your example inside a Google Colab environment:
To get a list of all CSV files matching the criteria you've specified, you can use the following code:
import glob
rootpath = './Root Folder/'
# The following line of code looks through all files
# inside the rootpath recursively, trying to match the
# pattern specified. In this case, it tries to find any
# CSV file that starts with the letters X, Y, or Z,
# and ends with 2 numbers (ranging from 0 to 9).
glob.glob(rootpath + "**/[X|Y|Z][0-9][0-9].csv", recursive=True)
# Returns:
# ['./Root Folder/Subfolder5/Y01.csv',
# './Root Folder/Subfolder5/X01.csv',
# './Root Folder/Subfolder1/Subfolder 2/Y01.csv',
# './Root Folder/Subfolder1/Subfolder 2/Z01.csv',
# './Root Folder/Subfolder1/Subfolder 2/X01.csv']
Now you can combine this with spark.read.csv capability of reading a list of files to get the answer you're looking for:
import glob
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
rootpath = './Root Folder/'
spark.read.csv(glob.glob(rootpath + "**/[X|Y|Z][0-9][0-9].csv", recursive=True), inferSchema=True, header=True)
Note
You can specify more general patterns like:
glob.glob(rootpath + "**/*.csv", recursive=True)
To return a list of all csv files inside any subdirectory of rootpath.
Additionally, to consider only the immediate subdirectories files, you could use something like:
glob.glob(rootpath + "*.csv", recursive=True)
Edit
Based on your comments to this answer, does something like this works on Databricks?
from notebookutils import mssparkutils as ms
# databricks has a module called dbutils.fs.ls
# that works similarly to mssparkutils.fs, based on
# the following page of its documentation:
# https://docs.databricks.com/dev-tools/databricks-utils.html#ls-command-dbutilsfsls
def scan_dir(
initial_path: str,
search_str: str,
account_name: str,
):
"""Scan a directory and subdirectories for a string.
Parameters
----------
initial_path : str
The path to start the search. Accepts either a valid container name,
or the entire connection string.
search_str : str
The string to search.
account_name : str
The name of the account to access the container folders.
This value is only used, when the `initial_path`, doesn't
conform with the format: "abfss://<initial_path>#<account_name>.dfs.core.windows.net/"
Raises
------
FileNotFoundError
If the `initial_path` informed doesn't exist.
ValueError
If `initial_path` is not a string.
"""
if not isinstance(initial_path, str):
raise ValueError(
f'`initial_path` needs to be of type string, not {type(initial_path)}'
)
elif not initial_path.startswith('abfss'):
initial_path = f'abfss://{initial_path}#{account_name}.dfs.core.windows.net/'
try:
fdirs = ms.fs.ls(initial_path)
except Py4JJavaError as exc:
raise FileNotFoundError(
f'The path you informed \"{initial_path}\" doesn\'t exist'
) from exc
found = []
for path in fdirs:
p = path.path
if path.isDir:
found = [*found, *scan_dir(p, search_str)]
if search_str.lower() in path.name.lower():
# print(p.split('.net')[-1])
found = [*found, p.replace(path.name, "")]
return list(set(found))
Example:
# Change .parquet to .csv
spark.read.parquet(*scan_dir("abfss://CONTAINER_NAME#ACCOUNTNAME.dfs.core.windows.net/ROOT/FOLDER/", ".parquet"))
This method above worked for on Azure Synapse:

Related

how to import variables from a json file to attributes in BUILD.bazel?

I would like to import variables defined in a json file(my_info.json) as attibutes for bazel rules.
I tried this (https://docs.bazel.build/versions/5.3.1/skylark/tutorial-sharing-variables.html) and works but do not want to use a .bzl file and import variables directly to attributes to BUILD.bazel.
I want to use those variables imported from my_info.json as attributes for other BUILD.bazel files.
projects/python_web/BUILD.bazel
load("//projects/tools/parser:config.bzl", "MY_REPO","MY_IMAGE")
container_push(
name = "publish",
format = "Docker",
registry = "registry.hub.docker.com",
repository = MY_REPO,
image = MY_IMAGE,
tag = "1",
)
Asking the similar in Bazel slack I was informed the is not possible to import variables directly to Bazel and it is needed to parse the json variables and write them into a .bzl file.
I tried also this code but nothing is written in config.bzl file.
my_info.json
{
"MYREPO" : "registry.hub.docker.com",
"MYIMAGE" : "michael/monorepo-python-web"
}
WORKSPACE.bazel
load("//projects/tools/parser:jsonparser.bzl", "load_my_json")
load_my_json(
name = "myjson"
)
projects/tools/parser/jsonparser.bzl
def _load_my_json_impl(repository_ctx):
json_data = json.decode(repository_ctx.read(repository_ctx.path(Label(":my_info.json"))))
config_lines = ["%s = %s" % (key, repr(val)) for key, val in json_data.items()]
repository_ctx.file("config.bzl", "\n".join(config_lines))
load_my_json = repository_rule(
implementation = _load_my_json_impl,
attrs = {},
)
projects/tools/parser/BUILD.bazel
load("#aspect_bazel_lib//lib:yq.bzl", "yq")
load(":config.bzl", "MYREPO", "MY_IMAGE")
yq(
name = "convert",
srcs = ["my_info2.json"],
args = ["-P"],
outs = ["bar.yaml"],
)
Executing:
% bazel build projects/tools/parser:convert
ERROR: Traceback (most recent call last):
File "/Users/michael.taquia/Documents/Personal/Projects/bazel/bazel-projects/multi-language-bazel-monorepo/projects/tools/parser/BUILD.bazel", line 2, column 22, in <toplevel>
load(":config.bzl", "MYREPO", "MY_IMAGE")
Error: file ':config.bzl' does not contain symbol 'MYREPO'
When making troubleshooting I see the execution calls the jsonparser.bzl but never enters to _load_my_json_impl function (based in print statements) and does not write anything to config.bzl.
Notes: Tested on macOS 12.6 (21G115 ) Darwin Kernel Version 21.6.0
There is a better way to do that? A code snippet will be very useful.

make a list of a list out of the header from a csv file

I want to put the header of the csv file in a nested list.
It should have ann output like this:
[[name], [age], [""], [""]]
how can I do this without reading the line again (I am not allowed to and I also am not allowed to use csv module and pandas (all imports except os are forbidden))
Just map the item of the list to list. Check below
def value_to_list(tlist):
l=len(tlist)
for i in range(l):
tlist[i]=[tlist[i]]
return tlist
headers=[]
with open(r"D:\my_projects\DemoProject\test.csv","r") as file :
headers = value_to_list(file.readline().split(","))
test.csv file is "col1,col2,col3"
output :
> python -u "run.py"
[['col1'], ['col2'], ['col3']]
>

jq - How to extract domains and remove duplicates

Given the following json:
Full file here: https://pastebin.com/Hzt9bq2a
{
"name": "Visma Public",
"domains": [
"accountsettings.connect.identity.stagaws.visma.com",
"admin.stage.vismaonline.com",
"api.home.stag.visma.com",
"api.workbox.dk",
"app.workbox.dk",
"app.workbox.co.uk",
"authz.workbox.dk",
"connect.identity.stagaws.visma.com",
"eaccounting.stage.vismaonline.com",
"eaccountingprinting.stage.vismaonline.com",
"http://myservices-api.stage.vismaonline.com/",
"identity.stage.vismaonline.com",
"myservices.stage.vismaonline.com"
]
}
How can I transform the data to the below. Which is, to identify the domains in the format of site.SLD.TLD present and then remove the duplication of them. (Not including the subdomains, protocols or paths as illustrated below.)
{
"name": "Visma Public",
"domains": [
"workbox.co.uk",
"workbox.dk",
"visma.com",
"vismaonline.com"
]
}
I would like to do so in jq as that is what I've used to wrangled the data into this format so far, but at this stage any solution that I can run on Debian (I'm using bash) without any extraneous tooling ideally would be fine.
I'm aware that regex can be used within jq so I assume the best way is to regex out the domain and then pipe to unique however I'm unable to get anything working so far I'm currently trying this version which seems to me to need only the text transformation stage adding in somehow either during the jq process or with a run over with something like awk after the event perhaps:
jq '[.[] | {name: .name, domain: [.domains[]] | unique}]' testfile.json
This appears to be useful: https://github.com/stedolan/jq/issues/537
One solution was offered which does a regex match to extract the last two strings separated by . and call the unique function on that & works up to a point but doesn't cover site.SLD.TLD that has 2 parts. Like google.co.uk would return only co.uk with this jq for example:
jq '.domains |= (map(capture("(?<x>[[:alpha:]]+).(?<z>[[:alpha:]]+)(.?)$") | join(".")) | unique)'
A programming language is much more expressive than jq.
Try the following snippet with python3.
import json
import pprint
import urllib.request
from urllib.parse import urlparse
import os
def get_tlds():
f = urllib.request.urlopen("https://publicsuffix.org/list/effective_tld_names.dat")
content = f.read()
lines = content.decode('utf-8').split("\n")
# remove comments
tlds = [line for line in lines if not line.startswith("//") and not line == ""]
return tlds
def extract_domain(url, tlds):
# get domain
url = url.replace("http://", "").replace("https://", "")
url = url.split("/")[0]
# get tld/sld
parts = url.split(".")
suffix1 = parts[-1]
sld1 = parts[-2]
if len(parts) > 2:
suffix2 = ".".join(parts[-2:])
sld2 = parts[-3]
else:
suffix2 = suffix1
sld2 = sld1
# try the longger first
if suffix2 in tlds:
tld = suffix2
sld = sld2
else:
tld = suffix1
sld = sld1
return sld + "." + tld
def clean(site, tlds):
site["domains"] = list(set([extract_domain(url, tlds) for url in site["domains"]]))
return site
if __name__ == "__main__":
filename = "Hzt9bq2a.json"
cache_path = "tlds.json"
if os.path.exists(cache_path):
with open(cache_path, "r") as f:
tlds = json.load(f)
else:
tlds = get_tlds()
with open(cache_path, "w") as f:
json.dump(tlds, f)
with open(filename) as f:
d = json.load(f)
d = [clean(site, tlds) for site in d]
pprint.pprint(d)
with open("clean.json", "w") as f:
json.dump(d, f)
May I offer you achieving the same query with jtc: the same could be achieved in other languages (and of course in jq) - the query is mostly how to come up with the regex to satisfy your ask:
bash $ <file.json jtc -w'<domains>l:>((?:[a-z0-9]+\.)?[a-z0-9]+\.[a-z0-9]+)[^.]*$<R:' -u'{{$1}}' /\
-ppw'<domains>l:><q:' -w'[domains]:<[]>j:' -w'<name>l:'
{
"domains": [
"stagaws.visma.com",
"stage.vismaonline.com",
"stag.visma.com",
"api.workbox.dk",
"app.workbox.dk",
"workbox.co.uk",
"authz.workbox.dk"
],
"name": "Visma Public"
}
bash $
Note: it does extract only DOMAIN.TLD, as per your ask. If you like to extract DOMAIN.SLD.TLD, then the task becomes a bit less trivial.
Update:
Modified solution as per the comment: extract domain.sld.tld where 3 or more levels and domain.tld where there’s only 2
PS. I'm the creator of the jtc - JSON processing utility. This disclaimer is SO requirement.
One of the solutions presented on this page offers that:
A programming language is much more expressive than jq.
It may therefore be worthwhile pointing out that jq is an expressive, Turing-complete programming language, and that it would be as straightforward (and as tedious) to capture all the intricacies of the "Public Suffix List" using jq as any other programming language that does not already provide support for this list.
It may be useful to illustrate an approach to the problem that passes the (revised) test presented in the Q. This approach could easily be extended in any one of a number of ways:
def extract:
sub("^[^:]*://";"")
| sub("/.*$";"")
| split(".")
| (if (.[-1]|length) == 2 and (.[-2]|length) <= 3
then -3 else -2 end) as $ix
| .[$ix : ]
| join(".") ;
{name, domain: (.domains | map(extract) | unique)}
Output
{
"name": "Visma Public",
"domain": [
"visma.com",
"vismaonline.com",
"workbox.co.uk",
"workbox.dk"
]
}
Judging from your example, you don't actually want top-level domains (just one component, e.g. ".com"), and you probably don't really want second-level domains (last two components) either, because some domain registries don't operate at the TLD level. Given www.foo.com.br, you presumably want to find out about foo.com.br, not com.br.
To do that, you need to consult the Public Suffix List. The file format isn't too complicated, but it has support for wildcards and exceptions. I dare say that jq isn't the ideal language to use here — pick one that has a URL-parsing module (for extracting hostnames) and an existing Public Suffix List module (for extracting the domain parts from those hostnames).

writer.writerow() doesn't write to the correct column

I have three DynamoDB tables. Two tables have instance IDs that are part of an application and the other is a master table of all instances across all of my accounts and the tag metadata. I have two scans for the two tables to get the instance IDs and then query the master table for the tag metadata. However, when I try writing this to the CSV file, I want to have two separate header sections for each dynamo table's unique output. Once the first iteration is done, the second file write writes to the last row where the first iteration left off instead of starting over at the top in the second header section. Below is my code and an output example to make it clear.
CODE:
import boto3
import csv
import json
from boto3.dynamodb.conditions import Key, Attr
dynamo = boto3.client('dynamodb')
dynamodb = boto3.resource('dynamodb')
s3 = boto3.resource('s3')
# Required resource and client calls
all_instances_table = dynamodb.Table('Master')
missing_response = dynamo.scan(TableName='T1')
installed_response = dynamo.scan(TableName='T2')
# Creates CSV DictWriter object and fieldnames
with open('file.csv', 'w') as csvfile:
fieldnames = ['Agent Not Installed', 'Not Installed Account', 'Not Installed Tags', 'Not Installed Environment', " ", 'Agent Installed', 'Installed Account', 'Installed Tags', 'Installed Environment']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
# Find instances IDs from the missing table in the master table to pull tag metadata
for instances in missing_response['Items']:
instance_missing = instances['missing_instances']['S']
#print("Missing:" + instance_missing)
query_missing = all_instances_table.query(KeyConditionExpression=Key('ID').eq(instance_missing))
for item_missing in query_missing['Items']:
missing_id = item_missing['ID']
missing_account = item_missing['Account']
missing_tags = item_missing['Tags']
missing_env = item_missing['Environment']
# Write the data to the CSV file
writer.writerow({'Agent Not Installed': missing_id, 'Not Installed Account': missing_account, 'Not Installed Tags': missing_tags, 'Not Installed Environment': missing_env})
# Find instances IDs from the installed table in the master table to pull tag metadata
for instances in installed_response['Items']:
instance_installed = instances['installed_instances']['S']
#print("Installed:" + instance_installed)
query_installed = all_instances_table.query(KeyConditionExpression=Key('ID').eq(instance_installed))
for item_installed in query_installed['Items']:
installed_id = item_installed['ID']
print(installed_id)
installed_account = item_installed['Account']
installed_tags = item_installed['Tags']
installed_env = item_installed['Environment']
# Write the data to the CSV file
writer.writerow({'Agent Installed': installed_id, 'Installed Account': installed_account, 'Installed Tags': installed_tags, 'Installed Environment': installed_env})
OUTPUT:
This is what the columns/rows look like in the file.
I need all of the output to be on the same line for each header section.
DATA:
Here is a sample of what both tables look like.
SAMPLE OUTPUT:
Here is what the for loops print out and appends to the lists.
Missing:
i-0xxxxxx 333333333 foo#bar.com int
i-0yyyyyy 333333333 foo1#bar.com int
Installed:
i-0zzzzzz 44444444 foo2#bar.com int
i-0aaaaaa 44444444 foo3#bar.com int
You want to collect related rows together into a single list to write on a single row, something like:
missing = [] # collection for missing_responses
installed = [] # collection for installed_responses
# Find instances IDs from the missing table in the master table to pull tag metadata
for instances in missing_response['Items']:
instance_missing = instances['missing_instances']['S']
#print("Missing:" + instance_missing)
query_missing = all_instances_table.query(KeyConditionExpression=Key('ID').eq(instance_missing))
for item_missing in query_missing['Items']:
missing_id = item_missing['ID']
missing_account = item_missing['Account']
missing_tags = item_missing['Tags']
missing_env = item_missing['Environment']
# Update first half of row with missing list
missing.append(missing_id, missing_account, missing_tags, missing_env)
# Find instances IDs from the installed table in the master table to pull tag metadata
for instances in installed_response['Items']:
instance_installed = instances['installed_instances']['S']
#print("Installed:" + instance_installed)
query_installed = all_instances_table.query(KeyConditionExpression=Key('ID').eq(instance_installed))
for item_installed in query_installed['Items']:
installed_id = item_installed['ID']
print(installed_id)
installed_account = item_installed['Account']
installed_tags = item_installed['Tags']
installed_env = item_installed['Environment']
# update second half of row by updating installed list
installed.append(installed_id, installed_account, installed_tags, installed_env)
# combine your two lists outside a loop
this_row = []
i = 0;
for m in missing:
# iterate through the first half to concatenate with the second half
this_row.append( m + installed[i] )
i = i +1
# adding an empty column after the write operation, manually, is optional
# Write the data to the CSV file
writer.writerow(this_row)
This will work if your installed and missing tables operate on a relatable field - like a timestamp or an account ID, something that you can ensure keeps the rows being concatenated in the same order. A data sample would be useful to really answer the question.

Get a setting value, from file located in non-standard dir

I have Sublime's directory structure like this:
Packages
|-- Foo
| |-- Markdown.sublime-settings
|
|-- Bar
| |-- plugin.py
|
|-- User
|-- Markdown.sublime-settings
Then, I'm trying to get a wrap_width value, stored in Foo/Markdown.sublime-setting. For some reason, it seems that load_setting method doesn't work, although save_settings works fine.
import sublime
import sublime_plugin
class MarkdownSettings(sublime_plugin.EventListener):
def on_activated(self, view):
path = view.file_name()
if path:
e = view.file_name().split('.')[1]
if e == ("md" or "mmd"):
# Simple test. It works
x = sublime.load_settings("Markdown.sublime-settings")
wrap_width = x.get("wrap_width")
print(wrap_width) # Prints 50
# If I change directory to "../Foo", `load_setting` method would not work
x = sublime.load_settings("../Foo/Markdown.sublime-settings")
wrap_width = x.get("wrap_width")
print(wrap_width) # Prints None
# The code below is added just for demonstration purposes,
# to show that `save_setting` method works fine.
x = sublime.load_settings("../Foo/Markdown.sublime-settings")
x.set("wrap_width", 20)
sublime.save_settings("../Foo/Markdown.sublime-settings") # File updated
How I could get wrap_width value stored in Foo/Markdown.sublime-settings?
Using a path with load_settings is not supported.
From http://www.sublimetext.com/docs/3/api_reference.html#sublime:
Loads the named settings. The name should include a file name and extension, but not a path. The packages will be searched for files matching the base_name, and the results will be collated into the settings object. Subsequent calls to load_settings() with the base_name will return the same object, and not load the settings from disk again.
If you really need to do this, you should use sublime.decode_value(sublime.load_resource('Packages/Foo/Markdown.sublime-settings')) instead.