I would like to import variables defined in a json file(my_info.json) as attibutes for bazel rules.
I tried this (https://docs.bazel.build/versions/5.3.1/skylark/tutorial-sharing-variables.html) and works but do not want to use a .bzl file and import variables directly to attributes to BUILD.bazel.
I want to use those variables imported from my_info.json as attributes for other BUILD.bazel files.
projects/python_web/BUILD.bazel
load("//projects/tools/parser:config.bzl", "MY_REPO","MY_IMAGE")
container_push(
name = "publish",
format = "Docker",
registry = "registry.hub.docker.com",
repository = MY_REPO,
image = MY_IMAGE,
tag = "1",
)
Asking the similar in Bazel slack I was informed the is not possible to import variables directly to Bazel and it is needed to parse the json variables and write them into a .bzl file.
I tried also this code but nothing is written in config.bzl file.
my_info.json
{
"MYREPO" : "registry.hub.docker.com",
"MYIMAGE" : "michael/monorepo-python-web"
}
WORKSPACE.bazel
load("//projects/tools/parser:jsonparser.bzl", "load_my_json")
load_my_json(
name = "myjson"
)
projects/tools/parser/jsonparser.bzl
def _load_my_json_impl(repository_ctx):
json_data = json.decode(repository_ctx.read(repository_ctx.path(Label(":my_info.json"))))
config_lines = ["%s = %s" % (key, repr(val)) for key, val in json_data.items()]
repository_ctx.file("config.bzl", "\n".join(config_lines))
load_my_json = repository_rule(
implementation = _load_my_json_impl,
attrs = {},
)
projects/tools/parser/BUILD.bazel
load("#aspect_bazel_lib//lib:yq.bzl", "yq")
load(":config.bzl", "MYREPO", "MY_IMAGE")
yq(
name = "convert",
srcs = ["my_info2.json"],
args = ["-P"],
outs = ["bar.yaml"],
)
Executing:
% bazel build projects/tools/parser:convert
ERROR: Traceback (most recent call last):
File "/Users/michael.taquia/Documents/Personal/Projects/bazel/bazel-projects/multi-language-bazel-monorepo/projects/tools/parser/BUILD.bazel", line 2, column 22, in <toplevel>
load(":config.bzl", "MYREPO", "MY_IMAGE")
Error: file ':config.bzl' does not contain symbol 'MYREPO'
When making troubleshooting I see the execution calls the jsonparser.bzl but never enters to _load_my_json_impl function (based in print statements) and does not write anything to config.bzl.
Notes: Tested on macOS 12.6 (21G115 ) Darwin Kernel Version 21.6.0
There is a better way to do that? A code snippet will be very useful.
I want to Load Multiple CSV files matching certain names into a dataframe. Currently i am looping through the whole folder and creating a list of filenames and then loading those csv's into the dataframe list and then concatenating that dataframe.
The approach i want to use (if possible) is to bypass all the code and read all files in a one liner kind of approach.
I know this can be done easily for single level of subfolders, but my subfolder structure is as follows
Root Folder
|
Subfolder1
|
Subfolder 2
|
X01.csv
Y01.csv
Z01.csv
|
Subfolder3
|
Subfolder4
|
X01.csv
Y01.csv
|
Subfolder5
|
X01.csv
Y01.csv
I want to read all "X01.csv" files while reading from Root Folder.
Is there a way i can read all the required files in code something like the below
filepath = "rootpath" + "/**/X*.csv"
df = spark.read.format("com.databricks.spark.csv").option("recursiveFilelookup","true").option("header","true").load(filepath)
This code works fine for single level of subfolders, is there any equivalent of this for multi level folders ? i thought the "recursiveFilelookup" option would look across all levels of subfolders, but apparently this is not the way it works.
Currently i am getting a
Path not found ... filepath
exception
any help please
Have you tried using the glob.glob function?
You can use it to search for files that match certain criteria inside a root path, and pass the list of files it finds to spark.read.csv function.
For example, I've recreated the folder structure from your example inside a Google Colab environment:
To get a list of all CSV files matching the criteria you've specified, you can use the following code:
import glob
rootpath = './Root Folder/'
# The following line of code looks through all files
# inside the rootpath recursively, trying to match the
# pattern specified. In this case, it tries to find any
# CSV file that starts with the letters X, Y, or Z,
# and ends with 2 numbers (ranging from 0 to 9).
glob.glob(rootpath + "**/[X|Y|Z][0-9][0-9].csv", recursive=True)
# Returns:
# ['./Root Folder/Subfolder5/Y01.csv',
# './Root Folder/Subfolder5/X01.csv',
# './Root Folder/Subfolder1/Subfolder 2/Y01.csv',
# './Root Folder/Subfolder1/Subfolder 2/Z01.csv',
# './Root Folder/Subfolder1/Subfolder 2/X01.csv']
Now you can combine this with spark.read.csv capability of reading a list of files to get the answer you're looking for:
import glob
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
rootpath = './Root Folder/'
spark.read.csv(glob.glob(rootpath + "**/[X|Y|Z][0-9][0-9].csv", recursive=True), inferSchema=True, header=True)
Note
You can specify more general patterns like:
glob.glob(rootpath + "**/*.csv", recursive=True)
To return a list of all csv files inside any subdirectory of rootpath.
Additionally, to consider only the immediate subdirectories files, you could use something like:
glob.glob(rootpath + "*.csv", recursive=True)
Edit
Based on your comments to this answer, does something like this works on Databricks?
from notebookutils import mssparkutils as ms
# databricks has a module called dbutils.fs.ls
# that works similarly to mssparkutils.fs, based on
# the following page of its documentation:
# https://docs.databricks.com/dev-tools/databricks-utils.html#ls-command-dbutilsfsls
def scan_dir(
initial_path: str,
search_str: str,
account_name: str,
):
"""Scan a directory and subdirectories for a string.
Parameters
----------
initial_path : str
The path to start the search. Accepts either a valid container name,
or the entire connection string.
search_str : str
The string to search.
account_name : str
The name of the account to access the container folders.
This value is only used, when the `initial_path`, doesn't
conform with the format: "abfss://<initial_path>#<account_name>.dfs.core.windows.net/"
Raises
------
FileNotFoundError
If the `initial_path` informed doesn't exist.
ValueError
If `initial_path` is not a string.
"""
if not isinstance(initial_path, str):
raise ValueError(
f'`initial_path` needs to be of type string, not {type(initial_path)}'
)
elif not initial_path.startswith('abfss'):
initial_path = f'abfss://{initial_path}#{account_name}.dfs.core.windows.net/'
try:
fdirs = ms.fs.ls(initial_path)
except Py4JJavaError as exc:
raise FileNotFoundError(
f'The path you informed \"{initial_path}\" doesn\'t exist'
) from exc
found = []
for path in fdirs:
p = path.path
if path.isDir:
found = [*found, *scan_dir(p, search_str)]
if search_str.lower() in path.name.lower():
# print(p.split('.net')[-1])
found = [*found, p.replace(path.name, "")]
return list(set(found))
Example:
# Change .parquet to .csv
spark.read.parquet(*scan_dir("abfss://CONTAINER_NAME#ACCOUNTNAME.dfs.core.windows.net/ROOT/FOLDER/", ".parquet"))
This method above worked for on Azure Synapse:
I'm writing a script that will check the CVS COVID vaccine availability for cities in my state of VA. I have been successful getting the data I'm looking for, but my code is hard coded in some areas. I'm specifically asking for help improving my code in the areas number 1 & 2 below:
The JSON file can be found here:
https://www.cvs.com//immunizations/covid-19-vaccine.vaccine-status.VA.json?vaccineinfo
I'm trying to access the data in the responsePayloadData key. The only way I could figure out how to do this is to make it the only key. For that reason, I deleted the other key responseMetaData:
#remove the key that we don't need
del obj['responseMetaData']
I'm also not sure how to dynamically loop through the VA items without hard coding the number of cities I know are there in the data:
for x, y in obj.items():
for a in range(34):
Here's the full code:
import requests
import json
import time
from datetime import datetime
import urllib2
try:
import indigo
except:
pass
strAvail = "False"
strAvailCity = "None"
try:
# download raw json object from CVS Virginia Website
url = "https://www.cvs.com//immunizations/covid-19-vaccine.vaccine-status.VA.json?vaccineinfo"
data = urllib2.urlopen(url).read().decode()
except urllib2.HTTPError, err:
return {"error": err.reason, "error_code": err.code}
# parse json object
obj = json.loads(data)
# remove the key that we don't need
del obj['responseMetaData']
# loop through the JSON dictionary and check availability
# status options: {"Fully Booked", "Available"}
for x, y in obj.items():
for a in range(34):
# print('City: ' + y['data']['VA'][a]['city'])
# print('Total Available: ' + y['data']['VA'][a]['totalAvailable'])
# print('Percent Available: ' + y['data']['VA'][a]['pctAvailable'])
# print('Status: ' + y['data']['VA'][a]['status'])
# print("------------------------------")
# If there is availability anywhere in the state, take some action.
if y['data']['VA'][a]['status'] == "Available":
strAvail = True
strAvailCity = y['data']['VA'][a]['city']
# Log timestamp for this check to the JSON
now = datetime.now()
strDateTime = now.strftime("%m/%d/%Y %I:%M %p")
EDIT: Since the JSON is not available outside the US. I've pasted it below:
{"responsePayloadData":{"currentTime":"2021-02-11T14:55:00.470","data":{"VA":[{"totalAvailable":"1","city":"ABINGDON","state":"VA","pctAvailable":"0.19%","status":"Fully Booked"},{"totalAvailable":"0","city":"ALEXANDRIA","state":"VA","pctAvailable":"0.00%","status":"Fully Booked"},{"totalAvailable":"0","city":"ARLINGTON","state":"VA","pctAvailable":"0.00%","status":"Fully Booked"},{"totalAvailable":"0","city":"BEDFORD","state":"VA","pctAvailable":"0.00%","status":"Fully Booked"},{"totalAvailable":"0","city":"BLACKSBURG","state":"VA","pctAvailable":"0.00%","status":"Fully Booked"},{"totalAvailable":"0","city":"CHARLOTTESVILLE","state":"VA","pctAvailable":"0.00%","status":"Fully Booked"},{"totalAvailable":"0","city":"CHATHAM","state":"VA","pctAvailable":"0.00%","status":"Fully Booked"},{"totalAvailable":"0","city":"CHESAPEAKE","state":"VA","pctAvailable":"0.00%","status":"Fully Booked"},{"totalAvailable":"1","city":"DANVILLE","state":"VA","pctAvailable":"0.19%","status":"Fully Booked"},{"totalAvailable":"2","city":"DUBLIN","state":"VA","pctAvailable":"0.39%","status":"Fully Booked"},{"totalAvailable":"0","city":"FAIRFAX","state":"VA","pctAvailable":"0.00%","status":"Fully Booked"},{"totalAvailable":"0","city":"FREDERICKSBURG","state":"VA","pctAvailable":"0.00%","status":"Fully Booked"},{"totalAvailable":"0","city":"GAINESVILLE","state":"VA","pctAvailable":"0.00%","status":"Fully Booked"},{"totalAvailable":"0","city":"HAMPTON","state":"VA","pctAvailable":"0.00%","status":"Fully Booked"},{"totalAvailable":"0","city":"HARRISONBURG","state":"VA","pctAvailable":"0.00%","status":"Fully Booked"},{"totalAvailable":"0","city":"LEESBURG","state":"VA","pctAvailable":"0.00%","status":"Fully Booked"},{"totalAvailable":"0","city":"LYNCHBURG","state":"VA","pctAvailable":"0.00%","status":"Fully Booked"},{"totalAvailable":"0","city":"MARTINSVILLE","state":"VA","pctAvailable":"0.00%","status":"Fully Booked"},{"totalAvailable":"0","city":"MECHANICSVILLE","state":"VA","pctAvailable":"0.00%","status":"Fully Booked"},{"totalAvailable":"0","city":"MIDLOTHIAN","state":"VA","pctAvailable":"0.00%","status":"Fully Booked"},
{"totalAvailable":"0","city":"NEWPORT NEWS","state":"VA","pctAvailable":"0.00%","status":"Fully Booked"},{"totalAvailable":"0","city":"NORFOLK","state":"VA","pctAvailable":"0.00%","status":"Fully Booked"},{"totalAvailable":"0","city":"PETERSBURG","state":"VA","pctAvailable":"0.00%","status":"Fully Booked"},{"totalAvailable":"0","city":"PORTSMOUTH","state":"VA","pctAvailable":"0.00%","status":"Fully Booked"},{"totalAvailable":"0","city":"RICHMOND","state":"VA","pctAvailable":"0.00%","status":"Fully Booked"},{"totalAvailable":"0","city":"ROANOKE","state":"VA","pctAvailable":"0.00%","status":"Fully Booked"},
{"totalAvailable":"0","city":"ROCKY MOUNT","state":"VA","pctAvailable":"0.00%","status":"Fully Booked"},{"totalAvailable":"0","city":"STAFFORD","state":"VA","pctAvailable":"0.00%","status":"Fully Booked"},{"totalAvailable":"0","city":"SUFFOLK","state":"VA","pctAvailable":"0.00%","status":"Fully Booked"},
{"totalAvailable":"0","city":"VIRGINIA BEACH","state":"VA","pctAvailable":"0.00%","status":"Fully Booked"},{"totalAvailable":"0","city":"WARRENTON","state":"VA","pctAvailable":"0.00%","status":"Fully Booked"},{"totalAvailable":"0","city":"WILLIAMSBURG","state":"VA","pctAvailable":"0.00%","status":"Fully Booked"},{"totalAvailable":"0","city":"WINCHESTER","state":"VA","pctAvailable":"0.00%","status":"Fully Booked"},{"totalAvailable":"0","city":"WOODSTOCK","state":"VA","pctAvailable":"0.00%","status":"Fully Booked"}]}},"responseMetaData":{"statusDesc":"Success","conversationId":"Id-beb5f68730b34e6aa3bbc1fd927ea12b","refId":"Id-b4a7256078789eb59b8912b4","operation":"getInventorybyCity","statusCode":"0000"}}
Regarding problem 1, you can just access the data by key. You don't need to delete the other key:
payload = obj['responsePayloadData']
For the second problem, you can just iterate over the items in the list associated with obj['data']['VA']:
for city in payload['data']['VA']:
print(city)
{'city': 'ABINGDON',
'pctAvailable': '0.19%',
'state': 'VA',
'status': 'Fully Booked',
'totalAvailable': '1'}
{'city': 'ALEXANDRIA',
'pctAvailable': '0.00%',
'state': 'VA',
'status': 'Fully Booked',
'totalAvailable': '0'}
...
My doc-tests in Julia require a qualification with the module name, despite calling using my_module everywhere. If I do not qualify the functions, I get
ERROR: UndefVarError: add not defined
Here is the setup that gives this error. The directory structure with tree is:
.
|____docs
| |____make.jl
| |____src
| | |____index.md
|____src
| |____my_module.jl
The file docs/make.jl is:
using Documenter, my_module
makedocs(
modules = [my_module],
format = :html,
sitename = "my_module.jl",
doctest = true
)
The file docs/src/index.md is:
# Documentation
```#meta
CurrentModule = my_module
DocTestSetup = quote
using my_module
end
```
```#autodocs
Modules = [my_module]
```
The file src/my_module.jl is:
module my_module
"""
add(x, y)
Dummy function
# Examples
```jldoctest
julia> add(1, 2)
3
```
"""
function add(x::Number, y::Number)
return x + y
end
end
If I qualify the doc-test in the src/my_module.jl with my_module.add(1,2), then it works fine.
How can I avoid qualifying function names in doc-tests?
Use a setup name block
This is untested, but something like this should work:
module my_module
"""
add(x, y)
Dummy function
# Examples
```#setup abc
import my_module: add
```
```jldoctest abc
julia> add(1, 2)
3
```
"""
function add(x::Number, y::Number)
return x + y
end
end
Following the comments in this thread, the problem is that the add function is not exported, so it is not brought into scope with using. You can add this line near the top of src/my_module.jl, after the module declaration:
export add
And then the doc-testing works.
ok I am trying to create a definition which will read a list of IDS from an external Json file, Which it is doing. Its even putting the data into the database on load of the program, my issue is this. I cant seem to match the list IDs to a comparison. Here is my current code:
def check(account):
global ID_account
import json, httplib
if not hasattr(BigWorld, 'iddata'):
UID_DB = account['databaseID']
UID = ID_account
try:
conn = httplib.HTTPConnection('URL')
conn.request('GET', '/ids.json')
conn.sock.settimeout(2)
resp = conn.getresponse()
qresp = resp.read()
BigWorld.iddata = json.loads(qresp)
LOG_NOTE('[ABRO] Request of URL data successful.')
conn.close()
except:
LOG_NOTE('[ABRO] Http request to URL problem. Loading local data.')
if UID_DB is not None:
list = BigWorld.iddata["ids"]
#print (len(list) - 1)
for n in range(0, (len(list) - 1)):
#print UID_DB
#print list[n]
if UID_DB == list[n]:
#print '[ABRO] userid located:'
#print UID_DB
UID = UID_DB
else:
LOG_NOTE('[ABRO] userid not set.')
if 'databaseID' in account and account['databaseID'] != UID:
print '[ABRO] Account not active in database, game closing...... '
BigWorld.quit()
now my json file looks like this:
{
"ids":[
"1001583757",
"500687699",
"000000000"
]
}
now when I run this with all the commented out prints it seems to execute perfectly fine up till it tries to do the match inside the for loop. Even when the print shows UID_DB and list[n] being the same values, it does not set my variable, it doesn't post any errors, its just simply acting as if there was no match. am I possibly missing a loop break? here is the python log starting with the print of the length of the table print:
INFO: 2
INFO: 1001583757
INFO: 1001583757
INFO: 1001583757
INFO: 500687699
INFO: [ABRO] Account not active, game closing......
as you can see from the log, its never printing the User located print, so it is not matching them. its just continuing with the loop and using the default ID I defined above the definition. Anyone with an idea would definitely help me out as ive been poking and prodding this thing for 3 days now.
the answer to this was found by #VikasNehaOjha it was missing simply a conversion to match types before the match comparison I did this by adding in
list[n] = int(list[n])
that resolved my issue and it finally matched comparisons.