Is there a convenient way to pass lists as environment variables for Quarkus Configuration except for Comma separation?
MY_FOO=val1,val2,val3
Comma separation works fine (even if it does not look so nice for long lists). But if you have to pass a list of entries where each entry has commas in it, it won't work.
I'm thinking about something similar to Spring Configuration where you can pass list entries with an index as postfix
MY_FOO_0_ = val1
MY_FOO_1_ = val2
MY_FOO_2_ = val3
Quarkus uses MicroProfile and SmallRye for this and you can achieve the desired result using indexed properties:
# MicroProfile Config - Collection Values
my.collection=dog,cat,turtle
# SmallRye Config - Indexed Property
my.indexed.collection[0]=dog
my.indexed.collection[1]=cat
my.indexed.collection[2]=turtle
From the documentation:
A call to Config#getValues("my.collection", String.class), will automatically create and convert a List that contains the values dog, cat and turtle. A call to Config#getValues("my.indexed.collection", String.class) returns the exact same result.
Following the rules for conversion in environment variables, you would then pass the environment variables as
MY_INDEXED_COLLECTION_0_=dog
MY_INDEXED_COLLECTION_1_=cat
MY_INDEXED_COLLECTION_2_=turtle
and access these by
ConfigProvider.getConfig().getValue("my.indexed.collection", String.class);
Documentation on indexed properties: https://smallrye.io/smallrye-config/2.11.1/config/indexed-properties/
Documentation on environment variables: https://smallrye.io/smallrye-config/2.11.1/config/environment-variables
If your problem is to pass a comma (,) inside one item of your array, I believe this will help you.
Below I show one example where I pass the items of an array separated by comma (,) and one specific item ( AllowedRemoteAddresses ) of the array that receives commas.
Config variable inside your application
#ConfigProperty(name = "quickfix")
List<String> quickfixSessionSettings;
application.properties
In the Eclipse Microprofile Config properties file, I just have to put double backslashes before the comma:
quickfix=[default],\
# Sessions,\
[session],\
BeginString=FIX.4.4,\
SenderCompID=EXEC,\
TargetCompID=BANZAI,\
ConnectionType=acceptor,\
StartTime=00:00:00,\
EndTime=00:00:00,\
# Aceptor,\
SocketAcceptPort=9880,\
# Logging,\
ScreenLogShowHeartBeats=Y,\
# Store,\
# FileStorePath=target/data/store,\
JdbcStoreMessagesTableName=messages,\
JdbcStoreSessionsTableName=sessions,\
JdbcLogHeartBeats=Y,\
JdbcLogIncomingTable=messages_log_incoming,\
JdbcLogOutgoingTable=messages_log_outgoing,\
JdbcLogEventTable=event_log,\
JdbcSessionIdDefaultPropertyValue=not_null,\
AllowedRemoteAddresses=localhost\\,127.0.0.1\\,172.0.0.2
List<String> result inside application:
[default]
# Sessions
[session]
BeginString=FIX.4.4
SenderCompID=EXEC
TargetCompID=BANZAI
ConnectionType=acceptor
StartTime=00:00:00
EndTime=00:00:00
# Aceptor
SocketAcceptPort=9880
# Logging
ScreenLogShowHeartBeats=Y
# Store
# FileStorePath=target/data/store
JdbcStoreMessagesTableName=messages
JdbcStoreSessionsTableName=sessions
JdbcLogHeartBeats=Y
JdbcLogIncomingTable=messages_log_incoming
JdbcLogOutgoingTable=messages_log_outgoing
JdbcLogEventTable=event_log
JdbcSessionIdDefaultPropertyValue=not_null
AllowedRemoteAddresses=localhost,127.0.0.1,172.0.0.2
Yaml deployment file
In the Yaml deployment file, I just have to put one backslash before the comma:
environment:
- QUARKUS_DATASOURCE_JDBC_URL=jdbc:postgresql://postgresql-qfj:5432/postgres?currentSchema=exchange
- QUARKUS_DATASOURCE_USERNAME=postgres
- QUARKUS_DATASOURCE_PASSWORD=postgres
- QUICKFIX=[default],
[session],
BeginString=FIX.4.4,
SenderCompID=EXEC,
TargetCompID=BANZAI,
ConnectionType=acceptor,
StartTime=00:00:00,
EndTime=00:00:00,
SocketAcceptPort=9880,
ScreenLogShowHeartBeats=Y,
JdbcStoreMessagesTableName=messages,
JdbcStoreSessionsTableName=sessions,
JdbcLogHeartBeats=Y,
JdbcLogIncomingTable=messages_log_incoming,
JdbcLogOutgoingTable=messages_log_outgoing,
JdbcLogEventTable=event_log,
JdbcSessionIdDefaultPropertyValue=not_null,
AllowedRemoteAddresses=127.0.0.1\,172.0.0.2\,172.0.0.3\,broker-back-end
List<String> result inside application:
[default]
[session]
BeginString=FIX.4.4
SenderCompID=EXEC
TargetCompID=BANZAI
ConnectionType=acceptor
StartTime=00:00:00
EndTime=00:00:00
SocketAcceptPort=9880
ScreenLogShowHeartBeats=Y
JdbcStoreMessagesTableName=messages
JdbcStoreSessionsTableName=sessions
JdbcLogHeartBeats=Y
JdbcLogIncomingTable=messages_log_incoming
JdbcLogOutgoingTable=messages_log_outgoing
JdbcLogEventTable=event_log
JdbcSessionIdDefaultPropertyValue=not_null
AllowedRemoteAddresses=127.0.0.1,172.0.0.2,172.0.0.3,broker-back-end
Related
I want to Load Multiple CSV files matching certain names into a dataframe. Currently i am looping through the whole folder and creating a list of filenames and then loading those csv's into the dataframe list and then concatenating that dataframe.
The approach i want to use (if possible) is to bypass all the code and read all files in a one liner kind of approach.
I know this can be done easily for single level of subfolders, but my subfolder structure is as follows
Root Folder
|
Subfolder1
|
Subfolder 2
|
X01.csv
Y01.csv
Z01.csv
|
Subfolder3
|
Subfolder4
|
X01.csv
Y01.csv
|
Subfolder5
|
X01.csv
Y01.csv
I want to read all "X01.csv" files while reading from Root Folder.
Is there a way i can read all the required files in code something like the below
filepath = "rootpath" + "/**/X*.csv"
df = spark.read.format("com.databricks.spark.csv").option("recursiveFilelookup","true").option("header","true").load(filepath)
This code works fine for single level of subfolders, is there any equivalent of this for multi level folders ? i thought the "recursiveFilelookup" option would look across all levels of subfolders, but apparently this is not the way it works.
Currently i am getting a
Path not found ... filepath
exception
any help please
Have you tried using the glob.glob function?
You can use it to search for files that match certain criteria inside a root path, and pass the list of files it finds to spark.read.csv function.
For example, I've recreated the folder structure from your example inside a Google Colab environment:
To get a list of all CSV files matching the criteria you've specified, you can use the following code:
import glob
rootpath = './Root Folder/'
# The following line of code looks through all files
# inside the rootpath recursively, trying to match the
# pattern specified. In this case, it tries to find any
# CSV file that starts with the letters X, Y, or Z,
# and ends with 2 numbers (ranging from 0 to 9).
glob.glob(rootpath + "**/[X|Y|Z][0-9][0-9].csv", recursive=True)
# Returns:
# ['./Root Folder/Subfolder5/Y01.csv',
# './Root Folder/Subfolder5/X01.csv',
# './Root Folder/Subfolder1/Subfolder 2/Y01.csv',
# './Root Folder/Subfolder1/Subfolder 2/Z01.csv',
# './Root Folder/Subfolder1/Subfolder 2/X01.csv']
Now you can combine this with spark.read.csv capability of reading a list of files to get the answer you're looking for:
import glob
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
rootpath = './Root Folder/'
spark.read.csv(glob.glob(rootpath + "**/[X|Y|Z][0-9][0-9].csv", recursive=True), inferSchema=True, header=True)
Note
You can specify more general patterns like:
glob.glob(rootpath + "**/*.csv", recursive=True)
To return a list of all csv files inside any subdirectory of rootpath.
Additionally, to consider only the immediate subdirectories files, you could use something like:
glob.glob(rootpath + "*.csv", recursive=True)
Edit
Based on your comments to this answer, does something like this works on Databricks?
from notebookutils import mssparkutils as ms
# databricks has a module called dbutils.fs.ls
# that works similarly to mssparkutils.fs, based on
# the following page of its documentation:
# https://docs.databricks.com/dev-tools/databricks-utils.html#ls-command-dbutilsfsls
def scan_dir(
initial_path: str,
search_str: str,
account_name: str,
):
"""Scan a directory and subdirectories for a string.
Parameters
----------
initial_path : str
The path to start the search. Accepts either a valid container name,
or the entire connection string.
search_str : str
The string to search.
account_name : str
The name of the account to access the container folders.
This value is only used, when the `initial_path`, doesn't
conform with the format: "abfss://<initial_path>#<account_name>.dfs.core.windows.net/"
Raises
------
FileNotFoundError
If the `initial_path` informed doesn't exist.
ValueError
If `initial_path` is not a string.
"""
if not isinstance(initial_path, str):
raise ValueError(
f'`initial_path` needs to be of type string, not {type(initial_path)}'
)
elif not initial_path.startswith('abfss'):
initial_path = f'abfss://{initial_path}#{account_name}.dfs.core.windows.net/'
try:
fdirs = ms.fs.ls(initial_path)
except Py4JJavaError as exc:
raise FileNotFoundError(
f'The path you informed \"{initial_path}\" doesn\'t exist'
) from exc
found = []
for path in fdirs:
p = path.path
if path.isDir:
found = [*found, *scan_dir(p, search_str)]
if search_str.lower() in path.name.lower():
# print(p.split('.net')[-1])
found = [*found, p.replace(path.name, "")]
return list(set(found))
Example:
# Change .parquet to .csv
spark.read.parquet(*scan_dir("abfss://CONTAINER_NAME#ACCOUNTNAME.dfs.core.windows.net/ROOT/FOLDER/", ".parquet"))
This method above worked for on Azure Synapse:
Let's say I load a dataset
myds=ds.dataset('mypath', format='parquet', partitioning='hive')
myds.schema
# On/Off_Peak: string
# area: string
# price: decimal128(8, 4)
# date: date32[day]
# hourbegin: int32
# hourend: int32
# inflation: string rename to Inflation
# Price_Type: string
# Reference_Year: int32
# Case: string
# region: string rename to Region
My end goal is to resave the dataset with the following projection:
projection={'Region':ds.field('region'),
'Date':ds.field('date'),
'isPeak':pc.equal(ds.field('On/Off_Peak'),ds.scalar('On')),
'Hourbegin':ds.field('hourbegin'),
'Hourend':ds.field('hourend'),
'Inflation':ds.field('inflation'),
'Price_Type':ds.field('Price_Type'),
'Area':ds.field('area'),
'Price':ds.field('price'),
'Reference_Year':ds.field('Reference_Year'),
'Case':ds.field('Case'),
}
I make a scanner
scanner=myds.scanner(columns=projection)
Now I try to save my new dataset with
ds.write_dataset(scanner, 'newpath',
partitioning=['Reference_Year', 'Case', 'Region'], partitioning_flavor='hive',
format='parquet')
but I get
KeyError: 'Column Region does not exist in schema'
I can work around this by changing my partitioning to ['Reference_Year', 'Case', 'region'] to match the non-projected columns (and then later changing the name of all those directories) but is there a way to do it directly?
Suppose my partitioning needed the compute for more than just the column name changing. Would I have to save a non-partitioned dataset in one step to get the new column and then do another save operation to create the partitioned dataset?
EDIT: this bug has been fixed in pyarrow 10.0.0
It looks like a bug to me. It's as if write_dataset is looking at the dataset_schema rather than the projected_schema
I think you can get around it by calling to_reader on the scanner.
table = pa.Table.from_arrays(
[
pa.array(['a', 'b', 'c'], pa.string()),
pa.array(['a', 'b', 'c'], pa.string()),
],
names=['region', "Other"]
)
table_dataset = ds.dataset(table)
columns={
"Region": ds.field('region'),
"Other": ds.field('Other'),
}
scanner = table_dataset.scanner(columns=columns)
ds.write_dataset(
scanner.to_reader(),
'newpath',
partitioning=['Region'], partitioning_flavor='hive',
format='parquet')
I've reported the issue here
My use case is i have an S3 event which triggers a lambda (upon an S3 createobject event), which in turn invokes an Airflow DAG passing in a couple of --conf values (bucketname, filekey).
I am then extracting the key value using a Python operator and storing in an xcom variable. I then want to extract this xcom value within a S3ToSnowflakeOperator and essentially load the file into a Snowflake table.
All parts of the process are working bar the extraction of xcom value within the S3ToSnowflakeOperator task. I basically get the following in my logs.
query: [COPY INTO "raw".SOURCE_PARAMS_JSON FROM #MYSTAGE_PARAMS_DEMO/ files=('{{ ti.xcom...]
which looks like the jinja template is not correctly resolving the xcom value.
My code is as follows:
from airflow import DAG
from airflow.utils import timezone
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash import BashOperator
from airflow.providers.snowflake.transfers.s3_to_snowflake import S3ToSnowflakeOperator
FILEPATH = "demo/tues-29-03-2022-6.json"
args = {
'start_date': timezone.utcnow(),
'owner': 'airflow',
}
with DAG(
dag_id='example_dag_conf',
default_args=args,
schedule_interval=None,
catchup=False,
tags=['params demo'],
) as dag:
def run_this_func(**kwargs):
outkey = '{}'.format(kwargs['dag_run'].conf['key'])
print(outkey)
ti = kwargs['ti']
ti.xcom_push(key='FILE_PATH', value=outkey)
run_this = PythonOperator(
task_id='run_this',
python_callable=run_this_func
)
get_param_val = BashOperator(
task_id='get_param_val',
bash_command='echo "{{ ti.xcom_pull(key="FILE_PATH") }}"',
dag=dag)
copy_into_table = S3ToSnowflakeOperator(
task_id='copy_into_table',
s3_keys=["{{ ti.xcom_pull(key='FILE_PATH') }}"],
snowflake_conn_id=SNOWFLAKE_CONN_ID,
stage=SNOWFLAKE_STAGE,
schema="""\"{0}\"""".format(SNOWFLAKE_RAW_SCHEMA),
table=SNOWFLAKE_RAW_TABLE,
file_format="(type = 'JSON')",
dag=dag,
)
run_this >> get_param_val >> copy_into_table
If I replace
s3_keys=["{{ ti.xcom_pull(key='FILE_PATH') }}"],
with
s3_keys=[FILEPATH]
My operator works fine and the data is loaded into Snowflake. So the error is centered on resolving s3_keys=["{{ ti.xcom_pull(key='FILE_PATH') }}"], i believe?
Any guidance/help would be appreciated. I am using Airflow 2.2.2
I removed the S3ToSnowflakeOperator and replaced with the SnowflakeOperator.
I was then able to reference the xcom value (as above) for the sql param value.
**my xcom value was a derived COPY INTO statement effectively replicating the functionality of the S3ToSnowflakeOperator. With the added advantage of being able to store the metadata file information within the table columns too.
I'm wondering if I could get JRuby-internal Java objects (e.g. org.jruby.RubyString, org.jruby.RubyTime) in Ruby code, and call their Java methods from Ruby. Does anyone know how to do it?
str = "foobar"
rubystring_str = str.toSomethingConversion # <== What I want
# http://jruby.org/apidocs/org/jruby/RubyString.html#getEncoding()
rubystring_str.getEncoding() # Java::org.jcodings.Encoding
# http://jruby.org/apidocs/org/jruby/RubyString.html#getBytes()
rubystring_str.getBytes() # [Java::byte]
time = Time.now
rubytime_time = time.toSomethingConversion # <== What I want
# http://jruby.org/apidocs/org/jruby/RubyTime.html#getDateTime()
rubytime_time.getDateTime() # Java::org.joda.time.DateTime
I know I can do like that using Java code as below, but here, I'd like to do it purely in Ruby.
public org.joda.time.DateTime getJodaDateTime(RubyTime rubytime) {
return rubytime.getDateTime();
}
Ah, I found the answer in my tries-and-errors.
The following works.
"foobar".to_java(Java::org.jruby.RubyString).getEncoding()
Time.now.to_java(Java::org.jruby.RubyTime).getDateTime()
Here I am trying to use a plugin to check whether the service running or not, if there is any warning or any critical action required, at the same time the performance parameter.
We have used below plugin to check if a server is alive or not and read it's performance data JSON
https://github.com/drewkerrigan/nagios-http-json
I am trying to read a JSON file as below which is hosted on http://localhost:8080/sample.json
The plugin works perfectly on Command line, it shows me all the Metrics available.
$:/usr/lib/nagios/plugins$ ./check_http_json.py -H localhost:8080 -p sample.json -m metrics.etp_count metrics.atc_count
OK: Status OK.|'metrics.etp_count'=101 'metrics.atc_count'=0
But when I try the same in Icinga2 configuration, it doesn't show me this performance metrics, although it doesn't give any error but at the same time it don't show any value.
find the JSON, Command.conf and Service.conf as follows.
{
"metrics": {
"etp_count": "0",
"atc_count": "101",
"mean_time": -1.0,
}
}
Below are my commands.conf and services.conf
commands.conf
/* Json Read Command */
object CheckCommand "json_check"{
import "plugin-check-command"
command = [PluginDir + "/check_http_json.py"]
arguments = {
"-H" = "$server_port$"
"-p" = "$json_path$"
"-w" = "$warning_value$"
"-c" = "$critical_value$"
"-m" = "$Metrics1$,$Metrics2$"
}
}
services.conf
apply Service "json"{
import "generic-service"
check_command = "json_check"
vars.server_port="localhost:8080"
vars.json_path="sample.json"
vars.warning_value="metrics.etp_count,1:100"
vars.critical_value="metrics.etp_count,101:1000"
vars.Metrics1="metrics.etp_count"
vars.Metrics2="metrics.atc_count"
assign where host.name == NodeName
}
Does any one have any idea how can we pass multiple values in Command.conf and Service.conf??
I have resolved the issue.
I had to change the Plugin file "check_http_json.py" for below code
def checkMetrics(self):
"""Return a Nagios specific performance metrics string given keys and parameter definitions"""
metrics = ''
warning = ''
critical = ''
if self.rules.metric_list != None:
for metric in self.rules.metric_list:
Replaced With
def checkMetrics(self):
"""Return a Nagios specific performance metrics string given keys and parameter definitions"""
metrics = ''
warning = ''
critical = ''
if self.rules.metric_list != None:
for metric in self.rules.metric_list[0].split():
Actually the issue was the list was not handled properly, so it was not able to iterate through the items in the list, it was considering it as a single string due to services.config file.
it had to be further get split to get the items in the Metrics string.