How to get python dictionary or list from Jinja templated string in Airflow? - jinja2

Let's assume I've an operator which needs a python list (or dict) as an argument for it's property
doExampleTask = ExampleOperator(
task_id = "doExampleTask",
property_needs_list = [
("a", "x"),
("b", "y")
],
property_needs_dict = {
"dynamic_field_1": "dynamic_value",
# ...
"dynamic_field_N": "dynamic_value",
},
)
The problem is that I can't define the python data structure of the list (how many list elements is needed) or the dict (which fields were generated) in the creation time of the DAG.
I could only get this structure dynamically by executing a previous task or macro.
the task could write the data structure with dynamic fields into the XCOM
the macro could return a data structure
But in both of the above cases there is no way to convert the dynamic data structure (which is returned by XCOM or custom macro) to python data structure and use it as a property of the operator.
This will not return list or dict:
doExampleTask = ExampleOperator(
task_id = "doExampleTask",
property_needs_list = '{{ generate_list() }}',
property_needs_dict = '{{ generate_dict() }}',
)
This will also not return dict or list:
doExampleTask = ExampleOperator(
task_id = "doExampleTask",
property_needs_list = '{{ ti.xcom_pull(task_ids="PreviousTask", key="list_structure") }}',
property_needs_dict = '{{ ti.xcom_pull(task_ids="PreviousTask", key="dict_structure") }}',
)
If I use something like eval() function, it will not be able to evaluate the string argument in execution time of the Task. It will try to evaluate it in creation time of the DAG, but the values will obviously not be there.
doExampleTask = ExampleOperator(
task_id = "doExampleTask",
property_needs_list = eval('{{ ti.xcom_pull(task_ids="PreviousTask", key="list_structure") }}'),
property_needs_dict = eval('{{ ti.xcom_pull(task_ids="PreviousTask", key="dict_structure") }}'),
)
or
doExampleTask = ExampleOperator(
task_id = "doExampleTask",
property_needs_list = eval('{{ generate_list() }}'),
property_needs_dict = eval('{{ generate_dict() }}'),
)
How can I workaround this problem?
I'm mostly interested in Airflow 1.x, but I'm open to Aitflow 2.x solution.
Thank you!

In Airflow 1, Jinja expressions are always evaluated as strings. You'll have to either subclass the operator or build in logic to your custom operator to translate the stringified list/dict arg as necessary.
However, in Airflow 2.1, there was an option added to render templates as native Python types. You can set render_templates_as_native_obj=True at the DAG level and lists will render as a true list, dict as a true dict, etc. Check out the docs here.

Related

Airflow Jinja Template dag_run.conf not parsing

I have this dag code below.
import pendulum
from airflow import DAG
from airflow.decorators import dag, task
from custom_operators.profile_data_and_update_test_suite_operator import ProfileDataAndUpdateTestSuiteOperator
from custom_operators.validate_data_operator import ValidateDataOperator
from airflow.models import Variable
connstring = Variable.get("SECRET_SNOWFLAKE_DEV_CONNECTION_STRING")
#dag('profile_and_validate_data', schedule_interval=None, start_date=pendulum.datetime(2021, 1, 1, tz="UTC"), catchup=False)
def taskflow():
profile_data = ProfileDataAndUpdateTestSuiteOperator(
task_id="profile_data",
asset_name="{{ dag_run.conf['asset_name'] }}",
data_format="sql",
connection_string=connstring
)
validate_data = ValidateDataOperator(
task_id="validate_data",
asset_name="{{ dag_run.conf['asset_name'] }}",
data_format="sql",
connection_string=connstring,
trigger_rule="all_done"
)
profile_data >> validate_data
dag = taskflow()
But the asset_name parameter is showing up the raw string of "{{ dag_run.conf['asset_name'] }}" rather than the configuration that is parsed when you trigger the dag and parsed with jinja.
What am I doing wrong here?
BaseOperator has a field "template_fields" that contains all the field name that during the run Airflow would replace it values according to Jinja template.
You need to specify in your Custom Operators (ProfileDataAndUpdateTestSuiteOperator, ValidateDataOperator) the field "asset_name"
template_fields: Sequence[str] = (asset_name, )
render_template_as_native_obj is set to False by default on the DAG. Setting it to False returns strings, change it True to get the native obj.
#dag('profile_and_validate_data', schedule_interval=None, start_date=pendulum.datetime(2021, 1, 1, tz="UTC"), catchup=False, render_template_as_native_obj=True)

Access Xcom in S3ToSnowflakeOperatorof Airflow

My use case is i have an S3 event which triggers a lambda (upon an S3 createobject event), which in turn invokes an Airflow DAG passing in a couple of --conf values (bucketname, filekey).
I am then extracting the key value using a Python operator and storing in an xcom variable. I then want to extract this xcom value within a S3ToSnowflakeOperator and essentially load the file into a Snowflake table.
All parts of the process are working bar the extraction of xcom value within the S3ToSnowflakeOperator task. I basically get the following in my logs.
query: [COPY INTO "raw".SOURCE_PARAMS_JSON FROM #MYSTAGE_PARAMS_DEMO/ files=('{{ ti.xcom...]
which looks like the jinja template is not correctly resolving the xcom value.
My code is as follows:
from airflow import DAG
from airflow.utils import timezone
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash import BashOperator
from airflow.providers.snowflake.transfers.s3_to_snowflake import S3ToSnowflakeOperator
FILEPATH = "demo/tues-29-03-2022-6.json"
args = {
'start_date': timezone.utcnow(),
'owner': 'airflow',
}
with DAG(
dag_id='example_dag_conf',
default_args=args,
schedule_interval=None,
catchup=False,
tags=['params demo'],
) as dag:
def run_this_func(**kwargs):
outkey = '{}'.format(kwargs['dag_run'].conf['key'])
print(outkey)
ti = kwargs['ti']
ti.xcom_push(key='FILE_PATH', value=outkey)
run_this = PythonOperator(
task_id='run_this',
python_callable=run_this_func
)
get_param_val = BashOperator(
task_id='get_param_val',
bash_command='echo "{{ ti.xcom_pull(key="FILE_PATH") }}"',
dag=dag)
copy_into_table = S3ToSnowflakeOperator(
task_id='copy_into_table',
s3_keys=["{{ ti.xcom_pull(key='FILE_PATH') }}"],
snowflake_conn_id=SNOWFLAKE_CONN_ID,
stage=SNOWFLAKE_STAGE,
schema="""\"{0}\"""".format(SNOWFLAKE_RAW_SCHEMA),
table=SNOWFLAKE_RAW_TABLE,
file_format="(type = 'JSON')",
dag=dag,
)
run_this >> get_param_val >> copy_into_table
If I replace
s3_keys=["{{ ti.xcom_pull(key='FILE_PATH') }}"],
with
s3_keys=[FILEPATH]
My operator works fine and the data is loaded into Snowflake. So the error is centered on resolving s3_keys=["{{ ti.xcom_pull(key='FILE_PATH') }}"], i believe?
Any guidance/help would be appreciated. I am using Airflow 2.2.2
I removed the S3ToSnowflakeOperator and replaced with the SnowflakeOperator.
I was then able to reference the xcom value (as above) for the sql param value.
**my xcom value was a derived COPY INTO statement effectively replicating the functionality of the S3ToSnowflakeOperator. With the added advantage of being able to store the metadata file information within the table columns too.

Split jinja string in airflow

I trigger my dag with the API from a lambda function with a trigger on a file upload. I get the file path from the lambda context
i.e. : ingestion.archive.dev/yolo/PMS_2_DXBTD_RTBD_2021032800000020210328000000SD_20210329052822.XML
I put this variable in the API call to get it back as "{{ dag_run.conf['file_path'] }}"
At some point, I need to extract information from this string by splitting it by / so inside the DAG to use the S3CopyObjectOperator.
So here the first approach I had
from datetime import datetime
from airflow import DAG
from airflow.providers.amazon.aws.operators.s3_copy_object import S3CopyObjectOperator
from airflow.operators.python_operator import PythonOperator
default_args = {
'owner': 'me',
}
s3_final_destination = {
"bucket_name": "ingestion.archive.dev",
"verification_failed": "validation_failed",
"processing_failed": "processing_failed",
"processing_success": "processing_success"
}
def print_var(file_path,
file_split,
source_bucket,
source_path,
file_name):
data = {
"file_path": file_path,
"file_split": file_split,
"source_bucket": source_bucket,
"source_path": source_path,
"file_name": file_name
}
print(data)
with DAG(
f"test_s3_transfer",
default_args=default_args,
description='Test',
schedule_interval=None,
start_date=datetime(2021, 4, 24),
tags=['ingestion', "test", "context"],
) as dag:
# {"file_path": "ingestion.archive.dev/yolo/PMS_2_DXBTD_RTBD_2021032800000020210328000000SD_20210329052822.XML"}
file_path = "{{ dag_run.conf['file_path'] }}"
file_split = file_path.split('/')
source_bucket = file_split[0]
source_path = "/".join(file_split[1:])
file_name = file_split[-1]
test_var = PythonOperator(
task_id="test_var",
python_callable=print_var,
op_kwargs={
"file_path": file_path,
"file_split": file_split,
"source_bucket": source_bucket,
"source_path": source_path,
"file_name": file_name
}
)
file_verification_fail_to_s3 = S3CopyObjectOperator(
task_id="file_verification_fail_to_s3",
source_bucket_key=source_bucket,
source_bucket_name=source_path,
dest_bucket_key=s3_final_destination["bucket_name"],
dest_bucket_name=f'{s3_final_destination["verification_failed"]}/{file_name}'
)
test_var >> file_verification_fail_to_s3
I use the PythonOperator to check the value I got to debug.
I have the right value in file_path but I got in file_split -> ['ingestion.archive.dev/yolo/PMS_2_DXBTD_RTBD_2021032800000020210328000000SD_20210329052822.XML']
It's my str in a list and not each part splited like ["ingestion.archive.dev", "yolo", "PMS_2_DXBTD_RTBD_2021032800000020210328000000SD_20210329052822.XML"].
So what's wrong here?
In Airflow the Jinja rendering is not done until task runtime, however, since the parsing of the file_path value as written is performed as top-level code (i.e. outside of an Operator's execute() method or DAG instantiation, the file_path value is initialized as [" {{ dag_run.conf['file_path'] }}"] by the Scheduler. Then when the task is executed, the Jinja rendering begins which is why you see ["ingestion.archive.dev/yolo/PMS_2_DXBTD_RTBD_2021032800000020210328000000SD_20210329052822.XML"] as the value because there is no "/" in the initialized string.
Even if you explicitly split the string within the Jinja expression like file_split="{{ dag_run.conf.file_path.split('/') }}" the value will then be the string representation of the list and not a list object.
However, in Airflow 2.1, you can set render_template_as_native_obj=True as a DAG parameter which will render templated values to a native Python object. Now the string split will render as a list as you expect:
As best practice, you should avoid top-level code since it's executed every Scheduler heartbeat which could lead to some performance issues in your DAG and environment. I would suggest passing the "{{ dag_run.conf['file_path'] }}" expression as an argument to the function which needs it and then execute the parsing logic within the function itself.

How Do I Consume an Array of JSON Objects using Plumber in R

I have been experimenting with Plumber in R recently, and am having success when I pass the following data using a POST request;
{"Gender": "F", "State": "AZ"}
This allows me to write a function like the following to return the data.
#* #post /score
score <- function(Gender, State){
data <- list(
Gender = as.factor(Gender)
, State = as.factor(State))
return(data)
}
However, when I try to POST an array of JSON objects, I can't seem to access the data through the function
[{"Gender":"F","State":"AZ"},{"Gender":"F","State":"NY"},{"Gender":"M","State":"DC"}]
I get the following error
{
"error": [
"500 - Internal server error"
],
"message": [
"Error in is.factor(x): argument \"Gender\" is missing, with no default\n"
]
}
Does anyone have an idea of how Plumber parses JSON? I'm not sure how to access and assign the fields to vectors to score the data.
Thanks in advance
I see two possible solutions here. The first would be a command line based approach which I assume you were attempting. I tested this on a Windows OS and used column based data.frame encoding which I prefer due to shorter JSON string lengths. Make sure to escape quotation marks correctly to avoid 'argument "..." is missing, with no default' errors:
curl -H "Content-Type: application/json" --data "{\"Gender\":[\"F\",\"F\",\"M\"],\"State\":[\"AZ\",\"NY\",\"DC\"]}" http://localhost:8000/score
# [["F","F","M"],["AZ","NY","DC"]]
The second approach is R native and has the advantage of having everything in one place:
library(jsonlite)
library(httr)
## sample data
lst = list(
Gender = c("F", "F", "M")
, State = c("AZ", "NY", "DC")
)
## jsonify
jsn = lapply(
lst
, toJSON
)
## query
request = POST(
url = "http://localhost:8000/score?"
, query = jsn # values must be length 1
)
response = content(
request
, as = "text"
, encoding = "UTF-8"
)
fromJSON(
response
)
# [,1]
# [1,] "[\"F\",\"F\",\"M\"]"
# [2,] "[\"AZ\",\"NY\",\"DC\"]"
Be aware that httr::POST() expects a list of length-1 values as query input, so the array data should be jsonified beforehand. If you want to avoid the additional package imports altogether, some system(), sprintf(), etc. magic should do the trick.
Finally, here is my plumber endpoint (living in R/plumber.R and condensed a little bit):
#* #post /score
score = function(Gender, State){
lapply(
list(Gender, State)
, as.factor
)
}
and code to fire up the API:
pr = plumber::plumb("R/plumber.R")
pr$run(port = 8000)

Postgres JSON data type Rails query

I am using Postgres' json data type but want to do a query/ordering with data that is nested within the json.
I want to order or query with .where on the json data type. For example, I want to query for users that have a follower count > 500 or I want to order by follower or following count.
Thanks!
Example:
model User
data: {
"photos"=>[
{"type"=>"facebook", "type_id"=>"facebook", "type_name"=>"Facebook", "url"=>"facebook.com"}
],
"social_profiles"=>[
{"type"=>"vimeo", "type_id"=>"vimeo", "type_name"=>"Vimeo", "url"=>"http://vimeo.com/", "username"=>"v", "id"=>"1"},
{"bio"=>"I am not a person, but a series of plants", "followers"=>1500, "following"=>240, "type"=>"twitter", "type_id"=>"twitter", "type_name"=>"Twitter", "url"=>"http://www.twitter.com/", "username"=>"123", "id"=>"123"}
]
}
For any who stumbles upon this. I have come up with a list of queries using ActiveRecord and Postgres' JSON data type. Feel free to edit this to make it more clear.
Documentation to the JSON operators used below: https://www.postgresql.org/docs/current/functions-json.html.
# Sort based on the Hstore data:
Post.order("data->'hello' DESC")
=> #<ActiveRecord::Relation [
#<Post id: 4, data: {"hi"=>"23", "hello"=>"22"}>,
#<Post id: 3, data: {"hi"=>"13", "hello"=>"21"}>,
#<Post id: 2, data: {"hi"=>"3", "hello"=>"2"}>,
#<Post id: 1, data: {"hi"=>"2", "hello"=>"1"}>]>
# Where inside a JSON object:
Record.where("data ->> 'likelihood' = '0.89'")
# Example json object:
r.column_data
=> {"data1"=>[1, 2, 3],
"data2"=>"data2-3",
"array"=>[{"hello"=>1}, {"hi"=>2}],
"nest"=>{"nest1"=>"yes"}}
# Nested search:
Record.where("column_data -> 'nest' ->> 'nest1' = 'yes' ")
# Search within array:
Record.where("column_data #>> '{data1,1}' = '2' ")
# Search within a value that's an array:
Record.where("column_data #> '{array,0}' ->> 'hello' = '1' ")
# this only find for one element of the array.
# All elements:
Record.where("column_data ->> 'array' LIKE '%hello%' ") # bad
Record.where("column_data ->> 'array' LIKE ?", "%hello%") # good
According to this http://edgeguides.rubyonrails.org/active_record_postgresql.html#json
there's a difference in using -> and ->>:
# db/migrate/20131220144913_create_events.rb
create_table :events do |t|
t.json 'payload'
end
# app/models/event.rb
class Event < ActiveRecord::Base
end
# Usage
Event.create(payload: { kind: "user_renamed", change: ["jack", "john"]})
event = Event.first
event.payload # => {"kind"=>"user_renamed", "change"=>["jack", "john"]}
## Query based on JSON document
# The -> operator returns the original JSON type (which might be an object), whereas ->> returns text
Event.where("payload->>'kind' = ?", "user_renamed")
So you should try Record.where("data ->> 'status' = 200 ") or the operator that suits your query (http://www.postgresql.org/docs/current/static/functions-json.html).
Your question doesn't seem to correspond to the data you've shown, but if your table is named users and data is a field in that table with JSON like {count:123}, then the query
SELECT * WHERE data->'count' > 500 FROM users
will work. Take a look at your database schema to make sure you understand the layout and check that the query works before complicating it with Rails conventions.
JSON filtering in Rails
Event.create( payload: [{ "name": 'Jack', "age": 12 },
{ "name": 'John', "age": 13 },
{ "name": 'Dohn', "age": 24 }]
Event.where('payload #> ?', '[{"age": 12}]')
#You can also filter by name key
Event.where('payload #> ?', '[{"name": "John"}]')
#You can also filter by {"name":"Jack", "age":12}
Event.where('payload #> ?', {"name":"Jack", "age":12}.to_json)
You can find more about this here