In Airflow, we've created several DAGS. Some of which share common properties, for example the directory to read files from. Currently, these properties are listed as a property in each separate DAG, which will obviously become problematic in the future. Say if the directory name was to change, we'd have to go into each DAG and update this piece of code (possibly even missing one).
I was looking into creating some sort of a configuration file, which can be parsed into Airflow and used by the various DAGS when a certain property is required, but I cannot seem to find any sort of documentation or guide on how to do this. Most I could find was the documentation on setting up Connection ID's, but that does not meet my use case.
The question to my post, is it possible to do the above scenario and how?
Thanks in advance.
There are a few ways you can accomplish this based on your setup:
You can use a DagFactory type approach where you have a function generate DAGs. You can find an example of what that looks like here
You can store a JSON config as an Airflow Variable, and parse through that to generate a DAG. You can store something like this in a Admin -> Variables:
[
{
"table": "users",
"schema": "app_one",
"s3_bucket": "etl_bucket",
"s3_key": "app_one_users",
"redshift_conn_id": "postgres_default"
},
{
"table": "users",
"schema": "app_two",
"s3_bucket": "etl_bucket",
"s3_key": "app_two_users",
"redshift_conn_id": "postgres_default"
}
]
Your DAG could get generated as:
sync_config = json.loads(Variable.get("sync_config"))
with dag:
start = DummyOperator(task_id='begin_dag')
for table in sync_config:
d1 = RedshiftToS3Transfer(
task_id='{0}'.format(table['s3_key']),
table=table['table'],
schema=table['schema'],
s3_bucket=table['s3_bucket'],
s3_key=table['s3_key'],
redshift_conn_id=table['redshift_conn_id']
)
start >> d1
Similarly, you can just store that config as a local file and open it as you would any other file. Keep in mind the best answer to this will depend on your infrastructure and use case.
Related
I have a pipeline that processes some files, and in some cases "groups" of files. Meaning the files should be processed together and are correlated with a timestamp.
Ex.
Timestamp#Customer.csv
Timestamp#Customer_Offices.csv
Timestamp_1#Customer.csv
Timestamp_1#Customer_Offices.csv
...
I have a table with all the scopes, and files with respective filemask. I have populated a variable in the beginning of the pipeline based on a parameter
The Get files activity goes to a sFTP location and grab files from a folder. Then I only want to process the "Customer.csv" and ".Customer_Offices.csv" files. This is because the folder location has more file types or scopes to be processed by other pipelines. If I don't filter, the next activities end up by processing metadata of files that are not supposed to. In terms of efficiency and performance s bad, and is even causing some issues with files being left behind.
I've tried something like
#variables('FilesToSearch').contains(#endswith(item().name, 'do I need this 2nd parm in arrays ?'))
but no luck... :(
Any help will be highly appreciated,
Best regards,
Manuel
contains function can direct for a string to find a substring, so you can try something like this expression #contains(item().name,'Customer')
and no need to create a variable.
Or use endsWith function and use this expression:
#or(endswith(item().name,'Customer.csv'),endswith(item().name,'Customer_Offices.csv'))
I have two columns in config file col1 and col2.
Now I have to import this config file in my main python-transform and then extract the values of columns in order to create dynamic output path from these values by iterating over all the possible values.
For example
ouput_path1=Constant+value1+value2
ouput_path2=Constant+value3+value4
Please suggest some solution for generating output file in palantir foundary(code-repo)
What you probably want to use is a transform generator. In the "Python Transforms" chapter of the documentation, there's a section "Transform generation" which outlines the basics of this.
The most straightforward path is likely to generate multiple transforms, but if you want just one transform that outputs to multiple datasets, that would be possible too (if a little more complicated.)
For the former approach, you would add a .yaml file (or similar) to your repo, in which you define your values, and then you read the .yaml file and generate multiple transforms based on the values. The documentation gives an example that does pretty much exactly this.
For the latter approach, you would probably want to read the .yaml file in your pipeline definer, and then dynamically add outputs to a single transform. In your transforms code, you then need to be able to handle an arbitrary number of outputs in some way (which I presume you have a plan for.) I suspect you might need to fall back to manual transform registration for this, or you might need to construct a transforms object without using the decorator. If this is the solution you need, I can construct an example for you.
Before you proceed with this though, I want to note that the number of inputs and outputs is fixed at "CI-time" or "compile-time". When you press the "commit" button in Authoring (or you merge a PR), it is at this point that the code is run that generates the transforms/outputs. At a later time, when you build the actual dataset (i.e. you run the transforms) it is not possible to add/remove inputs, outputs and transforms anymore.
So to change the number of inputs/outputs/transforms, you will need to go to the repo, modify the .yaml file (or whatever you chose to use) and then press the commit button. This will cause the CI checks to run, and publish the new code, including any new transforms that might have been generated in the process.
If this doesn't work for you (i.e. you want to decide at dataset build-time which outputs to generate) you'll have to fundamentally re-think your approach. Otherwise you should be good with one of the two solutions I roughly outlined above.
You cannot programmatically create transforms based on another datasets's content. The datasets are created at CI time.
You can however have a constants file inside your code repo, which can be read at CI time, and use that to generate transforms. I.e.:
myconfig.py:
dataset_pairs = [
{
"in": "/path/to/input/dataset,
"out": "/path/to/output/dataset,
},
{
"in": "/path/to/input/dataset2,
"out": "/path/to/output/dataset2,
},
# ...
{
"in": "/path/to/input/datasetN,
"out": "/path/to/output/datasetN,
},
]
///////////////////////////
anotherfile.py
from myconfig import dataset_pairs
TRANSFORMS = []
for conf in dataset_pairs:
#transform_df(Output(conf["out"]), my_input=Input(conf["in"]))
def my_generated_transform(my_input)
# ...
return df
TRANSFORMS.append(my_generated_transform)
To re-iterate, you cannot create the config.py programatically based on a dataset contents, because when this code is run, it is at CI time, so it doesn't have access to the datasets.
With Apache Drill, when querying files from the filesystem, is there any way to set a shortcut for long directory paths?
For example, in:
> SELECT * FROM dfs.`/Users/me/Clients/foo/current-data/sample/releases/test*.json`
Is there any way I can shorten /Users/me/Dropbox/Clients/foo/current-data/sample/releases/ to a local variable so I don't have to type the full path each time?
I've looked through the docs, but can't see any reference to this (but maybe I'm being dumb).
There are a couple options here:
You could create a view from you long query so you don't have to type the monstrosity every time. This is less flexible then the second solution. For more information, check out: https://drill.apache.org/docs/create-view
You could modify the DFS storage settings (in the web ui at http://:8047 under the storage tab/dfs) and create a new workspaces pointing directly to the "/Users/me/Clients/foo/current-data/sample/releases" directory.
For example:
"releases": {
"location":
"/mapr/demo.mapr.com/data/a/university/student/records/grades/",
"writable": true,
"defaultInputFormat": null
}
Then you would be able to query select * from dfs.releases.tests.csv
I was led to believe that you can wildcard the filename property in an Azure Blob Table source object.
I want to pick up only certain csv files from blob storage that exist in the same directory as other files I don't want to process:
i.e.
root/data/GUJH-01.csv
root/data/GUJH-02.csv
root/data/DFGT-01.csv
I want to process GUJH*.csv and not DFGT-01.csv
Is this possible? If so, why is my blob source validation failing, informing me that the file does not exist (The message reports that the root/data blob does not exist.
Thanks in advance.
Answering my own question..
There's not a wildcard but there is a 'Starts With' which will work in my scenario:
Instead of root/data/GUJH*.csv I can do root/data/GUJH on the folderPath property and it will bring in all root/data/GUJH files..
:)
Just adding some more detail here because I'm finding this a very difficult learning curve and I'd like to document this for my sake and others.
Given a sample file like this (no extensions in this case) in blob storage,
ZZZZ_20170727_1324
We can see the middle part is in yyyyMMdd format.
This is uploaded to folder Landing inside container MyContainer
this was part of my dataset definition::
"typeProperties": {
"folderPath": "MyContainer/Landing/ZZZZ_{DayCode}",
"format": {
"type": "TextFormat",
"columnDelimiter": "\u0001"
},
"partitionedBy": [
{
"name": "DayCode",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyyMMdd"
}
}
]
},
Note that it's a 'prefix', which you will see in the log / error messages, if you can find them (good luck)
If you want to test loading this particular file you need to press the 'Diagram' button, then drill into your pipeline until you find the target dataset - the one the file is being loaded into (I am loading this into SQL Azure). Click on the target dataset, now go and find the correct timeslice. In my case I need to find the timeslice with a start timeslice of 20170727 and run that one.
This will make sure the correct file is picked up and loaded in to SQL Azure
Forget about manually running pipelines or activities - thats just not how it works. You need to run the output dataset under a timeslice to pull it through.
I have two different versions of linux/unix each running cfengine3. Is it possible to have one promises.cf file I can put on both machines that will copy different files based on what os is on the clients? I have been searching around the internet for a few hours now and have not found anything useful yet.
There are several ways of doing this. At the simplest, you can simply have different files: promises depending on the operating system, for example:
files:
ubuntu_10::
"/etc/hosts"
copy_from => mycopy("$(repository)/etc.hosts.ubuntu_10");
suse_9::
"/etc/hosts"
copy_from => mycopy("$(repository)/etc.hosts.suse_9");
redhat_5::
"/etc/hosts"
copy_from => mycopy("$(repository)/etc.hosts.redhat_5");
windows_7::
"/etc/hosts"
copy_from => mycopy("$(repository)/etc.hosts.windows_7");
This example can be easily simplified by realizing that the built-in CFEngine variable $(sys.flavor) contains the type and version of the operating system, so we could rewrite this example as follows:
"/etc/hosts"
copy_from => mycopy("$(repository)/etc.$(sys.flavor)");
A more flexible way to achieve this task is known in CFEngine terminology as "hierarchical copy." In this pattern, you specify an arbitrary list of variables by which you want files to be differentiated, and the order in which they should be considered, from most specific to most general. When the copy promise is executed, the most-specific file found will be copied.
This pattern is very simple to implement:
# Use single copy for all files
body agent control
{
files_single_copy => { ".*" };
}
bundle agent test
{
vars:
"suffixes" slist => { ".$(sys.fqhost)", ".$(sys.uqhost)", ".$(sys.domain)",
".$(sys.flavor)", ".$(sys.ostype)", "" };
files:
"/etc/hosts"
copy_from => local_dcp("$(repository)/etc/hosts$(suffixes)");
}
As you can see, we are defining a list variable called $(suffixes) that contains the criteria by which we want to differentiate the files. All the variables contained in this list are automatically defined by CFEngine, although you could use any arbitrary CFEngine variables. Then we simply include that variable, as a scalar, in our copy_from parameter. Because CFEngine does automatic list expansion, it will try each variable in turn, executing the copy promise multiple times (one for each value in the list) and copy the first file that exists. For example, for a Linux SuSE 11 machine called superman.justiceleague.com, the #(suffixes) variable will contain the following values:
{ ".superman.justiceleague.com", ".superman", ".justiceleague.com", ".suse_11",
".linux", "" }
When the file-copy promise is executed, implicit looping will cause these strings to be appended in sequence to "$(repository)/etc/hosts", so the following filenames will be attempted in sequence: hosts.superman.justiceleague.com, hosts.justiceleague.com, hosts.suse_11, hosts.linux and hosts. The first one to exist will be copied over /etc/hosts in the client, and the rest will be skipped.
For this technique to work, we have to enable "single copy" on all the files you want to process. This is a configuration parameter that tells CFEngine to copy each file at most once, ignoring successive copy operations for the same destination file. The files_single_copy parameter in the agent control body specifies a list of regular expressions to match filenames to which single-copy should apply. By setting it to ".*" we match all filenames.
For hosts that don't match any of the existing files, the last item on the list (an empty string) will cause the generic hosts file to be copied. Note that the dot for each of the filenames is included in $(suffixes), except for the last element.
I hope this helps.
(p.s. and shameless plug: this is taken from my upcoming book, "Learning CFEngine 3", published by O'Reilly)