With Apache Drill, when querying files from the filesystem, is there any way to set a shortcut for long directory paths?
For example, in:
> SELECT * FROM dfs.`/Users/me/Clients/foo/current-data/sample/releases/test*.json`
Is there any way I can shorten /Users/me/Dropbox/Clients/foo/current-data/sample/releases/ to a local variable so I don't have to type the full path each time?
I've looked through the docs, but can't see any reference to this (but maybe I'm being dumb).
There are a couple options here:
You could create a view from you long query so you don't have to type the monstrosity every time. This is less flexible then the second solution. For more information, check out: https://drill.apache.org/docs/create-view
You could modify the DFS storage settings (in the web ui at http://:8047 under the storage tab/dfs) and create a new workspaces pointing directly to the "/Users/me/Clients/foo/current-data/sample/releases" directory.
For example:
"releases": {
"location":
"/mapr/demo.mapr.com/data/a/university/student/records/grades/",
"writable": true,
"defaultInputFormat": null
}
Then you would be able to query select * from dfs.releases.tests.csv
Related
Couple weeks ago I try spin up a local IPFS node, publish a file, and were able to access it via publish gateway. I thought the file would have been store by a lots of nodes, so I deleted it from my local machine, now I can't access the file via the ID (QmNvxsaXqWoLR1NNJpiRXTEo57ptyg3CjSGBrgeyiyFiPm) anymore.
I noticed that I can still somehow access the data from the webui, but only able to see the raw data instead of the files. Is there any way to retrieve the file?
I actually can retrieve this CID via a simple ipfs cat QmNvxsaXqWoLR1NNJpiRXTEo57ptyg3CjSGBrgeyiyFiPm:
{
"0x9a39f286e1cd710da14e45ac124e38f2b6242622": "4.705",
"0x7c981d31b2ab65ce9f9cce49feac9e9e11e8ca64": "0.174481",
"0xa83cdaaadbb0e01d5de8df4a670947eacbb11f7e": "0.860812",
"0x445f4b54039cb1f86644351f2ef324c6876f6d76": "0.036128",
"0x29eab4341629aa1ae5e996f76ea0750548311ecf": "5.4",
"0xbbccf6cab5b3aec26b0cbc6095b5b6ddbacfd59a": "17.172011",
"0x33d5ae030cf11723f9b34ecc6fe5cfe00c6dc133": "0.001909",
"0x03886228bb749eeba43426d2d6b70eba472f4876": "6.8",
"0x1eb8e88a563fde7b3b8ebbbb0e1ac117c3d80800": "1821.138157",
"0x62ba33ccc4a404456e388456c332d871dae7ae9e": "0.000145",
"0x63e62588330657c99ba79139e7c21af0c0db1e7e": "12.560212",
"0xcd45fdaa6a72740e1d092f458213ff39d3d94a10": "280.592062",
"0xb92667e34cb6753449adf464f18ce1833caf26e0": "0.647424",
"0x9a5179e08acf37b3d84c9a0c0d6f3ea2417f9175": "10.097725",
"0xc43cffc5db578879cc5d0d4cfe07ad514c934d3b": "6.365907",
"0x34915628fc56ae8ff6684be39462e7ba398164b8": "0.00069",
"0x47e2bc7475ef8a9a5e10aef076da53d3c446291e": "5.305",
"0xf432d70c941ebe657ca8cff0b70d1649d5781eea": "0.153823",
"0xff90d66d41fc97b223e8005dba51635b5d49632b": "0.002298",
"0x1cf41ad63f67f3e7f8a1db240d812f5392b9a9c4": "6.05013",
"0xc418aaa0d1e018ded3efc0f72a089519b3d58683": "0.179902",
"0x7d209486a3562fe406b72d65b3703884c50bac81": "2.191224",
"0xe782657a1043062087232b3c20c4d25e2a982cb3": "0.110927",
"0xd998e5a4777e1b47c1441a88bb553cbf16802e4c": "0.095045",
"0x9f3ef50ea64adad5b33f1f8222760cfbf42007f7": "0.069055",
"0x40c1efa324fd80329117409c65081f13e7a08a42": "2790.399058",
"0x9ef8c5ae4a320ef0984695af9a85d07f5be13792": "0.139741",
"0xf46422c1b6c2135dbca9b55771fd6e7869a8691c": "995.479262",
"0xf6f3bc09782d3c0df474eb3cec5cac8423bfedf3": "0.00012",
"0x4f2769e87c7d96ed9ca72084845ee05e7de5dda2": "0.000509",
"0x92f1e9a52c1a81fdb76ee6477c0c605917cddbe5": "0.811623",
"0x1e6424a481e6404ed2858d540aec37399671f5e0": "19253.760913",
"0xc9b2c3a6a8e1896aadcf236b88019c7574d75069": "781.127767",
"0xb08f95dbc639621dbaf48a472ae8fce0f6f56a6e": "34.704074"
}
I thought the file would have been store by a lots of nodes, so I deleted it from my local machine
It's important to note that data is only stored by other nodes temporarily if they access the content themselves. If you want data to live reliably longterm, you can use a pinning service like Pinata, as you're paying them to keep your data pinned.
Otherwise you have to rely on other nodes pinning your data to ensure it remains available.
I have a pipeline that processes some files, and in some cases "groups" of files. Meaning the files should be processed together and are correlated with a timestamp.
Ex.
Timestamp#Customer.csv
Timestamp#Customer_Offices.csv
Timestamp_1#Customer.csv
Timestamp_1#Customer_Offices.csv
...
I have a table with all the scopes, and files with respective filemask. I have populated a variable in the beginning of the pipeline based on a parameter
The Get files activity goes to a sFTP location and grab files from a folder. Then I only want to process the "Customer.csv" and ".Customer_Offices.csv" files. This is because the folder location has more file types or scopes to be processed by other pipelines. If I don't filter, the next activities end up by processing metadata of files that are not supposed to. In terms of efficiency and performance s bad, and is even causing some issues with files being left behind.
I've tried something like
#variables('FilesToSearch').contains(#endswith(item().name, 'do I need this 2nd parm in arrays ?'))
but no luck... :(
Any help will be highly appreciated,
Best regards,
Manuel
contains function can direct for a string to find a substring, so you can try something like this expression #contains(item().name,'Customer')
and no need to create a variable.
Or use endsWith function and use this expression:
#or(endswith(item().name,'Customer.csv'),endswith(item().name,'Customer_Offices.csv'))
I have two columns in config file col1 and col2.
Now I have to import this config file in my main python-transform and then extract the values of columns in order to create dynamic output path from these values by iterating over all the possible values.
For example
ouput_path1=Constant+value1+value2
ouput_path2=Constant+value3+value4
Please suggest some solution for generating output file in palantir foundary(code-repo)
What you probably want to use is a transform generator. In the "Python Transforms" chapter of the documentation, there's a section "Transform generation" which outlines the basics of this.
The most straightforward path is likely to generate multiple transforms, but if you want just one transform that outputs to multiple datasets, that would be possible too (if a little more complicated.)
For the former approach, you would add a .yaml file (or similar) to your repo, in which you define your values, and then you read the .yaml file and generate multiple transforms based on the values. The documentation gives an example that does pretty much exactly this.
For the latter approach, you would probably want to read the .yaml file in your pipeline definer, and then dynamically add outputs to a single transform. In your transforms code, you then need to be able to handle an arbitrary number of outputs in some way (which I presume you have a plan for.) I suspect you might need to fall back to manual transform registration for this, or you might need to construct a transforms object without using the decorator. If this is the solution you need, I can construct an example for you.
Before you proceed with this though, I want to note that the number of inputs and outputs is fixed at "CI-time" or "compile-time". When you press the "commit" button in Authoring (or you merge a PR), it is at this point that the code is run that generates the transforms/outputs. At a later time, when you build the actual dataset (i.e. you run the transforms) it is not possible to add/remove inputs, outputs and transforms anymore.
So to change the number of inputs/outputs/transforms, you will need to go to the repo, modify the .yaml file (or whatever you chose to use) and then press the commit button. This will cause the CI checks to run, and publish the new code, including any new transforms that might have been generated in the process.
If this doesn't work for you (i.e. you want to decide at dataset build-time which outputs to generate) you'll have to fundamentally re-think your approach. Otherwise you should be good with one of the two solutions I roughly outlined above.
You cannot programmatically create transforms based on another datasets's content. The datasets are created at CI time.
You can however have a constants file inside your code repo, which can be read at CI time, and use that to generate transforms. I.e.:
myconfig.py:
dataset_pairs = [
{
"in": "/path/to/input/dataset,
"out": "/path/to/output/dataset,
},
{
"in": "/path/to/input/dataset2,
"out": "/path/to/output/dataset2,
},
# ...
{
"in": "/path/to/input/datasetN,
"out": "/path/to/output/datasetN,
},
]
///////////////////////////
anotherfile.py
from myconfig import dataset_pairs
TRANSFORMS = []
for conf in dataset_pairs:
#transform_df(Output(conf["out"]), my_input=Input(conf["in"]))
def my_generated_transform(my_input)
# ...
return df
TRANSFORMS.append(my_generated_transform)
To re-iterate, you cannot create the config.py programatically based on a dataset contents, because when this code is run, it is at CI time, so it doesn't have access to the datasets.
In Airflow, we've created several DAGS. Some of which share common properties, for example the directory to read files from. Currently, these properties are listed as a property in each separate DAG, which will obviously become problematic in the future. Say if the directory name was to change, we'd have to go into each DAG and update this piece of code (possibly even missing one).
I was looking into creating some sort of a configuration file, which can be parsed into Airflow and used by the various DAGS when a certain property is required, but I cannot seem to find any sort of documentation or guide on how to do this. Most I could find was the documentation on setting up Connection ID's, but that does not meet my use case.
The question to my post, is it possible to do the above scenario and how?
Thanks in advance.
There are a few ways you can accomplish this based on your setup:
You can use a DagFactory type approach where you have a function generate DAGs. You can find an example of what that looks like here
You can store a JSON config as an Airflow Variable, and parse through that to generate a DAG. You can store something like this in a Admin -> Variables:
[
{
"table": "users",
"schema": "app_one",
"s3_bucket": "etl_bucket",
"s3_key": "app_one_users",
"redshift_conn_id": "postgres_default"
},
{
"table": "users",
"schema": "app_two",
"s3_bucket": "etl_bucket",
"s3_key": "app_two_users",
"redshift_conn_id": "postgres_default"
}
]
Your DAG could get generated as:
sync_config = json.loads(Variable.get("sync_config"))
with dag:
start = DummyOperator(task_id='begin_dag')
for table in sync_config:
d1 = RedshiftToS3Transfer(
task_id='{0}'.format(table['s3_key']),
table=table['table'],
schema=table['schema'],
s3_bucket=table['s3_bucket'],
s3_key=table['s3_key'],
redshift_conn_id=table['redshift_conn_id']
)
start >> d1
Similarly, you can just store that config as a local file and open it as you would any other file. Keep in mind the best answer to this will depend on your infrastructure and use case.
I was led to believe that you can wildcard the filename property in an Azure Blob Table source object.
I want to pick up only certain csv files from blob storage that exist in the same directory as other files I don't want to process:
i.e.
root/data/GUJH-01.csv
root/data/GUJH-02.csv
root/data/DFGT-01.csv
I want to process GUJH*.csv and not DFGT-01.csv
Is this possible? If so, why is my blob source validation failing, informing me that the file does not exist (The message reports that the root/data blob does not exist.
Thanks in advance.
Answering my own question..
There's not a wildcard but there is a 'Starts With' which will work in my scenario:
Instead of root/data/GUJH*.csv I can do root/data/GUJH on the folderPath property and it will bring in all root/data/GUJH files..
:)
Just adding some more detail here because I'm finding this a very difficult learning curve and I'd like to document this for my sake and others.
Given a sample file like this (no extensions in this case) in blob storage,
ZZZZ_20170727_1324
We can see the middle part is in yyyyMMdd format.
This is uploaded to folder Landing inside container MyContainer
this was part of my dataset definition::
"typeProperties": {
"folderPath": "MyContainer/Landing/ZZZZ_{DayCode}",
"format": {
"type": "TextFormat",
"columnDelimiter": "\u0001"
},
"partitionedBy": [
{
"name": "DayCode",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyyMMdd"
}
}
]
},
Note that it's a 'prefix', which you will see in the log / error messages, if you can find them (good luck)
If you want to test loading this particular file you need to press the 'Diagram' button, then drill into your pipeline until you find the target dataset - the one the file is being loaded into (I am loading this into SQL Azure). Click on the target dataset, now go and find the correct timeslice. In my case I need to find the timeslice with a start timeslice of 20170727 and run that one.
This will make sure the correct file is picked up and loaded in to SQL Azure
Forget about manually running pipelines or activities - thats just not how it works. You need to run the output dataset under a timeslice to pull it through.