Data factory azure blob source - wildcard - csv

I was led to believe that you can wildcard the filename property in an Azure Blob Table source object.
I want to pick up only certain csv files from blob storage that exist in the same directory as other files I don't want to process:
i.e.
root/data/GUJH-01.csv
root/data/GUJH-02.csv
root/data/DFGT-01.csv
I want to process GUJH*.csv and not DFGT-01.csv
Is this possible? If so, why is my blob source validation failing, informing me that the file does not exist (The message reports that the root/data blob does not exist.
Thanks in advance.

Answering my own question..
There's not a wildcard but there is a 'Starts With' which will work in my scenario:
Instead of root/data/GUJH*.csv I can do root/data/GUJH on the folderPath property and it will bring in all root/data/GUJH files..
:)

Just adding some more detail here because I'm finding this a very difficult learning curve and I'd like to document this for my sake and others.
Given a sample file like this (no extensions in this case) in blob storage,
ZZZZ_20170727_1324
We can see the middle part is in yyyyMMdd format.
This is uploaded to folder Landing inside container MyContainer
this was part of my dataset definition::
"typeProperties": {
"folderPath": "MyContainer/Landing/ZZZZ_{DayCode}",
"format": {
"type": "TextFormat",
"columnDelimiter": "\u0001"
},
"partitionedBy": [
{
"name": "DayCode",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyyMMdd"
}
}
]
},
Note that it's a 'prefix', which you will see in the log / error messages, if you can find them (good luck)
If you want to test loading this particular file you need to press the 'Diagram' button, then drill into your pipeline until you find the target dataset - the one the file is being loaded into (I am loading this into SQL Azure). Click on the target dataset, now go and find the correct timeslice. In my case I need to find the timeslice with a start timeslice of 20170727 and run that one.
This will make sure the correct file is picked up and loaded in to SQL Azure
Forget about manually running pipelines or activities - thats just not how it works. You need to run the output dataset under a timeslice to pull it through.

Related

How to generate dynamic files using config file in palantir foundry

I have two columns in config file col1 and col2.
Now I have to import this config file in my main python-transform and then extract the values of columns in order to create dynamic output path from these values by iterating over all the possible values.
For example
ouput_path1=Constant+value1+value2
ouput_path2=Constant+value3+value4
Please suggest some solution for generating output file in palantir foundary(code-repo)
What you probably want to use is a transform generator. In the "Python Transforms" chapter of the documentation, there's a section "Transform generation" which outlines the basics of this.
The most straightforward path is likely to generate multiple transforms, but if you want just one transform that outputs to multiple datasets, that would be possible too (if a little more complicated.)
For the former approach, you would add a .yaml file (or similar) to your repo, in which you define your values, and then you read the .yaml file and generate multiple transforms based on the values. The documentation gives an example that does pretty much exactly this.
For the latter approach, you would probably want to read the .yaml file in your pipeline definer, and then dynamically add outputs to a single transform. In your transforms code, you then need to be able to handle an arbitrary number of outputs in some way (which I presume you have a plan for.) I suspect you might need to fall back to manual transform registration for this, or you might need to construct a transforms object without using the decorator. If this is the solution you need, I can construct an example for you.
Before you proceed with this though, I want to note that the number of inputs and outputs is fixed at "CI-time" or "compile-time". When you press the "commit" button in Authoring (or you merge a PR), it is at this point that the code is run that generates the transforms/outputs. At a later time, when you build the actual dataset (i.e. you run the transforms) it is not possible to add/remove inputs, outputs and transforms anymore.
So to change the number of inputs/outputs/transforms, you will need to go to the repo, modify the .yaml file (or whatever you chose to use) and then press the commit button. This will cause the CI checks to run, and publish the new code, including any new transforms that might have been generated in the process.
If this doesn't work for you (i.e. you want to decide at dataset build-time which outputs to generate) you'll have to fundamentally re-think your approach. Otherwise you should be good with one of the two solutions I roughly outlined above.
You cannot programmatically create transforms based on another datasets's content. The datasets are created at CI time.
You can however have a constants file inside your code repo, which can be read at CI time, and use that to generate transforms. I.e.:
myconfig.py:
dataset_pairs = [
{
"in": "/path/to/input/dataset,
"out": "/path/to/output/dataset,
},
{
"in": "/path/to/input/dataset2,
"out": "/path/to/output/dataset2,
},
# ...
{
"in": "/path/to/input/datasetN,
"out": "/path/to/output/datasetN,
},
]
///////////////////////////
anotherfile.py
from myconfig import dataset_pairs
TRANSFORMS = []
for conf in dataset_pairs:
#transform_df(Output(conf["out"]), my_input=Input(conf["in"]))
def my_generated_transform(my_input)
# ...
return df
TRANSFORMS.append(my_generated_transform)
To re-iterate, you cannot create the config.py programatically based on a dataset contents, because when this code is run, it is at CI time, so it doesn't have access to the datasets.

How to work with configuration files in Airflow

In Airflow, we've created several DAGS. Some of which share common properties, for example the directory to read files from. Currently, these properties are listed as a property in each separate DAG, which will obviously become problematic in the future. Say if the directory name was to change, we'd have to go into each DAG and update this piece of code (possibly even missing one).
I was looking into creating some sort of a configuration file, which can be parsed into Airflow and used by the various DAGS when a certain property is required, but I cannot seem to find any sort of documentation or guide on how to do this. Most I could find was the documentation on setting up Connection ID's, but that does not meet my use case.
The question to my post, is it possible to do the above scenario and how?
Thanks in advance.
There are a few ways you can accomplish this based on your setup:
You can use a DagFactory type approach where you have a function generate DAGs. You can find an example of what that looks like here
You can store a JSON config as an Airflow Variable, and parse through that to generate a DAG. You can store something like this in a Admin -> Variables:
[
{
"table": "users",
"schema": "app_one",
"s3_bucket": "etl_bucket",
"s3_key": "app_one_users",
"redshift_conn_id": "postgres_default"
},
{
"table": "users",
"schema": "app_two",
"s3_bucket": "etl_bucket",
"s3_key": "app_two_users",
"redshift_conn_id": "postgres_default"
}
]
Your DAG could get generated as:
sync_config = json.loads(Variable.get("sync_config"))
with dag:
start = DummyOperator(task_id='begin_dag')
for table in sync_config:
d1 = RedshiftToS3Transfer(
task_id='{0}'.format(table['s3_key']),
table=table['table'],
schema=table['schema'],
s3_bucket=table['s3_bucket'],
s3_key=table['s3_key'],
redshift_conn_id=table['redshift_conn_id']
)
start >> d1
Similarly, you can just store that config as a local file and open it as you would any other file. Keep in mind the best answer to this will depend on your infrastructure and use case.

Save long directory path to local variable in Apache Drill?

With Apache Drill, when querying files from the filesystem, is there any way to set a shortcut for long directory paths?
For example, in:
> SELECT * FROM dfs.`/Users/me/Clients/foo/current-data/sample/releases/test*.json`
Is there any way I can shorten /Users/me/Dropbox/Clients/foo/current-data/sample/releases/ to a local variable so I don't have to type the full path each time?
I've looked through the docs, but can't see any reference to this (but maybe I'm being dumb).
There are a couple options here:
You could create a view from you long query so you don't have to type the monstrosity every time. This is less flexible then the second solution. For more information, check out: https://drill.apache.org/docs/create-view
You could modify the DFS storage settings (in the web ui at http://:8047 under the storage tab/dfs) and create a new workspaces pointing directly to the "/Users/me/Clients/foo/current-data/sample/releases" directory.
For example:
"releases": {
"location":
"/mapr/demo.mapr.com/data/a/university/student/records/grades/",
"writable": true,
"defaultInputFormat": null
}
Then you would be able to query select * from dfs.releases.tests.csv

How to Update Parts of a document in Couchbase

In scanning the docs I cannot find how to update part of a document.
for example - say the whole document looks like this:
{
"Active": true,
"Barcode": "123456789",
"BrandID": "9f3751ef-f14f-464a-bb86-854e99cf14c0",
"BuyCurrencyOverride": ".37",
"BuyDiscountAmount": "45.00",
"ID": "003565a3-4a0d-47d9-befb-0ac642cb8057",
}
but I only want to work with part of the document as I don't want to be selecting / updating the whole document in many cases:
{
"Active": false,
"Barcode": "999999999",
"BrandID": "9f3751ef-f14f-464a-bb86-854e99cf14c0",
"ID": "003565a3-4a0d-47d9-befb-0ac642cb8057",
}
How can I use N1QL to just update those fields? Upsert completely replaces the whole document and update statement is not that clear.
Thanks
The answer to your question depends on why you want to update only part of the document (e.g., are you concerned about network bandwidth?), and how you want to perform the update (e.g., from the web console? from a program using the SDK?).
The 4.5 sub-document API, for which you provided a link in your comment, is a feature only available via the SDK (e.g., from Go or Java programs), and the goal of that feature is to reduce network bandwidth by no transmitting entire documents around. Does your use case include programmatic document modifications via the SDK? If so, then the sub-document API is a good way to go.
Using the "UPDATE" statement in N1QL is a good way to change any number of documents that match a pattern for which you can specify a "WHERE" clause. As noted above, it works very similarly to the "UPDATE" statement in SQL. To use your example above, you could change the "Active" field to false in any documents where the BuyDiscountAmount was "45.00":
UPDATE my bucket SET Active = false WHERE BuyDiscountAmount = "45.00"
When running N1QL UPDATE queries, almost all the network traffic will be between the Query, Index, and Data nodes of your cluster, so a N1QL update does not cause much network traffic into/out-of your cluster.
If you provide more details about your use case, and why you want to update only part of your documents, I could provide more specific advice on the right approach to take.
The sub-doc API introduced in Couchbase4.5 is currently not used by N1QL. However, when you use the UPDATE statement to update parts of one or more documents.
http://developer.couchbase.com/documentation/server/current/n1ql/n1ql-language-reference/update.html
Let me know any Qs.
-Prasad
It is simple like sql query.
update `Employee` set District='SambalPur' where EmpId="1003"
and here is the responce
{
"Employee": {
"Country": "India",
"District": "SambalPur",
"EmpId": "1003",
"EmpName": "shyam",
"Location": "New-Delhi"
}
}

Getting Sphider to output JSON

I've recently added the Sphider crawler to my site in order to add search functionality. But the default search.php that comes with the distribution of Sphider that I downloaded is too plain and doesn't integrate well with the rest of my site. I have a little navigation bar at the top of the site which has a search box in it, and I'd like to be able to access Sphider's search results through that search field using Ajax. To do this, I figure I need to get Sphider to return its results in JSON format.
The way I did that is I used a "theme" that outputs JSON (Sphider supposts "theming" its output). I found that theme on this thread on Sphider's site. It seems to work, but more strict JSON parsers will not parse it. Here's some example JSON output:
{"result_report":"Displaying results 1 - 1 of 1 match (0 seconds) ", "results":[ { "idented":"false", "num":"1", "weight":"[100.00%]", "link":"http://www.avtainsys.com/articles/Triple_Contraints", "title":"Triple Contraints", "description":" on 01/06/12 Project triple constraints are time, cost, and quality. These are the three constraints that control the performance of the project. Think about this triple-constraint as a three-leg tripod. If one of the legs is elongated or", "link2":"http://www.avtainsys.com/articles/Triple_Contraints", "size":"3.3kb" }, { "num":"-1" } ], "other_pages":[ { "title":"1", "link":"search.php?query=constraints&start=1&search=1&results=10&type=and&domain=", "active":"true" }, ] }
The issue is that there is a trailing comma near the end. According to this, "trailing commas are not allowed" when using PHP's json_decode() function. This JSON also failed to parse using this online formatter. But when I took the comma out, it worked and I got this better-formatted JSON:
{
"result_report":"Displaying results 1 - 1 of 1 match (0 seconds) ",
"results":[
{
"idented":"false",
"num":"1",
"weight":"[100.00%]",
"link":"http://www.avtainsys.com/articles/Triple_Contraints",
"title":"Triple Contraints",
"description":" on 01/06/12 Project triple constraints are time, cost, and quality. These are the three constraints that control the performance of the project. Think about this triple-constraint as a three-leg tripod. If one of the legs is elongated or",
"link2":"http://www.avtainsys.com/articles/Triple_Contraints",
"size":"3.3kb"
},
{
"num":"-1"
}
],
"other_pages":[
{
"title":"1",
"link":"search.php?query=constraints&start=1&search=1&results=10&type=and&domain=",
"active":"true"
}
]
}
Now, how would I do this programmatically? And (perhaps more importantly), is there a more elegant way of accomplishing this? And you should know that PHP is the only language I can run on my shared hosting account, so a Java solution for example would not work for me.
In search_result.html, you can surround the , at the end of the foreach loop with condition to only print if the index is strictly less than the number of pages - 1.