Using Apache Nifi to collect files from 3rd party Rest APi - Flow advice - json

I am trying to create a flow within Apache-Nifi to collect files from a 3rd party RESTful APi and I have set my flow with the following:
InvokeHTTP - ExtractText - PutFile
I can collect the file that I am after, as I have specified this within my Remote URL however when I get all of the data from said file it is outputting multiple (100's) of the same files to my output directory.
3 things I need help with:
1: How do I get the flow to output the file in a readable .csv rather than just a file with no ext
2: How can I stop the processor once I have all of the data that I need
3: The Json file that I have been supplied with gives me the option to get files from a certain date range:
https://api.3rdParty.com/reports/v1/scheduledReports/877800/1553731200000
Or I can choose a specific file:
https://api.3rdParty.com/reports/v1/scheduledReports/download/877800/201904/CTDDaily/2019-04-02T01:50:00Z.csv
But how can I create a command in Nifi to automatically check for newer files, as this process will be running daily and we will be looking at downloading a new file each day.
If this is too broad, please help me by letting me know so I can edit this post.
Thanks.
Note: 3rdParty host name has been renamed to comply with security - therefore links will not directly work. Thanks.

1) You change the filename of the flow file to anything you want using the UpdateAttribute processor. If you want to make it have a ".csv" extension then you can add a property named "filename" with a value of "${filename}.csv" (without the quotes when you enter it).
2) By default most processors have a scheduling strategy of timer-driver 0 seconds, which means keep running as fast as possible. Go to the configuration of the processor on the scheduling tab and configure the appropriate schedule, it sounds like you probably want CRON scheduling to schedule it daily.
3) You can use NiFi expression language statements to create dynamic time ranges. I don't fully understand the syntax for the API that you have to communicate with, but you could do something like this for the URL:
https://api.3rdParty.com/reports/v1/scheduledReports/877800/${now()}
Where now() would return the current timestamp as an epoch.
You can also format it to a date string if necessary:
${now():format('yyyy-MM-dd')}
https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html

Related

Which is the best way of parsing CSV-data in a logic app without using a custom connector?

I have an SFTP trigger in a logic app which fires when a file is added to a certain file area. It is a CSV-formatted file and I want the rows to be parsed and coverted into json. Which is the best way to convert CSV-data into json without using any custom connectors?
I cannot find any built-in connectors doing this job. And as far as I know there are no logic apps functions doing the job either.
Right now, there is no connector/action in logic app that can provide the out of box solution for your requirement. You need to loop in through the array and perform the calculation as per your requirement but I will not suggest you leverage the loop, variables action as it may take time and cost you more.
The alternative would be leveraging the inline code (JavaScript code) to do the calculation as per your requirement. Please note that you will need Integration Account to run your inline code.
Please refer to javascript code and modified if needed according to your needs. I have used '_' for differentiating the nested objects. For more details you can refer to previous discussion here.
For complex calculation you can offload this functionality to azure function and write your code as per the supported languages and call azure function from logic app.
1.Created logic app as shown below:
2 .Created container in storage account and uploaded a CSV file in container.
3.Next using compose action to split the contents of the CSV file on every new line into an array.
a. Here is the expression used in SplitLines compose action:
split(body('Get_blob_content_(V2)'),decodeUriComponent('%0D%0A'))
b. Follow the below MS Doc to write expressions:
4. Removing last(empty) line from previous output using another compose action as shown below ,
take(outputs('SplitLines'),add(length(outputs('SplitLines')),-1))
5.Separating filed names using compose action
split(first(outputs('SplitLines')), ',')
Forming json as shown below using Select action,
**From**: **`skip(outputs('RemoveLastLine'), 1)`**
**Map:**
**`outputs('SplitFieldName')[0]`** **`split(item(), ',')?[0]`**
**`outputs('SplitFieldName')[1]`** **`split(item(), ',')?[1]`**
Tested logic app and it is running successfully. 
Content of CSV file is as shown below:
Csv data is formatted as json:
Reference:Use data operations in Power Automate (contains video) — Power Automate | Microsoft Docs
Credit: #Iason Koulas

Best data processing software to parse CSV file and make API call per row

I'm looking for ideas for an Open Source ETL or Data Processing software that can monitor a folder for CSV files, then open and parse the CSV.
For each CSV row the software will transform the CSV into a JSON format and make an API call to start a Camunda BPM process, passing the cell data as variables into the process.
Looking for ideas,
Thanks
You can use a Java WatchService or Spring FileSystemWatcher as discussed here with examples:
How to monitor folder/directory in spring?
referencing also:
https://www.baeldung.com/java-nio2-watchservice
Once you have picked up the CSV you can use my example here as inspiration or extend it: https://github.com/rob2universe/csv-process-starter specifically
https://github.com/rob2universe/csv-process-starter/blob/main/src/main/java/com/camunda/example/service/CsvConverter.java#L48
The example starts a configurable process for every row in the CSV and includes the content of the row as a JSON process data.
I wanted to limit the dependencies of this example. The CSV parsing logic applied is very simple. Commas in the file may break the example, special characters may not be handled correctly. A more robust implementation could replace the simple Java String .split(",") with an existing CSV parser library such as Open CSV
The file watcher would actually be a nice extension to the example. I may add it when I get around to it, but would also accept a pull request in case you fork my project.

Using Azure Data Factory to read only one file from blob storage and load into a DB

I'd like to read just one file from a blob storage container and load it into a copy operation into a DB, after the arrival of the file has set off a trigger.
Using Microsoft Documentation, the closest I seem to do is read all the file in order of Modified Date.
Would anyone out there know how to read one file after it has arrived in my blob storage?
EDIT:
Just to clarify, I would look to read only the latest file automatically. Without hardcoding the filename.
You can specify a single Blob in the DataSet. This value can be hard coded or variables (using Data Set Parameters):
If you need to run this process whenever a new blob is created/updated, you can use the Event Trigger:
EDIT:
Based on your addition of "only the latest", I don't have a direct solution. Normally, you could use Lookup or GetMetadata activities, but neither they nor the expression language support sorting or ordering. One option would be to use an Azure Function to determine the file to process.
However - if you think about the Event Trigger I mention above, every time it fires the file (blob) is the most recent one in the folder. If you want to coalesce this across a certain period of time, something like this might work:
Logic App 1 on event trigger: store the blob name in a log [blob, SQL, whatever works for you].
Logic App 2 OR ADF pipeline on recurrence trigger: read the log to grab the "most recent" blob name.

Totally new to Talend ESB

I'm completely brand new to Talend ESB (not so much Talend for data integration, but ESB totally.)
That being said, I'm trying to build a simple route that watches a specific file path and get the filename of any file dropped into it. Then it will pass that filename to the childjob (cTalendJob) and the child job will do something to the file.
I'm able to watch the directory, procure the filename itself and System.out.println the filename. but I can't seem to 'pass' it down to the child job. When it runs, the route goes into an endless loop.
Any help is GREATLY appreciated.
You must add a context parameter to your Talend job, and then pass the filename from the route to the job by assigning it to the parameter.
In my example I added a parameter named "Param" to my job. In the Context Param view of cTalendJob, click the + button and select it from the list of available parameters, and assign a value to it.
You can then do context.Param in your child job to use the filename.
I think you are making this more difficult than you need...
I don't think you need your cProcessor or cSetBody steps.
In your tRouteInput if you want the filename, then map "${header.CamelFileName}" to a field in your schema, and you will get the filename. Mapping "${in.body}" would give you the file contents, but if you don't need that you can just map the required heading. If your job would read the file as a whole, you could skip that step and just map the message body.
Also, check the default behaviour of the camel file component - it is intended to put the contents of the file into a message, moving the file to a .camel subdirectory once complete. If your job writes to the directory cFile is monitoring, it will keep running indefinitely, as it keeps finding a "new" file - you would want to write any updated files to a different directory, or a filename mask that isn't monitored by the cFile component.

SSIS - "switchable" file output for debug?

In an SSIS data-flow task, I'm using a Multicast transform at a key part of the flow which I want to hang a File Output destination off.
This, in itself, is no problem to do. However I only want output in the file if I enable it; i.e., I'd be using it for debugging the data if the flow fails unexpectedly and it's not immediately obvious from the default log message output why this occured.
My initial thought was to create a File Output whose output file was obtained from a variable, and by default, the variable would contain 'nul' - i.e., the Windows bit-bucket - which I could override through configuration in the event of needing to dig further.
Unfortuantly this isn't working: the File Output complains saying that "The filename is a device or contains invalid characters". So it looks like I can't use the bit-bucket.
Is anyone aware of a way to make output "switchable"? This would make enabling debug a less risky proposition than editing the package and dropping a File Output in directly.
I suppose I could have a Conditional Split off the multi-cast which basically sends output if a variable is set to some given value, but this seems overly messy, I'll be poking other options, but if anyone has any suggestions/solutions, they'd be welcome.
I'd go for the conditional split, redirecting rows to the konesans trash destination adaptor if your variable wasn't set, otherwise send to your file.