Using Azure Data Factory to read only one file from blob storage and load into a DB - csv

I'd like to read just one file from a blob storage container and load it into a copy operation into a DB, after the arrival of the file has set off a trigger.
Using Microsoft Documentation, the closest I seem to do is read all the file in order of Modified Date.
Would anyone out there know how to read one file after it has arrived in my blob storage?
EDIT:
Just to clarify, I would look to read only the latest file automatically. Without hardcoding the filename.

You can specify a single Blob in the DataSet. This value can be hard coded or variables (using Data Set Parameters):
If you need to run this process whenever a new blob is created/updated, you can use the Event Trigger:
EDIT:
Based on your addition of "only the latest", I don't have a direct solution. Normally, you could use Lookup or GetMetadata activities, but neither they nor the expression language support sorting or ordering. One option would be to use an Azure Function to determine the file to process.
However - if you think about the Event Trigger I mention above, every time it fires the file (blob) is the most recent one in the folder. If you want to coalesce this across a certain period of time, something like this might work:
Logic App 1 on event trigger: store the blob name in a log [blob, SQL, whatever works for you].
Logic App 2 OR ADF pipeline on recurrence trigger: read the log to grab the "most recent" blob name.

Related

How to take "access token" value inside an output json file and pass the same "access token" to another REST GET request in Azure Data Factory?

I have got access token and expiry time as two columns in a JSON file after doing a POST request and stored the file in Blob storage.
Now I need to look inside the JSON file the I stored before and take the value of Access token and use it as a parameter to another REST request.
Please help...
Depending on your scenario, there are a couple ways you can do this. I assume that you need the access token for a completely different pipeline since you are storing the Get access token output to a file in Blob.
So in order to reference the values within the json Blob file, you can just use a lookup activity in Azure Data Factory. Within this Lookup Activity you will use a dataset for a json file referencing a linked service connection to your Azure Blob Storage.
Here is an illustration with a json file in my Blob container:
The above screenshot uses a lookup using the Json File dataset on a Blob Storage Linked service to get the contents of the file. It then saves the outputted contents of the file to variables, one for access token, and another for expiration time. You don't have to save them to variables, and instead can call the output of the activity directly in the subsequent web activity. Here are the details of the outputs and settings:
Hopefully this helps, and let me know if you need clarification on anything.
EDIT:
I forgot to mention, if you need to get the access token using a web activity, then just need to use it again for another web activity in the same pipeline, then you can just get the AccessToken Value in the first web activity, and call that output in the next web activity. Just like I showed in the Lookup Activity, but instead you'd be using the response from the first web activity that retrieves the Access Token. Apologies if that's hard to follow, so here is an illustration of what I mean:
A simple way to read JSON files into a pipeline is to use the Lookup activity.
Here is a test JSON file loaded into Blob Storage in a container named json:
Create a JSON Dataset that just points to the container, you won't need to configure or parameterize the Folder or file name values, although you certainly can if that suits your purpose:
Use a Lookup activity that references the Dataset. Populate the Wildcard folder and file name fields. [This example leaves the "Wildcard folder path" blank, because the file is in the container root.] To read a single JSON file, leave "First row only" checked.
This will load the file contents into the Lookup Activity's output:
The Lookup activity's output.firstRow property will become your JSON root. Process the object as you would any other JSON structure:
NOTE: the Lookup activity has a limit of 5,000 rows and 4MB.

Using Apache Nifi to collect files from 3rd party Rest APi - Flow advice

I am trying to create a flow within Apache-Nifi to collect files from a 3rd party RESTful APi and I have set my flow with the following:
InvokeHTTP - ExtractText - PutFile
I can collect the file that I am after, as I have specified this within my Remote URL however when I get all of the data from said file it is outputting multiple (100's) of the same files to my output directory.
3 things I need help with:
1: How do I get the flow to output the file in a readable .csv rather than just a file with no ext
2: How can I stop the processor once I have all of the data that I need
3: The Json file that I have been supplied with gives me the option to get files from a certain date range:
https://api.3rdParty.com/reports/v1/scheduledReports/877800/1553731200000
Or I can choose a specific file:
https://api.3rdParty.com/reports/v1/scheduledReports/download/877800/201904/CTDDaily/2019-04-02T01:50:00Z.csv
But how can I create a command in Nifi to automatically check for newer files, as this process will be running daily and we will be looking at downloading a new file each day.
If this is too broad, please help me by letting me know so I can edit this post.
Thanks.
Note: 3rdParty host name has been renamed to comply with security - therefore links will not directly work. Thanks.
1) You change the filename of the flow file to anything you want using the UpdateAttribute processor. If you want to make it have a ".csv" extension then you can add a property named "filename" with a value of "${filename}.csv" (without the quotes when you enter it).
2) By default most processors have a scheduling strategy of timer-driver 0 seconds, which means keep running as fast as possible. Go to the configuration of the processor on the scheduling tab and configure the appropriate schedule, it sounds like you probably want CRON scheduling to schedule it daily.
3) You can use NiFi expression language statements to create dynamic time ranges. I don't fully understand the syntax for the API that you have to communicate with, but you could do something like this for the URL:
https://api.3rdParty.com/reports/v1/scheduledReports/877800/${now()}
Where now() would return the current timestamp as an epoch.
You can also format it to a date string if necessary:
${now():format('yyyy-MM-dd')}
https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html

Altering source file not updating DB table SSIS

I am working on a Microsoft Integration Service project, and I have a flat source file (Product.txt) that contains some data that I am saving in a SQL Server DB when I run the project.
The data is saved successfully, but when I change some values in my source Product.txt and re-run the project, the data in the SQL server are not updated.
Is there any thing that must be done to enable the update? Thank you.
There are several things you can do but you haven't provided enough info. I am guessing here based on the word "changed file", to me that means update.
That generally means in your data flow you should start with source, then use a lookup based on your destination to see if your "key" exists. Change the test to redirect no match.
Map no match to your inserts. And map matches to an update SQL statement.

Temporarily disable MS Access data macros

I have several Access files with data from a group of users that I'm importing into one master file. The tables in the user files are each configured with a Before Change data macro that adds a timestamp each time the user edits the data.
("Data macros" are similar to triggers in SQL Server. They are different from UI macros. For more info, see this page.)
I'd like to import these timestamps into the master file, but since the master file is a clone of the user files, it also contains the same set of data macros. Thus, when I import the data, the timestamps get changed to the time of the import, which is unhelpful.
The only way I can find to edit data macros is by opening each table in Design View and then using the Ribbon to change the settings. There must be an easier way.
I'm using VBA code to perform the merge, and I'm wondering if I can also use it to temporarily disable the data macro feature until the merge has been completed. If there is another way to turn the data macros off for all files/tables at once, even on the users' files/tables, I'd be open to that too.
Disable the code? No. Bypass the code? Yes.
Use a table/field as a flag. Set the status before importing. Check the status of this flag in your event code, and decide if you want to skip the rest of the code. I.e.
If [tblSkipFlag].[SkipFlag] = false
{rest of data macros}
EndIf
Another answer here explains how you can use the (almost-)undocumented SaveAsText and LoadFromText methods with the acTableDataMacro argument to save and retrieve the Data Macros to a text file in XML format. If you were to save the Data Macro XML text for each table, replace ...
<DataMacro Event="BeforeChange"><Statements>
... with ...
<DataMacro Event="BeforeChange"><Statements><Action Name="StopMacro"/>
... and then write the updated macros back to the table then that would presumably have the effect of "short-circuiting" those macros.

Google Drive - Changes:list API - Detect changes at folder level

I'm testing out Google Drive 'Changes' API and have some questions.
Expectation:
I've folder tree structure under 'My Drive' with files in them. I would like to call the Changes:list API to detect if items have been added/edited/deleted within that specific folder id.
APIs Explorer tried: https://developers.google.com/drive/v2/reference/changes/list
Questions:
I don't see any option to pass a specific folder id to this API for getting the 'largestChangeId'. Looks like this api doesn't support the parm 'q'? Is it possible?
As an alternate solution, thought of storing the 'modifiedDate' attribute for that folder and use it for comparing next time. But, it's not getting updated when items are updated within that folder. Should it not work like in windows where folder modified date gets updated when its contents gets updated?
Would like to hear if it's possible to detect changes at folder level.
Thanks
[EDIT]
Solution that worked:
Use Files:list to list all.
selected fields: items/id,items/modifiedDate,items/parents/id,items/title
Get starting folder id ( e.g for 'MyRootFolder' using Title search)
Traverse through the sub-tree structure (linking parents:id and file Id) using recursive array search and get max modifiedDate and total file counts.
Store max modifiedDate and file counts in the app storage.
For subsequent calls, compare the new max modifiedDate with the stored and also compare total file counts with the stored. If either one doesn't match, then contents within 'MyRootFolder' has been updated.
This is not currently possible directly with the API -- sorry!
I think the best current solution would be to use the regular changes feed and filter results in your code to ones with the correct parents set.
drive.changes.list google drive API now allows users to pass the "startChangeId" parameter.
Im using the value I get for "largestChangeId" from the previous API result, So I can incrementally build the changes done by the user, avoiding the need to rebuild the entire tree.
how ever I'm surprised to see why they don't support the "includeDeleted" parameter together with "startChangeId" parameter.