Split CSV file in Azure Data Factory based on additional headers in file

Split CSV file in Azure Data Factory based on additional headers in file - csv

I currently receive csv files in the following format:
Col1, Col2, Col3, Col4
Header1, , ,
Val1, Val2, Val3, Val4
Val1, Val2, Val3, Val4
Val1, Val2, Val3, Val4
Header2, , ,
Val1, Val2, Val3, Val4
Header3, , ,
Val1, Val2, Val3, Val4
Val1, Val2, Val3, Val4
The number of rows per header can vary and the Headers can contain any words.
The expected result should be one of:
Option 1: Save headers to additional column in 1 file
File: abc/abc/complete_output
Col1, Col2, Col3, Col4, Col5
Val1, Val2, Val3, Val4, Header1
Val1, Val2, Val3, Val4, Header1
Val1, Val2, Val3, Val4, Header1
Val1, Val2, Val3, Val4, Header2
Val1, Val2, Val3, Val4, Header3
Val1, Val2, Val3, Val4, Header3
Option 2: create different file per header:
File1: abc/abc/Header1
Col1, Col2, Col3, Col4
Val1, Val2, Val3, Val4
Val1, Val2, Val3, Val4
Val1, Val2, Val3, Val4
File2: abc/abc/Header2
Col1, Col2, Col3, Col4
Val1, Val2, Val3, Val4
File3: abc/abc/Header3
Col1, Col2, Col3, Col4
Val1, Val2, Val3, Val4
Val1, Val2, Val3, Val4
The files should either be split from the received format to different files or the header rows should be mapped to an additional column. Can this be done in Azure Data Factory, including Data Flow options? There is no access to a Databricks cluster.
P.S. I know this would be easy with a Python script whatsoever, but I hope to be able to build the complete flow in ADF.
I tried splitting the file based on conditional split. However, this does not work, as this just allows to select rows. This could only be used if (one of) the row values gave an indication about the Header.
No other things seem usable to me.
Edit: added desired output options as asked

You can achieve this with in data factory using variables, loops and conditionals and copy data activity.
First, read the source file using look up activity without header and random row and column delimiters and without selecting First row as header option (So that you would get the output as shown in below image.) I have used ; as column delimiter and | as row delimiter and.
Now, I have used multiple set variable activities. The header activity is to extract the header (col1,col2,col3,col4) from lookup output.
#first(array(split(activity('file as text').output.value[0]['Prop_0'],decodeUriComponent('%0A'))))
each file set variable activity to store all the data for each file. I initialized it with header variable value.
get first filename is used to extract the name of first file (header1 in this case) using collection and string functions.
#replace(first(skip(array(split(activity('file as text').output.value[0]['Prop_0'],decodeUriComponent('%0A'))),1)),',,,','')
Now, take the rest of the data (after header1 line), use it as items value in for each loop and then do further processing.
#skip(array(split(activity('file as text').output.value[0]['Prop_0'],decodeUriComponent('%0A'))),2)
Inside for each, I have an if condition activity to check if the line is header (to be considered as filename) or not. Accordingly, I have concatenated each line accordingly and used copy data activity as per requirement.
The following is the entire pipeline JSON (you can use this directly except that you have to create your datasets).
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "file as text",
"type": "Lookup",
"dependsOn": [],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"storeSettings": {
"type": "AzureBlobFSReadSettings",
"recursive": true,
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
},
"dataset": {
"referenceName": "csv1",
"type": "DatasetReference"
},
"firstRowOnly": false
}
},
{
"name": "header",
"type": "SetVariable",
"dependsOn": [
{
"activity": "file as text",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"variableName": "header",
"value": {
"value": "#first(array(split(activity('file as text').output.value[0]['Prop_0'],decodeUriComponent('%0A'))))",
"type": "Expression"
}
}
},
{
"name": "each file",
"type": "SetVariable",
"dependsOn": [
{
"activity": "header",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"variableName": "each file",
"value": {
"value": "#variables('header')",
"type": "Expression"
}
}
},
{
"name": "get first filename",
"type": "SetVariable",
"dependsOn": [
{
"activity": "each file",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"variableName": "filename",
"value": {
"value": "#replace(first(skip(array(split(activity('file as text').output.value[0]['Prop_0'],decodeUriComponent('%0A'))),1)),',,,','')",
"type": "Expression"
}
}
},
{
"name": "ForEach1",
"type": "ForEach",
"dependsOn": [
{
"activity": "get first filename",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"items": {
"value": "#skip(array(split(activity('file as text').output.value[0]['Prop_0'],decodeUriComponent('%0A'))),2)",
"type": "Expression"
},
"isSequential": true,
"activities": [
{
"name": "If Condition1",
"type": "IfCondition",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"expression": {
"value": "#contains(item(),',,,')",
"type": "Expression"
},
"ifFalseActivities": [
{
"name": "each row",
"type": "SetVariable",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"variableName": "each row",
"value": {
"value": "#concat(variables('each file'),decodeUriComponent('%0A'),item())",
"type": "Expression"
}
}
},
{
"name": "complete data",
"type": "SetVariable",
"dependsOn": [
{
"activity": "each row",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"variableName": "each file",
"value": {
"value": "#variables('each row')",
"type": "Expression"
}
}
}
],
"ifTrueActivities": [
{
"name": "create each file",
"type": "Copy",
"dependsOn": [],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"additionalColumns": [
{
"name": "req",
"value": {
"value": "#variables('each file')",
"type": "Expression"
}
}
],
"storeSettings": {
"type": "AzureBlobFSReadSettings",
"recursive": true,
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
},
"sink": {
"type": "DelimitedTextSink",
"storeSettings": {
"type": "AzureBlobFSWriteSettings"
},
"formatSettings": {
"type": "DelimitedTextWriteSettings",
"quoteAllText": true,
"fileExtension": ".txt"
}
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": {
"name": "req",
"type": "String"
},
"sink": {
"type": "String",
"physicalType": "String",
"ordinal": 1
}
}
],
"typeConversion": true,
"typeConversionSettings": {
"allowDataTruncation": true,
"treatBooleanAsNumber": false
}
}
},
"inputs": [
{
"referenceName": "demo",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "op_files",
"type": "DatasetReference",
"parameters": {
"fileName": {
"value": "#variables('filename')",
"type": "Expression"
}
}
}
]
},
{
"name": "change filename",
"type": "SetVariable",
"dependsOn": [
{
"activity": "create each file",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"variableName": "filename",
"value": {
"value": "#replace(item(),',,,','')",
"type": "Expression"
}
}
},
{
"name": "re initialise each file value",
"type": "SetVariable",
"dependsOn": [
{
"activity": "change filename",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"variableName": "each file",
"value": {
"value": "#variables('header')",
"type": "Expression"
}
}
}
]
}
}
]
}
},
{
"name": "for last file within csv",
"type": "Copy",
"dependsOn": [
{
"activity": "ForEach1",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"additionalColumns": [
{
"name": "req",
"value": {
"value": "#variables('each file')",
"type": "Expression"
}
}
],
"storeSettings": {
"type": "AzureBlobFSReadSettings",
"recursive": true,
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
},
"sink": {
"type": "DelimitedTextSink",
"storeSettings": {
"type": "AzureBlobFSWriteSettings"
},
"formatSettings": {
"type": "DelimitedTextWriteSettings",
"quoteAllText": true,
"fileExtension": ".txt"
}
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": {
"name": "req",
"type": "String"
},
"sink": {
"type": "String",
"physicalType": "String",
"ordinal": 1
}
}
],
"typeConversion": true,
"typeConversionSettings": {
"allowDataTruncation": true,
"treatBooleanAsNumber": false
}
}
},
"inputs": [
{
"referenceName": "demo",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "op_files",
"type": "DatasetReference",
"parameters": {
"fileName": {
"value": "#variables('filename')",
"type": "Expression"
}
}
}
]
}
],
"variables": {
"header": {
"type": "String"
},
"each file": {
"type": "String"
},
"filename": {
"type": "String"
},
"each row": {
"type": "String"
}
},
"annotations": []
}
}
For copy data, the source data looks as shown below:
The sink of the copy data activity has the following dataset configurations (both source and sink dataset are same in 2 copy data activities):
The following are the outputs for each of the file for given sample data:
If the file ends with data (not header) then the file would be populated as required instead of empty file with just header.

If the input dataset is static, considering the second option as requirement then you can go with the following approach:
Add Filter transformation after source with expression as : !startsWith(Col1, 'Header')
Add surrogate key transformation to create the incremental identity column
Add conditional split transformation to split the data into three parts having these expressions:
stream1: Id>=1 && Id<=3
stream2 : Id==4
stream3 : Default
Use Select transformation to deselect 'Id' column
Add sink transformation to load the data to csv file

Related

Azure data factory appending to Json

Wanted to pick your brains on something
So, in Azure data factory, I am running a set of activities which at the end of the run produce a json segment
{"name":"myName", "email":"email#somewhere.com", .. <more elements> }
This set of activities occurs in a loop - Loop Until activity.
My goal is to have a final JSON object like this:
"profiles":[
{"name":"myName", "email":"email#somewhere.com", .. <more elements> },
{"name":"myName", "email":"email#somewhere.com", .. <more elements> },
{"name":"myName", "email":"email#somewhere.com", .. <more elements> },
...
{"name":"myName", "email":"email#somewhere.com", .. <more elements> }
]
That is a concatenation of all the individual ones.
To put in perspective, each individual item is a paged data from a rest api - and all them constitute the final response. I have no control over how many are there.
I understand how to concatenate individual items using 2 variables
jsonTemp = #concat(finalJson, individualResponse)
finalJson = jsonTemp
But, I do not know how to make it all under the single roof "profiles" afterwards.

So this is a bit of a hacky way of doing it and happy to hear a better solution.
I'm assuming you have stored all your results in an array variable (let's call this A).
First step is to find the number of elements in this array. You can
do this using the length(..) function.
Then you go into a loop, interating a counter variable and
concatenating each of the elements of the array making sure you add
a ',' in between each element. You have to make sure you do not add
the ',' after the last element(You will need to use an IF condition
to check if your counter has reached the length of the array. At the
end of this you should have 1 string variable like this.
{"name":"myName","email":"email#somewhere.com"},{"name":"myName","email":"email#somewhere.com"},{"name":"myName","email":"email#somewhere.com"},{"name":"myName","email":"email#somewhere.com"}
Now all you need to do this this expression when you are pushing the
response anywhere.
#json(concat('{"profiles":[',<your_string_variable_here>,']}'))

I agree with #Anupam Chand and I am following the same process with a different Second step.
You mentioned that your object data comes from API pages and to end the until loop you have to give a condition to about the number of pages(number of iterations).
This is my pipeline:
First I have initilized a counter and used that counter each web page URL and in until condition to meet a certain number of pages. As ADF do not support self referencing variables, I have used another temporary variable to increment the counter.
To Store the objects of each iteration from web activity, I have created an array variable in pipeline.
Inside ForEach, use append variable activity to append each object to the array like below.
For my sample web activity the dynamic content will be #activity('Data from REST').output.data[0]. for you it will be like #activity('Web1').output. change it as per your requirement.
This is my Pipeline JSON:
{
"name": "pipeline3",
"properties": {
"activities": [
{
"name": "Until1",
"type": "Until",
"dependsOn": [
{
"activity": "Counter intialization",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"expression": {
"value": "#equals('3', variables('counter'))",
"type": "Expression"
},
"activities": [
{
"name": "Data from REST",
"type": "WebActivity",
"dependsOn": [
{
"activity": "counter in temp variable",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"url": {
"value": "https://reqres.in/api/users?page=#{variables('counter')}",
"type": "Expression"
},
"method": "GET"
}
},
{
"name": "counter in temp variable",
"type": "SetVariable",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"variableName": "tempCounter",
"value": {
"value": "#variables('counter')",
"type": "Expression"
}
}
},
{
"name": "Counter increment using temp",
"type": "SetVariable",
"dependsOn": [
{
"activity": "Data from REST",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"variableName": "counter",
"value": {
"value": "#string(add(int(variables('tempCounter')),1))",
"type": "Expression"
}
}
},
{
"name": "Append web output to array",
"type": "AppendVariable",
"dependsOn": [
{
"activity": "Counter increment using temp",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"variableName": "arr",
"value": {
"value": "#activity('Data from REST').output.data[0]",
"type": "Expression"
}
}
}
],
"timeout": "0.12:00:00"
}
},
{
"name": "Counter intialization",
"type": "SetVariable",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"variableName": "counter",
"value": {
"value": "#string('1')",
"type": "Expression"
}
}
},
{
"name": "To show res array",
"type": "SetVariable",
"dependsOn": [
{
"activity": "Until1",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"variableName": "res_show",
"value": {
"value": "#variables('arr')",
"type": "Expression"
}
}
}
],
"variables": {
"arr": {
"type": "Array"
},
"counter": {
"type": "String"
},
"tempCounter": {
"type": "String"
},
"res_show": {
"type": "Array"
},
"arr_string": {
"type": "String"
}
},
"annotations": []
}
}
Result in an array variable:
You can access this array by the variable name. If you want the output to be like yours, you can use the below dynamic content.
#json(concat('{','"profile":',string(variables('res_show')),'}')))
However, if you want to store this in a variable, you have to wrap it in #string() as currently, ADF variables only supports int, string and array type only.

How to Parse Json Object Element that contains Dynamic Attribute

We have a Json object with the following structure.
{
"results": {
"timesheets": {
"135288482": {
"id": 135288482,
"user_id": 1242515,
"jobcode_id": 17288283,
"customfields": {
"19142": "Item 1",
"19144": "Item 2"
},
"attached_files": [
50692,
44878
],
"last_modified": "1970-01-01T00:00:00+00:00"
},
"135288514": {
"id": 135288514,
"user_id": 1242509,
"jobcode_id": 18080900,
"customfields": {
"19142": "Item 1",
"19144": "Item 2"
},
"attached_files": [
50692,
44878
],
"last_modified": "1970-01-01T00:00:00+00:00"
}}
We need to access the elements that is inside the results --> timesheets --> Dynamic id.
Example:
{
"id": 135288482,
"user_id": 1242515,
"jobcode_id": 17288283,
"customfields": {
"19142": "Item 1",
"19144": "Item 2"
},
"attached_files": [
50692,
44878
],
"last_modified": "1970-01-01T00:00:00+00:00"
}
The problem is that "135288482": { is dynamic. How do we access what is inside of it.
We are trying to create data flow to parse the data. The data is dynamic, so accessing via attribute name is not possible.

AFAIK, as per your JSON structure and dynamic keys it might not be possible to get the desired result using Dataflow.
I have reproduced the above and able to get it done using set variable and ForEach like below.
In your JSON, the keys and the values for the ids are same. So, I have used that to get the list of keys first. Then using that list of keys, I am able to access the inner JSON object.
These are my variable in the pipeline:
First, I have a set variable of type string and stored "id" in it.
Then I have taken lookup activity to get the above JSON file from blob.
I have stored lookup the timesheets objects as string in a set variable using the below dynamic content
#string(activity('Lookup1').output.value[0].results.timesheets)
I have used split on that string with "id" and stored the result array in an array variable.
#split(variables('jsonasstring'), variables('ids'))
This will give the array like below.
Now, I took a Foreach to this array but skipped the first element. In Each iteration, I have taken first 9 indexes of the string to append variable and that is the key.
If your id values and keys are not same, then you can skip the last from split array and take the keys from the reverse side of the string as per your requirement.
This is my dynamic content for append variable activity inside ForEach #take(item(), 9)
Then I took another ForEach, and given this keys list array to it. Inside foreach you can access the JSON with below dynamic content.
#string(activity('Lookup1').output.value[0].results.timesheets[item()])
This is my pipeline JSON:
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "Lookup1",
"type": "Lookup",
"dependsOn": [
{
"activity": "for ids",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "JsonSource",
"storeSettings": {
"type": "AzureBlobFSReadSettings",
"recursive": true,
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "JsonReadSettings"
}
},
"dataset": {
"referenceName": "Json1",
"type": "DatasetReference"
},
"firstRowOnly": false
}
},
{
"name": "JSON as STRING",
"type": "SetVariable",
"dependsOn": [
{
"activity": "Lookup1",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"variableName": "jsonasstring",
"value": {
"value": "#string(activity('Lookup1').output.value[0].results.timesheets)",
"type": "Expression"
}
}
},
{
"name": "for ids",
"type": "SetVariable",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"variableName": "ids",
"value": {
"value": "#string('\"id\":')",
"type": "Expression"
}
}
},
{
"name": "after split",
"type": "SetVariable",
"dependsOn": [
{
"activity": "JSON as STRING",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"variableName": "split_array",
"value": {
"value": "#split(variables('jsonasstring'), variables('ids'))",
"type": "Expression"
}
}
},
{
"name": "ForEach to append keys to array",
"type": "ForEach",
"dependsOn": [
{
"activity": "after split",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"items": {
"value": "#skip(variables('split_array'),1)",
"type": "Expression"
},
"isSequential": true,
"activities": [
{
"name": "Append variable1",
"type": "AppendVariable",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"variableName": "key_ids_array",
"value": {
"value": "#take(item(), 9)",
"type": "Expression"
}
}
}
]
}
},
{
"name": "ForEach to access inner object",
"type": "ForEach",
"dependsOn": [
{
"activity": "ForEach to append keys to array",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"items": {
"value": "#variables('key_ids_array')",
"type": "Expression"
},
"isSequential": true,
"activities": [
{
"name": "Each object",
"type": "SetVariable",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"variableName": "show_res",
"value": {
"value": "#string(activity('Lookup1').output.value[0].results.timesheets[item()])",
"type": "Expression"
}
}
}
]
}
}
],
"variables": {
"jsonasstring": {
"type": "String"
},
"ids": {
"type": "String"
},
"split_array": {
"type": "Array"
},
"key_ids_array": {
"type": "Array"
},
"show_res": {
"type": "String"
}
},
"annotations": []
}
}
Result:
use #json() in the dynamic content to convert the below string to an object.
If you want to store this JSONs in a file, use a ForEach and inside Foreach use an SQL script to copy each object as a row to table in each iteration using JSON_VALUE . Then outside Foreach use copy activity to copy that SQL table to your destination as per your requirement.

How do I add time-key to properties in geoJSON during SELECT from Postgres

I have a PHP-script to SELECT data from Postgres in geoJSON-format. That works fine. This is the SQL-code.
SELECT jsonb_build_object(
'type', 'FeatureCollection',
'features', json_agg(features.feature)
)
FROM (SELECT jsonb_build_object(
'type', 'Feature',
'id', id,
'geometry', st_AsGeojson(st_SetSrid(st_MakePoint(split_part(to_jsonb(row)->'data'->'location'->>'value', ',', 2)::double precision, split_part(to_jsonb(row)->'data'->'location'->>'value', ',', 1)::double precision), 4326))::json,
'properties', to_jsonb(row) - 'id'
) AS feature
FROM (SELECT * FROM cs_fiets_json WHERE last_updated > '$sel_f') row) features;
I need to include a time-key with value into the properties (to be able to use TimeDimension in Leaflet) and I am stuck at that.
The time-value I need is somewhere deep in the row.
Played with json_set and json_insert. But don't know if, how and where to use that.
So I need "time": "2022-03-02T14:32:37.00Z" directly under "properties".
Thanks in advance!
{
"id": 1581083302,
"type": "Feature",
"geometry": {
"type": "Point",
"coordinates": [
5.0545584,
52.1574455
]
},
"properties": {
"data": {
"id": "359215101322999",
"voc": {
"type": "Number",
"value": 42,
"metadata": {
"dateCreated": {
"type": "DateTime",
"value": "2021-11-25T10:49:10.00Z"
},
"dateModified": {
"type": "DateTime",
"value": "2022-03-02T14:32:37.00Z"
}
}
},
"pm10": {
"type": "Number",
"value": 14,
"metadata": {
"dateCreated": {
"type": "DateTime",
"value": "2021-11-25T10:49:10.00Z"
},
"dateModified": {
"type": "DateTime",
"value": "2022-03-02T14:32:37.00Z"
}
}
},

Azure DataFactory ForEach Copy activity is not iterating through but instead pulling all files in blob. Why?

I have a pipeline in DF2 that has to look at a folder in blob and process each of the 145 files sequentially into a database table. After each file has been loaded into the table, a stored procedure should be trigger that will check each record and either insert it, or update an existing record into a master table.
Looking online I feel as though I have tried every combination of "Get MetaData", "For Each", "LookUp" and "Assign Variable" activates that have been suggested but for some reason my Copy Data STILL picks up all files at the same time and runs 145 times.
Recently found a blog online that I followed to use "Assign Variable" as it will be useful for multiple file locations but it does not work for me. I need to read the files as CSVs to tables and not binary objects so therefore I think this is my issue.
{
"name": "BulkLoadPipeline",
"properties": {
"activities": [
{
"name": "GetFileNames",
"type": "GetMetadata",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"dataset": {
"referenceName": "DelimitedText1",
"type": "DatasetReference",
"parameters": {
"fileName": "#item()"
}
},
"fieldList": [
"childItems"
],
"storeSettings": {
"type": "AzureBlobStorageReadSetting"
},
"formatSettings": {
"type": "DelimitedTextReadSetting"
}
}
},
{
"name": "CopyDataRunDeltaCheck",
"type": "ForEach",
"dependsOn": [
{
"activity": "BuildList",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "#variables('fileList')",
"type": "Expression"
},
"isSequential": true,
"activities": [
{
"name": "WriteToTables",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"storeSettings": {
"type": "AzureBlobStorageReadSetting",
"wildcardFileName": "*.*"
},
"formatSettings": {
"type": "DelimitedTextReadSetting"
}
},
"sink": {
"type": "AzureSqlSink"
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": {
"name": "myID",
"type": "String"
},
"sink": {
"name": "myID",
"type": "String"
}
},
{
"source": {
"name": "Col1",
"type": "String"
},
"sink": {
"name": "Col1",
"type": "String"
}
},
{
"source": {
"name": "Col2",
"type": "String"
},
"sink": {
"name": "Col2",
"type": "String"
}
},
{
"source": {
"name": "Col3",
"type": "String"
},
"sink": {
"name": "Col3",
"type": "String"
}
},
{
"source": {
"name": "Col4",
"type": "String"
},
"sink": {
"name": "Col4",
"type": "String"
}
},
{
"source": {
"name": "DW Date Created",
"type": "String"
},
"sink": {
"name": "DW_Date_Created",
"type": "String"
}
},
{
"source": {
"name": "DW Date Updated",
"type": "String"
},
"sink": {
"name": "DW_Date_Updated",
"type": "String"
}
}
]
}
},
"inputs": [
{
"referenceName": "DelimitedText1",
"type": "DatasetReference",
"parameters": {
"fileName": "#item()"
}
}
],
"outputs": [
{
"referenceName": "myTable",
"type": "DatasetReference"
}
]
},
{
"name": "CheckDeltas",
"type": "SqlServerStoredProcedure",
"dependsOn": [
{
"activity": "WriteToTables",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"storedProcedureName": "[TL].[uspMyCheck]"
},
"linkedServiceName": {
"referenceName": "myService",
"type": "LinkedServiceReference"
}
}
]
}
},
{
"name": "BuildList",
"type": "ForEach",
"dependsOn": [
{
"activity": "GetFileNames",
"dependencyConditions": [
"Succeeded"
]
}
],
"typeProperties": {
"items": {
"value": "#activity('GetFileNames').output.childItems",
"type": "Expression"
},
"isSequential": true,
"activities": [
{
"name": "Create list from variables",
"type": "AppendVariable",
"typeProperties": {
"variableName": "fileList",
"value": "#item().name"
}
}
]
}
}
],
"variables": {
"fileList": {
"type": "Array"
}
}
}
}
The Details screen of the pipleline output shows the pipeline loops for the number of items in the blob but each time, the Copy Data and Stored Procedure are run for each file in the list at once as opposed to one at a time.
I feel like I am close to the answer but missing one vital part. Any help or suggestions are GREATLY appreciated.

Your payload is not correct.
GetMetadata actvitiy should not use the same dataset with Copy Activity.
GetMetadata activity should reference a dataset with a folder, the folder contains all file you want to deal with. but your dataset has 'filename' parameter.
use the output of the getMetadata activity as the input of forEach activity.

Azure Data Factory Copy Activity

I have been working on this for a couple days and cannot get past this error. I have 2 activities in this pipeline. The first activity copies data from an ODBC connection to an Azure database, which is successful. The 2nd activity transfers the data from Azure table to another Azure table and keeps failing.
The error message is:
Copy activity met invalid parameters: 'UnknownParameterName', Detailed message: An item with the same key has already been added..
I do not see any invalid parameters or unknown parameter names. I have rewritten this multiple times using their add activity code template and by myself, but do not receive any errors when deploying on when it is running. Below is the JSON pipeline code.
Only the 2nd activity is receiving an error.
Thanks.
Source Data set
{
"name": "AnalyticsDB-SHIPUPS_06shp-01src_AZ-915PM",
"properties": {
"structure": [
{
"name": "UPSD_BOL",
"type": "String"
},
{
"name": "UPSD_ORDN",
"type": "String"
}
],
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "Source-SQLAzure",
"typeProperties": {},
"availability": {
"frequency": "Day",
"interval": 1,
"offset": "04:15:00"
},
"external": true,
"policy": {}
}
}
Destination Data set
{
"name": "AnalyticsDB-SHIPUPS_06shp-02dst_AZ-915PM",
"properties": {
"structure": [
{
"name": "SHIP_SYS_TRACK_NUM",
"type": "String"
},
{
"name": "SHIP_TRACK_NUM",
"type": "String"
}
],
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "Destination-Azure-AnalyticsDB",
"typeProperties": {
"tableName": "[olcm].[SHIP_Tracking]"
},
"availability": {
"frequency": "Day",
"interval": 1,
"offset": "04:15:00"
},
"external": false,
"policy": {}
}
}
Pipeline
{
"name": "SHIPUPS_FC_COPY-915PM",
"properties": {
"description": "copy shipments ",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "$$Text.Format('SELECT COMPANY, UPSD_ORDN, UPSD_BOL FROM \"orupsd - UPS interface Dtl\" WHERE COMPANY = \\'01\\'', WindowStart, WindowEnd)"
},
"sink": {
"type": "SqlSink",
"sqlWriterCleanupScript": "$$Text.Format('delete imp_fc.SHIP_UPS_IntDtl_Tracking', WindowStart, WindowEnd)",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "COMPANY:COMPANY, UPSD_ORDN:UPSD_ORDN, UPSD_BOL:UPSD_BOL"
}
},
"inputs": [
{
"name": "AnalyticsDB-SHIPUPS_03shp-01src_FC-915PM"
}
],
"outputs": [
{
"name": "AnalyticsDB-SHIPUPS_03shp-02dst_AZ-915PM"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3,
"longRetry": 0,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Day",
"interval": 1,
"offset": "04:15:00"
},
"name": "915PM-SHIPUPS-fc-copy->[imp_fc]_[SHIP_UPS_IntDtl_Tracking]"
},
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "$$Text.Format('select distinct ups.UPSD_BOL, ups.UPSD_BOL from imp_fc.SHIP_UPS_IntDtl_Tracking ups LEFT JOIN olcm.SHIP_Tracking st ON ups.UPSD_BOL = st.SHIP_SYS_TRACK_NUM WHERE st.SHIP_SYS_TRACK_NUM IS NULL', WindowStart, WindowEnd)"
},
"sink": {
"type": "SqlSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
},
"translator": {
"type": "TabularTranslator",
"columnMappings": "UPSD_BOL:SHIP_SYS_TRACK_NUM, UPSD_BOL:SHIP_TRACK_NUM"
}
},
"inputs": [
{
"name": "AnalyticsDB-SHIPUPS_06shp-01src_AZ-915PM"
}
],
"outputs": [
{
"name": "AnalyticsDB-SHIPUPS_06shp-02dst_AZ-915PM"
}
],
"policy": {
"timeout": "1.00:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 3,
"longRetryInterval": "00:00:00"
},
"scheduler": {
"frequency": "Day",
"interval": 1,
"offset": "04:15:00"
},
"name": "915PM-SHIPUPS-AZ-update->[olcm]_[SHIP_Tracking]"
}
],
"start": "2017-08-22T03:00:00Z",
"end": "2099-12-31T08:00:00Z",
"isPaused": false,
"hubName": "adf-tm-prod-01_hub",
"pipelineMode": "Scheduled"
}
}

Have you seen this link?
They get the same error message and suggest using AzureTableSink instead of SqlSink
"sink": {
"type": "AzureTableSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
It would make sense for you too since your 2nd copy activity is Azure to Azure

It could be a red herring but I'm pretty sure "tableName" is a require entry in the typeProperties for a sqlSource. Yours is missing this for the input dataset. Appreciate you have a join in the sqlReaderQuery so probably best to put a dummy (but real) table name in there.
Btw, not clear why you are using $$Text.Format and WindowStart/WindowEnd on your queries if you're not transposing these values into the query; you could just put the query between double quotes.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Split CSV file in Azure Data Factory based on additional headers in file - csv

Related

Azure data factory appending to Json

How to Parse Json Object Element that contains Dynamic Attribute

How do I add time-key to properties in geoJSON during SELECT from Postgres

Azure DataFactory ForEach Copy activity is not iterating through but instead pulling all files in blob. Why?

Azure Data Factory Copy Activity

Categories

Resources