Advanced mapping of JSON in Azure Data Factory - some guidance requested - json

I'm trying to map a JSON document (sensor data) into a more meaningful representation using Mapping Dataflows. However, hard time getting this to work and would really appreciate some insight/recommendations on how to solve the following:
The input is
What I would like to end up with is the following:
Any pointers as to how this can be implemented are more than welcome.

This can be accomplished using the Copy activity and then split function in Derived Column transformation in Azure Data Factory.
Use the copy activity to read the JSON file as source and in sink, use SQL database to store the data as table. In Mapping tab, Import the schema and map the JSON records to the corresponding column names. Refer this third-part tutorial for guidance - https://sqlkover.com/dynamically-map-json-to-sql-in-azure-data-factory/
Finally, use the Data Flow activity and choose the SQL table as source now which you have used as sink above.
Select the Derived Column transformation.
Use split function.
Add the column which will take the split values which you want to split as shown below.
Use split(<column_name_to_split>, '_') function to split the column on with _ delimiter. Change <column_name_to_split> to the name of column you cant to split. Refer image below.
Preview the data to check the result.

Related

JSON flattening in AWS Glue ETL job creates inferred schema with duplicated columns

I'm relatively new to AWS Glue and using the visual AWS Glue studio at the moment. Kind of a niche issue I'm having here...
Context:
I'm building an ETL job that, among other things, should parse/flatten json from a string column to replace it with different fields in different format which I can select to load in my datawarehouse table.
Approach:
I first extract my data from the Glue catalog as a dynamicFrame (in this case only one table).
Then I'm trying to use the approach of unboxing and unnesting.
Let's call that json column data:
def transformTable (glueContext, dfc) -> DynamicFrameCollection:
dyf = dfc.select(list(dfc.keys())[0])
dyf = Unbox.apply(frame=dyf, path="data", format="json")
dyf = UnnestFrame.apply(frame=dyf)
return DynamicFrameCollection({"TranformedTable": dyf}, glueContext)
(Then I have a step to select the right frame from the frame collection, and then I can apply mapping to my fields and load.)
My issue:
Glue automatically infers the data types of the my frame schema (rather successfully)
but it duplicates certain fields into several when the data type is unclear (similar to make_cols in the resolveChoice method), e.g. I end up with 2 fields in the output schema price_int and price_double, where price_int contains only the values that were round numbers by chance and null values everywhere else, etc.
So it seems like the default behavior of this method is to split columns in case of data type doubt (make_cols).
I understand that I could write a resolveChoice for each field, but with this approach they are already split in separate columns in the output schema.
Note: There are dozens of fields in this json, so I'm trying to devise a blanket solution that automatically makes all the fields of the json available in the schema to select and map in the next step, and avoid having to add one line of code for each field I want to extract. (And the json structure will grow with new fields in the future, so I'm trying to limit future ETL maintenance...)
Questions/help needed:
Any idea if there's a way to change this default behavior (like in the resolveChoice method)?
Alternatively, is there a way to apply a kind of default resolveChoice to all problematic fields from the json unboxing? For instance, I could force all problematic fields into string (similar to 'project:string'), and then reformat if needed in the applyMapping step. But resolveChoice seems to need to be applied field by field...
What's a different/better approach I could try? I would like to keep it as dynamic/automated as possible... e.g.:
I think I could maybe extract specific fields from the JSON line by line, but I'm not sure how (looks like the Unbox method is already splitting columns by format). And as explained, it's dozens of fields and growing... so it requires updating the code regularly, instead of just ticking boxes in the list of available fields.
TheRelationalize method could be an option, but it creates distinct frames and this quickly becomes much more complex (there are actually several columns with json, which all need to be flattened...).
Creating crawlers or classifiers which are run automatically regularly for extracting the schema from that specific string column from a table should be an option as well...
Thanks in advance!

Map nested JSON (Mongo ATLAS) to SQL [Azure Data Factory]

I want to map nested json to sql table (Microsoft SSMS)
Source is a Dataset of MongoAtlas &
Sink is a Dataset of Azure SQL Database Managed Instance
I am able to map parentArray using collection reference.
but not able to select child under it.
also childArrays are kind of scalar arrays (they don't have any keys)
Note : I tried the option Map complex values to string
but it is putting the values in column cell like ["ABC", PQR] which I dont want
is there any way to map it ?
Expected output for Table : childarray2
Currently in ADF, Copy Activity supports mapping of arrays for only 1 level.
There is not way to map nested arrays.
For this I had to use Data flows.
Limitation was, we cannot use mongoDB/mongo Atlas as a input source in Data flow, so the workaround was
Convert Mongo To Azure Blob JSON (Copy Activity Task)
Use Azure Blob JSON files as a input source and then SQL tables as sink
Note: You can select this option to delete you blob files, to save storage space

How can I reference a JSON source for a derived column action in Azure Data Factory

I'm new to Azure Data Factory. I've been able to generate a set of JSON files from a REST API source using a Pipeline. Each file consists of one top level JSON object with an array of up to 100 child objects. The output is saved to an Azure Blob Storage container.
I now want to use a Mapping Data Flow to modify the JSON before I write it to Azure SQL, however I'm struggling with the syntax. I've configured the source to point to the directory containing the JSON files. The Source Projection tab displays the correct schema. I can preview the data and I see a row for each file and I can expand the child objects to see the full structure.
However, when I add a Derived Column action, the Input Schema is blank in the Expression Builder. I can refer to the top level elements in the source using the byName and byPosition functions, but I don't know how I can reference the child elements.
The examples that I have been able to find online use a SQL table or CSV file as a source. I can't find any examples that use hierarchical data as the source for a derived column.
Am I missing something? Is this scenario supported?
I found a way to achieve what I want. This may not be the best approach, but it works.
It seems that it is difficult to deal with JSON that has multiple hierarchies as a source for copy data activities. You can choose one level of repeating data to map to a table structure (the Collection Reference property on the Mapping tab).
In my scenario, there was additional repeating data within the data I was mapping to my table. I updated the mapping to write the child JSON data to a text field in my SQL table. To do this, I needed to use the Azure Data Factory JSON editor for my pipeline. You can access this from the "Code" link in the top right corner of the pipeline visual editor.
I added the following line after the closing bracket for the "mappings" array for my copy activity:
"mapComplexValuesToString": true
The full path to the mapping array in the activity definition is typeProperties - translator - mappings. Make sure your commas are correct after you add the new element.
With this approach, I had a row in my SQL table for each array item in my Collection Reference. The scalar child elements in the array items are mapped to table columns and the child JSON element is written to a data column in the same table.
To extract the values I need within the child JSON, I created a SQL view that uses the CROSS APPLY OPENJSON syntax. This allows me to treat the JSON in the data field similar to a related table. You can specify the structure that your JSON is in. If you have nested data in your JSON, you can apply the same approach for each level.
The OPENJSON command is only supported by more recent versions of SQL Server. I'm using Azure SQL, so that works for me.

Cannot map input column 'CurrencyDate', to lookup column, 'FullDateAlternateKey', because the data types do not match in AdventureWorksDW2008R2

I am attempting to learn SSIS by completing the Microsoft supplied AdventureWOrksDW sample database. I am up to step 6 - Adding and Configuring the Lookup Transformations. I can't seem to map the CurrencyDate to the FullDateALternateKey no matter what data type I change the CurrencyDate field to..
Has anyone been able to complete this and if so how? thanks

ServiceNow - JSON Web Service, display related tables

I'm working on a C# program that retrieves data from a ServiceNow database and converts that data into C# .NET objects. I'm using the JSON Web Service to return my data in JSON format.
What I want to achieve is as follows: If there is a relational mapping between a value (for
example: I have a table called Company, where CEO is not a TEXT field but an sys_id to a Employee Table) I want to be able to output that data not with an sys_id (or just displaying the name property by using the 'displayvariable' parameter) but by an object displayed in JSON.
This means that the value of a property should be an object in JSON instead of just a single value.
A few examples:
// I don't want the JSON like this
{"Company":{"CEO":"b181e841c9212c008aeb36850331fab2"}}
// Or by displaying the name of the sys_id table
{"Company":{"CEO":"James Henderson" }}
// I want the data as follows, so I can have all the data I need inside a single JSON record.
{"Company":{"CEO":{"name":"James Henderson", "age":34, "sex":"male", "office":"SBN Left Floor 23"}}}
From reading the documentation I couldn't find anything in the JSON Web Service that allowed me to display the information like this nor
find any other alternative. It should have something to do with joining the tables and displaying it all in the right format.
I have been using SNC for almost three years and have not found you can automatically join tables in a web service. Your best option would be to use a scripted web service which possibly takes a query parameter and table parameter. Then you can json serialized your result as you see fit.
Or, another option would be to generate a new processor that will traverse the GlideRecord object. The ?JSON parameter you pass in to the URL is merely a flag to pass your request to a particular processor. Unfortunately the OOB one I believe is a Java class not a JS script, so you would need to write a script much like I mentioned earlier to traverse the object path serializing the object graph as far down as your want to go.