JSON flattening in AWS Glue ETL job creates inferred schema with duplicated columns - json

I'm relatively new to AWS Glue and using the visual AWS Glue studio at the moment. Kind of a niche issue I'm having here...
Context:
I'm building an ETL job that, among other things, should parse/flatten json from a string column to replace it with different fields in different format which I can select to load in my datawarehouse table.
Approach:
I first extract my data from the Glue catalog as a dynamicFrame (in this case only one table).
Then I'm trying to use the approach of unboxing and unnesting.
Let's call that json column data:
def transformTable (glueContext, dfc) -> DynamicFrameCollection:
dyf = dfc.select(list(dfc.keys())[0])
dyf = Unbox.apply(frame=dyf, path="data", format="json")
dyf = UnnestFrame.apply(frame=dyf)
return DynamicFrameCollection({"TranformedTable": dyf}, glueContext)
(Then I have a step to select the right frame from the frame collection, and then I can apply mapping to my fields and load.)
My issue:
Glue automatically infers the data types of the my frame schema (rather successfully)
but it duplicates certain fields into several when the data type is unclear (similar to make_cols in the resolveChoice method), e.g. I end up with 2 fields in the output schema price_int and price_double, where price_int contains only the values that were round numbers by chance and null values everywhere else, etc.
So it seems like the default behavior of this method is to split columns in case of data type doubt (make_cols).
I understand that I could write a resolveChoice for each field, but with this approach they are already split in separate columns in the output schema.
Note: There are dozens of fields in this json, so I'm trying to devise a blanket solution that automatically makes all the fields of the json available in the schema to select and map in the next step, and avoid having to add one line of code for each field I want to extract. (And the json structure will grow with new fields in the future, so I'm trying to limit future ETL maintenance...)
Questions/help needed:
Any idea if there's a way to change this default behavior (like in the resolveChoice method)?
Alternatively, is there a way to apply a kind of default resolveChoice to all problematic fields from the json unboxing? For instance, I could force all problematic fields into string (similar to 'project:string'), and then reformat if needed in the applyMapping step. But resolveChoice seems to need to be applied field by field...
What's a different/better approach I could try? I would like to keep it as dynamic/automated as possible... e.g.:
I think I could maybe extract specific fields from the JSON line by line, but I'm not sure how (looks like the Unbox method is already splitting columns by format). And as explained, it's dozens of fields and growing... so it requires updating the code regularly, instead of just ticking boxes in the list of available fields.
TheRelationalize method could be an option, but it creates distinct frames and this quickly becomes much more complex (there are actually several columns with json, which all need to be flattened...).
Creating crawlers or classifiers which are run automatically regularly for extracting the schema from that specific string column from a table should be an option as well...
Thanks in advance!

Related

Advanced mapping of JSON in Azure Data Factory - some guidance requested

I'm trying to map a JSON document (sensor data) into a more meaningful representation using Mapping Dataflows. However, hard time getting this to work and would really appreciate some insight/recommendations on how to solve the following:
The input is
What I would like to end up with is the following:
Any pointers as to how this can be implemented are more than welcome.
This can be accomplished using the Copy activity and then split function in Derived Column transformation in Azure Data Factory.
Use the copy activity to read the JSON file as source and in sink, use SQL database to store the data as table. In Mapping tab, Import the schema and map the JSON records to the corresponding column names. Refer this third-part tutorial for guidance - https://sqlkover.com/dynamically-map-json-to-sql-in-azure-data-factory/
Finally, use the Data Flow activity and choose the SQL table as source now which you have used as sink above.
Select the Derived Column transformation.
Use split function.
Add the column which will take the split values which you want to split as shown below.
Use split(<column_name_to_split>, '_') function to split the column on with _ delimiter. Change <column_name_to_split> to the name of column you cant to split. Refer image below.
Preview the data to check the result.

AWS glue: crawler does not identify metadata when CSV contains string and timestamp/date values

I have come across one thing when we consider CSV as input to crawler
crawler doesn't identify the columns header when all the data is in string format in CSV.
#P1 Headers are displayed as col0,col1...colN.
#P2 And actual column names are considered as data.
#P3 Metadata (i.e. column datatype is shown as string even the CSV dataset consists of date/timestamp value)
If we are going to consider custom (CSV) classifier then we are manually mentioning the column header.
#P2 will get covered i.e. column names will be removed however
#P1 still remain same. column header will be displayed as col0,col1...colN.
There are 3 things I want to avoid and achieve expected result.
CSV with strings only should show actual column names instead of col0,col1...colN.
Metadata of generated table should show correctly (i.e. date/timestamp, string) once it is crawled by crawler.
If Custom classifier is used, we need to mention column header names manually in classifier, yet result is not satisfactory.
Need generic solution instead of manual interventions.
Have gone through this document: here
If anyone has already implemented the solution, Please help.
I got solution to one of the above points. Headers i.e. first line of CSV is displayed by using 'Has heading' in CSV classifier.
However, Solution for following is yet to figure out.
Metadata of CSV file is shown as string even if column contains timestamp/date value. Crawler is reading these datatypes as string.
Custom classifier needs manual interventions. I have mentioned all column names in classifier. Is there generic solution?
If we are using pd.to_csv to write the dataframe, then to avoid getting column names as col1, col2 and so on, add the parameter
index_label='index' such as:
pd.to_csv(df,index_label='index')

SSIS consolidate and concatenate multiple rows into single rows without using SQL

I am trying to accomplish something that is pretty easy to do in SQL, but seemingly very challenging to do in SSIS without using SQL. Basically, I need to consolidate and concatenate a field of a many-to-one relationship.
Given entities: [Contract Item] (many) to (one) [Account]
There is a field [ari_productsummary] that contains the product listed on the Contract Item entity. We want to write that value to the Account as [ari_activecontractitems]. However, an Account may have more than one Contract Item record associated to it, in which case, we want to concatenate those values. We also only want the distinct values to be concatenated (distinct rows already solved within my data flow).
This can be accomplished by writing to a temporary table, and then using a query or view to obtain the summarized results as followed. I created a SQL table called TESTTABLE that contains the [ari_productsummary] from the Contract Item entity along with the referring [accountid] to map it back to Account. I then wrote the following query as a view:
SELECT distinct accountid,
(SELECT TT2.ari_productsummary + '; '
FROM TESTTABLE TT2
WHERE TT2.accountid = TT.accountid
FOR XML PATH ('')
) AS 'ari_activecontractitems'
FROM TESTTABLE TT
Executing that Query provides me the results that I want, which I can then use for importing into the Account entity as shown below:
But how do I do this in a SSIS dataflow without writing to a SQL table as a temporary placeholder for the data?? I want to do the entire process inside one dataflow container, without using a temporary SQL table/view. The whole summarization process needs to be done on the fly:
Does anyone have a solution that doesn't require a temporary SQL table/view/query, but is contained entirely within the data flow?
I am using VS 2017 and the KingswaySoft Dynamic CRM 365 ETL toolset to develop my solution/package.
Spit balling here as I don't Dynamics nor do I have the custom components.
Data Flow 1 - Contract aggregation
The purpose of this data flow is to replicate your logic in the elegant query you provided and shove that into a Cache Connection Manager (see Notes for 2008+ at the end)
KingswaySoft Dynamics Source -> Script Task -> Cache Transform
If you want to keep the sort in there, do it before the script task. The implementation I'll take with the Script Task is that it's fully blocking - that is all the rows must arrive before it can send any on. Tasks like the Merge Join are only partially blocking because the requirement of sorted data means that once you no longer have a match for the current item, you can send it on down the pipeline.
The Script Task is going to be asynchronous transformation. You'll have two output columns, your key accountid and your new derived column of ari_activecontractitems. That column will might need to be big - you'll know your data best but if it's a blob type in Dynamics (> 4k unicode or > 8k ascii characters) then you'll have to define the data type as DT_TEXT/DT_NTEXT
As inputs, you'll select accountid and ari_productsummary from your source.
The code should be pretty easy. We're going to accumulate the inbound data into a Dictionary.
// member variable
Dictionary<string, List<string>> accumulator;
The PreProcess method, we'll tack this in there to initialize our variable
// initialize in PreProcess method
accumulator = new Dictionary<string, List<string>>();
In the OnBufferRowSent (name approx)
// simulate the inbound queue
// row_id would be something like Rows.row_id
if (!accumulator.ContainsKey(row_id))
{
// Create an empty dictionary for our list
accumulator.Add(row_id, new List<string>());
}
// add it if we don't have it
if (!accumulator[row_id].Contains(invoice))
{
accumulator[row_id].Add(invoice);
}
Once you get the signal sent of no more data available, that's when you start buffering output data. The auto generated code will have placeholders for all this.
// This is how we shove data out the pipe
foreach(var kvp in accumulator)
{
// approximately thus
OutputBuffer1.AddRow();
OutputBuffer1.row_id = kvp.Key;
OutputBuffer1.ari_productsummary = string.Join("; ", kvp.Value);
}
We have an upcoming release that comes with a component that does exactly what you are trying to achieve without the need of writing custom code. The feature is currently under preview, please reach out to us for private access to the feature. You can find our contact information on our website.
UPDATE - June 5, 2020, we have made the components available for public access at https://www.kingswaysoft.com/products/ssis-productivity-pack/ as a result of our 2020 Release Wave 1. We have two components available that serve this kind of purpose. The Composition component will take input values and transform into a composite value in a SSIS column. The Decomposition component does the opposite, it would take an input value and split it into multiple rows using either delimiter-based text splitting or XML/JSON array splitting.

How can I reference a JSON source for a derived column action in Azure Data Factory

I'm new to Azure Data Factory. I've been able to generate a set of JSON files from a REST API source using a Pipeline. Each file consists of one top level JSON object with an array of up to 100 child objects. The output is saved to an Azure Blob Storage container.
I now want to use a Mapping Data Flow to modify the JSON before I write it to Azure SQL, however I'm struggling with the syntax. I've configured the source to point to the directory containing the JSON files. The Source Projection tab displays the correct schema. I can preview the data and I see a row for each file and I can expand the child objects to see the full structure.
However, when I add a Derived Column action, the Input Schema is blank in the Expression Builder. I can refer to the top level elements in the source using the byName and byPosition functions, but I don't know how I can reference the child elements.
The examples that I have been able to find online use a SQL table or CSV file as a source. I can't find any examples that use hierarchical data as the source for a derived column.
Am I missing something? Is this scenario supported?
I found a way to achieve what I want. This may not be the best approach, but it works.
It seems that it is difficult to deal with JSON that has multiple hierarchies as a source for copy data activities. You can choose one level of repeating data to map to a table structure (the Collection Reference property on the Mapping tab).
In my scenario, there was additional repeating data within the data I was mapping to my table. I updated the mapping to write the child JSON data to a text field in my SQL table. To do this, I needed to use the Azure Data Factory JSON editor for my pipeline. You can access this from the "Code" link in the top right corner of the pipeline visual editor.
I added the following line after the closing bracket for the "mappings" array for my copy activity:
"mapComplexValuesToString": true
The full path to the mapping array in the activity definition is typeProperties - translator - mappings. Make sure your commas are correct after you add the new element.
With this approach, I had a row in my SQL table for each array item in my Collection Reference. The scalar child elements in the array items are mapped to table columns and the child JSON element is written to a data column in the same table.
To extract the values I need within the child JSON, I created a SQL view that uses the CROSS APPLY OPENJSON syntax. This allows me to treat the JSON in the data field similar to a related table. You can specify the structure that your JSON is in. If you have nested data in your JSON, you can apply the same approach for each level.
The OPENJSON command is only supported by more recent versions of SQL Server. I'm using Azure SQL, so that works for me.

Best way to parse a big and intricated Json file with OpenRefine (or R)

I know how to parse json cells in Open refine, but this one is too tricky for me.
I've used an API to extract the calendar of 4730 AirBNB's rooms, identified by their IDs.
Here is an example of one Json file : https://fr.airbnb.com/api/v2/calendar_months?key=d306zoyjsyarp7ifhu67rjxn52tv0t20&currency=EUR&locale=fr&listing_id=4212133&month=11&year=2016&count=12&_format=with_conditions
For each ID and each day of the year from now until november 2017, i would like to extract the availability of this rooms (true or false) and its price at this day.
I can't figure out how to parse out these informations. I guess that it implies a series of nested forEach, but i can't find the right way to do this with Open Refine.
I've tried, of course,
forEach(value.parseJson().calendar_months, e, e.days)
The result is an array of arrays of dictionnaries that disrupts me.
Any help would be appreciate. If the operation is too difficult in Open Refine, a solution with R (or Python) would also be fine for me.
Rather than just creating your Project as text, and working with GREL to parse out...
The best way is just select the JSON record part that you want to work with using our visual importer wizard for JSON files and XML files (you can even use a URL pointing to a JSON file as in your example). (A video tutorial shows how here: https://www.youtube.com/watch?v=vUxdB-nl0Bw )
Select the JSON part that contains your records that you want to parse and work with (this can be any repeating part, just select one of them and OpenRefine will extract all the rest)
Limit the amount of data rows that you want to load in during creation, or leave default of all rows.
Click Create Project and now your in Rows mode. However if you think that Records mode might be better suited for context, just import the project again as JSON and then select the next outside area of the content, perhaps a larger array that contains a key field, etc. In the example, the key field would probably be the Date, and why I highlight the whole record for a given date. This way OpenRefine will have Keys for each record and Records mode lets you work with them better than Row mode.
Feel free to take this example and make it better and even more helpful for all , add it to our Wiki section on How to Use
I think you are on the right track. The output of:
forEach(value.parseJson().calendar_months, e, e.days)
is hard to read because OpenRefine and JSON both use square brackets to indicate arrays. What you are getting from this expression is an OR array containing twelve items (one for each month of the year). The items in the OR array are JSON - each one an array of days in the month.
To keep the steps manageable I'd suggest tackling it like this:
First use
forEach(value.parseJson().calendar_months,m,m.days).join("|")
You have to use 'join' because OR can't store OR arrays directly in a cell - it has to be a string.
Then use "Edit Cells->Split multi-valued cells" - this will get you 12 rows per ID, each containing a JSON expression. Now for each ID you have 12 rows in OR
Then use:
forEach(value.parseJson(),d,d).join("|")
This splits the JSON down into the individual days
Then use "Edit Cells->Split multi-valued cells" again to split the details for each day into its own cell.
Using the JSON from example URL above - this gives me 441 rows for the single ID - each contains the JSON describing the availability & price for a single day. At this point you can use the 'fill down' function on the ID column to fill in the ID for each of the rows.
You've now got some pretty easy JSON in each cell - so you can extract availability using
value.parseJson().available
etc.