Loosing data from Source to Sink in Copy Data - json

In my MS Azure datafactory, I have a rest API connection to a nested JSON dataset.
The Source "Preview data" shows all data. (7 orders from the online store)
In the "Activity Copy Data", is the menu tab "Mapping" where I map JSON fields with the sink SQL table columns. If I under "Collection Reference" I select None, all 7 orders are copied over.
But if I want the nested metadata, I select the meta field in "Collection Reference", then I get my nested data, in multiple order lines, each with a one metadata point, but I only get data from 1 order, not 7
I think I have a reason for my problem. One of the fields in the nested meta data, is both a string and array. But I still don't have a solution
sceen shot of meta data

Your sense is right,it caused by your nested structure meta data. Based on the statements of Collection Reference property:
If you want to iterate and extract data from the objects inside an
array field with the same pattern and convert to per row per object,
specify the JSON path of that array to do cross-apply. This property
is supported only when hierarchical data is source.
same pattern is key point here, I think. However, your data inside metadata array are not same as your screenshot.
My workaround is using Azure Blob Storage to make a transition, REST API ---> Azure Blob Storage--->Your sink dataset. Inside Blob Storage Dataset, you could flatten the incoming JSON data with Cross-apply nested JSON array setting:
You could refer to this blog to learn about this feature. Then you could copy the flatten data into your destination.

Related

Spark from_avro 2nd argument is a constant string, any way to obtain schema string from some column of each record?

suppose we are developing an application that pulls Avro records from a source
stream (e.g. Kafka/Kinesis/etc), parses them into JSON, then further processes that
JSON with additional transformations. Further assume these records can have a
varying schema (which we can look up and fetch from a registry).
We would like to use Spark's built in from_avro function, But it is pretty clear that
Spark from_avro wants you to hard code a >Fixed< schema into your code. It doesn't seem
to allow the schema to vary row by incoming row.
That sort of makes sense if you are parsing the Avro to Internal row format.. One would need
a consistent structure for the dataframe. But what if we wanted something like
from_avro which grabbed the bytes from some column in the row and also grabbed the string
representation of the Avro schema from some other column in the row, and then parsed that Avro
into a JSON string.
Does such built-in method exist? Or is such functionality available in a 3rd party library ?
Thanks !

ADF: Split a JSON file with an Array of Objects into Single JSON files containing One Element in Each

I'm using Azure Data Factory and trying to convert a JSON file that is an array of JSON objects into separate JSON files each contain one element e.g. the input:
[
{"Animal":"Cat","Colour":"Red","Age":12,"Visits":[{"Reason":"Injections","Date":"2020-03-15"},{"Reason":"Check-up","Date":"2020-01-02"}]},
{"Animal":"Dog","Colour":"Blue","Age":1,"Visits":[{"Reason":"Check-up","Date":"2020-02-08"}]},
{"Animal":"Guinea Pig","Colour":"Green","Age":5,"Visits":[{"Reason":"Injections","Date":"2019-12-01"},{"Reason":"Check-up","Date":"2020-02-26"}]}
]
However, I've tried Data Flow to split this array up into single files containing each element of the JSON array but cannot work it out. Ideally I would also want to name each file dynamically e.g. Cat.json, Dog.json and "Guinea Pig.json".
Is Data Flow the correct tool for this with Azure Data Factory (version 2)?
Data Flows should do it for you. Your JSON snippet above will generate 3 rows. Each of those rows can be sent to a single sink. Set the Sink as a JSON sink with no filename in the dataset. In the Sink transformation, use the 'File Name Option' of 'As Data in Column'. Add a Derived Column before that which sets a new column called 'filename' with this expression:
Animal + '.json'
Use the column name 'filename' as data in column in the sink.
You'll get a separate file for each row.

Handling big JSONs in Azure Data Factory

I'm trying to use ADF for the following scenario:
a JSON is uploaded to a Azure Storage Blob, containing an array of similar objects
this JSON is read by ADF with a Lookup Activity and uploaded via a Web Activity to an external sink
I cannot use the Copy Activity, because I need to create a JSON payload for the Web Activity, so I have to lookup the array and paste it like this (payload of the Web Activity):
{
"some field": "value",
"some more fields": "value",
...
"items": #{activity('GetJsonLookupActivity').output.value}
}
The Lookup activity has a known limitation of an upper limit of 5000 rows at a time. If the JSON is larger, only 5000 top rows will be read and all else will be ignored.
I know this, so I have a system that chops payloads into chunks of 5000 rows before uploading to storage. But I'm not the only user, so there's a valid concern that someone else will try uploading bigger files and the pipeline will silently pass with a partial upload, while the user would obviously expect all rows to be uploaded.
I've come up with two concepts for a workaround, but I don't see how to implement either:
Is there any way for me to check if the JSON file is too large and fail the pipeline if so? The Lookup Activity doesn't seem to allow row counting, and the Get Metadata Activity only returns the size in bytes.
Alternatively, the MSDN docs propose a workaround of copying data in a foreach loop. But I cannot figure out how I'd use Lookup to first get rows 1-5000 and then 5001-10000 etc. from a JSON. It's easy enough with SQL using OFFSET N FETCH NEXT 5000 ROWS ONLY, but how to do it with a JSON?
You can't set any index range(1-5,000,5,000-10,000) when you use LookUp Activity.The workaround mentioned in the doc doesn't means you could use LookUp Activity with pagination,in my opinion.
My workaround is writing an azure function to get the total length of json array before data transfer.Inside azure function,divide the data into different sub temporary files with pagination like sub1.json,sub2.json....Then output an array contains file names.
Grab the array with ForEach Activity, execute lookup activity in the loop. The file path could be set as dynamic value.Then do next Web Activity.
Surely,my idea could be improved.For example,you get the total length of json array and it is under 5000 limitation,you could just return {"NeedIterate":false}.Evaluate that response by IfCondition Activity to decide which way should be next.It the value is false,execute the LookUp activity directly.All can be divided in the branches.

Other data input format than flat json for crossfilter?

When using crossfilter (for example for dc.js), do I always need to transform my data to a flat JSON for input?
Flat JSON data when reading from AJAX requests tend to be a lot larger than it needs to be (in comparison to for example nested JSON, value to array or CSV data).
Is there an API available which can read in other types than flat json? Are there plans to add those?
I would like to avoid to let the client transform the data before using it.

Use JSON.stringify but the data still have array, what does it mean?

I have a data which are object array. It contains object arrays in a tree structure. I use JSON.stringify(myArray) but the data still contain array because I see [] inside the converted data.
In my case, I want all the data to be converted into json object not array regarding I need to used the data on TreeTable of SAPUI5.
Maybe I misunderstand. Please help me clear.
This is the example of the data that I got from JSON.stringify.
[{"value":{"Id":"00145E5BB2641EE284F811A7907717A3",
"Text":"BI-RA Reporting, analysis, and dashboards",
"Parent":"00145E5BB2641EE284F811A79076F7A3","Type":"BMF"},
"children":[{"value":{"Id":"00145E5BB2641EE284F811A7907737A3",
"Text":"WebIntelligence_4.1","Parent":"00145E5BB2641EE284F811A7907717A3",
"Type":"TWB"},"children":[{"value":{"Id":"00145E5BB2641EE284F811A7907757A3",
"Text":"Functional Areas","Parent":"00145E5BB2641EE284F811A7907737A3","Type":"TWB"},
"children":[{"value":{"Id":"00145E5BB2641EE284F811A7907777A3",
"Text":"CHARTING","Parent":"00145E5BB2641EE284F811A7907757A3","Type":"TWB"},
"children":[{"value":{"Id":"001999E0B9081EE28AB706BE26631E93",
"Text":"Drill","Parent":"00145E5BB2641EE284F811A7907777A3","Type":"TWB"},
"children":[{"value":{"Id":"001999E0B9081EE28AB706BE26633E93",
"Text":"[AUTO][ACCEPT] Drill on charts DHTML","Parent":"001999E0B9081EE28AB706BE26631E93",
"Type":"TWB","Ref":"UT_WEBI_CHARTS_DRILL_HTML"}},{"value":{"Id":"001999E0B9081EE28AB706BE26635E93",
"Text":"[AUTO][ACCEPT] Drill on charts JAVA","Parent":"001999E0B9081EE28AB706BE26631E93",
"Type":"TWB","Ref":"UT_WEBI_CHARTS_DRILL_JAVA"}}]},...
The output that I want shouldn't be array of object but should be something like...
{{"value":{
"Id":"00145E5BB2641EE284F811A7907717A3",
"Text":"BI-RA Reporting, analysis, and dashboards",
"Parent":"00145E5BB2641EE284F811A79076F7A3","Type":"BMF"},
"children":{
{"value":{
"Id":"00145E5BB2641EE284F811A7907737A3",
"Text":"WebIntelligence_4.1",
"Parent":"00145E5BB2641EE284F811A7907717A3",
"Type":"TWB"},
"children":{
{"value":{
"Id":"00145E5BB2641EE284F811A7907757A3",
"Text":"Functional Areas",
"Parent":"00145E5BB2641EE284F811A7907737A3",
"Type":"TWB"},...
JSON.stringify merely converts JavaScript data structures to a JSON-formatted string for consumption by other parsers (including JSON.parse). If you want it to stringify to a different value, you must change the source data structures first.
However, it seems that this can't be represented as anything other than an array because you have duplicate keys (i.e. value appears more than once). That would not be valid for a JavaScript object or a JSON representation of such.
I think what you want is
JSON.stringify(data[0]);
or perhaps
JSON.stringify(data[0].value);
where data is the object you passed in the question