How to remove escaped character when parsing xml to json with copy data activity in Azure Data Factory? - json

I have an ADF pipeline exporting from xml dataset (ADLS) to json dataset (ADLS) with a copy Data activity. Due to the complex xml structure, I need to parse the nested xml to nested json then use T-SQL to parse the nested json into Synapse table.
However, the output nested has double backslash (It seems like escape characters) at nodes which have comma in it. You can check a sample of xml input and json output below:
xml input
<Address2>test, test</Address2>
json output
"Address2":"test\\, test"
How can I remove the double backslash in the output json with copy data activity in Azure Data Factory ?

Unfortunately there is no such provision in CopyData Activity.
However, I just tried with just the lines you provided as sample source and sink with CopyData Activity and it just copies as is. I don't see any \\. Perhaps you could share the exact pipeline you have, with details of the nested XML, JSON and T-SQL that you are using.
Repro: (with all default settings and properties)

Related

How to prevent adding backslash to JSON string

I would like to read events from eventhub using Databricks, events are in json format but they can have different schema (it's important because i find solutions in which the schema was given to from_json(jsonStr,schema) function, but i cannot use it in my use case). When i use
.withColumn('Value', col('value').cast(StringType() in dataframe returns json output with backslashes "{\"time\": 1432826855000,\"host\":...... .
I found a solution How to prevent spark sql with kafka from adding backslash to JSON string in dataframe but in Delta Live Tables framework we create streaming tables by returning a dataframe, so i cant use this solution.
Should i use non pyspark functions in etl process such as
How to remove backslash from decoded JSON string? ?
Will it be efficient during streaming from eventhub to bronze?
You shouldn't worry about that backslashes - it's just a visual representation of your string when you display data and it has " character embedded into a string. Internally, data will be stored without backslashes, like: {"time": 1432826855000,"host":.......

Reading JSON in Azure Synapse

I'm trying to understand the code for reading JSON file in Synapse Analytics. And here's the code provided by Microsoft documentation:
Query JSON files using serverless SQL pool in Azure Synapse Analytics
select top 10 *
from openrowset(
bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.jsonl',
format = 'csv',
fieldterminator ='0x0b',
fieldquote = '0x0b'
) with (doc nvarchar(max)) as rows
go
I wonder why the format = 'csv'. Is it trying to convert JSON to CSV to flatten the file?
Why they didn't just read the file as a SINGLE_CLOB I don't know
When you use SINGLE_CLOB then the entire file is important as one value and the content of the file in the doc is not well formatted as a single JSON. Using SINGLE_CLOB will make us do more work after using the openrowset, before we can use the content as JSON (since it is not valid JSON we will need to parse the value). It can be done but will require more work probably.
The format of the file is multiple JSON's like strings, each in separate line. "line-delimited JSON", as the document call it.
By the way, If you will check the history of the document at GitHub, then you will find that originally this was not the case. As much as I remember, originally the file included a single JSON document with an array of objects (was wrapped with [] after loaded). Someone named "Ronen Ariely" in fact found this issue in the document, which is why you can see my name in the list if the Authors of the document :-)
I wonder why the format = 'csv'. Is it trying to convert json to csv to flatten the hierarchy?
(1) JSON is not a data type in SQL Server. There is no data type name JSON. What we have in SQL Server are tools like functions which work on text and provide support for strings which are JSON's like format. Therefore, we do not CONVERT to JSON or from JSON.
(2) The format parameter has nothing to do with JSON. It specifies that the content of the file is a comma separated values file. You can (and should) use it whenever your file is well formatted as comma separated values file (also commonly known as csv file).
In this specific sample in the document, the values in the csv file are strings, which each one of them has a valid JSON format. Only after you read the file using the openrowset, we start to parse the content of the text as JSON.
Notice that only after the title "Parse JSON documents" in the document, the document starts to speak about parsing the text as JSON.

Is format always json when SELECTing from stage?

Snowflake supports multiple file types via creation FILE_FORMAT (avro, json, csv etc).
Now I have tested SELECTing from snowflake stage (s3) both:
*.avro files (generated from nifi processor batching 10k source oracle table).
*.json files (json per line).
And when Select $1 from #myStg, snowflake expands as many rows as records on avro or json files (cool), but.. the $1 variant is both json format and now i wonder if whatever snowflake file_format we use do records always arrive as json on the variant $1 ?
I haven't tested csv or others snowflake file_formats.
Or i wonder if i get json from the avros (from oracle table) because maybe NiFi processor creates avro files (with internally uses json format).
Maybe im making some confusion here.. i know avro files contain both:
avro schema - language similar to json key/value.
compressed data (binary).
Thanks,
Emanuel O.
I tried with CSV, When Its came to CSV its parsing each records in the file like below
So when its came to JSON it will treat one complete JSON as one records so its displaying in JSON format.

Can an ndjson file be saved using xdmp.save?

I have a requirement to gather certain JSON documents from my database and save them in an outside drive as one file for a downstream consumer.
Using server-side Javascript I can combine the documents in a JSON object or array. However, they need to be saved into this singular file in ndjson format.
Is there any way to do this using xdmp.save in MarkLogic? I thought of saving the documents as a sequence but that throws an error.
xdmp.save() expects a node() for the second parameter.
You could serialize the JSON docs and delimit with a carriage return to generate the Newline Delimited JSON, and then create a text() node from that string.
const ndjson = new NodeBuilder()
.addText(cts.search(cts.collectionQuery("json")).toArray().join("\n"))
.toNode();
xdmp.save("/temp/ndjson.json", ndjson);

converting avro record to string and back

I have to develop a mapreduce program that is needed to perform a join on two different data sets.
One of them is a csv file and other is an avro file.
I am using MultipleInputs to process both sources. However to process both dataset in one single reducer, I am converting the Avro Data to Text by using
new Text(key.datum.toString())
My challenge is to convert the Json String generated above to Avro rcord back in reducer as the final output needs to be in avro format.
Is there a particular function or class that can be used to do this?
If yes, can you please quote an example as well?