Can an ndjson file be saved using xdmp.save? - json

I have a requirement to gather certain JSON documents from my database and save them in an outside drive as one file for a downstream consumer.
Using server-side Javascript I can combine the documents in a JSON object or array. However, they need to be saved into this singular file in ndjson format.
Is there any way to do this using xdmp.save in MarkLogic? I thought of saving the documents as a sequence but that throws an error.

xdmp.save() expects a node() for the second parameter.
You could serialize the JSON docs and delimit with a carriage return to generate the Newline Delimited JSON, and then create a text() node from that string.
const ndjson = new NodeBuilder()
.addText(cts.search(cts.collectionQuery("json")).toArray().join("\n"))
.toNode();
xdmp.save("/temp/ndjson.json", ndjson);

Related

How to remove escaped character when parsing xml to json with copy data activity in Azure Data Factory?

I have an ADF pipeline exporting from xml dataset (ADLS) to json dataset (ADLS) with a copy Data activity. Due to the complex xml structure, I need to parse the nested xml to nested json then use T-SQL to parse the nested json into Synapse table.
However, the output nested has double backslash (It seems like escape characters) at nodes which have comma in it. You can check a sample of xml input and json output below:
xml input
<Address2>test, test</Address2>
json output
"Address2":"test\\, test"
How can I remove the double backslash in the output json with copy data activity in Azure Data Factory ?
Unfortunately there is no such provision in CopyData Activity.
However, I just tried with just the lines you provided as sample source and sink with CopyData Activity and it just copies as is. I don't see any \\. Perhaps you could share the exact pipeline you have, with details of the nested XML, JSON and T-SQL that you are using.
Repro: (with all default settings and properties)

Reading JSON in Azure Synapse

I'm trying to understand the code for reading JSON file in Synapse Analytics. And here's the code provided by Microsoft documentation:
Query JSON files using serverless SQL pool in Azure Synapse Analytics
select top 10 *
from openrowset(
bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.jsonl',
format = 'csv',
fieldterminator ='0x0b',
fieldquote = '0x0b'
) with (doc nvarchar(max)) as rows
go
I wonder why the format = 'csv'. Is it trying to convert JSON to CSV to flatten the file?
Why they didn't just read the file as a SINGLE_CLOB I don't know
When you use SINGLE_CLOB then the entire file is important as one value and the content of the file in the doc is not well formatted as a single JSON. Using SINGLE_CLOB will make us do more work after using the openrowset, before we can use the content as JSON (since it is not valid JSON we will need to parse the value). It can be done but will require more work probably.
The format of the file is multiple JSON's like strings, each in separate line. "line-delimited JSON", as the document call it.
By the way, If you will check the history of the document at GitHub, then you will find that originally this was not the case. As much as I remember, originally the file included a single JSON document with an array of objects (was wrapped with [] after loaded). Someone named "Ronen Ariely" in fact found this issue in the document, which is why you can see my name in the list if the Authors of the document :-)
I wonder why the format = 'csv'. Is it trying to convert json to csv to flatten the hierarchy?
(1) JSON is not a data type in SQL Server. There is no data type name JSON. What we have in SQL Server are tools like functions which work on text and provide support for strings which are JSON's like format. Therefore, we do not CONVERT to JSON or from JSON.
(2) The format parameter has nothing to do with JSON. It specifies that the content of the file is a comma separated values file. You can (and should) use it whenever your file is well formatted as comma separated values file (also commonly known as csv file).
In this specific sample in the document, the values in the csv file are strings, which each one of them has a valid JSON format. Only after you read the file using the openrowset, we start to parse the content of the text as JSON.
Notice that only after the title "Parse JSON documents" in the document, the document starts to speak about parsing the text as JSON.

How do I read a Large JSON Array File in PySpark

Issue
I recently encountered a challenge in Azure Data Lake Analytics when I attempted to read in a Large UTF-8 JSON Array file and switched to HDInsight PySpark (v2.x, not 3) to process the file. The file is ~110G and has ~150m JSON Objects.
HDInsight PySpark does not appear to support Array of JSON file format for input, so I'm stuck. Also, I have "many" such files with different schemas in each containing hundred of columns each, so creating the schemas for those is not an option at this point.
Question
How do I use out-of-the-box functionality in PySpark 2 on HDInsight to enable these files to be read as JSON?
Thanks,
J
Things I tried
I used the approach at the bottom of this page:
from Databricks that supplied the below code snippet:
import json
df = sc.wholeTextFiles('/tmp/*.json').flatMap(lambda x: json.loads(x[1])).toDF()
display(df)
I tried the above, not understanding how "wholeTextFiles" works, and of course ran into OutOfMemory errors that killed my executors quickly.
I attempted loading to an RDD and other open methods, but PySpark appears to support only the JSONLines JSON file format, and I have the Array of JSON Objects due to ADLA's requirement for that file format.
I tried reading in as a text file, stripping Array characters, splitting on the JSON object boundaries and converting to JSON like the above, but that kept giving errors about being unable to convert unicode and/or str (ings).
I found a way through the above, and converted to a dataframe containing one column with Rows of strings that were the JSON Objects. However, I did not find a way to output only the JSON Strings from the data frame rows to an output file by themselves. The always came out as
{'dfColumnName':'{...json_string_as_value}'}
I also tried a map function that accepted the above rows, parsed as JSON, extracted the values (JSON I wanted), then parsed the values as JSON. This appeared to work, but when I would try to save, the RDD was type PipelineRDD and had no saveAsTextFile() method. I then tried the toJSON method, but kept getting errors about "found no valid JSON Object", which I did not understand admittedly, and of course other conversion errors.
I finally found a way forward. I learned that I could read json directly from an RDD, including a PipelineRDD. I found a way to remove the unicode byte order header, wrapping array square brackets, split the JSON Objects based on a fortunate delimiter, and have a distributed dataset for more efficient processing. The output dataframe now had columns named after the JSON elements, inferred the schema, and dynamically adapts for other file formats.
Here is the code - hope it helps!:
#...Spark considers arrays of Json objects to be an invalid format
# and unicode files are prefixed with a byteorder marker
#
thanksMoiraRDD = sc.textFile( '/a/valid/file/path', partitions ).map(
lambda x: x.encode('utf-8','ignore').strip(u",\r\n[]\ufeff")
)
df = sqlContext.read.json(thanksMoiraRDD)

Extracting json records from sequence files in spark scala

I have a sequence file containing multiple json records. I want to send every json record to a function . How can I extract one json record at a time?
Unfortunately there is no standard way to do this.
Unlike YAML which has a well-defined way to allow one file contain multiple YAML "documents", JSON does not have such standards.
One way to solve your problem is to invent your own "object separator". For example, you can use newline characters to separate adjacent JSON objects. You can tell your JSON encoder not to output any newline characters (by forcing escaping it into \ and n). As long as your JSON decoder is sure that it will not see any newline character unless it separates two JSON objects, it can read the stream one line at a time and decode each line.
It has also been suggested that you can use JSON arrays to store multiple JSON objects, but it will no longer be a "stream".
You can read content of your sequence files to RDD[String] and convert it to Spark Dataframe.
val seqFileContent = sc
.sequenceFile[LongWritable, BytesWritable](inputFilename)
.map(x => new String(x._2.getBytes))
val dataframeFromJson = sqlContext.read.json(seqFileContent)

converting avro record to string and back

I have to develop a mapreduce program that is needed to perform a join on two different data sets.
One of them is a csv file and other is an avro file.
I am using MultipleInputs to process both sources. However to process both dataset in one single reducer, I am converting the Avro Data to Text by using
new Text(key.datum.toString())
My challenge is to convert the Json String generated above to Avro rcord back in reducer as the final output needs to be in avro format.
Is there a particular function or class that can be used to do this?
If yes, can you please quote an example as well?