Differentiating XBRL, XML, CSV, and JSON - json

Currently I'm trying to differentiate between different serialized text formats. Mainly between XBRL, XML, CSV, and JSON.
I would like to assume that, checking by steps, if we use a parser to parse an XBRL/XML and returns without any exception thrown, then it's a valid XML document and needs further checking to see if the document is a regular xml or an xbrl.
If the first check fails, try parsing the csv. If parsing the csv returns an exception, try parsing as a JSON. If none of the above works, it's an invalid document.
Would this be an exceptional way of identifying the type of text format the document is? Or is there a better way? (i.e reading the first few bytes of the document etc...).
thanks

If you know the JSON will be an object or array, and that the content HAS to be one of those four...
if(content.charAt(0) == "[" || content.charAt(0) == "{") {
// JSON
} else if(content.charAt(0) == "<") {
if(content.indexOf("xmlns=\"http://www.xbrl.org/2001/instance\"") >= 0) {
// XBRL
} else {
// XML
}
} else {
// CSV ?...
// first remove strings
var testCSV = content.replace("\"\"", ""); // remove escaped quotes
testCSV = testCSV.replace(/".*?"/g, ""); // match-remove quoted strings
var lines = testCSV.split("\n");
if(lines.length === 1 && lines[0].split(",").length > 1) {
// only 1 row so we can only verify if there is two or more columns
// CSV
} else if(lines.length > 1 && lines[0].split(",").length > 1 && lines[0].split(",").length === lines[1].split(",").length) {
// we know there's multiple lines with the same number of columns
// CSV
}
// can't be sure what it is
// ???
}
The above will give you a reasonable amount of certainty.
EDIT I added a quick CSV test as well.

I would like to specifically address the difference between XML and XBRL.
XML is a syntax. An XML parser may be tasked with parsing out the elements, checking the elements against a schema, and perform other syntax-level validations against the structure of the document. For the most part, parsing XML is a syntax check against the structure of the document.
XBRL leverages the XML format, so all XBRL documents are also XML documents. However, the XBRL specification goes above and beyond an XML parser to ensure that the semantics of the data encoded in the XML format are correct. An XBRL parser, for example, loads a calculation linkbase, if one is defined, and ensures that the numeric values that participate in the calculation add up correctly as defined by the calculation linkbase. Tools such as Gepsio perform this XBRL-specific semantic check work to ensure that the data encoded in the XML format conforms to all of the rules defined in the XBRL Specification.
XBRL is semantic rules against XML-encoded data. Valid XBRL is also valid XML, but the reverse is not necessarily true.

XBRL has not been seen as a "language" by users any more. XBRL has became a semantic standard for financial business documents. Initially, XML was vastly adopted by companies because in that time JSON did not even exist (we are talking about 90's).
Today, XML is used just because its facility of creating a huge amount of linked data (through XLinks, Schemas and Linkbases). However, you are not stuck in XML format, you can use any one of this technology for representing the XBRL file: XML, JSON or CSV.
If you already have a XBRL-XML file, you can convert it to XBRL-JSON format through free and Open-source tools - e.g.: https://youtu.be/Xr6v4jL535w.

Related

JSON variable indent for different entries

Background: I want to store a dict object in json format that has say, 2 entries:
(1) Some object that describes the data in (2). This is small data mostly definitions, parameters that control, etc. and things (maybe called metadata) that one would like to read before using the actual data in (2). In short, I want good human readability of this portion of the file.
(2) The data itself is a large chunk- should more like machine readable (no need for human to gaze over it on opening the file).
Problem: How to specify some custom indent, say 4 to the (1) and None to the (2). If I use something like json.dump(data, trig_file, indent=4) where data = {'meta_data': small_description, 'actual_data': big_chunk}, meaning the large data will have a lot of whitespace making the file large.
Assuming you can append json to a file:
Write {"meta_data":\n to the file.
Append the json for small_description formatted appropriately to the file.
Append ,\n"actual_data":\n to the file.
Append the json for big_chunk formatted appropriately to the file.
Append \n} to the file.
The idea is to do the json formatting out the "container" object by hand, and using your json formatter as appropriate to each of the contained objects.
Consider a different file format, interleaving keys and values as distinct documents concatenated together within a single file:
{"next_item": "meta_data"}
{
"description": "human-readable content goes here",
"split over": "several lines"
}
{"next_item": "actual_data"}
["big","machine-readable","unformatted","content","here","....."]
That way you can pass any indent parameters you want to each write, and you aren't doing any serialization by hand.
See How do I use the 'json' module to read in one JSON object at a time? for how one would read a file in this format. One of its answers wisely suggests the ijson library, which accepts a multiple_values=True argument.

Reading JSON in Azure Synapse

I'm trying to understand the code for reading JSON file in Synapse Analytics. And here's the code provided by Microsoft documentation:
Query JSON files using serverless SQL pool in Azure Synapse Analytics
select top 10 *
from openrowset(
bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.jsonl',
format = 'csv',
fieldterminator ='0x0b',
fieldquote = '0x0b'
) with (doc nvarchar(max)) as rows
go
I wonder why the format = 'csv'. Is it trying to convert JSON to CSV to flatten the file?
Why they didn't just read the file as a SINGLE_CLOB I don't know
When you use SINGLE_CLOB then the entire file is important as one value and the content of the file in the doc is not well formatted as a single JSON. Using SINGLE_CLOB will make us do more work after using the openrowset, before we can use the content as JSON (since it is not valid JSON we will need to parse the value). It can be done but will require more work probably.
The format of the file is multiple JSON's like strings, each in separate line. "line-delimited JSON", as the document call it.
By the way, If you will check the history of the document at GitHub, then you will find that originally this was not the case. As much as I remember, originally the file included a single JSON document with an array of objects (was wrapped with [] after loaded). Someone named "Ronen Ariely" in fact found this issue in the document, which is why you can see my name in the list if the Authors of the document :-)
I wonder why the format = 'csv'. Is it trying to convert json to csv to flatten the hierarchy?
(1) JSON is not a data type in SQL Server. There is no data type name JSON. What we have in SQL Server are tools like functions which work on text and provide support for strings which are JSON's like format. Therefore, we do not CONVERT to JSON or from JSON.
(2) The format parameter has nothing to do with JSON. It specifies that the content of the file is a comma separated values file. You can (and should) use it whenever your file is well formatted as comma separated values file (also commonly known as csv file).
In this specific sample in the document, the values in the csv file are strings, which each one of them has a valid JSON format. Only after you read the file using the openrowset, we start to parse the content of the text as JSON.
Notice that only after the title "Parse JSON documents" in the document, the document starts to speak about parsing the text as JSON.

Does JSON to XML lose me anything?

We have a program that accepts as data XML, JSON, SQL, OData, etc. For the XML we use Saxon and its XPath support and that works fantastic.
For JSON we use the jsonPath library which is not as powerful as XPath 3.1. And jsonPath is a little squirrelly in some corner cases.
So... what if we convert the JSON we get to XML and then use Saxon? Are there limitations to that approach? Are there JSON constructs that won't convert to XML, like anonymous arrays?
The headline question: The json-to-xml() function in XPath 3.1 is lossless, except that by default, characters that are invalid in XML (such as NUL, or unpaired surrogates) are replaced by a SUB character -- you can change this behaviour with the option escape=true.
The losslessness has been achieved at some cost in convenience. For example, JSON property names are not translated to XML element or attribute names, but rather to values of the key attribute.
Lots of different people have come up with lots of different conversions of JSON to XML. As already pointed out, the XPath 3.1 and the XSLT 3.0 spec have a loss-less, round-tripping conversion with json-to-xml and xml-to-json that can handle any JSON.
There are simpler conversions that handle limited sets of JSON, the main problem is how to represent property names of JSON that don't map to XML names e.g. { "prop 1" : "value" } is represented by json-to-xml as <string key="prop 1">value</string> while conversions trying to map the property name to an element or attribute name either fail to create well-formed XML (e.g. <prop 1>value</prop 1>) or have to escape the space in the element name (e.g. <prop_1>value</prop_1> or some hex representation of the Unicode of the space inserted).
In the end I guess you want to select the property foo in { "foo" : "value" } as foo which the simple conversion would give you; in XPath 3.1 you would need ?foo for the XDM map or fn:string[#key = 'foo'] for the json-to-xml result format.
With { "prop 1" : "value" } the latter kind of remains as fn:string[#key = 'prop 1'], the ? approach needs to be changed to ?('prop 1') or .('prop 1'). Any conversion that has escaped the space in an element name requires you to change the path to e.g. prop_1.
There is no ideal way for all kind of JSON I think, in the end it depends on the JSON formats you expect and the willingness or time of users to learn a new selection/querying approach.
Of course you can use other JSON to XML conversions than the json-to-xml and then use XPath 3.1 on any XML format; I think that is what the oXygen guys opted for, they had some JSON to XML conversion before XPath 3.1 provided one and are mainly sticking with it, so in oXygen you can write "path" expressions against JSON as under the hood the path is evaluated against an XML conversion of the JSON. I am not sure which effort it takes to indicate which JSON values in the original JSON have been selected by XPath path expressions in the XML format, that is probably not that easy and straightforward.

Is it true that a JSON document will not be parseable until the last byte is written?

Is the following hypothesis valid?
Leaving aside whitespace, once the first character of a JSON document
has been written, the resulting stream will not parse as valid JSON
until the last character has been written.
I'm interested in using this assumption so that when I have one process writing a file and another reading it, I can safely ignore partially-written files by ignoring anything that doesn't parse as valid JSON.
I'm sure it depends on the parser you are using... it seems than any scrupulous parser would follow that rule due to the structure of JSON... curly brackets around every "object" key/value pair, including any wrapping document { }).
As always with programming, test rather than assume.

Is it valid to define functions in JSON results?

Part of a website's JSON response had this (... added for context):
{..., now:function(){return(new Date).getTime()}, ...}
Is adding anonymous functions to JSON valid? I would expect each time you access 'time' to return a different value.
No.
JSON is purely meant to be a data description language. As noted on http://www.json.org, it is a "lightweight data-interchange format." - not a programming language.
Per http://en.wikipedia.org/wiki/JSON, the "basic types" supported are:
Number (integer, real, or floating
point)
String (double-quoted Unicode
with backslash escaping)
Boolean
(true and false)
Array (an ordered
sequence of values, comma-separated
and enclosed in square brackets)
Object (collection of key:value
pairs, comma-separated and enclosed
in curly braces)
null
The problem is that JSON as a data definition language evolved out of JSON as a JavaScript Object Notation. Since Javascript supports eval on JSON, it is legitimate to put JSON code inside JSON (in that use-case). If you're using JSON to pass data remotely, then I would say it is bad practice to put methods in the JSON because you may not have modeled your client-server interaction well. And, further, when wishing to use JSON as a data description language I would say you could get yourself into trouble by embedding methods because some JSON parsers were written with only data description in mind and may not support method definitions in the structure.
Wikipedia JSON entry makes a good case for not including methods in JSON, citing security concerns:
Unless you absolutely trust the source of the text, and you have a need to parse and accept text that is not strictly JSON compliant, you should avoid eval() and use JSON.parse() or another JSON specific parser instead. A JSON parser will recognize only JSON text and will reject other text, which could contain malevolent JavaScript. In browsers that provide native JSON support, JSON parsers are also much faster than eval. It is expected that native JSON support will be included in the next ECMAScript standard.
Let's quote one of the spec's - https://www.rfc-editor.org/rfc/rfc7159#section-12
The The JavaScript Object Notation (JSON) Data Interchange Format Specification states:
JSON is a subset of JavaScript but excludes assignment and invocation.
Since JSON's syntax is borrowed from JavaScript, it is possible to
use that language's "eval()" function to parse JSON texts. This
generally constitutes an unacceptable security risk, since the text
could contain executable code along with data declarations. The same
consideration applies to the use of eval()-like functions in any
other programming language in which JSON texts conform to that
language's syntax.
So all answers which state, that functions are not part of the JSON standard are correct.
The official answer is: No, it is not valid to define functions in JSON results!
The answer could be yes, because "code is data" and "data is code".
Even if JSON is used as a language independent data serialization format, a tunneling of "code" through other types will work.
A JSON string might be used to pass a JS function to the client-side browser for execution.
[{"data":[["1","2"],["3","4"]],"aFunction":"function(){return \"foo bar\";}"}]
This leads to question's like: How to "https://stackoverflow.com/questions/939326/execute-javascript-code-stored-as-a-string".
Be prepared, to raise your "eval() is evil" flag and stick your "do not tunnel functions through JSON" flag next to it.
It is not standard as far as I know. A quick look at http://json.org/ confirms this.
Nope, definitely not.
If you use a decent JSON serializer, it won't let you serialize a function like that. It's a valid OBJECT, but not valid JSON. Whatever that website's intent, it's not sending valid JSON.
JSON explicitly excludes functions because it isn't meant to be a JavaScript-only data
structure (despite the JS in the name).
A short answer is NO...
JSON is a text format that is completely language independent but uses
conventions that are familiar to programmers of the C-family of
languages, including C, C++, C#, Java, JavaScript, Perl, Python, and
many others. These properties make JSON an ideal data-interchange
language.
Look at the reason why:
When exchanging data between a browser and a server, the data can only
be text.
JSON is text, and we can convert any JavaScript object into JSON, and
send JSON to the server.
We can also convert any JSON received from the server into JavaScript
objects.
This way we can work with the data as JavaScript objects, with no
complicated parsing and translations.
But wait...
There is still ways to store your function, it's widely not recommended to that, but still possible:
We said, you can save a string... how about converting your function to a string then?
const data = {func: '()=>"a FUNC"'};
Then you can stringify data using JSON.stringify(data) and then using JSON.parse to parse it (if this step needed)...
And eval to execute a string function (before doing that, just let you know using eval widely not recommended):
eval(data.func)(); //return "a FUNC"
Via using NodeJS (commonJS syntax) I was able to get this type of functionality working, I originally had just a JSON structure inside some external JS file, but I wanted that structure to be more of a Class, with methods that could be decided at run time.
The declaration of 'Executor' in myJSON is not required.
var myJSON = {
"Hello": "World",
"Executor": ""
}
module.exports = {
init: () => { return { ...myJSON, "Executor": (first, last) => { return first + last } } }
}
Function expressions in the JSON are completely possible, just do not forget to wrap it in double quotes. Here is an example taken from noSQL database design:
{
"_id": "_design/testdb",
"views": {
"byName": {
"map": "function(doc){if(doc.name){emit(doc.name,doc.code)}}"
}
}
}
although eval is not recommended, this works:
<!DOCTYPE html>
<html>
<body>
<h2>Convert a string written in JSON format, into a JavaScript function.</h2>
<p id="demo"></p>
<script>
function test(val){return val + " it's OK;}
var someVar = "yup";
var myObj = { "func": "test(someVar);" };
document.getElementById("demo").innerHTML = eval(myObj.func);
</script>
</body>
</html>
Leave the quotes off...
var a = {"b":function(){alert('hello world');} };
a.b();