Delimiter for multiple json strings - json

I'd like to save multiple json strings to a file and separate them by a delimiter, such that it will be easy to read this list in, split on the delimiter and work with each json doc separately.
Serializing using a json array is not an option due to external reasons.
I would like to use a delimiter that is illegal in JSON (e.g. delimiting using a comma would be a bad idea since there are commas within the json strings).
Are there any characters that are not considered legal in JSON serialized strings?

I know it's not exactly what you needed, but you can use this SO answer to write the json string to a CSV, then read it on the other side by using a good streaming CSV reader such as this one

NDJSON
Have a look at NDJSON (Newline delimited JSON).
http://ndjson.org/
It seems to me to be exactly how you should do things, though its not exactly what you asked for. (If you can't flatten your JSON objects into single lines then it's not for you though!) You asked for a delimiter that is not allowed in JSON. Newline is allowed in JSON, but it is not necessary for JSON to contain newlines.
The format is used for log files amongst other things. I discovered it when looking at the Lichess API documentation.
You can start listening in to a broadcast stream of NDJSON data part way through, wait for the next newline character and then start processing objects as and when they arrive.
If you go for NDJSON, you are at least following a standard and I think you'd be hard pressed to find an alternative standard to follow.
Example NDJSON
{"some":"thing"}
{"foo":17,"bar":false,"quux":true}
{"may":{"include":"nested","objects":["and","arrays"]}}

An old question, but hopefully this answer will be useful.
Most JSON readers crash on this character: , which is information separator two. They declare it "unexpected token", so I guess it has to be wrapped to pass or something.

Related

Regex for replacing unnecessary quotation marks within a JSON object containing an array

I am currently trying to format a JSON object using LabVIEW and have ran into the issue where it adds additional quotation marks invalidating my JSON formatting. I have not found a way around this so I thought just formatting the string manually would be enough.
Here is the JSON object that I have:
{
"contentType":"application/json",
"content":{
"msgType":2,
"objects":"["cat","dog","bird"]",
"count":3
}
}
Here is the JSON object I want with the quotation marks removed.
{
"contentType":"application/json",
"content":{
"msgType":2,
"objects":["cat","dog","bird"],
"count":3
}
}
I am still not an expert with regex and using a regex tester I was only able to grab the "objects" and "count" fields but I would still feel I would have to utilize substrings to remove the quotation marks.
Example I am using (would use a "count" to find the start of the next field and work backwards from there)
"([objects]*)"
Additionally, all the other Regex I have been looking at removes all instances of quotation marks whereas I only need a specific area trimmed. Thus, I feel that a specific regex replace would be a much more elegant solution.
If there is a better way to go about this I am happy to hear any suggestions!
Your question suggests that the built-in LabVIEW JSON tools are insufficient for your use case.
The built-in library converts LabVIEW clusters to JSON in a one-shot approach. Bundle all your data into a cluster and then convert it to JSON.
When it comes to parsing JSON, you use the path input terminal and the default type terminals to control what data is parsed from a JSON string.
If you need to handle JSON in a manner similar to say JavaScript, I would recommend something like the JSONText Toolkit which is free to use (and distribute) under the BSD licence. This allows more complex and iterative building of JSON strings from LabVIEW types and has text-path style element access along with many more features.
The Output controls from both my examples are identical - although JSONText provides a handy Pretty Print vi.
After using a regex from one of the comments, I ended up with this regex which allowed me to match the array itself.
(\[(?:"[^"]*"|[^"])+\])
I was able to split the the JSON string into before match, match and after match and removed the quotation marks from the end of 'before match' and start of 'after match' and concatenated the strings again to form a new output.

Spark - load numbers from a CSV file with non-US number format

I have a CSV file which I want to convert to Parquet for futher processing. Using
sqlContext.read()
.format("com.databricks.spark.csv")
.schema(schema)
.option("delimiter",";")
.(other options...)
.load(...)
.write()
.parquet(...)
works fine when my schema contains only Strings. However, some of the fields are numbers that I'd like to be able to store as numbers.
The problem is that the file arrives not as an actual "csv" but semicolon delimited file, and the numbers are formatted with German notation, i.e. comma is used as decimal delimiter.
For example, what in US would be 123.01 in this file would be stored as 123,01
Is there a way to force reading the numbers in different Locale or some other workaround that would allow me to convert this file without first converting the CSV file to a different format? I looked in Spark code and one nasty thing that seems to be causing issue is in CSVInferSchema.scala line 268 (spark 2.1.0) - the parser enforces US formatting rather than e.g. rely on the Locale set for the JVM, or allowing configuring this somehow.
I thought of using UDT but got nowhere with that - I can't work out how to get it to let me handle the parsing myself (couldn't really find a good example of using UDT...)
Any suggestions on a way of achieving this directly, i.e. on parsing step, or will I be forced to do intermediate conversion and only then convert it into parquet?
For anybody else who might be looking for answer - the workaround I went with (in Java) for now is:
JavaRDD<Row> convertedRDD = sqlContext.read()
.format("com.databricks.spark.csv")
.schema(stringOnlySchema)
.option("delimiter",";")
.(other options...)
.load(...)
.javaRDD()
.map ( this::conversionFunction );
sqlContext.createDataFrame(convertedRDD, schemaWithNumbers).write().parquet(...);
The conversion function takes a Row and needs to return a new Row with fields converted to numerical values as appropriate (or, in fact, this could perform any conversion). Rows in Java can be created by RowFactory.create(newFields).
I'd be happy to hear any other suggestions how to approach this but for now this works. :)

CSV parsing nested quotes

I try to parse a fairly complex CSV with apache sparks CSV reader which internally relies on the apache commons library (https://github.com/databricks/spark-csv).
I tried different combination of
quoteMode and escape but could not get it to work e.g. prevent the exceptions. Do you have any hints which parameters would support such a nested Structure?
ERROR CsvRelation$: Exception while parsing line: "Gabriella's Song" From The Motion Picture "The Mission";
java.io.IOException: (line 1) invalid char between encapsulated token and delimiter
I know that sed could be used to pre-process the data. However, it would be great if was integrated into Spark e.g. if no further pre-processing was needed. I did not find the possibility to specify a regex or so.
The CSV file looks like:
"Gabriella's Song" From The Motion Picture "The Mission";
This is related to https://github.com/databricks/spark-csv/issues/295
some more special fields like
&
Or "Eccoli; attenti ben (Don Pasquale)"
Cause these Problems. We will write our own CSV Pre-Processor for Apache Camel.
Try this,it worked very well for me -
HDFS file -
spark.read.option("WholeFile", true).option("delimiter", ",").csv(s"hdfs://{my-hdfs-file-path}")
Non-HDFS file -
spark.read.option("WholeFile", true).option("delimiter", ",").csv(my-hdfs-file-path)
Above approach works for any delimited file, just change the delimiter value.
You can also use Regex but that will very in-efficient for large files.
Hope this is helpful.

Which of the following are valid JSON documents?

Which of the following are valid JSON documents?
{“name”:”Fred Flintstone”;”occupation”:”Miner”;”wife”:”Wilma”}
{“title”:”Star Wars”, “quotes”:[“Use The Force”,”These are not the Droids you are looking for”],”director”:”George Lucas”}
{}
{“city”:”New York”, “population”, 7999034, boros:{“queens”, “manhattan”, “staten island”, “the bronx”, “brooklyn”}}
{“a”:1, “b”:{“b”:1, “c”:”foo”, “d”:”bar”, “e”:[1,2,4]}}
This is obviously homework, so I will try to help you to come to the correct solution yourself, not just hand it to you.
Look up which character is used to separate the individual key/value pairs from each other in a JSON document. One of the documents is using the wrong character.
Look up the difference between objects and arrays in JSON. What's the difference and which characters are used to mark the beginning and end of either? The author of one of the documents tried to create an array, but uses the syntax for objects.
The official JSON specification can serve as a reference for you.

Is it true that a JSON document will not be parseable until the last byte is written?

Is the following hypothesis valid?
Leaving aside whitespace, once the first character of a JSON document
has been written, the resulting stream will not parse as valid JSON
until the last character has been written.
I'm interested in using this assumption so that when I have one process writing a file and another reading it, I can safely ignore partially-written files by ignoring anything that doesn't parse as valid JSON.
I'm sure it depends on the parser you are using... it seems than any scrupulous parser would follow that rule due to the structure of JSON... curly brackets around every "object" key/value pair, including any wrapping document { }).
As always with programming, test rather than assume.