How to parse JSONB array of strings in Ruby - json

I found many answers for parsing JSON or JSONB but it seems PostgreSQL has its own way of doing things.
I have a CSV dump from PostgreSQL that I want to parse in Ruby.
A column of the dump was made with array_agg. It is an array of strings. Unfortunately, PostgreSQL decided not to double-quote some values. It may believe they are numbers.
require 'json'
original_names = "[\"1800 Reposado 750ML\",1800Jalisco750ml,\"1800 Tequila Reposado, Jalisco, Mexico (750ml)\"]"
array = JSON.parse original_names
# Raises: 434: unexpected token at 'Jalisco750ml,"...' (JSON::ParserError)
I have tried to add the quotes myself, but it fails because of the other string that contain commas.
array = JSON.parse original_names.gsub(/,([^"|,]+),/, ',"\1",')
# Raises: 434: unexpected token at 'Jalisco", Mexico (750ml)"]' (JSON::ParserError)
I spent an hour trying workarounds and now I believe I need to implement the parser myself. Does anyone have a better way?

"Array_agg in postgres selectively quotes" shows that PostgreSQL's json_agg returns proper JSON, which prevents my issue.

Related

Regex for replacing unnecessary quotation marks within a JSON object containing an array

I am currently trying to format a JSON object using LabVIEW and have ran into the issue where it adds additional quotation marks invalidating my JSON formatting. I have not found a way around this so I thought just formatting the string manually would be enough.
Here is the JSON object that I have:
{
"contentType":"application/json",
"content":{
"msgType":2,
"objects":"["cat","dog","bird"]",
"count":3
}
}
Here is the JSON object I want with the quotation marks removed.
{
"contentType":"application/json",
"content":{
"msgType":2,
"objects":["cat","dog","bird"],
"count":3
}
}
I am still not an expert with regex and using a regex tester I was only able to grab the "objects" and "count" fields but I would still feel I would have to utilize substrings to remove the quotation marks.
Example I am using (would use a "count" to find the start of the next field and work backwards from there)
"([objects]*)"
Additionally, all the other Regex I have been looking at removes all instances of quotation marks whereas I only need a specific area trimmed. Thus, I feel that a specific regex replace would be a much more elegant solution.
If there is a better way to go about this I am happy to hear any suggestions!
Your question suggests that the built-in LabVIEW JSON tools are insufficient for your use case.
The built-in library converts LabVIEW clusters to JSON in a one-shot approach. Bundle all your data into a cluster and then convert it to JSON.
When it comes to parsing JSON, you use the path input terminal and the default type terminals to control what data is parsed from a JSON string.
If you need to handle JSON in a manner similar to say JavaScript, I would recommend something like the JSONText Toolkit which is free to use (and distribute) under the BSD licence. This allows more complex and iterative building of JSON strings from LabVIEW types and has text-path style element access along with many more features.
The Output controls from both my examples are identical - although JSONText provides a handy Pretty Print vi.
After using a regex from one of the comments, I ended up with this regex which allowed me to match the array itself.
(\[(?:"[^"]*"|[^"])+\])
I was able to split the the JSON string into before match, match and after match and removed the quotation marks from the end of 'before match' and start of 'after match' and concatenated the strings again to form a new output.

How to prevent adding backslash to JSON string

I would like to read events from eventhub using Databricks, events are in json format but they can have different schema (it's important because i find solutions in which the schema was given to from_json(jsonStr,schema) function, but i cannot use it in my use case). When i use
.withColumn('Value', col('value').cast(StringType() in dataframe returns json output with backslashes "{\"time\": 1432826855000,\"host\":...... .
I found a solution How to prevent spark sql with kafka from adding backslash to JSON string in dataframe but in Delta Live Tables framework we create streaming tables by returning a dataframe, so i cant use this solution.
Should i use non pyspark functions in etl process such as
How to remove backslash from decoded JSON string? ?
Will it be efficient during streaming from eventhub to bronze?
You shouldn't worry about that backslashes - it's just a visual representation of your string when you display data and it has " character embedded into a string. Internally, data will be stored without backslashes, like: {"time": 1432826855000,"host":.......

Processing JSON from a .txt file and converting to a DataFrame in Julia

Cross posting from Julia Discourse in case anyone here has any leads.
I’m just looking for some insight into why the below code is returning a dataframe containing just the first line of my json file. If you’d like to try working with the file I’m working with, you can download the aminer_papers_0.zip from the Microsoft Open Academic Graph site, I’m using the first file in that group of files.
using JSON3, DataFrames, CSV
file_name = "path/aminer_papers_0.txt"
json_string = read(file_name, String)
js = JSON3.read(json_string)
df = DataFrame([js])
The resulting DataFrame has just one line, but the column titles are correct, as is the first line. To me the mystery is why the rest isn’t getting processed. I think I can rule out that read() is only reading the first JSON object, because I can index into the resulting object and see many JSON objects:
enter image description here
My first guess was maybe the newline \n was causing escape issues, and tried to use chomp to get rid of them, but couldn’t get it to work.
Anyway - any help would be greatly appreciated!
I think the problem is that the file is in JSON Lines format, and the JSON3 library only returns the first valid JSON value that it finds at the start of a string unless told otherwise.
tl;dr
Call JSON3.read with the keyword argument jsonlines=true.
Why?
By default, JSON3 interprets a string passed to its read function as a single "JSON text", defined by RFC 8259 section 1.3.2:
A JSON text is a serialized value....
(My emphasis on the use of the indefinite singular article "a.") A "JSON value" is defined in section 1.3.3:
A JSON value MUST be an object, array, number, or string, or one of the following three literal names: false, null, true.
A string with multiple JSON values in it is technically multiple "JSON texts." It is up to the parser to determine what part of the string argument you give it is a JSON text, and the authors of JSON3 chose as the default behavior to parse from the start of the string to the end of the first valid JSON value.
In order to get JSON3 to read the string as multiple JSON values, you have to give it the keyword option jsonlines=true, which is documented as:
jsonlines: A Bool indicating that the json_str contains newline delimited JSON strings, which will be read into a JSON3.Array of the JSON values. See jsonlines for reference. [default false]
Example
Take for example this simple string:
two_values = "3.14\n2.72"
Each one of these lines is a valid JSON serialization of a number. However, when passed to JSON3.read, only the first is parsed:
using JSON3
#assert JSON3.read(two_values) == 3.14
Using jsonlines=true, both values are parsed and returned as a JSON3.Array struct:
#assert JSON3.read(two_values, jsonlines=true) == [3.14, 2.72]
Other Packages
The JSON.jl library, which people might use by default given the name, does not implement parsing of JSON Lines strings at all, leaving it up to the caller to properly split the string as needed:
using JSON
JSON.parse(two_values)
# ERROR: Expected end of input
# Line: 1
# Around: ...3.14 2.72...
# ^
A simple way to implement reading multiple values is to use eachline:
#assert [JSON.parse(line) for line in eachline(IOBuffer(two_values))] == [3.14, 2.72]

'JSON.parse' replaces ':' with '=>'

I have a string:
{"name":"hector","time":"1522379137221"}
I want to parse the string into JSON and expect to get:
{"name":"hector","time":"1522379137221"}
I am doing:
require 'json'
JSON.parse
which produces this:
{"name"=>"hector","time"=>"1522379137221"}
Can someone tell me how I can keep :? I don't understand why it adds =>.
After you parse the json data you should see it in the programming language that you are using.
Ruby uses => to separated the key from the value in hash (while json uses :).
So the ruby output is correct and the data is ready for you to manipute in your code. When you convert your hash to json, the json library will convert the => back to :.
JSON does not have symbol class. Hence, nothing in JSON data corresponds to Ruby symbol. Under a trivial conversion from JSON to Ruby like JSON.parse, you cannot have a symbol in the output.

Delimiter for multiple json strings

I'd like to save multiple json strings to a file and separate them by a delimiter, such that it will be easy to read this list in, split on the delimiter and work with each json doc separately.
Serializing using a json array is not an option due to external reasons.
I would like to use a delimiter that is illegal in JSON (e.g. delimiting using a comma would be a bad idea since there are commas within the json strings).
Are there any characters that are not considered legal in JSON serialized strings?
I know it's not exactly what you needed, but you can use this SO answer to write the json string to a CSV, then read it on the other side by using a good streaming CSV reader such as this one
NDJSON
Have a look at NDJSON (Newline delimited JSON).
http://ndjson.org/
It seems to me to be exactly how you should do things, though its not exactly what you asked for. (If you can't flatten your JSON objects into single lines then it's not for you though!) You asked for a delimiter that is not allowed in JSON. Newline is allowed in JSON, but it is not necessary for JSON to contain newlines.
The format is used for log files amongst other things. I discovered it when looking at the Lichess API documentation.
You can start listening in to a broadcast stream of NDJSON data part way through, wait for the next newline character and then start processing objects as and when they arrive.
If you go for NDJSON, you are at least following a standard and I think you'd be hard pressed to find an alternative standard to follow.
Example NDJSON
{"some":"thing"}
{"foo":17,"bar":false,"quux":true}
{"may":{"include":"nested","objects":["and","arrays"]}}
An old question, but hopefully this answer will be useful.
Most JSON readers crash on this character: , which is information separator two. They declare it "unexpected token", so I guess it has to be wrapped to pass or something.