How can I use serde_json to deserialize arbitrary JSON into a Value-like object of raw bytes? - json

I'm writing a library to deserialize a subset of JSON into predefined Python types.
I want to deserialize arbitrary JSON into an object that quacks like serde-json's Value. However, I don't want it to deserialize into String's, Number's and Bool's - instead when the deserializer hits one of these I would prefer it simply keeps a reference to the respective byte string so I can efficiently (i.e. without the additional type conversion) parse the byte strings into the correct arbitrary Python types. Something like this:
use serde::Deserialize;
use serde_json::value::RawValue;
use serde_json::Map;
#[derive(Deserialize)]
pub enum MyValue<'a> {
Null,
Bytes(&'a RawValue),
Array(Vec<MyValue<'a>>),
Object(Map<String, MyValue<'a>>),
}
This will require writing a lot of traits so that it behaves like Value, and I'm not even sure if it won't just ignore deserializing the structural parts and put everything into a RawValue.
What is the cleanest way to do this?

Related

A better solution to validate JSON unmarshal to nested structs

There appears to be few options to validate the source JSON used when unmarshalling to a struct. By validate I mean 3 main things:
a required field exists in the JSON
the field is the correct type (e.g. don't force a string into an integer)
the field contains a valid value (value range / enum)
For nested structs, I simply mean where an attribute in one struct has the type of another struct:
type Example struct {
Attr1 int `json:"attr1"`
Attr2 ExampleToo `json:"attr2"`
}
type ExampleToo struct {
Attr3 int `json:"attr3"`
}
And this JSON would be valid:
{"attr1": 5, "attr2": {"attr3": 0}}
To keep this simple, I'll focus simply on integers. The concept of "zero values" is the first issue. I could create an UnmarshalJSON method, which is detected by JSON packages, including the standard encoding/json package. The problem with this approach is that is that is does not support nested structs. If ExampleToo has an UnmarshalJSON method, the ExampleToo.UnmarshalJSON() method is never called if unmarshalling to an Example object. It would be possible to write a method Example.UnmarshalJSON() that recursively handled validation, but that seems extremely complex, especially if ExampleToo is reused in many places.
So there appears to be some packages like the go-playground/validator where validation can be specified both as functions and tags. However, this works on the struct created, and not the JSON itself. So if a field is tagged as validation:"required" on an integer, and the integer value is 0, this will return an error because 0 is both a valid value and the "zero value" for integers.
An example of the latter here: https://go.dev/play/p/zqSUksPzUiq
I could also use pointers for everything, checking for nil as missing values. The main problem with that is that it requires dereferencing on each use and is a pretty uncommon practice for things like integers and strings.
One thing that I have also considered is a "sister struct" that uses pointers to do validation for required fields. The process would basically be to write a validation method for each struct, then validate that sister struct. If it works, then deserialize the main struct (without pointers). I haven't started on this, just a concept I've thought about, but I'm hoping there are better validation options.
So... is there a better way to do JSON/YAML input validation on nested structs? I'm happy to mix methods where say UnmarshalJSON is used for doing some work like verifying fields exist, but I'd like to pass that back to the library to let it continue to call UnmarshalJSON on subsequent nested structs. I'd also rather defer to the JSON library for casting values into the struct, etc.

What's the difference between UnmarshalText and UnmarshalJson?

In decode.go, it mentions:
// To unmarshal JSON into a value implementing the Unmarshaler interface,
// Unmarshal calls that value's UnmarshalJSON method, including
// when the input is a JSON null.
// Otherwise, if the value implements encoding.TextUnmarshaler
// and the input is a JSON quoted string, Unmarshal calls that value's
// UnmarshalText method with the unquoted form of the string.
What are the differences between UnmarshalText and UnmarshalJSON? Which one is preferred?
Simply:
UnmarshalText unmarshals a text-encoded value.
UnmarshalJSON unmarshals a JSON-encoded value.
Which is preferred depends on what you're doing.
JSON encoding is defined by RFC 7159. If you're consuming or producing JSON documents, you should use JSON encoding.
Text encoding has no standard, and is entirely implementation-dependent. Go implements Text-(un)marshalers for a few types, but there's no guarantee that any other application will understand these formats.
Text-encoding is most commonly used for things like URL query parameters, HTML forms, or other loosely-defined formats.
If you have a choice in the matter, using JSON is probably a better way to go. But again, it depends on what you're doing what makes the most sense.
As it relates to Go's JSON unmarshaler, the JSON unmarshaler will call a type's UnmarshalJSON method, if it's defined, and fall back to UnmarshalText if that is defined.
If you know you'll be using JSON, you should absolutely define an UnmarshalJSON function.
You would generally create an UnmarshalText only if you expected it to be used in non-JSON contexts, with the added benefit that the JSON unmarshaler would also use it, without having to duplicate it (if indeed the same implementation would work for JSON).
Per the documentation:
To unmarshal JSON into a value implementing the Unmarshaler interface,
Unmarshal calls that value's UnmarshalJSON method, including when the
input is a JSON null. Otherwise, if the value implements
encoding.TextUnmarshaler and the input is a JSON quoted string,
Unmarshal calls that value's UnmarshalText method with the unquoted
form of the string.
Meaning: if you want to take some JSON and unmarshal it with some custom logic, you would use UnmarshalJSON. If you want to take the text in a string field of a JSON document and decode that in some special way (i.e. parse it rather than just write it into a string-typed field), you would use UnmarshalText. For example, net.IP implements UnmarshalText so that you can provide a string value like "ipAddress": "1.2.3.4" and unmarshal it into a net.IP field. If net.IP did not implement UnmarshalText, you would only be able to unmarshal the JSON representation of the underlying type ([]byte).

UnmarshalJSON any type from []interface{} possible?

When you unmarshal JSON to []interface{} is there any way to automatically detect the type besides some standard types like bool, int and string?
What I noticed is the following, Let's say I marshal [uuid.UUID, bool] then the JSON I get looks like:
[[234,50,7,116,194,41,64,225,177,151,60,195,60,45,123,106],true]
When I unmarshal it again, I get the types as shown through reflect:
[]interface{}, bool
I don't understand why it picked []interface{}. If it cannot detect it, shouldn't it be at least interface{}?
In any case, my question is, is it possible to unmarshal any type when the target is of type []interface{}? It seems to work for standard types like string, bool, int but for custom types I don't think that's possible, is it? You can define custom JSON marshal/unmarshal methods but that only works if you decode it into a target type so that it can look up which custom marshal/unmarshal methods to use.
You can unmarshal any type into a value of type interface{}. If you use a value of type []interface{}, you can only unmarshal JSON arrays into it, but yes, the elements of the array may be of any type.
Since you're using interface{} or []interface{}, yes, type information is not available, and it's up to the encoding/json package to choose the best it sees fit. For example, for JSON objects it will choose map[string]interface{}. The full list of default types is documented in json.Unmarshal():
To unmarshal JSON into an interface value, Unmarshal stores one of these in the interface value:
bool, for JSON booleans
float64, for JSON numbers
string, for JSON strings
[]interface{}, for JSON arrays
map[string]interface{}, for JSON objects
nil for JSON null
Obviously if your JSON marshaling/unmarshaling logic needs some pre- / postprocessing, the json package will not miraculously find it out. It can know about those only if you unmarshal into values of specific types (which implement json.Unmarshaler). The json package will still be able to unmarshal them to the default types, but custom logic will obviously not run on them.

name json variable and jsonString variable convention?

JSON could mean JSON type or json string.
It starts confuse me when different library use json in two different meanings.
I wonder how other people name those variables differently.
For all practical purposes, "JSON" has exactly one meaning, which is a string representing a JavaScript object following certain specific syntax.
JSON is parsed into a JavaScript object using JSON.parse, and an JavaScript object is converted into a JSON string using JSON.stringify.
The problem is that all too many people have gotten into the bad habit of referring to plain old JavaScript objects as JSON. That is either confused or sloppy or both. {a: 1} is a JS object. '{"a": 1}' is a JSON string.
In the same vein, many people use variable names like json to refer to JavaScript objects derived from JSON. For example:
$.getJSON('foo.php') . then(function(json) { ...
In the above case, the variable name json is ill-advised. The actual payload returned from the server is a JSON string, but internally $.getJSON has already transformed that into a plain old JavaScript object, which is what is being passed to the then handler. Therefore, it would be preferable to use the variable name data, for example.
If a library uses the term "json" to refer to things which are not JSON, but actually are JavaScript objects, it is a mark of poor design, and I'd suggest looking around for a different library.

How to convert between BSON and JSON, especially for those special objects?

I am not asking for any libraries to do so and I am just writing code for bson_to_json and json_to_bson.
so here is the BSON specification.
For regular double, doc, array, string, it is fine and it is easy to convert between BSON and JSON.
However, for those particular objects, such as
Timestamp and UTC:
If convert from JSON to BSON, how can I know they are timestamp and utc?
Regex (string, string), JavaScript code with scope (string, doc)
their structures have multiple parts, how can I present the structures in JSON?
Binary data (generic, function, etc)`
How can I present the type of binary data in JSON?
int32 and int64
How can I present them in JSON, so BSON can know which is 32 bit or 64 bit?
Thanks
As we know JSON cannot express objects so you will need to decide how you want the stringified version of the BSON objects (field types) to be represented within the output of your ocaml driver.
Some of the data types are easy, Timestamp is not needed since it is internal to sharding only and Javascript blocks are best left out due to the fact that they are best used only within system.js as saved functions for use in MRs.
You also gotta consider that some of these fields are actually both in and out. What I mean by in and out is that some are used to specify input documents to be serialised to BSON and some are part of output document that need deserialising from BSON into JSON.
Regex is one which will most likely be a field type you send down. As such you will need to serialise your ocaml object to the BSON equivilant of {$regex: 'd', '$options': 'ig'} from /d/ig PCRE representation.
Dates can be represented in JSON by either choosing to use the ISODate string or a timestamp for the representation. The output will be something like {$sec:556675,$usec:6787} and you can convert $sec to the display you need.
Binary data in JSON can be represented by taking the data (if I remember right) property from the output document and then encoding that to base 64 and storing it as a stirng in the field.
int32 and int64 has no real definition between the two in JSON except that 64bit ints will be bigger than 2147483647 so I am unsure if you can keep the data types unique there.
That should help get you started.