What's the difference between UnmarshalText and UnmarshalJson? - json

In decode.go, it mentions:
// To unmarshal JSON into a value implementing the Unmarshaler interface,
// Unmarshal calls that value's UnmarshalJSON method, including
// when the input is a JSON null.
// Otherwise, if the value implements encoding.TextUnmarshaler
// and the input is a JSON quoted string, Unmarshal calls that value's
// UnmarshalText method with the unquoted form of the string.
What are the differences between UnmarshalText and UnmarshalJSON? Which one is preferred?

Simply:
UnmarshalText unmarshals a text-encoded value.
UnmarshalJSON unmarshals a JSON-encoded value.
Which is preferred depends on what you're doing.
JSON encoding is defined by RFC 7159. If you're consuming or producing JSON documents, you should use JSON encoding.
Text encoding has no standard, and is entirely implementation-dependent. Go implements Text-(un)marshalers for a few types, but there's no guarantee that any other application will understand these formats.
Text-encoding is most commonly used for things like URL query parameters, HTML forms, or other loosely-defined formats.
If you have a choice in the matter, using JSON is probably a better way to go. But again, it depends on what you're doing what makes the most sense.
As it relates to Go's JSON unmarshaler, the JSON unmarshaler will call a type's UnmarshalJSON method, if it's defined, and fall back to UnmarshalText if that is defined.
If you know you'll be using JSON, you should absolutely define an UnmarshalJSON function.
You would generally create an UnmarshalText only if you expected it to be used in non-JSON contexts, with the added benefit that the JSON unmarshaler would also use it, without having to duplicate it (if indeed the same implementation would work for JSON).

Per the documentation:
To unmarshal JSON into a value implementing the Unmarshaler interface,
Unmarshal calls that value's UnmarshalJSON method, including when the
input is a JSON null. Otherwise, if the value implements
encoding.TextUnmarshaler and the input is a JSON quoted string,
Unmarshal calls that value's UnmarshalText method with the unquoted
form of the string.
Meaning: if you want to take some JSON and unmarshal it with some custom logic, you would use UnmarshalJSON. If you want to take the text in a string field of a JSON document and decode that in some special way (i.e. parse it rather than just write it into a string-typed field), you would use UnmarshalText. For example, net.IP implements UnmarshalText so that you can provide a string value like "ipAddress": "1.2.3.4" and unmarshal it into a net.IP field. If net.IP did not implement UnmarshalText, you would only be able to unmarshal the JSON representation of the underlying type ([]byte).

Related

A better solution to validate JSON unmarshal to nested structs

There appears to be few options to validate the source JSON used when unmarshalling to a struct. By validate I mean 3 main things:
a required field exists in the JSON
the field is the correct type (e.g. don't force a string into an integer)
the field contains a valid value (value range / enum)
For nested structs, I simply mean where an attribute in one struct has the type of another struct:
type Example struct {
Attr1 int `json:"attr1"`
Attr2 ExampleToo `json:"attr2"`
}
type ExampleToo struct {
Attr3 int `json:"attr3"`
}
And this JSON would be valid:
{"attr1": 5, "attr2": {"attr3": 0}}
To keep this simple, I'll focus simply on integers. The concept of "zero values" is the first issue. I could create an UnmarshalJSON method, which is detected by JSON packages, including the standard encoding/json package. The problem with this approach is that is that is does not support nested structs. If ExampleToo has an UnmarshalJSON method, the ExampleToo.UnmarshalJSON() method is never called if unmarshalling to an Example object. It would be possible to write a method Example.UnmarshalJSON() that recursively handled validation, but that seems extremely complex, especially if ExampleToo is reused in many places.
So there appears to be some packages like the go-playground/validator where validation can be specified both as functions and tags. However, this works on the struct created, and not the JSON itself. So if a field is tagged as validation:"required" on an integer, and the integer value is 0, this will return an error because 0 is both a valid value and the "zero value" for integers.
An example of the latter here: https://go.dev/play/p/zqSUksPzUiq
I could also use pointers for everything, checking for nil as missing values. The main problem with that is that it requires dereferencing on each use and is a pretty uncommon practice for things like integers and strings.
One thing that I have also considered is a "sister struct" that uses pointers to do validation for required fields. The process would basically be to write a validation method for each struct, then validate that sister struct. If it works, then deserialize the main struct (without pointers). I haven't started on this, just a concept I've thought about, but I'm hoping there are better validation options.
So... is there a better way to do JSON/YAML input validation on nested structs? I'm happy to mix methods where say UnmarshalJSON is used for doing some work like verifying fields exist, but I'd like to pass that back to the library to let it continue to call UnmarshalJSON on subsequent nested structs. I'd also rather defer to the JSON library for casting values into the struct, etc.

How can I use serde_json to deserialize arbitrary JSON into a Value-like object of raw bytes?

I'm writing a library to deserialize a subset of JSON into predefined Python types.
I want to deserialize arbitrary JSON into an object that quacks like serde-json's Value. However, I don't want it to deserialize into String's, Number's and Bool's - instead when the deserializer hits one of these I would prefer it simply keeps a reference to the respective byte string so I can efficiently (i.e. without the additional type conversion) parse the byte strings into the correct arbitrary Python types. Something like this:
use serde::Deserialize;
use serde_json::value::RawValue;
use serde_json::Map;
#[derive(Deserialize)]
pub enum MyValue<'a> {
Null,
Bytes(&'a RawValue),
Array(Vec<MyValue<'a>>),
Object(Map<String, MyValue<'a>>),
}
This will require writing a lot of traits so that it behaves like Value, and I'm not even sure if it won't just ignore deserializing the structural parts and put everything into a RawValue.
What is the cleanest way to do this?

UnmarshalJSON any type from []interface{} possible?

When you unmarshal JSON to []interface{} is there any way to automatically detect the type besides some standard types like bool, int and string?
What I noticed is the following, Let's say I marshal [uuid.UUID, bool] then the JSON I get looks like:
[[234,50,7,116,194,41,64,225,177,151,60,195,60,45,123,106],true]
When I unmarshal it again, I get the types as shown through reflect:
[]interface{}, bool
I don't understand why it picked []interface{}. If it cannot detect it, shouldn't it be at least interface{}?
In any case, my question is, is it possible to unmarshal any type when the target is of type []interface{}? It seems to work for standard types like string, bool, int but for custom types I don't think that's possible, is it? You can define custom JSON marshal/unmarshal methods but that only works if you decode it into a target type so that it can look up which custom marshal/unmarshal methods to use.
You can unmarshal any type into a value of type interface{}. If you use a value of type []interface{}, you can only unmarshal JSON arrays into it, but yes, the elements of the array may be of any type.
Since you're using interface{} or []interface{}, yes, type information is not available, and it's up to the encoding/json package to choose the best it sees fit. For example, for JSON objects it will choose map[string]interface{}. The full list of default types is documented in json.Unmarshal():
To unmarshal JSON into an interface value, Unmarshal stores one of these in the interface value:
bool, for JSON booleans
float64, for JSON numbers
string, for JSON strings
[]interface{}, for JSON arrays
map[string]interface{}, for JSON objects
nil for JSON null
Obviously if your JSON marshaling/unmarshaling logic needs some pre- / postprocessing, the json package will not miraculously find it out. It can know about those only if you unmarshal into values of specific types (which implement json.Unmarshaler). The json package will still be able to unmarshal them to the default types, but custom logic will obviously not run on them.

Why doesn't the "official" json checker allow top-level primitives? [duplicate]

I've carefully read the JSON description http://json.org/ but I'm not sure I know the answer to the simple question. What strings are the minimum possible valid JSON?
"string" is the string valid JSON?
42 is the simple number valid JSON?
true is the boolean value a valid JSON?
{} is the empty object a valid JSON?
[] is the empty array a valid JSON?
At the time of writing, JSON was solely described in RFC4627. It describes (at the start of "2") a JSON text as being a serialized object or array.
This means that only {} and [] are valid, complete JSON strings in parsers and stringifiers which adhere to that standard.
However, the introduction of ECMA-404 changes that, and the updated advice can be read here. I've also written a blog post on the issue.
To confuse the matter further however, the JSON object (e.g. JSON.parse() and JSON.stringify()) available in web browsers is standardised in ES5, and that clearly defines the acceptable JSON texts like so:
The JSON interchange format used in this specification is exactly that described by RFC 4627 with two exceptions:
The top level JSONText production of the ECMAScript JSON grammar may consist of any JSONValue rather than being restricted to being a JSONObject or a JSONArray as specified by RFC 4627.
snipped
This would mean that all JSON values (including strings, nulls and numbers) are accepted by the JSON object, even though the JSON object technically adheres to RFC 4627.
Note that you could therefore stringify a number in a conformant browser via JSON.stringify(5), which would be rejected by another parser that adheres to RFC4627, but which doesn't have the specific exception listed above. Ruby, for example, would seem to be one such example which only accepts objects and arrays as the root. PHP, on the other hand, specifically adds the exception that "it will also encode and decode scalar types and NULL".
There are at least four documents which can be considered JSON standards on the Internet. The RFCs referenced all describe the mime type application/json. Here is what each has to say about the top-level values, and whether anything other than an object or array is allowed at the top:
RFC-4627: No.
A JSON text is a sequence of tokens. The set of tokens includes six
structural characters, strings, numbers, and three literal names.
A JSON text is a serialized object or array.
JSON-text = object / array
Note that RFC-4627 was marked "informational" as opposed to "proposed standard", and that it is obsoleted by RFC-7159, which in turn is obsoleted by RFC-8259.
RFC-8259: Yes.
A JSON text is a sequence of tokens. The set of tokens includes six
structural characters, strings, numbers, and three literal names.
A JSON text is a serialized value. Note that certain previous
specifications of JSON constrained a JSON text to be an object or an
array. Implementations that generate only objects or arrays where a
JSON text is called for will be interoperable in the sense that all
implementations will accept these as conforming JSON texts.
JSON-text = ws value ws
RFC-8259 is dated December 2017 and is marked "INTERNET STANDARD".
ECMA-262: Yes.
The JSON Syntactic Grammar defines a valid JSON text in terms of tokens defined by the JSON lexical
grammar. The goal symbol of the grammar is JSONText.
Syntax
JSONText :
JSONValue
JSONValue :
JSONNullLiteral
JSONBooleanLiteral
JSONObject
JSONArray
JSONString
JSONNumber
ECMA-404: Yes.
A JSON text is a sequence of tokens formed from Unicode code points that conforms to the JSON value
grammar. The set of tokens includes six structural tokens, strings, numbers, and three literal name tokens.
According to the old definition in RFC 4627 (which was obsoleted in March 2014 by RFC 7159), those were all valid "JSON values", but only the last two would constitute a complete "JSON text":
A JSON text is a serialized object or array.
Depending on the parser used, the lone "JSON values" might be accepted anyway. For example (sticking to the "JSON value" vs "JSON text" terminology):
the JSON.parse() function now standardised in modern browsers accepts any "JSON value"
the PHP function json_decode was introduced in version 5.2.0 only accepting a whole "JSON text", but was amended to accept any "JSON value" in version 5.2.1
Python's json.loads accepts any "JSON value" according to examples on this manual page
the validator at http://jsonlint.com expects a full "JSON text"
the Ruby JSON module will only accept a full "JSON text" (at least according to the comments on this manual page)
The distinction is a bit like the distinction between an "XML document" and an "XML fragment", although technically <foo /> is a well-formed XML document (it would be better written as <?xml version="1.0" ?><foo />, but as pointed out in comments, the <?xml declaration is technically optional).
JSON stands for JavaScript Object Notation. Only {} and [] define a Javascript object. The other examples are value literals. There are object types in Javascript for working with those values, but the expression "string" is a source code representation of a literal value and not an object.
Keep in mind that JSON is not Javascript. It is a notation that represents data. It has a very simple and limited structure. JSON data is structured using {},:[] characters. You can only use literal values inside that structure.
It is perfectly valid for a server to respond with either an object description or a literal value. All JSON parsers should be handle to handle just a literal value, but only one value. JSON can only represent a single object at a time. So for a server to return more than one value it would have to structure it as an object or an array.
The ecma specification might be useful for reference:
http://www.ecma-international.org/ecma-262/5.1/
The parse function parses a JSON text (a JSON-formatted String) and produces an ECMAScript value. The
JSON format is a restricted form of ECMAScript literal. JSON objects are realized as ECMAScript objects.
JSON arrays are realized as ECMAScript arrays. JSON strings, numbers, booleans, and null are realized as
ECMAScript Strings, Numbers, Booleans, and null. JSON uses a more limited set of white space characters
than WhiteSpace and allows Unicode code points U+2028 and U+2029 to directly appear in JSONString literals
without using an escape sequence. The process of parsing is similar to 11.1.4 and 11.1.5 as constrained by
the JSON grammar.
JSON.parse("string"); // SyntaxError: Unexpected token s
JSON.parse(43); // 43
JSON.parse("43"); // 43
JSON.parse(true); // true
JSON.parse("true"); // true
JSON.parse(false);
JSON.parse("false");
JSON.parse("trueee"); // SyntaxError: Unexpected token e
JSON.parse("{}"); // {}
JSON.parse("[]"); // []
Yes, yes, yes, yes, and yes. All of them are valid JSON value literals.
However, the official RFC 4627 states:
A JSON text is a serialized object or array.
So a whole "file" should consist of an object or array as the outermost structure, which of course can be empty. Yet, many JSON parsers accept primitive values as well for input.
Just follow the railroad diagrams given on the json.org page. [] and {} are the minimum possible valid JSON objects. So the answer is [] and {}.
var x;
console.log(JSON.stringify(x)); // will output "{}"
So your answer is "{}" which denotes an empty object.

name json variable and jsonString variable convention?

JSON could mean JSON type or json string.
It starts confuse me when different library use json in two different meanings.
I wonder how other people name those variables differently.
For all practical purposes, "JSON" has exactly one meaning, which is a string representing a JavaScript object following certain specific syntax.
JSON is parsed into a JavaScript object using JSON.parse, and an JavaScript object is converted into a JSON string using JSON.stringify.
The problem is that all too many people have gotten into the bad habit of referring to plain old JavaScript objects as JSON. That is either confused or sloppy or both. {a: 1} is a JS object. '{"a": 1}' is a JSON string.
In the same vein, many people use variable names like json to refer to JavaScript objects derived from JSON. For example:
$.getJSON('foo.php') . then(function(json) { ...
In the above case, the variable name json is ill-advised. The actual payload returned from the server is a JSON string, but internally $.getJSON has already transformed that into a plain old JavaScript object, which is what is being passed to the then handler. Therefore, it would be preferable to use the variable name data, for example.
If a library uses the term "json" to refer to things which are not JSON, but actually are JavaScript objects, it is a mark of poor design, and I'd suggest looking around for a different library.