Unmarshalling JSON with non-printable ASCII characters - json

Using Go, how can I unmarshal a JSON string that contains unprintable ASCII characters?
For Example
testJsonString := "{\"test_one\" : \"123\x10456\x0B789\v123\a456\"}"
var dat map[string]interface{}
err := json.Unmarshal([]byte(testJsonString), &dat)
if err != nil {
panic(err)
}
Yields:
panic: invalid character '\x10' in string literal
goroutine 1 [running]:
main.main()
/tmp/sandbox903140350/main.go:14 +0x180
https://play.golang.org/p/mFGWzndDK8V
Unfortunately I do not have control over the source data, so I need a way to ignore or strip out the unprintable characters.
Similarly, another data issue I'm encountering is stripping out a few C escape sequences as well - like \0 and \a. If I replace string listed above with this string below, the program fails as well. Essentially it also fails on any C escape sequence https://en.wikipedia.org/wiki/Escape_sequences_in_C
testJsonString := "{\"test_one\" : \"123456789\\a123456\"}"
will error out with
panic: invalid character 'a' in string escape code
goroutine 1 [running]:
main.main()
/tmp/sandbox322770276/main.go:12 +0x100
This also seems to not be able to be unmarshaled, but is not able to be escaped through rune number checking or checking the unicode (since Go appears to treat it as a backslash followed by the character 'a', which are both legal)
Is there a good way to handle these edge cases?

According to the JSON spec https://jsonapi.org/format/ non printable characters should be URI escaped (or converted to valid unicode escapes)
So here's a converter that makes non printable characters into their uri escaped forms. These can then be fed into the Unmarshal
If this isn't exactly the behaviour you need then modify the converter to remove the characters (with continue) or replace with a question mark rune or whatever
BTW, the second problem with \\a does not "print out as expected" for me. Please give a better example that actually shows the problem you are experiencing
package main
import (
"bytes"
"encoding/json"
"fmt"
"unicode"
"net/url"
)
func safety(d string) []byte {
var buffer bytes.Buffer
for _, c := range d {
s := string(c)
if c == 92 { // 92 is a backslash
continue
}
if unicode.IsPrint(c) {
buffer.WriteString(s)
} else {
buffer.WriteString(url.QueryEscape(s))
}
fmt.Println(buffer.String())
}
return buffer.Bytes()
}
func main() {
testJsonString := "{\"test_one\" : \"123\x10456\x0B789\v123\a456\"}"
var dat map[string]interface{}
err := json.Unmarshal(safety(testJsonString), &dat)
if err != nil {
panic(err)
}
fmt.Printf("%v", dat)
}

Related

How to remove unicode control characters from a string in Go before decoding JSON

I want to create a function which does the same as json.Unmarshal, but before that it needs to remove all JSON encoded unicode control characters which are not spaces.
func UnmarshalJSON(data []byte, v interface{}) error {
// TODO: implement the removeControlCharacters function
// cleanData, err := removeControlCharacters(data)
// if err != nil { return err }
return json.Unmarshal(cleanData, v)
}
I already got that same removal function working for characters which are not JSON encoded like so:
cleanData := strings.Map(func(r rune) rune {
if unicode.IsControl(r) && !unicode.IsPrint(r) && !unicode.IsSpace(r) {
return -1
}
return r
}, string(data))
However, in JSON these characters are encoded like \u0000. Also there's the problem of having escaped slashes. So when the JSON data (as a string) looks like this:
{"name":"\b\t\u0009Some\u0000thing\\u0002"}
I expect an output like this (also as a string):
{"name":"\t\u0009Something\\u0002"}
Is there a way to remove these characters in a relatively clean way? Is there some preexisting code somewhere which I could use? It's a bit annoying to create very low level code for something quite basic (in my opinion)

Replace characters in go serialization by using custom MarshalJSON method

As far as I saw, I just did a customized MarshalJSON method in order to replace these characters:\u003c and \u003e: https://go.dev/play/p/xJ-qcMN9QXl
In the example above, i marshaled the similar struct by sending to marshal from an aux struct that contains the same fields and last step is to replace the fields that I actually need and the return.
As you can see in the print placed before returning from MarshalJSON method, the special characters were replaced, but after calling the json.Marshal func, the special characters remains the same.
Something I'm missing here but cannot figure it out. Appreciate your help.
Thankies :)
In the Marshal documentation of the json package https://pkg.go.dev/encoding/json#Marshal you will find the following paragraph:
String values encode as JSON strings coerced to valid UTF-8, replacing invalid bytes with the Unicode replacement rune. So that the JSON will be safe to embed inside HTML tags, the string is encoded using HTMLEscape, which replaces "<", ">", "&", U+2028, and U+2029 are escaped to "\u003c","\u003e", "\u0026", "\u2028", and "\u2029". This replacement can be disabled when using an Encoder, by calling SetEscapeHTML(false).
So try it using a Encoder, example:
package main
import (
"bytes"
"encoding/json"
"fmt"
)
type Foo struct {
Name string
Surname string
Likes map[string]interface{}
Hates map[string]interface{}
newGuy bool //rpcclonable
}
func main() {
foo := &Foo{
Name: "George",
Surname: "Denkin",
Likes: map[string]interface{}{
"Sports": "volleyball",
"Message": "<Geroge> play volleyball <usually>",
},
}
buf := &bytes.Buffer{} // or &strings.Builder{} as from the example of #mkopriva
enc := json.NewEncoder(buf)
enc.SetEscapeHTML(false)
err := enc.Encode(foo)
if err != nil {
return
}
fmt.Println(buf.String())
}

How to convert escape characters in HTML tags?

How can we directly convert "\u003chtml\u003e" to "<html>"? Conversion of "<html>" to "\u003chtml\u003e" is quite easy using json.Marshal(), but json.Unmarshal() is quite lengthy and cumbersome. Is there any direct way to do that in golang?
You can use the strconv.Unquote() to do the conversion.
One thing you should be aware of is that strconv.Unquote() can only unquote strings that are in quotes (e.g. start and end with a quote char " or a back quote char `), so we have to manually append that.
Example:
// Important to use backtick ` (raw string literal)
// else the compiler will unquote it (interpreted string literal)!
s := `\u003chtml\u003e`
fmt.Println(s)
s2, err := strconv.Unquote(`"` + s + `"`)
if err != nil {
panic(err)
}
fmt.Println(s2)
Output (try it on the Go Playground):
\u003chtml\u003e
<html>
Note: To do HTML text escaping and unescaping, you can use the html package. Quoting its doc:
Package html provides functions for escaping and unescaping HTML text.
But the html package (specifically html.UnescapeString()) does not decode unicode sequences of the form \uxxxx, only &#decimal; or &#xHH;.
Example:
fmt.Println(html.UnescapeString(`\u003chtml\u003e`)) // wrong
fmt.Println(html.UnescapeString(`<html>`)) // good
fmt.Println(html.UnescapeString(`<html>`)) // good
Output (try it on the Go Playground):
\u003chtml\u003e
<html>
<html>
Note #2:
You should also note that if you write a code like this:
s := "\u003chtml\u003e"
This quoted string will be unquoted by the compiler itself as it is an interpreted string literal, so you can't really test that. To specify quoted string in the source, you may use the backtick to specify a raw string literal or you may use a double quoted interpreted string literal:
s := "\u003chtml\u003e" // Interpreted string literal (unquoted by the compiler!)
fmt.Println(s)
s2 := `\u003chtml\u003e` // Raw string literal (no unquoting will take place)
fmt.Println(s2)
s3 := "\\u003chtml\\u003e" // Double quoted interpreted string literal
// (unquoted by the compiler to be "single" quoted)
fmt.Println(s3)
Output:
<html>
\u003chtml\u003e
You can use the fmt string formatting package for this scope.
fmt.Printf("%v","\u003chtml\u003e") // will output <html>
https://play.golang.org/p/ZEot6bxO1H
I think it's a common problem. This is how I get it work.
func _UnescapeUnicodeCharactersInJSON(_jsonRaw json.RawMessage) (json.RawMessage, error) {
str, err := strconv.Unquote(strings.Replace(strconv.Quote(string(_jsonRaw)), `\\u`, `\u`, -1))
if err != nil {
return nil, err
}
return []byte(str), nil
}
func main() {
// Both are valid JSON.
var jsonRawEscaped json.RawMessage // json raw with escaped unicode chars
var jsonRawUnescaped json.RawMessage // json raw with unescaped unicode chars
// '\u263a' == '☺'
jsonRawEscaped = []byte(`{"HelloWorld": "\uC548\uB155, \uC138\uC0C1(\u4E16\u4E0A). \u263a"}`) // "\\u263a"
jsonRawUnescaped, _ = _UnescapeUnicodeCharactersInJSON(jsonRawEscaped) // "☺"
fmt.Println(string(jsonRawEscaped)) // {"HelloWorld": "\uC548\uB155, \uC138\uC0C1(\u4E16\u4E0A). \u263a"}
fmt.Println(string(jsonRawUnescaped)) // {"HelloWorld": "안녕, 세상(世上). ☺"}
}
https://play.golang.org/p/pUsrzrrcDG-
Hope this helps someone.

Why are string and []bytes treated differently when unmarshaling JSON?

My understanding from reading the documentation was that string is essentially an immutable []byte and that one can easily convert between the two.
However when unmarshaling from JSON this doesn't seem to be true. Take the following example program:
package main
import (
"encoding/json"
"fmt"
)
type STHRaw struct {
Hash []byte `json:"hash"`
}
type STHString struct {
Hash string `json:"hash"`
}
func main() {
bytes := []byte(`{"hash": "nuyHN9wx4lZL2L3Ir3dhZpmggTQEIHEZcC3DUNCtQsk="}`)
stringHead := new(STHString)
if err := json.Unmarshal(bytes, &stringHead); err != nil {
return
}
rawHead := new(STHRaw)
if err := json.Unmarshal(bytes, &rawHead); err != nil {
return
}
fmt.Printf("String:\t\t%x\n", stringHead.Hash)
fmt.Printf("Raw:\t\t%x\n", rawHead.Hash)
fmt.Printf("Raw to string:\t%x\n", string(rawHead.Hash[:]))
}
This gives the following output:
String: 6e7579484e397778346c5a4c324c3349723364685a706d67675451454948455a63433344554e437451736b3d
Raw: 9eec8737dc31e2564bd8bdc8af77616699a0813404207119702dc350d0ad42c9
Raw to string: 9eec8737dc31e2564bd8bdc8af77616699a0813404207119702dc350d0ad42c9
Instead I would have expected to receive the same value each time.
What is the difference?
The designers of the encoding/json package made the decision that applications must provide valid UTF-8 text in string values and that applications can put arbitrary byte sequences in []byte values. The package base64 encodes []byte values to ensure that the resulting string is valid UTF-8.
The encoding of []byte values is described in the Marshal function documentation.
This decision was not dictated by the design of the Go language. The string type can contain arbitrary byte sequences. The []byte type can contain valid UTF-8 text.
The designers could have used a flag in the field tag to indicate that a string or []byte value should be encoded and which encoder to use, but that's not what they did.

Remove invalid UTF-8 characters from a string

I get this on json.Marshal of a list of strings:
json: invalid UTF-8 in string: "...ole\xc5\"
The reason is obvious, but how can I delete/replace such strings in Go? I've been reading docst on unicode and unicode/utf8 packages and there seems no obvious/quick way to do it.
In Python for example you have methods for it where the invalid characters can be deleted, replaced by a specified character or strict setting which raises exception on invalid chars. How can I do equivalent thing in Go?
UPDATE: I meant the reason for getting an exception (panic?) - illegal char in what json.Marshal expects to be valid UTF-8 string.
(how the illegal byte sequence got into that string is not important, the usual way - bugs, file corruption, other programs that do not conform to unicode, etc)
In Go 1.13+, you can do this:
strings.ToValidUTF8("a\xc5z", "")
In Go 1.11+, it's also very easy to do the same using the Map function and utf8.RuneError like this:
fixUtf := func(r rune) rune {
if r == utf8.RuneError {
return -1
}
return r
}
fmt.Println(strings.Map(fixUtf, "a\xc5z"))
fmt.Println(strings.Map(fixUtf, "posic�o"))
Output:
az
posico
Playground: Here.
For example,
package main
import (
"fmt"
"unicode/utf8"
)
func main() {
s := "a\xc5z"
fmt.Printf("%q\n", s)
if !utf8.ValidString(s) {
v := make([]rune, 0, len(s))
for i, r := range s {
if r == utf8.RuneError {
_, size := utf8.DecodeRuneInString(s[i:])
if size == 1 {
continue
}
}
v = append(v, r)
}
s = string(v)
}
fmt.Printf("%q\n", s)
}
Output:
"a\xc5z"
"az"
Unicode Standard
FAQ - UTF-8, UTF-16, UTF-32 & BOM
Q: Are there any byte sequences that are not generated by a UTF? How
should I interpret them?
A: None of the UTFs can generate every arbitrary byte sequence. For
example, in UTF-8 every byte of the form 110xxxxx2 must be followed
with a byte of the form 10xxxxxx2. A sequence such as <110xxxxx2
0xxxxxxx2> is illegal, and must never be generated. When faced with
this illegal byte sequence while transforming or interpreting, a UTF-8
conformant process must treat the first byte 110xxxxx2 as an illegal
termination error: for example, either signaling an error, filtering
the byte out, or representing the byte with a marker such as FFFD
(REPLACEMENT CHARACTER). In the latter two cases, it will continue
processing at the second byte 0xxxxxxx2.
A conformant process must not interpret illegal or ill-formed byte
sequences as characters, however, it may take error recovery actions.
No conformant process may use irregular byte sequences to encode
out-of-band information.
Another way to do this, according to this answer, could be
s = string([]rune(s))
Example:
package main
import (
"fmt"
"unicode/utf8"
)
func main() {
s := "...ole\xc5"
fmt.Println(s, utf8.Valid([]byte(s)))
// Output: ...ole� false
s = string([]rune(s))
fmt.Println(s, utf8.Valid([]byte(s)))
// Output: ...ole� true
}
Even though the result doesn't look "pretty", it still nevertheless converts the string into a valid UTF-8 encoding.