Remove invalid UTF-8 characters from a string

Remove invalid UTF-8 characters from a string - json

I get this on json.Marshal of a list of strings:
json: invalid UTF-8 in string: "...ole\xc5\"
The reason is obvious, but how can I delete/replace such strings in Go? I've been reading docst on unicode and unicode/utf8 packages and there seems no obvious/quick way to do it.
In Python for example you have methods for it where the invalid characters can be deleted, replaced by a specified character or strict setting which raises exception on invalid chars. How can I do equivalent thing in Go?
UPDATE: I meant the reason for getting an exception (panic?) - illegal char in what json.Marshal expects to be valid UTF-8 string.
(how the illegal byte sequence got into that string is not important, the usual way - bugs, file corruption, other programs that do not conform to unicode, etc)

In Go 1.13+, you can do this:
strings.ToValidUTF8("a\xc5z", "")
In Go 1.11+, it's also very easy to do the same using the Map function and utf8.RuneError like this:
fixUtf := func(r rune) rune {
if r == utf8.RuneError {
return -1
}
return r
}
fmt.Println(strings.Map(fixUtf, "a\xc5z"))
fmt.Println(strings.Map(fixUtf, "posic�o"))
Output:
az
posico
Playground: Here.

For example,
package main
import (
"fmt"
"unicode/utf8"
)
func main() {
s := "a\xc5z"
fmt.Printf("%q\n", s)
if !utf8.ValidString(s) {
v := make([]rune, 0, len(s))
for i, r := range s {
if r == utf8.RuneError {
_, size := utf8.DecodeRuneInString(s[i:])
if size == 1 {
continue
}
}
v = append(v, r)
}
s = string(v)
}
fmt.Printf("%q\n", s)
}
Output:
"a\xc5z"
"az"
Unicode Standard
FAQ - UTF-8, UTF-16, UTF-32 & BOM
Q: Are there any byte sequences that are not generated by a UTF? How
should I interpret them?
A: None of the UTFs can generate every arbitrary byte sequence. For
example, in UTF-8 every byte of the form 110xxxxx2 must be followed
with a byte of the form 10xxxxxx2. A sequence such as <110xxxxx2
0xxxxxxx2> is illegal, and must never be generated. When faced with
this illegal byte sequence while transforming or interpreting, a UTF-8
conformant process must treat the first byte 110xxxxx2 as an illegal
termination error: for example, either signaling an error, filtering
the byte out, or representing the byte with a marker such as FFFD
(REPLACEMENT CHARACTER). In the latter two cases, it will continue
processing at the second byte 0xxxxxxx2.
A conformant process must not interpret illegal or ill-formed byte
sequences as characters, however, it may take error recovery actions.
No conformant process may use irregular byte sequences to encode
out-of-band information.

Another way to do this, according to this answer, could be
s = string([]rune(s))
Example:
package main
import (
"fmt"
"unicode/utf8"
)
func main() {
s := "...ole\xc5"
fmt.Println(s, utf8.Valid([]byte(s)))
// Output: ...ole� false
s = string([]rune(s))
fmt.Println(s, utf8.Valid([]byte(s)))
// Output: ...ole� true
}
Even though the result doesn't look "pretty", it still nevertheless converts the string into a valid UTF-8 encoding.

Related

Go JSON Marshaller errors when converting int64 bytes

I am writing a time alias to JSON Marshal some time into a Unix format (left some of my experimenting test code in there
type NotifyTime time.Time
// MarshalJSON implements marshaller interface. Marshals time.Time into unix time in bytes
func (t NotifyTime) MarshalJSON() ([]byte, error) {
// unixTime := time.Time(t).Unix()
unixTime := 1626132059 // unixTime := time.Now().Unix()
buffer := make([]byte, 8)
binary.PutVarint(buffer, int64(unixTime))
// binary.LittleEndian.PutUint64(buffer, uint64(unixTime))
// try to convert back
fmt.Println(string(buffer))
unixIntValue := int64(binary.LittleEndian.Uint64(buffer))
fmt.Println(unixIntValue)
return buffer, nil
}
When I run json.Marshal on an object with NotifyTime struct, it errs, with the following,
json: error calling MarshalJSON for type notify.NotifyTime: invalid character '¶' looking for beginning of value
type TestMe struct {
Time NotifyTime `json:"time"`
}
testJSON := TestMe{
Time: NotifyTime(time.Now()),
}
marshalled, err := json.Marshal(testJSON)
I have switched to marshalling it as a string, but still curious as to why this happens. When stepping through the code it seems to be because
on function compact on go/1.16.2/libexec/src/encoding/json/indent.go:17
it is looping over the marshalled bytes of the JSON
and the first (0th) byte fails the checks in
go/1.16.2/libexec/src/encoding/json/scanner.go:247

Let's put aside the aspect of encoding a time.Time and lets focus on how the int64 is being turned into JSON.
binary.PutVarint uses an encoding that is appropriate for low level wire or file formats. For the constant 1626132059, this writes into buffer [182 185 230 142 12 0 0 0]. The first character is 182 PILCROW SIGN in UTF-8. This is where '¶' comes from. You are getting an error like:
json: error calling MarshalJSON for type main.NotifyTime: invalid character '¶' looking for beginning of value
This is not the beginning of a valid JSON value. You will need to find an encoding of int64 that is a JSON value, such as a decimal number 1626132059 or a string of hexadecimal digits "60ecce5b".
In general you need to be careful putting binary string values into JSON as these can contain special characters that need to be escaped.

Unmarshalling JSON with non-printable ASCII characters

Using Go, how can I unmarshal a JSON string that contains unprintable ASCII characters?
For Example
testJsonString := "{\"test_one\" : \"123\x10456\x0B789\v123\a456\"}"
var dat map[string]interface{}
err := json.Unmarshal([]byte(testJsonString), &dat)
if err != nil {
panic(err)
}
Yields:
panic: invalid character '\x10' in string literal
goroutine 1 [running]:
main.main()
/tmp/sandbox903140350/main.go:14 +0x180
https://play.golang.org/p/mFGWzndDK8V
Unfortunately I do not have control over the source data, so I need a way to ignore or strip out the unprintable characters.
Similarly, another data issue I'm encountering is stripping out a few C escape sequences as well - like \0 and \a. If I replace string listed above with this string below, the program fails as well. Essentially it also fails on any C escape sequence https://en.wikipedia.org/wiki/Escape_sequences_in_C
testJsonString := "{\"test_one\" : \"123456789\\a123456\"}"
will error out with
panic: invalid character 'a' in string escape code
goroutine 1 [running]:
main.main()
/tmp/sandbox322770276/main.go:12 +0x100
This also seems to not be able to be unmarshaled, but is not able to be escaped through rune number checking or checking the unicode (since Go appears to treat it as a backslash followed by the character 'a', which are both legal)
Is there a good way to handle these edge cases?

According to the JSON spec https://jsonapi.org/format/ non printable characters should be URI escaped (or converted to valid unicode escapes)
So here's a converter that makes non printable characters into their uri escaped forms. These can then be fed into the Unmarshal
If this isn't exactly the behaviour you need then modify the converter to remove the characters (with continue) or replace with a question mark rune or whatever
BTW, the second problem with \\a does not "print out as expected" for me. Please give a better example that actually shows the problem you are experiencing
package main
import (
"bytes"
"encoding/json"
"fmt"
"unicode"
"net/url"
)
func safety(d string) []byte {
var buffer bytes.Buffer
for _, c := range d {
s := string(c)
if c == 92 { // 92 is a backslash
continue
}
if unicode.IsPrint(c) {
buffer.WriteString(s)
} else {
buffer.WriteString(url.QueryEscape(s))
}
fmt.Println(buffer.String())
}
return buffer.Bytes()
}
func main() {
testJsonString := "{\"test_one\" : \"123\x10456\x0B789\v123\a456\"}"
var dat map[string]interface{}
err := json.Unmarshal(safety(testJsonString), &dat)
if err != nil {
panic(err)
}
fmt.Printf("%v", dat)
}

Runtime error when parsing JSON array and map elements with trailing commas

Dave Cheney, one of the leading subject matter experts on Go, wrote: "When initializing a variable with a composite literal, Go requires that each line of the composite literal end with a comma, even the last line of your declaration. This is the result of the semicolon rule."
However, when I am trying to apply that beautiful rule to JSON text, the parser doesn't seem to agree with this philosophy. In the code below, removing the comma works. Is there a fix for this so I can just see one line change when I add elements in my diffs?
package main
import (
"fmt"
"encoding/json"
)
type jsonobject struct {
Objects []ObjectType `json:"objects"`
}
type ObjectType struct {
Name string `json:"name"`
}
func main() {
bytes := []byte(`{ "objects":
[
{"name": "foo"}, // REMOVE THE COMMA TO MAKE THE CODE WORK!
]}`)
jsontype := &jsonobject{}
json.Unmarshal(bytes, &jsontype)
fmt.Printf("Results: %v\n", jsontype.Objects[0].Name) // panic: runtime error: index out of range
}

There is not. The JSON specification does not allow a trailing comma.
This is not a valid JSON:
{ "objects":
[
{"name": "foo"},
]}
It's a Go syntax that you need to use a comma if the enumeration is not closed on the line (more on this), e.g.:
// Slice literal:
s := []int {
1,
2,
}
// Function call:
fmt.Println(
"Slice:",
s,
)
Even if you could "convince" one specific JSON parser to silently swallow it, other, valid JSON parsers would report an error, rightfully. Don't do it.

While trailing commas are not valid JSON, some languages support trailing commas natively, notably JavaScript, so you may see them in your data.
It's better to remove trailing commas, but if you cannot change your data, use a JSON parser that supports trailing commas like HuJSON (aka Human JSON) which supports trailing commas and comments in JSON. It's a soft fork of encoding/json and is maintained by noted Xoogler and Ex-Golang team member Brad Fitzpatrick and others.
repo: https://github.com/tailscale/hujson
docs: https://pkg.go.dev/github.com/tailscale/hujson
The Unmarshal syntax is the same as encoding/json, just use:
err := hujson.Unmarshal(data, v)
I've used it and it works as described.

Why does json.Encoder add an extra line?

json.Encoder seems to behave slightly different than json.Marshal. Specifically it adds a new line at the end of the encoded value. Any idea why is that? It looks like a bug to me.
package main
import "fmt"
import "encoding/json"
import "bytes"
func main() {
var v string
v = "hello"
buf := bytes.NewBuffer(nil)
json.NewEncoder(buf).Encode(v)
b, _ := json.Marshal(&v)
fmt.Printf("%q, %q", buf.Bytes(), b)
}
This outputs
"\"hello\"\n", "\"hello\""
Try it in the Playground

Because they explicitly added a new line character when using Encoder.Encode. Here's the source code to that func, and it actually states it adds a newline character in the documentation (see comment, which is the documentation):
https://golang.org/src/encoding/json/stream.go?s=4272:4319
// Encode writes the JSON encoding of v to the stream,
// followed by a newline character.
//
// See the documentation for Marshal for details about the
// conversion of Go values to JSON.
func (enc *Encoder) Encode(v interface{}) error {
if enc.err != nil {
return enc.err
}
e := newEncodeState()
err := e.marshal(v)
if err != nil {
return err
}
// Terminate each value with a newline.
// This makes the output look a little nicer
// when debugging, and some kind of space
// is required if the encoded value was a number,
// so that the reader knows there aren't more
// digits coming.
e.WriteByte('\n')
if _, err = enc.w.Write(e.Bytes()); err != nil {
enc.err = err
}
encodeStatePool.Put(e)
return err
}
Now, why did the Go developers do it other than "makes the output look a little nice"? One answer:
Streaming
The go json Encoder is optimized for streaming (e.g. MB/GB/PB of json data). It is typical that when streaming you need a way to deliminate when your stream has completed. In the case of Encoder.Encode(), that is a \n newline character. Sure, you can certainly write to a buffer. But you can also write to an io.Writer which would stream the block of v.
This is opposed to the use of json.Marshal which is generally discouraged if your input is from an untrusted (and unknown limited) source (e.g. an ajax POST method to your web service - what if someone posts a 100MB json file?). And, json.Marshal would be a final complete set of json - e.g. you wouldn't expect to concatenate a few 100 Marshal entries together. You'd use Encoder.Encode() for that to build a large set and write to the buffer, stream, file, io.Writer, etc.
Whenever in doubt if it's a bug, I always lookup the source - that's one of the advantages to Go, it's source and compiler is just pure Go. Within [n]vim I use \gb to open the source definition in a browser with my .vimrc settings.

You can erease the newline by backward stream:
f, _ := os.OpenFile(fname, ...)
encoder := json.NewEncoder(f)
encoder.Encode(v)
f.Seek(-1, 1)
f.WriteString("other data ...")
They should let user control this strange behavior:
a build option to disable it
Encoder.SetEOF(eof string)
Encoder.SetIndent(prefix, indent, eof string)

The Encoder writes a stream of documents. The extra whitespace terminates a JSON document in the stream.
A terminator is required for stream readers. Consider a stream containing these JSON documents: 1, 2, 3. Without the extra whitespace, the data on the wire is the sequence of bytes 123. This is a single JSON document with the number 123, not three documents.

Go json.Unmarshal key with \u0000 \x00

Here is the Go playground link.
Basically there are some special characters ('\u0000') in my JSON string key:
var j = []byte(`{"Page":1,"Fruits":["5","6"],"\u0000*\u0000_errorMessages":{"x":"123"},"*_successMessages":{"ok":"hi"}}`)
I want to Unmarshal it into a struct:
type Response1 struct {
Page int
Fruits []string
Msg interface{} `json:"*_errorMessages"`
Msg1 interface{} `json:"\\u0000*\\u0000_errorMessages"`
Msg2 interface{} `json:"\u0000*\u0000_errorMessages"`
Msg3 interface{} `json:"\0*\0_errorMessages"`
Msg4 interface{} `json:"\\0*\\0_errorMessages"`
Msg5 interface{} `json:"\x00*\x00_errorMessages"`
Msg6 interface{} `json:"\\x00*\\x00_errorMessages"`
SMsg interface{} `json:"*_successMessages"`
}
I tried a lot but it's not working.
This link might help golang.org/src/encoding/json/encode_test.go.

Short answer: With the current json implementation it is not possible using only struct tags.
Note: It's an implementation restriction, not a specification restriction. (It's the restriction of the json package implementation, not the restriction of the struct tags specification.)
Some background: you specified your tags with a raw string literal:
The value of a raw string literal is the string composed of the uninterpreted (implicitly UTF-8-encoded) characters between the quotes...
So no unescaping or unquoting happens in the content of the raw string literal by the compiler.
The convention for struct tag values quoted from reflect.StructTag:
By convention, tag strings are a concatenation of optionally space-separated key:"value" pairs. Each key is a non-empty string consisting of non-control characters other than space (U+0020 ' '), quote (U+0022 '"'), and colon (U+003A ':'). Each value is quoted using U+0022 '"' characters and Go string literal syntax.
What this means is that by convention tag values are a list of (key:"value") pairs separated by spaces. There are quite a few restrictions for keys, but values may be anything, and values (should) use "Go string literal syntax", this means that these values will be unquoted at runtime from code (by a call to strconv.Unquote(), called from StructTag.Get(), in source file reflect/type.go, currently line #809).
So no need for double quoting. See your simplified example:
type Response1 struct {
Page int
Fruits []string
Msg interface{} `json:"\u0000_abc"`
}
Now the following code:
t := reflect.TypeOf(Response1{})
fmt.Printf("%#v\n", t.Field(2).Tag)
fmt.Printf("%#v\n", t.Field(2).Tag.Get("json"))
Prints:
"json:\"\\u0000_abc\""
"\x00_abc"
As you can see, the value part for the json key is "\x00_abc" so it properly contains the zero character.
But how will the json package use this?
The json package uses the value returned by StructTag.Get() (from the reflect package), exactly what we did. You can see it in the json/encode.go source file, typeFields() function, currently line #1032. So far so good.
Then it calls the unexported json.parseTag() function, in json/tags.go source file, currently line #17. This cuts the part after the comma (which becomes the "tag options").
And finally json.isValidTag() function is called with the previous value, in source file json/encode.go, currently line #731. This function checks the runes of the passed string, and (besides a set of pre-defined allowed characters "!#$%&()*+-./:<=>?#[]^_{|}~ ") rejects everything that is not a unicode letter or digit (as defined by unicode.IsLetter() and unicode.IsDigit()):
if !unicode.IsLetter(c) && !unicode.IsDigit(c) {
return false
}
'\u0000' is not part of the pre-defined allowed characters, and as you can guess now, it is neither a letter nor a digit:
// Following code prints "INVALID":
c := '\u0000'
if !unicode.IsLetter(c) && !unicode.IsDigit(c) {
fmt.Println("INVALID")
}
And since isValidTag() returns false, the name (which is the value for the json key, without the "tag options" part) will be discarded (name = "") and not used. So no match will be found for the struct field containing a unicode zero.
For an alternative solution use a map, or a custom json.Unmarshaler or use json.RawMessage.
But I would highly discourage using such ugly json keys. I understand likely you are just trying to parse such json response and it may be out of your reach, but you should fight against using these keys as they will just cause more problems later on (e.g. if stored in db, by inspecting records it will be very hard to spot that there are '\u0000' characters in them as they may be displayed as nothing).

You cannot do in such way due to: http://golang.org/ref/spec#Struct_types
But You can unmarshal to map[string]interface{} then check field names of that object through regexp.

I don't think this is possible with struct tags. The best thing you can do is unmarshal it into map[string]interface{} and then get the values manually:
var b = []byte(`{"\u0000abc":42}`)
var m map[string]interface{}
err := json.Unmarshal(b, &m)
if err != nil {
panic(err)
}
fmt.Println(m, m["\x00abc"])
Playground: http://play.golang.org/p/RtS7Nst0d7.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Remove invalid UTF-8 characters from a string - json

Related

Go JSON Marshaller errors when converting int64 bytes

Unmarshalling JSON with non-printable ASCII characters

Runtime error when parsing JSON array and map elements with trailing commas

Why does json.Encoder add an extra line?

Go json.Unmarshal key with \u0000 \x00

Categories

Resources