I'm trying to store utf-8 text into a table which encoding is latin1_swedish_ci. I can't change the encoding since I do not have direct access to the the db. So what I'm trying is encode the text into latin-1 with this Go library that provides the encoder and this one that has a function that wraps the encoder so it replaces the invalid characters instead of returning an error.
But when I try to insert the row mysql complains Error 1366: Incorrect string value: '\\xE7\\xE3o pa...' for column 'description' at row 1.
I tried writing the same text to a file and file -I reports this file.txt: application/octet-stream; charset=binary.
Example
package main
import (
"fmt"
"os"
"golang.org/x/text/encoding"
"golang.org/x/text/encoding/charmap"
)
func main() {
s := "foo – bar"
encoder := charmap.ISO8859_1.NewEncoder()
encoder = encoding.ReplaceUnsupported(encoder)
encoded, err := encoder.String(s)
if err != nil {
panic(err)
}
fmt.Println(s)
fmt.Println(encoded)
fmt.Printf("%q\n", encoded)
/* file test */
f, err := os.Create("file.txt")
if err != nil {
panic(err)
}
defer f.Close()
w := encoder.Writer(f)
w.Write([]byte(s))
}
I'm probably missing something very obvious but my knowledge about encodings is very poor.
Thanks in advace.
Were you expecting çã ?
The problem is easily solved. MySQL will gladly translate from latin1 to utf8 while INSERTing text. But you must tell it that your client is using latin1. That is probably done during the connection to MySQL, and is probably defaulted to utf8 or UTF-8 or utf8mb4 currently. It is something like
charset=latin1
Related
I'm trying to decode CSV files encoded in UTF-16BE in Golang. What is the charmap ISO character number that I have to call for the new reader ?
I want to invoke
csv.NewReader(charmap.XXXX.NewDecoder().Reader(file))
What should be the value of XXXX ?
Have you tried this?
https://godoc.org/golang.org/x/text/encoding/unicode#UTF16
unicode.UTF16(BigEndian, UseBOM)
After some review, a simple way to decode UTF16 into UTF8 is provided by this code:
https://gist.github.com/bradleypeabody/185b1d7ed6c0c2ab6cec#file-gistfile1-go
You can use golang.org/x/text/encoding/unicode.UTF16 to create a decoder from your target UTF-16 Little/Big-Endian encoding into UTF-8.
The code below shows a working example for UTF-16 LE (Go playground):
dec := unicode.UTF16(unicode.LittleEndian, unicode.UseBOM).NewDecoder()
utf16r := getUTF16LittleEndianCSVReader()
utf8r := transform.NewReader(utf16r, dec)
csvr := csv.NewReader(utf8r)
records, err := csvr.ReadAll()
// TODO: handle err
fmt.Printf("%#v", records)
// [][]string{[]string{"id", "name"}, []string{"1", "foo"}}
Switching to Big-endian should be as simple as below:
enc := unicode.UTF16(unicode.BigEndian, unicode.UseBOM).NewDecoder()
Using Go, how can I unmarshal a JSON string that contains unprintable ASCII characters?
For Example
testJsonString := "{\"test_one\" : \"123\x10456\x0B789\v123\a456\"}"
var dat map[string]interface{}
err := json.Unmarshal([]byte(testJsonString), &dat)
if err != nil {
panic(err)
}
Yields:
panic: invalid character '\x10' in string literal
goroutine 1 [running]:
main.main()
/tmp/sandbox903140350/main.go:14 +0x180
https://play.golang.org/p/mFGWzndDK8V
Unfortunately I do not have control over the source data, so I need a way to ignore or strip out the unprintable characters.
Similarly, another data issue I'm encountering is stripping out a few C escape sequences as well - like \0 and \a. If I replace string listed above with this string below, the program fails as well. Essentially it also fails on any C escape sequence https://en.wikipedia.org/wiki/Escape_sequences_in_C
testJsonString := "{\"test_one\" : \"123456789\\a123456\"}"
will error out with
panic: invalid character 'a' in string escape code
goroutine 1 [running]:
main.main()
/tmp/sandbox322770276/main.go:12 +0x100
This also seems to not be able to be unmarshaled, but is not able to be escaped through rune number checking or checking the unicode (since Go appears to treat it as a backslash followed by the character 'a', which are both legal)
Is there a good way to handle these edge cases?
According to the JSON spec https://jsonapi.org/format/ non printable characters should be URI escaped (or converted to valid unicode escapes)
So here's a converter that makes non printable characters into their uri escaped forms. These can then be fed into the Unmarshal
If this isn't exactly the behaviour you need then modify the converter to remove the characters (with continue) or replace with a question mark rune or whatever
BTW, the second problem with \\a does not "print out as expected" for me. Please give a better example that actually shows the problem you are experiencing
package main
import (
"bytes"
"encoding/json"
"fmt"
"unicode"
"net/url"
)
func safety(d string) []byte {
var buffer bytes.Buffer
for _, c := range d {
s := string(c)
if c == 92 { // 92 is a backslash
continue
}
if unicode.IsPrint(c) {
buffer.WriteString(s)
} else {
buffer.WriteString(url.QueryEscape(s))
}
fmt.Println(buffer.String())
}
return buffer.Bytes()
}
func main() {
testJsonString := "{\"test_one\" : \"123\x10456\x0B789\v123\a456\"}"
var dat map[string]interface{}
err := json.Unmarshal(safety(testJsonString), &dat)
if err != nil {
panic(err)
}
fmt.Printf("%v", dat)
}
json.Encoder seems to behave slightly different than json.Marshal. Specifically it adds a new line at the end of the encoded value. Any idea why is that? It looks like a bug to me.
package main
import "fmt"
import "encoding/json"
import "bytes"
func main() {
var v string
v = "hello"
buf := bytes.NewBuffer(nil)
json.NewEncoder(buf).Encode(v)
b, _ := json.Marshal(&v)
fmt.Printf("%q, %q", buf.Bytes(), b)
}
This outputs
"\"hello\"\n", "\"hello\""
Try it in the Playground
Because they explicitly added a new line character when using Encoder.Encode. Here's the source code to that func, and it actually states it adds a newline character in the documentation (see comment, which is the documentation):
https://golang.org/src/encoding/json/stream.go?s=4272:4319
// Encode writes the JSON encoding of v to the stream,
// followed by a newline character.
//
// See the documentation for Marshal for details about the
// conversion of Go values to JSON.
func (enc *Encoder) Encode(v interface{}) error {
if enc.err != nil {
return enc.err
}
e := newEncodeState()
err := e.marshal(v)
if err != nil {
return err
}
// Terminate each value with a newline.
// This makes the output look a little nicer
// when debugging, and some kind of space
// is required if the encoded value was a number,
// so that the reader knows there aren't more
// digits coming.
e.WriteByte('\n')
if _, err = enc.w.Write(e.Bytes()); err != nil {
enc.err = err
}
encodeStatePool.Put(e)
return err
}
Now, why did the Go developers do it other than "makes the output look a little nice"? One answer:
Streaming
The go json Encoder is optimized for streaming (e.g. MB/GB/PB of json data). It is typical that when streaming you need a way to deliminate when your stream has completed. In the case of Encoder.Encode(), that is a \n newline character. Sure, you can certainly write to a buffer. But you can also write to an io.Writer which would stream the block of v.
This is opposed to the use of json.Marshal which is generally discouraged if your input is from an untrusted (and unknown limited) source (e.g. an ajax POST method to your web service - what if someone posts a 100MB json file?). And, json.Marshal would be a final complete set of json - e.g. you wouldn't expect to concatenate a few 100 Marshal entries together. You'd use Encoder.Encode() for that to build a large set and write to the buffer, stream, file, io.Writer, etc.
Whenever in doubt if it's a bug, I always lookup the source - that's one of the advantages to Go, it's source and compiler is just pure Go. Within [n]vim I use \gb to open the source definition in a browser with my .vimrc settings.
You can erease the newline by backward stream:
f, _ := os.OpenFile(fname, ...)
encoder := json.NewEncoder(f)
encoder.Encode(v)
f.Seek(-1, 1)
f.WriteString("other data ...")
They should let user control this strange behavior:
a build option to disable it
Encoder.SetEOF(eof string)
Encoder.SetIndent(prefix, indent, eof string)
The Encoder writes a stream of documents. The extra whitespace terminates a JSON document in the stream.
A terminator is required for stream readers. Consider a stream containing these JSON documents: 1, 2, 3. Without the extra whitespace, the data on the wire is the sequence of bytes 123. This is a single JSON document with the number 123, not three documents.
I'm running into an issue with attempting to manage a dynamodb instance using godynamo.
My code is meant to take a gob encoded byte array and put it into dynamodb.
func (c *checkPointManager) CommitGraph(pop *Population) {
var blob, err = pop.GobEncodeColorGraphs()
fitness := pop.GetTotalFitness()
if err != nil {
log.Fatal(err)
}
put1 := put.NewPutItem()
put1.TableName = "CheckPoint"
put1.Item["fitnessScore"] = &attributevalue.AttributeValue{N: string(fitness)}
put1.Item["population"] = &attributevalue.AttributeValue{N: string(1)}
put1.Item["graph"] = &attributevalue.AttributeValue{B: string(blob)}
body, code, err := put1.EndpointReq()
if err != nil || code != http.StatusOK {
log.Fatalf("put failed %d %v %s\n", code, err, body)
}
fmt.Printf("values checkpointed: %d\n %v\n %s\n", code, err, body)
}
Every time I run this code though, I get the following error.
can not be converted to a Blob: Base64 encoded length is expected a multiple of 4 bytes but found: 25
Does godynamo not handle making sure a binary array specifically converts to base64? Is there an easy way for me to handle this issue?
"Client applications must encode binary values in base64 format" according to the binary data type description of Amazon DynamoDB Data Types.
Your code could encode the value if you want, see golang's base64 package:
https://golang.org/pkg/encoding/base64
The godynamo library provides functions that will encode it for you, have a look at AttributeValue:
// InsertB_unencoded adds a new plain string to the B field.
// The argument is assumed to be plaintext and will be base64 encoded.
func (a *AttributeValue) InsertB_unencoded(k string) error {
I am writing a package to read CSV files in Go, and I need to open CSV files which may be coded in different formats (such as UTF8, Latin1 or others). Is there a way to specify the encoding format of the CSV file to read?
Package csv
import "encoding/csv"
func NewReader
func NewReader(r io.Reader) *Reader
NewReader returns a new Reader that reads from r.
Provide an io.Reader to csv.NewReader that maps the CSV file character set to Unicode UTF-8.
For example,
import (
"encoding/csv"
"golang.org/x/text/encoding/charmap"
)
file, err := os.Open(filename)
if err != nil {
return err
}
defer file.Close()
rdr := csv.NewReader(charmap.ISO8859_15.NewDecoder().Reader(file))