Quicker way to deepcopy objects in golang, JSON vs gob - json

I am using go 1.9. And I want to deepcopy value of object into another object. I try to do it with encoding/gob and encoding/json. But it takes more time for gob encoding than json encoding. I see some other questions like this and they suggest that gob encoding should be quicker. But I see exact opposite behaviour. Can someone tell me if I am doing something wrong? Or any better and quicker way to deepcopy than these two? My object's struct is complex and nested.
The test code:
package main
import (
"bytes"
"encoding/gob"
"encoding/json"
"log"
"time"
"strconv"
)
// Test ...
type Test struct {
Prop1 int
Prop2 string
}
// Clone deep-copies a to b
func Clone(a, b interface{}) {
buff := new(bytes.Buffer)
enc := gob.NewEncoder(buff)
dec := gob.NewDecoder(buff)
enc.Encode(a)
dec.Decode(b)
}
// DeepCopy deepcopies a to b using json marshaling
func DeepCopy(a, b interface{}) {
byt, _ := json.Marshal(a)
json.Unmarshal(byt, b)
}
func main() {
i := 0
tClone := time.Duration(0)
tCopy := time.Duration(0)
end := 3000
for {
if i == end {
break
}
r := Test{Prop1: i, Prop2: strconv.Itoa(i)}
var rNew Test
t0 := time.Now()
Clone(r, &rNew)
t2 := time.Now().Sub(t0)
tClone += t2
r2 := Test{Prop1: i, Prop2: strconv.Itoa(i)}
var rNew2 Test
t0 = time.Now()
DeepCopy(&r2, &rNew2)
t2 = time.Now().Sub(t0)
tCopy += t2
i++
}
log.Printf("Total items %+v, Clone avg. %+v, DeepCopy avg. %+v, Total Difference %+v\n", i, tClone/3000, tCopy/3000, (tClone - tCopy))
}
I get following output:
Total items 3000, Clone avg. 30.883µs, DeepCopy avg. 6.747µs, Total Difference 72.409084ms

JSON vs gob difference
The encoding/gob package needs to transmit type definitions:
The implementation compiles a custom codec for each data type in the stream and is most efficient when a single Encoder is used to transmit a stream of values, amortizing the cost of compilation.
When you "first" serialize a value of a type, the definition of the type also has to be included / transmitted, so the decoder can properly interpret and decode the stream:
A stream of gobs is self-describing. Each data item in the stream is preceded by a specification of its type, expressed in terms of a small set of predefined types.
This is explained in great details here: Efficient Go serialization of struct to disk
So while in your case it's necessary to create a new gob encoder and decoder each time, it is still the "bottleneck", the part that makes it slow. Encoding to / decoding from JSON format, type description is not included in the representation.
To prove it, make this simple change:
type Test struct {
Prop1 [1000]int
Prop2 [1000]string
}
What we did here is made the types of fields arrays, "multiplying" the values a thousand times, while the type information is effectively remained the same (all elements in the arrays have the same type). Creating values of them like this:
r := Test{Prop1: [1000]int{}, Prop2: [1000]string{}}
Now running your test program, the output on my machine:
Original:
2017/10/17 14:55:53 Total items 3000, Clone avg. 33.63µs, DeepCopy avg. 2.326µs, Total Difference 93.910918ms
Modified version:
2017/10/17 14:56:38 Total items 3000, Clone avg. 119.899µs, DeepCopy avg. 462.608µs, Total Difference -1.02812648s
As you can see, in the original version JSON is faster, but in the modified version gob became faster, as the cost of transmitting type info amortized.
Testing / benching method
Now on to your testing method. This way of measuring performance is bad and can yield quite inaccurate results. Instead you should use Go's built-in testing and benchmark tools. For details, read Order of the code and performance.
Caveats of these cloning
These methods work with reflection and thus can only "clone" fields that are accessible via reflection, that is: exported. Also they often don't manage pointer equality. By this I mean if you have 2 pointer fields in a struct, both pointing to the same object (pointers being equal), after marshaling and unmarshaling, you'll get 2 different pointers pointing to 2 different values. This may even cause problems in certain situations. They also don't handle self-referencing structures, which at best returns an error, or in wrose case causes an infinite loop or goroutine stack exceeding.
The "proper" way of cloning
Considering the caveats mentioned above, often the proper way of cloning needs help from the "inside". That is, cloning a specific type is often only possible if that type (or the package of that type) provides this functionality.
Yes, providing a "manual" cloning functionality is not convenient, but on the other side it will outperform the above methods (maybe even by orders of magnitude), and will require the least amount of "working" memory required for the cloning process.

Related

Golang reading csv consuming more than 2x space in memory than on disk

I am loading in a lot of CSV files into a struct using Golang.
The struct is
type csvData struct {
Index []time.Time
Columns map[string][]float64
}
I have a parser that uses:
csv.NewReader(file).ReadAll()
Then I iterate over the rows, and convert the values into their types: time.Time or float64.
The problem is that on disk these files consume 5GB space.
Once I load them into memory they consume 12GB!
I used ioutil.ReadFile(path) and found that this was, as expected, almost exactly the on-disk size.
Here is the code for my parser, with errors omitted for readability, if you could help me troubleshoot:
var inMemoryRepo = make([]csvData, 0)
func LoadCSVIntoMemory(path string) {
parsedData := csvData{make([]time.Time, 0), make(map[string][]float64)}
file, _ := os.Open(path)
reader := csv.NewReader(file)
columnNames := reader.Read()
columnData := reader.ReadAll()
for _, row := range columnData {
parsedData.Index = append(parsedData.Index, parseTime(row[0])) //parseTime is a simple wrapper for time.Parse
for i := range row[1:] { //parse non-index numeric columns
parsedData.Columns[columnNames[i]] = append(parsedData.Columns[columnsNames[i]], parseFloat(columnData[i])) //parseFloat is wrapper for strconv.ParseFloat
}
}
inMemoryRepo = append(inMemoryRepo, parsedData)
}
I tried troubleshooting by setting columnData and reader to nil at end of function call, but no change.
There is nothing surprising in this. On your disk there are just the characters (bytes) of your CSV text. When you load them into memory, you create data structures from your text.
For example, a float64 value requires 64 bits in memory, that is: 8 bytes. If you have an input text "1", that is 1 single byte. Yet, if you create a float64 value equal to 1, that will still consume 8 byes.
Further, strings are stored having a string header (reflect.StringHeader) which is 2 integer values (16 bytes on 64-bit architectures), and this header points to the actual string data. See String memory usage in Golang for details.
Also slices are similar data structures: reflect.SliceHeader. The header consists of 3 integer values, which again is 24 bytes on 64-bit architectures even if there are no elements in the slice.
Structs on top of this may have padding (fields must be aligned to certain values), which again adds overhead. For details, see Spec: Size and alignment guarantees.
Go maps are hashmaps, which again has quite some overhead, for details see why slice values can sometimes go stale but never map values?, for memory usage see How much memory do golang maps reserve?
Reading an entire file into memory rarely is a good idea.
What if your csv is 100GiB?
If your transformation does not involve several records, maybe you could apply the following algorithm:
open csv_reader (source file)
open csv_writer (destination file)
for row in csv_reader
transform row
write row into csv_writer
close csv_reader and csv_write

Unmarshalling Entities from multiple JSON arrays without using reflect or duplicating code

I'm making an JSON API wrapper client that needs to fetch paginated results, where the URL to the next page is provided by the previous page. To reduce code duplication for the 100+ entities that share the same response format, I would like to have a single client method that fetches and unmarshalls the different entities from all paginated pages.
My current approach in a simplified (pseudo) version (without errors etc):
type ListResponse struct {
Data struct {
Results []interface{} `json:"results"`
Next string `json:"__next"`
} `json:"d"`
}
func (c *Client) ListRequest(uri string) listResponse ListResponse {
// Do a http request to uri and get the body
body := []byte(`{ "d": { "__next": "URL", "results": []}}`)
json.NewDecoder(body).Decode(&listResponse)
}
func (c *Client) ListRequestAll(uri string, v interface{}) {
a := []interface{}
f := c.ListRequest(uri)
a = append(a, f.Data.Results...)
var next = f.Data.Next
for next != "" {
r := c.ListRequest(next)
a = append(a, r.Data.Results...)
next = r.Data.Next
}
b, _ := json.Marshal(a)
json.Unmarshal(b, v)
}
// Then in a method requesting all results for a single entity
var entities []Entity1
client.ListRequestAll("https://foo.bar/entities1.json", &entities)
// and somewehere else
var entities []Entity2
client.ListRequestAll("https://foo.bar/entities2.json", &entities)
The problem however is that this approach is inefficient and uses too much memory etc, ie first Unmarshalling in a general ListResponse with results as []interface{} (to see the next URL and concat the results into a single slice), then marshalling the []interface{} for unmarshalling it directly aftwards in the destination slice of []Entity1.
I might be able to use the reflect package to dynamically make new slices of these entities, directly unmarshal into them and concat/append them afterwards, however if I understand correctly I better not use reflect unless strictly necessary...
Take a look at the RawMessage type in the encoding/json package. It allows you to defer the decoding of json values until later. For example:
Results []json.RawMessage `json:"results"`
or even...
Results json.RawMessage `json:"results"`
Since json.RawMessage is just a slice of bytes this will be much more efficient then the intermediate []interface{} you are unmarshalling to.
As for the second part on how to assemble these into a single slice given multiple page reads you could punt that question to the caller by making the caller use a slice of slices type.
// Then in a method requesting all results for a single entity
var entityPages [][]Entity1
client.ListRequestAll("https://foo.bar/entities1.json", &entityPages)
This still has the unbounded memory consumption problem your general design has, however, since you have to load all of the pages / items at once. You might want to consider changing to an Open/Read abstraction like working with files. You'd have some Open method that returns another type that, like os.File, provides a method for reading a subset of data at a time, while internally requesting pages and buffering as needed.
Perhaps something like this (untested):
type PagedReader struct {
c *Client
buffer []json.RawMessage
next string
}
func (r *PagedReader) getPage() {
f := r.c.ListRequest(r.next)
r.next = f.Data.Next
r.buffer = append(r.buffer, f.Data.Results...)
}
func (r *PagedReader) ReadItems(output []interface{}) int {
for len(output) > len(buffer) && r.next != "" {
r.getPage()
}
n := 0
for i:=0;i<len(output)&&i< len(r.buffer);i++ {
json.Unmarshal(r.buffer[i], output[i] )
n++
}
r.buffer = r.buffer[n:]
return n
}

Store and retrieve interfaces in golang

How can we store array of different structs into some file and retrieve it back in the same format without losing its properties(methods it provides).
For example: I have data struct A and struct B, both implementing a common interface X {} with some methods.
One options is to write both save and retrieve method to accept the interface X slice.
However the problem is how to unmarshal it back in some generic way which is not tied to my Data struct. i.e., every time I add a new data struct I need not to change my save or retrieve functions to retrieve back the slice of interface X so that its methods can be used independent of data struct.
Example where Unmarshaling throws error :
Go PlayGround Link with a small Example
However the problem is how to unmarshal it back in some generic way which is not tied to my Data struct.
Yes, this is undoable. Redesign.
Intersting question, like Volkel says, undoable as you want it. But as you redesign then there are possibilities. Normally, try to avoid using reflection. It’s not idiomatic, but it’s very powerful in particular cases and it maybe just what you are looking for, especially, often there are not many candidates for structs which are applicable, and you can keep inside your application the unmarshall method generic in its parameter, because the type-discovery procedure is inside the function, and not visible in the function call.
So sometimes, you can solve it by using it inside the UnMarshall function.
if (f.Type() == reflect.TypeOf(Entity{})) {
//reflected type is of type "Entity", you can now Unmarshal
}
It is easier if you think only of the data ...
Your X should not be an interface but a struct so that you can marshall it.
To make the process generic, you can consider X holds a choice
type A struct {
A int64
}
type B struct {
S string
}
type Choice int
const (
XisA Choice = iota
XisB
)
type X struct {
Choice
A
B
}
Before marshalling, you just need to set the choice for each item of your array
a := A{
A: 1,
}
b := B{
S: "2",
}
x1 := X{
Choice: XisA,
A: a,
}
x2 := X{
Choice: XisB,
B: b,
}
x := [2]X{x1, x2}
After unmarshalling, you just need to retrieve the choice you made for each item of the array
for _, item := range decoded {
switch {
case item.Choice == XisA:
println(item.A.GetKey())
case item.Choice == XisB:
println(item.B.GetKey())
}
}
Here is an example: https://play.golang.org/p/RtzF6DmNlKL

Speeding up JSON parsing in Go

We have transaction log files in which each transaction is a single line in JSON format. We often need to take selected parts of the data, perform a single time conversion, and feed results into another system in a specific format. I wrote a Python script that does this as we need, but I hoped that Go would be faster, and would give me a chance to start learning Go. So, I wrote the following:
package main
import "encoding/json"
import "fmt"
import "time"
import "bufio"
import "os"
func main() {
sep := ","
reader := bufio.NewReader(os.Stdin)
for {
data, _ := reader.ReadString('\n')
byt := []byte(data)
var dat map[string]interface{}
if err := json.Unmarshal(byt, &dat); err != nil {
break
}
status := dat["status"].(string)
a_status := dat["a_status"].(string)
method := dat["method"].(string)
path := dat["path"].(string)
element_uid := dat["element_uid"].(string)
time_local := dat["time_local"].(string)
etime, _ := time.Parse("[02/Jan/2006:15:04:05 -0700]", time_local)
fmt.Print(status, sep, a_status, sep, method, sep, path, sep, element_uid, sep, etime.Unix(), "\n")
}
}
That compiles without complaint, but I'm surprised at the lack of performance improvement. To test, I placed 2,000,000 lines of logs into a tmpfs (to ensure that disk I/O would not be a limitation) and compared the two versions of the script. My results:
$ time cat /mnt/ramdisk/logfile | ./stdin_conv > /dev/null
real 0m51.995s
$ time cat /mnt/ramdisk/logfile | ./stdin_conv.py > /dev/null
real 0m52.471s
$ time cat /mnt/ramdisk/logfile > /dev/null
real 0m0.149s
How can this be made faster? I have made some rudimentary efforts. The ffjson project, for example, proposes to create static functions that make reflection unnecessary; however, I have failed so far to get it to work, getting the error:
Error: Go Run Failed for: /tmp/ffjson-inception810284909.go
STDOUT:
STDERR:
/tmp/ffjson-inception810284909.go:9:2: import "json_parse" is a program, not an importable package
:
Besides, wouldn't what I have above be considered statically typed? Possibly not-- I am positively dripping behind the ears where Go is concerned. I have tried selectively disabling different attributes in the Go code to see if one is especially problematic. None have had an appreciable effect on performance. Any suggestions on improving performance, or is this simply a case where compiled languages have no substantial benefit over others?
Try using a type to remove all this unnecessary assignment and type assertion;
type RenameMe struct {
Status string `json:"status"`
Astatus string `json:"a_status"`
Method string `json:"method"`
Path string `json:"path"`
ElementUid string `json:"element_uid"`
TimeLocal time.Time `json:"time_local"`
Etime time.Time // deal with this after the fact
}
data := &RenameMe{}
if err := json.Unmarshal(byt, data); err != nil {
break
}
data.Etime, _ := time.Parse("[02/Jan/2006:15:04:05 -0700]", time_local)
I'm not going to test this to ensure it outperforms your code but I bet it does by a large margin. Give it a try and let me know please.
http://jsoniter.com/ declares itself to be the fastest json parser, golang and java implementations are provided. Two types of api can be used. And pre-injected json object definition is optional.
Check https://github.com/pquerna/ffjson
I saw 3x improvements over the standard json marshal/unmarshal method employed by the standard lib. It does so by rewrite the source and removing the need for reflection.

How do I substitute bytes in a stream using Go standard library?

I have a io.Reader which I get from http.Request.Body that reads a JSON byte slice from a server.
I would like to stream this to json.NewDecoder. However I would also like to intercept the JSON before it hits json.NewDecoder and substitute certain parts of it. For example, the JSON string contains empty hashes "{}" which I would like to remove due to a bug in the server's JSON output.
I am currently achieving my goal using json.Unmarshal but not using the JSON streaming parser:
data, _ := ioutil.ReadAll(r.Body)
data = bytes.Replace(data, []byte("{}"), "", -1)
json.Unmarshal(data, [my struct])
How can I achieve the same thing as above but using json.NewDecoder so I can save the many times the above code has to parse through r.Body's data? Here's some code using a pseudo function ReplaceStream(r io.Reader, old, new []byte):
reader := ReplaceStream(r.Body, []byte("{}"), "")
dec := json.NewDecoder(reader)
dec.Decode([my struct])
I know ReplaceStream might be fairly trivial to make, but is there anything in the standard library to do this that I am unaware of?
My advice is to just treat that kind of message as a special case and avoid the extra parsing / substituting for all the other requests
data, _ := ioutil.ReadAll(r.Body)
// FIXME: overcome bug #12312 of json server
if data == `{"list": [{}]}` {
return []
}
// Normal datastruct ..